NOVEL COMPUTATIONAL FRAMEWORKS FOR ANALYZING COMPLEX NON-TREE-LIKE EVOLUTION IN GENOMIC SEQUENCE DATA By Qiqige Wuyun A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science—Doctor of Philosophy 2025 ABSTRACT Phylogenetics studies the evolutionary relationships among species, often represented as phyloge- netic trees. However, traditional tree-like models fall short when dealing with interspecific gene flows, such as hybridization/introgression in eukaryotes and horizontal gene transfer (HGT) in prokaryotes, which necessitate the use of phylogenetic networks to capture complex, reticulate evolutionary histories. Although numerous methodologies have been developed to analyze non- tree-like evolutions, the advent of high-throughput sequencing technologies has introduced two primary scalability challenges: the large number of taxa present in the data and the intricate and diverse gene flow among these taxa. These challenges have considerably constrained the scope and precision of non-tree-like phylogenetic inference and analysis. This dissertation addresses these limitations by introducing novel techniques for analyzing non-tree-like evolutions in large-scale genomic studies. We developed PHiMM, an introgression detection tool that combines coalescent-based approximations with hidden Markov models (HMMs) to improve scalability and detection accuracy. Comparisons with state-of-the-art methods on both simulation and empirical datasets indicated that PHiMM significantly outperforms previous meth- ods like PhyloNet-HMM in terms of runtime and memory usage, while maintaining comparable inference accuracy to PhyloNet-HMM. To further enhance PHiMM’s performance, we integrated it with the SERES resampling tool, significantly improving introgression inference accuracy under various model conditions. Simu- lation experiments demonstrated that combining the SERES resampling approach with PHiMM substantially improves introgression inference accuracy compared to standalone PHiMM, although it results in longer runtimes. One major limitation of PHiMM is its requirement for a phylogenetic network as input to constitute the structure of the HMM. To address this, we extended PHiMM to DACS by integrating phylogenetic network inference with introgression detection. To further improve the scalability and accuracy of introgression mapping on ultra large datasets, we adopted divide-and-conquer and subsampling techniques, allowing us to efficiently handle the complexity of the data while maintaining accuracy. Moreover, we applied these approaches to metagenomic studies, where data often include thou- sands of species derived from complex microbial communities. After assembling the sequencing reads into metagenome-assembled genomes (MAGs), we employ DACS to identify reticulate evo- lutionary events, such as introgression or HGT, under challenging scenarios characterized by noise, incomplete data, and large numbers of taxa. By adapting our methods to accommodate the scale and complexity of metagenomic datasets, we provide a powerful framework for elucidating reticulate evolutionary histories in diverse microbial communities. These advancements provide deeper insights into genetic and biological processes and offer robust tools for a wide range of biological and medical applications. Copyright by QIQIGE WUYUN 2025 ACKNOWLEDGEMENTS First, I would like to express my sincere thanks to my advisor, Professor Kevin Liu, for his continuous training and mentorship throughout my Ph.D. studies. With his patience and immense knowledge, Professor Liu has taught me to learn from every mistake and to care about details, guiding me to become a competent scientist. I am deeply indebted to my committee members, Professor Emily Dolson, Professor Mohammad Ghassemi, Professor Elizabeth Heath-Heckman, and Professor Adina Howe, for their generous service, valuable comments, and thoughtful questions. It has been a great honor and privilege to have them on my doctoral committee. I am grateful to BEACON, EEB, and HPCC for their generous support during my Ph.D. studies. I would also like to thank the professors and staff at the Department of Computer Science and Engineering, especially Professor Eric Torng and Professor Sandeep Kulkarni, for their invaluable assistance. My deepest gratitude also goes to Professor Jishou Ruan, who supervised my master’s thesis at the Department of Mathematics, Nankai University. Professor Ruan has provided invaluable patience and feedback. Without his precious support and inspiration, I would not have achieved the current results. Although he is no longer with us, he will always remain in my heart. Lastly, I had the pleasure of working with my excellent colleagues — Hussein, Wei, Julia, Meijun, Ahmad, Md, Rei, Byungho, Preethi, and many others — who have impacted and inspired me greatly. v TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 5 CHAPTER 3 PHIMM: SCALABLE STATISTICAL INTROGRESSION DETECTION USING THE APPROXIMATE COALESCENT-BASED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INFERENCE . . 31 CHAPTER 4 APPLICATION OF PHIMM INTROGRESSION DETECTION METHOD TO EMERGING MODEL SYSTEMS . . . . . . . . . . . . . 50 CHAPTER 5 BOOSTING PHIMM-BASED PHYLOGENETIC HMM INFERENCE AND LEARNING BASED ON RANDOM WALK RESAMPLING . . . 62 CHAPTER 6 DACS: FAST AND ACCURATE ULTRA LARGE-SCALE COESTIMATION OF PHYLOGENETIC NETWORKS AND INTROGRESSIONS USING DIVIDE-AND-CONQUER . . . . . . . . 80 CHAPTER 7 APPLICATION OF DACS-BASED METHOD TO HORIZONTAL GENE TRANSFER DETECTION IN METAGENOMICS . . . . . . . . 120 CHAPTER 8 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . 139 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 APPENDIX A THE COMPARISON OF INTROGRESSION PATTERNS BETWEEN PHYLONET-HMM AND PHIMM ON EACH CHROMOSOME OF MOUSE DATA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 APPENDIX B THE INTROGRESSION PATTERNS OF DACS ON DARWIN’S FINCH DATA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 vi CHAPTER 1 INTRODUCTION Phylogenetics is the study of relationships among different groups of species and their evolutionary developments, and attempts to trace the evolutionary history in the Tree of Life. A phylogeny represents the evolutionary history for a set of organisms and is often depicted in a diagram known as a phylogenetic tree. Phylogenies are used in a range of applications, including studying pathogens [1], representing the relationships between species in the tree of life, studying speciation and extinction [2], studying the spread of antibiotic resistance between species [3], identifying genes and non-coding RNAs in newly sequenced genomes [4], and reconstructing ancestral genomes [5]. These widespread applications have led to the development of many sophisticated methods for phylogeny reconstruction, reflecting the importance of accurate and efficient inference algorithms. Most phylogenetic reconstruction methods assume that the underlying biological data follows a tree-like structure. However, it is known that this assumption does not always hold due to interspecific gene flows such as hybridization/introgression in eukaryotes and horizontal gene transfer (HGT) in prokaryotes. Introgression, or introgressive hybridization, denotes the process of gene flow from one species into the gene pool of another through the recurrent backcrossing of an interspecific hybrid with one of its parental species; HGT is the nonsexual movement of genetic material between distantly related organisms, which enables the rapid transmission of genes between species and plays an essential role in the adaptation of microbes to novel environments. Interspecific gene flow is thought to be especially important in the evolution of a wide range of different species, including humans [6], mice [7], butterflies [8], fungi [9], and bacteria [10]. When reticulate events such as introgression or HGT occur, the evolutionary history can no longer be accurately represented by a simple, bifurcating tree. Instead, a phylogenetic network is required to capture these complex genetic exchanges. A phylogenetic network extends the conventional tree framework by incorporating reticulation nodes and reticulation edges, which explicitly capture gene flow. This more general model accurately reflects both vertical (tree-like) and non-vertical (reticulate) evolutionary processes. 1 Over the past decade, researchers have developed various computational methods aimed at detecting and analyzing non-tree-like evolutionary events [11–13]. However, the field has recently encountered significant challenges stemming from high-throughput sequencing technologies. Two key scalability issues have emerged. First, there is often a large number of taxa (i.e., species or strains) present in the data, potentially spanning hundreds or thousands of organisms. Second, the complexity and diversity of gene flows among these taxa amplify the computational burden. These challenges can undermine the performance of traditional non-tree-like phylogenetic methods, limiting both the size of datasets that can be analyzed and the accuracy of the resulting inferences [11, 14, 15]. 1.1 Contributions To address these scalability and complexity challenges, this dissertation focuses on developing novel phylogenetic methods and models designed for large-scale genomics and metagenomics studies involving non-tree-like evolutionary processes. We first concentrate on introgression, a prevalent form of gene flow in eukaryotes. We introduce PHiMM, an introgression detection tool tailored to identify introgression in genomic datasets containing dozens of DNA sequences. PHiMM incorporates a newly devised coalescent-based approximation technique with a Hidden Markov Model (HMM) under a joint model of genetic drift, substitutions, recombination, and gene flow. This integrated approach reduces model complexity and enhances the scalability of introgression detection compared to existing methods. PHiMM requires a phylogenetic network to define the structure of the underlying HMM. Comparisons with other state-of-the-art tools on both simulated and empirical datasets suggest that PHiMM offers significantly improved runtime and memory usage over PhyloNet-HMM while maintaining comparable inference accuracy. Next, we seek to boost PHiMM’s introgression inference performance and learning capabilities by leveraging SERES, a random-walk-based resampling tool. As a data perturbation technique, SERES is used to perform resampling on the input alignment to generate multiple replicates. PHiMM is then run on these resampled SERES replicates. The introgression probabilities inferred 2 by PHiMM are aggregated across all SERES replicates to obtain final results. Simulation exper- iments demonstrate that combining the SERES resampling approach with PHiMM substantially improves introgression inference accuracy under various model conditions compared to standalone PHiMM, although this benefit does come at the cost of increased runtimes. A primary limitation of PHiMM is its dependence on a known phylogenetic network as an input. Consequently, we introduce DACS, an extension of PHiMM that integrates phylogenetic network inference with introgression detection. By embedding network construction and introgression analysis into a unified framework, DACS bypasses the need for a predefined network. Additionally, DACS employs divide-and-conquer strategies and subsampling techniques to handle ultra-large datasets more efficiently. These approaches allow us to break down large datasets into smaller, manageable subproblems, each of which can be analyzed independently before combining the results, thereby efficiently handling the complexity of the data while maintaining accuracy. The advent of high-throughput metagenomic technologies has brought us tons of biological data, which have experienced exponential growth in recent years. The data generated by metagenomic experiments are also inherently noisy and partial, sometimes containing as many as 10,000 species [16]. Extracting useful biological information from datasets of such magnitude poses substantial computational hurdles for researchers. To address these difficulties, we apply our proposed method, DACS, to metagenomic data. The first thing we need to do is to process the metagenomic data into metagenome-assembled genomes (MAGs), from which the aforementioned methods should be easily employed. Once MAGs are obtained, DACS can be applied to detect introgression or horizontal gene transfer (HGT) among diverse microbial communities. We further study and discuss the differences in terms of input requirements, accuracy, scalability, and application conditions of proposed methods between genomics and metagenomics. 1.2 Organization The dissertation is organized into the following chapters: • Chapter 2 provides a background and foundation for the research presented in this disser- tation. We introduce the fundamental concepts behind non-tree-like evolution, including 3 phylogenetic network, multi-species network coalescent (MSNC) model, and introgression. • Chapter 3 details the development and design of PHiMM, our novel introgression mapping algorithm. We demonstrate how PHiMM significantly expands the scalability of introgression detection and achieves competitive or superior performance compared to existing tools. • Chapter 4 focuses on empirical validations of PHiMM. We apply PHiMM to a variety of real-world datasets, indicating its effectiveness in practical scenarios. • Chapter 5 explains the application of SERES random walk-based resampling approach, illustrating how integrating SERES with PHiMM can further boost introgression inference accuracy in challenging cases. • Chapter 6 introduces DACS, an improved introgression mapping framework that integrates phylogenetic network inference and adopts divide-and-conquer and subsampling techniques. We highlight its ability to handle ultra-large datasets without prior knowledge of the under- lying network. • Chapter 7 explores the application of DACS-based introgression detection within the con- text of metagenomic studies. We elaborate on how our method is applied to microbial communities featuring vast species diversity and high levels of reticulation. • Chapter 8 concludes the dissertation with a summary of our research findings and a discus- sion of potential future directions. This dissertation thus aims to bridge the gap between the theoretical frameworks of reticulate evolution and their practical implementation in large-scale genomics and metagenomics. Through the development of scalable algorithms and robust computational tools, we provide researchers with new ways to decode the complexities of gene flow across the Tree of Life. 4 CHAPTER 2 BACKGROUND 2.1 Phylogenies 2.1.1 Phylogenetic Tree A phylogenetic tree represents a branching structure that depicts the evolutionary relationships between diverse biological species, genes, populations, or even individuals based on similarities and differences in their physical or genetic characteristics. A tree is a connected acyclic graph. Let 𝑇 = (𝑉, 𝐸) be a phylogenetic tree, 𝑉 be a set of vertices or nodes, and 𝐸 be a set of edges or branches. Each edge 𝑒 ∈ 𝐸 is defined by two connected vertices 𝑒 = (𝑣1, 𝑣2), where 𝑣1, 𝑣2 ∈ 𝑉. The leaf nodes or leaves represent present-day taxa, such as species, populations, genes, or individuals, while the internal nodes typically symbolize extinct ancestors whose sequence data remain unavailable and inaccessible. An internal node is also termed as the most recent common ancestor (MRCA) of its descendants. The ancestor of all sequences is the root of the tree. The branches or edges represent the time estimates of the evolutionary relationships among taxa. A branch or edge that is incident with a leaf is called an external branch or external edge, whereas one connecting two internal nodes is called an internal branch or internal edge. A tree in which the root is specified is known as a rooted tree, whereas a tree lacking a determined root is referred to as an unrooted tree (Figure 2.1). For 𝑛 taxa, there are (2𝑛 − 3)!! distinct rooted trees, and (2𝑛 − 5)!! different unrooted trees. For an unrooted tree, a commonly-used strategy to identify the root and produce a rooted tree is outgroup rooting. An outgroup is a species distantly related to all species in an unrooted tree. Through introducing an outgroup in the tree reconstruction, the root can be placed on the branch connecting the outgroup, resulting in a rooted subtree for the ingroups [17]. For readability and legibility in computer programs, rooted trees are commonly depicted using a Newick format. It utilizes a pair of parentheses to group sister taxa into one clade where taxa are split by a comma. A semicolon is used to mark the end of the tree. Branch lengths, if present, are prefixed by colons. Figure 2.1 shows an example of a rooted tree, which can be referred to as 5 “((((A, B), C), D), E);” without branch lengths visualization, or “((((A: 0.1, B: 0.2): 0.12, C: 0.3): 0.123, D: 0.4): 0.1234, E: 0.5);” showing branch lengths. The Newick format can also be used to represent unrooted trees. An unrooted tree has several Newick representations due to the flexibility of root placement on the tree. For example, the unrooted tree shown in Figure 2.1 can also be represented as “((((A, B), C), D), E);”. Figure 2.1 The illustration of the rooted and unrooted phylogenetic trees. In the rooted tree on the left, a specific ancestor is designated as the root (at the base of the diagram), implying a clear evolutionary direction as lineages diverge over time. In contrast, the unrooted tree on the right shows the same relationships among taxa (A, B, C, D, E) without specifying a common ancestor, thus omitting any time or evolutionary direction. Branch lengths can indicate the amount of sequence divergence (e.g., 0.1 substitutions per site). Taxa labeled A, B, C, D, and E represent different species or taxonomic units. This figure is reproduced from Yang [17]. 2.1.2 Phylogenetic Network Although phylogenetic trees are widely used to represent evolutionary relationships, the true evolutionary histories are not always tree-like. The phylogenetic networks are thus introduced to visualize evolutionary relationships when reticulation events, such as hybridization, introgression, and horizontal gene transfer (HGT), are involved. The phylogenetic networks explicitly model richly linked networks through the introduction of reticulation nodes to capture gene flow. 6 A phylogenetic network is a directed acyclic graph (DAG) with at least one reticulation event [18]. Let 𝑁 = (𝑉, 𝐸) be a phylogenetic network with 𝑉 = 𝑟 ∪ 𝑉𝐿 ∪ 𝑉𝑇 ∪ 𝑉𝑁 , where • 𝑟 is the root of the network: 𝑖𝑛𝑑𝑒𝑔𝑟𝑒𝑒(𝑟) = 0 (node has no parent); • 𝑉𝑁 is the set of reticulation nodes in 𝑁: ∀𝑣 ∈ 𝑉𝑁 , 𝑖𝑛𝑑𝑒𝑔𝑟𝑒𝑒(𝑣) = 2 (nodes have two parents) and 𝑜𝑢𝑡𝑑𝑒𝑔𝑟𝑒𝑒(𝑣) = 1 (nodes have one child); • 𝑉𝑇 is the set of tree nodes in 𝑁: ∀𝑣 ∈ 𝑉𝑁 , 𝑖𝑛𝑑𝑒𝑔𝑟𝑒𝑒(𝑣) = 1 (nodes have one parent) and 𝑜𝑢𝑡𝑑𝑒𝑔𝑟𝑒𝑒(𝑣) ≥ 2 (nodes have at least two children); • 𝑉𝐿 is the set of leaves in 𝑁: ∀𝑣 ∈ 𝑉𝑁 , 𝑖𝑛𝑑𝑒𝑔𝑟𝑒𝑒(𝑣) = 1 (nodes have one parent) and 𝑜𝑢𝑡𝑑𝑒𝑔𝑟𝑒𝑒(𝑣) = 0 (nodes have no child). 𝐸 ⊆ 𝑉 × 𝑉 are the edges of network 𝑁. There are two distinct types of edges: reticulation edges, which direct to reticulation nodes, and tree edges, which point to tree nodes. A phylogenetic network can be characterized based on the set of induced tree topologies. The set of trees can be obtained by the following steps. First, the phylogenetic network is converted to a multilabeled tree (MUL-tree) where leaves are not uniquely labeled by a taxon set [19, 20]. Then, a set of true trees is obtained by keeping only one valid way of taxa mapping in the MUL-tree. For simplicity, the set of induced trees is also called the set of MUL-trees. As shown in Figure 2.2, the set of MUL-trees, 𝑇1 and 𝑇2, are induced from the network 𝑁. An extended Newick format can be employed to represent phylogenetic networks for easy use in computer programs [21]. It introduces the “#” symbol to annotate the reticulation nodes that are duplicated and numbered consecutively. For example, the network in Figure 2.2 can be represented as “((A:1, (B:0.5)X#H1:0.5):1, ((X#H1:0,C:0.5):0.5, D:1));” or “((A:1, X#H1:0.5):1, (((B:0.5)X#H1:0,C:0.5):0.5, D:1));”. 2.1.3 Gene Tree, Species Tree, and Species Network The phylogeny that illustrates the relationships among a group of species is referred to as the species tree or network. The tree corresponding to a set of gene sequences from the species is known as the gene tree. For a species tree or network, the leaf nodes represent species, which include the entire popu- 7 Figure 2.2 The illustration of a phylogenetic network. (A) A phylogenetic network 𝑁, where 𝑟 is the root node, and 𝑋 is a reticulation node. (B) The MUL-tree induced from the network 𝑁. (C) The induced trees 𝑇1 and 𝑇2 from network 𝑁. lation of this species. Each internal node symbolizes a speciation event, which results in the split of one species’ population into subsequent species. The gene tree shows the evolution path of one particular gene, which is a particular region on the genome of all involved species. The gene tree is not necessarily identical to the species tree or other gene trees. A variety of factors may render the gene tree different from the species tree, such as introgression, hybridization, horizontal gene transfer, recombination, incomplete lineage sorting, and gene duplication and loss. Some gene tree discordances do not affect the tree structure of the species tree. However, some of the discordances caused by gene flow, where the genetic information transfers from one species to another species, cannot be properly represented in a tree structure on the species level. In these cases, a tree structure can depict the evolutionary traces of genes, but a network structure with reticulations may be a better form to visualize a species phylogeny. 8 (A) A scenario where the gene tree is the same as the species tree. Figure 2.3 Gene evolution within a species tree (((A,B),C),(D,E)) under the multi-species (B) A coalescent model. scenario where the gene tree has a different tree topology from the species tree, resulting from ILS. (C) A scenario where the gene tree is also the same as the species tree. (D) A scenario where the gene tree also displays a deep coalescence, and yet the gene tree matches the species tree. This figure is reproduced from Stadler and Degnan [22]. 2.1.4 Multi-Species Coalescent (MSC) Model The multi-species coalescent (MSC) is introduced by Kingman [23], which offers a framework to analyze multi-locus genomic sequence data from different species in diverse inference problems, such as species tree inference. In the MSC model, each species represents a population of individuals, where each gene 9 involves a set of alleles. The species tree evolves forward in time, while the alleles trace their histories from leaves to the root backward in time. Each allele at a leaf can randomly pick its parent from the prior generation. Over time, two or more alleles may converge upon a common parent, giving rise to a coalescence event, wherein any two alleles have an equal probability of coalescing first. Eventually, all the alleles coalesce into a single allele. Thus, all of the alleles from different individuals constitute a gene tree, which matches inside the species tree. For any two alleles, the initial chance of coalescence occurs on the edge located above their most recent common ancestor (MRCA). When two or more alleles do not coalesce on that first possible edge, but enter and coalesce on a subsequent edge (i.e., the one further back in time, and closer to the root), we call this event “deep coalescence”. Since any two alleles are equally likely to coalesce first, deep coalescence has the potential to create gene trees that differ from the species tree and differ from each other, although the gene trees still fit inside the species tree. The deep coalescence is also called incomplete lineage sorting (ILS). Figure 2.3 depicts four gene trees that evolved within the same species tree under the MSC. Of the four different gene trees, the gene trees in panels (A), (C) and (D) show the same tree topology as the species tree, while the incongruence between the gene tree and the species tree in panel (B) is due to deep coalescence or ILS. Interestingly, the gene tree in panel (D) also illustrates a deep coalescence event, where the alleles from A and B do not coalesce on the edge above the MRCA of A and B, despite the gene tree matching the species tree. Note that ILS can take place without altering the topology. Mathematically, the MSC model specifies how lineages coalesce through time. The structure of the tree generated by the coalescent model is determined by coalescent times and how lineages are selected to merge at each coalescent event. The probability that two lineages coalesce in the generation immediately prior is equivalent to the chance they inherited their genetic material from the same parent sequence. Assuming a stable effective population size where each genetic locus has 2𝑁𝑒 possible parental sequences, there exist precisely 2𝑁𝑒 candidate parental copies in the previous generation. Under random mating conditions, the probability of two alleles descending 10 from an identical parental sequence is therefore 1/(2𝑁𝑒), and conversely, the probability they do not coalesce at this generation is 1 − 1/(2𝑁𝑒). Considering generations sequentially, the coalescence probability follows a geometric distribution. Specifically, the probability that lineages coalesce exactly at generation 𝑡 is given by the product of the non-coalescence probabilities across the preceding (𝑡 − 1) generations and the coalescence probability in generation 𝑡 itself: 𝑃𝑐 (𝑡) = (1 − 1 2𝑁𝑒 )𝑡−1( 1 2𝑁𝑒 ) When the effective population size 𝑁𝑒 is sufficiently large, this discrete probability distribution is closely approximated by a continuous exponential form: 𝑃𝑐 (𝑡) = 1 2𝑁𝑒 𝑒− 𝑡 −1 2𝑁𝑒 2.1.5 Multi-Species Network Coalescent (MSNC) Model The MSC model has been introduced in the context of phylogenetic trees. It can be easily extended to the phylogenetic network [24], where the reticulation edges are used for tracing gene flow events, including hybridization, introgression, and horizontal gene transfer (HGT). Figure 2.4 shows four scenarios where gene trees evolve within the same species tree under the multi-species network coalescent (MSNC) model. Of the four different gene trees, the gene tree in panel (A) has neither gene flow nor ILS, and matches the species network. Moreover, the gene tree in panel (C) has only gene flow, so it matches the species network with the reticulation edge. The gene tree in panel (B) displays only ILS, resulting in a mismatch to the species network. Interestingly, the gene tree in panel (D) also matches the species network, but includes both gene flow and ILS. Formally, suppose we have 𝑚 independent genomic loci, each represented by an alignment set denoted as S = 𝑆1, 𝑆2, . . . , 𝑆𝑚, where each 𝑆𝑖 contains sequence information for locus 𝑖. Typically, each alignment 𝑆𝑖 might consist of sequences from multiple species under study or represent data from a single bi-allelic genetic marker (e.g., a vector of binary indicators such as single nucleotide polymorphisms, SNPs). The MSNC model consists of two main components: 11 1. A phylogenetic network Ψ, characterized by its topological structure and associated contin- uous parameters, including divergence times; 2. A vector Γ that represents inheritance probabilities. The likelihood under this model can be mathematically expressed as follows: 𝑝(S |Ψ, Γ) = 𝑚 (cid:214) ∫ 𝑔 𝑖=1 𝑝(𝑆𝑖 |𝑔) 𝑝(𝑔|Ψ, Γ)𝑑𝑔 where the integration is taken over all possible gene trees, 𝑝(𝑆𝑖 |𝑔) is the probability of the sequence alignment 𝑆𝑖 given a particular gene tree 𝑔 [25], and 𝑝(𝑔|Ψ, Γ) is the density of the gene tree (topologies and branch lengths) given the model parameters, defined as [26]: 𝑝(𝑔|Ψ, Γ) = ∑︁ ℎ∈𝐻Ψ (𝑔) 𝑤(ℎ) 𝑑 (ℎ) (cid:214) 𝑏∈𝐸 (Ψ) 𝑤𝑏 (ℎ) 𝑑𝑏 (ℎ) Γ[𝑏, 𝑗]𝑢𝑏 (ℎ) 𝑝𝑢𝑏 (ℎ)𝑣𝑏 (ℎ) (𝜆𝑏) where 𝐻Ψ (𝑔) denotes the set of all coalescent histories that explain a gene tree 𝑔 appearing inside a species tree with topology Ψ and a vector of branch length 𝜆, 𝑢𝑏 (ℎ) and 𝑣𝑏 (ℎ) denote the number of lineages enter and exit edge 𝑏 of Ψ under coalescent history ℎ, 𝑤𝑏 (ℎ) denotes the number of possible ways the coalescent events could have occurred consistently with the gene tree 𝑔, 𝑑𝑏 (ℎ) denotes the number of sequences of coalescences that give the number of coalescent events specified by ℎ, and 𝑝𝑢𝑏 (ℎ)𝑣𝑏 (ℎ) (𝜆𝑏) denotes the probability of 𝑢𝑏 lineages entering the branch and 𝑣𝑏 lineages exiting the branch over a branch of length 𝜆𝑏. The posterior 𝑝(Ψ, Γ|S ) of the model is proportional to 𝑝(Ψ, Γ|S ) ∝ 𝑝(S |Ψ, Γ) 𝑝(Ψ) 𝑝(Γ) = 𝑝(Ψ) 𝑝(Γ) 𝑚 (cid:214) ∫ 𝑔 𝑖=1 𝑝(𝑆𝑖 |𝑔) 𝑝(𝑔|Ψ, Γ)𝑑𝑔 where 𝑝(Ψ) and 𝑝(Γ) are the priors on the phylogenetic network (and its parameters) and the inheritance probabilities, respectively. While such a full likelihood approach uses all the information in the data, it can be computa- tionally intensive. One common strategy for speeding up inference is to preprocess each locus by estimating its local genealogies. The observed data are then summarized as a set of inferred gene trees G = {𝐺1, 𝐺2, ..., 𝐺𝑚}. In this case, the likelihood formulation simplifies to 𝑝(G |Ψ, Γ) = 𝑝(𝐺𝑖 |Ψ, Γ) 𝑚 (cid:214) 𝑖=1 12 where 𝐺𝑖 is the gene tree that has been inferred for every locus, and 𝑝(𝐺𝑖 |Ψ, Γ) is the density of the gene tree 𝐺𝑖 (topologies and branch lengths) given the model parameters [26]. Because this approach relies on accurate gene tree inference, any errors in gene tree estimation can propagate to the final network inference. Despite these optimizations, exact likelihood and posterior computations can still pose sig- nificant computational challenges. Hence, pseudo-likelihood methods have been developed. For example, Yu and Nakhleh [27] introduced the pseudo-likelihood of phylogenetic network Ψ and inheritance probabilities Γ given a set of gene trees 𝐺: 𝐿(Ψ, Γ|𝐺) = 𝑝(𝐺 |Ψ, Γ) = (cid:214) {𝑋,𝑌 ,𝑍 }⊆X 𝑓 (𝜌(𝑋𝑌 |𝑍, 𝐺), 𝜌(𝑋 𝑍 |𝑌 , 𝐺), 𝜌(𝑌 𝑍 |𝑋, 𝐺)|Ψ, Γ) where X is the set of taxa, 𝑋𝑌 |𝑍, 𝑋 𝑍 |𝑌 and 𝑌 𝑍 |𝑋 are binary triples with {𝑋, 𝑌 , 𝑍 } ⊆ X, and 𝜌 is the number of times a binary triple is induced by 𝐺. 2.2 Phylogenetic Inference 2.2.1 Simulation Studies Many methods have been developed to address the phylogeny problems. Each represents different tradeoffs of phylogeny accuracy with respect to the true phylogeny, as well as computational requirements. Simulation studies are one category of widely-used tools to assess the performance of different methods by directly comparing the estimated phylogeny produced by different methods against a true phylogeny. By the generation of synthetic or simulated datasets, simulation studies provide predominant insights into the relative performance of different phylogeny inference methods. A simulation study proceeds according to the following steps: (1) A model phylogeny (tree or network) is simulated randomly, where r8s [28] tool is often utilized to generate a random tree, while steps listed in Hejase et al. [11] can be used to add reticulations on the tree to generate a network. (2) Local genealogies or gene trees are simulated for independent and identically distributed (i.i.d.) loci following a species phylogeny under an MSC or MSNC model. In this step, 13 Figure 2.4 Gene evolution within a species network under the multi-species network coalescent (MSNC) model. (A) A scenario where the gene tree has neither gene flow nor ILS, and matches the species network. (B) A scenario where the gene tree displays only ILS, resulting in a mismatch to the species network. (C) A scenario where the gene tree has only gene flow, and matches the species network with the reticulation edge. (D) A scenario where the gene tree also matches the species network, but includes both gene flow and ILS. This figure is reproduced from Stadler and Degnan [22]. 14 msmove [29] and ms [30] are often employed to simulate the local genealogies. (3) DNA sequences are evolved following local gene trees under a specific substitution model [31], in which seq-gen [32] and INDELible [33] are two popularly used tools. 2.2.2 Phylogenetic Tree Inference 2.2.2.1 Gene Tree Inference The inference of phylogenetic trees is a basic and fundamental problem in phylogenetic studies. A variety of methods have been developed to tackle this issue. These methods can be categorized into two main groups: distance-based methods and optimization-based methods. The first category of phylogenetic tree inference methods is distance-based methods. In those methods, a distance matrix is first formed by calculating the distances of pairwise sequences, and then converted into a phylogenetic tree by clustering algorithms. The most popular methods within this category are UPGMA (Unweighted Pair Group Method using Arithmetic mean) [34] and neighbor-joining [35]. Other methods are optimization-based methods, which employ an optimality criterion or ob- jective function to measure a tree’s compatibility with the data. The tree that achieves the optimal score is considered the estimate of the true tree. Examples in this category include maximum parsimony [36], maximum likelihood (ML) [25], and Bayesian methods [37]. In the maximum parsimony method, the tree score is the minimum number of character changes required for the tree, and the maximum parsimony tree or most parsimonious tree is the tree with the smallest tree score. The ML method uses the log-likelihood value of the tree to measure the fit of the tree to the data, and the maximum likelihood tree is the tree with the highest log likelihood value. The ML method of gene tree estimation was first introduced by Felsenstein [25] and has been implemented in programs such as PAML [38], PhyML [39], RAxML [40], RAxML-NG [41], IQ-Tree [42], and FastTree [43]. In the Bayesian method, the posterior probability of a tree is the probability that the tree is true given the data. The tree with the maximum posterior probability is the estimate of the true tree. The Bayesian method was introduced into molecular phylogenetics in the 1990s [37] and has been implemented in programs such as MrBayes [44], RevBayes [45], BEAST [46, 47], and 15 PhyloBayes [48, 49]. Distance-based methods are often computationally faster than optimization-based methods, and can be easily applied to analyze different kinds of data as long as pairwise distances can be calculated [17]. Among optimization-based methods, parsimony is known to be more prone than likelihood methods to systematic errors, including long branch attraction (LBA) [50], which is the phenomenon where two branches that are in truth not sisters are inferred to be sister branches when using maximum parsimony inference (see Figure 2.5). This occurs because, unlike likelihood, parsimony does not take into account branch lengths when computing the parsimony score. The situations in which phylogenetic trees produce data susceptible to this failure are referred to as “Felsenstein Zone” [51]. Figure 2.5 An illustration of long branch attraction. This figure contrasts the true phylogenetic tree (left) with a parsimony-predicted tree (right) to highlight the phenomenon of long-branch attraction where two branches that are in truth not sisters are inferred to be sister branches when In the true tree, lineages 1 and 2 share a more recent using maximum parsimony inference. common ancestor than either does with 3 or 4, yet in the parsimony-predicted tree, lineages 1 and 4 are incorrectly grouped as sister taxa. This occurs because, unlike likelihood, parsimony does not take into account branch lengths when computing the parsimony score. The situations in which phylogenetic trees produce data susceptible to this failure are referred to as “Felsenstein Zone” [51]. 16 2.2.2.2 Species Tree Inference Reconstruction of a species tree usually requires the analyses of a set of genes, due to the limited evolutionary signals provided by each gene and discordance between gene trees. To better infer the species tree, sufficient gene trees, associated with the corresponding gene tree distribution, are needed. Many methods have been developed to address the species tree inference. There are three main categories of these methods: concatenation methods, summary-based methods (or supertree methods), and co-estimation methods (see Figure 2.6). In the concatenation methods, sequences from genes are first concatenated into one larger size of genome-level sequences, called a supermatrix. Then, a global estimate of the species tree is performed based on the supermatrix for taxa of interest. However, these methods ignore the discordance between gene trees, and may thus lead to statistically inconsistent trees and introduce poor accuracy [52]. Commonly-used programs implementing the multi-species coalescent (MSC) model include StarBEAST [53, 54] and BPP [55, 56]. For the summary-based methods, on the other hand, the gene trees are first estimated indepen- dently based on their sequences, and then summarized to construct the overall species tree. There are many methods in this category, such as “minimizing deep coalescence (MDC)”-based methods [57–59], and weighted quartet-based methods [60, 61]. Popular summary-based programs include ASTRAL [60, 61] and MP-EST [62]. These methods are computationally efficient and can analyze thousands of genes but may suffer from errors in reconstructed gene trees [50]. The co-estimation methods simultaneously infer both gene trees and the species tree based on the Bayesian model and the Markov Chain Monte Carlo (MCMC) algorithm, which have shown higher accuracy in species tree inference than solely concatenation methods [53, 63]. 2.2.2.3 Metric Sometimes we may wish to quantify the similarity between two trees. For instance, we might be interested in comparing the differences among trees derived from different methods or evaluating the dissimilarities between the true tree and the estimated tree in a simulation study in order to assess a tree inference method. 17 Figure 2.6 Species tree inference methods, including concatenation methods, summary-based methods (or supertree methods), and co-estimation methods. In the concatenation methods, sequences from genes are first concatenated into one larger size of genome-level sequences, called a supermatrix. Then, a global estimate of the species tree is performed based on the supermatrix for taxa of interest. However, these methods ignore the discordance between gene trees, and may thus lead to statistically inconsistent trees and introduce poor accuracy [52]. For the summary-based methods, on the other hand, the gene trees are first estimated independently based on their sequences, and then summarized to construct the overall species tree. These methods are computationally efficient and can analyze thousands of genes but may suffer from errors in reconstructed gene trees [50]. The co-estimation methods simultaneously infer both gene trees and the species tree based on the Bayesian model and the Markov Chain Monte Carlo (MCMC) algorithm, which have shown higher accuracy in species tree inference than solely concatenation methods [53, 63]. Phylogenetic trees can be represented by a collection of bipartitions. A bipartition refers to a distinct split of leaves obtained by removing one internal edge in an unrooted tree. Given an unrooted tree 𝑇 = (𝑉, 𝐸), each edge defines a bipartition of taxa. We denote by 𝐶 (𝑇) the set of non-trivial bipartitions of 𝑇, in which each non-trivial bipartition on the leaf set of 𝑇 is generated by the removal of an internal edge. The Robinson-Foulds (RF) distance [64] of two unrooted phylogenetic trees 𝑇1 and 𝑇2 is defined as 𝑑 (𝑇1, 𝑇2) = |𝐶 (𝑇1)\𝐶 (𝑇2)| + |𝐶 (𝑇2)\𝐶 (𝑇1)| That is, the RF distance is defined as the total number of false positive bipartitions and false negative 18 bipartitions. For easy use, the RF distance can be normalized between 0 and 1 by dividing 𝑑 (𝑇1, 𝑇2) by the maximal possible RF distance, whose value is 2𝑛 − 6 for binary unrooted trees, where 𝑛 is the number of leaves in a tree. Although the RF distance here is defined for unrooted trees, the same definition can be easily applied to rooted trees by virtually attaching an outgroup to the root. Furthermore, the missing branch rate is another commonly-used measure for evaluating the difference between the true tree and the estimated tree. The missing branch rate is computed as the proportion of bipartitions present in the true tree but absent in the estimated tree. 2.2.3 Phylogenetic Network Inference 2.2.3.1 Inference Methods Numerous computational methodologies have been devised to infer phylogenetic networks from multi-locus data, including the maximum parsimony method, the maximum likelihood method, the maximum pseudo-likelihood method, and the Bayesian inference method. The maximum parsimony (MP) methods employ the minimizing deep coalescence (MDC) criterion [57], which searches for the network that has the capability to minimize the occurrence of deep coalescences necessary to account for all input gene trees. Maximum likelihood methods are proposed to calculate model likelihood under the multi- species network coalescent (MSNC) model, where only gene tree topologies (MLE) or gene tree topologies and branch lengths (MLE-length) are used [26]. One weakness of maximum likelihood methods is their high computational cost. Maximum pseudo-likelihood (MPL) methods use pseudo-likelihood in the optimization crite- rion [27], which reduces the computational requirements but attains less accuracy than the MLE or MLE-length methods. By leveraging the prior distribution in conjunction with a set of rooted gene tree topologies as input, the Bayesian inference method can achieve computational expediency, resulting in faster computations for inferring the posterior distribution of the network. The most widely-used method for the Bayesian inference of the phylogenetic network is the reversible-jump Markov chain Monte 19 Carlo (RJMCMC) [65]. 2.2.3.2 Metric The tripartition-based distance [18] counts the proportion of tripartitions that are not shared between the two networks. For each node of the network, a tripartition consists of three sets of leaves: those that are strict descendants of it, those that are non-strict descendants of it, and those that are not descendants of it. Given a phylogenetic network 𝑁 = (𝑉, 𝐸) with 𝐿 as a set of leaves, an ancestor 𝑠 of a node 𝑢 in 𝑁 is referred to as a strict ancestor if all the paths from the root of 𝑁 to 𝑢 contain 𝑠, otherwise a non-strict ancestor of 𝑢. The tripartition of edge 𝑒 = (𝑢, 𝑣) is defined as ( 𝐴(𝑒), 𝐵(𝑒), 𝐶 (𝑒)), where • 𝐴(𝑒) = {𝑠 ∈ 𝐿 | 𝑢 is a strict ancestor of 𝑠}; • 𝐵(𝑒) = {𝑠 ∈ 𝐿 | 𝑢 is a non-strict ancestor of 𝑠}; • 𝐶 (𝑒) = {𝑠 ∈ 𝐿 | 𝑢 is not an ancestor of 𝑠}. For two phylogenetic networks 𝑁1 and 𝑁2, the tripartition-based distance can be calculated by 𝑑 (𝑁1, 𝑁2) = 1 2 × ( |{𝑒1 ∈ 𝐸 (𝑁1)|(cid:154)𝑒2 ∈ 𝐸 (𝑁2), 𝐴(𝑒1) = 𝐴(𝑒2), 𝐵(𝑒1) = 𝐵(𝑒2), 𝐶 (𝑒1) = 𝐶 (𝑒2)}| |𝐸 (𝑁1)| |{𝑒2 ∈ 𝐸 (𝑁2)|(cid:154)𝑒1 ∈ 𝐸 (𝑁1), 𝐴(𝑒1) = 𝐴(𝑒2), 𝐵(𝑒1) = 𝐵(𝑒2), 𝐶 (𝑒1) = 𝐶 (𝑒2)}| |𝐸 (𝑁2)| ) + where 𝐸 (𝑁1) and 𝐸 (𝑁2) define the set of edges of networks 𝑁1 and 𝑁2, respectively. The reduction-based distance [66] is commonly used to measure the difference between phyloge- netic networks. The reduction-based distance quantifies the dissimilarity between two phylogenetic networks by considering their reduced topologies. We first define the reduction of a network. Given a phylogenetic network 𝑁, its reduced network 𝑁′ can be obtained by creating symbolic leaf nodes to replace and represent each maximal subtree where reticulation nodes are not included. For two reduced phylogenetic networks 𝑁1 and 𝑁2, the reduction-based distance can be calcu- lated by 𝑑 (𝑁1, 𝑁2) = 1 2 × ( ∑︁ 𝑣∈𝑈 (𝑁1) 𝑚𝑎𝑥{0, 𝜅(𝑣) − 𝜅(𝑣′)} + ∑︁ 𝑢∈𝑈 (𝑁2) 𝑚𝑎𝑥{0, 𝜅(𝑢) − 𝜅(𝑢′)}) 20 where 𝑈 (𝑁1) and 𝑈 (𝑁2) define the set of unique nodes of networks 𝑁1 and 𝑁2, respectively. 𝑣′ ∈ 𝑈 (𝑁2) is equivalent to 𝑣 ∈ 𝑈 (𝑁1), while 𝑢′ ∈ 𝑈 (𝑁1) is equivalent to 𝑢 ∈ 𝑈 (𝑁2). Two nodes are considered equivalent if they are both leaf nodes and bear identical labels, or if they have the same number of children and their children are equivalent accordingly. 𝜅(𝑣), 𝜅(𝑣′), 𝜅(𝑢) and 𝜅(𝑢′) represent the number of nodes equivalent to 𝑣, 𝑣′, 𝑢, and 𝑢′ in networks 𝑁1, 𝑁2, 𝑁2, and 𝑁1, respectively. Similar to the above-mentioned RF distance, the reduction-based distance roughly quantifies the number of rooted subnetworks that are present in one network but absent in the other [66]. 2.3 Substitution Models 2.3.1 Definition The substitution models are Markov models that characterize changes occurring over evolution- ary time in macromolecules (e.g., DNA sequences) represented as sequences of symbols (A, C, G, and T in the case of DNA). Substitutions at any particular site are described by a Markov chain, wherein the symbols (A, C, G, and T in the case of DNA) serve as the states of the chain. The primary characteristic of a Markov chain is its lack of memory: “given the present, the future does not depend on the past”. The sites in the sequence are commonly assumed to undergo evolution independently of each other. 2.3.2 Nucleotide Substitution Models The most general and commonly-used model of nucleotide substitution is the general time- reversible (GTR) model [67]. It is a continuous-time reversible Markov model that is parameterized by a stationary base probability distribution over the nucleotides 𝜋 and a 4 × 4 transition rate matrix 𝑄, 𝑇 𝐶 𝐴 𝐺 𝑄 = {𝑞𝑖 𝑗 } = 𝑇 . (cid:169) (cid:173) 𝐶 𝑎𝜋𝑇 (cid:173) (cid:173) (cid:173) (cid:173) 𝐴 𝑏𝜋𝑇 (cid:173) (cid:173) (cid:173) 𝐺 𝑐𝜋𝑇 (cid:171) 𝑎𝜋𝐶 𝑏𝜋 𝐴 𝑐𝜋𝐺 . 𝑑𝜋 𝐴 𝑒𝜋𝐺 𝑑𝜋𝐶 𝑒𝜋𝐶 . 𝑓 𝜋𝐺 𝑓 𝜋 𝐴 . (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 21 where each entry is the instantaneous substitution rate of one nucleotide to another (see Figure 2.7). Jukes-Cantor (JC or JC69) model [31] represents the simplest submodel of the GTR model, as it assumes both equal transition rates and equal base frequencies, i.e., 𝑎 = 𝑏 = 𝑐 = 𝑑 = 𝑒 = 𝑓 and 𝜋 𝐴 = 𝜋𝑇 = 𝜋𝐺 = 𝜋𝐶 = 1 4. The transition rate matrix 𝑄 is as follows (see also Figure 2.7), 𝑇 𝐶 𝐴 𝐺 𝑄 = {𝑞𝑖 𝑗 } = 𝑇 . (cid:169) (cid:173) 𝐶 𝜆 (cid:173) (cid:173) (cid:173) (cid:173) 𝐴 𝜆 (cid:173) (cid:173) (cid:173) 𝐺 𝜆 (cid:171) 𝜆 . 𝜆 𝜆 𝜆 𝜆 . 𝜆 𝜆 𝜆 𝜆 . (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) Kimura’s two-parameter model, also known as K80 model [68], is another simple submodel of the GTR model that assumes that all of the bases are equally frequent (𝜋 𝐴 = 𝜋𝑇 = 𝜋𝐺 = 𝜋𝐶 = 1 4). Let the substitution rates be 𝛼 for transitions and 𝛽 for transversions, where the transition means the substitution between two pyrimidines (𝑇 ↔ 𝐶) or between two purines (𝐴 ↔ 𝐺), while the transversion is the substitution between a pyrimidine and a purine (𝑇, 𝐶 ↔ 𝐴, 𝐺). The transition rate matrix 𝑄 is as follows (see also Figure 2.7), 𝑇 𝐶 𝐴 𝐺 𝑇 . 𝛼 𝛽 𝛽 (cid:169) (cid:173) 𝐶 𝛼 (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) 𝐺 𝛽 (cid:171) Let the transition/transversion rate ratio 𝜅 be 𝛼/𝛽. The transition rate matrix 𝑄 can also be 𝑄 = {𝑞𝑖 𝑗 } = (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 𝛽 𝛼 𝐴 𝛼 𝛽 𝛽 𝛽 𝛽 . . . represented as, 𝑇 𝐶 𝐴 𝐺 𝑄 = {𝑞𝑖 𝑗 } = 𝜅 . 1 1 1 1 . 𝜅 1 1 𝜅 . (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 𝑇 . (cid:169) (cid:173) 𝐶 𝜅 (cid:173) (cid:173) (cid:173) (cid:173) 𝐴 1 (cid:173) (cid:173) (cid:173) 𝐺 1 (cid:171) 22 These nucleotide substitution models can be employed to simulate the evolutionary process of nucleotide sequences under a rooted tree 𝑇, where branch lengths represent the time estimating the evolutionary changes. First, the sequence at the root is randomly generated based on the stationary base probability distribution of nucleotides 𝜋. Then, the sequences on the descendant nodes are iteratively obtained according to the transition probability matrix 𝑃(𝑡) = 𝑒𝑄𝑡 [69], where the time 𝑡 represents the branch length until a set of simulated sequences at all nodes of 𝑇 are created. To incorporate rate heterogeneity across sites, the instantaneous substitution rates of any sub- stitution model can be multiplied by a rate sampled from the Γ distribution for rate variation across sites [70]. For example, GTR+Γ is the GTR substitution model in combination with the Γ model to produce rate variation across sites. Figure 2.7 An illustration of JC69, K80, and GTR substitution model. In JC69 model (left), all substitutions — whether transition or transversion — are assumed to occur at the same rate 𝜆. In K80 model (middle), transitions (dashed arrows) occur at rate 𝛼, whereas transversions (solid arrows) occur at rate 𝛽, thus distinguishing purine-purine or pyrimidine-pyrimidine substitutions from purine-pyrimidine substitutions. The GTR (General Time Reversible) model (right) is the most parameter-rich, allowing each of the six possible base substitutions to have its own rate parameter. In addition, GTR incorporates base frequency parameters (𝜋 𝐴, 𝜋𝑇 , 𝜋𝐺, 𝜋𝐶) to account for unequal nucleotide frequencies in the sequences being modeled. These models progressively increase in complexity to capture the diverse evolutionary patterns that arise in real genetic data. 23 2.4 Gene Flow 2.4.1 Definition Gene flow is the transfer of genetic material from one population to another. The gene flow between species primarily involves horizontal gene transfer (HGT) in prokaryotes and introgression (also known as introgressive hybridization) in eukaryotes. 2.4.2 Introgression 2.4.2.1 Definition Introgression, or introgressive hybridization, denotes the process of gene flow from one species into the gene pool of another through the recurrent backcrossing of an interspecific hybrid with one of its parental species. Introgression differs from hybridization, because simple hybridization results in a relatively even mixture of two parental species, while introgression results in a variable mixture involving a relatively small percentage from the donor species (Figure 2.8). Ancient introgression events have the potential to leave traces of extinct species in present-day genomes, a phenomenon referred to as ghost introgression [71]. Introgression can manifest as either neutral, where its effects on phenotypes go unnoticed, or adaptive, where it influences phenotypes in significant ways. The term “adaptive introgression” is used when the genetic transfer of minimal genomic regions from a donor species leads to positive fitness effects within the gene pool of the recipient species [72]. 2.4.2.2 Inference Methods The detection of introgression has become a major area of interest in evolutionary biology. Numerous computational methods have been proposed to distinguish true introgression events from incomplete lineage sorting (ILS), leveraging different features of genomic data, analytical formalisms, and statistical frameworks. Early approaches for detecting introgression distinguish introgressive hybridization and ILS by using sequence data to construct gene tree topologies evolved down a phylogenetic network. Sang and Zhong [73] and Holder et al. [74] built and tested their model based on the idea that incongruent gene trees show a nearly equal distribution in the case of introgression but a significantly distinct 24 Introgression differs from hy- Figure 2.8 An illustration of introgression and hybridization. bridization, because simple hybridization results in a relatively even mixture of two parental species, while introgression results in a variable mixture involving a relatively small percentage from the donor species. For introgression on the left, a single hybridization event is followed by repeated backcrossing of hybrids to one parental species (Species A), causing the proportion of red chromosomes (from Species B) to diminish in each generation (e.g., 50%, 25%, 12.5%, 6.25%) while still persisting in the final population. For hybridization on the right, multiple or ongoing hybridization events occur between the two parental species across successive generations, leading to more complex admixture patterns. Here, the fraction of genetic material from each species can shift in different ways (e.g., returning to 62.5% red in one generation), reflecting repeated interbreeding rather than a single hybridization event followed strictly by backcrossing. distribution for ILS. Patterson et al. [75] detected the introgressive hybridization by computing the genetic divergence of aligned sequences. A notable class of recent methods employ hidden Markov models (HMMs) to analyze genomic data in the context of incomplete lineage sorting (ILS) [12, 76–79]. CoalHMM [76, 79] was originally proposed to tackle different population genetic inference issues but has since been expanded to address the challenge of statistical introgression mapping. Subsequently, Liu et al. [12] devised the PhyloNet-HMM method, which combined an HMM-based model with the phylogenetic networks as input to detect introgression in genomes. This method is accurate but time- and memory-consuming because it takes all possible gene trees evolved down the input phylogenetic network into account. Other methods bypass the gene tree inference and calculate imbalances in minimum genetic 25 distances to infer introgression. The method from Joly et al. [80] was based upon the fact that the calculated minimum pairwise sequence distance between two species is smaller for introgression events than for ILS. Joly [81] developed a software called JML based on the above scenarios. Drawing inspiration from the reconciliation framework used in detecting horizontal gene trans- fer, Chauve et al. [82] introduced a method to reconcile gene trees within a species network, identifying gene flow events that yield consistent explanations for observed incongruences. This approach has been shown to detect extensive introgression even in large and complex datasets. Taking advantage of the rapid progress in artificial intelligence technologies, Schrider et al. [83] introduced FILET, a machine learning-based pipeline designed to detect genomic introgression by integrating a comprehensive set of population genetic summary statistics. By capturing patterns of variation across multiple populations, FILET effectively identifies regions of introgressed an- cestry. Expanding on this paradigm, Flagel et al. [84] proposed a convolutional neural network (CNN) framework that interprets population genomic alignments as images, thereby eliminating the need for predefined summary statistics. Through the automatic extraction of discriminative fea- tures across genomic windows, this CNN-based approach demonstrates high accuracy in detecting introgression events. More recently, Ray et al. [85] introduced IntroUNET, a deep learning frame- work that similarly treats alignments as images but employs semantic segmentation to pinpoint introgressed alleles with heightened resolution. D-statistic is based on the idea that a statistically significant imbalance between two SNP patterns is indicative of an introgression event. D-statistic is first applied to hominoids using genome-scale SNP data and a sliding window analysis to detect introgression on four-taxon data despite the existence of ILS [6, 86, 87]. An extension of D-statistic (called 𝐷 𝐹𝑂𝐼 𝐿) applying for the symmetric five-taxon phylogeny is developed by Pease and Hahn [88]. Inspired by 𝐷 𝐹𝑂𝐼 𝐿, Elworth et al. [89] devised a general statistic, 𝐷𝐺𝐸 𝑁 , to automatically generate a statistic for detecting introgression in any number of genomes and any set of hybridization events. In recent years, many analogs to the D-statistic have been brought up to estimate the introgression regions [90–97]. The disadvantage of this type of statistical method is that it assumes an infinite-sites model and independence across 26 loci. 2.4.3 Horizontal Gene Transfer 2.4.3.1 Definition Horizontal gene transfer (HGT) or lateral gene transfer (LGT) is the nonsexual movement of genetic material between distantly related organisms. Other than the “vertical” transfer from parents to offspring, HGT enables the rapid transmission of genes between species and plays an essential role in the adaptation of microbes to novel environments [98]. In particular, there is a growing interest in HGT concerning the dissemination of antibiotic resistance genes among bacteria in diverse communities [99–102]. Most studies in genetics and biology have focused upon the “vertical” transfer, such as hybridization and introgression, but the importance of HGT among single-cell and even some multi-cell organisms is beginning to be acknowledged [103]. While HGT remains the primary mode of horizontal evolution scrutinized in genomic data, it is crucial to acknowledge several other mechanisms of horizontal evolutionary processes as well, including gene duplication and loss, and genetic recombination (Figure 2.9). Gene duplication is defined as the generation of new copies of a gene at distinct loci within the genome [104]. Gene loss is referred to as the absence or extinction of a gene that is identified when comparing different species, but can also encompass any allelic variant carrying a loss-of-function mutation found within a population [105]. Genetic recombination is the exchange of genetic material between different chromosomes, which renders different histories for neighboring segments within a gene [106]. Due to these processes, gene trees may be expected to be discordant with the species tree. As discussed before, the multi-species coalescent (MSC) model has served as a baseline to investigate gene tree discordance caused by incomplete lineage sorting (ILS), while the multi-species network coalescent (MSNC) model is proposed for analyzing the discordance introduced from gene flow, such as introgression, hybridization, and HGT. Furthermore, these models can also be extended to capture genetic recombination as well as gene duplication and loss. An example is shown in Figure 2.9. 27 Figure 2.9 Alternative sources of the discordance between the species tree and gene trees from horizontal evolutions. (A) A scenario where an HGT event makes the gene tree (A,(B,C)) incon- gruent to the species tree ((A,B),C), but matches the species network ((A,(B)X#H1),(X#H1,C)). (B) A scenario where the incongruence is caused by the gene duplication and loss. (C) A scenario where the discordance comes from recombination. For the DNA segment depicted in red, the gene tree is ((A,B),C), but for the segment in green, the gene tree is ((A,C),B). This figure is reproduced from Degnan and Rosenberg [104]. 2.4.3.2 Inference Methods A number of approaches have been developed to detect HGT in genomics and metagenomics data. These methods can be broadly classified into two groups: sequence composition-based methods and phylogenetic methods (Figure 2.10). Sequence composition-based methods search for fragments of a genome significantly differ- ent from the genomic average by computing sequence signatures, such as GC content or codon usage [107]. Genomic regions that exhibit distinct compositional characteristics compared to the background are referred to as “genomic islands”. There are two commonly-used tools for the identification of genomic islands: GIST [108] and IslandViewer [109]. Phylogenetic methods analyze the evolutionary histories of genes involved and identify conflict- ing phylogenies. These methods can be further divided into explicit methods, which reconstruct and directly compare phylogenetic trees, and implicit methods, which employ surrogate measures in place of the phylogenetic trees [110]. For example, HGTector [111] and DarkHorse [112] are two popular implicit phylogenetic methods, which identify genes in genomes exhibiting tax- onomically discordant similarity to genes present within a reference database; AnGST [113] and RANGER-DTL [114] are widely-used explicit phylogenetic methods that reconcile phylogenetic 28 incongruencies between gene and species trees to detect HGT. These methods address the detection of HGT from distinct perspectives, which may lead to remarkably different HGT inference results. However, these tools can be combined to produce more robust detections [115]. Based on this idea, MetaCHIP [116] has recently been proposed to identify HGT events in a natural community by combining an all-against-all BLASTN with the RANGER-DTL software. 29 Figure 2.10 An overview of HGT inference methods. The HGT iference methods can be broadly classified into two groups: (1) sequence composition-based methods and (2) phylogenetic methods. (1) Sequence composition-based methods search for fragments of a genome significantly different from the genomic average by computing sequence signatures, such as GC content or codon usage [107]. Genomic regions that exhibit distinct compositional characteristics compared to the back- ground are referred to as “genomic islands”. (2) Phylogenetic methods analyze the evolutionary histories of genes involved and identify conflicting phylogenies. These methods can be further divided into explicit methods, which reconstruct and directly compare phylogenetic trees, and im- plicit methods, which employ surrogate measures in place of the phylogenetic trees [110]. This figure is reproduced from Ravenhall et al. [117]. 30 CHAPTER 3 PHIMM: SCALABLE STATISTICAL INTROGRESSION DETECTION USING THE APPROXIMATE COALESCENT-BASED INFERENCE 3.1 Introduction Hybridization, defined as the sexual reproduction of two organisms from distinct species, is a key evolutionary process that can alter genetic and phenotypic diversity. Introgression (or introgressive hybridization) occurs when gene flow from one species enters the gene pool of another through recurrent backcrossing of hybrids [118]. It has been estimated that up to 25% of plant species and 10% of animal species are capable of interspecific hybridization and introgression [119], highlighting the evolutionary significance of this phenomenon. Introgression regions in eukaryotic genomes are important, since the introgression can be neutral without showing in phenotypes but can also be adaptive and impact phenotypes. A widely- studied example of adaptive introgression is the house mice, i.e., Mus musculus dometicus (M. m. dometicus) and Mus spretus (M. spretus). The anticoagulant rodenticide resistance of M. m. dometicus is acquired through the introgression of the Vkorc1 (vitamin K 2,3-epoxide reductase subcomponent 1) gene from M. spretus to M. m. dometicus [120]. Clearly, introgression plays a vital role in genetic evolution and adaptive divergence. Therefore, developing state-of-the-art techniques to detect introgression regions based upon the increasing explosion of genomic data is urgently needed. Detecting the introgression region can be achieved by scanning the genomes of the species, and constructing the gene tree at each locus. The incongruence between gene trees evolved from different species trees offers an opportunity to detect introgression events. However, introgression detection remains a challenging problem, since distinguishing between hybrid introgression and incomplete lineage sorting (ILS) can be difficult. ILS is a type of deep coalescence that occurs in the speciation event, which can also result in the discordance between gene trees along the genomes. Although other processes, such as gene duplication and loss [121], may also cause the incongruence between gene trees and species trees, we focus on introgression and ILS here. Thus, a 31 powerful approach for the detection of introgressed regions in the presence of ILS is highly needed but challenging. Recent methods have been designed to disentangle introgression mapping in the presence of ILS, such as D-statistic, and an array of hidden Markov model (HMM)-based techniques (refer to Chapter 2.4.2.2). Of great relevance to our work, Liu et al. [12] introduced an introgression mapping method, PhyloNet-HMM, which combines the multi-species network coalescent (MSNC) model [20] with an HMM to detect introgression. Although these methods have advanced the field, none are optimized for both speed and accuracy when analyzing datasets comprising dozens of DNA sequences. To bridge this gap, we developed PHiMM (fast PhyloNet + Hidden Markov Model) [122], a refined introgression detection framework that integrates a coalescent-based approximation with an HMM under a model in a combination of substitutions, recombination, and gene flow. The simulation study demonstrates that the PHiMM tool is able to decrease the model complexity, which can significantly expand the scalability of introgression detection while maintaining the inference accuracy. By offering an efficient and accurate tool for introgression detection in increasingly large genomic datasets, PHiMM contributes to a deeper understanding of the evolutionary processes underpinning species divergence and adaptation. 3.2 Related Work 3.2.1 PhyloNet-HMM Algorithm The PhyloNet-HMM algorithm infers introgression regions based on a hidden Markov model. The algorithm was first proposed by Liu et al. [12], and then modified and implemented as a part of the PhyloNet version 3.6 package [13, 123]. The inputs of PhyloNet-HMM algorithm are the aligned DNA sequences 𝐴 and a phylogenetic network 𝑁. The input alignment 𝐴 can be defined as {𝐴, 𝐶, 𝑇, 𝐺}𝐾×𝐿, where 𝐾 is the number of taxa, and 𝐿 is the length of genomic sequence alignment. The HMM states in the PhyloNet-HMM algorithm correspond to all distinct pairs of a MUL-tree and a gene tree. The MUL-tree is encoded from the input species network 𝑁 using some existing 32 methods described by Huber et al. [19] and Yu et al. [20]. The set of MUL-trees can be used to represent the species network 𝑁. The gene tree can be any possible rooted binary tree on 𝐾 leaves, so the total number of possible gene trees is (2𝐾 − 3)!!, where 𝐾 is the number of taxa. Gene flow directionality is reflected in reticulation edge directionality in an explicit phylogenetic network. Let 𝑚 and 𝑛 be the number of MUL-trees and gene trees, respectively. Thus, the total number of hidden states should be 𝑚 × 𝑛. As shown in Figure 3.1, the HMM hidden states can be represented by 𝑠𝑖 𝑗 = (𝑇𝑖, 𝐺 𝑗 ), where 𝑇𝑖 is the 𝑖-th MUL-tree (1 ≤ 𝑖 ≤ 𝑚) and 𝐺 𝑗 is the 𝑗-th gene tree (1 ≤ 𝑗 ≤ 𝑛). The transition of HMM from the start state to a state 𝑠𝑖 𝑗 = (𝑇𝑖, 𝐺 𝑗 ) can be calculated as follows: 𝑡(𝑇𝑖,𝐺 𝑗 ) = 𝑧(𝑠𝑖 𝑗 ) (cid:205) 𝑧(𝑠𝑘𝑙) 𝑘,𝑙 where 𝑧(𝑠𝑖 𝑗 ) is the probability of local gene tree 𝐺 𝑗 under MUL-tree 𝑇𝑖, which can be calculated using the approach in Yu et al. [20]. The transition from a state 𝑠𝑖 𝑗 = (𝑇𝑖, 𝐺 𝑗 ) to a state 𝑠𝑘𝑙 = (𝑇𝑘 , 𝐺𝑙) (1 ≤ 𝑖, 𝑘 ≤ 𝑚 and 1 ≤ 𝑗, 𝑙 ≤ 𝑛) occurs with the following probability: 𝑎 (𝑇𝑖,𝐺 𝑗 )→(𝑇𝑘,𝐺𝑙) = 𝜖 (𝑇𝑖, 𝑇𝑘 )𝛿(𝐺 𝑗 , 𝐺𝑙) 𝑧(𝑠𝑘𝑙) (cid:205) 𝑧(𝑠𝑖 𝑗 ) 𝑖, 𝑗 where the 𝜖 (𝑇𝑖, 𝑇𝑘 ) and 𝛿(𝐺 𝑗 , 𝐺𝑙) can be calculated based on the following formulas: 1 − Δ𝑇 Δ𝑇 𝑚−1 𝑖 𝑓 𝑖 = 𝑘 𝑖 𝑓 𝑖 ≠ 𝑘 1 − Δ𝐺 𝑖 𝑓 𝑗 = 𝑙 𝜖 (𝑇𝑖, 𝑇𝑘 ) = 𝛿(𝐺 𝑗 , 𝐺𝑙) =       Δ𝐺 𝑛−1 where Δ𝐺 represents the probability of switching between gene trees with different topologies 𝑗 ≠ 𝑙 𝑖 𝑓 (i.e., switching between columns in Figure 3.1), while Δ𝑇 is the probability of switching between MUL-trees with different topologies (i.e., switching between rows in Figure 3.1). 33 Given a hidden state 𝑠𝑖 𝑗 = (𝑇𝑖, 𝐺 𝑗 ), the emission probability can be calculated based on the input observation sequence 𝐴, which is defined as { 𝐴, 𝐶, 𝑇, 𝐺}𝐾×𝐿, where 𝐾 is the number of taxa and 𝐿 is the length of genomic sequence alignment. We define each site of the observation sequence 𝐴 as 𝑎𝑖 (1 ≤ 𝑖 ≤ 𝐿). The emission probability at each site 𝑎𝑖 (1 ≤ 𝑖 ≤ 𝐿) is calculated by: 𝑒𝑠,𝜙 (𝑎𝑖) = 𝑃[𝑎𝑖 |𝑠, 𝜙] = 𝑃[𝑎𝑖 |ℓ𝑇 , ℓ𝐺, 𝜙] where ℓ𝑇 are the branch lengths of the MUL-tree and ℓ𝐺 are the branch lengths of the gene tree associated with state 𝑠 = (𝑇, 𝐺). 𝜙 represents a specific substitution model under which emissions occur. In this study, the generalized time-reversible (GTR) model [67] is used. Given the observation sequences 𝐴 and the HMM structure, the model parameters 𝜃 are learned to maximize 𝑃( 𝐴|𝜃), i.e., argmax𝜃 𝑃( 𝐴|𝜃). The model parameters 𝜃 are comprised of: • The set of MUL-trees with both topologies and branch lengths; • The set of local gene trees with both topologies and branch lengths; • DNA substitution model parameter 𝜙; • MUL-tree and gene tree switching probabilities Δ𝑇 and Δ𝐺 While the model likelihood for a fixed 𝜃 can be calculated efficiently using the dynamic programming [124], it is computationally difficult to optimize the model likelihood by learning 𝜃. For this reason, local search heuristics, such as the Baum-Welch algorithm and the expectation- maximization algorithm [124], are typically used for HMM learning. In the PhyloNet version 3.6 implementation, the PhyloNet-HMM framework utilizes the BOBYQA algorithm [125] to iteratively perform multi-variate optimization as part of a hill-climbing search (whereas in the initial implementation, PhyloNet-HMM utilizes Brent’s method for univariate optimization [126]). Given an optimized model 𝜃∗, the forward and backward algorithms are used to calculate the posterior decoding probability: 𝑃(𝜋𝑡 = (𝑇𝑖, 𝐺 𝑗 )| 𝐴, 𝜃∗) = 𝑓𝑡 (𝑖, 𝑗)𝑏𝑡 (𝑖, 𝑗) 𝑃( 𝐴|𝜃∗) 34 where 𝐴 is the observed multiple sequence alignment with 𝐿 columns, and each aligned columns of 𝐴 is 𝑎𝑡 ∈ 𝐴 (1 ≤ 𝑡 ≤ 𝐿); 𝜋𝑡 is the 𝑡-th state of hidden state path 𝜋; (𝑇𝑖, 𝐺 𝑗 ) represents all possible hidden states (1 ≤ 𝑖 ≤ 𝑚, 1 ≤ 𝑗 ≤ 𝑛) as shown in Figure 3.1; the forward probability 𝑓𝑡 (𝑖, 𝑗) = 𝑃(𝑎1, 𝑎2, ..., 𝑎𝑡, 𝜋𝑡 = (𝑇𝑖, 𝐺 𝑗 )|𝜃∗) is calculated using the forward algorithm; the backward probability 𝑏𝑡 (𝑖, 𝑗) = 𝑃(𝑎𝑡+1, 𝑎𝑡+2, ..., 𝑎𝐿 |𝜋𝑡 = (𝑇𝑖, 𝐺 𝑗 ), 𝜃∗) is calculated using the backward algo- rithm; 𝑃( 𝐴|𝜃∗) is the probability of the alignment, which can be computed by either the forward or backward algorithms. Finally, a modified posterior decoding approach is used to assess the confidence of introgression inference by the PhyloNet-HMM algorithm. The modified posterior decoding probability that a site 𝑎𝑡 (1 ≤ 𝑡 ≤ 𝐿) in the alignment 𝐴 has an introgressed origin is computed as follows: ∑︁ 𝑝𝑡 = 𝑃(𝜋𝑡 = (𝑇𝑖, 𝐺 𝑗 )| 𝐴, 𝜃∗) 𝑇𝑖 ∈Ω𝑇 1≤ 𝑗 ≤𝑛 where Ω𝑇 represents the set of MUL-trees having introgressive origins. 3.3 Methods 3.3.1 Problem Definition Based on the description of PhyloNet-HMM, the inputs of the problem are the aligned DNA sequences 𝐴 and a phylogenetic network 𝑁. Consistent with the study of Liu et al. [12], the output of the problem is a sequence of modified posterior decoding probabilities for the columns of the input alignment 𝐴. At a site 𝑎𝑡 (1 ≤ 𝑡 ≤ 𝐿) in the alignment 𝐴, the probability of 𝑎𝑡 having an introgressive origin is defined by: ∑︁ 𝑝𝑡 = 𝑃(𝜋𝑡 = (𝑇𝑖, 𝐺 𝑗 )| 𝐴) 𝑇𝑖 ∈Ω𝑇 1≤ 𝑗 ≤𝑛 where Ω𝑇 is the set of MUL-trees corresponding to introgression events. 3.3.2 PHiMM Algorithm The PHiMM algorithm for statistical introgression mapping consists of a three-stage pipeline. The pseudocode of our algorithm is given in Algorithm 3.1 and Figure 3.2. 35 Figure 3.1 An illustration of the PhyloNet-HMM framework. (A) The genomes of three species A, B, and C. (B) The species network of A, B, and C, which has a reticulate evolutionary history, where individuals in B have some genetic material from the common ancestor of B and A, and other genetic material from C. (C) Corresponding gene trees and parental species trees. The blue, red, and green “locus” in the genomes have 𝐺1, 𝐺2, and 𝐺3 as their local gene trees, respectively. Further, gene trees 𝐺1 and 𝐺3 for the blue and green loci evolved within the parental species tree 𝑇1, whereas gene tree 𝐺2 for the red locus evolved within the parental species tree 𝑇2. (D) The structure of the HMM (only states are shown). The states 𝑠11, 𝑠12, and 𝑠13 correspond to three possible local gene trees 𝐺1, 𝐺2, and 𝐺3 with evolution following the parental tree 𝑇1, while states 𝑠21, 𝑠22, and 𝑠23 correspond to three possible local gene trees 𝐺1, 𝐺2, and 𝐺3 with evolution following the parental species tree 𝑇2. 𝑠00 is the start state. Note that only transitions outgoing from 𝑠11 = (𝑇1, 𝐺1) are shown to simplify the presentation. This figure is reproduced from Liu et al. [12]. 36 The first stage consists of the following “truncation” algorithm: (a) Under the input species network model N , we conduct a Monte Carlo sampling of 𝑧 local gene tree topologies from the gene tree topology distribution under the MSNC model. Here, 𝑧 is set to 1000 in the simulation experiments and empirical analyses. The observed frequency distribution of local gene tree topologies is normalized to obtain an estimated probability distribution ˆ𝑓N . (b) Topologies in the domain of ˆ𝑓N are ranked based on their estimated probabilities. Let Δ be the top 𝑘𝑛 topologies based on the topology ranking. In this study, 𝑘𝑛 is set to 30. (c) The distribution ˆ𝑓N is truncated such that the domain consists only of topologies in Δ. Then, a truncated probability distribution 𝑔N is obtained by normalizing the truncated distribution ˆ𝑓N . The second stage of the PHiMM algorithm constructs a hidden Markov model (HMM) in a manner similar to the PhyloNet-HMM algorithm with a single modification. The set of MUL-trees 𝑇𝑖 (1 ≤ 𝑖 ≤ 𝑚, where 𝑚 is the number of MUL-trees) in the MUL-tree representation of 𝑁 is enumerated using the procedure described by Yu et al. [20]. HMM state construction utilizes the truncated distribution ˆ𝑓N rather than the full theoretical distribution 𝑓 , where a row of HMM states is instantiated for each distinct MUL-tree 𝑇𝑖 and each state in a row corresponds to a distinct local gene tree topology 𝐺 𝑗 (1 ≤ 𝑗 ≤ 𝑘𝑛 = |Δ|) in the domain of 𝑔N . The final stage of the PHiMM algorithm performs model learning and statistical inference under the fitted model using the same procedures as the PhyloNet-HMM algorithm. 3.4 Materials 3.4.1 Simulation Data The simulation study is used to evaluate the performance and applicability of the proposed PHiMM method, since we can track the true history of evolutionary events. The simulation data are constructed through various tools, such as r8s [28], ms [30], msmove [29], and seq-gen [32]. (1) Construction of model phylogenies The model phylogenies are generated using the procedure of Hejase et al. [11]. 37 Figure 3.2 The PHiMM pipeline. This flowchart illustrates the three-stage PHiMM algorithm for statistical introgression mapping: Stage 1 (“Truncation” Algorithm) begins by performing Monte Carlo sampling of local gene tree topologies under the input species network and ranking these topologies according to their empirical frequencies. Only the top 𝑘𝑛 topologies are retained, and a truncated probability distribution is formed. Stage 2 (HMM Construction) leverages the truncated set of topologies rather than the full gene tree distribution. Each MUL-tree derived from the species network is enumerated, and each state in the hidden Markov model corresponds to one of the retained local gene tree topologies attached to a specific MUL-tree. Stage 3 (Model Learning and Inference) applies maximum-likelihood or similar routines (as in PhyloNet-HMM) to the constructed HMM, producing a site-by-site probability of introgression along the input aligned genomes. Through these steps, PHiMM provides a computationally efficient framework to detect introgression signals while accounting for the complexity of network-like evolutionary histories.This figure is reproduced from Wuyun et al. [122]. 38 N ← 𝐺𝑒𝑡𝑆 𝑝𝑒𝑐𝑖𝑒𝑠𝑁𝑒𝑡𝑤𝑜𝑟 𝑘 𝑀𝑜𝑑𝑒𝑙 (𝑁) ⊲ 𝑁: Phylogenetic network ⊲ N : Species network model ⊲ Δ𝑧: Sampled gene tree topologies ⊲ 𝑧: Sampling size Δ𝑧 ← ∅ int 𝑖 ← 1 while 𝑖 ≤ 𝑧 do Algorithm 3.1 PHiMM 1: procedure PHiMM(𝑁, 𝐴) 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: Δ𝑧 ← Δ𝑧 + 𝐺𝑒𝑛𝑒𝑇𝑟𝑒𝑒𝑀𝑜𝑛𝑡𝑒𝐶𝑎𝑟𝑙𝑜𝑆𝑎𝑚 𝑝𝑙𝑖𝑛𝑔(𝑁, N ) 𝑖 ← 𝑖 + 1 Δ𝑑 ← 𝐺𝑒𝑡𝐷𝑖𝑠𝑡𝑖𝑛𝑐𝑡𝐺𝑒𝑛𝑒𝑇𝑟𝑒𝑒𝑠(Δ𝑧) ˆ𝑓N ← 𝐸 𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(Δ𝑑) ˆ𝑓N ← 𝑅𝑎𝑛𝑘𝑇 𝑜 𝑝𝑜𝑙𝑜𝑔𝑦( ˆ𝑓N ) Δ ← 𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑒( ˆ𝑓N , Δ𝑑, 𝑘𝑛) ˆ𝑔N ← 𝐸 𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦(Δ) ⊲ Δ𝑑: Distinct gene tree topologies in Δ𝑧 ⊲ ˆ𝑓N : Estimated probability distribution of Δ𝑑 ⊲ 𝑘𝑛: Truncation size; Δ: Selected gene tree topologies ⊲ ˆ𝑔N : Estimated probability distribution of Δ ⊲ 𝜃: Model parameters 𝜃 ← 𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒𝑀𝑜𝑑𝑒𝑙𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑠(𝑁, Δ, ˆ𝑔N ) while Not reaching the convergence criteria do 𝜃 ← 𝐻𝑒𝑢𝑟𝑖𝑠𝑡𝑖𝑐𝐿𝑒𝑎𝑟𝑛𝑖𝑛𝑔(𝜃, 𝐴) ⊲ 𝐴: Input multiple sequence alignment with 𝐾 aligned sequences and 𝐿 columns 20: 21: 22: {𝑝𝑡 }1≤𝑡≤𝐿 ← 𝑀𝑜𝑑𝑖 𝑓 𝑖𝑒𝑑𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝐷𝑒𝑐𝑜𝑑𝑖𝑛𝑔(𝜃, 𝑁, 𝐴) ⊲ 𝑝𝑡: Introgression probability for each aligned site 𝑡 (1 ≤ 𝑡 ≤ 𝐿) return {𝑝𝑡 }1≤𝑡≤𝐿 First, the r8s [28] tool is used to generate a random phylogenetic tree. The r8s version is 1.81, and the commands are shown below: #nexus begin r8s; simulate diversemodel=bdback seed= charevol=yes ntaxa=<5, 6, 7, 8, 9, or 10> infinite=yes nreps=20; end; where “diversemodel” means the model used for tree generation, which is generally set to the birth-death model denoted by “bdback”. “seed” specifies a random seed for the tree generation. “ntaxa” represents the number of taxa, which is selected from 5 to 10 in this study. “nreps” indicates the number of generated repeats. “charevol=yes” indicates the model tree is output with branch lengths. “infinite=yes” represents that the branch lengths 39 are set to the expected values based on rate and time. Then, the resulting tree is scaled to a height ℎ of 1 by multiplying the length of each edge in the model tree by ℎ. Furthermore, an outgroup is added to the generated tree at 10.0 coalescent time. In order to get the phylogenetic network for input, we finally add 𝑟 reticulations (𝑟 ∈ [1, 2]) by iterating the following steps: a time 𝑡𝑀 between 0 and the tree height is selected uniformly at random, two tree edges for which corresponding ancestral popula- tions existed during a time interval [𝑡 𝐴, 𝑡𝐵] such that 𝑡𝑀 ∈ [𝑡 𝐴, 𝑡𝐵] are randomly selected, and a reticulation at time 𝑡𝑀 is added to connect the pair of tree edges. Similar to Leaché et al. [15], the model network can be further classified based upon whether gene flow is “deep” or “non-deep”, which is defined by the topological placement of reticulations, i.e., non-deep reticulations are placed between two leaf edges, while deep reticulations include all other reticulations. (2) Generation of local genealogies Local genealogy at each locus is simulated using the msmove [29] or ms [30] following the above species network. The local gene trees are modeled for independent and identi- cally distributed (i.i.d.) loci under a multi-species network coalescent with recombination (MSNCwR) model. msmove is modified from the Monte Carlo simulator ms [30] to allow the tracking of migration events, while ms does not provide this annotation. The following msmove or ms command is used: msmove -T -r -I -ej i_1 j_1 -ej i_2 j_2 ... -ej i_k j_k -ev i j or ms -T 40 -r -I -ej i_1 j_1 -ej i_2 j_2 ... -ej i_k j_k -em i j -em i j where -T parameter indicates the gene trees representing the history of the sampled taxa are output. The -I parameter is followed by 𝑘 that represents the number of populations. The list of integers (n_1 n_2 ... n_k) includes the number of taxa sampled in each population. In this study, one allele is sampled from each taxon. The -r parameter is used to set recombination under Hudson’s finite-sites recombination model [30], where the crossover rate or recombination rate 𝜌 is set to 1 for msmove or 10 for ms, and the number of sites between which recombination can occur is set to 100 for msmove or 1000 for ms. The -ej parameter specifies moving all lineages in population i to population j at time t. The -ev parameter is special for msmove, which sets migration at time t_m from population i to population j with the migration probability x. The migration rate is selected randomly for each replicate in this study. In contrast, the -em parameter for the ms tool sets migration from subpopulation j to subpopulation i at time t_m1 or t_m2 with migration rate as x_1 or x_2. (3) Simulation of DNA sequences The DNA sequences are generated using seq-gen [32] under the Jukes-Cantor substitution model [31] with mutation rate 𝜃 = 2. The sequence simulation is performed under the procedure of multi-locus genomic sequence evolution. In detail, we simulate two different classes of loci: “query” loci that are sampled from the MSNC model with a recombination rate 𝜌 of 1 based on msmove tool to produce shorter sequences with a length of 100 bp, and “non-query” loci that are sampled with a recombination rate 𝜌 of 10 based on ms tool to generate longer sequences with a length of 1 kb. Loci from the two classes are interleaved, resulting in ten query loci and nine non-query loci sampled per dataset, where each locus is independent and identically distributed (i.i.d.). The following seq-gen command is used: 41 seq-gen -mHKY -l -p < genetreefile > seqfile where -m parameter specifies the substitution model. Here, the Jukes-Cantor substitution model denoted by “HKY” is used. The -l parameter specifies the sequence length. The -p parameter sets the number of partitions. The genetreefile is the input file providing the gene trees. The seqfile is the output file with simulated sequences under the given gene trees. The sampling design is carefully designed to ensure that the query loci are sufficiently separated by sequence length. This allows us to make the assumption of free recombination between the query loci based on the observed decay of linkage disequilibrium in previous empirical studies [127]. In the simulation study, we conduct 20 repetitions of the simulation procedure for each model condition. To assess the performance of the methods, we calculate the average and standard error of the performance measures across these 20 replicates. The statistical summary of the simulation dataset is presented in Table 3.1. 3.4.2 Performance Assessments To evaluate and compare the performance of different approaches on the simulation dataset where the true history of evolutionary events can be tracked, we use two types of area under the curve (AUC): the area under the receiver-operating characteristic (ROC) curve, and the area under the Precision-Recall curve, referred to as simply ROC-AUC and PR-AUC, respectively. ROC-AUC is plotted by the true positive rate ( threshold settings, while PR-AUC similarly plots the precision ( 𝑇 𝑃 𝐹𝑃+𝑇 𝑁 ) at various 𝑇 𝑃 𝑇 𝑃+𝐹 𝑁 ) at different thresholds, where TP, FP, TN, and FN represent the numbers of true positives, false 𝑇 𝑃+𝐹 𝑁 ) as a function of the false positive rate ( 𝐹𝑃 𝑇 𝑃+𝐹𝑃 ) against the recall ( 𝑇 𝑃 positives, true negatives, and false negatives, respectively. The ROC-AUC and PR-AUC measures represent the tradeoff between type I errors and type II errors under different threshold values. The measures are calculated only on the “query” loci in the simulation dataset, where the true migration events are annotated by msmove. After concatenating all query loci in one simulation replicate, the ROC-AUC and PR-AUC of a method are assessed based on the probability of a particular site 42 Table 3.1 Statistics for the simulation dataset. The reticulation scenarios include one non-deep reticulation, two non-deep reticulations, and one deep reticulation. “Ntaxa” means the number of taxa. “Total Sites” indicates the number of total sites in the simulation genomes for running. “Sites for analysis” represents the number of “query” sites for analyzing the performance of introgression detection methods. The p-distance of a pair of aligned sequences is calculated by dividing the number of sites where the two sequences had different nucleotides by the number of sites in which both sequences had nucleotides. “Average/Maximum p-dist” is the average/maximum p-distance of all pairs of aligned sequences in the simulation dataset. “Introgression (%)” shows the percent of introgressed sites over the genome. “Network Height” gives the height of the model network. “Average Branch Length” is the average branch length of the model network. The average and standard error (SE) are shown based on 20 replicates. Reticulation Scenario one non-deep Ntaxa 5 6 7 8 9 10 two non-deep 5 6 7 8 9 10 5 6 7 8 9 10 one deep Total Sites 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb 10kb Sites for analysis 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb 1kb Average p-dist (%) Maximum p-dist (%) Average SE 68.307 67.474 68.039 68.673 68.327 68.878 66.634 66.945 67.551 68.882 68.490 68.256 66.605 67.891 67.262 68.372 68.056 68.936 Average SE 73.317 73.254 73.403 73.538 73.446 73.630 72.763 72.645 73.136 73.468 73.325 73.535 72.330 72.584 73.072 73.238 73.456 73.535 2.339 2.251 2.434 1.368 1.493 1.196 2.567 3.350 1.874 1.270 1.522 1.683 2.501 2.168 2.224 2.288 1.736 1.078 0.624 0.759 0.386 0.442 0.310 0.330 0.638 1.560 0.346 0.378 0.449 0.312 1.996 1.360 0.503 0.546 0.400 0.404 Introgression (%) Network Height Average SE 54.000 51.000 47.500 55.000 60.500 52.500 70.000 79.000 78.500 72.500 75.500 74.500 72.500 74.000 80.000 72.000 79.500 75.000 Average SE 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 13.565 13.379 12.600 15.330 14.654 13.370 13.416 11.790 11.079 16.086 15.322 11.169 16.086 10.198 10.000 13.638 13.219 12.450 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Average Branch Length Average SE 0.337 0.288 0.297 0.285 0.270 0.275 0.273 0.272 0.253 0.264 0.246 0.241 0.300 0.293 0.272 0.261 0.247 0.258 0.052 0.039 0.050 0.036 0.041 0.037 0.046 0.040 0.035 0.034 0.038 0.033 0.042 0.052 0.054 0.042 0.048 0.032 involving an introgressive origin. Additionally, we report the memory usage and runtime in order to comprehensively evaluate the scalability of our framework. 3.4.3 Software and Data Our PHiMM algorithm is implemented as custom software, which includes a tailored version of the MSNC-based Monte Carlo algorithm and custom adaptations of the PhyloNet software package [13, 123]. All of our software and study datasets are open-source and publicly available for access at https://gitlab.msu.edu/liulab/phimm-dataset. 3.5 Results We present a comprehensive evaluation of the PHiMM framework using simulated datasets designed to capture a variety of factors that could influence introgression detection accuracy. Specifically, we focus on (1) the effect of varying the gene tree truncation size 𝑘𝑛, (2) the impact of 43 increasing the number of taxa, and (3) the influence of different numbers and depths of reticulations. We further compare the performance of PHiMM with PhyloNet-HMM in terms of accuracy, runtime, and memory usage. 3.5.1 Effect of gene tree truncation size An initial step in implementing PHiMM involves determining the optimal gene tree truncation size 𝑘𝑛, which directly affects the number of hidden states in the HMM. As depicted in Figure 3.3, both memory usage and runtime exhibit an increasing trend with the number of gene trees, partic- ularly for networks involving 10 taxa. This outcome is expected because more gene trees translate into a larger state space for the HMM, thereby necessitating greater computational resources. Despite the rise in computational cost, Figure 3.3 also shows that detection accuracy remains relatively stable across a wide range of gene tree counts. This stability implies that many gene trees may be redundant with respect to introgression detection, allowing us to choose a moderate truncation size without sacrificing predictive performance. Based on these findings, we set 𝑘𝑛 = 30 as a default, thereby balancing accuracy with memory and runtime efficiency. 3.5.2 Impact of the number of taxa and reticulation number and depth To further explore PHiMM’s performance, we assess how the number of taxa and various reticulation scenarios affect both computational requirement (memory and runtime) and detection accuracy. Figure 3.4 gives the performance of the PHiMM approach on different numbers of taxa and different reticulation scenarios. We find that the memory and runtime increase as the function of the number of taxa grows from 5 to 10. But overall, the memory and runtime are under 10 GB or 10 hours. Furthermore, the ROC-AUC of PHiMM remains above 0.75 on different numbers of taxa with non-deep reticulations. However, ROC-AUC on more than 7 taxa involving deep reticulations is reduced below 0.7. Furthermore, we analyze the influence of different numbers or types of reticulations. There are three scenarios: one non-deep reticulation, two non-deep reticulations, and one deep reticulation, because the memory and runtime would rise as the number of reticulations increases. As shown in 44 Figure 3.4, we find that the memory usage grows as the number of reticulations increases, while the runtime remains relatively unchanged by comparing the results of one non-deep reticulation with those of two non-deep reticulations. However, the ROC-AUC is between 0.85 and 0.9 for two non-deep reticulations, which is significantly higher than that for one non-deep reticulation. On the other hand, we compare the difference between non-deep and deep reticulations. We noticed that the memory and runtime are approximately similar for these two scenarios. However, the ROC-AUC is comparatively lower for one deep reticulation (below 0.7 for more than 7 taxa). This is straightforward because the deep reticulation is more complex, thus resulting in hard calculations and optimizations in PHiMM. 3.5.3 Comparison with PhyloNet-HMM Finally, we compare the PHiMM framework with PhyloNet-HMM. Since PhyloNet-HMM is time and memory-intensive, we only make the evaluation based on the 5-taxon phylogenetic network with one non-deep reticulation as input. In Table 3.2, the memory usage and runtime for PHiMM are only 2.4 GB and 0.1 hours, which are 0.3% and 0.7% of those for PhyloNet-HMM (318.6 GB and 40.9 hours), respectively, indicating that the PHiMM framework can dramatically reduce the memory usage and runtime for introgression detection when compared with the PhyloNet-HMM. However, PHiMM’s ROC-AUC and PR-AUC are comparable with PhyloNet-HMM’s, even though the ROC-AUC of PHiMM (0.7653) is 2% lower than that of PhyloNet-HMM (0.7806). Overall, these results demonstrate that PHiMM significantly reduces computational overhead without severely compromising accuracy. Consequently, PHiMM represents a compelling alterna- tive for large-scale introgression mapping, especially in datasets with dozens of DNA sequences and complex reticulation patterns. 3.6 Discussion and Conclusion In this study, we introduced PHiMM, a novel introgression detection framework that employs a coalescent-based approximation strategy to reduce model complexity and enhance detection performance in large-scale genomic datasets. Through extensive simulation experiments, we demonstrated that PHiMM achieves significant improvements in runtime and memory usage — 45 Figure 3.3 Results for different numbers of gene trees for (A) 5 and (B) 10 taxa input network. The simulation data is generated under the model conditions: one non-deep reticulation. The number of gene trees used ranges from 10 to 100. Performance measures include the area under the receiver operating characteristic curve (denoted by AUC) (red points), runtime (blue rectangles), and memory usage (green triangles). Averages and standard error bars are depicted based on 20 replicates. This figure comes from Wuyun et al. [122]. 46 Figure 3.4 The performance of PHiMM on model conditions with (A) one non-deep reticu- lation, (B) two non-deep reticulations, and (C) one deep reticulation. Performance measures include the area under the receiver operating characteristic curve (denoted by AUC) (red points), runtime (blue rectangles), and memory usage (green triangles). Averages and standard error bars are depicted based on 20 replicates. This figure comes from Wuyun et al. [122]. 47 Table 3.2 The performance comparison between PHiMM and PhyloNet-HMM on the 5-taxon model condition with one non-deep reticulation. Performance measures include the area under the receiver-operating characteristic curve (ROC-AUC), the area under the precision-recall curve (PR-AUC), runtime, and memory usage. The average and standard error (SE) are calculated based on 20 replicates. This table comes from Wuyun et al. [122]. PhyloNet-HMM Average 0.7806 ROC-AUC PR-AUC 0.7197 Runtime (hour) 40.8968 318.6493 Memory (GB) SE 0.0534 0.0871 0.5607 4.3099 PHiMM Average SE 0.7653 0.7305 0.1291 2.3527 0.0523 0.0726 0.0035 0.2271 often by several orders of magnitude — when compared to the state-of-the-art PhyloNet-HMM. This exceptional scalability is primarily attributable to the truncation strategy built into PHiMM, wherein the model approximation procedure effectively curtails the number of parameters involved in local optimizations. Although model approximations typically raise concerns regarding trade-offs between com- putational efficiency and inference precision, our results show that PHiMM maintains accuracy levels comparable to PhyloNet-HMM, with only a marginal decrease in ROC-AUC and PR-AUC. One plausible explanation for this robust performance is that discarding redundant gene trees or parameters through truncation strategically simplifies the model without forfeiting critical genetic signals required for introgression inference. Our findings further indicate that PHiMM’s detection accuracy is generally resilient across variations in the number of taxa and the complexity of reticulation patterns. Nonetheless, a modest drop in ROC-AUC is observed in datasets containing a large number of taxa and deep reticulations. These observations align with recent studies suggesting that deep gene flow or ancient evolutionary events impose greater challenges for phylogenetic inference than more recent divergence scenarios [11, 14, 15]. Moreover, although the computational demands of PHiMM increase in proportion to the number and depth of reticulations, as well as the overall number of taxa, these requirements remain within manageable limits for modern high-performance computing systems. In conclusion, PHiMM represents a powerful alternative for introgression mapping, combining 48 computational efficiency with reliable inference of hybridization events. The framework’s scalabil- ity, coupled with its robust accuracy, renders it particularly well-suited to the burgeoning volume and complexity of contemporary genomic datasets. Future endeavors could focus on refining the approximation strategies to further enhance performance on data characterized by deep reticula- tions, as well as extending PHiMM to incorporate additional sources of genomic heterogeneity such as gene duplication and loss. Through these developments, PHiMM has the potential to offer a versatile platform for unraveling the evolutionary processes that shape species diversity. 49 CHAPTER 4 APPLICATION OF PHIMM INTROGRESSION DETECTION METHOD TO EMERGING MODEL SYSTEMS 4.1 Introduction In our previous work, we introduced the PHiMM approach as a scalable and accurate method for detecting introgression in simulated genomic data. Through a series of computational experiments, we demonstrated that PHiMM maintains high inference accuracy while dramatically enhancing the scalability of introgression detection. Encouraged by these promising simulation-based results, the next logical step is to validate the performance of PHiMM on empirical datasets that capture the complexity and heterogeneity inherent in real-world biological systems. Accordingly, this study applies PHiMM to two well-characterized cases of introgression, thereby evaluating its effectiveness and robustness under practical evolutionary scenarios. The first empirical dataset is the widely-studied case of adaptive introgression among house mice, i.e., Mus musculus dometicus (M. m. dometicus) and Mus spretus (M. spretus). The adaptive significance of this introgression is evident in the acquired resistance to anticoagulant rodenticides by M. m. dometicus. This resistance is conferred by the introgression of Vkorc1 (vitamin K 2,3-epoxide reductase subcomponent 1) gene from the gene pool of M. spretus to M. m. dometicus [120]. This case not only exemplifies how introgression can confer immediate and profound adaptive benefits but also provides a tractable system in which to assess the accuracy of PHiMM’s introgression mapping under a known evolutionary outcome. The second empirical dataset is from studies of mimicry in the butterfly genus Limenitis. The WntA gene was found to be responsible for wing mimicry among Limenitis [128]. This dataset allows us to explore the role of introgression in morphological diversification and ecological adaptation. By examining how introgression may have contributed to morphological diversification and ecological adaptation, this dataset allows for a broader evaluation of PHiMM’s performance under selective pressures distinct from those observed in rodents. Moreover, these butterflies present an opportunity to interrogate introgression events that shape complex phenotypes, such as 50 wing coloration and patterning. By applying PHiMM to these two divergent empirical systems — an adaptive introgression conferring rodenticide resistance in mice and morphological mimicry in butterflies — we aim to (i) verify the consistency of our simulation-derived findings in real datasets, and (ii) explore the broader applicability of PHiMM to evolutionary processes involving diverse selection pressures. This study thereby not only extends PHiMM’s utility beyond controlled simulation conditions but also contributes to a deeper understanding of the adaptive and ecological impacts of introgression in natural populations. 4.2 Materials 4.2.1 Mouse Data Figure 4.1 The phylogenetic network and corresponding parental species trees used for eval- uating PhyloNet-HMM and PHiMM on mouse data. The phylogenetic network captures (A) introgression from M. spretus to M. m. domesticus. The parental tree in (B) captures genomic regions of introgressive descent, while the parental tree in (C) captures genomic regions with no introgression. The data gathered from the mice include empirical genomic sequence datasets with positive and negative control loci. The datasets are sampled from wild-derived and wild mouse samples from M. m. domesticus and M. spretus. For comparison purposes, we replicate a subset of the PhyloNet- HMM analyses conducted in the research by Liu et al. [7], which utilized genomic sequence data from Didion et al. [131]. We briefly review relevant methodological details here (see [7] for more details). The data were sequenced using an SNP array designed by Yang et al. [129]; raw reads from the array were genotyped using MouseDivGeno software [131]. The genotypic sequence data 51 Table 4.1 Mouse samples used in this study. We used existing mouse samples from previous studies. Each row lists a sample’s name, species, commonly used alias, the geographic origin of the population, and the corresponding reference source. The M. m. domesticus entries represent house mouse populations from various locations in Spain, Germany, Cyprus, and Georgia, while the M. spretus entry corresponds to a distinct mouse species from the Cadiz Province in Spain. References to original sources are also included for traceability. Alias Origin Species M. m. domesticus MWN1279 Arenal, Mallorca Island, Spain Sample Name Spain-Arenal Spain-RocadelValles M. m. domesticus MWN1287 Roca del Valles, Catalunya, Spain Germany-Hamm-A M. m. domesticus B9 Germany-Hamm-B M. m. domesticus B10 Germany-Hamm-C M. m. domesticus B11 Germany-Hamm-D M. m. domesticus C1 Germany-Hamm-E M. m. domesticus C2 M. m. domesticus C3 Germany-Hamm-F A-background M. m. domesticus DCA M. m. domesticus DCP B-background M. m. domesticus DGA C-background Spretus M. spretus Hamm, North Rhine-Westphalia, Germany Hamm, North Rhine-Westphalia, Germany Hamm, North Rhine-Westphalia, Germany Hamm, North Rhine-Westphalia, Germany Hamm, North Rhine-Westphalia, Germany Hamm, North Rhine-Westphalia, Germany Akotiri, Cyprus Paphos, Cyprus Adjaria, Georgia SPRET/EiJ Puerto Real, Cadiz Province, Spain Source [129] [129] [7] [7] [7] [7] [7] [7] [129, 130] [129, 130] [129, 130] [129, 130] Table 4.2 Mouse datasets. Each dataset comprises a subset of Mus samples selected for phyloge- netic and comparative genomic analyses using PhyloNet-HMM or PHiMM. For PhyloNet-HMM, four haploid Mus genomes were included, while the extended PHiMM analysis incorporated five Mus genomes. The extended PHiMM datasets (noted in parentheses) specifically incorporate the additional Mus genome to deepen the inference of phylogenetic histories and evolutionary relation- ships across mouse populations. An additional rat genome (Rattus norvegicus reference genome RGSC Rnor_5.0) served as an outgroup. Set of samples included Spain-Arenal, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Dataset Spain-Arenal Spain-RocadelValles Spain-RocadelValles, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Germany-Hamm-A Germany-Hamm-B Germany-Hamm-C Germany-Hamm-D Germany-Hamm-E Germany-Hamm-F Germany-Hamm-A, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Germany-Hamm-B, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Germany-Hamm-C, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Germany-Hamm-D, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Germany-Hamm-E, A-background, B-background, C-background (for extended PHiMM analysis), Spretus Germany-Hamm-F, A-background, B-background, C-background (for extended PHiMM analysis), Spretus 52 Table 4.3 Statistics for the mouse datasets. “Haplotype” indicates the haplotype name in each dataset. “Nchrs” means the number of chromosomes in each dataset. “Sites” indicates the number of total sites from all chromosomes. The p-distance of a pair of aligned sequences is calculated by dividing the number of sites where the two sequences had different nucleotides by the number of sites in which both sequences had nucleotides. “Average/Maximum p-dist𝑎” is the average/maximum p-distance of all pairs of aligned sequences in the PhyloNet-HMM datasets with 4 ingroup taxa, while “Average/Maximum p-dist𝑏” is for the extended PHiMM datasets with 5 ingroup taxa. The average and standard error (SE) are shown based on all chromosomes. Dataset Haplotype Nchrs Sites Average p-dist𝑎 Maximum p-dist𝑎 Average p-dist𝑏 Maximum p-dist𝑏 Average SE Average SE Spain-Arenal Spain-Arenal Spain-RocadelValles Spain-RocadelValles Germany-Hamm-A Germany-Hamm-A Germany-Hamm-B Germany-Hamm-B Germany-Hamm-C Germany-Hamm-C Germany-Hamm-D Germany-Hamm-D Germany-Hamm-E Germany-Hamm-E Germany-Hamm-F Germany-Hamm-F 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 414376 26.590 414376 26.621 414376 26.609 414376 26.642 414376 27.622 414376 27.623 414376 27.644 414376 27.624 414376 27.571 414376 27.623 414376 27.511 414376 27.500 414376 27.513 414376 27.511 414376 27.500 414376 27.554 1.381 35.008 1.367 35.141 1.334 34.983 1.352 35.142 1.458 36.918 1.482 36.909 1.462 36.918 1.475 36.905 1.484 36.876 1.462 36.890 1.442 36.782 1.477 36.788 1.416 36.792 1.468 36.768 1.414 36.719 1.479 36.775 0.777 0.884 0.884 0.921 0.883 0.923 0.908 0.956 1.034 0.994 1.095 1.087 1.022 1.050 1.042 1.026 Average SE 25.261 25.287 25.266 25.295 26.112 26.107 26.123 26.119 26.076 26.107 25.999 25.990 25.994 26.000 25.989 26.038 1.575 1.587 1.543 1.579 1.638 1.682 1.641 1.671 1.644 1.667 1.616 1.663 1.601 1.660 1.597 1.669 Average SE 35.008 35.141 34.983 35.142 36.918 36.909 36.918 36.905 36.876 36.890 36.782 36.788 36.792 36.768 36.719 36.775 0.777 0.884 0.884 0.921 0.883 0.923 0.908 0.956 1.034 0.994 1.095 1.087 1.022 1.050 1.042 1.026 was phased into haploid genomic sequences using fastPHASE [132]. Comprehensive details about the data can be found in Table 4.1 and 4.2. Each dataset consists of genomic sequences from three M. m. domesticus samples, one M. spretus sample, and one Rattus norvegicus genome (RGSC Rnor_5.0/rn5) that is included as an outgroup (see Figure 4.1). Thus, each dataset consists of four ingroup taxa and one outgroup taxon. For the three M. m. domesticus samples, one originates from the region of sympatry between M. m. domesticus and M. spretus, and two are from far outside the sympatric region. Specifically, the M. spretus sample is also from the sympatric region. Therefore, we assume the network for this dataset involved a reticulation from M. spretus to the sympatric M. m. domesticus. In addition to the datasets in Liu et al. [7], our study also incorporates larger “extended” datasets with a more extensive taxon sampling, which encompasses all the data from the original study as a strict superset. The larger size of the extended datasets necessitates the use of PHiMM for introgression mapping purposes. Other than the larger set of taxa in new datasets, all other aspects of empirical data are the same (i.e., genotyping, phasing, etc.). Notably, the extended datasets 53 include an extra sample of M. m. domesticus from outside the region of sympatry between M. m. domesticus and M. spretus (Figure 4.1). For the analyses of the extended datasets, PHiMM is run with the same settings used in the simulation study, except for two modifications. First, the number of iterations for model parameter learning is increased to 1000 (as opposed to 300). Second, model parameters are optimized using chromosome 7 from the extended Spain-Arenal dataset. The summarized statistics of these datasets can be found in Table 4.3. 4.2.2 Limenitis Data Figure 4.2 The phylogenetic network and corresponding parental species trees used for eval- uating PhyloNet-HMM and PHiMM on Limenitis data. The phylogenetic network captures (A) introgression from Limenitis arthemis arizonensis to Limenitis arthemis astyanax. The parental tree in (B) captures genomic regions of introgressive descent, while the parental tree in (C) captures genomic regions with no introgression. We also conduct a re-analysis of the Limenitis dataset described in the study by Gallant et al. [128]. For this re-analysis, we perform PhyloNet-HMM and PHiMM on the Limenitis AC scaffold, which contains the WntA gene. The dataset comprises four ingroup taxa, namely, Limenitis arthemis arizonensis, Limenitis arthemis arthemis, Limenitis arthemis astyanax, and Limenitis archippus floridenesis. Our assumption is that Limenitis arthemis arthemis and Limenitis arthemis astyanax first coalesce, followed by their ancestors coalescing with Limenitis arthemis arizonensis, and finally with Limenitis archippus floridenesis. In this 4-taxon network, a reticulation from Limenitis arthemis arizonensis to Limenitis arthemis astyanax is postulated (see Figure 4.2). Due to the scalability limitations of PhyloNet-HMM, it is executed on the 4-taxon dataset, while PHiMM employs the extended dataset that includes an additional Limenitis arthemis arthemis 54 sample (Figure 4.2). 4.3 Results 4.3.1 Mouse Data Re-analysis In our study, we conduct a performance comparison between PHiMM and PhyloNet-HMM using mouse genomic sequence datasets previously analyzed in the works of Liu et al. [7] and Didion et al. [131]. However, due to the scalability constraints of PhyloNet-HMM, we run PhyloNet-HMM on a smaller dataset consisting of only four ingroup taxa. This smaller dataset is a proper subset of the larger dataset used for PHiMM analyses, which includes more samples of M. m. domesticus but remains otherwise identical. Additional information and detailed explanations of the datasets can be found in the Methods section. Previous studies [7, 120] have reported instances of adaptive interspecific introgression involv- ing the region around the Vkorc1 gene, specifically the region on chromosome 7 between coordinates 123 Mb and 134 Mb. As depicted in Figure 4.3 and Figure A.1-A.20 of Appendix A, both PHiMM and PhyloNet-HMM methods infer the presence of multi-megabase-long introgressed tracts that are present in all eight samples from Spain and Germany, with the exception of the Arenal, Spain sample. Thus, both methods successfully detect interspecific introgression in this positive control. Notably, the genomic region containing the Vkorc1 gene shows the longest introgressed tracts detected in the mouse genome. Within this particular genomic region, it is worth mentioning that PHiMM infers a greater total sequence length of introgressed tracts compared to the inference made by PhyloNet-HMM. Beyond Vkorc1, additional genomic regions spanning several hundred kilobases also exhibit substantial introgression signals, as reported by Liu et al. [7]. In many of these regions, PHiMM and PhyloNet-HMM converge on qualitatively similar inferences, reaffirming the presence of in- trogressed tracts across the genome. However, when examining local inference patterns, two types of differences become evident. First, there are local differences in the pattern of introgressed and non-introgressed tracts, exem- plified in regions such as the chromosome 7 region between coordinates 102 Mb and 108 Mb, and 55 (A) PhyloNet-HMM (B) PHiMM Figure 4.3 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on mouse data. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 56 Figure 4.4 The comparison of histograms of introgressed tract lengths between (A) PhyloNet- HMM and (B) PHiMM on mouse data. Introgressed tract lengths are shown in 100 kilobases (Kb). Results are reported for all recipients M. m. domesticus samples from the region of sympatry between M. m. domesticus and M. spretus. This figure comes from Wuyun et al. [122]. 57 the chromosome 17 region between coordinates 4 Mb and 54 Mb. Second, PHiMM infers longer and more numerous introgressed tracts compared to PhyloNet-HMM in certain regions, such as chromosome 10, 12, and 15. Furthermore, we investigate the genome-wide histogram of introgressed tract lengths as shown in Figure 4.4. Qualitatively, PhyloNet-HMM and PHiMM return inferences with similar histogram shapes: a large peak over short tract lengths (kilobases or a few megabases of sequence length at most), followed by a long tail over longer sequence lengths. Two primary differences between the histograms of the two methods are noted. First, PHiMM’s histogram has a greater total count of introgressed tracts and a larger total sequence length compared to the latter. Second, PHiMM’s histogram appears to be bimodal: one part has tract lengths distributed between 0 and 3.5 Mb, and the other part consists of around a dozen much longer introgressed tracts with sequence lengths between 4 and 11 Mb. In comparison, PhyloNet-HMM’s histogram has a long tail with a relatively uniform distribution of introgressed tracts between 2 and 10 Mb. In summary, these comparisons illustrate that PHiMM and PhyloNet-HMM both accurately detect known adaptive introgression events in M. m. domesticus, although PHiMM appears more inclined to identify longer or subtler introgressed tracts. Future studies could delve into the biolog- ical relevance of such extended signals and refine methodological assumptions to better account for the complexities of natural populations, including varying recombination rates, incomplete lineage sorting, and the mosaic nature of introgressed genomes. 4.3.2 Limenitis Data Re-analysis In addition to the mouse datasets, we further evaluated PHiMM and PhyloNet-HMM on genomic data from the butterfly genus Limenitis, which exhibits well-characterized mimicry traits. Due to computational constraints, PHiMM was applied to a larger five-taxon dataset, whereas PhyloNet- HMM could only be run on a smaller four-taxon subset (see the Methods section for detailed descriptions of both datasets). This setup facilitates a comparative assessment of the two approaches under varying data sizes and complexities. As shown in Figure 4.5, the longest introgressed tract inferred by PHiMM closely corresponds 58 Figure 4.5 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on Limenitis data. Introgressed tracts along the genome are shown in kilobases (Kb). Results are reported for the introgression from a donor Limenitis arthemis arizonensis sample to a recipient Limenitis arthemis astyanax sample. This figure comes from Wuyun et al. [122]. to the major tract identified by PhyloNet-HMM, albeit shifted slightly toward lower genomic coor- dinates. In principle, inferring introgression from datasets with more taxa can be more challenging owing to increased heterogeneity and potential discordance across gene regions. Consequently, the five-taxon analyses conducted by PHiMM exhibit a marginally higher level of noise relative to the smaller, four-taxon analysis by PhyloNet-HMM. This is reflected in the observation that PHiMM tends to detect a larger number of shorter introgressed tracts compared to PhyloNet-HMM. Despite these differences, both methods consistently identify introgression within the WntA region, spanning approximately 27 to 101 kb, with a pronounced cluster of introgressed segments between 60 kb and 100 kb. These findings align with the results in Gallant et al. [128], who highlighted the importance of WntA for wing-pattern mimicry in Limenitis. The concordance of these inferences underscores each method’s ability to capture critical genomic intervals, even when operating on datasets of different sizes and complexities. By confirming introgression signals in an ecologically and evolutionarily significant region, our analysis further validates the utility of both PHiMM and PhyloNet-HMM for dissecting genomic contributions to phenotypic adaptations. 4.4 Discussion and Conclusion In this study, we evaluate the performance of PHiMM and another state-of-the-art method, PhyloNet-HMM, using two empirical genomic sequence datasets. Our analyses focus on the adaptive introgression in house mice and the mimicry in the butterfly genus Limenitis, providing a 59 comprehensive evaluation of PHiMM’s capabilities in real-world biological data. On the mouse empirical datasets, both PHiMM and PhyloNet-HMM provide qualitatively similar inferences regarding the introgressed genomic regions, which align with the molecular hypotheses proposed by Liu et al. [7]. However, there are differences observed in the patterns of local inferences between the two methods. In some genomic regions, such as the Vkorc1-containing region on chromosome 7, PHiMM infers longer and more numerous introgressed tracts compared to PhyloNet-HMM. Additionally, the distribution of introgressed tract lengths exhibited dissimilarities between the two methods. PHiMM detects more introgressed tracts and displays a clearer “separation” between two classes of tracts: megabases-long tracts, which are few in number, and shorter tracts, which are more numerous. The presence of the former “long” class of tracts is consistent with the hypothesis of adaptive introgression, where neutral recurrent backcrossing tends to shorten introgressed tracts over time, while positive selection and genetic hitchhiking have an opposing effect [7]. On the other hand, the latter “short” class of tracts aligns with Liu et al. [7]’s hypothesis regarding more ancient bouts of adaptive interspecific introgression; sympatry between M. musculus and M. spretus is understood to have predated the recent introduction of pesticides [7, 133]. On the Limenitis empirical datasets, both PhyloNet-HMM and PHiMM identify similar intro- gression tracts that exhibit significant overlap across much of the genomic region containing WntA. These results are in line with the experimental findings presented in the work of Gallant et al. [128]. The observed differences in our empirical study can be attributed to two factors: PHiMM’s competitive statistical power and type I error control relative to PhyloNet-HMM, as well as the denser allele sampling made possible by PHiMM’s improved scalability compared to PhyloNet- HMM. In conclusion, our study demonstrates that the PHiMM method is a powerful tool for introgres- sion detection, capable of handling large datasets and providing accurate inferences. By applying PHiMM to diverse empirical datasets, we have illustrated its robustness and utility in real-world biological research. These findings contribute to our understanding of the evolutionary processes 60 underlying genetic diversity and adaptive introgression, and pave the way for future applications of the PHiMM approach in various genomic studies. 61 CHAPTER 5 BOOSTING PHIMM-BASED PHYLOGENETIC HMM INFERENCE AND LEARNING BASED ON RANDOM WALK RESAMPLING 5.1 Introduction Non-parametric resampling techniques enable researchers to leverage empirical data to construct distributions for obtaining critical values, calculating 𝑝-values, or constructing confidence intervals. For the task of introgression mapping, non-parametric resampling methods can be employed to generate resampled replicates of a genome alignment. Inference/analysis can be thus performed and compared across replicates. There are two primary classes of resampling methods: non-parametric and parametric. Among non-parametric methods, the bootstrap method is the most widely-used [134, 135]. Given an input set of observations, the bootstrap method resamples observations uniformly at random with replacements. Re-estimation is then conducted on these resampled replicates, and repeatability is assessed by comparing the re-estimated results. Other non-parametric methods include the jackknife and weighted bootstrap. In contrast, parametric methods resample directly from an explicit statistical model, often requiring the assumption of a hypothesis model due to the unavailability of the original model that generated the original inputs. Non-parametric methods are preferred in many cases as they do not necessitate assumptions where observations were generated under a specific model. Despite their popularity, non-parametric resampling methods, such as the bootstrap, have significant limitations, particularly the assumption that input observations are independent and identically distributed (i.i.d.). This assumption does not hold in cases where inputs consist of sequences of observations, which is common in genomics and computational biology. The SEquential RESampling (“SERES”) framework [136] consists of non-parametric or semi- parametric sequential resampling techniques that generalize the standard bootstrap method for non-parametric resampling [134] and the Heads-or-Tails method [137]. A critical feature of SERES is its “neighbor preservation property”, which ensures that neighboring bases within the 62 original sequences are preserved during resampling. This generalized framework employs a random walk along a multiple sequence alignment (MSA) that is composed of either aligned or unaligned biomolecular sequences. In SERES, the random walk is conducted using the following procedure: A starting point and direction for the random walk are chosen uniformly at random across all sites. As the random walk proceeds, reversals occur with certainty at the start or end of the MSA; reversals can also occur during each step with probability 𝛾. The random walk continues until the number of sampled characters equals the fixed MSA length. For each resampled replicate, re-estimation is performed, and repeatability is then measured by quantifying disagreement among re-estimations. Initial studies of SERES focused on unaligned sequence inputs [136], rather than aligned inputs. Briefly, the SERES algorithm for unaligned sequence inputs also takes the form of a random walk, with one main difference: resampling “reads” along unaligned sequences occurs in an asynchronous fashion, and a set of anchors serves as synchronization “barriers” in the same way as they do in parallel computing. The SERES algorithm was applied to perform confidence interval placement for a classical problem in computational biology and bioinformatics — multiple sequence alignment (MSA) estimation. Results from synthetic and empirical data demonstrated that SERES random walks within a resampling/re-estimation pipeline yielded comparable or superior type I and type II error rates compared to state-of-the-art methods [136]. In subsequent research, we applied the SERES resampling algorithm on the aligned sequences for another classical problem in computational biology and bioinformatics: recombination-aware local genealogical inference [138]. We mainly focused on an HMM-based local genealogical infer- ence method, recHMM [139], for the following reasons. RecHMM identifies local genealogy by applying heuristic searches on the space of all possible partitions, and is powerful in detecting local genealogy changes, especially when the regions involved in recombination are long and the dataset size is large. Thus, RecHMM reveals the most likely recombination breakpoint locations with high accuracy and fewer requirements on parameter settings such as window schemes, making it an ideal method to focus on. Another reason we focused on the recHMM method is that recHMM takes aligned sequences as input to annotate mosaic genome structures, making it possible to combine 63 the SERES resampling approach. Simulation experiments indicated that combining SERES with recHMM significantly improved recombination detection and local genealogical inference. The SERES resampling also holds potential for applications in ancestral recombination inference prob- lems, such as recombination rate estimation [140] and recombination hotspot or coldspot detection [141, 142]. In this study, we propose another application of SERES random walks on aligned sequences, and utilize SERES random walks as a means to “boost” HMM inference/learning performance. Similar to other non-parametric resampling methods, we demonstrate that SERES will serve as a data perturbation technique in addition to its use in confidence interval placement, as considered by earlier work [136, 138]. In the simulation study, we apply the SERES resampling algorithm on the aligned sequences for a classical problem in computational biology and bioinformatics — introgression mapping. We mainly focus on the PHiMM algorithm for introgression mapping. In detail, we compare the performance and scalability of the PHiMM method with and without the SERES resampling approach. Benchmark results suggest that the SERES resampling and re-estimation can significantly improve PHiMM’s inference accuracy. 5.2 Methods 5.2.1 Standalone PHiMM pipeline PHiMM is an HMM-based introgression mapping method that combines inference and learning under a combined model of genetic drift, substitutions, recombination, and gene flow with a coalescent-based approximation technique. Benchmark analyses indicate that PHiMM offers better computational runtime and main memory usage by multiple orders of magnitude, while returning comparable inference accuracy. We first run the PHiMM alone to establish a baseline performance on the introgression mapping in the simulation data. PHiMM analyses are run using default settings, e.g., the number of iterations for model parameter learning is 300, and the number of runs is set to 10. Users also need to assign a gene tree truncation size 𝑘𝑛 for the HMM model of PHiMM. In our simulation study, we run PHiMM with the default setting, 𝑘𝑛 = 15. 64 ⊲ 𝑁: Phylogenetic network ⊲ 𝐴: Input multiple sequence alignment with 𝐾 aligned sequences and 𝐿 columns ⊲ 𝑝𝑡: Introgression probability for each aligned site 𝑡 (1 ≤ 𝑡 ≤ 𝐿) ⊲ 𝜃: Model parameters 𝑠𝑡𝑎𝑟𝑡𝑠𝑖𝑡𝑒(𝑖), 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛(𝑖) ← 𝑆𝑒𝑙𝑒𝑐𝑡𝑆𝑡𝑎𝑟𝑡𝑆𝑖𝑡𝑒( 𝐴) ⊲ 𝑠𝑡𝑎𝑟𝑡𝑠𝑖𝑡𝑒(𝑖), 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛(𝑖): Starting point and direction for SERES random walk ⊲ 𝐴(𝑖), 𝑚𝑎 𝑝 𝑝𝑖𝑛𝑔(𝑖): Resampled alignment and mapping ⊲ 𝑅: Number of replicates 𝐴(𝑖), 𝑚𝑎 𝑝 𝑝𝑖𝑛𝑔(𝑖) ← 𝐴(𝑖), 𝑚𝑎 𝑝 𝑝𝑖𝑛𝑔(𝑖)+𝑅𝑎𝑛𝑑𝑜𝑚𝑊 𝑎𝑙 𝑘 ( 𝐴, 𝑠𝑡𝑎𝑟𝑡𝑠𝑖𝑡𝑒(𝑖), 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛(𝑖), 𝛾) Algorithm 5.1 SERES+PHiMM 1: procedure SERES-based PHiMM(𝑁, 𝐴, 𝑅) ({𝑝𝑡 }1≤𝑡≤𝐿, 𝜃) ← 𝑃𝐻𝑖𝑀 𝑀 (𝑁, 𝐴) 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 𝐴(𝑖), 𝑚𝑎 𝑝 𝑝𝑖𝑛𝑔(𝑖) ← ∅, ∅ while Length of 𝐴(𝑖) ≤ 𝐿 do int 𝑖 ← 1 while 𝑖 ≤ 𝑅 do 14: 15: 16: 17: 𝑡 }1≤𝑡≤𝐿 ← 𝑃𝐻𝑖𝑀 𝑀 (𝑁, 𝐴(𝑖), 𝜃) {𝑝 (𝑖) 𝑖 ← 𝑖 + 1 { ¯𝑝𝑡 } ← 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦({𝑝𝑡 }, {𝑝 (1) return { ¯𝑝𝑡 }1≤𝑡≤𝐿 𝑡 ⊲ 𝛾: Reversal probability ⊲ Run PHiMM with fixed model parameters }, 𝑚𝑎 𝑝 𝑝𝑖𝑛𝑔(1)..., {𝑝 (𝑅) 𝑡 }, 𝑚𝑎 𝑝 𝑝𝑖𝑛𝑔(𝑅)) Consistent with the study by Wuyun et al. [122], the inputs of the PHiMM include: the aligned DNA sequences 𝐴, with 𝐾 aligned sequences and 𝐿 columns; and a phylogenetic network 𝑁 on 𝐾 taxa. The output is a sequence of modified posterior decoding probabilities for the columns of the input alignment 𝐴. 5.2.2 The SERES+PHiMM pipeline The PHiMM framework can be augmented with SERES-based resampling and re-estimation to enhance inference and learning (Algorithm 5.1). First, SERES random walks are conducted to perform resampling on the input alignment 𝐴. Detailed pseudocode for this procedure is shown in Algorithm 5.2 (reproduced from Wang et al. [136, 138]), and Figure 5.1 provides an illustrated example of a SERES random walk on an input MSA. The SERES resampling procedure in our simulation study utilizes a default reversal probability 𝛾 = 0.0001. We also conduct additional experiments with alternative reversal probability 𝛾 ∈ {0, 0.0001, 0.001, 0.005, 0.01, 0.1}. We use the SERES random walk to generate 10 resampled replicates for each dataset in our study. Then, the PHiMM algorithm is run with default settings to perform optimization-based learning 65 ⊲ 𝐴: Input multiple sequence alignment ⊲ 𝛾: Walk reversal probability ⊲ numReplicates: Number of SERES replicates direction = (rand() > 0.5) ? +1 : −1 replicates = < > for 𝑖 from 1 to numReplicates do Algorithm 5.2 SERES 1: procedure SERES Walk on Aligned Sequences(𝐴, 𝛾, numReplicates) 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 𝑖 += direction if (𝑖 ≤ 0) or (𝑖 > length(𝐴)) or (rand() < 𝛾) then replicate = < > while length(replicate) < length(𝐴) do direction ∗= −1 if (𝑖 ≤ 0) or (𝑖 > length(𝐴)) then 𝑖 = ⌊ length(𝐴) ∗ rand() ⌋ + 1 replicate .= 𝐴𝑖 𝑖 += direction ∗ 2 ⊲ Uniformly at random choose a direction (right vs. left) ⊲ rand() returns floating point number sampled uniformly at random from [0, 1) ⊲ Uniformly at random draw from [1, length(𝐴)] ⊲ Read 𝐴𝑖, which is the 𝑖-th character in alignment 𝐴 ⊲ Alignment characters 𝐴𝑖 are one-indexed ⊲ Reflection of random walk ⊲ Always reflect at start/end of alignment 𝐴 21: 22: replicates .= replicate return replicates on the original input alignment 𝐴. The optimized model parameters are then used to perform fixed- parameter-value inference on resampled SERES replicates. Finally, re-estimated posterior probability distributions are averaged across SERES replicate analyses to obtain a final inferred distribution. Consistent with the study of Wuyun et al. [122], the output of each SERES replicate is a sequence of modified posterior decoding probabilities for the columns of the input alignment 𝐴. For each site 𝑎𝑡 (1 ≤ 𝑡 ≤ 𝐿) in the alignment 𝐴, the posterior probability distributions are aggregated across all the replicates in which the site appeared. The aggregated distribution is then averaged to obtain a valid probability distribution. 5.3 Materials 5.3.1 Simulation Data The simulation study is used to evaluate the performance and applicability of the standalone PHiMM method and SERES+PHiMM method, since we can track the true history of evolutionary 66 Figure 5.1 Illustrated example of a SERES random walk on an input multiple sequence alignment (MSA). The input of the SERES resampling algorithm is an MSA. The SERES resampling algorithm takes the form of a random walk. First, a start site and initial walk direction are chosen uniformly at random. In this example, the fourth site from the left is chosen as the start site, and the initial walk direction is rightward. The random walk proceeds, where an MSA site is sampled during each step of the walk. Walk reversals occur with certainty at the start and end of the input MSA and with probability 𝛾 at any other point of the walk. The walk concludes when the sampled replicate meets a sequence length criterion — namely, when the number of sites in the sampled replicate and input MSA are equal. In this example, a first reversal occurs in the interior of the input MSA and a second reversal occurs at the left boundary of the input MSA. The output consists of the sampled replicates. This figure comes from Wang et al. [138]. events. The simulation data are constructed through various tools, such as r8s [28], msmove [29], and seq-gen [32]. (1) Generation of model trees using r8s A random rooted model tree can be generated using r8s [28] tool, which utilizes the birth- death model for the tree generation. #nexus begin r8s; simulate diversemodel=bdback seed= charevol=yes ntaxa= infinite=yes nreps=1; end; where “diversemodel” means the model used for tree generation, which is generally set to the birth-death model denoted by “bdback”. “seed” specifies a random seed for the tree 67 generation. “ntaxa” represents the number of taxa, which is set from 5 to 10 for comparison. “nreps” indicates the number of generated repeats. “charevol=yes” indicates the model tree is output with branch lengths. “infinite=yes” represents that branch lengths are set to the expected values based on rate and time. The height of the resulting tree is scaled to ℎ by multiplying the length of each edge in the model tree by ℎ. Here, we set ℎ to 5.0 coalescent units. Furthermore, an outgroup is added to the generated tree at 50.0 coalescent time. (2) Generation of model networks A model network can be generated by the steps listed in Hejase et al. [11]. With the model tree obtained from the above step, we then add 𝑟 reticulations (𝑟 ∈ [1, 2]) by iterating the following steps: a time 𝑡𝑀 between 0 and the tree height is selected uniformly at random, two tree edges for which corresponding ancestral populations exist during a time interval [𝑡 𝐴, 𝑡𝐵] such that 𝑡𝑀 ∈ [𝑡 𝐴, 𝑡𝐵] are randomly selected, and a reticulation at time 𝑡𝑀 is added to connect the pair of tree edges. Similar to Leaché et al. [15], the model network can be further classified based upon whether gene flow is “deep” or “non-deep”, which is defined by the topological placement of reticulations, i.e., non-deep reticulations are placed between two leaf edges, while deep reticulations include all other reticulations. (3) Generation of local genealogies using msmove Given a model species network, 100 local genealogies can be simulated using the msmove [29] for independent and identically distributed (i.i.d.) loci following a species network under a multi-species network coalescent with recombination (MSNCwR) model. msmove is a modified version of the Monte Carlo simulator ms [30] allowing the tracking of migra- tion events, while ms does not provide this annotation. Recombination is modeled using Hudson’s finite-sites recombination model [30]. msmove -T -r -I -ej i_1 j_1 -ej i_2 j_2 ... -ej i_k j_k 68 -ev i j where -T parameter indicates the gene trees representing the history of the sampled taxa are output. The -I parameter is followed by 𝑘 that represents the number of populations. The list of integers (n_1 n_2 ... n_k) includes the number of taxa sampled in each population. In this study, one allele is sampled from each taxon. The -r parameter is used to set recombination by crossover rate and the number of sites between which recombination can occur, where the crossover rate or recombination rate 𝜌 is set to 0.1, and the number of sites between which recombination can occur is set to 900. The -ej parameter specifies moving all lineages in population i to population j at time t. The -ev parameter is special for msmove, which sets migration at time t_m from population i to population j with the migration probability 𝑥 = 0.1 in this study. Then, the gene trees will be deviated away from ultrametricity by the following steps [143]. First, a deviation factor, 𝑐, that quantifies the deviation level is determined. Then, for each edge in the model tree, its branch length is multiplied by 𝑒𝑥, where 𝑥 is uniformly and randomly chosen from the interval [− lg(𝑐), lg(𝑐)]. Similar to Liu et al. [144], 𝑐 is set to 2.0 here. (4) Simulation of sequences using seq-gen The DNA sequences can be generated using seq-gen [32] under the general time-reversible (GTR) substitution model [67]. seq-gen -mGTR -f -r -a -l -p < genetreefile > seqfile where -m parameter specifies the general time-reversible (GTR) substitution model denoted by “GTR”. The -s parameter sets the mutation rate 𝜃 that scales the branch lengths to make them equal the expected number of substitutions per site for each branch. The mutation rate 𝜃 is set to 1.The -f parameter specifies the frequencies of the four nucleotides A, C, G, and 69 T. The -r parameter sets a relative rate of substitutions between nucleotides in a GTR model. The -a parameter specifies the shape of the Γ distribution. The -l parameter specifies the sequence length. Here, the simulated sequence length of each gene tree is set to 900 bp. The -p parameter sets the number of partitions. The genetreefile is the input file providing the gene trees. The seqfile is the output file with simulated sequences under the given gene trees. The GTR substitution model parameter values were estimated based on empirical anal- yses of the mouse genomic sequence dataset from Liu et al. [7] study. We used all M. m. domesticus and M. spretus samples listed in “Table S1” of Liu et al. [7] study and concate- nated all chromosomes to get the sequence data. RAxML was used to perform concatenated phylogenetic MLE under the GTR model. The estimated parameters used in seq-gen simu- lations include base frequencies: 𝜋 𝐴 = 0.216, 𝜋𝐶 = 0.284, 𝜋𝐺 = 0.285, 𝜋𝑇 = 0.215; GTR matrix: 1.002820, 5.486349, 1.095939, 0.539209, 5.535556, and 1.000000 for 𝐴 ↔ 𝐶, 𝐴 ↔ 𝐺, 𝐴 ↔ 𝑇, 𝐶 ↔ 𝐺, 𝐶 ↔ 𝑇, and 𝐺 ↔ 𝑇, respectively; shape parameter 𝛼 of the Γ distribution: 0.543225. Then, coalescent times are converted into branch lengths using “Equation (3.1)” in Hein et al. [145]. For each model condition, the simulation procedure is repeated to obtain 30 replicate datasets. Model condition parameters and summary statistics for simulated datasets are shown in Table 5.1. 5.3.2 Performance Assessments To evaluate and compare the performance of different approaches on the simulation dataset where the true history of evolutionary events can be tracked, we use two types of area under the curve (AUC): the area under the receiver-operating characteristic (ROC) curve, and the area under the Precision-Recall curve, referred to as simply ROC-AUC and PR-AUC, respectively. ROC-AUC is plotted by the true positive rate ( threshold settings, while PR-AUC similarly plots the precision ( 𝑇 𝑃 𝐹𝑃+𝑇 𝑁 ) at various 𝑇 𝑃 𝑇 𝑃+𝐹 𝑁 ) at different thresholds, where TP, FP, TN, and FN represent the numbers of true positives, false 𝑇 𝑃+𝐹 𝑁 ) as a function of the false positive rate ( 𝐹𝑃 𝑇 𝑃+𝐹𝑃 ) against the recall ( 𝑇 𝑃 70 Table 5.1 Statistics for the simulation dataset. The reticulation scenarios include one non-deep reticulation, two non-deep reticulations, and one deep reticulation. “Ntaxa” means the number of taxa. “Network Height” gives the height of the model network. “Total Sites” indicates the number of total sites in the simulation genomes. The p-distance of a pair of aligned sequences is calculated by dividing the number of sites where the two sequences had different nucleotides by the number of sites in which both sequences had nucleotides. “Average/Maximum p-dist” is the average/maximum p-distance of all pairs of aligned sequences in the simulation dataset. “Introgression (%)” shows the percent of introgressed sites over the genome. “Number of Gene Trees” indicates the number of true gene trees. “Average Branch Length” is the average branch length of the model network. The average and standard error (SE) are shown based on 30 replicates. Reticulation Scenario one non-deep one deep two non-deep Ntaxa 5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10 Total Sites 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 Network Height 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 Average p-dist (%) Maximum p-dist (%) Average 0.631 0.626 0.615 0.612 0.610 0.608 0.632 0.624 0.617 0.614 0.609 0.606 0.634 0.626 0.620 0.612 0.609 0.603 Average 0.704 0.704 0.704 0.704 0.704 0.704 0.704 0.704 0.704 0.704 0.705 0.705 0.704 0.704 0.704 0.704 0.704 0.704 SE 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.001 0.002 0.002 0.002 0.003 0.002 0.002 0.001 0.002 0.002 0.002 SE 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 Introgression (%) Number of Gene Trees Average Branch Length Average 0.101 0.100 0.113 0.120 0.141 0.126 0.208 0.129 0.173 0.147 0.177 0.164 0.184 0.184 0.182 0.182 0.177 0.152 Average 1196.800 1227.900 1256.800 1260.933 1270.967 1351.367 1231.433 1219.767 1250.933 1304.167 1337.233 1289.033 1208.633 1237.133 1258.067 1255.467 1290.900 1280.533 Average 2.014 1.834 1.532 1.505 1.441 1.383 2.006 1.757 1.607 1.510 1.400 1.350 2.082 1.870 1.658 1.463 1.436 1.322 SE 14.416 12.191 16.283 18.156 10.680 22.363 16.601 16.295 12.212 12.799 15.688 19.507 13.822 13.853 15.582 19.117 18.244 15.278 SE 0.049 0.045 0.047 0.045 0.047 0.047 0.065 0.044 0.044 0.049 0.041 0.047 0.055 0.055 0.036 0.049 0.041 0.041 SE 0.006 0.008 0.008 0.010 0.015 0.012 0.022 0.011 0.017 0.013 0.019 0.017 0.014 0.014 0.011 0.015 0.009 0.009 positives, true negatives, and false negatives, respectively. The ROC-AUC and PR-AUC measures represent the tradeoff between type I errors and type II errors under different threshold values. The measures are calculated on all loci in the simulation datasets, where the true migration events are annotated by msmove with an asterisk. After concatenating all query loci in one simulation replicate, the ROC-AUC and PR-AUC of a method are assessed based on the probability of a particular site involving an introgressive origin. Additionally, we report the memory usage and runtime in order to comprehensively evaluate the scalability of our framework. 5.4 Results In this study, we compare the performance of SERES-based PHiMM and PHiMM on datasets with different numbers of taxa (5 to 10) and different model conditions (one non-deep reticulation, one deep reticulation, and two non-deep reticulations), since the memory and runtime would rise as the number of taxa/reticulations and the complexity of reticulations increases. 71 Figure 5.2 shows the comparisons of the area under the receiver-operating characteristic (ROC) curve (ROC-AUC), and the area under the Precision-Recall curve (PR-AUC), runtime, and memory usage between PHiMM and SERES-based PHiMM on the 5- to 10-taxon model conditions with one non-deep reticulation, one deep reticulation, and two non-deep reticulations. For predictable performance, both methods exhibit relatively high ROC-AUC and PR-AUC values for model conditions with one or two non-deep reticulations, suggesting strong prediction performance. The SERES+PHiMM method consistently demonstrates superior or comparable ROC-AUC and PR-AUC values across all taxa when compared to the PHiMM alone. Particularly for model conditions with one non-deep reticulation, the improvement brought by SERES-based PHiMM method is substantially larger than other model conditions with two non-deep reticulations or one deep reticulation, where the performance gap between the two methods narrows. The model conditions with one deep reticulation shows a different trend where the ROC-AUC and PR-AUC values for both methods tend to be lower compared to the “non-deep” model conditions, suggesting that the deeper reticulations might face challenges in maintaining high classification accuracy. Nevertheless, the SERES+PHiMM method still generally outperforms the PHiMM method, particularly for smaller numbers of taxa (5 to 8 taxa). Overall, the integration of SERES with PHiMM tends to improve or sustain predictable performance across varying numbers of taxa and reticulation scenarios, with the “non-deep” model conditions yielding more robust results compared to the “deep” model conditions. The runtime for both methods increases as the number of taxa increases under model condi- tions with one non-deep reticulation. PHiMM demonstrates significantly lower runtimes compared to SERES+PHiMM across all taxa numbers, because SERES+PHiMM runs on multiple SERES resampling replicates to aggregate the results. For instance, with 10 taxa, PHiMM takes approx- imately 50 hours, whereas SERES+PHiMM requires about 255 hours. The discrepancy becomes more pronounced with an increasing number of taxa, indicating that SERES+PHiMM scales less efficiently in these model conditions. In model conditions with one deep reticulation, a similar trend is observed. PHiMM consistently outperforms SERES+PHiMM in terms of runtime. For 10 taxa, 72 PHiMM’s runtime is around 60 hours compared to over 320 hours for SERES+PHiMM. Again, the gap widens as the number of taxa increases, highlighting the inefficiency of SERES+PHiMM in handling larger datasets under the deep model conditions. The two non-deep model conditions also show PHiMM with substantially lower runtimes compared to SERES+PHiMM. The runtimes for PHiMM remain relatively low and stable, even as the number of taxa increases. In contrast, SERES+PHiMM’s runtime escalates dramatically with the number of taxa. For example, with 10 taxa, PHiMM’s runtime is about 55 hours, while SERES+PHiMM’s runtime is close to 350 hours. Overall, the results indicate that PHiMM is significantly more efficient in runtime than SERES+PHiMM across all tested reticulation scenarios and numbers of taxa. The efficiency gap between the two methods becomes more substantial as the number of taxa increases, particu- larly under model conditions with one deep reticulation or two non-deep reticulations where both methods tend to use more runtimes. These findings suggest that PHiMM is a more scalable and time-efficient method for large datasets and complex model conditions. In terms of memory usage, both methods exhibit low memory usage for 5 taxa, with memory consumption increasing as the number of taxa rises. In model conditions with one non-deep reticulation, PHiMM and SERES+PHiMM show comparable memory usage, particularly at higher numbers of taxa where both methods utilize approximately 60 GB at 10 taxa. In model conditions with one deep reticulation, memory consumption increases for both methods as the number of taxa grows, with PHiMM showing marginally higher memory usage than SERES+PHiMM for 6 to 10 taxa. In the two non-deep model conditions, both methods show a significant increase in memory consumption as the number of taxa rises. At 10 taxa, PHiMM reaches around 55 GB, while SERES+PHiMM is slightly higher, around 60 GB, indicating similar memory usage overall. Overall, PHiMM and SERES+PHiMM demonstrate similar memory consumption across all model conditions, with only slight differences that become more pronounced at higher numbers of taxa. This suggests that both methods have comparable memory efficiency, making them both viable options for large-scale phylogenetic studies. Table 5.2 presents a detailed comparison of the area under the receiver-operating characteristic 73 curve (ROC-AUC) values for two phylogenetic inference methods, PHiMM and SERES+PHiMM, under varying reversal probabilities 𝛾 for model conditions involving one non-deep reticulation, one deep reticulation, and two non-deep reticulations. The analysis spans 5- to 10-taxon model conditions. For model conditions with one non-deep reticulation, PHiMM consistently shows high ROC-AUC values across all 𝛾 values, with slight variations as the number of taxa increases. SERES+PHiMM generally demonstrates higher ROC-AUC values compared to PHiMM, particu- larly at lower 𝛾 values. When two non-deep reticulations are considered, both methods demonstrate a decrease in performance as the 𝛾 value increases. However, the performance of SERES+PHiMM remains superior, especially at lower 𝛾 values, showing consistent ROC-AUC improvements across most 𝛾 values. The trend continues for model conditions with one deep reticulation, where both methods maintain relatively high ROC-AUC values. SERES+PHiMM outperforms PHiMM across most 𝛾 values and different numbers of taxa, especially noticeable at 𝛾 values of 0, 0.0001, and 0.001. The consistency in performance across different reticulation scenarios and 𝛾 values high- lights SERES+PHiMM’s superior ability to manage reversal probabilities in phylogenetic model conditions. The Table 5.2 demonstrates the enhanced effectiveness of SERES+PHiMM in phyloge- netic inference tasks across various taxa counts, reticulation complexities, and reversal probabilities, highlighting its robustness and reliability in evolutionary studies. Table 5.3 presents the comparison of the area under the precision-recall curve (PR-AUC) between the PHiMM and SERES+PHiMM models across various reversal probabilities 𝛾 for each combination of reticulation scenario and number of taxa. For model conditions with one non-deep reticulation, SERES+PHiMM generally exhibits higher PR-AUC values compared to PHiMM, especially at lower 𝛾 levels. The improvement diminishes as 𝛾 increases. In scenarios with two non-deep reticulations, a similar trend is observed, but both methods display a notable decline in PR-AUC values. For model conditions with one deep reticulation, SERES+PHiMM consistently shows superior performance compared to PHiMM across most 𝛾 levels, but the performance gap narrows at higher 𝛾 levels. Table 5.4 provides a detailed comparison of the runtime (measured in hours) between the 74 PHiMM and SERES+PHiMM methods under various reversal probabilities 𝛾 in 5- to 10-taxon model conditions. For scenarios with one non-deep reticulation, PHiMM generally exhibits shorter runtimes compared to SERES+PHiMM, with the difference becoming more pronounced as 𝛾 decreases. In scenarios involving two non-deep reticulations, SERES+PHiMM consistently shows significantly longer runtimes than PHiMM across all 𝛾 levels, especially for lower 𝛾 values, with the gap widening as the number of taxa increases. Similarly, in the one deep reticulation scenario, both methods tend to have increased runtimes due to the complexity of model conditions. SERES+PHiMM’s runtime is substantially longer compared to PHiMM, particularly at lower 𝛾 levels. These results suggest that SERES+PHiMM may be computationally more intensive, especially in complex reticulation scenarios and with lower reversal probabilities, while PHiMM is more time-efficient without a significant loss in accuracy, particularly in scenarios with more taxa. Table 5.5 shows a comparative analysis of memory usage (measured in gigabytes or GB) between the PHiMM and SERES+PHiMM methods under different reversal probabilities 𝛾 for 5- to 10-taxon model conditions. For all three scenarios, PHiMM exhibits consistently similar memory usage compared to SERES+PHiMM across all 𝛾 levels. Notably, for the SERES+PHiMM method, there is no significant difference of memory usage between different 𝛾 levels. These findings suggest that SERES+PHiMM requires similar memory resources than PHiMM across different reticulation scenarios and reversal probabilities, indicating that SERES+PHiMM is a more memory-efficient alternative while maintaining better performance in accuracy, particularly for lower 𝛾 values. 5.5 Discussion and Conclusion This study introduces the application of SERES random walks on aligned sequences and shows SERES as a data perturbation technique to improve introgression inference and learning. The simulation experiments show that the combination of the SERES resampling approach with the PHiMM returns great improvement in the introgression inference compared to standalone PHiMM on different model conditions in our study, which suggests that the SERES resampling and re- estimation has the potential to “boost” PHiMM’s inference accuracy. To evaluate to what extent the SERES resampling approach boosts the introgression inference 75 Figure 5.2 The performance comparison between PHiMM and SERES+PHiMM on the 5- to 10-taxon model conditions with one non-deep reticulation, one deep reticulation, and two non- deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC), the area under the precision-recall curve (PR-AUC), runtime (Hour), and memory usage (GB). The SERES+PHiMM method was run with reversal probabilities 𝛾 = 0.0001. The average and standard error (SE) are calculated based on 30 replicates. 76 Table 5.2 The comparison of the area under the receiver-operating characteristic curve (ROC- AUC) between PHiMM and SERES+PHiMM among different reversal probabilities 𝛾 on the 5- to 10-taxon model conditions with one non-deep reticulation, one deep reticulation, and two non-deep reticulations. The parameter 𝛾 represents the probability of reversals in SERES resampling and is tested at several levels (0, 0.0001, 0.001, 0.005, 0.01, and 0.1). The “no 𝛾” column indicates results for a baseline method where reversals are not considered. “Ntaxa” means the number of taxa. The average and standard error (SE) are calculated based on 30 replicates (represented as “average value ± SE value”). Reticulation Scenario one non-deep two non-deep one deep Ntaxa 5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10 PHiMM no 𝛾 0.899±0.032 0.941±0.020 0.885±0.028 0.867±0.027 0.915±0.019 0.888±0.023 0.830±0.026 0.859±0.030 0.886±0.024 0.885±0.024 0.857±0.026 0.855±0.023 0.889±0.024 0.891±0.020 0.910±0.017 0.902±0.023 0.844±0.024 0.922±0.012 𝛾 = 0 0.965±0.012 0.950±0.019 0.925±0.019 0.909±0.020 0.945±0.011 0.918±0.020 0.849±0.027 0.882±0.032 0.909±0.026 0.907±0.020 0.869±0.027 0.875±0.019 0.897±0.024 0.899±0.022 0.924±0.015 0.920±0.022 0.870±0.021 0.942±0.011 𝛾 = 0.0001 0.969±0.011 0.953±0.017 0.903±0.027 0.889±0.025 0.935±0.016 0.922±0.020 0.851±0.031 0.887±0.027 0.894±0.032 0.892±0.030 0.865±0.031 0.860±0.021 0.899±0.024 0.901±0.021 0.928±0.015 0.918±0.025 0.865±0.022 0.924±0.013 SERES+PHiMM 𝛾 = 0.001 0.958±0.015 0.953±0.018 0.888±0.034 0.881±0.028 0.916±0.025 0.909±0.021 0.840±0.033 0.868±0.029 0.887±0.032 0.879±0.029 0.864±0.027 0.851±0.025 0.891±0.026 0.893±0.023 0.926±0.015 0.920±0.023 0.859±0.022 0.921±0.014 𝛾 = 0.005 0.958±0.012 0.916±0.030 0.880±0.031 0.890±0.021 0.919±0.021 0.909±0.022 0.826±0.037 0.871±0.027 0.907±0.025 0.880±0.025 0.834±0.032 0.861±0.019 0.884±0.026 0.904±0.024 0.924±0.016 0.900±0.026 0.855±0.022 0.925±0.014 𝛾 = 0.01 0.966±0.010 0.944±0.018 0.886±0.027 0.888±0.025 0.937±0.016 0.905±0.023 0.823±0.032 0.877±0.027 0.892±0.027 0.890±0.023 0.854±0.028 0.859±0.019 0.892±0.022 0.889±0.020 0.921±0.015 0.912±0.024 0.847±0.021 0.922±0.014 𝛾 = 0.1 0.923±0.022 0.920±0.025 0.885±0.027 0.859±0.026 0.915±0.021 0.901±0.022 0.811±0.029 0.865±0.026 0.877±0.029 0.858±0.038 0.851±0.027 0.852±0.021 0.881±0.024 0.882±0.025 0.904±0.017 0.907±0.021 0.841±0.023 0.914±0.013 accuracy, we compare the performances of the standalone PHiMM method and the combined SERES+PHiMM method. Across all the simulation model conditions, the SERES+PHiMM method has a consistently better inference accuracy compared to the standalone PHiMM method. The improved performance obtained by combining PHiMM inference with SERES resampling and re-estimation is robust in different numbers of taxa and different reticulation scenarios. We also compare the runtime and memory usage produced by these two methods. We find that both methods produce a similar memory usage for different numbers of taxa and different reticulation scenarios. The SERES+PHiMM method produces longer runtimes compared to the standalone PHiMM method. We attribute these findings to two factors. First, the application of the SERES resampling and re-estimation appears to be conducive to the introgression inferences. The SERES resampling approach has the ability to retain the sequence dependence in the input alignment to the resampled 77 Table 5.3 The comparison of the area under the precision-recall curve (PR-AUC) between PHiMM and SERES+PHiMM among different reversal probabilities 𝛾 on the 5- to 10-taxon model conditions with one non-deep reticulation, one deep reticulation, and two non-deep reticulations. The parameter 𝛾 represents the probability of reversals in SERES resampling and is tested at several levels (0, 0.0001, 0.001, 0.005, 0.01, and 0.1). The “no 𝛾” column indicates results for a baseline method where reversals are not considered. “Ntaxa” means the number of taxa. The average and standard error (SE) are calculated based on 30 replicates (represented as “average value ± SE value”). Reticulation Scenario one non-deep two non-deep one deep Ntaxa 5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10 PHiMM no 𝛾 0.806±0.040 0.819±0.047 0.730±0.057 0.660±0.052 0.750±0.045 0.690±0.054 0.639±0.053 0.685±0.056 0.741±0.051 0.676±0.058 0.626±0.054 0.600±0.052 0.745±0.051 0.751±0.045 0.773±0.042 0.801±0.036 0.671±0.045 0.787±0.028 𝛾 = 0 0.876±0.032 0.865±0.046 0.761±0.052 0.711±0.051 0.811±0.041 0.784±0.045 0.647±0.053 0.707±0.058 0.781±0.052 0.715±0.052 0.658±0.051 0.617±0.048 0.771±0.050 0.771±0.046 0.800±0.040 0.834±0.031 0.701±0.043 0.830±0.028 𝛾 = 0.0001 0.880±0.029 0.834±0.050 0.738±0.059 0.687±0.053 0.794±0.047 0.769±0.047 0.666±0.054 0.694±0.062 0.754±0.058 0.712±0.056 0.638±0.056 0.603±0.047 0.786±0.049 0.780±0.045 0.807±0.039 0.831±0.033 0.697±0.046 0.793±0.036 SERES+PHiMM 𝛾 = 0.001 0.865±0.036 0.862±0.046 0.752±0.055 0.678±0.061 0.760±0.051 0.733±0.053 0.681±0.049 0.660±0.061 0.731±0.059 0.662±0.059 0.637±0.049 0.605±0.049 0.773±0.051 0.768±0.047 0.793±0.041 0.838±0.036 0.690±0.047 0.781±0.033 𝛾 = 0.005 0.860±0.031 0.808±0.054 0.712±0.058 0.690±0.051 0.764±0.048 0.740±0.048 0.645±0.057 0.647±0.058 0.777±0.050 0.664±0.056 0.604±0.056 0.612±0.044 0.742±0.052 0.798±0.045 0.806±0.041 0.801±0.035 0.676±0.045 0.786±0.033 𝛾 = 0.01 0.861±0.034 0.825±0.050 0.733±0.052 0.682±0.053 0.799±0.047 0.741±0.049 0.636±0.053 0.692±0.056 0.737±0.056 0.701±0.052 0.628±0.048 0.607±0.046 0.762±0.046 0.753±0.042 0.794±0.038 0.827±0.033 0.666±0.044 0.769±0.036 𝛾 = 0.1 0.764±0.053 0.788±0.052 0.719±0.055 0.612±0.056 0.735±0.050 0.740±0.048 0.625±0.052 0.653±0.055 0.720±0.053 0.659±0.057 0.632±0.051 0.594±0.045 0.740±0.050 0.756±0.046 0.755±0.043 0.801±0.034 0.661±0.046 0.761±0.032 Table 5.4 The comparison of runtime (Hour) between PHiMM and SERES+PHiMM among different reversal probabilities 𝛾 on the 5- to 10-taxon model conditions with one non-deep reticulation, one deep reticulation, and two non-deep reticulations. The parameter 𝛾 represents the probability of reversals in SERES resampling and is tested at several levels (0, 0.0001, 0.001, 0.005, 0.01, and 0.1). The “no 𝛾” column indicates results for a baseline method where reversals are not considered. “Ntaxa” means the number of taxa. The average and standard error (SE) are calculated based on 30 replicates (represented as “average value ± SE value”). Reticulation Scenario one non-deep two non-deep one deep Ntaxa 5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10 PHiMM 𝛾 = 0.0001 𝛾 = 0 no 𝛾 20.031±1.828 16.546±1.216 1.615±0.108 45.421±4.291 4.715±0.454 49.991±4.406 96.198±7.236 13.589±1.263 138.952±11.148 152.545±8.533 25.038±1.869 234.342±14.003 36.166±1.938 307.725±19.453 195.459±10.506 47.028±1.958 387.345±22.982 255.875±11.187 46.201±4.475 35.666±3.726 3.213±0.341 6.008±0.497 67.373±5.306 62.110±4.795 16.251±1.480 159.016±13.826 109.736±10.768 27.248±1.985 273.218±20.707 171.304±14.714 44.984±2.276 424.090±17.552 290.139±11.489 57.054±2.142 526.604±16.344 324.596±12.752 38.941±3.025 38.677±3.422 3.128±0.234 7.639±0.560 84.079±6.956 83.086±6.566 17.409±1.263 179.234±12.396 152.101±10.097 28.617±1.689 277.617±17.329 200.300±12.639 41.516±2.337 404.211±22.775 277.049±14.920 55.080±1.916 508.594±26.932 348.707±12.763 SERES+PHiMM 𝛾 = 0.001 14.912±1.109 31.086±2.952 58.166±5.285 99.570±5.828 123.787±7.184 157.106±6.896 33.429±4.173 42.995±3.690 91.010±8.737 131.127±9.094 195.802±9.559 242.069±7.839 31.505±2.334 62.756±4.682 102.317±7.182 131.271±9.287 191.190±10.614 231.784±8.474 𝛾 = 0.005 15.251±1.472 26.044±2.399 53.561±4.949 70.428±4.312 86.693±5.498 119.800±4.739 29.445±3.340 39.200±3.932 78.998±7.451 104.789±8.835 168.118±7.621 200.630±10.444 40.511±2.762 64.538±4.975 89.755±6.701 125.695±9.284 160.965±9.449 196.986±7.220 𝛾 = 0.01 12.675±0.897 23.157±2.099 43.252±3.472 66.534±3.835 85.434±4.559 109.674±4.363 30.117±3.287 37.157±3.110 70.492±6.905 92.155±8.071 139.561±7.637 197.226±7.346 29.911±2.388 56.337±5.096 82.943±7.334 109.313±8.128 149.415±8.548 186.991±7.114 𝛾 = 0.1 12.194±0.973 18.694±1.883 36.710±2.977 53.501±3.225 69.443±3.788 90.982±3.310 27.399±3.204 30.806±2.945 65.597±6.562 91.718±7.491 141.756±6.827 170.772±6.564 24.947±2.296 50.446±5.096 71.305±5.565 102.464±8.426 135.226±8.261 170.549±6.458 78 Table 5.5 The comparison of memory usage (GB) between PHiMM and SERES+PHiMM among different reversal probabilities 𝛾 on the 5- to 10-taxon model conditions with one non-deep reticulation, one deep reticulation, and two non-deep reticulations. The parameter 𝛾 represents the probability of reversals in SERES resampling and is tested at several levels (0, 0.0001, 0.001, 0.005, 0.01, and 0.1). The “no 𝛾” column indicates results for a baseline method where reversals are not considered. “Ntaxa” means the number of taxa. The average and standard error (SE) are calculated based on 30 replicates (represented as “average value ± SE value”). Reticulation Scenario one non-deep two non-deep one deep Ntaxa 5 6 7 8 9 10 5 6 7 8 9 10 5 6 7 8 9 10 PHiMM no 𝛾 3.974±0.276 10.749±0.953 25.778±1.690 45.796±2.347 47.177±2.270 59.551±1.269 5.249±0.523 12.731±1.622 26.489±1.727 43.038±2.572 56.088±1.308 60.285±1.470 5.072±0.543 12.845±1.433 26.223±1.588 43.157±2.486 51.237±2.288 56.510±1.548 SERES+PHiMM 𝛾 = 0.005 4.733±0.394 𝛾 = 0.01 4.974±0.220 𝛾 = 0.0001 3.552±0.199 8.464±0.603 𝛾 = 0.1 𝛾 = 0.001 𝛾 = 0 5.697±0.607 3.771±0.299 8.116±0.901 14.667±1.367 10.828±0.672 10.873±0.829 11.137±0.820 10.043±0.790 29.367±2.186 25.978±1.447 24.698±2.531 27.031±2.311 27.012±2.305 27.410±2.239 48.084±2.383 47.791±2.098 42.071±2.391 45.293±2.299 41.971±2.167 46.498±2.653 55.784±1.776 48.073±2.073 48.495±2.165 47.093±2.766 46.954±1.888 48.179±1.942 58.761±2.496 58.189±2.046 54.531±1.372 56.675±1.390 55.667±1.521 58.262±1.212 4.994±0.506 8.481±1.018 7.913±0.843 6.254±0.762 15.422±1.600 10.278±1.021 13.111±1.266 8.364±0.688 26.709±2.116 27.908±1.500 24.325±1.920 22.528±1.806 24.571±1.610 23.876±1.749 44.987±2.474 43.014±2.326 41.804±2.565 40.568±2.152 42.366±2.337 44.181±2.484 55.065±1.511 52.504±1.271 53.388±1.730 54.329±1.430 55.747±1.586 57.242±2.377 58.654±1.360 58.981±1.224 58.310±1.434 60.466±2.537 63.296±1.447 60.150±1.259 7.317±0.669 5.166±0.606 19.203±2.126 15.900±1.665 16.421±1.690 12.690±1.370 14.750±1.922 12.010±1.584 33.512±2.154 25.357±1.920 28.008±1.715 29.564±2.369 27.848±1.992 25.206±1.180 43.338±2.153 44.520±2.404 43.925±2.811 48.347±2.414 44.036±2.373 44.941±2.765 56.058±1.783 52.744±2.306 53.838±1.825 61.073±1.933 52.754±2.121 58.877±1.985 59.617±1.799 59.161±1.292 56.849±1.327 58.594±1.345 58.694±1.494 60.472±1.506 6.147±0.742 10.026±0.976 6.739±0.698 9.579±0.721 6.321±0.504 5.855±0.455 7.557±0.602 5.949±0.614 replicates. In addition, the intra-sequence dependence among sites provides additional information on the historical evolutionary events, especially those that caused the dependence. Thus, the introgression inference greatly benefits from the SERES resampling and re-estimation process. Second, the SERES resampling algorithm reveals uncertainties in the introgression inference. Incorrect introgression inferences are less repeatable. The accuracy of introgression inference by SERES+PHiMM method is consistently better than the standalone PHiMM for all model conditions of the simulated datasets, which indicates that the SERES resampling and re-estimation process produces consistently more correct introgression inferences. Additional experiments performed to evaluate how the choice of the reversal probability impacts the method performance indicate that the SERES+PHiMM is robust to the choice of reversal proba- bility 𝛾. The results are consistent with the original motivation for sequence-aware resampling and re-estimation. We note the smaller 𝛾 values mean that longer-distance sequential dependence is re- tained. Our results suggest that longer-distance sequential dependence is critical to the performance of resampling and re-estimation for sequence-based inference problems. 79 CHAPTER 6 DACS: FAST AND ACCURATE ULTRA LARGE-SCALE COESTIMATION OF PHYLOGENETIC NETWORKS AND INTROGRESSIONS USING DIVIDE-AND-CONQUER 6.1 Introduction Uncovering introgression events among a large number of taxa is a challenging task that demands both efficient computational algorithms and accurate phylogenetic models. In prior studies [12, 122], we introduced the PHiMM framework and demonstrated its efficacy in mapping introgression signals when a known phylogenetic network is available. Specifically, PHiMM uses the topology of an input network to construct the hidden Markov model (HMM) structure, enabling site-by-site inference of introgression probabilities. When the true network is known, as often assumed in simulation studies, PHiMM has been shown to maintain high accuracy while achieving substantial reductions in runtime and memory usage compared to existing methods such as PhyloNet-HMM. In empirical applications, however, the ground-truth phylogenetic network is rarely available a priori. Consequently, network inference is typically guided by existing knowledge or domain expertise [12, 122]. While this approach is feasible for well-studied, small taxon sets, it becomes intractable for large sets of taxa for which only limited biological information is known. In such cases, robust mathematical and computational methods are needed for scalable network inference. Unfortunately, most current state-of-the-art phylogenetic network inference methods encounter severe computational bottlenecks with datasets containing more than 30 taxa [14]. As the number of taxa increases, the computational requirements grow rapidly, while the corresponding inference accuracy decreases. These constraints highlight the necessity for novel, scalable techniques that can recover phylogenetic networks efficiently while maintaining high accuracy. FastNet [11] is a recently developed tool that addresses these challenges by employing a divide- and-conquer strategy to infer phylogenetic networks from large-scale genomic datasets. The key insight lies in partitioning the full taxon set into smaller, more closely related subsets, inferring sub-network topologies, and then combining these partial solutions into a final network. This 80 partitioning not only alleviates computational demands but also improves inference accuracy by reducing the evolutionary divergence within each subset. Building upon these advances, we propose DACS (Divide-And-Conquer and Subsampling), a new algorithm specifically designed to detect introgression events more efficiently in large phyloge- nomic datasets by integrating a divide-and-conquer approach with a robust subsampling strategy. Unlike the standard PHiMM pipeline, DACS does not require a predefined network as input. In- stead, it leverages FastNet’s network inference capabilities for large-scale analyses. The framework systematically addresses two fundamental challenges in introgression detection: (1) DACS reduces the computational overhead associated with analyzing a large number of taxa simultaneously; and (2) rather than relying on a single, potentially erroneous phylogenetic network, DACS integrates over multiple plausible topologies to reduce errors. As demonstrated by our simulation experi- ments and empirical analyses, this approach not only scales effectively to dozens or even hundreds of taxa but also maintains high accuracy in identifying introgressed genomic regions. By enabling large-scale, site-by-site introgression mapping without requiring prior knowledge of the true net- work or an optimal taxon sampling scheme, DACS opens new avenues for investigating complex evolutionary scenarios. 6.2 Methods 6.2.1 Problem Definition The inputs of the problem must include the aligned DNA sequences 𝐴, which can be defined as {𝐴, 𝐶, 𝑇, 𝐺}𝐾×𝐿, where 𝐾 is the number of taxa, and 𝐿 is the length of genomic sequence alignment. A phylogenetic network Ψ is optional. But if Ψ is not provided, the number of reticulations presented in the phylogenetic network Ψ should at least be specified, denoted as 𝑅. Using that information, FastNet method [11] can be used to infer a phylogenetic network Ψ. To represent the species network Ψ, we use a collection of MUL-trees [19, 20]. Corresponding gene trees are assumed to be any rooted binary tree on 𝐾 leaves. Let 𝑚 and 𝑛 be the number of MUL-trees and gene trees, respectively. Thus, the total number of HMM hidden states should be 𝑚 × 𝑛, each state represented as a pair (𝑇𝑖, 𝐺 𝑗 ), where 𝑇𝑖 is the 𝑖-th MUL-tree (1 ≤ 𝑖 ≤ 𝑚) and 𝐺 𝑗 81 is the 𝑗-th gene tree (1 ≤ 𝑗 ≤ 𝑛). Consistent with the studies of Liu et al. [12] and Wuyun et al. [122], the output of the problem is a sequence of modified posterior decoding probabilities for the columns of the input alignment 𝐴. At a site 𝑎𝑡 (1 ≤ 𝑡 ≤ 𝐿) in the alignment 𝐴, the probability of 𝑎𝑡 having an introgressive origin is defined by: ∑︁ 𝑝𝑡 = 𝑃(𝜋𝑡 = (𝑇𝑖, 𝐺 𝑗 )| 𝐴) 𝑇𝑖 ∈Ω𝑇 1≤ 𝑗 ≤𝑛 where Ω𝑇 is the set of MUL-trees corresponding to introgression events. 6.2.2 Proposed Method Previously, we introduced the PHiMM approach for introgression mapping. Our simula- tion results indicated that PHiMM reduces runtime and memory usage significantly compared to PhyloNet-HMM, while maintaining comparable accuracy. However, two key challenges remain: 1. Although PHiMM has largely reduced the runtime and memory usage, it cannot be run efficiently on large numbers of taxa. For instance, more than 300GB of memory may be needed for just 30 taxa with sequence length >100kb. 2. PHiMM requires a phylogenetic network as input to constitute the structure of the HMM. But in real-world datasets, we generally do not know true phylogenetic network information. To address these challenges, we propose DACS (Divide-And-Conquer and Subsampling), which extends PHiMM by (1) leveraging the divide-and-conquer and subsampling schemes to focus on local reticulation events within smaller taxa subsets, and (2) incorporating FastNet to systematically reduce uncertainty in phylogenetic network inference. The pseudocode of DACS is given in Algorithm 6.1, and an illustration of the pipeline is provided in Figure 6.1. 6.2.2.1 The divide-and-conquer approach with subsampling To alleviate PHiMM’s high computational demands on large datasets, we integrate a divide- and-conquer approach with subsampling. Phylogenomic subsampling can be defined as a phylogenomic protocol in which loci are sampled 82 at random to create different-sized locus-by-species matrices, with the goal of exploring the stability of a phylogenetic hypothesis [146]. Subsampling can also be performed on taxa, and although many studies have explored this approach [146, 147], often called jackknifing in past studies. Many kinds of subsampling have been employed throughout the history of phylogenetics and phylogenomics [146–149]. DACS begins with a phylogenetic network Ψ. If the true network Ψ is unknown, it can be estimated using a suitable inference method such as FastNet (see the next Chapter for details). Once a network Ψ is obtained, the algorithm identifies all reticulation edges, labeled Υ1, ..., Υ𝑅 in the input network Ψ, where 𝑅 is the total number of reticulations in the input network. For each non-sister reticulation Υ𝑖 (1 ≤ 𝑖 ≤ 𝑅), DACS randomly subsample a small subset of taxa containing the reticulation edge Υ𝑖 multiple times. The subsampling process begins by identifying the most recent common ancestor (MRCA) of all leaves under the putative reticulation. From there, a specified subnetwork size 𝐶 (e.g., 𝐶 set to 7) is enforced by first retaining at least one “visible” [150, 151] leaf from the source node and one from the sink node of the reticulation. To fill out the rest of the subset to reach the size 𝐶, the remaining leaves are randomly sampled from the overall set, with the constraint that at least one of these sampled leaves must be under the MRCA but not under the reticulation nodes to ensure that the subnetwork captures the non-sister reticulation event. By limiting each subset to a maximum size 𝐶 (e.g., 4 or 7 taxa), the method substantially reduces runtime and memory usage compared to analyzing the full dataset. This selective subsampling process is repeated 𝑀 times for each reticulation Υ𝑖, resulting in multiple random subsamples that ensure the robustness of the final estimates to subsampling variation. After subsampling, DACS applies the original PHiMM algorithm to each subset to infer site-by- site posterior decoding probability distribution (i.e., introgression probabilities). PHiMM analyses are run using default settings, e.g., the number of iterations for model parameter learning is 300, and the number of runs is set to 10. Users also need to assign a gene tree truncation size 𝑘𝑛 for the HMM model of PHiMM. In our simulation study, we run PHiMM with the default setting, 𝑘𝑛 = 15. Concretely, for the 𝑚-th subsample of reticulation Υ𝑖, DACS obtains an introgression probability 83 𝑝 (Υ𝑖) 𝑡,𝑚 at each alignment site 𝑡. These probabilities are then merged across all 𝑀 subsamples for the same reticulation Υ𝑖 using an “average” merge strategy: 𝑝 (Υ𝑖) 𝑡 = 1 𝑀 𝑀 ∑︁ 𝑚=1 𝑝 (Υ𝑖) 𝑡,𝑚 This averaging step stabilizes the introgression probability estimates for each reticulation by reduc- ing the influence of individual subsamples. Next, DACS merges the introgression probabilities from different reticulations Υ1, ..., Υ𝑅 via a “maximum” merge strategy: 𝑝𝑡 = max 1≤𝑖≤𝑅 𝑝 (Υ𝑖) 𝑡 This yields a final site-by-site introgression probability distribution 𝑝𝑡 for each alignment column after considering all reticulation events in Ψ. This choice of merge function identifies the largest introgression signal across any of the detected reticulations at each site. By focusing only on small subsets of taxa around each reticulation, we drastically reduce the memory and computational time. Meanwhile, repeated subsampling ensures that the resulting estimates are robust to sampling variance. 6.2.2.2 Bypassing the need for a predefined network input: Integration with FastNet While the procedure above applies directly when a phylogenetic network Ψ is provided, real- world datasets often lack a ground-truth network. To account for this network uncertainty, FastNet [11] can be used to infer one or more candidate phylogenetic networks from sequence alignments. FastNet provides a set of top-ranked network topologies Ψ1, ..., Ψ𝑁 . Each candidate network has a log pseudolikelihood score (defined by Equation (1) in the FastNet paper). When selecting candidate networks, one can either choose the top 𝑁 scoring networks or choose all networks whose log pseudolikelihood score lies within a specified Δ% range of the best score. To avoid including too few or too many networks, here we set a lower limit of 5 and an upper limit of 15 topologies. This procedure ensures that moderately well-supported network topologies are not excluded while preventing an unmanageably large set of candidates. 84 For each guide network Ψ𝑗 (1 ≤ 𝑗 ≤ 𝑁) provided by FastNet, above introgression mapping procedure (PHiMM with the divide-and-conquer and subsampling strategy) is run to obtain a introgression probability distribution {𝑝𝑡, 𝑗 }. Since different networks may vary in quality, DACS assigns a weight 𝜔 𝑗 to each guide network Ψ𝑗 according to its relative log pseudolikelihood: 𝜔 𝑗 = 𝑚𝑎𝑥1≤𝑘 ≤𝑁 𝑙𝑜𝑔(𝐿 (Ψ𝑘 , Γ|𝐺)) 𝑙𝑜𝑔(𝐿 (Ψ𝑗 , Γ|𝐺)) Here, 𝐿 (Ψ, Γ|𝐺) means the pseudo-likelihood of phylogenetic network Ψ and inheritance proba- bilities Γ given a set of gene trees 𝐺 (see Chapter 2.1.5). This ensures that networks with higher log pseudolikelihood are more influential. Finally, DACS combines the introgression probabilities from all candidate networks into a single consensus distribution ˜𝑝𝑡 by adopting a weighted average approach: ˜𝑝𝑡 = (cid:205) 𝑗 𝜔 𝑗 × 𝑝𝑡, 𝑗 (cid:205) 𝑗 𝜔 𝑗 Users may alternatively adopt other combination functions (e.g., simple averaging, median, or maximum) in place of the weighted average as deemed appropriate. This multi-network strategy greatly reduces the risk of relying on a single, potentially inaccurate network and captures reticulation signals that might be missed or incorrectly represented by any one topology, and therefore typically improves robustness, particularly for large-scale or complex datasets where reticulation signals can be subtle or where gene flow may be multilayered. The end result is a site-by-site introgression probability distribution that integrates over: po- tential reticulation events, multiple subsamplings of large sets of taxa, and uncertainty in network inference. Therefore, these steps provide a scalable, memory-efficient, and robust pipeline for detecting introgression events in large phylogenomic datasets while explicitly accounting for phy- logenetic network uncertainties. 6.3 Materials 6.3.1 Simulation Data The simulation study is used to evaluate the performance and applicability of the proposed DACS method, since we can track the true history of evolutionary events. The simulation data are 85 Figure 6.1 The DACS pipeline. (A) Starting with a multiple DNA sequence alignment, if a phy- logenetic network is not already available, FastNet is used to infer a set of top-ranked candidate networks. (B) Each inferred network is then decomposed around its reticulation events and sub- jected to repeated subsampling (subsets of taxa around each reticulation) to reduce computational demands. PHiMM is run on each subsample, yielding site-by-site introgression probabilities, which are then averaged per reticulation (average merge). Subsequently, the results for all reticulations in a single network are combined (max merge) to produce an introgression distribution for that network. (C) Finally, to account for uncertainty in network inference, the site-by-site distributions from all candidate networks are combined via a weighted average to generate the final introgression probability profile along the genome alignment (bottom). The pipeline thus integrates the pos- sibility of multiple network topologies, subsampling for computational efficiency, and statistical merging steps to yield robust introgression probability estimates. 86 ⊲ Ψ: Phylogenetic network ⊲ 𝐴𝐾×𝐿: Multiple sequence alignment with 𝐾 aligned sequences and 𝐿 columns ⊲ 𝑅: Total number of reticulations in Ψ 𝑚 ← 𝑆𝑢𝑏𝑠𝑎𝑚 𝑝𝑙𝑒𝑇 𝑎𝑥𝑎( 𝐴, Ψ, Υ𝑖, 𝐶) ⊲ 𝑀: Number of replicates ⊲ 𝐶: Maximum size of Algorithm 6.1 DACS 1: procedure DACSwithInputNetwork(Ψ, 𝐴) 2: 3: 4: 5: 6: 7: Υ1, ..., Υ𝑅 ← 𝐼𝑑𝑒𝑛𝑡𝑖 𝑓 𝑦𝑅𝑒𝑡𝑖𝑐𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑠(Ψ) for Υ𝑖 from Υ1 to Υ𝑅 do } ← 0 {𝑝 (Υ𝑖) 𝑡 for 𝑚 from 1 to 𝑀 do 𝑚 , 𝐴(Υ𝑖) Ψ(Υ𝑖) subsampling 𝑚 , 𝐴(Υ𝑖) 𝑡,𝑚 } ← 𝑃𝐻𝑖𝑀 𝑀 (Ψ(Υ𝑖) 𝑚 ) } + {𝑝 (Υ𝑖) 𝑡,𝑚 } {𝑝 (Υ𝑖) {𝑝 (Υ𝑖) 𝑡 } ← {𝑝 (Υ𝑖) } ← {𝑝 (Υ𝑖) 𝑡 }/𝑀 {𝑝𝑡 } ← max1≤𝑖≤𝑅{𝑝 (Υ𝑖) } return {𝑝𝑡 }1≤𝑡≤𝐿 {𝑝 (Υ𝑖) 𝑡 𝑡 𝑡 8: 9: 10: 11: 12: 13: 14: 15: 16: procedure DACSwithoutInputNetwork(Ψ, 𝑅) 17: 18: 19: 20: 21: ⊲ 𝑅: Number of reticulations in the input network ⊲ 𝐴𝐾×𝐿: Multiple sequence alignment with 𝐾 aligned sequences and 𝐿 columns Ψ1, 𝐿(Ψ1, Γ|𝐺), ..., Ψ𝑁 , 𝐿(Ψ𝑁 , Γ|𝐺) ← 𝐹𝑎𝑠𝑡𝑁𝑒𝑡 ( 𝐴, 𝑅, 𝑁) ⊲ 𝑁: Number of candidate networks ⊲ 𝐿(Ψ, Γ|𝐺): Pseudo-likelihood of phylogenetic network Ψ and inheritance probabilities Γ given a set of gene trees 𝐺 for 𝑗 from 1 to 𝑁 do {𝑝𝑡, 𝑗 } ← 𝐷 𝐴𝐶𝑆𝑤𝑖𝑡ℎ𝐼𝑛𝑝𝑢𝑡𝑁𝑒𝑡𝑤𝑜𝑟 𝑘 (Ψ𝑗 , 𝐴) 𝜔 𝑗 ← 𝑚𝑎𝑥1≤𝑘 ≤𝑁 𝑙𝑜𝑔(𝐿 (Ψ𝑘 , Γ|𝐺))/𝑙𝑜𝑔(𝐿 (Ψ𝑗 , Γ|𝐺)) { ˜𝑝𝑡 } ← (cid:205) 𝑗 𝜔 𝑗 × {𝑝𝑡, 𝑗 }/(cid:205) 𝑗 𝜔 𝑗 return { ˜𝑝𝑡 }1≤𝑡≤𝐿 22: 23: 24: 25: 26: constructed through various tools, such as r8s [28], msmove [29], and seq-gen [32]. (1) Generation of model trees using r8s A random rooted model tree can be generated using r8s [28] tool, which utilizes the birth- death model for the tree generation. #nexus begin r8s; simulate diversemodel=bdback seed= charevol=yes ntaxa= infinite=yes nreps=1; 87 ⊲ Ψ: Phylogenetic network ⊲ 𝐴𝐾×𝐿: Multiple sequence alignment with 𝐾 aligned sequences and 𝐿 columns ⊲ Υ: Reticulation ⊲ 𝐶: Maximum size of subsampling 𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑒𝑎𝑣𝑒𝑠 ← 𝐺𝑒𝑡𝑆𝑜𝑢𝑟𝑐𝑒𝑉𝑖𝑠𝑖𝑏𝑙𝑒𝐿𝑒𝑎𝑣𝑒𝑠(Ψ, Υ) 𝑠𝑖𝑛𝑘_𝑙𝑒𝑎𝑣𝑒𝑠 ← 𝐺𝑒𝑡𝑆𝑖𝑛𝑘𝑉𝑖𝑠𝑖𝑏𝑙𝑒𝐿𝑒𝑎𝑣𝑒𝑠(Ψ, Υ) 𝑀 𝑅𝐶 𝐴_𝑙𝑒𝑎𝑣𝑒𝑠 ← 𝐺𝑒𝑡 𝑀 𝑅𝐶 𝐴𝑙𝑒𝑎𝑣𝑒𝑠(Ψ, Υ) 𝑎𝑙𝑙_𝑙𝑒𝑎𝑣𝑒𝑠 ← 𝐺𝑒𝑡 𝐴𝑙𝑙 𝐿𝑒𝑎𝑣𝑒𝑠(Ψ) Algorithm 6.2 Subsampling 1: procedure SubsampleTaxa(Ψ, 𝐴, Υ, 𝐶) 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 𝑠𝑢𝑏𝑠𝑒𝑡.𝐴𝑑𝑑 (𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝑙𝑒𝑎 𝑓 ) 𝑠𝑢𝑏𝑠𝑒𝑡 ← {} 𝑠𝑢𝑏𝑠𝑒𝑡.𝐴𝑑𝑑 (𝑅𝑎𝑛𝑑𝑜𝑚𝐶ℎ𝑜𝑖𝑐𝑒(𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑒𝑎𝑣𝑒𝑠)) 𝑠𝑢𝑏𝑠𝑒𝑡.𝐴𝑑𝑑 (𝑅𝑎𝑛𝑑𝑜𝑚𝐶ℎ𝑜𝑖𝑐𝑒(𝑠𝑖𝑛𝑘_𝑙𝑒𝑎𝑣𝑒𝑠)) 𝑠𝑢𝑏𝑠𝑒𝑡.𝐴𝑑𝑑 (𝑅𝑎𝑛𝑑𝑜𝑚𝐶ℎ𝑜𝑖𝑐𝑒(𝑀 𝑅𝐶 𝐴_𝑙𝑒𝑎𝑣𝑒𝑠 − 𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑒𝑎𝑣𝑒𝑠 − 𝑠𝑖𝑛𝑘_𝑙𝑒𝑎𝑣𝑒𝑠)) while 𝑆𝑖𝑧𝑒(𝑠𝑢𝑏𝑠𝑒𝑡) < 𝐶 do 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝑙𝑒𝑎 𝑓 ← 𝑅𝑎𝑛𝑑𝑜𝑚𝐶ℎ𝑜𝑖𝑐𝑒(𝑎𝑙𝑙_𝑙𝑒𝑎𝑣𝑒𝑠) if 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒_𝑙𝑒𝑎 𝑓 ∉ 𝑀 𝑅𝐶 𝐴_𝑙𝑒𝑎𝑣𝑒𝑠 + 𝑠𝑜𝑢𝑟𝑐𝑒_𝑙𝑒𝑎𝑣𝑒𝑠 + 𝑠𝑖𝑛𝑘_𝑙𝑒𝑎𝑣𝑒𝑠 then 19: 20: 21: Ψ(Υ) ← 𝐺𝑒𝑡𝑆𝑢𝑏𝑛𝑒𝑡𝑤𝑜𝑟 𝑘 (Ψ, Υ, 𝑠𝑢𝑏𝑠𝑒𝑡) 𝐴(Υ) ← 𝐺𝑒𝑡𝑆𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠( 𝐴, 𝑠𝑢𝑏𝑠𝑒𝑡) return Ψ(Υ), 𝐴(Υ) end; where “diversemodel” means the model used for tree generation, which is generally set to the birth-death model denoted by “bdback”. “seed” specifies a random seed for the tree generation. “ntaxa” represents the number of taxa, which is set to 10, 20, or 100 for comparison. “nreps” indicates the number of generated repeats. “charevol=yes” indicates the model tree is output with branch lengths. “infinite=yes” represents that branch lengths are set to the expected values based on rate and time. The height of the resulting tree is scaled to ℎ by multiplying the length of each edge in the model tree by ℎ. Here, we set ℎ to 5.0 coalescent units. Furthermore, an outgroup is added to the generated tree at 50.0 coalescent time. (2) Generation of model networks A model network can be generated by the steps listed in Hejase et al. [11]. With the model tree obtained from the above step, we then add 𝑟 reticulations (𝑟 ∈ [1, 2, 3, 4, 5]) by 88 iterating the following steps: a time 𝑡𝑀 between 0 and the tree height is selected uniformly at random, two tree edges for which corresponding ancestral populations exist during a time interval [𝑡 𝐴, 𝑡𝐵] such that 𝑡𝑀 ∈ [𝑡 𝐴, 𝑡𝐵] are randomly selected, and a reticulation at time 𝑡𝑀 is added to connect the pair of tree edges. Similar to Leaché et al. [15], the model network can be further classified based upon whether gene flow is “deep” or “non-deep”, which is defined by the topological placement of reticulations, i.e., non-deep reticulations are placed between two leaf edges, while deep reticulations include all other reticulations. (3) Generation of local genealogies using msmove Given a model species network, 100 local genealogies can be simulated using the msmove [29] for independent and identically distributed (i.i.d.) loci following a species network under a multi-species network coalescent with recombination (MSNCwR) model. msmove is a modified version of Monte Carlo simulator ms [30] allowing the tracking of migration events, while ms does not provide this annotation. Recombination is modeled using Hudson’s finite- sites recombination model [30]. msmove -T -r -I -ej i_1 j_1 -ej i_2 j_2 ... -ej i_k j_k -ev i j where -T parameter indicates the gene trees representing the history of the sampled taxa are output. The -I parameter is followed by 𝑘 that represents the number of populations. The list of integers (n_1 n_2 ... n_k) includes the number of taxa sampled in each population. In this study, one allele is sampled from each taxon. The -r parameter is used to set recombination by crossover rate and the number of sites between which recombination can occur, where the crossover rate or recombination rate 𝜌 is set to 0.1, and the number of sites between which recombination can occur is set to 900. The -ej parameter specifies moving all lineages in population i to population j at time t. The -ev parameter is special for msmove, which sets migration at time t_m from population i to population j with the migration probability 89 𝑥 = 0.1 in this study. Then, the gene trees will be deviated away from ultrametricity by the following steps [143]. First, a deviation factor, 𝑐, that quantifies the deviation level is determined. Then, for each edge in the model tree, its branch length is multiplied by 𝑒𝑥, where 𝑥 is uniformly and randomly chosen from the interval [− lg(𝑐), lg(𝑐)]. Similar to Liu et al. [144], 𝑐 is set to 2.0 here. (4) Simulation of sequences using seq-gen The DNA sequences can be generated using seq-gen [32] under the general time-reversible (GTR) substitution model [67]. seq-gen -mGTR -f -r -a -l -p < genetreefile > seqfile where -m parameter specifies the general time-reversible (GTR) substitution model denoted by “GTR”. The -s parameter sets the mutation rate 𝜃 that scales the branch lengths to make them equal the expected number of substitutions per site for each branch. The mutation rate 𝜃 is set to 0.1, 0.5, 1, 2, 5, and 10 for comparison. The default setting is 0.5. The -f parameter specifies the frequencies of the four nucleotides A, C, G, and T. The -r parameter sets a relative rate of substitutions between nucleotides in a GTR model. The -a parameter specifies the shape of the Γ distribution. The -l parameter specifies the sequence length. Here, the simulated sequence length of each gene tree is set to 900 bp. The -p parameter sets the number of partitions. The genetreefile is the input file providing the gene trees. The seqfile is the output file with simulated sequences under the given gene trees. The GTR substitution model parameter values were estimated based on empirical anal- yses of the mouse genomic sequence dataset from Liu et al. [7] study. We used all M. m. domesticus and M. spretus samples listed in “Table S1” of Liu et al. [7] study and concate- nated all chromosomes to get the sequence data. RAxML was used to perform concatenated 90 phylogenetic MLE under the GTR model. The estimated parameters used in seq-gen simu- lations include base frequencies: 𝜋 𝐴 = 0.216, 𝜋𝐶 = 0.284, 𝜋𝐺 = 0.285, 𝜋𝑇 = 0.215; GTR matrix: 1.002820, 5.486349, 1.095939, 0.539209, 5.535556, and 1.000000 for 𝐴 ↔ 𝐶, 𝐴 ↔ 𝐺, 𝐴 ↔ 𝑇, 𝐶 ↔ 𝐺, 𝐶 ↔ 𝑇, and 𝐺 ↔ 𝑇, respectively; shape parameter 𝛼 of the Γ distribution: 0.543225. Then, coalescent times are converted into branch lengths using “Equation (3.1)” in Hein et al. [145]. For each model condition, the simulation procedure is repeated to obtain 20 replicate datasets. Model condition parameters and summary statistics for simulated datasets are shown in Table 6.1. Table 6.1 Statistics for the simulation dataset. The reticulation scenarios include solely non-deep reticulations, solely deep reticulations, and a combination of both non-deep and deep reticulations. “Ntaxa” means the number of taxa. “Network Height” gives the height of the model network. “Total Sites” indicates the number of total sites in the simulation genomes. The p-distance of a pair of aligned sequences is calculated by dividing the number of sites where the two sequences had different nucleotides by the number of sites in which both sequences had nucleotides. “Average/Maximum p-dist” is the average/maximum p-distance of all pairs of aligned sequences in the simulation dataset. “Introgression (%)” shows the percent of introgressed sites over the genome. “Average Branch Length” is the average branch length of the model network. The average and standard error (SE) are shown based on 20 replicates. Ntaxa Reticulation Scenario 10 10 10 10 20 20 20 20 100 100 100 100 100 100 100 100 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 2 non-deep 3 non-deep 4 non-deep 5 non-deep 1 non-deep + 1 deep 2 deep 3 deep Total Sites 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 90000 Network Height 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6.3.2 Mosquito Data Average p-dist (%) Maximum p-dist (%) Average 55.625 55.912 55.540 56.094 53.671 53.187 53.288 53.304 52.657 52.519 52.501 52.438 52.535 52.245 52.278 52.403 Average 68.500 68.403 68.519 68.452 68.522 68.548 68.514 68.529 68.595 68.597 68.572 68.573 68.599 68.590 68.605 68.576 SE 0.020 0.026 0.026 0.023 0.029 0.024 0.020 0.016 0.025 0.027 0.015 0.025 0.024 0.025 0.029 0.026 SE 0.209 0.276 0.368 0.210 0.329 0.221 0.291 0.208 0.136 0.175 0.302 0.249 0.252 0.303 0.278 0.178 Introgression (%) Average Branch Length Average 13.450 14.850 16.900 21.000 9.550 19.250 15.350 17.750 9.150 11.550 11.700 15.250 13.737 15.350 15.350 18.100 Average 1.381 1.432 1.366 1.469 1.055 0.941 0.987 0.968 0.596 0.593 0.624 0.625 0.588 0.596 0.605 0.606 SE 1.488 1.875 1.525 1.515 0.872 1.745 1.024 1.710 1.057 1.077 0.995 0.820 0.670 1.044 0.844 1.013 SE 0.041 0.050 0.054 0.049 0.047 0.036 0.048 0.032 0.019 0.021 0.023 0.028 0.025 0.025 0.020 0.020 We downloaded the MAF whole genome alignment from high-depth field samples of Anopheles species from Dryad (doi: 10.5061/dryad.f4114) [152]. The data consist of one genome from each of the species An. gambiae, An. coluzzii, An. arabiensis, An. quadriannulatus, An. merus, An. 91 melas, An christyi, and An epiroticus. Since DACS method can be applied to dozens of taxa, we added more background mosquito species from the study of Neafsey et al. [153]. Following the alignment strategy in studies of Neafsey et al. [153] and Fontaine et al. [152], we use An. gambiae PEST (AgamP3) as reference, i.e., all other species assemblies are mapped to the An. gambiae PEST (AgamP3) reference assembly. The all-against-all pairwise LASTZ (https://lastz.github.io/lastz/) alignments are projected to ensure that the reference species is “single-coverage”, i.e. in any pairwise alignment, regions of the reference species may only be present once. Subsequent projection steps are then performed as guided by the species dendrogram (Figure 6.2) to progressively combine the pairwise alignments, and then the multiple alignments, until they encompass the complete phylogeny of all 19 species. To avoid the impact of input phylogenetic networks on introgression detection, we directly used the species network from studies of Neafsey et al. [153] and Fontaine et al. [152] for our introgression analyses (see Figure 6.2). Finally, we applied DACS to the 2L, 2R, 3L, 3R, and X chromosomal arms of 19 mosquito species. DACS was run with the same settings used in the simulation study, except for a modification of the number of iterations for model parameter learning increased to 1000 (as opposed to 300). 6.3.3 Darwin’s Finch Data As part of our empirical study, we re-analyzed Darwin’s finch genomic sequence data originally published by Lamichhaney et al. [154]. We began by downloading the original Illumina HiSeq2000 paired-end read data from the NCBI SRA database (accession number PRJNA263122). From these data, we randomly selected one sample per species, yielding a dataset of 20 samples. We also downloaded the assembled whole-genome sequence and gene annotations for the medium ground finch, Geospiza fortis, from the GigaDB database (http://gigadb.org/dataset/100040). Next- generation sequencing (NGS) read alignment, quality filtering, and variant calling steps were based on the study of [154]. First, paired-end reads for each sample were aligned to the G. fortis reference genome using BWA version 0.7.17 with default parameters [155]. The medium ground finch genome assembly contains 27,239 scaffolds unassigned to chromosomes. Quality filtering, post- 92 Figure 6.2 The phylogenetic network used on mosquito data. The phylogenetic network captures introgression from An. merus to An. quadriannulatus. Branch colors indicate genus information: orange for 8 Pyretophorus species, blue for 5 Neocellia+Myzomyia species, green for 2 Neomy- zomyia species, pink for 2 Anopheles species, and cyan for 2 Nyssorhynchus species. 93 Table 6.2 Mosquito species used in this study. We used existing mosquito dataset from previous studies. Each row lists species name and the corresponding data sources. References to original sources are also included for traceability. Source Species Dryad (doi: 10.5061/dryad.f4114) An. arabiensis Dryad (doi: 10.5061/dryad.f4114) An. christyi Dryad (doi: 10.5061/dryad.f4114) An. coluzzii Dryad (doi: 10.5061/dryad.f4114) An. epiroticus Dryad (doi: 10.5061/dryad.f4114) An. gambiae Dryad (doi: 10.5061/dryad.f4114) An. melas An. merus Dryad (doi: 10.5061/dryad.f4114) An. quadriannulatus Dryad (doi: 10.5061/dryad.f4114) An. albimanus An. atroparvus An. culicifacies An. darlingi An. dirus An. farauti An. funestus An. maculatus An. minimus An. sinensis An. stephensi Reference [152] [152] [152] [152] [152] [152] [152] [152] GenBank Assembly: GCA_000349125.2 [153] GenBank Assembly: GCA_000473505.1 [153] GenBank Assembly: GCA_000473375.1 [153] GenBank Assembly: GCA_943734745.1 [153] GenBank Assembly: GCA_000349145.1 [153] GenBank Assembly: GCA_000473445.2 [153] GenBank Assembly: GCA_000349085.1 [153] GenBank Assembly: GCA_000473185.1 [153] GenBank Assembly: GCA_000349025.1 [153] GenBank Assembly: GCA_000472065.2 [153] GenBank Assembly: GCA_000349045.1 [153] processing, and variant calling tasks were then performed using SAMtools [156]. Identifed variants consisted of SNPs and short indel polymorphisms. To obtain haplotype sequences, bi-allelic SNPs were phased using fastPHASE version 1.4.8 [132]. The phased calls for bi-allelic SNPs were combined with the genotypic data of homozygous multi-allelic SNPs and homozygous indel polymorphisms. The heterozygous multi-allele SNPs and heterozygous indels, representing less than 1% of the input data, were treated as missing data. To obtain a species phylogeny for our introgression analyses, we used FastNet [11] to infer species networks from the genomic alignment of all 27,239 scaffolds. The species tree listed in Lamichhaney et al. [154] was used as a starting tree and the number of reticulations was set ot 2 for FastNet’s local search heuristics by performing maximum likelihood estimation (MLE) 94 optimization using PhyloNet software package [13, 123]. As shown in Figure 6.3, we used top 15 two-reticulation species networks as input of DACS. These networks include introgression between warbler finches and non-warbler finches, introgression between G. propinqua_G and tree finches, and introgression between G. propinqua_G and G. conirostris_E, consistent with previous reports of extensive hybridization and gene flow among Darwin’s finches [154, 157]. Finally, we applied DACS to the 469 largest scaffolds with sequence length greater than 100kb for computational efficiency. DACS was run with the same settings used in the simulation study, except for a modification of the number of iterations for model parameter learning increased to 1000 (as opposed to 300). The summarized statistics of these datasets can be found in Table 6.3. 95 : s g n i p u o r g s e i c e p s e t a c i d n i s r o l o c h c n a r B . a t a d h c n fi s ’ n i w r a D n o t e N t s a F y b d e r r e f n i s k r o w t e n c i t e n e g o l y h p 5 1 p o t e h T 3 . 6 e r u g i F , s e h c n fi e e r t r o f n e e r g , s e h c n fi d n u o r g d e k a e b - p r a h s r o f d e r , h c n fi s o c o c r o f n a y c , h c n fi n a i r a t e g e v r o f w o l l e y , s e h c n fi r e l b r a w r o f e l p r u p . s p u o r g t u o r o f k c a l b d n a , s e h c n fi d n u o r g r e h t o l l a r o f e u l b 96 Table 6.3 Summary of samples of Darwin’s finches and outgroup species. The table lists key metadata for Darwin’s finches and their outgroups. “Classification” column indicates whether a sample belongs to Darwin’s finches or an outgroup species. “Sample Name” provides a unique label assigned to each sample. “Common Name” and “Species (Alias)” detail the vernacular name (e.g., large ground finch) and the corresponding scientific name (e.g., Geospiza magnirostris). “Island (Abbreviation)” column identifies the sampling site for each species (e.g., Daphne Major, abbreviated “M”). “NCBI Accession” refers to the publicly available sequencing record associated with each sample. Classification Darwin’s finches G. magnirostris_G Sample Name Species (Alias) Common Name Large ground finch Geospiza magnirostris Medium ground finch Geospiza fortis Small ground finch Geospiza fuliginosa Large cactus finch Geospiza propinqua Large cactus finch Geospiza conirostris Common cactus finch Geospiza scandens Sharp-beaked ground finch Geospiza difficilis Island (Abbreviation) NCBI Accession Genovesa (G) Daphne Major (M) G. fortis_M Santa Cruz (Z) G. fuliginosa_Z Genovesa (G) G. propinqua_G Española (E) G. conirostris_E Daphne Major (M) G. scandens_M G. difficilis_P Pinta (P) G. septentrionalis_W Sharp-beaked ground finch Geospiza septentrionalis Wolf (W) G. acutirostris_G C. heliobates_I C. pallidus_Z C. psittacula_P C. parvulus_Z C. pauper_F P. crassirostris_Z C. olivacea_S C. fusca_E P. inornata_C T. bicolor_B L. noctis_B Sharp-beaked ground finch Geospiza acutirostris Mangrove finch Woodpecker finch Large tree finch Small tree finch Medium tree finch Vegetarian finch Green warbler finch Grey warbler finch Cocos finch Black-faced grassquit Lesser Antillean bullfinch Camarhynchus heliobates Camarhynchus pallidus Camarhynchus psittacula Camarhynchus parvulus Camarhynchus pauper Platyspiza crassirostris Certhidea olivacea Certhidea fusca Pinaroloxias inornata Tiaris bicolor Loxigilla noctis SRR1607485 SRR1607458 SRR1607462 SRR1607365 SRR1607296 SRR1607547 SRR1607420 SRR1607440 SRR1607406 SRR1607472 SRR1607494 SRR1607543 SRR1607504 SRR1607508 SRR1607534 SRR1607385 SRR1607359 SRR1607529 SRR1607551 SRR1607480 Genovesa (G) Isabela (I) Santa Cruz (Z) Pinta (P) Santa Cruz (Z) Floreana (F) Santa Cruz (Z) Santiago (S) Española (E) Cocos Island (C) Barbados (B) Barbados (B) outgroup 6.3.4 Performance Assessments To evaluate and compare the performance of different approaches on the simulation dataset where the true history of evolutionary events can be tracked, we use two types of area under the curve (AUC): the area under the receiver-operating characteristic (ROC) curve, and the area under the Precision-Recall curve, referred to as simply ROC-AUC and PR-AUC, respectively. ROC-AUC is plotted by the true positive rate ( threshold settings, while PR-AUC similarly plots the precision ( 𝑇 𝑃 𝐹𝑃+𝑇 𝑁 ) at various 𝑇 𝑃 𝑇 𝑃+𝐹 𝑁 ) at different thresholds, where TP, FP, TN, and FN represent the numbers of true positives, false 𝑇 𝑃+𝐹 𝑁 ) as a function of the false positive rate ( 𝐹𝑃 𝑇 𝑃+𝐹𝑃 ) against the recall ( 𝑇 𝑃 positives, true negatives, and false negatives, respectively. The ROC-AUC and PR-AUC measures represent the tradeoff between type I errors and type II errors under different threshold values. The measures are calculated on all loci in the simulation datasets, where the true migration events are annotated by msmove with an asterisk. After concatenating all query loci in one simulation 97 replicate, the ROC-AUC and PR-AUC of a method are assessed based on the probability of a particular site involving an introgressive origin. Additionally, we report the memory usage and runtime in order to comprehensively evaluate the scalability of our framework. 6.3.5 Software and Data Our software implementation of the DACS algorithm includes the divide-and-conquer strategy to combine FastNet [11] and PHiMM [122] methods to realize the coestimation of phylogenetic network and introgression mapping. Open-source software and open data for all study datasets are publicly accessible at https://gitlab.msu.edu/liulab/phimm2. 6.4 Results 6.4.1 Simulation Studies 6.4.1.1 DACS’s parameter setting on simulation data To determine optimal parameter settings for DACS under 20-taxon model conditions with two non-deep reticulations, we systematically evaluate the impact of subsampling size, subsampling replicates, number of input networks, and merging strategies on both accuracy and runtime. In Figure 6.4A and Table 6.4, we first investigate the impact of two key factors on model performance (the receiver operating characteristic area under the curve or ROC-AUC shown in the left y-axis) and computational cost (the runtime in CPU hours shown in the right y-axis): the sampling size and the number of subsampling replicates, on 20-taxon model conditions with 2 non- deep reticulations. We find that higher sampling size (blue markers) consistently yields a greater ROC-AUC than lower sampling size (red markers). This indicates that including more samples per subsample helps capture the underlying signal more effectively, thereby improving predictive performance. Furthermore, increasing the number of subsampling replicates—shown on the x- axis—improves the ROC-AUC for both sampling sizes, suggesting that more extensive subsampling helps the model generalize and reduces uncertainty in performance estimates. However, these gains in accuracy come with a cost in computational time. As the number of subsampling replicates increases, the runtime also rises, and this effect is more pronounced for the larger sampling 98 Figure 6.4 The parameter setting on 20-taxon model conditions with 2 non-deep reticulations. (A) ROC-AUC (left y-axis, bars) and runtime (right y-axis, dash lines) as functions of the number of subsampling replicates. Red bars/lines represent for a sampling size of 4, and blue bars/lines for a sampling size of 7. (B) ROC-AUC (left y-axis, bars) and runtime (right y-axis, dash lines) for four merging strategies of combining multiple input networks: Weighted Average (red), Average (blue), Maximum (green), and Medium (purple). The x-axis shows the number or the fraction of top-scoring candidate networks. (C) The minimum (left y-axis, black dash lines) and average (right y-axis, red dash lines) reduction-based distance between true and estimated networks (details see Chapter 2.2.3.2) for different numbers or fractions of top-scoring candidate networks. 99 Table 6.4 The sampling parameter setting on 20-taxon model conditions with 2 non-deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC) and runtime (in CPU Hour). DACS method was run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value” or only “average” for runtime). Sampling Size Number of subsampling ROC-AUC Runtime (CPU Hour) Sampling Size = 4 Sampling Size = 7 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0.721±0.040 0.852±0.036 0.878±0.031 0.892±0.028 0.933±0.022 0.936±0.022 0.942±0.018 0.946±0.017 0.953±0.016 0.951±0.017 0.935±0.027 0.940±0.027 0.962±0.013 0.962±0.013 0.960±0.013 0.962±0.012 0.967±0.012 0.968±0.011 0.968±0.011 0.969±0.011 0.157 0.253 0.206 0.297 0.380 0.451 0.508 0.524 0.702 0.724 0.639 1.075 2.380 3.041 3.385 4.253 4.950 5.751 6.318 7.078 size. Similar trend is found for sampling size where the sampling size of 7 reflects a higher overall computational burden. These findings highlight the importance of tuning both the sampling strategy and the number of replicates to achieve an acceptable balance of performance and resource usage for a given analysis. In the following sections, we set the sampling size to 7 and the subsampling replicates to 10 as default. Figure 6.4B and Table 6.5 compare four different strategies—Weighted Average, Average, Maximum, and Medium—for combining multiple input networks, illustrating the trade-off between predictive performance (left y-axis) and computational runtime (right y-axis). The Weighted Average method (pink bars) exhibits lower ROC-AUC when only a small subset of networks is 100 Table 6.5 The network parameter setting on 20-taxon model conditions with 2 non-deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC) and runtime (in CPU Hour). DACS method was run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value” or only “average” for runtime). Merging strategy Number of networks ROC-AUC Runtime (CPU Hour) Weighted Average Average Maximum Medium Top 1 Top 0.1%≈5.25 Top 0.5%≈5.9 Top 1%≈6.35 Top 5%≈11.3 Top 10%≈14.45 Top 15%≈14.95 Top 20%≈15 Top 1 Top 0.1%≈5.25 Top 0.5%≈5.9 Top 1%≈6.35 Top 5%≈11.3 Top 10%≈14.45 Top 15%≈14.95 Top 20%≈15 Top 1 Top 0.1%≈5.25 Top 0.5%≈5.9 Top 1%≈6.35 Top 5%≈11.3 Top 10%≈14.45 Top 15%≈14.95 Top 20%≈15 Top 1 Top 0.1%≈5.25 Top 0.5%≈5.9 Top 1%≈6.35 Top 5%≈11.3 Top 10%≈14.45 Top 15%≈14.95 Top 20%≈15 0.893±0.027 0.906±0.028 0.904±0.028 0.903±0.028 0.899±0.030 0.919±0.026 0.918±0.026 0.918±0.026 0.893±0.027 0.906±0.028 0.904±0.028 0.903±0.028 0.899±0.030 0.919±0.026 0.918±0.026 0.918±0.026 0.893±0.027 0.906±0.024 0.905±0.024 0.899±0.027 0.895±0.026 0.904±0.026 0.902±0.027 0.902±0.027 0.893±0.027 0.908±0.026 0.903±0.027 0.902±0.028 0.892±0.030 0.903±0.026 0.903±0.026 0.903±0.026 101 11.043 57.912 64.274 68.933 124.232 159.871 164.963 165.452 11.043 57.912 64.274 68.933 124.232 159.871 164.963 165.452 11.043 57.912 64.274 68.933 124.232 159.871 164.963 165.452 11.043 57.912 64.274 68.933 124.232 159.871 164.963 165.452 Table 6.6 The network errors of different network parameter settings on 20-taxon model conditions with 2 non-deep reticulations. The measures include the average and minimum reduction-based distance between true and estimated networks (details see Chapter 2.2.3.2). DACS method was run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value” ). Number of networks Minimum reduction-based distance Average eduction-based distance Top 1 Top 0.1%≈5.25 Top 0.5%≈5.9 Top 1%≈6.35 Top 5%≈11.3 Top 10%≈14.45 Top 15%≈14.95 Top 20%≈15 13.750±0.864 13.200±0.842 13.050±0.866 13.050±0.866 12.050±0.899 11.150±1.091 11.150±1.091 11.150±1.091 13.750±0.864 15.541±0.598 15.534±0.617 15.528±0.630 15.863±0.554 16.048±0.518 16.107±0.498 16.097±0.498 considered (e.g., “Top 1”), but its performance steadily improves as the number of input networks grows. Eventually, Weighted Average surpasses the other methods, indicating a notable advantage in predictive accuracy once enough candidate networks are incorporated. Meanwhile, Average (blue bars), Maximum (green bars), and Medium (purple bars) remain relatively close to one another, suggesting that these three aggregation methods yield comparable performance under different sampling fractions. However, these performance gains come with a pronounced rise in runtime (red dashed line) as the number of input networks increases. The runtime begins at a relatively modest level when only a single or a few top networks are selected but climbs sharply beyond 100 CPU hours as the subset of considered networks expands. This pattern underscores the computational costs associated with exploring a larger space of input networks, even though such exploration can improve predictive performance. To further refine the selection of candidate networks used for merging, we evaluated the topological similarity between true and estimated networks using the reduction-based distance (Figure 6.4C and Table 6.6). As more top-scoring networks were incorporated, the minimum reduction-based distance decreased, indicating an improved best-case network topology match. However, the average reduction-based distance plateaued beyond the top 10%, implying diminishing gains from adding more networks beyond this threshold. These results suggest that while merging a greater number of candidate networks can slightly improve the chance of obtaining a network 102 close to the true topology, the benefit saturates rapidly. Overall, balancing accuracy and computational efficiency, we set the fraction of top-scoring candidate networks to 10% with weighted average method as default in the following sections. 6.4.1.2 Comparison of PHiMM and DACS with true phylogeny input Figure 6.5 The performance comparison between PHiMM and DACS with true network as input on 20-taxon model conditions. Reticulation scenarios include one non-deep reticulation, one deep reticulation, two non-deep reticulations, and a combination of both non-deep and deep reticulations. (A) The head-to-head comparison of area under the receiver-operating characteristic curve (ROC-AUC) between PHiMM and DACS on different reticulation scenarios. Statistical significance of performance differences between PHiMM and DACS was evaluated using a one- tailed pairwise Student’s t-test (𝛼 = 0.05; 𝑛 = 20). (B) The runtime (in CPU Hour) comparison between PHiMM and DACS. (C) The memory usage (in GB) comparison between PHiMM and DACS. The percentages and arrows (in red) shown in panels (B) and (C) indicate the improvement in DACS’s performance compared to PHiMM’s performance. Both PHiMM and DACS methods were run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates. We make a comprehensive comparison of the PHiMM and DACS methods, using a true network as input under various 20-taxon model conditions, including different reticulation scenarios: one non-deep reticulation, one deep reticulation, two non-deep reticulations, and a combinations of both. Figure 6.5A presents the head-to-head comparison of the area under the receiver operating 103 characteristic curve (ROC-AUC) between PHiMM and DACS across the different reticulation scenarios. The results reveal a significant improvement in ROC-AUC for DACS over PHiMM across all conditions with 𝑝-values by one-tailed pairwise Student’s t-test, all below 0.05. Figure 6.5B indicates the comparison of runtime (in CPU hours) between PHiMM and DACS under the same conditions. It is evident that DACS not only improves accuracy but also reduces computational time significantly. For instance, DACS exhibits an 83.6%, 89.8%, 77.4%, and 83.8% reduction in runtime, respectively, in the case of one non-deep, one deep, two non-deep, and combined non-deep and deep reticulations with 𝑝-values by one-tailed pairwise Student’s t-test all below 0.05. These substantial reductions in runtime underscore the efficiency gains achieved by DACS. Figure 6.5C further examines memory usage (in GB) between the two methods. Here, DACS consistently demonstrates lower memory consumption across all tested conditions. The reductions range from 63.8% to 82.5%, with the highest memory savings observed in the non-deep reticulation scenario. These findings suggest that DACS not only enhances computational speed but also optimizes memory usage, making it a more resource-efficient method compared to PHiMM. In summary, Figure 6.5 provides clear evidence that DACS is superior to PHiMM across multiple dimensions, including model accuracy, runtime efficiency, and memory consumption, under varying reticulation scenarios. The statistical significance of the results further validates the robustness of DACS, highlighting its potential for more effective and resource-efficient modeling in phylogenetic studies. 6.4.1.3 Comparison of PHiMM and DACS with inferred phylogeny by FastNet We extend our comparison of PHiMM and DACS to estimated phylogenetic networks as input under various reticulation scenarios in 20-taxon model conditions. Similar trends are seen when estimated networks are used as input under various 20-taxon model conditions (Figure 6.6). The ROC-AUC values in DACS continue to outperform those in PHiMM when estimated networks are used (Figure 6.6A). The improvement is statistically significant across all scenarios, as indicated by the 𝑝-values being below the 0.05 threshold. These results underscore that DACS 104 Figure 6.6 The performance comparison between PHiMM and DACS with estimated networks as input on 20-taxon model conditions. Reticulation scenarios include one non-deep reticulation, one deep reticulation, two non-deep reticulations, and a combination of both non-deep and deep reticulations. (A) The head-to-head comparison of area under the receiver-operating characteristic curve (ROC-AUC) between PHiMM and DACS on different reticulation scenarios. Statistical significance of performance differences between PHiMM and DACS was evaluated using a one- tailed pairwise Student’s t-test (𝛼 = 0.05; 𝑛 = 20). (B) The runtime (in CPU Hour) comparison between PHiMM and DACS. (C) The memory usage (in GB) comparison between PHiMM and DACS. The percentages and arrows (in red) shown in panels (B) and (C) indicate the improvement in DACS’s performance compared to PHiMM’s performance. Both PHiMM and DACS methods were run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates. maintains a robust performance advantage over PHiMM in different reticulation scenarios, even with the potential inaccuracies introduced by estimated networks. Figure 6.6B shows the comparison of runtime (in CPU hours) between PHiMM and DACS under these conditions. The efficiency improvements seen with true networks (as in Figure 6.5) are also observed when using estimated networks, suggesting that DACS offers substantial computational benefits across different input conditions. Figure 6.6C presents the memory usage (in GB) comparison between the two methods. Once again, DACS demonstrates lower memory consumption across all tested conditions. These results indicate that DACS not only accelerates computational performance but also optimizes memory utilization, even when estimated networks are used as input. 105 Overall, Figure 6.6 confirms that DACS significantly outperforms PHiMM with substantial improvements in ROC-AUC, runtime, and memory usage, even when the networks are estimated rather than true. This reinforces the reliability and resource efficiency of DACS in handling complex phylogenetic scenarios. The consistency of these findings are also seen across various 10-taxon model conditions in Table 6.7 and 6.8. These tables illustrate comparisons of the PHiMM and DACS methods, using either a true network or estimated networks as input across diverse reticulation scenarios. The data consistently demonstrate the superior performance of DACS over PHiMM in terms of accuracy, computational runtime, and memory efficiency, irrespective of the input network utilized. 6.4.1.4 Impact of network inference on DACS’s introgression detection To quantify how inaccuracies in the input phylogenetic network influence DACS, we compare ROC-AUC values with the reduction-based distances between true and estimated networks under 20-taxon model conditions with four reticulation scenarios: (i) one non-deep reticulation, (ii) one deep reticulation, (iii) two non-deep reticulations, and (iv) one non-deep plus one deep reticulation (Table 6.9 and Figure 6.7). In the simplest scenario involving a single non-deep reticulation, DACS remains highly robust with average ROC-AUC of 0.938, and both the minimum and average reduction-based distances between true and estimated networks are relatively small (6.55 and 12.79, respectively). The weak Pearson correlation coefficient (PCC) between minimum distance and ROC-AUC (PCC=0.088) indicates that minimal impact of minor network inference errors on downstream detection perfor- mance. With two non-deep reticulations, ROC-AUC stays high (0.919), yet the correlation between ROC-AUCs and reduction-based distances becomes strongly negative (PCC=-0.419), indicating that larger topological errors directly reduced DACS’s detection accuracy. As the reticulation complexity increases, particularly in scenarios involving deep reticulations, DACS’s performance declines rapidly. In the single-deep scenario the minimum reduction-based distance nearly doubles (13.05) and ROC-AUC falls to 0.810. This trend is most pronounced in the hybrid scenario combining one non-deep and one deep reticulation, where ROC-AUC is the lowest 106 Table 6.7 The performance comparison between PHiMM and DACS with true network as input on 10- and 20-taxon model conditions. Reticulation scenarios include one non-deep reticulation, one deep reticulation, two non-deep reticulations, and a combination of both non-deep and deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC), the area under the precision-recall curve (PR-AUC), runtime (in CPU Hour), and memory usage (in GB). Statistical significance of performance differences between PHiMM and DACS was evaluated using a one-tailed pairwise Student’s t-test (𝛼 = 0.05; 𝑛 = 20). Both PHiMM and DACS methods were run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value”). Measure ROC-AUC PR-AUC Runtime (CPU Hour) Memory (GB) Ntaxa Reticulation Scenario 10 10 10 10 20 20 20 20 10 10 10 10 20 20 20 20 10 10 10 10 20 20 20 20 10 10 10 10 20 20 20 20 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep PHiMM 0.917±0.024 0.880±0.027 0.881±0.032 0.849±0.032 0.896±0.039 0.877±0.022 0.871±0.032 0.791±0.037 0.734±0.059 0.689±0.051 0.731±0.063 0.721±0.048 0.720±0.071 0.680±0.050 0.731±0.060 0.596±0.063 2.120±0.126 2.400±0.149 2.534±0.129 3.780±0.253 22.558±2.063 41.541±3.796 30.488±2.192 61.081±8.119 25.522±2.893 23.349±1.351 41.691±4.001 44.740±3.229 31.112±1.206 27.511±0.864 35.099±0.650 53.147±2.144 DACS 0.937±0.027 0.933±0.020 0.942±0.020 0.874±0.028 0.977±0.010 0.980±0.008 0.969±0.011 0.922±0.018 0.861±0.045 0.786±0.051 0.852±0.047 0.751±0.051 0.885±0.045 0.951±0.019 0.914±0.027 0.813±0.045 2.668±0.222 3.448±0.318 6.994±0.589 7.572±0.574 3.690±0.309 4.223±0.407 6.876±0.478 9.873±0.673 4.091±0.230 6.588±0.772 4.988±0.325 11.290±1.567 5.715±0.379 9.966±0.997 6.155±0.371 13.056±0.860 𝑝-value 5.18e-02 3.55e-04 1.16e-02 3.43e-03 2.46e-02 5.26e-05 7.74e-04 1.43e-04 1.62e-04 9.03e-03 7.01e-03 3.60e-02 8.64e-03 2.11e-05 5.36e-04 1.06e-04 3.96e-03 4.85e-04 5.85e-08 5.18e-08 1.75e-08 3.12e-09 2.45e-09 1.65e-06 1.82e-07 3.93e-10 1.01e-08 2.80e-09 1.03e-13 9.87e-10 1.76e-20 2.61e-13 107 Table 6.8 The performance comparison between PHiMM and DACS with estimated networks as input on 10- and 20-taxon model conditions. Reticulation scenarios include one non-deep reticulation, one deep reticulation, two non-deep reticulations, and a combination of both non-deep and deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC), the area under the precision-recall curve (PR-AUC), runtime (in CPU Hour), and memory usage (in GB). Statistical significance of performance differences between PHiMM and DACS was evaluated using a one-tailed pairwise Student’s t-test (𝛼 = 0.05; 𝑛 = 20). Both PHiMM and DACS methods were run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value”). Measure ROC-AUC PR-AUC Runtime (CPU Hour) Memory (GB) Ntaxa Reticulation Scenario 10 10 10 10 20 20 20 20 10 10 10 10 20 20 20 20 10 10 10 10 20 20 20 20 10 10 10 10 20 20 20 20 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep PHiMM 0.844±0.037 0.726±0.044 0.746±0.051 0.684±0.039 0.688±0.051 0.622±0.050 0.698±0.038 0.607±0.039 0.586±0.071 0.424±0.065 0.511±0.068 0.473±0.053 0.312±0.067 0.369±0.062 0.409±0.061 0.372±0.055 23.723±1.916 29.972±2.453 31.926±2.414 37.327±3.778 234.787±23.018 273.297±23.213 377.342±25.539 457.258±59.173 18.949±0.214 19.091±0.249 35.675±1.446 38.317±4.360 19.103±0.382 20.576±0.221 42.647±5.645 43.110±5.673 DACS 0.865±0.056 0.901±0.034 0.853±0.035 0.779±0.037 0.938±0.028 0.810±0.048 0.919±0.026 0.770±0.043 0.768±0.075 0.772±0.059 0.700±0.062 0.629±0.052 0.795±0.069 0.608±0.076 0.839±0.046 0.624±0.059 65.806±5.201 73.934±6.266 123.099±9.094 141.815±11.325 81.629±4.147 85.071±2.649 159.871±5.853 162.829±6.729 12.720±0.938 14.348±1.147 16.336±1.378 15.584±1.445 14.274±1.169 15.173±1.182 16.422±1.106 16.628±1.290 𝑝-value 3.20e-01 1.21e-04 7.99e-03 8.29e-03 1.47e-04 9.82e-05 1.66e-06 2.25e-04 5.30e-03 2.42e-05 1.44e-03 3.73e-03 2.05e-05 1.00e-03 5.38e-08 2.39e-05 7.31e-11 7.11e-10 1.67e-11 2.33e-11 5.69e-07 4.58e-08 3.73e-08 1.74e-05 5.40e-08 3.71e-05 5.78e-22 7.04e-07 4.24e-04 2.67e-05 1.78e-05 8.81e-06 108 Figure 6.7 The head-to-head performance comparison of network inference and introgression detection on 20-taxon model conditions. Reticulation scenarios include one non-deep reticula- tion, one deep reticulation, two non-deep reticulations, and a combination of both non-deep and deep reticulations. X-axis shows minimum reduction-based distance between true and estimated networks; the Y-axis depicts the area under the receiver-operating characteristic curve (ROC-AUC) of DACS with estimated networks as input. DACS method was run using 20 CPUs. 109 Table 6.9 The network inference errors for DACS on 10-, 20-, and 100-taxon model condi- tions. Reticulation scenarios include solely non-deep reticulations, solely deep reticulations, and a combination of both non-deep and deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC), and the average and minimum reduction-based distance between true and estimated networks (details see Chapter 2.2.3.2). Pearson correlation co- efficient (PCC) between each reduction-based distance and ROC-AUC was also calculated. DACS method was run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value”). Ntaxa Reticulation Scenario ROC-AUC Minimum reduction-based distance Average eduction-based distance 10 10 10 10 20 20 20 20 100 100 100 100 100 100 100 100 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 1 deep 2 non-deep 1 non-deep + 1 deep 1 non-deep 2 non-deep 3 non-deep 4 non-deep 5 non-deep 1 non-deep + 1 deep 2 deep 3 deep distance 2.300±0.726 0.865±0.056 8.800±0.574 0.901±0.034 0.853±0.035 6.800±0.756 0.779±0.037 10.100±0.533 0.938±0.028 6.550±0.910 0.810±0.048 13.050±0.749 0.919±0.026 11.150±1.091 0.770±0.043 15.850±0.638 0.857±0.047 30.050±1.725 0.839±0.034 35.800±1.933 0.855±0.031 39.450±2.498 0.827±0.028 45.800±2.207 0.812±0.029 49.950±2.367 0.758±0.033 42.650±1.918 0.710±0.037 45.400±1.895 0.717±0.031 51.800±1.325 PCC -0.764 -0.148 -0.296 -0.136 0.088 0.240 -0.419 -0.219 -0.455 -0.253 -0.482 -0.146 -0.452 -0.244 -0.223 -0.077 distance 8.217±0.405 10.146±0.278 10.667±0.544 12.268±0.356 12.790±0.512 14.557±0.554 16.048±0.518 18.257±0.475 35.193±1.502 40.627±1.605 44.827±2.221 52.520±1.901 55.213±2.036 45.777±1.649 47.610±1.834 53.120±1.326 PCC -0.482 -0.145 -0.299 -0.150 0.083 0.229 -0.348 -0.226 -0.326 -0.114 -0.417 -0.131 -0.454 -0.203 -0.252 -0.169 (0.770) and the minimun reduction-based distances reaches 15.85. The negative PCC (-0.219) further confirms that increased network inference error leads to consistent declines in introgression detection performance. Overall, DACS demonstrates strong tolerance to moderate network-inference errors and main- tains acceptable accuracy even for deep reticulations — ROC-AUC remains above 0.80 for the single deep reticulation case — underscoring its practicality for complex evolutionary histories. Nonethe- less, achieving reliable introgression detection in highly reticulated or mixed-depth scenarios still benefits from the most accurate network estimates available. 6.4.1.5 Accuracy of DACS on ultra large datasets To assess performance on ultra-large datasets, we analyze a 100-taxon model condition under multiple reticulation scenarios. Due to PHiMM’s limited scalability, it was not feasible to run PHiMM on this dataset, preventing a direct comparison. Nevertheless, this analysis showcases DACS’s capability to scale effectively to datasets of this size. Figure 6.8 and Table 6.10 presents 110 Figure 6.8 The performance comparison between true network and estimated networks as input for DACS on ultra large datasets with 100 taxa. The measures include (A) the area under the receiver-operating characteristic curve (ROC-AUC), (B) runtime (in CPU Hour), and (C) memory usage (in GB). Reticulation scenarios include solely non-deep reticulations, solely deep reticulations, and a combination of both non-deep and deep reticulations. DACS method was run using 20 CPUs. The percentages and arrows shown in red indicate the performance reduction with the estimated networks as input compared to true network as input. The average and standard error (SE) are calculated based on 20 replicates. 111 Table 6.10 The performance comparison between true network and estimated networks as input for DACS on ultra large datasets with 100 taxa. Reticulation scenarios include solely non-deep reticulations, solely deep reticulations, and a combination of both non-deep and deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC), the area under the precision-recall curve (PR-AUC), runtime (in CPU Hour), and memory usage (in GB). DACS method was run using 20 CPUs. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value”). Measure ROC-AUC PR-AUC Runtime (CPU Hour) Memory (GB) Ntaxa Reticulation Scenario 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 1 non-deep 2 non-deep 3 non-deep 4 non-deep 5 non-deep 1 non-deep + 1 deep 2 deep 3 deep 1 non-deep 2 non-deep 3 non-deep 4 non-deep 5 non-deep 1 non-deep + 1 deep 2 deep 3 deep 1 non-deep 2 non-deep 3 non-deep 4 non-deep 5 non-deep 1 non-deep + 1 deep 2 deep 3 deep 1 non-deep 2 non-deep 3 non-deep 4 non-deep 5 non-deep 1 non-deep + 1 deep 2 deep 3 deep Estimated True 0.857±0.047 0.972±0.013 0.839±0.034 0.960±0.018 0.855±0.031 0.970±0.007 0.827±0.028 0.941±0.015 0.807±0.030 0.939±0.012 0.758±0.033 0.936±0.020 0.710±0.037 0.963±0.011 0.717±0.031 0.950±0.019 0.671±0.093 0.916±0.038 0.670±0.062 0.889±0.044 0.713±0.065 0.891±0.028 0.716±0.044 0.878±0.025 0.657±0.058 0.850±0.032 0.576±0.053 0.845±0.044 0.465±0.057 0.920±0.020 0.539±0.051 0.878±0.038 86.124±1.732 4.798±0.278 188.392±5.733 10.435±0.458 14.597±0.825 297.975±7.360 19.411±0.957 473.023±13.245 23.366±1.575 630.657±13.456 160.655±5.159 12.319±0.706 221.789±4.556 14.236±0.847 493.052±6.947 20.436±0.726 17.776±1.110 6.525±0.228 15.309±1.105 8.239±0.421 18.208±1.309 8.613±0.538 20.868±1.493 8.892±0.446 19.116±1.422 7.646±0.208 16.707±1.477 15.141±0.825 20.176±1.547 16.294±0.783 16.658±1.905 17.224±0.959 112 detailed comparisons of the performance of DACS when using a true network versus estimated networks as input, on ultra-large datasets with 100 taxa. Figure 6.8A displays the ROC-AUC values across various reticulation scenarios, including up to three deep and five non-deep reticulations. DACS maintains high accuracy when true networks are provided, while performance with estimated networks remains strong but shows a moderate decline. The percentage differences, indicated above each pair, range from an 11.8% decrease in the simplest scenario of one non-deep reticulation to a 25.7% decrease in the most complex scenario of three deep reticulations. These results highlight DACS’s robustness but also underscore the challenges posed by input uncertainty as network complexity increases. Figure 6.8B illustrates the memory usage (in GB) for the two types of inputs across the different scenarios. The results show a significant increase in memory consumption when estimated networks are used, with the differences ranging from 10.3% to a staggering 172.4% depending on the scenario. However, DACS remains operable even under the most demanding conditions. The largest increase, observed in the one non-deep reticulation case, reflects the extra overhead required to process less accurate input. Nevertheless, these results affirm that DACS can efficiently manage large-scale inputs even when input quality varies. Figure 6.8C presents runtime (in CPU hours) required for running DACS method, which increases substantially when using estimated networks — by 1204.1% to 2339%, depending on the scenario. The most extreme increase in runtime is observed in the scenario involving three deep reticulations, where the computational burden more than doubles. This significant increase in runtime underscores the computational challenges posed by estimated networks, particularly in complex scenarios, and suggests that further optimizations or alternative approaches may be necessary to maintain computational efficiency. Overall, these findings demonstrate that DACS is capable of handling ultra-large datasets with up to 100 taxa, even in the presence of complex reticulation patterns. While using estimated networks incurs costs in accuracy, memory, and runtime, DACS remains applicable and scalable, offering a practical solution for large-scale phylogenetic network analysis. When true networks 113 are not available, researchers can still leverage DACS, with an understanding of the computational trade-offs, particularly in scenarios with high complexity. 6.4.2 Mosquito Data Re-analysis Figure 6.9 The introgression patterns of DACS on mosquito data. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from An. merus to An. quadriannulatus. Figure 6.10 The histograms of introgressed tract lengths by DACS on mosquito data. Intro- gressed tract lengths are shown in megabases (Mb). Results are reported for the introgression from An. merus to An. quadriannulatus. We applied DACS to mosquito whole-genome data to infer introgressed genomic tracts from An. 114 merus into An. quadriannulatus. The results revealed widespread introgression across all major autosomes (Figure 6.9), with a particularly dense signal on chromosome arm 3L. This observation is consistent with prior findings of pervasive autosomal introgression between An. merus and An. quadriannulatus, especially in the ∼22 Mb 3La inversion region (Figure 4 from [152]). In our results, this region is marked by high probabilities of introgressed tracts, suggesting nearly complete replacement of the ancestral An. quadriannulatus sequence by its An. merus counterpart in this genomic interval. Figure 6.10 shows the distribution of inferred tract lengths. Most tracts are relatively short (<1 Mb), but a small number of long tracts — up to ∼8 Mb — were also observed, indicating potentially recent or large-scale introgression events. These patterns are consistent with a model of both ancient and recent gene flow, with the longer tracts possibly reflecting lower recombination rates, such as those expected in regions with historical inversions like 3La [152]. The extensive and contiguous replacement observed in 3L supports the hypothesis that historical recombination suppression facilitated the maintenance of a large introgressed haplotype. 6.4.3 Darwin’s Finch Data Re-analysis We applied DACS to the 469 largest scaffolds of the Darwin’s finch genome, focusing on those exceeding 100 kb in length (Figure 6.11 and Figure B.1-B.8 of Appendix B). This subset captures the majority of genomic content relevant to introgression while excluding highly fragmented or low- information scaffolds. In total, these 469 scaffolds represent a substantial portion of the genome’s informative content. The inferred tract length distribution (Figure 6.12) was highly skewed, with most tracts under 100 kb but a subset extending beyond 500 kb. This heterogeneity suggests both ancient and recent introgression events: longer tracts are likely indicative of more recent gene flow or segments maintained by positive selection; in contrast, more fragmented or shorter tracts, scattered throughout various scaffolds, may reflect older events that have undergone increased recombination or simply represent the lower bound of detection accuracy [158]. To explore potential biological relevance, we examined the genomic context of inferred tracts 115 Figure 6.11 The introgression patterns of DACS on on top 1-50 largest scaffolds of Darwin’s Introgressed tracts along the genome are shown in megabases (Mb). Results are finch data. reported for the introgression from An. merus to An. quadriannulatus. across species (Figure 6.11). Several introgressed regions overlapped genes previously implicated in finch morphology [154, 157]. Notably, we detected tracts overlapping 4 genes previously associated with craniofacial development, beak morphology, and/or body size variation in mammals or birds, including FGF12 (fibroblast growth factor 12), LRRIQ1 (leucine-rich repeats and IQ motif containing 1), MSRB3 (methionine-R-sulfoxide reductase B3), and HMGA2 (high mobility AT- hook 2). These signals were especially existed on scaffolds 172, 166, and 186, consistent with previous reports of extensive hybridization and gene flow among Darwin’s finches [154, 157]. These findings exemplify how integrative approaches such as DACS can reveal biologically meaningful signals of introgression even in complex, reticulate clades. 116 Figure 6.12 The histograms of introgressed tract lengths by DACS on Darwin’s finch data. Introgressed tract lengths are shown in kilobases (Kb). 6.5 Discussion and Conclusion In this study, we introduced a novel divide-and-conquer and subsampling-based algorithm, DACS, designed to detect introgression events in large phylogenomic datasets, even when a phylo- genetic network is not initially available. Building upon the PHiMM framework, DACS integrates FastNet-based network inference, subsampling, and statistical merging strategies to address the scalability and accuracy limitations commonly encountered in introgression mapping. Our simulation experiments demonstrated that DACS improves upon PHiMM both in terms of inference accuracy and computational efficiency. Even under challenging reticulation scenarios (e.g., multiple deep or non-deep reticulations), DACS consistently achieved higher ROC-AUC and PR-AUC values than PHiMM. Moreover, it substantially reduced memory usage and runtime, enabling analyses of datasets spanning up to 100 taxa—well beyond the practical limits of standard phylogenetic network inference tools reported in the literature [14]. 117 These performance improvements can be attributed to two main algorithmic advances. First, the divide-and-conquer component selectively focuses on smaller taxa subsets around each reticulation, significantly lowering computational and memory overheads. Second, repeated subsampling further stabilizes the posterior decoding probabilities, reducing biases that can arise from single-run estimates. When coupled with FastNet’s ability to infer multiple candidate networks, DACS also reduces the risk of conditioning on an inaccurate topology by weighting the final introgression probabilities according to network quality. Our simulation results additionally highlight important trade-offs. For instance, while DACS achieves high accuracy when true networks are known, performance diminishes and computational costs rise when networks must be estimated. In ultra-large analyses (e.g., 100 taxa), runtime can increase substantially as the number of reticulation events grows. These findings indicate that users should balance the accuracy gains from broader network exploration (i.e., considering more candidate topologies) against the computational burdens that arise. Further methodological optimizations—such as more efficient sampling or heuristic selection of candidate networks—may help alleviate these challenges. Beyond simulations, our empirical re-analyses of mosquito data and Darwin’s finch genomic data underscore the practical utility of DACS. The method effectively processed real-world, large- scale sequencing data, detecting introgression signals in a biologically rich context where hybridiza- tion is known to play a significant role. By accommodating uncertain or unknown phylogenetic networks, DACS offers a robust framework for researchers investigating gene flow across diverse, understudied taxonomic groups. Despite these advances, some limitations remain. The current pipeline relies on FastNet for network inference, which may itself become computationally expensive for extremely large datasets or highly reticulated evolutionary histories. Moreover, while repeated subsampling helps reduce variance, the choice of sampling size and number of replicates can affect both accuracy and runtime, necessitating careful parameter tuning. Addressing these issues—possibly via adaptive or dynamic sampling strategies—could further improve the method’s efficiency. 118 Future work will concentrate on further optimizing both the subsampling and merging steps, broadening the framework to accommodate diverse reticulation types, and investigating refined strategies for network inference in highly reticulate evolutionary contexts. We anticipate that these developments will establish DACS as a powerful resource for evolutionary biologists and genomics researchers seeking to elucidate the dynamics of introgression across increasingly large and complex datasets. 119 CHAPTER 7 APPLICATION OF DACS-BASED METHOD TO HORIZONTAL GENE TRANSFER DETECTION IN METAGENOMICS 7.1 Introduction In previous studies, we have developed DACS, an introgression detection tool that significantly enhances the scalability of introgression analysis while preserving high accuracy in inference. Building on this advancement, we propose to adapt and improve the DACS tool for Horizontal Gene Transfer (HGT) detection in microorganisms. This extension represents a more complex challenge due to the involvement of diverse taxa and intricate reticulations in HGT events, which require more robust methods to handle the increased biological complexity. The advent of high-throughput metagenomic technologies has revolutionized biological research by generating vast amounts of sequencing data, with data volumes expanding exponentially. As a result, there is an urgent need to apply introgression and HGT detection methodologies to metagenomic datasets. However, this task is complicated by the inherent challenges of metagenomic data, including noisy and partial sequences. While shotgun read data can be directly employed for tasks such as taxon identification and functional annotation, the majority of metagenomic workflows focus on reconstructing metagenome-assembled genomes (MAGs) through processes such as filtering, assembly, and binning. MAGs, which exhibit high completeness, are considered close representations of actual microbial genomes and can be utilized for downstream genomic analyses. Nevertheless, metagenomic datasets are often derived from heterogeneous microbial commu- nities, sometimes containing over 10,000 distinct species [16]. This complexity, coupled with noisy and incomplete sequence data, presents significant challenges when applying traditional genomics-based methods to metagenomics. Existing tools and methods may perform suboptimally due to these quality issues. Furthermore, many phylogenetic-based approaches, which have proven valuable in genomic studies, have yet to be fully adapted or tested in the metagenomics domain. In this study, we address the challenges associated with HGT detection in metagenomic datasets 120 by applying our proposed method, DACS, to metagenomic studies. In this study, we aim to address the challenges associated with HGT detection in metagenomic datasets by applying our proposed method, DACS, to these data. After assembling sequencing reads into MAGs, we use DACS to identify reticulate evolutionary events, such as introgression and HGT, even under challenging model conditions characterized by noise, incomplete data, and large numbers of taxa. We also compare the performance of DACS in metagenomics with its application to genomics, discussing differences in terms of input requirements, accuracy, scalability, and the specific conditions under which each method is most applicable. 7.2 Background 7.2.1 Metagenomics Definition Traditionally, microbiology has relied on cultivating pure cultures within laboratory settings, allowing for the sequencing of individual organisms’ genomes one at a time. However, this approach only scratches the surface, as up to 99% of microorganisms elude cultivation, leaving us ignorant of their very existence. This limitation in cultivation has distorted our perception of microbial diversity and restricted our understanding of the vast microbial realm. In contrast, metagenomics presents a refreshing perspective by directly sampling the genome sequences of entire communities of organisms within their natural habitats, thus circumventing the necessity for isolating and cultivating individual organisms in the lab. Metagenomics offers a comparatively unbiased outlook not only on the community’s structure, encompassing species richness and distribution, but also on the functional and metabolic potential of the community as a whole. Various habitats or environments, such as soil, sea, mud, forests, space, and even the hu- man body, harbor distinct communities of microorganisms, comprising bacteria, viruses/phages, microbial eukaryotes (e.g., yeast), as well as worms (e.g., helminths and nematodes). 7.2.2 Metagenomics Analysis The field of metagenomics raises two fundamental questions [159]: 1. Who is there? — This question pertains to identifying the various species of microorganisms 121 present in the sample. 2. What are they doing? — Here, the objective is to discern the functions and behaviors of the identified microorganisms. To address these questions, numerous strategies can be employed in the analysis of metage- nomics shotgun data [160] (Figure 7.1, Table 7.1, and Table 7.2). 7.2.2.1 Quality filtering A customary initial step involves running a range of computational tools for quality control or filtering, including trimming sequencing adapters, discarding short and low-quality reads, and eliminating reads with low-quality extremes or “N” characters based on quality. Following quality control, the reads can be subjected to two potential paths: either assembled into longer contiguous sequences known as contigs, or directly employed in taxonomic classifiers or functional annotations (see Figure 7.1). 7.2.2.2 Assembly Assembly, the process of merging collinear metagenomic reads from the same genome into contiguous sequences called contigs, proves valuable in generating longer sequences, thereby simplifying bioinformatic analysis compared to working with unassembled short metagenomic reads. Two primary approaches to assembly are available: 1. De novo assembly: This method involves joining sequenced fragments to generate contigs without relying on a previously sequenced reference genome. The De Bruijn graph method stands as the most popular De novo assembly paradigm and finds implementation in various assemblers. 2. Comparative assembly: This approach utilizes a previously sequenced, closely related or- ganism as a guide to aid the assembly process. Following assembly, the quality of the obtained contigs can be assessed to facilitate the evalua- tion and comparison of metagenome data, relying on alignments to close references as a basis for assessment. The resulting contigs obtained after assembly can serve multiple purposes, directly 122 contributing to gene prediction and functional annotation or playing a crucial role in the binning process. 7.2.2.3 Binning Binning algorithms undertake the task of grouping contigs or scaffolds that originate from the same or closely related organisms. Once these contigs are clustered into bins, subsequent analyses, such as taxonomic profiling and functional annotation, are performed on these bins rather than individual contigs. Binning has proven to be a powerful approach capable of clustering contigs even from rare species. Moreover, it has demonstrated the ability to recover draft genomes from previously uncultivated bacteria [161]. There are two primary binning methods: taxonomy-dependent methods (supervised methods) and taxonomy-independent methods (unsupervised methods). 1. Taxonomy-dependent methods (Supervised methods): These methods utilize known ref- erence genomes or characteristics extracted directly from the nucleotide sequences (e.g., oligonucleotide frequencies, GC content) to map the contigs. They are based on aligning metagenomic sequences to a reference. However, a drawback of taxonomy-dependent meth- ods is the limited number of available sequenced genomes in current databases, which can lead to challenges in the alignment process and result in longer computing times. 2. Taxonomy-independent methods (Unsupervised methods): In contrast, taxonomy-independent methods do not rely on a reference genome and have become more popular due to their ver- satility. These methods assume that contigs belonging to the same genome should exhibit similar abundance and/or oligonucleotide composition within the same sample. Following the binning process, reads can be mapped back to the bins, allowing each bin to be reassembled. If the binning is successful, this step may produce longer contigs, enhancing the overall quality of the assembled genomes [160]. Once the bins have been filtered for contamination and completeness, they are referred to as metagenome-assembled genomes (MAGs) [162]. 123 7.2.2.4 Gene prediction or annotation Gene prediction determines which metagenomic reads contain coding sequences. Once iden- tified, coding sequences can be functionally annotated. Gene prediction can be conducted on assembled or unassembled metagenomic sequences. One of the most straight forward ways of identifying coding sequenced in a metagenome is to map metagenomic reads or contigs to a database of gene sequences. De novo gene prediction, on the other hand, can potentially identify novel genes. Here, gene prediction models, which are trained by evaluating various properties of microbial genes (e.g., length, codon usage, GC bias), are used to assess whether a metagenomic read or contig contains a gene without relying on sequence similarity to a reference database. 7.2.2.5 Functional annotation Functional annotation in metagenomics serves the purpose of addressing the question “what are they doing”. By examining the collective functions encoded in the genomes of the microorganisms within a community, metagenomes shed light on the physiological activities of that community. Once genome assembly, binning, and gene annotation are completed, numerous tools are avail- able to carry out functional annotations. The conventional approach to identifying gene function involves similarity searches using well-established tools like BLAST. Predicted peptides that are classified as homologs of a protein family are annotated with the family’s function. Conducting this analysis across all reads results in a community functional diversity profile. To refine the predicted function, it is essential to utilize comprehensive databases (Table 7.1). 7.2.2.6 Taxonomic profiling In metagenomics, the question of “who is there” is addressed through taxonomic profiling, providing insights into the taxonomic composition of each analyzed sample. Taxonomic profiling not only identifies the detected taxa in a sample but also estimates their relative abundances and various diversity indices. Traditionally, taxonomic profiling has been performed using marker gene analysis, a widely- adopted approach that relies on specific genes conserved across related organisms to infer taxonomic identities. This technique will be described in greater detail later. 124 Several software tools are available to carry out taxonomic profiling using either MAGs or contigs. These tools aid in the taxonomic classification and quantification of the microbial commu- nities present in the samples (Table 7.1). In recent developments, methods have emerged to enable genome profiling of MAGs at the strain level, providing more detailed insights into the diversity within specific taxa. These tools offer a more detailed and precise understanding of the taxonomic structure and strain-level diversity within microbial communities (Table 7.1). 7.2.2.7 Assembly-free approach Assembled genomes offer evident advantages for subsequent functional analyses. Nonetheless, obtaining accurate assemblies, bins, and even MAGs when dealing with metagenomic samples remains a challenging task. Consequently, a considerable portion of the reads acquired from metagenomic samples will remain non-assembled. Hence, approaches have been developed to enable the direct analysis of raw metagenomic data for both taxonomic classification and functional assignments. Marker gene analysis represents one of the most straightforward and computationally efficient methods for assessing the taxonomic diversity of a metagenome. This approach involves comparing metagenomic reads to a database of taxonomically informative gene families, commonly known as marker genes. By identifying metagenomic reads that exhibit homology to these marker genes, the sequences are taxonomically annotated through sequence or phylogenetic similarity to the marker gene database. The frequently employed marker genes are those found in ribosomal RNA (rRNA) genes and protein-coding genes, which typically exist as single copies and are widespread across microbial genomes. Notably, the 16S rRNA, 18S rRNA, and ITS genes are particularly esteemed as excellent marker genes for bacteria, eukaryotes, and fungi, respectively. Various methods have been developed for assesbly-free taxonomic classification in metage- nomics, including both reference-based and reference-free approaches (Table 7.2). Reference-based methods rely on reference databases for classification. On the other hand, reference-free methods do not depend on reference databases and offer alternative approaches for taxonomic profiling in metagenomic samples. 125 When working directly with reads for annotation, conventional tools like BLAST often suffer from slowness due to the substantial amount of data being processed. As a result, novel meth- ods employing optimized strategies have emerged to expedite the process. Additionally, several pipelines have been established to integrate some of these tools, enabling direct annotation from raw sequencing data. These pipelines streamline the annotation process, providing efficient ways to analyze metagenomic data directly from raw reads (Table 7.2). 7.3 Methods 7.3.1 Standalone DACS pipeline DACS is an advanced algorithm specifically designed to detect introgression events in large phylogenomic datasets efficiently by integrating a divide-and-conquer approach with a robust subsampling strategy. Unlike the standard PHiMM pipeline, which requires a predefined network as input, DACS leverages FastNet’s network-inference capabilities for large-scale analyses. DACS aims to address two fundamental challenges in introgression detection. First, DACS reduces the computational overhead associated with analyzing a large number of taxa simultaneously. Second, rather than relying on a single, potentially erroneous phylogenetic network, DACS integrates over multiple plausible topologies to minimize errors and improve the robustness of the results. Benchmark analyses have demonstrated that DACS not only scales effectively to handle dozens or even hundreds of taxa but also maintains high accuracy in identifying introgressed genomic regions. To apply DACS to metagenomic datasets, the process begins after assembling sequencing reads into metagenome-assembled genomes (MAGs) (as described in Chapter 7.4). The DACS pipeline is then run to perform introgression mapping on these MAGs under different model conditions, which account for noise, incomplete data, and large numbers of taxa. The divide-and-conquer and subsampling procedures in DACS require the user to set the subsampling size and the number of subsampling replicates. By default, we use a subsampling size of 7 and perform 10 subsampling replicates. During the subproblem-solving phase of DACS, the PHiMM tool is used for introgression analysis. PHiMM runs with its default settings, which include 126 Figure 7.1 Common analysis procedures for metagenomics data. First, raw reads undergo quality control (QC) to remove low-quality bases and contaminants. The QC-controlled reads are then assembled into longer contiguous sequences (contigs), which may be further evaluated and filtered. Next, binned contigs are formed by grouping contigs that likely originate from the same organism, yielding metagenome-assembled genomes (MAGs). Two major downstream analyses follow: Taxonomic Profiling (“Who is there?”) involves comparing reads or contigs to reference databases—either marker-gene based or more comprehensive sequence repositories—to classify the taxa present and infer their evolutionary relationships. Functional Annotation (“What are they doing?”) entails gene prediction and mapping predicted proteins to known families and pathways, thereby identifying the biological roles and metabolic capabilities of the microbial community. 127 Table 7.1 Common assembly-based tools for metagenomics data. This table compiles a broad range of bioinformatics software used throughout the metagenomics workflow. Tools are organized by major analytical steps: (1) Quality filtering—covering low-quality read trimming, contig as- sessment, and bin validation; (2) Assembly—including traditional de novo strategies and ensemble methods; (3) Binning—encompassing taxonomy-dependent (supervised), taxonomy-independent (unsupervised), and ensemble approaches; (4) Gene prediction—distinguishing between prokary- otic and eukaryotic prediction tools; (5) Functional annotation—highlighting commonly used databases, aligners, and tools; and (6) Taxonomic profiling—covering broad (basic) as well as strain-level taxonomic resolution. References are provided for each method/tool to offer further guidance on specific algorithms, use cases, implementation details, performance characteristics, and potential applications. Category Quality filtering Subcategory Quality filtering for reads Methods & Reference PRINSEQ [163], Trimmomatic [164], FastQC [165], SeqKit [166], Cutadapt [167], BBDuk [168], Sickle (v.1.33) [169] MetaQUAST [170] CheckM [171], AMBER [172] Minia [173], MetaVelvet [174], MetaVelvet-SL [175], Quality filtering for contigs Quality filtering for bins De novo assembly (Eg. De Bruijn graph methods) Ray Meta [176], IDBA [177], IDBA-UD [178], Assembly Binning Ensemble assembly Taxonomy-dependent or Supervised Taxonomy-independent or Unsupervised Ensemble binning Gene prediction Prokaryotes Eukaryotes Functional annotation Aligners Databases Tools Taxonomic profiling Basic Strain level MetaSPAdes [179], Megahit [180] MetAMOS [181], MeGAMerge [182], GAM-NGS [183] CARMA3 [184], MetaPhyler [185], SOrt-ITEMS [186], PhyloPythiaS+ [187], Phymm/PhymmBL [188], MG-RAST v.4 [189], MEGAN6 [190], TACOA [191], IMG/M v.5.0 [192] Metawatt [193], AbundanceBin [194], LikelyBin [195], MBBC [196], Canopy [197], MetaCluster4 [198], MaxBin2 [199], MetaBAT2 [200], CONCOCT [201], COCACOLA [202], SCIMM [203], CompostBin [204] Binning-refiner [205], DAS Tool [206], ICoVeR [207], MetaWRAP [208] MetaGeneMark [209], MetageneAnnotator [210], Glimmer-MG [211], JGI (Joint Genome Institute) [212], MetaProdigal [213], FragGeneScan [214] GeneMark.hmm [215, 216], AUGUSTUS [217], Gnomon [218], SNAP [219] BLAST [220] Pfam [221], Interpro [222], PRIAM [223], KEGG [224], Metacyc [225], NCBI taxonomy [226] Prokka [227], DFAST [228], NCBI’s PGAP [229], MetaErg [230] MEGAN6 [190], DIAMOND [231], MAGpy [232], MG-RAST v.4 [189, 233] MetaMLST [234], StrainPhlAn [235], DESMAN [236], MetaSVN [237] 128 Table 7.2 Common assembly-free tools for metagenomics data. This table summarizes frequently used software tools that bypass the need for assembly when analyzing metagenomic samples. Tools are grouped into two main categories: (1) Functional annotation, which leverages fast aligners (e.g., USEARCH, DIAMOND) and integrated pipelines (e.g., HUMAnN2, MGS-Fast) to directly map reads to known databases for functional insights; and (2) Taxonomic profiling, subdivided into reference-based methods (e.g., Kraken2, Centrifuge) that match reads to comprehensive taxonomic databases, and reference-free methods (e.g., MetaPhlAn, PhymmBL) that infer taxonomy without relying on external reference genomes. References are provided for each method/tool to offer further guidance on specific algorithms, use cases, implementation details, performance characteristics, and potential applications. Category Functional annotation Aligners Taxonomic profiling Methods USEARCH BLAT5 RAPSearch2 DIAMOND PALADIN GRASP2 FUN4ME MOCAT2 MGS-Fast HUMAnN2 MetaStorm MG-RASTv.4 IMG/M v.5.0 Reference-based MG-RASTv.4 Tools Reference [238] [239] [240] [231] [241] [242] [243] [244] [245] [246] [247] [189] [192] [189] [190] MEGAN6 [184] CARMA3 [248] Kraken [249] Clark [250] Kraken2 [251] Taxonomer [252] Centrifuge [253] Kaiju [254] taxMaps Taxator-tk [255] PhylopythiaS+ [187] [188] PhymmBL [256] MetaPhlAn [257] MetaPhlAn2 Reference-free 129 300 iterations for model parameter learning and 10 seperate runs for robustness. Additionally, users need to assign a gene tree truncation size, denoted as 𝑘𝑛, for the hidden Markov model (HMM) in PHiMM. Here, we set 𝑘𝑛 = 15, which corresponds to the default setting in PHiMM. First of all, to make a fair comparison across different metagenomic scenarios and avoid potential confounding effects from network uncertainty, we opt to use the true phylogenetic network as input for DACS (obtained from Chapter 6.3.1). The required inputs for DACS thus include the aligned MAG sequences 𝐴, where 𝐾 represents the number of sequences and 𝐿 is the number of alignment columns, as well as a true phylogenetic network Ψ on the 𝐾 taxa. The input MAG alignment 𝐴 is constructed based on steps described in Chapter 7.4, while the true phylogenetic network Ψ is directly from the DACS simulation studies in Chapter 6.3.1. However, the ground-truth phylogenetic network is rarely available a priori in real-world data. One of the key advantages of DACS is its ability to bypass the need for a predefined network by employing FastNet [11] to infer one or more candidate phylogenetic networks from the input sequence alignments. Therefore, we also run experiments with estimated phylogenetic networks by FastNet as input for DACS (see steps described in Chapter 6.2.2.2). The required inputs for DACS thus include the aligned MAG sequences 𝐴, where 𝐾 represents the number of sequences and 𝐿 is the number of alignment columns, as well as a the number of reticulations 𝑅 (obtained from dataset construction step in Chapter 7.4). The output of DACS is a sequence of modified posterior decoding probabilities for the columns of the input MAG alignment. These probabilities are crucial for identifying potential introgression events and understanding the evolutionary dynamics within the dataset. 7.4 Materials 7.4.1 Simulation of reads Preliminary results presented in this Chapter are derived entirely from fully synthetic reference genomes; the 100-taxon datasets used for DACS analyses (see Chapter 6.3.1) do not correspond to any real-world species. These datasets were generated in silico solely to explore method behavior under extreme evolutionary complexity and should be interpreted as proof-of-concept rather than 130 empirically validated findings. First, we prepared reference genomes for simulating raw reads from two 100-taxon datasets used in the DACS simulation analysis (see Chapter 6.3.1). Each dataset includes 100 distinct genomes corresponding to different taxa or species, and each dataset consists of 20 replicates. The first dataset was constructed under model condition involving 3 deep reticulations, and the second dataset was constructed under model condition with 5 non-deep reticulations. These reticulation conditions represent highly complex evolutionary scenarios, making them suitable for mimicking real-world data. As a result, they present a challenging and robust foundation for our downstream analyses. For the 100 reference genomes from each replicate of each dataset, we simulated metagenomic sequence reads using wgsim (available at https://github.com/lh3/wgsim), a widely-used simulator initially developed as part of SAMtools [156]. This tool generates error-containing paired-end reads. We simulated both short reads with a length of 250 bp as well as long reads with a length of 1000 bp, applying the default Illumina sequencing noise with a 2% error rate. All other parameters were left at their default settings. Since the reconstruction of genes involved in HGT is highly sensitive to sequencing depth and the choice of assembler [258], we set the wgsim parameters to produce target DNA fragments by randomly sampling from the reference sequences. We generated reads with varying coverage depth levels: 5x, 10x, 20x, 30x, 50x, and 100x, to explore different levels of genetic divergence and their effects on metagenomic assembly and downstream introgression/HGT detection. Coverage depth (or read depth, sequencing depth) refers to the number of times a specific base (nucleotide) in the DNA is read during the sequencing process. Thus, we simulated both 250-bp and 1000-bp paired-end reads with coverage depths of 5x, 10x, 20x, 30x, 50x, and 100x for the two 100-taxon datasets under 3 deep reticulation condition or 5 non-deep reticulation condition. 131 7.4.2 Metagenomic assembly and MAG construction For each simulated read dataset, BBDuk [168] was used to remove Illumina adapters as well as PhiX and other Illumina trace contaminants. After adapter removal, reads were trimmed using Sickle (v.1.33) [169] with a default quality threshold of 20 (quality type set to Sanger, which corresponds to CASAVA v.1.8 or higher). Each physical filter was treated as an independent sample. Specifically, metagenomic reads from a single filter were assembled together, rather than co-assembling reads from all filters or size fractions. Assembly was performed using Megahit (v.1.2.9) [180] with the default parameters. with a k-mer size cascade of 21, 29, 39, 59, 79, 99, 119, 141, 159, 179, 199, 219, 239, and 255 to assemble quality controlled reads into contigs. Assembled contigs were then scaffolded using the scaffolding function from IDBA-UD [178]. Only scaffolds greater than 1 kb in length were retained for downstream gene prediction and MAG construction. These scaffolds were aligned to corresponding reference genomes with BLASTN [220] using the default setting. The BLASTN results were then filtered with an E-value cut-off of < 10−3, an identity cut-off of > 90% and an aligned length cut-off of > 100 bp for the scaffolds. Note that scaffolds can only be mapped to one reference genome (i.e., one species) with best E-value. These stringent criteria ensured the high quality and relevance of the identified genes. Subsequently, we derived MAGs by extracting all aligned scaffold segments corresponding to reference genomes for simulated read datasets with different read length and coverage levels. Given that our reference genomes were well-aligned, the resulting MAGs were also well-aligned, with gaps representing missing DNA sites. To prepare the MAGs for input into the DACS method, we removed these gap regions, as DACS requires a complete alignment with no gaps. 7.5 Results Figure 7.2, Figure 7.3, and Table 7.3 summarize the effects of read length (250-bp vs. 1000-bp) and read sequencing depth (5×, 10×, 20×, 30×, 50×, and 100×) on the accuracy of detecting reticulate events using DACS. Performance was evaluated across two simulated 100-taxon metagenomic datasets, each modeled under either 5 non-deep or 3 deep reticulations, using both true and 132 estimated phylogenetic networks as input. Across all conditions, both the area under the receiver-operating characteristic curve (ROC- AUC) and the completeness of the reconstructed metagenome-assembled genomes (MAG coverage) improved steadily with increasing read coverage depth, regardless of read length or reticulation complexity. At low read depths (5×–10×), assemblies from short reads (250-bp) generally produced slightly higher ROC-AUC values than those from long reads (1000-bp). For example, in the 5 non- deep reticulation scenario with true network as input (Figure 7.2A), the mean ROC-AUC at 5x read depth was 0.926 for 250-bp reads and 0.915 for 1000-bp reads. However, as read depth increased beyond 20×, the differences between the two read lengths diminished, with both read- length assemblies converging on ROC-AUC values around 0.933-0.935. By 100x read depth, short- and long-read assemblies performed nearly identically (both at approximately 0.933-0.934 ROC-AUC), and both approached the “True” reference condition (0.939), in which DACS was run directly on the original reference genomes without any assembly. This same trend was reflected in MAG coverage, which neared 100% for all assemblies once read coverage depth reached 50x and above. When estimated networks were used instead of true networks, a consistent drop in ROC-AUC was observed across all settings (Figure 7.3), highlighting DACS’s sensitivity to phylogenetic inference accuracy. For the 5 non-deep reticulation model at 5× depth, ROC-AUC dropped to 0.760 for 250-bp reads and 0.749 for 1000-bp reads. The decline was more pronounced in the 3 deep reticulation model, where ROC-AUC at 5× depth fell to 0.671 (250-bp) and 0.661 (1000- bp). Although performance improved with increasing depth, the gap between true and estimated network inputs remained evident. Even at 100× coverage, the ROC-AUC under estimated networks remained ∼0.717–0.807 — substantially lower than ∼0.939–0.950 observed with true networks. The 3 deep reticulation model posed greater challenges overall, with slightly reduced ROC-AUC and MAG coverage at low to moderate depths compared to the 5 non-deep reticulation model. This suggests that deeper reticulate events are more difficult to resolve, particularly when assemblies are incomplete or networks are inferred. Nonetheless, once coverage reached 30× or more, DACS was 133 able to recover high-quality MAGs and achieve reliable HGT detection, even in the presence of complex evolutionary histories. Overall, these findings demonstrate that DACS maintains robust performance in identifying HGT/introgression events under both shallow and deep reticulation models as well as under both true and estimated networks as input once moderate (≥20-30x) coverage depths are reached. The 250-bp assemblies often offered slightly better early-stage performance at 5-10x coverage, because the long-read assembly is more difficult and challenging. But the gap between short- and long-read assemblies closed quickly with increasing coverage, indicating that both read lengths can achieve accurate HGT detection at moderate to high coverage depths. 7.6 Discussion and Conclusion In this study, we investigated the performance of our DACS-based approach for detecting horizontal gene transfer (HGT) in metagenomic datasets under a variety of read lengths (250-bp vs. 1000-bp), coverage depths (5x, 10x, 20x, 30x, 50x, and 100x), input phylogenies (true vs. estimated networks) and reticulation complexities (3 deep vs. 5 non-deep reticulations). Our results show that coverage depth is a crucial determinant of both reconstructed MAG completeness and accuracy in detecting reticulate events. While short reads exhibited slightly better in ROC-AUC at low coverage (5-10x), the gap between short- and long-read assemblies diminished as coverage increased, and both achieved near-reference-level accuracy at coverage depths of 20-30x and higher. One potential explanation for the early lead of short reads under low coverage is that shorter reads may assemble more robustly when read coverage is sparse, possibly because the fragmented sequence data are less prone to errors in contig assembly. Conversely, long-read assemblies require sufficient coverage to resolve repetitive regions, complex structural variations, and other complications that can prevent accurate contig construction. Once coverage exceeds a certain threshold (in this study, about 20-30x), both short and long reads form sufficiently complete and accurate assemblies, yielding nearly indistinguishable performance in HGT detection. We also observed that DACS consistently achieved higher accuracy when using true networks as input, highlighting the method’s sensitivity to the quality of phylogenetic inference. Even at high 134 Figure 7.2 The comparison of different read lengths of 250-bp and 1000-bp and coverage depths of 5x, 10x, 20x, 30x, 50x, and 100x when running DACS with true network as input on both 100-taxon datasets under model conditions with (A) 5 non-deep reticulations and (B) 3 deep reticulations. “True” means results directly on the reference genomes. ROC-AUC (left y-axis, bars) and MAG coverage (%) (right y-axis, dash lines) as functions of read coverage depths. Red bars/lines represent for read length of 250-bp, and blue bars/lines for read length of 1000-bp. 135 Figure 7.3 The comparison of different read lengths of 250-bp and 1000-bp and coverage depths of 5x, 10x, 20x, 30x, 50x, and 100x when running DACS with estimated networks as input on both 100-taxon datasets under model conditions with (A) 5 non-deep reticulations and (B) 3 deep reticulations. “True” means results directly on the reference genomes. ROC-AUC (left y-axis, bars) and MAG coverage (%) (right y-axis, dash lines) as functions of read coverage depths. Red bars/lines represent for read length of 250-bp, and blue bars/lines for read length of 1000-bp. 136 Table 7.3 The comparison of different read lengths of 250-bp and 1000-bp and coverage depths of 5x, 10x, 20x, 30x, 50x, and 100x on both 100-taxon datasets under model conditions with 5 non-deep reticulations and 3 deep reticulations. The measures include the area under the receiver-operating characteristic curve (ROC-AUC) and MAG coverage (%). DACS was run twice with true network or estimated networks as input. “True” means results directly on the reference genomes. The average and standard error (SE) are calculated based on 20 replicates (represented as “average value ± SE value” or only “average” for runtime). ROC-AUC True Network Estimated Networks 0.926±0.014 0.927±0.015 0.931±0.013 0.933±0.016 0.935±0.015 0.933±0.011 0.939±0.012 0.915±0.013 0.914±0.015 0.930±0.015 0.932±0.014 0.934±0.014 0.933±0.015 0.939±0.012 0.927±0.018 0.931±0.018 0.939±0.018 0.941±0.023 0.947±0.018 0.949±0.018 0.950±0.019 0.920±0.018 0.929±0.018 0.937±0.019 0.939±0.020 0.946±0.018 0.947±0.015 0.950±0.019 0.760±0.028 0.778±0.027 0.788±0.028 0.789±0.022 0.792±0.026 0.799±0.029 0.807±0.030 0.749±0.028 0.760±0.027 0.784±0.026 0.785±0.025 0.789±0.024 0.798±0.030 0.807±0.030 0.671±0.022 0.680±0.025 0.699±0.029 0.701±0.029 0.706±0.026 0.710±0.030 0.717±0.031 0.661±0.028 0.671±0.023 0.698±0.028 0.700±0.027 0.700±0.026 0.706±0.025 0.717±0.031 Dataset Read length Read Coverage Depth MAG coverage (%) 5 non-deep 3 deep 250-bp 1000-bp 250-bp 1000-bp 5x 10x 20x 30x 50x 100x True 5x 10x 20x 30x 50x 100x True 5x 10x 20x 30x 50x 100x True 5x 10x 20x 30x 50x 100x True 90.467 93.747 95.821 97.877 98.845 99.885 100.000 90.225 91.486 95.689 97.773 98.797 99.801 100.000 91.484 92.790 93.803 93.816 97.836 98.840 100.000 88.266 91.508 93.702 93.747 97.797 97.797 100.000 137 read depth, the use of estimated networks introduced noticeable performance degradation. This observation suggests that improvements in phylogenetic network inference methods could further enhance the predictive accuracy of DACS and related downstream analyses. Despite the increased complexity of the 3 deep reticulation model compared to the 5 non-deep reticulation model, DACS demonstrated robustness across varying levels of reticulation complexity if the assembly quality is adequate. In both scenarios, coverage depth remained the key driver of improved MAG coverage and ROC-AUC, underscoring the importance of sufficient sequencing effort when detecting subtle evolutionary events such as introgression or HGT. Our results also highlight that even complex reticulate events are tractable with a carefully assembled metagenome, provided there is adequate depth of coverage. Although this study used simulated reads from known reference genomes, real-world metage- nomic datasets often introduce additional complexities such as uneven taxon abundances, incom- plete reference databases, and the possibility of assembling novel microbial lineages. Future efforts should further evaluate the DACS method on real metagenomic datasets that capture the full diver- sity and uneven coverage typically encountered in microbial communities. Additional refinements to assembly strategies, binning algorithms, and data integration approaches may further enhance the accuracy of HGT detection by improving the resolution and completeness of MAGs. Our application of the DACS-based framework to simulated metagenomic datasets demonstrates its robustness in identifying HGT and introgression events under a wide range of coverage depths and read lengths. Despite the initial advantage of shorter reads under very low coverage, both read lengths can ultimately achieve similar and high accuracy in detecting complex reticulate events once coverage depth reaches approximately 20-30x. These findings emphasize the critical role of coverage depth in assembling high-quality MAGs, which in turn drives more reliable HGT detection. The results further underscore the value of DACS for metagenomic research, offering a scalable and accurate tool that can be adapted to diverse microbial communities characterized by extensive reticulation and genomic complexity. 138 CHAPTER 8 CONCLUSIONS AND FUTURE WORK 8.1 Conclusions The dissertation have introduced innovative techniques that push the boundaries of what is achievable in introgression studies. The research has laid a solid foundation for advancing the study of introgression by offering novel methods and insights that effectively address the limitations of existing approaches. In particular, our work tackles critical scalability issues and enhances the accuracy of phylogenetic analyses. These advancements not only make significant strides towards understanding genetics and biology, but also provide valuable tools for a wide range of biological and medical applications. A central achievement of this research is the development of PHiMM, an introgression detection algorithm designed to handle genomic datasets containing dozens of DNA sequences. PHiMM employs a novel coalescent-based approximation strategy combined with a Hidden Markov Model (HMM) framework. This integration effectively reduces model complexity while preserving de- tection accuracy. Comparative studies with existing state-of-the-art methods demonstrated that PHiMM achieves substantially better runtime and memory usage than PhyloNet-HMM, maintain- ing inference accuracies on par with or surpassing established benchmarks. We also employed SERES as a data perturbation technique to enhance introgression inference and learning. Simulation experiments demonstrated that combining the SERES resampling ap- proach with PHiMM substantially improves introgression inference accuracy under various model conditions compared to standalone PHiMM. Although the combined SERES+PHiMM method results in longer runtimes, it maintains similar memory usage compared to standalone PHiMM, demonstrating its potential to “boost‘’ PHiMM’s inference accuracy. Recognizing that reliable introgression detection is intrinsically linked to accurate phyloge- netic network inference, we extended PHiMM to DACS. Unlike traditional methods that require a known network topology, DACS integrates de novo network inference with introgression detection, enabling more flexible and automated analysis. We further improved scalability by leveraging 139 divide-and-conquer and subsampling techniques, allowing large-scale genomic datasets to be par- titioned into smaller, more manageable subproblems. These enhancements collectively ensure that DACS is both computationally tractable and capable of producing highly accurate results in ultra-large data contexts. Given the rising importance of metagenomic studies — particularly those involving complex microbial communities — our work underscores the potential of applying DACS to metagenomic datasets. Through the construction of metagenome-assembled genomes (MAGs), we demonstrated that our methods can be adapted to identify reticulate evolutionary events, such as introgression or horizontal gene transfer (HGT), even in the presence of substantial noise and partial data. These advancements underscore the growing relevance of network-based phylogenetics in metagenomics and pave the way for future research that further refines these approaches for real-world applications. 8.2 Future Work While the methodologies developed in this dissertation provide a robust framework for intro- gression detection, several key avenues remain open for further investigation. An important extension involves distinguishing adaptive introgression from the broader spec- trum of introgression events. In adaptive introgression, relatively small genomic regions — trans- ferred from a donor species — confers a selective advantage to the recipient species. This in- vestigation is of great importance because adaptive introgression can broaden genetic diversity and promote species adaptations to various environments. Many methods have been developed to detect adaptive introgression [259–261]. With the rapid progress in deep learning techniques, Gower et al. [262] proposed a convolutional neural network (CNN)-based approach to jointly model introgression and positive selection. However, this method requires the use of simulation data to train the network, with certain parameters and models needing to be specified a priori. This necessitates the experience and expertise of users to avoid CNN performance declines caused by misspecifications. Inspired by this method, integrating generative adversarial networks (GANs) [263] into the adaptive introgression detection process may help allow the model to learn in a more unsupervised or semi-supervised manner. Attention-based mechanisms could also be introduced 140 to incorporate expert knowledge without significantly constraining model flexibility. Furthermore, making these methods more accessible to a wider audience is a key priority. Developing open-source software packages with user-friendly interfaces and thorough documenta- tion will enable researchers from different fields—such as ecology, epidemiology, and evolution- ary biology—to apply these approaches in their own work. Improved support for reproducible research practices, including containerized computing environments and version-controlled work- flows, would further promote transparency and foster broader use of these tools. Overall, these prospective directions aim to deepen our understanding of reticulate evolution across myriad biological contexts. By harnessing methodological advances in non-tree-like phylo- genetics, machine learning, and large-scale data analysis, future work has the potential to further elucidate the complex processes that shape biodiversity and drive adaptation in both eukaryotic and prokaryotic lineages. 141 BIBLIOGRAPHY [1] [2] [3] Bryan T. Grenfell, Oliver G. Pybus, Julia R. Gog, James L. N. Wood, Janet M. Daly, Jenny A. Mumford, and Edward C. Holmes. Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 303(5656):327–332, 2004. doi: 10.1126/science.1090727. URL https://www.science.org/doi/abs/10.1126/science.1090727. Sean B Carroll, Jennifer K Grenier, and Scott D Weatherbee. From DNA to diversity: molecular genetics and the evolution of animal design. John Wiley & Sons, 2013. Christopher M Thomas and Kaare M Nielsen. Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nature reviews microbiology, 3(9):711–721, 2005. [4] Manolis Kellis, Nick Patterson, Matthew Endrizzi, Bruce Birren, and Eric S Lander. Se- quencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423(6937):241–254, 2003. [5] [6] [7] [8] Jian Ma. Reconstructing the history of large-scale genomic changes: Biological questions and computational challenges. Journal of Computational Biology, 18(7):879–893, 2011. doi: 10.1089/cmb.2010.0189. URL https://doi.org/10.1089/cmb.2010.0189. PMID: 21563973. Richard E. Green, Johannes Krause, Adrian W. Briggs, Tomislav Maricic, Udo Stenzel, Martin Kircher, Nick Patterson, Heng Li, Weiwei Zhai, Markus Hsi-Yang Fritz, Nancy F. Hansen, Eric Y. Durand, Anna-Sapfo Malaspinas, Jeffrey D. Jensen, Tomas Marques-Bonet, Can Alkan, Kay Prüfer, Matthias Meyer, Hernán A. Burbano, Jeffrey M. Good, Rigo Schultz, Ayinuer Aximu-Petri, Anne Butthof, Barbara Höber, Barbara Höffner, Madlen Siegemund, Antje Weihmann, Chad Nusbaum, Eric S. Lander, Carsten Russ, Nathaniel Novod, Jason Affourtit, Michael Egholm, Christine Verna, Pavao Rudan, Dejana Brajkovic, Željko Kucan, Ivan Gušic, Vladimir B. Doronichev, Liubov V. Golovanova, Carles Lalueza-Fox, Marco de la Rasilla, Javier Fortea, Antonio Rosas, Ralf W. Schmitz, Philip L. F. Johnson, Evan E. Eichler, Daniel Falush, Ewan Birney, James C. Mullikin, Montgomery Slatkin, Rasmus Nielsen, Janet Kelso, Michael Lachmann, David Reich, and Svante Pääbo. A Draft Sequence of the Neandertal Genome. Science, 328(5979):710–722, 2010. doi: 10.1126/science.1188021. URL https://www.science.org/doi/abs/10.1126/science.1188021. Kevin J. Liu, Ethan Steinberg, Alexander Yozzo, Ying Song, Michael H. Kohn, and Luay Nakhleh. Interspecific introgressive origin of genomic diversity in the house mouse. Proceedings of the National Academy of Sciences, 112(1):196–201, 2015. doi: 10.1073/pnas.1406298111. URL https://www.pnas.org/doi/abs/10.1073/pnas.1406298111. Kanchon K. Dasmahapatra, James R. Walters, Adriana D. Briscoe, John W. Davey, Annabel Whibley, Nicola J. Nadeau, Aleksey V. Zimin, Daniel S. T. Hughes, Laura C. Ferguson, Si- mon H. Martin, Camilo Salazar, James J. Lewis, Sebastian Adler, Seung-Joon Ahn, Dean A. Baker, Simon W. Baxter, Nicola L. Chamberlain, Ritika Chauhan, Brian A. Counterman, Tamas Dalmay, Lawrence E. Gilbert, Karl Gordon, David G. Heckel, Heather M. Hines, 142 Katharina J. Hoff, Peter W. H. Holland, Emmanuelle Jacquin-Joly, Francis M. Jiggins, Robert T. Jones, Durrell D. Kapan, Paul Kersey, Gerardo Lamas, Daniel Lawson, Daniel Mapleson, Luana S. Maroja, Arnaud Martin, Simon Moxon, William J. Palmer, Riccardo Papa, Alexie Papanicolaou, Yannick Pauchet, David A. Ray, Neil Rosser, Steven L. Salzberg, Megan A. Supple, Alison Surridge, Ayse Tenger-Trolander, Heiko Vogel, Paul A. Wilkinson, Derek Wilson, James A. Yorke, Furong Yuan, Alexi L. Balmuth, Cathlene Eland, Karim Gharbi, Marian Thomson, Richard A. Gibbs, Yi Han, Joy C. Jayaseelan, Christie Kovar, Tittu Mathew, Donna M. Muzny, Fiona Ongeri, Ling-Ling Pu, Jiaxin Qu, Rebecca L. Thornton, Kim C. Worley, Yuan-Qing Wu, Mauricio Linares, Mark L. Blaxter, Richard H. ffrench Constant, Mathieu Joron, Marcus R. Kronforst, Sean P. Mullen, Robert D. Reed, Steven E. Scherer, Stephen Richards, James Mallet, W. Owen McMillan, Chris D. Jiggins, and The He- liconius Genome Consortium. Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature, 487(7405):94–98, 2012. doi: 10.1038/nature11041. URL https://doi.org/10.1038/nature11041. [9] Emile Gluck-Thaler and Jason C. Slot. Dimensions of Horizontal Gene Transfer in Eukaryotic Microbial Pathogens. PLOS Pathogens, 11(10):1–7, 10 2015. doi: 10.1371/journal.ppat. 1005156. URL https://doi.org/10.1371/journal.ppat.1005156. [10] C. Gyles and P. Boerlin. Horizontally Transferred Genetic Elements and Their Role in Pathogenesis of Bacterial Disease. Veterinary Pathology, 51(2):328–340, 2014. doi: 10.1177/0300985813511131. URL https://doi.org/10.1177/0300985813511131. PMID: 24318976. [11] Hussein A. Hejase, Natalie VandePol, Gregory M. Bonito, and Kevin J. Liu. FastNet: Fast and Accurate Statistical Inference of Phylogenetic Networks Using Large-Scale Genomic Se- quence Data. In Mathieu Blanchette and Aïda Ouangraoua, editors, Comparative Genomics, pages 242–259, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00834-5. [12] Kevin J. Liu, Jingxuan Dai, Kathy Truong, Ying Song, Michael H. Kohn, and Luay Nakhleh. An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukary- otes. PLOS Computational Biology, 10(6):1–13, 06 2014. doi: 10.1371/journal.pcbi. 1003649. URL https://doi.org/10.1371/journal.pcbi.1003649. [13] Dingqiao Wen, Yun Yu, Jiafan Zhu, and Luay Nakhleh. Inferring Phylogenetic Networks ISSN 1063-5157. doi: Using PhyloNet. Systematic Biology, 67(4):735–740, 03 2018. 10.1093/sysbio/syy015. URL https://doi.org/10.1093/sysbio/syy015. [14] Hussein A. Hejase and Kevin J. Liu. A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation. BMC Bioinformatics, 17(1):422, 2016. doi: 10.1186/s12859-016-1277-1. URL https://doi.org/ 10.1186/s12859-016-1277-1. [15] Adam D. Leaché, Rebecca B. Harris, Bruce Rannala, and Ziheng Yang. The Influence of Gene Flow on Species Tree Estimation: A Simulation Study. Systematic Biology, 63(1): 143 17–30, 09 2013. ISSN 1063-5157. doi: 10.1093/sysbio/syt049. URL https://doi.org/10. 1093/sysbio/syt049. [16] John C. Wooley, Adam Godzik, and Iddo Friedberg. A Primer on Metagenomics. PLOS Computational Biology, 6(2):1–13, 02 2010. doi: 10.1371/journal.pcbi.1000667. URL https://doi.org/10.1371/journal.pcbi.1000667. [17] Ziheng Yang. Computational molecular evolution. OUP Oxford, 2006. [18] B.M.E. Moret, L. Nakhleh, T. Warnow, C.R. Linder, A. Tholse, A. Padolina, J. Sun, and R. Timme. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1):13–23, 2004. doi: 10. 1109/TCBB.2004.10. [19] Katharina T. Huber, Bengt Oxelman, Martin Lott, and Vincent Moulton. Reconstructing the Evolutionary History of Polyploids from Multilabeled Trees. Molecular Biology and Evolution, 23(9):1784–1791, 06 2006. ISSN 0737-4038. doi: 10.1093/molbev/msl045. URL https://doi.org/10.1093/molbev/msl045. [20] Yun Yu, James H. Degnan, and Luay Nakhleh. The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detec- tion. PLOS Genetics, 8(4):1–10, 04 2012. doi: 10.1371/journal.pgen.1002660. URL https://doi.org/10.1371/journal.pgen.1002660. [21] Gabriel Cardona, Francesc Rosselló, and Gabriel Valiente. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics, 9(1):532, 2008. doi: 10.1186/1471-2105-9-532. URL https://doi.org/10.1186/1471-2105-9-532. [22] Tanja Stadler and James H. Degnan. A polynomial time algorithm for calculating the probability of a ranked gene tree given a species tree. Algorithms for Molecular Biology, 7 (1):7, 2012. doi: 10.1186/1748-7188-7-7. URL https://doi.org/10.1186/1748-7188-7-7. [23] J. F. C. Kingman. On the genealogy of large populations. Journal of Applied Probability, 19(A):27–43, 1982. doi: 10.2307/3213548. [24] Yun Yu, Cuong Than, James H. Degnan, and Luay Nakhleh. Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting. Systematic Biology, 60(2):138–149, 01 2011. ISSN 1063-5157. doi: 10.1093/sysbio/syq084. URL https://doi.org/10.1093/sysbio/syq084. [25] Joseph Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood ap- proach. Journal of Molecular Evolution, 17(6):368–376, 1981. doi: 10.1007/BF01734359. URL https://doi.org/10.1007/BF01734359. [26] Yun Yu, Jianrong Dong, Kevin J. Liu, and Luay Nakhleh. Maximum likelihood inference 144 of reticulate evolutionary histories. Proceedings of the National Academy of Sciences, 111 (46):16448–16453, 2014. doi: 10.1073/pnas.1407950111. URL https://www.pnas.org/doi/ abs/10.1073/pnas.1407950111. [27] Yun Yu and Luay Nakhleh. A maximum pseudo-likelihood approach for phylogenetic networks. BMC Genomics, 16(10):S10, 2015. doi: 10.1186/1471-2164-16-S10-S10. URL https://doi.org/10.1186/1471-2164-16-S10-S10. [28] Michael J. Sanderson. r8s: inferring absolute rates of molecular evolution and diver- gence times in the absence of a molecular clock. Bioinformatics, 19(2):301–302, 01 2003. ISSN 1367-4803. doi: 10.1093/bioinformatics/19.2.301. URL https://doi.org/10.1093/ bioinformatics/19.2.301. [29] D. Garrigan and A. J. Geneva. msmove, 2014. URL http://dx.doi.org/10.6084/m9.figshare. 1060474. [30] Richard R. Hudson. Generating samples under a Wright–Fisher neutral model of genetic ISSN 1367-4803. doi: 10.1093/ variation . Bioinformatics, 18(2):337–338, 02 2002. bioinformatics/18.2.337. URL https://doi.org/10.1093/bioinformatics/18.2.337. [31] T.H. Jukes and C.R. Cantor. Evolution of Protein Molecules, volume 3, pages 21–132. Academic Press, New York, NY, USA, 1969. [32] Andrew Rambaut and Nicholas C. Grass. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics, 13(3): 235–238, 06 1997. ISSN 1367-4803. doi: 10.1093/bioinformatics/13.3.235. URL https: //doi.org/10.1093/bioinformatics/13.3.235. [33] William Fletcher and Ziheng Yang. INDELible: A Flexible Simulator of Biological Sequence Evolution. Molecular Biology and Evolution, 26(8):1879–1888, 05 2009. ISSN 0737-4038. doi: 10.1093/molbev/msp098. URL https://doi.org/10.1093/molbev/msp098. [34] Robert R Sokal. A statistical method for evaluating systematic relationships. Univ. Kansas, Sci. Bull., 38:1409–1438, 1958. [35] N Saitou and M Nei. The neighbor-joining method: a new method for reconstructing phylo- genetic trees. Molecular Biology and Evolution, 4(4):406–425, 07 1987. ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a040454. URL https://doi.org/10.1093/oxfordjournals. molbev.a040454. [36] Walter M. Fitch. Toward Defining the Course of Evolution: Minimum Change for a Specific ISSN 1063-5157. doi: Tree Topology. Systematic Biology, 20(4):406–416, 12 1971. 10.1093/sysbio/20.4.406. URL https://doi.org/10.1093/sysbio/20.4.406. [37] Bruce Rannala and Ziheng Yang. Probability distribution of molecular evolutionary trees: 145 A new method of phylogenetic inference. Journal of Molecular Evolution, 43(3):304–311, 1996. doi: 10.1007/BF02338839. URL https://doi.org/10.1007/BF02338839. [38] Ziheng Yang et al. Paml: a program package for phylogenetic analysis by maximum likelihood. Computer applications in the biosciences, 13(5):555–556, 1997. [39] Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. New algorithms and methods to estimate maximum- likelihood phylogenies: Assessing the performance of phyml 3.0. Systematic Biol- ogy, 59(3):307–321, 05 2010. ISSN 1063-5157. doi: 10.1093/sysbio/syq010. URL https://doi.org/10.1093/sysbio/syq010. [40] Alexandros Stamatakis. Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9):1312–1313, 01 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu033. URL https://doi.org/10.1093/bioinformatics/btu033. [41] Alexey M Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, and Alexandros Stamatakis. Raxml-ng: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 35(21):4453–4455, 05 2019. ISSN 1367-4803. doi: 10.1093/ bioinformatics/btz305. URL https://doi.org/10.1093/bioinformatics/btz305. [42] Lam-Tung Nguyen, Heiko A. Schmidt, Arndt von Haeseler, and Bui Quang Minh. Iq-tree: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32(1):268–274, 11 2014. ISSN 0737-4038. doi: 10.1093/ molbev/msu300. URL https://doi.org/10.1093/molbev/msu300. [43] Morgan N. Price, Paramvir S. Dehal, and Adam P. Arkin. Fasttree 2 – approximately maximum-likelihood trees for large alignments. PLOS ONE, 5(3):1–10, 03 2010. doi: 10.1371/journal.pone.0009490. URL https://doi.org/10.1371/journal.pone.0009490. [44] John P Huelsenbeck and Fredrik Ronquist. Mrbayes: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754–755, 2001. [45] Sebastian Höhna, Michael J. Landis, Tracy A. Heath, Bastien Boussau, Nicolas Lartillot, Brian R. Moore, John P. Huelsenbeck, and Fredrik Ronquist. Revbayes: Bayesian phyloge- netic inference using graphical models and an interactive model-specification language. Sys- tematic Biology, 65(4):726–736, 05 2016. ISSN 1063-5157. doi: 10.1093/sysbio/syw021. URL https://doi.org/10.1093/sysbio/syw021. [46] Marc A Suchard, Philippe Lemey, Guy Baele, Daniel L Ayres, Alexei J Drummond, and Andrew Rambaut. Bayesian phylogenetic and phylodynamic data integration using beast 1.10. Virus evolution, 4(1):vey016, 2018. [47] Remco Bouckaert, Timothy G. Vaughan, Joëlle Barido-Sottani, Sebastián Duchêne, Mathieu Fourment, Alexandra Gavryushkina, Joseph Heled, Graham Jones, Denise Kühnert, Nicola 146 De Maio, Michael Matschiner, Fábio K. Mendes, Nicola F. Müller, Huw A. Ogilvie, Louis du Plessis, Alex Popinga, Andrew Rambaut, David Rasmussen, Igor Siveroni, Marc A. Suchard, Chieh-Hsi Wu, Dong Xie, Chi Zhang, Tanja Stadler, and Alexei J. Drummond. Beast 2.5: An advanced software platform for bayesian evolutionary analysis. PLOS Computational Biology, 15(4):1–28, 04 2019. doi: 10.1371/journal.pcbi.1006650. URL https://doi.org/10.1371/journal.pcbi.1006650. [48] Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart. Phylobayes 3: a bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics, 25(17): 2286–2288, 06 2009. ISSN 1367-4803. doi: 10.1093/bioinformatics/btp368. URL https: //doi.org/10.1093/bioinformatics/btp368. [49] Nicolas Lartillot, Nicolas Rodrigue, Daniel Stubbs, and Jacques Richer. PhyloBayes MPI: Phylogenetic Reconstruction with Infinite Mixtures of Profiles in a Parallel Environment. Systematic Biology, 62(4):611–615, 04 2013. ISSN 1063-5157. doi: 10.1093/sysbio/syt022. URL https://doi.org/10.1093/sysbio/syt022. [50] Paschalia Kapli, Ziheng Yang, and Maximilian J Telford. Phylogenetic tree building in the genomic age. Nature Reviews Genetics, 21(7):428–444, 2020. [51] Joseph Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Systematic zoology, 27(4):401–410, 1978. [52] Md Shamsuzzoha Bayzid and Tandy Warnow. Naive binning improves phylogenomic analyses. Bioinformatics, 29(18):2277–2284, 07 2013. ISSN 1367-4803. doi: 10.1093/ bioinformatics/btt394. URL https://doi.org/10.1093/bioinformatics/btt394. [53] Joseph Heled and Alexei J. Drummond. Bayesian Inference of Species Trees from Multilocus Data. Molecular Biology and Evolution, 27(3):570–580, 11 2009. ISSN 0737-4038. doi: 10.1093/molbev/msp274. URL https://doi.org/10.1093/molbev/msp274. [54] Huw A. Ogilvie, Remco R. Bouckaert, and Alexei J. Drummond. Starbeast2 brings faster species tree inference and accurate estimates of substitution rates. Molecular Biology and Evolution, 34(8):2101–2114, 04 2017. ISSN 0737-4038. doi: 10.1093/molbev/msx126. URL https://doi.org/10.1093/molbev/msx126. [55] Ziheng Yang and Bruce Rannala. Unguided species delimitation using dna sequence data from multiple loci. Molecular Biology and Evolution, 31(12):3125–3135, 10 2014. ISSN 0737-4038. doi: 10.1093/molbev/msu279. URL https://doi.org/10.1093/molbev/msu279. [56] Tomáš Flouri, Xiyun Jiao, Bruce Rannala, and Ziheng Yang. Species tree inference with bpp using genomic sequences and the multispecies coalescent. Molecular Biology and Evolution, 35(10):2585–2593, 07 2018. ISSN 0737-4038. doi: 10.1093/molbev/msy147. URL https://doi.org/10.1093/molbev/msy147. 147 [57] Wayne P. Maddison. Gene Trees in Species Trees. Systematic Biology, 46(3):523–536, 09 1997. ISSN 1063-5157. doi: 10.1093/sysbio/46.3.523. URL https://doi.org/10.1093/sysbio/ 46.3.523. [58] Cuong Than and Luay Nakhleh. Species Tree Inference by Minimizing Deep Coalescences. PLOS Computational Biology, 5(9):1–12, 09 2009. doi: 10.1371/journal.pcbi.1000501. URL https://doi.org/10.1371/journal.pcbi.1000501. [59] Yun Yu, Tandy Warnow, and Luay Nakhleh. Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles. Journal of Computational Biology, 18(11):1543–1559, 2011. doi: 10.1089/cmb.2011.0174. URL https://doi.org/10.1089/cmb.2011.0174. PMID: 22035329. [60] S. Mirarab, R. Reaz, Md. S. Bayzid, T. Zimmermann, M. S. Swenson, and T. Warnow. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics, 30(17): i541–i548, 08 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu462. URL https: //doi.org/10.1093/bioinformatics/btu462. [61] Siavash Mirarab and Tandy Warnow. ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics, 31(12):i44–i52, 06 2015. ISSN 1367-4803. doi: 10.1093/bioinformatics/btv234. URL https://doi.org/10.1093/ bioinformatics/btv234. [62] Liang Liu, Lili Yu, and Scott V Edwards. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC evolutionary biology, 10:1–18, 2010. [63] Liang Liu. BEST: Bayesian estimation of species trees under the coalescent model. Bioinfor- matics, 24(21):2542–2543, 09 2008. ISSN 1367-4803. doi: 10.1093/bioinformatics/btn484. URL https://doi.org/10.1093/bioinformatics/btn484. [64] D.F. Robinson and L.R. Foulds. Comparison of phylogenetic trees. Mathemat- ical Biosciences, 53(1):131–147, 1981. https://doi.org/ 10.1016/0025-5564(81)90043-2. URL https://www.sciencedirect.com/science/article/pii/ 0025556481900432. ISSN 0025-5564. doi: [65] Jiafan Zhu, Dingqiao Wen, Yun Yu, Heidi M. Meudt, and Luay Nakhleh. Bayesian inference of phylogenetic networks from bi-allelic genetic markers. PLOS Computational Biology, 14(1):1–32, 01 2018. doi: 10.1371/journal.pcbi.1005932. URL https://doi.org/10.1371/ journal.pcbi.1005932. [66] Luay Nakhleh. A Metric on the Space of Reduced Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(2):218–222, 2010. doi: 10.1109/TCBB.2009.2. 148 [67] F. Rodríguez, J.L. Oliver, A. Marín, and J.R. Medina. The general stochastic model of nucleotide substitution. Journal of Theoretical Biology, 142(4):485–501, 1990. ISSN 0022-5193. doi: https://doi.org/10.1016/S0022-5193(05)80104-3. URL https://www. sciencedirect.com/science/article/pii/S0022519305801043. [68] Motoo Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of molecular evolution, 16: 111–120, 1980. [69] Joseph Felsenstein and Joseph Felenstein. Inferring phylogenies, volume 2. Sinauer asso- ciates Sunderland, MA, 2004. [70] Z Yang. Maximum-likelihood estimation of phylogeny from DNA sequences when sub- stitution rates differ over sites. Molecular Biology and Evolution, 10(6):1396–1401, ISSN 0737-4038. doi: 10.1093/oxfordjournals.molbev.a040082. URL https: 11 1993. //doi.org/10.1093/oxfordjournals.molbev.a040082. [71] Jente Ottenburghs. Ghost Introgression: Spooky Gene Flow in the Distant Past. BioEs- says, 42(6):2000012, 2020. doi: https://doi.org/10.1002/bies.202000012. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/bies.202000012. [72] Adriana Suarez-Gonzalez, Christian Lexer, and Quentin C. B. Cronk. Adaptive introgression: a plant perspective. Biology Letters, 14(3):20170688, 2018. doi: 10.1098/rsbl.2017.0688. URL https://royalsocietypublishing.org/doi/abs/10.1098/rsbl.2017.0688. [73] Tao Sang and Yang Zhong. Testing Hybridization Hypotheses Based on Incongruent Gene ISSN 1063-5157. doi: 10.1080/ Trees. Systematic Biology, 49(3):422–434, 09 2000. 10635159950127321. URL https://doi.org/10.1080/10635159950127321. [74] Mark T. Holder, Jennifer A. Anderson, and Alisha K. Holloway. Difficulties in Detecting Hybridization. Systematic Biology, 50(6):978–982, 2001. ISSN 10635157, 1076836X. URL http://www.jstor.org/stable/3070875. [75] Nick Patterson, Daniel J. Richter, Sante Gnerre, Eric S. Lander, and David Reich. Genetic evidence for complex speciation of humans and chimpanzees. Nature, 441(7097):1103– 1108, 2006. doi: 10.1038/nature04789. URL https://doi.org/10.1038/nature04789. [76] Asger Hobolth, Ole F Christensen, Thomas Mailund, and Mikkel H Schierup. Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model. PLOS Genetics, 3(2):1–11, 02 2007. doi: 10.1371/ journal.pgen.0030007. URL https://doi.org/10.1371/journal.pgen.0030007. [77] Julien Y Dutheil, Ganesh Ganapathy, Asger Hobolth, Thomas Mailund, Marcy K Uyenoyama, and Mikkel H Schierup. Ancestral Population Genomics: The Coalescent Hidden Markov Model Approach. Genetics, 183(1):259–274, 09 2009. ISSN 1943-2631. 149 doi: 10.1534/genetics.109.103010. URL https://doi.org/10.1534/genetics.109.103010. [78] Thomas Mailund, Julien Y. Dutheil, Asger Hobolth, Gerton Lunter, and Mikkel H. Schierup. Estimating Divergence Time and Ancestral Effective Population Size of Bornean and Suma- tran Orangutan Subspecies Using a Coalescent Hidden Markov Model. PLOS Genetics, 7(3): 1–15, 03 2011. doi: 10.1371/journal.pgen.1001319. URL https://doi.org/10.1371/journal. pgen.1001319. [79] Thomas Mailund, Anders E. Halager, Michael Westergaard, Julien Y. Dutheil, Kasper Munch, Lars N. Andersen, Gerton Lunter, Kay Prüfer, Aylwyn Scally, Asger Hobolth, and Mikkel H. Schierup. A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species. PLOS Genetics, 8(12):1–19, 12 2012. doi: 10.1371/journal.pgen.1003125. URL https://doi.org/10.1371/journal.pgen.1003125. [80] Simon Joly, Patricia A. McLenachan, and Peter J. Lockhart. A Statistical Approach for Distinguishing Hybridization and Incomplete Lineage Sorting. The American Naturalist, 174(2):E54–E70, 2009. doi: 10.1086/600082. URL https://doi.org/10.1086/600082. PMID: 19519219. [81] Simon Joly. JML: testing hybridization from species trees. Molecular Ecology Resources, 12(1):179–184, 2012. doi: https://doi.org/10.1111/j.1755-0998.2011.03065.x. URL https: //onlinelibrary.wiley.com/doi/abs/10.1111/j.1755-0998.2011.03065.x. [82] Cedric Chauve, Jingxue Feng, and Liangliang Wang. Detecting Introgression in Anopheles Mosquito Genomes Using a Reconciliation-Based Approach. In Mathieu Blanchette and Aïda Ometricuangraoua, editors, Comparative Genomics, pages 163–178, Cham, 2018. Springer International Publishing. ISBN 978-3-030-00834-5. [83] Daniel R. Schrider, Julien Ayroles, Daniel R. Matute, and Andrew D. Kern. Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLOS Genetics, 14(4):1–29, 04 2018. doi: 10.1371/journal.pgen.1007341. URL https://doi.org/10.1371/journal.pgen.1007341. [84] Lex Flagel, Yaniv Brandvain, and Daniel R Schrider. The unreasonable effectiveness of convolutional neural networks in population genetic inference. Molecular Biology and Evolution, 36(2):220–238, 12 2018. ISSN 0737-4038. doi: 10.1093/molbev/msy224. URL https://doi.org/10.1093/molbev/msy224. [85] Dylan D. Ray, Lex Flagel, and Daniel R. Schrider. Introunet: Identifying introgressed alleles via semantic segmentation. PLOS Genetics, 20(2):1–37, 02 2024. doi: 10.1371/journal. pgen.1010657. URL https://doi.org/10.1371/journal.pgen.1010657. [86] Daniel H. Huson, Tobias Klöpper, Pete J. Lockhart, and Mike A. Steel. Reconstruction In Satoru Miyano, Jill Mesirov, Simon Kasif, of Reticulate Networks from Gene Trees. 150 Sorin Istrail, Pavel A. Pevzner, and Michael Waterman, editors, Research in Computational Molecular Biology, pages 233–249, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31950-4. [87] Eric Y. Durand, Nick Patterson, David Reich, and Montgomery Slatkin. Testing for Ancient Admixture between Closely Related Populations. Molecular Biology and Evolution, 28 (8):2239–2252, 02 2011. ISSN 0737-4038. doi: 10.1093/molbev/msr048. URL https: //doi.org/10.1093/molbev/msr048. [88] James B. Pease and Matthew W. Hahn. Detection and Polarization of Introgression in a Five-Taxon Phylogeny. Systematic Biology, 64(4):651–662, 04 2015. ISSN 1063-5157. doi: 10.1093/sysbio/syv023. URL https://doi.org/10.1093/sysbio/syv023. [89] Ryan A. Leo Elworth, Chabrielle Allen, Travis Benedict, Peter Dulworth, and Luay Nakhleh. DGEN: A Test Statistic for Detection of General Introgression Scenarios. In Laxmi Parida and Esko Ukkonen, editors, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018), volume 113 of Leibniz International Proceedings in Informatics (LIPIcs), pages 19:1–19:13, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977-082-8. doi: 10.4230/LIPIcs.WABI.2018.19. URL http://drops.dagstuhl. de/opus/volltexte/2018/9321. [90] David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh. Reconstructing Indian population history. Nature, 461(7263):489–494, 2009. doi: 10.1038/ nature08365. URL https://doi.org/10.1038/nature08365. [91] David Reich, Nick Patterson, Martin Kircher, Frederick Delfin, Madhusudan R. Nandi- neni, Irina Pugach, Albert Min-Shan Ko, Ying-Chin Ko, Timothy A. Jinam, Maude E. Phipps, Naruya Saitou, Andreas Wollstein, Manfred Kayser, Svante Pääbo, and Mark Stoneking. Denisova Admixture and the First Modern Human Dispersals into South- east Asia and Oceania. The American Journal of Human Genetics, 89(4):516–528, 2011. ISSN 0002-9297. doi: https://doi.org/10.1016/j.ajhg.2011.09.005. URL https: //www.sciencedirect.com/science/article/pii/S0002929711003958. [92] Nick Patterson, Priya Moorjani, Yontao Luo, Swapan Mallick, Nadin Rohland, Yiping Zhan, Teri Genschoreck, Teresa Webster, and David Reich. Ancient Admixture in Human History. Genetics, 192(3):1065–1093, 11 2012. ISSN 1943-2631. doi: 10.1534/genetics.112.145037. URL https://doi.org/10.1534/genetics.112.145037. [93] Simon H. Martin, John W. Davey, and Chris D. Jiggins. Evaluating the Use of ABBA–BABA Statistics to Locate Introgressed Loci. Molecular Biology and Evolution, 32(1):244–257, 09 2014. ISSN 0737-4038. doi: 10.1093/molbev/msu269. URL https://doi.org/10.1093/ molbev/msu269. [94] Benjamin M Peter. Admixture, Population Structure, and F-Statistics. Genetics, 202(4): 1485–1501, 02 2016. ISSN 1943-2631. doi: 10.1534/genetics.115.183913. URL https: 151 //doi.org/10.1534/genetics.115.183913. [95] Paul D Blischak, Julia Chifman, Andrea D Wolfe, and Laura S Kubatko. HyDe: A Python Package for Genome-Scale Hybridization Detection. Systematic Biology, 67(5):821–829, 03 2018. ISSN 1063-5157. doi: 10.1093/sysbio/syy023. URL https://doi.org/10.1093/sysbio/ syy023. [96] Laura S. Kubatko and Julia Chifman. An invariants-based method for efficient identification of hybrid species from large-scale genomic data. BMC Evolutionary Biology, 19(1):112, 2019. doi: 10.1186/s12862-019-1439-7. URL https://doi.org/10.1186/s12862-019-1439-7. [97] Bastian Pfeifer and Durrell D. Kapan. Estimates of introgression as a function of pairwise distances. BMC Bioinformatics, 20(1):207, 2019. doi: 10.1186/s12859-019-2747-z. URL https://doi.org/10.1186/s12859-019-2747-z. [98] Howard Ochman, Jeffrey G. Lawrence, and Eduardo A. Groisman. Lateral gene transfer and the nature of bacterial innovation. Nature, 405(6784):299–304, 2000. doi: 10.1038/ 35012500. URL https://doi.org/10.1038/35012500. [99] Brad Spellberg, Robert Guidos, David Gilbert, John Bradley, Helen W. Boucher, W. Michael Scheld, John G. Bartlett, Jr Edwards, John, and the Infectious Diseases Society of Amer- ica. The Epidemic of Antibiotic-Resistant Infections: A Call to Action for the Medi- cal Community from the Infectious Diseases Society of America. Clinical Infectious Diseases, 46(2):155–164, 01 2008. doi: 10.1086/524891. URL https://doi.org/10.1086/524891. ISSN 1058-4838. [100] Alan G. Mathew, Robin Cissell, and S. Liamthong. Antibiotic Resistance in Bacteria Associated with Food Animals: A United States Perspective of Livestock Production. Food- borne Pathogens and Disease, 4(2):115–133, 2007. doi: 10.1089/fpd.2006.0066. URL https://doi.org/10.1089/fpd.2006.0066. PMID: 17600481. [101] Rafael Szczepanowski, Burkhard Linke, Irene Krahn, Karl-Heinz Gartemann, Tim Gützkow, Wolfgang Eichler, Alfred Pühler, and Andreas Schlüter. Detection of 140 clinically relevant antibiotic-resistance genes in the plasmid metagenome of wastewater treatment plant bacteria showing reduced susceptibility to selected antibiotics. Microbiology, 155(7):2306–2319, 2009. ISSN 1465-2080. doi: https://doi.org/10.1099/mic.0.028233-0. URL https://www. microbiologyresearch.org/content/journal/micro/10.1099/mic.0.028233-0. [102] Johan Bengtsson-Palme, Fredrik Boulund, Jerker Fick, Erik Kristiansson, and D. G. Joakim Larsson. Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 2014. ISSN 1664-302X. doi: 10.3389/fmicb.2014.00648. URL https://www.frontiersin.org/articles/10. 3389/fmicb.2014.00648. [103] Lauren D. McDaniel, Elizabeth Young, Jennifer Delaney, Fabian Ruhnau, Kim B. Ritchie, 152 and John H. Paul. High Frequency of Horizontal Gene Transfer in the Oceans. Science, 330 (6000):50–50, 2010. doi: 10.1126/science.1192243. URL https://www.science.org/doi/abs/ 10.1126/science.1192243. [104] James H. Degnan and Noah A. Rosenberg. Gene tree discordance, phylogenetic in- ference and the multispecies coalescent. Trends in Ecology & Evolution, 24(6):332– 340, 2009. doi: https://doi.org/10.1016/j.tree.2009.01.009. URL https://www.sciencedirect.com/science/article/pii/S0169534709000846. ISSN 0169-5347. [105] Ricard Albalat and Cristian Cañestro. Evolution by gene loss. Nature Reviews Genetics, 17 (7):379–391, 2016. doi: 10.1038/nrg.2016.39. URL https://doi.org/10.1038/nrg.2016.39. [106] Shweta Meshram, Sunaina Bisht, and Robin Gogoi. Chapter 14 - Current development, application and constraints of biopesticides in plant disease management. In Amitava Rakshit, Vijay Singh Meena, P.C. Abhilash, B.K. Sarma, H.B. Singh, Leonardo Fraceto, Manoj Parihar, and Anand Kumar Singh, editors, Biopesticides, Advances in Bio-inoculant Science, pages 207–224. Woodhead Publishing, 2022. ISBN 978-0-12-823355-9. doi: https://doi. org/10.1016/B978-0-12-823355-9.00004-3. URL https://www.sciencedirect.com/science/ article/pii/B9780128233559000043. [107] Jeffrey G. Lawrence and Howard Ochman. Reconciling the many faces of lateral gene doi: https: transfer. Trends in Microbiology, 10(1):1–4, 2002. //doi.org/10.1016/S0966-842X(01)02282-X. URL https://www.sciencedirect.com/science/ article/pii/S0966842X0102282X. ISSN 0966-842X. [108] Mohammad Shabbir Hasan, Qi Liu, Han Wang, John Fazekas, Bernard Chen, and Dong- sheng Che. GIST: Genomic island suite of tools for predicting genomic islands in genomic sequences. Bioinformation, 8(4):203–205, 2012. ISSN 0973-2063 (Electronic); 0973-8894 (Print); 0973-2063 (Linking). doi: 10.6026/97320630008203. [109] Claire Bertelli, Matthew R Laird, Kelly P Williams, Simon Fraser University Research Com- puting Group, Britney Y Lau, Gemma Hoad, Geoffrey L Winsor, and Fiona SL Brinkman. IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets. Nucleic Acids Research, 45(W1):W30–W35, 05 2017. ISSN 0305-1048. doi: 10.1093/nar/gkx343. URL https://doi.org/10.1093/nar/gkx343. [110] Christophe Dessimoz, Daniel Margadant, and Gaston H. Gonnet. DLIGHT – Lateral Gene Transfer Detection Using Pairwise Evolutionary Distances in a Statistical Framework. In Martin Vingron and Limsoon Wong, editors, Research in Computational Molecular Biology, pages 315–330, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540- 78839-3. [111] Qiyun Zhu, Michael Kosoy, and Katharina Dittmar. HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers. BMC Ge- nomics, 15(1):717, 2014. doi: 10.1186/1471-2164-15-717. URL https://doi.org/10.1186/ 153 1471-2164-15-717. [112] Sheila Podell and Terry Gaasterland. DarkHorse: a method for genome-wide prediction of horizontal gene transfer. Genome Biology, 8(2):R16, 2007. doi: 10.1186/gb-2007-8-2-r16. URL https://doi.org/10.1186/gb-2007-8-2-r16. [113] Lawrence A. David and Eric J. Alm. Rapid evolutionary innovation during an Archaean genetic expansion. Nature, 469(7328):93–96, 2011. doi: 10.1038/nature09649. URL https://doi.org/10.1038/nature09649. [114] Mukul S Bansal, Manolis Kellis, Misagh Kordi, and Soumya Kundu. RANGER-DTL 2.0: rigorous reconstruction of gene-family evolution by duplication, transfer and loss. Bioinformatics, 34(18):3214–3216, 04 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/ bty314. URL https://doi.org/10.1093/bioinformatics/bty314. [115] Thomas W. Schoenfeld, Senthil K. Murugapiran, Jeremy A. Dodsworth, Sally Floyd, Michael Lodes, David A. Mead, and Brian P. Hedlund. Lateral Gene Transfer of Family A DNA Poly- merases between Thermophilic Viruses, Aquificae, and Apicomplexa. Molecular Biology and Evolution, 30(7):1653–1664, 04 2013. ISSN 0737-4038. doi: 10.1093/molbev/mst078. URL https://doi.org/10.1093/molbev/mst078. [116] Weizhi Song, Bernd Wemheuer, Shan Zhang, Kerrin Steensen, and Torsten Thomas. MetaCHIP: community-level horizontal gene transfer identification through the combi- nation of best-match and phylogenetic approaches. Microbiome, 7(1):36, 2019. doi: 10.1186/s40168-019-0649-y. URL https://doi.org/10.1186/s40168-019-0649-y. [117] Matt Ravenhall, Nives Škunca, Florent Lassalle, and Christophe Dessimoz. Inferring Horizontal Gene Transfer. PLOS Computational Biology, 11(5):1–16, 05 2015. doi: 10.1371/journal.pcbi.1004095. URL https://doi.org/10.1371/journal.pcbi.1004095. [118] Richard G. Harrison and Erica L. Larson. Hybridization, Introgression, and the Nature of Species Boundaries. Journal of Heredity, 105(S1):795–809, 08 2014. ISSN 0022-1503. doi: 10.1093/jhered/esu033. URL https://doi.org/10.1093/jhered/esu033. [119] James Mallet. Hybridization as an invasion of the genome. Trends in Ecology & Evolution, ISSN 0169-5347. doi: https://doi.org/10.1016/j.tree.2005.02.010. 20(5):229–237, 2005. URL https://www.sciencedirect.com/science/article/pii/S016953470500039X. Special is- sue: Invasions, guest edited by Michael E. Hochberg and Nicholas J. Gotelli. [120] Ying Song, Stefan Endepols, Nicole Klemann, Dania Richter, Franz-Rainer Matuschka, Ching-Hua Shih, Michael W. Nachman, and Michael H. Kohn. Adaptive Introgression of Anticoagulant Rodent Poison Resistance by Hybridization between Old World Mice. Current Biology, 21(15):1296–1301, 2011. ISSN 0960-9822. doi: https://doi.org/10.1016/j.cub. 2011.06.043. URL https://www.sciencedirect.com/science/article/pii/S0960982211007160. 154 [121] Luay Nakhleh. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends in Ecology & Evolution, 28(12):719–728, 2013. ISSN 0169- 5347. doi: https://doi.org/10.1016/j.tree.2013.09.004. URL https://www.sciencedirect.com/ science/article/pii/S0169534713002139. [122] Qiqige Wuyun, Nicholas W. VanKuren, Marcus Kronforst, Sean P. Mullen, and Kevin J. Liu. Scalable Statistical Introgression Mapping Using Approximate Coalescent-Based Inference. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19, page 504–513, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450366663. doi: 10.1145/3307339. 3342165. URL https://doi.org/10.1145/3307339.3342165. [123] Cuong Than, Derek Ruths, and Luay Nakhleh. PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC Bioinformatics, 9(1):322, 2008. doi: 10.1186/1471-2105-9-322. URL https://doi.org/10.1186/1471-2105-9-322. [124] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. doi: 10.1109/5.18626. [125] Michael JD Powell. The BOBYQA algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06, University of Cambridge, Cambridge, pages 26–46, 2009. [126] R. P. Brent. Algorithms for Minimization without Derivatives. Dover Publications, Mineola, New York, 1973. [127] Fabian Staubach, Anna Lorenc, Philipp W. Messer, Kun Tang, Dmitri A. Petrov, and Diethard Tautz. Genome Patterns of Selection and Introgression of Haplotypes in Natural Populations of the House Mouse (Mus musculus). PLOS Genetics, 8(8):1–13, 08 2012. doi: 10.1371/ journal.pgen.1002891. URL https://doi.org/10.1371/journal.pgen.1002891. [128] Jason R. Gallant, Vance E. Imhoff, Arnaud Martin, Wesley K. Savage, Nicola L. Cham- berlain, Ben L. Pote, Chelsea Peterson, Gabriella E. Smith, Benjamin Evans, Robert D. Reed, Marcus R. Kronforst, and Sean P. Mullen. Ancient homology underlies adap- tive mimetic diversity across butterflies. Nature Communications, 5(1):4817, 2014. doi: 10.1038/ncomms5817. URL https://doi.org/10.1038/ncomms5817. [129] Hyuna Yang, Jeremy R Wang, John P Didion, Ryan J Buus, Timothy A Bell, Catherine E Welsh, François Bonhomme, Alex Hon-Tsen Yu, Michael W Nachman, Jaroslav Pialek, Priscilla Tucker, Pierre Boursot, Leonard McMillan, Gary A Churchill, and Fernando Pardo- Manuel de Villena. Subspecific origin and haplotype diversity in the laboratory mouse. Nature Genetics, 43(7):648–655, 2011. doi: 10.1038/ng.847. URL https://doi.org/10.1038/ ng.847. [130] Jean-Louis Guénet and François Bonhomme. Wild mice: an ever-increasing contribu- 155 tion to a popular mammalian model. Trends in Genetics, 19(1):24–31, 2003. 0168-9525. sciencedirect.com/science/article/pii/S0168952502000070. ISSN doi: https://doi.org/10.1016/S0168-9525(02)00007-0. URL https://www. [131] John P. Didion, Hyuna Yang, Keith Sheppard, Chen-Ping Fu, Leonard McMillan, Fernando Pardo-Manuel de Villena, and Gary A. Churchill. Discovery of novel variants in genotyping arrays improves genotype retention and reduces ascertainment bias. BMC Genomics, 13(1): 34, 2012. doi: 10.1186/1471-2164-13-34. URL https://doi.org/10.1186/1471-2164-13-34. [132] Paul Scheet and Matthew Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics, 78(4):629–644, 2006. [133] Bettina Harr, Emre Karakoc, Rafik Neme, Meike Teschke, Christine Pfeifle, Željka Pezer, Hiba Babiker, Miriam Linnenbrink, Inka Montero, Rick Scavetta, Mohammad Reza Abai, Marta Puente Molins, Mathias Schlegel, Rainer G. Ulrich, Janine Altmüller, Marek Franitza, Anna Büntge, Sven Künzel, and Diethard Tautz. Genomic resources for wild populations of the house mouse, Mus musculus and its close relative Mus spretus. Scientific Data, 3(1): 160075, 2016. doi: 10.1038/sdata.2016.75. URL https://doi.org/10.1038/sdata.2016.75. [134] Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992. [135] Bradley Efron and Robert Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science, pages 54–75, 1986. [136] Wei Wang, Jack Smith, Hussein A Hejase, and Kevin J Liu. Non-parametric and semi- parametric support estimation using sequential resampling random walks on biomolecular sequences. Algorithms for Molecular Biology, 15:1–15, 2020. [137] Giddy Landan and Dan Graur. Heads or tails: a simple reliability check for multiple sequence alignments. Molecular biology and evolution, 24(6):1380–1383, 2007. [138] Wei Wang, Qiqige Wuyun, and Kevin J Liu. An application of random walk resampling to phylogenetic hmm inference and learning. IEEE Transactions on NanoBioscience, 19(3): 506–517, 2020. [139] Oscar Westesson and Ian Holmes. Accurate detection of recombinant breakpoints in whole- genome alignments. PLOS Computational Biology, 5(3):1–13, 03 2009. doi: 10.1371/ journal.pcbi.1000318. URL https://doi.org/10.1371/journal.pcbi.1000318. [140] Michael PH Stumpf and Gilean AT McVean. Estimating recombination rates from population-genetic data. Nature Reviews Genetics, 4(12):959–968, 2003. [141] Simon Myers, Leonardo Bottolo, Colin Freeman, Gil McVean, and Peter Donnelly. A fine- 156 scale map of recombination rates and hotspots across the human genome. Science, 310 (5746):321–324, 2005. doi: 10.1126/science.1117196. URL https://www.science.org/doi/ abs/10.1126/science.1117196. [142] Adam Auton and Gil McVean. Recombination rate estimation in the presence of hotspots. Genome Research, 17(8):1219–1227, 2007. doi: 10.1101/gr.6386707. URL http://genome. cshlp.org/content/17/8/1219.abstract. [143] Bernard M.E. Moret, Usman Roshan, and Tandy Warnow. Sequence-Length Requirements for Phylogenetic Methods. In Roderic Guigó and Dan Gusfield, editors, Algorithms in Bioinformatics, pages 343–356, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg. ISBN 978-3-540-45784-8. [144] Kevin Liu, Sindhu Raghavan, Serita Nelesen, C Randal Linder, and Tandy Warnow. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. doi: 10.1126/science.1171243. URL Science, 324(5934):1561–1564, 2009. https://www.science.org/doi/abs/10.1126/science.1171243. [145] Jotun Hein, Mikkel Schierup, and Carsten Wiuf. Gene Genealogies, Variation and Evolution: a Primer in Coalescent Theory. Oxford University Press, Oxford, 2004. [146] Sen Song, Liang Liu, Scott V Edwards, and Shaoyuan Wu. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proceed- ings of the National Academy of Sciences, 109(37):14942–14947, 2012. [147] Mark P Simmons, Daniel B Sloan, and John Gatesy. The effects of subsampling gene trees on coalescent methods applied to ancient divergences. Molecular Phylogenetics and Evolution, 97:76–89, 2016. [148] Laura A Katz and Jessica R Grant. Taxon-rich phylogenomic analyses resolve the eukaryotic tree of life and reveal the power of subsampling by sites. Systematic biology, 64(3):406–415, 2015. [149] Alexander Lalejini, Marcos Sanson, Jack Garbus, Matthew Andres Moreno, and Emily Dol- son. Runtime phylogenetic analysis enables extreme subsampling for test-based problems. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 511–514, 2024. [150] Andrew Francis, Daniel H. Huson, and Mike Steel. Normalising phylogenetic net- works. Molecular Phylogenetics and Evolution, 163:107215, 2021. ISSN 1055-7903. doi: https://doi.org/10.1016/j.ympev.2021.107215. URL https://www.sciencedirect.com/ science/article/pii/S1055790321001482. [151] Andrew Francis, Daniele Marchei, and Mike Steel. Phylogenetic network classes through the lens of expanding covers. Journal of Mathematical Biology, 88(5):58, 2024. 157 [152] Michael C. Fontaine, James B. Pease, Aaron Steele, Robert M. Waterhouse, Daniel E. Neafsey, Igor V. Sharakhov, Xiaofang Jiang, Andrew B. Hall, Flaminia Catteruccia, Evdoxia Kakani, Sara N. Mitchell, Yi-Chieh Wu, Hilary A. Smith, R. Rebecca Love, Mara K. Lawniczak, Michel A. Slotman, Scott J. Emrich, Matthew W. Hahn, and Nora J. Besansky. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science, 347(6217):1258524, 2015. doi: 10.1126/science.1258524. URL https://www. science.org/doi/abs/10.1126/science.1258524. [153] Daniel E. Neafsey, Robert M. Waterhouse, Mohammad R. Abai, Sergey S. Aganezov, Max A. Alekseyev, James E. Allen, James Amon, Bruno Arcà, Peter Arensburger, Gleb Artemov, Lauren A. Assour, Hamidreza Basseri, Aaron Berlin, Bruce W. Birren, Stephanie A. Blandin, Andrew I. Brockman, Thomas R. Burkot, Austin Burt, Clara S. Chan, Cedric Chauve, Joanna C. Chiu, Mikkel Christensen, Carlo Costantini, Victoria L. M. Davidson, Elena Deligianni, Tania Dottorini, Vicky Dritsou, Stacey B. Gabriel, Wamdaogo M. Guelbeogo, Andrew B. Hall, Mira V. Han, Thaung Hlaing, Daniel S. T. Hughes, Adam M. Jenkins, Xiaofang Jiang, Irwin Jungreis, Evdoxia G. Kakani, Maryam Kamali, Petri Kemppainen, Ryan C. Kennedy, Ioannis K. Kirmitzoglou, Lizette L. Koekemoer, Njoroge Laban, Nicholas Langridge, Mara K. N. Lawniczak, Manolis Lirakis, Neil F. Lobo, Ernesto Lowy, Robert M. MacCallum, Chunhong Mao, Gareth Maslen, Charles Mbogo, Jenny McCarthy, Kristin Michel, Sara N. Mitchell, Wendy Moore, Katherine A. Murphy, Anastasia N. Naumenko, Tony Nolan, Eva M. Novoa, Samantha O’Loughlin, Chioma Oringanje, Mohammad A. Oshaghi, Nazzy Pakpour, Philippos A. Papathanos, Ashley N. Peery, Michael Povelones, Anil Prakash, David P. Price, Ashok Rajaraman, Lisa J. Reimer, David C. Rinker, Antonis Rokas, Tanya L. Russell, N’Fale Sagnon, Maria V. Sharakhova, Terrance Shea, Felipe A. Simão, Frederic Simard, Michel A. Slotman, Pradya Somboon, Vladimir Stegniy, Clau- dio J. Struchiner, Gregg W. C. Thomas, Marta Tojo, Pantelis Topalis, José M. C. Tubio, Maria F. Unger, John Vontas, Catherine Walton, Craig S. Wilding, Judith H. Willis, Yi- Chieh Wu, Guiyun Yan, Evgeny M. Zdobnov, Xiaofan Zhou, Flaminia Catteruccia, George K. Christophides, Frank H. Collins, Robert S. Cornman, Andrea Crisanti, Martin J. Donnelly, Scott J. Emrich, Michael C. Fontaine, William Gelbart, Matthew W. Hahn, Immo A. Hansen, Paul I. Howell, Fotis C. Kafatos, Manolis Kellis, Daniel Lawson, Christos Louis, Shirley Luckhart, Marc A. T. Muskavitch, José M. Ribeiro, Michael A. Riehle, Igor V. Sharakhov, Zhijian Tu, Laurence J. Zwiebel, and Nora J. Besansky. Highly evolvable malaria vectors: The genomes of 16 anopheles mosquitoes. Science, 347(6217):1258522, 2015. doi: 10.1126/science.1258522. URL https://www.science.org/doi/abs/10.1126/science.1258522. [154] Sangeet Lamichhaney, Jonas Berglund, Markus Sällman Almén, Khurram Maqbool, Man- fred Grabherr, Alvaro Martinez-Barrio, Marta Promerová, Carl-Johan Rubin, Chao Wang, Neda Zamani, et al. Evolution of darwin’s finches and their beaks revealed by genome sequencing. Nature, 518(7539):371–375, 2015. [155] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, 05 2009. ISSN 1367-4803. doi: 10.1093/ bioinformatics/btp324. URL https://doi.org/10.1093/bioinformatics/btp324. 158 [156] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078–2079, 06 2009. ISSN 1367-4803. doi: 10.1093/bioinformatics/btp352. URL https://doi.org/10.1093/ bioinformatics/btp352. [157] Sangeet Lamichhaney, Fan Han, Jonas Berglund, Chao Wang, Markus Sällman Almén, Matthew T. Webster, B. Rosemary Grant, Peter R. Grant, and Leif Andersson. A beak size locus in darwin’s finches facilitated character displacement during a drought. Science, 352 (6284):470–474, 2016. doi: 10.1126/science.aad8786. URL https://www.science.org/doi/ abs/10.1126/science.aad8786. [158] Stepfanie M Aguillon, Tristram O Dodge, Gabriel A Preising, and Molly Schumer. Intro- gression. Current Biology, 32(16):R865–R868, 2022. [159] Thomas J. Sharpton. An introduction to the analysis of shotgun metagenomic data. Frontiers in Plant Science, 5, 2014. ISSN 1664-462X. doi: 10.3389/fpls.2014.00209. URL https: //www.frontiersin.org/articles/10.3389/fpls.2014.00209. [160] Florian P Breitwieser, Jennifer Lu, and Steven L Salzberg. A review of methods and databases for metagenomic classification and assembly. Briefings in Bioinformatics, 20(4): 1125–1136, 09 2017. ISSN 1477-4054. doi: 10.1093/bib/bbx120. URL https://doi.org/10. 1093/bib/bbx120. [161] Mads Albertsen, Philip Hugenholtz, Adam Skarshewski, Kåre L Nielsen, Gene W Tyson, and Per H Nielsen. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnology, 31(6):533–538, 2013. doi: 10.1038/nbt.2579. URL https://doi.org/10.1038/nbt.2579. [162] Luisa W. Hugerth, John Larsson, Johannes Alneberg, Markus V. Lindh, Catherine Legrand, Jarone Pinhassi, and Anders F. Andersson. Metagenome-assembled genomes uncover doi: 10.1186/ a global brackish microbiome. Genome Biology, 16(1):279, 2015. s13059-015-0834-7. URL https://doi.org/10.1186/s13059-015-0834-7. [163] Robert Schmieder and Robert Edwards. Quality control and preprocessing of metage- doi: nomic datasets. Bioinformatics, 27(6):863–864, 01 2011. ISSN 1367-4803. 10.1093/bioinformatics/btr026. URL https://doi.org/10.1093/bioinformatics/btr026. [164] Anthony M. Bolger, Marc Lohse, and Bjoern Usadel. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15):2114–2120, 04 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu170. URL https://doi.org/10.1093/bioinformatics/btu170. [165] Simon Andrews et al. FastQC: a quality control tool for high throughput sequence data, 2010. 159 [166] Wei Shen, Shuai Le, Yan Li, and Fuquan Hu. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE, 11(10):1–10, 10 2016. doi: 10.1371/journal. pone.0163962. URL https://doi.org/10.1371/journal.pone.0163962. [167] Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1):10–12, 2011. ISSN 2226-6089. doi: 10.14806/ej.17.1.200. URL https://journal.embnet.org/index.php/embnetjournal/article/view/200. [168] Brian Bushnell, Jonathan Rood, and Esther Singer. Bbmerge – accurate paired shotgun read merging via overlap. PLOS ONE, 12(10):1–15, 10 2017. doi: 10.1371/journal.pone. 0185056. URL https://doi.org/10.1371/journal.pone.0185056. [169] NA Joshi, JN2011 Fass, et al. Sickle: A sliding-window, adaptive, quality-based trimming tool for fastq files (version 1.33)[software], 2011. URL https://github.com/najoshi/sickle. [170] Alla Mikheenko, Vladislav Saveliev, and Alexey Gurevich. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics, 32(7):1088–1090, 11 2015. ISSN 1367-4803. doi: 10.1093/bioinformatics/btv697. URL https://doi.org/10.1093/bioinformatics/btv697. [171] Donovan H. Parks, Michael Imelfort, Connor T. Skennerton, Philip Hugenholtz, and Gene W. Tyson. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25(7):1043–1055, 2015. doi: 10.1101/gr. 186072.114. URL http://genome.cshlp.org/content/25/7/1043.abstract. [172] Fernando Meyer, Peter Hofmann, Peter Belmann, Ruben Garrido-Oter, Adrian Fritz, Alexan- der Sczyrba, and Alice C McHardy. AMBER: Assessment of Metagenome BinnERs. Gi- gaScience, 7(6), 06 2018. ISSN 2047-217X. doi: 10.1093/gigascience/giy069. URL https://doi.org/10.1093/gigascience/giy069. giy069. [173] Rayan Chikhi and Guillaume Rizk. Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology, 8(1):22, 2013. doi: 10.1186/ 1748-7188-8-22. URL https://doi.org/10.1186/1748-7188-8-22. [174] Toshiaki Namiki, Tsuyoshi Hachiya, Hideaki Tanaka, and Yasubumi Sakakibara. MetaVel- vet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads . Nucleic Acids Research, 40(20):e155–e155, 07 2012. ISSN 0305-1048. doi: 10.1093/nar/gks678. URL https://doi.org/10.1093/nar/gks678. [175] Afiahayati, Kengo Sato, and Yasubumi Sakakibara. MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning. DNA Research, 22(1):69–77, 11 2014. ISSN 1340-2838. doi: 10.1093/dnares/dsu041. URL https://doi.org/10.1093/dnares/dsu041. [176] Sébastien Boisvert, Frédéric Raymond, Élénie Godzaridis, François Laviolette, and Jacques Corbeil. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol- 160 ogy, 13(12):R122, 2012. doi: 10.1186/gb-2012-13-12-r122. URL https://doi.org/10.1186/ gb-2012-13-12-r122. [177] Yu Peng, Henry C. M. Leung, S. M. Yiu, and Francis Y. L. Chin. IDBA – A Practical Iterative de Bruijn Graph De Novo Assembler. In Bonnie Berger, editor, Research in Computational Molecular Biology, pages 426–440, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg. ISBN 978-3-642-12683-3. [178] Yu Peng, Henry C. M. Leung, S. M. Yiu, and Francis Y. L. Chin. IDBA-UD: a de novo assem- bler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinfor- matics, 28(11):1420–1428, 04 2012. ISSN 1367-4803. doi: 10.1093/bioinformatics/bts174. URL https://doi.org/10.1093/bioinformatics/bts174. [179] Sergey Nurk, Dmitry Meleshko, Anton Korobeynikov, and Pavel A. Pevzner. metaSPAdes: a new versatile metagenomic assembler. Genome Research, 27(5):824–834, 2017. doi: 10.1101/gr.213959.116. URL http://genome.cshlp.org/content/27/5/824.abstract. [180] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10):1674–1676, 01 2015. ISSN 1367-4803. doi: 10.1093/bioinformatics/btv033. URL https://doi.org/10.1093/bioinformatics/btv033. [181] Todd J. Treangen, Sergey Koren, Daniel D. Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron E. Darling, Adam M. Phillippy, and Mihai Pop. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biology, 14(1):R2, 2013. doi: 10.1186/gb-2013-14-1-r2. URL https://doi.org/10.1186/gb-2013-14-1-r2. [182] Matthew Scholz, Chien-Chi Lo, and Patrick S. G. Chain. Improved Assemblies Us- ing a Source-Agnostic Pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of Contigs. 10.1038/srep06480. URL https://doi.org/10.1038/srep06480. Scientific Reports, 4(1):6480, 2014. doi: [183] Riccardo Vicedomini, Francesco Vezzi, Simone Scalabrin, Lars Arvestad, and Alberto Policriti. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics, 14(7):S6, 2013. doi: 10.1186/1471-2105-14-S7-S6. URL https://doi.org/ 10.1186/1471-2105-14-S7-S6. [184] Wolfgang Gerlach and Jens Stoye. Taxonomic classification of metagenomic shotgun se- quences with CARMA3. Nucleic Acids Research, 39(14):e91–e91, 05 2011. ISSN 0305- 1048. doi: 10.1093/nar/gkr225. URL https://doi.org/10.1093/nar/gkr225. [185] Bo Liu, Theodore Gibbons, Mohammad Ghodsi, and Mihai Pop. MetaPhyler: Taxonomic profiling for metagenomic sequences. In 2010 IEEE International Conference on Bioinfor- matics and Biomedicine (BIBM), pages 95–100, 2010. doi: 10.1109/BIBM.2010.5706544. 161 [186] Monzoorul Haque Mohammed, Tarini Shankar Ghosh, Nitin Kumar Singh, and Sharmila S. Mande. SPHINX—an algorithm for taxonomic binning of metagenomic sequences. Bioin- formatics, 27(1):22–30, 10 2010. ISSN 1367-4803. doi: 10.1093/bioinformatics/btq608. URL https://doi.org/10.1093/bioinformatics/btq608. [187] Ivan Gregor, Johannes Dröge, Melanie Schirmer, Christopher Quince, and Alice C. McHardy. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ, 4:e1603, February 2016. ISSN 2167-8359. doi: 10.7717/ peerj.1603. URL https://doi.org/10.7717/peerj.1603. [188] Arthur Brady and Steven L Salzberg. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6(9):673–676, 2009. doi: 10.1038/nmeth.1358. URL https://doi.org/10.1038/nmeth.1358. [189] Folker Meyer, Saurabh Bagchi, Somali Chaterji, Wolfgang Gerlach, Ananth Grama, Travis Harrison, Tobias Paczian, William L Trimble, and Andreas Wilke. MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Briefings in Bioinformatics, 20(4):1151–1159, 09 2017. ISSN 1477-4054. doi: 10.1093/ bib/bbx105. URL https://doi.org/10.1093/bib/bbx105. [190] Daniel H. Huson, Sina Beier, Isabell Flade, Anna Górska, Mohamed El-Hadidi, Suparna Mitra, Hans-Joachim Ruscheweyh, and Rewati Tappu. MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data. PLOS Computational Biology, 12(6):1–12, 06 2016. doi: 10.1371/journal.pcbi.1004957. URL https://doi.org/10.1371/journal.pcbi.1004957. [191] Naryttza N. Diaz, Lutz Krause, Alexander Goesmann, Karsten Niehaus, and Tim W. Nattkemper. TACOA –Taxonomic classification of environmental genomic fragments us- ing a kernelized nearest neighbor approach. BMC Bioinformatics, 10(1):56, 2009. doi: 10.1186/1471-2105-10-56. URL https://doi.org/10.1186/1471-2105-10-56. [192] I-Min A Chen, Ken Chu, Krishna Palaniappan, Manoj Pillay, Anna Ratner, Jinghua Huang, Marcel Huntemann, Neha Varghese, James R White, Rekha Seshadri, Tatyana Smirnova, Edward Kirton, Sean P Jungbluth, Tanja Woyke, Emiley A Eloe-Fadrosh, Natalia N Ivanova, IMG/M v.5.0: an integrated data management and comparative and Nikos C Kyrpides. analysis system for microbial genomes and microbiomes. Nucleic Acids Research, 47(D1): D666–D677, 10 2018. ISSN 0305-1048. doi: 10.1093/nar/gky901. URL https://doi.org/10. 1093/nar/gky901. [193] Marc Strous, Beate Kraft, Regina Bisdorf, and Halina E Tegetmeyer. The binning of metagenomic contigs for microbial physiology of mixed cultures. Frontiers in mi- crobiology, 3:410, 2012. doi: 10.3389/fmicb.2012.00410. URL https://www.frontiersin.org/articles/10.3389/fmicb.2012.00410. ISSN 1664-302X. [194] Yu-Wei Wu and Yuzhen Ye. A Novel Abundance-Based Algorithm for Binning Metagenomic 162 Sequences Using l-tuples. Journal of Computational Biology, 18(3):523–534, 2011. doi: 10.1089/cmb.2010.0245. URL https://doi.org/10.1089/cmb.2010.0245. PMID: 21385052. [195] Andrey Kislyuk, Srijak Bhatnagar, Jonathan Dushoff, and Joshua S. Weitz. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics, 10(1):316, 2009. doi: 10.1186/1471-2105-10-316. URL https://doi.org/10.1186/1471-2105-10-316. [196] Ying Wang, Haiyan Hu, and Xiaoman Li. MBBC: an efficient approach for metage- nomic binning based on clustering. BMC Bioinformatics, 16(1):36, 2015. doi: 10.1186/ s12859-015-0473-8. URL https://doi.org/10.1186/s12859-015-0473-8. [197] H Bjørn Nielsen, Mathieu Almeida, Agnieszka Sierakowska Juncker, Simon Rasmussen, Junhua Li, Shinichi Sunagawa, Damian R Plichta, Laurent Gautier, Anders G Pedersen, Emmanuelle Le Chatelier, Eric Pelletier, Ida Bonde, Trine Nielsen, Chaysavanh Manichanh, Manimozhiyan Arumugam, Jean-Michel Batto, Marcelo B Quintanilha dos Santos, Nikolaj Blom, Natalia Borruel, Kristoffer S Burgdorf, Fouad Boumezbeur, Francesc Casellas, Joël Doré, Piotr Dworzynski, Francisco Guarner, Torben Hansen, Falk Hildebrand, Rolf S Kaas, Sean Kennedy, Karsten Kristiansen, Jens Roat Kultima, Pierre Léonard, Florence Levenez, Ole Lund, Bouziane Moumen, Denis Le Paslier, Nicolas Pons, Oluf Pedersen, Edi Prifti, Junjie Qin, Jeroen Raes, Søren Sørensen, Julien Tap, Sebastian Tims, David W Ussery, Takuji Yamada, Agnieszka S Juncker, Pierre Leonard, Pierre Renault, Thomas Sicheritz- Ponten, Peer Bork, Jun Wang, Søren Brunak, S Dusko Ehrlich, Alexandre Jamet, Alexandre Mérieux, Antonella Cultrone, Antonio Torrejon, Benoit Quinquis, Christian Brechot, Chris- tine Delorme, Christine M’Rini, Willem M de Vos, Emmanuelle Maguin, Encarna Varela, Eric Guedon, Falony Gwen, Florence Haimet, François Artiguenave, Gaetana Vandemeule- brouck, Gérard Denariaz, Ghalia Khaci, Hervé Blottière, Jan Knol, Jean Weissenbach, Johan E T van Hylckama Vlieg, Jørgensen Torben, Julian Parkhill, Keith Turner, Maarten van de Guchte, Maria Antolin, Maria Rescigno, Michiel Kleerebezem, Muriel Derrien, Nathalie Galleron, Nicolas Sanchez, Niels Grarup, Patrick Veiga, Raish Oozeer, Rozenn Dervyn, Séverine Layec, Thomas Bruls, Yohanan Winogradski, Zoetendal Erwin G, and MetaHIT Consortium. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature Biotechnology, 32 (8):822–828, 2014. doi: 10.1038/nbt.2939. URL https://doi.org/10.1038/nbt.2939. [198] Yi Wang, Henry CM Leung, Siu-Ming Yiu, and Francis YL Chin. MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. Journal of Computational Biology, 19(2):241–249, 2012. doi: 10.1089/cmb.2011.0276. URL https://doi.org/10.1089/ cmb.2011.0276. PMID: 22300323. [199] Yu-Wei Wu, Blake A. Simmons, and Steven W. Singer. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics, 32(4): 605–607, 10 2015. ISSN 1367-4803. doi: 10.1093/bioinformatics/btv638. URL https: //doi.org/10.1093/bioinformatics/btv638. [200] Dongwan D Kang, Feng Li, Edward Kirton, Ashleigh Thomas, Rob Egan, Hong An, and 163 Zhong Wang. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ, 7:e7359, 2019. ISSN 2167-8359. doi: 10.7717/peerj.7359. URL https://doi.org/10.7717/peerj.7359. [201] Johannes Alneberg, Brynjar Smári Bjarnason, Ino de Bruijn, Melanie Schirmer, Joshua Quick, Umer Z Ijaz, Leo Lahti, Nicholas J Loman, Anders F Andersson, and Christopher Quince. Binning metagenomic contigs by coverage and composition. Nature Methods, 11(11):1144–1146, 2014. doi: 10.1038/nmeth.3103. URL https://doi.org/10.1038/nmeth. 3103. [202] Yang Young Lu, Ting Chen, Jed A Fuhrman, and Fengzhu Sun. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge. Bioinformatics, 33(6):791–798, 06 2016. ISSN 1367-4803. doi: 10.1093/bioinformatics/btw290. URL https://doi.org/10.1093/bioinformatics/btw290. [203] David R. Kelley and Steven L. Salzberg. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics, 11(1):544, 2010. doi: 10.1186/1471-2105-11-544. URL https://doi.org/10.1186/1471-2105-11-544. [204] Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai, and Jonathan A Eisen. CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In Annual International Conference on Research in Computational Molecular Biology, pages 17–28, Berlin, Heidelberg, 2008. Springer, Springer Berlin Heidelberg. [205] Wei-Zhi Song and Torsten Thomas. Binning_refiner: improving genome bins through the combination of different binning programs. Bioinformatics, 33(12):1873–1875, 02 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx086. URL https://doi.org/10.1093/ bioinformatics/btx086. [206] Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, Matthias Hess, Susannah G. Tringe, and Jillian F. Banfield. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nature Microbiology, 3(7):836–843, 2018. doi: 10.1038/s41564-018-0171-1. URL https://doi.org/10.1038/s41564-018-0171-1. [207] Bertjan Broeksema, Magdalena Calusinska, Fintan McGee, Klaas Winter, Francesco Bongiovanni, Xavier Goux, Paul Wilmes, Philippe Delfosse, and Mohammad Ghoniem. ICoVeR –an interactive visualization tool for verification and refinement of metagenomic bins. BMC Bioinformatics, 18(1):233, 2017. doi: 10.1186/s12859-017-1653-5. URL https://doi.org/10.1186/s12859-017-1653-5. [208] Gherman V. Uritskiy, Jocelyne DiRuggiero, and James Taylor. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome, 6(1):158, 2018. doi: 10.1186/s40168-018-0541-1. URL https://doi.org/10.1186/s40168-018-0541-1. [209] Wenhan Zhu, Alexandre Lomsadze, and Mark Borodovsky. Ab initio gene identification 164 in metagenomic sequences . Nucleic Acids Research, 38(12):e132–e132, 04 2010. ISSN 0305-1048. doi: 10.1093/nar/gkq275. URL https://doi.org/10.1093/nar/gkq275. [210] Hideki Noguchi, Takeaki Taniguchi, and Takehiko Itoh. MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anony- mous Prokaryotic and Phage Genomes. DNA Research, 15(6):387–396, 10 2008. ISSN 1340-2838. doi: 10.1093/dnares/dsn027. URL https://doi.org/10.1093/dnares/dsn027. [211] David R. Kelley, Bo Liu, Arthur L. Delcher, Mihai Pop, and Steven L. Salzberg. Gene predic- tion with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Research, 40(1):e9–e9, 11 2011. ISSN 0305-1048. doi: 10.1093/nar/gkr1067. URL https://doi.org/10.1093/nar/gkr1067. [212] Marcel Huntemann, Natalia N. Ivanova, Konstantinos Mavromatis, H. James Tripp, David Paez-Espino, Kristin Tennessen, Krishnaveni Palaniappan, Ernest Szeto, Manoj Pillay, I- Min A. Chen, Amrita Pati, Torben Nielsen, Victor M. Markowitz, and Nikos C. Kyrpides. The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4). Standards in Genomic Sciences, 11(1):17, 2016. doi: 10.1186/s40793-016-0138-x. URL https://doi.org/10.1186/s40793-016-0138-x. [213] Doug Hyatt, Philip F. LoCascio, Loren J. Hauser, and Edward C. Uberbacher. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics, 28(17): 2223–2230, 07 2012. ISSN 1367-4803. doi: 10.1093/bioinformatics/bts429. URL https: //doi.org/10.1093/bioinformatics/bts429. [214] Mina Rho, Haixu Tang, and Yuzhen Ye. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Research, 38(20):e191–e191, 08 2010. ISSN 0305-1048. doi: 10.1093/nar/gkq747. URL https://doi.org/10.1093/nar/gkq747. [215] Vardges Ter-Hovhannisyan, Alexandre Lomsadze, Yury O. Chernoff, and Mark Borodovsky. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Research, 18(12):1979–1990, 2008. doi: 10.1101/gr.081612.108. URL http://genome.cshlp.org/content/18/12/1979.abstract. [216] Alexandre Lomsadze, Vardges Ter-Hovhannisyan, Yury O. Chernoff, and Mark Borodovsky. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Research, 33(20):6494–6506, 01 2005. ISSN 0305-1048. doi: 10.1093/nar/gki937. URL https://doi.org/10.1093/nar/gki937. [217] Mario Stanke and Burkhard Morgenstern. Augustus: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Research, 33(suppl_2): W465–W467, 07 2005. ISSN 0305-1048. doi: 10.1093/nar/gki458. URL https://doi.org/ 10.1093/nar/gki458. [218] A Souvorov, Y Kapustin, B Kiryutin, V Chetvernin, T Tatusova, and D Lipman. Gnomon– 165 ncbi eukaryotic gene prediction tool. National Center for Biotechnology Information, pages 1–24, 2010. [219] Ian Korf. Gene finding in novel genomes. BMC bioinformatics, 5:1–9, 2004. [220] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. ISSN 0022-2836. doi: https://doi.org/10.1016/S0022-2836(05)80360-2. URL https://www. sciencedirect.com/science/article/pii/S0022283605803602. [221] Robert D. Finn, Alex Bateman, Jody Clements, Penelope Coggill, Ruth Y. Eberhardt, Sean R. Eddy, Andreas Heger, Kirstie Hetherington, Liisa Holm, Jaina Mistry, Erik L. L. Sonnham- mer, John Tate, and Marco Punta. Pfam: the protein families database. Nucleic Acids Research, 42(D1):D222–D230, 11 2013. ISSN 0305-1048. doi: 10.1093/nar/gkt1223. URL https://doi.org/10.1093/nar/gkt1223. [222] Sarah Hunter, Rolf Apweiler, Teresa K Attwood, Amos Bairoch, Alex Bateman, David Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, et al. InterPro: the integrative protein signature database. Nucleic Acids Research, 37(suppl_1):D211–D215, 10 2008. ISSN 0305-1048. doi: 10.1093/nar/gkn785. URL https://doi.org/10.1093/nar/gkn785. [223] Clotilde Claudel-Renard, Claude Chevalet, Thomas Faraut, and Daniel Kahn. Enzyme- specific profiles for genome annotation: PRIAM. Nucleic Acids Research, 31(22):6633– 6639, 11 2003. ISSN 0305-1048. doi: 10.1093/nar/gkg847. URL https://doi.org/10.1093/ nar/gkg847. [224] Minoru Kanehisa, Susumu Goto, Yoko Sato, Masayuki Kawashima, Miho Furumichi, and Mao Tanabe. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Research, 42(D1):D199–D205, 11 2013. ISSN 0305-1048. doi: 10.1093/nar/ gkt1076. URL https://doi.org/10.1093/nar/gkt1076. [225] Peter D. Karp, Monica Riley, Suzanne M. Paley, and Alida Pellegrini-Toole. The MetaCyc Database. Nucleic Acids Research, 30(1):59–61, 01 2002. ISSN 0305-1048. doi: 10.1093/ nar/30.1.59. URL https://doi.org/10.1093/nar/30.1.59. [226] Scott Federhen. The NCBI Taxonomy database. Nucleic Acids Research, 40(D1):D136– D143, 12 2011. ISSN 0305-1048. doi: 10.1093/nar/gkr1178. URL https://doi.org/10.1093/ nar/gkr1178. [227] Torsten Seemann. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14): 2068–2069, 03 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu153. URL https: //doi.org/10.1093/bioinformatics/btu153. [228] Yasuhiro Tanizawa, Takatomo Fujisawa, and Yasukazu Nakamura. DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics, 166 34(6):1037–1039, 11 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx713. URL https://doi.org/10.1093/bioinformatics/btx713. [229] Tatiana Tatusova, Michael DiCuccio, Azat Badretdin, Vyacheslav Chetvernin, Eric P. Nawrocki, Leonid Zaslavsky, Alexandre Lomsadze, Kim D. Pruitt, Mark Borodovsky, and James Ostell. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Re- search, 44(14):6614–6624, 06 2016. ISSN 0305-1048. doi: 10.1093/nar/gkw569. URL https://doi.org/10.1093/nar/gkw569. [230] Xiaoli Dong and Marc Strous. An integrated pipeline for annotation and visualization of metagenomic contigs. Frontiers in Genetics, 10:999, 2019. ISSN 1664-8021. doi: 10.3389/ fgene.2019.00999. URL https://www.frontiersin.org/articles/10.3389/fgene.2019.00999. [231] Benjamin Buchfink, Chao Xie, and Daniel H Huson. Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1):59–60, 2015. doi: 10.1038/nmeth.3176. URL https://doi.org/10.1038/nmeth.3176. [232] Robert D Stewart, Marc D Auffret, Timothy J Snelling, Rainer Roehe, and Mick Watson. MAGpy: a reproducible pipeline for the downstream analysis of metagenome-assembled genomes (MAGs). Bioinformatics, 35(12):2150–2152, 11 2018. ISSN 1367-4803. doi: 10.1093/bioinformatics/bty905. URL https://doi.org/10.1093/bioinformatics/bty905. [233] Kevin P Keegan, Elizabeth M Glass, and Folker Meyer. MG-RAST, a metagenomics service for analysis of microbial community structure and function. In Microbial environmental genomics (MEG), pages 207–233. Springer, New York, NY, 2016. ISBN 978-1-4939-3369- 3. doi: 10.1007/978-1-4939-3369-3_13. URL https://doi.org/10.1007/978-1-4939-3369-3_ 13. [234] Moreno Zolfo, Adrian Tett, Olivier Jousson, Claudio Donati, and Nicola Segata. MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples. Nucleic Acids Research, 45(2):e7–e7, 09 2016. ISSN 0305-1048. doi: 10.1093/nar/gkw837. URL https://doi.org/ 10.1093/nar/gkw837. [235] Duy Tin Truong, Adrian Tett, Edoardo Pasolli, Curtis Huttenhower, and Nicola Segata. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Research, 27(4):626–638, 2017. doi: 10.1101/gr.216242.116. URL http://genome.cshlp. org/content/27/4/626.abstract. [236] Christopher Quince, Tom O. Delmont, Sébastien Raguideau, Johannes Alneberg, Aaron E. Darling, Gavin Collins, and A. Murat Eren. DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biology, 18(1):181, 2017. doi: 10.1186/s13059-017-1309-9. URL https://doi.org/10.1186/s13059-017-1309-9. [237] Paul Igor Costea, Robin Munch, Luis Pedro Coelho, Lucas Paoli, Shinichi Sunagawa, and Peer Bork. metaSNV: A tool for metagenomic strain level analysis. PLOS ONE, 12(7):1–9, 167 07 2017. doi: 10.1371/journal.pone.0182392. URL https://doi.org/10.1371/journal.pone. 0182392. [238] Robert C. Edgar. Search and clustering orders of magnitude faster than BLAST. Bioinfor- matics, 26(19):2460–2461, 08 2010. ISSN 1367-4803. doi: 10.1093/bioinformatics/btq461. URL https://doi.org/10.1093/bioinformatics/btq461. [239] W. James Kent. BLAT—The BLAST-Like Alignment Tool. Genome Research, 12(4): 656–664, 2002. doi: 10.1101/gr.229202. URL http://genome.cshlp.org/content/12/4/656. abstract. [240] Yongan Zhao, Haixu Tang, and Yuzhen Ye. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics, 28(1): 125–126, 10 2011. ISSN 1367-4803. doi: 10.1093/bioinformatics/btr595. URL https: //doi.org/10.1093/bioinformatics/btr595. [241] Anthony Westbrook, Jordan Ramsdell, Taruna Schuelke, Louisa Normington, R Daniel Bergeron, W Kelley Thomas, and Matthew D MacManes. PALADIN: protein alignment for functional profiling whole metagenome shotgun data. Bioinformatics, 33(10):1473–1478, 01 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx021. URL https://doi.org/10. 1093/bioinformatics/btx021. [242] Cuncong Zhong, Youngik Yang, and Shibu Yooseph. GRASP2: fast and memory- efficient gene-centric assembly and homolog search for metagenomic sequencing data. BMC Bioinformatics, 20(11):276, 2019. doi: 10.1186/s12859-019-2818-1. URL https: //doi.org/10.1186/s12859-019-2818-1. [243] Fatemeh Sharifi and Yuzhen Ye. From gene annotation to function prediction for metage- nomics. In Protein Function Prediction, pages 27–34. Springer, New York, NY, 2017. ISBN 978-1-4939-7015-5. doi: 10.1007/978-1-4939-7015-5_3. URL https://doi.org/10.1007/ 978-1-4939-7015-5_3. [244] Jens Roat Kultima, Luis Pedro Coelho, Kristoffer Forslund, Jaime Huerta-Cepas, Simone S. Li, Marja Driessen, Anita Yvonne Voigt, Georg Zeller, Shinichi Sunagawa, and Peer Bork. MOCAT2: a metagenomic assembly, annotation and profiling framework. Bioinformatics, 32(16):2520–2523, 04 2016. ISSN 1367-4803. doi: 10.1093/bioinformatics/btw183. URL https://doi.org/10.1093/bioinformatics/btw183. [245] Stuart M Brown, Hao Chen, Yuhan Hao, Bobby P Laungani, Thahmina A Ali, Changsu Dong, Carlos Lijeron, Baekdoo Kim, Claudia Wultsch, Zhiheng Pei, and Konstantinos Krampis. MGS-Fast: Metagenomic shotgun data fast annotation using microbial gene catalogs. GigaScience, 8(4), 04 2019. ISSN 2047-217X. doi: 10.1093/gigascience/giz020. URL https://doi.org/10.1093/gigascience/giz020. giz020. [246] Eric A. Franzosa, Lauren J. McIver, Gholamali Rahnavard, Luke R. Thompson, Melanie 168 Schirmer, George Weingart, Karen Schwarzberg Lipson, Rob Knight, J. Gregory Ca- poraso, Nicola Segata, and Curtis Huttenhower. Species-level functional profiling of metagenomes and metatranscriptomes. Nature Methods, 15(11):962–968, 2018. doi: 10.1038/s41592-018-0176-y. URL https://doi.org/10.1038/s41592-018-0176-y. [247] Gustavo Arango-Argoty, Gargi Singh, Lenwood S. Heath, Amy Pruden, Weidong Xiao, and Liqing Zhang. MetaStorm: A Public Resource for Customizable Metagenomics Annotation. PLOS ONE, 11(9):1–13, 09 2016. doi: 10.1371/journal.pone.0162442. URL https://doi. org/10.1371/journal.pone.0162442. [248] Derrick E. Wood and Steven L. Salzberg. Kraken: ultrafast metagenomic sequence clas- sification using exact alignments. Genome Biology, 15(3):R46, 2014. doi: 10.1186/ gb-2014-15-3-r46. URL https://doi.org/10.1186/gb-2014-15-3-r46. [249] Rachid Ounit, Steve Wanamaker, Timothy J. Close, and Stefano Lonardi. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics, 16(1):236, 2015. doi: 10.1186/s12864-015-1419-2. URL https: //doi.org/10.1186/s12864-015-1419-2. [250] Derrick E. Wood, Jennifer Lu, and Ben Langmead. Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1):257, 2019. doi: 10.1186/s13059-019-1891-0. URL https://doi.org/10.1186/s13059-019-1891-0. [251] Steven Flygare, Keith Simmon, Chase Miller, Yi Qiao, Brett Kennedy, Tonya Di Sera, Erin H. Graf, Keith D. Tardif, Aurélie Kapusta, Shawn Rynearson, Chris Stockmann, Krista Queen, Suxiang Tong, Karl V. Voelkerding, Anne Blaschke, Carrie L. Byington, Seema Jain, Andrew Pavia, Krow Ampofo, Karen Eilbeck, Gabor Marth, Mark Yandell, and Robert Schlaberg. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling. Genome Biology, 17(1):111, 2016. doi: 10.1186/s13059-016-0969-1. URL https://doi.org/10.1186/s13059-016-0969-1. [252] Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Research, 26(12):1721– 1729, 2016. doi: 10.1101/gr.210641.116. URL http://genome.cshlp.org/content/26/12/ 1721.abstract. [253] Peter Menzel, Kim Lee Ng, and Anders Krogh. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7(1):11257, 2016. doi: 10.1038/ ncomms11257. URL https://doi.org/10.1038/ncomms11257. [254] André Corvelo, Wayne E. Clarke, Nicolas Robine, and Michael C. Zody. taxMaps: com- prehensive and highly accurate taxonomic classification of short-read data in reasonable time. Genome Research, 28(5):751–758, 2018. doi: 10.1101/gr.225276.117. URL http://genome.cshlp.org/content/28/5/751.abstract. 169 [255] J. Dröge, I. Gregor, and A. C. McHardy. Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods. Bioinformatics, 31(6): 817–824, 11 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu745. URL https: //doi.org/10.1093/bioinformatics/btu745. [256] Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, and Curtis Huttenhower. Metagenomic microbial community profiling using unique clade- specific marker genes. Nature Methods, 9(8):811–814, 2012. doi: 10.1038/nmeth.2066. URL https://doi.org/10.1038/nmeth.2066. [257] Duy Tin Truong, Eric A Franzosa, Timothy L Tickle, Matthias Scholz, George Weingart, Edoardo Pasolli, Adrian Tett, Curtis Huttenhower, and Nicola Segata. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature Methods, 12(10):902–903, 2015. doi: 10.1038/nmeth.3589. URL https://doi.org/10.1038/nmeth.3589. [258] Weizhi Song, Kerrin Steensen, and Torsten Thomas. Hgtsim: a simulator for horizontal gene transfer (hgt) in microbial communities. PeerJ, 5:e4015, November 2017. ISSN 2167-8359. doi: 10.7717/peerj.4015. URL https://doi.org/10.7717/peerj.4015. [259] Benjamin Vernot, Serena Tucci, Janet Kelso, Joshua G. Schraiber, Aaron B. Wolf, Rachel M. Gittelman, Michael Dannemann, Steffi Grote, Rajiv C. McCoy, Heather Norton, Laura B. Scheinfeldt, David A. Merriwether, George Koki, Jonathan S. Friedlaender, Jon Wakefield, Svante Pääbo, and Joshua M. Akey. Excavating Neandertal and Denisovan DNA from the genomes of Melanesian individuals. Science, 352(6282):235–239, 2016. doi: 10.1126/ science.aad9416. URL https://www.science.org/doi/abs/10.1126/science.aad9416. [260] Fernando Racimo, Davide Marnetto, and Emilia Huerta-Sánchez. Signatures of Archaic Adaptive Introgression in Present-Day Human Populations. Molecular Biology and Evo- lution, 34(2):296–317, 08 2016. ISSN 0737-4038. doi: 10.1093/molbev/msw216. URL https://doi.org/10.1093/molbev/msw216. [261] Derek Setter, Sylvain Mousset, Xiaoheng Cheng, Rasmus Nielsen, Michael DeGiorgio, and Joachim Hermisson. VolcanoFinder: Genomic scans for adaptive introgression. PLOS Genetics, 16(6):1–44, 06 2020. doi: 10.1371/journal.pgen.1008867. URL https://doi.org/ 10.1371/journal.pgen.1008867. [262] Graham Gower, Pablo Iáñez Picazo, Matteo Fumagalli, and Fernando Racimo. Detecting adaptive introgression in human evolution using convolutional neural networks. eLife, 10: e64669, may 2021. ISSN 2050-084X. doi: 10.7554/eLife.64669. URL https://doi.org/10. 7554/eLife.64669. [263] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. 170 APPENDIX A THE COMPARISON OF INTROGRESSION PATTERNS BETWEEN PHYLONET-HMM AND PHIMM ON EACH CHROMOSOME OF MOUSE DATA. (A) PhyloNet-HMM (B) PHiMM Figure A.1 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 1. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 171 (A) PhyloNet-HMM (B) PHiMM Figure A.2 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 2. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 172 (A) PhyloNet-HMM (B) PHiMM Figure A.3 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 3. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 173 (A) PhyloNet-HMM (B) PHiMM Figure A.4 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 4. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 174 (A) PhyloNet-HMM (B) PHiMM Figure A.5 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 5. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 175 (A) PhyloNet-HMM (B) PHiMM Figure A.6 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 6. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 176 (A) PhyloNet-HMM (B) PHiMM Figure A.7 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 7. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 177 (A) PhyloNet-HMM (B) PHiMM Figure A.8 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 8. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 178 (A) PhyloNet-HMM (B) PHiMM Figure A.9 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 9. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 179 (A) PhyloNet-HMM (B) PHiMM Figure A.10 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 10. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 180 (A) PhyloNet-HMM (B) PHiMM Figure A.11 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 11. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 181 (A) PhyloNet-HMM (B) PHiMM Figure A.12 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 12. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 182 (A) PhyloNet-HMM (B) PHiMM Figure A.13 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 13. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 183 (A) PhyloNet-HMM (B) PHiMM Figure A.14 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 14. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 184 (A) PhyloNet-HMM (B) PHiMM Figure A.15 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 15. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 185 (A) PhyloNet-HMM (B) PHiMM Figure A.16 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 16. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 186 (A) PhyloNet-HMM (B) PHiMM Figure A.17 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 17. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 187 (A) PhyloNet-HMM (B) PHiMM Figure A.18 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 18. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 188 (A) PhyloNet-HMM (B) PHiMM Figure A.19 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome 19. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 189 (A) PhyloNet-HMM (B) PHiMM Figure A.20 The comparison of introgression patterns between (A) PhyloNet-HMM and (B) PHiMM on chromosome X. Introgressed tracts along the genome are shown in megabases (Mb). Results are reported for the introgression from donor M. spretus samples to the recipient M. m. domesticus samples (two Spanish mice and six German mice). Each M. m. domesticus sample is shown with two haploid genomes (the above bar is haplotype 1, while the below represents haplotype 2). This figure comes from Wuyun et al. [122]. 190 APPENDIX B THE INTROGRESSION PATTERNS OF DACS ON DARWIN’S FINCH DATA. Figure B.1 The introgression patterns of DACS on top 51-100 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in megabases (Mb). 191 Figure B.2 The introgression patterns of DACS on top 101-150 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in megabases (Mb). 192 Figure B.3 The introgression patterns of DACS on top 151-200 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in kilobases (Kb). 193 Figure B.4 The introgression patterns of DACS on top 201-250 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in kilobases (Kb). 194 Figure B.5 The introgression patterns of DACS on top 251-300 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in kilobases (Kb). 195 Figure B.6 The introgression patterns of DACS on top 301-350 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in kilobases (Kb). 196 Figure B.7 The introgression patterns of DACS on top 351-400 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in kilobases (Kb). 197 Figure B.8 The introgression patterns of DACS on top 401-469 largest scaffolds of Darwin’s finch data. Introgressed tracts along the genome are shown in kilobases (Kb). 198