NOVEL COMPUTATIONAL FRAMEWORKS FOR ANALYZING COMPLEX NON-TREE-LIKE EVOLUTION IN GENOMIC SEQUENCE DATA
Phylogenetics studies the evolutionary relationships among species, often represented as phylogenetictrees. However, traditional tree-like models fall short when dealing with interspecific gene flows, such as hybridization/introgression in eukaryotes and horizontal gene transfer (HGT) in prokaryotes, which necessitate the use of phylogenetic networks to capture complex, reticulate evolutionary histories. Although numerous methodologies have been developed to analyze nontree- like evolutions, the advent of high-throughput sequencing technologies has introduced two primary scalability challenges: the large number of taxa present in the data and the intricate and diverse gene flow among these taxa. These challenges have considerably constrained the scope and precision of non-tree-like phylogenetic inference and analysis. This dissertation addresses these limitations by introducing novel techniques for analyzing non-tree-like evolutions in large-scale genomic studies. We developed PHiMM, an introgression detection tool that combines coalescent-based approximations with hidden Markov models (HMMs) to improve scalability and detection accuracy. Comparisons with state-of-the-art methods on both simulation and empirical datasets indicated that PHiMM significantly outperforms previous methods like PhyloNet-HMM in terms of runtime and memory usage, while maintaining comparable inference accuracy to PhyloNet-HMM. To further enhance PHiMM’s performance, we integrated it with the SERES resampling tool, significantly improving introgression inference accuracy under various model conditions. Simulation experiments demonstrated that combining the SERES resampling approach with PHiMM substantially improves introgression inference accuracy compared to standalone PHiMM, although it results in longer runtimes. One major limitation of PHiMM is its requirement for a phylogenetic network as input to constitute the structure of the HMM. To address this, we extended PHiMM to DACS by integrating phylogenetic network inference with introgression detection. To further improve the scalability and accuracy of introgression mapping on ultra large datasets, we adopted divide-and-conquer and subsampling techniques, allowing us to efficiently handle the complexity of the data while maintaining accuracy. Moreover, we applied these approaches to metagenomic studies, where data often include thousands of species derived from complex microbial communities. After assembling the sequencing reads into metagenome-assembled genomes (MAGs), we employ DACS to identify reticulate evolutionary events, such as introgression or HGT, under challenging scenarios characterized by noise, incomplete data, and large numbers of taxa. By adapting our methods to accommodate the scale and complexity of metagenomic datasets, we provide a powerful framework for elucidating reticulate evolutionary histories in diverse microbial communities. These advancements provide deeper insights into genetic and biological processes and offer robust tools for a wide range of biological and medical applications.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Authors
-
WUYUN, QIQIGE
- Thesis Advisors
-
Liu, Kevin
- Committee Members
-
Liu, Kevin
Dolson, Emily
Ghassemi, Mohammad
Heath-Heckman, Elizabeth
- Date Published
-
2025
- Subjects
-
Computer science
- Program of Study
-
Computer Science - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- 204 pages
- Permalink
- https://doi.org/doi:10.25335/zswk-q328