COMPUTATIONAL FRAMEWORKS FOR INDEL-AWARE EVOLUTIONARY ANALYSIS USING LARGE-SCALE GENOMIC SEQUENCE DATA By Wei Wang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science – Doctor of Philosophy 2021 ABSTRACT COMPUTATIONAL FRAMEWORKS FOR INDEL-AWARE EVOLUTIONARY ANALYSIS USING LARGE-SCALE GENOMIC SEQUENCE DATA By Wei Wang With the development of sequencing techniques, genetic sequencing data has been extensively used in evolutionary studies. The phylogenetic reconstruction problem, which is the reconstruction of evolutionary history from biomolecular sequences, is a fundamental problem. The evolutionary relationship between organisms is often represented by phylogeny, which is a tree or network representation. The most widely-used approach for reconstructing phylogenies from sequencing data involves two phases: multiple sequence alignment and phylogenetic reconstruction from the aligned sequences. As the amount of biomolecular sequence data increases, it has become a major challenge to develop efficient and accurate computational methods for phylogenetic analyses of large-scale sequencing data. Due to the complexity of the phylogenetic reconstruction problem in modern phylogenetic studies, the traditional sequence-based phylogenetic analysis methods involve many over-simplified assumptions. In this thesis, we describe our contribution in relaxing some of these over-simplified assumptions in the phylogenetic analysis. Insertion and deletion events, referred to as indels, carry much phylogenetic information but are often ignored in the reconstruction process of phylogenies. We take into account the indel uncer- tainties in multiple phylogenetic analyses by applying resampling and re-estimation. Another over- simplified assumption that we contributed to is adopted by many commonly used non-parametric algorithms for the resampling of biomolecular sequences, all sites in an MSA are evolved inde- pendently and identically distributed (i.i.d). Many evolution events, such as recombination and hybridization, may produce intra-sequence and functional dependence in biomolecular sequences that violate this assumption. We introduce SERES, a resampling algorithm for biomolecular se- quences that can produce resampled replicates that preserve the intra-sequence dependence. We describe the application of the SERES resampling and re-estimation approach to two classical problems: the multiple sequence alignment support estimation and recombination-aware local genealogical inference. We show that these two statistical inference problems greatly benefit from the indel-aware resampling and re-estimation approach and the reservation of intra-sequence dependence. A major drawback of SERES is that it requires parameters to ensure the synchronization of random walks on unaligned sequences. We introduce RAWR, a non-parametric resampling method designed for phylogenetic tree support estimation that does not require extra parameters. We show that the RAWR-based resampling and re-estimation method produces comparable or typically better performance than the traditional bootstrap approach on the phylogenetic tree support estimation problem. We further relax the commonly used assumption of phylogeny. Evolutionary history is usually considered as a tree structure. Evolutionary events that cause reticulated gene flow are ignored. Previous studies show that alignment uncertainty greatly impacts downstream tree inference and learning. However, there is little discussion about the impact of MSA uncertainties on the phylo- genetic network reconstruction. We show evidence that the errors introduced in MSA estimation decrease the accuracy of the inferred phylogenetic network, and an indel-aware reconstruction method is needed for phylogenetic network analysis. In this dissertation, we introduce our contribution to phylogenetic estimation using biomolecular sequence data involving complex evolutionary histories, such as sequence insertion and deletion processes and non-tree-like evolution. In the loving memory of my parents, Huaijie Wang and Yuehua Hu. iv ACKNOWLEDGEMENTS It has been a long journey since the start of my Ph.D. study in 2016. I would not have made it to the endpoint without the help and support of many people. First and foremost, I would like to thank my advisor, Professor Kevin Liu, for all his guidance and support. He is always very patient and encouraging. As a research advisor and a career mentor, he opened many new doors for me, which I could not imagine when I first started my Ph.D. study. I would like to thank my committee members: Professor Arjun Krishnan, Professor Eric Goodman, and Professor Mohammad Ghassemi, for their generous service, valuable comments, and thoughtful questions. It has been a great honor and privilege for me to have them on my doctoral committee. I would also like to thank Professor Jizhong Lou, who supervised my master’s theses at the Institute of Biophysics (IBP), Chinese Academy of Sciences. Professor Lou has excellent advice and unsurpassed knowledge. I am grateful for his support and inspiration. It’s a great pleasure to work with my excellent colleagues, including Hussein, Zhiwei, Jack, Qiqige, Julia, Ahmad, and many others. I must acknowledge the generous help from Hussein in the first year of my Ph.D. Hussein helped me get familiar with different tools and software. He also generously shared his experience on the Ph.D. study and suggested that do not burn out at the beginning of this marathon. I am grateful to the BEACON, NSF Science and Technology Center for the study of evolution in action for their generous support during my Ph.D. study. I would like to thank the professors and staff at the Computer Science and Engineering department and the Office of International Students and Scholars, especially Professor Eric Torng, Professor Katy Colbry, Professor Sandeep Kulkarni, Erin Dunlop, Brenda Hodge, and Bryce Carlton. I want to thank my friends from Michigan State University. I thank Xi Yin, a best friend of mine and the best roommate ever, for her support, encouragement, and patience during the first two years of my Ph.D. life. We shared many beautiful memories in East Lansing. I thank Nan Du, a v dear friend of mine, and surprisingly turned out to be a distant cousin of my husband, found by an ancestry test service. He is a living Wikipedia who knows everything and generously offers free service to everyone. I thank Qi Wang for her optimistic attitude that deeply affected me. I thank my friends who lived next door to me, Mengying Sun, Deliang Yang, Boyang Liu, and Yunshi Liang, for bringing me endless joy and comfort during this pandemic. I thank my lifelong friends, Xi (Lucy) Lu, Yue Kang, Mengnan Bai, and Xiaolan Li. I want to express my utmost gratitude to my husband, Teng Cao, for his unconditional love. After losing my parents, I couldn’t find a connection with this world until I met him. He saved me from my severe depression and insomnia. He gave me a home where I could go back to and a future to look forward to. I am grateful to my cat, Popcorn. She always gazed at me lovingly. Seeing her every day is a source of mental happiness that relieves me from stress and depression. Finally, I would like to thank my parents from the very bottom of my heart. My father was my best friend who understood me the most. He had a good sense of humor and brought me a lot of joy and happiness when I was young. My personality was largely inherited from him. My mother was the most beautiful and strong woman in the world. She gave me endless love and encouragement. Her optimistic attitude influenced me deeply. Those good memories never fade away. Their love and companionship made me what I am today. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Sequence evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Multiple Sequence Alignment Estimation . . . . . . . . . . . . . . . . . . 10 2.2 Phylogenies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Phylogenetic Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Phylogenetic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Gene Trees and Species Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Coalescence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Multi-Species Coalescent (MSC) Model . . . . . . . . . . . . . . . . . . . 19 2.4 Phylogenetic Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Phylogenetic Tree Reconstruction of Single Gene . . . . . . . . . . . . . . 22 2.4.2 Branch support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.3 Phylogenetic Tree Reconstruction of Multiple Genes . . . . . . . . . . . . 25 2.4.4 Phylogenetic Network Reconstruction . . . . . . . . . . . . . . . . . . . . 28 2.4.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.5.1 Comparison of Phylogenetic Trees . . . . . . . . . . . . . . . . . 30 2.4.5.2 Comparison of Phylogenetic Networks . . . . . . . . . . . . . . 31 CHAPTER 3 SERES: THE SEQUENTIAL RESAMPLING AND ITS APPLICATION ON MULTIPLE SEQUENCE ALIGNMENT SUPPORT ESTIMATION . . 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 SERES walks on aligned sequences . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 SERES walks on unaligned sequences . . . . . . . . . . . . . . . . . . . . 36 3.2.3 Performance study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.4 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.5 Empirical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.6 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Empirical study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vii CHAPTER 4 APPLICATION OF SERES RESAMPLING APPROACH TO ALIGNED SEQUENCES: PHYLOGENETIC HMM INFERENCE AND LEARNING . 57 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.1 Standalone recHMM analysis . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 The SERES+recHMM pipeline . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.3 Simulated datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.4 Empirical datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.1 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.2 Empirical study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 CHAPTER 5 PHYLOGENETIC SUPPORT ESTIMATION WITH THE RANDOM WALK RESAMPLING APPROACH . . . . . . . . . . . . . . . . . . . . . 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.1 RAWR-based Phylogenetic Support Estimation . . . . . . . . . . . . . . . 76 5.2.2 Bootstrap Phylogenetic Support Estimation. . . . . . . . . . . . . . . . . . 78 5.2.3 Additional Performance Study . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.4 Simulated datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.5 Empirical datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2.6 Performance measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.3.1.1 Performance comparison of RAWR versus bootstrap. . . . . . . . 86 5.3.1.2 RAWR support estimation using reduced resampling replication. . 89 5.3.1.3 Results of Additional Performance Study . . . . . . . . . . . . . 89 5.3.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.2.1 Performance comparison of RAWR versus bootstrap. . . . . . . . 96 5.3.2.2 Results of Additional Performance Study . . . . . . . . . . . . . 96 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 CHAPTER 6 AN APPLICATION OF RANDOM WALK RESAMPLING TO PHY- LOGENOMIC ANALYSIS OF DARWIN’S FINCHES . . . . . . . . . . . . 102 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.2 Process of the raw sequencing data . . . . . . . . . . . . . . . . . . . . . . 104 6.2.3 Concatenated MLE phylogenetic tree inference . . . . . . . . . . . . . . . 105 6.2.4 Phylogenetic support estimation using bootstrap resampling. . . . . . . . . 106 6.2.5 Phylogenetic support estimation using RAWR resampling. . . . . . . . . . 106 6.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 viii 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 CHAPTER 7 IMPACT OF MULTIPLE SEQUENCE ALIGNMENT ERROR ON THE SUMMARY-BASED PHYLOGENETIC NETWORK RECONSTRUCTION 113 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2.1 Simulated Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.2.2 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2.3 Empirical Datasets and Experiments . . . . . . . . . . . . . . . . . . . . . 118 7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.3.1.1 D-statistics for gene flow detection . . . . . . . . . . . . . . . . . 119 7.3.1.2 Phylogenetic network inference . . . . . . . . . . . . . . . . . . 120 7.3.2 Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 CHAPTER 8 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 129 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 ix LIST OF TABLES Table 3.1: Simulated datasets: parameter values and summary statistics. The simulation model condition parameters consist of the number of taxa, model tree height, and insertion/deletion probability. Each model condition corresponds to a distinct set of model parameter values. The following table columns list average summary statistics for each model condition (𝑛 = 20). “NHD” is the average normalized Hamming distance of a pair of aligned sequences in the true alignment. “Gappiness” is the percentage of true alignment cells which consists of indels. “True align length” is the length of the true alignment. “Est align length” is the length of the MAFFT-estimated alignment [64] which was provided as input to the support estimation methods. “SP-FN” and “SP-FP” are the proportion of homologies that appear in the true alignment but not in the MAFFT-estimated alignment and vice versa, respectively. . . . . . . . . . . 42 Table 3.2: Medium-gap-length model conditions: estimated alignment statistics. The MSA support estimation problem requires an input MSA. Our study in- cluded ClustalW [88] and FSA [12] alignments to explore the impact of input alignment quality on downstream support estimation. The following table columns list average statistics for estimated alignments on each model condi- tion (𝑛 = 20). “Est align length” is the estimated alignment length. “SP-FN” and “SP-FP” are the proportion of homologies that appear in the true alignment but not in the estimated alignment and vice versa, respectively. . . . . . . . . . 42 Table 3.3: Empirical dataset summary statistics. The empirical study made use of ref- erence alignments (“Ref align”) from the CRW database [14]. The column description is identical to Table 3.1. . . . . . . . . . . . . . . . . . . . . . . . . 43 Table 3.4: Support estimation method performance on simulated datasets. Results are shown for simulated datasets. The top rows show AUC comparisons of GUID- ANCE1 (“GUIDANCE1”) vs. SERES combined with parametric techniques from GUIDANCE1 (“SERES+GUIDANCE1”), Results AUC comparisons of GUIDANCE2 (“GUIDANCE2”) vs. SERES combined with parametric tech- niques from GUIDANCE2 (“SERES+GUIDANCE2”); the best AUC is shown in bold. Corrected q-values are reported (𝑛 = 20) and all were significant (𝛼 = 0.05). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 x Table 3.5: Support estimation method performance on long-gap-length model conditions. The performance of GUIDANCE2 and SERES+GUIDANCE2 is compared across model conditions 10.long.A through 10.long.E (named in order of gen- erally increasing sequence divergence). Aggregate PR-AUC and ROC-AUC are reported across all replicate datasets in a model condition (𝑛 = 20), and the best AUC for each model condition is shown in bold. Statistical significance of PR-AUC or ROC-AUC differences was assessed using a one-tailed pairwise t-test or DeLong test [25] test, respectively, and multiple test correction was performed using the method of [8]. Corrected q-values are reported (𝑛 = 20) and all were significant (𝛼 = 0.05). . . . . . . . . . . . . . . . . . . . . . . . . . 47 Table 3.6: SERES+GUIDANCE2 performance using alternative methods for estimating an input MSA. Input MSAs in these experiments were estimated using either ClustalW [88] or FSA [12]. (MAFFT was used to estimate input MSAs throughout the rest of our study.) Results are shown for model conditions 10.A through 10.E (named in order of generally increasing sequence divergence). Otherwise, table layout and description are identical to Table 3.5. . . . . . . . . 48 Table 3.7: Empirical study results. Results are shown for empirical datasets. For each dataset and pairwise method comparison. Table layout, and table description are otherwise identical to Table 3.4. . . . . . . . . . . . . . . . . . . . . . . . . 53 Table 4.1: Simulated dataset statistics. The number of true gene trees and average normal- ized Hamming distance (“ANHD”) are reported for simulated datasets from the simulation study; average (“Avg”) and standard error (“SE”) are shown for all experimental replicates from each model condition (𝑛 = 30). . . . . . . . . . 61 Table 4.2: On 4-taxon model conditions, the posterior probabilities inferred by SERES+recHMM were better correlated with topological accuracy compared with the standalone recHMM. For each method, we calculated the Pearson correlation between the inferred posterior probability of a gene tree 𝑔 and the topological distance between 𝑔 and the true gene tree of a site. Average correlation for a method is calculated across all replicates in a model condition (𝑛 = 30). . . . . . . . . . . . 65 Table 4.3: On 5-taxon model conditions, posterior probabilities inferred by the SERES+recHMM had stronger correlation with topological accuracy compared with the poste- rior probabilities inferred by the standalone recHMM. Otherwise, table layout and description are identical to Table 4.2. . . . . . . . . . . . . . . . . . . . . . 65 Table 4.4: On 6-taxon model conditions, posterior probabilities inferred using SERES+recHMM were more highly correlated with topological accuracy compared to standalone recHMM. Otherwise, table layout and description are identical to Table 4.2. . . . 66 xi Table 4.5: The comparison among different reversal probabilities 𝛾 on 4-, 5- and 6- taxon model conditions. The methods utilize models with 𝜙 = 3 to infer a posterior probability distribution over gene tree topologies. For each method’s inference, we calculated the Pearson correlation between the inferred posterior probability for a gene tree 𝑔 and the topological distance between 𝑔 and the true evolutionary history of a site (i.e., the true local gene tree). The averages are reported across all 𝑛 replicates in a model condition (𝑛 = 30). . . . . . . . . . 69 Table 4.6: The comparison among different number of states 𝜙 on 5-taxon model con- ditions. The methods utilize models with 𝛾 = 0.005 to infer a posterior probability distribution over gene tree topologies. For each method’s infer- ence, we calculated the Pearson correlation between the inferred posterior probability for a gene tree 𝑔 and the topological distance between 𝑔 and the true evolutionary history of a site (i.e., the true local gene tree). The averages are reported across all 𝑛 replicates in a model condition (𝑛 = 30). . . . . . . . . . 70 Table 4.7: The runtime and memory usage information for standalone recHMM and SERES+recHMM methods on simulation study model conditions. Model conditions were parameterized by the number of sequences, recombination rate 𝜌, and mutation rate 𝜃. Both methods utilize models with 𝜙 = 3 and 𝛾 = 0.005 to infer a posterior probability distribution over gene tree topologies. Average runtime in hours and peak memory usage in GiB are reported across all replicates in a model condition (𝑛 = 30). . . . . . . . . . . . . . . . . . . . . 71 Table 5.1: Summary statistics for ClustalW-estimated alignments and RAxML(ClustalW) trees on 10-taxon model conditions. Table layout and description are otherwise identical to Table 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 5.2: Long-gap-length model conditions: parameter values and summary statistics. Our simulation study included additional 10-taxon model conditions that uti- lized the long gap length distribution from the study of 2012 study of Liu et al. [80]. The model parameters consisted of model tree height and inser- tion/deletion probability, and each model condition corresponds to a distinct set of model parameter values. The long-gap-length model conditions are named 10.long.A through 10.long.E in order of generally increasing sequence divergence. The following table columns list average summary statistics for each model condition (𝑛 = 20). “NHD” is the average normalized Hamming distance of a pair of aligned sequences in the true alignment. “Gappiness” is the percentage of true alignment cells which consists of indels. “True align length” is the length of the true alignment. “Est align length” is the length of the MAFFT-estimated alignment [64] which was provided as input to the support estimation methods. “SP-FN” and “SP-FP” are the proportion of ho- mologies that appear in the true alignment but not in the MAFFT-estimated alignment and vice versa, respectively. The table and caption are reproduced from the original study of the SERES resampling algorithm [145]. . . . . . . . 80 xii Table 5.3: Model condition parameters and summary statistics of the simulation datasets. Model condition parameters consisted of the number of taxa, tree height, and insertion/deletion probability. The model conditions are named from A to E to represent increasing evolutionary divergence. The average summary statistics are reported for the true alignments, and the MAFFT-estimated alignments over 𝑛 replicate datasets (𝑛 = 20). “ANHD” is the average normalized Ham- ming distance of a pair of aligned sequences in an MSA, “Gappiness” is the proportion of an MSA matrix that consists of indels, “length” is the number of MSA columns, and “SP-FN” and “SP-FP” are the proportions of residue pairs that appear in the true alignment but not in the estimated alignment or vice versa, respectively. The average normalized Robinson-Foulds distance ( “nRF”) between the model tree and the RAxML(MAFFT)-inferred tree is also reported over 𝑛 replicate datasets (𝑛 = 20). . . . . . . . . . . . . . . . . . . . . . 84 Table 5.4: Summary statistics of empirical datasets. Summary statistic calculations and descriptions identical to Table 5.3. . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 5.5: PR-AUC performances on the simulation datasets. The PR-AUC are aggre- gated over 𝑛 replicate datasets for a model condition (𝑛 = 20). Statistical significance of PR-AUC differences between RAWR and bootstrap were eval- uated using a one-tailed pairwise t-test and a multiple test correction was performed using the method of [8]. Corrected q-values are reported (𝑛 = 20). . . 87 Table 5.6: RAWR support estimation using alternative estimation/re-estimation methods. We compared RAWR support estimation using two different estimation/re- estimation methods: either MAFFT and RAxML(MAFFT) or ClustalW and RAxML(ClustalW). For each of the two methods, aggregate PR-AUC is shown across all replicate datasets of each model condition (𝑛 = 20). . . . . . . . . . . 90 Table 5.7: Simulation study: RAWR support estimation using different choices for rever- sal probability 𝛾. Aggregate PR-AUC is reported across all replicate datasets of each 10-taxon model condition (𝑛 = 20). . . . . . . . . . . . . . . . . . . . . 91 Table 5.8: PR-AUC comparison of bootstrap and RAWR methods on 10-taxon long- gap-length model conditions. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re-estimation, respectively. Each method’s PR-AUC is reported as an aggregate across all replicate datasets for a model condition (𝑛 = 20). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 xiii Table 5.9: PR-AUC comparison of aLRT and RAWR methods for phylogenetic support estimation. We used PhyML [49] to run two types of aLRT analyses: (1) support estimation or a free tree topology that was also estimated as part of the analysis, and (2) support estimation for a RAxML-inferred tree topology. The latter methodology for obtaining an annotation tree is identical to the approach used in all other experiments in our study, and its PR-AUC performance is therefore directly comparable to other simulation study experiments. Table layout and description are otherwise identical to Table 5.8. . . . . . . . . . . . . 92 Table 5.10: PR-AUC comparison of TBE with bootstrap resampling, TBE with RAWR resampling, and RAWR. TBE was used to estimate phylogenetic support using two different resampling approaches: either (1) standard bootstrap resampling, which corresponds to the method originally proposed and studied by Lemoine et al. [74], or (2) RAWR resampling. The former is denoted “TBE with bootstrap resampling”, and the latter is denoted “TBE with RAWR resam- pling”. For comparison purposes, RAWR resampling and re-estimation was also run as a third method (denoted “RAWR”), and we used the same method- ology as elsewhere in our study (i.e., using a standard branch presence/absence calculation to assess phylogenetic support). . . . . . . . . . . . . . . . . . . . . 94 Table 5.11: PR-AUC comparison of GUIDANCE2 and RAWR phylogenetic support esti- mation methods. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re-estimation, respectively. We report each method’s ag- gregate PR-AUC across all replicate datasets for a model condition (𝑛 = 20). . . 95 Table 5.12: PR-AUC performance of RAWR+teleport on 10-taxon model conditions. . . . . 95 Table 5.13: PR-AUC performances on the emprircal datasets. PR-AUC comparison of bootstrap and RAWR methods for phylogenetic support estimation. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re- estimation, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Table 5.14: Empirical study: PR-AUC comparison of aLRT and RAWR methods on CRW benchmarking datasets. Table layout and description are otherwise identical to Table 5.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Table 5.15: PR-AUC comparison of TBE and RAWR methods on CRW benchmarking datasets. Table layout and description are otherwise identical to Table 5.9. . . . 97 Table 5.16: PR-AUC comparison of GUIDANCE2 and RAWR methods for phylogenetic support estimation. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re-estimation, respectively. . . . . . . . . . . . . . . . 98 xiv Table 7.1: Model parameters and summary statistics of the simulated datasets. The 4- taxon model conditions are named 4.A through 4.E in order of increasing evo- lutionary divergence; the 8-taxon model conditions are named 8.A through 8.E similarly. Additional model condition parameters include the insertion/dele- tion rate and the model phylogeny height (see Methods section for details). Average normalized Hamming distance (“ANHD”) and the percentage of true MSA cells that consist of indels (“Gappiness”) are reported as an average for each model condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Table 7.2: Topological distance pairwise comparison of species networks inferred by MLE on different MSAs of the mosquito dataset. We obtained the reference alignment from the original study of [103]. The estimated alignments were generated by ClustalW, MAFFT, or FSA. We also compared estimation of species networks with differnt reticulations, which represents different model complexity. The MLE method was used to estimate species networks with at most 1, 2, 3, or 4 reticulations (”ret.”). Topological distances [100] of pairwise comparison between estimated networks were measured. Only upper triangular entries in the pairwise distance matrix are shown. . . . . . . . . . . . 125 Table 7.3: Topological distance pairwise comparison of species networks inferred by MLE on different MSAs of the yeast dataset. The MSAs were estimated using ClustalW, MAFFT, or FSA. Table description and layout are identical to Table 7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xv LIST OF FIGURES Figure 1.1: Darwin’s first evolutionary diagram, drawn in his Notebook in 1837. . . . . . . 2 Figure 2.1: Illustration of character evolution. Characters evolve on a tree. The branching structure of the tree guides the mutation process. The red dots on the branches represent the substitution events. . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 2.2: An example of Multiple Sequence Alignment (MSA). (a) A tree shows the evolutionary history of sequences. Evolutionary events make changes to the content of the observable sequences at leaf nodes. (b) Sequences at the leaf nodes can be observed. The unaligned sequences form a character matrix. (c) Multiple sequence alignments can be used to build a character matrix, each column consists of characters that evolve from the same ancestral character. . . 9 Figure 2.3: Progressive alignment algorithm. The progressive alignment algorithms take a set of sequences as input and build a distance matrix for input sequences. Then a guide tree is estimated based on the distance matrix and pairwise alignment is conducted at each internal node on the tree from leaves to the root. This figure is reproduced from http://readiab.org/. . . . . . . . . . . . . . 12 Figure 2.4: An example of phylogenetic tree. (a) A rooted phylogenetic tree. 𝑉𝑟 is the root node. Taxon 𝐵 is an ancestor of taxon 𝐴. (b) The unrooted version of the same tree by removing the root node 𝑉𝑟 . (c) An polytomy tree. Node 𝐺 has a degree of four, which represents a polytomy node. . . . . . . . . . . . . . 14 Figure 2.5: An example phylogenetic network. (a). A phylogenetic network 𝑁, 𝑟 is the root node, 𝐴 is a tree node, 𝐵 is a reticulation node, and 𝐶, 𝐷, 𝐸 are leaf nodes. The edge in red is a reticulation edge. The tree 𝑇1 and 𝑇2 in (b) and (c) are the inducted trees of 𝑁. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Figure 2.6: Illustration of coalescence and multi-species coalescence. (a) An illustration of the coalescent process. Each row represents a generation. Generations have no overlap with each other. The new generation is generated by randomly sampling from the previous generation. Each dot represents a gene copy. The blue line represents the inheritance relationship, where one gene is copied from an ancestor gene. Parents are selected at random. When populations are separated, and gene copies only come from samples from the same population. A coalescence event is a point when two lineages merge into a common ancestor. The coalescent process forms a lineage tree. . . . . . . . . . . . . . . 19 xvi Figure 2.7: Illustration of deep coalescence. (a) Multi-species coalescence results in a gene tree inside a species tree. In this example, the gene tree is concordant with the species tree in terms of topology. During the process of coalescence, genes may separate before the species separate, which is called deep coalescence. The most recent common ancestor (MRCA) of gene copies sampled from species B and C is older than the speciation event of B and C. (b) Deep coalescence with incomplete fixation of gene lineages in species lineages may cause discordance between the gene trees and the species tree. In this example, the gene copies sampled from species A and B coalescent first, which is different from the order speciation events. . . . . . . . . . . . . . . . 21 Figure 2.8: An illustration of the phylogenetic tree pipeline for a single gene. First, homologous sequences are selected and aligned. Then the phylogenetic tree is inferred based on the alignment result. The figure is adapted from [33]. . . . 23 Figure 2.9: An illustration of the phylogenetic tree pipeline for multiple genes. (a) Con- catenation methods first concatenate the alignments of genes into a super- matrix. Then the species tree is inferred based on this super matrix. (b) Summary-based methods infer gene trees from the gene alignments first. Then the species tree is inferred from all the gene trees. (c) Co-estimation methods estimate gene trees and species trees simultaneously in a single statistical inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 3.1: An example of SERES resampling random walk on unaligned sequences. First, we estimate an alignment for the input unaligned sequences. Then, a set of anchors with the highest similarity is estimated using the estimated alignment. The anchors’ boundaries and the start and end of the input se- quences are used as barriers. Finally, the SERES random walk performs on the input sequences with barriers. The random walk starts at a randomly selected barrier and moves to a randomly selected direction to the next bar- rier. It reverses at each encountered barrier with probability 𝛾 and reverses with certainty at the start and end barriers of the input sequence. During the random walk, the sub-sequences between the previous barrier and the current barrier are sampled for the replicate. The resampling procedure ends when the resampled sequences reach the length criteria. . . . . . . . . . . . . . . . . 39 Figure 3.2: SERES+GUIDANCE2 performance using different choices for anchor length. Results are shown for five 10-taxon medium-gap-length model conditions (named 10.A through 10.E in order of generally increasing sequence di- vergence). We evaluated the performance of SERES+GUIDANCE2 where anchor length in bp was either 3, 5, 10, 30, or 50. We calculated each method’s precision-recall (PR) and receiver operating characteristic (ROC) curves. Performance is evaluated based upon aggregate area under curve (AUC) across all replicates for a model condition (𝑛 = 20). . . . . . . . . . . . 49 xvii Figure 3.3: SERES+GUIDANCE2 performance using different choices for the number of anchors. We evaluated the performance of SERES+GUIDANCE2 where the number of anchors used was either 3, 5, 20, 50, or 100. Otherwise, figure layout and description are identical to Figure 3.2. . . . . . . . . . . . . . . . . 50 Figure 3.4: Runtime comparison of methods under study. (a) For each method, average runtime (h) across all replicates in each simulation study model condition is reported (𝑛 = 20); standard error bars are also shown. The 10-taxon model conditions 10.A through 10.E are shown in order from left to right, followed by the 50-taxon model conditions similarly. (b) Method runtimes are shown for each empirical study dataset. Datasets are arranged from left to right in order of increasing dataset size as measured by number of taxa. . . . . . . . . 51 Figure 3.5: Memory usage comparison of methods under study. Memory usage is shown in GiB. Otherwise, figure layout and description are identical to Supplemen- tary Figure 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 4.1: The posterior probability distribution inferred by the standalone recHMM method on 4-taxon model conditions. For each site, we split the local gene tree topologies into true class, which contains the true gene tree topologies for the site, and false class’, which contains all other gene tree topologies. For each class and each replicate dataset in a model condition, the inferred posterior probabilities for gene trees at any site were binned into deciles; the resulting histogram was normalized over all replicates in a model condition (𝑛 = 30). The normalized histograms for the true and false classes are shown in blue and orange, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Figure 4.2: Histogram of posterior probabilities inferred by SERES+recHMM method on 4-taxon model conditions. Figure layout and description are otherwise identical to Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Figure 4.3: Distribution of posterior probabilities inferred by standalone recHMM method on 6-taxon model conditions. Figure layout and description are otherwise identical to Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Figure 4.4: Distribution of posterior probabilities inferred by SERES+recHMM method on 6-taxon model conditions. Figure layout and description are otherwise identical to Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 xviii Figure 4.5: Posterior probability distribution of local gene tree topologies inferred by stan- dalone recHMM versus SERES+recHMM method on Indian HIV-1 dataset. We re-analyzed a subset of the Indian HIV-1 genome dataset that was pub- lished by [84]; [150] re-analyzed the original dataset using recHMM. Our re-analysis compared local gene tree probabilities computed using standalone recHMM posterior decoding (top panel) versus SERES+recHMM posterior decoding (bottom panel). The plots show posterior decoding probabilities (y-axis) versus genome coordinate (x-axis). Local gene tree probabilities are colored based on the three possible unrooted topologies for the four-taxon dataset (shown in either blue, orange, or green). . . . . . . . . . . . . . . . . . 72 Figure 5.1: An illustrated example of RAWR resampling and re-estimation. The first step of the RAWR-based phylogenetic support estimation is sequence resampling. A random walk is performed on the input MSA. MSA sites are resampled during the random walk. The indels are removed from the resampled sites to produce a resampled replicate. Then, MSA is re-estimated from the resampled sequence. Finally, a phylogenetic tree is re-estimated using the re-estimated MSA as input. The dashed lines in the first and second subplots show the reversal breakpoints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Figure 5.2: Runtime and memory usage of the phylogenetic support estimation methods on simulation datasets. The top row includes average runtime usage for each model condition in the simulated study. The y-axis shows runtime in hours and is in log-scale. The left, middle right subplots represent 10-taxon, 50-taxon, and 100-taxon model conditions respectively. The bottom row includes average memory usage for each model condition in the simulated study. The y-axis shows runtime peak memory usage in GiB. The left, middle right subplots represent 10-taxon, 50-taxon, and 100-taxon model conditions respectively. The average runtime or peak memory usage were calculated across all replicate datasets in the model condition (𝑛 = 20). . . . . . . . . . . 88 Figure 6.1: Pictures of three Darwin’s finch species. (a) The small tree finch Ca- marhynchus parvulus. (b) The medium tree finch Camarhynchus pauper. (c) The large tree finch Camarhynchus psittacula. All three species live on the Galapagos island of Floreana. Arrows indicate the migration of two pop- ulations of C. psittacula from Isabela and Santa Cruz. Figure comes from [47]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 xix Figure 6.2: Phylogenetic tree of Darwin’s finch species and two close relatives reported in Lamichhaney et al. 2015 study[69]. This phylogenetic tree was reproduced from the Figure 1 panel b in Lamichhaney’s 2015 paper. The branch length were ignored and the tree was rescaled to an ultrametric tree, where all the leaves have the same distance to the root. The color of the branches and the species name represents the group that the species belongs to. We used the same color as the original study, the purple, brown, cyan, red, green, blue and black color represent the group of warbler finches, vegetarian finch, cocos finch, sharp-beaked ground finches, tree finches, all other ground finches and the outgroups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 6.3: The re-estimated phylogenetic tree for Darwin’s finch species and two close relatives with supports estimated by the standard bootstrap method. We re- estimated a species tree using maximum likelihood estimation (MLE) on the concatenated and partitioned genomic sequence alignment of Darwin’s finch species and two close relatives. We estimated supports using the standard bootstrap method, which is implemented by RAxML version 8.2.9. The branch length were ignored and the tree was rescaled to an ultrametric tree. The color mapping is the same as Figure 6.2. . . . . . . . . . . . . . . . . . . 110 Figure 6.4: The re-estimated phylogenetic tree for Darwin’s finch species and two close relatives with supports estimated by the RAWR-based support estimation method. We calculated supports for the re-estimated phylogenetic tree using the RAWR-based support estimation method. The annotated tree is the same tree as Figure 6.3. The branch length were ignored and the tree was rescaled to an ultrametric tree. The color mapping is the same as Figure 6.2. . . . . . . 111 Figure 7.1: Alignment error for MSA estimation methods of the simulated datasets. The MSA methods in our study consisted of MAFFT, ClustalW, and FSA. We assessed MSA estimation error based on type I and type II error: the former was assessed based on SP-FP proportion (“SPFP”), which is the proportion of nucleotide-nucleotide homologies that appear in the estimated alignment but not the true alignment, and the latter was assessed based on SP-FN proportion (“SPFN”), which is the proportion of nucleotide-nucleotide homologies that appear in the true alignment but not the estimated alignment. Average SPFN and SPFP are shown for each MSA method on each model condition. . . . . . 116 Figure 7.2: Gene flow detection using the D-statistic on simulated datasets. The left subplot shows the D-statistic score distribution on simulated datasets with model networks. The right subplot shows the D-statistic score distribution on simulated datasets with model trees. The D-statistic values were calculated using MAFFT-estimated alignments, which refer to ”estiAln’, and true align- ments, which refer to ”trueAln”. Average D-statistic values are reported, and standard error bars are shown over 20 replicates. . . . . . . . . . . . . . . . . 120 xx Figure 7.3: Topological errors of MLE analysis on simulation datasets. The MLE method was conducted on five different inputs: (1) true MSAs and true gene trees (”trueTree”), (2) true alignments and gene trees estimated using FastTree on the true alignments (”trueAln”), (3) ClustalW-estimated align- ments and gene trees estimated using FastTree on the ClustalW-estimated MSAs (”clustalwAln”), (4) MAFFT-estimated alignments and gene trees es- timated using FastTree on MAFFT-estimated alignments (”mafftAln”), or (5) FSA-estimated alignments and gene trees estimated using FastTree on FSA- estimated alignments (”fsaAln”). Topological error was measured using the reduced distance [100]. Averages and standard error bars are shown for each model condition in the simulation study (𝑛 = 20). . . . . . . . . . . . . . . . . 122 Figure 7.4: Topological errors of MPL analysis on simulation datasets. Figure description and layout are otherwise identical to Figure 7.3. . . . . . . . . . . . . . . . . . 122 Figure 7.5: Computational runtime requirements of summary-based species network in- ference methods for simulation study. The runtime of the MPL and MLE methods on simulation datasets is shown in hours. Averages and standard error bars are shown for each model condition over 20 replicates in the simu- lation study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Figure 7.6: Computational memory requirements of summary-based species network in- ference methods of simulation study. The peak main memory usage of the MPL and MLE method on simulation datasets shows in GiB. Figure descrip- tion and layout are identical to Figure 7.5. . . . . . . . . . . . . . . . . . . . . 124 xxi LIST OF ALGORITHMS Algorithm 3.1: SERES walk on aligned sequences . . . . . . . . . . . . . . . . . . . . . . 35 Algorithm 3.2: SERES resampling of unaligned sequences . . . . . . . . . . . . . . . . . 37 Algorithm 3.3: Obtain anchors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Algorithm 3.4: Modified Hamming distance calculation . . . . . . . . . . . . . . . . . . . 38 Algorithm 5.1: RAWR phylogenetic support estimation . . . . . . . . . . . . . . . . . . . 77 Algorithm 5.2: RAWR+teleport resampling procedure . . . . . . . . . . . . . . . . . . . . 82 xxii CHAPTER 1 INTRODUCTION Phylogenetics is the study of evolutionary relationships. The evolutionary relationships are often depicted by phylogeny, a graphical representation of the evolutionary history of a group of organisms (taxa), such as genes or species. One of the most simple and widely used representations is the phylogenetic tree, a directed acyclic graph. However, the true evolutionary history is not always tree-like. A phylogenetic network can better depict the horizontal genetic material flow, such as hybridization and recombination. Phylogenies show the history of genetic information transmission and thus play an essential role in interpreting information on many aspects of organisms, such as the structure and function of genomics. The reconstructed phylogenies are used in numerous biological studies, such as gene function prediction, protein structure prediction, drug discovery, vaccine development, and many non-biological studies, such as computer security applications and linguistics studies. Since phylogenetic reconstruction is an important and fundamental problem for biological stud- ies, many methods have been developed for this problem [56, 37, 77, 157]. With the technological advances in molecular biology and genomics, increasing amounts of biomolecular sequencing data, such as DNA, RNA, and amino acid sequences, are available for accurate phylogenetic recon- struction, and many of these computational methods infer evolutionary history from biomolecular sequencing data. A general phylogenetic reconstruction pipeline using sequence data consists of the following steps. First, collect samples from a group of closely related species. Samples are processed and sequenced by any selected sequencing technique, such as Sanger sequencing and next-generation sequencing techniques. Sequence data is produced for several genes of interest. Then the sequence data needs to go through a series of preprocessing steps, such as quality control, filtering, and assembly. After preprocessing, there are two fundamental steps for the phylogenetic reconstruction, Multiple Sequence Alignment (MSA) estimation, and phylogenetic inference. 1 Figure 1.1: Darwin’s first evolutionary diagram, drawn in his Notebook in 1837. The first step is Multiple Sequence Alignment (MSA) estimation. The phylogenetic analysis has to be conducted on homologous sequences, which are characters that evolved from the same ancestor. We discuss the MSA estimation problem in detail in Chapter 2. The MSA obtained in the previous step is used as input for the phylogenetic reconstruction. There are many approaches to phylogenetic tree and network reconstruction, which we discuss in detail in Chapter 2. Phylogenetic reconstruction is an important step whose results greatly impact downstream biological studies. However, the traditional sequence-based phylogenetic reconstruction methods involve many over-simplified assumptions. In this thesis, we describe our contribution to relaxing some of these over-simplified assumptions in phylogenetic analysis. 2 Insertions and deletions, known as indels and represented by gaps in MSAs, carry much phy- logenetic information but are frequently overlooked during the phylogeny reconstruction process. Many previous studies have shown that phylogenetic information carried by indels helps infer phy- logenies [127, 7, 86]. There are studies show that taking into account indels helps resolve some deep branches in phylogenies [115, 119]. Traditional methods only utilize historical substitution events to reconstruct phylogenetic trees. Moreover, these methods usually take the input MSA for granted and ignore uncertainty in the MSA estimation. For the following reasons, indels are either discarded or treated as missing values during the phylogenetic reconstruction process. Some studies considered that indels were unreliable for phylogenetic reconstruction [46]. This viewpoint was later proved to be incorrect [128]. Accurate estimation of indels can be very time-consuming. Unlike substitution events, where sequence length is not affected, insertion and deletion events often involve many sites, resulting in dependence among sites and changes in the sequence length. Fur- thermore, multiple indels may overlap, which makes it even more challenging to obtain an accurate estimation. Also, there is no uniform opinion on how to deal with indels in the phylogenetic reconstruction yet. Based on these challenges, it is important to determine the reliability of indels and take into account the uncertainty of estimated MSAs during the phylogenetic analysis. Resampling and re- estimation are often used in confidence interval estimation. Resampling is the process of drawing samples from the original set of observations. It has been widely used in statistical support estima- tion, especially those non-parametric approaches, such as the standard bootstrap [32] and jackknife [140]. Non-parametric approaches do not require a particular model. However, these resampling approaches usually assume that the observations are independent and identically distributed (i.i.d), which is another over-simplified assumption that has been widely used in phylogenetic analysis. This assumption does not always hold for biomolecular sequences. Many evolution events pro- duce intra-sequence dependence and functional dependence in the biomolecular sequences that are inconsistent with this assumption, such as recombination and hybridization. Another over-simplified assumption that we want to address is that the evolutionary relationships 3 are often simplified to a tree structure, and all reticulate gene flows are ignored. Though the phylogenetic tree is the most widely used representation of the evolutionary history of a group of taxa, the true evolutionary history is not always tree-like. Certain evolutionary processes, such as horizontal gene transfer (HGT), hybridization, and recombination, involve reticulate gene flow between two sibling taxa, which is better represented by a phylogenetic network. Previous research has shown that the uncertainties in MSAs have an effect on downstream tree inference and learning [153, 152]. However, there is little discussion about how the phylogenetic network is influenced by MSA uncertainties. In this thesis, we relax the widely used over-simplified assumptions introduced above. We introduce new resampling algorithms for the resampling of biomolecular sequences. We take into account the indel uncertainties in several phylogenetic analyses by applying the resampling and re-estimation of MSAs. Furthermore, we show that indel-aware phylogenetic analysis obtains better accuracy than the traditional methods. The rest of the dissertation is organized as follows. Chapter 2 provides background knowl- edge about multiple sequence alignment and phylogenetic reconstruction. We introduce the basic concepts of phylogenetic analysis and some of the widely used phylogenetic analysis methods. In Chapter 3, we describe our work on a new sequential resampling algorithm, SERES, which relaxes the over-simplified assumption made by the standard bootstrap approach [32] that all sites in the alignments are evolved independently and identically distributed (i.i.d). We applied the SERES resampling approach together with re-estimation to the MSA support estimation problem and achieved comparable or better performance than the state-of-the-art methods. In Chapter 4, we extend the SERES resampling and re-estimation procedure to another classical problem in computa- tional biology and bioinformatics, recombination-aware local genealogical inference. The SERES resampling approach produces local genealogies that greatly improve the topological accuracy. In Chapter 5, we discuss the simplified version of the sequential resampling algorithm, RAWR, which is designed for phylogenetic tree support estimation by resampling and re-estimation using unaligned sequences. For the phylogenetic support estimation problem, the RAWR-based method 4 outperformed the bootstrap method on almost all the simulation model conditions and the empirical datasets, which indicates that the RAWR-based support estimate benefits from the indel uncertain- ties of the input alignment. In Chapter 6, we discuss the application of the RAWR-based support estimation method to the whole-genome sequencing dataset of Darwin’s finch from Larmichhaney’s study in 2015 [69]. In Chapter 7, we further relax the assumption of phylogenetic tree and introduce a performance study on how MSA uncertainties impact the phylogenetic network inference. We show evidence that the errors introduced in MSA estimation and gene tree estimation steps reduce the accuracy of the inferred phylogenetic network, and an indel-aware reconstruction method is needed for phylogenetic network analysis. Finally, we conclude in Chapter 8 with discussion of our work and future research directions. 5 CHAPTER 2 BACKGROUND In this chapter, we briefly introduce the basic concepts that we use in the following chapters. First, we introduce the definition of Multiple Sequence Alignments (MSA) in Section 2.1. We discuss the sequence evolution models and MSA estimation methods in detail. Then we introduce phylogenies in Section 2.2, which includes the definition of phylogenetic tree and phylogenetic network. We further discuss two related conceptions: gene trees and species trees and their discordance in Section 2.3. We also briefly introduce the coalescent model, which integrates the evolutionary process of genes with species. Finally, we discuss the phylogenetic reconstruction methods in Section 2.4 and method performance evaluation in Section 2.4.5. 2.1 Multiple Sequence Alignment The genetic information encoded in DNA sequences is the biological foundation of life. Evolu- tion and heredity ensure the changes and the continuation of all living species. With the development of sequencing techniques, the cost of genome sequencing has dropped dramatically. Now we can easily read the genomic information using various sequencing technologies [94] and get cheap, large biomolecular sequence datasets. DNA sequences and other sequences derived from the DNA sequences, such as RNA sequences and amino acid sequences, can be represented as one-dimension arrays of characters. Sequence alignment is a fundamental problem in molecular biology. It is a problem of arranging the sequences to identify regions of similarity. A set of sequences is considered to be unaligned if they have different lengths. Biomolecular sequences are one-dimensional arrays over the alphabet Σ. For DNA sequences, the alphabet Σ = 𝐴, 𝑇, 𝐶, 𝐺. Due to historical insertions and deletions, sequences that share evolutionary history can have different lengths. The process of sequence alignment is to line up sequences and maximize identical subsequences. The order and the content of the DNA sequences cannot be changed. The only allowed operation 6 is to add gaps (represented by "-") into the unaligned sequences, which represent gaps that are created by the historical insertion and deletion events, which is referred to as indels. The characters of different sequences are aligned into columns or sites by adding gaps in appropriate positions. All characters in a column are assumed to derive from a common ancestral character. A multiple sequence alignment (MSA) represents the evolutionary relationship among a set of unaligned sequences. The alignment can be considered as a matrix, where each sequence is a row in the matrix, and each column, or site, shows the sequence homology. Homologous are two characters that are evolved from the same ancestral character. Such an evolutionary relationship is called homology. An example is shown in Figure 2.2. Each column in the matrix corresponds to a different character and all the characters evolve from the same tree. There are two widely used summary statistics that measure the sequence divergence, the gap- piness and the Normalized Hamming Distance (NHD). Insertion and deletion events in the evo- lutionary history create gaps in the multiple sequence alignment. Gappiness is calculated by the percentage gap characters in the multiple sequence alignment matrix. For a pairwise sequence alignment, the classical Hamming Distance [50] is the number of sites in the alignment that contain different characters, which can be considered as a nucleotide pair with two different nucleotides. The hamming distance of multiple sequence alignment is the sum of the Hamming Distances of all pairs of sequences contained in the multiple sequence alignment. The normalized Hamming Distance (NHD) is the hamming distance normalized by the length of the sequence. 2.1.1 Sequence evolution The evolution process is a series of changes in biomolecular sequences through multiple types of mutations. Substitution, also known as point mutation, is one type of sequence mutation. Substitution is the event that a single base changes into another base. Figure 2.1 shows an example of the substitution process of a single character. This example includes two substitution events along the phylogenetic tree. Note that, we can observe the sequences at leaf nodes, but not the sequences at internal nodes and root nodes. 7 Figure 2.1: Illustration of character evolution. Characters evolve on a tree. The branching structure of the tree guides the mutation process. The red dots on the branches represent the substitution events. Another important sequence evolution process is insertion/deletion. They can make more complicated changes to the genetic sequences. Insertions and deletions, also known as indels, are the events that occur when genetic materials are inserted or deleted during the evolutionary process. Insertion and deletion events can change the length of the sequences. The characters that used to align in the same column will no longer align with each other. Compared with the evolutionary process with only substitution events, the existence of indels makes it more challenge to identify the homologous characters and their evolutionary relationship. An example is shown in Figure 2.2. With the existence of insertion and deletion events, the sequences at leaf nodes have different sequence lengths. The characters at the same position on the sequences are no longer homologous. We will discuss the MSA algorithms in Section 2.1.2. These methods reconstruct the homology relationships of the genetic sequence and produce an MSA, a matrix in which each column contains only homologous characters. Dashes in the alignment represent gaps created by the historical insertion and deletion events. An example is shown in Figure 2.2c. The first two dashes in the second sequence are caused by a single deletion event, labeled as a green dot on the tree in Figure 2.2a. Dashes in the fourth column are caused by an insertion event, labeled as a yellow dot. 8 (b) Unaligned sequences. (a) A tree of the sequence evolution. (c) Multiple Sequence Align- ment (MSA). Figure 2.2: An example of Multiple Sequence Alignment (MSA). (a) A tree shows the evolutionary history of sequences. Evolutionary events make changes to the content of the observable sequences at leaf nodes. (b) Sequences at the leaf nodes can be observed. The unaligned sequences form a character matrix. (c) Multiple sequence alignments can be used to build a character matrix, each column consists of characters that evolve from the same ancestral character. Evolutionary models describe the substitution events on a single site of the biomolecular sequence. Most evolutionary models assume that the sites evolve down a tree where nucleotide substitutions under a Markov process and all sites evolve independently and identically (i.i.d). A rooted tree 𝑇 with edges annotated with substitution rate matrices describes the evolution of a single site down the tree. Besides the model tree, 𝑇, the model also requires a substitution rate matrix. The substitution rate matrix 𝑀 describes the probability of the parent state evolving into a child state along an edge. The rate of base 𝑖 changing to base 𝑗 is represented by element 𝑀𝑖 𝑗 in the substitution rate matrix 𝑀. For DNA sequences, the substitution rate matrix 𝑀 is a 4 × 4 matrix. 𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓 denote the relative rates that one base changes to another. The frequencies of the bases A, T, C, G are represented by 𝜋 𝐴 , 𝜋𝑇 , 𝜋𝐶 , 𝜋𝐺 respectively. The diagonal elements of M are chosen to ensure that the sum of elements in the corresponding row equals zero. If for all pairs of bases 𝑖, 𝑗, 𝑀𝑖 𝑗 = 𝑀 𝑗𝑖 , the model is considered time-reversible. The most widely used sequence 9 evolution models in phylogenetic analyses are time-reversible. The generalized time-reversible model (GTR) [135] has a substitution rate matrix as follows. ©−(𝑎𝜋𝐶 + 𝑏𝜋𝐺 + 𝑐𝜋𝑇 ) 𝑎𝜋𝐶 𝑏𝜋𝐺 𝑐𝜋𝑇 ª ­ ® −(𝑎𝜋𝐶 + 𝑑𝜋𝐺 + 𝑒𝜋𝑇 ) ­ ® ­ 𝑎𝜋𝐶 𝑑𝜋𝐺 𝑒𝜋𝑇 ® 𝑀 = ­­ ® ® ­ ­ 𝑏𝜋𝐺 𝑑𝜋𝐺 −(𝑑𝜋𝐺 + 𝑑𝜋𝐺 + 𝑓 𝜋𝑇 ) 𝑓 𝜋𝑇 ® ® ­ ® « 𝑐𝜋𝑇 𝑒𝜋𝑇 𝑓 𝜋𝑇 −(𝑐𝜋𝑇 + 𝑒𝜋𝑇 + 𝑓 𝜋𝑇 ) ¬ The Jukes-Cantor model [63] is the simplest substitution model, which is a subset of the GTR model. The Jukes-Cantor model assumes that all base frequencies are equal, 𝜋 𝐴 = 𝜋𝑇 = 𝜋𝐶 = 𝜋𝐺 = 14 , and all substitution rates are equal, 𝑎 = 𝑏 = 𝑐 = 𝑑 = 𝑒 = 𝑓 = 1. The substitution rate matrix of the Jukes-Cantor model shows as follows. 3 1 1 1 ©− 4 4 4 4 ª ­ ® ­ 1 ­ 4 − 34 14 1 ® 4 ® 𝑀 = ­­ ® ® ­ 1 1 3 1 − 4 4 ®® ­ 4 4 ­ ® 1 1 1 − 3 « 4 4 4 4¬ To incorporate the rate heterogeneity across sites into the sequence evolution models. We can multiply the substitution rate matrix with a rate drawn from Γ distribution. The combination of the substitution model with the Γ model produces rate variation across sites [155]. For example, GTR+Γ is the GTR model combined with the Γ model for rate variation. 2.1.2 Multiple Sequence Alignment Estimation In computational biology, the estimation of multiple sequence alignments is one of the most funda- mental problems. Multiple sequence alignment estimation is the initial step of many applications, such as predicting the structure and function of proteins and phylogenetic reconstruction. The following is a definition of a multiple sequence alignment problem. The input is a set of unaligned sequences 𝑆. Add gaps in proper positions to align homologous characters, which 10 are derived from a common ancestral character, into the same column. The output is a multiple sequence alignment 𝐴 that represents the true evolutionary history of 𝑆. However, true evolutionary history is usually not available. The distance between the estimated alignment and the true alignment is not able to be measured. Thus, the multiple sequence alignment problem was simplified to an optimization problem that minimizes the differences between the sequences in the alignment. One measurement of the sequence differences is the sum-of-pairs (SP) criterion. The SP score is the number of mismatched nucleotide pairs in the MSA. The mismatched pairs include pairs of two different non-gap characters and pairs of one non-gap character with one gap character. MSA estimation methods are generalized from the pairwise alignment. The estimation of MSA that minimizes the SP score of other similar metrics has been demonstrated to be an NP-hard problem [10, 144]. Many heuristic methods were designed for the estimation of multiple sequence alignment to find a sub-optimal solution instead, such as progressive alignment methods [38]. Progressive alignment methods first build a guide tree based on the distance of sequences. The pairwise sequence alignment is performed at the parent of two leaf nodes, then recursively performed at internal nodes. The final MSA result will be created at the root, as shown in Figure 2.3. Clustal W [73], MUSCLE [29], MAFFT [64] and T-COFFEE [107] are a few examples of progressive alignment methods. Though progressive alignment has been widely used, there are some disadvantages associated with this type of MSA estimation algorithm. First, the choice of the initial guide tree directly influences the MSA result and the downstream analysis, such as phylogenetic inference. The inferred phylogenetic trees are biased towards the initial guide tree by using such an MSA approach [152]. Furthermore, distantly related sequences tend to be over aligned by progressive alignment [85]. The insertion and deletion events are not evenly penalized in progressive alignment. Many standard metrics are designed for the quantitative measurement of the alignment methods’ performance. The performance measurement usually compares the reference alignment with the estimated alignment. The SP-score is calculated by the percentage of all pairwise homologies in 11 Figure 2.3: Progressive alignment algorithm. The progressive alignment algorithms take a set of sequences as input and build a distance matrix for input sequences. Then a guide tree is estimated based on the distance matrix and pairwise alignment is conducted at each internal node on the tree from leaves to the root. This figure is reproduced from http://readiab.org/. the reference alignment recovered in the estimated alignment. The modeler score is the percentage of all pairwise homologies in the estimated alignment that are found in the reference tree. The pairs score averages the SP-score and the modeler score to penalize false positive and false negative homologies equally. TC score, which is the number of columns that are recovered entirely correctly in the estimated alignment. The most commonly used measurement for the MSA estimation problem is the SP-score. In the following studies, we used software called FastSP [160] to measure the alignment accuracy. 2.2 Phylogenies A phylogeny is a graphical representation that represents the evolutionary history of a group of entities, such as genes and species, which are usually referred to as taxa. There are mainly two types of phylogenies: phylogenetic trees and phylogenetic networks. In the following content, we will discuss about these two types of phylogenies in detail. 12 2.2.1 Phylogenetic Trees One of the most simple and widely used representations is the phylogenetic tree, a directed acyclic graph. A tree 𝑇 consists of a set of vertices 𝑉 and edges 𝐸. The vertices 𝑉 indicate a particular taxon, and the edges 𝐸 indicate evolutionary history. An internal node is a node that has a degree greater than one. The taxon at an internal node represents the ancestral taxon of the descendent taxa. A leaf node or terminal node has a degree of one, which represents an extant taxon. An edge or branch represents the evolutionary relationship between two taxa it connects. The length of the edge in a phylogenetic tree is called branch length, a non-negative number representing the quantitative measurement of the evolutionary changes or evolutionary time between the two related taxa. Trees without branch lengths attached are referred to as topologies. A rooted tree has a root node 𝑣 𝑟 ∈ 𝑉. The root node is the most recent common ancestor of all taxa in this tree. This designation of root vertex allows us to associate ancestor-descendant relationships to the vertices. For example, if a node 𝐵 is located on the path from node 𝐴 to root, node 𝐵 is considered an ancestor of node 𝐴, as illustrated in Figure 2.4a for examples. It usually requires additional knowledge about the taxa included in the tree or molecular clock assumption to root a phylogenetic tree. There are many different rooting methods. Some popular ones include rooting a tree utilizing an outgroup taxon, and rooting a tree by estimating the time of speciation. One method utilizes an outgroup taxon, which has a far more evolutionary relationship than all other taxa under consideration. The other method roots a phylogenetic tree based on the estimated time between speciation. This method can only be used if we assume that the evolution rate of molecular data, which we use to reconstruct a phylogeny, is constant. However, this assumption is often violated in real datasets. Unrooted trees have no such ancestor-descendant relationship. A phylogenetic tree is considered a binary tree if all vertices have a degree of three. A phylogenetic tree is called a polytomy if at least one internal node has a degree greater than three. Polytomies represent unresolved evolutionary relationships. Figure 2.4a shows an example of a rooted binary phylogenetic tree over six taxa, where the direction of evolution is the direction from the root node to the leave nodes. The evolutionary 13 Vr A B C D E F G H (a) A rooted phylogenetic tree. A F B E C D (b) A unrooted phylogenetic tree. A F G B E C D (c) A polytomy tree. Figure 2.4: An example of phylogenetic tree. (a) A rooted phylogenetic tree. 𝑉𝑟 is the root node. Taxon 𝐵 is an ancestor of taxon 𝐴. (b) The unrooted version of the same tree by removing the root node 𝑉𝑟 . (c) An polytomy tree. Node 𝐺 has a degree of four, which represents a polytomy node. relationship among six taxa can be inferred from the tree. Figure 2.4b shows an unrooted version of the phylogenetic tree of Figure 2.4a by removing the root node 𝑉𝑟 . Figure 2.4c shows an example of a polytomy tree. 14 A clade is a rooted subtree in a phylogenetic tree 𝑇, defined by a node 𝑣 in 𝑇 that roots the clade. Taxa included in a clade have a closer evolutionary relationship with each other than the rest of taxa in the same phylogenetic tree. Bipartition is a similar conception as the clade in unrooted trees. A bipartition is defined by edges. By removing an edge 𝑒, the unrooted phylogenetic tree 𝑇 is split into two subtrees 𝑇1 and 𝑇2 , and the set of taxa is also split into two subsets. The leaf bipartition denotes {𝑙1 |𝑙2 }, where 𝑙1 ∪ 𝑙2 is the complete set of leaves of 𝑇. A phylogeny does not necessarily need to be tree-like. In certain scenarios, more complex representations such as phylogenetic networks fit better than the tree structure. 2.2.2 Phylogenetic Network Though the phylogenetic tree is the most widely used representation of the evolutionary history of a group of species, the true evolutionary history is not always tree-like. Certain evolutionary pro- cesses, such as horizontal gene transfer (HGT), hybridization and recombination, involve reticulate gene flow between two sibling taxa, which is better represented by a phylogenetic network. By adding reticulation edges, we can better model the horizontal flow of genetic material. A phylogenetic network 𝑁 is a rooted, directed, acyclic graph (DAG) defined on a set of taxa 𝑆. Vertices 𝑉 of 𝑁 is defined as 𝑉 = {𝑟} ∪ 𝑉𝐿 ∪ 𝑉𝑇 ∪ 𝑉𝑁 , where • 𝑖𝑛𝑑𝑒𝑔(𝑟) = 0 (𝑟 is the root of 𝑁); • ∀𝑣 ∈ 𝑉𝐿 , 𝑖𝑛𝑑𝑒𝑔(𝑣) = 1 and 𝑜𝑢𝑡𝑑𝑒𝑔(𝑣) = 0 (𝑉𝐿 are the external tree nodes or leaves of 𝑁); • ∀𝑣 ∈ 𝑉𝑇 , 𝑖𝑛𝑑𝑒𝑔(𝑣) = 1 and 𝑜𝑢𝑡𝑑𝑒𝑔(𝑣) ≥ 2 (𝑉𝑇 are the tree nodes of 𝑁); • ∀𝑣 ∈ 𝑉𝑁 , 𝑖𝑛𝑑𝑒𝑔(𝑣) = 2 and 𝑜𝑢𝑡𝑑𝑒𝑔(𝑣) ≥ 1 (𝑉𝑁 are the reticulation nodes of 𝑁); 𝐸 ⊆ 𝑉 × 𝑉 are the edges of network 𝑁. There are two types of edges, reticulation edges who point to reticulation nodes, and tree edges who point to tree nodes. The leaves of the network are bijectively labeled by the elements of 𝑆 A phylogenetic network induces a set of trees, called 15 r r r A B C D E C D E C D E (a) A phylogenetic network 𝑁. (b) A induced tree 𝑇1. (c) A induced tree 𝑇2. Figure 2.5: An example phylogenetic network. (a). A phylogenetic network 𝑁, 𝑟 is the root node, 𝐴 is a tree node, 𝐵 is a reticulation node, and 𝐶, 𝐷, 𝐸 are leaf nodes. The edge in red is a reticulation edge. The tree 𝑇1 and 𝑇2 in (b) and (c) are the inducted trees of 𝑁. induced trees. The induced trees are obtained by keeping only one reticulation edge for each reticulation node in the network. Figure 2.5 shows an example of a phylogenetic network 𝑁 on taxa set 𝑆 = {𝐶, 𝐷, 𝐸 }. In the network 𝑁, 𝑟 is the root node with an indegree of 0. 𝐴 is an example of a tree node with an indegree of 1 and an outdegree of 2. 𝐵 is an example of the reticulation node with an indegree of 2 and an outdegree of 1. 𝐶, 𝐷, 𝐸 are leaf nodes with an indegree of 1 and an outdegree of 2. The edge in red is a reticulation edge that points to a reticulation node. Before the reticulation event, all taxa evolved under a tree structure. Following the reticulation event, two lineages merge to form a new lineage, with the genetic material of the new lineage inherited from either ancestral lineage, which can be represented by two induced trees, 𝑇1 and 𝑇2 , which are part of the network 𝑁. 2.3 Gene Trees and Species Trees Phylogenetic trees can be used to represent evolutionary history for different types of entities. Species and genes are two closely related entities that can be modeled by phylogenetic trees. In a species phylogeny, each leaf node represents the entire population of a species. Each internal node represents a speciation event, where the population of one species is split into subsequent species through multiple mechanisms [19], such as allopatric speciation. In an allopatric speciation event, a population is split into two populations due to geographical isolation. Each population then 16 evolves independently until it becomes a new species. The succession of such speciation events can be captured by a tree structure called the species tree. The genetic material of extant species can help us identify ancient speciation events. Gene phylogeny shows the evolution path of particular genes, which are short regions of the genome across all involved species. Gene trees are not necessarily the same as the species tree or other gene trees. For example, humans and chimpanzees have a closer evolutionary relationship than humans with gorillas in some parts of the genome than in other parts [28]. The inconsistency between gene trees and the differences between gene trees and the species tree is so-called gene tree discordance or incongruence. Many evolutionary events, such as hybridization, horizontal gene transfer, recombination, and gene duplication and loss, can influence genome evolution and cause discordance among the evolutionary histories of genes and species [89, 23]. Some of those evolutionary events can cause genes to have different phylogenies compared with the species tree, but they do not contradict the tree-like structure of the species phylogeny. The other evolutionary events may result in complex evolutionary histories, which is hard for tree structure at the species level to depict. The species tree representation is constructed based on the assumption that one species evolved from only one ancestral species. Such an assumption is accurate for many cases of vertical evolution. However, those evolutionary events involved in reticulation evolution may break the tree structure. For example, hybridization events where a new species evolves as a result of hybridization between two species [111]; horizontal gene transfer (HGT) events where genetic material can be obtained from the environment or other sibling species [44, 11]. Under such scenarios, gene phylogenies can still be represented by a tree structure, however, species phylogenies can not be a tree structure. Species networks with reticulation edges are better form of species phylogeny with such gene tree discordance. In such cases, gene phylogenies are still trees, but the species phylogeny is best modeled as a network [101, 61]. 17 2.3.1 Coalescence Before we introduce the concept of coalescence, let’s first introduce some related concepts. Recombination is an evolutionary process in which individuals of current generations have recombined the genomes of multiple individuals in previous generations. In this case, the ancestor of the genetic materials may vary. When we move along on a particular chromosome with recombination, the ancestors, where the genetic material is inherited, may switch for different regions on the chromosome. As the accumulation of the recombination events, the chromosome is divided into multiple regions, where sites in the same region share the same evolutionary history, but the evolutionary process may vary from one region to another. A coalescent gene, also known as a c-gene, is a continuous region on a genome where all sites share the same evolutionary history and where there are no breakpoints caused by recombination events [42, 27, 58]. Note that, the term "gene" usually refers to stretches inside the genome that can be translated into proteins and perform certain functions, which is different from the definition of coalescent gene. The coalescent gene is considered as the most basic unit of the genome for phylogenetic analyses, where only a single evolutionary history is included. The accumulation of recombinations results in the formation of multiple coalescent genes. The evolutionary history of one coalescent gene is represented by one tree. Different coalescent genes may have different evolutionary histories. This phenomenon can be mathematically described by the coalescent process [66]. The coalescent process models the gene variances of a genealogy. This process is easier to understand on the population level. It starts with two gene variants, which are called alleles, of the present time, and then continuously traces back their parent alleles in previous generations until we reach a point where the two alleles share the same ancestor. This point is where the two alleles coalesce. The coalescent model was first proposed by Kingman in 1982 [66]. This model is widely used to estimate parameters such as the recombination rate, population size, and migration rates. Kingman’s coalescent model assumes constant population size, random mating, a large enough population, and no overlap generation. Under such assumptions, the number of generations, which 18 is the time that two random alleles coalesce, is exponentially distributed [66]. Let 𝑇𝑖 be the waiting time for any two alleles from 𝑖 sampled alleles be coalescent, and 𝑁 𝑒 be the population size, the waiting time 𝑇 is calculated by following equations, when 𝑖 = 2 : 1 1 − 𝑁𝑡 𝑇2 (𝑁 𝑒 ) ∼ exp(𝑡; 𝜆 = )= 𝑒 𝑒 𝑁𝑒 𝑁𝑒 More broadly: 𝑖 𝑖 ( 2𝑖 ) 2 2 −𝑡 𝑁 𝑒 𝑇𝑖 (𝑁 𝑒 ) ∼ exp(𝑡; 𝜆 = )= 𝑒 𝑁𝑒 𝑁𝑒 The coalescent history can be represented by An example is shown in Figure 2.6. (a) The coalescent process. (b) The multi-species coalescent. Figure 2.6: Illustration of coalescence and multi-species coalescence. (a) An illustration of the coalescent process. Each row represents a generation. Generations have no overlap with each other. The new generation is generated by randomly sampling from the previous generation. Each dot represents a gene copy. The blue line represents the inheritance relationship, where one gene is copied from an ancestor gene. Parents are selected at random. When populations are separated, and gene copies only come from samples from the same population. A coalescence event is a point when two lineages merge into a common ancestor. The coalescent process forms a lineage tree. 2.3.2 Multi-Species Coalescent (MSC) Model In the previous section, we discussed the coalescent model under a single population. For phyloge- netic analyses involving multiple species and multiple populations, the general framework can be 19 extended to the Multi-Species Coalescent (MSC) model [114]. The MSC model tree is a species tree with branch length represented by coalescent units. Each leaf node of the model tree represents a population of species with fixed population size. Each branch represents one instance of the Kingman coalescent process. Random mating happens between individuals within the same popu- lation. At the internal nodes, where speciation events happen, the lineages that have not coalesced yet in the child populations are moved to the parent population. The coalescent process continues in the parent population. An example of the multi-species coalescent process is shown in Figure 2.6b. The MSA model traces alleles back in time utilizing the following procedure. For three leaf species, A, B, and C and their parent populations, as shown in Figure 2.6b. From each leaf species, we sampled 𝑘 different individuals. At the terminal branch leading to species A, we start with 𝑘 𝑎 individuals at the bottom and trace back the Kingman coalescence for 𝑡 𝑎 generations, where 𝑡 𝑎 is the length of the branch. During this time, some alleles coalesce, and some do not. At the start of the branch, 𝑟 𝑎 denotes the remaining alleles that have not coalesced yet. Assume 𝑟 𝑎 ≤ 𝑘 𝑎 , then 𝑘 𝑎 − 𝑟 𝑎 coalescent events happened on this branch. A similar process also happens in species B and C. Repeat this process on all branches until all the alleles coalesce into the root branch. The MSC model assumes that the coalescent histories in different branches of the species tree are independent. The coalescent history represents a gene tree that evolves inside the species tree. The randomness of the coalescent process may result in different gene trees. Some gene trees may have different topologies compared with the species tree. We will discuss the discordance between the gene tree and species tree in the following section. Incomplete Lineage Sorting (ILS) describes the discordance between gene trees and the species tree. The multispecies coalescent process may result in various gene trees, as the example shown in Figure 2.7 When tracing two lineages from sibling populations back in the multispecies coalescent process, it is possible that two lineages do not coalesce before reaching the nearest ancestral population. If these two lineages do not coalesce, they go further back in time to a deeper ancestral population. 20 (a) A cordant gene tree. (b) A discordance gene tree. Figure 2.7: Illustration of deep coalescence. (a) Multi-species coalescence results in a gene tree inside a species tree. In this example, the gene tree is concordant with the species tree in terms of topology. During the process of coalescence, genes may separate before the species separate, which is called deep coalescence. The most recent common ancestor (MRCA) of gene copies sampled from species B and C is older than the speciation event of B and C. (b) Deep coalescence with incomplete fixation of gene lineages in species lineages may cause discordance between the gene trees and the species tree. In this example, the gene copies sampled from species A and B coalescent first, which is different from the order speciation events. 21 This scenario is called deep coalescence. In the deeper ancestral population, other lineages from other species are also present. Under the random mating assumption, the gene lineages from other species may coalesce with one of the sibling lineages before they coalesce with each other. In this situation, gene trees become discordant with the species tree, and this scenario is called Incomplete Lineage Sorting (ILS). An example of ILS is shown in Figure 2.7b. In the ancestral population of species B and C, the gene lineages from these two species do not coalesce. Both gene lineages go back to the deeper ancestral population of species A, B, and C. In that deep ancestral population, the gene lineages from species A and B coalesce first. Then it coalesces with the linage from C. This ILS results in a gene tree where A and B are sibling species, which is different from the species tree, where B and C are siblings. Each species tree has a unique distribution of gene trees under the MSC model [24], and it can be defined by a unique distribution of the true gene trees [24, 2]. The probability of observing a particular gene tree topology can be calculated by random sampling of a set of gene trees from this distribution. It is possible to infer the true species tree by sampling a sufficient number of gene trees, despite the discordance between the gene trees and the species tree. Yet this is not an easy problem to solve [22]. The most likely gene tree may be inconsistent with the species tree under certain conditions. We will talk about this problem in detail in the following sections. 2.4 Phylogenetic Reconstruction The sequencing data has been used for phylogeny reconstruction for decades [56, 77, 157]. Many different methods and models are designed for phylogenetic inference using sequencing data. In the following content, we discuss the standard pipeline of phylogenetic reconstruction in detail. 2.4.1 Phylogenetic Tree Reconstruction of Single Gene Modern phylogenetic analyses usually take molecular sequence data as input. The first step of the phylogenetic analysis is to collect data from organisms of interest and sequence the samples to get 22 genome data. High throughput sequencing technologies can read the whole genome or transcrip- tome and produce short reads that can be later assembled into longer sequences by computational methods [124]. The most frequently used biomolecular sequence data in phylogenetic analysis are DNA and RNA sequences. Note that the RNA sequences constitute only a small portion of the whole genome since the RNA sequences only contain the genetic materials of the coding genes. The pipeline of phylogenetic tree inference of a single gene mainly consists of two steps: multiple sequence alignment and phylogenetic tree inference. An illustration is shown in Figure 2.8. Figure 2.8: An illustration of the phylogenetic tree pipeline for a single gene. First, homologous sequences are selected and aligned. Then the phylogenetic tree is inferred based on the alignment result. The figure is adapted from [33]. The phylogenetic reconstruction pipeline takes a set of unaligned sequences as input. First, the sequences are aligned into MSA. Then the MSA estimated in the previous step is used as input for the phylogenetic tree inference. Like many other data analysis pipelines, one problem with this two-phase pipeline is that the quality of alignment estimated in the first step impacts the downstream phylogenetic tree inference [15, 109, 85, 143, 78]. The basic concepts and estimation methods of MSA are introduced in Section 2.1. There are mainly four types of phylogenetic tree reconstruction methods, distance-based meth- ods, Bayesian methods, Maximum Parsimony (MP) methods, and Maximum Likelihood (ML) methods. These tree inference methods usually reconstruct the phylogenetic tree based on the historical substitution events. The other more complicated evolutionary events are ignored, such as insertions and deletions. Gaps in the input MSA are treated as missing values. Though the standard version of these methods ignores other evolutionary events, it is possible to extend these methods 23 to more complex evolutionary events. One important conception of a statistical inference method is statistically consistent. A method is considered statistically consistent when the inference value converges to the true value as the amount of input data increases. Many statistically consistent methods have been developed for the phylogenetic reconstruction. For example,*BEAST [44], MP-EST [124], and ASTRAL [151, 108] are some of the most widely used phylogenetic tree inference methods that are statistically consistent. In the following content, we introduce four types of phylogenetic tree inference methods in detail. Distance-based methods infer the evolutionary relationship based on the distance between input taxa. First, we construct a distance matrix that contains pairwise distances between all possible taxa pairs. Then the distance-based methods utilize the distance matrix to reconstruct a phylogenetic tree. Bayesian methods calculate the posterior probability distribution of trees utilizing the prior probability of a particular tree and the likelihood of the input data. Given the prior probability, the likelihood of the data, and the correctness of the likelihood model, the posterior probability of a particular tree is the probability of this tree being the true tree. Bayesian methods usually search the tree space by utilizing the Markov Chain Monte Carlo (MCMC) algorithm. Maximum Parsimony (MP) methods infer a phylogenetic tree by searching for the tree that best explains the observed sequencing data with the minimum number of substitution events. The search for the best tree with the lowest score in the entire tree space has been shown to be NP-hard [20]. For datasets with a large number of taxa, heuristic search algorithms are used for the tree search [134, 9, 41, 43]. Heuristic methods utilize the hill-climbing algorithm to approach the best solution progressively, but there is no guarantee of the optimal solution. Maximum Likelihood (ML) methods reconstruct a phylogenetic tree using the following criteria. Given a sequence evolution model and a set of sequencing data, assume that the sequencing data is generated under the evolution model and search for a tree with the maximum likelihood of producing the observed sequencing data. 24 Finding the ML tree is also proved to be NP-hard [117] as the MP tree. Therefore, many heuristic algorithms have been developed. Instead of finding the ML tree, the heuristic algorithms search for an approximation of the ML tree [37, 98, 112]. There are some widely used ML tree methods, such as PhyML [49], FastTree [112] and RAxML [130]. 2.4.2 Branch support It is not an easy task to infer phylogenetic trees. Phylogenetic trees inferred by any phylogenetic tree inference methods are expected to contain errors. However, we do not have access to the true evolutionary history except in experiment settings [132]. Besides phylogenetic tree estimation, it is also crucial to have a quantitative measurement for the confidence of the inferred tree and individual branches in the inferred tree. The posterior probability distribution of trees, which is by Bayesian methods, can be used as tree support. For the other methods, the support of the inferred tree is usually calculated by the bootstrap method [55, 36]. The bootstrap support estimation method first samples a sufficiently large number of replicate datasets at random. Then trees are inferred for each bootstrap replicate, respectively. A sample of the possible universe of the data that we could have observed is provided by the bootstrap replicates. For each branch in the estimated tree, the frequency that this branch appears in the bootstrap replicates is used as the branch support. 2.4.3 Phylogenetic Tree Reconstruction of Multiple Genes In the previous section, we introduced the standard phylogenetic reconstruction pipeline for a single gene. The limitations of such phylogenetic analysis are rather obvious. First, a single gene usually contains a few hundred to thousand base pairs. Limited sites restrict the phylogenetic signals carried by one gene. Another problem is that the gene tree topologies often disagree with the species tree due to evolutionary events, such as hybridization and horizontal gene transfer. As 25 (a) Pipeline of concatenation methods. (b) Pipeline of summary-based methods. (c) Pipeline of co-estimation methods. Figure 2.9: An illustration of the phylogenetic tree pipeline for multiple genes. (a) Concatenation methods first concatenate the alignments of genes into a supermatrix. Then the species tree is inferred based on this super matrix. (b) Summary-based methods infer gene trees from the gene alignments first. Then the species tree is inferred from all the gene trees. (c) Co-estimation methods estimate gene trees and species trees simultaneously in a single statistical inference. discussed in Section 2.3, the discordance between gene trees and the species tree is very common. Even though we have a perfect computational method that can infer a completely correct gene tree from given sequencing data, it is possible that the species tree has a different topology. Thus, it is critical to comprehensively analyze multiple genes and integrate the overall gene tree distribution when inferring the species tree. Phylogenetic reconstruction with multiple genes can potentially produce a more accurate species tree due to increased input data and better estimation of the gene tree distribution. There are mainly three types of multi-gene phylogenetic reconstruction pipelines: concatenation, summary-based, and co-estimation methods. An illustration is shown in Figure 2.9. 26 In the following content, we introduce these three types of multi-gene phylogenetic tree in- ference pipelines in detail. Concatenation methods are the most basic multi-gene phylogenetic reconstruction pipelines, where all the input sequencing data is simply concatenated into one su- permatrix, and the phylogenetic tree is inferred based on this supermatrix. This method takes into account the statistical power provided by the entire dataset. With the gene tree discordance, the phylogenetic tree inferred from a single gene of a small set of concatenated genes has a higher probability of disagreeing with the true species tree. In contrast, the concatenated analysis of a sufficient number of genes ignores the conflicting signals shown in different genes and produces a fully resolved species tree with maximum support [120]. For the concatenated analysis, we assume that ILS is the only source of discordance between the true gene trees and the species tree. Previous simulation studies showed that the concatenation method might give wrong species trees with high support under this assumption [31, 72, 83, 68]. One reason that may cause the wrong species tree is that the most frequent gene tree can have a different topology than the species tree. The summary-based methods usually take two steps in the phylogenetic tree inference. First, gene trees are inferred independently. Then the gene trees are summarized to infer the species tree. The idea of the summary-based phylogenetic tree reconstruction method comes from the fact that the species tree can be uniquely defined by the probability distribution of gene trees under the MSC model, which we introduced in detail in Section 2.3.2. We can calculate the gene tree distribution by the sequencing data of a sufficiently large number of genes, then estimate the species tree using the gene tree distribution under the MSC model. One big challenge for the summary-based method is that the accuracy of the inferred gene trees has a great impact on the downstream species tree inference. Hign estimation errors in the gene trees, which are very common for many phylogenomic datasets, can result in reduced accuracy of the inferred species tree [96, 97]. There are many summary-based methods that have been developed to infer species trees by summarizing the gene trees. One of the early approaches uses maximum parsimony criteria on 27 minimizing deep coalescence (MDC) [89, 90, 137, 162] to infer the species tree. The main drawback of the summary-based method is that the gene trees are inferred indepen- dently, which limits the available data for each gene tree inference. This problem can be solved by inferring gene trees and the species tree at the same time. Such a method is called the co-estimation method. Since the gene trees are not entirely independent from each other, co-estimation of all gene trees and species tree is the best way to retain the dependence among genes during the phylogenetic reconstruction process. Co-estimation methods infer both gene trees and species tree in one statistical inference, which ensures sufficient data for the inference of each gene tree and retains the dependence among genes. Previous studies have shown that the gene trees inferred by the co-estimation methods have higher accuracy than those inferred by the independent estimation [154, 6]. There are two widely used co-estimation methods, BEST [82] and *BEAST [54]. Both methods conduct Bayesian inference via the MCMC algorithm to simultaneously infer the probability distributions of all gene trees. 2.4.4 Phylogenetic Network Reconstruction In previous sections, we introduced the phylogenetic reconstruction of trees from sequencing data. The MSC model plays a critical role in modeling the gene tree discordance caused by ILS in phylogenetic tree inference. As we introduced in the previous section, the reticulate evolutionary relationship between closely related species requires phylogenetic reconstruction methods that consider both ILS and reticulation events. The phylogenetic network has been proposed to represent such a complicated evolutionary history [60, 61, 101, 21]. Similar to the phylogenetic tree, a phylogenetic network topology is also represented by a rooted, directed, acyclic graph. The major difference between the phylogenetic tree and the network is that the phylogenetic network allows reticulation edges that model the horizontal gene flow. Reticulation edges represent gene flow between two sibling species or populations that exist in the same period of time. The phylogenetic network describes the evolutionary history of species, and gene trees grow within the species network. 28 The MSC model has been used in the phylogenetic tree inference from sequencing data of multiple genes. It has been extended to the phylogenetic network and is used for network inference from multi-gene sequencing data [161, 158]. Many computational approaches are developed for the phylogenetic network reconstruction from multiple genes, for example, the maximum parsimony method, the maximum likelihood method, and the Bayesian inference method. We introduce these methods in detail in the following content. The Maximum Parsimony (MP) method searches for the best tree under the minimizing deep coalescence (MDC) criterion [89]. The MDC criterion was first proposed for phylogenetic tree reconstruction and then extended to the reconstruction of the phylogenetic network [90, 137]. The MP method only considers the gene tree topologies. There are mainly two problems with the MP method and the MDC criterion. One problem is that we cannot estimate parameters other than the network topology under the MDC criterion. Another problem is that the MP method with the MDC criterion is not statistically consistent, especially when short branches exist. The Maximum Likelihood (ML) method was proposed to solve these two problems [159] with the multispecies network coalescent model [158], which models the stochastic process of gene trees growth in the branches of a phylogenetic network. The ML method searches for the optimal phylogenetic network that maximizes the probability of observing given gene trees. Hill-climbing heuristics can be applied to ML inference to improve search efficiency. The ML method can take into account both gene tree topologies and branch lengths. Other than the computational complexity, the ML method is easy to overfit on the input data. The ML method tends to give a higher likelihood score to a more complicated network model [4]. For this reason, the ML method may generate a sub-optimal solution. This may not be a problem for the phylogenetic tree reconstruction, but it is a big obstacle for the network inference. Different from the inference of the phylogenetic tree, the phylogenetic network inference needs to determine the number of reticulation events. However, ML inference prefers more complicated networks. Networks with more reticulation edges will have a higher likelihood score. This makes the ML 29 method not statistically consistent. Adding more reticulation edges to the true network increases the likelihood. Thus, the number of reticulation events needs to be carefully chosen for the ML inference. The Bayesian inference method can reduce the model complexity by regularizing the prior distribution. The bayesian method infers the posterior distribution of the network given a set of rooted gene tree topologies. The reversible-jump Markov chain Monte Carlo (RJMCMC) is often used for the Bayesian inference of the phylogenetic network. 2.4.5 Evaluation Metrics We evaluate multiple methods through experiments in this dissertation. Both simulated and empirical datasets are used in the experiments. In simulated experiments, we generate synthetic data under sequence evolution models with various procedures, where the simulation process is under control, and the ground truth is known. We apply multiple methods to estimate the MSA and reconstruct the phylogeny. Since we have access to the ground truth, it is easy to measure the errors in the phylogenetic reconstruction process. In empirical experiments, we use empirical datasets, where the ground truth is often not available. Therefore, we evaluate the method performance using hand-curated reference alignments and trees or prior knowledge from the previous studies. Many metrics are developed to measure the similarity of two phylogenies. Here, we will introduce some widely used metrics that measure the similarity between two given phylogenies. Note that not all the measurements of tree similarity are symmetric. In the following content, we compare a reference phylogeny to an estimated phylogeny of the same set of taxa. 2.4.5.1 Comparison of Phylogenetic Trees Phylogenetic trees could be determined by a set of bipartitions. A bipartition is a unique split of leaves generated by deleting one internal edge in one unrooted tree. By deleting the edge, the leave nodes are split into two sets. Given an unrooted tree 𝑇, each branch defines a bipartition of taxa. 30 The bipartitions of 𝑇 is 𝐶 (𝑇) = 𝜋(𝑒) : 𝑒 ∈ 𝐸 (𝑇), where 𝜋(𝑒) is the bipartition on the leaf set of 𝑇 produced by removing the internal edge 𝑒 [147]. The False Negative (FN) rate is calculated by the proportion of bipartitions that are shown in the reference tree but not in the estimated tree. This metric is also known as the missing branch rate. The False Positive (FP) rate is the opposite of the FN rate, which is calculated by the proportion of bipartitions that appear in the estimated tree but not in the reference tree. The Robinson-Foulds (RF) distance [116] is the total number of bipartitions that are different between the reference tree and the estimated tree, which includes both false-positive bipartitions and false-negative bipartitions. The Robinson-Foulds (RF) distance[116] of two unrooted phylogenetic trees 𝑇1 and 𝑇2 is defined as 𝑑 (𝑇1 , 𝑇2 ) = |𝐶 (𝑇1 )\𝐶 (𝑇2 )| + |𝐶 (𝑇2 )\𝐶 (𝑇1 )| In the following sections, we use the normalized RF rate, which is the proportion of bipartitions that are different between the two trees. The normalized RF rate is the mean of FN and FP rates. The definition of the normalized RF rate shows as following. 1 |𝐶 (𝑇1 )\𝐶 (𝑇2 )| |𝐶 (𝑇2 )\𝐶 (𝑇1 )| 𝑑˜(𝑇1 , 𝑇2 ) = × ( + ) 2 𝐶 (𝑇1 ) 𝐶 (𝑇2 ) The normalized RF rate ranges from 0 to 1. When two trees are identical, the distance equals 0. FN rate, FP rate, and normalized RF rate are equal when the reference tree and the estimated tree are bifurcating. RF distance is one of the most widely used metrics for tree similarity measurement. However, when the reference tree is not bifurcating, it is inappropriate to use RN distance to compare the reference tree and the estimated tree. 2.4.5.2 Comparison of Phylogenetic Networks Before we introduce the distance measurement between two phylogenetic networks, let’s first introduce some important concepts. For a phylogenetic network 𝑁, its reduced network 𝑁 0 is 31 obtained by applying the reduction procedure. The reduction procedure is described as follows. Given a phylogenetic network 𝑁, use a single node ℎ to replace each maximal subtree 𝑡, which does not include any network nodes. The replace node ℎ is treated as a symbolic leaf that represents the subtree 𝑡. The equivalence mapping between two phylogenetic networks 𝑁1 and 𝑁2 is defined that, for node 𝑢 in 𝑁1 and node 𝑣 in 𝑁2 , two nodes are considered as equivalent if both nodes are leaf nodes and share the same label, or both nodes have 𝑘 children and 𝑢𝑖 is equivalent to 𝑣𝑖 for 1 ≤ 𝑖 ≤ 𝑘. The distance, which we refer to as reduction-based distance, measures the distance between two phylogenetic networks based on their topologies. For two reduced phylogenetic networks 𝑁1 and 𝑁2 can be calculated by the following equation. 1 Õ Õ 𝑑 (𝑁1 , 𝑁2 ) = ( 𝑚𝑎𝑥{0, 𝜅 𝑁1 (𝑣) − 𝜅 𝑁2 (𝑣0)} + 𝑚𝑎𝑥{0, 𝜅 𝑁2 (𝑢) − 𝜅 𝑁1 (𝑢0)}) 2 𝑣∈𝑈 (𝑁1 ) 𝑣∈𝑈 (𝑁2 ) 𝑣0 is a node in 𝑁2 that is equivalent to 𝑣 in 𝑁1 , similar for 𝑢0, where 𝑢0 is a node in 𝑁1 that is equivalent to 𝑢 in 𝑁2 . 𝜅 𝑁1 (𝑣) refers to the number of nodes equivalent to 𝑣 in network 𝑁1 . 𝜅 𝑁2 (𝑣0), 𝜅 𝑁2 (𝑢), and 𝜅 𝑁1 (𝑢0) are defined similarly. [100] The reduction-based distance is equivalent to the number of rooted sub-networks that appear in one network but not in the other. The reduced distance is very sensitive to small perturbations like the Robinson-Foulds distance. 32 CHAPTER 3 SERES: THE SEQUENTIAL RESAMPLING AND ITS APPLICATION ON MULTIPLE SEQUENCE ALIGNMENT SUPPORT ESTIMATION 3.1 Introduction Resampling is the process of drawing samples from the original set of observations. The resampling methods are widely used in computational biology and bioinformatics for statistical support estimation, especially those non-parametric approaches, such as the standard bootstrap [32] and jackknife [140]. Support estimation utilizing resampling techniques usually includes three steps: drawing multiple resampled replicates from the original observation; performing inference or analysis on each resampled replicate; and comparing results among replicates. The standard bootstrap method, which we refer to as the bootstrap method in the following section, is a widely used resampling method for statistical support estimation. This method independently draws objects with replacements from a population. The bootstrap approach does not require a particular model for the support estimation. However, it assumes that the observations are independent and identically distributed (i.i.d). This assumption does not always hold for biomolecular sequences. Many evolution events produce intra-sequence dependence and functional dependence in biomolecular sequences that are inconsistent with this assumption, such as recombination and hybridization. To solve the dilemma of i.i.d assumption on the biomolecular sequences sampling, Landan and Graur proposed the Heads-or-Tails (HoT) algorithm [70] for the support estimation of multiple sequence alignment (MSA). The main idea of the HoT algorithm is that the statistical inference or analysis of MSAs should not be affected by the direction of the input alignment. That means the analysis result should be the same for the original alignment (head direction) and the reversed alignment (tail direction). The two resampled replicates of the HoT algorithm are the original alignment and the reversed 33 alignment. However, for support estimation, hundreds of resampled replicates are needed. Some follow-up studies proposed new support estimation algorithms that combined the idea of the HoT with the perturbation of the parameters of the progressive MSA algorithms [71, 110, 126]. These new algorithms show advanced performance in the support estimation of MSAs compared with other state-of-the-art methods [65, 107]. In this study, we proposed a new non-parametric resampling approach called SERES, which is short for "SEquential RESampling"[145] that maintains the key property needed for non-parametric resampling, which is the neighbor preservation property. The neighbor preservation property means any pair of bases that are neighbors in the original sequence will always be neighbors in the resampled replicates. SERES utilized an improved format of the HoT algorithm, consecutive random walk, which combined the HoT algorithm with the traditional bootstrap algorithm for the resampling of the biomolecular sequences, so that we can resample many non-parametric replicates that reserve the dependence within a sequence. To test the performance of the SERES resampling method, we applied SERES to the problem of MSA support estimation. The SERES-based support estimation performs comparably or better than the state-of-the-art methods GUIDANCE[110] and GUIDANCE2[126]. 3.2 Methods The SERES resampling method combines the bootstrap method with the HoT algorithm. This method preserves the dependence within the input sequences during the resampling process. The neighboring sites are still neighbors in resampled replicates. The SERES method is capable of producing sufficient resampled replicates for support estimation. SERES can take either aligned or unaligned sequences as input for resampling purposes. Let us first introduce the SERES random walk on the aligned sequences. 34 3.2.1 SERES walks on aligned sequences The SERES random walk is performed on a set of aligned sequences. The input alignment consists of MSA sites, which is a column of aligned nucleotides. The random walk starts at a randomly chosen site. The starting point is chosen uniformly at random from the input alignment. The initial moving direction is also chosen uniformly at random. Then the random walk moves on the input alignment in the initial direction. During each step of the random walk, the current site is sampled to the resampled replicate, and the moving direction could reverse at random with the reverse probability 𝛾. The direction certainly changes at the start and end of the input alignment. The random walk ends when the resampled replicate reaches the length of the input alignment. The SERES random walk could potentially introduce bias to the resampled replicates due to the different reversal probabilities of the input alignment. The start and end sites of the input alignment have a reversal probability of 1. But the other sites have a reverse probability of 𝛾. However, for practical choices of walk length and reversal probability 𝛾 sampling bias is expected to be minimal. The detailed pseudocode for a non-parametric SERES walk on a fixed MSA is shown in the Algorithm 3.1. Algorithm 3.1: SERES walk on aligned sequences 1: procedure SERESWalkOnAlignedSequences(𝐴, 𝛾, numReplicates) ⊲ Input: MSA 𝐴, walk reversal probability 𝛾, number of SERES replicates numReplicates ⊲ Output: list of SERES replicates 2: replicates = <> 3: for 𝑖 = 1 to numReplicates do 4: direction = (rand() > 0.5) ? +1 : −1 ⊲ Uniformly at random (UAR) choose direction (right vs. left) 5: 𝑖 = b length(𝐴) * rand() c + 1 ⊲ UAR draw from [1, length( 𝐴)] ⊲ rand() returns floating point number sampled UAR from [0, 1) 6: replicate = <> 7: while length(replicate) < length(𝐴) do 8: replicate .= 𝐴𝑖 ⊲ read 𝐴𝑖 , which is the 𝑖th character in alignment 𝐴 ⊲ Alignment characters 𝐴𝑖 are one-indexed 9: i+= direction 10: if (𝑖 ≤ 0) or (𝑖 > length( 𝐴)) or (rand() < 𝛾) then ⊲ Reflection of random walk 11: direction *= -1 12: if (𝑖 ≤ 0) or (𝑖 > length( 𝐴)) then 13: i+= direction * 2 ⊲ Always reflect at start/end of alignment 𝐴 14: replicates .= replicate 15: return(replicates) 35 3.2.2 SERES walks on unaligned sequences In previous section, we described the SERES random walk on the aligned sequences. For the aligned sequences, all sequences have the same length. There is no need to worry about the inconsistency of the random walk on the sequences. However, for the SERES resampling of unaligned sequences, we needs to consider the synchronization among sequences. We use a set of anchors to ensure synchronization. Anchors are highly conservative short sequence regions. Sub-sequences in one anchor have high sequence similarity to each other. To estimate anchors for the input unaligned sequences, the best practice is to utilize multiple MSA estimation methods, and then select the most conservative short regions that appears in multiple estimated MSA results. In practical, we found that the highly similar regions in a single estimated alignment is sufficient to produce reasonable anchors. The anchors are defined using following steps. First, estimate a guide MSA for the input sequences. Then we use Average Normalized Hamming Distance (ANHD) to measure the similarity of all possible sequence segments of given anchor length. The indels in the estimated alignments are considered as missing data. Finally, regions with the highest sequence similarity, represented by the lowest ANHD, are selected as anchors. Indices of the unaligned sequences corresponding to the start and end of each anchor are used as barriers. The start and end of the input sequences also serve as barriers. The random walks perform between barriers similar to the random walk on the alignment sequences. The change of direction only happens at the barriers with reverse probability 𝛾. The direction certainly changes at the first and last barriers, which are the start and end of the input sequence. The random walk stops when the longest resampled sequence has reached to the length of the longest input sequence. The application of SERES random walk on the unaligned sequences requires parametric MSA estimation for anchor selection. Therefore, the overall process is considered as semi-parametric. An illustrated example shows in Figure 3.1 on a 5-taxon unaligned sequence dataset. First, we estimate an alignment for the input unaligned sequences. Then five anchors were selected based on the sequence similarity. The anchors’ boundaries and the start and end of the input sequences are 36 used as barriers. Finally, the SERES random walk performs on the input sequences with barriers. The red arrows indicates the random walk paths. The dash line showed the reversal events. The random walk reverses at each encountered barrier with probability 𝛾 and reverses with certainty at the start and end barriers of the input sequence. During the random walk, the sub-sequences between previous barrier and current barrier are sampled to the replicate. The resampling procedure ends when the resampled sequences reached the length criteria. The pseudocode for SERES resampling of a set of unaligned sequences 𝑆 is shown in Algorithms 3.2 through Algorithm 3.4. Algorithm 3.2: SERES resampling of unaligned sequences 1: procedure SERESWalkOnUnalignedSequences(𝑆, 𝛾, numReplicates) ⊲ Input: set of unaligned sequences 𝑆, walk reversal probability 𝛾, number of SERES replicates numReplicates ⊲ Output: list of SERES replicates 2: replicates = <> 3: barriers = <> 4: 𝐴init = ObtainGuideAlignments(𝑆) ⊲ See Algorithm 3.3 5: Ψ = GetAnchorsFromGuideAlignments(𝑆, 𝐴init ) ⊲ See Algorithm 3.3 6: AddTrivialBarriers(barriers) 7: for all (𝑎, ® Ψ do ® 𝑏) 8: barriers .= 𝑎® . 𝑏® 9: for 𝑖 = 1 to numReplicates do 10: replicates .= SERESWalkOnUnalignedSequences(𝑆, 𝛾, 𝑖, barriers) 11: return(replicates) 12: static variable maxReplicateLengthFactor ⊲ Maximum replicate length is factor of longest unaligned sequence length 13: procedure SERESWalkOnUnalignedSequences(𝑆, 𝛾, replicateNum, barriers) 14: direction = (rand() > 0.5) ? +1 : −1 ⊲ UAR choose direction (left vs. right) 15: i = b length(barriers) * rand() c + 1 16: replicate = <> 17: while maxLength(replicate) < maxLength(𝑆) * maxReplicateLengthFactor do ⊲ maxLength(𝑆) is length of longest unaligned sequence in 𝑆 18: if ((𝑖 == 1) and (direction == −1)) or ((𝑖 == length(barriers)) and (direction == +1)) then ⊲ reflect at first or last barrier 19: direction *= −1 20: AsynchronousReadBetweenAdjacentBarriers(𝑆, barriers, 𝑖, direction, replicate) ⊲ read result passed by reference to mutable object replicate 21: i += direction 22: if rand() < 𝛾 then ⊲ change walk direction with probability 𝛾 23: direction *= −1 24: return(replicate) 25: procedure AsynchronousReadBetweenAdjacentBarriers(𝑆, barriers, 𝑖, direction, replicate) 26: j = 𝑖 + direction 27: for 𝑧 = 𝑖 to 𝑛 do 28: replicate[𝑧] .= (direction > 0) ? substr(S[𝑧], barriers[𝑖], barriers[ 𝑗]) : reverse(substr(S[𝑧], barriers[ 𝑗] + 1, barriers[𝑖] + 1) ⊲ substr(𝑥, 𝑖, 𝑗) returns substring in index interval [𝑖, 𝑗) if 𝑖 < 𝑗 or empty string if 𝑖 ≥ 𝑗 29: return ⊲ read result passed by reference to mutable object replicate 37 Algorithm 3.3: Obtain anchors 1: static variable 𝑀 ⊲ MSA methods 𝑀 =< 𝑀1, 𝑀2, . . . > 2: procedure ObtainGuideAlignments(𝑆) 3: alignments = <> 4: for all (𝑚) 𝑀 do 5: alignments .= 𝑚(𝑆) 6: return(alignments) 7: procedure GetAnchorsFromGuideAlignments(𝑆, 𝐴init ) 8: 𝛼 = <> 9: 𝛽 = <> 10: canonicalAlignment = 𝐴init [1] ⊲ anchors are indexed based on a fixed alignment in 𝐴init (WLOG chosen to be the first alignment in 𝐴init ) 11: 𝐶strict = GetStrictConsensusColumns(𝐴init ) ⊲ GetStrictConsensusColumns() returns column indices into first alignment in canonicalAlignment 12: 𝛼® strict = MergeAdjacentColumns(𝐴init , 𝐶strict ) ⊲ merges adjacent columns ⊲ returns array of ordered pairs (® 𝑥 , 𝑦®) where start indices 𝑥® and end indices 𝑦® are indexed based on canonicalAlignment 13: SortAnchors(® 𝛼strict , canonicalAlignment) 14: for 𝑧 = 1 to length(® 𝛼strict ) do 15: for 𝑖 = 1 to 𝑛 do 16: (® 𝑥 , 𝑦®) = 𝛼 ® strict [𝑧] 17: if substr(canonicalAlignment[𝑖], 𝑥® [𝑖], 𝑦® [𝑖]) contains only indels then 18: 𝛼[𝑖] [𝑧] = LookupUnalignedSequenceIndex( GetLastNonIndelIndexInPrefix(canonicalAlignment[𝑖], 𝑥 [𝑖])) 19: 𝛽[𝑖] [𝑧] = 𝛼[𝑖] [𝑧] 20: else 21: 𝛼[𝑖] [𝑧] = LookupUnalignedSequenceIndex( GetFirstNonIndelIndexInRange(canonicalAlignment[𝑖], 𝑥 [𝑖], 𝑦[𝑖] + 1)) 22: 𝛽[𝑖] [𝑧] = LookupUnalignedSequenceIndex( GetLastNonIndelIndexInRange(canonicalAlignment[𝑖], 𝑥 [𝑖], 𝑦[𝑖] + 1)) 23: return(𝛼, 𝛽) 24: procedure SortAnchors(® 𝛼, canonicalAlignment) ⊲𝛼 ® is an array of ordered pairs (® 𝑥 , 𝑦®) where start indices 𝑥® and end indices 𝑦® are indexed based on canonicalAlignment 25: sort (ComputeModifiedHammingDistance(𝑢, canonicalAlignment) <=> ComputeModifiedHammingDistance(𝑣, canoni- calAlignment)) 𝛼 ® ⊲ perl sort syntax ⊲ See Algorithm 3.4 Algorithm 3.4: Modified Hamming distance calculation 1: procedure ComputeModifiedHammingDistance(𝑢, 𝐴) 2: dist = 0 3: ( 𝑥®, 𝑦®) = 𝑢 4: for 𝑖 = 1 to 𝑛 do 5: for 𝑗 = 𝑖 + 1 to 𝑛 do 6: dist += ComputeModifiedHammingDistancePair(substr(𝐴[𝑖],® 𝑥 [𝑖],®𝑦 [𝑖]), substr(𝐴[ 𝑗],® 𝑥 [ 𝑗],®𝑦 [ 𝑗]))  7: return(dist / 𝑛2 ) 8: procedure ComputeModifiedHammingDistancePair(𝑥, 𝑦) 9: alignedLength = length(𝑥) ⊲ aligned sequences 𝑥 and 𝑦 have same length 10: matches = 0 11: for 𝑖 = 1 to alignedLength do 12: if (𝑥 [𝑖] != INDEL) and (𝑦[𝑖] != INDEL) and (x[i] == y[i]) then ⊲ homologies involving indels are penalized as mismatch 13: matches++ 14: return(matches / alignedLength) 38 (a) Estimate consensus alignment on input set of unaligned sequences. s1 AGTCTGGACTATAATGAAAGCCGA s2 AGTCTGGTATAATGAAAGCTGGTACGA s3 AGTCTGTACTATAATGGAAGTGGGGACACGTGGACAGCCGA s4 AGTCTGTACTATAATGCGACACGTGGATAGCCGA s5 AGTCTGTACTATAATGGGAGGAAAGCCGA s1 s2 s3 s4 s5 (b) Obtain anchors on consensus alignment. Barriers (dashed lines) consist of anchor boundaries plus trivial start/end barriers. Anchor Anchor Anchor Anchor Anchor 1 2 3 4 5 s1 T A s2 T A s3 T A s4 T A s5 T A Barriers (dashed lines) (c) Choose an initial barrier and walk direction at random. Begin random walk (red arrow) from first barrier to neighboring barrier. As walk proceeds from one barrier to neighboring barrier, sample unaligned sequences between barrier pairs. s1 T s2 T s3 T s4 T s5 T s1 TA Resampled s2 s3 TA TA sequences s4 TA s5 TA (d) Random walk terminates when resampled sequences reach required length. s1 s2 s3 s4 s5 Reversal Reversal s1 TATAATGAAAGCCGAGCCGAAAGCC Resampled s2 s3 TATAATGAAAGCTGGTACGAGCATGGTCGAAAGCTGGTAC TATAATGGAAGTGGGGACACGTGGACAGCCGAGCCGACAGCC sequences s4 TATAATGCGACACGTGGATAGCCGAGCCGATAGCC s5 TATAATGGGAGGAAAGCCGAGCCGAAAGCC Figure 3.1: An example of SERES resampling random walk on unaligned sequences. First, we estimate an alignment for the input unaligned sequences. Then, a set of anchors with the highest similarity is estimated using the estimated alignment. The anchors’ boundaries and the start and end of the input sequences are used as barriers. Finally, the SERES random walk performs on the input sequences with barriers. The random walk starts at a randomly selected barrier and moves to a randomly selected direction to the next barrier. It reverses at each encountered barrier with probability 𝛾 and reverses with certainty at the start and end barriers of the input sequence. During the random walk, the sub-sequences between the previous barrier and the current barrier are sampled for the replicate. The resampling procedure ends when the resampled sequences reach the length criteria. 39 3.2.3 Performance study In this study, we applied the SERES resampling approach to the MSA support estimation problem and measured the performance of the SERES-based support estimation. The MSA support estima- tion is used to quantify the confidence of the estimated MSA result. For an estimated alignment, support values are inferred from multiple resampled replicates for each nucleotide-nucleotide ho- mology in the estimated alignment. The support value indicates the confidence of the aligned residue pair. Many computational methods are designed to solve this problem, such as PSAR[65], TCS[18], HoT[70], GUIDANCE[110] and GUIDANCE2[126]. Among those existing methods, GUIDANCE, which we refer to as GUIDANCE1 in the following section, and GUIDANCE2 are two state-of- the-art methods for MSA support estimation. Both algorithms utilize the uncertainty of the guide tree of the passive MSA methods. By using different guide trees for the passive MSA alignment, multiple alternative MSAs are produced. Then those alternative MSAs are used to calculate the support values for the residue pairs in the input MSA. In addition, GUIDANCE2 also generates alternative MSAs by varying the gap penalty score and co-optimal MSA solution. Both GUIDANCE1 and GUIDANCE2 use the standard bootstrap to generate alternative guide trees. Our study adopted SERES to generate alternative guide trees for the passive MSA alignment. First, we applied the SERES random walk on the input unaligned sequences to generate 100 resampled replicates. The reversal probability 𝛾 = 0.5. Then, we realigned the sampled sequences using MAFFT with default settings. We also investigated MSA estimation using ClustalW [88], and FSA[12]. 𝑘 c anchors with anchor size of 5 bp and a minimum Each SERES replicate utilized a total of b 20 distance between neighboring anchors of 25 bp, where 𝑘 is the length of the input alignment 𝐴. The anchor parameters were selected according to our additional experiments on the impact of the anchor parameter settings, where we found that the SERES-based MSA support estimation is robust to the selection of anchor numbers and anchor size. More details is included in Section 3.3.1. The re-estimated alignments are used as alternative MSAs for the downstream steps of GUID- 40 ANCE1 and GUIDANCE2. To further explore the impact of the parameter choices, we conducted additional SERES- based support estimation experiments with varied parameter settings. Each set of experiments manipulated one parameter setting. The parameters include the number of anchors, anchor length, or the method used to estimate the input MSA. Other than the selected manipulate parameter, all the parameters used default settings for SERES-based support estimation. The number of anchors was selected from the set {3, 5, 20, 50, 100}. Anchor length in bp was chosen from the set {3, 5, 10, 30, 50}. Three different methods were used for estimating an input MSA: ClustalW [88], MAFFT [64], and FSA [12]. 3.2.4 Simulated Data We simulated the datasets under 10 model conditions to comprehensively evaluate the performance of the SERES-based MSA support estimation with different levels of complexity of the evolutionary processes. Model conditions for 10 taxa and 50 taxa are named from A to E, representing increases in sequence diversity. Parameters of model conditions are shown in Table 3.1. We used r8s version 1.7[123] to sample a random model tree under a birth-death process. INDELible version 1.03[39] takes the model tree topology and simulates nucleotide sequences and true alignment according to the guide trees under the General Time-Reversible (GTR) substitution model. We used the simulation parameters from Liu et al. 2012 study[80]. The parameters of the GTR substitution model[118] and the indel model comes from Liu et al. 2012 study[80]. The length of the simulated sequences was 1000bp. This simulation process was repeated twenty times independently to generate twenty replicates for each model condition. All the results were the average of twenty replicates. The summary statistics of the simulated dataset are shown in Table 3.1. Additional summary statistics of estimated alignment produced by ClustalW and FSA shows in Table 3.2. To explore the impact of gap length distribution, our study also included 10-taxon model conditions which utilized the long gap length distribution from the study of Liu et al. [80] instead 41 Model Number Tree Insertion/deletion True align Est align condition of taxa height probability NHD Gappiness length length SP-FN SP-FP 10.A 10 0.4 0.13 0.297 0.474 1965 1552 0.294 0.341 10.B 10 0.7 0.1 0.394 0.512 2165 1564 0.483 0.533 10.C 10 1 0.06 0.514 0.526 2163 1554 0.657 0.684 10.D 10 1.6 0.031 0.599 0.486 1874 1508 0.747 0.753 10.E 10 4.3 0.013 0.693 0.465 1849 1613 0.945 0.943 50.A 50 0.45 0.06 0.281 0.516 2044 1786 0.086 0.088 50.B 50 0.7 0.03 0.398 0.475 1936 1714 0.106 0.103 50.C 50 1 0.02 0.514 0.498 2048 1703 0.245 0.230 50.D 50 1.8 0.012 0.594 0.471 1945 1712 0.455 0.419 50.E 50 4.3 0.004 0.688 0.459 1890 2319 0.963 0.948 Table 3.1: Simulated datasets: parameter values and summary statistics. The simulation model condition parameters consist of the number of taxa, model tree height, and insertion/deletion probability. Each model condition corresponds to a distinct set of model parameter values. The following table columns list average summary statistics for each model condition (𝑛 = 20). “NHD” is the average normalized Hamming distance of a pair of aligned sequences in the true alignment. “Gappiness” is the percentage of true alignment cells which consists of indels. “True align length” is the length of the true alignment. “Est align length” is the length of the MAFFT-estimated alignment [64] which was provided as input to the support estimation methods. “SP-FN” and “SP-FP” are the proportion of homologies that appear in the true alignment but not in the MAFFT-estimated alignment and vice versa, respectively. of the medium gap length distribution that was used in the other datasets of our simulation study. Parameter values and summary statistics for the long-gap-length model conditions are shown in Table 3.5. ClustalW Model Est align condition length SP-FN SP-FP 10.A 1208.5 0.497 0.556 10.B 1186.2 0.624 0.684 10.C 1144.8 0.711 0.754 10.D 1105.7 0.756 0.786 10.E 1060.1 0.896 0.906 FSA Model Est align condition length SP-FN SP-FP 10.A 2289.3 0.334 0.124 10.B 3418.5 0.585 0.164 10.C 4506.6 0.729 0.211 10.D 5000.9 0.800 0.223 10.E 6657.1 0.907 0.531 Table 3.2: Medium-gap-length model conditions: estimated alignment statistics. The MSA support estimation problem requires an input MSA. Our study included ClustalW [88] and FSA [12] alignments to explore the impact of input alignment quality on downstream support estimation. The following table columns list average statistics for estimated alignments on each model condition (𝑛 = 20). “Est align length” is the estimated alignment length. “SP-FN” and “SP-FP” are the proportion of homologies that appear in the true alignment but not in the estimated alignment and vice versa, respectively. 42 3.2.5 Empirical data We used the empirical benchmarks from the Comparative RNA Web (CRW) site database [14] to further test the performance of the SERES-based MSA support estimation. The CRW dataset contains alignments of ribosomal RNA sequences, which covers a wide range of dataset size and sequence divergence. The reference alignments included in the CRW dataset are produced by the combination of automatic alignment software and intensive manual correction with information of the secondary structure. The quality of the reference alignments include in the CRW are very high. This benchmark dataset has been widely used in the evaluation and comparison of MSA approaches. We selected 11 datasets from the CRW dataset with at most 250 sequences, including primary 16S rRNA, primary 23S rRNA, primary intron and seed alignments. Preprocess of the empirical datasets filtered sequences with greater or equal to 99% missing data. The summary statistics of the empirical datasets are shown in Table 3.3. Number Ref align Est align Dataset of taxa NHD Gappiness length length SP-FN SP-FP IGIA 110 0.606 0.915 10368 6675 0.734 0.784 IGIB 202 0.579 0.910 10633 7379 0.825 0.864 IGIC2 32 0.533 0.700 4243 3514 0.689 0.715 IGID 21 0.719 0.782 5061 3023 0.874 0.904 IGIE 249 0.451 0.838 2751 2775 0.393 0.376 IGIIA 174 0.668 0.814 6406 7005 0.816 0.800 PA23 142 0.293 0.267 3991 3552 0.078 0.077 PE23 117 0.300 0.612 9436 10083 0.202 0.213 PM23 102 0.361 0.797 10999 8803 0.262 0.288 SA16 132 0.212 0.205 1866 1673 0.031 0.028 SA23 144 0.304 0.460 4048 3678 0.077 0.081 Table 3.3: Empirical dataset summary statistics. The empirical study made use of reference alignments (“Ref align”) from the CRW database [14]. The column description is identical to Table 3.1. 3.2.6 Performance Measure We evaluate the performance of the MSA support estimation approaches by the receiver operating characteristic (ROC) curves, precision-recall (PR) curves, and area under ROC curves (ROC-AUC) and PR curves (PR-AUC). We focus on this application because the multiple sequence alignment problem is considered to be a classical problem in computational biology and bioinformatics and 43 MSAs are used as inputs for a variety of important computational problems throughout com- putational biology and bioinformatics, such as phylogenetics analysis, proteomics, comparative genomics, etc. It is well known that MSA quality has a major impact on downstream analysis [71, 79, 80]. The MSA support estimation produces support value for each residue pair in estimated align- ment. Both the ROC curve and the PR curve are produced with thresholds ranging from 0 to 1. For each threshold, by comparing the support value with the threshold and whether the homology appears in the true alignment, we divided the homologies in the estimated alignment into four parts. True positive (TP) represents the number of residue pairs with support values greater than or equal threshold and appear in the true alignment. False positive (FP) represents the number of residue pairs with support value greater or equal to the threshold but not in the true alignment. True negative (TN) represents the number of residue pairs with support values less than the threshold and not true alignment. False negative (FN) represents the number of residue pairs with support value less than the threshold but appear in the true alignment. The ROC curve, PR curve, ROC-AUC, and PR-AUC were calculated by the scikit-learn Python library[136]. 3.3 Result 3.3.1 Simulation study For all the model conditions, the SERES-based support estimation method showed better PR- AUC and ROC-AUC performances than the GUIDANCE1 and GUIDANCE2, two state-of-the-art methods. The results of the simulation study are shown in Table 3.4. All the performance improvements achieved by the SERES-based support estimation were statistically significant. The p-values were calculated by the corrected pairwise t-test, or DeLong test [25], respectively, over all replicate datasets on each model condition. Sequence divergence showed a strong impact on the performance improvement achieved by the SERES-based method, although the SERES-based support estimation method consistently outperformed the GUIDANCE1 and GUIDANCE2 using the standard bootstrap resampling approach on all model conditions despite different dataset sizes 44 and sequence divergence. The performance improvement increased as the sequence divergence increased. For the 10-taxon and 50-taxon model conditions, the SERES-based method achieved at most 3% improvement on the datasets with the least sequence divergence. The SERES-based method improved the PR-AUC performance by 28% on the most divergent dataset in the simulation study. Datasets with the highest sequence divergence are the most challenging datasets. All methods showed degraded PR-AUC as the sequence divergence increased, which is con- sistent with the previous GUIDANCE2 study [126]. While the SERES-based method’s PR-AUC performance degraded slower than the original GUIDANCE1 and GUIDANCE2 using the boot- strap resampling approach. The performance improvement achieved by the SERES+GUIDANCE1 method over GUIDANCE1 was generally greater and statistically more significant than the com- parison between the SERES+GUIDANCE2 method and GUIDANCE2. On all model conditions, the GUIDANCE2 produced better PR-AUC and ROC-AUC performance than the GUIDANCE1. Another observation was that the PR-AUC differences were generally larger than the ROC-AUC differences, especially on model conditions with higher sequence divergence. The performance comparisons on the long-gap-length model conditions were mostly the same as the medium-gap-length model conditions. The PR-AUC and ROC-AUC performances of the two methods are shown in Table 3.5. Similar to previous findings, SERES+GUIDANCE2 con- sistently produced significant improvements on both PR-AUC and ROC-AUC in comparison to GUIDANCE2. The statistical significance was calculated by the corrected pairwise t-test or De- Long test [25] across all 20 replicates for each model condition, respectively. Furthermore, as sequence divergence increased, the PR-AUC improvement that SERES+GUIDANCE2 produced relative to GUIDANCE2 tended to improve. When comparing medium-gap-length with corre- sponding long-gap-length model condition pairs, for example, 10.A and 10.long.A, the PR-AUC improvement of SERES+GUIDANCE2 over GUIDANCE2 was similar between the two types of gap length distribution. The differences were less than 1%. A similar finding was observed for ROC-AUC measurements. An exception is the comparison between 10.D and 10.long.D model conditions. For the 10.long.D model condition, the PR-AUC performance improvement achieved 45 PR-AUC(%) ROC-AUC(%) SERES+ Pairwise t-test SERES+ DeLong et al. test Model GUID- GUID- GUID- corrected GUID- corrected condition ANCE1 ANCE1 ANCE1 q-value ANCE1 q-value 10.A 88.74 91.17 5.4 × 10−7 80.22 85.57 < 10−10 10.B 82.21 86.26 1.5 × 10−6 84.83 88.66 < 10−10 10.C 76.23 83.49 1.9 × 10−4 86.98 91.23 < 10−10 10.D 74.65 85.81 1.9 × 10−4 88.55 93.72 < 10−10 10.E 42.61 59.20 3.1 × 10−4 82.24 87.40 < 10−10 50.A 98.22 98.92 5.3 × 10−10 83.09 90.64 < 10−10 50.B 97.84 98.69 2.8 × 10−9 82.85 90.39 < 10−10 50.C 95.08 96.80 5.6 × 10−8 85.54 90.64 < 10−10 50.D 90.79 95.75 5.3 × 10−6 88.89 94.56 < 10−10 50.E 62.47 79.14 8.0 × 10−10 91.02 93.23 < 10−10 PR-AUC(%) ROC-AUC(%) SERES+ Pairwise t-test SERES+ DeLong et al. test Model GUID- GUID- GUID- corrected GUID- corrected condition ANCE2 ANCE2 ANCE2 q-value ANCE2 q-value 10.A 92.55 93.33 7.4 × 10−6 87.17 88.34 < 10−10 10.B 88.08 89.31 8.4 × 10−4 89.45 90.56 < 10−10 10.C 84.28 86.86 3.1 × 10−4 91.36 92.88 < 10−10 10.D 86.03 88.75 1.9 × 10−4 93.34 94.69 < 10−10 10.E 51.17 62.30 1.3 × 10−3 86.00 88.28 < 10−10 50.A 98.98 99.14 5.3 × 10−6 91.17 92.50 < 10−10 50.B 98.79 98.96 1.5 × 10−6 91.24 92.44 < 10−10 50.C 96.86 97.45 3.2 × 10−7 90.81 92.31 < 10−10 50.D 94.04 96.23 1.5 × 10−5 92.67 95.09 < 10−10 50.E 72.61 81.47 1.5 × 10−8 92.94 94.22 < 10−10 Table 3.4: Support estimation method performance on simulated datasets. Results are shown for simulated datasets. The top rows show AUC comparisons of GUIDANCE1 (“GUIDANCE1”) vs. SERES combined with parametric techniques from GUIDANCE1 (“SERES+GUIDANCE1”), Results AUC comparisons of GUIDANCE2 (“GUIDANCE2”) vs. SERES combined with para- metric techniques from GUIDANCE2 (“SERES+GUIDANCE2”); the best AUC is shown in bold. Corrected q-values are reported (𝑛 = 20) and all were significant (𝛼 = 0.05). by the SERES-based method was larger than that seen in the 10.D model condition. We also conducted additional experiments to study the impact of the MSA estimation method choices. The experiments were conducted by SERES+GUIDANCE2 with alternative MSA methods for estimating the input MSA. A direct performance comparison is shown in Table 3.6. The three MSA methods used in our study returned input alignments with different levels of quality. Compared to the other ClustalW and MAFFT, input MSA estimated by FSA had a lower average SP-FP and the best or close to best average SP-FN measurement. Detailed summary statistics are shown in Table 3.2. Downstream support estimation PR-AUC reflected input alignment quality. In the previous experiments, we found that the PR-AUC performance decreased as sequence divergence 46 PR-AUC(%) Model SERES+ Pairwise t-test GUIDANCE2 condition GUIDANCE2 corrected q-value ROC-AUC(%) Model SERES+ DeLong et al. test GUIDANCE2 condition GUIDANCE2 corrected q-value 10.long.A 89.99 90.99 < 10−10 10.long.B 91.84 93.02 < 10−10 10.long.C 93.14 94.59 < 10−10 10.long.D 93.89 96.13 < 10−10 10.long.E 92.62 94.38 < 10−10 Table 3.5: Support estimation method performance on long-gap-length model conditions. The performance of GUIDANCE2 and SERES+GUIDANCE2 is compared across model conditions 10.long.A through 10.long.E (named in order of generally increasing sequence divergence). Ag- gregate PR-AUC and ROC-AUC are reported across all replicate datasets in a model condition (𝑛 = 20), and the best AUC for each model condition is shown in bold. Statistical significance of PR-AUC or ROC-AUC differences was assessed using a one-tailed pairwise t-test or DeLong test [25] test, respectively, and multiple test correction was performed using the method of [8]. Corrected q-values are reported (𝑛 = 20) and all were significant (𝛼 = 0.05). increased. When using FSA-estimated alignments as input, the PR-AUC performance reduction was smaller than that of MAFFT and ClustalW. The PR-AUC and ROC-AUC performance improvements obtained by SERES+GUIDANCE2 over GUIDANCE2 were robust to the quality of input alignment or choice of MSA estimation method. The SERES+GUIDANCE2 outperformed GUIDANCE2 on both PR-AUC and ROC-AUC no matter annotating more accurate input alignments, such as FSA- estimated alignments, or less accurate input alignments, such as alignments estimated by MAFFT or ClustalW. We explored different anchor parameter settings. The method performances using differing choices for anchor length and numbers of anchors are shown in Figures 3.2 and Figure 3.3, respec- tively. For different anchor lengths used for the SERES resampling, the SERES+GUIDANCE2 produced roughly similar PR-AUC and ROC-AUC performance. The average ROC-AUC difference for different choices of anchor length was less than 0.01 for all model conditions. The largest PR- AUC performance difference was 0.058 on the 10.E model condition. This PR-AUC difference is considered very small compared to the PR-AUC improvement obtained by SERES+GUIDANCE2 over GUIDANCE2, which was 0.28 on the 10.E model condition. A similar outcome was seen in experiments involving different choices for the number of anchors, except in the most divergent 47 PR-AUC(%) ClustalW FSA SERES+ Pairwise t-test SERES+ Pairwise t-test Model GUID- GUID- GUID- corrected GUID- corrected condition ANCE2 ANCE2 ANCE2 q-value ANCE2 q-value 10.A 95.37 95.78 2.8 × 10−3 96.36 96.55 8.6 × 10−3 10.B 92.30 92.95 8.2 × 10−4 95.40 95.87 4.9 × 10−3 10.C 89.36 91.23 1.7 × 10−4 95.32 96.06 2.7 × 10−3 10.D 88.53 90.45 8.8 × 10−5 96.21 96.87 2.1 × 10−3 10.E 73.96 76.50 8.2 × 10−4 90.23 92.51 8.6 × 10−3 ROC-AUC(%) ClustalW FSA SERES+ DeLong et al. test SERES+ DeLong et al. test Model GUID- GUID- GUID- corrected GUID- corrected condition ANCE2 ANCE2 ANCE2 q-value ANCE2 q-value 10.A 96.99 97.23 < 10−10 80.85 81.61 < 10−10 10.B 96.64 96.94 < 10−10 81.31 82.89 < 10−10 10.C 96.27 96.88 < 10−10 84.48 86.56 < 10−10 10.D 95.78 96.65 < 10−10 88.63 90.37 < 10−10 10.E 89.84 90.80 < 10−10 89.10 90.83 < 10−10 Table 3.6: SERES+GUIDANCE2 performance using alternative methods for estimating an input MSA. Input MSAs in these experiments were estimated using either ClustalW [88] or FSA [12]. (MAFFT was used to estimate input MSAs throughout the rest of our study.) Results are shown for model conditions 10.A through 10.E (named in order of generally increasing sequence divergence). Otherwise, table layout and description are identical to Table 3.5. 10.E model condition, where an intermediate number of anchors yielded the best PR-AUC. We compared the runtime of the GUIDANCE1 and GUIDANCE2 with or without the SERES resampling and re-estimation, the SERES-based method required slightly more runtime on all model conditions, usually a few minutes for the entire run. On average, all methods in the simulation study completed analysis of each replicate dataset in less than half an hour and with less than 1 GiB of main memory usage. The average runtime of SERES+GUIDANCE1 was longer than GUIDANCE1 alone by 1 minute and 5 minutes on the 10-taxon and 50-taxon model conditions, respectively. The average runtime of SERES+GUIDANCE2 was longer than GUIDANCE2 alone by at most 1.4 minutes and 6.5 minutes, respectively. The runtime shows in Figure 3.4. For the average memory usage on 10-taxon and 50-taxon model conditions, SERES+ GUID- ANCE1 used 0.016 GiB to 0.610 GiB more than GUIDANCE1 alone. A similar outcome was ob- served when comparing SERES+GUIDANCE2 and GUIDANCE2, where SERES+GUIDANCE2 used 0.034 GiB and 0.871 GiB more memories than GUIDANCE2, respectively. The memory 48 3bp 5bp 10bp 30bp 50bp 1.0 0.8 0.6 PR-AUC 0.4 0.2 0.0 10.A 10.B 10.C 10.D 10.E Model Condition 3bp 5bp 10bp 30bp 50bp 1.0 0.8 0.6 ROC-AUC 0.4 0.2 0.0 10.A 10.B 10.C 10.D 10.E Model Condition Figure 3.2: SERES+GUIDANCE2 performance using different choices for anchor length. Re- sults are shown for five 10-taxon medium-gap-length model conditions (named 10.A through 10.E in order of generally increasing sequence divergence). We evaluated the performance of SERES+GUIDANCE2 where anchor length in bp was either 3, 5, 10, 30, or 50. We calculated each method’s precision-recall (PR) and receiver operating characteristic (ROC) curves. Performance is evaluated based upon aggregate area under curve (AUC) across all replicates for a model condition (𝑛 = 20). 49 3 anchors 5 anchors 20 anchors 50 anchors 100 anchors 1.0 0.8 0.6 PR-AUC 0.4 0.2 0.0 10.A 10.B 10.C 10.D 10.E Model Condition Series2 Series3 Series4 Series5 Series6 1.0 0.8 0.6 ROC-AUC 0.4 0.2 0.0 10.A 10.B 10.C 10.D 10.E Model Condition Figure 3.3: SERES+GUIDANCE2 performance using different choices for the number of anchors. We evaluated the performance of SERES+GUIDANCE2 where the number of anchors used was either 3, 5, 20, 50, or 100. Otherwise, figure layout and description are identical to Figure 3.2. usage is shown in Figure 3.5. 50 (a) Time (h) Simulation study model condition (b) Time (h) Empirical study dataset Figure 3.4: Runtime comparison of methods under study. (a) For each method, average runtime (h) across all replicates in each simulation study model condition is reported (𝑛 = 20); standard error bars are also shown. The 10-taxon model conditions 10.A through 10.E are shown in order from left to right, followed by the 50-taxon model conditions similarly. (b) Method runtimes are shown for each empirical study dataset. Datasets are arranged from left to right in order of increasing dataset size as measured by number of taxa. 3.3.2 Empirical study The SERES-based support estimation methods produced better performance on all empirical datasets except for the IGIC2 dataset, where the GUIDANCE2 performed better than the SERES+ GUIDANCE2 method by 1.17% on PR-AUC and 2.12% on ROC-AUC. Consistent with the simulation study, the SERES-based method achieved larger PR-AUC per- formance improvements on datasets with higher sequence divergence. The PR-AUC improvements 51 (a) Memory (GiB) Simulation study model condition (b) Memory (GiB) Empirical study dataset Figure 3.5: Memory usage comparison of methods under study. Memory usage is shown in GiB. Otherwise, figure layout and description are identical to Supplementary Figure 3.4. were less than 1% on seed and primary non-intronic datasets, which were datasets with less ANHD and gappiness, refer to Table 3.3. For the intronic datasets, datasets with higher sequence divergence, the PR-AUC improvements were as much as 13.87%. Performance improvements of the SERES+GUIDANCE1 method over GUIDANCE1 were relatively greater than that of the comparison between the SERES+GUIDANCE2 method and GUIDANCE2, which is also consistent with the simulation study. Finally, GUIDANCE2 consistently generated better performance than GUIDANCE1 on both PR-AUC or ROC-AUC. The runtime comparison between SERES+GUIDANCE2 and GUIDANCE2 showed larger 52 PR-AUC(%) ROC-AUC(%) SERES+ SERES+ Model GUID- GUID- GUID- GUID- condition ANCE1 ANCE1 ANCE1 ANCE1 IGIA 62.67 69.28 89.50 91.62 IGIB 73.60 87.47 94.49 97.39 IGIC2 72.67 75.36 82.25 83.87 IGID 63.74 76.30 95.10 96.73 IGIE 93.56 95.42 90.08 93.30 IGIIA 73.03 83.06 86.49 96.45 PA23 98.54 99.41 82.59 93.63 PE23 98.44 99.27 94.75 97.41 PM23 97.53 98.48 94.20 96.44 SA16 99.72 99.86 91.07 95.57 SA23 98.35 99.24 81.76 92.18 PR-AUC(%) ROC-AUC(%) SERES+ SERES+ Model GUID- GUID- GUID- GUID- condition ANCE2 ANCE2 ANCE2 ANCE2 IGIA 67.4 68.49 91.38 91.94 IGIB 80.66 86.72 96.47 97.38 IGIC2 74.44 73.27 84.63 82.51 IGID 75.15 78.38 96.44 97.09 IGIE 94.6 95.44 91.84 93.49 IGIIA 78.16 85.09 94.50 96.82 PA23 99.24 99.53 91.48 94.88 PE23 99.07 99.34 96.72 97.63 PM23 98.68 98.85 96.93 97.28 SA16 99.88 99.91 96.22 97.22 SA23 99.04 99.33 89.93 93.18 Table 3.7: Empirical study results. Results are shown for empirical datasets. For each dataset and pairwise method comparison. Table layout, and table description are otherwise identical to Table 3.4. differences in the empirical datasets compared to the simulation study. SERES+GUIDANCE2 used at most 2.6 hours on the largest empirical datasets, which have 100-200 taxa. The variance of the runtime difference between the two methods was also larger than that of the simulation study. GUIDANCE2’s main memory usage was not consistently better than SERES+GUIDANCE2 on the empirical datasets. These two methods had comparable memory usage across the empirical datasets. The maximum difference between these two methods was 0.06 GiB. Similar runtime and memory usage comparisons were observed for SERES+GUIDANCE1 and GUIDANCE1, with the former having a maximum overhead relative to the latter of 4.2 hours and 0.07 GiB. 53 3.4 Discussion In both the simulation and the empirical studies, the utilization of the SERES resampling and re-estimation promoted the MSA support estimation performance. According to the experiment results, the performance improvement greatly benefited from the SERES resampling approach, which can produce many distinct replicates and reserve the intra-sequence dependence. Under all model conditions, the support estimation produced by GUIDANCE1 and GUID- ANCE2 with the SERES resampling and re-estimation showed significant improvements in the PR- AUC and ROC-AUC compared to the pipelines without the SERES resampling process. The main difference for the comparison is the resampling approach, SERES, versus the standard bootstrap. The results of the simulation study and the empirical study indicate that the SERES resampling approach outperformed the standard bootstrap resampling method in the application of the MSA support estimation. The SERSE resampling approach was designed to preserve the intra-sequence dependence caused by the insertion and deletion processes and relax the assumption made by the bootstrap resampling method that all the sites are independent and identically distributed. The experiment results showed that the SERES+GUIDANCE1 method achieved greater PR- AUC and ROC-AUC performance improvements over the GUIDANCE1 method than the compar- ison between the SERES+GUIDANCE2 and GUIDANCE2. One reason is that in terms of AUC performance, the GUIDANCE2 outperformed the GUIDANCE1. We used the SERES resam- pling approach together with the support estimation framework of GUIDANCE1 or GUIDANCE2. GUIDANCE2 already yielded very high PR-AUC and ROC-AUC on the simulation datasets and empirical datasets, and it is hard for the SERES resampling approach to make large performance improvements upon that. The experiments on the empirical datasets generated similar results as the simulation study. The non-intronic datasets have low ANHD and gappiness, which means they have low sequence divergence. The intronic datasets have higher sequence divergence. Therefore, the methods’ performances on the non-intronic datasets were similar to low divergence model conditions, and performances on the intronic datasets were similar to high divergence model conditions. The 54 SERES-based support estimation methods consistently showed better performance than either GUIDANCE1 or GUIDANCE2 alone for all the empirical datasets. The SERES-based methods achieved larger performance improvements on the PR-AUC and ROC-AUC as the dataset became more divergent and challenging to align. For small datasets with low sequence divergence, the SERES resampling approach did not greatly improve the performance as on those challenging datasets. One example is the IGIC2 datasets, where the SERES+GUIDANCE2 did not outperform GUIDANCE2. IGIC2 was a small dataset, which is about an order of magnitude smaller than other datasets. This dataset also has a lower ANHD and gappiness than the other intronic datasets. Another small intronic dataset, IGID, has higher ANHD and gappiness compared to the IGIC2 dataset. When compared to the GUIDANCE2, the SERES-based support estimation yielded a relatively small performance improvement of about 3.2%. Higher sequence divergence usually means more challenges for statistical inferences and more inference errors. While the performance of the SERES-based methods degraded much slower than the corresponding non-SERES methods as the sequence divergence increased. The largest performance improvement was seen in the most divergent model conditions and empirical datasets. The combination of GUIDANCE1 and GUIDANCE2 with SERES-based resampling and re- estimation usually leads to an increase in computational runtime in our study. The computational runtime increased by a few minutes for the 10-taxon and 50-taxon simulated datasets and a few hours for the larger empirical datasets with around 100-200 taxa. In the simulation study, the SERES-based methods also required more memory than GUIDANCE1 and GUIDANCE2 alone. The gap between the SERES-based methods and GUIDANCE1 and GUIDANCE2 decreased on the larger empirical datasets with a few hundred taxa. Compared to GUIDANCE1 and GUIDANCE2, the SERES-based methods include an additional MSA re-estimation step, occurring after SERES random walk resampling. This additional step is likely the reason for the increased computational cost. Finally, we noticed that the combination of two different types of methods resulted in better per- formance than either of these methods alone. This finding indicated that the resampling techniques 55 are orthogonal to parametric alternatives, which is consistent with previous studies [110, 126]. 3.5 Conclusions In this study, we introduced a new non-parametric/semi-parametric resampling approach, SERES, for the resampling of biomolecular sequence data. The SERES resampling approach can produce many distinct resampling replicates while reserving the sequential dependence during the resampling process. We applied the SERES resampling approach to a classical problem in computational biology and bioinformatics, the MSA support estimation problem. We tested the SERES-based support estimation on both simulated and empirical datasets. The SERES-based method performed comparably or better than the state-of-the-art approaches. For future research, there are several directions. Create a purely non-parametric resampling approach based on the SERES method. The SERES algorithm in this study requires the estimation of anchors to ensure the synchronization of the random walk on the unaligned input sequences. Anchor estimation is a semi-parametric procedure that makes use of progressive multiple sequence alignment algorithms. Therefore, the SERES resampling approach on the unaligned sequences is also considered a semi-parametric approach. Non-parametric resampling procedure could be obtained by replacing the anchor estimation in the SERES resampling process. The SERES resampling approach could potentially be extended to perform the MSA estima- tion since alternative homologies were produced during the SERES resampling and re-estimation process. Many other problems can also make use of the SERES resampling approach, such as protein structure prediction, read mapping, and assembly. Non-parametric resampling for support estimation is widely used throughout science and engineering, and SERES resampling can also be applied in research areas outside of computational biology and bioinformatics. 56 CHAPTER 4 APPLICATION OF SERES RESAMPLING APPROACH TO ALIGNED SEQUENCES: PHYLOGENETIC HMM INFERENCE AND LEARNING 4.1 Introduction Besides estimation of the confidence intervals, the statistical resampling methods are also used to produce perturbations for statistical inference and learning to improve the accuracy [13]. In this study, we applied the SERES resampling algorithm on the aligned sequences for another classical problem in computational biology and bioinformatics, recombination-aware local genealogical inference. We selected the recombination breakpoint detection problem mainly because the recombination event is one of those evolutionary processes that creates sequence dependence. The sites involved in the recombination share the same evolutionary history. The purpose of the recombination breakpoints detection is to find the breakpoints where two neighboring regions have different local genealogy topologies. Preservation and identification of intra-sequence dependence play crucial roles in the recombination breakpoint detection problem. As introduced in the previous chapter, the SERES resampling algorithm preserves the sequence dependence within the resampled replicates. Thus, the SERES resampling approach has the ability to retain intra-sequence dependence caused by the historical recombination events during its resampling process, which makes it a perfect fit for this problem. Traditionally, the phylogenetic tree is reconstructed under the assumption that all input MSA sites share the same evolutionary history. However, due to the existence of recombination, this assumption does not fit all conditions. Recombination is the process by which genetic materials are transferred between different organisms. This process is an essential source of genetic diversity; for example, disease-causing bacteria acquire resistance to antibiotics. This process also leads to a mixture of genetic materials, where the local genealogies of the affected region are changed and 57 are inconsistent with the evolutionary history of the other regions. The Hidden Markov model was first adopted to describe evolutionary history by the study of Hein [51], where the hidden states were used to represent the topologies of evolutionary history, and the recombination event was interpreted as a transition between different states. McGuire extended the ideas of Hein’s study to detect the local topologies changes caused by the recombination event [92, 93]. The HMM model was combined into a likelihood framework where the maximum likelihood method was used as part of Bayesian inference and the Markov chain was used to assign a prior probability to the sequence of topologies along a sequence alignment. There are many HMM-based methods for local genealogical inference [59, 150, 91, 81]. In this study, We mainly focus on the recHMM [150] for the following reasons. RecHMM identifies local genealogy by applying heuristic searches on the space of all possible partitions. This algorithm is powerful in detecting local genealogy changes, especially when the regions that involve in recombination are long and the dataset size is large. It reveals the most likely recombination breakpoint locations with high accuracy and fewer requirements on parameter settings such as window schemes. Another reason is that recHMM takes aligned sequences as input to annotating mosaic genome structures, making it possible to combine the SERES resampling approach. The recHMM algorithm utilizes a statistical model that combines a finite-sites substitution model and a phylo-HMM to capture intra-sequence dependence due to recombination. An EM- based approach is used to learn recHMM model parameters. [150] also applied a structural EM heuristic [40] to automatically learn the set of local gene trees represented by the recHMM’s states. In the simulation study, we compared the method performance of the recHMM method and the combined method of the recHMM with the SERES resampling approach on local genealogies inference. We also evaluated the combined method performance on an empirical HIV genome sequence dataset. 58 4.2 Methods 4.2.1 Standalone recHMM analysis We first ran the recHMM alone to get a baseline performance on the local genealogy inference in the simulation study. Users need to assign a state size 𝜙 for the HMM model of recHMM. In our simulation study, we ran recHMM with the default setting, 𝜙 = 3. For recHMM, gene trees with either different topologies or branch lengths are considered different trees. In the structural EM used by [150], HMM states are distinguished by both gene tree topologies and branch lengths. In Westesson’s study, recHMM employed 𝜓 independent optimization trials to avoid local optima and selected the best trial under the maximum likelihood criterion. In our simulation study, we applied the same strategy as in the original study of recHMM, where recHMM ran with 𝜓 = 100 independent optimization trials, and selected the best trail. We used the same posterior decoding algorithm as used in the original study to perform statistical inference of local phylogenies [113]. Let 𝐺 be the set of all possible unrooted tree topologies on 𝑛 taxa. The input alignment 𝐴 contains 𝑛 sequences, each sequence comes from one taxon. The length of the input alignment is 𝑘, which means there are 𝑘 sites in 𝐴. The input alignment is assumed to contain recombined regions, where the corresponding local genealogies are different from the other regions due to the historical recombinations [52]. For each site 𝑎𝑖 , where 1 ≤ 𝑖 ≤ 𝑘, in the input alignment 𝐴, recHMM outputs the conditional probability for the gene tree 𝑔 ∈ 𝐺 correspond to each hidden state conditional on all sites in 𝐴 and the fitted HMM model. 4.2.2 The SERES+recHMM pipeline To test the SERES resampling approach on the local genealogy inference problem, we combined recHMM with the SERES random walk. First, we ran SERES resampling on the input alignment 𝐴 with a default reversal probability 𝛾 = 0.005. We also conducted additional experiments with alternative reversal probability 𝛾 ∈ 59 {0, 0.01, 0.1}. We used the SERES random walk to generate ten resampled replicates for each dataset in our study. Then, we ran recHMM on each resampled replicate. To get a fair comparison, we restricted the number of independent learning trials 𝜓 used in the SERES-based pipeline. We ran recHMM on each SERES replicate with 𝜓 = 10. So for each dataset, the total number of independent learning trials used in the SERES-based pipeline was equal to the number of independent learning trials used by the standalone recHMM method. The recHMM generated posterior probabilities for each site over 𝜙 gene tree topologies. For each site, the posterior probability distributions were aggregated across all the replicates in which the site appeared. The aggregated distribution was then normalized to obtain a valid probability distribution. 4.2.3 Simulated datasets We used ms [57] to simulate the gene trees under the coalescent-with-recombination model with either 4, 5, or 6 taxa. The recombination rate of the simulation is 𝜌 ∈ {0.5, 1.0, 2.0} and the total sequence length is 1 Kb per replicate. For each gene tree, we used Seq-Gen [48] to simulate the sequences under the Jukes-Cantor substitution model [63]. The substitution rate is set to be 𝜃 ∈ {0.5, 1.0, 2.0}. The simulation procedures described above were repeated 30 times independently. We got 30 replicate datasets for each model condition. We simulated datasets that covers a wide range of recombination rates and mutation rates over 4, 5, or 6 taxa. The number of taxa are chosen to test the scalability of the SERES resampling algorithm on the HMM inference. The number of unique unrooted tree topologies for 4 taxa is 3. There are 15 unique unrooted tree topologies for 5 taxa, and this number increases to 105 for 6 taxa. Summary statistics for the simulated datasets are shown in Table 4.1. We used the Robinson- Foulds distance [116] to measure the topological accuracy of inferred gene trees compared with the ground truth. The Robinson-Foulds distance is the proportion of bipartitions that appear in an inferred gene tree but not in the true gene tree or vice versa. 60 Number of Recombination Mutation # gene trees # gene trees ANHD ANHD taxa rate 𝜌 rate 𝜃 Avg SE Avg SE 4 0.5 0.5 3.4 1.4 0.359 0.080 4 0.5 1 3.0 1.2 0.498 0.101 4 0.5 2 3.0 1.2 0.620 0.086 4 1 0.5 4.1 1.7 0.344 0.102 4 1 1 4.1 1.7 0.495 0.105 4 1 2 4.1 1.7 0.625 0.079 4 2 0.5 5.6 2.4 0.321 0.091 4 2 1 5.6 2.4 0.462 0.103 4 2 2 5.6 2.4 0.581 0.094 5 0.5 0.5 3.100 1.165 0.331 0.093 5 0.5 1 3.200 1.137 0.464 0.083 5 0.5 2 3.200 1.137 0.585 0.068 5 1 0.5 4.700 1.917 0.357 0.085 5 1 1 4.700 1.917 0.492 0.063 5 1 2 4.700 1.917 0.608 0.044 5 2 0.5 6.367 2.834 0.331 0.078 5 2 1 6.367 2.834 0.467 0.083 5 2 2 6.367 2.834 0.587 0.062 6 0.5 0.5 3.167 1.293 0.336 0.088 6 0.5 1 3.300 1.159 0.439 0.077 6 0.5 2 3.300 1.159 0.554 0.072 6 1 0.5 4.667 2.134 0.304 0.082 6 1 1 4.667 2.134 0.435 0.079 6 1 2 4.667 2.134 0.554 0.064 6 2 0.5 6.100 2.587 0.312 0.079 6 2 1 6.100 2.587 0.448 0.085 6 2 2 6.100 2.587 0.565 0.068 Table 4.1: Simulated dataset statistics. The number of true gene trees and average normalized Hamming distance (“ANHD”) are reported for simulated datasets from the simulation study; average (“Avg”) and standard error (“SE”) are shown for all experimental replicates from each model condition (𝑛 = 30). 4.2.4 Empirical datasets We also re-analyzed an HIV dataset from the study of [150]. The dataset consisted of Indian samples that were originally studied by [84]. The dataset was sub-sampled to include four sequences, including the putatively recombinant sequence 95IN21301. 4.3 Results 4.3.1 Simulation study We calculated the correlation between the inferred posterior probability of a gene tree topology 𝑔 and the topological accuracy of 𝑔 to measure the performance of the standalone method and the SERES-based method. Table 4.2 shows the correlation results of the 4 taxa model conditions. For all 4-taxon model conditions, the posterior probabilities produced by the SERES-based method 61 are consistently better correlated with topological accuracy than the probabilities produced by the standalone recHMM. The absolute correlation improvement was at least 0.203, where the recombination rate 𝜌 = 2 and the mutation rate 𝜃 = 1. The largest correlation improvement was 0.305, where the recombination rate 𝜌 = 1 and the mutation rate 𝜃 = 0.5. This correlation improvement achieved by the SERES-based method is robust for all model conditions with various mutation rates and recombination rates. Then we compared the distributions of the posterior probability inferred by the standalone recHMM and the SERES+recHMM method for the 4-taxon model conditions. Figure 4.1 shows the distribution of the posterior probabilities inferred by recHMM alone. Figure 4.2 shows the distribution of the posterior probabilities inferred by the SERES+recHMM method. In Figure 4.1 and Figure 4.2, "true class" refers to the true gene tree topologies used as guide trees for the sequence simulation, and all the other gene tree topologies are labeled as "false class". Ideally, the posterior probability of the true gene tree topologies should be 100%, and the posterior probability of the other gene tree topologies should be 0%. The right-most blue bar is the highest in Figure 4.2 for all the model conditions, which means the majority of the posterior probabilities inferred by the SERES+recHMM for the true gene tree topologies were higher than 90%. However, in Figure 4.1, the right-most bar and the left-most bar are almost identical in height. This means almost half of the posterior probabilities inferred by the standalone recHMM are less than 10%. An opposite trend was observed for the false class. The orange bar representing the highest posterior probability inferences, which are 90% posterior probability or greater, was the second- highest in Figure 4.1. In contrast, SERES+recHMM consistently returned fewer inferences in the top decile of the posterior probability range. The SERES+recHMM-inferred posterior distribution for the false class of per-site inferences was consistently shifted leftward compared to standalone recHMM. We saw a similar performance on the 5-taxon model conditions. In Table 4.3, the SERES+ recHMM inference had a stronger correlation with topological accuracy compared with the 62 Baseline Probability Distribution RecombRate=0.5 RecombRate=0.5 RecombRate=0.5 MutRate=0.5 MutRate=1 MutRate=2 100% 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=1 RecombRate=1 RecombRate=1 MutRate=0.5 MutRate=1 MutRate=2 100% 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=2 RecombRate=2 RecombRate=2 MutRate=0.5 MutRate=1 MutRate=2 100% true class 80% false class 60% Frequency 40% Frequency Frequency 20% 0% 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Posterior Probability Posterior Probability Posterior Probability Figure 4.1: The posterior probability distribution inferred by the standalone recHMM method on 4-taxon model conditions. For each site, we split the local gene tree topologies into true class, which contains the true gene tree topologies for the site, and false class’, which contains all other gene tree topologies. For each class and each replicate dataset in a model condition, the inferred posterior probabilities for gene trees at any site were binned into deciles; the resulting histogram was normalized over all replicates in a model condition (𝑛 = 30). The normalized histograms for the true and false classes are shown in blue and orange, respectively. 63 SERES Probability Distribution RecombRate=0.5 RecombRate=0.5 RecombRate=0.5 MutRate=0.5 MutRate=1 MutRate=2 100% 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=1 RecombRate=1 RecombRate=1 MutRate=0.5 MutRate=1 MutRate=2 100% 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=2 RecombRate=2 RecombRate=2 MutRate=0.5 MutRate=1 MutRate=2 100% true class 80% false class 60% Frequency 40% Frequency Frequency 20% 0% 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Posterior Probability Posterior Probability Posterior Probability Figure 4.2: Histogram of posterior probabilities inferred by SERES+recHMM method on 4-taxon model conditions. Figure layout and description are otherwise identical to Figure 4.1. 64 Number Recomb- of ination Mutation recHMM SERES+recHMM taxa rate 𝜌 rate 𝜃 correlation correlation 4 0.5 0.5 -0.547 -0.830 4 0.5 1 -0.622 -0.866 4 0.5 2 -0.554 -0.799 4 1 0.5 -0.470 -0.775 4 1 1 -0.460 -0.742 4 1 2 -0.427 -0.677 4 2 0.5 -0.560 -0.855 4 2 1 -0.664 -0.867 4 2 2 -0.609 -0.837 Table 4.2: On 4-taxon model conditions, the posterior probabilities inferred by SERES+recHMM were better correlated with topological accuracy compared with the standalone recHMM. For each method, we calculated the Pearson correlation between the inferred posterior probability of a gene tree 𝑔 and the topological distance between 𝑔 and the true gene tree of a site. Average correlation for a method is calculated across all replicates in a model condition (𝑛 = 30). recHMM inference for all model conditions. The SERES+recHMM improved the correlation coefficients by at least 0.066 when the recom- bination rate 𝜌 = 1 and the mutation rate 𝜃 = 2. The largest correlation improvement was 0.155, where the recombination rate 𝜌 = 2 and the mutation rate 𝜃 = 0.5. Number Recomb- of ination Mutation recHMM SERES+recHMM taxa rate 𝜌 rate 𝜃 correlation correlation 5 0.5 0.5 -0.571 -0.692 5 0.5 1 -0.526 -0.676 5 0.5 2 -0.525 -0.651 5 1 0.5 -0.597 -0.675 5 1 1 -0.569 -0.678 5 1 2 -0.618 -0.684 5 2 0.5 -0.506 -0.661 5 2 1 -0.543 -0.665 5 2 2 -0.56 -0.648 Table 4.3: On 5-taxon model conditions, posterior probabilities inferred by the SERES+recHMM had stronger correlation with topological accuracy compared with the posterior probabilities in- ferred by the standalone recHMM. Otherwise, table layout and description are identical to Table 4.2. Similar to the 4-taxon and 5-taxon dataset comparisons, the SERES+recHMM’s inference was more strongly correlated with topological accuracy across all 6-taxon model conditions when compared to standalone recHMM (Table 4.4). However, compared with the 4-taxon and 5-taxon datasets, the correlation coefficients for both methods were generally weaker. Moreover, the absolute improvement in the correlation achieved by SERES+recHMM was more negligible as 65 well. The comparison of the posterior probability was the same for the false class. For the false class of inferences, the posterior probability distributions inferred by SERES+recHMM were more strongly shifted leftward than recHMM, despite the fact that posterior probabilities of the false class of inferences were more than double of the inferences in the 4-taxon and 5-taxon experiments. However, a different outcome was observed for the true class of per-site inferences. For the true class, the posterior probability distributions inferred by the SERES+recHMM were more diffuse than recHMM rather than a rightward shift. One reason for this observation is that when the number of sequences involved in the HMM inference and learning increased, the computational complexity increased dramatically. Number Recomb- of ination Mutation recHMM SERES+recHMM taxa rate 𝜌 rate 𝜃 correlation correlation 6 0.5 0.5 -0.3312 -0.494 6 0.5 1 -0.251 -0.457 6 0.5 2 -0.360 -0.472 6 1 0.5 -0.376 -0.486 6 1 1 -0.469 -0.473 6 1 2 -0.507 -0.535 6 2 0.5 -0.382 -0.506 6 2 1 -0.504 -0.515 6 2 2 -0.455 -0.554 Table 4.4: On 6-taxon model conditions, posterior probabilities inferred using SERES+recHMM were more highly correlated with topological accuracy compared to standalone recHMM. Other- wise, table layout and description are identical to Table 4.2. We also conducted additional experiments to evaluate the impact of key method parameters. The results are shown in Table 4.5. By comparing the inference accuracy of recHMM versus SERES+recHMM with different SERES reversal probability 𝛾, we found that the performance improvement achieved by SERES+recHMM over the standalone recHMM was robust to the choice of reversal probability 𝛾, as long as the chosen value was not too high. Reasonable choices are equivalent to reversal breakpoints separated by an average of at least 100 bp of sequence length. The results are consistent with the original motivation for sequence-aware resampling and re-estimation. [145]. We tested the impact of the HMM state space size 𝜙 too. The results are shown in Table 4.6. 66 Baseline Probability Distribution RecombRate=0.5 RecombRate=0.5 RecombRate=0.5 MutRate=0.5 MutRate=1 MutRate=2 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=1 RecombRate=1 RecombRate=1 MutRate=0.5 MutRate=1 MutRate=2 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=2 RecombRate=2 RecombRate=2 MutRate=0.5 MutRate=1 MutRate=2 80% true class false class 60% Frequency 40% Frequency Frequency 20% 0% 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Posterior Probability Posterior Probability Posterior Probability Figure 4.3: Distribution of posterior probabilities inferred by standalone recHMM method on 6-taxon model conditions. Figure layout and description are otherwise identical to Figure 4.1. There are approximately 3 to 6 distinct true gene tree topologies in each simulated sequence. For the recHMM, using an HMM state space size 𝜙 larger than the number of local gene tree topologies, the inferred results were easily overfitting. This finding is consistent with the original study of recHMM [150]. The average runtime for the two methods was roughly comparable, and neither method consis- tently ran faster than the other one. We observed low memory usage (less than 100 MiB) for both methods throughout our study. 67 SERES Probability Distribution RecombRate=0.5 RecombRate=0.5 RecombRate=0.5 MutRate=0.5 MutRate=1 MutRate=2 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=1 RecombRate=1 RecombRate=1 MutRate=0.5 MutRate=1 MutRate=2 80% 60% Frequency 40% Frequency Frequency 20% 0% Posterior Probability Posterior Probability Posterior Probability RecombRate=2 RecombRate=2 RecombRate=2 MutRate=0.5 MutRate=1 MutRate=2 true class 80% false class 60% Frequency 40% Frequency Frequency 20% 0% 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Posterior Probability Posterior Probability Posterior Probability Figure 4.4: Distribution of posterior probabilities inferred by SERES+recHMM method on 6-taxon model conditions. Figure layout and description are otherwise identical to Figure 4.1. 4.3.2 Empirical study Figure 4.5 shows the posterior probability distribution obtained by the SERES+recHMM method. The breakpoints refer to positions where the blue topology and the orange topology are switched. The SERES+recHMM results recovered five breakpoints that had been described in both Lole’s study and Westesson’s study [84, 150], which are located at 6402 bp, 6969 bp, 7073 bp, 9431 bp, and 9585 bp on the input alignment. The posterior probability distribution inferred by the 68 Number of Recombination Mutation recHMM SERES+recHMM taxa rate 𝜌 rate 𝜃 No 𝛾 𝛾=0 𝛾 = 0.005 𝛾 = 0.01 𝛾 = 0.1 4 0.5 0.5 -0.614 -0.851 -0.842 -0.854 -0.679 4 0.5 1 -0.670 -0.868 -0.869 -0.847 -0.697 4 0.5 2 -0.651 -0.875 -0.840 -0.876 -0.700 4 1 0.5 -0.554 -0.873 -0.867 -0.888 -0.681 4 1 1 -0.506 -0.803 -0.798 -0.782 -0.575 4 1 2 -0.539 -0.760 -0.725 -0.748 -0.551 4 2 0.5 -0.651 -0.818 -0.844 -0.833 -0.664 4 2 1 -0.756 -0.841 -0.865 -0.845 -0.663 4 2 2 -0.667 -0.851 -0.867 -0.838 -0.676 5 0.5 0.5 -0.571 -0.732 -0.692 -0.673 -0.605 5 0.5 1 -0.526 -0.699 -0.676 -0.679 -0.601 5 0.5 2 -0.525 -0.689 -0.651 -0.662 -0.577 5 1 0.5 -0.597 -0.690 -0.675 -0.654 -0.591 5 1 1 -0.569 -0.705 -0.678 -0.669 -0.601 5 1 2 -0.618 -0.693 -0.684 -0.671 -0.593 5 2 0.5 -0.506 -0.669 -0.661 -0.629 -0.595 5 2 1 -0.547 -0.677 -0.665 -0.638 -0.572 5 2 2 -0.560 -0.675 -0.648 -0.653 -0.573 6 0.5 0.5 -0.489 -0.597 -0.573 -0.549 -0.465 6 0.5 1 -0.343 -0.591 -0.533 -0.533 -0.446 6 0.5 2 -0.374 -0.565 -0.537 -0.557 -0.44 6 1 0.5 -0.314 -0.609 -0.568 -0.556 -0.471 6 1 1 -0.375 -0.608 -0.568 -0.530 -0.453 6 1 2 -0.331 -0.607 -0.589 -0.548 -0.466 6 2 0.5 -0.511 -0.582 -0.558 -0.547 -0.477 6 2 1 -0.498 -0.606 -0.564 -0.556 -0.475 6 2 2 -0.528 -0.610 -0.584 -0.566 -0.509 Table 4.5: The comparison among different reversal probabilities 𝛾 on 4-, 5- and 6-taxon model conditions. The methods utilize models with 𝜙 = 3 to infer a posterior probability distribution over gene tree topologies. For each method’s inference, we calculated the Pearson correlation between the inferred posterior probability for a gene tree 𝑔 and the topological distance between 𝑔 and the true evolutionary history of a site (i.e., the true local gene tree). The averages are reported across all 𝑛 replicates in a model condition (𝑛 = 30). SERES+recHMM method clearly showed inference uncertainty in the first few hundred bp of the input alignment. The study of Westesson [150] also reported two additional breakpoints at 4328 bp and 4401 bp that were not described in the study of Lole [84]. The standalone recHMM found local topology switches in the region of 4000 bp to 4200 bp. However, the SERES-based method results in more uncertainty in the corresponding regions. Inconsistencies exist between the posterior probability distribution inferred by the standalone recHMM and the SERES+recHMM. Some local topology changes found by the standalone recHMM were not supported by the SERES+recHMM analysis, for example, the region from 6000 to 6500 bp. 69 Number of States Recombination Mutation recHMM SERES+recHMM 𝜙 rate 𝜌 rate 𝜃 correlation correlation 3 0.5 0.5 -0.571 -0.692 3 0.5 1 -0.526 -0.676 3 0.5 2 -0.525 -0.651 3 1 0.5 -0.597 -0.675 3 1 1 -0.569 -0.678 3 1 2 -0.618 -0.684 3 2 0.5 -0.506 -0.661 3 2 1 -0.543 -0.665 3 2 2 -0.56 -0.648 10 0.5 0.5 -0.546 -0.681 10 0.5 1 -0.563 -0.671 10 0.5 2 -0.555 -0.66 10 1 0.5 -0.563 -0.677 10 1 1 -0.54 -0.674 10 1 2 -0.534 -0.661 10 2 0.5 -0.433 -0.639 10 2 1 -0.464 -0.651 10 2 2 -0.533 -0.651 15 0.5 0.5 -0.446 -0.685 15 0.5 1 -0.458 -0.685 15 0.5 2 -0.487 -0.655 15 1 0.5 -0.501 -0.679 15 1 1 -0.472 -0.671 15 1 2 -0.501 -0.657 15 2 0.5 -0.388 -0.648 15 2 1 -0.426 -0.652 15 2 2 -0.456 -0.645 Table 4.6: The comparison among different number of states 𝜙 on 5-taxon model conditions. The methods utilize models with 𝛾 = 0.005 to infer a posterior probability distribution over gene tree topologies. For each method’s inference, we calculated the Pearson correlation between the inferred posterior probability for a gene tree 𝑔 and the topological distance between 𝑔 and the true evolutionary history of a site (i.e., the true local gene tree). The averages are reported across all 𝑛 replicates in a model condition (𝑛 = 30). Finally, for the gene tree topology, the posterior probability inferred by SERES+recHMM was generally lower than the standalone recHMM. One example is shown in the region located between 5000 and 8000 bp. The SERES+recHMM inferred almost zero probability for the green topology within this region, whereas the recHMM inferred a highly variable probability. The SERES-based recHMM detected local genealogy changes in the HIV dataset, which is consistent with the previous studies [84, 150]. These pieces of evidence suggested that the sequence 95IN21301 is recombinant. 70 Rec Mut Runtime (h) Memory (GiB) # of rate rate SERES+ SERES+ seq 𝜌 𝜃 recHMM recHMM recHMM recHMM 4 0.5 0.5 1.788 1.731 0.055 0.055 4 0.5 1 1.951 1.838 0.055 0.055 4 0.5 2 1.852 1.944 0.055 0.055 4 1 0.5 1.958 1.868 0.055 0.055 4 1 1 1.935 1.674 0.055 0.055 4 1 2 1.919 1.672 0.055 0.055 4 2 0.5 1.712 1.878 0.055 0.055 4 2 1 1.676 1.887 0.055 0.055 4 2 2 2.199 1.852 0.055 0.055 5 0.5 0.5 2.907 3.001 0.056 0.056 5 0.5 1 3.168 3.030 0.056 0.056 5 0.5 2 3.973 3.139 0.056 0.056 5 1 0.5 3.417 3.468 0.056 0.056 5 1 1 3.164 3.153 0.056 0.056 5 1 2 3.258 2.968 0.056 0.056 5 2 0.5 3.651 2.971 0.056 0.056 5 2 1 3.849 3.253 0.056 0.056 5 2 2 3.109 3.100 0.056 0.056 6 0.5 0.5 5.542 4.959 0.058 0.058 6 0.5 1 4.969 4.748 0.058 0.058 6 0.5 2 4.715 5.353 0.058 0.058 6 1 0.5 5.652 5.423 0.058 0.058 6 1 1 5.471 5.529 0.058 0.058 6 1 2 4.398 5.314 0.058 0.058 6 2 0.5 4.819 6.177 0.059 0.058 6 2 1 4.271 5.136 0.058 0.058 6 2 2 4.221 5.056 0.058 0.058 Table 4.7: The runtime and memory usage information for standalone recHMM and SERES+recHMM methods on simulation study model conditions. Model conditions were pa- rameterized by the number of sequences, recombination rate 𝜌, and mutation rate 𝜃. Both methods utilize models with 𝜙 = 3 and 𝛾 = 0.005 to infer a posterior probability distribution over gene tree topologies. Average runtime in hours and peak memory usage in GiB are reported across all replicates in a model condition (𝑛 = 30). 4.4 Discussion To evaluate to what extent the posterior probability inferred by each method reflects the topo- logical accuracy, we compared the method performances of the standalone recHMM method and the combined SERES+recHMM method. The correlation between the inferred per-site posterior probability for a gene tree topology 𝑔 and the topological accuracy of 𝑔 as measured by the Robinson- Foulds distance between 𝑔 and the true gene tree topology for a site was used to assess method performance. Across all the simulation model conditions, the posterior probabilities inferred by the SERES+recHMM method had a consistently better correlation with topological accuracy compared to standalone recHMM. The improved performance obtained by combining recHMM inference with 71 Standalone recHMM Probability SERES+recHMM Probability Figure 4.5: Posterior probability distribution of local gene tree topologies inferred by standalone recHMM versus SERES+recHMM method on Indian HIV-1 dataset. We re-analyzed a subset of the Indian HIV-1 genome dataset that was published by [84]; [150] re-analyzed the original dataset using recHMM. Our re-analysis compared local gene tree probabilities computed using standalone recHMM posterior decoding (top panel) versus SERES+recHMM posterior decoding (bottom panel). The plots show posterior decoding probabilities (y-axis) versus genome coordinate (x-axis). Local gene tree probabilities are colored based on the three possible unrooted topologies for the four-taxon dataset (shown in either blue, orange, or green). SERES resampling and re-estimation was robust to a wide range of mutation and recombination rates. We also compared the inferred posterior probability distributions produced by these two meth- ods. We found that the standalone recHMM method produced a similar posterior probability distribution for both true classes and false classes. The SERES+recHMM method produced more distinguishable posterior probability distributions for the true classes and false classes. We attribute these findings to two factors. First, the application of the SERES resampling and re-estimation appears to be conducive to the improved inference of true gene tree topologies. The SERES resampling approach has the ability to retain the sequence dependence in the input alignment to the resampled replicates. And, the intra-sequence dependence among sites provides additional information on the historical evolutionary events, especially those that caused the dependence. The local genealogy inference greatly benefited from the SERES resampling and re-estimation process. Second, the SERES resampling algorithm reveals uncertainties in the inference of local 72 genealogy. Incorrect inferences for the gene tree topologies were less repeatable. The posterior probability distribution of the false class was leftward shifted consistently for all model conditions of the simulated datasets, which indicates that the SERES resampling and re-estimation process produced a consistently low posterior probability for those incorrect local genealogy topologies. Even for the larger dataset with 6 taxa, where solution space was an order of magnitude larger than that of the 4-taxon and 5-taxon datasets, the SERES+recHMM method produced consistently low posterior probabilities for those incorrect gene tree topologies. Although the SERES+recHMM produced low posterior probabilities for the false class over all model conditions, it returned diffused posterior probability distribution for the true class when the dataset size was large, which indicates that uncertainties included in the inference of true gene tree topology were also reflected in the posterior probabilities when the dataset size increased. This result may be caused by the increased computational complexity of HMM learning optimization as the number of input sequences increases. It is likely that conservatively limiting SERES-based re-estimation to 10 learning iterations is insufficient for the larger model conditions in our study. More intensive learning optimization may yield improved re-estimation and a greater performance benefit from augmenting recHMM with SERES. Additional experiments that we performed to evaluate how the choice of the parameters impacted the method performance indicated that the performance advantage returned by SERES+recHMM over standalone recHMM was robust to the choice of reversal probability 𝛾. The results are consistent with the original motivation for sequence-aware resampling and re-estimation. We noted the correspondence between an 𝑟th order Markov process and a SERES random walk with a reversal probability 𝛾. For 𝛾 = 0.5, a first-order Markov process suffices; for 𝛾 < 0.5, higher- order Markovian processes are needed to capture sequential dependence. Essentially, smaller 𝛾 values mean that longer-distance sequential dependence is retained. Our results suggest that there is a certain threshold. When passing that threshold, longer-distance sequential dependence is critical to the performance of resampling and re-estimation for sequence-based inference problems. Experiments with alternative settings for the HMM state space size showed SERES+recHMM’s 73 performance was relatively robust to overfitting, as compared to standalone recHMM analysis. 4.5 Conclusions This study introduced the application of SERES random walks on aligned sequences and showed SERES as a data perturbation technique to improve statistical inference and learning. The simulation experiments showed that the combination of the The SERES resampling approach with the recHMM achieved great improvement in the local genealogy inferences. The empirical study on the HIV genome sequence dataset confirmed the breakpoints detected in the previous studies [84, 150]. The SERES resampling approach achieved great success on recombination detection and local genealogical inference problems. SERES resampling and re-estimation may prove to be similarly beneficial in ancestral recombination inference problems other than local genealogical inference, such as recombination rate estimation [133], recombination hotspot or coldspot detection [99, 5], etc. 74 CHAPTER 5 PHYLOGENETIC SUPPORT ESTIMATION WITH THE RANDOM WALK RESAMPLING APPROACH 5.1 Introduction In modern phylogenetic studies, the true evolutionary history is usually not available for anal- ysis. Phylogenies are generally inferred by statistical methods. Therefore, it is critical to evaluate the reproducibility of the inferred phylogenies. In 1985, Felsenstein proposed a standard bootstrap approach to estimate the confidence intervals for an input phylogeny [36]. This bootstrap method takes a set of sequence alignmentss as input. Then this method generates bootstrap replicates by sampling sites from the input alignment at random with replacements. Alignment sites are considered independently and identically distributed (i.i.d). Phylogenies are inferred for all boot- strap replicates. Given an annotation phylogeny, the bootstrap support of the internal edges of the annotation phylogeny is calculated. The support value is the proportion of the re-estimated phylogenies that contain the same internal edge. Bootstrap support estimation has become the de facto standard for assessing reproducibility in phylogenetics and phylogenomics analysis. This method has been used in almost all modern phylogenetic studies to show the confidence of the involved phylogenies. This makes Felsenstein’s ground-breaking work one of the most cited works in history. Felsenstein’s 1985 paper has become the 41st most cited in all of science, according to the survey of [141], which has been cited over 44,000 times. Many alternative computational algorithms can be used for the purposes of phylogenetic support estimation. For example, non-parametric resampling approaches such as the jackknife [140]; the MSA confidence measurement approaches such as GUIDANCE1 [71, 110], GUIDANCE2 [126], PSAR [65], T-COFFEE [107], and Divvier [1]); parametric MSA resampling or filtering methods applied to the problem of phylogenetic support estimation (e.g., TCS [18], Gblocks [17], and Trimal [16], and Bayesian posterior probability methods [156]). However, these alternative methods are 75 not widely used due to limitations such as parametric models and specific application scenarios. In this study, we proposed a new non-parametric resampling approach, RAWR, which is short for "RAndom Walk Resampling." The RAWR resampling method is a simplified version of the SERES resampling approach described in previous sections. The SERES algorithm is a semi- parametric algorithm for the application of unaligned sequences that requires anchor estimation for synchronization purposes. This increases the method complexity by adding extra parameters such as anchor length, count, and sequence similarity measurement. The RAWR resampling approach does not require anchor estimation. This new algorithm is designed for the phylogenetic support estimation problem by resampling and re-estimation using unaligned sequences. The RAWR- based phylogenetic support estimation shows comparable or typically better performance than the traditional bootstrap support estimation approach. 5.2 Methods The computational problem of the phylogenetic support estimation is described below. The inputs contain an MSA 𝐴 and a phylogenetic tree 𝑇. The MSA 𝐴 is estimated using MSA method 𝑓 , and the phylogenetic tree 𝑇 = (𝑉, 𝐸) is estimated using the phylogenetic inference method 𝑔 on the MSA 𝐴. The output is a set of confidence values ranging from 0 to 1 for all bipartitions split by internal edges in 𝑒 ∈ 𝐸. 5.2.1 RAWR-based Phylogenetic Support Estimation The RAWR-based phylogenetic support estimation takes an estimated MSA 𝐴 and a phylogenetic tree 𝑇 = (𝑉, 𝐸). The problem output consists of confidence interval estimates for each bipartitions defined by non-leaf edge 𝑒 ∈ 𝐸. The estimated MSA 𝐴 is inferred from the unaligned sequences 𝑆 by the MSA method 𝑓 . First, the RAWR resampling approach generates sequence replicates from the input MSA 𝐴. The RAWR resampling approach also utilizes the random walk to conduct the resampling process, similar to the SERES method. The random walk starts with a randomly selected start point 76 and a randomly selected direction. The random walk moves along the input alignment in the initial direction. All sites located on the trace of the random walk are sampled to the resampled replicate. The RAWR algorithm does not require estimation of the anchors to ensure synchronization during the random walk, where the random walk only changes its direction at the barriers, which are sites located at the start and end of anchors, with reverse probability 𝛾. The RAWR resampling approach simplifies this setting and uses all sites of the input estimated MSA 𝐴 as barriers. The random walk certainly changes direction when encountering the first and last sites. The reverse probability is set to 𝛾 elsewhere. The random walk procedure ends once the sampled replicate reaches the desired length. The details of this resampling procedure are described in Algorithm 5.1. Algorithm 5.1: RAWR phylogenetic support estimation 1: procedure RAWRSupport(𝐴, 𝑇, 𝑓 (), 𝑔(), 𝛾, 𝑘 𝑟 ) ⊲ Input: MSA 𝐴, phylogenetic tree 𝑇, MSA method 𝑓 (), ⊲ phylogenetic tree estimation method 𝑔(), reversal probability 𝛾, ⊲ number of replicates 𝑘 𝑟 ⊲ Output: phylogenetic support estimates 𝜖 2: reestimates = <> 3: for 𝑖 = 1 to 𝑘 𝑟 do 4: 𝑋𝑖 = resampleRAWRReplicate(𝐴) 5: reestimates .= 𝑔( 𝑓 (𝑋𝑖 )) 6: for all non-leaf edge 𝑒 in 𝑇 do 7: 𝜖 (𝑒) = proportion of 𝑇𝑖 in list reestimates that display bipartition corresponding to 𝑒 8: return(𝜖) 9: procedure resampleRAWRReplicate(𝐴,𝛾) 10: 𝑌 = <> 11: select 𝑖 ∈ [1, | 𝐴|] and walkDirection uniformly at random 12: while !converged(𝑌 ,𝐴) do 13: 𝑌 .= 𝐴[𝑖] ⊲ add 𝑖th column of 𝐴 to 𝑌 14: if reversal(𝛾) || (𝑖 == 1 && walkDirection is left) 15: || (𝑖 == | 𝐴| && walkDirection is right) then 16: reverse(walkDirection) 17: 𝑖 = next column index after 𝑖 in walkDirection order 18: return(unalign(𝑌 )) ⊲ unalign(𝑌 ) drops indels from 𝑌 19: procedure converged(𝑌 ,𝐴) 20: return(length(𝑌 ) ≥ length(𝐴)) ⊲ Sequence-length-based convergence criterion requires ⊲ number of resampled sites ≥ input MSA length In this study, we set the length of the resampled replicate to be equal to the length of 𝐴. Other ending criteria are also feasible for the random walk, for example, statistical criteria based on the random walk procedure. The indels are removed from the sampled alignment to produce a set of 77 unaligned sequences. Each round of the random walk procedure produces a resampled replicate which consists of a set of unaligned sequences. The random walk process is repeated multiple times independently to obtain a set of resampled replicates. The two-phase methods for phylogenetic inference are widely used in systematic studies. First, the MSA is estimated for the unaligned sequence. Then the estimated MSA is used to infer the phylogenetic tree. We used MAFFT version 7.222 [64] with default settings for the MSA re-estimation. The MAFFT [64] is one of the most accurate MSA estimation methods. We also explored the choice of MSA estimation methods, including ClustalW [88], and FSA[12]. Moreover, it is proved to result in good accuracy in the following phylogenetic inference [79, 80]. The summary statistics for the estimated MSAs are shown in Table 5.3. The phylogenetic tree was inferred from the re-estimated MSA by RAxML version 8.2.11 [130] under the GTR+Γ model with maximum likelihood criteria [118, 155, 142]. The phylogenetic support values are calculated for all internal branches of the input phylogenetic tree. The support value of the internal branch 𝑒 ∈ 𝐸 is the fraction of the re-estimated trees that contain the same internal branch 𝑒. In this study, the default setting of 𝛾 was 1 × 10−1 . We also explored other reverse probability, 𝛾 ∈ {1 × 10−3 , 1 × 10−2 , 2 × 10−2 , 5 × 10−2 , 1 × 10−1 , 2 × 10−1 , 3 × 10−1 }. For each experiment, RAWR was conducted to generate 100 resampled replicates for each input dataset. An illustrated example shows in Figures 5.1, where the RAWR random walk is applied on a 8-taxon dataset. Then an alignment is generated based on the RAWR replicate and a tree is inferred from the estimated alignment. 5.2.2 Bootstrap Phylogenetic Support Estimation. We performed the bootstrap analysis using the RAxML version 8.2.11. The bootstrap was imple- mented in the RAxML software. The bootstrap support estimation first generates 100 bootstrap replicates from the input MSA. Then the phylogenetic tree was re-estimated for the bootstrap repli- cate. The support values of the internal branches are calculated using the same process as the 78 t1 - - - - - - - - - A C G A - - - C A A C A A T A T A A C T A A T G A C G G G A C C C A T T G T T T G T T - - - - - - - - - t2 - - - - - - - A C A C A G - - - - - - - A T T A A A G A C A G C A C A G A T G G A T A T A T C T G A C C A G - - - - - - - RAWR t3 - - - - - - - A T C C A G G G T C C G C A G T T A T G C A A A C G G G G A T G A A T A C A T C - - - - - - - - - - - - - - t4 Random Walk t5 A - T - G - A - A - A - T - A A C T C C C C A A A - - - - - - - - - - - T - C - G - T - C - C - A - G - G G T C T A C A G A C C A G C G G G A G T C A T T G A A G A T T G A C T A A - T - C - T - A - A - T - C - A - G - T - T - C - T - C - A - G On Input MSA t6 - - - - - - - - - G C G A - - - - - - - A C T A T G A C T A A T G C T A G G T C C C A T G T T C A A T G T G C T - - - - - t7 A T A A A A T A T C C A A - - - - - T T G T C C A G G T T C G C A C G A T A T A G C G C A - - - - - - - - - - - - - - - - t8 - - - - - - - A C A C A G - - - - - - - A T T A A A G T C A G C A C A G A T G G A T A T G T C T G A C C A G - - - - - - - Reversal breakpoint (not included in RAWR replicate) t1 A G G G C A G T A T G T A A T C A A T A T A A C A A C - - - A G C A - A C G A - - - C A A C A A T A T A A C A A T A T A A t2 G T A G A C A C G C A C G A C A G A A A T T A - - - - - - - G A C A C A C A G - - - - - - - A T T A A A G A G A A A A A G t3 G T A G G G G C A C G C A A A C G T A T T G A C G C C T G G G A C C T C C A G G G T C C G C A G T T A T G C G T A T A T G Resampled t4 T A T A G C A C G C A C G C T T G G A C C T G C T - - - - - A A C C C C C A A - - - - - T C G T C C A G G T G G A C A G G Data t5 t6 G T C G G G G C A C G C A A A C G - - - - - - - - - - - - - - A C C T C C A - - - - - - - - - - - - - - G C G - - - - - G T G G A T C G T A T G T A A T C A G T A T C A - - - - - - - A G C G - G C G A - - - - - - - A C T A T G A C A G T A T G A t7 T A T A G C A C G C A C G C T T G G A C C T G T T - - - - - A A C C T C C A A - - - - - T T G T C C A G G T G G A C A G G t8 G T A G A C A C G C A C G A C T G A A A T T A - - - - - - - G A C A C A C A G - - - - - - - A T T A A A G T G A A A A A G Reversal breakpoint (not included in RAWR replicate) t1 A G G G C A G T A T G T A A T C A A T A T A A C A A C A G C A A C G A C A A C A A T A T A A C A A T A T A A t2 G T A G A C A C G C A C G A C A G A A A T T A G A C A C A C A G A T T A A A G A G A A A A A G Unalign Data t3 G T A G G G G C A C G C A A A C G T A T T G A C G C C T G G G A C C T C C A G G G T C C G C A G T T A T G C G T A T A T G t4 T A T A G C A C G C A C G C T T G G A C C T G C T A A C C C C C A A T C G T C C A G G T G G A C A G G To Obtain t5 G T C G G G G C A C G C A A A C G A C C T C C A G C G G RAWR Replicate t6 T G G A T C G T A T G T A A T C A G T A T C A A G C G G C G A A C T A T G A C A G T A T G A t7 T A T A G C A C G C A C G C T T G G A C C T G T T A A C C T C C A A T T G T C C A G G T G G A C A G G t8 G T A G A C A C G C A C G A C T G A A A T T A G A C A C A C A G A T T A A A G T G A A A A A G t1 - - - - A G G G C A G T A T G T A A T C A A T A T A A C A A C A G C A A C G A C A A C A A T A T A A C A A T A T A A - - - - - - - t2 - - - - G T A G A C A C G C A C G A C A G A A A T T A G A C A C A C A G A T T A A A G A G A A A A A G - - - - - - - - - - - - - - Re-estimate t3 - - - - G T A G G G G C A C G C A A A C G T A T T G A C G C C T G G G A C C T C C A G G G T C C G C A G T T A T G C G T A T A T G t4 MSA On t5 T - A - T - A - G G C T A C C G G G C G A G C C G A C C T G T C G A G A A A C C C - T - G - C - T - A - A - C - C - C - C - C - A - A - T G C A G C T C C T C C A C G A G G T C G G G G A - C - A - G - G - - - - - - - - - - - - - - - - - - - - - - - - - - - - - RAWR Replicate t6 - - - - T G G A T C G T A T G T A A T C A G T A T C A A G C G G C G A A C T A T G A C A G T A T G A - - - - - - - - - - - - - - - t7 T A T A G C A C G C A C G C T T G G A C C T G T T A A C C T C C A A T T G T C C A G G T G G A C A G G - - - - - - - - - - - - - - t8 - - - - G T A G A C A C G C A C G A C T G A A A T T A G A C A C A C A G A T T A A A G T G A A A A A G - - - - - - - - - - - - - - t1 Re-estimate t6 Tree On t3 Re-estimated t5 MSA t7 t4 t8 t2 Figure 5.1: An illustrated example of RAWR resampling and re-estimation. The first step of the RAWR-based phylogenetic support estimation is sequence resampling. A random walk is performed on the input MSA. MSA sites are resampled during the random walk. The indels are removed from the resampled sites to produce a resampled replicate. Then, MSA is re-estimated from the resampled sequence. Finally, a phylogenetic tree is re-estimated using the re-estimated MSA as input. The dashed lines in the first and second subplots show the reversal breakpoints. RAWR-based phylogenetic support estimation process. 5.2.3 Additional Performance Study We explored different parameter settings, such as the choice of the MSA estimation method, for the RAWR-based support estimation method. We explored alternative MSA methods for the estimation and re-estimation of MSAs. We used ClustalW in the performance study to explore the impact of alignment quality on downstream phylogenetic inference and support estimation. We selected ClustalW because it is one of the most widely used MSA methods in computational biology and bioinformatics. For our experiments, we used ClustalW version 2.1 with default settings. The detailed summary statistics of the ClustalW-estimated MSAs are shown in Table 5.1. 79 ClustalW alignment Model RAxML(ClustalW) length SP-FN SP-FP condition nRF 10.A 1216 0.670 0.747 0.207 10.B 1236 0.726 0.809 0.236 10.C 1184 0.835 0.889 0.379 10.D 1171 0.828 0.874 0.500 10.E 1163 0.901 0.926 0.650 Table 5.1: Summary statistics for ClustalW-estimated alignments and RAxML(ClustalW) trees on 10-taxon model conditions. Table layout and description are otherwise identical to Table 5.3. To better evaluate the performance of the RAWR-based support estimation method, we applied the RAWR resampling approach to model conditions with different parameter settings of the insertion/deletion model. We conducted a simulation experiment using additional model conditions from the original study of the SERES resampling algorithm [145] where we utilized the long gap length distribution from the study of Liu et al. [80], rather than the medium gap length distribution used elsewhere in our simulation study. The simulation and experimental procedures were exactly the same across all model conditions in our simulation study. The model parameters and summary statistics of model conditions using the long gap length distribution are listed in Table 5.2. True alignment MAFFT alignment Model Insertion Model tree deletion ANHD Gapiness Length Length SP-FN SP-FP condition height rate 10.long.A 0.4 0.13 0.276 0.440 1804.8 1433.7 0.272 0.315 10.long.B 0.7 0.1 0.363 0.481 1926.7 1447.8 0.381 0.426 10.long.C 1 0.06 0.455 0.456 1853.5 1413.3 0.510 0.537 10.long.D 1.6 0.031 0.542 0.432 1754.1 1403.1 0.725 0.729 10.long.E 4.3 0.013 0.660 0.445 1811.0 1560.1 0.899 0.897 Table 5.2: Long-gap-length model conditions: parameter values and summary statistics. Our simulation study included additional 10-taxon model conditions that utilized the long gap length distribution from the study of 2012 study of Liu et al. [80]. The model parameters consisted of model tree height and insertion/deletion probability, and each model condition corresponds to a distinct set of model parameter values. The long-gap-length model conditions are named 10.long.A through 10.long.E in order of generally increasing sequence divergence. The following table columns list average summary statistics for each model condition (𝑛 = 20). “NHD” is the average normalized Hamming distance of a pair of aligned sequences in the true alignment. “Gappiness” is the percentage of true alignment cells which consists of indels. “True align length” is the length of the true alignment. “Est align length” is the length of the MAFFT-estimated alignment [64] which was provided as input to the support estimation methods. “SP-FN” and “SP-FP” are the proportion of homologies that appear in the true alignment but not in the MAFFT-estimated alignment and vice versa, respectively. The table and caption are reproduced from the original study of the SERES resampling algorithm [145]. 80 We mainly compared the method performance of the RAWR-based support estimation method and the bootstrap method. There are many other well-designed methods for the problem of phy- logenetic tree support estimation. Alternatives include other non-parametric resampling methods such as the jackknife [140] and parametric resampling [147, 37], such as MSA-specific confidence measures and other alignment-oblivious phylogenetic support estimation methods. Alternatives have also been proposed for the last step of the phylogenetic support calculation. For example, the transfer bootstrap expectation (TBE) method [74], which pairs bootstrap resampling of MSAs and phylogenetic tree re-estimation with an alternative support calculation. We conducted experiments using one of the MSA-specific confidence measures, GUIDANCE2, and parametric resampling to perform the support estimation of phylogenetic trees. We also compared the RAWR-based support estimation performance with the alRT, one of the alignment-oblivious phylogenetic sup- port estimation methods. Finally, we compared the method performance of the RAWR+TBE and bootstrap+TBE support estimations. The GUIDANCE2 [126] is one of these methods that was originally developed for the estima- tion of confidence intervals of multiple sequence alignments. GUIDANCE2 combines the HoT algorithm with altering parameters used in progressive MSA estimation methods, such as guide tree and gap penalties, to generate parametric resampling techniques. In our experiment, we ran GUIDANCE2 with default settings. For each dataset, GUIDANCE2 was used to resample 100 replicates, and re-estimation was performed on each resampled replicate using an identical proce- dure as in the RAWR and bootstrap analyses. We calculated the phylogenetic tree support using the RAxML software with the same command as used in other simulation experiments in our study. Although the HoT algorithm was initially designed for MSAs, and GUIDANCE2 was originally developed for MSA confidence interval placement, it is natural to consider the impact of MSA quality on downstream phylogenetic inference. As we demonstrate in the performance study, a new application of these parametric and semi-parametric techniques beyond their originally intended use can bring value. We used the aLRT [3] implementation in PhyML version 3.0 [49] to run two aLRT analy- 81 ses. First, We ran a parametric aLRT support analysis under the general time reversible (GTR) nucleotide substitution model [135], where a phylogenetic tree topology was estimated alongside GTR substitution model parameters and branch lengths. The second analysis is that, we ran a fixed-topology analysis, where the fixed topology consisted of the annotation topology that was estimated by RAxML. The fixed-topology analysis included estimation of branch lengths and GTR substitution rates/frequencies using PhyML. TBE [74] was run using the same annotation MSA, annotation tree, and re-estimated trees as in the rest of our performance study. Re-estimated trees were generated by either bootstrap resampling and re-estimation or RAWR resampling and re-estimation. The equivalent inputs enable a comparison across TBE and other methods in our study. The TBE analyses were run using Booster v. 0.1.2. Since the RAWR random walk is continuous for each replicate, the random walk can be concentrated in a narrow region. In this scenario, sites outside the narrow region covered by the random walk were ignored in the statistical inference involving RAWR resampling. We proposed and evaluated an alternative random-walk-based phylogenetic support estimation procedure to help the RAWR random walk better explore the input alignment. The alternative procedure replaced random reversals in the RAWR resampling procedure with random teleportation. The details of this new support estimation algorithm shows in Algorithm 5.2. For this reason, we refer to the alternative method as "RAWR+teleport." Algorithm 5.2: RAWR+teleport resampling procedure 1: procedure resampleWithRAWR+Teleport(𝐴,𝛾) 2: 𝑌 = <> 3: select 𝑖 ∈ [1, | 𝐴|] and walkDirection uniformly at random 4: while !converged(𝑌 ,𝐴) do 5: 𝑌 .= 𝐴[𝑖] ⊲ add 𝑖th column of 𝐴 to 𝑌 6: if teleport(𝛾) then ⊲ Biased coin flip with bias 𝛾 7: select 𝑖 ∈ [1, | 𝐴|] and walkDirection uniformly at random 8: else 9: if (𝑖 == 1 && walkDirection is left) || (𝑖 == | 𝐴| && walkDirection is right) then 10: reverse(walkDirection) 11: 𝑖 = next column index after 𝑖 in walkDirection order 12: return(unalign(𝑌 )) ⊲ unalign(𝑌 ) drops indels from 𝑌 82 5.2.4 Simulated datasets We utilized the same model conditions and the simulation datasets used in the previous studies [145, 80] for the simulation dataset. The simulated datasets cover a wide range of dataset sizes and evolutionary divergence. The simulation process of the benchmark datasets is described as follows. For the 10-taxon and 50-taxon datasets, the INDELible version 1.03 [39] was used to sample non-ultrametric trees. The branch lengths were sampled randomly within the range of 0 and 1. For the 100-taxon datasets, r8s version 1.7 [123] was used to sample random birth-death model trees. The sampled ultrametric model trees were deviated using the procedure described in the study by Roshan et al. [102]. The deviation factor 𝑐 = 2.0. All model trees were rescaled to the specified tree height ℎ described in the corresponding model conditions. The nucleotide sequences were simulated along the model trees under the general time-reversible (GTR) model of substitution and the insertion/deletion model [39]. Details about the GTR model and insertion/deletion models are described in Chapter 2. The base frequency and substitution rate parameters were obtained from the 2012 study by Liu et al. [80]. The root sequence length was set to be 1kb. INDELible version 1.03 was used to simulate nucleotide sequences for the 10-taxon and 50- taxon model conditions. ROSE was used to simulate nucleotide sequences for the 100-taxon model conditions with the indel model described in Liu’s 2012 study [80] with a medium gap length distribution. The above simulation process was repeated 20 times independently to produce multiple replicate datasets. The model parameters and summary statistics of the simulated datasets are shown in Table 5.3. 5.2.5 Empirical datasets We used the empirical benchmark dataset from the Comparative RNA Website (CRW) database (www.rna.icmb.utexas.edu) [14]. The CRW rRNA datasets [14] contain sequence alignments that were generated based on structural information and intensive manual correction. This benchmark dataset has been widely used in the evaluation and comparison of MSA approaches. We estimated 83 True alignment MAFFT alignment Number Model Insertion Model RAxML of tree deletion ANHD Gapiness Length Length SP-FN SP-FP condition nRF taxa height rate 10.A 10 0.47 0.13 0.380 0.591 2466 1543 0.566 0.629 0.186 10.B 10 0.7 0.1 0.479 0.618 2691 1602 0.687 0.750 0.243 10.C 10 1.2 0.06 0.591 0.645 2832 1588 0.811 0.850 0.443 10.D 10 2 0.031 0.642 0.591 2490 1583 0.815 0.841 0.464 10.E 10 4.4 0.013 0.696 0.578 2390 1623 0.904 0.913 0.664 50.A 50 0.45 0.06 0.415 0.667 3070 2053 0.340 0.336 0.084 50.B 50 0.73 0.03 0.513 0.603 2525 1834 0.451 0.431 0.146 50.C 50 1.2 0.02 0.598 0.620 2646 1950 0.731 0.704 0.322 50.D 50 2 0.012 0.667 0.629 2720 2171 0.902 0.881 0.517 50.E 50 4.3 0.005 0.715 0.591 2474 2385 0.974 0.965 0.755 100.A 100 4 1 × 10−5 0.454 0.331 1682 1533 0.054 0.046 0.075 100.B 100 7 1 × 10−5 0.540 0.439 2263 1861 0.209 0.176 0.119 100.C 100 15 5 × 10−5 0.646 0.571 2317 2418 0.680 0.603 0.470 100.D 100 25 2 × 10−5 0.683 0.634 1837 2799 0.899 0.853 0.607 100.E 100 20 4 × 10−5 0.672 0.614 2487 2701 0.848 0.796 0.661 Table 5.3: Model condition parameters and summary statistics of the simulation datasets. Model condition parameters consisted of the number of taxa, tree height, and insertion/deletion probability. The model conditions are named from A to E to represent increasing evolutionary divergence. The average summary statistics are reported for the true alignments, and the MAFFT-estimated alignments over 𝑛 replicate datasets (𝑛 = 20). “ANHD” is the average normalized Hamming distance of a pair of aligned sequences in an MSA, “Gappiness” is the proportion of an MSA matrix that consists of indels, “length” is the number of MSA columns, and “SP-FN” and “SP-FP” are the proportions of residue pairs that appear in the true alignment but not in the estimated alignment or vice versa, respectively. The average normalized Robinson-Foulds distance ( “nRF”) between the model tree and the RAxML(MAFFT)-inferred tree is also reported over 𝑛 replicate datasets (𝑛 = 20). MLE trees from the reference alignment. The RAWR-based phylogenetic support estimation and the bootstrap phylogenetic support estimation were conducted on the empirical benchmark dataset using the same command as in the simulation study. Simulation studies and empirical benchmarking were designed to focus on non-coding DNA sequence evolution. Therefore, we selected the intronic rRNA datasets with a range of evolutionary divergence and dataset size to perform the empirical experiments for consistency purposes. Sequences with more than 99% missing data were filtered from the analysis. The summary statistics of the empirical datasets are shown in Table 5.4. 5.2.6 Performance measurement. We used the precision-recall (PR) curve and the area under the PR curve (PR-AUC) to evaluate the performance of both phylogenetic support estimation methods. For the following reasons, we 84 Reference alignment MAFFT alignment Number RAxML Dataset ANHD Gapiness Length Length SP-FN SP-FP of taxa nRF IGIA 110 0.606 0.915 10368 6065 0.732 0.780 0.645 IGIB 202 0.579 0.910 10633 7070 0.825 0.863 0.678 IGIC2 32 0.533 0.700 4243 3530 0.691 0.716 0.517 IGID 21 0.719 0.782 5061 3063 0.874 0.905 0.778 IGIE 249 0.451 0.838 2751 2847 0.406 0.389 0.585 IGIIA 174 0.668 0.814 6406 6945 0.817 0.800 0.450 Table 5.4: Summary statistics of empirical datasets. Summary statistic calculations and descriptions identical to Table 5.3. selected the PR curve and its area, PR-AUC, to measure the method performance rather than the Receiver Operating Characteristic (ROC) curve and the area beneath it, ROC-AUC. Although the ROC curve is one of the most commonly used statistics for evaluating the performance of binary- response statistical inferences, it can be misleading when the proportions of bipartitions that appear in or not in the true phylogeny are very imbalanced [121]. The precision is sensitive to the imbalance of the dataset. This makes the PR curve a more accurate measurement of the imbalanced dataset. The PR curve is calculated on thresholds ranges from 0 to 1. A confusion matrix is calculated for each threshold based on the model prediction and the ground truth. The support values estimated by the phylogenetic support estimation methods serve as the model predictions. The phylogenetic support estimation methods generate support values for bipartitions, which is a unique split of the leaf set defined by the internal branches of a given annotation phylogeny. In both simulation studies and empirical study, the annotation phylogenies are inferred from the simulated sequences. The labels indicating whether a bipartition appears in the model tree were generated by the comparison of the inferred tree and the model/reference tree. The bipartitions that appear in the model tree/reference tree are represented by 1, and the other branches are represented by 0. True positives (TP) consist of bipartitions of the estimated tree that have support values greater than or equal to a given threshold and appear in the reference tree. False positives (FP) consist of bipartitions of the estimated tree that have support value greater than or equal to a given threshold but do not appear in the reference tree. False negatives (FN) consist of bipartitions of the estimated tree that have support less than a given threshold but appear in the reference tree. True negatives (TN) consist of bipartitions of the estimated tree that have support less than a given threshold and 85 do not appear in the reference tree. The PR curve plots the recall versus the precision of each |TP| |TP| threshold. The recall is calculated by |TP|+|FN| . And the precision is calculated by |TP|+|FP| . We used custom scripts, and the scikit-learn Python library [136] to calculate the curves and AUC quantities. We also compared the runtime and peak memory usage for these two phylogenetic support estimation methods. All experiments were conducted on computing facilities in the Michigan State University High-Performance Computing Center. We used compute nodes in the intel16-k80 cluster, each with a 2.4 GHz 14-core Intel Xeon E5-2680v4 processor. 5.3 Results 5.3.1 Simulation Study 5.3.1.1 Performance comparison of RAWR versus bootstrap. The PR-AUC results of both methods on the simulation datasets are shown in Table 5.5. The RAWR-based approach produces comparable or better PR-AUC results than the bootstrap method on all simulation datasets. We conducted pairwise t-tests with Benjamini Hochberg correction [8] with 𝑛 = 20 and 𝛼 = 0.05. The results show that the improvements achieved by the RAWR-based approach were statistically significant for all model conditions except for two 100-taxon model conditions with the lowest sequence divergence. The performance of the two phylogenetic support estimation approaches was comparable on these two 100-taxon model conditions. The PR-AUC improvement achieved by the RAWR-based approach increases as the sequence divergence grows. The PR-AUC improvement grew from 0.045 to 0.246 for the 10.A model condition to 10.E model condition, 0.009 to 0.334 for the 50.A model condition to 50.E model condition, and -0.004 to 0.291 for the 100.A to 100.E model condition. This finding indicates that the RAWR support estimates show better performance on more challenging datasets. The largest PR-AUC improvements of the RAWR-based approaches over the bootstrap approach were 0.334 and 0.160, respectively. The average PR-AUC improvements were 0.136 and 0.039, 86 respectively. One possible reason is that the RAWR-based support estimation approaches conduct re-estimation for both MSAs and phylogenetic trees. While the bootstrap support estimation only includes the re-estimation of phylogenetic trees. The bootstrap resampling process breaks the intra-sequence dependence, and the bootstrap replicates lose the sequence homology and cannot produce meaningful alignments. PR-AUC Model Corrected Bootstrap RAWR-reduced RAWR condition q-value 10.A 0.951 0.989 0.996 8.2 × 10−3 10.B 0.920 0.978 0.990 4.2 × 10−3 10.C 0.784 0.927 0.977 4.2 × 10−3 10.D 0.822 0.950 0.968 4.2 × 10−3 10.E 0.679 0.976 0.925 1.5 × 10−4 50.A 0.988 0.993 0.997 4.3 × 10−3 50.B 0.970 0.990 0.994 5.4 × 10−4 50.C 0.900 0.980 0.989 4.9 × 10−6 50.D 0.798 0.981 0.988 < 10−10 50.E 0.663 0.990 0.997 < 10−10 100.A 0.997 0.990 0.993 < 10−10 100.B 0.990 0.986 0.991 < 10−10 100.C 0.828 0.971 0.982 7.2 × 10−9 100.D 0.735 0.973 0.983 < 10−10 100.E 0.695 0.975 0.986 < 10−10 Table 5.5: PR-AUC performances on the simulation datasets. The PR-AUC are aggregated over 𝑛 replicate datasets for a model condition (𝑛 = 20). Statistical significance of PR-AUC differences between RAWR and bootstrap were evaluated using a one-tailed pairwise t-test and a multiple test correction was performed using the method of [8]. Corrected q-values are reported (𝑛 = 20). The runtime and peak memory usage for all methods are shown in Figure 5.2. Compared to the bootstrap method, the RAWR-based support estimation methods require one extra step: re- estimation of the resampled replicates. This additional step greatly increased the runtime of the RAWR-based approaches. The runtimes of all the methods were relatively short for the 10-taxon model conditions, which usually takes less than an hour. As the taxa number and the sequence divergence increased, the runtime increased by an order of magnitude. For the 50-taxon model conditions, the most time-consuming method, the GUIDANCE2 method, took over 10 hours for all five model conditions. For the 100-taxon model condition with the highest sequence divergence, the support estimation procedure costs half a day to multiple days. The runtime increase was mainly caused by the increase in the computational complexity of MSA estimation and phylogenetic tree 87 inference. 102 102 102 Running Time (h) Running Time (h) Running Time (h) 101 101 101 100 100 100 10.A 10.B 10.C 10.D 10.E 50.A 50.B 50.C 50.D 50.E 100.A 100.B 100.C 100.D 100.E bootstrap RAWR-reduced GUIDANCE2 RAWR 0.18 0.18 0.18 0.16 0.16 0.16 0.14 0.14 0.14 Peak Memory (GiB) Peak Memory (GiB) Peak Memory (GiB) 0.12 0.12 0.12 0.10 0.10 0.10 0.08 0.08 0.08 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0.00 0.00 0.00 10.A 10.B 10.C 10.D 10.E 50.A 50.B 50.C 50.D 50.E 100.A 100.B 100.C 100.D 100.E bootstrap GUIDANCE2 RAWR-reduced RAWR Figure 5.2: Runtime and memory usage of the phylogenetic support estimation methods on sim- ulation datasets. The top row includes average runtime usage for each model condition in the simulated study. The y-axis shows runtime in hours and is in log-scale. The left, middle right subplots represent 10-taxon, 50-taxon, and 100-taxon model conditions respectively. The bottom row includes average memory usage for each model condition in the simulated study. The y-axis shows runtime peak memory usage in GiB. The left, middle right subplots represent 10-taxon, 50-taxon, and 100-taxon model conditions respectively. The average runtime or peak memory usage were calculated across all replicate datasets in the model condition (𝑛 = 20). For the peak memory usage, the bootstrap method used the least amount of memory compared to the other methods, which is similar to the quantitative comparison of the runtime. The differences in the peak memory usage among all the support estimation methods were relatively small. The peak memory usage of the support estimation methods was usually about a few hundred MiB. However, according to the previous studies [80, 95] and the computational difficulty in this study, we expect that as the dataset size increases, the memory limitations will quickly become a significant bottleneck. Overall, the simulation study experiments indicate that the RAWR-based support estimation ap- 88 proach improves performance compared with the widely-used bootstrap support estimation method. This improvement requires additional time and memory usage compared to a standard bootstrap analysis. 5.3.1.2 RAWR support estimation using reduced resampling replication. In the simulation study, we included a reduced RAWR-based support estimation method, which we refer to as the "RAWR-reduced" method in the following. For the reduced RAWR-based method, we used an order of magnitude fewer resampled replicates compared with the regular RAWR-based method analysis and the bootstrap method. The standard RAWR-based method or the bootstrap method requires 100 resampled replicates, and the reduced RAWR-based approach conducts the support estimation with 10 resampled replicates. Though the resampled replicates were reduced by an order of magnitude, the RAWR-reduced method produced comparable or better performance than the bootstrap method. Like the standard RAWR method, the RAWR-reduced method achieved greater PR-AUC improvements under model conditions with a large dataset size or higher sequence divergence. In comparing the PR-AUC of the RAWR-reduced method to the regular RAWR method, there was an average improvement of 0.007 across all model conditions, which is shown in Table 5.5. The RAWR-reduced method used slightly more runtime than the bootstrap method on all model conditions and less runtime than the regular RAWR method. The RAWR-reduced method used similar peak memory usage compared to the regular RAWR method. The RAWR-reduced method uses much less runtime and comparable peak memory by reducing the resampled replicates, but the performance is comparable or even better than the regular RAWR method. 5.3.1.3 Results of Additional Performance Study As in other performance studies of MSA and phylogenetic tree estimation from unaligned sequence inputs [79, 80], we found that MAFFT generally produced more accurate alignments than ClustalW on the 10-taxon model conditions, although this accuracy improvement did not translate directly to 89 PR-AUC Model MAFFT ClustalW condition 10.A 0.996 0.997 10.B 0.990 0.991 10.C 0.977 0.988 10.D 0.968 0.992 10.E 0.925 0.964 Table 5.6: RAWR support estimation using alternative estimation/re-estimation methods. We compared RAWR support estimation using two different estimation/re-estimation methods: either MAFFT and RAxML(MAFFT) or ClustalW and RAxML(ClustalW). For each of the two methods, aggregate PR-AUC is shown across all replicate datasets of each model condition (𝑛 = 20). more accurate downstream phylogenetic inference. The details of the alignment summary statistics refer to Table 5.3 and Table 5.1. Despite the alignment accuracy, RAWR returned comparable PR-AUC regardless of which of the two MSA methods were used on the 10.A and 10.B model conditions. On the more divergent 10.C through 10.E model conditions, when using ClustalW for estimation and re-estimation of the MSA, RAWR improved the PR-AUC performance by 0.011, 0.024, and 0.039 comparing to that used MAFFT. The results are shown in Table 5.6. Our finding suggests that RAWR support estimation is robust to the quality of the annotation MSA. Furthermore, this result suggests that neighbor-preserving random walks may yield better support estimates where computational problems are more challenging, and estimation uncertainty is greater. For the experiments where we used alternative choices for reversal probability 𝛾 for the RAWR support estimation, on each 10-taxon model condition except for the 10.C model condition, RAWR returned similar PR-AUC as the reversal probability 𝛾 was increased from 0.001 up until a critical threshold. PR-AUC then dropped as 𝛾 increased past the threshold (Table 5.7). The exact threshold varied somewhat across model conditions. More generally, we observed a range of RAWR 𝛾 settings that returned the highest PR-AUC, where the range typically crossed one to two orders of magnitude. For the experiments on long-gap-length model conditions, similar performance outcomes were observed compared to the medium-gap-length simulations in the rest of our simulation study. The RAWR method produced a comparable or better PR-AUC than bootstrap. The improvement 90 Reversal probability 𝛾 Model 1 × 10−3 1 × 10−2 2 × 10−2 5 × 10−2 1 × 10−1 2 × 10−1 3 × 10−1 condition 10.A 0.997 0.998 0.998 0.996 0.994 0.986 0.980 10.B 0.994 0.990 0.991 0.990 0.987 0.985 0.977 10.C 0.942 0.941 0.950 0.977 0.968 0.957 0.901 10.D 0.977 0.982 0.978 0.95 0.944 0.935 0.934 10.E 0.969 0.978 0.971 0.983 0.929 0.923 0.922 Table 5.7: Simulation study: RAWR support estimation using different choices for reversal prob- ability 𝛾. Aggregate PR-AUC is reported across all replicate datasets of each 10-taxon model condition (𝑛 = 20). PR-AUC Model Bootstrap RAWR condition 10.long.A 0.997 0.997 10.long.B 0.992 0.994 10.long.C 0.904 0.937 10.long.D 0.829 0.949 10.long.E 0.552 0.788 Table 5.8: PR-AUC comparison of bootstrap and RAWR methods on 10-taxon long-gap- length model conditions. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re-estimation, respectively. Each method’s PR-AUC is reported as an aggregate across all replicate datasets for a model condition (𝑛 = 20). achieved by the RAWR method was as much as 0.236. The RAWR-based method achieved a larger PR-AUC advantage on model conditions with greater evolutionary divergence. Both methods returned slightly lower PR-AUC for the long-gap-length model conditions compared to the medium-gap-length model conditions. But the PR-AUC improvements obtained by the RAWR- based support estimation method over the bootstrap method were largely unaffected by the gap length distribution used for simulation. This suggests that RAWR’s performance is robust to increasing dependence between neighboring sites due to longer insertions and deletion events. For the PR-AUC comparisons of aLRT and RAWR-based methods on simulated datasets, the RAWR-based support estimation method consistently produced a better PR-AUC than the aLRT methods over all model conditions in the simulation study. The results are shown in Tables 5.9. Furthermore, as the sequence divergence increases, the PR-AUC improvement of the RAWR support estimation method tended to increase. The original aLRT method was proposed to address the high computational cost of the standard phylogenetic bootstrap method, which requires re- estimation of multiple bootstrap replicates. The computational efficiency of the LRT method 91 PR-AUC Model aLRT with aLRT with RAWR condition free topology fixed topology 10.A 0.952 0.939 0.996 10.B 0.884 0.884 0.990 10.C 0.722 0.745 0.977 10.D 0.757 0.784 0.968 10.E 0.631 0.621 0.925 50.A 0.979 0.980 0.997 50.B 0.960 0.961 0.994 50.C 0.870 0.876 0.989 50.D 0.711 0.710 0.988 50.E 0.548 0.556 0.997 100.A 0.986 0.987 0.993 100.B 0.976 0.969 0.991 100.C 0.775 0.773 0.982 100.D 0.663 0.670 0.983 100.E 0.592 0.593 0.986 Table 5.9: PR-AUC comparison of aLRT and RAWR methods for phylogenetic support estimation. We used PhyML [49] to run two types of aLRT analyses: (1) support estimation or a free tree topology that was also estimated as part of the analysis, and (2) support estimation for a RAxML- inferred tree topology. The latter methodology for obtaining an annotation tree is identical to the approach used in all other experiments in our study, and its PR-AUC performance is therefore directly comparable to other simulation study experiments. Table layout and description are otherwise identical to Table 5.8. is obtained using statistical approximations that represent a potential tradeoff in terms of type I/II error. Our findings support these observations, as the PR-AUC achieved by both the aLRT methods also underperformed the traditional bootstrap support estimation. We note one critical difference between our study and the study of Anisimova and Gascuel [3]. Our study provided estimated annotation MSAs as input to phylogenetic support estimation methods, whereas the study of Anisimova and Gascuel utilized the true alignments. The relative performance comparisons of aLRT and RAWR can be attributed in part to the major impact of MSA quality on downstream phylogenetic and phylogenetic support estimation. In the PR-AUC comparisons of the TBE and RAWR methods on simulated datasets, the RAWR method consistently outperformed the original TBE method [74] over all model conditions in our simulation study. One exception was the least divergent 100-taxon model condition, where both methods returned comparable PR-AUC. The PR-AUC comparisons of the TBE and RAWR methods are shown in Table 5.10. TBE support calculation is downstream of input data resampling. The original TBE support 92 estimation method utilized standard bootstrap resampling, which we refer to as TBE+bootstrap. As noted in the original study of TBE [74], the orthogonality of these two problems allows other resampling techniques to be easily replaced. Thus, we included a third method that joined TBE sup- port calculation with RAWR resampling, which we refer to as TBE+RAWR. TBE+RAWR returned comparable or improved PR-AUC compared to TBE+bootstrap, and the largest improvements were obtained on the most divergent model conditions. However, neither of the TBE methods outperformed standard RAWR support estimation, which uses a traditional binary test for the presence or absence of bipartition to calculate phylogenetic support. We also did not observe PR-AUC comparisons that suggested a type I/II error advantage for the original TBE method over traditional phylogenetic bootstrap support estimation. Our findings differ from the original study of the TBE method [74], which we attribute to the following factors. As noted above, a major difference between the two studies is MSA quality: the former utilizes estimated MSAs, and the latter utilizes true MSAs. Furthermore, Lemoine et al. [74] noted that, by definition of the bipartition transfer distance, TBE support is always greater than or equal to traditional bootstrap support for a given set of inputs. Based on our findings, we conjecture that an optimistic support measure is beneficial for addressing type II errors but could be counterproductive for type I errors. Among other important assumptions, for example, treating indels as missing data or an addi- tional state, theoretical guarantees about TBE and phylogenetic bootstrap support implicitly assume that input sequences are aligned without error. However, incorrect sequence homology and other misalignments will require a different set of theoretical and applied considerations. Our experi- ments suggest that sequence-aware resampling and re-estimation have an important role to play in phylogenetic support estimation. We also compared the performance of RAWR versus GUIDANCE2, a state-of-the-art purpose- built fully parametric method for placing confidence intervals on estimated multiple sequence alignments. The application of an MSA confidence assessment method like GUIDANCE2 to the downstream task of phylogenetic support estimation differs from its original intended purpose. 93 PR-AUC Model TBE with TBE with RAWR condition bootstrap resampling RAWR resampling 10.A 0.943 0.982 0.996 10.B 0.913 0.959 0.990 10.C 0.773 0.894 0.977 10.D 0.823 0.924 0.968 10.E 0.670 0.869 0.925 50.A 0.986 0.983 0.997 50.B 0.965 0.967 0.994 50.C 0.888 0.943 0.989 50.D 0.785 0.950 0.988 50.E 0.655 0.968 0.997 100.A 0.995 0.984 0.993 100.B 0.985 0.962 0.991 100.C 0.806 0.922 0.982 100.D 0.722 0.934 0.983 100.E 0.680 0.930 0.986 Table 5.10: PR-AUC comparison of TBE with bootstrap resampling, TBE with RAWR resam- pling, and RAWR. TBE was used to estimate phylogenetic support using two different resampling approaches: either (1) standard bootstrap resampling, which corresponds to the method originally proposed and studied by Lemoine et al. [74], or (2) RAWR resampling. The former is denoted “TBE with bootstrap resampling”, and the latter is denoted “TBE with RAWR resampling”. For comparison purposes, RAWR resampling and re-estimation was also run as a third method (denoted “RAWR”), and we used the same methodology as elsewhere in our study (i.e., using a standard branch presence/absence calculation to assess phylogenetic support). However, we note that GUIDANCE2 incorporates standard bootstrap resampling as a first step, and subsequent steps focus on guide tree re-estimation and other re-estimation tasks as part of progressive MSA re-estimation. For this reason, GUIDANCE2 can be seen as an adaptation of the standard bootstrap to MSA and tree re-estimation. The performance comparison between RAWR and GUIDANCE2 was qualitatively similar to that of RAWR and bootstrap. RAWR returned comparable or better PR-AUC compared to GUIDANCE2 on the simulated datasets, and RAWR’s PR-AUC advantage over GUIDANCE2 tended to grow as model conditions grew larger and more divergent. The results show in Table 5.11). GUIDANCE2 was the slowest method overall due to the complexity of its special-purpose MSA re-estimation approach, and both GUIDANCE2 and RAWR required more main memory compared to bootstrap. Finally, we note that GUIDANCE2 is purpose-built for MSA re-estimation, whereas bootstrap and RAWR are general-purpose non-parametric resampling methods, both resample an MSA 94 PR-AUC Model GUIDANCE2 RAWR condition 10.A 0.989 0.996 10.B 0.983 0.990 10.C 0.921 0.977 10.D 0.939 0.968 10.E 0.997 0.997 50.B 0.994 0.994 50.C 0.975 0.989 50.D 0.942 0.988 50.E 0.837 0.997 100.A 0.988 0.993 100.B 0.993 0.991 100.C 0.939 0.982 100.D 0.894 0.983 100.E 0.881 0.986 Table 5.11: PR-AUC comparison of GUIDANCE2 and RAWR phylogenetic support estimation methods. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re- estimation, respectively. We report each method’s aggregate PR-AUC across all replicate datasets for a model condition (𝑛 = 20). Model Condition PR-AUC 10.A 0.998 10.B 0.99 10.C 0.978 10.D 0.966 10.E 0.964 Table 5.12: PR-AUC performance of RAWR+teleport on 10-taxon model conditions. without utilizing an explicit parametric model. Despite this, RAWR was able to match or exceed GUIDANCE2’s PR-AUC performance. For the simulation study experiments with an alternative random walk resampling procedure, the PR-AUC values returned by RAWR+teleport on the 10-taxon model conditions are shown in Table 5.12. In terms of PR-AUC, RAWR+teleport had comparable performance to the standard RAWR method. For the model conditions in our experiment, downstream re-estimation and support calculations may be relatively tolerant of discontinuities introduced by teleportation, at least relative to related desynchronization injected by random reversals in a standard RAWR resampled replicate. Further experimentation will help to clarify the tradeoffs between the different resampling methods. 95 5.3.2 Empirical Study 5.3.2.1 Performance comparison of RAWR versus bootstrap. We evaluated the RAWR-based methods versus the bootstrap methods based on their PR-AUC on the empirical datasets, shown in Table 5.13. The RAWR method returned a similarly better PR-AUC compared to the bootstrap method, which is consistent with the results of the simulation study. The RAWR-based method outperformed the bootstrap method on all empirical datasets, and the average improvement was 0.105. This result indicates that the RAWR-based support estimation method can handle large datasets containing hundreds of sequences and still produce accurate support estimations than the traditional bootstrap method. There are some differences between the empirical study and the simulation study that are worth noting. First, the reference trees do not equal the true evolutionary history. The reference trees used in the empirical study were inferred from the reference alignment by the MLE method. Although the reference alignments were aligned by an automatic method and manual correction, and the alignments are very accurate, there is no guarantee that the reference alignments are true alignments. While in the simulation study, the model trees and the true MSAs are the ground truth. The model trees were used to guide the simulation of the sequence evolution. Secondly, the number of empirical datasets is different from the number of simulation datasets. This is caused by the large amount of effort required to curate reference alignments for empirical datasets. Finally, the intronic rRNA datasets have similar summary statistics compared to the simulated datasets. However, the intronic rRNA datasets are not exactly the same as the simulated datasets. The selected empirical datasets involve secondary structure evolution, strong selective pressures, and other evolutionary and biophysical constraints that are not considered in the simulation process in this study. 5.3.2.2 Results of Additional Performance Study The PR-AUC comparisons of the aLRT and RAWR methods on the empirical benchmarking datasets were consistent with the simulation datasets. RAWR consistently returned PR-AUC improvements 96 PR-AUC Model Bootstrap RAWR condition IGIA 0.725 0.804 IGIB 0.629 0.695 IGIC2 0.778 0.957 IGID 0.670 0.884 IGIE 0.772 0.808 IGIIA 0.830 0.884 Table 5.13: PR-AUC performances on the emprircal datasets. PR-AUC comparison of bootstrap and RAWR methods for phylogenetic support estimation. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re-estimation, respectively. relative to both aLRT methods across all of the empirical benchmarking datasets. The results are shown in Table 5.14. PR-AUC aLRT with aLRT with Dataset RAWR free topology fixed topology IGIA 0.6696 0.7094 0.7845 IGIB 0.4904 0.5515 0.8332 IGIC2 0.6368 0.7242 0.8808 IGID 0.7998 0.7299 0.8524 IGIE 0.6832 0.6864 0.8206 IGIIA 0.7774 0.8036 0.9053 Table 5.14: Empirical study: PR-AUC comparison of aLRT and RAWR methods on CRW bench- marking datasets. Table layout and description are otherwise identical to Table 5.9. The PR-AUC comparisons of the TBE and RAWR methods on the empirical benchmarking datasets were also consistent with the simulated datasets. The RAWR-based support estimation method consistently outperformed the original TBE method [74] across all empirical benchmarks. The results are shown in Table 5.15). PR-AUC TBE with TBE with Dataset RAWR bootstrap resampling RAWR resampling IGIA 0.7100 0.7195 0.7845 IGIB 0.6194 0.7232 0.8332 IGIC2 0.7508 0.8559 0.8808 IGID 0.5915 0.8044 0.8524 IGIE 0.7235 0.7622 0.8206 IGIIA 0.8252 0.8524 0.9053 Table 5.15: PR-AUC comparison of TBE and RAWR methods on CRW benchmarking datasets. Table layout and description are otherwise identical to Table 5.9. The performance comparison between RAWR and GUIDANCE2 was similar on the empirical datasets compared to the simulated datasets. The RAWR-based support estimation method produced 97 comparable or better PR-AUC compared to GUIDANCE2. The results are shown in Table 5.16, RAWR outperformed GUIDANCE2 on all empirical benchmarks except for IGIB. The average absolute difference between the two methods’ PR-AUC values was 0.055. This discrepancy may be attributed to the relative difficulty that the IGIB dataset presents: the worst PR-AUC values in our entire study were observed on this dataset. One primary factor for this outcome is the high gappiness of the reference alignments for the IGIB and IGIA datasets, for example, the fraction of the reference alignment that consists of indels, as compared to every other dataset in our study. RAWR resampling of datasets with high gappiness may require additional safeguards to mitigate desynchronization. PR-AUC Dataset GUIDANCE2 RAWR IGIA 0.705 0.804 IGIB 0.737 0.695 IGIC2 0.874 0.957 IGID 0.740 0.884 IGIE 0.777 0.808 IGIIA 0.870 0.884 Table 5.16: PR-AUC comparison of GUIDANCE2 and RAWR methods for phylogenetic support estimation. MAFFT and RAxML(MAFFT) were used to perform MSA and tree estimation/re- estimation, respectively. 5.4 Discussion The results of the simulation study and the empirical study indicate that the neighbor preser- vation property plays a critical role in the phylogenetic support estimation. For the phylogenetic support estimation problem, the regular RAWR-based method and the RAWR-reduced method outperformed the bootstrap method on almost all the simulation model conditions and the empir- ical datasets. The difference between the RAWR-based methods and the bootstrap method is that the RAWR resampling approach produces the resampled replicates that retain the intra-sequence dependence followed by the alignment re-estimation, while bootstrap replicates do not. Meaningful sequence homology is retained in the RAWR replicates, which makes it possible to re-estimate alignments from the RAWR replicates. Therefore, the phylogenetic trees are inferred from the re- estimated alignments following the same two-phase pipeline and serve as a better data perturbation 98 source for the phylogenetic support estimation. Compared with the SERES resampling process, the RAWR resampling process is more similar to the bootstrap method. There is no need for anchor estimation for synchronization purposes, and the parameters involved in the anchor estimation steps are unnecessary. The random walk procedure improved the HoT algorithm by producing many replicates with reverse orientation. More replicates usually result in better performance when it comes to support estimation. However, there are problems associated with the resampled replicates. First, synchronization may be broken during the resampling process by frequent reversal of the direction. If true alignments were accessible, all sites would be correctly aligned. In this case, breakpoints caused by the direction change do not break the synchronization of the sequence homology. However, true alignments are usually not available for analysis. Using estimated alignment, which contains mismatches near the reversal breakpoints, the synchronization is interrupted during the resampling process. This may impact the downstream re-estimation. A relatively low reversal probability limits the number of unique resampled replicates. However, high reversal probability results in more reversal breakpoints, which may break the synchronization due to the inaccuracy of the estimated alignments. Our experiments show that the choice of reversal probability has little impact on the downstream support estimates if we use a relatively low reversal probability. But the impact of the choice of reversal probability is still worth noting. Another problem with the RAWR resampling method is that a high reversal probability will reduce the sequential dependence of the RAWR resampled replicates. To the extreme, the bootstrap replicates do not take into account the sequential dependence. The RAWR resampling with 𝛾 = 0 is equivalent to mirrored inputs but with a random start point, and reverse only at the start or end of the input alignments. The RAWR resampling with 𝛾 = 0.5 is a first-order Markovian process, as discussed in [145]. The RAWR resampling with 0 < 𝛾 < 0.5, where 𝛾 is relatively small, the resampled replicates contain more sequential dependence. According to our experiments on the choice of reversal probability 𝛾, smaller reversal probabilities are likely to be more practical for most applications. 99 The resampled site distribution depends on the choice of reversal probability 𝛾. For an extreme example, the RAWR resampling with 𝛾 = 1 is equivalent to sampling a single site many times and leads to a significant data loss. Even if we use a relatively small reversal probability, an individual RAWR replicate can still be biased around the initial start position. Also, the RAWR random walk certainly changes direction at the start and end of the input alignment, which can also introduce RAWR resampling bias. One solution to avoid the sampling bias is to sample a sufficiently large set of RAWR replicates with a relatively small reversal probability and aggregate the statistical inference. The RAWR start positions are chosen uniformly at random. We explored the sampling frequency of the input alignment sites for the RAWR resampling and found that the results are consistent with the above hypothesis. Therefore, the RAWR support estimation is less affected by the resampling bias since many resampled RAWR replicates are re-estimated and aggregated into a single support estimation. The bootstrap support estimation usually treats the input alignment as the true alignment and does not consider the alignment errors. However, true alignments are not available in practice, and the alignment errors are commonly found in the estimated alignments. The RAWR-based support estimation benefits from the inaccuracy of the input alignment. The RAWR replicates serve as an effective data augmentation source, which provides more alternative inference results for the MSA and the tree estimation. This study applies the RAWR resampling and re-estimation to sequence dependence due to insertion and deletion processes. Below, we highlight other factors as part of future work. Many other factors also lead to sequence dependence, which we do not take into account yet. The new random walk resampling and re-estimation techniques paved the road toward solving the i.i.d. assumptions in biomolecular sequence analysis and other topics. There is still more progress to be made. 100 5.5 Conclusion In this study, we introduced a new non-parametric resampling technique, RAWR. We applied the RAWR resampling approach to a classical bioinformatics problem, phylogenetic support estimation. We conducted experiments on both simulated and empirical datasets that covered a wide range of dataset sizes and sequence divergence. The performance study showed that the RAWR-based support estimation method produced comparable or better performance than the widely used bootstrap support estimation method. At the same time, the RAWR-based support estimation methods request longer runtime and memory usage due to the RAWR resampling and re-estimation procedure. The tradeoff between accuracy and computational runtime and memory can be offset by reducing the number of resampled replicates. For future research, application-specified resampling and re-estimation can better utilize the sequence dependence in biomolecular sequence analysis. For example, taking the structure infor- mation into account for the resampling procedure. Another direction is applying the non-parametric resampling that reserves the intra-sequence dependence to other biological analyses that require biomolecular sequences as input or account for the sequential dependence. The non-parametric resampling methods require fewer assumptions and are not restricted to a specific application. Therefore, the RAWR resampling approach can be easily applied to other problems that deal with sequencing data. One important example is the statistical inference of species trees under models of sequence evolution that take into account the sequential dependence [139, 146]. 101 CHAPTER 6 AN APPLICATION OF RANDOM WALK RESAMPLING TO PHYLOGENOMIC ANALYSIS OF DARWIN’S FINCHES 6.1 Introduction We proposed the RAWR resampling algorithm to simplify the SERES random walk and serve the phylogenetic support estimation problem. The RAWR-based method estimates the support of phylogenetic trees by resampling and re-estimation using unaligned DNA sequences. The RAWR- based phylogenetic support estimation shows comparable or typically better performance than the traditional bootstrap support estimation approach. In previous experiments, we tested the RAWR resampling and re-estimation phylogenetic support estimation approach on single loci. Traditionally, phylogenetic inference and learning assume that all sites of the input alignment evolve along the same tree, and many population- level evolutionary events are ignored during the phylogenetic tree reconstruction.[34, 45]. Thus phylogenetic tree reconstruction is often conducted on a single locus or a few loci. Such approaches may work well on distantly related taxa. However, for closely related species, due to evolutionary events such as incomplete lineage sorting (ILS) and horizontal gene flow, a phylogenetic tree inferred from one or a few given loci may be inconsistent with the species tree. The inconsistent gene tree topologies may be obtained across the entire genome[89, 106, 30]. As the costs of next-generation sequencing techniques drop, sequencing a large number of genes or even whole genomes is feasible for phylogenetic inference and learning. Many computational algorithms are designed for coalescent-based species tree inference, which takes multiple gene trees as input and reconstructs the species tree with the presence of both ILS and horizontal gene flow. The whole-genome sequencing dataset is challenging in both its size and the variance of sequence divergence. In this study, we compared the RAWR-based support estimation method and the standard bootstrap method on the whole-genome sequencing dataset of Darwin’s finch from 102 Figure 6.1: Pictures of three Darwin’s finch species. (a) The small tree finch Camarhynchus parvulus. (b) The medium tree finch Camarhynchus pauper. (c) The large tree finch Camarhynchus psittacula. All three species live on the Galapagos island of Floreana. Arrows indicate the migration of two populations of C. psittacula from Isabela and Santa Cruz. Figure comes from [47]. Larmichhaney’s study in 2015 [69]. We also contributed an analysis pipeline for the phylogenetic tree reconstruction and support estimation from the raw sequencing data. 6.2 Methods 6.2.1 Dataset In this study, we reanalyzed the whole-genome sequencing data of Darwin’s finches that were initially studied by Lamichhaney et al. 2015 study [69]. Lamichhaney’s study conducted phylogenetic analysis on the whole-genome sequencing data of 120 individuals representing all of the Darwin’s finch species and two close relatives, in total 25 species. We randomly selected one sample for each species, resulting in a dataset with 25 sam- ples (accession numbers SRR1607296, SRR1607504, SRR1607439, SRR1607359, SRR1607385, SRR1607440, SRR1607547, SRR1607403, SRR1607458, SRR1607472, SRR1607551, SRR1607494, 103 SRR1607399, SRR1607462, SRR1607343, SRR1607534, SRR1607406, SRR1607485, SRR1607508, SRR1607543, SRR1607365, SRR1607420, SRR1607466, SRR1607529, and SRR1607480). We downloaded the raw Illumina HiSeq2000 paired-end short read data from the NCBI SRA database (accession number PRJNA263122 at http://www.ncbi.nlm.nih.gov/sra). The Lamichhaney study used the reference genome of medium ground finch Geospiza fortis for mapping and variance calling purposes. We used the same reference genome and gene annotations in our analysis. The reference genome and annotation file were downloaded from the GigaDB database (http://gigadb.org/dataset/100040). 6.2.2 Process of the raw sequencing data Due to limitations in software and script availability, we explored an alternative process and analysis pipeline for the phylogenetic tree reconstruction and support estimation of the whole-genome sequencing raw data. After the raw NGS read data was downloaded, the raw data went through NGS read mapping, quality filtering, variant calling, and phasing steps to obtain a multi-locus sequence dataset. The details of the raw reads processing are described as below. First, the paired-end raw reads of each sample were mapped to the reference genome using BWA version 0.7.17 with default parameters [75]. Then the mapped results were filtered based on the mapping quality and sorted by coordinate order using SAMtools [76]. We also used SAMtools to conduct variant calling on the preprocessed data to identify variants. The variants include SNPs and short indels polymorphisms. To obtain the haplotype sequences, those bi-allelic SNPs were phased using fastPHASE version 1.4.8 [125]. The phased calls for bi-allelic SNPs were combined with the genotypic data of homozygous multi-allelic SNPs and homozygous indel polymorphisms. The heterozygous multi-allele SNPs and heterozygous indels, representing less than 1% of the input data, were treated as missing data. In total, we got 28,507 scaffold sequences for each haplotype. The gene annotation data provided with the reference genome include coordinate information of locus corresponded to a gene, an intergenic region between annotated genes on a scaffold, or a scaffold with no annotated genes. We extracted loci sequences based on the coordinate information 104 of the gene annotations and the scaffold sequences produced in previous steps. Then we estimated alignment for each locus using MAFFT version 7.222 [64] with default settings. Finally, estimated alignments were filtered by the sequence length to achieve better efficiency. Very short alignments will cause errors in the following tree inference analysis, while very long alignments will cost much time for the resampling process. Alignments with the length between 30bp and 1.5Mb were used in the following analysis. After filtering, we got 34,972 alignments for each haplotype. The final multi-locus sequence dataset consisted of 13,321 loci corresponding to annotated genes, 15,275 loci corresponding to intergenic regions between annotated genes on a scaffold, and 6,376 loci corresponding to scaffolds that lacked annotated genes. 6.2.3 Concatenated MLE phylogenetic tree inference We concatenated multiple sequence alignments of all loci to produce a concatenated genomic sequence alignment 𝐴 for phylogenetic tree inference and support estimation. The 𝑖 th alignment 𝑎𝑖 served as partition 𝑝𝑖 in the concatenated alignment and the following analyses. To obtain an annotation phylogeny for the phylogenetic support estimation methods under study, we estimated a species tree using maximum likelihood estimation (MLE) on the concatenated and partitioned genomic sequence alignment 𝐴. The concatenated alignment 𝐴 had a sequence length of multiple orders of magnitude larger than the other datasets that we used in our previous studies. The RAxML algorithm we used in previous studies was inefficient for such big, partitioned phylogenomic datasets. Therefore, we utilized ExaML [67] to perform MLE phylogenetic tree inference. The ExaML algorithm was designed for large-scale phylogenetic analyses by performing parallelized computation on a high-performance computing cluster. First, we inferred an initial tree by maximum parsimony optimization using RAxML version 8.2.9 [131]. The start tree was used for ExaML’s local search heuristics. Then the concatenated and partitioned alignment 𝐴 were filtered only to contain partitions where all four types of nucleotides. Partitions without full coverage of all types of nucleotides will cause errors during the substitution model parameters inference process. The filtered alignment was transformed into a binary file format by a dedicated parser component of 105 ExaML version 3.0.21. The binary alignment file contains global data information such as alignment length, data types, and partition boundaries, allowing each ExaML process to concurrently read only those parts of the alignment on which it will be computing likelihoods and greatly improving the computational efficiency. Then the ExaML version 3.0.21 was used to perform phylogenetic MLE on the concatenated and partitioned alignment 𝐴. 6.2.4 Phylogenetic support estimation using bootstrap resampling. We used the standard bootstrap method to sample 100 bootstrap replicates from the concatenated and partitioned genomic sequence alignment 𝐴. The bootstrap resampling method was implemented in RAxML version 8.2.9. The bootstrap replicates were filtered only to contain partitions containing all four types of nucleotides. We used ExaML to transform the filtered alignment into binary format and used RAxML to infer an initial tree for each bootstrap replicate. Then, for each bootstrap replicate, an MLE tree was inferred by the ExaML using the initial tree and binary file as input. The software version and commands we used to perform phylogenetic MLE inference were the same as for annotating tree estimation. The re-estimated trees of bootstrap replicates were used to estimate phylogenetic bootstrap support for the annotation tree that we estimated using the original concatenated and partitioned alignment 𝐴. 6.2.5 Phylogenetic support estimation using RAWR resampling. We conducted the RAWR random walk on each partitioned alignment 𝑎𝑖 of the original alignment 𝐴 with default reverse rate 𝛾 = 1 × 10−1 . For each estimated alignment 𝑎𝑖 , we used the RAWR resampling approach to obtain 100 resampled replicates {𝑏𝑖 𝑗 } for 1 ≤ 𝑗 ≤ 100. To produce a single concatenated and partitioned alignment for each resampled replicate, we concatenated estimated alignments {𝑏𝑖 𝑗 } across all partitions 𝑝𝑖 for one resampled replicate 𝑗, where 1 ≤ 𝑗 ≤ 100. As in the bootstrap analyses, the concatenated alignments were filtered to contain partitions that cover all four types of nucleotides. Phylogenetic re-estimation was conducted on each RAWR replicate alignment with the same procedure and parameter settings as the bootstrap 106 support estimation. The re-estimated RAWR trees were used to calculate support for the annotation tree. We used Dendroscope [62] to visualize phylogenetic support estimates on the species tree. 6.3 Results and Discussion We produced a phylogenetic tree for Darwin’s finches using the MLE phylogenetic tree infer- ence method on the concatenated alignment of the whole-genome sequence data. The inferred phylogenetic tree, which is shown in Figure 6.3, was topologically identical to the phylogenetic tree provided by Lamichhaney’s study [69], which is shown in Figure 6.2, except for one single internal branch in the tree of finches clade. The phylogenetic tree produced by the original study showed that the C. pauper has a closer relationship with C. parvulus, and then the clade of these two species coalescents with the other tree finch species. The species tree we got from our MLE analysis showed that instead of C. parvulus, C. pauper is genetically closer to the other three tree finch species, C. psittacula, C. heliobates, and C. pallidus. This result indicates that the analysis pipeline we used in our study produced a reasonable phylogenetic tree, which is mostly consistent with the tree provided by the original study using the whole-genome sequencing data. The change in raw sequencing data processing tools and pipelines did not impact the downstream phylogenetic tree reconstruction, unlike the MSA estimation, where the choice of MSA methods greatly influences the accuracy of downstream phylogenetic analysis. The inconsistent tree finch clade did not mean that the phylogenetic tree inferred in this study was wrong. We found that two support estimation methods produced different estimations for this inconsistent clade. The supports produced by the RAWR-based method are shown in 6.4, and the supports produced by the bootstrap method are shown in 6.3. The RAWR-based support estimation approach produced lower support than the bootstrap method for the tree finch clade. The RAWR-based support estimation approach yielded 46% support on the parent edge for the clade of C. psittacula, C. heliobates, and C. pallidus, and 68% support on the parent edge of C. psittacula, C. heliobates, C. pallidus, and C. pauper. However, the support produced by the bootstrap method is 100% for the two internal branches mentioned above. The different support 107 estimations produced by the two methods on the inconsistent clade indicate that there are more uncertainties in the phylogenetic relationships of the tree finch species. Thus, it is more difficult to obtain a confident inference from the sequencing data for this set of finch species. Furthermore, our findings suggest that the RAWR resampling and re-estimation process aids in revealing the uncertainties of the downstream phylogenetic tree reconstruction, particularly variances introduced by MSA estimation. The bootstrap method is not sensitive to the influence of MSA quality on the downstream phylogenetic reconstruction since the bootstrap replicates lose intra-sequence dependence during the resampling process. Minor discrepancies were also noted in the non-sharp- beaked ground finch clade, at most a 7% difference. Otherwise, both methods closely agreed on estimated bipartition support in the rest of the species phylogeny, with an average support difference of 0.6%. The RAWR-based support estimation took a longer runtime than the standard bootstrap support estimation method on the same multi-locus sequencing dataset because the RAWR-based method requires an additional step than the bootstrap method: the re-estimation of the resampled replicate sequences. This additional step significantly increased the runtime of the RAWR-based approaches. This is consistent with our previous study on the single-locus dataset, where the runtime of the RAWR-based support estimation was an order of magnitude longer than the bootstrap method. Parallel computing can easily be applied to the re-estimation of resampled replicates to solve this problem. 6.4 Conclusion In this study, we reanalyzed the whole-genome sequencing data of Darwin’s finches that were initially studied by Lamichhaney et al. 2015 study[69]. We explored the application of RAWR- based support estimation on the multi-locus sequence dataset. We reproduced the phylogenetic tree inference by concatenated MLE analysis. Though inferring phylogenetic relationships of birds is often challenging because of frequent hybridization and rapid radiations (Grant and Grant 1992; Ericson et al. 2006), the reconstructed phylogeny shares an identical topology with the tree 108 C. olivacea_S C. fusca_E C. fusca_L P. crassirostris_Z P. inornata_C G. difficilis_F G. difficilis_S G. difficilis_P C. pauper_F C. parvulus_Z C. psittacula_P C. heliobates_I C. pallidus_Z G. difficilis_D G. difficilis_W G. fuliginosa_Z G. fuliginosa_S G. fortis_M G. difficilis_G G. scandens_M G. conirostris_G G. magnirostris_G G. conirostris_E T. bicolor_B L. noctis_B Figure 6.2: Phylogenetic tree of Darwin’s finch species and two close relatives reported in Lamich- haney et al. 2015 study[69]. This phylogenetic tree was reproduced from the Figure 1 panel b in Lamichhaney’s 2015 paper. The branch length were ignored and the tree was rescaled to an ultrametric tree, where all the leaves have the same distance to the root. The color of the branches and the species name represents the group that the species belongs to. We used the same color as the original study, the purple, brown, cyan, red, green, blue and black color represent the group of warbler finches, vegetarian finch, cocos finch, sharp-beaked ground finches, tree finches, all other ground finches and the outgroups. 109 100 C. olivacea_S 100 C. fusca_E 100 C. fusca_L P. crassirostris_Z 100 P. inornata_C 100 G. difficilis_F 100 100 G. difficilis_S G. difficilis_P 100 C. parvulus_Z 100 100 C. pauper_F 100 C. psittacula_P 100 C. heliobates_I 100 C. pallidus_Z 100 G. difficilis_D G. difficilis_W 66 G. fuliginosa_Z 100 85 G. fuliginosa_S 58 G. fortis_M 100 G. difficilis_G 85 G. scandens_M 85 G. conirostris_G 100 G. magnirostris_G G. conirostris_E 100 T. bicolor_B L. noctis_B Figure 6.3: The re-estimated phylogenetic tree for Darwin’s finch species and two close relatives with supports estimated by the standard bootstrap method. We re-estimated a species tree using maximum likelihood estimation (MLE) on the concatenated and partitioned genomic sequence alignment of Darwin’s finch species and two close relatives. We estimated supports using the standard bootstrap method, which is implemented by RAxML version 8.2.9. The branch length were ignored and the tree was rescaled to an ultrametric tree. The color mapping is the same as Figure 6.2. 110 100 C. olivacea_S 100 C. fusca_E 100 C. fusca_L P. crassirostris_Z 100 P. inornata_C 100 G. difficilis_F 98 100 G. difficilis_S G. difficilis_P 100 C. parvulus_Z 100 68 C. pauper_F 46 C. psittacula_P 100 C. heliobates_I 100 C. pallidus_Z 100 G. difficilis_D G. difficilis_W 64 G. fuliginosa_Z 100 80 G. fuliginosa_S 51 G. fortis_M 100 G. difficilis_G 81 G. scandens_M 81 G. conirostris_G 100 G. magnirostris_G G. conirostris_E 100 T. bicolor_B L. noctis_B Figure 6.4: The re-estimated phylogenetic tree for Darwin’s finch species and two close relatives with supports estimated by the RAWR-based support estimation method. We calculated supports for the re-estimated phylogenetic tree using the RAWR-based support estimation method. The annotated tree is the same tree as Figure 6.3. The branch length were ignored and the tree was rescaled to an ultrametric tree. The color mapping is the same as Figure 6.2. 111 reported in Lamichhaney’s study except for the tree finches. The only inconsistent internal branch showed uncertainty about the evolutionary relationship of the C. pauper with the other tree finches. The RAWR-based method produced low confidence intervals for this clade, which confirmed that the phylogenetic relationships between tree finches were challenging to infer, and the upstream MSA estimation process had a certain impact on the phylogenetic reconstruction. We performed the bootstrap support estimation on Darwin’s finch dataset, and the bootstrap method returned 100% supports for the clade of the tree finches. This result indicates that the bootstrap method is not sensitive enough to identify uncertain phylogenies caused by the MSA re-estimation. Our experiment on the whole-genome sequencing data of Darwin’s finches confirmed that the neighbor preservation property plays a critical role in the phylogenetic support estimation. For future research, the RAWR-based support estimation can be applied to the analysis of many other types of sequencing data. Transcriptome data has been used in diverse applications, including phylogenetic inference and learning. Though the phylogenetic analysis using transcriptome data is not well developed as the phylogenetic analysis of DNA and protein sequence, it has been proved that the phylogenetic inference using transcriptome data could effectively reproduce previous phylogenies, such as phylogeny of the genus Flaveria reconstructed by Lyu et al. [87] and the phylogenetic analyses of Orchidaceae studied by Deng et al. [26]. The RAWR resampling approach can be extended to phylogenetic studies with transcriptome data, where the phylogenetic analysis can be associated with the evolution of gene expression, post-transcriptional modifications, alternative splicing, and gene fusions. 112 CHAPTER 7 IMPACT OF MULTIPLE SEQUENCE ALIGNMENT ERROR ON THE SUMMARY-BASED PHYLOGENETIC NETWORK RECONSTRUCTION 7.1 Introduction The phylogenetic tree is the traditional model for studying the evolutionary history of a set of organisms. The tree structure can effectively represent the relationship among a group of species or genes. However, phylogenetic trees are unable to depict more complex evolutionary events, such as horizontal gene transfer, hybridization, recombination, introgression, or gene duplication and loss. In these scenarios, the genetic materials of some sites are not inherited vertically from the parents but rather horizontally. Therefore, a more general structure, phylogenetic network, was proposed to better represent the evolutionary relationship with reticulation events involved [61, 101]. Various computational methods are designed to reconstruct phylogenetic networks from large- scale genomic sequence data, such as distance-based, maximum parsimony, and maximum likeli- hood methods. Many of these methods have a two-phase pipeline similar to that of phylogenetic tree inference. The first step is to estimation a set of gene trees from multiple sequence alignments (MSAs) of multiple loci. The second step is to reconstruct the species phylogeny by gene trees. The multiple sequence alignments from the loci are the very initial step of the phylogeny reconstruction. Previous studies show that the error introduced during the alignment estimation greatly impacts the downstream tree inference and learning [153, 152]. However, there is little discussion about the impact of MSA error on the phylogenetic network reconstruction. Here we conducted a performance study to investigate how the MSA error impacts the recon- struction of phylogenetic networks. We performed our study on both simulated datasets and two benchmark datasets. The results show that the errors introduced into the sequence alignments significantly impact the accuracy of the downstream phylogenetic network inference. This effect becomes more pronounced as sequence divergence or taxa size increase. 113 The study offers some critical insights into the quality of input MSAs and a new direction to improve the accuracy of phylogenetic network reconstruction. The computational methods should take into account the alignment error. 7.2 Methods 7.2.1 Simulated Dataset We performed our simulation study on randomly selected model networks with one reticulation event, following the simulation procedure of Hejase and Liu’s 2016 study [53]. The model networks contain either four taxa or eight taxa. We simulated the sequence replicates through three main steps: randomly sample the model network, simulate local genealogies under model networks, simulate sequence evolution for each gene tree. To simulate the model network, we started with a randomly sampled tree using r8s version 1.8.1 [123]. We scaled all the sampled trees to 1.0 to achieve better control of the divergence of simulated sequences. Then we randomly chose a reticulation time 𝑡 𝑀 , where 𝑡 𝑀 ∈ [0.01, 0.25], and two populations. One reticulation event was added at time 𝑡 𝑀 between these two populations with a randomly chosen direction. The reticulation event is shown as a directed edge on the tree, representing the gene flow between two populations at time 𝑡 𝑀 . Last, for rooting purposes, we added one outgroup taxon to the model network at time 1.5. We simulated the local evolutionary history of 1000 loci using ms [57] under the multi-species coalescent model for each model network. The migration event occurs between time 𝑡 𝑀 − 0.01 and 𝑡 𝑀 + 0.01 with a migration rate of 5.0. Then, we applied INDELible v1.03 [39] to simulate the sequence evolution of each gene tree from the previous step. We conducted the sequence simulation under the General Time-Reversible (GTR) nucleotide substitution model [118]. We used the medium gap length distribution for the insertion and deletion model. The parameters of the GTR model and the insertion and deletion models come from the study of Liu et al. [80]. To obtain replicated datasets, we repeat this simulation process independently 100 times for each model condition. 114 The detailed statistical summary of the simulated dataset is listed in Table 7.1. The average nor- malized Hamming distance (ANHD) represents the sequence divergence. The gappiness represents the percentage of gaps in the true alignments. Model Insertion/deletion Model phylogeny condition rate height ANHD Gappiness 4.A 0.1 0.5 0.3146 0.3301 4.B 0.05 0.8 0.4212 0.2855 4.C 0.03 1.4 0.5417 0.2956 4.D 0.02 2.5 0.6388 0.3326 4.E 0.01 5 0.7053 0.3313 8.A 0.03 1 0.4468 0.2825 8.B 0.02 2 0.5844 0.3418 8.C 0.01 3 0.6454 0.2809 8.D 0.006 7 0.7153 0.3531 8.E 0.004 10 0.7270 0.3433 Table 7.1: Model parameters and summary statistics of the simulated datasets. The 4-taxon model conditions are named 4.A through 4.E in order of increasing evolutionary divergence; the 8-taxon model conditions are named 8.A through 8.E similarly. Additional model condition parameters include the insertion/deletion rate and the model phylogeny height (see Methods section for details). Average normalized Hamming distance (“ANHD”) and the percentage of true MSA cells that consist of indels (“Gappiness”) are reported as an average for each model condition. 7.2.2 Simulation Experiments We conducted two sets of experiments to investigate the impact of the MSA error on phylogenetic network inference. Both sets of experiments took MSAs as input. The input alignments were either the true alignments simulated under the local genealogies or the estimated alignments inferred from simulated sequences. We used the true alignments as ground truth, which does not have alignment errors. We aligned the simulated sequences using ClustalW version 2.1, MAFFT version 7.222, and FSA version 1.15.9 [73, 64, 12], three widely used MSA methods. The estimated alignments contain errors introduced by the alignment process. By comparing the signals of gene flow or the accuracy of inferred phylogenies, we can learn the effects of alignment error on the phylogenetic network reconstruction. 115 1.0 clustalw2 mafft 0.8 fsa 0.6 SPFN SPFN 0.4 0.2 0.0 4.A 4.B 4.C 4.D 4.E 8.A 8.B 8.C 8.D 8.E 1.0 clustalw2 mafft 0.8 fsa 0.6 SPFP SPFP 0.4 0.2 0.0 4.A 4.B 4.C 4.D 4.E 8.A 8.B 8.C 8.D 8.E Figure 7.1: Alignment error for MSA estimation methods of the simulated datasets. The MSA methods in our study consisted of MAFFT, ClustalW, and FSA. We assessed MSA estimation error based on type I and type II error: the former was assessed based on SP-FP proportion (“SPFP”), which is the proportion of nucleotide-nucleotide homologies that appear in the estimated alignment but not the true alignment, and the latter was assessed based on SP-FN proportion (“SPFN”), which is the proportion of nucleotide-nucleotide homologies that appear in the true alignment but not the estimated alignment. Average SPFN and SPFP are shown for each MSA method on each model condition. 116 The first set of experiments applied D-statistics on input MSAs to detect the gene flow signals. D-statistics, or ABBA-BABA statistics, is a parsimony algorithm designed for gene flow detection of closely related species. This method takes sequence alignments as input and calculates D-value based on the numbers of ABBA and BABA sites, which represent two types of inconsistent local genealogies with the species tree. Without significant gene flow, ABBA and BABA sites appear in alignment with the same probability. Therefore, the expected D-value is 0 in this scenario. However, if the D-value is significantly different from zero, two non-sibling species are closer than their sibling species. Such D-values indicate that there is a significant difference between the number of ABBA and BABA sites and that two non-sister species are more similar to each other than expected. This is considered as a signal of gene flow. The other set of experiments is to reconstruct phylogenetic network from the simulated se- quences using summary-based algorithms. The state-of-the-art phylogenetic network reconstruc- tion algorithms we used in this study are implemented in the PhyloNet software package [149, 138]. The phylogenetic network inference pipeline is similar to the phylogenetic tree inference pipeline, which also applies two steps to the biomolecular sequences. The first step is to esti- mate an MSA from the simulated sequences. and infer gene tree from the estimated MSA. The summary-based inference methods take the inferred gene trees as input and reconstruct the network from the sampled loci. MSAs in the first pipeline stage consisted of either the true alignment or an estimated alignment that was produced using the above procedure. Gene trees in the second pipeline stage consisted of either the true gene trees or inferred gene trees that were obtained using the following procedure: on either the true MSA or an estimated MSA, we ran FastTree version 2.1.11 [112] with default settings. to perform maximum likelihood estimation of an unrooted gene tree under the GTR+Γ model of nucleotide substitution [118, 35, 105], and rooted gene trees were obtained using outgroup rooting. Since outgroups were used solely for rooting gene trees, the leaf edge to the outgroup taxon was then pruned from each rooted gene tree. Finally, for each set of rooted gene trees – either true gene trees, estimated gene trees that were obtained used MLE on true alignments, or estimated 117 gene trees that were obtained using MLE on estimated alignments – PhyloNet was used to perform summary-based network inference under one of three different optimization criteria, model likelihood given gene tree topologies as input [160], model likelihood given gene tree topologies and branch lengths as input [160], or pseudo-likelihood given gene tree topologies as input [160]. We refer to the two summary-based inference methods as MLE, and MPL. All two methods were run using default settings and version 3.6.0 of the PhyloNet software package. We measured the method performance using the topological errors of the inferred phylogenies compared with the corresponding model phylogenies. We used the Robinson-Foulds distance [116] between the inferred phylogeny and the model phylogeny to measure the topological error of the inferred gene trees. The Robinson-Foulds distance was calculated by the the proportion of bipartitions that appear in the inferred gene tree but not in the true tree or vice versa. We used the metric proposed by Nakhleh’s 2019 study [100] to measure the inferred species networks’ topological error. The metric, which we refer to as the reduced distance in the following section, is calculated on the set of reduced phylogenetic networks by the number of rooted subnetworks that appear in the inferred network but not the model network or vice versa. 7.2.3 Empirical Datasets and Experiments In our empirical study, we re-analyzed the datasets from two previous studies [148, 122]. The mosquito dataset is the dataset used in the study of Wen et al. 2016[148] for adaptive introgression in mosquitoes, which was sampled from the whole genome alignment of Fontaine et al. 2015 [103]. This dataset consists of mosquito genetic sequence data of six species and one outgroup taxon. The mosquito dataset includes the following 6 species: Anopheles gambiae, Anopheles coluzzii, Anopheles arabiensis, Anopheles quadriannulatus, Anopheles merus, and Anopheles melas, which are represented by G, C, A, Q, R, and L, respectively in our analysis. Anopheles christyi serves as the outgroup taxon. A total of 3019 loci are included. The second dataset used in the empirical study was obtained from the study of Salichos and 118 Rokas. [122]. This dataset contains genomic sequence data from 23 yeast species, 4435 loci in total. We used the same summary-based phylogenetic inference approach to reconstruct species networks as the simulation study. We utilized three different MSA methods to estimate alignments for each locus, for the purpose of obtaining estimated alignments of different quality. The MSA methods we used are ClustalW version 2.1, MAFFT version 7.222, and FSA version 1.15.9 [73, 64, 12]. Then we used FastTree to infer unrooted gene tree for each locus based on the estimated alignments of different quality. For the mosquito-6taxa dataset, the inferred gene trees were rooted by removing the outgroup taxon. For yeast dataset, gene trees were rooted under the MDC criterion using the species tree from Neafsey’s 2015 study and Salichos’s 2013 study [104, 122]. Finally, we used the rooted gene trees as input to reconstruct a species network with 𝑟 reticulations, where 𝑟 ∈ [0, 4]. The network was inferred using the MPL approach, which is implemented in PhyloNet version 3.6.0. 7.3 Results 7.3.1 Simulation Study 7.3.1.1 D-statistics for gene flow detection In the first set of experiments in the simulation study, we used the D-statistic analysis to detect the gene flow from estimated alignments. A score obtained by D-statistic analysis, which is also called D-value, that is significantly different from zero indicates gene flow between two taxa. A larger absolute score means a stronger gene flow signal. We conducted D-statistics on sequences derived from either model networks or model trees. For the simulation datasets generated from tree-like model phylogenies, no gene flow was included. Such simulation datasets were considered as the negative control group. Ideally, the D-value inferred from the sequences of the negative control group should be close to zero, which means no gene flow was detected. The average scores produced by the D-statistics on the true alignments were close to zero. These results indicated weak 119 network tree 0.0000 Average D Value −0.0005 estiAln trueAln −0.0010 −0.0015 estiAln trueAln estiAln trueAln method Figure 7.2: Gene flow detection using the D-statistic on simulated datasets. The left subplot shows the D-statistic score distribution on simulated datasets with model networks. The right subplot shows the D-statistic score distribution on simulated datasets with model trees. The D-statistic values were calculated using MAFFT-estimated alignments, which refer to ”estiAln’, and true alignments, which refer to ”trueAln”. Average D-statistic values are reported, and standard error bars are shown over 20 replicates. gene flow signals were detected for the true alignments of the negative control datasets. However, the average score on the MAFFT-estimated alignments are the largest among all the groups of simulated datasets. Gene flow signals detected from the estimated alignments of the negative control datasets are even larger than the gene flow signals detected from the network datasets. The simulation datasets derived from network-like model phylogenies contain gene flow signals, which should result in D-values away from zero. As expected, the D-values of both estimated and true alignments were larger than zero. However, the variance of the D-statistic scores tended to be larger on estimated alignments than that of the corresponding true alignments, which means the gene flow detected from the MAFFT-estimated alignments were stronger than that detected from the true alignments. 7.3.1.2 Phylogenetic network inference We calculated the topological errors of the inferred phylogenetic networks. The topological errors were quantified by the reduction-based distance, which is also called reduced distance, between the 120 inferred networks and the corresponding model networks. The reduced distance results from the inferred phylogenetic networks of the simulated datasets are shown in Figure 7.3. Generally, the species networks generated by the MLE network inference method from the true alignments and the true gene trees were the most accurate. The estimated alignments and the estimated gene trees produced the least accurate species networks. There are two exceptions, 4.A and 8.A, two model conditions with the least sequence divergence. For these two model conditions, all the species networks produced by the MLE inference method had comparable topological accuracy. The topological error difference between species networks inferred from different data sources increased as the sequence divergence increased. For the three least divergent model conditions of 4 taxa and 8 taxa, the MLE method with the true alignments and true gene trees produced similar topological errors compared to the MLE method with true alignments and estimated gene trees. However, for the two most divergent model conditions of 4 taxa and 8 taxa, the MLE method with the true alignments and estimated gene trees returned higher topological error than that using the true gene trees. The species networks inferred by the MLE method using estimated alignments and estimated gene trees were generally less accurate than the other methods, especially for the model conditions with higher sequence divergence. Also, the choice of the MSA method had a minor impact on the topological accuracy of the downstream inferred phylogenetic networks. Furthermore, the topological errors obtained on the 8-taxon model conditions were generally higher than those obtained on the 4-taxon model conditions. The comparisons among MPL methods with different alignments and gene trees were similar to those of the MLE methods. The MPL method generally produced the most accurate phylogenetic networks with true alignments and true gene trees produced, and the least accurate networks with estimated alignments and estimated gene trees, except for the least divergent model conditions of 4 taxa and 8 taxa, where the MPL method produced similar topological errors despite the alignment errors and gene tree errors. The MLE method usually produces more accurate species networks than the MPL method under 121 4taxa MLE 8taxa MLE 10 8 trueTree trueAln reducedDistance reducedDistance 6 clustalwAln mafftAln fsaAln 4 2 0 4.A 4.B 4.C 4.D 4.E 8.A 8.B 8.C 8.D 8.E Figure 7.3: Topological errors of MLE analysis on simulation datasets. The MLE method was conducted on five different inputs: (1) true MSAs and true gene trees (”trueTree”), (2) true align- ments and gene trees estimated using FastTree on the true alignments (”trueAln”), (3) ClustalW- estimated alignments and gene trees estimated using FastTree on the ClustalW-estimated MSAs (”clustalwAln”), (4) MAFFT-estimated alignments and gene trees estimated using FastTree on MAFFT-estimated alignments (”mafftAln”), or (5) FSA-estimated alignments and gene trees esti- mated using FastTree on FSA-estimated alignments (”fsaAln”). Topological error was measured using the reduced distance [100]. Averages and standard error bars are shown for each model condition in the simulation study (𝑛 = 20). 4taxa MPL 8taxa MPL 10 8 trueTree trueAln reducedDistance reducedDistance 6 clustalwAln mafftAln fsaAln 4 2 0 4.A 4.B 4.C 4.D 4.E 8.A 8.B 8.C 8.D 8.E Figure 7.4: Topological errors of MPL analysis on simulation datasets. Figure description and layout are otherwise identical to Figure 7.3. the same model condition. The MLE method utilizes the full model likelihood criterion, while the MPL method uses a pseudolikelihood criterion, which approximates the full model likelihood criterion [160, 129]. The MPL method is faster but less accurate than the MLE method. The runtime and peak memory usage of the simulation study are shown in Figures 7.5 and Figure 7.6. The MPL method consistently used less time and memory than the MLE method on 122 0.030 MPL 20.0 MLE 0.025 17.5 15.0 Running Time (h) Running Time (h) 0.020 12.5 0.015 10.0 0.010 7.5 5.0 0.005 2.5 0.000 0.0 4.A 4.B 4.C 4.D 4.E 8.A 8.B 8.C 8.D 8.E Figure 7.5: Computational runtime requirements of summary-based species network inference methods for simulation study. The runtime of the MPL and MLE methods on simulation datasets is shown in hours. Averages and standard error bars are shown for each model condition over 20 replicates in the simulation study. all model conditions. On the 8.E model condition, which has the highest sequence divergence, the MLE method’s runtime reached almost 20 hours. However, large datasets are very common for modern phylogenomic studies, where the datasets contain many dozens of genomic sequences. The runtime of the MLE method has become a major bottleneck in its application. Our finding is consistent with an earlier performance study [53]. The peak memory usage of both the MPL and MLE methods was less than 800 MiB on all model conditions. Relative differences in peak memory usage were smaller than runtime comparisons as well. 7.3.2 Empirical Study For the mosquito dataset, we compared the topologies of estimated networks using different input MSAs, including reference and estimated alignments estimated by ClustalW, MAFFT, or FSA. The pairwise reduced distances between the inferred networks are shown in Table 7.2. We inferred phylogenetic network with a single reticulation event from each set of input alignment. The networks inferred from the ClustalW-estimated alignments and MAFFT-estimated alignments were identical in topology compared to the networks inferred from the reference alignments. However, networks inferred from the FSA-estimated alignments had different topologies compared to the networks 123 1.0 1.0 MPL MLE 0.8 0.8 Peak Memory (GiB) Peak Memory (GiB) 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 4.A 4.B 4.C 4.D 4.E 8.A 8.B 8.C 8.D 8.E Figure 7.6: Computational memory requirements of summary-based species network inference methods of simulation study. The peak main memory usage of the MPL and MLE method on simulation datasets shows in GiB. Figure description and layout are identical to Figure 7.5. inferred from the other alignments. As the reticulations increased, the networks became more complex, and the topological distance between the different methods also increased. There was no identical network inferred from different alignments with two or more reticulation events, and there was no clear trend in terms of the input alignments. One important difference between the empirical study and the simulation study was that the reference alignments used in the empirical study were partially estimated by the computational approaches, which is not the true alignment. The yeast dataset showed similar results, except the reference alignments were not available. The topological differences were lowest for the networks with a single reticulation event. As the number of reticulations increased, the hypotheses became more complex, and topological differences increased too. As with the mosquito dataset, there were no clear trends in the topological differences between the pairwise comparison of networks inferred using different input MSAs. 7.4 Discussion The results of the simulation study and the empirical study indicate that the estimation error greatly impacts the downstream phylogenetic network reconstruction. The estimation error intro- duced to MSA and gene trees potentially create false positive gene flow signals. According to the 124 1 ret. ClustalW MAFFT FSA Reference ClustalW - 0 6 0 MAFFT - 6 0 FSA 6 - 6 Reference - 2 ret. ClustalW MAFFT FSA Reference ClustalW - 5 7 9 MAFFT - 9 9 FSA - 7 Reference - 3 ret. ClustalW MAFFT FSA Reference ClustalW - 11 6 10 MAFFT - 9 11 FSA - 8 Reference - 4 ret. ClustalW MAFFT FSA Reference ClustalW - 13 9 15 MAFFT - 12 11 FSA - 13 Reference - Table 7.2: Topological distance pairwise comparison of species networks inferred by MLE on different MSAs of the mosquito dataset. We obtained the reference alignment from the original study of [103]. The estimated alignments were generated by ClustalW, MAFFT, or FSA. We also compared estimation of species networks with differnt reticulations, which represents different model complexity. The MLE method was used to estimate species networks with at most 1, 2, 3, or 4 reticulations (”ret.”). Topological distances [100] of pairwise comparison between estimated networks were measured. Only upper triangular entries in the pairwise distance matrix are shown. D-statistics study, the D-values inferred from the estimated alignments showed stronger gene flow signals than those inferred from true alignments for both tree and network datasets. The MSA estimation error enhanced the gene flow signals. The sequence alignment problem is to reconstruct the homology of unaligned sequences. However, if homologous are not correctly aligned, the evolutionary history resolved from such estimated alignments tells a different story. As the dataset size of sequence divergence increases, the MSA estimation becomes more challenging. More estimation errors are included in the estimated alignments of large datasets with high sequence divergence. The dataset size and sequence divergence of sequences greatly impact the accuracy of the inferred species network, which is proved by the simulation experiments. 125 1 ret. ClustalW MAFFT FSA ClustalW - 5 8 MAFFT - 4 FSA - 2 ret. ClustalW MAFFT FSA ClustalW - 13 15 MAFFT - 15 FSA - 3 ret. ClustalW MAFFT FSA ClustalW - 15 20 MAFFT - 22 FSA - 4 ret. ClustalW MAFFT FSA ClustalW - 16 18 MAFFT - 21 FSA - Table 7.3: Topological distance pairwise comparison of species networks inferred by MLE on different MSAs of the yeast dataset. The MSAs were estimated using ClustalW, MAFFT, or FSA. Table description and layout are identical to Table 7.2. Other than the dataset size and sequence divergence, the choice of the MSA method also makes an impact on the topological accuracy of the downstream inferred phylogenetic networks. Although, the FSA method is considered to produce the most accurate MSA estimation. The species network inferred from FSA-estimated alignments usually had the highest topological error. The alignments estimated by ClustalW and MAFFT resulted in comparable topological errors. The FSA method seeks for a global alignment with the minimum number of gap openings across the sequences. Such optimization criterion makes minor impacts on the phylogenetic tree inference since the tree reconstruction assumes that the substitution, insertion, and deletion events occur independently for all sites. However, evolutionary events, such as recombination, hybridization, and horizontal gene transfer, create dependence among sites. Incorrect indels are likely to contribute to false-positive gene flow signals. Generally, the sequence alignment method reconstructs homology based on historical substitution, insertion, and deletion events. More complicated evolutionary events are often ignored. Based on our experiments, more sophisticated MSA methods are needed to take into account those evolutionary processes that result in reticulate gene flow. 126 Similar to MSA estimation, errors introduced to estimated gene trees also impact the phylo- genetic networks, especially under model conditions with high sequence divergence. When the sequence divergence is low, the gene tree errors are less impactable for the network inference, which indicate that less sequence divergence usually represents fewer evolutionary events. However, for more challenging datasets, the estimation errors included in the estimated MSA accumulate into the estimation of gene trees. Another factor that influences the network inference is the number of reticulation hypotheses. As the number of reticulations increased, the hypotheses became more complicated, and the search space of phylogenetic network inference increased dramatically. Expanded search space creates a major obstacle for network inference and possibly results in lower inference accuracy. 7.5 Conclusion In this study, we investigate the impact of MSA quality on the inference of the phylogenetic network. We compared networks inferred from either true alignments or estimated alignments with either true gene trees or inferred gene trees from the input alignments. The networks inferred from the true alignments and true gene trees were the most accurate, and the networks inferred from the estimated alignments and estimated gene trees were the least accurate. The estimation errors included in MSA greatly impacted the downstream phylogenetic network inference and learning. Alignment estimation becomes more difficult as dataset size and sequence divergence increase. As a consequence, the inferred networks were less accurate for larger and more divergent datasets. Besides the MSA estimation errors, the choice of MSA estimation method and the errors included in gene trees also affect the species network. The results are consistent across all model conditions in the simulation study. The topological comparisons of the empirical datasets were also consistent with the simulation study. The phylogenetic network reconstruction is also influenced by the hypothesis of network complexity. As the network complexity increases, the search space of the phylogenetic network also increases dramatically, which makes it harder to approach the optimal solution. 127 Current MSA estimation methods do not take into account the evolutionary processes that produce reticulate gene flow, such as recombination and horizontal gene transfer. Thus, an advanced computational method for MSA-aware phylogenetic network reconstruction is needed for better accuracy. However, computational scalability is a big challenge for such a method. Due to the complexity of the network inference problem, the computational methods that we used in our study are already computationally intensive. For better network inferences, the network inference problem of a large dataset can be divided into small problem sets, and parallel computing can be used to improve the computational scalability. 128 CHAPTER 8 CONCLUSIONS AND FUTURE WORK In this dissertation, we addressed problems caused by some over-simplified assumptions in the phy- logenetic analysis. Phylogenetic reconstruction is an important and fundamental problem for many biological studies. More and more biomolecular sequencing data are available for phylogenetic analyses as the technological advances in molecular biology and genomics. Due to the complexity of the phylogenetic reconstruction problem, many over-simplified assumptions have been made in modern phylogenetic studies. We combined the resampling and re-estimation procedures with multiple phylogenetic analyses to utilize the phylogenetic information carried by indels, which are often ignored in the traditional phylogenetic reconstruction. We proposed a new sequential resam- pling algorithm, SERES, to address the assumption made by many widely used non-parametric resampling algorithms that all sites of an alignment have evolved independently and identically distributed (i.i.d). A biomolecular sequence resampling algorithm is proposed for the resampling of biomolecular sequence data that counts for the sequence dependence. The SERES resampling approach combines the standard bootstrap resampling algorithm with the form of a random walk, which can produce many distinct resampling replicates while reserving the sequential dependence during the resampling process. The SERES resampling approach outperformed the state-of-the-art approaches, which utilize the bootstrap method, on a classical problem in computational biology and bioinformatics, the MSA support estimation problem. The SERES resampling approach has a wide range of applications for various types of data. We introduced the application of SERES random walks on aligned sequences and showed SERES as a data perturbation technique to improve statistical inference and learning. The combination of the SERES resampling approach with the recHMM obtained great improvement in the local genealogy inferences. This finding is confirmed by the breakpoint detection problem of the HIV genome sequence dataset [84, 150]. SERES resampling and re-estimation may be similarly beneficial in ancestral recombination inference problems other than local genealogical inference, such as 129 recombination rate estimation [133], recombination hotspot or coldspot detection [99, 5], etc. Due to the synchronization needs, the SERES algorithm is a semi-parametric algorithm when applied to the unaligned sequences, which requires anchor estimation for synchronization purposes. We introduced a new non-parametric resampling technique, RAWR. The RAWR resampling ap- proach does not require additional parameters. We applied the RAWR resampling approach to another classical bioinformatics problem, phylogenetic support estimation. The performance study showed that the RAWR-based support estimation produced better performance than the widely used bootstrap support estimation method, especially for sequences with higher divergence. This finding is consistent with the SERES resampling approach to the MSA support estimation problem. In this dissertation, we further relaxed the tree structure assumption of phylogenies and investi- gated the impact of MSA uncertainties on the phylogenetic network inference. Previous research has shown that estimation errors in MSA have a great impact on downstream tree inference [153, 152]. We showed that the MSA estimation errors also affect the phylogenetic network reconstruction. The accuracy of the inferred networks decreased as the accumulation of estimation errors of alignments and gene trees. The impact of alignment errors on the topological accuracy grows as the sequence divergence increases. The topological comparison results of the empirical datasets are consistent with the simulation datasets, and they show that the network hypothesis complexity also impacts the phylogenetic network inferences. This dissertation mainly addressed problems caused by over-simplified assumptions that are commonly used in the traditional phylogenetic analysis. We took into account the indels uncertain- ties and the intra-sequence dependence for the analyses of large sequencing datasets in phylogenetic studies, but many such challenges remain. We point out some directions for future work. The SERES and RAWR resampling algorithms require few assumptions and are not restricted to a specific application. It is easy to apply the new resampling algorithms to other evolutionary anal- yses which require biomolecular sequences as input or accounting for the sequential dependence, such as protein structure prediction, reads mapping, and assembly. One important application is the statistical inference of species trees under models of sequence evolution that take into account the 130 sequential dependence. Based on our performance study, the MSA quality has a great impact on the phylogenetic network. An advanced computational method for MSA-aware phylogenetic network reconstruction can utilize the RAWR resampling approach to produce phylogenetic networks with better accuracy. However, computational scalability would be a major challenge for such a method due to the complexity of the network inference problem. For the network inferences with the sequence resampling algorithm, the original inference problem can be divided into small problem sets. In this scenario, parallel computing can be applied to solve small problem sets parallelly and improve computational scalability. Non-parametric resampling is also widely used throughout science and engineering, and SERES/RAWR resampling can also be applied in research areas outside of computational biology and bioinformatics. In this dissertation, we discussed how does the new resampling algorithms deal with the sequence dependence caused by historical insertion and deletion events. Many other factors also lead to sequence dependence, which we do not take into account yet. The non-parametric resampling algorithm can be extended to application-specified resampling and re-estimation. For example, take the structure information into account for the resampling procedure. Such a resampling algorithm can better utilize the sequence dependence caused by particular evolutionary events in the biomolecular sequence analysis involved in those evolutionary events. Since alternative homologous or bipartitions are produced during the resampling and re- estimation process, the sequential resampling approach could potentially be extended to perform the MSA estimation or phylogenetic inference. However, the optimization criterion and computa- tional efficiency are two main challenges for such a method. In the study of RAWR resampling and re-estimation in phylogenetic support estimation, we observed that the re-estimation process is the bottleneck of computational runtime. Since the resampling and re-estimation processes of the resampled replicates are independent to each other, parallel computing can be applied to boost the computational efficiency of such methods. 131 BIBLIOGRAPHY 132 BIBLIOGRAPHY [1] Raja Hashim Ali, Marcin Bogusz, and Simon Whelan. Identifying clusters of high con- fidence homologies in multiple sequence alignments. Molecular biology and evolution, 36(10):2340–2351, 2019. [2] Elizabeth S Allman, James H Degnan, and John A Rhodes. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. Journal of mathematical biology, 62(6):833–862, 2011. [3] Maria Anisimova and Olivier Gascuel. Approximate likelihood-ratio test for branches: a fast, accurate, and powerful alternative. Systematic biology, 55(4):539–552, 2006. [4] Hagai Attias et al. A variational baysian framework for graphical models. In NIPS, volume 12. Citeseer, 1999. [5] Adam Auton and Gil McVean. Recombination rate estimation in the presence of hotspots. Genome research, 17(8):1219–1227, 2007. [6] Md Shamsuzzoha Bayzid and Tandy Warnow. Estimating optimal species trees from incom- plete gene trees under deep coalescence. Journal of Computational Biology, 19(6):591–605, 2012. [7] Frida Belinky, Ofir Cohen, and Dorothée Huchon. Large-scale parsimony analysis of meta- zoan indels in protein-coding genes. Molecular biology and evolution, 27(2):441–451, 2010. [8] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995. [9] Maria Bonet, Mike Steel, Tandy Warnow, and Shibu Yooseph. Better methods for solving parsimony and compatibility. Journal of Computational Biology, 5(3):391–407, 1998. [10] Paola Bonizzoni and Gianluca Della Vedova. The complexity of multiple sequence alignment with sp-score that is a metric. Theoretical Computer Science, 259(1-2):63–79, 2001. [11] Luis Boto. Horizontal gene transfer in evolution: facts and challenges. Proceedings of the Royal Society B: Biological Sciences, 277(1683):819–827, 2010. [12] Robert K. Bradley, Adam Roberts, Michael Smoot, Sudeep Juvekar, Jaeyoung Do, Colin Dewey, Ian Holmes, and Lior Pachter. Fast statistical alignment. PLoS Computational Biology, 5(5):e1000392, may 2009. [13] Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996. 133 [14] Jamie J Cannone, Sankar Subramanian, Murray N Schnare, James R Collett, Lisa M D’Souza, Yushi Du, Brian Feng, Nan Lin, Lakshmi V Madabusi, Kirsten M Müller, et al. The comparative rna web (crw) site: an online database of comparative sequence and structure information for ribosomal, intron, and other rnas. BMC bioinformatics, 3(1):1–31, 2002. [15] Brandi L Cantarel, Hilary G Morrison, and William Pearson. Exploring the relationship be- tween sequence similarity and accurate phylogenetic trees. Molecular biology and evolution, 23(11):2090–2100, 2006. [16] Salvador Capella-Gutiérrez, José M Silla-Martínez, and Toni Gabaldón. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15):1972–1973, 2009. [17] Jose Castresana. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular biology and evolution, 17(4):540–552, 2000. [18] Jia Ming Chang, Paolo Di Tommaso, and Cedric Notredame. TCS: A new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution, 31(6):1625–1637, jun 2014. [19] Jerry A Coyne, H Allen Orr, et al. Speciation, volume 37. Sinauer Associates Sunderland, MA, 2004. [20] William HE Day. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of mathematical biology, 49(4):461–467, 1987. [21] Andrew H Debevec and James B Whitfield. Introduction to phylogenetic networks.—david a. morrison., 2013. [22] James H Degnan and Noah A Rosenberg. Discordance of species trees with their most likely gene trees. PLoS genetics, 2(5):e68, 2006. [23] James H Degnan and Noah A Rosenberg. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in ecology & evolution, 24(6):332–340, 2009. [24] James H Degnan and Laura A Salter. Gene tree distributions under the coalescent process. Evolution, 59(1):24–37, 2005. [25] E R DeLong, D M DeLong, and D L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3):837–45, 1988. [26] Hua Deng, Guo-Qiang Zhang, Min Lin, Yan Wang, and Zhong-Jian Liu. Mining from tran- scriptomes: 315 single-copy orthologous genes concatenated for the phylogenetic analyses of orchidaceae. Ecology and Evolution, 5(17):3800–3807, 2015. [27] Jeff J Doyle. Trees within trees: genes and species, molecules and morphology. Systematic Biology, 46(3):537–553, 1997. 134 [28] Ingo Ebersberger, Petra Galgoczy, Stefan Taudien, Simone Taenzer, Matthias Platzer, and Arndt Von Haeseler. Mapping human genetic ancestry. Molecular biology and evolution, 24(10):2266–2276, 2007. [29] Robert C Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792–7, 2004. [30] Scott V Edwards. Is a new and general theory of molecular systematics emerging? Evolution: International Journal of Organic Evolution, 63(1):1–19, 2009. [31] Scott V Edwards, Liang Liu, and Dennis K Pearl. High-resolution species trees without concatenation. Proceedings of the National Academy of Sciences, 104(14):5936–5941, 2007. [32] B Efron. Bootstrap methods: another look at the jackknife. e annals of statistics, 7 (1): 1–26. URL http://www. jstor. org/stable/2958830, 1979. [33] R. A. L. Elworth, H. A. Ogilvie, J. Zhu, and L. Nakhleh. Advances in Computational Methods for Phylogenetic Networks in the Presence of Hybridization. Technical report, 2018. [34] Joseph Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17(6):368–376, 1981. [35] Joseph Felsenstein. Taking variation of evolutionary rates between sites into account in inferring phylogenies. Journal of Molecular Evolution, 53(4-5):447–455, 2001. [36] Joseph Felsenstein. Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution, 39(4):783, jul 2006. [37] Joseph Felsenstein and Joseph Felenstein. Inferring phylogenies, volume 2. Sinauer asso- ciates Sunderland, MA, 2004. [38] Da-Fei Feng and Russell F Doolittle. Journal of Molecular Evolution Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. J Mol Evol, 25(4):351–360, 1987. [39] William Fletcher and Ziheng Yang. INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8):1879–1888, 2009. [40] Nir Friedman, Matan Ninio, Itsik Pe’er, and Tal Pupko. A structural em algorithm for phylogenetic inference. In Proceedings of the fifth annual international conference on Computational biology, pages 132–140, 2001. [41] Ganeshkumar Ganapathy, Vijaya Ramachandran, and Tandy Warnow. Better hill-climbing searches for parsimony. In International Workshop on Algorithms in Bioinformatics, pages 245–258. Springer, 2003. 135 [42] John Gatesy and Mark S Springer. Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum. Molecular phylogenetics and evolution, 80:231–266, 2014. [43] Gonzalo Giribet. Tnt: tree analysis using new technology, 2005. [44] J Peter Gogarten and Jeffrey P Townsend. Horizontal gene transfer, genome innovation and evolution. Nature Reviews Microbiology, 3(9):679–687, 2005. [45] Brian Golding and Joe Felsenstein. A maximum likelihood approach to the detection of selection from a phylogeny. Journal of molecular evolution, 31(6):511–523, 1990. [46] Edward M Golenberg, Michael T Clegg, Mary L Durbin, John Doebley, and Din Pow Ma. Evolution of a noncoding region of the chloroplast genome. Molecular phylogenetics and evolution, 2(1):52–64, 1993. [47] Peter R Grant and B Rosemary Grant. Speciation undone. Nature, 507(7491):178–179, 2014. [48] Nicholas C. Grassly, Jun Adachj, and Andrew Rambaut. PSeq-Gen: an application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Bioinfor- matics, 13(5):559–560, 2007. [49] Stéphane Guindon, Jean-François Dufayard, Vincent Lefort, Maria Anisimova, Wim Hordijk, and Olivier Gascuel. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of phyml 3.0. Systematic biology, 59(3):307–321, 2010. [50] Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal, 29(2):147–160, 1950. [51] Jotun Hein. A heuristic method to reconstruct the history of sequences subject to recombi- nation. Journal of Molecular Evolution, 36(4):396–405, 1993. [52] Jotun Hein, Mikkel Schierup, and Carsten Wiuf. Gene genealogies, variation and evolution: a primer in coalescent theory. Oxford University Press, USA, 2004. [53] Hussein A Hejase and Kevin J Liu. A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation. BMC bioinformatics, 17(1):1–12, 2016. [54] Joseph Heled and Alexei J Drummond. Bayesian inference of species trees from multilocus data. Molecular Biology and Evolution, 27(3):570–580, 2009. [55] David M Hillis and James J Bull. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Systematic biology, 42(2):182–192, 1993. [56] David M Hillis, Craig Moritz, Barbara K Mable, and Richard G Olmstead. Molecular systematics, volume 23. Sinauer Associates Sunderland, MA, 1996. 136 [57] Richard R. Hudson. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics, 18(2):337–338, 2002. [58] Richard R Hudson et al. Gene genealogies and the coalescent process. Oxford surveys in evolutionary biology, 7(1):44, 1990. [59] Dirk Husmeier and Frank Wright. Detection of recombination in dna multiple alignments with hidden markov models. Journal of Computational Biology, 8(4):401–427, 2001. [60] Daniel H Huson and David Bryant. Application of phylogenetic networks in evolutionary studies. Molecular biology and evolution, 23(2):254–267, 2006. [61] Daniel H Huson, Regula Rupp, and Celine Scornavacca. Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, 2010. [62] Daniel H Huson and Celine Scornavacca. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Systematic biology, 61(6):1061–1067, 2012. [63] S. JEFFERY. Evolution of Protein Molecules. Biochemical Society Transactions, 7(2):452– 453, 2015. [64] Kazutaka Katoh and Daron M. Standley. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution, 30(4):772–780, apr 2013. [65] Jaebum Kim and Jian Ma. Psar: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic acids research, 39(15):6359–6368, 2011. [66] John FC Kingman. On the genealogy of large populations. Journal of Applied Probability, 19(A):27–43, 1982. [67] Alexey M Kozlov, Andre J Aberer, and Alexandros Stamatakis. Examl version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics, 31(15):2577–2579, 2015. [68] Laura Salter Kubatko and James H Degnan. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Systematic biology, 56(1):17–24, 2007. [69] Sangeet Lamichhaney, Jonas Berglund, Markus Sällman Almén, Khurram Maqbool, Man- fred Grabherr, Alvaro Martinez-Barrio, Marta Promerová, Carl-Johan Rubin, Chao Wang, Neda Zamani, et al. Evolution of darwin’s finches and their beaks revealed by genome sequencing. Nature, 518(7539):371–375, 2015. [70] Giddy Landan and Dan Graur. Heads or tails: A simple reliability check for multiple sequence alignments. Molecular Biology and Evolution, 24(6):1380–1383, mar 2007. [71] Giddy Landan and Dan Graur. Local reliability measures from sets of co-optimal multiple sequence alignments. In Biocomputing 2008, pages 15–24. World Scientific, 2008. [72] Bret R Larget, Satish K Kotha, Colin N Dewey, and Cécile Ané. Bucky: gene tree/species tree reconciliation with bayesian concordance analysis. Bioinformatics, 26(22):2910–2911, 2010. 137 [73] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. Mcgettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, and D. G. Higgins. Clustal W and Clustal X version 2.0. Bioinformatics, 23(21):2947–2948, nov 2007. [74] Fréderic Lemoine, J-B Domelevo Entfellner, Eduan Wilkinson, Damien Correia, M Dávila Felipe, Tulio De Oliveira, and Olivier Gascuel. Renewing felsenstein’s phylogenetic bootstrap in the era of big data. Nature, 556(7702):452–456, 2018. [75] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler transform. bioinformatics, 25(14):1754–1760, 2009. [76] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. The sequence alignment/map format and samtools. Bioinformatics, 25(16):2078–2079, 2009. [77] C Randal Linder and Tandy Warnow. An overview of phylogeny reconstruction. 2001. [78] Kevin Liu, C Randal Linder, and Tandy Warnow. Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLoS currents, 2, 2010. [79] Kevin Liu, Sindhu Raghavan, Serita Nelesen, C Randal Linder, and Tandy Warnow. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Sci- ence, 324(5934):1561–1564, 2009. [80] Kevin Liu, Tandy J. Warnow, Mark T. Holder, Serita M. Nelesen, Jiaye Yu, Alexandros P. Stamatakis, and C. Randal Linder. SATé-II: Very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Systematic Biology, 61(1):90–106, jan 2012. [81] Kevin J Liu, Jingxuan Dai, Kathy Truong, Ying Song, Michael H Kohn, and Luay Nakhleh. An hmm-based comparative genomic framework for detecting introgression in eukaryotes. PLoS computational biology, 10(6):e1003649, 2014. [82] Liang Liu. Best: Bayesian estimation of species trees under the coalescent model. Bioinfor- matics, 24(21):2542–2543, 2008. [83] Liang Liu, Lili Yu, and Scott V Edwards. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC evolutionary biology, 10(1):1– 18, 2010. [84] K S Lole, R C Bollinger, R S Paranjape, D Gadkari, S S Kulkarni, N G Novak, R In- gersoll, H W Sheppard, and S C Ray. Full-length human immunodeficiency virus type 1 genomes from subtype C-infected seroconverters in India, with evidence of intersubtype recombination. Journal of virology, 73(1):152–60, 1999. [85] Ari Löytynoja and Nick Goldman. Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science, 320(5883):1632–1635, 2008. 138 [86] Peng-tao Luan, Oliver A Ryder, Heidi Davis, Ya-ping Zhang, and Li Yu. Incorporating indels as phylogenetic characters: impact for interfamilial relationships within arctoidea (mammalia: Carnivora). Molecular Phylogenetics and Evolution, 66(3):748–756, 2013. [87] Ming-Ju Amy Lyu, Udo Gowik, Steve Kelly, Sarah Covshoff, Julia Mallmann, Peter Westhoff, Julian M Hibberd, Matt Stata, Rowan F Sage, Haorong Lu, et al. Rna-seq based phylogeny recapitulates previous phylogeny of the genus flaveria (asteraceae) with some modifications. BMC evolutionary biology, 15(1):1–14, 2015. [88] G Blackshields1 M.A. Larkin1 N.P. Brown3, R. Chenna3, P.A. McGettigan1,, F Valentin4 H. McWilliam4 I.M. Wallace1, A. Wilm1, R. Lopez4, J.D. Thompson2,, T J Gibson3 Higgins, and D G. Sequence analysis, Clustal W and Clustal X version 2.0. Bioinformatics Applications Note, 23(21):2947–2948, 2007. [89] Wayne P Maddison. Gene trees in species trees. Systematic Biology, 46(3):523–536, 1997. [90] Wayne P Maddison and L Lacey Knowles. Inferring phylogeny despite incomplete lineage sorting. Systematic biology, 55(1):21–30, 2006. [91] Thomas Mailund, Julien Y Dutheil, Asger Hobolth, Gerton Lunter, and Mikkel H Schierup. Estimating divergence time and ancestral effective population size of bornean and suma- tran orangutan subspecies using a coalescent hidden markov model. PLoS genetics, 7(3):e1001319, 2011. [92] Grainne McGuire. Statistical methods for DNA sequences: detection of recombination and distance estimation. PhD thesis, University of Edinburgh, 1998. [93] Gráinne McGuire, Frank Wright, and Michael J Prentice. A bayesian model for detecting past recombination events in dna multiple alignments. Journal of Computational Biology, 7(1-2):159–170, 2000. [94] Michael L Metzker. Sequencing technologies—the next generation. Nature reviews genetics, 11(1):31–46, 2010. [95] Siavash Mirarab, Nam Nguyen, Sheng Guo, Li-San Wang, Junhyong Kim, and Tandy Warnow. Pasta: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. Journal of Computational Biology, 22(5):377–386, 2015. [96] Erin K Molloy and Tandy Warnow. To include or not to include: the impact of gene filtering on species tree estimation methods. Systematic Biology, 67(2):285–303, 2018. [97] Erin K Molloy and Tandy Warnow. Statistically consistent divide-and-conquer pipelines for phylogeny estimation using njmerge. Algorithms for Molecular Biology, 14(1):1–17, 2019. [98] Daniel Money and Simon Whelan. Characterizing the phylogenetic tree-search problem. Systematic biology, 61(2):228, 2012. [99] Simon Myers, Leonardo Bottolo, Colin Freeman, Gil McVean, and Peter Donnelly. A fine-scale map of recombination rates and hotspots across the human genome. Science, 310(5746):321–324, 2005. 139 [100] Luay Nakhleh. A metric on the space of reduced phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(2):218–222, 2009. [101] Luay Nakhleh. Evolutionary Phylogenetic Networks: Models and Issues. In Problem Solving Handbook in Computational Biology and Bioinformatics, pages 125–158. Springer, 2010. [102] Luay Nakhleh, Bernard ME Moret, Usman Roshan, Katherine St. John, Jerry Sun, and Tandy Warnow. The accuracy of fast phylogenetic methods for large datasets. In Biocomputing 2002, pages 211–222. World Scientific, 2001. [103] Daniel E Neafsey, Igor V Sharakhov, Xiaofang Jiang, Andrew B Hall, Evdoxia Kakani, Sara N Mitchell, Yi-chieh Wu, Hilary A Smith, Matthew W Hahn, and Nora J Besansky. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science, 347(6217):1–20, 2015. [104] Daniel E. Neafsey, Robert M. Waterhouse, Mohammad R. Abai, Sergey S. Aganezov, Max A. Alekseyev, James E. Allen, James Amon, Bruno Arcà, Peter Arensburger, Gleb Artemov, Lauren A. Assour, Hamidreza Basseri, Aaron Berlin, Bruce W. Birren, Stephanie A. Blandin, Andrew I. Brockman, Thomas R. Burkot, Austin Burt, Clara S. Chan, Cedric Chauve, Joanna C. Chiu, Mikkel Christensen, Carlo Costantini, Victoria L.M. Davidson, Elena Deligianni, Tania Dottorini, Vicky Dritsou, Stacey B. Gabriel, Wamdaogo M. Guelbeogo, Andrew B. Hall, Mira V. Han, Thaung Hlaing, Daniel S.T. Hughes, Adam M. Jenkins, Xiaofang Jiang, Irwin Jungreis, Evdoxia G. Kakani, Maryam Kamali, Petri Kemppainen, Ryan C. Kennedy, Ioannis K. Kirmitzoglou, Lizette L. Koekemoer, Njoroge Laban, Nicholas Langridge, Mara K.N. Lawniczak, Manolis Lirakis, Neil F. Lobo, Ernesto Lowy, Robert M. MacCallum, Chunhong Mao, Gareth Maslen, Charles Mbogo, Jenny McCarthy, Kristin Michel, Sara N. Mitchell, Wendy Moore, Katherine A. Murphy, Anastasia N. Naumenko, Tony Nolan, Eva M. Novoa, Samantha O’Loughlin, Chioma Oringanje, Mohammad A. Oshaghi, Nazzy Pakpour, Philippos A. Papathanos, Ashley N. Peery, Michael Povelones, Anil Prakash, David P. Price, Ashok Rajaraman, Lisa J. Reimer, David C. Rinker, Antonis Rokas, Tanya L. Russell, N’Fale Sagnon, Maria V. Sharakhova, Terrance Shea, Felipe A. Simão, Frederic Simard, Michel A. Slotman, Pradya Somboon, Vladimir Stegniy, Clau- dio J. Struchiner, Gregg W.C. Thomas, Marta Tojo, Pantelis Topalis, José M.C. Tubio, Maria F. Unger, John Vontas, Catherine Walton, Craig S. Wilding, Judith H. Willis, Yi Chieh Wu, Guiyun Yan, Evgeny M. Zdobnov, Xiaofan Zhou, Flaminia Catteruccia, George K. Christophides, Frank H. Collins, Robert S. Cornman, Andrea Crisanti, Martin J. Donnelly, Scott J. Emrich, Michael C. Fontaine, William Gelbart, Matthew W. Hahn, Immo A. Hansen, Paul I. Howell, Fotis C. Kafatos, Manolis Kellis, Daniel Lawson, Christos Louis, Shirley Luckhart, Marc A.T. Muskavitch, José M. Ribeiro, Michael A. Riehle, Igor V. Sharakhov, Zhijian Tu, Laurence J. Zwiebel, and Nora J. Besansky. Highly evolvable malaria vectors: The genomes of 16 Anopheles mosquitoes. Science, 347(6217):1258522, 2015. [105] MASATOSHI NEi, Ranajit Chakraborty, and Paul A Fuerst. Infinite allele model with varying mutation rate. Proceedings of the National Academy of Sciences, 73(11):4164– 4168, 1976. 140 [106] Richard Nichols. Gene trees and species trees are not the same. Trends in Ecology & Evolution, 16(7):358–364, 2001. [107] C Notredame, D G Higgins, and J Heringa. Turnbaugh et al. 16S rRNA sequences\rT- Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol, 302(1):205–217, 2000. [108] Cédric Notredame, Desmond G. Higgins, and Jaap Heringa. T-coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1):205– 217, sep 2000. [109] T Heath Ogden and Michael S Rosenberg. Multiple sequence alignment accuracy and phylogenetic inference. Systematic biology, 55(2):314–328, 2006. [110] Osnat Penn, Eyal Privman, Giddy Landan, Dan Graur, and Tal Pupko. An alignment confidence score capturing robustness to guide tree uncertainty. Molecular Biology and Evolution, 27(8):1759–1767, 2010. [111] David Posada and Keith A Crandall. Intraspecific gene genealogies: trees grafting into networks. Trends in ecology & evolution, 16(1):37–45, 2001. [112] Morgan N Price, Paramvir S Dehal, and Adam P Arkin. Fasttree 2–approximately maximum- likelihood trees for large alignments. PloS one, 5(3):e9490, 2010. [113] Lawrence R Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. [114] Bruce Rannala and Ziheng Yang. Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics, 164(4):1645–1656, 2003. [115] Maria C Rivera and James A Lake. Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science, 257(5066):74–76, 1992. [116] D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Mathematical Bio- sciences, 53(1-2):131–147, feb 1981. [117] Sebastien Roch. A short proof that phylogenetic tree reconstruction by maximum likelihood is hard. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(1):92–94, 2006. [118] F. Rodríguez, J. L. Oliver, A. Marín, and J. R. Medina. The general stochastic model of nucleotide substitution. Journal of Theoretical Biology, 142(4):485–501, feb 1990. [119] Antonis Rokas and Peter WH Holland. Rare genomic changes as a tool for phylogenetics. Trends in ecology & evolution, 15(11):454–459, 2000. [120] Antonis Rokas, Barry L Williams, Nicole King, and Sean B Carroll. Genome-scale ap- proaches to resolving incongruence in molecular phylogenies. Nature, 425(6960):798–804, 2003. 141 [121] Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015. [122] Leonidas Salichos and Antonis Rokas. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature, 497(7449):327–331, 2013. [123] Michael J Sanderson. Sanderson - 2003 - r8s inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock.pdf. Bioinformatics, 19(2):301– 302, 2003. [124] Michael C Schatz, Arthur L Delcher, and Steven L Salzberg. Assembly of large genomes using second-generation sequencing. Genome research, 20(9):1165–1173, 2010. [125] Paul Scheet and Matthew Stephens. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. The American Journal of Human Genetics, 78(4):629–644, 2006. [126] Itamar Sela, Haim Ashkenazy, Kazutaka Katoh, and Tal Pupko. GUIDANCE2: Accurate de- tection of unreliable alignment regions accounting for the uncertainty of multiple parameters. Nucleic Acids Research, 43(W1):W7–W14, 2015. [127] Mark P Simmons and Helga Ochoterena. Gaps as characters in sequence-based phylogenetic analyses. Systematic biology, 49(2):369–381, 2000. [128] Mark P Simmons, Helga Ochoterena, and Timothy G Carr. Incorporation, relative homo- plasy, and effect of gap characters in sequence-based phylogenetic analyses. Systematic Biology, 50(3):454–462, 2001. [129] Claudia Solís-Lemus and Cécile Ané. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS genetics, 12(3):e1005896, 2016. [130] Alexandros Stamatakis. Raxml version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9):1312–1313, 2014. [131] Alexandros Stamatakis, Paul Hoover, and Jacques Rougemont. A rapid bootstrap algorithm for the RAxML web servers. Systematic Biology, 57(5):758–771, oct 2008. [132] Michael Steel. Recovering a tree from the leaf colourations it generates under a markov model. Applied Mathematics Letters, 7(2):19–23, 1994. [133] Michael PH Stumpf and Gilean AT McVean. Estimating recombination rates from population-genetic data. Nature Reviews Genetics, 4(12):959–968, 2003. [134] David L Swofford. Paup: phylogenetic analysis using parsimony. Mac Version 3. 1. 1.(Com- puter program and manual)., 1993. [135] Simon Tavaré. Some probabilistic and statistical problems in the analysis of dna sequences. Lectures on Mathematics in the Life Sciences, 17(2):57–86, 1986. 142 [136] Allan Thach, Elizabeth C., Thompson, Karen J., Morris. A fresh look at followership: A model for matching Followership and leadership styles. Journal of Behavioral & Applied Management, 14(1):1–5, 2006. [137] Cuong Than and Luay Nakhleh. Species tree inference by minimizing deep coalescences. PLoS Computational Biology, 5(9):e1000501, 2009. [138] Cuong Than, Derek Ruths, and Luay Nakhleh. Phylonet: a software package for analyzing and reconstructing reticulate evolutionary relationships. BMC bioinformatics, 9(1):1–16, 2008. [139] Jeffrey L Thorne, Hirohisa Kishino, and Joseph Felsenstein. Inching toward reality: an improved likelihood model of sequence evolution. Journal of molecular evolution, 34(1):3– 16, 1992. [140] John Tukey. Bias and confidence in not quite large samples. Ann. Math. Statist., 29:614, 1958. [141] Richard Van Noorden, Brendan Maher, and Regina Nuzzo. The top 100 papers. Nature News, 514(7524):550, 2014. [142] John Wakeley. Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. Journal of Molecular Evolution, 37(6):613–623, 1993. [143] Li-San Wang, Jim Leebens-Mack, P Kerr Wall, Kevin Beckmann, Claude W DePamphilis, and Tandy Warnow. The impact of multiple protein sequence alignment on phylogenetic esti- mation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(4):1108– 1119, 2009. [144] LUSHENG WANG and TAO JIANG. On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology, 1(4):337–348, 2009. [145] Wei Wang, Jack Smith, Hussein A. Hejase, and Kevin J. Liu. Non-parametric and semi- parametric support estimation using SEquential RESampling random walks on biomolecular sequences. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11183 LNBI:294–308, 2018. [146] Zhiwei Wang and Kevin J Liu. A performance study of the impact of recombination on species tree analysis. BMC genomics, 17(10):165–174, 2016. [147] Tandy Warnow. Computational phylogenetics: An introduction to designing methods for phylogeny estimation. Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation, pages 1–379, 2017. [148] Dingqiao Wen, Yun Yu, Matthew W. Hahn, and Luay Nakhleh. Reticulate evolutionary history and extensive introgression in mosquito species revealed by phylogenetic network analysis. Molecular Ecology, 25(11):2361–2372, 2016. [149] Dingqiao Wen, Yun Yu, Jiafan Zhu, and Luay Nakhleh. Inferring phylogenetic networks using phylonet. Systematic Biology, 67(4):735–740, 2018. 143 [150] Oscar Westesson and Ian Holmes. Accurate detection of recombinant breakpoints in whole- genome alignments. PLoS Computational Biology, 5(3):e1000318, mar 2009. [151] Travis J Wheeler and John D Kececioglu. Multiple alignment by aligning alignments. Bioinformatics, 23(13):i559–i568, 2007. [152] K.M. Wong, M.A. Suchard, and J.P. Huelsenbeck. Alignment uncertainty and genomic analysis Supporting online material. Science, 319(5862):473, 2008. [153] Martin Wu, Sourav Chatterji, and Jonathan A. Eisen. Accounting for alignment uncertainty in phylogenomics. PLoS ONE, 7(1):e30288, 2012. [154] Jimmy Yang and Tandy Warnow. Fast and accurate methods for phylogenomic analyses. In BMC bioinformatics, volume 12, pages 1–12. BioMed Central, 2011. [155] Ziheng Yang. Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. Molecular biology and evolution, 10(6):1396–1401, 1993. [156] Ziheng Yang and Bruce Rannala. Bayesian phylogenetic inference using DNA sequences: A Markov Chain Monte Carlo method. Molecular Biology and Evolution, 14(7):717–724, jul 1997. [157] Ziheng Yang and Bruce Rannala. Molecular phylogenetics: principles and practice. Nature reviews genetics, 13(5):303–314, 2012. [158] Yun Yu, James H Degnan, and Luay Nakhleh. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLoS genetics, 8(4):e1002660, 2012. [159] Yun Yu, Jianrong Dong, Kevin J Liu, and Luay Nakhleh. Maximum likelihood inference of reticulate evolutionary histories. Proceedings of the National Academy of Sciences, 111(46):16448–16453, 2014. [160] Yun Yu and Luay Nakhleh. A maximum pseudo-likelihood approach for phylogenetic networks. BMC genomics, 16(10):1–10, 2015. [161] Yun Yu, Cuong Than, James H Degnan, and Luay Nakhleh. Coalescent histories on phyloge- netic networks and detection of hybridization despite incomplete lineage sorting. Systematic Biology, 60(2):138–149, 2011. [162] Yun Yu, Tandy Warnow, and Luay Nakhleh. Algorithms for mdc-based multi-locus phy- logeny inference: beyond rooted binary gene trees on single alleles. Journal of Computational Biology, 18(11):1543–1559, 2011. 144