Inference of viral strains using metagenomics data
RNA Viruses, such as human immunodeficiency virus (HIV), influenza, and hepatitis C virus (HCV), have a great impact on human health. The high mutation rate of RNA viruses can produce a population of different but closely related virus sequences called viral quasispecies. To develop prevention and treatment strategies for viral pathogens, an essential and fundamental step is to characterize their sequences and abundances at the strain-level for a viral quasispecies. Advances in next-generation sequencing (NGS) technologies have opened up new opportunities to study viruses. Viral metagenomic data, which contains the genetic information for a bunch of viruses in the same habitat, have become the major resource for characterizing RNA viruses. Although there are many pipelines for analyzing viruses in metagenomic data, they usually lack three functions. First, novel viruses or viruses that lack closely related reference genomes cannot be detected with high sensitivity. Second, strain-level analysis is usually missing. Although there are several assembly tools specifically designed for viral haplotypes, de novo assembly of virus genomes is still a computationally challenge task due to the error-prone short reads, high similarities between related strains, and unknown number or abundance of virus haplotypes. Third, it is hard to estimate the number of haplotypes and their abundances in a quasispecies. In this dissertation, we have developed a pipeline with three tools, TAR-VIR PEHaplo and VirBin to address the challenges existing for viral metagenomic assembly and analysis. TAR-VIR is a tool for classifying and enriching viral reads from metagenomic data without relying on complete or high-quality reference genomes. It is optimized for identifying RNA viruses from metagenomic data with an efficient Burrows-Wheeler Transform (BWT) based overlapping method. TAR-VIR was tested on both simulated and real viral metagenomic datasets. The results demonstrated TAR-VIR competes favorably with benchmarked tools. PEHaplo is a de novo haplotype reconstruction tool, which employs paired-end reads to distinguish highly similar strains for viral quasispecies data. It was applied on both simulated and real quasispecies data, and the results were benchmarked against several recently published de novo haplotype reconstruction tools. The comparison shows that PEHaplo outperforms the benchmarked tools in a comprehensive set of metrics. With assembled viral contigs, VirBin focuses on estimating the number of haplotypes and clustering the contigs to different haplotypes. VirBin firstly identifies windows from contigs alignment profile to estimate haplotype number and better calculate the contig abundances. Then, it applies an Expectation-Maximization method to cluster the contigs based on contig abundance levels. The experimental results of VirBin on both simulated and real data sets show that the window-based method can precisely estimate the haplotype number, and generate more accurate abundance estimation for contigs that will eventually lead to superior clustering results. In addition, this dissertation also contains one chapter for the other work we have done for identifying the primary transcription start sites for miRNA genes in Caenorhabditis elegans and mouse with Cap-seq data.
Read
- In Collections
-
Electronic Theses & Dissertations
- Copyright Status
- In Copyright
- Material Type
-
Theses
- Thesis Advisors
-
Sun, Yanni
- Committee Members
-
Arnosti, David
Liu, Kevin
Chen, Jin
- Date
- 2018
- Subjects
-
RNA viruses
Metagenomics
- Program of Study
-
Computer Science - Doctor of Philosophy
- Degree Level
-
Doctoral
- Language
-
English
- Pages
- xv, 148 pages
- ISBN
-
9780438757370
0438757378
- Permalink
- https://doi.org/doi:10.25335/ecpx-ze34