Hidden Markov model-based homology search and gene prediction in NGS ERA

The exponential cost reduction of next-generation sequencing (NGS) enabled researchers to sequence a large number of organisms in order to answer various questions in biology, ecology, health, etc. For newly sequenced genomes, gene prediction and homology search against characterized protein sequence databases are two fundamental tasks for annotating functional elements in the genomes. The main goal of gene prediction is to identify the gene locus and their structures. As there is accumulating evidence showing important functions of RNAs (ncRNAs), comprehensive gene prediction should include both protein-coding genes and ncRNAs. Homology search against protein sequences can aid identification of functional elements in genomes. Although there are intensive research in the fields of gene prediction, ncRNA search, and homology search, there are still unaddressed challenges. In this dissertation, I made contributions in these three areas. For gene prediction, I designed an HMM-based ab initio gene prediction tool that considers G+C gradient in grass genomes. For homology search, I designed a method that can align short reads against protein families using profile HMMs. For ncRNA search, I designed a ncRNA alignment tool that can align highly structured ncRNAs using only sequence similarity. Below I summarize my contributions.Despite decades of research about gene prediction, existing gene prediction tools are not carefully designed to deal with variant G+C content and 5'-3' changing patterns inside coding regions. Thus, these tools can miss genes with positive or negative G+C gradient in grass genomes such as rice, maize, sorghum, etc. I implemented a tool named AUGUSTUS-GC that accounts for 5'-3' G+C gradient. Our tool can accurately predict protein-coding genes in plant genomes especially grass genomes.A large number of sequencing projects produced short reads from the whole genomes or transcriptomic data. I designed a short reads homology search tool that employs paired-end reads to improve homology search sensitivity. The experimental results show that our tool can achieve significantly better sensitivity and accuracy in aligning short reads that are part of remote homologs.Despite the extensive studies of ncRNA search, the existing tools that heavily depend on the secondary structure in homology search cannot efficiently handle RNA-seq data that is accumulating rapidly. It will be ideal if we can have a faster ncRNA homology search tool with similar accuracy as those adopting secondary structure. I implemented an accurate ncRNA alignment tool called glu-RNA that can achieve similar accuracy to structural alignment tools while keeping the same running time complexity as sequence alignment tools. The experimental results demonstrate that our tool can achieve more accurate alignments than the popular sequence alignment tools and a well-known structural alignment program.

Read

In Collections: Electronic Theses & Dissertations

Copyright Status: In Copyright

Material Type: Theses

Authors: Techa-angkoon, Prapaporn

Thesis Advisors: Sun, Yanni

Committee Members: Torng, Eric
Shiu, Shin-Han
Chen, Jin

Date Published: 2017

Subjects: Sequence alignment (Bioinformatics)
Molecular genetics--Data processing
Hidden Markov models
Genomics

Program of Study: Computer Science - Doctor of Philosophy

Degree Level: Doctoral

Language: English

Pages: xvii, 111 pages

ISBN: 9781369560206
1369560206

Permalink: https://doi.org/doi:10.25335/eg2a-1b96

Hidden Markov model-based homology search and gene prediction in NGS ERA

Full text