Novel computational methods for improving functional analysis for long noisy reads

"Single-molecule, real-time sequencing (SMRT) developed by Pacific Biosciences (PacBio) and Nanopore sequencing developed by Oxford Nanopore Technologies (Nanopore) produce longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. However, compared with data produced by popular short read sequencing technologies (such as Illumina), PacBio and Nanopore data have a higher sequencing error rate and lower coverage. Therefore, new algorithms are needed to take full advantage of third-generation sequencing technologies. For example, during an alignment-based homology search, insertion or deletion errors in genes will cause frameshifts, which may lead to marginal alignment scores and short alignments. In this case, it is hard to distinguish correct alignments from random alignments, and the ambiguity will incur errors in the structural and functional annotation. Existing frameshift correction tools are designed for data with a much lower error rate, and they are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio and Nanopore data. Another example is overlap detection. For both PacBio reads and Nanopore reads, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates. Addressing this need will enable better assembly for metagenomic data produced by the third-generation sequencing technologies.In this article, we are going to discuss the possible method for homology search and overlap detection for the third-generation sequencing. For overlap detection, we designed and implemented an overlap detection program named GroupK. GroupK takes a group of short kmer hits, which satisfy statistically derived distance constraints to increase the sensitivity of small overlap detection. For the homology search, we designed and implemented a profile homology search tool named Frame-Pro based on the profile hidden Markov model (pHMM) and consensus sequences finding method. However, Frame-pro is still relying on multiple sequence alignment. So we implemented DeepFrame, a deep learning model that predicts the corresponding protein function for third-generation sequencing reads. In the experiment on simulated reads of protein-coding sequences and real reads from the human genome, our model outperforms pHMM-based methods and the deep learning based method. Our model can also reject unrelated DNA reads and achieves higher recall with the precision comparable to the state-of-the-art method."--Pages ii-iii.

Read