PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF NEXT-GENERATION SEQUENCING DATA By Yuan Zhang A DISSERTATION Submitted to Michigan State University in partial fulllment of the requirements for the degree of Computer Science  Doctor of Philosophy 2013 ABSTRACT PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF NEXT-GENERATION SEQUENCING DATA By Yuan Zhang Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, beneting a variety of research areas such as gene expression proling, metagenomic annotation, ncRNA identication, etc. Therefore, functional analysis of NGS sequences becomes increasingly important because it provides insightful information, such as gene expression, protein composition, and phylogenetic complexity, of the species from which the sequences are generated. One basic step during the functional analysis is to classify genomic sequences into dierent functional categories, such as protein families or protein domains (or domains for short), which are independent functional units in a majority of annotated protein sequences. The state-of-the-art method for protein domain analysis is based on comparative sequence analysis, which classies query sequences into annotated protein or domain databases. There are two types of domain analysis methods, pairwise alignment and prole-based similarity search. The rst one uses pairwise alignment tools such as BLAST to search query genomic sequences against reference protein sequences in databases such as NCBI-nr. The second one uses prole HMM-based tools such as HMMER to classify query sequences into annotated domain families such as Pfam. Compared to the rst method, the prole HMM-based method has smaller search space and higher sensitivity with remote homolog detection. Therefore, I focus on prole HMM-based protein domain analysis. There are several challenges with protein domain analysis of NGS sequences. First, sequences generated by some NGS platforms such as pyrosequencing have relatively high error rates, making it dicult to classify the sequences into their native domain families. Second, existing protein domain analysis tools have low sensitivity with short query sequences and poorly conserved domain families. Third, the volume of NGS data is usually very large, making it dicult to assemble short reads into longer contigs. In this work, I focus on addressing these three challenges using dierent methods. To be specic, we have proposed four tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on detecting and correcting frameshift errors in sequences generated by pyrosequencing technology, thus accurately classifying metagenomic sequences containing frameshift errors into their native protein domain families. MetaDomain and SALT are both designed for short reads generated by NGS technologies. MetaDomain uses relaxed position-specic score thresholds and alignment positions to increase the sensitivity while keeping the false positive rate at a low level. SALT combines both position-specic score thresholds and graph algorithms and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage, and lower chimera rate compared with existing tools. Finally, I will make a conclusion on my work and briey talk about some future work. ACKNOWLEDGMENTS First and foremost, I would like to thank my adviser Dr. Yanni Sun. Her decision to admit me as her PhD student four years ago provided me the precious opportunity to study in Michigan State University and led me to the world of bioinformatics. During these four years under her guidance I have made continuous progress in several aspects, including reading research papers, proposing research topics, developing methods, designing experiments, to writing papers. More importantly, I have gradually improved my ability to both independently and collaboratively conduct in-depth analysis into sophisticated research problems and use scientic methodologies to solve the challenging problems. She also gave a lot of suggestions on how to eectively demonstrate our work to the audience, especially to people from other research areas. This ability is very important in that it will profoundly determine my capability to collaborate in a team of people from dierent background. I also want to thank other committee members Dr. Tan, and Dr. James R. Cole. my PhD program. C. Titus Brown, Dr. Pang-Ning They gave a lot of useful suggestions during the course of I also thank my lab mates Rujira Achawanantakun, Jikai Lei, Cheng Yuan, Prapaporn Techa-angkoon, and Jiao He. During these years, we have productive discussion and cooperation on various research topics and I have obtained great help from them. I would like to acknowledge my colleagues from BeachMint Inc. during my summer internship, especially Douglas Cohen, Je Cooper, and Manunya Rozelle. With their help, I learned how to apply theories and methods to solve challenging problems in industry. I gratefully acknowledge other faculties and stas of CSE department, especially Dr. Rong Jin, Dr. Jin Chen, Linda Moore, and Norma Teague. I also owe a lot of thanks to my friends in MSU and in China. All of them give me a lot of support and help during these years. iv My nal and most important acknowledgement must go to my family. They always give me persistent and determined love and support. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Next-generation sequencing technologies 1.2 Protein domain analysis of NGS sequences . . . . . . . . . . . . . . . . . . . 1 1.3 Challenges with protein domain analysis of NGS sequences . . . . . . . . . . 3 Chapter 2 Protein domain classication for metagenomic sequences containing frameshift errors . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Background 2.2 Related work 2.3 Method . . . . . . . . . . . . . . . . . . . . 1 1.1 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 The augmented Viterbi algorithm for sequencing error correction . . . 11 2.3.3 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Accuracy of HMM-FRAME 15 2.4.2 2.4 Error models 2.3.2 Using HMM-FRAME in Targeted Metagenomic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.2.1 Protein domain analysis of nifH sequences . . . . . . . . . . 18 2.4.2.2 Protein domain analysis of the bacterial aromatic dioxygenase genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Protein domain classication in the deep mine data set . . . . . . . . 22 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Chapter 3 Prole HMM-based protein domain classication for short sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 2.5 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Method 31 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Pipeline of MetaDomain 3.3.2 The Viterbi algorithm 3.3.3 3.3.3.1 Position specic threshold . . . . . . . . . . . . . . . . . . . 34 3.3.3.2 Alignment trimming 3.3.4 . . . . . . . . . . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Alignment Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . 35 Protein domain classication . . . . . . . . . . . . . . . . . . . . . . . 36 vi 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 37 Identifying transcribed protein domains in transcriptome . . . . . . . 3.4.1.1 Performance of read classication . . . . . . . . . . . . . . . 37 3.4.1.2 Identifying transcribed domains in the transcriptome . . . . 38 Protein domain analysis in a soil metagenomic data set . . . . . . . . 42 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.2 3.5 Chapter 4 A Sensitive and accurate protein domain classication tool (SALT) for short reads based on prole HMMs and graph algorithms . 47 4.1 Background 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.1 Overview of SALT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.2 Stage 1: prole HMM-based ltration . . . . . . . . . . . . . . . . . . 54 4.3.2.1 54 4.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Position-specic score threshold . . . . . . . . . . . . . . . . Stage 2: contig generation 47 . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3.3.1 Constructing a hit graph for a family . . . . . . . . . . . . . 57 4.3.3.2 Find the K 60 Stage 3: E-value computation and contig selection . . . . . . . . . . . 61 4.3.5 4.4 . . . . . . . . . . . . . . . . . . . 4.3.4 longest paths Running time analysis 62 . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.4.1 Protein domain classication of very short reads . . . . . . . . . . . . 64 4.4.1.1 Determining the true membership of reads . . . . . . . . . . 65 4.4.1.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . Arabidopsis 4.4.2 Protein domain classication of an RNA-Seq data of 4.4.3 Protein domain classication of a non-model organism 65 . . . 69 . . . . . . . . 72 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 5 A Scalable and Accurate Targeted gene Assembly tool (SATAssembler) for NGS data . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.1 Overview of SAT-assembler 82 5.3.2 Prole HMM-based homology search . . . . . . . . . . . . . . . . . . 83 5.3.3 Alignment informed graph construction . . . . . . . . . . . . . . . . . 84 5.3.4 Pruning and optimization of overlap graphs 87 5.3.5 Guided graph traversal using multiple information . . . . . . . . . . . 88 5.3.6 Contig scaolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.7 Running time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.1 . . . . . . . . 93 5.4.1.1 Edge creation performance . . . . . . . . . . . . . . . . . . . 93 5.4.1.2 Performance comparison with other assembly tools 95 5.4 . . . . . . . . . . . . . . . . . . . . . . . Gene assembly in an RNA-Seq data set of vii . . . . . . . . . . . . . . Arabidopsis . . . . . 5.4.2 5.5 Targeted gene assembly in a human gut metagenomic data set . . . . 98 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 6 Conclusion and future work Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . 101 105 LIST OF TABLES Table 2.1 Comparing the error detection performance of HMM-FRAME, GeneWise, and FragGeneScan. Table 3.1 . . . . . . . . . . . . . . . . . . . . . . Number of transcribed and non-transcribed domains using dierent cutos (N) for the number of mapped reads. Table 4.1 40 Burkholderia cenocepacia. . . . . . . . . . . . . 68 Performance comparison of SALT against the other classiers on the RNA-Seq data set of Table 4.3 . . . . . . . . . . . . . Performance comparison of SALT against the other classiers on the RNA-Seq data set of Table 4.2 17 Arabidopsis. . . . . . . . . . . . . . . . . . . . Classication results generated by dierent classiers on the balthica 71 Radix data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 4.4 Description of transcribed families uniquely identied by SALT. Table 5.1 Edge creation performance of three strategies on the RNA-Seq data set of Table 5.2 Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 94 Performance comparison between dierent assembly tools on the RNASeq data set of Table 5.3 . . 73 Arabidopsis. . . . . . . . . . . . . . . . . . . . . . . . 98 Performance comparison between dierent assembly tools in assembling genes from butyrate kinase family on the human gut metagenomic data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 99 LIST OF FIGURES Figure 2.1 Frameshifts cause short alignments with marginal scores . . . . . . . Figure 2.2 6 Change of HMMER alignments' scores, lengths, and E-values (in log space) before and after error correction for nifH sequences. (For interpretation of the references to color in this and all other gures, the reader is referred to the electronic version of this dissertation) . . . . Figure 2.3 20 Change of HMMER alignments' lengths, scores, and E-values (in log space) before and after error correction for the bacterial aromatic dioxygenase genes in a soil sample. . . . . . . . . . . . . . . . . . . . Figure 2.4 Protein domain classication results for the black sample in the deep mine data set. Figure 3.1 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Change of the read classication sensitivity of HMMER over read length and the average sequence identity of domain families. . . . . 27 Figure 3.2 Histogram of the average pairwise sequence identity for 2558 domains 29 Figure 3.3 Three types of alignment distributions. . . . . . . . . . . . . . . . . 32 Figure 3.4 Pipeline of MetaDomain. . . . . . . . . . . . . . . . . . . . . . . . . 33 Figure 3.5 Read classication sensitivity and FP rate of HMMER and MetaDomain. The size of each bubble represents the number of data points (i.e., domains) with the same sensitivity and FP rate. . . . . . . . . 39 Figure 3.6 ROC curves of HMMER and MetaDomain. . . . . . . . . . . . . . . 41 Figure 3.7 Read length distribution in the soil data set. . . . . . . . . . . . . . 43 Figure 3.8 Reads aligned by HMMER and MetaDomain. . . . . . . . . . . . . 44 Figure 3.9 The distributions of aligned reads for PF09703 by HMMER and MetaDomain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 45 Figure 4.1 Two genes, their domain organizations, and the sequenced reads. Domain X occurs in two dierent genes. Both genes are transcribed and sequenced. Red lines: positive reads. Blue lines: negative reads. . . 52 Figure 4.2 The pipeline of SALT. . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 4.3 (A) Thirteen reads and their alignment layout w.r.t. the prole HMM represented by its matching states. The alignment scores are shown in the table. Blue reads: negative reads. Red reads: positive reads. (B) The constructed hit graph when k ∗ = 4. For simplicity of ex- planation, mismatches are not allowed in this simple example (i.e. e = 0). Red nodes are created by positive reads. Blue nodes are cre- ated by negative reads. (C) The hit graph after removing transitive overlaps and adding the root node. . . . . . . . . . . . . . . . . . . . Figure 4.4 ROC curves of dierent classiers. 56 HHblits and SSAKE+HMMER are listed in separate embedded windows because their FP rates are orders of magnitude larger than others. Figure 4.5 67 A Venn diagram of the transcribed families identied by dierent classiers. Figure 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The pipeline of SAT-assembler. 73 Reads of the same color belong to the same gene family. Reads from dierent genes of the same family are distinguished using dierent patterns. Reads shared by multiple genes from the same family have multiple patterns. . . . . . . . . . . Figure 5.2 (A)Two reads a and b 83 sequenced from dierent genes of the same family are aligned to the prole HMM of the family. Their sequence overlap is indicated in red. (B) Read a and read b have an align- ment overlap of 66 and a sequence overlap of 25 (in bold). (C)The alignment between the translated peptides of Figure 5.3 a and A graph containing reads from two dierent genes red (v1 , gene B v4 , and v7 ) and in blue (v2 , v5 , and b is 22 residues. A and B . Nodes in v8 ) are from gene A and v6 ) are chimeric nodes respectively. Nodes in black (v3 and because they are shared by the two genes. are real edges. Arrows with solid lines Arrows with dotted lines and dashed lines indicate paired-end reads and transitive edges between two nodes respectively. Figure 5.4 86 Three contigs generated from a metagenomic data set. 88 The green parts of the contigs are contained in the target gene and thus are gene segments. The blue parts of the contigs are not gene segmetns. xi 92 Figure 5.5 Chimera rate versus gene coverage when k-mer old changes for dierent assembly tools. size or overlap thresh- These values are average values of the assemblers' performance on 3,188 input families. . . . . xii 96 Chapter 1 Introduction 1.1 Next-generation sequencing technologies In bioinformatics, sequencing means to determine the primary structure of a biological sequence. Prior to new sequencing technologies, Sanger sequencing is the main method for sequencing DNA. However, this technique has several limitations. For example, Sanger sequencing is not applicable to sequencing a small amount of DNA, making it expensive and not accessible to most small labs. Also, the length of the DNA being sequenced is limited. Next generation sequencing (NGS) technologies are developed at the demand of low-cost sequencing technologies. lel method. These new sequencing technologies make use of massive paral- They can produce large-scale sequence data at low costs. These advantages make large scale sequencing within the reach of many scientists. Moreover, new sequencing technologies generate much more sequence data that has high coverage and nucleotide level resolution per run. 1.2 Protein domain analysis of NGS sequences Inferring functions from sequences is important in analyzing dierent types of data generated by NGS technologies. One basic step during the functional analysis is to classify NGS sequences into annotated functional categories, such as protein families or protein do- 1 main families. Protein domain analysis has been widely used for functional annotations of RNA-Seq data [1, 2, 3, 4]. In particular, quantifying the expression levels of protein domains helps us understand how transcriptional changes of domains are associated with sequencing conditions, sampling tissues, or experimental treatments in RNA-Seq data. For example, computational domain analysis was applied to identify domains that play a role in vernalization and eux transporters in the gibberellin response in sugar beet [1]. Do- main analysis is also frequently used to evaluate and compare gene annotation quality of dierent gene-nding tools [3] or to compare domain composition of data sampled using dierent techniques [4]. Protein domain analysis has also been used to understand the phylogenetic complexity and biological functions of mycrobial communities, as well as their interactions with the host [5, 6, 7]. For example, Ellrott et al. investigated the distribution of protein families in the currently available human gut genomic and metagenomic data [8]. Schlüter et al. applied HMMER to understand the genetic diversity and composition of a plasmid metagenome from a wastewater treatment plant [9]. The phylogenetic algorithm CARMA [10] uses all Pfam domain and protein families as phylogenetic markers to identify the source organisms of environmental DNA fragments. There are two major comparative methods for protein domain analysis. The rst method is based on pairwise sequence alignment tools such as BLAST software suite [11]. Query sequences are classied via comparison with annotated protein databases such as NCBI-nr using BLASTX [11]. The second method is prole-based similarity search, which classies queries into characterized protein domain or family databases such as Pfam [12], TIGRFAM [13], FIGfams [14], etc. There also exist comprehensive protein domain search tools such as InterProScan [15], which combines dierent sequence and prole-based domain recognition methods from the InterPro [16] consortium member databases into one resource. 2 Although BLAST is one of the most ecient protein homology search tools, probabilistic model-based methods have much better sensitivity for remote protein homology recognition. Using using prole hidden Markov models (HMMs) to represent a protein family greatly improves homology search sensitivity between highly diverged sequences [17]. Thus it is desirable to conduct protein domain classication using prole HMM-based tools such as HMMER [18]. In conjunction with a fast-growing protein domain family database Pfam [12], which contains over 10,000 annotated protein domain families, HMMER is able to classify sequences into dierent domain families with high accuracy. In addition, the latest implementation of prole HMM-based domain classication tool HMMER 3.0 [18] has achieved comparable speed to BLAST, making it suitable for large-scale protein compositional analysis. For the convenience of discussion, we use HMMER to refer to HMMER 3.0 hereafter unless otherwise specied. 1.3 Challenges with protein domain analysis of NGS sequences Although prole HMM-based methods have been successfully applied to genome-wide domain analysis, there are still many challenges with protein domain analysis of NGS sequences, especially complex metagenomic data. First, sequences generated by some NGS platforms such as pyrosequencing technology have sequencing errors, including insertions or deletions of nucleotides, especially in homopolymer regions. These errors create frameshifts during translation, making it dicult to classify the derived peptide sequences into their native families. Second, when the length of the query reads decreases, existing tools have low sensitivity in classifying these short reads, especially for domain families of poor conservation. Many 3 sequencing technologies such as Illumina still generate short reads of 35 bp to 150 bp. Moreover, protein sequences encoded in individual metagenomic sequence reads may share only a small overlap with existing protein families. Therefore, a sizable portion of various data set still contain short reads. Third, microbial communities usually contain a large number of dierent microbial species, complicating the functional annotation of metagenomic data. In order to address these challenges and improve performance of protein domain classication, We have proposed three tools: HMM-FRAME, MetaDomain, and SALT. HMMFRAME is designed to accurately classify metagenomic sequences containing frameshift errors. MetaDomain and SALT are designed to directly classify short reads into their native protein domain families with better sensitivity than existing tools. Compared to MetaDomain, SALT incorporates graph algorithms to improve accuracy of protein domain classication. In my future work, I will focus on accurate and scalable gene assembly from complex metagenomic data. 4 Chapter 2 Protein domain classication for metagenomic sequences containing frameshift errors 2.1 Background Culture-independent methods and high-throughput sequencing technologies now enable us to obtain community random genomes (metagenomes) from dierent habitats such as arctic soils and mammalian gut. Currently, metagenomic annotation focuses on phylogenetic complexity and protein composition analysis. An important component in protein composition analysis is protein domain classication, which classies a putative protein sequence into annotated domain families and thus aids in functional analysis. Prole HMM-based alignment is the state-of-the-art method for protein domain classication because of its high sensitivity in classifying remote homologs. In conjunction with the Pfam database, HMMER [18] can accurately classify query protein sequences into existing domain families. In addition, the latest version of HMMER can achieve comparable speed to BLAST, making it applicable to large-scale metagenomic data sets. However, HMMER cannot optimally classify sequences containing frameshift errors. In 5 HMMER's domain analysis, six-frame translations of a sequence read or a predicted gene fragment are aligned with annotated protein domain families using HMMER. One problem of this method is that sequencing errors, including insertions or deletions of nucleotides, create frameshifts during translation. As a result, the derived peptide sequences are likely to generate alignments with marginal scores. As HMMER uses alignment scores, E-values, or lengths to determine family membership, these reads become unclassiable or can be falsely recognized as novel" proteins during downstream analysis. Figure 2.1 illustrates how insertion or deletion errors cause marginal alignment scores. X1X2X3 X4X5X6 X7X8X9 X10X11X12 X13X14X15 X16X17X18 X19X20X21 aa11 aa12 aa13 aa14 aa15 aa16 aa17 X1X2X3 X4X5X6 X X7X8X9 X10X11X12 X13X14X15 Y X16X17X18 X19X20X21 aa11 aa12 aa23 aa24 aa25 aa36 aa37 Figure 2.1: Frameshifts cause short alignments with marginal scores In Figure 2.1, is the j th Xi is the ith base of a DNA sequence. Every codon is underscored. amino acid of a peptide sequence derived under reading frame i. ai j The correct peptide sequence can be derived from the error-free sequence (shown on the top of the gure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y), the correct peptide sequence is the concatenation of three short peptide sequences derived using dierent reading frames. Thus, each peptide sequence derived using one reading frame can only generate short alignments with insignicant scores. This problem is more serious in domain analysis for metagenomic data sets. Given the high complexity of many metagenomic data sets, high-quality genome assembly is not always 6 available. Thus, protein annotation can only be conducted on short sequence reads. The average read length varies from 25-35 to around 400 bases for the next-generation sequencing methods currently in use. On average there is about one open reading frame per 1000 base pairs in bacteria genomes. Depending on gene size, many gene fragments in metagenomic sequence reads may share only a small overlap with existing domain families, generating even shorter prole HMM alignments with signicantly lower scores. Although a number of tools [19, 20, 21, 22, 23, 24] exist for frameshift detection, they are not designed for protein domain classication using prole HMMs. In addition, these tools have not incorporated sequencing error patterns associated with next generation sequencing technologies. A clear disadvantage is that they do not distinguish between error rates in and out of homopolymer regions in pyrosequencing reads. The goal of this work is to design an accurate prole HMM alignment method that can incorporate any given error pattern. Our experiments show that our tool has high sensitivity (> 95%) in detecting sequencing errors and has a low false positive rate (∼ 0.15%). By correcting insertion and deletion errors, it can generate longer alignments with signicantly higher alignment scores, and thus provide more accurate protein domain classication. 2.2 Related work A number of programs exist to handle frameshifts through DNA versus protein sequence alignment. The simplest methods discard sequences that might contain frameshifts rather than trying to correct them. For example, BLASTX provides insightful information about whether a query DNA sequence contains frameshifts using six-frame translations. However, it neither explicitly outputs positions of insertions or deletions that create frameshifts, nor 7 does it try to x them by constructing an alignment from pieces obtained from dierent reading frames. Other tools are available to detect and x frameshift errors automatically. Frame [19] uses BLASTX to compare all six reading frames of the query nucleotide sequence against protein sequences. Then the aligned regions are combined for frameshift detection. Guan et al. [20], Zhang et al. [21], and Halperin et al. [22] describe dynamic programming algorithms for frameshift detection during pairwise DNA and protein sequence alignment. Instead of using all reading frames of a DNA sequence to maximize the alignment score, another group of tools [23, 24] translate a protein sequence back into DNA sequences and formulate the alignment problem as a network matching problem. Frameshift detection has also been applied to nding distant protein homologies where the divergence is the result of frameshift mutations and substitutions [25, 26, 27]. Some gene-nding tools detect frameshifts. FrameD [28] relies on a directed acyclic graph for gene prediction in the presence of frameshifts. Kislyuk et al. [29] apply an ab initio method to detect possible frameshifts from coding potential generated by GeneMark [30]. GeneTack [31] and FragGeneScan [32] use hidden Markov models for ab initio frameshift detection in gene nding. Despite the extensive study of frameshift detection, the above programs are not designed for protein family classication through DNA versus protein family alignment. Alternatively, GeneWise [33], a widely used DNA versus protein alignment tool, allows comparison of a DNA sequence with a prole HMM. Our algorithm diers from GeneWise by explicitly incorporating a position-specic error model that is trained on data from dierent sequencing platforms such as 454 GS FLX Titanium. 8 2.3 Method The representative protein domain classication tool HMMER [18] classies a query protein sequence into a prole HMM-represented protein family using the Viterbi or the Forward algorithm [17]. The Viterbi algorithm aligns a query protein sequence to a prole HMM by searching for the most probable state path in the model. If the alignment score or Evalue meets the pre-dened threshold, the query is classied into the corresponding family. The alignment generated by the Viterbi algorithm only accounts for the dierence caused by evolutionary divergence between a sequence and a protein family. In order to classify error-containing sequences into their native families, the alignment algorithm must detect the dierences resulted from both evolution and sequencing errors. In this section, we describe HMM-FRAME, the implementation of an augmented Viterbi algorithm that searches for the optimal alignment between a DNA query and a prole HMM by considering both evolutionary divergence and sequencing errors. HMM-FRAME diers from HMMER in the following ways: 1) HMM-FRAME directly accepts a DNA sequence as input, 2) HMM-FRAME accepts a sequencing error model as input, 3) HMM-FRAME can detect and x frameshifts caused by sequencing errors in the DNA sequence. The output alignment indicates which bases are inserted or deleted due to evolutionary change or sequencing error. 2.3.1 Error models Here we describe the error models used in our experiments. Dierent sequencing technologies may have dierent types of errors. For example, previous work [34, 35, 36] has shown that insertions and deletions occur more often in homopolymer regions than in non-homopolymer 9 regions for pyrosequencing reads. Substitution errors occur more often than insertions or deletions in Illumina sequencing reads. Because deletion or insertion errors cause frameshifts, we focus on applying HMM-FRAME to pyrosequencing data sets. In this work, we consider two error models. The rst one is a published model trained from GS20 sequencing reads [34]. The insertion and deletion error rates in non-homopolymer and homopolymer regions are 0.0007 and 0.0044, respectively. The second error model is computed on data from FLX Titanium sequencing platform. We obtained a set of Titanium sequence reads (Cole and Wang, unpublished) extracted from the region H of the 16S rRNA, which were amplied from the Baylor mock community (22 strains, 24 sequences). Then we computed error rates using insertions and deletions that were annotated by generating careful Needleman-Wunsch alignments between the Titanium sequencing reads and the control sequences. In total, 7,040 sequences passed the initial quality control of RDP [37] after contamination and chimera detection. There were 1,721 insertion and deletion errors. Note that PCR, which was used to generate the amplicons of the sample, can introduce errors. However, because most of the errors introduced by PCR are substitution errors, we assumed that the deletions and insertions were mainly sequencing errors. The derived error rates for homopolymers of dierent sizes were: 1: 0.000532, 2: 0.000698, 3: 0.00102, 4: 0.000688, 5: 0.0372, 6: 0.00167, 7: 0.143, where the rst number is the size of homopolymer regions (1 means non-homopolymer) and the second number is the rate of insertion and deletion errors. If we sum the error rates for homopolymer regions of dierent sizes, the insertion and deletion error rates for non-homopolymer and homopolymer regions were 0.0005 and 0.001, respectively. They are slightly smaller than the published G20 error rates [34]. We will compare their performance on a data set with annotated errors in the section 2.4. 10 2.3.2 The augmented Viterbi algorithm for sequencing error correction Let π be a state path in a prole HMM in a DNA sequence π∗ x. M. Let r be a set of insertion and deletion positions The augmented Viterbi algorithm searches for the most probable path and the most probably error position set r∗ such that (π ∗ , r∗ ) = argmax(π,r) P (x, π, r). Intuitively this algorithm searches for an optimal alignment between a DNA sequence and a prole HMM by simultaneously considering 1) evolutionary divergence (i.e. the insertion, deletion, and substitution of amino acids) and 2) sequencing errors (i.e. insertion and deletion of nucleotides). To solve the above equation, we rst divide the search space according to dierent types of sequencing errors inside a codon and between two consecutive codons. For each type of error, we search for the most probable state path. Input: M a DNA sequence a prole HMM M, and a sequencing error model. Notations of and the error model will be described below. Output: in x, the optimal alignment between DNA sequence x and M , as well as error positions r. Algorithm: we rst dene notations that will be used in the dynamic programming equa- tions. • Notations about the prole HMM M : sertion, and deletion states in to s2 . es (T (xi−2 xi−1 xi )) T (xi−2 xi−1 xi ), M . as1 s2 States Mj , Ij , and which is translated from the codon M, guide of HMMER [18]. State are matching, in- is the transition probability from state s1 s to emit amino acid xi−2 xi−1 xi . For a detailed de- is the emission probability for state scription of a prole HMM Dj we refer the reader to the textbook [17] and the users' Gj is the only state that is not dened in prole HMMs 11 from HMMER 3.0. It encodes insertions of nucleotides between codons. transition probability from matching state set to the insertion error probability. Mj aGj Gj aMj Gj to nucleotide insertion state is the Gj . It is is the self-transition probability for Gj , encoding the probability of consecutive insertions. When consecutive insertion is not allowed, it is set to 0. matching state Mj . aGj−1 Mj is the transition probability from to the next When only one insertion error is allowed, it is set to 1.0. • Notations about the sequencing error model: pI (xi ) is an insertion error. Gj−1 pD (xi ) is the probability that base xi is the probability that there is a deletion error after base xi . • Subproblems and the recursive equations: Based on our analysis of error patterns, it is very rare that there are consecutive insertions or deletions in a sequence read. Thus, the following DP algorithm assumes that there is at most one insertion or deletion inside a codon. The algorithm can be extended to handle all possible cases.  VjM (i) is the score of the best alignment matching subsequence model up to the matching state Mj , given that xi and this codon encodes an amino acid emitted by  VjI (i) Ij , given that to the sub- is the third base of a codon Mj . is the score of the best alignment matching subsequence model up to the insertion state x1..i T (xi−2 xi−1 xi )  VjG (i) is the score of the best alignment ending in xi x1..i to the sub- is emitted by Ij . being emitted by state Gj , which encodes an insertion of nucleotides between codons.  VjD (i) is the score of the best alignment matching subsequence model up to the deletion state Dj . 12 x1..i to the sub- VjM (i) = max{ case I : no sequencing error in the codon xi−2 xi−1 xi : M eMj (T (xi−2 xi−1 xi )) × Vj−1 (i − 3) × aMj−1 Mj , I eMj (T (xi−2 xi−1 xi )) × Vj−1 (i − 3) × aIj−1 Mj , D eMj (T (xi−2 xi−1 xi )) × Vj−1 (i − 3) × aDj−1 Mj , G eMj (T (xi−2 xi−1 xi )) × pI (xi−3 ) × Vj−1 (i − 3) × aGj−1 Mj , case II : nucleotide xi−1 is an insertion : M eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aMj−1 Mj , I eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aIj−1 Mj , D eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aDj−1 Mj , G eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aGj−1 Mj , case III : nucleotide xi−2 is an insertion : Repeat the above f our equations f or eMj (T (xi−3 xi−1 xi )), case IV : there is a deleted nucleotide (represented by d) between xi−1 and xi : M eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aMj−1 Mj , I eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aIj−1 Mj , D eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aDj−1 Mj , G eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aGj−1 Mj , case V : there is a deleted nucleotide between xi−2 and xi−1 : Repeat the above f our equations f or eMj (T (d xi−1 xi )). } 13 In cases IV and V, we use emission probability of d to represent the deleted bases. We choose T (xi−1 d xi ) (or T (d xi−1 xi )) d to maximize the in the matching state Mj . VjI (i) = max{eIj (T (xi−2 xi−1 xi )) × VjM (i − 3) × aMj Ij , eIj (T (xi−2 xi−1 xi )) × VjI (i − 3) × aIj Ij } VjG (i) = max{pI (xi ) × VjM (i − 1) × aMj Gj , pI (xi ) × VjG (i − 1) × aGj Gj } M D VjD (i) = max{Vj−1 (i) × aMj−1 Dj , Vj−1 (i) × aDj−1 Dj } 2.3.3 Running time analysis The time complexity of the above dynamic programming algorithm is is the length of input DNA sequence and |M | O(δ|x||M |), is the number of states in M. δ where |x| is the number of dierent types of errors inside a codon plus the case of insertions between two codons. In our current implementation, δ = 26, which renders a longer running time than the standard Viterbi algorithm. Thus, it is not practical to compare millions of metagenomic sequence reads to over 10,000 protein families in Pfam. Instead, we only run HMM-FRAME on sequences that are likely to contain insertion or deletion errors. For large-scale applications, we suggest applying HMMER, which is as fast as BLAST, to all input sequence reads using a big E-value cuto (such as 100). Alignments covering at least 80% of the translated DNA sequence with signicant E-values can be classied by HMMER in this step. Sequence reads that do not yield any partial alignments are unlikely to be members of any protein family. 14 Thus, we only apply HMM-FRAME to reads yielding partial alignment with marginal scores because these reads could potentially contain sequencing errors. 2.4 Results In this section, we compare the sensitivity and false positive rates (FP rates) of HMMFRAME with GeneWise [33] and FragGeneScan [32]. We then apply HMM-FRAME to Targeted Metagenomics and a published metagenomic data set. Our experimental results show that the length, scores, and E-values of prole HMM alignments are signicantly improved after error correction. As prole HMM-based alignment tools determine membership by comparing E-value or length with user-dened thresholds, the improvement of these parameters enables more error-containing sequences to be classied into their native families. 2.4.1 Accuracy of HMM-FRAME In order to evaluate the accuracy of HMM-FRAME in detecting insertion and deletion errors, we obtained a control data set with annotated error positions from RDP (Cole and Wang, unpublished). In this data set, NifH gene families from the strain DCB-2, the Burkholderia xenovorans Desultobacterium hafniense strain LB40, and the PCC 7120 strain of abaena were amplied and then sequenced using 454 Titanium. An- The sequenced gene families were aligned with the nifH genes in these three organisms using the Needleman-Wunsch algorithm. Insertion and deletion errors were identied from the alignments. After contamination and chimera screening, we had 18,900 sequences, of which 3,408 sequences contained 4,623 insertion or deletion errors. We conducted the protein domain analysis on the 18,900 sequences using HMM-FRAME under the two error models presented in the 15 Method Sec- tion. The input prole HMM was trained on 25 nifH genes obtained from RDP's functional gene repository website [38]. We evaluated the performance of error-prediction tools using two types of sensitivity and FP rates. Let S S+ be the set of error-containing sequences in the control data set. Let be the set of predicted error-containing sequences. FP rate S∩S + S+ are and S−S + , S respectively. The Similarly, let Sequence-level sensitivity Q+ be the set of insertion and deletion positions in error-containing sequences from the control data set. Let of predicted error positions. The Base-level sensitivity and and FP rate are Q Q+ ∩Q Q+ be the set and Q−Q+ Q , respectively. Using the control data set, we rst evaluated the performance of HMM-FRAME under the published GS20 and our self-trained Titanium error models. Then we compared the performance of HMM-FRAME with GeneWise [33] and FragGeneScan [32]. Similar to HMMFRAME, GeneWise can directly compare DNA sequences with a prole HMM and can accept user-dened error rates. We tested GeneWise using dierent parameters including error rates and the alignment score thresholds (ranging from 0 to 20). The results with the best tradeo between sensitivity and FP rate were kept for comparison with HMM-FRAME. FragGeneScan [32] is a newly developed gene prediction tool for short and error-prone sequences. It predicts genes and identies sequencing errors inside predicted genes. We applied FragGeneScan on the above sequence set (all genes) and tested its sensitivity and FP rate. FragGeneScan successfully recognized all input as protein-coding genes, rendering a high gene-prediction sensitivity in this data set. However, FragGeneScan had higher FP rates than HMM-FRAME in error detection. The results are summarized in Table 2.1. Sensitivity and FP rate of each program when detecting annotated insertion and deletion errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq- 16 FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto of GeneWise is set to zero to maximize the sensitivity. As GeneWise has low sequence-level sensitivity, we did not evaluate its performance at the base-level. Table 2.1: Comparing the error detection performance of HMM-FRAME, GeneWise, and FragGeneScan. HMM-FRAME: HMM-FRAME: G20 self-trained base-sen 95.25% 85.08% seq-FP 0.154% base-FP 2.1% seq-sen 90.6% GeneWise FragGeneScan 53.8% 83.04% 82.4% 0 0.003% 53.39% 0.001% 0.7% 59.57% Sensitivity and FP rate of each program when detecting annotated insertion and deletion errors in nifH genes. seq-sen: sequence-level sensitivity. basesen: base-level sensitivity. seq-FP: sequence-level FP rate. base-FP: baselevel FP rate. The score cuto of GeneWise is set to zero to maximize the sensitivity. As GeneWise has low sequence-level sensitivity, we did not evaluate its performance at the base-level. As shown in Table 2.1, each tool has higher sensitivity and smaller FP rates in identifying error-containing sequences than in locating error positions. HMM-FRAME has a better tradeo between sensitivity and FP rate than both GeneWise and FragGeneScan. Both GS20 and our self-trained Titanium error models have small FP rates in predicting error positions, but GS20 has higher sensitivity. Thus, we plan to use GS20 in all further experiments. 2.4.2 Using HMM-FRAME in Targeted Metagenomic In this section, we present the utility of HMM-FRAME in two applications of Targeted Metagenomics", where one or several gene families are amplied from environmental DNA and these amplicons are sequenced using high-throughput sequencing platforms. One typical application of Targeted Metagenomics is to sequence the amplicons of the 16S rRNA gene for phylogenetic complexity analysis. Besides 16S rRNA, protein-coding genes that are 17 important to a particular habitat can be amplied and sequenced for targeted functional analysis in metagenomic data sets. For example, Targeted Metagenomics of the nifH gene, which encodes nitrogenase reductase, is important for analyzing microbial genomes sequenced from soil. Although these sequences are sampled from one or several targeted gene families, frameshift errors can cause short alignments with marginal scores between the input and the targeted gene families. As a result, sequences lacking signicant alignment length and scores will be regarded as contaminants and be discarded. Thus, it is desirable to x frameshift errors to maximize the number of usable samples. Given a DNA read and a prole HMM built from a set of known protein sequences, HMM-FRAME can be applied to detect and correct frameshift errors in amplicon reads. 2.4.2.1 Protein domain analysis of nifH sequences In the rst experiment, we obtained 3,937 nifH sequences of an average length of 76 bases generated by the 454 FLX sequencing technology. In order to discard contaminants that originated from non-target genes, we aligned the 3,937 sequences with the nifH gene family, which was built on a small set of 25 expert-veried full-length nifH protein reference sequences from RDP's functional gene repository [38]. In the gene family building process, we rst applied ClustalW [39] to align the 25 reference sequences. Then we applied HMMER 3.0's hmmbuild program to derive a prole HMM from the multiple sequence alignment. Of the 3,937 454 FLX sequences, 111 were found to be contaminants and were excluded from further analysis. Of the remaining 3,826 sequences, HMM-FRAME detected 296 insertions and deletions in 256 sequences. Thus, approximately, 7% of the samples contained frameshift errors. Of the 256 sequences containing insertion or deletion errors, 224 (87.5%) only contained one insertion or deletion error. 24 (9.4%) sequences contained two errors, and eight 18 (3.1%) contained three errors. Of the 296 insertions or deletions, 224 (75.7%) were inside or beside homopolymer regions. Because protein domain classication tools compare alignment lengths, scores, and Evalues with pre-dened thresholds to determine a sequence's membership, the changes in the alignments aect the nal domain composition analysis. After error correction, prole HMM-based alignment tools are expected to generate longer alignments with bigger scores and smaller E-values. This gives error-containing sequences a better chance of being classied into the correct families rather than being labeled contaminants. In order to conduct a fair comparison on alignments before and after error correction, we choose a third-party tool HMMER to generate alignments for original and corrected sequences. The changes of alignments' E-values and lengths due to error correction are presented in Figure 2.2. In this gure, the changes of alignments are presented for 256 sequences in which HMM-FRAME detects errors. Original" refers to HMMER alignments on sequences before error correction. Corrected" refers to HMMER alignments on sequences after error correction by HMM-FRAME. As a comparison, we also plot the length of the original sequence reads (with the legend sequence read"). They largely overlap with the length of corrected alignments, indicating that complete sequence reads can be aligned with the nifH prole HMM after error correction. In order to test whether the improvement was statistically signicant, we conducted a two-sample Kolmogorov-Smirnov test (K-S test) on the alignments' lengths and E-values before and after error correction. The p-values for the alignments' length and E-value distributions were 3.1037e-010 and 1.1802e-045, respectively. In particular, the comparison between alignments' lengths and the sequence reads' lengths shows that most partial alignments generated by error-containing sequences become complete alignments after error cor- 19 90 Lengths and E-values of pHMM alignments 70 50 LOG(original E-value) LOG(corrected E-value) original length corrected length Sequence read length 30 10 -10 1 21 41 61 81 101 121 141 161 181 201 221 241 -30 -50 NifH family reads Figure 2.2: Change of HMMER alignments' scores, lengths, and E-values (in log space) before and after error correction for nifH sequences. (For interpretation of the references to color in this and all other gures, the reader is referred to the electronic version of this dissertation) rection. Thus, when comparatively longer alignments (e.g., 23 amino acids or 69 bases) are required for domain classication, more sequence reads (213 more under when the threshold is 69 bases) will be classied into their native families. 2.4.2.2 Protein domain analysis of the bacterial aromatic dioxygenase genes In the second experiment, we obtained 2486 pyrosequencing samples of an average length of 224 bases from the bacterial aromatic dioxygenase genes in a soil sample [40]. Although these pyrosequencing reads were sequenced from the 5' end of PCR amplicons of bacterial aromatic dioxygenase genes, we were interested in classifying them into three sub-families of dioxygenase genes: toluene/biphenyl, naphthalene, and benzoate [41]. 20 Note that there is another subfamily (phthalate). However, due to lack of training proteins for this family (Dr. Iwai, personal communication), we only searched for members of three sub-families. Three sets of reference protein sequences were extracted from Pfam [12] for toluene/biphenyl, naphthalene, and benzoate [41]. Based on these training sets, we built three prole HMMs using ClustalW and HMMER. Then we applied HMM-FRAME to align the 2486 reads with the three prole HMMs. HMM-FRAME detected 77 insertions and 52 deletions, which were distributed in 121 sequences. Of the 121 error-containing sequences, 77 could not be classied into any subfamily by HMMER under the E-value threshold 0.1. After error correction using HMM-FRAME, these 77 sequences were classied into dierent families with an average Evalue of 3.3e-06, indicating that they were highly likely to be true members of the underlying families. For other error-containing sequences, the prole HMM alignments' E-values and lengths were signicantly increased after error correction. The change is plotted in Figure 2.3. In this gure, the data sets is sequenced from bacterial aromatic dioxygenase genes in a soil sample. All alignments are generated by HMMER for a fair comparison. Original" refers to HMMER alignments on sequences before error correction. Corrected" refers to HMMER alignments on sequences after error correction by HMM-FRAME. We also applied a two-sample K-S test on the alignments' lengths and E-values before and after error correction. The p-values for the length and E-value distributions were 8.0609e011and 1.9776e-040, respectively. The improved alignment lengths and E-values provide stronger evidence for the membership of the input samples. In total, after error correction by HMM-FRAME, we could classify 1,214 sequences into three subfamilies. were members of the naphthalene subfamily. 1,042 reads 96 reads belonged to the benzoate subfam- ily. 76 reads belonged to the toluene/biphenyl subfamily. The remaining 1272 reads could potentially be members of the subfamily phthalate (Dr. Iwai, personal communication). 21 89 Lengths and E-values of PHMM alignments 69 49 29 original length corrected length LOG(original E-value) LOG(corrected E-value) 9 -11 -31 -51 1 10 19 28 37 46 55 Soil sample sequence reads Figure 2.3: Change of HMMER alignments' lengths, scores, and E-values (in log space) before and after error correction for the bacterial aromatic dioxygenase genes in a soil sample. 2.4.3 Protein domain classication in the deep mine data set In order to show the utility of HMM-FRAME in a metagenomic data set containing members of multiple domain families, we applied HMM-FRAME to the rst 454 sequencing project for environment samples, which were sequenced from two sites in the Soudan Mine, Minnesota, USA [42]. In this experiment, we downloaded the Black Sample from the paper's supplementary data website. This data set contains 388,627 sequence reads with an average length of 99 bases. 22 There were two steps in the annotation. First, we applied gene-prediction tools. Second, we conducted the domain classication on predicted genes. tools are available for metagenomic data sets. A number of gene-prediction However, not every tool can handle short reads. Glimmer [43] did not output meaningful predictions when it was applied to this data set. The sensitivity of Metagene [44] drops to 59% for 100-base sequences [45]. We thus chose FragGeneScan, a newly developed gene-prediction tool for short reads. FragGeneScan predicted 281,658 genes, of which 72,355 contained errors. For convenience in discussion, let S be the set of genes predicted by FragGeneScan. Let S' be the raw read set corresponding to genes in S. Thus, 72,355 sequences in S were dierent from their raw reads in S' because FragGeneScan predicted and corrected errors in S'. We compared three domain classication pipelines: 1) apply HMMER 3.0 on raw reads S', 2) apply FragGeneScan and then HMMER on corrected reads S, and 3) apply HMM-FRAME on raw reads S'. We recorded how many reads could be classied into one of the 2,558 Pfam domain families that contain the keyword bacteria". The number of classiable reads for the three pipelines were: 13,544 for HMMER, 12,328 for FragGeneScan + HMMER, and 17,496 for HMM-FRAME. The classication results have large overlaps, which are illustrated in Figure 2.4. In this gure, sequence sets that can be classied by HMM-FRAME, HMMER, and FragGeneScan+HMMER are represented by three sets A, B, and C. |A| = 17, 496. |B| = 13, 544. |C| = 12, 328. B − C = 2224. C − B = 1008. C − A = 4. A − (B + C) = 2948. In summary, HMM-FRAME was able to classify 2,948 more reads than the other two annotation pipelines. HMM-FRAME found errors in all of these 2,948 reads. Thus, it is likely that other two pipelines failed to classify them because of frameshifts. HMM-FRAME failed to classify four reads that can be aligned by FrageGeneScan+HMMER. A closer examination showed that FragGeneScan and HMM-FRAME output dierent error positions in these four 23 B: HMMER alone C: FragGeneScan +HMMER A: HMMFRAME Figure 2.4: Protein domain classication results for the black sample in the deep mine data set. sequences. The performance evaluation of FragGeneScan must consider both gene-prediction and error-prediction. Of the 281,658 predicted genes, only 12,328 could be classied into existing domain families. Further analysis is needed to examine whether other predictions are novel genes or wrong predictions. It is worth noting that FragGeneScan could classify 1,008 more sequences after its error correction than applying HMMER 3.0 alone on raw reads. However, while 2,224 raw reads could be classied into existing domain families by HMMER 3.0, they could not be aligned with any family after error correction by FragGeneScan. This indicates that FragGeneScan might have over-predicted errors in the 2,224 sequences. This is consistent with our observation that FragGeneScan has a high FP rate in the control data set. 24 2.5 Conclusion Despite the advances of high-throughput sequencing technologies, sequencing errors still pose challenges for data annotation. In particular, our error model analysis shows that 454 FLX Titanium only slightly decreases the insertion and deletion error rates compared to GS20. Thus, correcting frameshifts caused by insertion or deletion errors is still important for metagenomic sequence annotation. In this work, we introduce a protein domain classication tool HMM-FRAME, which can classify error-prone DNA sequence reads into protein domain families. HMM-FRAME can accept any error model trained on data from high-throughput sequencing technologies and thus achieve high detection sensitivity while maintaining a low false positive rate. Applying HMM-FRAME to a data set with annotated errors shows its high sensitivity and accuracy in error detection. In particular, by xing frameshift errors, we can obtain signicantly longer prole HMM alignments with smaller E-values. As alignments' lengths, scores, and E-values are often used to determine family membership, improving them helps to classify more sequences into the native domain families. In our experiments, sequences that fail HMMER 3.0 under the default E-value or score threshold are classied into correct domain families using HMM-FRAME. Thus, HMM-FRAME can be used as a complementary tool to HMMER 3.0 on error-prone sequences. 25 Chapter 3 Prole HMM-based protein domain classication for short sequences 3.1 Background With the advent of next-generation sequencing and culture-independent methods, an enormous amount of metagenomic data have been sequenced from microbial communities from dierent habitats. In order to understand the phylogenetic complexity and biological functions of microbial communities, as well as their interactions with the host, automatic annotation tools such as CAMERA [5], MG-RAST [6], and MEGAN [7] are being used for annotating metagenomic data sets. As an important component of these metagenomic annotation tools, protein homology search provides basis for identifying putative genes and assigning those genes to annotated functional categories (e.g. protein domain families). Because of the high sensitivity of remote homology recognition, HMMER has been successfully applied to genome-wide domain analysis. However, its sensitivity is signicantly limited by the short reads of metagenomic data sets and poorly conserved domains. In order to investigate how read length and domain identity aect the sensitivity of HMMER, we randomly sampled 200 peptides with lengths of 12, 20, and 28 amino acids from the seed sequences of each of the 2,558 Pfam domains, which contain the word Bacteria" in their de- 26 scriptions. The peptides were aligned with the domain families using HMMER. We used the E-value cuto 1000 in order to boost the sensitivity. For each domain, the read classication sensitivity of HMMER is measured as the ratio of the number of aligned reads to the total number of sampled reads. We sort all data points by domain identity in ascending order and plot them in Figure 3.1. For domains with the same identity, their average sensitivity is reported. 1 0.9 Sensitivity of HMMER3 0.8 0.7 0.6 0.5 0.4 0.3 read length: 36 bp read length: 60 bp read length: 84 bp 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Average sequence identity of domain Figure 3.1: Change of the read classication sensitivity of HMMER over read length and the average sequence identity of domain families. Figure 3.1 shows that the sensitivity of HMMER deteriorates with the decrease of the query sequence length and domain identity. The sensitivity is decreased from 90% to 65-70% when the lengths of reads change from 28 residues (i.e., 84 bp for corresponding DNA reads) 27 to 20 residues (i.e., 60 bp for DNA reads) for domains with identity around 40%. Although next-generation sequencing technologies are producing longer reads and assembly tools may be available to assemble short reads into longer contigs, there is still a need for a protein domain analysis tool for short reads. First, many nished or on-going metagenomic sequencing projects contain reads with lengths from 35 to around 400 bp depending on the chosen sequencing technologies. In addition, peptide sequences encoded in individual metagenomic sequence reads may share only small overlaps with existing domain families. Thus, a sizable portion of many available data still contains short reads. Second, the sheer amount of data and the complexity of many metagenomic data sets pose a great challenge for assembly tools [46]. A large portion of short reads cannot be correctly assembled into longer contigs. Third, many domain families exhibit low average sequence identity, which poses a challenge for short and medium-sized reads. Figure 3.2 shows the histogram of pairwise sequence identity for domains related to bacteria. Of 2558 domains, there are about 43% domains with average identity no greater than 0.3. For these domains, the sensitivity of HMMER is between 0.7 and 0.8 for reads of length 84 bp, between 0.4 and 0.6 for reads of length 60 bp, and smaller than 0.1 for reads of length 36 bp. As a result, although a large number of reads are sequenced from genes, which are highly compact in microbial genomes, only a small percentage of the short reads can be classied into their native domains using existing tools. In this work, we introduce MetaDomain, a protein domain classication tool designed for short reads in metagenomic data sets. MetaDomain provides a complementary protein analysis tool to HMMER on assigning short reads into their native families. 28 800 700 Number of domains 600 500 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Average sequence identity of domain 0.9 1 Figure 3.2: Histogram of the average pairwise sequence identity for 2558 domains 29 3.2 Related Work Prole HMM-based protein homology search is widely used for mining microbial genomes. Knowing the composition of dierent domain families encoded in a metagenomic data set helps us understand which functions are important for a particular habitat. For example, Ellrott et al. [8] investigated the distribution of protein families in the available human gut genomic and metagenomic data. As the data set contains assembled contigs, using HMMER is expected to achieve high sensitivity. Schlüter et al. [9] used HMMER to understand the genetic diversity and composition of a plasmid metagenome from a wastewater treatment plant. The reads have an average length of 104 bp, which is also adequate for HMMER to achieve high sensitivity. Besides providing a basis for functional proling, prole HMM-based homology search was also used for phylogenetic complexity analysis in metagenomic data. The phylogenetic algorithm CARMA [10] uses all Pfam domain and protein families as phylogenetic markers to identify the source organisms of environmental DNA fragments as short as 80 bp. As we show in Figure 3.1, prole HMM-based tools have sensitivity of at least 0.9 in classifying reads of 80 bp into domains with average sequence identity above 40%. However, for poorlyconserved domains, a signicant number of reads might be missed. A similar but faster tool Treephyler [47] conducted community proling in metagenomics and metatranscriptomics based on Pfam domain assignments. Treephyler was applied to a data set with average read length of 200 bp. It is unclear how shorter reads aect its performance. Our previous work designed a tool HMM-FRAME [48], which can identify and correct frame-shift errors in pyrosequencing reads during protein domain classication using prole HMM-based alignment. However, it was not specically designed to handle short reads. 30 Finally, we note that the method used in MetaDomain shares a similar rationale to the recent work by Weng et al. [49]. Weng et al. reported that taxonomic binning tools for metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST cut-os. Thus, they re-analyzed the discarded reads using less stringent cut-os. In or- der to control the false positive matches introduced by the relaxed cut-os, they used the evolutionary conservation of adjacency between neighboring genes as an additional criterion. 3.3 Method HMMER uses E-values as the discrimination threshold to determine the membership of a query sequence. However, short reads may only generate low alignment scores and thus insignicant E-values. In particular, the conservation across the entire length of a domain family can be highly variable, posing a great challenge for classifying reads sequenced from poorly conserved sub-regions. In order to increase the sensitivity of aligning remotely-related short reads, we propose position-specic score cutos, by which poorly conserved regions allow more relaxed discrimination thresholds than well-conserved regions. However, the low thresholds can easily incur random matches. In order to control the false positive rate, we examine the position distribution of read alignments. The position distribution of read alignments on a truly encoded domain is expected to be more uniform than a domain that incurs random read alignments [50, 51]. Figure 3.3 shows the schematic representations of three types of distributions of read alignments along a domain. The alignments in (A) and (B) are more likely to be random. Thus the domains may not be encoded in the data set. The alignment distribution in (C) exhibits a much more uniform distribution, providing strong evidence for the existence of the underlying domain in the data set. Thus, by using relaxed 31 position-specic score cutos and inspecting the distribution of alignments, we expect to classify more short reads into the correct domain families while not falsely reporting domains that are not characterized in the data. Figure 3.3: Three types of alignment distributions. 3.3.1 Pipeline of MetaDomain The input to MetaDomain includes sequence reads and a list of protein domains. The output is a list of domains encoded in the underlying data set and the number of aligned reads. Figure 3.4 shows a schematic representation of the pipeline of MetaDomain. MetaDomain consists of three main stages: short read alignment, ltering, and classication. In the alignment stage, we use the Viterbi algorithm [17] to search for the best local alignment between a query sequence and a prole HMM-represented domain family. In the ltering stage, we rst apply a position-specic score threshold to eliminate insignicant alignments. Then we remove stacked alignments with the same alignment positions inside a poorly conserved region. In the nal stage, we use the number of aligned reads and the distribution of alignment positions to determine whether a domain is encoded. 32 Sequence reads Viterbi algorithm Pfam domains Optimal local alignments Position-specific threshold Filtering Trimmed alignments Read number and domain coverage thresholds Classification Transcribed or encoded protein domains Figure 3.4: Pipeline of MetaDomain. 33 3.3.2 The Viterbi algorithm The Viterbi algorithm aligns a query sequence to a prole HMM by searching for the most probable state path in the model. Unlike HMMER, MetaDomain directly aligns a DNA sequence to a prole HMM. To do so, we implicitly align translated peptides under dierent reading frames with a prole HMM. Let π be a state path in a prole HMM M and let be a query DNA sequence. The Viterbi algorithm searches for the most probable path such that π ∗ = argmaxπ (x, π). x π∗ The output of the Viterbi algorithm includes the optimal alignment and its score. As Viterbi is a standard algorithm designed for HMMs, we refer readers to Durbin et al.[17] for a detailed illustration of the dynamic programming equations for nding π∗. The major dierence between our implementation and the standard Viterbi algorithm includes : 1) our implementation accepts a DNA rather than a peptide sequence as input; 2) a local alignment can start and end with any state without incurring insertion or deletion penalties. 3.3.3 Alignment Filtering MetaDomain employs two ltering mechanisms to increase its sensitivity in aligning short reads while maintaining a low false positive rate: position-specic thresholds (PSTs) and trimming. 3.3.3.1 Position specic threshold PST allows dierent alignment thresholds for well conserved and poorly conserved regions. Let the length of a query DNA sequence be Mi,j L (in bp). Denote the prole HMM as be a sub-model formed by all consecutive states from the 34 i th match state M. Mi Let to the j th match state Mj . The upper bound of the alignment score against Mi,j score that can be generated by aligning any input sequence of length Let ai,j denote the transition probability from state probability of state Mi,j Mi emitting amino acid a. Mi to state Mj . is the maximum j−i+1 Let Then the upper bound ei (a) Ui,j with Mi,j . denote the for sub-model is calculated as follows: j ak,k+1 × max(ek (a)) Ui,j = k=i where aj,j+1 is set to 1 because j is the ending state of the sub-model. We dene PST for the submodel Mi,j as: P STi,j = γUi,j where the coecient γ is a user-specied parameter in the range of [0,1]. It can be exibly adjusted to control the trade-o between sensitivity and false positive rate of MetaDomain. The default value is 0.6, which is used in our experiments. 3.3.3.2 Alignment trimming Alignment with scores larger than their corresponding PSTs will pass the rst ltering stage. As each domain has various conservation along the entire length of the model, well-conserved sub-regions have high PSTs while poorly-conserved sub-regions yield low PSTs. Thus, random sequences tend to be aligned to poorly-conserved regions by MetaDomain, incurring a high FP rate. Our empirical experiments show that dozens of reads that are not sequenced from the underlying domain can be aligned to the same position in a poorly-conserved subregion. In order to minimize the eects of noise, we discard stacked alignments that have 35 the same alignment positions. 3.3.4 Protein domain classication In this stage we extract two features from the collected read alignments for each domain: the number of aligned reads and the domain coverage. The domain coverage is the fraction of positions covered by at least one read alignment in a domain. MetaDomain then applies a simple decision tree to classify all the target domains into two classes: encoded domains and non-encoded domains. If both features of a domain are equal to or bigger than their corresponding thresholds, this domain will be classied as encoded. Otherwise it is not encoded in the sample. By default, the cuto for domain coverage is 30%. Ideally, the cuto for the number of aligned read should be determined based on the properties of data such as sequencing depth. If users do not specify this value, we use 20 by default. 3.4 Results In order to evaluate the performance of MetaDomain on real data generated by nextgeneration sequencing technologies, we applied MetaDomain to protein domain analysis in two data sets. The rst one is the transcriptome generated using RNA-seq for cenocepacia. Burkholderia As both the reference genome and its domain annotations are available, we can quantify the sensitivity and false positive (FP) rate of MetaDomain. The second one is metagenome data sequenced from soil. We applied MetaDomain to identify domains encoded in the underlying data. In addition, we compared HMMER and MetaDomain in both applications. 36 3.4.1 Identifying transcribed protein domains in transcriptome In this experiment, we conducted transcribed domain analysis in the transcriptome from one strain of B. cenocepacia named AU1054 [52]. By using Illumina RNA-seq, the authors generated multiple samples for AU1054 in two growth media. We used one replicate of cDNA sample of AU1054 in the growth medium cystic brosis. In total, 3,361,008 reads of a length of 41 bp were downloaded from the website provided by the authors. We evaluated the performance of read classication and domain identication of MetaDomain and HMMER. 3.4.1.1 Performance of read classication The performance of read classication is quantied using both read classication sensitivity and FP (false positive) rate. In this experiment, the read classication performance is computed on reads that can be mapped to annotated domains. Below we sketch the main steps to obtain mapped reads for a domain using the reference genome and the domain annotations. First, we downloaded the genome of AU1054 and the annotated genes and domains from the IMG website [53]. There are 2,181 annotated Pfam domains. Second, the reads were mapped to the reference genome using Bowtie [54] with two mismatches allowed. Third, we compared the positions of read mapping and annotated domains. For a domain, all reads that fall into it are dened as mapped" reads. Denote the set of mapped reads as All other (unmapped) reads constitute set aligned reads for a domain be a domain are A∩M M and A. A−M , U U. M. For a domain classication tool, let the set of Thus, the sensitivity and FP rate of read classication for respectively. A perfect sensitivity indicates that all mapped reads can be aligned. A zero FP rate indicates that only mapped reads can be aligned to a domain. Of the 2,181 annotated families, we evaluated the performance of HMMER and Meta- 37 Domain on 1406 families which have at least 1 mapped reads. Of the 1406 tested domains, HMMER could not align any read to 1150 domains, resulting in zero sensitivity and FP rate. For the rest 256 domains, all aligned reads by HMMER are non-mappable reads, resulting in zero sensitivity and a positive FP rate. The comparison between HMMER and MetaDomain is summarized using a bubble chart in Figure 3.5. The biggest bubble indicates that HMMER has zero sensitivity and zero FP rate for 1150 domains. As we can see, it is highly dicult for HMMER to correctly align reads as short as 41 bp. There are two reasons for the low sensitivity of HMMER on short reads. First, the parameter training in E-value calculation of HMMER are based on much longer reads (100 amino acids). Thus, the small alignment scores generated by the short reads yield large E-values and cannot pass the E-value threshold. Second, the small alignment scores of short reads may not pass the ltration stage of HMMER. 3.4.1.2 Identifying transcribed domains in the transcriptome Figure 3.5 only shows the read classication performance. MetaDomain uses both aligned read number and domain coverage as thresholds for domain identication. We expect that the additional constraint will reduce the false positive rate in domain identication. Because of the low read classication sensitivity, we speculate that HMMER will have low sensitivity in identifying transcribed domains. In order to quantify the performance of domain identication, we need to build positive and negative test sets, which include transcribed and non-transcribed domains based on mapped reads. There is no commonly accepted criterion to dene transcribed genes using the number of mapped reads. Various expression scores such as an average coverage depth across the entire length of each gene [55] and reads per kilobase of exon model per million 38 1 0.9 HMMER3 MetaDomain 0.8 Sensitivity 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 FP rate 5 6 7 −5 x 10 Figure 3.5: Read classication sensitivity and FP rate of HMMER and MetaDomain. The size of each bubble represents the number of data points (i.e., domains) with the same sensitivity and FP rate. mapped reads (RPKM) are used to quantify transcriptional level. In addition, the cutos of dening highly transcribed, lowly transcribed, or non-transcribed genes are variable in dierent applications [56]. In this work, we dene transcribed domains based on the rationale that a truly transcribed domain should be mapped by a number of reads at dierent positions. Correspondingly, we use the following criteria to determine whether a domain inside a gene is transcribed: 1) at least N reads are mapped to a domain; 2) at least 30% of positions in a domain are mapped by reads. of mapped read is zero. A domain is labeled non-transcribed" if the number For domains that fall between the criteria for transcribed and non-transcribed domains, they are labeled unknown" and are excluded from the test sets. Table 3.1 shows the size change of the positive and negative test sets over the cuto N. 39 Table 3.1: Number of transcribed and non-transcribed domains using dierent cutos (N) for the number of mapped reads. N transcribed unknown nonetranscribed 10 318 1317 546 15 262 1373 546 20 226 1409 546 25 195 1440 546 30 169 1466 546 Intuitively, bigger N creates an easier case for domain classication than smaller N. We align all reads to the transcribed and non-transcribed domains using MetaDomain and HMMER. The unknown" domains are removed due to their ambiguity. For HMMER, we rst translated the short reads into peptide sequences using 6-frame translations. We then aligned the domains with the translated sequences using 1000 as the E-value threshold, which is chosen to maximize the sensitivity. For MetaDomain we directly aligned the short reads with the domains. The pipeline in Figure 3.4 was used to output a list of transcribed domains for MetaDomain. Let D+ and D− be the number of transcribed and non-transcribed domains identied using the read mapping results in Section 3.4.1.2. Let M + and M − be the predicted number of transcribed and non-transcribed domains by MetaDomain or HMMER. The sensitivity and FP rate of domain classication tools are dened using the following equations: Sensitivity FP rate = + + = D ∩M + D − ∩M + D D− The values of D and M are aected by several options. First, D+ and D− can change over the cuto N as shown in Table 3.1. Second, we used both the domain coverage and the 40 number of aligned reads to determine whether a domain is encoded or transcribed. In this experiment, the cuto for domain coverage is 30%, which we found reasonable across dierent experiments. Thus, M+ M− and mainly change over the required number of aligned reads to a domain. For simplicity, we denote the cuto as τ. Increasing τ implies a more stringent constraint for dening transcribed domains, and thus might result in lower sensitivity and a smaller FP rate. Decreasing τ is likely to increase the sensitivity while incurring a higher FP rate. In order to compare the performance of MetaDomain and HMMER under dierent we plotted the ROC curves by changing N=10 HMMER3 MetaDomain 0.1 FP rate 0.8 0.6 0.4 HMMER3 MetaDomain Sensitivity 0.4 1 0.8 Sensitivity Sensitivity N=30 1 0.6 0 0 from 1 to N for N=10, 20, and 30 in Figure 3.6. N=20 0.8 0.2 τ 0.2 0.2 τ, 0 0 0.6 0.4 0.2 0.1 FP rate 0.2 0 0 HMMER3 MetaDomain 0.1 FP rate 0.2 Figure 3.6: ROC curves of HMMER and MetaDomain. Figure 3.6 shows that HMMER is highly specic (FP rate ≤ 1.3%). However, as we speculated, its sensitivity is low, with the highest sensitivity being only 0.135. HMMER misses a large portion of short reads that can be mapped to protein domains even when we use a very relaxed E-value cuto. When both tools incur an FP rate of 0.02, the sensitivity of MetaDomain is 0.53 vs. 0.13 for HMMER. When N decreases from 30 to 10, the size of the positive test set D+ becomes larger and the sensitivity of both HMMER and MetaDomain 41 decreases. Note that the sensitivity and FP rate of HMMER keep the same for many dierent thresholds (i.e., τ ), resulting in compact ROC curves. Overall, the ROC curves show that MetaDomain can achieve higher sensitivity while keeping a similar FP rate as HMMER for domain classication in this experiment. In addition, Figure 3.6 provides guidance on determining appropriate τ for MetaDomain in order to achieve desired sensitivity and FP rate. On average, it took MetaDomain 280 seconds to align 752,156 reads with one domain on a 2.2GHz dual-core AMD Opteron machine. There are 2181 domains in this experiment. 3.4.2 Protein domain analysis in a soil metagenomic data set In the rst experiment, we demonstrated the accuracy of MetaDomain in classifying short RNA-seq reads into their native domain families. In this section, we present the utility of MetaDomain in identifying encoded protein domains in a complicated metagenomic data set, which is sequenced from the microbes dwelling in the soil from a long term cultivated corn site at Iowa using Illumina HiSeq platform 1. There are 520,346,510 sequence reads of various lengths, ranging from 31 bp to 114 bp. The average length of the reads is ∼ 73 bp. Figure 3.7 shows the distribution of the read lengths in this data set. The sheer amount of data and the complexity of this data set pose a great challenge for read assembly programs. Thus, we directly apply MetaDomain and HMMER to unassembled reads. We rst used HMMER to align all the reads against the 2558 Pfam domains that contain the word Bacteria in their descriptions. Using E-value 1000, HMMER classied 34,602,784 reads into 2558 Pfam domains, accounting for 6.65% of all reads in the data set. classiable reads have an average length of 1 Sequenced ∼ The 80 bp. A large number of reads shorter than by James Tiedje Lab at Michigan State University. Unpublished yet. 42 0.5 0.45 0.4 Percentage 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 30 40 50 60 70 80 90 100 110 Read length (bp) Figure 3.7: Read length distribution in the soil data set. 60 bp were not classied. By conducting a complementary domain analysis on short reads using MetaDomain, we expect to classify more reads to their native families. As many domains have been aligned with a large number of reads by HMMER, it is highly likely that they are encoded by the bacterial species in the soil data set. We thus excluded them from further screening. There are 80 domains with less than 20 reads aligned or with a smaller domain coverage than 30%. The left panel of Figure 3.9 shows an example of such domain. The small number of aligned reads and their biased distribution do not support the representation of this domain in this data set. We thus applied MetaDomain to the 80 domains and investigated whether they are encoded. On average, it took MetaDomain about 31 CPU minutes to align 11,194,176 sequences that were not classied by HMMER and are shorter than 60 bp against one domain family. Figure 3.8 presents the number of aligned reads and the domain coverage output by 43 HMMER and MetaDomain. Note that MetaDomain was only applied to reads that were not classiable by HMMER. Thus, the total number of reads that can be classied into each domain should be the sum of the output of HMMER and MetaDomain. This gure shows that signicantly more reads can be classied into the corresponding domains. We need to specify thresholds for domain coverage and the number of aligned reads in order to dene an encoded domain. Similar to previous experiments, the domain coverage cuto is 30%. 10 Log(Aligned read number) of HMMER3 Domain coverage of HMMER3 Log(Aligned read number) of MetaDomain Domain coverage of MetaDomain Aligned read number and domain coverage 9 8 7 6 5 4 3 2 1 0 0 10 20 30 40 50 Pfam domain index 60 70 80 Figure 3.8: Reads aligned by HMMER and MetaDomain. According to Figure 3.6, the sensitivity and FP rate of MetaDomain are 0.34 and 0.004 when τ is 20. We thus choose 20 as the cuto for the number of aligned reads. Out of the 80 Pfam domains, 24 have less than 20 reads aligned or a domain coverage no greater 44 than 30%. So these 24 domains are not likely to be encoded in this data set. For the other 56 domains, their average domain coverage by MetaDomain alone is 97.25%. The number of aligned reads by MetaDomain is 169.52 versus 15.27 by HMMER. This provides strong evidence that these protein domains are actually encoded. The average read length aligned by MetaDomain is 38 bp. As an example, Figure 3.9 shows the distribution of aligned reads by HMMER and MetaDomain for the domain PF09703. In summary, by using MetaDomain, we are able to identify 56 more domains encoded in this data set. 130,930 (0.025%) more reads are classied into these domain families. 0 100 200 300 Aligned read positions by HMMER3 0 100 200 300 Aligned read positions by MetaDomain Figure 3.9: The distributions of aligned reads for PF09703 by HMMER and MetaDomain. Among these 56 protein domains, 21 have unknown functions. 6 domains are CRISPRassociated domains. These kinds of domains are found in the genomes of approximately 40% of bacteria and 90% of archaea. More detailed analysis is needed to understand whether the functions of these domains are important to the specic habitat. 45 3.5 Conclusion In this work, we introduce MetaDomain, a protein domain classication tool for short reads produced by next-generation sequencing technologies. It provides a complementary domain classication tool to HMMER on classifying short reads into domain families with low sequence identity. Our experimental results show that it can achieve a better tradeo between sensitivity and FP rate than HMMER in classifying short sequences. Its current version is based on a faithful implementation of Viterbi and is slow when applied to thousands of millions of reads and the whole Pfam database. We plan to improve its eciency by using ltration strategies such as ungapped alignment and parallel programming. In addition, we plan to improve the method of designing position-specic score thresholds in order to achieve a better discrimination power. 46 Chapter 4 A Sensitive and accurate protein domain classication tool (SALT) for short reads based on prole HMMs and graph algorithms 4.1 Background Protein domain analysis has been widely used for functional annotations of RNA-Seq data [1, 2, 3, 4]. In particular, quantifying the expression levels of protein domains helps us understand how transcriptional changes of domains are associated with sequencing conditions, sampling tissues, or experimental treatments in RNA-Seq data. For example, computational domain analysis was applied to identify domains that play a role in vernalization and eux transporters in the gibberellin response in sugar beet [1]. Domain analysis is also frequently used to evaluate and compare gene annotation quality of dierent gene-nding tools [3] or to compare domain composition of data sampled using dierent techniques [4]. Existing domain classication tools are mainly designed for complete or near-complete domain member sequences rather than fragmentary and short reads in NGS data. Prole- 47 based domain classication tools rely on alignment scores to distinguish true homologous sequences from false ones. Short reads incur marginal alignment scores and thus can be easily missed by these tools. In particular, although HMMER can achieve high sensitivity in recognizing remote homologs of a domain family, it shows low sensitivity in classifying partial sequences of remote homologs [57]. When the reference genomes are available, genome-wide gene and domain annotations and read mapping positions can be combined to determine the membership of reads. However, many NGS data sets lack complete or quality reference genomes, such as complicated metagenomic data sets and transcriptomic data of some non-model organisms. For these data, protein domain analysis is applied to the short reads directly or to the contigs that are generated by de novo sequence assembly tools. Contigs produced by de novo sequence assembly tools enable NGS data analysis to take advantage of conventional bioinformatics tools designed for longer sequences. However, using sequence assembly tools as the rst step is not ideal for protein domain classication. First, the quality of read assembly deteriorates signicantly in complicated NGS data sets [58]. Dierent sequence assembly tools generate dierent sets of contigs. There is no consensus on the best assembly tool. Second, successful de novo assembly requires high sequencing depth, which is dicult to achieve for all domain regions. It is often observed that some domain regions are highly transcribed while others are transcribed much less. Third, dierent protein-coding genes can share the same domain. As a a result, similar domain regions can contribute to the nodes forming crossing points between dierent paths in a De Bruijn graph. for de novo Those repeat-like sequences add diculty sequence assembly. Thus, there is a need for an alternative and better domain classication tool for NGS data lacking reference genomes. In this work, we designed a Sensitive and Accurate protein domain cLassication Tool 48 (SALT), which uses prole HMMs and family-specic contig generation algorithms to classify short reads into domains. SALT is mainly designed for domain classication in transcriptomic data of non-model organisms that lack complete or high-quality reference genomes, and for metagenomic data sets. In this work, we focus on RNA-Seq of non-model organisms. In order to evaluate the performance of SALT, we applied SALT to RNA-Seq data of species with quality reference genomes. Read mapping and genome-wide domain annotations are combined as the ground truth for evaluating the read classication sensitivity and specicity of SALT. We compared SALT with HMMER and HHblits, another prole HMM-based protein homology search tool [59]. HMMER was applied to short reads directly or to contigs assembled by de novo assembly tools. SALT can correctly classify signicantly more reads into their native domains than HMMER and HHblits while keeping the same or better specicity when tested on two RNA-Seq data sets with dierent read lengths. Finally, we applied SALT to an RNA-Seq data set of a non-model organism and compared SALT with the alternative pipeline using de novo assembly tools and HMMER for domain classication. It is also worth noting that although we will use protein domains when describing the methods and results, SALT can be applied to both domain families and protein families. In this work, we used the families in Pfam [12], which include both domain families and protein families. 4.2 Related Work HMMER is the representative implementation of prole HMM-based domain classication tool. In conjunction with the Pfam database [12], which contains over 10,000 annotated protein domain families, HMMER can accurately classify query protein sequences into ex- 49 isting domain families. However, HMMER can fail to recognize short reads sequenced from remote homologs of a domain family. Our previous work [57] evaluated how read lengths aected the performance of HMMER on a large number of domains. Prole HMM-based tools have sensitivity of 0.9 in classifying reads of 80 bp into domains with average sequence identity above 40%. However, for poorly conserved domains, which are not rare, a signicant number of reads cannot be correctly classied. Even for a domain with good average conservation along each position, short reads that are sequenced from less well-conserved regions tend to be missed by existing methods. MetaDomain [57] partially solves this problem using position-specic score cutos. Using MetaDomain, signicantly more reads could be correctly classied into their native domains, however, the specicity was not satisfactory. HHblits uses HMM-HMM alignment to conduct fast iterative protein sequence search and can generate accurate alignment. However, it is not especially designed for NGS data and has unsatisfactory classication sensitivity for short and fragmentary reads, which we show in our experiments. In order to take advantage of conventional functional analysis pipelines, de novo sequence assembly tools can be applied to produce contigs, which are used as input to gene nding or domain analysis tools. The performance of short read assemblers depends on read length, complexity of the data, sequencing depth, etc. A recent review [58, 60] summarized several challenges of de novo assembly tools. Recently, a comparison of several popular de novo Radix assembly tools was conducted on the transcriptome of a non-model snail species ( balthica ) [61]. The assemblies generated by the tested tools diered in total number of contigs, contig length, and number and quality of gene hits, showing the need for better assembly methodology. Similarly, two dierent assembly tools were applied to RNA-Seq data from the non-model species sugar beet and output dierent sets of assemblies [1]. 50 Incomplete or error-containing assemblies directly aect the performance of the downstream analysis. In addition, this work focuses on domain analysis, which does not need contigs outside domain regions of genes. Thus, we deliver an alternative solution that does not rely on de novo sequence assembly tools. HMM-FRAME [48] aimed at classifying for reads containing frame-shift errors. The insertion or deletion errors in homopolymer regions of pyrosequencing reads can cause frameshifts, compounding domain classication. This work focused on short reads which are usually produced by Illumina platform, which tends to have substitution rather than insertion or deletion errors. SALT is able to deal with substitution errors during domain classication. 4.3 4.3.1 Methods Overview of SALT As a protein domain classication tool, SALT takes query reads and a list of protein domain families of interest as input. The output is a list of transcribed (or encoded) families and the classied reads for each family. Figure 4.1 shows a piece of genomic sequence with annotated domains and the sequenced reads. In this example, if a read is sequenced from Domain X in a gene, we call it a positive read w.r.t. Domain X. Otherwise, it is a negative read w.r.t Domain X. An ideal domain classication program should classify only the positive reads into the corresponding domain family. Positive reads have the following features: 1) many positive reads tend to share higher sequence similarity with the underlying family than negative reads, yielding higher alignment scores; 2) positive reads can be assembled into contigs that have statistically signicant alignment scores against their native families. Existing domain classication tools such as 51 HMMER mainly uses the rst feature. However, as short and fragmentary positive reads can yield as low scores as negative reads do, existing tools can miss a large number of positive reads when they are applied directly to short reads. De novo assembly tools can partially solve this problem but their performance depends a lot on sequencing depth, similarities of domain copies in dierent genes, etc. reads genomic sequence domain X domain X gene A gene B Figure 4.1: Two genes, their domain organizations, and the sequenced reads. Domain X occurs in two dierent genes. Both genes are transcribed and sequenced. Red lines: positive reads. Blue lines: negative reads. SALT takes advantage of both features by incorporating prole-based domain alignment methodology into a supervised contig generation algorithm. Figure 4.2 illustrates the SALT pipeline, which can be divided into three main stages: prole HMM-based ltration, contig generation, and contig selection. In the rst stage, we align query reads against prole HMMs, which represent the underlying domain families, using a modied Viterbi algorithm. A loose position-specic score cuto [57] is used in order to keep most positive reads. The rst stage trades specicity for sensitivity, as some negative reads pass the ltration stage. The rationale behind the second and third stages for removing negative reads is that the positive reads tend to be sequenced from continuous regions of a domain and their assembled contigs can yield statistically signicant alignment scores and E-values. The second stage builds a family-specic hit graph (see the Methods Section) using a supervised graph construction 52 algorithm. Contigs can be eciently generated from each hit graph. The third stage aligns these contigs to input domain families using HMMER. The computed E-values are used to choose sequences that are part of domain regions. Reads from the high-scoring contigs are classied into the corresponding input family. Query reads Profile HMM-based filtration Profile HMMs Hits Construction of hit graphs Hit graphs Contig generation Find the K longest paths Candidate contigs Contig selection Classified contigs Read classification Classified reads Figure 4.2: The pipeline of SALT. 53 4.3.2 Stage 1: prole HMM-based ltration The rst stage of SALT aligns reads to prole HMMs, which are built on homologous domains sequences from Pfam [12]. The software suite HMMER uses Viterbi and Forward algorithms to determine the membership of query sequences via E-value cutos. However, the sensitivity of HMMER drops signicantly for short sequences or poorly conserved protein domain families [57]. In order to address this problem, SALT uses position-specic score thresholds (PSST) and a modied Viterbi algorithm [57] to align reads to prole HMMs. For the standard Viterbi algorithm on a prole HMM, we refer users to the detailed introduction in [17]. The modied Viterbi algorithm used in SALT diers from the standard one in the following aspects: 1) SALT directly aligns a DNA sequence against a prole HMM by implicitly translating the DNA sequence into peptides under dierent reading frames; 2) a local alignment can start and end with any state without any transition penalty; 3) SALT uses PSSTs as alignment score cutos. 4.3.2.1 Position-specic score threshold PSST was described for MetaDomain [57]. As users of SALT can specify parameters to control the ltration performance of the rst stage, we summarize the main properties of PSST here. Let Mi,j to the Let L be the length of a query sequence in bp and M be the prole HMM. be a sub-model that includes all consecutive states from the j th match state Mj . any query sequence against Let Mi,j . Ui,j ith match state Mi be the upper bound of the alignment score between According to the scoring scheme of prole HMMs, always positive. If a query sequence is aligned to 54 Mi,j Ui,j is by the Viterbi algorithm, the PSST associated with this alignment is calculated by the following equation: PSST where the coecient γ = γUi,j , is a user-specied parameter in the range of [0,1]. The parameter controls the strength of the ltration. If γ γ is too large, the number of positive reads that pass this stage will be small and the sensitivity of SALT will be very low. If γ is too small, there will be an overwhelming number of negative reads passing ltration, aecting the accuracy of domain classication in the whole pipeline. specic needs. The default value of γ γ Users can adjust γ to accommodate their is 0.3, which was used in all our experiments. When is small, a read may be classied into multiple families in the rst stage. Therefore, in practice, we require that each read be assigned to at most three input families that generate the best alignment scores. 4.3.3 Stage 2: contig generation The ltration stage uses loose PSSTs to trade specicity for sensitivity. Some negative reads can pass the ltration stage. Given medium to high sequencing coverage, positive reads are usually sequenced from continuous regions in a domain. Accordingly, their alignments to the underlying prole HMM overlap. This stage assembles reads with overlapping alignments. Specically, we construct a family-specic hit graph, which is similar to a weighted overlap graph, using reads classied to the underlying family by the rst stage. A path searching algorithm is then applied to each hit graph to generate contigs for each domain family. 55 (A) Hit h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13 Score 12.0 6.2 11.4 10.5 7.1 14.4 13.5 14.6 18.6 6.8 h h3 h5 ACGTCGTCGTC h4 ATCGTCGTAG h8 h h1 h6 h7 h CACGCCGTCA 5.6 9.2 h13 AAACGTACAG AAAACGTACA 12 AACGTAAAAC 11 10 CGTAACGTAA ACACGACGTC GTACACGCGA 9 TATGTACACG CGTCACGTCG 2 GTATGTACACG ACGTATGTAC 5.4 h h profile HMM graph construction (B) overlap = 4 weight = 6.84 v1 v3 v2 v4 v6 v5 overlap = 4 overlap = 8 weight = 1.24 weight = 8.1 overlap = 5 weight = 7.3 v8 v7 overlap = 5 weight = 9.3 overlap = 5 weight = 5.7 v10 4 v9 overlap==11.16 v12 weight add root remove transitive overlaps (C) v11 overlap = 7 weight = 1.62 overlap = 5 weight = 2.8 overlap = 9 weight = 0.92 v13 root v10 v1 v3 v2 v6 v4 v5 v11 v8 v7 v12 v9 v13 Figure 4.3: (A) Thirteen reads and their alignment layout w.r.t. the prole HMM represented by its matching states. The alignment scores are shown in the table. Blue reads: negative k ∗ = 4. For simplicity (i.e. e = 0). Red nodes reads. Red reads: positive reads. (B) The constructed hit graph when of explanation, mismatches are not allowed in this simple example are created by positive reads. Blue nodes are created by negative reads. (C) The hit graph after removing transitive overlaps and adding the root node. 56 4.3.3.1 Constructing a hit graph for a family Overlap graphs are commonly used in is dened as de novo sequence assembly. A standard overlap graph G = (V, E), in which each read is a node, and overlaps larger than a given cuto are indicated by directed edges. A hit graph is also used to assembly contigs. However, its construction and the graph traversal algorithms are dierent from a standard overlap graph. We describe the hit graph construction procedure below using Fig. 4.3 as an example. The dierences between a hit graph and a standard overlap graph will be highlighted as well. In the rst step, a node is created for each read passing the ltration stage for a domain family. In Fig. 4.3, a read hi creates a node vi in the graph. Thus, dierent from a standard overlap graph, only reads that are likely to be classied into a family are used to build the hit graph. In addition, a hit graph is family specic. N hit graphs will be built for N input families. In the second step, edges are created using alignment positions of reads. reads hi and hj that are classied into a domain family by the ltration stage, their overlap is computed if and only if the alignments of and hj from hi is at least to hj . k∗, h3 and hj overlap. If the overlap between hi For example, in Figure 4.3.(A), which shows the alignment layout for 13 h1 and only tests whether a sux of and hi which is a user-specied overlap threshold, a directed edge is created reads, the alignment of h1 , h2 , For any two h2 h1 overlap and is a prex of h1 h2 . has a smaller starting position. SALT In addition, as the alignments of reads share no overlap with alignments of reads from h6 to h13 , test the sequence overlaps between any two reads from the two sets. SALT does not If two reads have the same starting position in their alignments, the bi-directional overlaps will be computed. Figure 4.3.(B) shows the added edges. Dierent from standard overlap graph construction, 57 this step takes advantage of the alignment positions and presents an ecient family-specic graph construction procedure. The graph construction stage allows substitution sequencing error, which is the major error type in Illumina sequencing technology. During the sequence overlap computation, at most e errors are permitted; e=2 adjusted to t the error rate of the given data. Given is the size of the longest sux of Users can specify of kmer size in a k∗, e in the current implementation. The parameter hi e, the overlap that has hamming distance ≤e oi,j from read to a prex of can be hi to hj hj . which can be estimated based on the same rationale as the choice de novo sequence assembly tool. The default value of k∗ is set to half of the average read length for relatively short reads. When the reads are long (∼100 bp), using the default value is too stringent for the hit graph construction. In addition, the ltration stage is more specic with longer reads. Thus we recommend to use a smaller value than the default one. It is worth mentioning that using alignment information to supervise the graph construction was also used for RNA virus population estimation [62]. But the application, the input data, and the alignment stage are all dierent from SALT. The graph G built so far contains many near-identical paths because of edges created by transitive overlaps, which are implied by other longer overlaps [63]. overlap between reads h1 and h3 overlaps exist between h1 and h2 , h2 For example, the in Figure 4.3.(A) is a transitive overlap because longer and h3 . Transitive overlaps add unnecessary edges without contributing to read classication because those reads can be identied using an alternative (usually longer) path. Thus, the third step of graph construction removes all edges corresponding to transitive overlaps. In Figure 4.3.(B), the edge resulting G is in Fig. 4.3.(C)). 58 (v1 , v3 ) will be removed (the In the fourth step, we add edge weights, which are dened based on both the overlaps and the alignment scores of reads. Contigs connecting positive reads can be aligned to the underlying domain family with much better scores than the short reads, rendering signicantly better discriminative power. Thus, we assign the edge weights so that the path weight is proportional to the alignment score of the corresponding contig. Negative reads that are sequenced from a continuous region may happen to pass the ltration stage and thus can form a path in G as well. However, their summed alignment score (i.e. the path weight) will be much smaller. Thus, by outputting the long paths, we have better chance to pick positive reads. The weight of an edge (vi , vj ) between reads alignment score contributed by the ending read where hj .score between hi is alignment score of the read and hj dened in this section. hi and hj is proportional to the increased hj : (vi , vj ).weight = hj . |hj | is the read length. hj .score∗(|hj |−oi,j ) , |hj | oi,j is the overlap The edge weights computed using the above equation are shown in Figure 4.3.(B). In order to incorporate alignment scores for nodes that have zero in-degree (i.e. no incoming edges, such as add a root and v6 in Figure 4.3.(B)), we G such that there is an edge from the root to every node with zero overlap of these edges are set to 0. Naturally, the edge weights are equal to node to in-degree. The v1 the alignment scores of these nodes (reads). For examples, refer to the edges (root, v6 ) (root, v1 ) and in Figure 4.3.(C). The above steps are applied to each family that has at least one classied read after the ltration stage. The pseudocode for the construction of 59 G can be found in the tool's website. 4.3.3.2 Find the K longest paths Every path in a hit graph denes a contig. The weight of a path is the sum of weights of all the edges in the path. Contigs assembled from positive reads generally have larger scores and thus larger path weights. Therefore, in order to classify positive reads, we need to nd the K longest paths (KLPs) in G, where K is a parameter that controls the number of contigs that will be generated. To use a small set of paths to cover a majority of positive reads, we should avoid outputting a path that is a subpath of any other generated path. Considering that the weights of all edges are positive, the KLPs will always begin with the root and end with a node without out-going edges (i.e. sinks). Therefore, the KLPs output by SALT start with the root and end with a sink node. Using the sorted alignment positions to construct graph (DAG). In order to nd the KLPs in G, G ensures that G we apply a dynamic programming algorithm extended from the algorithm that nds the longest path in a DAG. Let be the list of all nodes after the topological sorting of Let P [0..N ] end with be an array of vj . P [0] N +1 is a directly acyclic lists such that P [j] G. Let S V = {v1 , v2 , ..., vN } be a list of all sinks in keeps at most K G. longest paths that is reserved for the root and is initialized to 0. Then we can ll P [0..N ] using the following recurrence relation: P [j] = KLargest (vi ,vj )∈E,1≤k≤K where KLargest returns the K (P [i][k] + (vi , vj ).weight), largest values in a list. This equation shows that the KLPs ending with a node are chosen from all the KLPs ending with its predecessors. The KLPs of G are the KLPs among all the paths that end with a sink node. 60 There are several algorithms available to nd the K shortest simple (loopless) paths (KSPs) between two nodes in a graph [64, 65, 66]. These algorithms can also be applied to the KLP problem by simply negating the weights of all edges. A hit graph is usually sparse and therefore |E| is close to |V| (data will be shown in the Results Section). generally smaller than |V|. Moreover, K is Therefore, Our DP algorithm is more ecient to solve our KLP problem than other general KSP algorithms. The RNA virus population estimation in [62] solved the contig generation problem based on Dilworth's theorem. Simply speaking, the authors tried to identify the least number of paths that can cover all the nodes. In their application, all the nodes are positive" while our hit graph contains both positive and negative reads. Thus, their algorithm is not applicable to this problem. The choice of K aims at including contigs assembled from positive reads. Large down the algorithm; small K K slows can make SALT miss positive reads. As the size of the graph is not large because of the ltration step, we trade eciency for sensitivity in the current parameter choice. By default, hit graph is large, a smaller 4.3.4 K K = |S|, where |S| is the number of sinks in G. When the is recommended for eciency purpose. Stage 3: E-value computation and contig selection While most of the top K longest paths contain only positive reads, some of them may be assembled from negative reads. In principle, some negative reads that are sequenced from continuous regions outside of domains can pass the ltration stage because of the loose PSSTs. These reads can form paths in negative reads will be included in KLPs. G. When K is large, the paths containing This happens more often for domains with low sequence conservations. Thus we need a reliable and well-tested scoring system and cutos 61 to distinguish between positive and negative contigs, which consist of positive and negative reads, respectively. Towards this end, we chose HMMER because it can provide ecient and reliable E-value computation. We rst translate candidate contigs into peptides using multiframe translation. Then these peptides are aligned against the input family using HMMER. As the contigs are much longer than the reads, HMMER is able to select positive contigs from the input using E-value cutos. Users can specify the E-value cuto. The default Evalue cuto is 10−6 for this stage to ensure high specicity. All reads of the selected contigs will be classied into the input family. This is the nal classication result for each family. 4.3.5 Running time analysis Let the number of input reads be of the rst stage is a family. M there are N1 O(N |h|M ) N and the average read length be for one family, where M |h|. The time complexity is the number of match states in is mainly determined by the size of a domain family and reads passing the ltration stage. the time complexity of graph construction is Usually, O(τ |h|2 N1 ), N1 where N. τ M N. Suppose In the second stage, is the average number of overlapping alignments. During graph construction, we use alignment positions to guide the overlap computation, avoiding all-against-all comparison needed in a standard overlap graph construction. The time complexity of searching for the KLPs is |V| is the number of nodes and graph construction procedure, |E| O(KlogK|E| + |V|), where is the number of edges in the graph. According to the |E| ≈ |V| = N1 . Suppose there are N2 contigs produced from the hit graph. The time complexity of the last stage is determined by HMMER. Because of various optimization techniques and heuristics, the latest version of HMMER is as fast as BLAST [18]. In addition, under the default setup, in the hit graph and thus N2 < N1 . N2 is at most the number of sink nodes As a result, the last stage only adds a small overhead to 62 the overall time complexity. The rst stage is the bottleneck because N is very large for a typical NGS data set. The overall time complexity of SALT is determined by the rst stage. 4.4 Results In order to show the utility of SALT, we applied it to three RNA-Seq data sets and compared its performance with HMMER, HHblits, and a pipeline that uses de novo sequence assembly tools and HMMER. For brevity of description, we use assembly+HMMER to refer to the complete pipeline, where assembly" can be replaced by the name of a specic sequence assembly tool. This pipeline was used to classify query reads into input families based on the following steps: 1) use de novo sequence assembly tools to generate contigs from query reads, 2) use HMMER to align translated contigs against input families, 3) classify reads in the aligned regions of the contigs into the corresponding families. Note that the inputs to HMMER and HHblits are always the translated peptides from the reads or contigs. For simplicity, when we say we classify reads into protein domain families using HMMER or HHblits, the translated peptides are actually used. For the rst two experiments, annotated reference genomes were available. We rst determined the true membership of query reads based on protein domain annotations and read mapping positions in the genome. We then classied query reads into input families using four dierent types of classiers: HMMER, HHblits, assembly+HMMER, and SALT. Performance of the classiers was determined by comparing the true membership of reads and predicted membership by the classiers. We used four metrics to evaluate the performance of classiers: sensitivity, false positive rate (FP rate), positive predictive value (PPV), and F-score. Let D represent an input family. Let 63 A and B be the set of positive and negative reads of D, respectively. Let C be the set of reads classied to sensitivity of a classier is dened by dened by |A∩C| . |C| |A∩C| . |A| D by a classier. The FP rate is dened by |C−A| . |B| The The PPV is F-score considers both sensitivity and PPV and can be used as a single metric to evaluate the performance of a classier. It is dened by: F-score = 2 × sensitivity × PPV . sensitivity + PPV In order to compare the performance of the classiers on all input families, we rst calculate a metric for each input family and report the average of the values of the metric over all input families. In the last experiment, we conducted protein domain classication on an RNA-Seq data set sequenced from a non-model organism. Although we did not have the annotations of the reference genome we were able to show that SALT identied more transcribed protein domain families that were related to this species. 4.4.1 Protein domain classication of very short reads In this experiment, we conducted protein domain classication on an Illumina RNA-Seq data set sequenced from the transcriptome of one strain of AU1054 in the growth medium cystic brosis [52]. Burkholderia cenocepacia named We downloaded a total number of 3,361,008 reads of 41 bp from the website provided by the authors. For the pipeline of assembly+HMMER, we chose Velvet [67] and SSAKE [68] because Velvet is widely used and SSAKE is designed for very short reads. The experimental results show that SALT has much better performance than HMMER and assembly+HMMER in classifying very short reads. 64 4.4.1.1 Determining the true membership of reads First, we downloaded the genome and its protein domain annotations from the IMG website [53]. There are totally 2,181 annotated protein domains. Second, we mapped the reads back to the genome using Bowtie [69] with two mismatches allowed. Third, we compared the mapping positions of the reads and protein domains in the genome. Let of the overlap between a read and a protein domain and l ≥ 0.8L, L l be the length be the length of the read. If this read has a signicant overlap with the protein domain and is thus dened to be a positive read of the protein domain family. If l < 0.5L, the overlap is insignicant and this read is a negative read of the protein domain family. Otherwise, the signicance of the overlap is ambiguous and this type of reads will not be evaluated in our experiments. 4.4.1.2 Performance evaluation Out of the 2,181 annotated protein domain families, we are interested in those that are transcribed in the input data set. There is no commonly accepted criterion to dene transcribed protein domains. In this work, we dene protein domains that have at least 10 mapped reads as transcribed protein domains. There are 378 protein domain families with at least 10 mapped reads, which are used as the input families. When we adjusted this threshold in the experiment, the performance comparison between dierent classiers was generally unchanged (data can be found in SALT's website). We tried several dierent k -mer sizes for SSAKE and found that when the k -mer size was 16 SSAKE had the best overall performance. It generated 25,944 contigs with an average length of 60.36 bp. For Velvet, we used VelvetOptimizer [103] to search for the best assembly result by trying k -mer sizes from 15 bp to 39 bp. We tried both Lcon and tbp as the optimization function of VelvetOptimiser. The former optimizes the total number of base 65 pairs in large contigs and the latter optimizes the total number of base pairs in all contigs. We found that Lcon produced much higher sensitivity with similar PPV compared to tbp and the optimal k -mer size for Lcon was 25. Therefore, we report the performance of Velvet+HMMER using Lcon as the optimization function. For SALT, we set γ to the default value (0.3) for the ltration stage. On average, 0.23% of the input reads passed the ltration for one input family, showing that only a small portion of reads could pass the ltration stage and be used as the input to the graph construction. The overlap threshold k∗ was set to 20 by default (k ∗ length = read 2 ). The average numbers of nodes and edges in a hit graph were 1,575 and 1,879, respectively. E-value cutos determine the trade-o between sensitivity and FP rate of all classiers. We plotted the ROC curves of these classiers by changing HMMER's E-value cuto from 10 to 10−9 with ratio 0.1, as shown in Figure 4.4. Figure 4.4 shows that HMMER is highly specic with FP rate ≤ 1.2 × 10−7 . However, its sensitivity is only 9.81% even with very relaxed E-value cutos. When the E-value cuto changes from smaller than 10−5 10−6 all the classiers. When the FP rate to 10−6 , its sensitivity drops signicantly. When the E-value cuto is its sensitivity is almost 0. The performance of SSAKE is the worst among Its highest sensitivity is 8.25% and its lowest FP rate is ≤ 7.00 × 10−8 , 1.65 × 10−6 . Velvet has higher sensitivity than HMMER. When its FP rate is higher than that, it has lower sensitivity than HMMER. This shows that although Velvet generated longer contigs than SSAKE, both of them missed most positive reads. The sensitivity of SALT is much higher than that of HMMER and Velvet with comparable FP rate. The experimental results also show that when we increase the E-value cuto from 0.01 to 10, the performance of these classiers almost remains unchanged. In order to boost the sensitivity of HMMER and assembly+HMMER we will use 0.01 as their default E-value 66 0.25 0.08 HMMER Velvet+HMMER SALT 0.06 0.2 0.04 HHblits 0.02 0 Sensitivity 0.15 0 0.5 1 −4 x 10 0.09 0.08 0.1 0.07 0.06 0.05 SSAKE+HMMER 0.05 0.04 0.03 1.5 2 2.5 3 −6 x 10 0 0 1 2 3 4 FP rate Figure 4.4: ROC curves of dierent classiers. 5 6 7 −7 x 10 HHblits and SSAKE+HMMER are listed in separate embedded windows because their FP rates are orders of magnitude larger than others. cuto in our experiments hereafter unless otherwise specied. the default E-value cuto for SALT is set to 10−6 . Based on this experiment, Although this is a stringent cuto, as it yields good trade-o between sensitivity and PPV for very short reads, it is expected to be appropriate for data sets containing longer reads. Table 4.1 shows the comparison of the 67 average performance of these classiers under their default E-value cutos. Table 4.1: Performance comparison of SALT against the other classiers on the RNA-Seq data set of Classiers Sens HMMER 9.81% HHblits 6.40% SSAKE Burkholderia cenocepacia. 1.20e-07 PPV F-score 89.37% 0.1767 0.03 4.51e-05 27.86% 0.1041 0.65 8.06% 7.23e-06 50.53% 1474 2.16e-07 94.03% 0.1415 9.50% Velvet 0.1725 85.44% 0.3385 2.50 3.00e-07 21.10% SALT FP rate Time (m) 8.68 All these classiers were run on a 2.2GHz dual-core AMD Opteron machine. The running time is the average running time on all input families. SSAKE and Velvet represent the pipelines of SSAKE+HMMER and Velvet+HMMER, respectively. HMMER was highly specic with FP rate ≤ 1.2×10−7 . However, its sensitivity was only 9.81% even with very relaxed E-value cutos. When the E-value cuto was changed from 10−5 to 10−6 , its sensitivity dropped signicantly. 10−6 its sensitivity was almost 0. Both HHblits and SSAKE+HMMER had low sensitivity and high FP rate. was 6.75 × 10−7 . rate was When the E-value cuto was smaller than The highest sensitivity of HHblits was 6.72% and its lowest FP rate The highest sensitivity of SSAKE+HMMER was 8.25% and its lowest FP 1.65 × 10−6 . Although Velvet generated longer contigs than SSAKE, both of them missed most positive reads. The sensitivity of SALT was much higher than that of HMMER and Velvet+HMMER with a comparable FP rate. The experimental results also show that when we increased the E-value cuto from 0.01 to 10, the performance of these classiers remained almost unchanged. In order to boost the sensitivity of HMMER, HHblits, and assembly+HMMER we used 0.01 as their default E-value cuto in our experiments hereafter unless otherwise specied. Based on this experiment, the default E-value cuto for SALT was set to 10−6 . Although this is a stringent cuto, as it yields good trade-o between sensitivity and PPV for very short reads, it is expected to be appropriate for data sets containing longer 68 reads. Table 4.1 compares the average performance of these classiers under their default E-value cutos. HMMER had the lowest FP rate and Velvet had the highest PPV. However, SALT had much higher sensitivity with good FP rate and PPV. Therefore its F-score was signicantly larger than the other classiers, showing that it had the best overall classication performance. The running time of HMMER was much smaller than that of the other classiers. The ltration stage of SALT is the bottleneck and costs most of the running time. The average running time of the three stages of SALT was 6.20 min, 2.47 min, and 0.01 min, respectively. Since dierent input families can be analyzed independently, parallelization techniques can be applied to speed up this stage. This will be part of our future work. As we described in the Methods Section, the ltration stage trades specicity for sensitivity. It determines the upper bound of the sensitivity of SALT using loose score cutos. The remaining stages are used to reduce the high FP rate introduced by the ltration stage. Our experimental results are consistent with the expectation. After ltration, the sensitivity was 39.32% and the PPV was 13.22%. The nal sensitivity and PPV were 21.10% and 85.44%, respectively. 4.4.2 Protein domain classication of an RNA-Seq data of Ara- bidopsis In this experiment, we applied SALT to an RNA-Seq data set sequenced from a normalized cDNA library of Arabidopsis using paired-end Illumina sequencing [70]. There were a total of 9,559,784 reads of 76 bp from each end. The length of query reads was much longer than in the rst experiment. We used Velvet and Oases [71] as the 69 de novo sequence assembly tools in the assembly+HMMER pipeline. Oases takes into account a dynamic range of expression levels and is specially designed for the assembly of RNA-Seq data. The experimental results show that although HMMER, HHblits, and assembly+HMMER have higher sensitivity with long query reads, SALT still had the best performance. We used the starting reads as our benchmark data set. We downloaded all the coding sequences (CDS) of Arabidopsis thaliana reference genome version TAIR10 [72] and mapped the reads to the CDS using Bowtie with 2 mismatches allowed. Totally, 30.26% of the reads were mapped to the CDS. We annotated the protein domains in the CDS using HMMER with gathering thresholds (GAs). The set of reads mapped to each protein domain was determined using the same methodology as in the rst experiment. There are 3,188 protein domain families with at least 10 mapped reads. Velvet achieved the best assembly result when the k -mer size was 45. contigs with an average length of 443.99 bp. The range of k -mer It generated 44,161 sizes used for Oases was from 31 to 61 and it generated 57,824 contigs with an average length of 356.46 bp. parameters of SALT were set to their default values. Specically, k ∗ = 38 All because the read length is 76. Table 4.2 shows the comparison of the average performance of these classiers on all input families. SALT had the best performance on all metrics (Table 4.2). Oases+HMMER had better sensivitiy and similar PPV compared to Velvet+HMMER. However, both its sensitivity and PPV were much lower than for SALT. For all input families, the total number of reads correctly classied by SALT, HMMER, HHblits, Velvet+HMMER, and Oases+HMMER were 968,068, 681,306, 741,160, 681,831, and 776,805, respectively. Signicantly more reads (> 1.91×105 ) were correctly identied by SALT. Meanwhile, the sensitivities of all classiers were signicantly improved compared to those in the rst experiment. 70 This shows that Table 4.2: Performance comparison of SALT against the other classiers on the RNA-Seq data set of Arabidopsis. Classiers Sens FP rate PPV F-score HMMER 49.72% 2.45e-06 86.26% 0.6308 0.31 HHblits 50.09% 2.32e-04 14.91% 0.2298 11.98 Velvet 35.93% 9.19e-06 64.45% 0.4614 57.65 42.30% 1.02e-05 64.59% 0.5112 400.21 Oases SALT 58.46% 2.84e-06 87.89% 0.7021 Time (m) 171.65 All these classiers were run on a 2.2GHz dual-core AMD Opteron machine. The running time is the average running time on all input families. Velvet and Oases represent the pipelines of Vel- vet+HMMER and Oases+HMMER, respectively. prole HMM-based methods have better discriminative power with long sequences. Note that although Velvet+HMMER and Oases+HMMER correctly classied more reads than HMMER alone, their average sensitivities over all families were much lower than that of HMMER. A closer look at the results showed that Velvet/Oases+HMMER tended to perform well on highly transcribed families but missed a majority of reads for lowly transcribed families. On the other hand, HMMER and SALT are not as sensitive as Velvet and Oases to the transcriptional levels of domains. Thus, the numbers of families that are identied as transcribed by Velvet+HMMER and Oases+HMMER are smaller than for the other classiers. In order to show the performance of dierent classiers on input families of dierent transcription levels, we investigated the performance of HMMER, Velvet+HMMER, Oases+HMMER, and SALT on transcribed families when we changed the threshold for the number of mapped reads from 10 to 100 with a step size of 10. For input families that have at least 100 mapped reads, the RPKM (Reads per kilo base per million) was 193.81. The performance of HMMER remained almost unchanged when we changed the threshold. This is because HMMER aligns each read independently. 71 The sensitivity of Velvet+HMMER increased from 35.93% to 38.18% and its PPV increased from 64.45% to 72.58%. The sensitivity of Oases+HMMER increased from 42.30% to 44.32% and its PPV increased from 64.59% to 75.52%. The sensitivity of SALT increased from 58.46% to 59.39% and its PPV increased from 87.89% to 91.05%. These results show that while Velvet/Oases+HMMER require sucient coverage to assemble short reads, SALT can assemble short reads of variant coverage and achieves better classication performance. 4.4.3 Protein domain classication of a non-model organism In this experiment, we show the utility of SALT in classifying a paired-end RNA-Seq data set from the non-model organism Radix balthica [61] into protein domain families. When the reference genome is not available, read mapping and genome annotation cannot be used to provide ground truth" about read membership. Thus, we focus on comparing the transcribed domains as well as the total number of reads classied into these families. In this experiment, the pipelines of assembler+HMMER will be represented by the assemblers' names for simplicity. We rst downloaded the data set from NCBI Short Read Archive (SRP005151). There are 8,283,222 reads of 101 bp in each end. several de novo We also downloaded the contigs generated by sequence assembly tools from [61]. These tools include Velvet, SeqMan NGen [73] (hereafter called NGen), and Oases [71]. The authors used 31 as the k -mer sizes for Velvet and NGen. For Oases, they used both 21 (Oases21) and 31 (Oases31) as the k -mer size. We searched the keywords animal and snail in the Pfam website and obtained a list of 84 protein domain families that had the keywords in the description of the families. These protein domain families were used as the input families. 72 If more than 10 reads can be Velvet 3 1 5 Oases31 3 1 5 2 0 2 1 1 NGen 1 SALT 7 Figure 4.5: A Venn diagram of the transcribed families identied by dierent classiers. classied to a family, we report that this family is transcribed. We aligned the contigs generated by the assembly tools against the input families using HMMER with 0.01 as the E-value cuto. The parameters we used for SALT were: −6 (default) and E-value=10 k ∗ = 41. (default). previous two experiments, we used a smaller k∗ k∗ γ = 0.3 As the reads are much longer than the than the default value (50). Our choice of is already larger than kmer sizes used by the tested assembly tools and is not likely to introduce random overlaps. Table 4.3 shows the classication results of these classiers. Table 4.3: Classication results generated by dierent classiers on the classiers Radix balthica data set. # of transcribed families # of classied reads Velvet 17 996 NGen 13 8338 Oases21 19 1336 Oases31 18 1687 SALT 23 4457 # of classied reads: the number of reads that are classied into transcribed families. 73 In Table 4.3, SALT identied more transcribed families than the other classiers even with a more stringent E-value cuto. Moreover, it classied more reads into transcribed families than the other classiers except NGen. Although NGen classied the largest number of reads, it only identied 13 transcribed families, which are thus likely to be highly transcribed families. Being consistent with our second experiment, Velvet+HMMER identied a smaller number of transcribed domains than SALT. The relationship of the sets of transcribed domains identied by dierent strategies is shown in Figure 4.5. Table 4.4: Description of transcribed families uniquely identied by SALT. Accession ID Description PF00654 Voltage_CLC Voltage gated chloride channel PF00907 T-box transcription factors involved in limb and heart development PF03388 Lectin_leg-like PF04275 P-mevalo_kinase Legume-like lectin family Phosphomevalonate kinase PF07684 NODP PF07829 Toxin_14 NOTCH protein Alpha-A conotoxin PIVA-like protein PF12349 Sterol-sensing Sterol-sensing domain of SREBP cleavage- activation These classiers identied a total number of 34 transcribed families, 5 of which were shared among them. When we do pairwise comparisons, we nd that the transcribed family sets identied by Velvet and Oases31 have 12 members in common, showing a higher similarity than any other pair of classiers. NGen has a similar number of shared families with Velvet and Oases31. SALT identied 7 transcribed families that were not identied by any other tool. In order to verify these uniquely identied families, we investigated whether 74 these domains exist in the contigs generated by the sequence assembly tools. The output of running HMMER between these families and the contigs includes statistically signicant alignments. However, the number of reads consisting of the aligned regions is smaller than 10 and thus these domains are not reported as transcribed by assembly+HMMER. This indicates that these domains are very likely to be transcribed but the de novo sequence assembly tools fail to assemble many positive reads in the data set. A complete list of all the transcribed families identied by dierent classiers are available as supplementary data in the tool's website. SALT is able to identify more functional protein domains that reect the species's reactions to the environment. For example, PF07829 is an Alpha-A conotoxin PIVA-like protein family. It is the major paralytic toxin found in the venom produced by the piscivorous snail Conus purpurascens [12]. Table 4.4 shows a brief description of the families uniquely identied by SALT. The detailed annotations of these families are listed in Pfam's website (http://pfam.sanger.ac.uk/). 4.5 Discussion Most existing domain analysis in RNA-Seq data still relies on traditional domain classication tools. In this work, we analyzed why a commonly used tool HMMER can incur low sensitivity for analyzing NGS data. We conducted a correlation study between the sensitivity of HMMER and other features based on the data set in the second experiment. These features include: 1) the alignment score between the domain region and the input family (F1 ); 2) the length of the domain region in the transcript (F2 ); 3) the normalized alignment score of the domain region in the transcript (F3 ), which is dened as: 75 F F3 = F1 . 2 The correlation coecients between the sensitivity and these features are -0.0214, -0.0788, and 0.8415, respectively. The normalized alignment score of the domain region in the transcript has a strong positive linear relationship with the sensitivity of HMMER. This implies that reads sequenced from close homologs of a domain family can be better classied using prole HMM-based methods. On the other hand, reads sequenced from remote homologs of a domain family are harder to classify. We further found that the correlation coecient between the sensitivity and average pairwise sequence identity in a domain is 0.5728. This indicates that the performance of prole HMM-based methods are related to the sequence conservation level of a domain. The choice of parameters is important for SALT to provide good overall performance. Here we give some analysis of the parameter The default value of small, this K K K, the number of generated candidate contigs. is the number of sinks in the hit graph. When the graph size is value works well. However, when the graph size is large, such as the hit graphs constructed from metagenomic data sets, the eciency of SALT will decrease signicantly. Theoretically, the number of sequenced contigs of a domain is the product of the average number of contigs in a domain and the number of copies of the domain in the genome. The former one can be estimated by the sequencing depth and the latter one can be inferred from existing domain organizations in Pfam. This approximation provides a better trade-o between classication performance and eciency. However, if the users do not have a good estimation of these values the default K value is recommended. 76 4.6 Conclusion In this work, we introduced a sensitive and accurate protein domain classication tool (SALT) designed for transcriptomic data of organisms without high-quality reference genomes. Experiments on two RNA-Seq data sets from annotated genomes showed that SALT had higher sensitivity than existing tools with comparable specicity. When we applied SALT to an RNA-Seq data set of a non-model organism SALT identied more transcribed species-related families than alternative pipelines. 77 Chapter 5 A Scalable and Accurate Targeted gene Assembly tool (SAT-Assembler) for NGS data 5.1 Background Advances of next-generation sequencing (NGS) technologies enable sequencing of transcriptoms of a large number of non-model organisms (RNA-Seq) and species from various environmental samples (metagenomic data). analyzing these NGS data sets. Functional annotation is an important step in For RNA-Seq of non-model organisms and metagenomic data, which lack quality reference genomes, a commonly used functional annotation pipeline conducts de novo assembly rst and applies functional annotation analysis to the assembled contigs. This pipeline is widely adopted in functional analysis of RNA-Seq data [1, 2, 3, 4] and gene-centric metagenomic analysis [74, 53, 75, 76, 6]. One of the most informative and practical steps in functional analysis is homology search, which assigns functions to contigs using the signicant matches against non-redundant protein sequence databases or secondary databases including protein families, domains and functional sites. The performance of downstream functional analysis largely depends on the quality of the 78 de novo assembly, which is still a challenging problem for RNA-Seq and metagenomic data. The problem of assembling NGS reads has received considerable eorts [58, 67]. Dierent programs have been developed for transcriptomic assembly [71, 77, 78] and metagenomic assembly [79, 80, 81, 82, 83, 84, 85]. RNA-Seq de novo assembly aims to identify transcribed genes, separate iso-forms, and quantify gene expression while metagenomic de novo assembly intends to recover individual genomes in an environmental sample. Often there are specic sets of genes in pathways that are of special interest, for example carbon and nitrogen cycling pathways in the response of permafrost to global warming [86] in soil metagenomic data, while much of bulk de novo assemblies consist of less interesting housekeeping genes" and genes with unknown function that give little insight to the specic question at hand. Thus, gene-centric analysis using homology search is a direct and eective strategy to study complicated transcriptomic and metagenomic data. De novo gene assembly for RNA-Seq and metagenomic data shares similar challenges in algorithm design. First, genes that have divergent expression levels in RNA-Seq data or highly dierent abundance in metagenomic data lead to a wide range of sequence coverage. Even worse, the sequence coverage along the same gene can be highly dierent due to bias in sequencing protocols. Using one set of assembly parameters for all genes or even for dierent regions of the same gene leads to unsatisfactory performance, resulting in either fragmented contigs or chimeric contigs. Although methods using multiple k-mer sizes have been developed to identify transcripts in a broader coverage range, the choice of size is still ad hoc. k-mer Second, expression of gene iso-forms due to alternative splicing and expression of genes with shared short identical sequence regions in RNA-Seq data as well as the high similarity of orthologous genes in metagenomic data grossly compound de novo assembly. Chimeric contigs tend to be produced between highly similar transcripts or gene 79 homologs. Third, existing assembly programs suer from either long computational time or high memory usage or both. New developments in sequencing technology such as today's Illumina HiSeq can produce up to 150 bp run in single-end mode. × 180 million reads for a total of 27 Gbp in a single Using all seven lanes can output up to 189 Gbp in one run and close to double that in paired-end mode. Two popular RNA-Seq assembly programs, Trinity and Oases, incur the longest running time and the maximum memory usage in comparison with other assembly tools [87], respectively. Using small k-mer size in Osases may require resources exceeding the memory space of a typical computing server on existing data [87]. Thus, scalable methods and tools are in urgent need to analyze new sequencing data. To improve the performance of gene assembly in RNA-Seq and metagenomic data that lack quality reference genomes, we propose a dierent scheme, which conducts homology search rst and then family-specic gene assembly. The proposed assembly method takes genes of interest as input, which can include specic sets of genes in pathways that are of special interest or well-characterized protein or domain families in existing databases. Note that although we conduct homology search for reads, assembly is still indispensable for functional analysis because the preferred products for a majority of users are relatively complete genes. The main novelties of our method and how those are used to address the above challenges are summarized below. First, by conducting homology search on reads directly, all reads will be rst classied into dierent gene families. for reads in each family. to bulk assembly. De novo assembly will only be conducted The input size for assembly is signicantly decreased compared Second, the alignments output by homology search are used to guide family-specic gene assembly. We propose a novel overlap graph construction algorithm that can achieve maximum connectivity and accuracy for genes of both low and high sequence 80 coverage. Third, we use a modied depth-rst-search (DFS) that carefully incorporates multiple information including paired-end reads, transitive edges, and coverage information to guide graph traversal. The novel graph construction algorithm together with the carefully designed traversal algorithm can maximumly avoid generation of chimeric contigs. 5.2 Related Work As gene families of interest are used as input, our algorithm employs homology search against gene families in assembly graph construction. Using genomes or proteome of related species to boost and optimize genome assembly has been proposed or implemented in a group of assembly programs [88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 85]. The contigs belonging to a single gene or a block of genome in the related species are ordered, orientated, and assembled. Most of these programs are designed to improve genome assembly. A few of these comparative or gene-boosted assembly programs are specically designed for RNA-Seq or metagenomic assembly. For example, Surget-Groba et al. [92] carefully considered the highly heterogeneous sequence coverage of transcripts and employed multi-k and proteome of a related species to optimize transcriptomic assembly. Dutilh et al. [91] used one closely related reference genome to increase assembly performance of microbial genomes in metagenomic data. Ye's group used homologous genes to stitch gene fragments for gene assembly in metagenomic data [85]. Our work is dierent from existing comparative assembly approaches in the following aspects. First, our tool does not require any related species as input. Most of the existing comparative approaches are limited by the availability of closely related reference genomes. Low similarity between the to-be-assembled genes or genomes and the related genes or genomes 81 can lead to wrongly assembled contigs. Our tool uses well-characterized genes of particular interest or ubiquitously represented sequence families such as those from family databases of proteins, domains, or functional sites as input to guide assembly. Second, as we use sequence families rather than a single sequence as reference, prole-based alignment methods rather than pairwise sequence alignment or exact sequence mapping are applied to conduct homology search. Prole-based methods tend to be more sensitive for remote homology search. Third, to our best knowledge, SAT-Assembler is the rst tool that uses consistency between sequence overlap and alignment overlap for edge creation in an overlap graph. 5.3 5.3.1 Methods Overview of SAT-assembler Our tool can be divided into two main stages. First, we align reads against prole hidden Markov models (HMMs), which eectively represent the underlying gene families. This stage classies the whole input data set into subsets of reads sequenced from dierent gene families. Second, SAT-Assembler constructs a family-specic overlap graph and assembles reads from the same family into contigs using a graph traversal algorithm. The graph construction is supervised by the alignment information from the rst stage and aims to obtain maximum connectivity between reads while avoiding false connections. In particular, it can accurately capture small overlaps between reads from lowly sequenced regions and improves the assembly of lowly transcribed or encoded genes. The graph traversal is guided by multiple types of information to avoid generation of chimeric contigs. Finally, paired-end reads are used to scaold contigs from the same genes into schematic representation of the pipeline of SAT-Assembler. 82 super contigs. Figure 5.1 is a reads profile HMM-based alignments contigs Figure 5.1: The pipeline of SAT-assembler. Reads of the same color belong to the same gene family. Reads from dierent genes of the same family are distinguished using dierent patterns. Reads shared by multiple genes from the same family have multiple patterns. 5.3.2 Prole HMM-based homology search Our method conducts homology search on reads rst. Depending on the algorithms and the target databases, homology search methods can be divided into two types. The rst is to compare the sequences against protein sequence databases using pairwise alignment tools such as BLAST [11]. The second method is prole-based homology search, which classies queries into characterized protein domain or family databases such as Pfam [12, 98], TIGRFAM [13], FIGfams [14], InterProScan [15], etc. Applying prole-based homology search to NGS reads has several advantages. First, the 83 number of gene families is signicantly smaller than the number of sequences, rendering much smaller search space. For example, there are only about 13,000 manually curated protein families in Pfam. However, these cover nearly 80% of the UniProt Knowledgebase and the coverage is increasing every year as enough information becomes available to form new families [98]. As the prole-based homology search tool HMMER is as fast as BLAST [18], using prole-based search provides shorter search time. Second, previous work [17] has shown that using family information can improve the sensitivity of remote protein/ncRNA homology search. For non-model organism transcriptome and metagenomes, sensitive remote homology search is especially important for identifying possible new homologs. Third, although short reads can pose challenges to both types of homology search [99, 57], empirical studies on thousands of families [57] showed that the performance of prole-based homology search improved quickly with the increase of read size. For read length of 85 bases, the sensitivity is close to 1.0 for moderately and highly conserved domains. Thus, for read lengths pro- duced by modern NGS technologies, prole-based homology search methods are capable of classifying most reads into their native families with high specicity. 5.3.3 Alignment informed graph construction The rst stage not only classies query reads into their native families but provides important alignment information for de novo assembly. The alignment positions are used to guide overlap graph construction. A standard overlap graph is dened as G = (V, E), where each non-duplicate read is a node and an overlap larger than a given cuto is indicated by a directed edge. Our overlap graph is dierent from a standard overlap graph [100, 101, 102] in the edge creation criteria and graph construction procedure. In a standard overlap graph, edge creation only depends on the sequence overlap, which 84 is not ideal for genes of heterogeneous sequence coverage. the relationship between two types of overlaps: For simplicity of explanation, a read r2 , r alignment overlap and sequence overlap. corresponds to vertex an edge is created from the corresponding node are satised: and r2 1) the alignment position of overlap by at least t∗ , r1 We add edges by considering r1 r in G. For two reads to node is smaller than r2 r2 ; s1 and e1 , and if the following criteria 2) the alignments of r1 a user-dened threshold; 3) the sequence overlap of the two reads is consistent with the overlap in their alignment positions. Suppose model between r1 and r2 aligns to the model between s2 overlap is the number of bases in the overlapping region between and s2 e2 . and r1 aligns to the The alignment e1 . Criterion 3 is the key observation to connect reads that are sequenced from the same gene rather than from orthologous or paralogous genes because the latter can have very dierent sequence and alignment overlaps. r2 An example is given in Figure 5.2, in which read are from two homologous genes. and 66 respectively. and r2 when the r1 and read Their sequence overlap and alignment overlap are 25 Other assembly tools such as Trinity will create an edge between k-mer r1 size is 25. However, because their alignment overlap and sequence overlap are inconsistent SAT-Assembler does not connect them, avoiding a wrong connection between reads from homologous genes. The consistency-based edge creation also allows us to improve connectivity in regions with low sequence coverage. For relatively small overlaps, we still allow an edge if the alignment overlap and sequence overlap are similar to each other. The intuition is that the chance that reads with random overlaps can be aligned to the same model with similar alignment overlap is small. To quantify the consistency between the two types of overlaps we introduce the overlap dierence dened by d = |t−k| t , where 85 t is alignment overlap and k relative is sequence r1 r2 ... profile HMM (A) r1 AAGAAAGACAACCCCGTAACTGACACTGTGTCACATGAGGAAGAAAGGATAAGCCAGTTACCGGAACCG AAGGATAAGCCAGTTACCGGAACCGTGTCCAAGGAGGAGGATAGAATAAGCCAGTTACCGGAACCAGAAC alignment overlap 66 bp r2 (B) r1 KKDNPVTDTVSHEEERISQLPEP KDKPVTGTVSKEEDRISQLPEPE r2 high similarity (C) Figure 5.2: (A)Two reads a and b sequenced from dierent genes of the same family are aligned to the prole HMM of the family. Their sequence overlap is indicated in red. (B) Read a and read b have an alignment overlap of 66 and a sequence overlap of 25 (in bold). (C)The alignment between the translated peptides of overlap. Criterion 3 is satised only when default value of 0.15. d ≤ d∗ , a and where b d∗ is 22 residues. is a predened cuto with a We examined relative overlap dierence in both real RNA-Seq and metagenomic data sets. For reads sequenced from the same gene and from dierent genes, the average relative overlap dierence is 0.072 and 0.89 respectively. overlaps, we use 20 as the default value for t∗ , To avoid small random which is the alignment overlap threshold. Our overlap graph construction is dierent from standard overlap graph construction. It does not need all-against-all sequence comparison. We rst sort the reads by their alignment positions in a non-decreasing order. We only check the sequence overlap between two reads if their alignment overlap passes t∗ . Therefore, the alignment information also increases the eciency of graph construction (the running time analysis can be found in Section 5.3.7). To incorporate substitution sequencing errors introduced by some NGS platforms, we allow 86 a certain number two hits r2 . r1 and r2 e of of mismatches in the sequence overlap. That is, the overlap between is the longest sux of In our current implementation, r1 e = 2. that has a Hamming distance The parameter e ≤e to a prex of can be adjusted to t the error rate of the input data. 5.3.4 Pruning and optimization of overlap graphs Transitive edges correspond to edges whose two end nodes are connected by an alternative path (usually with higher coverage). They add unnecessary edges without contributing to the connectivity of the graph and will be removed before de novo assembly. Before removing them, SAT-Assembler keeps a record of all the pairs of nodes connected by transitive edges because a transitive edge indicates that a pair of nodes are from the same gene region. This information will be used to guide the graph traversal. If a node has only one outgoing edge that points to another node that has only one incoming edge these two nodes can be merged as one node. Tips and bubbles are identied and removed using the standard topology-based pruning methods as in Velvet [67]. Although our edge creation method excludes most random sequence overlaps, some erroneous edges still exist. An edge is highly likely to be erroneous if it is inferior to another edge that shares a head node or tail node with it. An edge a the following two criteria are met: 1) the sequence overlap of of b; 2) the Hamming distance of sequence overlap of a is inferior to another edge a b if is smaller than half of that is larger than that of b. A random overlap is more likely to be much smaller and have more mismatches than a true overlap. Therefore, these two criteria will help us remove most erroneous edges. 87 v1 v2 v3 14 v5 v4 31 v6 v7 v8 29 13 A and B . Nodes in red (v1 , v7 ) and in blue (v2 , v5 , and v8 ) are from gene A and gene B respectively. Nodes black (v3 and v6 ) are chimeric nodes because they are shared by the two genes. Arrows Figure 5.3: A graph containing reads from two dierent genes v4 , in and with solid lines are real edges. Arrows with dotted lines and dashed lines indicate paired-end reads and transitive edges between two nodes respectively. 5.3.5 Guided graph traversal using multiple information Once a family-specic graph is constructed and optimized, the goal is to conduct a graph traversal to output paths corresponding to genes. The traversal starts with nodes without incoming edges. The challenge arises when two or more genes contain a common or similar subsequence, leading to chimeric nodes such as v3 and v6 in Figure 5.3. Chimeric nodes add to the complexity of the graph traversal by leading to chimeric contigs. For example, the path < v1 , v3 , v4 , v6 , v8 > contains nodes exclusively from both genes and thus is a chimeric path. SAT-Assembler employs three types of information to guide the graph traversal to recover correct gene paths: transitive edges, paired-end reads, 88 and coverage. We describe the key steps of our graph traversal algorithm using Figure 5.3. The goal is to output two correct paths corresponding to the two genes. A paired-end read represents two reads appearing in the same genome with known order (by our homology search) and distance range (insert size). Although transitive edges are removed at the stage of graph pruning they can act as a paired-end read with a small insert size. Therefore, both transitive edges and paired-end reads can be used to examine whether two nodes are from the same gene. Two nodes that are not connected by an edge are said to have supports or be supported if there are transitive edges or paired-end reads between them. For supports of paired-end reads, we further require that their distance in the path is consistent with the insert size. Dierent from previous traversal algorithms, we divide supports into two types, supports and non-critical supports. critical Critical supports can be used to resolve branches in graph traversal while non-critical supports are not able to distinguish dierent gene paths. For example, a graph traversal generates a path edges (v3 , v4 ) and (v3 , v5 ). < v1 , v3 >. If there is a support between edge in Figure 5.3, the traversal will be guided to visit v4 v1 The node and v4 , v3 has two outgoing such as the transitive in next step. This transitive edge provides a critical support for correct traversal. However, the support between not critical" for guiding the graph traversal because any path that has visited visit v6 . In Figure 5.3 , the support between v3 and v6 v3 v3 and v6 is needs to is a non-critical support while all the other supports are critical supports. When there is no support between two non-chimeric nodes , node coverage will be used to resolve the branches. The coverage of a node is the total size of reads normalized by the length of the assembled sequence of the node. For protein-coding genes, although the sequence coverage is usually not uniform along the genes its change is gradual rather than sharp. 89 Thus, the coverage of two consecutive non-chimeric nodes in the same path should reect this observation. Any sharp change indicates a wrong path. For example, in Figure 5.3, and v4 v7 have similar coverage, as do v5 and v8 . v4 and v8 , however, have signicantly dierent coverage. Therefore, a path that has visited v4 and v6 should next visit v7 instead of v8 . We use a bounded depth-rst search (DFS) algorithm to generate correct paths. While a typical DFS takes exponential time to generate all simple paths between two nodes our graph traversal method makes use of critical supports to bound the search and only visits correct paths, eectively reducing the time complexity of path generation. During search, we will proceed to successors of the current node that provide critical supports. If none of the successive non-chimeric nodes has supports with any of the previously visited non-chimeric nodes, we will proceed to the one that has a similar coverage to the recently visited nonchimeric node given that their coverage is similar enough. Otherwise, it is highly likely that the current node is not from the same gene with any of its successors. Therefore, we will output the current path and start a new path from its successors. The traversal stops when there is no appropriate successive node available. All paths with critical supports above a given threshold will be output. 5.3.6 Contig scaolding Assembly tools may output multiple contigs from the same gene. There are two main reasons for the fragmentation: 1) some regions between contigs are not sequenced due to sequencing bias, PCR bias, low transcription level or abundance; 2) reads from lowly conserved regions of the gene may not pass the homology search and thus are not used to construct the graph. The contigs are oriented and connected using their alignment positions against the underlying prole HMM and paired-end reads. The scaolding results in super contigs. 90 5.3.7 Running time analysis Let the number of input reads be of the homology search stage is prole HMM and N1 Usually, M N. N. N and the average read length be O(N |r|M ) Suppose N1 |r|. The time complexity for one input family, where M is length of the reads have passed the homology search stage. The time complexity of graph construction is O(τ |r|2 N1 ), the average number of overlapping alignments longer than a given cuto. where τ is During graph construction, we use alignment positions to guide the overlap computation, avoiding the allagainst-all comparison needed in a standard overlap graph construction. The time complexity of graph traveral is of edges, n O(|E|+|V|+n2 N2 ), where |V| is the number of nodes, |E| is the number is the number of read pairs that have critical supports, and N2 correct paths in the graph. The time complexity of the scaoding stage is is the number of 2 O(N2 ). Because of various optimization techniques and heuristics, the latest version of HMMER is as fast as BLAST [18]. Considering n N2 , the time complexity of scaolding is much smaller than graph traversal. Therefore, the overall running time is determined by the graph traversal stage. 5.4 Results To show the utility of SAT-Assembler we applied it to an RNA-Seq data set of Arabidopsis and a human gut metagenomic data set. We compared the completeness and correctness of assembly, length of contigs, memory usage, and running time of dierent de novo assembly tools, including SAT-Assembler, Velvet, Oases, Trinity, Meta-IDBA, and MetaVelvet. For targeted gene assembly we are interested in evaluating gene segments recovered by contigs generated by assembly tools. Figure 5.4 shows an example of gene segments assembled 91 contigs a b 400 bp 300 bp c 1000 bp gene Figure 5.4: Three contigs generated from a metagenomic data set. The green parts of the contigs are contained in the target gene and thus are gene segments. The blue parts of the contigs are not gene segmetns. from a metagenoic data set. For the two experiments, by mapping reads to annotated reference genome or charaterized genes from existing databases, we can construct a set of reference genes, which are transcribed or encoded in an NGS data set. All contigs output by assembly tools will be aligned against reference genes using BLAST with an E-value cuto of 10−5 . BLAST hits with at least 95% identity were regarded as recovered gene segments. To quantify the completeness, correctness, and length of gene segments we propose three metrics: gene coverage, chimera rate, and contig coverage. For one reference gene, gene coverage is dened as the fraction of bases in the reference gene covered by at least one gene segment. Chimera rate is the fraction of contigs consisting of reads from dierent genes. As read mapping results can be used to determine the origin of each read, we can evaluate whether a contig is a chimera by checking the origin of each read in the contig. Contig coverage is the length of the gene segment normalized by the length of the target gene. This normalized contig length is used to prevent gene segments of long reference genes from dominating the average contig coverage of an assembly tool. In Figure 5.4, the length of the 92 gene is 1000 bp and the length of the gene segments from contig a and contig b is 700 bp. Therefore, its gene coverage is 70%. The contig coverages of the two gene segments are 40% and 30% respectively. To compare the performance of assembly tools on all input families, we rst calculate a metric for each family and then report the average of the values of the metric over all families. 5.4.1 Gene assembly in an RNA-Seq data set of Arabidopsis In this experiment, we applied SAT-Assembler to an RNA-Seq data set sequenced from a normalized cDNA library of Arabidopsis using paired-end Illumina sequencing [70]. There were a total of 9,559,784 paired-end reads of 76 bp. Pfam was used as our database of input families. We compared the performance of SAT-Assembler with Velvet, Oases, and Trinity. Velvet is a widely used short read de novo assembly tool. Oases and Trinity are assembly tools specially designed for transcriptomic data. To obtain transcribed genes in this data set, we conducted read mapping (using Bowtie) on all the coding sequences (CDS) of Arabidopsis thaliana of version TAIR10 [72]. There is no commonly accepted criterion to dene transcribed genes. In this work, we dened CDS with at least 10 mapped reads as transcribed CDS. Assembly results of dierent tools were compared on these transcribed CDS. 3,188 protein or domain families from Pfam that can be aligned to these CDS using HMMER with gathering thresholds (GAs) were used as input to SAT-Assembler. 5.4.1.1 Edge creation performance Before we evaluate the assembly performance of SAT-Assembler we rst evaluate the performance of the proposed edge creation strategy. We name the overlap threshold of SAT- 93 Assembler 20+consistency (or consistency for simplicity), where 20 is the default threshold for alignment overlap and consistency stands for the consistency between alignment overlap and sequence overlap. We compared edge creation performance of three strategies that used dierent overlap thresholds: i) 20; ii) 38; iii) consistency. Three metrics were used in the comparison: the number of true positive edges (T P ), the number of false positive edges (F P ), and positive predictive value (P P V ), which is TP T P +F P . True positive edges connect nodes that are from the same genes. As the number of correct connections is the same for these strategies, TP is proportional to the sensitivity. Higher TP indicates higher sensi- tivity. False positive edges connect nodes that are from dierent genes. The performance comparison of three strategies is shown in Table 5.1. Table 5.1: Edge creation performance of three strategies on the RNA-Seq data set of Arabidopsis. TP FP PPV 20 2,377,680 254,331 90.34% 38 2,127,910 15,639 99.27% consistency 2,346,341 84,121 96.54% Strategy The metrics are evaluated on all edges in overlap graphs of 3,188 families. The consistency strategy provided a better tradeo for edge creation than the stringent overlap threshold of 38 and the very loose overlap threshold of 20. Compared with the 20 threshold strategy, the consistency strategy avoided 170,210 false positive connections at the expense of missing only 31,339 positive connections, leading to 6.20% increase in PPV . Compared with the 38 threshold strategy, it successfully captured 218,431 more positive connections, leading to 10.27% increase in TP at the expense of 2.73% decrease in PPV . This showed that the consistency constraint successfully eliminated most random overlaps while preserving overlaps between many reads from the same genes. 94 5.4.1.2 Performance comparison with other assembly tools For Velvet we tried of k-mer k-mer sizes from 35 to 61 with a step size of 2. Oases accepts a range sizes. It rst generates assemblies on dierent We xed the beginning k-mer k-mer sizes and then merges them. size as 35 and changed the ending with a step size of 2. Trinity uses a xed k-mer k-mer sizes from 35 to 61 size of 25 in its current implementation. We ran it with its default parameters. For SAT-Assembler, we used consistency" as its edge creation strategy and default values for the other parameters. To compare the consistency strategy with the simple overlap threshold strategy we also ran SAT-Assembler with an overlap threshold from 35 to 45 without the consistency constraint. Figure 5.5 shows chimera rate versus gene coverage when we changed k-mer sizes or overlap thresholds for dierent assembly tools. Velvet was more sensitive to the change of k-mer sizes than the other assembly tools. Its largest and smallest gene coverage were 82.03% and 59.53% respectively. Its best overall performance was achieved when the k-mer size was 37. We also ran VelvetOptimiser [103] to search for the best assembly result by trying k -mer sizes from 35 to 61 bp with  K50 the optimization function. It reported an optimal k-mer as size of 57, which generated a gene coverage of 66.31% and a chimera rate of 18.85%. When the range of k-mer sizes was changed the performance of Oases did not change much. This is because Oases merges assemblies generated by dierent k-mer sizes and there is large redundancy in its nal assemblies. The overall performance of Trinity was slightly better than Oases. The performance of SAT-Assembler was also stable when the simple overlap threshold was changed. Overall, using the consistency strategy shows better gene coverage while keeping similar chimera rate compared to the simple overlap threshold. 95 0.85 0.8 Velvet Oases Trinity SAT without consistency SAT with consistency Gene coverage 0.75 0.7 0.65 0.6 0.55 0.05 0.1 0.15 0.2 Chimera rate Figure 5.5: Chimera rate versus gene coverage when k-mer 0.25 0.3 size or overlap threshold changes for dierent assembly tools. These values are average values of the assemblers' performance on 3,188 input families. 96 We further compared the performance of SAT-Assembler with the overlap 45 threshold strategy and the consistency strategy on individual target genes. We found that the im- provement of overall gene coverage was more likely to be generated by target genes with low depth of reads that passed the homology search stage. Given a cuto for the depth of reads d∗ , we compared gene coverage of two strategies on target genes that had a depth lower than d∗ . When d∗ = 10, the consistency strategy had better gene coverage in 11.74% target genes and the 45 threshold had better gene coverage in 4.60% target genes. For the rest of the target genes both strategies had almost the same gene coverage. When d∗ = 20, the consistency strategy had better gene coverage in 10.92% target genes and the 45 threshold had better gene coverage in 6.06% target genes. When we further increased d∗ we found that the performance improvement of the consistency strategy became less signicant, showing that the consistency strategy had better ability to assemble reads from lowly transcribed or encoded target genes. We then compared the metrics of gene coverage, chimera rate, contig coverage, memory usage, and running time of dierent assembly tools when their best performance of gene coverage and chimera rate were achieved in Table 5.2. Contigs generated by all assembly tools covered about 80% of genes on average. SAT-Assembler had the lowest chimera rate due to our consistency-based edge creation strategy. Oases generated the longest contigs (indicated by contig coverage) on average. But the price paid was the highest chimera rate, longest running time, and second highest memory usage, even with a single When users apply a range of k-mer k-mer size. sizes, its memory usage will be further increased. SAT- Assembler had the lowest memory usage and second shortest running time because of the eective classication in the homology search stage. Our current implementation of SATAssembler uses the Python library of Networkx [104], which contributes to a large portion 97 of both memory usage and running time. We plan to implement SAT-Assembler using C++ in the future. Table 5.2: Performance comparison between dierent assembly tools on the RNA-Seq data set of Arabidopsis. Assembly tool Gene coverage Chimera rate Contig coverage Memory (MB) Time (m) Velvet 80.15% 11.04% 64.69% 8034 140.85 Oases 80.75% 29.18% 84.55% 22475 5128.74 Trinity 81.60% 26.10% 71.59% 34691 1145.05 SAT 81.32% 9.96% 78.76% 2592 228.14 The memory usage for all tools is based on a single overlap threshold or k-mer. The running time was the average running time on all input families. 5.4.2 Targeted gene assembly in a human gut metagenomic data set In this experiment, we compared the performance of SAT-Assembler with Velvet, MetaIDBA, and MetaVelvet on a human gut metagenomic data set. Meta-IDBA and MetaVelvet are both de Bruijn graph based and specially designed for data. de novo assembly of metagenomic There were 47,117,906 paired-end and 5,528,102 unpaired reads of various lengths. The average length of the query reads was 95.72 bp and 75% of them were 100 bp. In this experiment, we were interested in assembling butyrate kinase pathway genes, which play important roles in butyrate synthesis. We downloaded 2,352 annotated genes of butyrate kinase family from RDP's functional gene repository [38]. By using read mapping, a total of 58 genes with at least 10 mapped reads were identied and were used to evaluate the performance of all assembly tools. For SAT-Assembler, a prole HMM built from 77 seed genes was used as the input family. We used VelvetOptimiser to search for the best assembly result by trying 98 k-mer sizes from 51 to 81 bp with  K50 as the optimization function. VelvetOptimiser reported 51 as the optimal k-mer size. For Meta-IDBA we used the same range of Meta-Velvet used the hash table generated by Velvet and its k-mer k-mer sizes as Velvet. size was thus 51 as well. For SAT-Assembler we used its default parameters. A total of 16,136 reads were classied into butyrate kinase family by the homology search stage, which accounted for 0.31% of the query reads. Table 5.3 shows a performance comparison between these assembly tools. Table 5.3: Performance comparison between dierent assembly tools in assembling genes from butyrate kinase family on the human gut metagenomic data set Assembly tool Velvet Meta-IDBA MetaVelvet SAT-Assembler Gene coverage 51.97% 60.55% 43.52% 68.89% Chimera rate 59.26% 36.67% 36.73% 16.87% Contig coverage 38.99% 62.06% 49.30% 48.56% 41248 16648 20330 283 1736.08 1065.77 834.10 1530.63 Memory usage (MB) Time (m) The memory usage for all tools is based on a single overlap threshold or k-mer. The running time was the average running time on all input families. SAT-Assembler had the best gene coverage, chimera rate, and memory usage. Its contigs were usually shorter than Meta-IDBA. A closer examination reveals the reason: a number of reads did not pass the prole HMM homology search and thus were not used as input to assembly. Gene coverage of assembly tools in this experiment was much lower than in the rst experiment because of lower sequencing depth and higher data complexity. MetaVelvet had the best running time performance because it directly used the optimal k-mer size and hash table from VelvetOptimiser while Velvet and Meta-IDBA both ran a range of k-mer sizes. The low memory usage of SAT-Assembler further showed the advantage of using homology search in targeted gene assembly for large scale NGS data. Due to the high complexity of the metagenomic data set SAT-Assembler constructed a much more complex overlap graph, leading to higher run-time overhead. 99 5.5 Conclusion Both experiments on RNA-Seq and metagenomic data sets show that our novel consistencybased edge creation strategy and guided graph traversal can eectively avoid chimeric contigs. Moreover, by reducing the original search space into a much smaller subset of reads from targeted genes the memory usage was signicantly decreased, making it a more ecient tool for large scale targeted gene assembly. 100 Chapter 6 Conclusion and future work In this thesis, I rst introduced next-generation sequencing technologies and its applications. I then discussed the three main challenges in the analysis of NGS data, including large data size, frameshift sequencing errors, and short sequences. In order to address these challenges, I have developed several tools and compared their performance with some existing tools. Here I summarize the contributions of the thesis as follows: • In Chapter 2, I introduced HMM-FRAME, which was designed to accurately detect and correct frameshift errors in NGS data and thus improved protein domain classication of NGS sequences. It was based on an augmented Viterbi algorithm and could incorporate dierent error models for dierent error rates. • In Chapter 3, I introduced MetaDomain, which was designed to improve sensitivity of protein domain classication of short reads while maintaining high specicity. MetaDomain employed a position-specic score threshold and signicantly improved the classication performance of short reads from poorly conserved domain regions. Compared with existing protein domain classication tools, it had much better sensitivity with high specicity. • In Chapter 4, I introduced SALT, which was designed to sensitively and accurately classify short reads into their native protein domain families. It combined homology 101 search with a guided family-specic de novo assembly stage. Compared with MetaDo- main, the specicity of SALT was signicantly improved with a comparable sensitivity. • In Chapter 5, I introduced SAT-Assembler, which was designed to conduct scalable and accurate targeted gene assembly. Its novelties include the family-specic gene assembly, alignment-informed graph construction, and a guided graph traversal. It can accurately assemble short reads from the same target genes while avoiding chimeric contigs. • All these tools were designed for large-scale NGS data. The prole HMM-based methods enable us to split the input set of reads into subsets of reads corresponding to different families. Therefore, the subsequent steps can be conducted in a family-specic manner, eectively reducing the search space and memory usage. Moreover, analysis of reads from dierent families can be naturally parallellized. This methodology has greatly improved the scalability of the analysis of large-scale NGS data. Based on my previous studies there are several directions for future work: • HMM-FRAME is more computationally expensive than HMMER 3.0 mainly because of diverse sequencing errors inside codons. We plan to improve the eciency of our DNA versus prole HMM alignment algorithm so that it can be used eciently in large-scale protein domain analysis. Besides applying DP matrix pruning techniques to reduce the computational cost, we plan to use a faster but less accurate Viterbi algorithm as a ltration stage. Specically, we can apply a faster Viterbi algorithm to predict whether there are any errors inside a codon before identifying error positions. If such errors exist, we can then use a more sensitive method to determine the exact number and positions of insertions or deletions. 102 • We plan to adapt SALT to domain classication in metagenomic data. In addition, when the protein families are available, we will test SALT's performance on gene assembly. We also plan to improve the eciency of SALT in several aspects. First, the hit graph construction is currently implemented using the Python library of NetworkX [104]. The overhead of object constructions can be greatly decreased if we implement it using C++. Second, the independence of dierent input families allows us to use parallelization techniques such as MPI. Finally, we will provide users with a systematical way to better estimate K based on the sequencing data and domain information. • There are still some challenges to address to further improve SAT-Assembler's performance. First, gene segments from some poorly conserved gene regions are fragmented because some reads from these regions fail to pass the homology search. This problem can be alleviated by increasing the sensitivity of the homology search. In the future, we will incorporate our proposed position-specic score threshold (PSST) [57, 105] into SAT-Assembler to classify more reads into their native families. Second, although the edge creation strategy of SAT-Assembler captured more overlaps between reads from the same genes, some positive overlaps still failed to be captured because of substitution errors in the overlapping regions. Masella et al. [106] proposed a sophisticated method that can probabilistically correct these errors based on the overlap data from the paired-end reads. We plan to use this method in our edge creation strategy to generate more positive connections. Third, the running time of the graph traversal stage is the bottleneck of SAT-Assembler, especially for complex metagenomic data. Therefore, we plan to add more bounding strategies into the graph traversal, such as a more 103 stringent threshold for critical supports. Moreover, we will implement SAT-Assembler using C++ to reduce its running time. 104 BIBLIOGRAPHY 105 BIBLIOGRAPHY [1] Ee Mutasa-Göttgens, Anagha Joshi, Helen Holmes, Peter Hedden, and Berthold Gottgens. A new RNAseq-based reference transcriptome for sugar beet and its application in transcriptome-scale analysis of vernalization and gibberellin responses. Genomics, 13(1):99, 2012. BMC [2] Angela M. Orshinsky, Jinnan Hu, Stephen O. Opiyo, Venu Reddyvari-Channarayappa, Thomas K. Mitchell, and Michael J. Boehm. RNA-seq analysis of the Sclerotinia homoeocarpa - Creeping Bentgrass Pathosystem. PLoS ONE, 7(8):e41150, 2012. [3] Zhen Li, Zhonghua Zhang, Pengcheng Yan, Sanwen Huang, Zhangjun Fei, and Kui Lin. RNA-seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics, 12(1):540, 2011. [4] W. Marc Schmid, Anja Schmidt, C. Ulrich Klostermeie, Matthias Barann, Philip Rosenstiel, and Ueli Grossniklaus. A powerful method for transcriptional proling of specic cell types in eukaryotes: Laser-assisted microdissection and rna sequencing. PLoS ONE, 7(1):e29685, 2012. [5] CAMERA: Community cyberinfrastructure for advanced microbial ecology research and analysis. http://camera.calit2.net/. [6] F Meyer, D Paarmann, M D'Souza, R Olson, EM Glass, M Kubal, T Paczian, A Rodriguez, R Stevens, A Wilke, J Wilkening, and RA Edwards. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9(1):386, 2008. [7] Daniel H. Huson, Alexander F. Auch, Ji Qi, and Stephan C. Schuster. MEGAN analysis of metagenomic data. Genome Research, 17(3):377386, 2007. [8] Kyle Ellrott, Lukasz Jaroszewski, Weizhong Li, John C. Wooley, and Adam Godzik. Expansion of the protein repertoire in newly explored environments: Human gut microbiome specic protein families. PLoS Comput Biol, 6(6):e1000798, 2010. [9] Andreas Schlüter, Lutz Krause, Rafael Szczepanowski, Alexander Goesmann, and Alfred Pühler. Genetic diversity and composition of a plasmid metagenome from a wastewater treatment plant. Journal of Biotechnology, 136(1-2):6576, 2008. 106 [10] Lutz Krause, Naryttza N. Diaz, Alexander Goesmann, Scott Kelley, Tim W. Nattkemper, Forest Rohwer, Robert A. Edwards, and Jens Stoye. Phylogenetic classication of short environmental DNA fragments. Nucleic Acids Research, 36(7):22302239, 2008. [11] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipmanl. Basic local alignment search tool. Journal of Molecular Biology, 215:403  410, 1990. [12] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E. Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoer Forslund, Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bateman. The Pfam protein families database. Nucleic Acids Research, 38(suppl 1):D211D222, 2010. [13] Daniel H. Haft, Jeremy D. Selengut, and Owen White. The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1):371373, 2003. [14] F. Meyer, R. Overbeek, and Rodriguez A. FIGfams: yet another set of protein families. Nucleic Acids Research, 37(20):66436654, 2009. [15] E Quevillon, V Silventoinen, S Pillai, N Harte, N Mulder, R Apweiler, and R Lopez. InterProScan: protein domains identier. Nucleic Acids Research, 33(suppl 2):W116 W120, 2005. [16] Sarah Hunter, Rolf Apweiler, Teresa K. Attwood, Amos Bairoch, Alex Bateman, David Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn, Julian Gough, Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurelie Laugraud, Ivica Letunic, David Lonsdale, Rodrigo Lopez, Martin Madera, John Maslen, Craig McAnulla, Jennifer McDowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, Christine Orengo, Antony F. Quinn, Jeremy D. Selengut, Christian J. A. Sigrist, Manjula Thimma, Paul D. Thomas, Franck Valentin, Derek Wilson, Cathy H. Wu, and Corin Yeats. InterPro: the integrative protein signature database. Acids Research, 37(suppl 1):D211D215, 2009. [17] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Probabilistic Models of Proteins and Nucleic Acids. Nucleic Biological Sequence Analysis Cambridge University Press, UK, 1998. [18] Sear R. Eddy. inference. A new generation of homology search tools based on probabilistic Genome Inform., 23(1):20511, 2009. [19] NP Brown, C Sander, and P Bork. Frame: detection of genomic sequencing errors. Bioinformatics, 14(4):36771, 1998. 107 [20] X Guan and E.C. Uberbacher. Alignments of dna and protein sequences containing frameshift errors. Comput Appl Biosci., 12(1):3140, 1996. [21] Z Zhang, RW Pearson, and W Miller. Aligning a dna sequence with a protein sequence. Proc. of RECOMB 97: the rst international conference on computational molecular biology, pages 337343. ACM press, 1997. In [22] E Halperin, S Faigler, and R Gill-More. FramePlus: aligning DNA to protein sequences. Bioinformatics, 15:867873, 1999. [23] M. Peltola, H. Soderlund, and E. Ukkonen. Algorithms for the search of amino acid Nucl. Acids Res., 14:99107, 1986. patterns in nucleic acid sequences. [24] W. I. Chang and E.L. Lawler. Sublinear expected time approximate string matching and biological applications. Algorithmica, 12:32744, 1994. [25] M Pellegrini and TO Yeates. Searching for frameshift evolutionary relationships be- Proteins, 37(2):27883, 1999. tween protein sequence families. [26] M. Girdea, L. Noe, and G. Kucherov. Back-translation for discovering distant pro- tein homologies. In T. Warnow and S. Salzberg, editors, September 12-13; Philadelphia, pages 108120, 2009. Proceedings of WABI 2009: [27] M. Girdea, L. Noe, and G. Kucherov. Back-translation for discovering distant protein homologies in the presence of frameshift mutations. Algorithms for Molecular Biology, 5(6), 2010. [28] Thomas Schiex, J Gouzy, Annick Moisan, and Yannick Oliveira. Framed: a exi- ble program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res., 31(13):37383741, 2003. [29] A. Kislyuk, A. Lomsadze, A. L. Lapidus, and M. Borodovsky. Frameshift detection in prokaryotic genomic sequences. International Journal of Bioinformatics Research and Applications, 5(4):458477, 2009. [30] M. Borodovsky and J. McIninch. strands. GeneMark: parallel gene recognition for both DNA Computers and Chemistry, 17(19):123133, 1993. [31] I Antonov and M Borodovsky. Genetack: frameshift identication in protein-coding sequences by the viterbi algorithm. J Bioinform Comput Biol., 8(3):535551, 2010. 108 [32] M. Rho, H. Tang, and Y. Ye. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res., 1(38):e191, 2010. [33] E Birney, M Clamp, and R Durbin. Genewise and genomewise. Genome Research, 14:988995, 2004. [34] C Wang, Y Mitsuya, B Gharizadeh, M Ronaghi, and RW Shafer. Characterization of mutation spectra with ultra-deep pyrosequencing: application to hiv-1 drug resistance. Genome Res., 17(8):1195201, 2007. [35] Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, and Niko Beerenwinkel. Viral pop- PLoS Comput Biol, 4(5):e1000074, 2008. ulation estimation using pyrosequencing. [36] Christopher Quince, Anders Lanzen, Thomas P Curtis, Russell J Davenport, Neil Hall, Ian M Head, L Fiona Read, and William T Sloan. Accurate determination of microbial diversity from 454 pyrosequencing data. Nature Methods, 6:639641, 2009. [37] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-SyedMohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Acids Research, 37(suppl 1):D141D145, 2009. Nucleic [38] FunGene: functional gene pipeline and repository. http://fungene.cme.msu.edu/. [39] M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, McWilliam. H., F. Valentin, I.M. Wallace, Lopez R. Wilm, A., J.D. Thompson, T.J. Gibson, and D.G. Higgins. ClustalW and ClustalX version 2. Bioinformatics, 23(21):29472948, 2007. [40] S Iwai, B Chai, WJ Sul, JR Cole, SA Hashsham, and JM Tiedje. Gene-targeted- metagenomics reveals extensive diversity of aromatic dioxygenase genes in the environment. ISME J., 4(2):279285, 2010. [41] D. T. Gibson and R. E. Parales. Aromatic hydrocarbon dioxygenases in environmental biotechnology. Current Opinion in Biotechnology, 11(3):236243, 2000. [42] Robert Edwards, Beltran Rodriguez-Brito, Linda Wegley, Matthew Haynes, Mya Breitbart, Dean Peterson, Martin Saar, Scott Alexander, E Calvin Alexander, and Forest Rohwer. Using pyrosequencing to shed light on deep mine microbial ecology. Genomics, 7(1):57, 2006. 109 BMC [43] Arthur L. Delcher, Kirsten A. Bratke, Edwin C. Powers, and Steven L. Salzberg. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics, 23(6):673679, 2007. [44] Hideki Noguchi, Jungho Park, and Toshihisa Takagi. nding from environmental genome shotgun sequences. MetaGene: prokaryotic gene Nucl. Acids Res., 34(19):5623 5630, 2006. [45] Weizhong Li. Analysis and comparison of very large metagenomes with fast clustering and functional annotation. BMC Bioinformatics, 10(1):359, 2009. [46] Stephan C Schuster. Next-generation sequencing transforms today's biology. Nat Meth, 5(1):1618, 2008. [47] Fabian Schreiber, Peter Gumrich, Rolf Daniel, and Peter Meinicke. Treephyler: fast taxonomic proling of metagenomes. Bioinformatics, 26(7):960961, 2010. [48] Yuan Zhang and Yanni Sun. HMM-FRAME: accurate protein domain classication for metagenomic sequences containing frameshift errors. BMC Bioinformatics, 12(1):198, 2011. [49] Francis Weng, Chien Hao Su, Ming Tsung Hsu, Tse Yi Wang, Huai Kuang Tsai, and Daryi Wang. Reanalyze unassigned reads in sanger based metagenomic data using conserved gene adjacency. BMC Bioinformatics, 11(1):565, 2010. [50] C. Juliane Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research, 36(16):e105, 2008. [51] D. Kasper Hansen, E. Steven Brenner, and Sandrine Dudoit. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, 2010. [52] D. R. Yoder-Himes, P. S. G. Chain, Y. Zhu, O. Wurtzel, E. M. Rubin, James M. Tiedje, and R. Sorek. Mapping the burkholderia cenocepacia niche response via high-throughput sequencing. Proceedings of the National Academy of Sciences, 106(10):39763981, 2009. [53] IMG: Integrated microbial genomes. http://img.jgi.doe.gov/. 110 [54] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-ecient alignment of short DNA sequences to the human genome. Biology, 10:R25, 2009. [55] Karla D. Passalacqua, Anjana Varadarajan, Brian D. Ondov, Genome David T. Okou, Michael E. Zwick, and Nicholas H. Bergman. Structure and complexity of a Bacterial transcriptome. J. Bacteriol., 191(10):32033211, 2009. [56] Brian A Williams, Brian A Williams, Kenneth McCue, Lorian Schaeer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth, 5(7):621628, 2008. [57] Yuan Zhang and Yanni Sun. MetaDomain: a prole hmm-based protein domain classication tool for short sequences. In (PSB), 2012. Proceedings of Pacic Symposium on Biocomputing [58] A. Martin1 Jerey and Wang Zhong. Next-generation transcriptome assembly. Reviews Genetics, 12:671682, 2011. Nature [59] Michael Remmert, Andreas Biegert, Andreas Hauser, and Johannes Söding. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Methods, 9:173175, 2012. Nature [60] Jason R. Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for nextgeneration sequencing data. Genomics, 95(6):315  327, 2010. [61] Barbara Feldmeyer, W. Christopher Wheat, Nicolas Krezdorn, Bjórn Rotter, and Markus Pfenninger. Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics, 12(317), 2012. [62] Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, and Niko Beerenwinkel. Viral population estimation using pyrosequencing. PLoS Comput Biol, 4(5), 2008. [63] Arthur L. Delcher Michael C. Schatz and Steven L. Salzberg. genomes using second-generation sequencing. Genome Res., 20(9):11651173, 2010. [64] Jin Y. Yen. Finding the K shortest loopless paths in a network. 17(11):712716, 1971. 111 Assembly of large Management Science, [65] D. Eppstain. Finding the k shortest paths. In Proc. 25th IEEE Annual Symposium on Foundation of Computer Science, pages 154165, 1994. [66] W. A. Brander and C. M. Sinclair. A comparative study of k-shortest path algorithms. Proc. 11th UK Performance Engineering Workshop for Computer and Telecommunications Systems, 1995. In [67] R. Daniel Zerbino and Ewan Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18(5):821829, 2008. [68] L. René Warren, G. Granger Sutton, J. M. Steven Jones, and A. Rober Holt. Assem- Bioinformatics, 23(4):500501, [69] Ben Langmead, Cole Trapnell, Mihai Pop, and L. Steven Salzberg. Ultrafast and bling millions of short DNA sequences using SSAKE. 2006. memory-ecient alignment of short dna sequences to the human genome. Biology, 10(R25), 2009. Genome [70] Yamile Marquez, W.S. John Brown, Craig Simpson, Andrea Barta, and Maria Kalyna. Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis. Genome Res., 22:11841195, 2012. [71] H. Marcel Schulz, R. Zerbino Daniel, Martin Vingron, and Ewan Birney. Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels. matics, 28(8):10861092, 2012. [72] TAIR: The Arabidopsis Information Resource. Bioinfor- www.arabidopsis.org. [73] SeqMan NGen - Software for Next-Gen Sequence Assembly. com/t-sub-products-genomics-seqman-ngen.aspx. http://www.dnastar. [74] Susannah Green Tringe, Christian Von Mering, Arthur Kobayashi, Asaf A Salamov, Kevin Chen, Hwai W Chang, Mircea Podar, Jay M Short, , E. J. Mathur, and et al. Comparative metagenomics of microbial communities. Science, 308(5721):554557, 2005. [75] EBI Metagenomics. [76] T. Lingner, https://www.ebi.ac.uk/metagenomics/. K. P. Aÿhauer, F. Schreiber, and P. Meinicke. server for comparative functional proling of metagenomes. 39(suppl_2):W518, 2011. 112 CoMet - a web Nucleic Acids Research, [77] Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman, and Aviv Regev. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, (7):644652, 2011. [78] I Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol, Greg Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E. Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Steven J. M. Jones. De novo transcriptome assembly with ABySS. Bioinformatics, 25(21):2872 2877, 2009. [79] T Namiki, T Hachiya, H Tanaka, and Y Sakakibara. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Acids Res., 40(20):e155, 2012. Nucleic [80] Todd Treangen, Sergey Koren, Daniel Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron Darling, Adam Phillippy, and Mihai Pop. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. [81] J Laserson, V Jojic, and D Koller. Comput Biol, 18(3):429443, 2011. Genome Biology, 14(1):R2, 2013. Genovo: de novo assembly for metagenomes. J [82] Yu Peng, Henry C. M. Leung, S. M. Yiu, and Francis Y. L. Chin. Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics, 27(13):i94i101, 2011. [83] C Luo, D Tsementzi, NC Kyrpides, and KT Konstantinidis. Individual genome assembly from complex community short-read metagenomic datasets. ISME J., 6(4):898901, 2012. [84] Steven L. Salzberg, Daniel D. Sommer, Daniela Puiu, and Vincent T. Lee. boosted assembly of a novel bacterial genome from very short reads. Biol, 4(9):e1000186, 09 2008. Gene- PLOS Comput [85] YuWei Wu, Mina Rho, Thomas G. Doak, and Yuzhen Ye. Stitching gene fragments with a network matching algorithm improves gene assembly for metagenomics. formatics, 28(18):i363i369, 2012. Bioin- [86] Rachel Mackelprang, Mark P. Waldrop, Kristen M. DeAngelis, Maude M. David, Krystle L. Chavarria, Steven J. Blazewicz, Edward M. Rubin, and Janet K. Jansson. 113 Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature, 480(7377):368  371, 2011. [87] QiongYi Zhao, Yi Wang, YiMeng Kong, Da Luo, Xuan Li Li, and Pei Hao. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics, 12(Suppl 14)(S2), 2011. [88] Mihai Pop, Adam Phillippy, Arthur L. Delcher, and Steven L. Salzberg. Comparative genome assembly. Briengs in Bioinformatics, 5:237  248, 2004. [89] S. L. Salzberg, D. D. Sommer, D. Puiu, and V. T. Lee. Gene-boosted assembly of a novel bacterial genome from very short reads. PLOS Comput. Biol., 4(9):e1000186, 2008. [90] T. Rausch, S. Koren, G. Denisov, D. Weese, A. K. Emde, A. Doring, and K. Reinert. A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics, 25(9):11181124, May 2009. [91] B. E. Dutilh, M. A. Huynen, and M. Strous. Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly. Bioinformatics, 25(21):28782881, Nov 2009. [92] Yann Surget-Groba and Juan I. Montoya-Burgos. Optimization of de novo transcriptome assembly from next-generation sequencing data. Genome Research, 20(10):1432 1440, 2010. [93] J. Nijkamp, W. Winterbach, M. van den Broek, J. M. Daran, M. Reinders, and D. de Ridder. Integrating genome assemblies with MAIA. Bioinformatics, 26(18):i433 439, Sep 2010. [94] J. D. Klein, S. Ossowski, K. Schneeberger, D. Weigel, and D. H. Huson. LOCASa low coverage assembly tool for resequencing projects. PLOS ONE, 6(8):e23455, 2011. [95] Y. Ji, Y. Shi, G. Ding, and Y. Li. A new strategy for better genome assembly from very short reads. BMC Bioinformatics, 12:493, 2011. [96] Y Nishito, Y Osana, T Hachiya, K Popendorf, A Toyoda, A Fujiyama, M Itaya, and Y Sakakibara. Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data. BMC Genomics, 11:243, 2010. [97] Y. I. Li and R. R. Copley. Scaolding low quality genomes using orthologous protein sequences. Bioinformatics, 29(2):160165, Jan 2013. 114 [98] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang, K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. Sonnhammer, S. R. Eddy, A. Bateman, and R. D. Finn. The Pfam protein families database. Nucleic Acids Res., 40(Database issue):290301, Jan 2012. [99] K. Eric Wommack, Jaysheel Bhavsar, and Jacques Ravel. Metagenomics: read length matters. Appl. Environ. Microbiol., 74(5):14531463, 2008. [100] Eugene W. Myers. The fragment assembly string graph. Bioinformatics, 21(suppl 2):ii79ii85, 2005. [101] Jared T. Simpson and Richard Durbin. Ecient construction of an assembly string graph using the FM-index. Bioinformatics, 26(12):i367i373, 2010. [102] J. T. Simpson and R. Durbin. compressed data structures. Ecient de novo assembly of large genomes using Genome Res., 22(3):549556, Mar 2012. [103] Victorian bioinformatics consortium: VelvetOptimiser. software.velvetoptimiser.shtml. bioinformatics.net.au/ [104] NetworkX: High-productivity software for complex networks. github.io/. http://networkx. [105] Y. Zhang, Y. Sun, and J. R. Cole. A Sensitive and Accurate protein domain cLassication Tool (SALT) for short reads. Bioinformatics, 29(17):21032111, Sep 2013. [106] A. P. Masella, A. K. Bartram, J. M. Truszkowski, D. G. Brown, and J. D. Neufeld. PANDAseq: paired-end assembler for illumina sequences. 2012. 115 BMC Bioinformatics, 13:31,