PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF NEXT-GENERATION
SEQUENCING DATA
By
Yuan Zhang

A DISSERTATION
Submitted to
Michigan State University
in partial fulllment of the requirements
for the degree of
Computer Science  Doctor of Philosophy
2013

ABSTRACT
PROFILE HMM-BASED PROTEIN DOMAIN ANALYSIS OF
NEXT-GENERATION SEQUENCING DATA
By
Yuan Zhang
Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a
wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence
data of high coverage and nucleotide level resolution at low costs, beneting a variety of
research areas such as gene expression proling, metagenomic annotation, ncRNA identication, etc. Therefore, functional analysis of NGS sequences becomes increasingly important
because it provides insightful information, such as gene expression, protein composition, and
phylogenetic complexity, of the species from which the sequences are generated. One basic
step during the functional analysis is to classify genomic sequences into dierent functional
categories, such as protein families or protein domains (or domains for short), which are
independent functional units in a majority of annotated protein sequences.
The state-of-the-art method for protein domain analysis is based on comparative sequence
analysis, which classies query sequences into annotated protein or domain databases. There
are two types of domain analysis methods, pairwise alignment and prole-based similarity
search. The rst one uses pairwise alignment tools such as BLAST to search query genomic
sequences against reference protein sequences in databases such as NCBI-nr. The second one
uses prole HMM-based tools such as HMMER to classify query sequences into annotated
domain families such as Pfam. Compared to the rst method, the prole HMM-based method
has smaller search space and higher sensitivity with remote homolog detection. Therefore, I

focus on prole HMM-based protein domain analysis.
There are several challenges with protein domain analysis of NGS sequences. First, sequences generated by some NGS platforms such as pyrosequencing have relatively high error
rates, making it dicult to classify the sequences into their native domain families. Second,
existing protein domain analysis tools have low sensitivity with short query sequences and
poorly conserved domain families.

Third, the volume of NGS data is usually very large,

making it dicult to assemble short reads into longer contigs. In this work, I focus on addressing these three challenges using dierent methods. To be specic, we have proposed four
tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on
detecting and correcting frameshift errors in sequences generated by pyrosequencing technology, thus accurately classifying metagenomic sequences containing frameshift errors into their
native protein domain families. MetaDomain and SALT are both designed for short reads
generated by NGS technologies. MetaDomain uses relaxed position-specic score thresholds
and alignment positions to increase the sensitivity while keeping the false positive rate at
a low level.

SALT combines both position-specic score thresholds and graph algorithms

and achieves higher accuracy than MetaDomain.

SAT-Assembler conducts targeted gene

assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage,
and lower chimera rate compared with existing tools. Finally, I will make a conclusion on
my work and briey talk about some future work.

ACKNOWLEDGMENTS

First and foremost, I would like to thank my adviser Dr. Yanni Sun. Her decision to admit me as her PhD student four years ago provided me the precious opportunity to study in
Michigan State University and led me to the world of bioinformatics. During these four years
under her guidance I have made continuous progress in several aspects, including reading
research papers, proposing research topics, developing methods, designing experiments, to
writing papers. More importantly, I have gradually improved my ability to both independently and collaboratively conduct in-depth analysis into sophisticated research problems
and use scientic methodologies to solve the challenging problems. She also gave a lot of
suggestions on how to eectively demonstrate our work to the audience, especially to people
from other research areas. This ability is very important in that it will profoundly determine
my capability to collaborate in a team of people from dierent background.
I also want to thank other committee members Dr.
Tan, and Dr.

James R. Cole.

my PhD program.

C. Titus Brown, Dr.

Pang-Ning

They gave a lot of useful suggestions during the course of

I also thank my lab mates Rujira Achawanantakun, Jikai Lei, Cheng

Yuan, Prapaporn Techa-angkoon, and Jiao He.

During these years, we have productive

discussion and cooperation on various research topics and I have obtained great help from
them. I would like to acknowledge my colleagues from BeachMint Inc. during my summer
internship, especially Douglas Cohen, Je Cooper, and Manunya Rozelle. With their help,
I learned how to apply theories and methods to solve challenging problems in industry. I
gratefully acknowledge other faculties and stas of CSE department, especially Dr. Rong
Jin, Dr. Jin Chen, Linda Moore, and Norma Teague. I also owe a lot of thanks to my friends
in MSU and in China. All of them give me a lot of support and help during these years.

iv

My nal and most important acknowledgement must go to my family. They always give
me persistent and determined love and support.

v

TABLE OF CONTENTS

LIST OF TABLES .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Chapter 1 Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Next-generation sequencing technologies

1.2

Protein domain analysis of NGS sequences

. . . . . . . . . . . . . . . . . . .

1

1.3

Challenges with protein domain analysis of NGS sequences . . . . . . . . . .

3

Chapter 2 Protein domain classication for metagenomic sequences containing frameshift errors . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1

Background

2.2

Related work

2.3

Method

. . . . . . . . . . . . . . . . . . . .

1

1.1

1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.3.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

The augmented Viterbi algorithm for sequencing error correction . . .

11

2.3.3

Running time analysis

. . . . . . . . . . . . . . . . . . . . . . . . . .

14

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.4.1

Accuracy of HMM-FRAME

15

2.4.2

2.4

Error models

2.3.2

Using HMM-FRAME in Targeted Metagenomic

. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .

17

2.4.2.1

Protein domain analysis of nifH sequences . . . . . . . . . .

18

2.4.2.2

Protein domain analysis of the bacterial aromatic dioxygenase genes . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Protein domain classication in the deep mine data set . . . . . . . .

22

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Chapter 3 Prole HMM-based protein domain classication for short sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.4.3
2.5

3.1

Background

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.3

Method

31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.3.1

Pipeline of MetaDomain

3.3.2

The Viterbi algorithm

3.3.3

3.3.3.1

Position specic threshold . . . . . . . . . . . . . . . . . . .

34

3.3.3.2

Alignment trimming

3.3.4

. . . . . . . . . . . . . . . . . . . . . . . . .

32

. . . . . . . . . . . . . . . . . . . . . . . . . .

34

Alignment Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

. . . . . . . . . . . . . . . . . . . . . .

35

Protein domain classication . . . . . . . . . . . . . . . . . . . . . . .

36

vi

3.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.4.1

37

Identifying transcribed protein domains in transcriptome . . . . . . .
3.4.1.1

Performance of read classication . . . . . . . . . . . . . . .

37

3.4.1.2

Identifying transcribed domains in the transcriptome . . . .

38

Protein domain analysis in a soil metagenomic data set . . . . . . . .

42

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.4.2
3.5

Chapter 4 A Sensitive and accurate protein domain classication tool (SALT)
for short reads based on prole HMMs and graph algorithms . 47
4.1

Background

4.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

4.3

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.3.1

Overview of SALT

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.3.2

Stage 1: prole HMM-based ltration . . . . . . . . . . . . . . . . . .

54

4.3.2.1

54

4.3.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Position-specic score threshold . . . . . . . . . . . . . . . .

Stage 2: contig generation

47

. . . . . . . . . . . . . . . . . . . . . . . .

55

4.3.3.1

Constructing a hit graph for a family . . . . . . . . . . . . .

57

4.3.3.2

Find the

K

60

Stage 3: E-value computation and contig selection . . . . . . . . . . .

61

4.3.5
4.4

. . . . . . . . . . . . . . . . . . .

4.3.4

longest paths

Running time analysis

62

. . . . . . . . . . . . . . . . . . . . . . . . . .

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.4.1

Protein domain classication of very short reads . . . . . . . . . . . .

64

4.4.1.1

Determining the true membership of reads . . . . . . . . . .

65

4.4.1.2

Performance evaluation

. . . . . . . . . . . . . . . . . . . .

Arabidopsis

4.4.2

Protein domain classication of an RNA-Seq data of

4.4.3

Protein domain classication of a non-model organism

65

. . .

69

. . . . . . . .

72

4.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

4.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

Chapter 5 A Scalable and Accurate Targeted gene Assembly tool (SATAssembler) for NGS data . . . . . . . . . . . . . . . . . . . . . . . .

78

5.1

Background

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

5.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

5.3

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.3.1

Overview of SAT-assembler

82

5.3.2

Prole HMM-based homology search

. . . . . . . . . . . . . . . . . .

83

5.3.3

Alignment informed graph construction . . . . . . . . . . . . . . . . .

84

5.3.4

Pruning and optimization of overlap graphs

87

5.3.5

Guided graph traversal using multiple information . . . . . . . . . . .

88

5.3.6

Contig scaolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

5.3.7

Running time analysis

. . . . . . . . . . . . . . . . . . . . . . . . . .

91

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.4.1

. . . . . . . .

93

5.4.1.1

Edge creation performance . . . . . . . . . . . . . . . . . . .

93

5.4.1.2

Performance comparison with other assembly tools

95

5.4

. . . . . . . . . . . . . . . . . . . . . . .

Gene assembly in an RNA-Seq data set of

vii

. . . . . . . . . . . . . .

Arabidopsis

. . . . .

5.4.2
5.5

Targeted gene assembly in a human gut metagenomic data set . . . .

98

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

100

Chapter 6 Conclusion and future work
Bibliography

.

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . .
.

viii

.

. . . . . . . . . . . . . . . . . . .

101
105

LIST OF TABLES

Table 2.1

Comparing the error detection performance of HMM-FRAME, GeneWise, and FragGeneScan.

Table 3.1

. . . . . . . . . . . . . . . . . . . . . .

Number of transcribed and non-transcribed domains using dierent
cutos (N) for the number of mapped reads.

Table 4.1

40

Burkholderia cenocepacia.

. . . . . . . . . . . .

68

Performance comparison of SALT against the other classiers on the
RNA-Seq data set of

Table 4.3

. . . . . . . . . . . . .

Performance comparison of SALT against the other classiers on the
RNA-Seq data set of

Table 4.2

17

Arabidopsis.

. . . . . . . . . . . . . . . . . . .

Classication results generated by dierent classiers on the

balthica

71

Radix

data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.4

Description of transcribed families uniquely identied by SALT.

Table 5.1

Edge creation performance of three strategies on the RNA-Seq data
set of

Table 5.2

Arabidopsis.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

94

Performance comparison between dierent assembly tools on the RNASeq data set of

Table 5.3

. .

73

Arabidopsis. .

. . . . . . . . . . . . . . . . . . . . . .

98

Performance comparison between dierent assembly tools in assembling genes from butyrate kinase family on the human gut metagenomic data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

99

LIST OF FIGURES

Figure 2.1

Frameshifts cause short alignments with marginal scores . . . . . . .

Figure 2.2

6

Change of HMMER alignments' scores, lengths, and E-values (in log
space) before and after error correction for nifH sequences. (For interpretation of the references to color in this and all other gures, the
reader is referred to the electronic version of this dissertation) . . . .

Figure 2.3

20

Change of HMMER alignments' lengths, scores, and E-values (in log
space) before and after error correction for the bacterial aromatic
dioxygenase genes in a soil sample. . . . . . . . . . . . . . . . . . . .

Figure 2.4

Protein domain classication results for the black sample in the deep
mine data set.

Figure 3.1

22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

Change of the read classication sensitivity of HMMER over read
length and the average sequence identity of domain families.

. . . .

27

Figure 3.2

Histogram of the average pairwise sequence identity for 2558 domains

29

Figure 3.3

Three types of alignment distributions.

. . . . . . . . . . . . . . . .

32

Figure 3.4

Pipeline of MetaDomain.

. . . . . . . . . . . . . . . . . . . . . . . .

33

Figure 3.5

Read classication sensitivity and FP rate of HMMER and MetaDomain. The size of each bubble represents the number of data points
(i.e., domains) with the same sensitivity and FP rate.

. . . . . . . .

39

Figure 3.6

ROC curves of HMMER and MetaDomain. . . . . . . . . . . . . . .

41

Figure 3.7

Read length distribution in the soil data set.

. . . . . . . . . . . . .

43

Figure 3.8

Reads aligned by HMMER and MetaDomain.

. . . . . . . . . . . .

44

Figure 3.9

The distributions of aligned reads for PF09703 by HMMER and MetaDomain.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

45

Figure 4.1

Two genes, their domain organizations, and the sequenced reads. Domain X occurs in two dierent genes. Both genes are transcribed and
sequenced. Red lines: positive reads. Blue lines: negative reads.

. .

52

Figure 4.2

The pipeline of SALT. . . . . . . . . . . . . . . . . . . . . . . . . . .

53

Figure 4.3

(A) Thirteen reads and their alignment layout w.r.t. the prole HMM
represented by its matching states. The alignment scores are shown
in the table. Blue reads: negative reads. Red reads: positive reads.
(B) The constructed hit graph when

k ∗ = 4.

For simplicity of ex-

planation, mismatches are not allowed in this simple example (i.e.

e = 0).

Red nodes are created by positive reads. Blue nodes are cre-

ated by negative reads. (C) The hit graph after removing transitive
overlaps and adding the root node. . . . . . . . . . . . . . . . . . . .
Figure 4.4

ROC curves of dierent classiers.

56

HHblits and SSAKE+HMMER

are listed in separate embedded windows because their FP rates are
orders of magnitude larger than others.
Figure 4.5

67

A Venn diagram of the transcribed families identied by dierent
classiers.

Figure 5.1

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The pipeline of SAT-assembler.

73

Reads of the same color belong to

the same gene family. Reads from dierent genes of the same family
are distinguished using dierent patterns. Reads shared by multiple
genes from the same family have multiple patterns. . . . . . . . . . .
Figure 5.2

(A)Two reads

a

and

b

83

sequenced from dierent genes of the same

family are aligned to the prole HMM of the family. Their sequence
overlap is indicated in red.

(B) Read

a

and read

b

have an align-

ment overlap of 66 and a sequence overlap of 25 (in bold). (C)The
alignment between the translated peptides of
Figure 5.3

a

and

A graph containing reads from two dierent genes
red (v1 ,
gene

B

v4 ,

and

v7 )

and in blue (v2 ,

v5 ,

and

b

is 22 residues.

A and B .

Nodes in

v8 ) are from gene A and
v6 ) are chimeric nodes

respectively. Nodes in black (v3 and

because they are shared by the two genes.
are real edges.

Arrows with solid lines

Arrows with dotted lines and dashed lines indicate

paired-end reads and transitive edges between two nodes respectively.
Figure 5.4

86

Three contigs generated from a metagenomic data set.

88

The green

parts of the contigs are contained in the target gene and thus are
gene segments. The blue parts of the contigs are not gene segmetns.

xi

92

Figure 5.5

Chimera rate versus gene coverage when

k-mer

old changes for dierent assembly tools.

size or overlap thresh-

These values are average

values of the assemblers' performance on 3,188 input families. . . . .

xii

96

Chapter 1
Introduction
1.1

Next-generation sequencing technologies

In bioinformatics, sequencing means to determine the primary structure of a biological sequence.

Prior to new sequencing technologies, Sanger sequencing is the main method for

sequencing DNA. However, this technique has several limitations. For example, Sanger sequencing is not applicable to sequencing a small amount of DNA, making it expensive and
not accessible to most small labs. Also, the length of the DNA being sequenced is limited.
Next generation sequencing (NGS) technologies are developed at the demand of low-cost
sequencing technologies.
lel method.

These new sequencing technologies make use of massive paral-

They can produce large-scale sequence data at low costs.

These advantages

make large scale sequencing within the reach of many scientists. Moreover, new sequencing
technologies generate much more sequence data that has high coverage and nucleotide level
resolution per run.

1.2

Protein domain analysis of NGS sequences

Inferring functions from sequences is important in analyzing dierent types of data generated by NGS technologies.

One basic step during the functional analysis is to classify

NGS sequences into annotated functional categories, such as protein families or protein do-

1

main families.

Protein domain analysis has been widely used for functional annotations

of RNA-Seq data [1, 2, 3, 4].

In particular, quantifying the expression levels of protein

domains helps us understand how transcriptional changes of domains are associated with
sequencing conditions, sampling tissues, or experimental treatments in RNA-Seq data. For
example, computational domain analysis was applied to identify domains that play a role
in vernalization and eux transporters in the gibberellin response in sugar beet [1].

Do-

main analysis is also frequently used to evaluate and compare gene annotation quality of
dierent gene-nding tools [3] or to compare domain composition of data sampled using
dierent techniques [4]. Protein domain analysis has also been used to understand the phylogenetic complexity and biological functions of mycrobial communities, as well as their
interactions with the host [5, 6, 7]. For example, Ellrott et al. investigated the distribution
of protein families in the currently available human gut genomic and metagenomic data [8].
Schlüter et al. applied HMMER to understand the genetic diversity and composition of a
plasmid metagenome from a wastewater treatment plant [9].

The phylogenetic algorithm

CARMA [10] uses all Pfam domain and protein families as phylogenetic markers to identify
the source organisms of environmental DNA fragments.
There are two major comparative methods for protein domain analysis. The rst method
is based on pairwise sequence alignment tools such as BLAST software suite [11].

Query

sequences are classied via comparison with annotated protein databases such as NCBI-nr
using BLASTX [11]. The second method is prole-based similarity search, which classies
queries into characterized protein domain or family databases such as Pfam [12], TIGRFAM [13], FIGfams [14], etc. There also exist comprehensive protein domain search tools
such as InterProScan [15], which combines dierent sequence and prole-based domain recognition methods from the InterPro [16] consortium member databases into one resource.

2

Although BLAST is one of the most ecient protein homology search tools, probabilistic
model-based methods have much better sensitivity for remote protein homology recognition.

Using using prole hidden Markov models (HMMs) to represent a protein family

greatly improves homology search sensitivity between highly diverged sequences [17]. Thus
it is desirable to conduct protein domain classication using prole HMM-based tools such as
HMMER [18]. In conjunction with a fast-growing protein domain family database Pfam [12],
which contains over 10,000 annotated protein domain families, HMMER is able to classify
sequences into dierent domain families with high accuracy. In addition, the latest implementation of prole HMM-based domain classication tool HMMER 3.0 [18] has achieved
comparable speed to BLAST, making it suitable for large-scale protein compositional analysis. For the convenience of discussion, we use HMMER to refer to HMMER 3.0 hereafter
unless otherwise specied.

1.3

Challenges with protein domain analysis of NGS sequences

Although prole HMM-based methods have been successfully applied to genome-wide domain analysis, there are still many challenges with protein domain analysis of NGS sequences,
especially complex metagenomic data. First, sequences generated by some NGS platforms
such as pyrosequencing technology have sequencing errors, including insertions or deletions of
nucleotides, especially in homopolymer regions. These errors create frameshifts during translation, making it dicult to classify the derived peptide sequences into their native families.
Second, when the length of the query reads decreases, existing tools have low sensitivity
in classifying these short reads, especially for domain families of poor conservation. Many

3

sequencing technologies such as Illumina still generate short reads of 35 bp to 150 bp. Moreover, protein sequences encoded in individual metagenomic sequence reads may share only
a small overlap with existing protein families. Therefore, a sizable portion of various data
set still contain short reads. Third, microbial communities usually contain a large number
of dierent microbial species, complicating the functional annotation of metagenomic data.
In order to address these challenges and improve performance of protein domain classication, We have proposed three tools: HMM-FRAME, MetaDomain, and SALT. HMMFRAME is designed to accurately classify metagenomic sequences containing frameshift errors. MetaDomain and SALT are designed to directly classify short reads into their native
protein domain families with better sensitivity than existing tools. Compared to MetaDomain, SALT incorporates graph algorithms to improve accuracy of protein domain classication. In my future work, I will focus on accurate and scalable gene assembly from complex
metagenomic data.

4

Chapter 2
Protein domain classication for
metagenomic sequences containing
frameshift errors
2.1

Background

Culture-independent methods and high-throughput sequencing technologies now enable us
to obtain community random genomes (metagenomes) from dierent habitats such as arctic
soils and mammalian gut. Currently, metagenomic annotation focuses on phylogenetic complexity and protein composition analysis. An important component in protein composition
analysis is protein domain classication, which classies a putative protein sequence into annotated domain families and thus aids in functional analysis. Prole HMM-based alignment
is the state-of-the-art method for protein domain classication because of its high sensitivity
in classifying remote homologs. In conjunction with the Pfam database, HMMER [18] can
accurately classify query protein sequences into existing domain families. In addition, the
latest version of HMMER can achieve comparable speed to BLAST, making it applicable to
large-scale metagenomic data sets.
However, HMMER cannot optimally classify sequences containing frameshift errors. In

5

HMMER's domain analysis, six-frame translations of a sequence read or a predicted gene
fragment are aligned with annotated protein domain families using HMMER. One problem
of this method is that sequencing errors, including insertions or deletions of nucleotides,
create frameshifts during translation. As a result, the derived peptide sequences are likely
to generate alignments with marginal scores. As HMMER uses alignment scores, E-values,
or lengths to determine family membership, these reads become unclassiable or can be
falsely recognized as novel" proteins during downstream analysis. Figure 2.1 illustrates how
insertion or deletion errors cause marginal alignment scores.

X1X2X3 X4X5X6 X7X8X9 X10X11X12 X13X14X15 X16X17X18 X19X20X21
aa11

aa12

aa13

aa14

aa15

aa16

aa17

X1X2X3 X4X5X6 X X7X8X9 X10X11X12 X13X14X15 Y X16X17X18 X19X20X21
aa11

aa12

aa23

aa24

aa25

aa36

aa37

Figure 2.1: Frameshifts cause short alignments with marginal scores

In Figure 2.1,
is the

j th

Xi

is the

ith

base of a DNA sequence. Every codon is underscored.

amino acid of a peptide sequence derived under reading frame

i.

ai
j

The correct

peptide sequence can be derived from the error-free sequence (shown on the top of the
gure) under reading frame 1. Because of insertions of two nucleotides (bolded X and Y),
the correct peptide sequence is the concatenation of three short peptide sequences derived
using dierent reading frames. Thus, each peptide sequence derived using one reading frame
can only generate short alignments with insignicant scores.
This problem is more serious in domain analysis for metagenomic data sets. Given the
high complexity of many metagenomic data sets, high-quality genome assembly is not always

6

available. Thus, protein annotation can only be conducted on short sequence reads. The
average read length varies from 25-35 to around 400 bases for the next-generation sequencing
methods currently in use. On average there is about one open reading frame per 1000 base
pairs in bacteria genomes. Depending on gene size, many gene fragments in metagenomic
sequence reads may share only a small overlap with existing domain families, generating
even shorter prole HMM alignments with signicantly lower scores.
Although a number of tools [19, 20, 21, 22, 23, 24] exist for frameshift detection, they are
not designed for protein domain classication using prole HMMs. In addition, these tools
have not incorporated sequencing error patterns associated with next generation sequencing
technologies. A clear disadvantage is that they do not distinguish between error rates in and
out of homopolymer regions in pyrosequencing reads. The goal of this work is to design an
accurate prole HMM alignment method that can incorporate any given error pattern. Our
experiments show that our tool has high sensitivity (>

95%)

in detecting sequencing errors

and has a low false positive rate (∼ 0.15%). By correcting insertion and deletion errors, it
can generate longer alignments with signicantly higher alignment scores, and thus provide
more accurate protein domain classication.

2.2

Related work

A number of programs exist to handle frameshifts through DNA versus protein sequence
alignment. The simplest methods discard sequences that might contain frameshifts rather
than trying to correct them. For example, BLASTX provides insightful information about
whether a query DNA sequence contains frameshifts using six-frame translations. However,
it neither explicitly outputs positions of insertions or deletions that create frameshifts, nor

7

does it try to x them by constructing an alignment from pieces obtained from dierent
reading frames. Other tools are available to detect and x frameshift errors automatically.
Frame [19] uses BLASTX to compare all six reading frames of the query nucleotide sequence
against protein sequences. Then the aligned regions are combined for frameshift detection.
Guan et al. [20], Zhang et al. [21], and Halperin et al. [22] describe dynamic programming
algorithms for frameshift detection during pairwise DNA and protein sequence alignment.
Instead of using all reading frames of a DNA sequence to maximize the alignment score,
another group of tools [23, 24] translate a protein sequence back into DNA sequences and
formulate the alignment problem as a network matching problem. Frameshift detection has
also been applied to nding distant protein homologies where the divergence is the result of
frameshift mutations and substitutions [25, 26, 27].
Some gene-nding tools detect frameshifts.

FrameD [28] relies on a directed acyclic

graph for gene prediction in the presence of frameshifts. Kislyuk et al. [29] apply an

ab initio

method to detect possible frameshifts from coding potential generated by GeneMark [30].
GeneTack [31] and FragGeneScan [32] use hidden Markov models for

ab initio

frameshift

detection in gene nding.
Despite the extensive study of frameshift detection, the above programs are not designed
for protein family classication through DNA versus protein family alignment. Alternatively,
GeneWise [33], a widely used DNA versus protein alignment tool, allows comparison of
a DNA sequence with a prole HMM. Our algorithm diers from GeneWise by explicitly
incorporating a position-specic error model that is trained on data from dierent sequencing
platforms such as 454 GS FLX Titanium.

8

2.3

Method

The representative protein domain classication tool HMMER [18] classies a query protein
sequence into a prole HMM-represented protein family using the Viterbi or the Forward
algorithm [17].

The Viterbi algorithm aligns a query protein sequence to a prole HMM

by searching for the most probable state path in the model. If the alignment score or Evalue meets the pre-dened threshold, the query is classied into the corresponding family.
The alignment generated by the Viterbi algorithm only accounts for the dierence caused
by evolutionary divergence between a sequence and a protein family.

In order to classify

error-containing sequences into their native families, the alignment algorithm must detect
the dierences resulted from both evolution and sequencing errors.
In this section, we describe HMM-FRAME, the implementation of an augmented Viterbi
algorithm that searches for the optimal alignment between a DNA query and a prole HMM
by considering both evolutionary divergence and sequencing errors. HMM-FRAME diers
from HMMER in the following ways: 1) HMM-FRAME directly accepts a DNA sequence
as input, 2) HMM-FRAME accepts a sequencing error model as input, 3) HMM-FRAME
can detect and x frameshifts caused by sequencing errors in the DNA sequence.

The

output alignment indicates which bases are inserted or deleted due to evolutionary change
or sequencing error.

2.3.1

Error models

Here we describe the error models used in our experiments. Dierent sequencing technologies
may have dierent types of errors. For example, previous work [34, 35, 36] has shown that
insertions and deletions occur more often in homopolymer regions than in non-homopolymer

9

regions for pyrosequencing reads.

Substitution errors occur more often than insertions or

deletions in Illumina sequencing reads. Because deletion or insertion errors cause frameshifts,
we focus on applying HMM-FRAME to pyrosequencing data sets.
In this work, we consider two error models. The rst one is a published model trained
from GS20 sequencing reads [34]. The insertion and deletion error rates in non-homopolymer
and homopolymer regions are 0.0007 and 0.0044, respectively.

The second error model is

computed on data from FLX Titanium sequencing platform. We obtained a set of Titanium
sequence reads (Cole and Wang, unpublished) extracted from the region H of the 16S rRNA,
which were amplied from the Baylor mock community (22 strains, 24 sequences).

Then

we computed error rates using insertions and deletions that were annotated by generating
careful Needleman-Wunsch alignments between the Titanium sequencing reads and the control sequences. In total, 7,040 sequences passed the initial quality control of RDP [37] after
contamination and chimera detection. There were 1,721 insertion and deletion errors. Note
that PCR, which was used to generate the amplicons of the sample, can introduce errors.
However, because most of the errors introduced by PCR are substitution errors, we assumed
that the deletions and insertions were mainly sequencing errors. The derived error rates for
homopolymers of dierent sizes were: 1: 0.000532, 2: 0.000698, 3: 0.00102, 4: 0.000688,
5: 0.0372, 6: 0.00167, 7: 0.143, where the rst number is the size of homopolymer regions
(1 means non-homopolymer) and the second number is the rate of insertion and deletion
errors. If we sum the error rates for homopolymer regions of dierent sizes, the insertion
and deletion error rates for non-homopolymer and homopolymer regions were 0.0005 and
0.001, respectively. They are slightly smaller than the published G20 error rates [34]. We
will compare their performance on a data set with annotated errors in the section 2.4.

10

2.3.2

The augmented Viterbi algorithm for sequencing error correction

Let

π

be a state path in a prole HMM

in a DNA sequence

π∗

x.

M.

Let

r

be a set of insertion and deletion positions

The augmented Viterbi algorithm searches for the most probable path

and the most probably error position set

r∗

such that

(π ∗ , r∗ ) = argmax(π,r) P (x, π, r).

Intuitively this algorithm searches for an optimal alignment between a DNA sequence and
a prole HMM by simultaneously considering 1) evolutionary divergence (i.e. the insertion,
deletion, and substitution of amino acids) and 2) sequencing errors (i.e. insertion and deletion
of nucleotides). To solve the above equation, we rst divide the search space according to
dierent types of sequencing errors inside a codon and between two consecutive codons. For
each type of error, we search for the most probable state path.

Input:
M

a DNA sequence

a prole HMM

M,

and a sequencing error model. Notations of

and the error model will be described below.

Output:
in

x,

the optimal alignment between DNA sequence

x and M ,

as well as error positions

r.

Algorithm:

we rst dene notations that will be used in the dynamic programming equa-

tions.

• Notations about the prole HMM M :
sertion, and deletion states in
to

s2 . es (T (xi−2 xi−1 xi ))

T (xi−2 xi−1 xi ),

M . as1 s2

States

Mj , Ij ,

and

which is translated from the codon

M,

guide of HMMER [18]. State

are matching, in-

is the transition probability from state

s1

s

to emit amino acid

xi−2 xi−1 xi .

For a detailed de-

is the emission probability for state

scription of a prole HMM

Dj

we refer the reader to the textbook [17] and the users'

Gj

is the only state that is not dened in prole HMMs

11

from HMMER 3.0. It encodes insertions of nucleotides between codons.
transition probability from matching state
set to the insertion error probability.

Mj

aGj Gj

aMj Gj

to nucleotide insertion state

is the

Gj .

It is

is the self-transition probability for

Gj ,

encoding the probability of consecutive insertions. When consecutive insertion is not
allowed, it is set to 0.
matching state

Mj .

aGj−1 Mj

is the transition probability from

to the next

When only one insertion error is allowed, it is set to 1.0.

• Notations about the sequencing error model: pI (xi )
is an insertion error.

Gj−1

pD (xi )

is the probability that base

xi

is the probability that there is a deletion error after base

xi .
• Subproblems and the recursive equations:

Based on our analysis of error patterns, it

is very rare that there are consecutive insertions or deletions in a sequence read. Thus,
the following DP algorithm assumes that there is at most one insertion or deletion
inside a codon. The algorithm can be extended to handle all possible cases.

 VjM (i)

is the score of the best alignment matching subsequence

model up to the matching state

Mj ,

given that

xi

and this codon encodes an amino acid emitted by

 VjI (i)

Ij ,

given that

to the sub-

is the third base of a codon

Mj .

is the score of the best alignment matching subsequence

model up to the insertion state

x1..i

T (xi−2 xi−1 xi )

 VjG (i) is the score of the best alignment ending in xi

x1..i

to the sub-

is emitted by

Ij .

being emitted by state

Gj ,

which encodes an insertion of nucleotides between codons.

 VjD (i)

is the score of the best alignment matching subsequence

model up to the deletion state

Dj .

12

x1..i

to the sub-

VjM (i) = max{
case I : no sequencing error in the codon xi−2 xi−1 xi :
M
eMj (T (xi−2 xi−1 xi )) × Vj−1 (i − 3) × aMj−1 Mj ,
I
eMj (T (xi−2 xi−1 xi )) × Vj−1 (i − 3) × aIj−1 Mj ,
D
eMj (T (xi−2 xi−1 xi )) × Vj−1 (i − 3) × aDj−1 Mj ,
G
eMj (T (xi−2 xi−1 xi )) × pI (xi−3 ) × Vj−1 (i − 3) × aGj−1 Mj ,

case II : nucleotide xi−1 is an insertion :
M
eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aMj−1 Mj ,
I
eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aIj−1 Mj ,
D
eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aDj−1 Mj ,
G
eMj (T (xi−3 xi−2 xi )) × pI (xi−1 ) × Vj−1 (i − 4) × aGj−1 Mj ,

case III : nucleotide xi−2 is an insertion :
Repeat the above f our equations f or eMj (T (xi−3 xi−1 xi )),
case IV : there is a deleted nucleotide (represented by d) between xi−1 and xi :
M
eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aMj−1 Mj ,
I
eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aIj−1 Mj ,
D
eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aDj−1 Mj ,
G
eMj (T (xi−1 d xi )) × pD (xi−1 ) × Vj−1 (i − 3) × aGj−1 Mj ,

case V : there is a deleted nucleotide between xi−2 and xi−1 :
Repeat the above f our equations f or eMj (T (d xi−1 xi )).
}

13

In cases IV and V, we use
emission probability of

d

to represent the deleted bases. We choose

T (xi−1 d xi )

(or

T (d xi−1 xi ))

d

to maximize the

in the matching state

Mj .

VjI (i) = max{eIj (T (xi−2 xi−1 xi )) × VjM (i − 3) × aMj Ij , eIj (T (xi−2 xi−1 xi )) × VjI (i − 3) × aIj Ij }

VjG (i) = max{pI (xi ) × VjM (i − 1) × aMj Gj , pI (xi ) × VjG (i − 1) × aGj Gj }

M
D
VjD (i) = max{Vj−1 (i) × aMj−1 Dj , Vj−1 (i) × aDj−1 Dj }

2.3.3

Running time analysis

The time complexity of the above dynamic programming algorithm is
is the length of input DNA sequence and

|M |

O(δ|x||M |),

is the number of states in

M. δ

where

|x|

is the number

of dierent types of errors inside a codon plus the case of insertions between two codons. In
our current implementation,

δ = 26,

which renders a longer running time than the standard

Viterbi algorithm. Thus, it is not practical to compare millions of metagenomic sequence
reads to over 10,000 protein families in Pfam.

Instead, we only run HMM-FRAME on

sequences that are likely to contain insertion or deletion errors. For large-scale applications,
we suggest applying HMMER, which is as fast as BLAST, to all input sequence reads using
a big E-value cuto (such as 100). Alignments covering at least 80% of the translated DNA
sequence with signicant E-values can be classied by HMMER in this step. Sequence reads
that do not yield any partial alignments are unlikely to be members of any protein family.

14

Thus, we only apply HMM-FRAME to reads yielding partial alignment with marginal scores
because these reads could potentially contain sequencing errors.

2.4

Results

In this section, we compare the sensitivity and false positive rates (FP rates) of HMMFRAME with GeneWise [33] and FragGeneScan [32].

We then apply HMM-FRAME to

Targeted Metagenomics and a published metagenomic data set. Our experimental results
show that the length, scores, and E-values of prole HMM alignments are signicantly improved after error correction. As prole HMM-based alignment tools determine membership
by comparing E-value or length with user-dened thresholds, the improvement of these parameters enables more error-containing sequences to be classied into their native families.

2.4.1

Accuracy of HMM-FRAME

In order to evaluate the accuracy of HMM-FRAME in detecting insertion and deletion errors,
we obtained a control data set with annotated error positions from RDP (Cole and Wang,
unpublished).

In this data set, NifH gene families from the

strain DCB-2, the

Burkholderia xenovorans

Desultobacterium hafniense

strain LB40, and the PCC 7120 strain of

abaena were amplied and then sequenced using 454 Titanium.

An-

The sequenced gene families

were aligned with the nifH genes in these three organisms using the Needleman-Wunsch
algorithm. Insertion and deletion errors were identied from the alignments. After contamination and chimera screening, we had 18,900 sequences, of which 3,408 sequences contained
4,623 insertion or deletion errors. We conducted the protein domain analysis on the 18,900
sequences using HMM-FRAME under the two error models presented in the

15

Method

Sec-

tion. The input prole HMM was trained on 25 nifH genes obtained from RDP's functional
gene repository website [38].
We evaluated the performance of error-prediction tools using two types of sensitivity
and FP rates. Let

S

S+

be the set of error-containing sequences in the control data set. Let

be the set of predicted error-containing sequences.

FP rate

S∩S +
S+

are

and

S−S + ,
S

respectively.

The

Similarly, let

Sequence-level sensitivity

Q+

be the set of insertion and

deletion positions in error-containing sequences from the control data set. Let
of predicted error positions. The

Base-level sensitivity

and

and

FP rate

are

Q

Q+ ∩Q
Q+

be the set

and

Q−Q+
Q ,

respectively.
Using the control data set, we rst evaluated the performance of HMM-FRAME under
the published GS20 and our self-trained Titanium error models.

Then we compared the

performance of HMM-FRAME with GeneWise [33] and FragGeneScan [32]. Similar to HMMFRAME, GeneWise can directly compare DNA sequences with a prole HMM and can accept
user-dened error rates. We tested GeneWise using dierent parameters including error rates
and the alignment score thresholds (ranging from 0 to 20). The results with the best tradeo
between sensitivity and FP rate were kept for comparison with HMM-FRAME.
FragGeneScan [32] is a newly developed gene prediction tool for short and error-prone
sequences.

It predicts genes and identies sequencing errors inside predicted genes.

We

applied FragGeneScan on the above sequence set (all genes) and tested its sensitivity and
FP rate. FragGeneScan successfully recognized all input as protein-coding genes, rendering
a high gene-prediction sensitivity in this data set. However, FragGeneScan had higher FP
rates than HMM-FRAME in error detection. The results are summarized in Table 2.1.
Sensitivity and FP rate of each program when detecting annotated insertion and deletion
errors in nifH genes. seq-sen: sequence-level sensitivity. base-sen: base-level sensitivity. seq-

16

FP: sequence-level FP rate. base-FP: base-level FP rate. The score cuto of GeneWise is
set to zero to maximize the sensitivity. As GeneWise has low sequence-level sensitivity, we
did not evaluate its performance at the base-level.
Table 2.1: Comparing the error detection performance of HMM-FRAME,
GeneWise, and FragGeneScan.
HMM-FRAME:

HMM-FRAME:

G20

self-trained

base-sen

95.25%
85.08%

seq-FP

0.154%

base-FP

2.1%

seq-sen

90.6%

GeneWise

FragGeneScan

53.8%

83.04%

82.4%

0
0.003%

53.39%
0.001%

0.7%
59.57%

Sensitivity and FP rate of each program when detecting annotated insertion
and deletion errors in nifH genes. seq-sen: sequence-level sensitivity. basesen: base-level sensitivity. seq-FP: sequence-level FP rate. base-FP: baselevel FP rate. The score cuto of GeneWise is set to zero to maximize the
sensitivity. As GeneWise has low sequence-level sensitivity, we did not evaluate its performance at the base-level.

As shown in Table 2.1, each tool has higher sensitivity and smaller FP rates in identifying
error-containing sequences than in locating error positions.

HMM-FRAME has a better

tradeo between sensitivity and FP rate than both GeneWise and FragGeneScan. Both GS20
and our self-trained Titanium error models have small FP rates in predicting error positions,
but GS20 has higher sensitivity. Thus, we plan to use GS20 in all further experiments.

2.4.2

Using HMM-FRAME in Targeted Metagenomic

In this section, we present the utility of HMM-FRAME in two applications of Targeted
Metagenomics", where one or several gene families are amplied from environmental DNA
and these amplicons are sequenced using high-throughput sequencing platforms. One typical
application of Targeted Metagenomics is to sequence the amplicons of the 16S rRNA gene
for phylogenetic complexity analysis.

Besides 16S rRNA, protein-coding genes that are

17

important to a particular habitat can be amplied and sequenced for targeted functional
analysis in metagenomic data sets. For example, Targeted Metagenomics of the nifH gene,
which encodes nitrogenase reductase, is important for analyzing microbial genomes sequenced
from soil. Although these sequences are sampled from one or several targeted gene families,
frameshift errors can cause short alignments with marginal scores between the input and the
targeted gene families. As a result, sequences lacking signicant alignment length and scores
will be regarded as contaminants and be discarded. Thus, it is desirable to x frameshift
errors to maximize the number of usable samples. Given a DNA read and a prole HMM
built from a set of known protein sequences, HMM-FRAME can be applied to detect and
correct frameshift errors in amplicon reads.

2.4.2.1

Protein domain analysis of nifH sequences

In the rst experiment, we obtained 3,937 nifH sequences of an average length of 76 bases
generated by the 454 FLX sequencing technology.

In order to discard contaminants that

originated from non-target genes, we aligned the 3,937 sequences with the nifH gene family,
which was built on a small set of 25 expert-veried full-length nifH protein reference sequences from RDP's functional gene repository [38]. In the gene family building process, we
rst applied ClustalW [39] to align the 25 reference sequences. Then we applied HMMER
3.0's hmmbuild program to derive a prole HMM from the multiple sequence alignment. Of
the 3,937 454 FLX sequences, 111 were found to be contaminants and were excluded from
further analysis. Of the remaining 3,826 sequences, HMM-FRAME detected 296 insertions
and deletions in 256 sequences. Thus, approximately, 7% of the samples contained frameshift
errors. Of the 256 sequences containing insertion or deletion errors, 224 (87.5%) only contained one insertion or deletion error. 24 (9.4%) sequences contained two errors, and eight

18

(3.1%) contained three errors. Of the 296 insertions or deletions, 224 (75.7%) were inside or
beside homopolymer regions.
Because protein domain classication tools compare alignment lengths, scores, and Evalues with pre-dened thresholds to determine a sequence's membership, the changes in
the alignments aect the nal domain composition analysis. After error correction, prole
HMM-based alignment tools are expected to generate longer alignments with bigger scores
and smaller E-values. This gives error-containing sequences a better chance of being classied
into the correct families rather than being labeled contaminants.
In order to conduct a fair comparison on alignments before and after error correction,
we choose a third-party tool HMMER to generate alignments for original and corrected
sequences.

The changes of alignments' E-values and lengths due to error correction are

presented in Figure 2.2.

In this gure, the changes of alignments are presented for 256

sequences in which HMM-FRAME detects errors. Original" refers to HMMER alignments
on sequences before error correction. Corrected" refers to HMMER alignments on sequences
after error correction by HMM-FRAME. As a comparison, we also plot the length of the
original sequence reads (with the legend sequence read").

They largely overlap with the

length of corrected alignments, indicating that complete sequence reads can be aligned with
the nifH prole HMM after error correction.
In order to test whether the improvement was statistically signicant, we conducted a
two-sample Kolmogorov-Smirnov test (K-S test) on the alignments' lengths and E-values
before and after error correction. The p-values for the alignments' length and E-value distributions were 3.1037e-010 and 1.1802e-045, respectively.

In particular, the comparison

between alignments' lengths and the sequence reads' lengths shows that most partial alignments generated by error-containing sequences become complete alignments after error cor-

19

90

Lengths and E-values of pHMM alignments

70
50

LOG(original E-value)
LOG(corrected E-value)
original length
corrected length
Sequence read length

30

10
-10

1

21

41

61

81

101

121

141

161

181

201

221

241

-30
-50
NifH family reads

Figure 2.2:

Change of HMMER alignments' scores, lengths, and E-values (in log space)

before and after error correction for nifH sequences.

(For interpretation of the references

to color in this and all other gures, the reader is referred to the electronic version of this
dissertation)

rection. Thus, when comparatively longer alignments (e.g., 23 amino acids or 69 bases) are
required for domain classication, more sequence reads (213 more under when the threshold
is 69 bases) will be classied into their native families.

2.4.2.2

Protein domain analysis of the bacterial aromatic dioxygenase genes

In the second experiment, we obtained 2486 pyrosequencing samples of an average length
of 224 bases from the bacterial aromatic dioxygenase genes in a soil sample [40]. Although
these pyrosequencing reads were sequenced from the 5' end of PCR amplicons of bacterial
aromatic dioxygenase genes, we were interested in classifying them into three sub-families
of dioxygenase genes: toluene/biphenyl, naphthalene, and benzoate [41].

20

Note that there

is another subfamily (phthalate). However, due to lack of training proteins for this family
(Dr.

Iwai, personal communication), we only searched for members of three sub-families.

Three sets of reference protein sequences were extracted from Pfam [12] for toluene/biphenyl,
naphthalene, and benzoate [41]. Based on these training sets, we built three prole HMMs
using ClustalW and HMMER. Then we applied HMM-FRAME to align the 2486 reads with
the three prole HMMs. HMM-FRAME detected 77 insertions and 52 deletions, which were
distributed in 121 sequences. Of the 121 error-containing sequences, 77 could not be classied
into any subfamily by HMMER under the E-value threshold 0.1. After error correction using
HMM-FRAME, these 77 sequences were classied into dierent families with an average Evalue of 3.3e-06, indicating that they were highly likely to be true members of the underlying
families. For other error-containing sequences, the prole HMM alignments' E-values and
lengths were signicantly increased after error correction. The change is plotted in Figure 2.3.
In this gure, the data sets is sequenced from bacterial aromatic dioxygenase genes in a soil
sample. All alignments are generated by HMMER for a fair comparison. Original" refers
to HMMER alignments on sequences before error correction. Corrected" refers to HMMER
alignments on sequences after error correction by HMM-FRAME.
We also applied a two-sample K-S test on the alignments' lengths and E-values before and
after error correction. The p-values for the length and E-value distributions were 8.0609e011and 1.9776e-040, respectively.

The improved alignment lengths and E-values provide

stronger evidence for the membership of the input samples. In total, after error correction
by HMM-FRAME, we could classify 1,214 sequences into three subfamilies.
were members of the naphthalene subfamily.

1,042 reads

96 reads belonged to the benzoate subfam-

ily. 76 reads belonged to the toluene/biphenyl subfamily. The remaining 1272 reads could
potentially be members of the subfamily phthalate (Dr. Iwai, personal communication).

21

89

Lengths and E-values of PHMM alignments

69

49

29
original length
corrected length
LOG(original E-value)
LOG(corrected E-value)

9

-11

-31

-51
1

10

19

28

37

46

55

Soil sample sequence reads
Figure 2.3: Change of HMMER alignments' lengths, scores, and E-values (in log space) before
and after error correction for the bacterial aromatic dioxygenase genes in a soil sample.

2.4.3

Protein domain classication in the deep mine data set

In order to show the utility of HMM-FRAME in a metagenomic data set containing members
of multiple domain families, we applied HMM-FRAME to the rst 454 sequencing project
for environment samples, which were sequenced from two sites in the Soudan Mine, Minnesota, USA [42]. In this experiment, we downloaded the Black Sample from the paper's
supplementary data website. This data set contains 388,627 sequence reads with an average
length of 99 bases.

22

There were two steps in the annotation. First, we applied gene-prediction tools. Second,
we conducted the domain classication on predicted genes.
tools are available for metagenomic data sets.

A number of gene-prediction

However, not every tool can handle short

reads. Glimmer [43] did not output meaningful predictions when it was applied to this data
set.

The sensitivity of Metagene [44] drops to 59% for 100-base sequences [45].

We thus

chose FragGeneScan, a newly developed gene-prediction tool for short reads. FragGeneScan
predicted 281,658 genes, of which 72,355 contained errors. For convenience in discussion, let
S be the set of genes predicted by FragGeneScan. Let S' be the raw read set corresponding
to genes in S. Thus, 72,355 sequences in S were dierent from their raw reads in S' because
FragGeneScan predicted and corrected errors in S'. We compared three domain classication
pipelines: 1) apply HMMER 3.0 on raw reads S', 2) apply FragGeneScan and then HMMER
on corrected reads S, and 3) apply HMM-FRAME on raw reads S'. We recorded how many
reads could be classied into one of the 2,558 Pfam domain families that contain the keyword
bacteria". The number of classiable reads for the three pipelines were: 13,544 for HMMER,
12,328 for FragGeneScan + HMMER, and 17,496 for HMM-FRAME. The classication
results have large overlaps, which are illustrated in Figure 2.4.

In this gure, sequence

sets that can be classied by HMM-FRAME, HMMER, and FragGeneScan+HMMER are
represented by three sets A, B, and C.

|A| = 17, 496. |B| = 13, 544. |C| = 12, 328. B − C =

2224. C − B = 1008. C − A = 4. A − (B + C) = 2948.
In summary, HMM-FRAME was able to classify 2,948 more reads than the other two
annotation pipelines. HMM-FRAME found errors in all of these 2,948 reads. Thus, it is likely
that other two pipelines failed to classify them because of frameshifts. HMM-FRAME failed
to classify four reads that can be aligned by FrageGeneScan+HMMER. A closer examination
showed that FragGeneScan and HMM-FRAME output dierent error positions in these four

23

B: HMMER
alone

C:
FragGeneScan
+HMMER

A: HMMFRAME
Figure 2.4: Protein domain classication results for the black sample in the deep mine data
set.

sequences.
The performance evaluation of FragGeneScan must consider both gene-prediction and
error-prediction. Of the 281,658 predicted genes, only 12,328 could be classied into existing
domain families. Further analysis is needed to examine whether other predictions are novel
genes or wrong predictions. It is worth noting that FragGeneScan could classify 1,008 more
sequences after its error correction than applying HMMER 3.0 alone on raw reads. However,
while 2,224 raw reads could be classied into existing domain families by HMMER 3.0,
they could not be aligned with any family after error correction by FragGeneScan.

This

indicates that FragGeneScan might have over-predicted errors in the 2,224 sequences. This
is consistent with our observation that FragGeneScan has a high FP rate in the control data
set.

24

2.5

Conclusion

Despite the advances of high-throughput sequencing technologies, sequencing errors still
pose challenges for data annotation.

In particular, our error model analysis shows that

454 FLX Titanium only slightly decreases the insertion and deletion error rates compared to
GS20. Thus, correcting frameshifts caused by insertion or deletion errors is still important for
metagenomic sequence annotation. In this work, we introduce a protein domain classication
tool HMM-FRAME, which can classify error-prone DNA sequence reads into protein domain
families. HMM-FRAME can accept any error model trained on data from high-throughput
sequencing technologies and thus achieve high detection sensitivity while maintaining a low
false positive rate.
Applying HMM-FRAME to a data set with annotated errors shows its high sensitivity
and accuracy in error detection.

In particular, by xing frameshift errors, we can obtain

signicantly longer prole HMM alignments with smaller E-values. As alignments' lengths,
scores, and E-values are often used to determine family membership, improving them helps
to classify more sequences into the native domain families. In our experiments, sequences
that fail HMMER 3.0 under the default E-value or score threshold are classied into correct
domain families using HMM-FRAME. Thus, HMM-FRAME can be used as a complementary
tool to HMMER 3.0 on error-prone sequences.

25

Chapter 3
Prole HMM-based protein domain
classication for short sequences
3.1

Background

With the advent of next-generation sequencing and culture-independent methods, an enormous amount of metagenomic data have been sequenced from microbial communities from
dierent habitats. In order to understand the phylogenetic complexity and biological functions of microbial communities, as well as their interactions with the host, automatic annotation tools such as CAMERA [5], MG-RAST [6], and MEGAN [7] are being used for
annotating metagenomic data sets. As an important component of these metagenomic annotation tools, protein homology search provides basis for identifying putative genes and
assigning those genes to annotated functional categories (e.g. protein domain families).
Because of the high sensitivity of remote homology recognition, HMMER has been successfully applied to genome-wide domain analysis.

However, its sensitivity is signicantly

limited by the short reads of metagenomic data sets and poorly conserved domains. In order
to investigate how read length and domain identity aect the sensitivity of HMMER, we
randomly sampled 200 peptides with lengths of 12, 20, and 28 amino acids from the seed
sequences of each of the 2,558 Pfam domains, which contain the word Bacteria" in their de-

26

scriptions. The peptides were aligned with the domain families using HMMER. We used the
E-value cuto 1000 in order to boost the sensitivity. For each domain, the read classication
sensitivity of HMMER is measured as the ratio of the number of aligned reads to the total
number of sampled reads.

We sort all data points by domain identity in ascending order

and plot them in Figure 3.1. For domains with the same identity, their average sensitivity
is reported.

1
0.9

Sensitivity of HMMER3

0.8
0.7
0.6
0.5
0.4
0.3

read length: 36 bp
read length: 60 bp
read length: 84 bp

0.2
0.1
0
0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Average sequence identity of domain
Figure 3.1: Change of the read classication sensitivity of HMMER over read length and the
average sequence identity of domain families.

Figure 3.1 shows that the sensitivity of HMMER deteriorates with the decrease of the
query sequence length and domain identity. The sensitivity is decreased from 90% to 65-70%
when the lengths of reads change from 28 residues (i.e., 84 bp for corresponding DNA reads)

27

to 20 residues (i.e., 60 bp for DNA reads) for domains with identity around 40%.
Although next-generation sequencing technologies are producing longer reads and assembly tools may be available to assemble short reads into longer contigs, there is still a need
for a protein domain analysis tool for short reads. First, many nished or on-going metagenomic sequencing projects contain reads with lengths from 35 to around 400 bp depending
on the chosen sequencing technologies. In addition, peptide sequences encoded in individual
metagenomic sequence reads may share only small overlaps with existing domain families.
Thus, a sizable portion of many available data still contains short reads. Second, the sheer
amount of data and the complexity of many metagenomic data sets pose a great challenge
for assembly tools [46]. A large portion of short reads cannot be correctly assembled into
longer contigs. Third, many domain families exhibit low average sequence identity, which
poses a challenge for short and medium-sized reads. Figure 3.2 shows the histogram of pairwise sequence identity for domains related to bacteria. Of 2558 domains, there are about
43% domains with average identity no greater than 0.3. For these domains, the sensitivity
of HMMER is between 0.7 and 0.8 for reads of length 84 bp, between 0.4 and 0.6 for reads
of length 60 bp, and smaller than 0.1 for reads of length 36 bp. As a result, although a large
number of reads are sequenced from genes, which are highly compact in microbial genomes,
only a small percentage of the short reads can be classied into their native domains using
existing tools.
In this work, we introduce

MetaDomain,

a protein domain classication tool designed

for short reads in metagenomic data sets. MetaDomain provides a complementary protein
analysis tool to HMMER on assigning short reads into their native families.

28

800

700

Number of domains

600

500

400

300

200

100

0
0

0.1

0.2

0.3

0.4
0.5
0.6
0.7
0.8
Average sequence identity of domain

0.9

1

Figure 3.2: Histogram of the average pairwise sequence identity for 2558 domains

29

3.2

Related Work

Prole HMM-based protein homology search is widely used for mining microbial genomes.
Knowing the composition of dierent domain families encoded in a metagenomic data set
helps us understand which functions are important for a particular habitat. For example,
Ellrott et al. [8] investigated the distribution of protein families in the available human gut
genomic and metagenomic data. As the data set contains assembled contigs, using HMMER
is expected to achieve high sensitivity. Schlüter et al. [9] used HMMER to understand the
genetic diversity and composition of a plasmid metagenome from a wastewater treatment
plant. The reads have an average length of 104 bp, which is also adequate for HMMER to
achieve high sensitivity.
Besides providing a basis for functional proling, prole HMM-based homology search
was also used for phylogenetic complexity analysis in metagenomic data. The phylogenetic
algorithm CARMA [10] uses all Pfam domain and protein families as phylogenetic markers
to identify the source organisms of environmental DNA fragments as short as 80 bp. As we
show in Figure 3.1, prole HMM-based tools have sensitivity of at least 0.9 in classifying
reads of 80 bp into domains with average sequence identity above 40%. However, for poorlyconserved domains, a signicant number of reads might be missed. A similar but faster tool
Treephyler [47] conducted community proling in metagenomics and metatranscriptomics
based on Pfam domain assignments. Treephyler was applied to a data set with average read
length of 200 bp. It is unclear how shorter reads aect its performance.
Our previous work designed a tool HMM-FRAME [48], which can identify and correct
frame-shift errors in pyrosequencing reads during protein domain classication using prole
HMM-based alignment. However, it was not specically designed to handle short reads.

30

Finally, we note that the method used in MetaDomain shares a similar rationale to the
recent work by Weng et al. [49].

Weng et al.

reported that taxonomic binning tools for

metagenomes discard 30-40% of Sanger sequencing data due to the stringency of BLAST
cut-os.

Thus, they re-analyzed the discarded reads using less stringent cut-os.

In or-

der to control the false positive matches introduced by the relaxed cut-os, they used the
evolutionary conservation of adjacency between neighboring genes as an additional criterion.

3.3

Method

HMMER uses E-values as the discrimination threshold to determine the membership of a
query sequence.

However, short reads may only generate low alignment scores and thus

insignicant E-values. In particular, the conservation across the entire length of a domain
family can be highly variable, posing a great challenge for classifying reads sequenced from
poorly conserved sub-regions. In order to increase the sensitivity of aligning remotely-related
short reads, we propose position-specic score cutos, by which poorly conserved regions
allow more relaxed discrimination thresholds than well-conserved regions. However, the low
thresholds can easily incur random matches.

In order to control the false positive rate,

we examine the position distribution of read alignments. The position distribution of read
alignments on a truly encoded domain is expected to be more uniform than a domain that
incurs random read alignments [50, 51]. Figure 3.3 shows the schematic representations of
three types of distributions of read alignments along a domain. The alignments in (A) and
(B) are more likely to be random. Thus the domains may not be encoded in the data set. The
alignment distribution in (C) exhibits a much more uniform distribution, providing strong
evidence for the existence of the underlying domain in the data set. Thus, by using relaxed

31

position-specic score cutos and inspecting the distribution of alignments, we expect to
classify more short reads into the correct domain families while not falsely reporting domains
that are not characterized in the data.

Figure 3.3: Three types of alignment distributions.

3.3.1

Pipeline of MetaDomain

The input to MetaDomain includes sequence reads and a list of protein domains.

The

output is a list of domains encoded in the underlying data set and the number of aligned
reads. Figure 3.4 shows a schematic representation of the pipeline of MetaDomain.
MetaDomain consists of three main stages: short read alignment, ltering, and classication.

In the alignment stage, we use the Viterbi algorithm [17] to search for the best

local alignment between a query sequence and a prole HMM-represented domain family. In
the ltering stage, we rst apply a position-specic score threshold to eliminate insignicant
alignments. Then we remove stacked alignments with the same alignment positions inside
a poorly conserved region. In the nal stage, we use the number of aligned reads and the
distribution of alignment positions to determine whether a domain is encoded.

32

Sequence reads

Viterbi algorithm

Pfam domains

Optimal local
alignments

Position-specific
threshold

Filtering

Trimmed alignments

Read number and
domain coverage
thresholds

Classification

Transcribed or encoded
protein domains

Figure 3.4: Pipeline of MetaDomain.

33

3.3.2

The Viterbi algorithm

The Viterbi algorithm aligns a query sequence to a prole HMM by searching for the most
probable state path in the model.

Unlike HMMER, MetaDomain directly aligns a DNA

sequence to a prole HMM. To do so, we implicitly align translated peptides under dierent
reading frames with a prole HMM. Let

π

be a state path in a prole HMM

M

and let

be a query DNA sequence. The Viterbi algorithm searches for the most probable path
such that

π ∗ = argmaxπ (x, π).

x

π∗

The output of the Viterbi algorithm includes the optimal

alignment and its score. As Viterbi is a standard algorithm designed for HMMs, we refer
readers to Durbin et al.[17] for a detailed illustration of the dynamic programming equations
for nding

π∗.

The major dierence between our implementation and the standard Viterbi

algorithm includes : 1) our implementation accepts a DNA rather than a peptide sequence
as input; 2) a local alignment can start and end with any state without incurring insertion
or deletion penalties.

3.3.3

Alignment Filtering

MetaDomain employs two ltering mechanisms to increase its sensitivity in aligning short
reads while maintaining a low false positive rate: position-specic thresholds (PSTs) and
trimming.

3.3.3.1

Position specic threshold

PST allows dierent alignment thresholds for well conserved and poorly conserved regions.
Let the length of a query DNA sequence be

Mi,j

L

(in bp). Denote the prole HMM as

be a sub-model formed by all consecutive states from the

34

i th

match state

M.

Mi

Let

to the

j th

match state

Mj .

The upper bound of the alignment score against

Mi,j

score that can be generated by aligning any input sequence of length
Let

ai,j

denote the transition probability from state

probability of state

Mi,j

Mi

emitting amino acid

a.

Mi

to state

Mj .

is the maximum

j−i+1
Let

Then the upper bound

ei (a)

Ui,j

with

Mi,j .

denote the

for sub-model

is calculated as follows:

j

ak,k+1 × max(ek (a))

Ui,j =
k=i
where

aj,j+1

is set to 1 because

j

is the ending state of the sub-model.

We dene PST for the submodel

Mi,j

as:

P STi,j = γUi,j

where the coecient

γ

is a user-specied parameter in the range of [0,1]. It can be exibly

adjusted to control the trade-o between sensitivity and false positive rate of MetaDomain.
The default value is 0.6, which is used in our experiments.

3.3.3.2

Alignment trimming

Alignment with scores larger than their corresponding PSTs will pass the rst ltering stage.
As each domain has various conservation along the entire length of the model, well-conserved
sub-regions have high PSTs while poorly-conserved sub-regions yield low PSTs. Thus, random sequences tend to be aligned to poorly-conserved regions by MetaDomain, incurring a
high FP rate. Our empirical experiments show that dozens of reads that are not sequenced
from the underlying domain can be aligned to the same position in a poorly-conserved subregion. In order to minimize the eects of noise, we discard stacked alignments that have

35

the same alignment positions.

3.3.4

Protein domain classication

In this stage we extract two features from the collected read alignments for each domain:
the number of aligned reads and the domain coverage. The domain coverage is the fraction
of positions covered by at least one read alignment in a domain. MetaDomain then applies
a simple decision tree to classify all the target domains into two classes: encoded domains
and non-encoded domains. If both features of a domain are equal to or bigger than their
corresponding thresholds, this domain will be classied as encoded.

Otherwise it is not

encoded in the sample. By default, the cuto for domain coverage is 30%. Ideally, the cuto
for the number of aligned read should be determined based on the properties of data such
as sequencing depth. If users do not specify this value, we use 20 by default.

3.4

Results

In order to evaluate the performance of MetaDomain on real data generated by nextgeneration sequencing technologies, we applied MetaDomain to protein domain analysis in
two data sets. The rst one is the transcriptome generated using RNA-seq for

cenocepacia.

Burkholderia

As both the reference genome and its domain annotations are available, we

can quantify the sensitivity and false positive (FP) rate of MetaDomain. The second one
is metagenome data sequenced from soil. We applied MetaDomain to identify domains encoded in the underlying data. In addition, we compared HMMER and MetaDomain in both
applications.

36

3.4.1

Identifying transcribed protein domains in transcriptome

In this experiment, we conducted transcribed domain analysis in the transcriptome from
one strain of

B. cenocepacia

named AU1054 [52]. By using Illumina RNA-seq, the authors

generated multiple samples for AU1054 in two growth media. We used one replicate of cDNA
sample of AU1054 in the growth medium cystic brosis. In total, 3,361,008 reads of a length
of 41 bp were downloaded from the website provided by the authors.

We evaluated the

performance of read classication and domain identication of MetaDomain and HMMER.

3.4.1.1

Performance of read classication

The performance of read classication is quantied using both read classication sensitivity
and FP (false positive) rate.

In this experiment, the read classication performance is

computed on reads that can be mapped to annotated domains. Below we sketch the main
steps to obtain mapped reads for a domain using the reference genome and the domain
annotations.

First, we downloaded the genome of AU1054 and the annotated genes and

domains from the IMG website [53]. There are 2,181 annotated Pfam domains. Second, the
reads were mapped to the reference genome using Bowtie [54] with two mismatches allowed.
Third, we compared the positions of read mapping and annotated domains. For a domain, all
reads that fall into it are dened as mapped" reads. Denote the set of mapped reads as
All other (unmapped) reads constitute set
aligned reads for a domain be
a domain are

A∩M
M

and

A.

A−M ,
U

U.

M.

For a domain classication tool, let the set of

Thus, the sensitivity and FP rate of read classication for

respectively. A perfect sensitivity indicates that all mapped

reads can be aligned. A zero FP rate indicates that only mapped reads can be aligned to a
domain.
Of the 2,181 annotated families, we evaluated the performance of HMMER and Meta-

37

Domain on 1406 families which have at least 1 mapped reads. Of the 1406 tested domains,
HMMER could not align any read to 1150 domains, resulting in zero sensitivity and FP
rate. For the rest 256 domains, all aligned reads by HMMER are non-mappable reads, resulting in zero sensitivity and a positive FP rate. The comparison between HMMER and
MetaDomain is summarized using a bubble chart in Figure 3.5. The biggest bubble indicates
that HMMER has zero sensitivity and zero FP rate for 1150 domains.

As we can see, it

is highly dicult for HMMER to correctly align reads as short as 41 bp.

There are two

reasons for the low sensitivity of HMMER on short reads. First, the parameter training in
E-value calculation of HMMER are based on much longer reads (100 amino acids). Thus,
the small alignment scores generated by the short reads yield large E-values and cannot pass
the E-value threshold. Second, the small alignment scores of short reads may not pass the
ltration stage of HMMER.

3.4.1.2

Identifying transcribed domains in the transcriptome

Figure 3.5 only shows the read classication performance. MetaDomain uses both aligned
read number and domain coverage as thresholds for domain identication. We expect that
the additional constraint will reduce the false positive rate in domain identication. Because
of the low read classication sensitivity, we speculate that HMMER will have low sensitivity
in identifying transcribed domains.
In order to quantify the performance of domain identication, we need to build positive
and negative test sets, which include transcribed and non-transcribed domains based on
mapped reads. There is no commonly accepted criterion to dene transcribed genes using
the number of mapped reads. Various expression scores such as an average coverage depth
across the entire length of each gene [55] and reads per kilobase of exon model per million

38

1
0.9
HMMER3
MetaDomain

0.8

Sensitivity

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

1

2

3

4
FP rate

5

6

7
−5

x 10

Figure 3.5: Read classication sensitivity and FP rate of HMMER and MetaDomain. The
size of each bubble represents the number of data points (i.e., domains) with the same
sensitivity and FP rate.

mapped reads (RPKM) are used to quantify transcriptional level. In addition, the cutos
of dening highly transcribed, lowly transcribed, or non-transcribed genes are variable in
dierent applications [56]. In this work, we dene transcribed domains based on the rationale
that a truly transcribed domain should be mapped by a number of reads at dierent positions.
Correspondingly, we use the following criteria to determine whether a domain inside a gene
is transcribed: 1) at least

N

reads are mapped to a domain; 2) at least 30% of positions

in a domain are mapped by reads.
of mapped read is zero.

A domain is labeled non-transcribed" if the number

For domains that fall between the criteria for transcribed and

non-transcribed domains, they are labeled unknown" and are excluded from the test sets.
Table 3.1 shows the size change of the positive and negative test sets over the cuto N.

39

Table 3.1: Number of transcribed and non-transcribed domains using dierent cutos (N)
for the number of mapped reads.
N

transcribed

unknown

nonetranscribed

10

318

1317

546

15

262

1373

546

20

226

1409

546

25

195

1440

546

30

169

1466

546

Intuitively, bigger N creates an easier case for domain classication than smaller N.
We align all reads to the transcribed and non-transcribed domains using MetaDomain
and HMMER. The unknown" domains are removed due to their ambiguity. For HMMER,
we rst translated the short reads into peptide sequences using 6-frame translations.

We

then aligned the domains with the translated sequences using 1000 as the E-value threshold,
which is chosen to maximize the sensitivity. For MetaDomain we directly aligned the short
reads with the domains. The pipeline in Figure 3.4 was used to output a list of transcribed
domains for MetaDomain. Let

D+ and D− be the number of transcribed and non-transcribed

domains identied using the read mapping results in Section 3.4.1.2. Let

M + and M − be the

predicted number of transcribed and non-transcribed domains by MetaDomain or HMMER.
The sensitivity and FP rate of domain classication tools are dened using the following
equations:

Sensitivity
FP rate

=

+
+
= D ∩M
+

D
− ∩M +
D
D−

The values of D and M are aected by several options. First,

D+

and

D−

can change

over the cuto N as shown in Table 3.1. Second, we used both the domain coverage and the

40

number of aligned reads to determine whether a domain is encoded or transcribed. In this
experiment, the cuto for domain coverage is 30%, which we found reasonable across dierent
experiments. Thus,

M+

M−

and

mainly change over the required number of aligned reads

to a domain. For simplicity, we denote the cuto as

τ.

Increasing

τ

implies a more stringent

constraint for dening transcribed domains, and thus might result in lower sensitivity and a
smaller FP rate. Decreasing

τ

is likely to increase the sensitivity while incurring a higher FP

rate. In order to compare the performance of MetaDomain and HMMER under dierent
we plotted the ROC curves by changing

N=10

HMMER3
MetaDomain

0.1
FP rate

0.8

0.6
0.4

HMMER3
MetaDomain

Sensitivity

0.4

1

0.8
Sensitivity

Sensitivity

N=30

1

0.6

0
0

from 1 to N for N=10, 20, and 30 in Figure 3.6.

N=20

0.8

0.2

τ

0.2

0.2

τ,

0
0

0.6
0.4
0.2

0.1
FP rate

0.2

0
0

HMMER3
MetaDomain

0.1
FP rate

0.2

Figure 3.6: ROC curves of HMMER and MetaDomain.

Figure 3.6 shows that HMMER is highly specic (FP rate

≤

1.3%).

However, as we

speculated, its sensitivity is low, with the highest sensitivity being only 0.135.

HMMER

misses a large portion of short reads that can be mapped to protein domains even when we
use a very relaxed E-value cuto. When both tools incur an FP rate of 0.02, the sensitivity
of MetaDomain is 0.53 vs. 0.13 for HMMER. When N decreases from 30 to 10, the size of the
positive test set

D+

becomes larger and the sensitivity of both HMMER and MetaDomain

41

decreases. Note that the sensitivity and FP rate of HMMER keep the same for many dierent
thresholds (i.e.,

τ ),

resulting in compact ROC curves. Overall, the ROC curves show that

MetaDomain can achieve higher sensitivity while keeping a similar FP rate as HMMER
for domain classication in this experiment. In addition, Figure 3.6 provides guidance on
determining appropriate

τ

for MetaDomain in order to achieve desired sensitivity and FP

rate.
On average, it took MetaDomain 280 seconds to align 752,156 reads with one domain on
a 2.2GHz dual-core AMD Opteron machine. There are 2181 domains in this experiment.

3.4.2

Protein domain analysis in a soil metagenomic data set

In the rst experiment, we demonstrated the accuracy of MetaDomain in classifying short
RNA-seq reads into their native domain families. In this section, we present the utility of
MetaDomain in identifying encoded protein domains in a complicated metagenomic data
set, which is sequenced from the microbes dwelling in the soil from a long term cultivated
corn site at Iowa using Illumina HiSeq platform

1.

There are 520,346,510 sequence reads of

various lengths, ranging from 31 bp to 114 bp. The average length of the reads is

∼

73 bp.

Figure 3.7 shows the distribution of the read lengths in this data set. The sheer amount of
data and the complexity of this data set pose a great challenge for read assembly programs.
Thus, we directly apply MetaDomain and HMMER to unassembled reads.
We rst used HMMER to align all the reads against the 2558 Pfam domains that contain
the word Bacteria in their descriptions. Using E-value 1000, HMMER classied 34,602,784
reads into 2558 Pfam domains, accounting for 6.65% of all reads in the data set.
classiable reads have an average length of

1 Sequenced

∼

The

80 bp. A large number of reads shorter than

by James Tiedje Lab at Michigan State University. Unpublished yet.

42

0.5
0.45
0.4

Percentage

0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

30

40

50

60

70

80

90

100

110

Read length (bp)
Figure 3.7: Read length distribution in the soil data set.

60 bp were not classied. By conducting a complementary domain analysis on short reads
using MetaDomain, we expect to classify more reads to their native families.

As many

domains have been aligned with a large number of reads by HMMER, it is highly likely
that they are encoded by the bacterial species in the soil data set. We thus excluded them
from further screening.

There are 80 domains with less than 20 reads aligned or with a

smaller domain coverage than 30%. The left panel of Figure 3.9 shows an example of such
domain.

The small number of aligned reads and their biased distribution do not support

the representation of this domain in this data set. We thus applied MetaDomain to the 80
domains and investigated whether they are encoded. On average, it took MetaDomain about
31 CPU minutes to align 11,194,176 sequences that were not classied by HMMER and are
shorter than 60 bp against one domain family.
Figure 3.8 presents the number of aligned reads and the domain coverage output by

43

HMMER and MetaDomain.

Note that MetaDomain was only applied to reads that were

not classiable by HMMER. Thus, the total number of reads that can be classied into each
domain should be the sum of the output of HMMER and MetaDomain. This gure shows
that signicantly more reads can be classied into the corresponding domains. We need to
specify thresholds for domain coverage and the number of aligned reads in order to dene
an encoded domain. Similar to previous experiments, the domain coverage cuto is 30%.

10
Log(Aligned read number) of HMMER3
Domain coverage of HMMER3
Log(Aligned read number) of MetaDomain
Domain coverage of MetaDomain

Aligned read number and domain coverage

9
8
7
6
5
4
3
2
1
0
0

10

20

30
40
50
Pfam domain index

60

70

80

Figure 3.8: Reads aligned by HMMER and MetaDomain.

According to Figure 3.6, the sensitivity and FP rate of MetaDomain are 0.34 and 0.004
when

τ

is 20.

We thus choose 20 as the cuto for the number of aligned reads.

Out of

the 80 Pfam domains, 24 have less than 20 reads aligned or a domain coverage no greater

44

than 30%. So these 24 domains are not likely to be encoded in this data set. For the other
56 domains, their average domain coverage by MetaDomain alone is 97.25%. The number
of aligned reads by MetaDomain is 169.52 versus 15.27 by HMMER. This provides strong
evidence that these protein domains are actually encoded. The average read length aligned
by MetaDomain is 38 bp. As an example, Figure 3.9 shows the distribution of aligned reads
by HMMER and MetaDomain for the domain PF09703. In summary, by using MetaDomain,
we are able to identify 56 more domains encoded in this data set. 130,930 (0.025%) more
reads are classied into these domain families.

0

100

200

300

Aligned read positions by HMMER3

0

100

200

300

Aligned read positions by MetaDomain

Figure 3.9: The distributions of aligned reads for PF09703 by HMMER and MetaDomain.

Among these 56 protein domains, 21 have unknown functions. 6 domains are CRISPRassociated domains. These kinds of domains are found in the genomes of approximately 40%
of bacteria and 90% of archaea. More detailed analysis is needed to understand whether the
functions of these domains are important to the specic habitat.
45

3.5

Conclusion

In this work, we introduce MetaDomain, a protein domain classication tool for short reads
produced by next-generation sequencing technologies. It provides a complementary domain
classication tool to HMMER on classifying short reads into domain families with low sequence identity. Our experimental results show that it can achieve a better tradeo between
sensitivity and FP rate than HMMER in classifying short sequences.

Its current version

is based on a faithful implementation of Viterbi and is slow when applied to thousands of
millions of reads and the whole Pfam database. We plan to improve its eciency by using
ltration strategies such as ungapped alignment and parallel programming. In addition, we
plan to improve the method of designing position-specic score thresholds in order to achieve
a better discrimination power.

46

Chapter 4
A Sensitive and accurate protein domain
classication tool (SALT) for short reads
based on prole HMMs and graph
algorithms
4.1

Background

Protein domain analysis has been widely used for functional annotations of RNA-Seq data [1,
2, 3, 4]. In particular, quantifying the expression levels of protein domains helps us understand how transcriptional changes of domains are associated with sequencing conditions,
sampling tissues, or experimental treatments in RNA-Seq data. For example, computational
domain analysis was applied to identify domains that play a role in vernalization and eux
transporters in the gibberellin response in sugar beet [1]. Domain analysis is also frequently
used to evaluate and compare gene annotation quality of dierent gene-nding tools [3] or
to compare domain composition of data sampled using dierent techniques [4].
Existing domain classication tools are mainly designed for complete or near-complete
domain member sequences rather than fragmentary and short reads in NGS data. Prole-

47

based domain classication tools rely on alignment scores to distinguish true homologous
sequences from false ones.

Short reads incur marginal alignment scores and thus can be

easily missed by these tools. In particular, although HMMER can achieve high sensitivity
in recognizing remote homologs of a domain family, it shows low sensitivity in classifying
partial sequences of remote homologs [57].
When the reference genomes are available, genome-wide gene and domain annotations
and read mapping positions can be combined to determine the membership of reads. However, many NGS data sets lack complete or quality reference genomes, such as complicated
metagenomic data sets and transcriptomic data of some non-model organisms.

For these

data, protein domain analysis is applied to the short reads directly or to the contigs that
are generated by

de novo

sequence assembly tools. Contigs produced by

de novo

sequence

assembly tools enable NGS data analysis to take advantage of conventional bioinformatics
tools designed for longer sequences. However, using sequence assembly tools as the rst step
is not ideal for protein domain classication. First, the quality of read assembly deteriorates
signicantly in complicated NGS data sets [58]. Dierent sequence assembly tools generate
dierent sets of contigs. There is no consensus on the best assembly tool. Second, successful

de novo assembly requires high sequencing depth, which is dicult to achieve for all domain
regions. It is often observed that some domain regions are highly transcribed while others
are transcribed much less. Third, dierent protein-coding genes can share the same domain.
As a a result, similar domain regions can contribute to the nodes forming crossing points
between dierent paths in a De Bruijn graph.
for

de novo

Those repeat-like sequences add diculty

sequence assembly. Thus, there is a need for an alternative and better domain

classication tool for NGS data lacking reference genomes.
In this work, we designed a Sensitive and Accurate protein domain cLassication Tool

48

(SALT), which uses prole HMMs and family-specic contig generation algorithms to classify
short reads into domains. SALT is mainly designed for domain classication in transcriptomic data of non-model organisms that lack complete or high-quality reference genomes,
and for metagenomic data sets. In this work, we focus on RNA-Seq of non-model organisms.
In order to evaluate the performance of SALT, we applied SALT to RNA-Seq data of species
with quality reference genomes. Read mapping and genome-wide domain annotations are
combined as the ground truth for evaluating the read classication sensitivity and specicity of SALT. We compared SALT with HMMER and HHblits, another prole HMM-based
protein homology search tool [59]. HMMER was applied to short reads directly or to contigs assembled by

de novo

assembly tools.

SALT can correctly classify signicantly more

reads into their native domains than HMMER and HHblits while keeping the same or better
specicity when tested on two RNA-Seq data sets with dierent read lengths. Finally, we
applied SALT to an RNA-Seq data set of a non-model organism and compared SALT with
the alternative pipeline using

de novo assembly tools and HMMER for domain classication.

It is also worth noting that although we will use protein domains when describing the
methods and results, SALT can be applied to both domain families and protein families. In
this work, we used the families in Pfam [12], which include both domain families and protein
families.

4.2

Related Work

HMMER is the representative implementation of prole HMM-based domain classication
tool.

In conjunction with the Pfam database [12], which contains over 10,000 annotated

protein domain families, HMMER can accurately classify query protein sequences into ex-

49

isting domain families. However, HMMER can fail to recognize short reads sequenced from
remote homologs of a domain family.

Our previous work [57] evaluated how read lengths

aected the performance of HMMER on a large number of domains.

Prole HMM-based

tools have sensitivity of 0.9 in classifying reads of 80 bp into domains with average sequence
identity above 40%. However, for poorly conserved domains, which are not rare, a signicant number of reads cannot be correctly classied. Even for a domain with good average
conservation along each position, short reads that are sequenced from less well-conserved regions tend to be missed by existing methods. MetaDomain [57] partially solves this problem
using position-specic score cutos. Using MetaDomain, signicantly more reads could be
correctly classied into their native domains, however, the specicity was not satisfactory.
HHblits uses HMM-HMM alignment to conduct fast iterative protein sequence search
and can generate accurate alignment. However, it is not especially designed for NGS data
and has unsatisfactory classication sensitivity for short and fragmentary reads, which we
show in our experiments.
In order to take advantage of conventional functional analysis pipelines,

de novo sequence

assembly tools can be applied to produce contigs, which are used as input to gene nding or
domain analysis tools. The performance of short read assemblers depends on read length,
complexity of the data, sequencing depth, etc. A recent review [58, 60] summarized several
challenges of

de novo

assembly tools.

Recently, a comparison of several popular

de novo
Radix

assembly tools was conducted on the transcriptome of a non-model snail species (

balthica )

[61].

The assemblies generated by the tested tools diered in total number of

contigs, contig length, and number and quality of gene hits, showing the need for better
assembly methodology.

Similarly, two dierent assembly tools were applied to RNA-Seq

data from the non-model species sugar beet and output dierent sets of assemblies [1].

50

Incomplete or error-containing assemblies directly aect the performance of the downstream
analysis.

In addition, this work focuses on domain analysis, which does not need contigs

outside domain regions of genes. Thus, we deliver an alternative solution that does not rely
on

de novo

sequence assembly tools.

HMM-FRAME [48] aimed at classifying for reads containing frame-shift errors. The insertion or deletion errors in homopolymer regions of pyrosequencing reads can cause frameshifts,
compounding domain classication.

This work focused on short reads which are usually

produced by Illumina platform, which tends to have substitution rather than insertion or
deletion errors. SALT is able to deal with substitution errors during domain classication.

4.3

4.3.1

Methods

Overview of SALT

As a protein domain classication tool, SALT takes query reads and a list of protein domain
families of interest as input. The output is a list of transcribed (or encoded) families and the
classied reads for each family. Figure 4.1 shows a piece of genomic sequence with annotated
domains and the sequenced reads. In this example, if a read is sequenced from Domain X
in a gene, we call it a positive read w.r.t. Domain X. Otherwise, it is a negative read w.r.t
Domain X. An ideal domain classication program should classify only the positive reads
into the corresponding domain family.
Positive reads have the following features: 1) many positive reads tend to share higher
sequence similarity with the underlying family than negative reads, yielding higher alignment
scores; 2) positive reads can be assembled into contigs that have statistically signicant
alignment scores against their native families. Existing domain classication tools such as

51

HMMER mainly uses the rst feature. However, as short and fragmentary positive reads can
yield as low scores as negative reads do, existing tools can miss a large number of positive
reads when they are applied directly to short reads.

De novo

assembly tools can partially

solve this problem but their performance depends a lot on sequencing depth, similarities of
domain copies in dierent genes, etc.

reads

genomic
sequence
domain X

domain X

gene A

gene B

Figure 4.1: Two genes, their domain organizations, and the sequenced reads.

Domain X

occurs in two dierent genes. Both genes are transcribed and sequenced. Red lines: positive
reads. Blue lines: negative reads.

SALT takes advantage of both features by incorporating prole-based domain alignment
methodology into a supervised contig generation algorithm. Figure 4.2 illustrates the SALT
pipeline, which can be divided into three main stages: prole HMM-based ltration, contig
generation, and contig selection.

In the rst stage, we align query reads against prole

HMMs, which represent the underlying domain families, using a modied Viterbi algorithm.
A loose position-specic score cuto [57] is used in order to keep most positive reads. The rst
stage trades specicity for sensitivity, as some negative reads pass the ltration stage. The
rationale behind the second and third stages for removing negative reads is that the positive
reads tend to be sequenced from continuous regions of a domain and their assembled contigs
can yield statistically signicant alignment scores and E-values. The second stage builds a
family-specic

hit graph

(see the Methods Section) using a supervised graph construction

52

algorithm. Contigs can be eciently generated from each hit graph. The third stage aligns
these contigs to input domain families using HMMER. The computed E-values are used to
choose sequences that are part of domain regions. Reads from the high-scoring contigs are
classied into the corresponding input family.

Query reads

Profile HMM-based filtration

Profile HMMs

Hits

Construction of hit graphs

Hit graphs
Contig generation
Find the K longest paths

Candidate contigs

Contig selection

Classified contigs

Read classification

Classified reads

Figure 4.2: The pipeline of SALT.

53

4.3.2

Stage 1: prole HMM-based ltration

The rst stage of SALT aligns reads to prole HMMs, which are built on homologous domains sequences from Pfam [12].

The software suite HMMER uses Viterbi and Forward

algorithms to determine the membership of query sequences via E-value cutos. However,
the sensitivity of HMMER drops signicantly for short sequences or poorly conserved protein domain families [57]. In order to address this problem, SALT uses position-specic score
thresholds (PSST) and a modied Viterbi algorithm [57] to align reads to prole HMMs.
For the standard Viterbi algorithm on a prole HMM, we refer users to the detailed introduction in [17]. The modied Viterbi algorithm used in SALT diers from the standard one
in the following aspects: 1) SALT directly aligns a DNA sequence against a prole HMM by
implicitly translating the DNA sequence into peptides under dierent reading frames; 2) a
local alignment can start and end with any state without any transition penalty; 3) SALT
uses PSSTs as alignment score cutos.

4.3.2.1

Position-specic score threshold

PSST was described for MetaDomain [57].

As users of SALT can specify parameters to

control the ltration performance of the rst stage, we summarize the main properties of
PSST here.
Let

Mi,j

to the

Let

L

be the length of a query sequence in bp and

M

be the prole HMM.

be a sub-model that includes all consecutive states from the

j th

match state

Mj .

any query sequence against

Let

Mi,j .

Ui,j

ith

match state

Mi

be the upper bound of the alignment score between

According to the scoring scheme of prole HMMs,

always positive. If a query sequence is aligned to

54

Mi,j

Ui,j

is

by the Viterbi algorithm, the PSST

associated with this alignment is calculated by the following equation:

PSST

where the coecient

γ

= γUi,j ,

is a user-specied parameter in the range of [0,1]. The parameter

controls the strength of the ltration. If

γ

γ

is too large, the number of positive reads that pass

this stage will be small and the sensitivity of SALT will be very low. If

γ

is too small, there

will be an overwhelming number of negative reads passing ltration, aecting the accuracy
of domain classication in the whole pipeline.
specic needs. The default value of

γ

γ

Users can adjust

γ

to accommodate their

is 0.3, which was used in all our experiments. When

is small, a read may be classied into multiple families in the rst stage. Therefore, in

practice, we require that each read be assigned to at most three input families that generate
the best alignment scores.

4.3.3

Stage 2: contig generation

The ltration stage uses loose PSSTs to trade specicity for sensitivity. Some negative reads
can pass the ltration stage. Given medium to high sequencing coverage, positive reads are
usually sequenced from continuous regions in a domain. Accordingly, their alignments to the
underlying prole HMM overlap. This stage assembles reads with overlapping alignments.
Specically, we construct a family-specic

hit graph, which is similar to a weighted overlap

graph, using reads classied to the underlying family by the rst stage. A path searching
algorithm is then applied to each hit graph to generate contigs for each domain family.

55

(A) Hit

h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 h13

Score 12.0 6.2 11.4 10.5 7.1 14.4 13.5 14.6 18.6 6.8

h

h3

h5 ACGTCGTCGTC
h4 ATCGTCGTAG

h8

h

h1

h6

h7

h

CACGCCGTCA

5.6

9.2

h13

AAACGTACAG
AAAACGTACA
12
AACGTAAAAC
11
10 CGTAACGTAA

ACACGACGTC
GTACACGCGA
9
TATGTACACG CGTCACGTCG
2 GTATGTACACG

ACGTATGTAC

5.4

h
h

profile HMM
graph construction

(B)
overlap = 4
weight = 6.84

v1

v3

v2

v4

v6

v5

overlap = 4
overlap = 8
weight = 1.24 weight = 8.1

overlap = 5
weight = 7.3

v8

v7

overlap = 5
weight = 9.3

overlap = 5
weight = 5.7

v10

4
v9 overlap==11.16 v12
weight

add root
remove transitive overlaps
(C)

v11

overlap = 7
weight = 1.62
overlap = 5
weight = 2.8
overlap = 9
weight = 0.92

v13

root

v10
v1
v3

v2

v6

v4 v5

v11
v8

v7

v12
v9

v13

Figure 4.3: (A) Thirteen reads and their alignment layout w.r.t. the prole HMM represented
by its matching states. The alignment scores are shown in the table. Blue reads: negative

k ∗ = 4. For simplicity
(i.e. e = 0). Red nodes

reads. Red reads: positive reads. (B) The constructed hit graph when
of explanation, mismatches are not allowed in this simple example

are created by positive reads. Blue nodes are created by negative reads. (C) The hit graph
after removing transitive overlaps and adding the root node.

56

4.3.3.1

Constructing a hit graph for a family

Overlap graphs are commonly used in
is dened as

de novo sequence assembly.

A standard overlap graph

G = (V, E), in which each read is a node, and overlaps larger than a given cuto

are indicated by directed edges. A hit graph is also used to assembly contigs. However, its
construction and the graph traversal algorithms are dierent from a standard overlap graph.
We describe the hit graph construction procedure below using Fig. 4.3 as an example. The
dierences between a hit graph and a standard overlap graph will be highlighted as well.
In the rst step, a node is created for each read passing the ltration stage for a domain
family. In Fig. 4.3, a read

hi

creates a node

vi

in the graph. Thus, dierent from a standard

overlap graph, only reads that are likely to be classied into a family are used to build the
hit graph. In addition, a hit graph is family specic.

N

hit graphs will be built for

N

input

families.
In the second step, edges are created using alignment positions of reads.
reads

hi

and

hj

that are classied into a domain family by the ltration stage, their overlap

is computed if and only if the alignments of
and

hj

from

hi

is at least
to

hj .

k∗,

h3

and

hj

overlap. If the overlap between

hi

For example, in Figure 4.3.(A), which shows the alignment layout for 13

h1

and

only tests whether a sux of
and

hi

which is a user-specied overlap threshold, a directed edge is created

reads, the alignment of

h1 , h2 ,

For any two

h2

h1

overlap and

is a prex of

h1

h2 .

has a smaller starting position.

SALT

In addition, as the alignments of reads

share no overlap with alignments of reads from

h6

to

h13 ,

test the sequence overlaps between any two reads from the two sets.

SALT does not

If two reads have

the same starting position in their alignments, the bi-directional overlaps will be computed.
Figure 4.3.(B) shows the added edges. Dierent from standard overlap graph construction,

57

this step takes advantage of the alignment positions and presents an ecient family-specic
graph construction procedure.
The graph construction stage allows substitution sequencing error, which is the major
error type in Illumina sequencing technology. During the sequence overlap computation, at
most

e

errors are permitted;

e=2

adjusted to t the error rate of the given data. Given
is the size of the longest sux of
Users can specify
of kmer size in a

k∗,

e

in the current implementation. The parameter

hi

e,

the overlap

that has hamming distance

≤e

oi,j

from read

to a prex of

can be

hi

to

hj

hj .

which can be estimated based on the same rationale as the choice

de novo

sequence assembly tool. The default value of

k∗

is set to half of

the average read length for relatively short reads. When the reads are long (∼100 bp), using
the default value is too stringent for the hit graph construction. In addition, the ltration
stage is more specic with longer reads. Thus we recommend to use a smaller value than
the default one.
It is worth mentioning that using alignment information to supervise the graph construction was also used for RNA virus population estimation [62]. But the application, the input
data, and the alignment stage are all dierent from SALT.
The graph

G

built so far contains many near-identical paths because of edges created

by transitive overlaps, which are implied by other longer overlaps [63].
overlap between reads

h1

and

h3

overlaps exist between

h1

and

h2 , h2

For example, the

in Figure 4.3.(A) is a transitive overlap because longer
and

h3 .

Transitive overlaps add unnecessary edges

without contributing to read classication because those reads can be identied using an
alternative (usually longer) path. Thus, the third step of graph construction removes all edges
corresponding to transitive overlaps. In Figure 4.3.(B), the edge
resulting

G

is in Fig. 4.3.(C)).

58

(v1 , v3 ) will be removed (the

In the fourth step, we add edge weights, which are dened based on both the overlaps
and the alignment scores of reads. Contigs connecting positive reads can be aligned to the
underlying domain family with much better scores than the short reads, rendering signicantly better discriminative power. Thus, we assign the edge weights so that the path weight
is proportional to the alignment score of the corresponding contig. Negative reads that are
sequenced from a continuous region may happen to pass the ltration stage and thus can
form a path in

G

as well. However, their summed alignment score (i.e. the path weight)

will be much smaller. Thus, by outputting the long paths, we have better chance to pick
positive reads.
The

weight

of an edge

(vi , vj )

between reads

alignment score contributed by the ending read
where

hj .score

between

hi

is alignment score of the read

and

hj

dened in this section.

hi

and

hj

is proportional to the increased

hj : (vi , vj ).weight =

hj . |hj |

is the read length.

hj .score∗(|hj |−oi,j )
,
|hj |

oi,j

is the overlap

The edge weights computed using the above

equation are shown in Figure 4.3.(B). In order to incorporate alignment scores for nodes
that have zero in-degree (i.e. no incoming edges, such as
add a

root

and

v6

in Figure 4.3.(B)), we

G

such that there is an edge from the root to every node with zero

overlap

of these edges are set to 0. Naturally, the edge weights are equal to

node to

in-degree. The

v1

the alignment scores of these nodes (reads). For examples, refer to the edges

(root, v6 )

(root, v1 )

and

in Figure 4.3.(C).

The above steps are applied to each family that has at least one classied read after the
ltration stage. The pseudocode for the construction of

59

G can be found in the tool's website.

4.3.3.2

Find the K longest paths

Every path in a hit graph denes a contig. The weight of a path is the sum of weights of all
the edges in the path. Contigs assembled from positive reads generally have larger scores and
thus larger path weights. Therefore, in order to classify positive reads, we need to nd the

K

longest paths (KLPs) in

G,

where

K

is a parameter that controls the number of contigs

that will be generated.
To use a small set of paths to cover a majority of positive reads, we should avoid outputting a path that is a subpath of any other generated path. Considering that the weights
of all edges are positive, the KLPs will always begin with the root and end with a node
without out-going edges (i.e. sinks). Therefore, the KLPs output by SALT start with the
root and end with a sink node.
Using the sorted alignment positions to construct
graph (DAG). In order to nd the KLPs in

G,

G

ensures that

G

we apply a dynamic programming algorithm

extended from the algorithm that nds the longest path in a DAG. Let
be the list of all nodes after the topological sorting of
Let

P [0..N ]

end with

be an array of

vj . P [0]

N +1

is a directly acyclic

lists such that

P [j]

G.

Let

S

V = {v1 , v2 , ..., vN }

be a list of all sinks in

keeps at most

K

G.

longest paths that

is reserved for the root and is initialized to 0. Then we can ll

P [0..N ]

using the following recurrence relation:

P [j] = KLargest

(vi ,vj )∈E,1≤k≤K

where

KLargest

returns the

K

(P [i][k] + (vi , vj ).weight),

largest values in a list. This equation shows that the KLPs

ending with a node are chosen from all the KLPs ending with its predecessors. The KLPs
of

G

are the KLPs among all the paths that end with a sink node.

60

There are several algorithms available to nd the

K

shortest simple (loopless) paths

(KSPs) between two nodes in a graph [64, 65, 66]. These algorithms can also be applied to
the KLP problem by simply negating the weights of all edges. A hit graph is usually sparse
and therefore

|E| is close to |V| (data will be shown in the Results Section).

generally smaller than

|V|.

Moreover,

K

is

Therefore, Our DP algorithm is more ecient to solve our KLP

problem than other general KSP algorithms.
The RNA virus population estimation in [62] solved the contig generation problem based
on Dilworth's theorem. Simply speaking, the authors tried to identify the least number of
paths that can cover all the nodes. In their application, all the nodes are positive" while our
hit graph contains both positive and negative reads. Thus, their algorithm is not applicable
to this problem.
The choice of

K

aims at including contigs assembled from positive reads. Large

down the algorithm; small

K

K

slows

can make SALT miss positive reads. As the size of the graph

is not large because of the ltration step, we trade eciency for sensitivity in the current
parameter choice. By default,
hit graph is large, a smaller

4.3.4

K

K = |S|,

where

|S|

is the number of sinks in

G.

When the

is recommended for eciency purpose.

Stage 3: E-value computation and contig selection

While most of the top K longest paths contain only positive reads, some of them may
be assembled from negative reads.

In principle, some negative reads that are sequenced

from continuous regions outside of domains can pass the ltration stage because of the
loose PSSTs.

These reads can form paths in

negative reads will be included in KLPs.

G.

When

K

is large, the paths containing

This happens more often for domains with low

sequence conservations. Thus we need a reliable and well-tested scoring system and cutos

61

to distinguish between positive and negative contigs, which consist of positive and negative
reads, respectively. Towards this end, we chose HMMER because it can provide ecient and
reliable E-value computation. We rst translate candidate contigs into peptides using multiframe translation. Then these peptides are aligned against the input family using HMMER.
As the contigs are much longer than the reads, HMMER is able to select positive contigs
from the input using E-value cutos. Users can specify the E-value cuto. The default Evalue cuto is

10−6

for this stage to ensure high specicity. All reads of the selected contigs

will be classied into the input family. This is the nal classication result for each family.

4.3.5

Running time analysis

Let the number of input reads be
of the rst stage is
a family.

M

there are

N1

O(N |h|M )

N

and the average read length be

for one family, where

M

|h|.

The time complexity

is the number of match states in

is mainly determined by the size of a domain family and
reads passing the ltration stage.

the time complexity of graph construction is

Usually,

O(τ |h|2 N1 ),

N1
where

N.
τ

M

N.

Suppose

In the second stage,

is the average number of

overlapping alignments. During graph construction, we use alignment positions to guide the
overlap computation, avoiding all-against-all comparison needed in a standard overlap graph
construction. The time complexity of searching for the KLPs is

|V|

is the number of nodes and

graph construction procedure,

|E|

O(KlogK|E| + |V|),

where

is the number of edges in the graph. According to the

|E| ≈ |V| = N1 .

Suppose there are

N2

contigs produced from

the hit graph. The time complexity of the last stage is determined by HMMER. Because of
various optimization techniques and heuristics, the latest version of HMMER is as fast as
BLAST [18]. In addition, under the default setup,
in the hit graph and thus

N2 < N1 .

N2

is at most the number of sink nodes

As a result, the last stage only adds a small overhead to

62

the overall time complexity. The rst stage is the bottleneck because

N

is very large for a

typical NGS data set. The overall time complexity of SALT is determined by the rst stage.

4.4

Results

In order to show the utility of SALT, we applied it to three RNA-Seq data sets and compared
its performance with HMMER, HHblits, and a pipeline that uses

de novo sequence assembly

tools and HMMER. For brevity of description, we use assembly+HMMER to refer to the
complete pipeline, where assembly" can be replaced by the name of a specic sequence
assembly tool. This pipeline was used to classify query reads into input families based on
the following steps: 1) use

de novo

sequence assembly tools to generate contigs from query

reads, 2) use HMMER to align translated contigs against input families, 3) classify reads
in the aligned regions of the contigs into the corresponding families. Note that the inputs
to HMMER and HHblits are always the translated peptides from the reads or contigs. For
simplicity, when we say we classify reads into protein domain families using HMMER or
HHblits, the translated peptides are actually used.
For the rst two experiments, annotated reference genomes were available.

We rst

determined the true membership of query reads based on protein domain annotations and
read mapping positions in the genome. We then classied query reads into input families
using four dierent types of classiers: HMMER, HHblits, assembly+HMMER, and SALT.
Performance of the classiers was determined by comparing the true membership of reads and
predicted membership by the classiers. We used four metrics to evaluate the performance
of classiers: sensitivity, false positive rate (FP rate), positive predictive value (PPV), and
F-score. Let

D

represent an input family. Let

63

A

and

B

be the set of positive and negative

reads of

D,

respectively.

Let

C

be the set of reads classied to

sensitivity of a classier is dened by
dened by

|A∩C|
.
|C|

|A∩C|
.
|A|

D

by a classier.

The FP rate is dened by

|C−A|
.
|B|

The

The PPV is

F-score considers both sensitivity and PPV and can be used as a single

metric to evaluate the performance of a classier. It is dened by:

F-score

=

2 × sensitivity × PPV
.
sensitivity + PPV

In order to compare the performance of the classiers on all input families, we rst calculate
a metric for each input family and report the average of the values of the metric over all
input families.
In the last experiment, we conducted protein domain classication on an RNA-Seq data
set sequenced from a non-model organism.

Although we did not have the annotations of

the reference genome we were able to show that SALT identied more transcribed protein
domain families that were related to this species.

4.4.1

Protein domain classication of very short reads

In this experiment, we conducted protein domain classication on an Illumina RNA-Seq
data set sequenced from the transcriptome of one strain of
AU1054 in the growth medium cystic brosis [52].

Burkholderia cenocepacia

named

We downloaded a total number of

3,361,008 reads of 41 bp from the website provided by the authors.

For the pipeline of

assembly+HMMER, we chose Velvet [67] and SSAKE [68] because Velvet is widely used and
SSAKE is designed for very short reads. The experimental results show that SALT has much
better performance than HMMER and assembly+HMMER in classifying very short reads.

64

4.4.1.1

Determining the true membership of reads

First, we downloaded the genome and its protein domain annotations from the IMG website [53]. There are totally 2,181 annotated protein domains. Second, we mapped the reads
back to the genome using Bowtie [69] with two mismatches allowed. Third, we compared
the mapping positions of the reads and protein domains in the genome. Let
of the overlap between a read and a protein domain and

l ≥ 0.8L,

L

l

be the length

be the length of the read.

If

this read has a signicant overlap with the protein domain and is thus dened to

be a positive read of the protein domain family. If

l < 0.5L,

the overlap is insignicant and

this read is a negative read of the protein domain family. Otherwise, the signicance of the
overlap is ambiguous and this type of reads will not be evaluated in our experiments.

4.4.1.2

Performance evaluation

Out of the 2,181 annotated protein domain families, we are interested in those that are transcribed in the input data set. There is no commonly accepted criterion to dene transcribed
protein domains.

In this work, we dene protein domains that have at least 10 mapped

reads as transcribed protein domains. There are 378 protein domain families with at least
10 mapped reads, which are used as the input families. When we adjusted this threshold
in the experiment, the performance comparison between dierent classiers was generally
unchanged (data can be found in SALT's website).
We tried several dierent

k -mer sizes for SSAKE and found that when the k -mer size was

16 SSAKE had the best overall performance. It generated 25,944 contigs with an average
length of 60.36 bp. For Velvet, we used VelvetOptimizer [103] to search for the best assembly
result by trying

k -mer

sizes from 15 bp to 39 bp. We tried both Lcon and tbp as the

optimization function of VelvetOptimiser. The former optimizes the total number of base

65

pairs in large contigs and the latter optimizes the total number of base pairs in all contigs.
We found that Lcon produced much higher sensitivity with similar PPV compared to
tbp and the optimal

k -mer size for Lcon

was 25. Therefore, we report the performance

of Velvet+HMMER using Lcon as the optimization function. For SALT, we set

γ

to the

default value (0.3) for the ltration stage. On average, 0.23% of the input reads passed the
ltration for one input family, showing that only a small portion of reads could pass the
ltration stage and be used as the input to the graph construction. The overlap threshold

k∗

was set to 20 by default (k

∗

length
= read 2
).

The average numbers of nodes and edges in

a hit graph were 1,575 and 1,879, respectively.
E-value cutos determine the trade-o between sensitivity and FP rate of all classiers.
We plotted the ROC curves of these classiers by changing HMMER's E-value cuto from

10

to

10−9

with ratio 0.1, as shown in Figure 4.4.

Figure 4.4 shows that HMMER is highly specic with FP rate

≤ 1.2 × 10−7 .

However,

its sensitivity is only 9.81% even with very relaxed E-value cutos. When the E-value cuto
changes from
smaller than

10−5

10−6

all the classiers.
When the FP rate

to

10−6 ,

its sensitivity drops signicantly. When the E-value cuto is

its sensitivity is almost 0. The performance of SSAKE is the worst among
Its highest sensitivity is 8.25% and its lowest FP rate is

≤ 7.00 × 10−8 ,

1.65 × 10−6 .

Velvet has higher sensitivity than HMMER. When its FP

rate is higher than that, it has lower sensitivity than HMMER. This shows that although
Velvet generated longer contigs than SSAKE, both of them missed most positive reads. The
sensitivity of SALT is much higher than that of HMMER and Velvet with comparable FP
rate.

The experimental results also show that when we increase the E-value cuto from

0.01 to 10, the performance of these classiers almost remains unchanged. In order to boost
the sensitivity of HMMER and assembly+HMMER we will use 0.01 as their default E-value

66

0.25
0.08

HMMER
Velvet+HMMER
SALT

0.06

0.2

0.04

HHblits

0.02

0

Sensitivity

0.15

0

0.5

1
−4

x 10
0.09
0.08

0.1

0.07
0.06
0.05
SSAKE+HMMER

0.05

0.04
0.03
1.5

2

2.5

3
−6

x 10
0

0

1

2

3

4

FP rate
Figure 4.4: ROC curves of dierent classiers.

5

6

7
−7

x 10

HHblits and SSAKE+HMMER are listed

in separate embedded windows because their FP rates are orders of magnitude larger than
others.

cuto in our experiments hereafter unless otherwise specied.
the default E-value cuto for SALT is set to

10−6 .

Based on this experiment,

Although this is a stringent cuto, as

it yields good trade-o between sensitivity and PPV for very short reads, it is expected to
be appropriate for data sets containing longer reads. Table 4.1 shows the comparison of the

67

average performance of these classiers under their default E-value cutos.
Table 4.1: Performance comparison of SALT against the other classiers on the RNA-Seq data set of
Classiers

Sens

HMMER

9.81%

HHblits

6.40%

SSAKE

Burkholderia cenocepacia.

1.20e-07

PPV

F-score

89.37%

0.1767

0.03

4.51e-05

27.86%

0.1041

0.65

8.06%

7.23e-06

50.53%

1474

2.16e-07

94.03%

0.1415

9.50%

Velvet

0.1725

85.44%

0.3385

2.50

3.00e-07

21.10%

SALT

FP rate

Time (m)

8.68

All these classiers were run on a 2.2GHz dual-core AMD Opteron
machine.

The running time is the average running time on all

input families.

SSAKE and Velvet represent the pipelines of

SSAKE+HMMER and Velvet+HMMER, respectively.

HMMER was highly specic with FP rate

≤ 1.2×10−7 .

However, its sensitivity was only

9.81% even with very relaxed E-value cutos. When the E-value cuto was changed from

10−5

to

10−6 , its sensitivity dropped signicantly.

10−6

its sensitivity was almost 0. Both HHblits and SSAKE+HMMER had low sensitivity

and high FP rate.
was

6.75 × 10−7 .

rate was

When the E-value cuto was smaller than

The highest sensitivity of HHblits was 6.72% and its lowest FP rate

The highest sensitivity of SSAKE+HMMER was 8.25% and its lowest FP

1.65 × 10−6 .

Although Velvet generated longer contigs than SSAKE, both of them

missed most positive reads. The sensitivity of SALT was much higher than that of HMMER
and Velvet+HMMER with a comparable FP rate. The experimental results also show that
when we increased the E-value cuto from 0.01 to 10, the performance of these classiers
remained almost unchanged.

In order to boost the sensitivity of HMMER, HHblits, and

assembly+HMMER we used 0.01 as their default E-value cuto in our experiments hereafter
unless otherwise specied. Based on this experiment, the default E-value cuto for SALT was
set to

10−6 .

Although this is a stringent cuto, as it yields good trade-o between sensitivity

and PPV for very short reads, it is expected to be appropriate for data sets containing longer

68

reads. Table 4.1 compares the average performance of these classiers under their default
E-value cutos.
HMMER had the lowest FP rate and Velvet had the highest PPV. However, SALT had
much higher sensitivity with good FP rate and PPV. Therefore its F-score was signicantly
larger than the other classiers, showing that it had the best overall classication performance. The running time of HMMER was much smaller than that of the other classiers.
The ltration stage of SALT is the bottleneck and costs most of the running time.

The

average running time of the three stages of SALT was 6.20 min, 2.47 min, and 0.01 min,
respectively.

Since dierent input families can be analyzed independently, parallelization

techniques can be applied to speed up this stage. This will be part of our future work.
As we described in the Methods Section, the ltration stage trades specicity for sensitivity. It determines the upper bound of the sensitivity of SALT using loose score cutos. The
remaining stages are used to reduce the high FP rate introduced by the ltration stage. Our
experimental results are consistent with the expectation. After ltration, the sensitivity was
39.32% and the PPV was 13.22%. The nal sensitivity and PPV were 21.10% and 85.44%,
respectively.

4.4.2

Protein domain classication of an RNA-Seq data of Ara-

bidopsis
In this experiment, we applied SALT to an RNA-Seq data set sequenced from a normalized
cDNA library of

Arabidopsis using paired-end Illumina sequencing [70].

There were a total of

9,559,784 reads of 76 bp from each end. The length of query reads was much longer than in
the rst experiment. We used Velvet and Oases [71] as the

69

de novo

sequence assembly tools

in the assembly+HMMER pipeline. Oases takes into account a dynamic range of expression
levels and is specially designed for the assembly of RNA-Seq data. The experimental results
show that although HMMER, HHblits, and assembly+HMMER have higher sensitivity with
long query reads, SALT still had the best performance.
We used the starting reads as our benchmark data set. We downloaded all the coding
sequences (CDS) of

Arabidopsis thaliana

reference genome version TAIR10 [72] and mapped

the reads to the CDS using Bowtie with 2 mismatches allowed. Totally, 30.26% of the reads
were mapped to the CDS. We annotated the protein domains in the CDS using HMMER
with gathering thresholds (GAs).

The set of reads mapped to each protein domain was

determined using the same methodology as in the rst experiment. There are 3,188 protein
domain families with at least 10 mapped reads.
Velvet achieved the best assembly result when the

k -mer size was 45.

contigs with an average length of 443.99 bp. The range of

k -mer

It generated 44,161

sizes used for Oases was

from 31 to 61 and it generated 57,824 contigs with an average length of 356.46 bp.
parameters of SALT were set to their default values. Specically,

k ∗ = 38

All

because the read

length is 76. Table 4.2 shows the comparison of the average performance of these classiers
on all input families.
SALT had the best performance on all metrics (Table 4.2). Oases+HMMER had better
sensivitiy and similar PPV compared to Velvet+HMMER. However, both its sensitivity and
PPV were much lower than for SALT. For all input families, the total number of reads
correctly classied by SALT, HMMER, HHblits, Velvet+HMMER, and Oases+HMMER
were 968,068, 681,306, 741,160, 681,831, and 776,805, respectively. Signicantly more reads
(>

1.91×105 ) were correctly identied by SALT. Meanwhile, the sensitivities of all classiers

were signicantly improved compared to those in the rst experiment.

70

This shows that

Table 4.2: Performance comparison of SALT against the other classiers on the RNA-Seq data set of

Arabidopsis.

Classiers

Sens

FP rate

PPV

F-score

HMMER

49.72%

2.45e-06

86.26%

0.6308

0.31

HHblits

50.09%

2.32e-04

14.91%

0.2298

11.98

Velvet

35.93%

9.19e-06

64.45%

0.4614

57.65

42.30%

1.02e-05

64.59%

0.5112

400.21

Oases
SALT

58.46% 2.84e-06 87.89% 0.7021

Time (m)

171.65

All these classiers were run on a 2.2GHz dual-core AMD Opteron
machine. The running time is the average running time on all input families.

Velvet and Oases represent the pipelines of Vel-

vet+HMMER and Oases+HMMER, respectively.

prole HMM-based methods have better discriminative power with long sequences.

Note

that although Velvet+HMMER and Oases+HMMER correctly classied more reads than
HMMER alone, their average sensitivities over all families were much lower than that of
HMMER. A closer look at the results showed that Velvet/Oases+HMMER tended to perform
well on highly transcribed families but missed a majority of reads for lowly transcribed
families. On the other hand, HMMER and SALT are not as sensitive as Velvet and Oases
to the transcriptional levels of domains. Thus, the numbers of families that are identied
as transcribed by Velvet+HMMER and Oases+HMMER are smaller than for the other
classiers.
In order to show the performance of dierent classiers on input families of dierent transcription levels, we investigated the performance of HMMER, Velvet+HMMER,
Oases+HMMER, and SALT on transcribed families when we changed the threshold for the
number of mapped reads from 10 to 100 with a step size of 10. For input families that have
at least 100 mapped reads, the RPKM (Reads per kilo base per million) was 193.81. The
performance of HMMER remained almost unchanged when we changed the threshold. This
is because HMMER aligns each read independently.

71

The sensitivity of Velvet+HMMER

increased from 35.93% to 38.18% and its PPV increased from 64.45% to 72.58%. The sensitivity of Oases+HMMER increased from 42.30% to 44.32% and its PPV increased from
64.59% to 75.52%. The sensitivity of SALT increased from 58.46% to 59.39% and its PPV
increased from 87.89% to 91.05%.

These results show that while Velvet/Oases+HMMER

require sucient coverage to assemble short reads, SALT can assemble short reads of variant
coverage and achieves better classication performance.

4.4.3

Protein domain classication of a non-model organism

In this experiment, we show the utility of SALT in classifying a paired-end RNA-Seq data
set from the non-model organism

Radix balthica

[61] into protein domain families.

When

the reference genome is not available, read mapping and genome annotation cannot be
used to provide ground truth" about read membership. Thus, we focus on comparing the
transcribed domains as well as the total number of reads classied into these families. In
this experiment, the pipelines of assembler+HMMER will be represented by the assemblers'
names for simplicity.
We rst downloaded the data set from NCBI Short Read Archive (SRP005151). There
are 8,283,222 reads of 101 bp in each end.
several

de novo

We also downloaded the contigs generated by

sequence assembly tools from [61].

These tools include Velvet, SeqMan

NGen [73] (hereafter called NGen), and Oases [71]. The authors used 31 as the

k -mer sizes

for Velvet and NGen. For Oases, they used both 21 (Oases21) and 31 (Oases31) as the

k -mer

size.
We searched the keywords animal and snail in the Pfam website and obtained a list
of 84 protein domain families that had the keywords in the description of the families. These
protein domain families were used as the input families.

72

If more than 10 reads can be

Velvet
3
1
5

Oases31
3

1
5

2

0
2

1

1

NGen
1

SALT
7

Figure 4.5: A Venn diagram of the transcribed families identied by dierent classiers.

classied to a family, we report that this family is transcribed.
We aligned the contigs generated by the assembly tools against the input families using
HMMER with 0.01 as the E-value cuto. The parameters we used for SALT were:

−6

(default) and E-value=10

k ∗ = 41.

(default).

previous two experiments, we used a smaller

k∗

k∗

γ = 0.3

As the reads are much longer than the

than the default value (50). Our choice of

is already larger than kmer sizes used by the tested assembly tools and is not likely to

introduce random overlaps. Table 4.3 shows the classication results of these classiers.
Table 4.3: Classication results generated by dierent classiers on the
classiers

Radix balthica

data set.

# of transcribed families

# of classied reads

Velvet

17

996

NGen

13

8338

Oases21

19

1336

Oases31

18

1687

SALT

23

4457

# of classied reads: the number of reads that are classied
into transcribed families.

73

In Table 4.3, SALT identied more transcribed families than the other classiers even
with a more stringent E-value cuto.

Moreover, it classied more reads into transcribed

families than the other classiers except NGen. Although NGen classied the largest number of reads, it only identied 13 transcribed families, which are thus likely to be highly
transcribed families. Being consistent with our second experiment, Velvet+HMMER identied a smaller number of transcribed domains than SALT. The relationship of the sets of
transcribed domains identied by dierent strategies is shown in Figure 4.5.
Table 4.4: Description of transcribed families uniquely
identied by SALT.
Accession

ID

Description

PF00654

Voltage_CLC

Voltage gated chloride
channel

PF00907

T-box

transcription

factors

involved in limb and
heart development
PF03388

Lectin_leg-like

PF04275

P-mevalo_kinase

Legume-like

lectin

family
Phosphomevalonate
kinase
PF07684

NODP

PF07829

Toxin_14

NOTCH protein
Alpha-A

conotoxin

PIVA-like protein
PF12349

Sterol-sensing

Sterol-sensing domain
of

SREBP

cleavage-

activation

These classiers identied a total number of 34 transcribed families, 5 of which were
shared among them. When we do pairwise comparisons, we nd that the transcribed family sets identied by Velvet and Oases31 have 12 members in common, showing a higher
similarity than any other pair of classiers. NGen has a similar number of shared families
with Velvet and Oases31. SALT identied 7 transcribed families that were not identied by
any other tool. In order to verify these uniquely identied families, we investigated whether

74

these domains exist in the contigs generated by the sequence assembly tools. The output
of running HMMER between these families and the contigs includes statistically signicant
alignments. However, the number of reads consisting of the aligned regions is smaller than
10 and thus these domains are not reported as transcribed by assembly+HMMER. This
indicates that these domains are very likely to be transcribed but the

de novo

sequence

assembly tools fail to assemble many positive reads in the data set. A complete list of all
the transcribed families identied by dierent classiers are available as supplementary data
in the tool's website.
SALT is able to identify more functional protein domains that reect the species's reactions to the environment. For example, PF07829 is an Alpha-A conotoxin PIVA-like protein
family.

It is the major paralytic toxin found in the venom produced by the piscivorous

snail Conus purpurascens [12]. Table 4.4 shows a brief description of the families uniquely
identied by SALT. The detailed annotations of these families are listed in Pfam's website
(http://pfam.sanger.ac.uk/).

4.5

Discussion

Most existing domain analysis in RNA-Seq data still relies on traditional domain classication tools.

In this work, we analyzed why a commonly used tool HMMER can incur low

sensitivity for analyzing NGS data. We conducted a correlation study between the sensitivity of HMMER and other features based on the data set in the second experiment. These
features include: 1) the alignment score between the domain region and the input family
(F1 ); 2) the length of the domain region in the transcript (F2 ); 3) the normalized alignment score of the domain region in the transcript (F3 ), which is dened as:

75

F

F3 = F1 .
2

The

correlation coecients between the sensitivity and these features are -0.0214, -0.0788, and
0.8415, respectively. The normalized alignment score of the domain region in the transcript
has a strong positive linear relationship with the sensitivity of HMMER. This implies that
reads sequenced from close homologs of a domain family can be better classied using prole HMM-based methods.

On the other hand, reads sequenced from remote homologs of

a domain family are harder to classify.

We further found that the correlation coecient

between the sensitivity and average pairwise sequence identity in a domain is 0.5728. This
indicates that the performance of prole HMM-based methods are related to the sequence
conservation level of a domain.
The choice of parameters is important for SALT to provide good overall performance.
Here we give some analysis of the parameter
The default value of
small, this

K

K

K,

the number of generated candidate contigs.

is the number of sinks in the hit graph.

When the graph size is

value works well. However, when the graph size is large, such as the hit graphs

constructed from metagenomic data sets, the eciency of SALT will decrease signicantly.
Theoretically, the number of sequenced contigs of a domain is the product of the average
number of contigs in a domain and the number of copies of the domain in the genome. The
former one can be estimated by the sequencing depth and the latter one can be inferred
from existing domain organizations in Pfam. This approximation provides a better trade-o
between classication performance and eciency. However, if the users do not have a good
estimation of these values the default

K

value is recommended.

76

4.6

Conclusion

In this work, we introduced a sensitive and accurate protein domain classication tool (SALT)
designed for transcriptomic data of organisms without high-quality reference genomes. Experiments on two RNA-Seq data sets from annotated genomes showed that SALT had higher
sensitivity than existing tools with comparable specicity.

When we applied SALT to an

RNA-Seq data set of a non-model organism SALT identied more transcribed species-related
families than alternative pipelines.

77

Chapter 5
A Scalable and Accurate Targeted gene
Assembly tool (SAT-Assembler) for NGS
data
5.1

Background

Advances of next-generation sequencing (NGS) technologies enable sequencing of transcriptoms of a large number of non-model organisms (RNA-Seq) and species from various environmental samples (metagenomic data).
analyzing these NGS data sets.

Functional annotation is an important step in

For RNA-Seq of non-model organisms and metagenomic

data, which lack quality reference genomes, a commonly used functional annotation pipeline
conducts

de novo

assembly rst and applies functional annotation analysis to the assembled

contigs. This pipeline is widely adopted in functional analysis of RNA-Seq data [1, 2, 3, 4]
and gene-centric metagenomic analysis [74, 53, 75, 76, 6]. One of the most informative and
practical steps in functional analysis is homology search, which assigns functions to contigs
using the signicant matches against non-redundant protein sequence databases or secondary
databases including protein families, domains and functional sites. The performance of downstream functional analysis largely depends on the quality of the

78

de novo

assembly, which is

still a challenging problem for RNA-Seq and metagenomic data.
The problem of assembling NGS reads has received considerable eorts [58, 67]. Dierent
programs have been developed for transcriptomic assembly [71, 77, 78] and metagenomic
assembly [79, 80, 81, 82, 83, 84, 85]. RNA-Seq

de novo

assembly aims to identify transcribed

genes, separate iso-forms, and quantify gene expression while metagenomic

de novo assembly

intends to recover individual genomes in an environmental sample. Often there are specic
sets of genes in pathways that are of special interest, for example carbon and nitrogen
cycling pathways in the response of permafrost to global warming [86] in soil metagenomic
data, while much of bulk

de novo

assemblies consist of less interesting housekeeping genes"

and genes with unknown function that give little insight to the specic question at hand.
Thus, gene-centric analysis using homology search is a direct and eective strategy to study
complicated transcriptomic and metagenomic data.

De novo

gene assembly for RNA-Seq and metagenomic data shares similar challenges

in algorithm design. First, genes that have divergent expression levels in RNA-Seq data or
highly dierent abundance in metagenomic data lead to a wide range of sequence coverage.
Even worse, the sequence coverage along the same gene can be highly dierent due to bias
in sequencing protocols.

Using one set of assembly parameters for all genes or even for

dierent regions of the same gene leads to unsatisfactory performance, resulting in either
fragmented contigs or chimeric contigs. Although methods using multiple

k-mer

sizes have

been developed to identify transcripts in a broader coverage range, the choice of
size is still ad hoc.

k-mer

Second, expression of gene iso-forms due to alternative splicing and

expression of genes with shared short identical sequence regions in RNA-Seq data as well
as the high similarity of orthologous genes in metagenomic data grossly compound

de novo

assembly. Chimeric contigs tend to be produced between highly similar transcripts or gene

79

homologs. Third, existing assembly programs suer from either long computational time or
high memory usage or both. New developments in sequencing technology such as today's
Illumina HiSeq can produce up to 150 bp
run in single-end mode.

× 180 million reads for a total of 27 Gbp in a single

Using all seven lanes can output up to 189 Gbp in one run and

close to double that in paired-end mode. Two popular RNA-Seq assembly programs, Trinity
and Oases, incur the longest running time and the maximum memory usage in comparison
with other assembly tools [87], respectively. Using small

k-mer

size in Osases may require

resources exceeding the memory space of a typical computing server on existing data [87].
Thus, scalable methods and tools are in urgent need to analyze new sequencing data.
To improve the performance of gene assembly in RNA-Seq and metagenomic data that
lack quality reference genomes, we propose a dierent scheme, which conducts homology
search rst and then family-specic gene assembly. The proposed assembly method takes
genes of interest as input, which can include specic sets of genes in pathways that are
of special interest or well-characterized protein or domain families in existing databases.
Note that although we conduct homology search for reads, assembly is still indispensable
for functional analysis because the preferred products for a majority of users are relatively
complete genes.
The main novelties of our method and how those are used to address the above challenges
are summarized below. First, by conducting homology search on reads directly, all reads will
be rst classied into dierent gene families.
for reads in each family.
to bulk assembly.

De novo

assembly will only be conducted

The input size for assembly is signicantly decreased compared

Second, the alignments output by homology search are used to guide

family-specic gene assembly. We propose a novel overlap graph construction algorithm that
can achieve maximum connectivity and accuracy for genes of both low and high sequence

80

coverage.

Third, we use a modied depth-rst-search (DFS) that carefully incorporates

multiple information including paired-end reads, transitive edges, and coverage information
to guide graph traversal. The novel graph construction algorithm together with the carefully
designed traversal algorithm can maximumly avoid generation of chimeric contigs.

5.2

Related Work

As gene families of interest are used as input, our algorithm employs homology search against
gene families in assembly graph construction. Using genomes or proteome of related species
to boost and optimize genome assembly has been proposed or implemented in a group
of assembly programs [88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 85].

The contigs belonging

to a single gene or a block of genome in the related species are ordered, orientated, and
assembled. Most of these programs are designed to improve genome assembly. A few of these
comparative or gene-boosted assembly programs are specically designed for RNA-Seq or
metagenomic assembly. For example, Surget-Groba et al. [92] carefully considered the highly
heterogeneous sequence coverage of transcripts and employed multi-k and proteome of a
related species to optimize transcriptomic assembly. Dutilh et al. [91] used one closely related
reference genome to increase assembly performance of microbial genomes in metagenomic
data.

Ye's group used homologous genes to stitch gene fragments for gene assembly in

metagenomic data [85].
Our work is dierent from existing comparative assembly approaches in the following aspects. First, our tool does not require any related species as input. Most of the existing comparative approaches are limited by the availability of closely related reference genomes. Low
similarity between the to-be-assembled genes or genomes and the related genes or genomes

81

can lead to wrongly assembled contigs. Our tool uses well-characterized genes of particular
interest or ubiquitously represented sequence families such as those from family databases of
proteins, domains, or functional sites as input to guide assembly. Second, as we use sequence
families rather than a single sequence as reference, prole-based alignment methods rather
than pairwise sequence alignment or exact sequence mapping are applied to conduct homology search. Prole-based methods tend to be more sensitive for remote homology search.
Third, to our best knowledge, SAT-Assembler is the rst tool that uses consistency between
sequence overlap and alignment overlap for edge creation in an overlap graph.

5.3

5.3.1

Methods

Overview of SAT-assembler

Our tool can be divided into two main stages. First, we align reads against prole hidden
Markov models (HMMs), which eectively represent the underlying gene families.

This

stage classies the whole input data set into subsets of reads sequenced from dierent gene
families. Second, SAT-Assembler constructs a family-specic overlap graph and assembles
reads from the same family into contigs using a graph traversal algorithm.

The graph

construction is supervised by the alignment information from the rst stage and aims to
obtain maximum connectivity between reads while avoiding false connections. In particular,
it can accurately capture small overlaps between reads from lowly sequenced regions and
improves the assembly of lowly transcribed or encoded genes. The graph traversal is guided
by multiple types of information to avoid generation of chimeric contigs. Finally, paired-end
reads are used to scaold contigs from the same genes into
schematic representation of the pipeline of SAT-Assembler.

82

super contigs.

Figure 5.1 is a

reads

profile HMM-based
alignments

contigs
Figure 5.1: The pipeline of SAT-assembler.

Reads of the same color belong to the same

gene family. Reads from dierent genes of the same family are distinguished using dierent
patterns. Reads shared by multiple genes from the same family have multiple patterns.

5.3.2

Prole HMM-based homology search

Our method conducts homology search on reads rst.

Depending on the algorithms and

the target databases, homology search methods can be divided into two types.

The rst

is to compare the sequences against protein sequence databases using pairwise alignment
tools such as BLAST [11].

The second method is prole-based homology search, which

classies queries into characterized protein domain or family databases such as Pfam [12, 98],
TIGRFAM [13], FIGfams [14], InterProScan [15], etc.
Applying prole-based homology search to NGS reads has several advantages. First, the

83

number of gene families is signicantly smaller than the number of sequences, rendering much
smaller search space. For example, there are only about 13,000 manually curated protein
families in Pfam. However, these cover nearly 80% of the UniProt Knowledgebase and the
coverage is increasing every year as enough information becomes available to form new families [98]. As the prole-based homology search tool HMMER is as fast as BLAST [18], using
prole-based search provides shorter search time. Second, previous work [17] has shown that
using family information can improve the sensitivity of remote protein/ncRNA homology
search. For non-model organism transcriptome and metagenomes, sensitive remote homology search is especially important for identifying possible new homologs. Third, although
short reads can pose challenges to both types of homology search [99, 57], empirical studies
on thousands of families [57] showed that the performance of prole-based homology search
improved quickly with the increase of read size. For read length of 85 bases, the sensitivity
is close to 1.0 for moderately and highly conserved domains.

Thus, for read lengths pro-

duced by modern NGS technologies, prole-based homology search methods are capable of
classifying most reads into their native families with high specicity.

5.3.3

Alignment informed graph construction

The rst stage not only classies query reads into their native families but provides important
alignment information for

de novo

assembly.

The alignment positions are used to guide

overlap graph construction. A standard overlap graph is dened as

G = (V, E),

where each

non-duplicate read is a node and an overlap larger than a given cuto is indicated by a
directed edge. Our overlap graph is dierent from a standard overlap graph [100, 101, 102]
in the edge creation criteria and graph construction procedure.
In a standard overlap graph, edge creation only depends on the sequence overlap, which

84

is not ideal for genes of heterogeneous sequence coverage.
the relationship between two types of overlaps:
For simplicity of explanation, a read

r2 ,

r

alignment overlap and sequence overlap.

corresponds to vertex

an edge is created from the corresponding node

are satised:
and

r2

1) the alignment position of

overlap by at least

t∗ ,

r1

We add edges by considering

r1

r

in G. For two reads

to node

is smaller than

r2

r2 ;

s1

and

e1 ,

and

if the following criteria
2) the alignments of

r1

a user-dened threshold; 3) the sequence overlap of the two

reads is consistent with the overlap in their alignment positions. Suppose
model between

r1

and

r2

aligns to the model between

s2

overlap is the number of bases in the overlapping region between

and

s2

e2 .

and

r1

aligns to the

The alignment

e1 .

Criterion 3 is

the key observation to connect reads that are sequenced from the same gene rather than
from orthologous or paralogous genes because the latter can have very dierent sequence
and alignment overlaps.

r2

An example is given in Figure 5.2, in which read

are from two homologous genes.

and 66 respectively.
and

r2

when the

r1

and read

Their sequence overlap and alignment overlap are 25

Other assembly tools such as Trinity will create an edge between

k-mer

r1

size is 25. However, because their alignment overlap and sequence

overlap are inconsistent SAT-Assembler does not connect them, avoiding a wrong connection
between reads from homologous genes.
The consistency-based edge creation also allows us to improve connectivity in regions with
low sequence coverage. For relatively small overlaps, we still allow an edge if the alignment
overlap and sequence overlap are similar to each other. The intuition is that the chance that
reads with random overlaps can be aligned to the same model with similar alignment overlap
is small.
To quantify the consistency between the two types of overlaps we introduce the

overlap dierence

dened by

d =

|t−k|
t ,

where

85

t

is alignment overlap and

k

relative

is sequence

r1
r2

...
profile HMM
(A)

r1
AAGAAAGACAACCCCGTAACTGACACTGTGTCACATGAGGAAGAAAGGATAAGCCAGTTACCGGAACCG
AAGGATAAGCCAGTTACCGGAACCGTGTCCAAGGAGGAGGATAGAATAAGCCAGTTACCGGAACCAGAAC

alignment overlap
66 bp

r2

(B)

r1

KKDNPVTDTVSHEEERISQLPEP
KDKPVTGTVSKEEDRISQLPEPE

r2

high similarity
(C)
Figure 5.2: (A)Two reads

a

and

b

sequenced from dierent genes of the same family are

aligned to the prole HMM of the family. Their sequence overlap is indicated in red. (B)
Read

a

and read

b

have an alignment overlap of 66 and a sequence overlap of 25 (in bold).

(C)The alignment between the translated peptides of

overlap. Criterion 3 is satised only when
default value of 0.15.

d ≤ d∗ ,

a

and

where

b

d∗

is 22 residues.

is a predened cuto with a

We examined relative overlap dierence in both real RNA-Seq and

metagenomic data sets. For reads sequenced from the same gene and from dierent genes,
the average relative overlap dierence is

0.072 and 0.89 respectively.

overlaps, we use 20 as the default value for

t∗ ,

To avoid small random

which is the alignment overlap threshold.

Our overlap graph construction is dierent from standard overlap graph construction. It
does not need all-against-all sequence comparison. We rst sort the reads by their alignment
positions in a non-decreasing order. We only check the sequence overlap between two reads
if their alignment overlap passes

t∗ .

Therefore, the alignment information also increases the

eciency of graph construction (the running time analysis can be found in Section 5.3.7).
To incorporate substitution sequencing errors introduced by some NGS platforms, we allow

86

a certain number
two hits

r2 .

r1

and

r2

e

of of mismatches in the sequence overlap. That is, the overlap between

is the longest sux of

In our current implementation,

r1

e = 2.

that has a Hamming distance

The parameter

e

≤e

to a prex of

can be adjusted to t the error

rate of the input data.

5.3.4

Pruning and optimization of overlap graphs

Transitive edges correspond to edges whose two end nodes are connected by an alternative
path (usually with higher coverage). They add unnecessary edges without contributing to
the connectivity of the graph and will be removed before

de novo

assembly. Before removing

them, SAT-Assembler keeps a record of all the pairs of nodes connected by transitive edges
because a transitive edge indicates that a pair of nodes are from the same gene region. This
information will be used to guide the graph traversal.
If a node has only one outgoing edge that points to another node that has only one
incoming edge these two nodes can be merged as one node. Tips and bubbles are identied
and removed using the standard topology-based pruning methods as in Velvet [67].
Although our edge creation method excludes most random sequence overlaps, some erroneous edges still exist. An edge is highly likely to be erroneous if it is inferior to another
edge that shares a head node or tail node with it. An edge

a

the following two criteria are met: 1) the sequence overlap of
of

b;

2) the Hamming distance of sequence overlap of

a

is inferior to another edge

a

b

if

is smaller than half of that

is larger than that of

b.

A random

overlap is more likely to be much smaller and have more mismatches than a true overlap.
Therefore, these two criteria will help us remove most erroneous edges.

87

v1

v2

v3

14

v5

v4

31
v6

v7

v8
29

13

A and B . Nodes in red (v1 ,
v7 ) and in blue (v2 , v5 , and v8 ) are from gene A and gene B respectively. Nodes
black (v3 and v6 ) are chimeric nodes because they are shared by the two genes. Arrows

Figure 5.3: A graph containing reads from two dierent genes

v4 ,
in

and

with solid lines are real edges. Arrows with dotted lines and dashed lines indicate paired-end
reads and transitive edges between two nodes respectively.

5.3.5

Guided graph traversal using multiple information

Once a family-specic graph is constructed and optimized, the goal is to conduct a graph
traversal to output paths corresponding to genes. The traversal starts with nodes without
incoming edges. The challenge arises when two or more genes contain a common or similar
subsequence, leading to

chimeric nodes

such as

v3

and

v6

in Figure 5.3.

Chimeric nodes add to the complexity of the graph traversal by leading to chimeric
contigs. For example, the path

< v1 , v3 , v4 , v6 , v8 >

contains nodes exclusively from both

genes and thus is a chimeric path. SAT-Assembler employs three types of information to
guide the graph traversal to recover correct gene paths: transitive edges, paired-end reads,

88

and coverage. We describe the key steps of our graph traversal algorithm using Figure 5.3.
The goal is to output two correct paths corresponding to the two genes.
A paired-end read represents two reads appearing in the same genome with known order
(by our homology search) and distance range (insert size).

Although transitive edges are

removed at the stage of graph pruning they can act as a paired-end read with a small insert
size. Therefore, both transitive edges and paired-end reads can be used to examine whether
two nodes are from the same gene. Two nodes that are not connected by an edge are said
to have

supports

or be

supported

if there are transitive edges or paired-end reads between

them. For supports of paired-end reads, we further require that their distance in the path is
consistent with the insert size.
Dierent from previous traversal algorithms, we divide supports into two types,

supports

and

non-critical supports.

critical

Critical supports can be used to resolve branches in

graph traversal while non-critical supports are not able to distinguish dierent gene paths.
For example, a graph traversal generates a path
edges

(v3 , v4 )

and

(v3 , v5 ).

< v1 , v3 >.

If there is a support between

edge in Figure 5.3, the traversal will be guided to visit

v4

v1

The node
and

v4 ,

v3

has two outgoing

such as the transitive

in next step. This transitive edge

provides a critical support for correct traversal. However, the support between
not critical" for guiding the graph traversal because any path that has visited
visit

v6 .

In Figure 5.3 , the support between

v3

and

v6

v3
v3

and

v6

is

needs to

is a non-critical support while all the

other supports are critical supports.
When there is no support between two non-chimeric nodes ,

node coverage will be used to

resolve the branches. The coverage of a node is the total size of reads normalized by the length
of the assembled sequence of the node.

For protein-coding genes, although the sequence

coverage is usually not uniform along the genes its change is gradual rather than sharp.

89

Thus, the coverage of two consecutive non-chimeric nodes in the same path should reect
this observation. Any sharp change indicates a wrong path. For example, in Figure 5.3,
and

v4

v7 have similar coverage, as do v5 and v8 . v4 and v8 , however, have signicantly dierent

coverage. Therefore, a path that has visited

v4

and

v6

should next visit

v7

instead of

v8 .

We use a bounded depth-rst search (DFS) algorithm to generate correct paths. While
a typical DFS takes exponential time to generate all simple paths between two nodes our
graph traversal method makes use of critical supports to bound the search and only visits
correct paths, eectively reducing the time complexity of path generation. During search, we
will proceed to successors of the current node that provide critical supports. If none of the
successive non-chimeric nodes has supports with any of the previously visited non-chimeric
nodes, we will proceed to the one that has a similar coverage to the recently visited nonchimeric node given that their coverage is similar enough. Otherwise, it is highly likely that
the current node is not from the same gene with any of its successors. Therefore, we will
output the current path and start a new path from its successors. The traversal stops when
there is no appropriate successive node available. All paths with critical supports above a
given threshold will be output.

5.3.6

Contig scaolding

Assembly tools may output multiple contigs from the same gene. There are two main reasons
for the fragmentation: 1) some regions between contigs are not sequenced due to sequencing
bias, PCR bias, low transcription level or abundance; 2) reads from lowly conserved regions
of the gene may not pass the homology search and thus are not used to construct the graph.
The contigs are oriented and connected using their alignment positions against the underlying
prole HMM and paired-end reads. The scaolding results in super contigs.

90

5.3.7

Running time analysis

Let the number of input reads be
of the homology search stage is
prole HMM and

N1

Usually,

M

N.

N.

N

and the average read length be

O(N |r|M )

Suppose

N1

|r|.

The time complexity

for one input family, where

M

is length of the

reads have passed the homology search stage.

The time complexity of graph construction is

O(τ |r|2 N1 ),

the average number of overlapping alignments longer than a given cuto.

where

τ

is

During graph

construction, we use alignment positions to guide the overlap computation, avoiding the allagainst-all comparison needed in a standard overlap graph construction. The time complexity
of graph traveral is
of edges,

n

O(|E|+|V|+n2 N2 ), where |V| is the number of nodes, |E| is the number

is the number of read pairs that have critical supports, and

N2

correct paths in the graph. The time complexity of the scaoding stage is

is the number of

2
O(N2 ).

Because

of various optimization techniques and heuristics, the latest version of HMMER is as fast as
BLAST [18]. Considering

n

N2 ,

the time complexity of scaolding is much smaller than

graph traversal. Therefore, the overall running time is determined by the graph traversal
stage.

5.4

Results

To show the utility of SAT-Assembler we applied it to an RNA-Seq data set of Arabidopsis
and a human gut metagenomic data set. We compared the completeness and correctness of
assembly, length of contigs, memory usage, and running time of dierent

de novo

assembly

tools, including SAT-Assembler, Velvet, Oases, Trinity, Meta-IDBA, and MetaVelvet.
For targeted gene assembly we are interested in evaluating gene segments recovered by
contigs generated by assembly tools. Figure 5.4 shows an example of gene segments assembled

91

contigs
a

b

400 bp

300 bp

c

1000 bp

gene
Figure 5.4: Three contigs generated from a metagenomic data set. The green parts of the
contigs are contained in the target gene and thus are gene segments. The blue parts of the
contigs are not gene segmetns.

from a metagenoic data set.

For the two experiments, by mapping reads to annotated

reference genome or charaterized genes from existing databases, we can construct a set of
reference genes, which are transcribed or encoded in an NGS data set. All contigs output by
assembly tools will be aligned against reference genes using BLAST with an E-value cuto
of

10−5 .

BLAST hits with at least 95% identity were regarded as recovered gene segments.

To quantify the completeness, correctness, and length of gene segments we propose three
metrics:

gene coverage, chimera rate, and contig coverage.

For one reference gene, gene

coverage is dened as the fraction of bases in the reference gene covered by at least one gene
segment.

Chimera rate is the fraction of contigs consisting of reads from dierent genes.

As read mapping results can be used to determine the origin of each read, we can evaluate
whether a contig is a chimera by checking the origin of each read in the contig.

Contig

coverage is the length of the gene segment normalized by the length of the target gene.
This normalized contig length is used to prevent gene segments of long reference genes from
dominating the average contig coverage of an assembly tool. In Figure 5.4, the length of the

92

gene is 1000 bp and the length of the gene segments from contig

a

and contig

b

is 700 bp.

Therefore, its gene coverage is 70%. The contig coverages of the two gene segments are 40%
and 30% respectively. To compare the performance of assembly tools on all input families,
we rst calculate a metric for each family and then report the average of the values of the
metric over all families.

5.4.1

Gene assembly in an RNA-Seq data set of Arabidopsis

In this experiment, we applied SAT-Assembler to an RNA-Seq data set sequenced from a
normalized cDNA library of

Arabidopsis

using paired-end Illumina sequencing [70]. There

were a total of 9,559,784 paired-end reads of 76 bp. Pfam was used as our database of input
families. We compared the performance of SAT-Assembler with Velvet, Oases, and Trinity.
Velvet is a widely used short read

de novo

assembly tool. Oases and Trinity are assembly

tools specially designed for transcriptomic data.
To obtain transcribed genes in this data set, we conducted read mapping (using Bowtie)
on all the coding sequences (CDS) of

Arabidopsis thaliana

of version TAIR10 [72]. There is

no commonly accepted criterion to dene transcribed genes. In this work, we dened CDS
with at least 10 mapped reads as transcribed CDS. Assembly results of dierent tools were
compared on these transcribed CDS. 3,188 protein or domain families from Pfam that can
be aligned to these CDS using HMMER with gathering thresholds (GAs) were used as input
to SAT-Assembler.

5.4.1.1

Edge creation performance

Before we evaluate the assembly performance of SAT-Assembler we rst evaluate the performance of the proposed edge creation strategy. We name the overlap threshold of SAT-

93

Assembler

20+consistency

(or

consistency

for simplicity), where 20 is the default threshold

for alignment overlap and consistency stands for the consistency between alignment overlap
and sequence overlap. We compared edge creation performance of three strategies that used
dierent overlap thresholds: i) 20; ii) 38; iii) consistency. Three metrics were used in the
comparison:

the number of true positive edges (T P ), the number of false positive edges

(F P ), and positive predictive value (P P V ), which is

TP
T P +F P .

True positive edges connect

nodes that are from the same genes. As the number of correct connections is the same for
these strategies,

TP

is proportional to the sensitivity.

Higher

TP

indicates higher sensi-

tivity. False positive edges connect nodes that are from dierent genes. The performance
comparison of three strategies is shown in Table 5.1.
Table 5.1:

Edge creation performance of

three strategies on the RNA-Seq data set of

Arabidopsis.

TP

FP

PPV

20

2,377,680

254,331

90.34%

38

2,127,910

15,639

99.27%

consistency

2,346,341

84,121

96.54%

Strategy

The metrics are evaluated on all edges in
overlap graphs of 3,188 families.

The consistency strategy provided a better tradeo for edge creation than the stringent
overlap threshold of 38 and the very loose overlap threshold of 20. Compared with the 20
threshold strategy, the consistency strategy avoided 170,210 false positive connections at
the expense of missing only 31,339 positive connections, leading to 6.20% increase in

PPV .

Compared with the 38 threshold strategy, it successfully captured 218,431 more positive
connections, leading to 10.27% increase in

TP

at the expense of 2.73% decrease in

PPV .

This showed that the consistency constraint successfully eliminated most random overlaps
while preserving overlaps between many reads from the same genes.

94

5.4.1.2

Performance comparison with other assembly tools

For Velvet we tried
of

k-mer

k-mer

sizes from 35 to 61 with a step size of 2. Oases accepts a range

sizes. It rst generates assemblies on dierent

We xed the beginning

k-mer

k-mer

sizes and then merges them.

size as 35 and changed the ending

with a step size of 2. Trinity uses a xed

k-mer

k-mer

sizes from 35 to 61

size of 25 in its current implementation. We

ran it with its default parameters. For SAT-Assembler, we used consistency" as its edge
creation strategy and default values for the other parameters. To compare the consistency
strategy with the simple overlap threshold strategy we also ran SAT-Assembler with an
overlap threshold from 35 to 45 without the consistency constraint. Figure 5.5 shows chimera
rate versus gene coverage when we changed

k-mer

sizes or overlap thresholds for dierent

assembly tools.
Velvet was more sensitive to the change of

k-mer

sizes than the other assembly tools.

Its largest and smallest gene coverage were 82.03% and 59.53% respectively. Its best overall
performance was achieved when the

k-mer

size was 37. We also ran VelvetOptimiser [103]

to search for the best assembly result by trying

k -mer sizes from 35 to 61 bp with  K50

the optimization function. It reported an optimal

k-mer

as

size of 57, which generated a gene

coverage of 66.31% and a chimera rate of 18.85%. When the range of

k-mer sizes was changed

the performance of Oases did not change much. This is because Oases merges assemblies
generated by dierent

k-mer

sizes and there is large redundancy in its nal assemblies.

The overall performance of Trinity was slightly better than Oases.

The performance of

SAT-Assembler was also stable when the simple overlap threshold was changed.

Overall,

using the consistency strategy shows better gene coverage while keeping similar chimera rate
compared to the simple overlap threshold.

95

0.85

0.8
Velvet
Oases
Trinity
SAT without consistency
SAT with consistency

Gene coverage

0.75

0.7

0.65

0.6

0.55
0.05

0.1

0.15

0.2
Chimera rate

Figure 5.5: Chimera rate versus gene coverage when

k-mer

0.25

0.3

size or overlap threshold changes

for dierent assembly tools. These values are average values of the assemblers' performance
on 3,188 input families.

96

We further compared the performance of SAT-Assembler with the overlap 45 threshold
strategy and the consistency strategy on individual target genes.

We found that the im-

provement of overall gene coverage was more likely to be generated by target genes with low
depth of reads that passed the homology search stage. Given a cuto for the depth of reads

d∗ ,

we compared gene coverage of two strategies on target genes that had a depth lower

than

d∗ .

When

d∗ = 10,

the consistency strategy had better gene coverage in 11.74% target

genes and the 45 threshold had better gene coverage in 4.60% target genes. For the rest of
the target genes both strategies had almost the same gene coverage.

When

d∗ = 20,

the

consistency strategy had better gene coverage in 10.92% target genes and the 45 threshold
had better gene coverage in 6.06% target genes. When we further increased

d∗

we found that

the performance improvement of the consistency strategy became less signicant, showing
that the consistency strategy had better ability to assemble reads from lowly transcribed or
encoded target genes.
We then compared the metrics of gene coverage, chimera rate, contig coverage, memory
usage, and running time of dierent assembly tools when their best performance of gene
coverage and chimera rate were achieved in Table 5.2. Contigs generated by all assembly
tools covered about 80% of genes on average. SAT-Assembler had the lowest chimera rate
due to our consistency-based edge creation strategy.

Oases generated the longest contigs

(indicated by contig coverage) on average. But the price paid was the highest chimera rate,
longest running time, and second highest memory usage, even with a single
When users apply a range of

k-mer

k-mer

size.

sizes, its memory usage will be further increased. SAT-

Assembler had the lowest memory usage and second shortest running time because of the
eective classication in the homology search stage. Our current implementation of SATAssembler uses the Python library of Networkx [104], which contributes to a large portion

97

of both memory usage and running time. We plan to implement SAT-Assembler using C++
in the future.
Table 5.2: Performance comparison between dierent assembly tools on the RNA-Seq data set
of

Arabidopsis.

Assembly tool

Gene coverage

Chimera rate

Contig coverage

Memory (MB)

Time (m)

Velvet

80.15%

11.04%

64.69%

8034

140.85

Oases

80.75%

29.18%

84.55%

22475

5128.74

Trinity

81.60%

26.10%

71.59%

34691

1145.05

SAT

81.32%

9.96%

78.76%

2592

228.14

The memory usage for all tools is based on a single overlap threshold or

k-mer.

The running

time was the average running time on all input families.

5.4.2

Targeted gene assembly in a human gut metagenomic data
set

In this experiment, we compared the performance of SAT-Assembler with Velvet, MetaIDBA, and MetaVelvet on a human gut metagenomic data set. Meta-IDBA and MetaVelvet
are both de Bruijn graph based and specially designed for
data.

de novo

assembly of metagenomic

There were 47,117,906 paired-end and 5,528,102 unpaired reads of various lengths.

The average length of the query reads was 95.72 bp and 75% of them were 100 bp. In this
experiment, we were interested in assembling butyrate kinase pathway genes, which play
important roles in butyrate synthesis.

We downloaded 2,352 annotated genes of butyrate

kinase family from RDP's functional gene repository [38]. By using read mapping, a total
of 58 genes with at least 10 mapped reads were identied and were used to evaluate the
performance of all assembly tools. For SAT-Assembler, a prole HMM built from 77 seed
genes was used as the input family.
We used VelvetOptimiser to search for the best assembly result by trying

98

k-mer

sizes

from 51 to 81 bp with  K50 as the optimization function. VelvetOptimiser reported 51 as
the optimal

k-mer

size. For Meta-IDBA we used the same range of

Meta-Velvet used the hash table generated by Velvet and its

k-mer

k-mer

sizes as Velvet.

size was thus 51 as well.

For SAT-Assembler we used its default parameters. A total of 16,136 reads were classied
into butyrate kinase family by the homology search stage, which accounted for 0.31% of the
query reads. Table 5.3 shows a performance comparison between these assembly tools.
Table 5.3: Performance comparison between dierent assembly tools in assembling genes from butyrate kinase family on the human gut metagenomic data
set
Assembly tool

Velvet

Meta-IDBA

MetaVelvet

SAT-Assembler

Gene coverage

51.97%

60.55%

43.52%

68.89%

Chimera rate

59.26%

36.67%

36.73%

16.87%

Contig coverage

38.99%

62.06%

49.30%

48.56%

41248

16648

20330

283

1736.08

1065.77

834.10

1530.63

Memory usage (MB)
Time (m)

The memory usage for all tools is based on a single overlap threshold or

k-mer.

The running time was the average running time on all input families.

SAT-Assembler had the best gene coverage, chimera rate, and memory usage. Its contigs
were usually shorter than Meta-IDBA. A closer examination reveals the reason: a number
of reads did not pass the prole HMM homology search and thus were not used as input
to assembly. Gene coverage of assembly tools in this experiment was much lower than in
the rst experiment because of lower sequencing depth and higher data complexity. MetaVelvet had the best running time performance because it directly used the optimal

k-mer

size and hash table from VelvetOptimiser while Velvet and Meta-IDBA both ran a range
of

k-mer

sizes. The low memory usage of SAT-Assembler further showed the advantage of

using homology search in targeted gene assembly for large scale NGS data. Due to the high
complexity of the metagenomic data set SAT-Assembler constructed a much more complex
overlap graph, leading to higher run-time overhead.

99

5.5

Conclusion

Both experiments on RNA-Seq and metagenomic data sets show that our novel consistencybased edge creation strategy and guided graph traversal can eectively avoid chimeric contigs.
Moreover, by reducing the original search space into a much smaller subset of reads from
targeted genes the memory usage was signicantly decreased, making it a more ecient tool
for large scale targeted gene assembly.

100

Chapter 6
Conclusion and future work
In this thesis, I rst introduced next-generation sequencing technologies and its applications.
I then discussed the three main challenges in the analysis of NGS data, including large data
size, frameshift sequencing errors, and short sequences. In order to address these challenges,
I have developed several tools and compared their performance with some existing tools.
Here I summarize the contributions of the thesis as follows:

•

In Chapter 2, I introduced HMM-FRAME, which was designed to accurately detect
and correct frameshift errors in NGS data and thus improved protein domain classication of NGS sequences. It was based on an augmented Viterbi algorithm and could
incorporate dierent error models for dierent error rates.

•

In Chapter 3, I introduced MetaDomain, which was designed to improve sensitivity of
protein domain classication of short reads while maintaining high specicity. MetaDomain employed a position-specic score threshold and signicantly improved the
classication performance of short reads from poorly conserved domain regions. Compared with existing protein domain classication tools, it had much better sensitivity
with high specicity.

•

In Chapter 4, I introduced SALT, which was designed to sensitively and accurately
classify short reads into their native protein domain families. It combined homology

101

search with a guided family-specic

de novo

assembly stage. Compared with MetaDo-

main, the specicity of SALT was signicantly improved with a comparable sensitivity.

•

In Chapter 5, I introduced SAT-Assembler, which was designed to conduct scalable and
accurate targeted gene assembly. Its novelties include the family-specic gene assembly,
alignment-informed graph construction, and a guided graph traversal. It can accurately
assemble short reads from the same target genes while avoiding chimeric contigs.

•

All these tools were designed for large-scale NGS data. The prole HMM-based methods enable us to split the input set of reads into subsets of reads corresponding to different families. Therefore, the subsequent steps can be conducted in a family-specic
manner, eectively reducing the search space and memory usage. Moreover, analysis
of reads from dierent families can be naturally parallellized. This methodology has
greatly improved the scalability of the analysis of large-scale NGS data.

Based on my previous studies there are several directions for future work:

•

HMM-FRAME is more computationally expensive than HMMER 3.0 mainly because
of diverse sequencing errors inside codons. We plan to improve the eciency of our
DNA versus prole HMM alignment algorithm so that it can be used eciently in
large-scale protein domain analysis. Besides applying DP matrix pruning techniques
to reduce the computational cost, we plan to use a faster but less accurate Viterbi
algorithm as a ltration stage. Specically, we can apply a faster Viterbi algorithm to
predict whether there are any errors inside a codon before identifying error positions.
If such errors exist, we can then use a more sensitive method to determine the exact
number and positions of insertions or deletions.

102

•

We plan to adapt SALT to domain classication in metagenomic data. In addition,
when the protein families are available, we will test SALT's performance on gene assembly.

We also plan to improve the eciency of SALT in several aspects.

First,

the hit graph construction is currently implemented using the Python library of NetworkX [104].

The overhead of object constructions can be greatly decreased if we

implement it using C++. Second, the independence of dierent input families allows
us to use parallelization techniques such as MPI. Finally, we will provide users with
a systematical way to better estimate

K

based on the sequencing data and domain

information.

•

There are still some challenges to address to further improve SAT-Assembler's performance. First, gene segments from some poorly conserved gene regions are fragmented
because some reads from these regions fail to pass the homology search. This problem
can be alleviated by increasing the sensitivity of the homology search. In the future,
we will incorporate our proposed position-specic score threshold (PSST) [57, 105] into
SAT-Assembler to classify more reads into their native families. Second, although the
edge creation strategy of SAT-Assembler captured more overlaps between reads from
the same genes, some positive overlaps still failed to be captured because of substitution errors in the overlapping regions. Masella et al. [106] proposed a sophisticated
method that can probabilistically correct these errors based on the overlap data from
the paired-end reads. We plan to use this method in our edge creation strategy to generate more positive connections. Third, the running time of the graph traversal stage
is the bottleneck of SAT-Assembler, especially for complex metagenomic data. Therefore, we plan to add more bounding strategies into the graph traversal, such as a more

103

stringent threshold for critical supports. Moreover, we will implement SAT-Assembler
using C++ to reduce its running time.

104

BIBLIOGRAPHY

105

BIBLIOGRAPHY
[1] Ee Mutasa-Göttgens, Anagha Joshi, Helen Holmes, Peter Hedden, and Berthold
Gottgens. A new RNAseq-based reference transcriptome for sugar beet and its application in transcriptome-scale analysis of vernalization and gibberellin responses.

Genomics, 13(1):99, 2012.

BMC

[2] Angela M. Orshinsky, Jinnan Hu, Stephen O. Opiyo, Venu Reddyvari-Channarayappa,
Thomas K. Mitchell, and Michael J. Boehm.

RNA-seq analysis of the Sclerotinia

homoeocarpa - Creeping Bentgrass Pathosystem.

PLoS ONE, 7(8):e41150, 2012.

[3] Zhen Li, Zhonghua Zhang, Pengcheng Yan, Sanwen Huang, Zhangjun Fei, and Kui
Lin. RNA-seq improves annotation of protein-coding genes in the cucumber genome.

BMC Genomics, 12(1):540, 2011.

[4] W. Marc Schmid, Anja Schmidt, C. Ulrich Klostermeie, Matthias Barann, Philip
Rosenstiel, and Ueli Grossniklaus.

A powerful method for transcriptional proling

of specic cell types in eukaryotes: Laser-assisted microdissection and rna sequencing.

PLoS ONE, 7(1):e29685, 2012.

[5] CAMERA: Community cyberinfrastructure for advanced microbial ecology research
and analysis.

http://camera.calit2.net/.

[6] F Meyer, D Paarmann, M D'Souza, R Olson, EM Glass, M Kubal, T Paczian, A Rodriguez, R Stevens, A Wilke, J Wilkening, and RA Edwards. The metagenomics RAST
server - a public resource for the automatic phylogenetic and functional analysis of
metagenomes.

BMC Bioinformatics, 9(1):386, 2008.

[7] Daniel H. Huson, Alexander F. Auch, Ji Qi, and Stephan C. Schuster. MEGAN analysis
of metagenomic data.

Genome Research, 17(3):377386, 2007.

[8] Kyle Ellrott, Lukasz Jaroszewski, Weizhong Li, John C. Wooley, and Adam Godzik.
Expansion of the protein repertoire in newly explored environments: Human gut microbiome specic protein families.

PLoS Comput Biol, 6(6):e1000798, 2010.

[9] Andreas Schlüter, Lutz Krause, Rafael Szczepanowski, Alexander Goesmann, and
Alfred Pühler.

Genetic diversity and composition of a plasmid metagenome from a

wastewater treatment plant.

Journal of Biotechnology, 136(1-2):6576, 2008.
106

[10] Lutz Krause, Naryttza N. Diaz, Alexander Goesmann, Scott Kelley, Tim W. Nattkemper, Forest Rohwer, Robert A. Edwards, and Jens Stoye. Phylogenetic classication of
short environmental DNA fragments.

Nucleic Acids Research, 36(7):22302239, 2008.

[11] Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J.
Lipmanl. Basic local alignment search tool.

Journal of Molecular Biology,

215:403 

410, 1990.

[12] Robert D. Finn, Jaina Mistry, John Tate, Penny Coggill, Andreas Heger, Joanne E.
Pollington, O. Luke Gavin, Prasad Gunasekaran, Goran Ceric, Kristoer Forslund,
Liisa Holm, Erik L. L. Sonnhammer, Sean R. Eddy, and Alex Bateman. The Pfam
protein families database.

Nucleic Acids Research, 38(suppl 1):D211D222, 2010.

[13] Daniel H. Haft, Jeremy D. Selengut, and Owen White. The TIGRFAMs database of
protein families.

Nucleic Acids Research, 31(1):371373, 2003.

[14] F. Meyer, R. Overbeek, and Rodriguez A. FIGfams: yet another set of protein families.

Nucleic Acids Research, 37(20):66436654, 2009.

[15] E Quevillon, V Silventoinen, S Pillai, N Harte, N Mulder, R Apweiler, and R Lopez.
InterProScan: protein domains identier.

Nucleic Acids Research, 33(suppl 2):W116

W120, 2005.

[16] Sarah Hunter, Rolf Apweiler, Teresa K. Attwood, Amos Bairoch, Alex Bateman, David
Binns, Peer Bork, Ujjwal Das, Louise Daugherty, Lauranne Duquenne, Robert D. Finn,
Julian Gough, Daniel Haft, Nicolas Hulo, Daniel Kahn, Elizabeth Kelly, Aurelie Laugraud, Ivica Letunic, David Lonsdale, Rodrigo Lopez, Martin Madera, John Maslen,
Craig McAnulla, Jennifer McDowall, Jaina Mistry, Alex Mitchell, Nicola Mulder, Darren Natale, Christine Orengo, Antony F. Quinn, Jeremy D. Selengut, Christian J. A.
Sigrist, Manjula Thimma, Paul D. Thomas, Franck Valentin, Derek Wilson, Cathy H.
Wu, and Corin Yeats. InterPro: the integrative protein signature database.

Acids Research, 37(suppl 1):D211D215, 2009.

[17] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison.

Probabilistic Models of Proteins and Nucleic Acids.

Nucleic

Biological Sequence Analysis

Cambridge University Press, UK,

1998.

[18] Sear R. Eddy.
inference.

A new generation of homology search tools based on probabilistic

Genome Inform., 23(1):20511, 2009.

[19] NP Brown, C Sander, and P Bork. Frame: detection of genomic sequencing errors.

Bioinformatics, 14(4):36771, 1998.

107

[20] X Guan and E.C. Uberbacher. Alignments of dna and protein sequences containing
frameshift errors.

Comput Appl Biosci., 12(1):3140, 1996.

[21] Z Zhang, RW Pearson, and W Miller. Aligning a dna sequence with a protein sequence.

Proc. of RECOMB 97: the rst international conference on computational molecular
biology, pages 337343. ACM press, 1997.
In

[22] E Halperin, S Faigler, and R Gill-More. FramePlus: aligning DNA to protein sequences.

Bioinformatics, 15:867873, 1999.

[23] M. Peltola, H. Soderlund, and E. Ukkonen. Algorithms for the search of amino acid

Nucl. Acids Res., 14:99107, 1986.

patterns in nucleic acid sequences.

[24] W. I. Chang and E.L. Lawler. Sublinear expected time approximate string matching
and biological applications.

Algorithmica, 12:32744, 1994.

[25] M Pellegrini and TO Yeates. Searching for frameshift evolutionary relationships be-

Proteins, 37(2):27883, 1999.

tween protein sequence families.

[26] M. Girdea, L. Noe, and G. Kucherov.

Back-translation for discovering distant pro-

tein homologies. In T. Warnow and S. Salzberg, editors,

September 12-13; Philadelphia, pages 108120, 2009.

Proceedings of WABI 2009:

[27] M. Girdea, L. Noe, and G. Kucherov. Back-translation for discovering distant protein
homologies in the presence of frameshift mutations.

Algorithms for Molecular Biology,

5(6), 2010.

[28] Thomas Schiex, J Gouzy, Annick Moisan, and Yannick Oliveira.

Framed:

a exi-

ble program for quality check and gene prediction in prokaryotic genomes and noisy
matured eukaryotic sequences.

Nucleic Acids Res., 31(13):37383741, 2003.

[29] A. Kislyuk, A. Lomsadze, A. L. Lapidus, and M. Borodovsky. Frameshift detection in
prokaryotic genomic sequences.

International Journal of Bioinformatics Research and

Applications, 5(4):458477, 2009.

[30] M. Borodovsky and J. McIninch.
strands.

GeneMark: parallel gene recognition for both DNA

Computers and Chemistry, 17(19):123133, 1993.

[31] I Antonov and M Borodovsky. Genetack: frameshift identication in protein-coding
sequences by the viterbi algorithm.

J Bioinform Comput Biol., 8(3):535551, 2010.
108

[32] M. Rho, H. Tang, and Y. Ye. FragGeneScan: predicting genes in short and error-prone
reads.

Nucleic Acids Res., 1(38):e191, 2010.

[33] E Birney, M Clamp, and R Durbin.

Genewise and genomewise.

Genome Research,

14:988995, 2004.

[34] C Wang, Y Mitsuya, B Gharizadeh, M Ronaghi, and RW Shafer. Characterization of
mutation spectra with ultra-deep pyrosequencing: application to hiv-1 drug resistance.

Genome Res., 17(8):1195201, 2007.

[35] Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback
Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, and Niko Beerenwinkel. Viral pop-

PLoS Comput Biol, 4(5):e1000074, 2008.

ulation estimation using pyrosequencing.

[36] Christopher Quince, Anders Lanzen, Thomas P Curtis, Russell J Davenport, Neil Hall,
Ian M Head, L Fiona Read, and William T Sloan. Accurate determination of microbial
diversity from 454 pyrosequencing data.

Nature Methods, 6:639641, 2009.

[37] J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-SyedMohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje. The Ribosomal
Database Project:

improved alignments and new tools for rRNA analysis.

Acids Research, 37(suppl 1):D141D145, 2009.

Nucleic

[38] FunGene: functional gene pipeline and repository. http://fungene.cme.msu.edu/.

[39] M.A. Larkin, G. Blackshields, N.P. Brown, R. Chenna, P.A. McGettigan, McWilliam.
H., F. Valentin, I.M. Wallace, Lopez R. Wilm, A., J.D. Thompson, T.J. Gibson, and
D.G. Higgins. ClustalW and ClustalX version 2.

Bioinformatics,

23(21):29472948,

2007.

[40] S Iwai, B Chai, WJ Sul, JR Cole, SA Hashsham, and JM Tiedje.

Gene-targeted-

metagenomics reveals extensive diversity of aromatic dioxygenase genes in the environment.

ISME J., 4(2):279285, 2010.

[41] D. T. Gibson and R. E. Parales. Aromatic hydrocarbon dioxygenases in environmental
biotechnology.

Current Opinion in Biotechnology, 11(3):236243, 2000.

[42] Robert Edwards, Beltran Rodriguez-Brito, Linda Wegley, Matthew Haynes, Mya Breitbart, Dean Peterson, Martin Saar, Scott Alexander, E Calvin Alexander, and Forest
Rohwer. Using pyrosequencing to shed light on deep mine microbial ecology.

Genomics, 7(1):57, 2006.

109

BMC

[43] Arthur L. Delcher, Kirsten A. Bratke, Edwin C. Powers, and Steven L. Salzberg.
Identifying bacterial genes and endosymbiont DNA with Glimmer.

Bioinformatics,

23(6):673679, 2007.

[44] Hideki Noguchi, Jungho Park, and Toshihisa Takagi.
nding from environmental genome shotgun sequences.

MetaGene:

prokaryotic gene

Nucl. Acids Res., 34(19):5623

5630, 2006.

[45] Weizhong Li. Analysis and comparison of very large metagenomes with fast clustering
and functional annotation.

BMC Bioinformatics, 10(1):359, 2009.

[46] Stephan C Schuster. Next-generation sequencing transforms today's biology.

Nat Meth,

5(1):1618, 2008.

[47] Fabian Schreiber, Peter Gumrich, Rolf Daniel, and Peter Meinicke. Treephyler: fast
taxonomic proling of metagenomes.

Bioinformatics, 26(7):960961, 2010.

[48] Yuan Zhang and Yanni Sun. HMM-FRAME: accurate protein domain classication for
metagenomic sequences containing frameshift errors.

BMC Bioinformatics, 12(1):198,

2011.

[49] Francis Weng, Chien Hao Su, Ming Tsung Hsu, Tse Yi Wang, Huai Kuang Tsai, and
Daryi Wang.

Reanalyze unassigned reads in sanger based metagenomic data using

conserved gene adjacency.

BMC Bioinformatics, 11(1):565, 2010.

[50] C. Juliane Dohm, Claudio Lottaz, Tatiana Borodina, and Heinz Himmelbauer. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Nucleic Acids Research, 36(16):e105, 2008.

[51] D. Kasper Hansen, E. Steven Brenner, and Sandrine Dudoit. Biases in Illumina transcriptome sequencing caused by random hexamer priming.

Nucleic Acids Research,

2010.

[52] D. R. Yoder-Himes, P. S. G. Chain, Y. Zhu, O. Wurtzel, E. M. Rubin, James M.
Tiedje, and R. Sorek.

Mapping the burkholderia cenocepacia niche response via

high-throughput sequencing.

Proceedings of the National Academy of Sciences,

106(10):39763981, 2009.

[53] IMG: Integrated microbial genomes.

http://img.jgi.doe.gov/.

110

[54] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg.

Ultrafast and

memory-ecient alignment of short DNA sequences to the human genome.

Biology, 10:R25, 2009.

[55] Karla D. Passalacqua,

Anjana Varadarajan,

Brian D. Ondov,

Genome

David T. Okou,

Michael E. Zwick, and Nicholas H. Bergman. Structure and complexity of a Bacterial
transcriptome.

J. Bacteriol., 191(10):32033211, 2009.

[56] Brian A Williams, Brian A Williams, Kenneth McCue, Lorian Schaeer, and Barbara
Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Nat Meth,

5(7):621628, 2008.

[57] Yuan Zhang and Yanni Sun. MetaDomain: a prole hmm-based protein domain classication tool for short sequences. In

(PSB), 2012.

Proceedings of Pacic Symposium on Biocomputing

[58] A. Martin1 Jerey and Wang Zhong. Next-generation transcriptome assembly.

Reviews Genetics, 12:671682, 2011.

Nature

[59] Michael Remmert, Andreas Biegert, Andreas Hauser, and Johannes Söding. HHblits:
lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Methods, 9:173175, 2012.

Nature

[60] Jason R. Miller, Sergey Koren, and Granger Sutton. Assembly algorithms for nextgeneration sequencing data.

Genomics, 95(6):315  327, 2010.

[61] Barbara Feldmeyer, W. Christopher Wheat, Nicolas Krezdorn, Bjórn Rotter, and
Markus Pfenninger. Short read Illumina data for the de novo assembly of a non-model
snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a
comparison of assembler performance.

BMC Genomics, 12(317), 2012.

[62] Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback
Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, and Niko Beerenwinkel. Viral population estimation using pyrosequencing.

PLoS Comput Biol, 4(5), 2008.

[63] Arthur L. Delcher Michael C. Schatz and Steven L. Salzberg.
genomes using second-generation sequencing.

Genome Res., 20(9):11651173, 2010.

[64] Jin Y. Yen. Finding the K shortest loopless paths in a network.
17(11):712716, 1971.

111

Assembly of large

Management Science,

[65] D. Eppstain. Finding the k shortest paths. In

Proc. 25th IEEE Annual Symposium on

Foundation of Computer Science, pages 154165, 1994.

[66] W. A. Brander and C. M. Sinclair. A comparative study of k-shortest path algorithms.

Proc. 11th UK Performance Engineering Workshop for Computer and Telecommunications Systems, 1995.
In

[67] R. Daniel Zerbino and Ewan Birney. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.

Genome Res., 18(5):821829, 2008.

[68] L. René Warren, G. Granger Sutton, J. M. Steven Jones, and A. Rober Holt. Assem-

Bioinformatics,

23(4):500501,

[69] Ben Langmead, Cole Trapnell, Mihai Pop, and L. Steven Salzberg.

Ultrafast and

bling millions of short DNA sequences using SSAKE.
2006.

memory-ecient alignment of short dna sequences to the human genome.

Biology, 10(R25), 2009.

Genome

[70] Yamile Marquez, W.S. John Brown, Craig Simpson, Andrea Barta, and Maria Kalyna.
Transcriptome survey reveals increased complexity of the alternative splicing landscape
in Arabidopsis.

Genome Res., 22:11841195, 2012.

[71] H. Marcel Schulz, R. Zerbino Daniel, Martin Vingron, and Ewan Birney. Oases: Robust
de novo RNA-seq assembly across the dynamic range of expression levels.

matics, 28(8):10861092, 2012.

[72] TAIR: The Arabidopsis Information Resource.

Bioinfor-

www.arabidopsis.org.

[73] SeqMan NGen - Software for Next-Gen Sequence Assembly.

com/t-sub-products-genomics-seqman-ngen.aspx.

http://www.dnastar.

[74] Susannah Green Tringe, Christian Von Mering, Arthur Kobayashi, Asaf A Salamov,
Kevin Chen, Hwai W Chang, Mircea Podar, Jay M Short, , E. J. Mathur, and et al.
Comparative metagenomics of microbial communities.

Science,

308(5721):554557,

2005.

[75] EBI Metagenomics.

[76] T. Lingner,

https://www.ebi.ac.uk/metagenomics/.

K. P. Aÿhauer,

F. Schreiber,

and P. Meinicke.

server for comparative functional proling of metagenomes.
39(suppl_2):W518, 2011.

112

CoMet - a web

Nucleic Acids Research,

[77] Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di
Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman, and
Aviv Regev. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Nature Biotechnology, (7):644652, 2011.

[78] I Birol, Shaun D. Jackman, Cydney B. Nielsen, Jenny Q. Qian, Richard Varhol, Greg
Stazyk, Ryan D. Morin, Yongjun Zhao, Martin Hirst, Jacqueline E. Schein, Doug E.
Horsman, Joseph M. Connors, Randy D. Gascoyne, Marco A. Marra, and Steven J. M.
Jones.

De novo transcriptome assembly with ABySS.

Bioinformatics,

25(21):2872

2877, 2009.

[79] T Namiki, T Hachiya, H Tanaka, and Y Sakakibara.

MetaVelvet: an extension of

Velvet assembler to de novo metagenome assembly from short sequence reads.

Acids Res., 40(20):e155, 2012.

Nucleic

[80] Todd Treangen, Sergey Koren, Daniel Sommer, Bo Liu, Irina Astrovskaya, Brian Ondov, Aaron Darling, Adam Phillippy, and Mihai Pop. MetAMOS: a modular and open
source metagenomic assembly and analysis pipeline.

[81] J Laserson, V Jojic, and D Koller.

Comput Biol, 18(3):429443, 2011.

Genome Biology, 14(1):R2, 2013.

Genovo: de novo assembly for metagenomes.

J

[82] Yu Peng, Henry C. M. Leung, S. M. Yiu, and Francis Y. L. Chin. Meta-IDBA: a de
novo assembler for metagenomic data.

Bioinformatics, 27(13):i94i101, 2011.

[83] C Luo, D Tsementzi, NC Kyrpides, and KT Konstantinidis. Individual genome assembly from complex community short-read metagenomic datasets.

ISME J., 6(4):898901,

2012.

[84] Steven L. Salzberg, Daniel D. Sommer, Daniela Puiu, and Vincent T. Lee.
boosted assembly of a novel bacterial genome from very short reads.

Biol, 4(9):e1000186, 09 2008.

Gene-

PLOS Comput

[85] YuWei Wu, Mina Rho, Thomas G. Doak, and Yuzhen Ye. Stitching gene fragments
with a network matching algorithm improves gene assembly for metagenomics.

formatics, 28(18):i363i369, 2012.

Bioin-

[86] Rachel Mackelprang, Mark P. Waldrop, Kristen M. DeAngelis, Maude M. David,
Krystle L. Chavarria, Steven J. Blazewicz, Edward M. Rubin, and Janet K. Jansson.

113

Metagenomic analysis of a permafrost microbial community reveals a rapid response
to thaw.

Nature, 480(7377):368  371, 2011.

[87] QiongYi Zhao, Yi Wang, YiMeng Kong, Da Luo, Xuan Li Li, and Pei Hao. Optimizing
de novo transcriptome assembly from short-read RNA-Seq data: a comparative study.

BMC Bioinformatics, 12(Suppl 14)(S2), 2011.

[88] Mihai Pop, Adam Phillippy, Arthur L. Delcher, and Steven L. Salzberg. Comparative
genome assembly.

Briengs in Bioinformatics, 5:237  248, 2004.

[89] S. L. Salzberg, D. D. Sommer, D. Puiu, and V. T. Lee. Gene-boosted assembly of a
novel bacterial genome from very short reads.

PLOS Comput. Biol.,

4(9):e1000186,

2008.

[90] T. Rausch, S. Koren, G. Denisov, D. Weese, A. K. Emde, A. Doring, and K. Reinert.
A consistency-based consensus algorithm for de novo and reference-guided sequence
assembly of short reads.

Bioinformatics, 25(9):11181124, May 2009.

[91] B. E. Dutilh, M. A. Huynen, and M. Strous. Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly.

Bioinformatics,

25(21):28782881, Nov 2009.

[92] Yann Surget-Groba and Juan I. Montoya-Burgos. Optimization of de novo transcriptome assembly from next-generation sequencing data.

Genome Research, 20(10):1432

1440, 2010.

[93] J. Nijkamp, W. Winterbach, M. van den Broek, J. M. Daran, M. Reinders, and
D. de Ridder. Integrating genome assemblies with MAIA.

Bioinformatics, 26(18):i433

439, Sep 2010.

[94] J. D. Klein, S. Ossowski, K. Schneeberger, D. Weigel, and D. H. Huson. LOCASa
low coverage assembly tool for resequencing projects.

PLOS ONE, 6(8):e23455, 2011.

[95] Y. Ji, Y. Shi, G. Ding, and Y. Li. A new strategy for better genome assembly from
very short reads.

BMC Bioinformatics, 12:493, 2011.

[96] Y Nishito, Y Osana, T Hachiya, K Popendorf, A Toyoda, A Fujiyama, M Itaya, and
Y Sakakibara. Whole genome assembly of a natto production strain Bacillus subtilis
natto from very short read data.

BMC Genomics, 11:243, 2010.

[97] Y. I. Li and R. R. Copley. Scaolding low quality genomes using orthologous protein
sequences.

Bioinformatics, 29(2):160165, Jan 2013.
114

[98] M. Punta, P. C. Coggill, R. Y. Eberhardt, J. Mistry, J. Tate, C. Boursnell, N. Pang,
K. Forslund, G. Ceric, J. Clements, A. Heger, L. Holm, E. L. Sonnhammer, S. R. Eddy,
A. Bateman, and R. D. Finn. The Pfam protein families database.

Nucleic Acids Res.,

40(Database issue):290301, Jan 2012.

[99] K. Eric Wommack, Jaysheel Bhavsar, and Jacques Ravel. Metagenomics: read length
matters.

Appl. Environ. Microbiol., 74(5):14531463, 2008.

[100] Eugene W. Myers.

The fragment assembly string graph.

Bioinformatics,

21(suppl

2):ii79ii85, 2005.

[101] Jared T. Simpson and Richard Durbin. Ecient construction of an assembly string
graph using the FM-index.

Bioinformatics, 26(12):i367i373, 2010.

[102] J. T. Simpson and R. Durbin.
compressed data structures.

Ecient de novo assembly of large genomes using

Genome Res., 22(3):549556, Mar 2012.

[103] Victorian bioinformatics consortium:

VelvetOptimiser.

software.velvetoptimiser.shtml.

bioinformatics.net.au/

[104] NetworkX: High-productivity software for complex networks.

github.io/.

http://networkx.

[105] Y. Zhang, Y. Sun, and J. R. Cole. A Sensitive and Accurate protein domain cLassication Tool (SALT) for short reads.

Bioinformatics, 29(17):21032111, Sep 2013.

[106] A. P. Masella, A. K. Bartram, J. M. Truszkowski, D. G. Brown, and J. D. Neufeld.
PANDAseq: paired-end assembler for illumina sequences.
2012.

115

BMC Bioinformatics, 13:31,