APPROACHES TO SCALING AND IMPROVING METAGENOME SEQUENCE
ASSEMBLY
By
Jason Pell

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Computer Science – Doctor of Philosophy
2013

ABSTRACT
APPROACHES TO SCALING AND IMPROVING METAGENOME
SEQUENCE ASSEMBLY
By
Jason Pell
Since the completion of the Human Genome Project in the early 2000s, new highthroughput sequencing technologies have been developed that produce more DNA sequence
reads at a much lower cost. Because of this, large quantities of data have been generated
that are diﬃcult to analyze computationally, not only because of the sheer number of reads
but due to errors. One area where this is a particularly diﬃcult problem is metagenomics,
where an ensemble of microbes in an environmental sample is sequenced. In this scenario,
blends of species with varying abundance levels must be processed together in a Bioinformatics pipeline. One common goal with a sequencing dataset is to assemble the genome
from the set of reads, but since comparing reads with one another scales quadratically, new
algorithms had to be developed to handle the large quantity of short reads generated from
the latest sequencers. These assembly algorithms frequently use de Bruijn graphs where
reads are broken down into k-mers, or small DNA words of a ﬁxed size k. Despite these
algorithmic advances, DNA sequence assembly still scales poorly due to errors and computer
memory ineﬃciency.
In this dissertation, we develop approaches to tackle the current shortcomings in metagenome
sequence assembly. First, we devise the novel use of a Bloom ﬁlter, a probabilistic data structure with false positives, for storing a de Bruijn graph in memory. We study the properties
of the de Bruijn graph with false positives in detail and observe that the components in
the graph abruptly connect together at a speciﬁc false positive rate. Then, we analyze the

memory eﬃciency of a partitioning algorithm at various false positive rates and ﬁnd that
this approach can lead to a 40x decrease in memory usage.
Extending the idea of a probabilistic de Bruijn graph, we then develop a two-pass error
correction algorithm that eﬀectively discards erroneous reads and corrects the remaining
majority to be more accurate. In the ﬁrst pass, we use the digital normalization algorithm
to collect novelty and discard reads that have already been at a suﬃcient coverage. In the
second, a read-to-graph alignment strategy is used to correct reads. Some heuristics are
employed to improve the performance. We evaluate the algorithm with an E. coli dataset as
well as a mock human gut metagenome dataset and ﬁnd that the error correction strategy
works as intended.

To my wife, Heather.

iv

ACKNOWLEDGMENTS

I could not have ﬁnished this dissertation without the assistance of a great number of people.
First of all, I want to thank my advisor, Dr. C. Titus Brown. He was an outstanding
mentor to me personally and professionally. His committment to open science was what
initially drew me into his lab, but he has since taught me how to be an eﬀective interdisciplinary scientist. I am deeply grateful to have been able to learn from him over the past few
years.
I also thank the other members of my thesis committee, Dr. Ian Dworkin, Dr. William
Punch, and Dr. Yanni Sun for their guidance and direction. Thanks for providing me critical
feedback and teaching great courses.
Thanks to the members of the GED Lab for their camaraderie over the years. I learned
a great deal from being in a lab consisting of such a diverse group, and I’ll always remember
the great times at lab get-togethers. Thanks to Dr. Adina Howe and Dr. Arend Hintze in
particular for being excellent research mentors to me.
Lastly, I am thankful to my wife, Heather, as well as my parents for their love and
support.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 1 Introduction . .
1.1 Overview . . . . . . . . .
1.2 Assembly Algorithms . .
1.3 Metagenomics . . . . . .
1.4 Why Assemble? . . . . .
1.5 Problem Statement . . .
1.6 Signiﬁcance of Research
1.7 Outline of Dissertation .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

1
1
2
5
6
7
7
8

Chapter 2 Review of Relevant Literature
2.1 Data Structures in Assembly . . . . . . .
2.1.1 Hash Tables . . . . . . . . . . . .
2.1.2 Trie . . . . . . . . . . . . . . . .
2.1.2.1 Suﬃx Tree . . . . . . .
2.1.2.2 Suﬃx Array . . . . . . .
2.1.2.3 FM-Index . . . . . . . .
2.2 Assemblers . . . . . . . . . . . . . . . . .
2.2.1 Genome Assembly . . . . . . . .
2.2.2 Metagenome Assemblers . . . . .
2.2.2.1 MetaVelvet . . . . . . .
2.2.2.2 Meta-IDBA . . . . . . .
2.2.2.3 Ray Meta . . . . . . . .
2.3 Recent Works in Metagenomics . . . . .
2.4 Sequence Filtering . . . . . . . . . . . .
2.4.1 Abundance Filtering . . . . . . .
2.4.2 Digital Normalization . . . . . . .
2.5 Graph Alignment . . . . . . . . . . . . .
2.6 Error Correction . . . . . . . . . . . . .
2.6.1 Quake . . . . . . . . . . . . . . .
2.6.2 Hammer . . . . . . . . . . . . . .
2.6.3 SEQuel . . . . . . . . . . . . . .
2.6.4 Musket . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

10
10
10
12
12
13
14
15
16
17
17
18
18
19
21
21
22
22
24
24
25
26
27

vi

.
.
.
.
.
.
.
.

ix

Chapter 3 Probabilistic de Bruijn Graphs . . . . . . . . . . .
3.1 Properties of de Bruijn Graphs . . . . . . . . . . . . . . . .
3.2 Probabilistic de Bruijn Graphs . . . . . . . . . . . . . . . . .
3.3 Percolation of DNA de Bruijn graphs . . . . . . . . . . . . .
3.4 Eﬀect of False Positives on Long-Range Graph Connectivity
3.5 Partitioning a Metagenomic Dataset . . . . . . . . . . . . .
3.6 Sequencing Errors Outnumber False Positives . . . . . . . .
3.7 Memory Eﬃciency vs. Exact Data Structures . . . . . . . .
3.8 K-mer Counting with Counting Bloom Filters . . . . . . . .
3.9 Implications for Non-Probabilistic de Bruijn Graphs . . . . .
3.10 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.10.1 Genomic Data . . . . . . . . . . . . . . . . . . . . . .
3.10.2 Metagenomic Data . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

30
30
32
38
41
43
44
47
47
49
50
50
50

Chapter 4 Read Error Correction . . . . . . . . . . . . .
4.1 K-mer Abundance Distributions . . . . . . . . . . . . .
4.1.1 K-mer Abundance Distribution of One Genome
4.1.2 Digital Normalization Convergence . . . . . . .
4.2 Error Correction Algorithm Components . . . . . . . .
4.2.1 Read to Graph Alignment . . . . . . . . . . . .
4.2.2 Two-pass Error Correction . . . . . . . . . . . .
4.2.3 Error Region Heuristics . . . . . . . . . . . . . .
4.3 Error Correction on Genome Sequence Data . . . . . .
4.4 Evaluation on Human Gut Mock Dataset . . . . . . . .
4.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 E. coli Dataset . . . . . . . . . . . . . . . . . .
4.5.2 Human Gut Mock Dataset . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

51
52
53
54
57
59
62
65
68
70
77
77
78

Chapter 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79
79
82

APPENDIX

. . . . . . . . . . . . . . . . . . .

86

. . . . . . . . . . . . . . . . .

89

. . . . . . . . . . . . . . .

BIBLIOGRAPHY

. . . . . . . . . . . . .

vii

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

LIST OF TABLES

Table 3.1

Partitioning results on a soil metagenome at k=31. . . . . . . . . . .

44

Table 3.2

Eﬀects of loading E. coli data at diﬀerent false positive rates . . . .

46

Table 4.1

Scores for error correction graph alignment. . . . . . . . . . . . . . .

62

Table 4.2

Error correction and mapping results on HMP Mock Dataset . . . .

72

Table 4.3

Even HMP Mock Dataset: raw versus error corrected. . . . . . . . .

75

Table 4.4

Staggered HMP Mock Dataset: raw versus error corrected. . . . . .

76

viii

LIST OF FIGURES

Figure 1.1

Three main types of assembly graphs. . . . . . . . . . . . . . . . . .

3

Figure 3.1

Example of a simple probabilistic de Bruijn graph with k=4, consisting of two reads. Due to the false positive rate, an erroneous k-mer,
AGGA, exists though it doesn’t exist in the set of reads. For interpretation of the references to color in this and all other ﬁgures, the
reader is referred to the electronic version of this dissertation. . . . .

33

Illustration between a classical Bloom ﬁlter (left side) and the Bloom
ﬁlter in the khmer software package (right side). . . . . . . . . . . .

34

Comparison between the optimal number of hash tables in khmer
with the theoretical optimal for Bloom ﬁlters. . . . . . . . . . . . . .

36

Graph visualization of a circular chromosome at four diﬀerent false
positive rates (1%, 5%, 10%, and 15%). . . . . . . . . . . . . . . . .

37

Average cluster size versus false positive rate. The average cluster
appears to approach a vertical asymptote. . . . . . . . . . . . . . . .

39

Figure 3.6

A 2D Lattice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

Figure 3.7

The diameter (i.e. longest shortest path) decreases abruptly after the
percolation threshold, suggesting long-range connections within the
graph structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

After partitioning a simulated dataset at increasing false positive
rates up to 15%, the number of partitions found does not change.
Closer to the percolation threshold, the partitioning process becomes
more diﬃcult computationally. . . . . . . . . . . . . . . . . . . . . .

45

Comparison between a Bloom ﬁlter at four diﬀerent false positives
with the information-theoretic lower bound for four values of k. . . .

48

Figure 3.2

Figure 3.3

Figure 3.4

Figure 3.5

Figure 3.8

Figure 3.9

Figure 4.1

K-mer abundance distribution using simulated reads with 20 reads
per k-mer, a circular genome size of 100,000 k-mers and 200,000 reads. 57

ix

Figure 4.2

Figure 4.3

Figure 4.4

Figure 4.5

Figure 4.6

Figure 4.7

Figure 4.8

Figure 4.9

The number of reads kept and discarded as reads are processed through
a simulation of the digital normalization algorithm in the same experiment as Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . .

58

Simpliﬁed example of a graph alignment beginning at a seed k-mer.
Two paths exists due to a single SNP, which forms a bubble structure.
The goal is to correct the read by replacing the substituted SNP for
the original base. The abbreviations for the vertex classes are as
follows: MN = match and non-erroneous, ME = match and erroneous,
SN = substitution and non-erroneous. . . . . . . . . . . . . . . . . .

61

The raw k-mer abundance distribution on an E. coli reads dataset.
Note that the y-axis uses logarithmic scaling. . . . . . . . . . . . . .

63

Illustration of the error footprint for a read. Each binary number
represents whether a k-mer. If the number is 1, the k-mer is deemed
correct; if 0, then the k-mer is considered erroneous. Rather than
aligning an entire read, the graph search need only span an erroneous
region in the graph. . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Histogram of the lengths of “error regions” in 100,000 reads of an E.
coli reads dataset. The spike at 20 is due to a single base pair error
causing k erroneous k-mers. In this case, k = 20. The spike at the
end is due to reads with no k-mers called as correct. . . . . . . . . .

67

Histogram of the largest error region in a 100,000 read subset of the
E. coli reads dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .

68

The k-mer abundance distribution of the E. coli dataset after running
the digital normalization algorithm with C=20. Note that the y-axis
uses logarithmic scaling. . . . . . . . . . . . . . . . . . . . . . . . . .

69

K-mer abundance distributions on the E. coli dataset after error correction using both the genome (top) and the dataset post-digital normalization(bottom) as the abundance distribution used for error correction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

x

Chapter 1
Introduction

1.1

Overview

In Bioinformatics, the genome assembly problem is among the most challenging and pressing
to date. Informally, the goal of assembly is to determine, from shredded pieces of DNA fragments, the original DNA sequence. Though assembly algorithms have been in development
for decades, an accurate and eﬃcient assembler for all but the most simple genomes has not
been achieved, for reasons we will discuss brieﬂy.
One principle reason for the diﬃculty in assembling genomes is the proliferation of socalled next-generation sequencing technologies. Though these newer platforms have dramatically reduced the cost of sequencing, they typically produce smaller, more erroneous
reads, which are more diﬃcult to ﬁnd overlaps between. For example, Illumina’s HiSeq 2000
System is capable of producing 6 billion reads of approximately 100 base pairs in length. In
contrast, the traditional, capillary-based methods (i.e. Sanger sequencing) have generally
produced longer reads with fewer errors but at a greater cost and lower throughput. As newer
technologies are developed, the phrase next-generation sequencing becomes more of a misnomer as companies are constantly developing new approaches that produce a combination
of higher throughput, fewer errors, and longer reads.

1

1.2

Assembly Algorithms

As sequencing technologies have evolved, so have the assembly algorithms. The primary
assembly method for assembling traditional Sanger data is called Overlap-Layout-Consensus
(OLC). These assemblers typically apply three primary stages using the long reads that
capillary-based sequencers produce. In the Overlap stage, an assembly/overlap graph is constructed by checking for overlaps between reads. In the next step, the reads are laid out along
a putative genome. Finally, the reads are aligned to one another and a consensus sequence
is determined. Examples of such assemblers include Celera[36], pcap[20], and Arachne[2].
Due to the large number of reads generated by next-generation sequencing platforms,
the overlap step of OLC assemblers became intractable due to being quadratic with respect
to the number of reads. To better handle these types of datasets, k-mer graph, or DNA
de Bruijn graph, methods took over as the predominant assembly method. Rather than
checking for overlaps among entire reads, each read is broken up into words of ﬁxed size
k. Each k-mer is stored in a data structure, generally a hash table, to enable fast lookup.
An assembler can represent k-mers as vertices in an assembly graph, where k-mers can
share an edge if they share a k-1 word overlap. The algorithms for traversing the graph
to generate contigs are assembler-dependent. One of the ﬁrst DBG assemblers, EULER,
ﬁnds an Eulerian path through the graph[40]. Other assemblers, such as Velvet[55] and
SOAPdenovo[28], use Dijkstra’s algorithm to detect sequencing errors and heterozygosity
between two k-mers. Assemblers apply their own heuristics to handle repetitive elements,
errors, and heterozygosity[35].
Though the de Bruijn graph approach is the most predominant in next-generation sequence assembly, other methods exist. One such approach has been with string graphs[37].

2

R1: GGGGCCAA
R2:
GGCCAAAATA
R3:
CCAAAAT
R4:
AAATAGCAC

Overlap Graph

GGGG
GGGC
GGCC
de Bruijn graph
GCCA
k=4
CCAA
CAAA
AAAA
AAAT
AATA
ATAG
TAGC
AGCA
GCAC

R1
R2

R1
R3
R2

R4

R4

String Graph

Figure 1.1: Three main types of assembly graphs.

3

A string graph is similar to the overlap graph used in OLC approaches, but it does not
necessarily need to know all of the overlaps among reads. Instead, it requires all of the read
overlaps that cannot be transitively reduced. For example, if read A overlaps with read B,
read B overlaps with read C, and read A overlaps with read C (and reads B and C overlap
the same side of the read as A), then the overlaps from A to B and B to C do not need to
be considered since the overlap from A to C transitively implies the former overlaps. Another feature of the string graph is that it does not contain the entire set of reads from a
sequencing dataset. If a read is contained within another read (i.e. it is a substring), then it
does not need to be part of the initial string graph. However, knowing all of the reads may
be important for determining coverage levels that are useful for resolving repetitive regions
and copy number variations (CNVs). Initially, the string graph approach was not utilized
because the running time was still quadratic with respect to the number of reads. However,
using the FM-index data structure, the running time was brought down to O(N)[45]. In the
next chapter, we go into more detail about how the FM-index can be used to store assembly
graphs and compare it to other data structures and approaches.
Figure 1.1 provides a visual comparison between the three main types of assembly graphs.
One immediate observation is that the overlap graph and string graph scale with the number
of reads while the k-mer graph scales with the number of distinct k-mers in the sample. This
implies a trade-oﬀ where longer reads make string and overlap graphs more memory eﬃcient
while de Bruijn graph-based assemblers are better for shorter and more abundant reads.
Another observation is that the string graph does not utilize read 3 since it is contained
within read 2. This oﬀers a potentially signiﬁcant memory reduction since many reads may
not be needed. Furthermore, sequencing projects utilizing both short and long reads beneﬁt
from this approach because many more short reads will be contained within the longer reads.
4

1.3

Metagenomics

The assembly problem is made even more challenging when presented with metagenomic
data. Metagenomics, or environmental sequencing, is the study of sequence data obtained
from environmental samples, which can contain diﬀerent species of microbes with varying
abundance levels. This is particularly useful when it is diﬃcult to isolate a microbial species
from a sample or one is interested in the entire population as a whole. For example, one may
be interested in the microbial population in the human gut or in a water sample from the
ocean. Since many species exist in a sample, more questions can be answered from looking
at the entire population than isolating individual species, especially considering the complex
dynamics that occur in microbial ecology.
In contrast to metagenome sequencing, genomic sequencing is considered an easier problem because, other than issues such as contamination and PCR biases, only one genome
is isolated and sequenced. A further hindrance to metagenomes is the amount of coverage
needed. As of 2010, the total amount of environmental sequencing constitutes less than 1%
of what would be found in a gram of soil[17]. In addition, due to the uneven abundances of
microbial species, deep sequencing is needed to discover all of the diversity in such a sample.
For example, if two species exist in a sample with one being 100 times more common than
the second, it is clear that one would need 1000X coverage of the sample in order to obtain
at least 10X coverage for each individual species. The diﬀerence between the least and most
abundant species in a metagenomics sample is likely to be much greater in highly diverse
populations than others.

5

1.4

Why Assemble?

A variety of tools are available that do not require an assembly to obtain information about
a metagenome sequence dataset. Furthermore, depending on the analysis, it may not make
sense to obtain the shotgun sequence of the environmental sample. If the goal is to determine the diversity within a population, it is useful to target and amplify the ribosomal 16S
sequence and sequence with a platform that provides longer reads such as 454. This allows
one to construct a phylogenetic tree that can be compared with similar samples of diﬀerent
treatments or locations. If one is interested in the functional composition of a microbial
population, then the entire metagenome should be sampled rather than the ribosomal 16S
sequence. In this case, a variety of analyses are possible without assembling. Gene fragments
can be obtained by looking at the six open reading frames and searching through a protein
database. Many tools such as MG-RAST[34] and FragGeneScan[42] are available that can
assist in metagenome sequence analyses that do not require an assembly.
However, despite the options for metagenome sequence analysis without a reference assembly, there are many limitations. First, because a sequence obtained from next-generation
sequencers has a speciﬁc maximum read length, only gene fragments can be obtained rather
than entire genes. This limits how eﬀective metagenome analysis tools can be at analyzing
a sample since it may only be possible to ﬁnd a match to a protein domain rather than
a protein. This lowers the likelihood that a function can be assigned to the sequence. By
assembling, one has the ability to pull out longer contigs and possibly large portions of entire
genomes.

6

1.5

Problem Statement

Though current sequencing technologies has enabled deep sampling of highly diverse microbial ecosystems, algorithmic methods for analyzing these datasets are still in their infancy.
Assembling less diverse metagenomics datasets is possible, especially with the computing
resources available at large sequencing centers and supercomputing facilities. However, highthroughput sequencing technologies is now inexpensive enough that labs with modest budgets
are able to sequence environmental samples of interest. Thus, a clear gap exists between
the ability to sequence complex microbial populations and the ability to analyze these large
datasets. Though the running time of sequence assembly is a concern, the primary bottleneck
that needs to be addressed is memory usage.
Another concern with metagenome sequence assembly is accuracy. Since an estimated
99% of microbes cannot be cultured in a laboratory[48], it is unlikely that reference genomes
can be obtained for the majority of microbes in an environmental sample. Since many
assembly validation techniques require the use of a reference genome, new techniques must
be developed to produce accurate assemblies. Prior to assembly, one of the most important
procedures in an assembly pipeline is read error correction. However, many techniques rely on
being able to obtain the coverage of the sequenced genome. Metagenome assembly pipelines
need methods for improving the accuracy of assemblies without the ability to compare to a
reference.

1.6

Signiﬁcance of Research

In this dissertation, we will establish a set of approaches that enable the assembly of extremely diverse metagenomic samples. Currently, this is not possible in high-performance
7

computing facilities, let alone on commodity machines. By making metagenomic assembly
easier to do on the cloud and in supercomputing facilities, we can enable researchers in any
lab to obtain a metagenomic assembly. By attempting to tackle both the high-memory and
compute requirements of assembly, this research will reduce the bottlenecks that currently
plague assembly pipelines.
Solving the metagenome sequence assembly problem has far reaching consequences in
basic and applied sciences. As mentioned previously, few microbes can be cultured in the
laboratory, so sequencing environmental populations is one of the few ways to better understand individual microbial species and populations. By deep and broad sampling throughout
the Earth, it is likely that novel genes will be discovered that can be used to produce better
biofuels and lead to new insights into drug development. By improving soil assemblies, it is
possible to better understand how soil and its microbial ecosystem sequesters carbon, ﬁxates
nitrogen, and how it provides a fertile location for crops to grow. Though studies on the
human gut microbiome are in progress, better assemblies will lead to more insight into the
interactions between microbial populations and human cells within the human ecosystem.
Given that so much diversity has not been sampled, new approaches that make the analysis
of metagenomics datasets will lead to more discoveries.

1.7

Outline of Dissertation

To aid in the assembly of metagenomes, we use a Bloom ﬁlter (or Counting Bloom Filter,
where appropriate) to store sequence data in order to accurately and eﬃciently pre-process
next-generation sequencing datasets, which will improve the overall quality of assembled
metagenomes. In the ﬁrst approach, we will show how metagenome partitioning can greatly

8

reduce the amount of memory needed to assemble datasets. We evaluate this approach on an
E. coli dataset as well as a small soil metagenome. For the second part, we utilize a read error
correction algorithm that uses k-mer coverage information to discern between erroneous and
true k-mers. We evaluate the error correction procedure on another E. coli dataset and a
mock human gut microbiome dataset. By combining these approaches together, it is possible
to assemble much larger datasets than with current methods.

9

Chapter 2
Review of Relevant Literature
DNA sequence assembly is a rapidly changing ﬁeld. Early algorithms were designed to
handle longer reads of higher quality, but next-generation sequencing platforms have forced
Bioinformaticians to rethink previous approaches. In this chapter, we present several existing
works on sequence assembly, particularly for metagenomes. In addition, we introduce preassembly ﬁltering and error correction methods that are relevant to the framework of this
dissertation. However, because a focus of this research is on scaling assembly, we begin our
discussion with a review of data structures that are useful in assembly pipelines.

2.1

Data Structures in Assembly

To scale assembly the choice of data structures is critical as an assembly tool must be able to
handle processing and storing billions of reads quickly. In this section, we will discuss many
of the data structures used for k-mer storage in pre-assembly tools as well as those used for
storing assembly graphs.

2.1.1

Hash Tables

Hash-style data structures are likely the most popular choice for storing assembly graphs,
particularly in de Bruijn graph-based assemblers. Hash tables are one of the most important

10

data structures in Computer Science, and many variants exist. A hash table maps a key to
a value. For k-mer based assemblers, generally the k-mer is the hash and the value is the
number of occurrences for that k-mer in the dataset. Once k-mers are stored, it is possible to
query for the existence of neighboring k-mers to traverse the de Bruijn graph of the dataset
and assemble contigs.
For a representative example of how a hash table is used in an assembler, we will discuss
how ABySS[47] implements a distributed de Bruijn graph. ABySS uses Google’s SparseHash
implementation of a hash map. This is advantageous because within the range of k used in
assembly (generally greater than 20), the number of possible k-mers is much larger than the
actual number in a dataset. Thus, using SparseHash keeps the memory overhead relatively
low. For each k-mer, 8 bits are used to represent the up to 8 possible edges that a k-mer
can have. There are eight possible neighbors because a k-mer could have four k-mers whose
(k−1)-mer preﬁx matches the (k−1)-mer suﬃx, and vice versa. If the corresponding neighbor
exists, then the bit is ﬂipped. This is advantageous to querying the k-mer itself because if
the bin where the neighbor k-mer is located is on another node, it will be considerably slower
than simply checking the status of a bit.
To distribute the assembly graph across multiple nodes, ABySS uses a simple scheme.
Each k-mer (with its reverse complement) is converted to a number, say i. If there are n hash
tables, then the k-mer is located at node i%n. Though this is simple to implement, there are
some drawbacks depending on the type of high-performance computing environment where
the assembler is being run. Due to the nature of de Bruijn graphs, the neighboring k-mers
for a particular k-mer are probably on diﬀerent nodes. The 8 bit scheme discussed earlier
partially remedies this by keeping the check for edges local to the node of a given k-mer, but
if a neighboring k-mer does indeed exist, then the node on which that k-mer exists must be
11

queried. Thus, components in the assembly graph are likely to be located on several diﬀerent
nodes if many nodes are used by the assembler. In the Genome Assembler section, we will
discuss in greater detail how ABySS implements a k-mer graph.
Hash tables oﬀer a number of advantages over other data structures concerning scaling
behavior. As we discussed previously with ABySS, it is possible to easily distribute hash
tables on multiple nodes. Another advantage is that while query and insertion times are not
guaranteed to be constant time given a sub-par hash function, for practical purposes hash
tables are generally very fast. However, handling collisions can add to the memory footprint.

2.1.2

Trie

Though a regular trie data structure is rarely used in sequence assemblers, special cases
such as suﬃx trees, suﬃx arrays, and so on have been used in assemblers in the past[8, 18].
Trie-style data structures oﬀer a number of interesting properties and are useful for string
processing applications in Bioinformatics. A trie, or preﬁx tree, is a tree where the root
is an empty string and words build from it. The ﬁrst character in each word of the data
structure is a child node of the root. Thus, if two words share the same preﬁx, they share a
set of nodes that contains that preﬁx. Several variants of tries exist that oﬀer a number of
advantages over a basic trie.

2.1.2.1

Suﬃx Tree

An important building block in examining trie-style data structures in assemblers is the
suﬃx tree. A suﬃx tree is a trie where each possible suﬃx of a tree is added to a trie. In
1973, the ﬁrst linear-time construction of a suﬃx tree enabled faster string searching at the
expense of a string indexing step[53]. However, one drawback that keeps classic suﬃx trees
12

from being useful in Bioinformatics is that they are more memory intensive than the leading
approaches.

2.1.2.2

Suﬃx Array

Though suﬃx arrays[31] have not been extensively used for de Bruijn graph-based assembly
(though exceptions such as the Edena[18] assembler exist), they have been useful for a variety
of pre-assembly ﬁltering and sequence analysis purposes. Suﬃx arrays use up to three to
ﬁve times less memory than suﬃx trees[53]. A related data structure, the FM-index[16],
has recently been used as a way to store string assembly graphs, which we will elaborate on
brieﬂy.
To construct a suﬃx array, ﬁrst one must generate all of the suﬃxes for a string (or a set
of strings). These suﬃxes are then sorted, and the starting position of the suﬃx is stored as
the array element where the array index is the sorted order for that suﬃx. The result is a
single array with the number of elements equal to the size of the string. However, this alone
does not enable fast lookup since a binary search would be required to determine whether
a pattern exists in the original string. Adding an additional data structure such as an LCP
(longest common preﬁx) array improves the speed of searching.
Suﬃx arrays enable many rapid string operations that are amenable for Bioinformatics
algorithms. One application for this data structure is for k-mer counting. Jellyﬁsh[32] and
tallymer[50] are two examples of k-mer counting software packages that utilize some kind
of suﬃx array. Though suﬃx arrays improve the speed of searching for a pattern within a
dataset, they can signiﬁcantly add to the memory footprint. Tallymer tackles this problem
by giving the user an option to split the suﬃx array into several diﬀerent partitions. If
the array is divided into two partitions, then only half is put into memory at a time. If
13

the entire suﬃx array cannot ﬁt into the amount of memory given, this can signiﬁcantly
improve the running time of the data structure construction even if multiple passes through
the dataset are required. Jellyﬁsh handles the problem diﬀerently; it uses a specialized,
memory-eﬃcient hash table and takes advantage of compare-and-swap in modern CPUs to
enable parallelization. Combining these approaches allow jellyﬁsh to be much more memory
eﬃcient than tallymer and other k-mer counters.

2.1.2.3

FM-Index

Short read mappers such as Bowtie[26] and BWA[27] used the Burrows-Wheeler transform
(BWT)[7] to reduce the memory requirements for storing the reference genome on which the
short reads are mapped. However, the use of the BWT was not used in assembly algorithms
until a method for storing a string graph with an FM-index was discovered[16]. An FMindex, which we will describe in greater detail below, uses the BWT as a component of the
FM-index. This approach led to the creation of the String Graph Assembler (SGA)[46].
FM-indices oﬀer a number of advantages compared to other data structures. However,
while FM-indices enable fast lookup after graph construction, constructing the data structure
from a set of reads is relatively time consuming. Another issue is that the data structure
scales with the number of reads inserted rather than k-mers. With eﬀective compression, FMindices can compare favorably to other implementations, but for highly diverse metagenomics
datasets this is not the case.
The ﬁrst step of an FM-index is to take the Burrows-Wheeler Transform of the set of
reads. The transform takes every cyclic permutation of a string and then sorts them. Since
each read is a separate string, a unique character can be used between each read to enable
fast lookup by read ID. One important feature of the Burrows-Wheeler Transform is that it
14

can compress the reads due to the way the strings are sorted. In addition to this transform,
the FM-index requires two additional data structures: a counts table and an occurrences
table. The counts table contains a simple lookup between a character (which could only
be the four letters used to represent nucleotides or an expanded alphabet) and the number
of times it appears in the dataset along with any characters lexicographically smaller. The
occurrences table provides a lookup for the number of times a character occurs in a preﬁx
of the transformed text. Using these three data structures together, it is possible to enable
fast queries of the text.
When reads are added to an FM-index, it is possible to determine the reads that overlap
another using the algorithm described in [45]. However, this algorithm returns all reads that
overlap rather than an irreducible set, which they also include. One drawback to the paper
(which is explicitly discussed) is that it does not handle mismatches, which is necessary for
real sequencing data. In another paper, the same authors outline the algorithm used in SGA
that enables the assembler to handle inexact overlaps[46].

2.2

Assemblers

Currently, most mainstream assemblers are tailored speciﬁcally for genome assembly. Some
assemblers, such as SOAPdenovo[41] include an option on the command line to denote a
metagenome assembly. In other cases, 3rd party add-ons are built around the original assembler to handle non-genomics datasets (e.g. MetaVelvet[38]). Finally, many other assemblers
are built speciﬁcally for metagenome assembly. We will discuss many of the approaches and
current metagenome sequencing projects that have occurred to date.

15

2.2.1

Genome Assembly

EULER[40] was the ﬁrst assembler that used a de Bruijn graph and was published in 2001, a
few years before the ﬁrst 2nd generation sequencers were released. Thus, EULER made use
of the long reads produced by Sanger, and it was built primarily to resolve repeats rather
than improve memory eﬃciency. It also made use of k-mer abundance based error correction.
Once sequencing platforms from Solexa (now Illumina) and 454 were released, many
new assemblers emerged. Velvet[55] was released in 2008 and is still maintained and used
in many assembly pipelines today. In addition to being a de Bruijn graph assembler, it
includes procedures for removing tips (i.e. erroneous, hanging ends in the read) and bubbles
(i.e. cycles in the graph caused by errors or heterozygosity). ABySS[47] was released shortly
after and included many similar error correction methods. What distinguished ABySS was its
use of a distributed hash table to enable high-memory assembly of large genomes. However,
because k-mers that are part of the same component frequently end up on diﬀerent nodes,
the assembler can run into performance issues due to the speed of the network.
A recent assembler/variant caller, Cortex[21], introduced the concept of colored de Bruijn
graphs. Rather than loading reads from diﬀerent samples into the same bins in a hash table,
Cortex assigns a diﬀerent color to each sample so that variants such as SNPs, indels, and
structural variation can be called from the assembly phase. Cortex was primarily designed
for bacterial variant calling and for the human genome project to enable the ability to call
variants in 1000s of diﬀerent human genomes.

16

2.2.2

Metagenome Assemblers

Assemblers created speciﬁcally for metagenomes have not been available until recently. One
reason for this is that sequencing platforms have only now been able to produce the deep
sampling required to probe highly diverse samples. Another reason is that they are inherently more challenging than genome assemblers: while many genome assemblers can take a
more or less uniform sampling for granted, metagenomes contain varying abundance levels.
Furthermore, with the sequencing of a single species, while heterozygosity is a challenging
problem when determining whether a SNP is real or a sequencing error, the concept of a
single species is more fuzzy for a population of microbes. Thus, metagenome assemblers are
still in their infancy and there is much room for progress in the ﬁeld.

2.2.2.1

MetaVelvet

One such assembler is called MetaVelvet[38]. MetaVelvet uses much of Velvet’s code base
(and thus is a de Bruijn graph based assembler) but is tailored to handle the uneven coverage
level found in most metagenomes. One key feature behind MetaVelvet is that it does a
type of partitioning to separate diﬀerent microbial species based on graph connectivity and
coverage level. Whereas Velvet truncates sequence when the coverage level in the graph
diﬀers largely from the expected coverage, MetaVelvet uses a mixture model to approximate
diﬀerent coverage levels and separates them in order to prevent chimeric contigs. This is
advantageous compared to neglecting to incorporate coverage levels because species that
share sequences could end up being assembled together, leading to a misassembly. While
MetaVelvet was shown to produce better results (measured by N50 length and other metrics)
compared to other methods, it is very memory intensive, which makes it impractical for large
datasets.
17

2.2.2.2

Meta-IDBA

Another assembler created speciﬁcally for metagenomes is called Meta-IDBA[39]. MetaIDBA, too, uses a de Bruijn graph for assembly. The authors make a distinction between
cr-branches, which are branches in the graph that result from similar genes, and sp-branches,
which are branches that result from variation, such as SNPs and indels. These types of
branches do not occur as often when assembling a single genome, so Meta-IDBA is designed
to detect and handle these cases. However, the method to resolving these issues are similar
to genome assemblers, except when encountering a branch (i.e. a bubble or tip as described
in Velvet), both paths are generally kept while a single genome assembler would generally
discard one branch as an error or treat the bubble as being caused by heterozygosity.

2.2.2.3

Ray Meta

In late 2012, a metagenome assembler was published called Ray Meta[5], which specializes in
distributed assembly of metagenomes. It is based oﬀ a genome assembler called Ray[4], which
also focuses on the scalability of genome assembly through a better distributed computing
model. Ray Meta is a de Bruijn graph based assembler, but rather than partitioning the
graph based on coverage levels, component ﬁnding, or graph simpliﬁcation, it applies a
“heuristic-guided graph traversal” that result in good assemblies.
To demonstrate the eﬀectiveness of the approach, the authors simulated two metagenomes
containing 100 and 1,000 genomes where each genome had a diﬀerent coverage level. They
also assembled and analyzed a dataset of 124 gut microbiomes. These results were compared
with the MetaVelvet assembler with the Ray Meta assembler showing superior results.
Ray Meta shows that distributing a de Bruijn graph generated from metagenomic data
is eﬀective in scaling. However, it is not clear how the approach will scale to highly diverse
18

datasets like soil since a simulated dataset of 1,000 genomes required 1,024 cores with 1.5GB
of memory per core. A soil sample could potentially contain 1,000 times more genomes and
the graph could be more complex. Nonetheless, the approach is well-demonstrated and will
likely be improved upon to handle such scale.

2.3

Recent Works in Metagenomics

Though some types of metagenomics, particularly those that used 16S ribosomal sequencing,
have been around for a longer period of time, it was not until recently that shotgun sequencing of environmental samples has become popular. A seminal work in the ﬁeld involved
shotgun sequencing an Acid Mine Drainage sample[51]. Due to the high abundance and
low heterozygosity of only a handful of the species in the sample, it was possible to almost
completely construct two genomes (Leptospirillum group II and Ferroplasma type II) and
partially reconstruct three additional genomes.
Another important work in metagenomics was led by Craig Venter where they sequenced
a water sample from the Sargasso Sea[52]. They obtained approximately a gigabase of data,
and attempted an assembly using the Celera assembler[36]. However, they experienced a
now-commonly encountered problem where reads from highly abundant species would be
ﬂagged as repetitive regions in the assembler, which would result in a poor assembly. To
resolve this, they had to incorporate some manual steps to resolve the issue. Ultimately,
1,800 diﬀerent species were found where 148 were newly-discovered. They also discovered
more than a million genes, suggesting that a great deal of novelty is yet to be discovered
using metagenomics. As this paper covered a small number of samples in the Sargasso
Sea, diﬀerent communities are likely to live in other waters and depth levels. Furthermore,

19

because Sanger sequencing was used, there is likely even more to be discovered in the water
sample taken.
Both of the previously mentioned works used Sanger sequencing to obtain their data.
More recently, a number of works have been published using Illumina technology. The MetaHIT Consortium published the initial results of their international human gut microbiome
project[41]. They sequenced the microbes in 124 diﬀerent faecal samples with the goal of
obtaining a gene catalog. Overall, approximately 576Gb of sequence data was obtained.
Because of the short reads in Illumina platforms, they used the SOAPdenovo assembler to
obtain contigs before cataloging the genes found. Initially, the dataset for each sample was
assembled separately, and then any of the reads leftover were pooled together and assembled.
Prior to any assembly, the reads were ﬁltered to remove any low abundance reads that would
be unlikely to be part of any assembly. Resulting from the assemblies, they found over three
million genes in total.
Other metagenomics eﬀorts that have required assembly include the cow rumen[19] and
permafrost soil[30]. To assemble the cow rumen metagenome dataset, a ﬁltering program
developed at the Joint Genome Institute was used to remove well-known sequencing artifacts.
Furthermore, low abundance reads were removed since their coverage level would not meet
the threshold required by most assemblers. The permafrost soil assembly was ﬁltered and
assembled in a similar manner as it was also a project by the Joint Genome Institute. In 2012,
Iverson et al. reported that they were able to close the assembly of a highly predominant
microbial species found in the Puget Sound[22]. The genome was from marine group II
Euryarchaeota, which has not yet been cultured. They used several custom ﬁltering steps
prior to assembly, primarily for dealing with SOLiD reads. These procedures do simple read
trimming using quality scores, read error correction, and PCR ampliﬁcation bias removal.
20

Frequently, previously published metagenome datasets are reanalyzed to reproduce the
results of a paper or look for new insights. For example, Wu et al. developed a method to look
for previously unknown branches in the evolutionary tree[54]. Using their approach, they
identiﬁed homologous genes that had not been characterized from the Sargasso Sea dataset.
They concluded that these may be from viruses, ancient paralogs, or even genes belonging to
organisms distantly unrelated to any that have been discovered and characterized. However,
because these genes nor the organism or virus to which they belong have not been studied
in greater detail, it is impossible to determine with certainty how the homologs should be
classiﬁed.

2.4

Sequence Filtering

Due to the large size of datasets, it is becoming more common to ﬁlter out a large number of
reads to make it feasible to assemble or do some type of high-memory analysis of the data.
However, several diﬀerent methods exist that can pose problems for downstream analyses.
We will introduce a couple of these methods and discuss their advantages and shortcomings.

2.4.1

Abundance Filtering

Most commonly for metagenomic datasets, reads are ﬁltered by using k-mer coverage as a
way of removing erroneous k-mers and reads[41, 19]. In some cases, reads that fall below
a certain level of k-mer coverage as one walks from the 5’ end of the read to the 3’ read is
trimmed at a certain threshold. If all of the k-mers in a read fall below the threshold, then
the entire read is discarded. This method has been eﬀectively used to lower the memory
requirements for assemblies, speciﬁcally for metagenomic datasets. However, the biggest

21

shortcoming of this approach is that low-abundance novelty will be discarded since coverage
levels are uneven.

2.4.2

Digital Normalization

A recently developed method, digital normalization, looks at the median k-mer coverage for
a read[6]. If the read falls below a speciﬁc threshold, then the read is kept and discarded
otherwise. While the algorithm is simple and easy to implement, it has the eﬀect of evening
out the level of coverage in a dataset. This is particularly attractive for metagenomics
datasets because most assemblers expect an even level of coverage for parameter estimation.
The digital normalization algorithm relies on the fact that the coverage level of the kmers within a read are similar. By using a Count-Min Sketch[11], a data structure that
enables memory-eﬃcient counting with a false positive rate and possible overestimation, it is
possible to process large amounts of sequence data with little memory footprint. By treating
a sequence read dataset as a stream, the digital normalization algorithm can be considered
an online algorithm as it only requires one pass through a dataset. Though the digital
normalization paper used genome and mRNAseq datasets, the results clearly showed that
the “correct” k-mers were retained while a large portion of erroneous k-mers are thrown out.

2.5

Graph Alignment

One recent advance in gene targeted assembly combines A* graph search with aligning proﬁle
hidden Markov models (proﬁle HMM) with de Bruijn graphs in a software package called
Xander(Jordan Fish, personal communication). In this approach, the proﬁle HMMs are
treated as the query sequence and a set of sequence reads are considered the database. The

22

sequence reads are initially loaded into a Bloom ﬁlter with a low false positive rate. To
assemble a gene, one searches the corresponding proﬁle HMM against the de Bruijn graph
found in the Bloom ﬁlter. More accurately, the search is done by combining the elements
of both k-mers and the states of a hidden Markov model. Thus, a vertex has a k-mer, read
position, and state (e.g. insertion, match, deletion). The edges connecting these vertices
are determined by the probabilities of the HMM. Since the database of sequence reads are
in nucleotide-space, the proﬁle HMM must be transformed to nucleotide-space in order to
ﬁnd seeds. The graph is then walked by 3 bases at a time to ensure that valid codons are
being searched. Once seeds are found, an A* graph search strategy is applied by using the
optimistic heuristic that the remaining nodes in the path to the goal (i.e. the end of the
alignment) will be perfect matches. Because this is an optimistic heuristic, it is admissible
and thus the graph search strategy will ﬁnd a shortest path/optimal alignment.
The biggest advantage to the approach in Xander is that it gives biologists a method
for ﬁnding genes of interest in metagenomic samples. For many reasons, especially uneven
coverage and errors, a de novo metagenome assembly is unlikely to assemble all of a biologist’s
genes of interest even if they exist in the dataset. However, using a proﬁle HMM introduces
biases due to the fact that proﬁle HMMs are generally built using genes that have already
been discovered. A homologous gene in a metagenomic sample (or any sample for that
matter) could be part of an under-sampled outgroup phylogenetically or the gene may have
undergone rapid selection compared to other genes in the family. These biases would cause
some of the novelty in the original set of reads to be lost in favor of the biases of the proﬁle
HMM. Another shortcoming to the approach is that the Bloom ﬁlter does not incorporate
coverage information. This makes it diﬃcult to discern between a sequencing error and true
novelty. Incorporating this information could be a way to overcome some of the biases with
23

using proﬁle HMMs that were mentioned previously.

2.6

Error Correction

Given that errors in reads is one of the most challenging aspects to address in assembly,
many error correction algorithms have been developed in order to attempt to address these
challenges. While many error correction algorithms are implemented as separate packages
that can be incorporated into an assembly pipeline, many assemblers include their own error
correction procedures. In this section, we will focus on the error correction algorithms that
are most closely related to the error correction method that will be introduced in a later
chapter. While these algorithms are primarily developed for genomic reads, we will address
how they may be useful for metagenomic datasets as well.

2.6.1

Quake

One read error corrector for whole genome assembly that has been part of many assembly
pipelines for evaluation purposes such as GAGE[44] and the Assemblathon[13] is Quake[23].
The central idea behind Quake is that it incorporates base quality values into the k-mer
coverage distribution of the dataset in order to better discriminate between erroneous kmers and low coverage true k-mers. To determine the k-mer coverage distribution, all of the
k-mers must ﬁrst be counted. However, the count for a k-mer is not simply incremented
each time it is discovered. Instead, for each base in the k-mer, the probability that the
base is correct based on the quality value is multiplied. Initially, Quake used a Hadoopstyle algorithm to quickly count the k-mers. However, after the Jellyﬁsh[32] k-mer counting
software package was released they switched to using it by default. Quake uses a mixture

24

model on the k-mer coverage distribution to predict whether or not a k-mer is erroneous.
For the true k-mers, a Gaussian distribution was used for coverage, and the zeta distribution
was used for copy number. For the error k-mer distribution, the gamma distribution was
used. In order to correct a read, each k-mer is determined to be either “trusted” or not.
The end goal is to correct the reads so that all of the k-mers are trusted. By looking at the
pattern of erroneous k-mers in a particular read, Quake localizes the errors and then uses
maximum likelihood to ﬁnd the most probable set of trusted k-mers.
One of the biggest advantages to using Quake is that it converts a regular k-mer into
“q-mers” by incorporating base quality values to prevent low coverage real k-mers from
being falsely changed. However, as is, Quake would be ineﬀective in correcting metagenomic
datasets. The primary reason is that erroneous and real k-mers are much more diﬃcult to
separate by using the k-mer coverage distribution in these datasets. Since the abundance
levels for species in environmental samples diﬀer, it is unclear what the expected k-mer
coverage level will be. Furthermore, it seems unlikely that the mixture distribution used
in Quake would be eﬀective for metagenomes, particularly the use of the Zeta function to
estimate copy number. In order to make the use of a Quake-style error correction algorithm,
one possibility would be to incorporate digital normalization to even out the coverage level.

2.6.2

Hammer

Not all error correction software packages use the k-mer coverage distribution to correct reads.
One such package, called Hammer[33], uses a Hamming (edit-distance) graph to attempt to
handle datasets with non-uniform coverage such as single-cell sequencing, mRNAseq, and
metagenomics.
In the ﬁrst step of the error correction algorithm, Hammer ﬁnds the set of distinct k-mers
25

in the reads. Then, it ﬁnds all of the components in the Hamming graph using a “spaced
seed” approach since comparing each pair of k-mers would be computationally intractable.
A union-ﬁnd data structure is used to quickly determine which k-mers are part of the same
components. Next, each component is processed in order to attempt to ﬁnd a consensus
k-mer. If the size of the component is 1, then the k-mer is either correct or it has too
many errors to connect within the edit-distance parameter that was set. In a manner similar
to Quake, Hammer weights the score of a k-mer using its quality value. If the size of the
component is greater than 1, then the highest occurring k-mer in the cluster is considered
to be the consensus.
Hammer was shown to perform better than other error correctors on a single-cell sequencing dataset, but it did not perform as well as Quake on a regular E. coli dataset. One
advantage to Hammer compared to other k-mer based approaches is that it is able to reconstruct k-mers that are not present in the dataset. This is particularly useful for Hammer’s
intended purpose, single cell sequencing datasets. While approaches that use Hamming
graphs are useful, there are many drawbacks to consider. First, if one were to take into
account indels, the Hamming graph is likely not a suitable candidate approach because the
problem becomes more computationally prohibitive. A drawback to Hammer is that it assumes that no more than one k-mer in a cluster is correct. This calls into question the use
of the software package on non-microbial genomes that are large and/or highly repetitive.

2.6.3

SEQuel

While most error correctors are generally employed in an assembly pipeline prior to assembly,
some correction methods are used to reﬁne an assembly and thus are placed after the assembly
step in a pipeline. This is the idea behind SEQuel[43], which is a post-assembly error
26

correction tool speciﬁcally tailored for single and multi-cell sequencing datasets. As with
Hammer, SEQuel specializes in datasets with non-uniform coverage.
SEQuel goes through each contig and ﬁnds the reads that have a unique mapping to that
contig. From here, SEQuel constructs what they coin a positional de Bruijn graph. This is
similar to a de Bruijn graph except that it also accounts for the k-mer’s position within a
contig as well, so the same k-mer could occur in diﬀerent vertices in the graph. Along with
the k-mer sequence and position is the count that is obtained from the original mapping
of the read to the contig. Since the k-mers are anchored to the contig, the graph is much
simpler, which makes reﬁning the assembly easier. Given the simpliﬁed graph, common
errors such as bubbles can be corrected, and indeed SEQuel is shown to correct a large
number of substitution and indel errors in the ﬁnal assembly.
One of the biggest shortcomings to SEQuel is that it relies on a read mapper. Unfortunately, this is not easy to overcome but it limits the capabilities of the algorithm to
the sensitivity and speciﬁcity of the mapper and the parameters chosen for that mapper.
Another issue is that it is unclear how well the algorithm would work with more complex
genomes because the paper only used microbial genomes as examples. However, what was
perhaps most relevant to this dissertation is the use of a positional de Bruijn graph, which
can place a k-mer at separate positions in a read and can simplify the overall structure of
the assembly graph.

2.6.4

Musket

A more recent addition to read error correction software packages is Musket[29]. Musket uses
many of the features from previously mentioned error correctors along with parallelization
support that make it highly eﬀective in both accuracy and performance. There are two
27

stages to the error corrector: k-mer counting and coverage cut-oﬀ analysis, and then the
actual error correction stage.
In the ﬁrst stage, Musket uses a Bloom ﬁlter to eliminate as many distinct k-mers as
possible. Thus, only k-mers that have already been seen by the Bloom ﬁlter will make it into
the hash table. Using a Bloom ﬁlter produces false positives, but no non-distinct k-mers
will be lost during the process. To parallelize this process, a master-slave model is used
where k-mers are dissected from reads by master nodes, which then distribute the k-mers
to slave nodes. Each slave handles a diﬀerent set of possible k-mers. Once the hash table
in each slave has been loaded with k-mers (after having distinct k-mers ﬁltered out by the
Bloom ﬁlter), all other distinct k-mers are removed from the hash tables, and the result is
the k-mer abundance distribution of the dataset with all distinct k-mers removed. Finally,
Musket uses the k-mer abundance distribution to estimate the cut-oﬀ between erroneous and
trusted k-mers. Rather than using a mixed model as with Quake, Musket simply ﬁnds the
point with the lowest k-mer count between the peaks though this can be overridden by a
parameter.
In the second stage, Musket combines three diﬀerent methods to achieve error correction.
First, it uses a “two-sided correction” that introduces the concept of trusted bases. A trusted
base is simply a base that is part of any trusted k-mer, which was determined from the
previous stage. Next, Musket attempts to replace the untrusted bases with ones that would
result in trusted k-mers. To speed up the process, only the end k-mers for an untrusted
base is used so that multiple combinations do not need to be tested. However, this approach
is too conservative when used on its own and only works when there is a single error in a
k-mer. To address this, Musket adds another step to their correction workﬂow that attempts
to correct untrusted k-mers next to trusted k-mers so that the k-mer has the highest count
28

of any of the given alternatives. Musket has further steps to their workﬂow to quality check
the corrections made and make changes as needed.
Being one of the most recent error correctors published, Musket oﬀers a combination of
features from other read error correction packages that give it good scaling characteristics as
well as accuracy. The use of a Bloom ﬁlter is an eﬀective way to screen out distinct k-mers
that are likely to be erroneous. While some distinct k-mers make it through the ﬁrst pass,
they are taken care of in the next stage. Another interesting component in Musket is using
the point of lowest density between the two peaks in the k-mer abundance distribution as
the error cut-oﬀ. Other error correctors such as Quake use more complicated mixed models,
but this simple approach seems to work just as well. While Musket is a viable option for
correcting sequence datasets for genomes, it has yet to be proven for more complicated
scenarios such as metagenomes. It is unlikely that the aforementioned cut-oﬀ method would
work when examining the k-mer spectra of a metagenomic dataset. Thus, further pre-error
correction ﬁltering would be required to handle these situations.

29

Chapter 3
Probabilistic de Bruijn Graphs
In this chapter, we introduce the idea of a probabilistic de Bruijn graph. This novel graph
representation utilizes a Bloom ﬁlter data structure for storing a k-mer assembly graph. We
start by describing useful properties of de Bruijn graphs, then we break down how a Bloom
ﬁlter can be used to store an assembly graph. We introduce an algorithm called partitioning
that ﬁnds graph components in an assembly graph and use it on an actual soil metagenome.
Next, we analyze the point at which the partitioning approach breaks down due to the
increase in the false positive rate.

3.1

Properties of de Bruijn Graphs

In graph theory, de Bruijn graphs[12] are a well-studied class of graphs due to their interesting
properties and application to computer networks. Recently, they have been shown to be
useful for analyzing DNA sequence datasets. In this section, we will introduce some basic
properties of de Bruijn graphs and also deﬁne a DNA de Bruijn graph and how they are
used as DNA sequence assembly graphs.
A (s, k)-de Bruijn graph is a graph where s diﬀerent symbols are used to form words of
length k. If these words share a k − 1 overlap, they are considered to be adjacent. Though a
directed formulation is used in some literature, we will consider only simple de Bruijn graphs
where there are no self-loops or directed edges. In total, there are sk vertices in the graph
30

where each vertex can have up to 2s adjacent vertices. While most vertices indeed have 2s
neighbors, two special cases exist. For vertices that contain only the same symbol (e.g. 0000
in a binary de Bruijn graph with k = 4 and s = 2), there are 2s − 2 adjacent vertices since
we do not include the self-loop that would exist in a directed de Bruijn graph. Another case
is where a word contains an alternating sequence (e.g. 0101 in a binary de Bruijn graph
with k = 4 and s = 2). These words (i.e. k-mers) have 2s − 1 because the directed version
would contain parallel edges between that k-mer and another alternating sequence word.
One property of de Bruijn graphs that make them a topic of study in computer networking
theory is the short diameter length, k. The diameter is generally deﬁned as the “longest
shortest” path. In other words, when comparing the shortest path between each possible
pair of vertices, the diameter is the longest of these paths.
We will now consider the application of de Bruijn graphs to DNA. In this case, we set
s = 4 and let our alphabet only allow the symbols a ∈ A, C, G, T . The word size, k, can
vary depending on the DNA sequence dataset of interest. Since DNA is double stranded, we
may want to consider the reverse complement of a k-mer as being the vertex itself. In this
k
case, rather than having a graph of 4k vertices, the graph will contain either 42 vertices if k
k

k

is odd or 42 − 4 2 if k is even. The diﬀerence is due to the presence of k-mers that are their
own reverse complement. They can be easily counted by dividing k by two and constructing
the reverse complement of that sequence and concatenating it to the original sequence. For
example, if k = 4, we can choose GT to be the ﬁrst two letters of a sequence. The reverse
complement is AC, so the full sequence is GTAC, which is it’s own reverse complement.

31

3.2

Probabilistic de Bruijn Graphs

To better scale assembly, we used a Bloom ﬁlter[3] data structure to store k-mers originating from next-generation sequencing datasets. A Bloom ﬁlter is a hash-style probabilistic
data structure used for storing sets. In the khmer software package, the Bloom ﬁlter is
implemented as multiple hash tables with each having a diﬀerent hash function. To add an
element in the data structure, the corresponding bin in each hash table where the k-mer
maps is set to 1. To query whether a k-mer is present or not, one must ﬁnd a 1 in each bin
for the k-mer. If there are one or more 0s, then the k-mer is deﬁnitely absent from the set.
However, if we ﬁnd all 1s, then the k-mer might be present. Thus, while false positives are a
possibility, false negatives are not. It is possible to tune the false positive rate by optimizing
the number of hash tables used depending on the expected number of k-mers to be inserted
and the total amount of memory that one is willing to allocate for the Bloom ﬁlter.
To load the Bloom ﬁlter, we step through each read and break it up into k-mers and
insert each into the Bloom ﬁlter. After loading the set of reads, it is possible to traverse the
de Bruijn graph by checking for the presence of each of the up to eight possible neighbor
k-mers (that share a k − 1 overlap). Due to the one-sided error caused by using a Bloom
ﬁlter, we call this graph a probabilistic de Bruijn graph, an example of which is illustrated
in Figure 3.1. If we use an ancillary data structure that tags at least one k-mer to each read,
we can employ a graph traversal algorithm to discover which reads transitively connect to
one another by ﬁnding components in the de Bruijn graph.
It is important to note that the data structure used in khmer is not a Bloom ﬁlter by
deﬁnition. A classic Bloom ﬁlter consists of one hash table with k diﬀerent hash functions.
Each hash function is assumed to be independent as this keeps the false positive rate as

32

Read 1: ATGGACCA
Read 2:
ACCAGATG
ATGG

GATG

TGGA

AGAT

GGAC
AGGA

CAGA

GACC

CCAG
ACCA

Figure 3.1: Example of a simple probabilistic de Bruijn graph with k=4, consisting of two
reads. Due to the false positive rate, an erroneous k-mer, AGGA, exists though it doesn’t
exist in the set of reads. For interpretation of the references to color in this and all other
ﬁgures, the reader is referred to the electronic version of this dissertation.
low as possible. In contrast, the data structure in khmer consists of k diﬀerent hash tables
with a diﬀerent hash function in each. Furthermore, the hash functions used in khmer are
not necessarily independent. Figure 3.2 demonstrates the diﬀerence between the two data
structures where two k-mers are inserted into a classical Bloom ﬁlter (right side) and the
Bloom ﬁlter in khmer (left side). Both require three diﬀerent hash functions each (in this
particular example).
Currently, khmer uses prime numbers above the given minimum hash table size parameter
as the size for each hash table. The hash function is then the modulo of that prime with
the number a k-mer is converted to. By mapping A to 00, C to 01, G to 10 and T to 11,
we can convert the k-mer to its binary number form. If we count the reverse complement of
the k-mer as the same, we can obtain both numbers and choose whichever is smaller.
We will show two ways that the Bloom ﬁlter variant data structure used in khmer has
33

0
0

1

1

1

1

ACCT
1

0

0

1
0
1

0

GTAA
1

1

0

0

1

Figure 3.2: Illustration between a classical Bloom ﬁlter (left side) and the Bloom ﬁlter in
the khmer software package (right side).
similar properties to the theoretical properties of a classical Bloom ﬁlter. First, we will
compare the simple analytic derivatives that show they are equivalent (assuming independent
hash functions) and show that the optimal number of hash tables is equivalent to the optimal
number of hash functions for a classical Bloom ﬁlter.
For the Bloom ﬁlter variant in khmer, we will begin by considering the probability that
a particular bin in one of the hash tables is occupied or not. First, let us assume that the
size of each hash table equivalent, and let m be the total amount of memory allocated to the
hash tables, k be the number of hash tables, h be the size of each hash table, and n be the
number of k-mers inserted into the data structure. Furthermore, we assume that each hash
function is independent of one another.
Examining one hash table, the probability that a bin is occupied after one k-mer is
inserted is
34

1
.
h

(3.1)

Thus, the probability that a bin is unoccupied after a single insertion is

1−

1
h

.

(3.2)

Extending this to n insertions, we see that the probability that a bin is unoccupied after
n insertions is

1 n
.
1−
h

(3.3)

Thus, we ﬁnd that the false positive rate for a single hash table to be

1 n
1− 1−
.
h

(3.4)

Since the data structure has k hash tables, we ﬁnd that the false positive rate for the
given variables is

1 n k
1− 1−
.
h

35

(3.5)

0.5

Optimal Number of Hash Tables
q

q

0.3

q

q

q

0.2

q
q
q
q
q
q

q
q

q

q

0.0

0.1

False Positive Rate

0.4

q

5

10

15

Number of Hash Tables
Figure 3.3: Comparison between the optimal number of hash tables in khmer with the
theoretical optimal for Bloom ﬁlters.
−n
1
Since (1 − h )n approximates to e h , we obtain

n k

−
1−e h

.

(3.6)

The false positive rate for a classical Bloom ﬁlter is obtained in a similar manner, and
kn

given the same variables is (1 − e− m )k [14]. Since m = kh, we see that the false positive
rate properties for both data structures are analytically the same.
While the analytic derivation matches, it makes the critical assumption that the Bloom
36

Figure 3.4: Graph visualization of a circular chromosome at four diﬀerent false positive rates
(1%, 5%, 10%, and 15%).
ﬁlter used in khmer has independent hash functions. Thus, it is important to compare the
theoretical properties with the actual properties in the software implementation. Given a
ﬁxed amount of memory we allocate to our Bloom ﬁlter-variant, we can vary the number of
hash tables and determine the false positive rate for each. As Figure 3.3 shows, the false
positive rate given a various number of hash tables matches the theoretical number of hash
functions for a classical Bloom ﬁlter for the given dataset. Given that our variant matches
the theoretical properties of a classical Bloom ﬁlter, we will henceforth refer to the data
structure in khmer as a Bloom ﬁlter.

37

3.3

Percolation of DNA de Bruijn graphs

Intuitively, it is clear that traversing a graph at a high enough false positive would result
in erroneous k-mers connecting together in such a way that false paths could connect components in the graph. We visualized this eﬀect by inserting a randomly generated 1,000bp
circular chromosome into a Bloom ﬁlter with four diﬀerent false positives. By starting at
the ﬁrst k-mer and then exploring the graph using the Bloom ﬁlter, we can see the eﬀects
of false positives as the false positive rate grows larger (Figure 3.4). Figure 3.5 shows as
the false positive rate increases linearly, the average number of clusters appears to rise nonlinearly and approaches a vertical asymptote. As we approach this asymptote, it becomes
computationally intractable to calculate average cluster sizes for large k because the number
of possible k-mers is so large and false positives erroneously connect clusters together.
The problem of determining when clusters begin to connect together has a strong theoretical framework called percolation theory. In graphs, the percolation threshold is the point
at which a unique, giant cluster exists. In inﬁnite graphs, this cluster is inﬁnitely-sized.
For ﬁnite graphs such as de Bruijn graphs, the giant cluster is of a size proportional to the
number of vertices in the graph, whereas below the percolation threshold the largest clusters
are of size proportional to the logarithm of the number of vertices. For a particular graph or
lattice, there are actually two percolation problems: site and bond. In site percolation, the
vertices are randomly “on” with a probability p. If an edge is shared between two vertices
that are “on,” then they are considered to be part of the same cluster. In bond percolation,
the edges are randomly “on” with probability p.
The most well-studied lattice for percolation analysis is the 2D lattice (Figure 3.6), which
consists of an inﬁnite number of vertices in a grid layout. Thus, each vertex has an adjacent

38

10
8
6
4

q

2

Average Component Size

q

q
q

0

q

q

0.00

q

q

q

q

q

q

q

0.05

q

q

q

q

0.10

0.15

False Positive Rate
Figure 3.5: Average cluster size versus false positive rate. The average cluster appears to
approach a vertical asymptote.

39

Figure 3.6: A 2D Lattice.
vertex above, below, to the left, and to the right. Though analytic solutions have been
attempted for both the site percolation threshold and the bond percolation threshold, only
the bond percolation threshold has been proven to be 0.5[24]. Through simulation, the site
percolation threshold for 2D lattices has been found to be around .59[15]. Simulating either
the site or bond percolation threshold for 2D lattices is considerably easier than DNA de
Bruijn graphs because of the structures between both graphs. An eﬀective way to ﬁnd the
percolation threshold for 2D lattices is to create a large, ﬁnite-sized grid and ﬁnd at what p
there exists a path from the left border to the right border. This calculation is made easier
by using a union-ﬁnd data structure to determine whether two vertices are part of the same
component. However, de Bruijn graphs do not appear to have natural boundaries that can
be easily tested. In order to test the DNA de Bruijn site percolation threshold, a diﬀerent
type of simulation must be used.

40

The site percolation threshold can be estimated by examining the cluster size distribution
as vertices are randomly ﬂipped on. As p increases, the distribution is ﬁtted using both
linear and quadratic regression. The percolation threshold is found by ﬁnding the point
where quadratic regression ﬁts better than a linear one due to the appearance of the giant
component[49]. We calculate this by using the F-statistic

F =

RSS1 − RSS2 n − q2
×
q 2 − q1
RSS2

where RSSi is the residual sum of squares for a model i, qi is the number of parameters for
model i, and n is the number of simulations run for a particular probability. The percolation
threshold, pc , is found by ﬁnding the second local maxima. In order to account for ﬁnite
sampling error, we employ the threshold binning method[1]. Using this technique, we found
that the percolation threshold for DNA de Bruijn graphs is approximately 0.183.

3.4

Eﬀect of False Positives on Long-Range Graph Connectivity

Though the percolation threshold results suggest that components in the graph will not fuse
together below a speciﬁc percolation threshold, we wanted to explore how false positives aﬀect
long-range graph structure. To do this, we examined how the diameter, or “longest shortest
path” in the graph changes as the false positive rate increases. Because we wanted to ﬁnd
when long-range connections occur beyond the percolation threshold, it is computationally
diﬃcult to use a large k. To remedy this, we simulated 50bp circular chromosomes at k = 8

41

25

q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
q q q q q q q
q q

q q
q
q
q q q

20

q q q

5

10

15

q

0

Length of Longest Shortest Path

q q q q q

0.00

0.05

0.10

0.15

0.20

0.25

0.30

False Positive Rate
Figure 3.7: The diameter (i.e. longest shortest path) decreases abruptly after the percolation
threshold, suggesting long-range connections within the graph structure.
(i.e. 58bp reads with the last 8bp equivalent to the ﬁrst 8) and then calculated the shortest
path between each pair of true k-mers with the longest of these being the diameter. The
experiment was repeated 500 times for each false positive rate. From these experiments, we
conclude that long-range false connections do not occur until at or beyond the percolation
threshold (Figure 3.7).

42

3.5

Partitioning a Metagenomic Dataset

Using the probabilistic de Bruijn graph, it is possible to separate the graph into components
that are likely to assemble together, which is called read partitioning. Using a probabilistic de
Bruijn graph, we are able to partition the reads with far less memory than with exact (i.e.
non-probabilistic data structures) approaches as long as we remain below the percolation
threshold.
The partitioning algorithm works as follows. A set of reads is inserted into a Bloom ﬁlter.
When each read is inserted, at least one k-mer is tagged in a C++ STL map so that the kmer is mapped to a read ID. The next step is to perform a local traversal of the probabilistic
de Bruijn graph based on the distance between tagged k-mers in the reads. Tagging few
k-mers requires less memory, but more computation to traverse through a larger portion of
the graph. However, doing a local traversal only captures the reads that share k-mers with
a particular read. To determine which reads belong together in the same graph component,
a merging step occurs where subsets are merged together. Finally, these separate sets of
reads (i.e. partitions), can be assembled separately on diﬀerent nodes to reduce memory
requirements.
One concern with the partitioning algorithm is how it performs as the false positive rate
increases. Though we determined a speciﬁc percolation threshold, this only applies when
vertices are randomly turned on and oﬀ. However, this clearly does not apply with genomic
DNA since large chromosomes will consist of large components with many k-mers. If two
chromosomes share a single k-mer, they will be part of the same graph component. This
may not present a problem when this happens infrequently, but if the false positive rate is
high enough, erroneous connections could occur in the graph if false paths are introduced

43

Table 3.1: Partitioning results on a soil metagenome at k=31.
False positive rate
1%
5%
10%
15%

Total memory use (improvement)
1.75GB (18.8x)
1.20GB (27.5x)
0.96GB (34.37x)
0.83GB (39.75x)

Largest partition size in reads
344,426
344,426
344,426
344,426

between components. To test whether or not this occurs, we randomly generated 1,000
DNA sequences 1,000 base pairs long. Varying the false positive rate with k=31, we used
the partitioning algorithm to determine the resulting number of partitions. Though the time
to partition increases in a similar fashion to the expected average cluster size, the number
of partitions does not change (Figure 3.8).
Using a small soil metagenomic dataset, we demonstrated the memory eﬃciency and
utility of the partitioning algorithm in the khmer software package. To do this, we utilized
a small script called memusg (available at https://gist.github.com/netj/526585) that
checked the memory usage every 0.1 seconds. With all of the partitioning and assembly
steps added to an sh script, the memusg script reported the highest memory measurement.
Assembling the entire dataset without partitioning requires approximately 33 GB of RAM,
but partitioning at a 15% false positive rate requires almost 40 times less memory. The
memory requirements for partitioning and assembly are roughly linear and match the per k
estimates (Table 3.1).

3.6

Sequencing Errors Outnumber False Positives

No sequencing platform produces error-free reads, and we explored how the number of erroneous k-mers resulting from sequencing errors compared to the erroneous k-mers that
resulted from the false positives in a Bloom ﬁlter. We used two diﬀerent datasets, the E.
44

1000

q

q

q

q

q

q

q

q

q

q

q

q

q

q

600
400
0

200

Number of Partitions

800

q

0.00

0.05

0.10

0.15

False Positive Rate
Figure 3.8: After partitioning a simulated dataset at increasing false positive rates up to
15%, the number of partitions found does not change. Closer to the percolation threshold,
the partitioning process becomes more diﬃcult computationally.

45

Table 3.2: Eﬀects of loading E. coli data at diﬀerent false positive rates
Graph
Total k-mers
E. coli at 0%
4,530,123
E. coli at 1%
4,814,050
E. coli at 5%
6,349,301
E. coli at 15% 31,109,523
Reads at 0%
45,566,033
Reads at 1%
48,182,271
Reads at 5%
62,019,545
Reads at 15% 231,687,063

False connected k-mers
0
283,927
1,819,178
26,579,400
41,036,029
43,652,265
57,489,537
227,157,037

% Real
100
94.1
71.3
14.6
9.9
9.4
7.3
1.9

Deg > 2
Mem (bytes)
50,605
2.1 × 109
313,844
5.4 × 106
1,339,102
3.5 × 106
10,522,482
2.2 × 106
7,699,309
2.1 × 109
31,600,208
5.4 × 107
42,642,203
3.6 × 107
113,489,902
2.3 × 107

coli genome as well as a set of Illumina GA-II reads that were obtained from sequencing
the same strain of E. coli. We loaded both the reads and the genome into separate “exact”
Bloom ﬁlters with K=17. It is exact in the sense that there is a bin for each possible 17mer, so no collisions are possible. We then loaded each of the datasets into Bloom ﬁlters
with 3 diﬀerent false positive rates (1%, 5%, and 15%). As Table 3.2 shows, even with no
false positives, the number of erroneous k-mers is almost 10x larger than the number of real
k-mers. Inserting the E. coli genome into a Bloom ﬁlter with a 1% false positive rate yields
approximately 5.9% of the k-mers as being false positives. However, it should be noted that
as the false positive rate approaches the percolation threshold, the percentage of real k-mers
falls dramatically. Another important comparison is between the number of k-mers with a
degree greater than 2. This metric is used as a proxy for graph complexity. When reads
are loaded into an exact Bloom ﬁlter, there are 7.7 million k-mers with 3 or more degrees.
On the other hand, storing the genome with a 1% false positive rate yields only 313,844
false k-mers, implying that for low false positive rates, the erroneous k-mers generated from
sequencing errors yields a comparatively more complex graph than false positive k-mers.

46

3.7

Memory Eﬃciency vs. Exact Data Structures

Assuming that the percolation threshold for DNA de Bruijn graphs is .183, we can compress
an assembly graph down to approximately 4 bits per k-mer, independent of the size of k.
Given a set of n total possible items and x items to be inserted, it is possible to represent
the set in log2 n total bits. This problem has been considered in the context of DNA asx
sembly graphs previously[10]. While many eﬀorts have been undertaken to create a succinct
representation of these types of datasets that match the lossless informatic-theoretic lower
bound, none have solved this problem in the general case. Regardless, a probabilistic data
structure such as a Bloom ﬁlter or Counting Bloom Filter can push below this lower bound.
Using the equation above, we compared the Bloom ﬁlters with four diﬀerent false positive
rates (1%, 5%, 10%, and 15%) with the information-theoretic lossless lower bound. Figure
3.9 shows that for sparsely populated k-mer bitmaps (typical of any sequencing dataset with
a reasonably high k), the Bloom ﬁlters have better memory eﬃciency than any exact data
structure. It is also important to note that the lossless lower bound is dependent on the size
of k chosen: when a larger k value is chosen, there is a larger number of possible k-mers
that could be part of the set. However, a Bloom ﬁlter is independent of k, but begins to
lose its advantage over lossless data structures as the set becomes more dense. Intuitively,
this makes sense because when a bitmap is used to represent a set, it is impossible to further
compress when half of the bits are randomly occupied.

3.8

K-mer Counting with Counting Bloom Filters

One limitation to Bloom ﬁlters is that they can only provide the likely presence or certain
absence of a particular item in a set. Since many genomics analyses use coverage information
47

Bloom 1%
Bloom 5%
Bloom 10%
Bloom 15%
Exact K=19
Exact K=23
Exact K=27
Exact K=31

Bits Needed Per K-mer

45
40
35
30
25
20
15
10
5
0
106

107
108
109 1010
Number of Kmers (Log)

Figure 3.9: Comparison between a Bloom ﬁlter at four diﬀerent false positives with the
information-theoretic lower bound for four values of k.

48

to help ﬁnd erroneous, repetitive, or other types of sequence, it is sometimes necessary to
use a data structure that can store k-mer counts rather than simply presence or absence.
A natural extension of the Bloom ﬁlter is to increase the bin size from 1 bit to the largest
count one would like to store (e.g. 8 bits would allow one to count from 0 to 255). Then, to
obtain the count for a particular k-mer, we look up the count for that particular k-mer in
each hash table and take the minimum value. As with Bloom ﬁlters, false negatives are not
possible, but false positives are. However, for k-mer counting there is yet another one-sided
error that must be considered. That is, if a k-mer exists in the dataset, we will correctly
see that it is present, but its count may be over-inﬂated. This only occurs in the case when
each bin that corresponds to the k-mer of interest collides with another k-mer. This type of
error depends largely on the count distribution as well as the false positive rate.

3.9

Implications for Non-Probabilistic de Bruijn Graphs

In this chapter, we examined the false positive rate at which point components in a probabilistic de Bruijn graph begin to erroneously connect together and form a giant cluster. However,
the idea of percolation can further be considered for non-probabilistic graphs. When choosing a low k value, one must consider that if the occupancy rate of possible vertices is above
the percolation threshold, then it is highly likely that the graph is already percolated and
that most of the reads will belong to one large cluster. This, of course, makes assembly more
diﬃcult and should be taken into consideration when selecting a value for the k parameter
of assemblers. Given that there are 4k possible k-mers for a selected k value (without taking
into account reverse complements), this becomes less of an issue as k is greater than 25. On
the other hand, selecting too high a value for k will lower the sensitivity since fewer overlaps

49

will be found between k-mers. Furthermore, even for non-probabilistic de Bruijn graphs,
it is still possible that components will share diﬀerent species or chromosomes even if they
are not sampled from the same place. The classic Birthday Problem suggests that even for
higher k values, say between 30 and 40, it is still possible that these types of problems can
occur.

3.10

Data

3.10.1

Genomic Data

The data used to demonstrate the eﬀect of false positives compared to sequencing errors in
a de Bruijn graph was the E. coli K-12 MG1655 genome, which is available on NCBI under
accession number NC 000913.2. The Illumina reads datasets that were used for comparison
were a 200bp insert library (SRA0016655) and a 500bp insert library (SRA001666). Both
were sequenced using the GA-II platform.

3.10.2

Metagenomic Data

The dataset used for the demonstration of the partitioning algorithm and its memory eﬃciency was a small soil metagenome sample named MSB2. It was sequenced on one lane of
Illumina’s GA-II platform. It contains approximately 35 million reads and totals around 2.8
Gbp. Since soil is so highly diverse, this dataset does not contain near enough coverage to
assemble the entire metagenome. The dataset exists in NCBI’s Short Read Archive under
accession number SRA050710.

50

Chapter 4
Read Error Correction
Though assembling a set of reads together is challenging, it is even more diﬃcult due to the
errors present in reads. Rather than detecting exact overlaps, one must develop algorithms
that are able to eﬃciently ﬁnd inexact overlaps between reads. In the context of a de Bruijn
graph, error k-mers greatly complicate assembly. A single base pair error in a sequence
read can create up to k erroneous k-mers. Given a greater than 1% error rate, it is clear
that erroneous k-mers generate the bulk of memory requirements for assemblers when not
properly handled. Furthermore, they can make the graph so complex in certain regions
that it is impossible to traverse the region in the graph. This is particularly troublesome in
repetitive sequence. Due to the problems that errors create in assembly, read error correctors
have become an ubiquitous part of assembly pipelines.
In this chapter, we introduce a two-pass algorithm for error correcting metagenomic
sequencing datasets. Currently, few options exist for error correcting metagenomics datasets.
Many eﬀective algorithms exist for correcting genomic reads, but these generally rely on a
more uniform coverage level across the sample. Because metagenomes can contain millions of
species of varying abundance levels, it is diﬃcult to use k-mer abundance information without
conducting an additional round of ﬁltering. We take advantage of the k-mer abundance
distribution that results from applying the digital normalization algorithm[6] to a dataset to
error correct metagenome datasets.

51

4.1

K-mer Abundance Distributions

Before we proceed with an implementation of a read error correction algorithm, we must ﬁrst
consider the k-mer abundance distribution for reads containing erroneously called bases if
we incorporate such information in the algorithm. However, we must ﬁrst make a number of
unrealistic assumptions to make an analysis feasible. Such assumptions are not unfounded,
and we will make similar ones used by Lander and Waterman in their paper outlining the
statistics behind shotgun sequencing[25].
One assumption we will make is that that there are no repetitive or low complexity
regions in the genome such that every k-mer in the genome is unique. This is clearly false
as not only will k-mers in diﬀerent locations of the genome be the same by random chance
for lower k values, but repetitive and low complexity regions will mean that even for large
k values there will be many identical k-mers. Another assumption is that sequencing errors
occur uniformly across a read. This, too, is an unrealistic assumption as sequencers tend
to have diﬀerent error proﬁles. In Illumina, it is well known that most errors occur on the
ends of reads with the majority happening on the 3’ end. We will also assume that indels do
not occur. These are relatively uncommon compared to SNPs in Illumina reads, and have a
similar eﬀect to them in the k-mer abundance distribution. However, by not including them,
we can assume the length of each read is the same. The last assumption we will make is
that when a sequencing error occurs, the new erroneous k-mers generated will be distinct.
This simpliﬁes calculating the k-mer abundance distribution with sequencing errors since
any erroneous k-mer will be given a count of 1.

52

4.1.1

K-mer Abundance Distribution of One Genome

Let G be the number of base pairs in a circular genome. Since it is circular, G is also the
number of k-mers where k is the chosen word size. Let er be the probability that a particular
base in a sequence read is erroneous. For a given sequence read, let R be the length of the
read in bases, and let M = R − K + 1 be the number of k-mers in a read. Lastly, let N be
the number of reads in a dataset.
To determine the k-mer abundance distribution, let us consider the probability that a
particular k-mer occurs in c reads without any errors. First, we will consider the probability
that a k-mer occurs in a particular read. Since a sequence read contains M k-mers, the
probability is M . The probability that the k-mer does not exist in the read is then G−M .
G
G
Putting this together, the probability that a k-mer occurs in c reads without any errors is

N
c

G − M N −c
.
G

M c
G

(4.1)

This means that for a genome of size G, the k-mer abundance distribution is

G·

N
c

M c
G

G − M N −c
.
G

(4.2)

Now we will incorporate errors into the distribution. The probability that a k-mer in a
read is error-free is (1 − er )k . This implies that the probability that a k-mer contains at
least one error (and is thus an error k-mer) is (1 − (1 − er )k ). Thus, the expected number of
correct k-mers per read is M (1 − er )k and the number of error k-mers is M (1 − (1 − er )k ).

53

To simplify the notation, let L = (M (1 − er )k ). We can then replace M with L from the
previous equation. Due to our assumption that all error k-mers are unique, we ﬁnd that the
piece-wise function for the k-mer abundance distribution for a single circular genome with
sequencing errors is

F (G, R, er , k, c) =





G · N
c


G · N

c

L
G
L
G

c
c

G−L
G
G−L
G

N −c

+ M (1(1 − er )k ), if c = 1
(4.3)

N −c

,

otherwise.

Due to the presence of the binomial distribution, this function simpliﬁes to

F (G, r, er , k, c) =




G · bin(c, N, L ) + M (1(1 − e )k ), if c = 1

r
G


G · bin(c, N, L ),

G

4.1.2

.

(4.4)

otherwise

Digital Normalization Convergence

An important issue when applying the digital normalization algorithm to a dataset is if the
k-mer coverage converges towards a speciﬁc value (ideally, the C parameter). If convergence
can be expected, then it can be useful for predicting when a dataset has saturated or if more
novelty can be expected. For the case where the length of the read is equal to K, it is simple
to show that the k-mer coverage approaches C, the parameter for digital normalization. We
deﬁne k-mer coverage to be the expected number of reads in a dataset that “contains” a
particular k-mer.
Intuitively, we can see that this is the case since the genome will continue to be sampled

54

until each possible k-mer has a count of C. We can also consider the probability mass
function for the k-mer abundance distribution in this special case given the size of a genome
in k-mers (G), the number of reads processed so far (N ), and the digital normalization
parameter (C), To do this we must consider three cases: where the k-mer count is less than
C, the k-mer count is currently equal to C, and the count is greater than C. In the last case,
we know that the number of k-mers having a count greater than C must be zero because if
a read equal to the size of a k-mer is introduced that is of size C, it will be discarded. In
the ﬁrst case, the count will be the same as for the binomial case, so any k-mer that does
not have a count of less than C must be equal to C. Putting this all together, we obtain



X
N −X

G · N
G−1
1

,
if X < C

X
G
G




i
N −i
1
G−1
pmf (N, C, G, X) = G · 1 − C−1 N
, if X = C
i
i=0
G
G







0,

if X > C

(4.5)

Furthermore, we can use the probability mass function to derive the number of reads
that are kept by the digital normalization algorithm in the same case where the length of
the read is K. Since no k-mer can have an abundance greater than C, we only consider
adding a read (i.e. k-mer) when the count is less than C, which we see is

N C−1

readskept(N, C, G) =
i=1 j=0

i
j

1 j
G

G − 1 i−j
.
G

(4.6)

Though it is straight-forward to analyze the probability mass function and number of

55

reads kept with the digital normalization in the case where there are no errors and the
length of the read is equal to k, the analysis becomes considerably more complicated when
extended to the length of the read. The most challenging part of this analysis is the use of
the median function in the algorithm. To determine the count of a particular k-mer after
N reads have been processed, one must consider the ordering of the reads that covered a
particular k-mer. Each of these reads may not have been retained because the median k-mer
count of the read exceeded C. To do this analysis, one would need to take into account the
median, rather than the expectation at each point in time. Since the digital normalization
algorithm skews the original reads distribution, this complicates the analysis. Regardless, it
is possible to evaluate the convergence of the algorithm by creating a simulation of the digital
normalization algorithm. We simulated a set of 200,000 reads where each read contains 20
k-mers. Given a circular genome with 100,000 k-mers, we apply the digital normalization
algorithm with C = 10. After processing the reads with the algorithm, we see that the mode
of the k-mer abundance distribution is 10, but that some k-mers have an abundance greater
than C due to how the reads are stacked along the genome as they are processed (Figure
4.1). Furthermore, the algorithm clearly causes the number of reads to converge to a speciﬁc
value (Figure 4.2).
However, it is clear that there is no convergence towards the C parameter in the presence
of errors. Considering the error model introduced earlier, there is a possibility that a read
will contain only error k-mers as long as there is at least one base error every k or fewer
bp in distance. Given an inﬁnite stream of reads, this means that any possible read can be
generated, and thus any k-mer. This implies that the data structure used would collect every
possible k-mer, but the k-mer coverage would still end up around C (since the erroneous
k-mers would dominate the expectation calculation at this point). However, because these
56

Figure 4.1: K-mer abundance distribution using simulated reads with 20 reads per k-mer, a
circular genome size of 100,000 k-mers and 200,000 reads.
error k-mers would begin to match the coverage of trusted ones, the algorithm would need
to terminate much sooner and would need to be modiﬁed heavily to form the basis for an
online, streaming algorithm. Due to the use of a Counting Bloom ﬁlter, we instead must use
a multi-pass approach that terminates well before these errors cause such an issue.

4.2

Error Correction Algorithm Components

Currently, few methods exist for reliably error correcting metagenomic reads due to the uneven abundances among “species” within an environmental sample and the deep sequencing
required to suﬃciently sample the least abundant organisms. As mentioned previously, several error correction methods such as Quake and what is used in the EULER assembler rely
on the k-mer coverage distribution to estimate which k-mers can be trusted and which are

57

Figure 4.2: The number of reads kept and discarded as reads are processed through a
simulation of the digital normalization algorithm in the same experiment as Figure 4.1.

58

likely erroneous. However, the k-mer coverage of metagenomic data is diﬀerent from genomic
datasets.
Digital normalization[6] can be used to even out the coverage level in a metagenomic
sample. This results in a k-mer coverage distribution from which one can separate erroneous
k-mers from non-erroneous. By creating only one peak of “trusted” k-mers, we need only
track two peaks in the abundance distribution. As the digital normalization algorithm uses
one pass through the reads dataset, we will use an error correction method that requires two
passes through the dataset by combining elements of probabilistic de Bruijn graph traversal
and alignment with the digital normalization algorithm.
The error correction algorithm applies digital normalization to the dataset of interest
and uses a Counting Bloom ﬁlter as the underlying data structure for k-mer counting. As
the algorithm walks through the reads, the k-mer abundance distribution is built up using
the digital normalization algorithm as a ﬁlter. We will use a mixture model containing two
Poisson distributions to discern between erroneous and non-erroneous k-mers, but better
ﬁtting distributions and models can be used in circumstances where the distribution is better
characterized.

4.2.1

Read to Graph Alignment

In order to pull out the best candidate alignment from the graph based on a sequence read,
we use a similar strategy to what is in Xander (Jordan Fish, personal communication). Given
a Counting Bloom Filter with a word size of k, we start at the beginning of each sequence
read until we reach a k-mer that exists in both the Counting Bloom Filter and the read.
To avoid aligning from each k-mer, we do not align k-mers that exist in alignments that
were previously obtained. We consider this k-mer to to be the starting vertex in an A*
59

search algorithm. We maintain a closed set of nodes that have been explored as a set and
an open set of nodes as a priority queue (implemented as a heap in C++’s STL library).
For each vertex, we check if any of its possible neighbors exist in the Counting Bloom Filter.
However, unlike with a regular de Bruijn graph, we must account for up to four insertions
and a deletion vertex in addition to the one possible exact match vertex and up to three
mismatch vertices. These vertices are added to the open set, and by employing the A*
graph algorithm, we can ﬁnd the optimal alignment between the read and the graph. It is
important to note that for each non-end k-mer (i.e. k-mers that are not at the end of a read),
we must do two traversals: one for the forward direction and one for the backward. Figure
4.3 demonstrates an example of a read containing one error being aligned to the graph, which
contains both the erroneous k-mers of the read (with only a k-mer coverage of one) and the
correct sequence where all of the k-mers have a coverage of either 19 or 20. The algorithm
must prefer the path that is more likely to contain correct sequence as opposed to the path
that more closely matches the read with erroneous k-mers.

60

Original Sequence: AGCCGGAGGTCCCGAATCTGATGGGGAGGCG
Read: AGCCGGAGGTACCGAATCTGATGGGGAGGCG

A
20
MN

G
19
MN
G
1
ME

C
19
SN

C
19
SN

G
19
SN

G
19
SN

A
19
SN

G
19
SN

G
19
SN

T
19
SN

C
19
SN
C
20
MN

C
1
ME

C
1
ME

G
1
ME

G
1
ME

A
1
ME

G
1
ME

G
1
ME

T
1
ME

Seed K-mer
CGAATCTGAT

A
1
ME

Emit. Base →
K-mer Cov. →
Vertex Class →

A
19
SN

Figure 4.3: Simpliﬁed example of a graph alignment beginning at a seed k-mer. Two paths exists due to a single SNP, which forms
a bubble structure. The goal is to correct the read by replacing the substituted SNP for the original base. The abbreviations
for the vertex classes are as follows: MN = match and non-erroneous, ME = match and erroneous, SN = substitution and
non-erroneous.
61

Table 4.1: Scores for error correction graph alignment.
Type
Score
Match
0
Substitution
8
Insertion
4
Deletion
4
To score the alignment, we assign scores for matches, substitutions, insertions, and deletions. However, due to the nature of the real biological sequence data, erroneous k-mers are
certain to exist in the dataset. Rather than traverse down paths that contain error k-mers,
we choose to stop traversal and not consider such error k-mers. We will discuss the method
for determining whether or not a k-mer can be trusted later on. However, if the algorithm
were to consider erroneous paths (e.g. to retain, rather than discard, error-prone reads), one
could assign separate scores for erroneous matches, SNPs, and indels. These scores must
necessarily be lower than any of the scores for correct k-mers that exist in the graph. The
set of scores that were used for the experiments in this paper are listed in Table 4.1.

4.2.2

Two-pass Error Correction

To handle metagenomic data, we must ﬁrst be able to successfully call an erroneous k-mer
from a correct one. However, given errors in reads, the mode of the distribution is generally 1
with another peak occurring at the expected k-mer coverage. Figure 4.4 illustrates this eﬀect
with an E. coli reads sequencing dataset (see Data section). Due to the large number of error
k-mers, the second peak is a small fraction of the size of the number k-mers of abundance
one. Though this is typical, other types of sequencing datasets such as mRNAseq, MDA
single cell, and metagenomic datasets are likely to have very diﬀerent distributions from
this. This highlights the importance of a pre-assembly ﬁltering method that can eﬀectively

62

Figure 4.4: The raw k-mer abundance distribution on an E. coli reads dataset. Note that
the y-axis uses logarithmic scaling.
normalize the distribution. The problem is even worse with metagenomic dataset because
with an ensemble of species in the sample, it is possible for no other easily discernible peaks
to exist.
To ameliorate this issue, we employ the digital normalization algorithm[6] to even out the
coverage level among diﬀerent species. By limiting the k-mer coverage to a speciﬁc value, we
know where to look for the second peak in the dataset. As long as all “species” in the sample
were sequenced to a higher coverage level than digital normalization’s C parameter, we can
retain the novelty from the sample while creating a peak from which we can discern trusted
k-mers from ones likely to be erroneous. Figure 4.8 shows the eﬀect of applying the digital
normalization algorithm to the E. coli reads dataset. Clearly, a peak occurs around 20,
which can be used to predict whether a particular k-mer is erroneous or correct. Knowing
63

where the peak will be due to the C parameter is also useful because we do not need to
keep track of the entire k-mer abundance distribution. Rather, we know that a peak at an
abundance level of 1 due to error k-mers as well as a peak near C.
We will now outline the two-pass error correction algorithm. A two-pass algorithm means
that the sequence reads dataset will be processed twice. In the ﬁrst pass, we build a Counting
Bloom ﬁlter while applying the digital normalization algorithm, which keeps any read where
the median k-mer count is below C and discards any other read. For the purposes of error
correction, the set of reads that are retained by the digital normalization algorithm are not
needed. Instead, the Counting Bloom ﬁlter that is used to obtain the count is saved so that
it can be used for the second pass of the algorithm.
In the second pass through the entire reads dataset, we utilize the read to graph alignment
strategy outlined earlier. To properly score the possible alignments, we must be able to
determine whether a k-mer is likely to be erroneous or correct. We use the post-digital
normalization abundance distribution as a guide. Though it is possible to use a bimodal
distribution analytically obtained by making the assumptions made from previous sections,
in practice there are many issues with this approach. First, as the assumptions about
repetitive and low complexity regions are unreasonable, there is a possibility that the model
may underestimate the number of erroneous k-mers. Ideally, an error correction algorithm
should limit the number of “false corrections” as much as possible, so a more conservative
approach is key. Second, the model requires many parameters that may not be known, so a
more simple solution may be desirable. For these reasons, we choose to use a mixed model
containing two Poisson distributions with one distribution belonging to error k-mers and the
other to correct k-mers. The lambda parameter for the correct k-mer distribution can be
directly tied to the C parameter of the digital normalization algorithm, which means that
64

the k-mer abundance distribution of the dataset after applying digital normalization does
not have to be obtained before hand. To determine whether a k-mer is classiﬁed as either
erroneous or correct, we simply pick which distribution the k-mer of interest ﬁts best in.

4.2.3

Error Region Heuristics

A pressing limitation to a graph-based alignment strategy is that in complex regions of the
assembly graph, graph search algorithms have a diﬃcult time quickly ﬁnding the lowest
cost path. Furthermore, aligning an entire read to the graph exacerbates the issue since
this leads to more possible paths for the graph aligner to traverse. These issues can make
the algorithm unreasonably slow for longer reads. To address this, we introduce a simple
heuristic to shorten the size of the read that is being aligned.
Using the method described in the previous section to determine whether or not a k-mer
is erroneous or not, we can go along a read and see the “error footprint” of a read. Since
a single base pair error results in k erroneous k-mers, we expect that if an erroneous k-mer
exists, there are likely to be several more in the same region. The longer this region, the
longer the graph alignment and thus the more time it takes to complete an alignment. To
avoid aligning an entire read, we adjust the graph alignment algorithm to only align from the
last correct k-mer before an erroneous region to the next correct k-mer. Figure 4.5 visually
demonstrates how an error footprint may look within a read. The binary numbers represent
whether or not the k-mer is considered correct (1 for correct, 0 for erroneous). The red line
illustrates the region that is aligned with the graph. To do an entire alignment, the stretches
of 1’s are added to the result along with the alignments of the 0 regions. This approach
makes a necessary assumption that any so-called correct k-mers are in fact correct. Aligning
the entire read becomes prohibitive for longer stretches of sequence, and thus this method is
65

Is this k-mer correct?

1111111100000011111000000000000011111111

Read 5' to 3'
Figure 4.5: Illustration of the error footprint for a read. Each binary number represents
whether a k-mer. If the number is 1, the k-mer is deemed correct; if 0, then the k-mer is
considered erroneous. Rather than aligning an entire read, the graph search need only span
an erroneous region in the graph.
necessary. Furthermore, we assume that the ﬁrst and last k-mer of the starting alignments
are correct and exist in the Counting Bloom ﬁlter. An exception to this is when the region is
on the end where we instead require only that a starting k-mer is correct. If this requirement
is not met, then the alignment cannot be made and the entire read is discarded.
Even when adding the heuristic to only align over erroneous regions, there are still likely
to be issues when trying to align long stretches. Figure 4.6 shows a histogram of the lengths
of the error regions in an E. coli reads dataset with the y-axis on a log scale. As the region
size is larger, hen number of these regions decrease exponentially. However, there are very
noticeable blips at 20 and 80. The spike at 20 is due to SNPs in reads that generate k
erroneous k-mers in a row, and in this case, k is set to 20. There is another spike at 80 that
occurs when the entire read is erroneous. These reads are discarded since no seed k-mers
66

Figure 4.6: Histogram of the lengths of “error regions” in 100,000 reads of an E. coli reads
dataset. The spike at 20 is due to a single base pair error causing k erroneous k-mers. In
this case, k = 20. The spike at the end is due to reads with no k-mers called as correct.
exist that can be used to anchor the alignment(s) between the read and the graph.
When examining the largest error region size on a per read basis, it is worth noting that
most reads have small error regions (Figure 4.7). Since the time to error correct a large
error region becomes computationally prohibitive, we add another feature called max-errorregion-size that will only error correct a read if the largest error region in the read is less
than or equal to the max-error-region-size parameter. If the read’s largest error region is
higher than this parameter, then the read is discarded. Adding this ability to the read
error correction procedure allows for some ﬂexibility between speed and the number of reads
that are corrected. Furthermore, it is likely that aligning longer erroneous segments is less

67

Figure 4.7: Histogram of the largest error region in a 100,000 read subset of the E. coli reads
dataset.
accurate than shorter alignments, so being conservative with this parameter may lead to a
higher speciﬁcity.

4.3

Error Correction on Genome Sequence Data

Having introduced all of the ingredients to our read error correction algorithm, we will now
use it to correct reads from a 5 million reads subset of an E. coli sequence dataset (see Data
section of this chapter). We ﬁrst run the digital normalization stage of the algorithm. After
applying the digital normalization algorithm at C = 20, a peak appears around 20 in the
k-mer abundance distribution (Figure 4.8). Compared to the raw distribution (Figure 4.4),

68

Figure 4.8: The k-mer abundance distribution of the E. coli dataset after running the digital
normalization algorithm with C=20. Note that the y-axis uses logarithmic scaling.
the number of erroneous k-mers is greatly reduced. However, the existence of the larger peak
allows us to better separate the error k-mers from correct ones, so we can then proceed with
the correction stage of the algorithm.
The top graph of Figure 4.9 shows the k-mer abundance distribution after it has been
corrected using the genome itself. This is eﬀectively the gold standard for the error correction
algorithm since it is not possible for a k-mer that doesn’t belong in the genome to be
considered correct. Though it is still possible to have false corrections due to the same k-mer
being in multiple locations on the genome, this type of error is diﬃcult to mitigate. The
graph shows a clear peak around 75, which is where the peak existed albeit slightly in the
raw reads abundance distribution. We see that though the distribution does not quite match
the “gold standard”, particularly at the left tail, the error correction algorithm performs well
69

when used against the normalized data (shown in the bottom of Figure 4.9).

4.4

Evaluation on Human Gut Mock Dataset

We will now evaluate the error correction procedure on a metagenomic dataset. Because
most metagenomic datasets have incomplete references where very few microbes in the sample are fully sequenced, it is diﬃcult, if not impossible, to properly evaluate error correction
in this circumstances. However, one mock metagenome dataset exists where microbes whose
genomes have been fully sequenced were combined together into a sample and then sequenced. These data were sequenced by the Human Microbiome Project, and they oﬀer two
Illumina datasets. The ﬁrst, an even sample, contains 20 diﬀerent known genomes with close
to equal abundance. The second, a staggered sample, contains the same 20 genomes but
with varying levels of abundance. The Data section has more information about where to
obtain the datasets.
The goal of the evaluation is to demonstrate that the error correction algorithm considerably reduces the error rate in the reads. Since we have the reference genomes available to
us, we can map the reads back to the genomes and look at the number of mismatches in
the alignment. However, there are some drawbacks to this method. First, since we will be
using Bowtie, insertions and deletions will not be captured. Though this is not ideal, this
is not a big issue since most errors from Illumina chemistries are SNPs rather than indels.
Another issue is that Bowtie has a limit to the number of mismatches that it check for. This
limit is in place presumably because the mapping problem becomes combinatorially more
diﬃcult as you increase the number of mismatches you are willing to tolerate. Because we
are limited in the number of mismatches to which we can map a read, we are unable to map

70

Figure 4.9: K-mer abundance distributions on the E. coli dataset after error correction using
both the genome (top) and the dataset post-digital normalization(bottom) as the abundance
distribution used for error correction.
71

Table 4.2: Error correction and mapping results on HMP Mock Dataset

Total Reads
Mappable Reads
EC Reads
EC Mappable

Even Dataset Staggered Dataset
6433051
7686356
4598477 (71.48%) 3741334 (48.68%)
4245210 (65.99%) 5073565 (66.01%)
4148620 (97.72%) 4740514 (93.44%)

the most error-prone reads. Thus, this means that any error rate given as a result of the
mappings are an underestimate. However, we can combine the percentage of reads that map
to the reference genomes with the error rates of the ones that are mappable to get a good
idea of the relative number of errors. Though we could change the parameters of Bowtie to
get as many reads as possible, we used the default parameters since we are only interested
in comparing reads that the error correction algorithm is likely to keep and correct. Since
the error correction algorithm detects and discards particularly error-prone reads, we are
unlikely to ﬁnd them in the reads ﬁle that is outputted.
To do the evaluation, we will ﬁrst map the raw reads of the even and staggered HMP
datasets to a collection of reference genomes using the default parameters of Bowtie. Then,
we will run the digital normalization algorithm against the datasets with C = 10 and K = 23.
Then, we run the second pass of the error correction algorithm with the max-error-region
parameter set to 25. We set it to be just above k because we wanted to make sure to include
reads that only had error regions caused by a single base pair errors, which is ultimately a
large number of reads. By 25, we choose speed over sensitivity, but since we are interested
more in obtaining high-quality reads than correcting as many reads as possible, this approach
is more favorable. Finally, we map the reads that were successfully error corrected back to
the reference genomes.
We could then summarize the results by counting the number of reads that mapped to

72

the reference genomes. With each mapping, we can see the number of mismatches that
occurred between the read and the reference, and by counting these we can see the error
rate of the mappable reads. Though the even and staggered datasets use the same reference
genomes, the percentage of mappable reads diﬀer greatly, at 71.48% and 48.68%, respectively
(Table 4.2). After correcting each of the reads datasets, 65.99% of the even reads were kept
and corrected along with 66.01% of the staggered. Even though fewer staggered reads were
mappable, a similar proportion were able to be corrected. Even more promising, 97.72%
of reads in the error-corrected even dataset were successfully mapped back to the reference
genomes while 93.44% of the reads in the error corrected staggered dataset mapped back.
This strongly implies that the reads are being accurately corrected and that highly errorprone reads are eliminated. Though we do not know how many mismatches existed between
the unmappable reads, the speciﬁcity is clearly much better compared to the raw dataset.
Tables 4.3 and 4.4 summarize the mapping results when broken down by genome. The
ﬁrst surprising result is that the even dataset is not very evenly distributed by abundance
level. Approximately 35.69 % of the raw reads mapped to the D. radiodurans genome. In
fact, only four genomes had more than 5% of the raw reads map to it. Because of the
unevenness of this sample, it is likely that the digital normalization algorithm ﬁltered out
most of the reads of many genomes because there was not enough coverage for even C = 10.
Low abundance organisms like L. gasseri and S. agalactiae were almost completely ﬁltered
out. Though concerning, this does suggest that the error correction procedure works as
intended because it is likely that the vast majority of the k-mers in those low abundance
genomes would be called as erroneous due to the low coverage. Due to the lower abundance
genomes being ﬁltered out, the high abundance organisms were corrected to have many
more mappable reads to the genome than the raw dataset. The number of errors between
73

the raw and the error corrected dataset is quite dramatic, with some genomes showing several
factors of fewer error bases. The most abundant species, D. radiodurans had a more than
10x decrease in the number of errors. However, the diﬀerence is actually likely to be much
more extreme as the result of Table 4.2 suggest that the percentage of reads that map to the
reference genomes in the error corrected dataset is much higher than the raw reads. Since
the most error-prone reads are excluded from this analysis, the gap in errors between the
raw and error corrected datasets is much larger.

74

Table 4.3: Even HMP Mock Dataset: raw versus error corrected.
Genome
Acinetobacter baumannii
Actinomyces odontolyticus
Bacillus cereus
Bacteroides vulgatus
Clostridium beijerinckii
Deinococcus radiodurans
Enterococcus faecalis
Escherichia coli K12
Helicobacter pylori
Lactobacillus gasseri
Listeria monocytogenes
Methanobrevibacter smithii
Neisseria meningitidis
Propionibacterium acnes
Pseudomonas aeruginosa
Rhodobacter sphaeroides
Staphylococcus aureus
Staphylococcus epidermidis
Streptococcus agalactiae
Streptococcus mutans
Streptococcus pneumoniae

# Reads # Reads Errors Errors % Reads % Reads
% Errors
% Errors
(Raw)
(EC) (Raw)
(EC)
(Raw)
(EC)
(Raw)
(EC)
603321
669106 169100
4978
0.1312
0.1613 0.00049031 0.00001644
105273
49946 50145
254
0.0229
0.0120 0.00014540 0.00000084
40404
922 10642
39
0.0088
0.0002 0.00003086 0.00000013
464746
409685 138186
1629
0.1011
0.0988 0.00040067 0.00000538
195538
49529 47656
233
0.0425
0.0119 0.00013818 0.00000077
1641112 2076680 802391 68753
0.3569
0.5006 0.00232654 0.00022701
64467
9536 17883
61
0.0140
0.0023 0.00005185 0.00000020
74713
6965 26375
65
0.0162
0.0017 0.00007647 0.00000021
137925
110753 37924
379
0.0300
0.0267 0.00010996 0.00000125
1419
63
398
13
0.0003
0.0000 0.00000115 0.00000004
95721
24576 26173
84
0.0208
0.0059 0.00007589 0.00000028
35121
3846
8912
15
0.0076
0.0009 0.00002584 0.00000005
116171
68039 44164
1501
0.0253
0.0164 0.00012805 0.00000496
186478
144171 80143
828
0.0406
0.0348 0.00023238 0.00000273
31009
492 15761
53
0.0067
0.0001 0.00004570 0.00000017
135655
44006 71492
330
0.0295
0.0106 0.00020729 0.00000109
139129
78098 69307 22721
0.0303
0.0188 0.00020096 0.00007502
258945
225277 64960
1518
0.0563
0.0543 0.00018835 0.00000501
4521
278
1175
21
0.0010
0.0001 0.00000341 0.00000007
88149
34929 23603
122
0.0192
0.0084 0.00006844 0.00000040
178648
141723 49017
1021
0.0388
0.0342 0.00014213 0.00000337

75

Table 4.4: Staggered HMP Mock Dataset: raw versus error corrected.
Genome
Acinetobacter baumannii
Actinomyces odontolyticus
Bacillus cereus
Bacteroides vulgatus
Clostridium beijerinckii
Deinococcus radiodurans
Enterococcus faecalis
Escherichia coli K12
Helicobacter pylori
Lactobacillus gasseri
Listeria monocytogenes
Methanobrevibacter smithii
Neisseria meningitidis
Propionibacterium acnes
Pseudomonas aeruginosa
Rhodobacter sphaeroides
Staphylococcus aureus
Staphylococcus epidermidis
Streptococcus agalactiae
Streptococcus mutans
Streptococcus pneumoniae

# Reads # Reads Errors Errors % Reads % Reads
% Errors
% Errors
(Raw)
(EC) (Raw)
(EC)
(Raw)
(EC)
(Raw)
(EC)
19219
419
7219
11
0.0051
0.0001 0.00002573 0.00000003
305
3
165
0
0.0001
0.0000 0.00000059 0.00000000
13209
465
4771
58
0.0035
0.0001 0.00001700 0.00000017
1014
9
426
5
0.0003
0.0000 0.00000152 0.00000001
59670
3250 20078
68
0.0159
0.0007 0.00007155 0.00000019
8562
160
5722
6
0.0023
0.0000 0.00002039 0.00000002
336
179
171
39
0.0001
0.0000 0.00000061 0.00000011
309905
306980 149326
2771
0.0828
0.0648 0.00053217 0.00000793
4186
8
1541
3
0.0011
0.0000 0.00000549 0.00000001
358
97
140
10
0.0001
0.0000 0.00000050 0.00000003
3313
271
1320
46
0.0009
0.0001 0.00000470 0.00000013
183680
204721 63333
627
0.0491
0.0432 0.00022571 0.00000179
7495
3759
4208
1344
0.0020
0.0008 0.00001500 0.00000385
8071
23
4846
4
0.0022
0.0000 0.00001727 0.00000001
75960
10827 52546
462
0.0203
0.0023 0.00018726 0.00000132
754540 1249509 542721 41796
0.2017
0.2636 0.00193414 0.00011964
896931 1191682 511311 357082
0.2397
0.2514 0.00182221 0.00102213
1019837 1306522 352521 22359
0.2726
0.2756 0.00125631 0.00006400
21450
1383
7779
53
0.0057
0.0003 0.00002772 0.00000015
352806
460061 129918
2779
0.0943
0.0970 0.00046300 0.00000795
475
186
207
32
0.0001
0.0000 0.00000074 0.00000009

76

Unsurprisingly, the diﬀerence in mappable raw reads between the most highly abundant
is the greatest in the staggered dataset, but there was not a single most dominant species.
Three genomes, R. sphaeroides, S. aureus, and S. epidermidis, had between 20-30% of the
mappable raw reads while the rest had below 10%. Compared to the even dataset, more
genomes were low abundance and were discarded by the error correction algorithm. In
general, the trend appears to be that when the error correction algorithm is run, more reads
from high abundance organisms become mappable while fewer reads from lower abundance
organisms are. Despite this, as long as the abundance of the species is high enough, the
number of reads that are kept and mappable either increases or remains steady compared
to the raw dataset. Furthermore, the number of errors between the raw and error corrected
datasets is quite large even though the number of mappable reads in both datasets are quite
similar. Once again, since the percentage of mappable reads in the raw dataset is far lower
than the error corrected dataset, the number of errors is far greater in the raw dataset than
what the tables present.

4.5
4.5.1

Data
E. coli Dataset

The E. coli dataset originates from [9] but is a subset containing only ﬁve million reads. This
dataset can be obtained from https://s3.amazonaws.com/public.ged.msu.edu/ecoli_
ref-5m.fastq.gz and was used in the digital normalization paper[6].

77

4.5.2

Human Gut Mock Dataset

The human gut mock dataset was obtained from the NCBI Short Read Archive with accession
numbers SRX055380 (6.6 million reads) and SRX055381 (7.9 million reads) where the former
corresponds to the even sample and the latter corresponds to the staggered. The genomes of
the microbes used in the mock dataset were obtained from ftp://ftp.hgsc.bcm.tmc.edu/
pub/misc/HMP/mock/mock.all.genome.fa.

78

Chapter 5
Conclusions
We have established a framework for scaling metagenome sequence assembly. This dissertation covers an overview of existing approaches and how they scale. Building on this
foundation, we described a probabilistic de Bruijn graph and described its properties. Furthermore, we evaluated the utility of a partitioning algorithm that can be used as a precursor
to assembly to bring down memory requirements. In the next approach, we established a
metagenome error correction algorithm. In this chapter, we brieﬂy summarize how these
approaches help to tackle the metagenome assembly problem and discuss possible avenues
for future work.

5.1

Summary

The overarching goal of the projects discussed in this dissertation is to better scale metagenome
sequence assembly. Scaling can be applied in a number of diﬀerent ways, but we focus on
two. In the ﬁrst, we improve the memory requirements for assembly by constant factor. In
the second, we reduce the number of errors on the assembly graph, which speeds up the
assembly process, improves accuracy, and further lowers the memory footprint. These approaches combined work to eﬀectively scale metagenome assembly in terms of both memory
and running time.
By using a simple probabilistic data structure, a Bloom ﬁlter, we are able to store k-mers
79

in a much more memory eﬃcient manner compared to exact methods. However, this added
eﬃciency comes at the cost of false positives where k-mers that are not present in the dataset
are declared present when queried. When adding the false positive k-mers to a de Bruijn
graph constructed from a set of reads, we call the resulting graph a probabilistic de Bruijn
graph. We found that below a false positive rate of approximately 0.183, graph components
do not erroneously fuse together, which enables for a component-separation step prior to
assembly. Assuming extremely large components do not exist in the original graph, this
can lead to memory savings in later stages of an assembly pipeline. We successfully tested
this algorithm on a real soil metagenomic dataset, and compared the false positive k-mers
generated by the Bloom ﬁlter with error k-mers caused by sequencing technology.
By increasing the number of bits in a Bloom ﬁlter, we obtain a Counting Bloom Filter
that can be used to count the number of k-mers in a sequencing dataset. In addition to
the false positives where a k-mer that is not in the dataset is declared present, there is
an overestimation error. If each has at least one collision, there will be an overestimate.
Regardless, the generalization of the Bloom ﬁlter to larger bin sizes enables fairly accurate
k-mer counting, especially for metagenome datasets that have many low abundance k-mers.
Using a Counting Bloom ﬁlter, we can obtain the k-mer abundance distribution for a
DNA sequence dataset. For a simple genome, the distribution will be bimodal, with many
low frequency k-mers due to errors and then a second peak with the mode at the expected
k-mer coverage. This distribution does not generally occur from metagenomic datasets because of the variability in abundance levels among diﬀerent microbial “species” in the sample.
However, by applying the digital normalization algorithm, one can normalize the abundance
level to even out the frequencies within the sample. With a new k-mer abundance distribution, we can appropriately discriminate between erroneous and correct k-mers by creating a
80

mixture model. Then, we can align each read in a dataset to the post-digital normalization
graph to error correct the reads. By comparing the k-mer abundance distribution between
the E. coli dataset corrected with the genome and with digital normalization, we showed
that the error correction method produces a much improved k-mer spectra.
We used a human gut mock dataset to show that the error correction procedure ﬁxes the
vast majority of reads while discarding those that would complicate assembly. By demonstrating the accuracy of the method on such a metagenomic dataset, we are able to rely on
the reference genomes of the microbes that were sequenced. This allowed us to accurately
map reads back to the reference genomes and determine that the error correction algorithm
successfully discarded a large number of erroneous reads and greatly improve the accuracy
of the existing ones.
Altogether, these memory reduction and error correction methods can be combined together as part of a pre-assembly pipeline to improve the accuracy of a metagenomics dataset.
Highly diverse samples require deep sampling to discover the novelty in suﬃcient detail, so a
buildup of error k-mers is to be expected, especially for highly abundance species. The memory reduction ﬁltering approaches eﬀectively remove erroneous k-mers without thoroughly
examining them (though suﬃcient coverage of these regions are assured through the digital
normalization algorithm) while the error correction method examines the k-mer abundance
distribution to correct erroneous k-mers to trusted one. The elimination and correction of
erroneous k-mers is ultimately what makes these approaches successful.

81

5.2

Future Work

Though the approaches detailed in this dissertation reduce the memory requirements for
metagenome sequence assembly as well as improve accuracy, there is still plenty of room for
improvement. Due to the speed at which new sequencing technologies are being developed
and improved upon, it is likely that all of the datasets used in these works are from chemistries
several generations old. It is important to develop algorithms that generalize well among
diﬀerent sequencing platforms. While it is diﬃcult to predict how sequencing will continue
to improve in the next 2-5 years, it is reasonable to expect that read lengths will continue
to be longer and error rates will continue to decrease. With these points in mind, there are
many areas worth exploring to continue to improve metagenome sequence assembly.
One area of improvement is to develop more sensitive metagenome sequence error correction algorithms. Though the approach elaborated in this work successfully improves the
quality of metagenome assemblies, it is not sensitive to rare variant detection. It is possible
to map the reads back to the metagenome assembly to detect these variants, but ideally they
would be detected and kept rather than potentially discarded as redundant sequence due to
the eﬀects of the digital normalization algorithm.
The error correction algorithm implemented in this chapter could be greatly improved
by a new seeding strategy. Currently, the method requires that at least one k-mer in the
read is considered correct. However, for larger values of k, multiple sequencing errors can
result in large error regions that are diﬃcult to align via the graph. Currently, the maximum
error region size heuristic drastically improves the performance at the expense of discarding exceptionally error-prone reads. This heuristic is undesirable for longer reads and any
sequencing technologies with a higher error rate. To address this, any new seeding strategy

82

would not require exact matching. One possibility is to employ the Hamming distance graph
in a similar manner to the Hammer read error corrector to ﬁnd seeds along the read that can
then be bridged using the graph alignment strategy. On the other hand, with deep enough
sequencing, it may make sense to to willingly discard exceptionally error-prone reads.
Another way the two-pass error correction method could be improved is to use a one-pass,
online formulation. In order to do this, one would need to apply the digital normalization
algorithm while simultaneously error correcting reads. There would likely need to be a
“burn-in” period where many errors are discarded, but eventually there would be enough
information to correct reads based on the post-digital normalization k-mer abundance distribution. Another issue that would arise with an online formulation is the use of Bloom ﬁlters.
The basic type of Bloom ﬁlter used in the partitioning and error correction approaches do
not remove k-mers over time. To be able to handle any size dataset, a Bloom ﬁlter variant
(or a streaming data structure) would need to be used that can remove k-mers over time. By
putting this framework together, it would be possible to error correct a dataset of arbitrary
size. However, even with this method it would be diﬃcult to extract low abundance species
from a dataset if the data structure is not able to hold enough novelty. This implies that the
bottleneck would continue to be memory and that a memory-eﬃcient approach is required.
Along similar lines, online metagenome sequence assembly is a worthy goal for Bioinformaticians to pursue. Deep sampling of environmental samples is required to fully capture
the diversity within a microbial ecosystem, so longer reads and fewer errors will not help
to sequence the least abundance microbes. Thus, it will be necessary to develop algorithms
that treat a reads dataset as a stream where only one pass through is possible. This would
require incorporating memory eﬃciency and error correction into the assembly process. As
the more abundance sequences are assembled, these parts of the assembly graph must be
83

removed over time and screened from being added again. This will help to keep memory
requirements low while allowing low abundance organisms to be sampled, which can be
functionally important.

84

APPENDIX

85

This appendix contains a glossary of commonly used terms within this dissertation.

Assembler: A computer program that takes multiple DNA sequence reads, ﬁnds overlaps
between them, and outputs an assembly.
Assembly: A collection of contigs generated by an assembler.
Contig: Short for contiguous sequence, a contig is a collection of zero or more overlaps from
one or more reads.
Coverage: The coverage at a particular location in a genome is the number of reads that
map to that location. This deﬁnition can be extended to assembly graphs where the k-mer
coverage is considered the number of reads that contain a given k-mer.
de Bruijn Graph: A type of graph that is used for DNA sequence assembly. A de Bruijn
graph can have an alphabet size of 2 or larger and a word size of 1 or more. In the case of
DNA, the alphabet size is 2 and the word size varies with the dataset. A vertex is a DNA
word of ﬁxed size k, and its neighbors are any k-mers that share a k − 1-mer overlap.
Genome: The collection of DNA sequences that make up a cell, organism, or species. An
eukaryotic species like human does not possess a single genome, but rather one in each
cell. Each individual within a species also has unique genomes even though the genome for
a species may be collectively referred to as the genome for that species (e.g. the Human
Genome).
Indel: When comparing a two aligned sequences, an indel is an insertion of one or more
base pairs and a corresponding deletion in the other sequence.
K-mer: A k-mer is a DNA word of a ﬁxed size k. If k = 4, an example of a 4-mer is CGAT.
Mapper: A mapper is a computer program that aligns a DNA sequence read to a reference
genome or sequence.
86

Metagenome: A collection of genomes obtained from sequencing an environmental sample.
OTU: Short for Operational Taxonomic Unit. It is used to bin similar organisms together
based on similarity as a way to form an arbitrary deﬁnition of a species in a microbial context.
Overlap: Where a preﬁx of one read is the same as the suﬃx of another. Usually, a putative
overlap must be of a certain size to be considered a true overlap. Inexact overlaps, where
mismatches exist between reads, are also possible.
Paired-End Read: When two sides of a longer DNA sequence fragment is sequenced. The
middle part of the sequence is not known, but the length between the two sequenced reads
are known. This added information assists in generating and validating sequence assemblies.
Read: A single sequence obtained from a sequencer.
Reverse Complement: Given a DNA sequence, the reverse complement is the opposite
strand of the DNA sequence read in opposite order since a DNA strand is generally read in
only one direction. The reverse complement is a combination of a string reverse operation
and a DNA complement operation (substitute A for T, T for A, C for G, and C for G).
SNP: A single nucleotide polymorphism. This is a single substitution between two or more
aligned sequences.
Scaﬀold: The gap between two contigs. That is, the sequences between the contigs are not
known, but the direction and length between the contigs is.
Supercontig: A collection of contigs where their relative positions to on another are known
due to scaﬀolds.
Shotgun Sequencing: A method for sequencing where DNA sequencing reads are essentially randomly sampled from a genome (or multiple genomes).
Transcriptome: The collection of mRNA transcripts from a cell or organism.

87

BIBLIOGRAPHY

88

BIBLIOGRAPHY
[1] C. Adami and J. Chu. Critical and near-critical branching processes. Physical Review
E, 66(1):011907, 2002.
[2] S. Batzoglou, D. Jaﬀe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger,
J. Mesirov, and E. Lander. Arachne: a whole-genome shotgun assembler. Genome
Research, 12(1):177, 2002.
[3] B. Bloom. Space/time tradeoﬀs in hash coding with allowable errors. CACM, 13(7):422–
426, 1970.
[4] S. Boisvert, F. Laviolette, and J. Corbeil. Ray: Simultaneous assembly of reads from
a mix of high-throughput sequencing technologies. Journal of Computational Biology,
17(11):1519–1533, 2010.
[5] S. Boisvert, F. Raymond, E. Godzaridis, F. Laviolette, and J. Corbeil. Ray meta:
scalable de novo metagenome assembly and proﬁling. Genome Biology, 13(12):R122,
2012.
[6] C. Brown, A. Howe, Q. Zhang, A. Pyrkosz, and T. Brom. A reference-free algorithm
for computational normalization of shotgun sequencing data. in review at PLoS One,
july 2012. Preprint at http://arxiv. org/abs/1203.4802, 2012.
[7] M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. 1994.
[8] T. Chen and S. Skiena. Trie-based data structures for sequence assembly. In Combinatorial Pattern Matching, pages 206–223. Springer, 1997.
[9] H. Chitsaz, J. L. Yee-Greenbaum, G. Tesler, M.-J. Lombardo, C. L. Dupont, J. H.
Badger, M. Novotny, D. B. Rusch, L. J. Fraser, N. A. Gormley, et al. Eﬃcient de novo
assembly of single-cell bacterial genomes from short-read data sets. Nature biotechnology, 2011.
[10] T. C. Conway and A. J. Bromage. Succinct data structures for assembling large genomes.
Bioinformatics, 27(4):479–486, Feb 2011.

89

[11] G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min
sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005.
[12] N. de Bruijn and P. Erdos. A combinatorial problem. Koninklijke Netherlands: Academe
Van Wetenschappen, 49:758–764, 1946.
[13] D. Earl, K. Bradnam, J. St John, A. Darling, D. Lin, J. Fass, H. O. Yu, V. Buﬀalo, D. R.
Zerbino, M. Diekhans, N. Nguyen, P. N. Ariyaratne, W. K. Sung, Z. Ning, M. Haimel,
J. T. Simpson, N. A. Fonseca, . Birol, T. R. Docking, I. Y. Ho, D. S. Rokhsar, R. Chikhi,
D. Lavenier, G. Chapuis, D. Naquin, N. Maillet, M. C. Schatz, D. R. Kelley, A. M.
Phillippy, S. Koren, S. P. Yang, W. Wu, W. C. Chou, A. Srivastava, T. I. Shaw, J. G.
Ruby, P. Skewes-Cox, M. Betegon, M. T. Dimon, V. Solovyev, I. Seledtsov, P. Kosarev,
D. Vorobyev, R. Ramirez-Gonzalez, R. Leggett, D. MacLean, F. Xia, R. Luo, Z. Li,
Y. Xie, B. Liu, S. Gnerre, I. MacCallum, D. Przybylski, F. J. Ribeiro, S. Yin, T. Sharpe,
G. Hall, P. J. Kersey, R. Durbin, S. D. Jackman, J. A. Chapman, X. Huang, J. L. DeRisi,
M. Caccamo, Y. Li, D. B. Jaﬀe, R. E. Green, D. Haussler, I. Korf, and B. Paten.
Assemblathon 1: a competitive assessment of de novo short read assembly methods.
Genome Res., 21(12):2224–2241, Dec 2011.
[14] L. Fan, P. Cao, J. Almeida, and A. Broder. Summary cache: a scalable wide-area web
cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3):281–293,
2000.
[15] X. Feng, Y. Deng, and H. Bl¨te. Percolation transitions in two dimensions. Physical
o
Review E, 78(3):031136, 2008.
[16] P. Ferragina and G. Manzini. Opportunistic data structures with applications. In
Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages
390–398. IEEE, 2000.
[17] J. Gilbert, F. Meyer, J. Jansson, J. Gordon, N. Pace, J. Tiedje, R. Ley, N. Fierer,
D. Field, N. Kyrpides, F. Glockner, H. Klenk, K. Wommack, E. Glass, K. Docherty,
R. Gallery, R. Stevens, and R. Knight. The earth microbiome project: Meeting report of
the ’1 emp meeting on sample selection and acquisition’ at argonne national laboratory
october 6 2010. Stand Genomic Sci, 3(3):249–53, 2010.
[18] D. Hernandez, P. Fran¸ois, L. Farinelli, M. Øster˚ and J. Schrenzel. De novo bacterial
c
as,
genome sequencing: millions of very short reads assembled on a desktop computer.
Genome Research, 18(5):802–809, 2008.
[19] M. Hess, A. Sczyrba, R. Egan, T. W. Kim, H. Chokhawala, G. Schroth, S. Luo, D. S.
Clark, F. Chen, T. Zhang, R. I. Mackie, L. A. Pennacchio, S. G. Tringe, A. Visel,
90

T. Woyke, Z. Wang, and E. M. Rubin. Metagenomic discovery of biomass-degrading
genes and genomes from cow rumen. Science, 331:463–467, Jan 2011.
[20] X. Huang and S. Yang. Generating a genome assembly with pcap. Current Protocols
in Bioinformatics, 2005.
[21] Z. Iqbal, M. Caccamo, I. Turner, P. Flicek, and G. McVean. De novo assembly and
genotyping of variants using colored de Bruijn graphs. Nat. Genet., 44(2):226–232, Feb
2012.
[22] V. Iverson, R. M. Morris, C. D. Frazar, C. T. Berthiaume, R. L. Morales, and E. V.
Armbrust. Untangling genomes from metagenomes: revealing an uncultured class of
marine Euryarchaeota. Science, 335(6068):587–590, Feb 2012.
[23] D. R. Kelley, M. C. Schatz, and S. L. Salzberg. Quake: quality-aware detection and
correction of sequencing errors. Genome Biol., 11(11):R116, 2010.
[24] H. Kesten. The critical probability of bond percolation on the square lattice equals 1/2.
Communications in Mathematical Physics, 74(1):41–59, 1980.
[25] E. S. Lander and M. S. Waterman. Genomic mapping by ﬁngerprinting random clones:
a mathematical analysis. Genomics, 2(3):231–239, 1988.
[26] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg. Ultrafast and memory-eﬃcient
alignment of short DNA sequences to the human genome. Genome Biol., 10(3):R25,
2009.
[27] H. Li and R. Durbin. Fast and accurate long-read alignment with Burrows-Wheeler
transform. Bioinformatics, 26(5):589–595, Mar 2010.
[28] R. Li, H. Zhu, J. Ruan, W. Qian, X. Fang, Z. Shi, Y. Li, S. Li, G. Shan, K. Kristiansen,
S. Li, H. Yang, J. Wang, and J. Wang. De novo assembly of human genomes with
massively parallel short read sequencing. Genome Res., 20(2):265–272, Feb 2010.
[29] Y. Liu, J. Schr¨der, and B. Schmidt. Musket: a multistage k-mer spectrum-based error
o
corrector for illumina sequence data. Bioinformatics, 29(3):308–315, 2013.
[30] R. Mackelprang, M. P. Waldrop, K. M. DeAngelis, M. M. David, K. L. Chavarria, S. J.
Blazewicz, E. M. Rubin, and J. K. Jansson. Metagenomic analysis of a permafrost
microbial community reveals a rapid response to thaw. Nature, 480(7377):368–371, Dec
2011.
91

[31] U. Manber and G. Myers. Suﬃx arrays: a new method for on-line string searches. SIAM
Journal on Computing, 22(5):935–948, 1993.
[32] G. Marcais and C. Kingsford. A fast, lock-free approach for eﬃcient parallel counting
of occurrences of k-mers. Bioinformatics, 27(6):764–770, Mar 2011.
[33] P. Medvedev, E. Scott, B. Kakaradov, and P. Pevzner.
Error correction of
high-throughput sequencing datasets with non-uniform coverage. Bioinformatics,
27(13):i137–141, Jul 2011.
[34] F. Meyer, D. Paarmann, M. D’Souza, R. Olson, E. M. Glass, M. Kubal, T. Paczian,
A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional
analysis of metagenomes. BMC Bioinformatics, 9:386, 2008.
[35] J. Miller, S. Koren, and G. Sutton. Assembly algorithms for next-generation sequencing
data. Genomics, 95(6):315–327, 2010.
[36] E. Myers, G. Sutton, A. Delcher, I. Dew, D. Fasulo, M. Flanigan, S. Kravitz, C. Mobarry,
K. Reinert, K. Remington, et al. A whole-genome assembly of Drosophila. Science,
287(5461):2196, 2000.
[37] E. W. Myers. The fragment assembly string graph. Bioinformatics, 21 Suppl 2:79–85,
Sep 2005.
[38] T. Namiki, T. Hachiya, H. Tanaka, and Y. Sakakibara. MetaVelvet: An extension of
Velvet assembler to de novo metagenome assembly from short sequence reads. ACM
Conference on Bioinformatics, Computational Biology and Biomedicine, 2011.
[39] Y. Peng, H. Leung, S. Yiu, and F. Chin. Meta-IDBA: a de novo assembler for metagenomic data. Bioinformatics, 27(13):i94–i101, 2011.
[40] P. A. Pevzner, H. Tang, and M. S. Waterman. An Eulerian path approach to DNA
fragment assembly. Proc. Natl. Acad. Sci. U.S.A., 98:9748–9753, Aug 2001.
[41] J. Qin, R. Li, J. Raes, M. Arumugam, K. S. Burgdorf, C. Manichanh, T. Nielsen,
N. Pons, F. Levenez, T. Yamada, D. R. Mende, J. Li, J. Xu, S. Li, D. Li, J. Cao,
B. Wang, H. Liang, H. Zheng, Y. Xie, J. Tap, P. Lepage, M. Bertalan, J. M. Batto,
T. Hansen, D. Le Paslier, A. Linneberg, H. B. Nielsen, E. Pelletier, P. Renault,
T. Sicheritz-Ponten, K. Turner, H. Zhu, C. Yu, S. Li, M. Jian, Y. Zhou, Y. Li, X. Zhang,

92

S. Li, N. Qin, H. Yang, J. Wang, S. Brunak, J. Dore, F. Guarner, K. Kristiansen, O. Pedersen, J. Parkhill, J. Weissenbach, P. Bork, S. D. Ehrlich, J. Wang, M. Antolin, F. Artiguenave, H. Blottiere, N. Borruel, T. Bruls, F. Casellas, C. Chervaux, A. Cultrone,
C. Delorme, G. Denariaz, R. Dervyn, M. Forte, C. Friss, M. van de Guchte, E. Guedon, F. Haimet, A. Jamet, C. Juste, G. Kaci, M. Kleerebezem, J. Knol, M. Kristensen,
S. Layec, K. Le Roux, M. Leclerc, E. Maguin, R. M. Minardi, R. Oozeer, M. Rescigno,
N. Sanchez, S. Tims, T. Torrejon, E. Varela, W. de Vos, Y. Winogradsky, and E. Zoetendal. A human gut microbial gene catalogue established by metagenomic sequencing.
Nature, 464:59–65, Mar 2010.
[42] M. Rho, H. Tang, and Y. Ye. FragGeneScan: predicting genes in short and error-prone
reads. Nucleic Acids Res., 38(20):e191, Nov 2010.
[43] R. Ronen, C. Boucher, H. Chitsaz, and P. Pevzner. SEQuel: improving the accuracy of
genome assemblies. Bioinformatics, 28(12):i188–196, Jun 2012.
[44] S. L. Salzberg, A. M. Phillippy, A. Zimin, D. Puiu, T. Magoc, S. Koren, T. J. Treangen,
M. C. Schatz, A. L. Delcher, M. Roberts, G. Marcais, M. Pop, and J. A. Yorke. GAGE:
A critical evaluation of genome assemblies and assembly algorithms. Genome Res.,
22(3):557–567, Mar 2012.
[45] J. Simpson and R. Durbin. Eﬃcient construction of an assembly string graph using the
fm-index. Bioinformatics, 26(12):i367–i373, 2010.
[46] J. Simpson and R. Durbin. Eﬃcient de novo assembly of large genomes using compressed
data structures. Genome research, 22(3):549–556, 2012.
[47] J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J. Jones, and I. Birol. ABySS:
a parallel assembler for short read sequence data. Genome Res., 19:1117–1123, Jun
2009.
[48] J. T. Staley and A. Konopka. Measurement of in situ activities of nonphotosynthetic
microorganisms in aquatic and terrestrial habitats. Annual Reviews in Microbiology,
39(1):321–346, 1985.
[49] D. Stauﬀer. Scaling theory of percolation clusters. Physics Reports, 54(1):1–74, 1979.
[50] K. Stefan, N. Apurva, S. Joshua, and W. Doreen. A new method to compute k-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics,
9, 2008.

93

[51] G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson,
V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banﬁeld. Community structure
and metabolism through reconstruction of microbial genomes from the environment.
Nature, 428(6978):37–43, Mar 2004.
[52] J. C. Venter, K. Remington, J. F. Heidelberg, A. L. Halpern, D. Rusch, J. A. Eisen,
D. Wu, I. Paulsen, K. E. Nelson, W. Nelson, D. E. Fouts, S. Levy, A. H. Knap, M. W.
Lomas, K. Nealson, O. White, J. Peterson, J. Hoﬀman, R. Parsons, H. Baden-Tillson,
C. Pfannkoch, Y. H. Rogers, and H. O. Smith. Environmental genome shotgun sequencing of the Sargasso Sea. Science, 304(5667):66–74, Apr 2004.
[53] P. Weiner. Linear pattern matching algorithms. In Switching and Automata Theory,
1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11.
IEEE, 1973.
[54] D. Wu, M. Wu, A. Halpern, D. Rusch, S. Yooseph, M. Frazier, J. Venter, and J. Eisen.
Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees. PLoS One, 6(3):e18011,
2011.
[55] D. R. Zerbino and E. Birney. Velvet: algorithms for de novo short read assembly using
de Bruijn graphs. Genome Res., 18:821–829, May 2008.

94