STUDYING THE EFFECTS OF SAMPLING ON THE EFFICIENCY
AND ACCURACY OF KMER INDEXES
By
Meznah Almutairy
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science  Doctor of Philosophy
2017
ABSTRACT
STUDYING THE EFFECTS OF SAMPLING ON THE EFFICIENCY AND
ACCURACY OF KMER INDEXES
By
Meznah Almutairy
Searching for local alignments is a critical step in many bioinformatics applications and
pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to
maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal
exact matches (M EM s) and all highly similar local alignments (HSLAs) between a query
sequence and a database of sequences. We focus on finding M EM s and HSLAs over nucleotide sequences.
One of the most common ways to search for all M EM s and HSLAs is to use a kmer
index such as BLAST. A major problem with kmer indexes is the space required to store
the lists of all occurrences of all kmers in the database. One method for reducing the space
needed, and also query time, is sampling where only some kmer occurrences are stored.
We classify sampling strategies used to create kmer indexes in two ways: how they
choose kmers and how many kmers they choose. The kmers can be chosen in two ways:
fixed sampling and minimizer sampling. A sampling method might select enough kmers
such that the kmer index reaches full accuracy. We refer to this sampling as hard sampling.
Alternatively, a sampling method might select fewer kmers to reduce the index size even
further but the index does not guarantee full accuracy. We refer to this sampling as soft
sampling. In the current literature, no systematic study has been done to compare the
different sampling methods and their relative benefits/weakness.
It is well known that fixed sampling will produce a smaller index, typically by roughly a
factor of two, whereas it is generally assumed that minimizer sampling will produce faster
query times since query kmers can also be sampled. However, no direct comparison of
fixed and minimizer sampling has been performed to verify these assumptions. Also, most
previous work uses hard sampling, in which all similar sequences are guaranteed to be found.
In contrast, we study soft sampling, which further reduces the kmer index at a cost of
decreasing query accuracy.
We systematically compare fixed and minimizer sampling to find all M EM s between
large genomes such as the human genome and the mouse genome. We also study soft
sampling to find all HSLAs using the NCBI BLAST tool with the human genome and
human ESTs. We use BLAST, since it is the most widely used tool to search for HSLAs.
We compared the sampling methods with respect to index size, query time, and query
accuracy.
We reach the following conclusions. First, using larger kmers reduces query time for both
fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same
kmer size for both methods, fixed sampling requires typically half as much space whereas
minimizer sampling processes queries slightly faster. If we are allowed to use any kmer size
for each method, then we can choose a kmer size such that fixed sampling both uses less
space and processes queries faster than minimizer sampling. When identifying HSLAs, we
find that soft sampling significantly reduces both index size and query time with relatively
small losses in query accuracy. The results demonstrate that soft sampling is a simple but
effective strategy for performing efficient searches for HSLAs. We also provide a new model
for sampling with BLAST that predicts empirical retention rates with reasonable accuracy.
To my parents, my husband, and my children.
iv
ACKNOWLEDGMENTS
It is my earnest desire to express my gratitude and appreciation for my dissertation adviser
Dr. Eric Torng who has provided constant guidance and encouragement throughout my
graduate studies. His guidance was essential to the completion of this dissertation and has
taught me innumerable lessons and insights on the academic research.
I would like to acknowledge my dissertation committee members Dr. Yanni Sun, Dr.
Kevin Liu, and Dr. Jason Gallant for their expertise and suggestions that have significantly
improved this dissertation. Also, I would like to acknowledge the support of the Michigan
State University High Performance Computing Center and the Institute for Cyber Enabled
Research.
I am truly grateful to my parents Rasheed and Haya for their everlasting love and prayers.
I wish to give my heartfelt thanks to my husband, Nasser, who always believed in me. Lastly,
I am deeply thankful to my children Joud, Jenna, and Ayah for bringing smiles to my heart
and joys to my spirit.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . .
1.1 Sequence similarity search . . . . . . . . . . . . . . . . . .
1.2 Maximal exact matches and highly similar local alignments
1.3 Overview of searching methods . . . . . . . . . . . . . . .
1.3.1 Sequence alignments types . . . . . . . . . . . . . .
1.3.2 Online searching methods . . . . . . . . . . . . . .
1.3.3 Offline searching methods . . . . . . . . . . . . . .
1.3.3.1 Defining kmer indexes . . . . . . . . . . .
1.4 Overview of kmer sampling strategies . . . . . . . . . . .
1.4.1 Fixed vs. minimizer sampling . . . . . . . . . . . .
1.4.2 Hard fixed vs. soft fixed sampling . . . . . . . . . .
1.5 Research questions and major results . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
2
2
4
5
6
7
8
9
9
Chapter 2 Related work . . . . . . . . . . . .
2.1 Fixed sampling versus minimizer sampling
2.2 Hard fixed versus soft fixed sampling . . .
2.3 Soft fixed sampling in EST mapping . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
14
18
19
Chapter 3 Fixed versus minimizer sampling . . . . . . . . . . . . . .
3.1 The M EM enumeration problem . . . . . . . . . . . . . . . . . . . .
3.2 Using kmer indexes to find M EM s . . . . . . . . . . . . . . . . . . .
3.3 Fixed sampling versus minimizer sampling . . . . . . . . . . . . . . .
3.3.1 Sampling methods . . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 Indexing and querying . . . . . . . . . . . . . . . . . . . . . .
3.3.3 Experimental setting and evaluation metics . . . . . . . . . .
3.3.3.1 Database and query sets . . . . . . . . . . . . . . . .
3.3.3.2 Index parameters and metrics . . . . . . . . . . . . .
3.3.3.3 Querying parameters and metrics . . . . . . . . . . .
3.3.3.4 System specification/configuration . . . . . . . . . .
3.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Index size and index construction time . . . . . . . . . . . . .
3.4.2 Query time . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2.1 The theoretical vs. empirical query time expectation
3.4.3 Space and speed . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
22
24
26
30
33
35
35
37
37
38
38
38
41
44
46
vi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 4 Soft fixed sampling . . . . . . . . . . . . . . .
4.1 Highly similar local alignments (HSLA) . . . . . . . .
4.2 Using NCBI BLAST kmer indexes to finding HSLAs
4.3 Hard versus soft Sampling . . . . . . . . . . . . . . . .
4.4 Retention rates and false positives . . . . . . . . . . . .
4.5 Using NCBI BLAST kmer indexes in EST mapping .
4.6 Materials and method . . . . . . . . . . . . . . . . . .
4.6.1 Experimental settings . . . . . . . . . . . . . . .
4.6.2 Analytical modeling . . . . . . . . . . . . . . .
4.7 Results and discussion . . . . . . . . . . . . . . . . . .
4.7.1 Index size . . . . . . . . . . . . . . . . . . . . .
4.7.2 Retention rate of HSLAs . . . . . . . . . . . . .
4.7.3 Possible improvements for the BLAST model .
4.7.4 Query time . . . . . . . . . . . . . . . . . . . .
4.7.5 Mapping results . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
51
53
55
56
57
58
58
65
69
69
70
74
77
79
Conclusions and future work . . . . . . . . . . . . . . . . . . . . .
82
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Chapter 5
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
LIST OF TABLES
Table 3.1: Possible minimizer sampling versions based on the optimization
techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Table 3.2: datasets used for testing. . . . . . . . . . . . . . . . . . . . . . . . . .
36
Table 3.3: The number of M EM s for each query set and the human genome
for both choices of L given our preprocessing into sequences of
length 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
Table 3.4: Query times (in hours) for all sampling methods and all choices
of k when L = 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Table 3.5: Query times (in hours) or all sampling methods and all choices
of k when L = 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Table 3.6: The number of shared kmer occurrences (in billions) for all sampling methods and all choices of k when L = 50. . . . . . . . . . . .
45
Table 3.7: The number of shared kmer occurrences (in billions) for all sampling methods and all choices of k when L = 100. . . . . . . . . . .
46
Table 3.8: The mean and standard deviation of the length of a kmer’s list
of occurrences using the human genome for all sampling methods
and all choices of k when L = 50. . . . . . . . . . . . . . . . . . . . . .
47
Table 3.9: The mean and standard deviation of the length of a kmer’s list
of occurrences using the human genome for all sampling methods
and all choices of k when L = 100. . . . . . . . . . . . . . . . . . . . .
47
Table 4.1: Human genome volume characteristics. . . . . . . . . . . . . . . . .
59
Table 4.2: A summary of the parameters used in our experiments for (1)
finding HSLAs and (2) EST mappings. For HSLA, we consider
all four choices of l. For EST, we only consider the first three
choices of l. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Table 4.3: The probability that the number of mismatches exceeds (1 − t)l
for various choices of l and t. . . . . . . . . . . . . . . . . . . . . . . .
77
viii
Table 4.4: The retention rate (RRmap ) and query time reduction (QT R)
results for all 1000 queries when l = 50 and w0 = 5, where 879
were mappable queries. . . . . . . . . . . . . . . . . . . . . . . . . . .
79
Table 4.5: The retention rate (RRmap )and query time reduction (QT R) results for all 1000 queries when l = 100 and w0 = 14, where 794
were mappable queries. . . . . . . . . . . . . . . . . . . . . . . . . . .
80
Table 4.6: The retention rate (RRmap )and query time reduction (QT R) results for all 1000 queries when l = 200 and and w0 = 17, where
528 were mappable queries. . . . . . . . . . . . . . . . . . . . . . . . .
80
ix
LIST OF FIGURES
Figure 3.1: The dictionary sizes, lists sizes, and construction times for kmer
indexes built using fixed sampling (f ix) and minimizer sampling
minrand,one , minrand,many , and minlex,one . For parts (a), (b), and (c),
we use L = 50. For parts (d), (e), and (f), we use L = 100. For all graphs,
12 ≤ k ≤ 32. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
Figure 3.2: Comparing the space and speed of fixed sampling (f ix), minimizer sampling minrand,one , minrand,many , and minlex,one . We use
L = 50 for the three upper figures and L = 100 for the three lower figures.
The values for minlex,one when k = 12 are very large and removed from
figure below. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Figure 4.1: Example database sequence s and query sequence q and two local
alignments A1 and A2 . The symbol () identifies two mapped identical
positions and (∗) is an inserted gap position in one of the two sequences.
51
Figure 4.2: Illustration of EST mapping process. The HSLAs (A, A∗), (B, B∗),
(C, C∗), and (D, D∗) are used to report the final mapping. . . . . . . .
58
Figure 4.3: The sampled index SI(w) size (percentage) as a function of sampling step size w of SI(w) versus sampled index SI(w0 ). The
k 0 mer indexes are built with k 0 = 12 and w ≥ w0 where w0 = l − k + 1.
70
Figure 4.4: The actual HSLA retention rate RR(w, w0 ), the actual short HSLA
retention rate RRshort (w, w0 ), and the expected short HSLA retention rate using both Kent’s model E[RRK (w, w0 )] and BLAST
model E[RRB (w, w0 )] for (a) l = 50, w0 = 5, (b) l = 100, w0 = 14,
(c) l = 200, w0 = 17, and (d) l = 400, w0 = 30. For other parameters
values see Table 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
Figure 4.5: Distribution of predicted and empirical MAXMEM lengths in
HSLAs. The predicted MAXMEM lengths are computed from a Monte
Carlo simulation. (a) l = 50, t = 96%, (b) l = 100, t = 97%, (c) l = 200,
t = 97%, and (d) l = 400, t = 97%. . . . . . . . . . . . . . . . . . . . .
76
Figure 4.6: The average median query time reduction (QTR) percentages
and the actual false positive reduction (FPR) percentages as a
function of sampling step w. (a) l = 50, w0 = 5, (b) l = 100, w0 = 14,
(c) l = 200, w0 = 17, and (d) l = 400, w0 = 30. For other parameters
values see Table 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
x
Chapter 1
Introduction
1.1
Sequence similarity search
With each passing year, advances in sequencing technologies, such as ROCHE/454, Illumina/Solexa, and Pacific Biosciences (PacBio), are producing DNA sequences both faster
and cheaper. Right now, the average number of sequences generated from one sequencing
run is on the order of hundreds of millions to billions.
While this explosive growth in DNA datasets yields exciting new possibilities for biologists, the vast size of the datasets also presents significant challenges for many computeintensive biology applications. These applications include homogenous search [69, 6, 104, 56],
detection of single nucleotide polymorphisms (SNP) [34, 78, 65], mapping cDNA sequences
against the corresponding genome [36, 66, 99]. sequence assembly [82, 70, 71], sequence
clustering [25, 28, 46], and sequence classification [98, 7, 23]. A core operation in all these
applications is to search the dataset for sequences that are similar to a given query sequence.
The similarity between a dataset sequence and a query sequence may be defined at the sequence to sequence, subsequence to sequence, or subsequence to subsequence level. These
similarity search levels are also known as searching for global alignments, semiglobal alignments, and local alignments, respectively. In this dissertation, we focus on applications that
rely on finding subsequence to subsequence similarity levels such as homogenous search,
1
sequence mapping, and SNP detections.
1.2
Maximal exact matches and highly similar local
alignments
Searching for local alignments is a critical step in many bioinformatics applications and
pipelines. This search process is often sped up by finding shared exact matches of a minimum
length. The value of minimum length is usually set to ensure all the desired similar sequences
are recognized. Depending on the application, the shared exact matches are extended to
maximal exact matches [91, 38], and these are often extended further to local alignments
by allowing mismatches and/or gaps [92, 56, 75, 36].
In this dissertation, we focus on searching for all maximal exact matches (M EM s) and
all highly similar local alignments (HSLAs) between a query sequence and a database of
sequences. We focus on finding M EM s and HSLAs over nucleotide sequences where nucleotides are represented by A, C, G, and T . The M EM s and HSLAs are commonly used
in applications that compare sequences within the same species or closely related species.
1.3
1.3.1
Overview of searching methods
Sequence alignments types
In bioinformatics, a sequence alignment is a procedure of arranging nucleotide or protein
sequences to determine regions of similarity. These alignments are usually used to find functional, structural, or evolutionary relationships between the sequences. Sequence alignments
2
are also used for nonbiological sequences, such as computing the similarity level between
strings in a natural language, text mining, and financial data.
Sequence alignments can be classified into global and local alignments. Global alignments
try to align every residue in every sequence. The NeedlemanWunsch algorithm [64] is a traditional global alignment method that is based on dynamic programming. On the other hand,
local alignments try to find regions of similarity between sequences. The SmithWaterman
algorithm [84] is a traditional local alignment method that is also based on dynamic programming. Semiglobal alignments [14, 53] are hybrid methods that try to align one sequence
entirely within the other sequence. These are especially useful when one sequence is short,
and the other is very long. Many of semiglobal alignment algorithms are extended from
global alignment algorithms such as the modified NeedlemanWunsch algorithm [53]. In
these methods, gaps at the beginning and/or end of an alignment (also known as starting or
trailing spacing) are ignored.
Sequence alignments can also be classified into pairwise sequence alignment and multiple
sequence alignment. In pairwise sequence alignment, we align two sequences. Some standard
pairwise sequence alignment methods are the NeedlemanWunsch algorithm [53], the SmithWaterman algorithm [84], FASTA [69], BLAST and its improved versions [6, 104, 57, 17].
The pairwise sequence alignments are usually computed using a 2dimensional matrix. Multiple sequence alignment methods align more than two sequences at a time. A common way
to perform multiple sequence alignments is to generalize pairwise sequence alignment methods. One way to generalize pairwise sequence alignment methods is to use an ndimensional
matrix instead of a 2dimensional matrix to align n sequences instead of 2 sequences. This
technique is computationally expensive but is guaranteed to find a global optimum solution
and is only used when for small n. A more practical solution for large n is to use heuristics
3
such as progressive alignments as in ClustalW [88] MUSCLE [24] MAFFT [35]. Progressive
alignment methods use a guide tree to reduce the multiple sequence alignments problem into
a series of 2dimensional pairwise sequence alignment problems.
Another way to align sequences is to do structural alignments such as DALI [32], SSAP [87],
and Combinatorial Extension methods [81]. Structural alignments are commonly used for
protein sequences and sometimes RNA sequences. They align sequences using information
about the secondary and tertiary structure. These methods typically find local alignments
and can be applied to two or more sequences.
1.3.2
Online searching methods
Searches can be online or offline. We first describe the online searching methods. The online methods produce the best alignments but are slow use dynamic programming. The
NeedlemanWunsch algorithm [53] and SmithWaterman algorithm [84] are based on dynamic programming and find global and local alignments, respectively.
Faster heuristic methods are often used as dynamic programming requires quadratic time
complexity. One heuristic, known as seedendextend, first finds seeds, wordtoword matches
of a given length, and then tries to extend these seeds into local alignments. FASTA[50, 69]
was one of the earliest seedandextend searching tools. Today, the dominant method for
sequence comparison is the Basic Local Alignment Search Tool, or BLAST [5]. Given a
query, BLAST performs a linear scan over the sequence database searching for a set of
seeds/words shared with some substrings of the query. It then extends them in both directions until the accumulated similarity score begins to decrease. Finally, BLAST reports these
matching regions with high statistical significance. Further improvements of BLAST include
MegaBLAST [104], MPBLAST [40], and miBLAST [39]. MegaBLAST is a greedy algorithm
4
that can efficiently align sequences that are highly similar. MPBLAST and miBLAST are
different versions of BLAST used for parallel queries.
The running time of all online search methods is proportional to the size of the dataset,
which is increasingly prohibitive. Therefore, much work has been done to develop searching
methods that rely on dataset preprocessing to create a search index. These search methods
are known as offline search methods.
1.3.3
Offline searching methods
In offline search, we first preprocess a dataset by building a persistent data structure, called
an index, to support fast search. This data structure is often designed to store the occurrences
of patterns in the dataset to support fast search later on. The patterns might be strings or
substrings of fixed or variable lengths. Most previous works fall into one of two categories:
suffixarray methods and seedandextend heuristic methods.
The suffixarray methods convert the dataset into a suffixarray data structure, which
simulates a suffixtree of the dataset. In the case of DNA datasets, edges of this suffixtree
are labeled with one of the four nucleotides. To form a unique suffix of the dataset, the
algorithms traverse the tree from the root to a leaf while concatenating all the nucleotides
on the edges along the path. Every leaf node stores all matching locations of this unique
suffix in the dataset. Searching for a query read is performed by traversing through the
dataset suffixtree from the root to a leaf node following the query sequence. If there exists
a path from the root to a leaf such that the corresponding suffix of the path matches the
query read, then all the locations stored in the leaf node are returned as matching locations.
Suffix arrays simulate the suffixtree traversal process with a much smaller memory footprint
by using the Ferragina Manzini index [27] and the BurrowsWheeler Transform [16]. There
5
are two significant problems with the suffix array approach: it does not manage mismatches
efficiently, and it uses lots of memory, even with these space saving optimizations.
The offline seedandextend method, similar to the online seedandextend method, is
developed based on the observation that for a correct matching, a query sequence and a
database sequence must share some small regions of exact matches or seeds. By identifying
the seeds of a query, the search methods limit the search range from the whole dataset to
only the neighborhood of each seed. In offline search, seeds are found by preprocessing the
dataset and storing the locations of their occurrences in a separate data structure called an
index. During the search process, a seedandextend method first identifies the seeds in a
query. Then the search method tries to extend the matches at each of the seed locations using
dynamic programming. SSAHA [66] and BLAT [36] are early examples of indexed seedandextend search methods. In both methods, hash tables are used to save seeds extracted from
a database. In 2008, Morgulis et al. [56] allow creating and passing an index to MegaBLAST
to perform indexed search. Instead of hash tables, the index used by Morgulis et al. is an
array of all possible seeds associated with the lists of occurrences of each seed.
1.3.3.1
Defining kmer indexes
Searching with kmer indexes is one of the earliest and major types of indexed seedandextend search methods. In this search method, the seeds are substring of length exactly k
which is known as kmers. A kmer index consists of two major parts: the kmer dictionary
and the inverted lists. The kmer dictionary is composed of all/some possible kmers that
can be extracted from the dataset sequences. Each kmer in the kmer dictionary has an
inverted list which is the list of all/some if its occurrences in the dataset.
Finding maximal exact matches M EM s and highly similar local alignments HSLAs is
6
often sped up using kmer indexes. Similar to any seedandextend offline method, a kmer
index supports quickly finding shared kmers. That is, when a query sequence is given, we
extract kmers from the query, check if that kmer appears in the dictionary, and find the
corresponding shared kmers by using the stored list of occurrences. We typically extend
these shared kmers in two stages. In the first stage, we try to extend every shared kmer into
an exact matche of a given minimum length. If the first stage is successful, we then further
extend the match into an M EM . In the context of searching for HSLAs, every M EM is
extended even further by allowing mismatches and/or gaps. We describe this search process
more precisely in chapter 3 and chapter 4.
In this dissertation, we only focus on indexed seedandextend searching methods that
are used to find M EM s and HSLAs. More precisely, we focus only on kmer index search
methods where seeds are kmers.
1.4
Overview of kmer sampling strategies
One of the biggest problems with using kmer indexes is that the size of the index is significantly larger than the underlying database/ datasets. As biological databases/datasets
rapidly increase in size, the size of the resulting kmer indexes increases which makes using a
kmer index infeasible for many applications. Furthermore, query time increases rapidly as
the database’s size and/or the number of queries increases. To ensure kmer indexes remain
viable, we must mitigate kmer index size and query time.
One of the most effective and widely used ways of mitigating kmer index size and query
time is to perform sampling, in which we omit some kmer occurrences from the index. We
classify sampling strategies in two ways: how they choose kmers and how many kmers they
7
choose.
1.4.1
Fixed vs. minimizer sampling
There are two major ways to choose kmers: fixed sampling and minimizer sampling. In
fixed sampling, for a given sampling step w ≥ 1, we choose the kmers that occur at every
wth position [56, 66, 36]; this is the sampling option supported by Indexed BLAST. When
a query sequence is given, all the kmers from the query are used during the search process.
On the other hand, in minimizer sampling, we choose a “minimum” kmer within a given
window [75, 76, 101, 18, 59, 43]. More specifically, for a window of w adjacent kmers, the
kmer that is alphabetically minimum is selected. The next window then starts one position
to right from the previous window. When a query sequence is given, we also sample the
kmers from query sequence in the same fashion. Thus, only the sampled kmers from the
query are used during the search process.
It has widely been assumed that fixed sampling produces smaller indexes and has less
index construction times than minimizer sampling, but that minimizer sampling leads to
faster query processing than fixed sampling. For example, Roberts et al. [75] highlight the
importance of sampling kmers at query time during the search procedure saying that “the
procedure would still be more efficient if we could compare only a fraction of the kmers
in T to the database” where T is their notation for a set of query strings. However, these
beliefs have never been empirically verified. In fact, few studies have empirically tested either
method on its own [75, 56]. In chapter 3, we fill this gap by systematically evaluating and
comparing fixed sampling and minimizer sampling to assess how these methods perform with
respect to index construction time, index size, and query processing time. Specifically, we
compare and contrast the construction time, index size, and query processing time required
8
to find all M EM s using both fixed and minimizer sampling.
1.4.2
Hard fixed vs. soft fixed sampling
In fixed sampling, for a given sampling step w ≥ 1, a kmer that occurs at every wth position
is saved. We distinguish between two types of fixed sampling: hard sampling [56, 66] and
soft sampling [36]. In hard sampling, we choose w small enough such that we are guaranteed
to find all desired HSLAs. On the other hand, in soft sampling, we consider large w values,
and thus we risk missing some HSLAs.
In chapter 4, we study how best to sample a kmer index to manage index size, query time,
and accuracy where accuracy refers to finding all desired HSLAs. We evaluate a broad range
of sampling rates w that includes existing choices and new sampling choices. In particular,
we study soft sampling, or sampling especially sparsely to reduce further index size and
query time at the risk of missing some HSLAs. We show that using soft sampling, which
has largely been ignored in previous studies, significantly reduces index size and computation
times with very little loss in accuracy.
1.5
Research questions and major results
We study the problem of trying to find the best sampling strategy to create simultaneously
efficient and accurate kmer indexes in the context of finding M EM s and HSLAs using
DNA datasets. In particular, we want to answer two main questions. The first question
is to compare the effectiveness of the two major sampling strategies, fixed sampling and
minimizer sampling. The second question is to study the effectiveness of soft or sparse fixed
sampling. To address these two major questions, we systematically investigate the effect of
9
sampling on kmer indexes.
Most biological applications use one of two major types of sampling: fixed sampling and
minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling
will produce faster query times since query kmers can also be sampled. However, no direct
comparison of fixed and minimizer sampling has been performed. In chapter 3, we systematically compare fixed and minimizer sampling with respect to index size and query processing
time. We use the resulting kmer indexes for fixed sampling and minimizer sampling to find
all M EM s.
In chapter 4, we study the effects of fixed soft sampling with NCBI indexed BLAST
when searching for HSLAs. Our work is unique in that there is little prior work that has
considered soft sampling, and the little work that has considered it has not systematically
studied how sampling parameter w affects accuracy. We specifically study the effect of w on
the size, accuracy, and query time of the kmer index. We also extend previous analytical
models to work with a wider range of w and k values. Then, we compare our empirical
results with predictions from both the original and the extended analytical models.
To assess the effectiveness of soft sampling in a real biological application, we use soft
sampling kmer index in mapping human ESTs on the human genome. We focus on kmerbased mappers [30, 3, 77, 2, 33, 96] that are typically fully sensitive mappers that “can detect
sequences missed by other tools” [31] but may be relatively slow. We study whether soft
sampling kmer indexes might increase the speed of these mappers with relatively little loss in
sensitivity. These methods work in two stages. First, they find the set of all HSLAs between
an EST and a genome. Then they map the EST to the genome by selecting and linking
these HSLAs. The mappers usually differ in how to modify, evaluate, and use the resulting
10
HSLAs to assess the final mapping process. In chapter 4, we assess precisely whether the
correct mapping is retained when we use softsampled kmer indexes to complete the first
stage of finding HSLAs. We only simulate the mapping process because we want our results
to be general and independent from the details of the final mapping process of a mapper.
We summarize our major contributions.
1. We systematically compare fixed sampling with minimizer sampling using real biological datasets to assess how well they find all M EM s with respect to index construction
time, index size, and query processing time. Our results show the surprising result
that fixed sampling typically answers queries at least as fast as minimizer sampling
and often is faster when both methods use the same space regardless of k.
2. We evaluate the impact of the k value on the effectiveness of fixed and minimizer
sampling methods to find all M EM s. Previous studies usually focus on only one value
of k. We show that the value of k has a significant impact on a sampling method’s
index size and query processing time. When the value of k decrease, fewer number of
kmer occurrences are saved resulting in smaller indexes. However, when the value of k
increases, the index processes queries much faster. On average, the reduction in query
times for all query sets when k = 32 compared to k = 12 is 37 and 136 times faster for
fixed sampling and minimizer sampling respectively.
3. We systematically assess how well BLAST can find HSLAs when using soft sampling
when working with the human genome as our database. In particular, we study how
accurate BLAST is in retaining all HSLAs. We show that BLAST’s accuracy is
high, even for large choices of w. Furthermore, the false positive rate, in the form
of shared M EM s that do not extend into HSLA, is significantly reduced, leading
11
to a corresponding significant reduction in query time. This demonstrates that soft
sampling is a simple but effective method to increase index efficiency with surprisingly
little loss of accuracy.
4. We design a new analytical model that we call the BLAST model by extending previously developed analytical models to work with our choices of k and w. We compare
the theoretical predictions from our new BLAST model and old models with our empirical results. We show that the new BLAST model is reasonably accurate whereas
other analytical models are not accurate in our context.
5. Finally, we study the effects of using soft sampling for the problem of mapping human
ESTs against the human genome. We conservatively simulate the process because
either existing mapping tools do not support soft sampling or do not allow us to
replace the first phase of finding HSLAs. We show that we are able to map more than
98% of the query ESTs perfectly while reducing index size by 35 times and query time
by 23.3% when compared to hard sampling.
12
Chapter 2
Related work
Over the last decade, there has been a dramatic increase in the use of kmer indexes in
biological applications and pipelines to accelerate the search for maximal exact matches
(M EM s) and/or highly similar local alignments (HSLAs). Since kmer indexes require a
lot of memory, sampling has been widely used to reduce the index size. We classify sampling
strategies used to create kmer indexes in two ways: how they choose kmers and how many
kmers they choose. The kmers can be sampled in two ways: fixed sampling and minimizer
sampling. The number of sampled kmers depends on the size of a sampling parameter w. If
w is relatively small, and thus enough kmers are sampled, then the kmer index reaches full
accuracy, and we refer to this sampling as hard sampling. If w is large, then fewer kmers
are sampled, then the index does not guarantee full accuracy, and we refer to this sampling
as soft sampling. In the current literature, no systematic study has been done to compare
the different sampling methods and their relative benefits/weakness.
In this dissertation, we study the problem of trying to find the best sampling strategy to
create simultaneously efficient and accurate kmer indexes. We study kmer indexes in the
context of finding all M EM s and all HSLAs between a query sequence and a database of
sequences, which are our two primary motivating applications.
13
2.1
Fixed sampling versus minimizer sampling
Two sampling methods are commonly used to build kmer indexes: fixed sampling and
minimizer sampling. No previous work has carefully compared these two sampling methods
and their relative benefits/weakness.
Both Roberts et al. [75] and Morgulis et al. [56] use sampling to reduce the size of kmer
indexes with application to find HSLAs. In 2004, Roberts et al. were the first to propose
minimizer sampling to reduce the size of a kmer index (in particular the kmer lists) to find
HSLAs. Assuming the data is random, Roberts et al. computed the expected theoretical
reduction in kmer lists size to be 2/(w + 1) compared to saving complete lists. In 2008,
Morgulis et al. proposed indexed megaBLAST to accelerate searching for HSLAs. Morgulis
et al. used fixed sampling to build kmer indexes and reported the reduction in list size to be
1/w, compared to saving all complete lists. Morgulis et al. did not conduct any comparisons
between fixed and minimizer sampling.
Finding M EM s is the first step in finding local alignments. In seedandextend searching
methods, shared seeds, which are exact matches, are extended to M EM s, and then to local
alignments. In comparative genomics studies, M EM s define anchor points for comparing
different genomes, In a recent paper, M EM s are also used to map long NGS reads against
a genome [92]. Although suffix trees have been the traditional data structure of choice when
searching for M EM s [41, 1, 37, 91], Khiste and Ilie recently showed that kmer indexes
outperform suffix trees when searching for M EM s in large genomes when the M EM size
is relatively large [38]. Specifically, Khiste and Ilie use fixed sampling to build a memoryefficient kmer index to search for M EM s. Although minimizer sampling has been actively
used as an alternative way to reduce kmer index size in other studies and allows for sampling
14
kmers from a query sequence, Khiste and Ilie do not compare fixed sampling to minimizer
sampling.
Roberts et al. [75] defined low complexity regions as one possible complication for minimizer sampling. Roberts et al. write “If a string contains many consecutive zeros (or As
in the case of genomic data), then several consecutive kmers may be minimizers. While
this is not a major problem, it counteracts our goal of sampling a fraction of the kmers.”
The low complexity regions, as defined by Roberts et al., is one simple type of repeats. In
general, repeats, or repeated sequences, are short substrings that occur multiple times in a
genome. To handle the low complexity regions, Roberts et al. write “In general, we want to
devise our ordering to increase the chance of rare kmers being minimizers.” They propose
strategies to try to accomplish this task. However, it is unclear how effective these schemes
are for dealing with repeats. In particular, it is not clear if the proposed schemes work for
all types or repeats, all numbers of repeats, and all possible choices of k.
Schleimer et al. [80] independently introduced minimizer sampling calling it winnowing
sampling. While winnowing is identical to minimizer, winnowing has been studied and used
for for different problems and domains than minimizer. Minimizer sampling has been used
to sample kmers, called minimizers, to solve the problem of searching for local alignments
using biological datasets while winnowing method has been used to solve the problem of kmer counting and ranking using text documents datasets for document plagiarism detection.
A key difference between how minimizer and winnowing sampling is that minimizer sampling
requires saving sampled kmer positions while winnowing method has not. This is different
areas because in M EM and HSLA search problems the positions define anchor points to
compare sequences whereas in kmer counting and ranking problems only the number of
occurrences is used to estimate the similarity between two documents. Similar to Roberts
15
et al., Schleimer et al. observed that in low complexity regions (or lowentropy strings in
text mining literature), a kmer might occur more than once and all of its occurrences are
sampled. Schleimer et al. suggested a simple fix to this problem by not sampling duplicate
kmers in both the indexing and the querying phase and call this version robust winnowing. If
it is desired to find positional information, in robust winnowing, we should inspect all the w
adjacent kmers to find the correct matching positions. In the context of kmer counting and
plagiarism detection, Schleimer et al. did not explore more winnowing and robust winnowing
methods. Also, Schleimer et al. did not compare the performance of winnowing and robust
winnowing.
The kmer sampling method has been used in sequence assembly. Ye et al. [101] proposed
using sampled kmers, instead of all kmers, to reduce the memory requirements for De Bruijn
graph (DBG) based assemblers. Ye et al. used fixed sampling to sample kmers; the penalty
is that links or edges between kmers are longer and slightly more complex. Ye et al. report
that fixed sampling with step w reduces their dictionary by roughly 1/w compared to tools
that use a full list of kmers [102, 83, 45]. Ye et al. note the existence of minimizer sampling
and express interest in comparing minimizer sampling to fixed sampling in future work but
did not compare the two in their work. In genome assembly, we only use the dictionary
(the list of kmers); we do not use the lists of kmer occurrences. In other applications,
where we use the lists of kmer occurrences, the size of these lists is the dominant factor
in index size. Therefore it is important to understand how fixed and minimizer sampling
affect both the number of kmers and the number of kmer occurrences. Li et al. [48] and
Movahedi et al. [59] both proposed diskbased DBG assemblers to avoid loading the whole
graph into RAM. They load small segments of the graph incrementally and complete the
assembly in this fashion. Since completing the assembly requires identifying adjacent exact
16
matches, both papers use minimizer sampling as a hashing mechanism, to find adjacent exact
matches and group them into the same segment. We do not focus on these applications for
two reasons. First, our goal is to focus on applications that use kmer indexes in RAM,
which means that index size is critical. Second, minimizer sampling, in the above context,
is used as a hashing function to minimize disk I/O operations rather than reducing the list
of kmers.
The kmer sampling method has been used to solve other problems in bioinformatics. In
the kmer counting problem, the task is to build a histogram of occurrences of every kmer
in a given dataset where k is relatively large (k > 20), and it is infeasible to list all kmers in
RAM. Similar to diskbased DBG assemblers, minimizer sampling is used to select mmers
(m < k) from every kmer. These mmers are later used to reduce disk I/O operations in
diskbased counting kmers tools such as MSPKmerCounter [47] and KMC2 [21]. Again,
this problem is significantly different than our motivating problems which are searching for
M EM s and HSLAs. In the M EM and HSLA search problems, the location of sampled
kmers is important.
In metagenomic sequence classification, Kraken program [98] uses the idea of minimizer to
accelerate the classification process in large data sets. Kraken starts with creating a database
that contains entries of a Lmer and the lowest common ancestor LCA of all organisms whose
genomes contain that Lmer. When a query sequence is given, Kraken searches the database
for each Lmer in a sequence, and then using the resulting set of LCA taxa to determine an
appropriate label for the sequence. To finding all Lmers effectively, Kraken builds a kmer
index, (k < L) where each kmer is associated with all Lmers containing this kmer as a its
minimizer. Since a simple lexicographical ordering of kmers can be biased to sample more
minimizers over lowcomplexity regains, Kraken uses the exclusiveor (XOR) operation to
17
scrambles the standard ordering of each kmer’s before comparing the kmers to each other
using lexicographical ordering.
2.2
Hard fixed versus soft fixed sampling
To find HSLAs, we use a sampled kmer index with sampling step w ≥ 1. We focus on
HSLAs that have at least one M EM with a minimum of length L. In chapter 4, we discuss
in detail how to compute the right value of L to find all desired HSLAs.
Most previous studies of sampled kmer indexes have focused on hard fixed sampling (or
hard sampling for short) with limited study of soft fixed sampling (or soft sampling for short)
and thus have not studied the effect of choosing a large sampling step w on index accuracy,
query time, or false positive rate. For example, Morgulis et al. [56] built Indexed BLAST,
which uses w = L − k + 1 and supports k values up to 15. Ning et al. [66] built the index in
SSAHA with k = 1/2(L + 1) and w = k. Morgulis et al. and Ning et al. use hard sampling
to find the desired HSLAs. In both cases, the value of w is small enough such that for each
M EM of length L, there is a sampled kmer and thus we can find all M EM s and thus all
HSLAs.
Kent [36] has performed the main previous study of soft sampling. Kent developed
an analytical model for estimating the likelihood of retaining matches and creating false
positives for a variety of indexed search strategies. These include searching with one kmer,
two nearby small kmers, and one large kmer with one allowed error. In all cases, he built
a soft sampled kmer index where w = k. Kent computed the best choice of k such that the
expected accuracy to find all HSLAs was above a given threshold and the number of shared
kmers that did not lead to HSLAs were as small as possible.
18
Kent’s work differs from ours in several key ways. First, we consider only k = 12 so
that we can use BLAST to perform our searches, whereas Kent considered multiple k values.
Second, we consider a wide range of w values, whereas Kent only considered w = k. Thus,
Kent’s work does not allow a true study of the effect of w on index performance since, in
his work, k is always changing in addition to w.dataset We extend Kent’s analytical model
to work with our choices of k < L and w and we call this new model the BLAST model.
We compare our empirical accuracy with the predicted accuracies of both Kent’s original
model and our BLAST model. Our results show that our BLAST model is reasonably
accurate in predictingHSLA retention rate. On the other hand, Kent’s model significantly
underestimates HSLA retention rate in our experiments with the human genome. This is
expected since Kent’s model is not designed to handle the case when k < L.
2.3
Soft fixed sampling in EST mapping
We apply soft fixed sampling to the problem of EST mapping on a genome, which builds
upon finding HSLAs. Mapping ESTs on a genome is a fundamental procedure in genome
research. These mappings are used to discover the intronexon structure of genes, SNPs, and
cDNA insertions and deletions, to name just a few applications. Many different mapping
tools are available, each with their own advantages [31]. We focus on kmer–based mappers
such as mrFAST/mrsFAST [30, 3], SHRiMP [77], Hobbes [2], drFAST [33], and RazerS [96].
These mappers are typically fully sensitive mappers that “can detect reads missed by other
tools” [31] but may be relatively slow. In all these mappers, only hard sampling is used when
building a kmer index. We study whether soft sampling kmer indexes might increase the
speed of these mappers with relatively little loss in sensitivity when working with the human
19
genome as our database. Specifically, we assess whether the correct mapping is retained
when we use softsampled kmer indexes to complete the first stage of finding HSLAs. We
measure the effect of sampling on both the index size and the query time. We only simulate
the mapping process because we want our results to be general and independent from the
details of the final mapping process of a mapper.
Finally, Xin et al. [100] proposed two general techniques to accelerate kmer based mappers. The first technique is to use the set of adjacent kmers as supporting evidence for the
existence of a true match. The second is to use shared infrequent kmers to select the best
mapping location. Similar to other studies, they only used w = k while evaluating these
techniques. In contrast, we test a broader range of w values and demonstrate that using
a larger w greatly reduces query time and index size while suffering only a small loss of
sensitivity.
20
Chapter 3
Fixed versus minimizer sampling
Searching for similar sequences is a critical step in many bioinformatics applications and
pipelines, including identifying homologues, sequence classifications, sequence mapping, and
sequence assembly. This search process is often sped up by finding shared exact matches
(EM ) of a minimum length L. The value of L is usually set to ensure all the desired similar
sequences are recognized. Depending on the application, the shared EM s can be extended
to longer EM s [98]. In some applications, the EM s are extended to maximal exact matches
(M EM s) [91, 38], and M EM s are often extended further to local alignments by allowing
mismatches and/or gaps [92, 56, 75, 36].
Finding EM s, M EM s, and local alignments is often sped up using kmer indexes (k <
L). One of the biggest problems with using kmer indexes is that the size of the index is
significantly larger than the underlying database/ datasets. One of the most effective and
widely used ways of mitigating kmer index size and query time is to perform sampling, in
which we omit some kmer occurrences from the index. A sampling strategy can be classified
based on how it chooses kmers. There are two major ways to choose kmers: fixed sampling
and minimizer sampling.
It has widely been assumed that fixed sampling produces smaller indexes and has smaller
index construction times than minimizer sampling but that minimizer sampling leads to
faster query processing than fixed sampling. However, these beliefs have never been empiri
21
cally verified. In fact, few studies have empirically tested either method on its own [75, 56].
In this chapter, we fill this gap by systematically evaluating and comparing fixed sampling
and minimizer sampling to assess how these methods perform with respect to index construction time, index size, and query processing time. Specifically, we compare and contrast
the construction time, index size, and query processing time required to find all MEMs using
both fixed and minimizer sampling.
We start by formalizing the problem of finding M EM s between two sequences. Then,
we illustrate how kmer indexes are used to find M EM s. Next, we formally describe the
two kmers sampling methods: fixed sampling and minimize sampling. We highlight the
key similarities and differences between these two sampling methods. Finally, we set the
comparison framework and conclude with the comparison results.
3.1
The M EM enumeration problem
Let Σ be a finite ordered alphabet. We focus on the alphabet for nucleotide databases
Σ = {A, C, G, T }. Let s be a string over Σ of length s. We use s[i] to denote the character
at position i in s, for 0 ≤ i < s. We use s[0] to denote the first character in string s. We use
the ordered pair s(i, j) to denote the substring in s starting with the character at position
i and ending with the character at position j for 0 ≤ i < j < s. We note that substring
s(i, j) is also denoted as s[i..j] in some papers, but we only use s(i, j) in this chapter.
Definition 1 Exact Match (EM ) For any two strings s1 and s2 , a pair of substrings
(s1 (i1 , j1 ), s2 (i2 , j2 )) is an exact match if and only if s1 (i1 , j1 ) = s2 (i2 , j2 ). Also, a substring
ω is an exact match if there is an exact match pair (s1 (i1 , j1 ), s2 (i2 , j2 )) such that ω =
22
s1 (i1 , j1 ). The length of an exact match is j1 − i1 + 1.
Definition 2 Maximal Exact Match (M EM ) An exact match (s1 (i1 , j1 ), s2 (i2 , j2 )) is
called maximal if s1 [i1 − 1] 6= s2 [i2 − 1] and s1 [j1 + 1] 6= s2 [j2 + 1]. Also, a substring ω is a
maximal exact match if there is a maximal exact match pair (s1 (i1 , j1 ), s2 (i2 , j2 )) such that
ω = s1 (i1 , j1 ).
We now formalize the problem of finding M EM s between two datasets.
Definition 3 M EM Enumeration Problem Given two datasets of sequences D1 and
D2 and an integer L, the M EM enumeration problem is to find the set of all M EM s of
length at least L between all sequences in D1 and all sequences in D2. We denote this set as
M EM (D1 , D2 , L). We use M EM (L) if D1 and D2 are clear from the context.
We illustrate many of these and later definitions using the following example where
D1 = {s1 } and D2 = {s2 } and s1 and s2 are as follows:
s1 = GT AC T AGG CT A CT A GGGG with length s1  = 18
s2 = GT AC A AGG CT A CT A CT A T T T T with length s2  = 21
The two string s1 and s2 have two M EM s of length at least 6: AGGCT ACT A =
(s1 (5, 13), s2 (5, 13)) with length 9 and CT ACT A = (s1 (8, 13), s2 (11, 16)) with length 6.
Thus, M EM ({s1 }, {s2 }, 6) = {(s1 (5, 13), s2 (5, 13)), (s1 (8, 13), s2 (11, 16))} whereas
M EM ({s1 }, {s2 }, 8) = {(s1 (5, 13), s2 (5, 13))}.
In this study, we focus on finding M EM s of a minimum length L between a query
sequence and a database of sequences because it is a critical step in searching for local
alignments with tools such as NCBI BLAST.
23
3.2
Using kmer indexes to find M EM s
Finding M EM s with a minimum length L is often sped up using a kmer index, k ≤ L, at
the cost of additional space. A kmer index supports quickly finding EM s of length k which
are also known as shared kmers. We typically extend these shared kmers in two stages.
In the first stage, we try to extend every shared kmer into an EM of length L ≥ k. If the
first stage is successful, we then further extend the match into an M EM . In the context of
searching for local alignments, every M EM is extended even further by allowing mismatches
and/or gaps.
It is possible to skip the first extension step and build an Lmer index to find all shared
Lmers. However, due to technical limitations and the huge memory requirements necessary
for building an Lmer index for L > 32, it is common to build the index using k ≤ 32 < L
[38, 98, 56, 75].
We typically work with a kmer index as follows. We save the list of kmers present in
the database and we refer to this list as dictionary. For each saved kmer, we save some of its
occurrences into a list. When given a query sequence, we extract kmers from the query, see
if that kmer appears in the dictionary, and find the corresponding shared kmers by using
the stored list of occurrences.
We describe this search process more precisely as follows.
Definition 4 (kmer and kmer occurrence) Consider any length k substring s(j − k +
1, j) of string s where k − 1 ≤ j ≤ s − 1. We call that substring a kmer and more concisely
represent this kmer occurrence using the ordered pair (s, j).
Definition 5 (Shared kmers and shared kmer occurrences) Consider any two strings
s1 and s2 that have an exact match (s1 (i1 , j1 ), s2 (i2 , j2 )) of length k. We call the common
24
substring s1 (i1 , j1 ) (equivalently s2 (i2 , jw )) a shared kmer and more concisely represent the
corresponding shared kmer occurrence using the quadruple (s1 , j1 , s2 , j2 ).
Any string s of length s contains exactly s − k + 1 kmer occurrences. Using our
previous example with k = 3, s1 and s2 have 16 and 19 3mer occurrences, respectively.
Furthermore, s1 and s2 have exactly seven shared 3mers: ACT , AGG, CT A, GCT , GGC,
GT A, and T AC. These shared 3mers result in 24 different shared 3mer occurrences as
follows. GCT , GGC, and GT A appear exactly once in s1 and s2 , and thus each of them
has exactly one shared 3mer occurrence: (s1 , 9, s2 , 9), (s1 , 8, s2 , 8), and (s1 , 2, s2 , 2). The
3mer AGG occurs 2 times in s1 and 1 time in s2 , and thus AGG is part of two different
shared 3mer occurrences: (s1 , 5, s2 , 7) and (s1 , 15, s2 , 7). Since shared 3mer ACT occurs 2
times in both s1 and in s2 , the shared 3mer ACT is part of 2 × 2 different shared 3mer
occurrences: (s1 , 4, s2 , 12), (s1 , 4, s2 , 15), (s1 , 12, s2 , 12), and (s1 , 12, s2 , 15). Similarly, T AC
occurs 2 times in s1 and 3 times in s2 , so shared 3mer T AC is part of 3 × 2 different shared
3mer occurrences: (s1 , 3, s2 , 3), (s1 , 3, s2 , 11), (s1 , 3, s2 , 14), (s1 , 11, s2 , 3), (s1 , 11, s2 , 11),
and (s1 , 11, s2 , 14). Finally, CT A occurs 3 times in s1 and 3 times in s2 , leading to 3 × 3
different shared 3mer occurrences: (s1 , 5, s2 , 10), (s1 , 5, s2 , 13), (s1 , 5, s2 , 16), (s1 , 10, s2 , 10),
(s1 , 10, s2 , 13), (s1 , 10, s2 , 16), (s1 , 13, s2 , 10), (s1 , 13, s2 , 13), and (s1 , 13, s2 , 16).
Since k ≤ L, it is possible that a shared kmer occurrence is not part of an M EM of
length at least L; we call such a shared kmer occurrence a false positive. In general, decreasing the value of k increases the chance that a shared kmer occurrence is a false positive.
Using the above 24 shared 3mers occurrences and assuming L = 6, the M EM =
25
AGGCT ACT A can be found by extending any of the following seven 3mer occurrences:
(s1 , 7, s2 , 7), (s1 , 8, s2 , 8), (s1 , 9, s2 , 9), (s1 , 10, s2 , 10), (s1 , 11, s2 , 11), (s1 , 12, s2 , 12), or
(s1 , 13, s2 , 13). Similarly the M EM = CT ACT A can be found by extending any of the following four 3mer occurrences: (s1 , 10, s2 , 13), (s1 , 11, s2 , 14), (s1 , 12, s2 , 15), or (s1 , 13, s2 , 16).
The remaining twelve shared 3mer occurrences are false positives. If L = 8, then the seven
shared 3mer occurrences that can be extended to AGGCT ACT A are not false positives.
The remaining sixteen shared 3mer occurrences are false positives.
Every EM of length L has L − k + 1 shared kmer occurrences. Finding and extending
one of these shared kmer occurrences is sufficient for finding that EM . Therefore, when
building kmer indexes, we can store a sampled subset of kmer occurrences in the index
and still find every possible M EM of length at least L. With sampling, we not only reduce
the index’s memory requirements, we also reduce query time by not discovering the same
M EM multiple times. Sampling, therefore, is a very effective method for improving a kmer
index’s efficiency (reducing construction time, space, and query time).
3.3
Fixed sampling versus minimizer sampling
In bioinformatics, two sampling methods are commonly used to build kmer indexes: fixed
sampling [38, 56] and minimizer sampling [75, 98]. To ensure that a kmer index achieves
100% sensitivity which means that it finds all shared M EM s of length at least L, both
methods ensure that within every M EM of length at least L, at least one kmer occurrence
is saved to the index. We now define both sampling methods comparing and contrasting
their relative strengths and weaknesses.
26
Fixed sampling is simple greedy sampling strategy that minimizes the number of kmer
occurrences stored in the index. The goal is to ensure we choose one complete kmer from
every possible substring of length L from each database sequence s. For example, we must
choose one kmer from s(0, L − 1) to store in the index; we greedily choose the kmer that
ends at s[L − 1] since it not only covers this substring but also the next L − k − 1 substrings
up to but not including s(L − k + 1, 2L − k). To cover that substring s(L − k + 1, 2L − k),
we again greedily choose the kmer that ends at s[2L − k] since it again covers the next
L − k − 1 substrings. In general, the jth kmer occurrence that we sample ends at position
L − 1 + (j − 1)w where w = L − k + 1. We typically refer to w = L − k + 1 as our sampling
step or sampling window for fixed sampling. During the query phase, we extract every kmer from the query sequence q to search for shared kmer occurrences. Since every kmer is
extracted from q, if s and q have an M EM of length at least L, then some shared kmer from
that M EM will be in the kmer index and the M EM can be recovered. Fixed sampling
has several advantages. First, it is very fast to construct the index. Second, it stores the
minimum possible number of kmer occurrences in the index to guarantee 100% sensitivity
and thus minimizes index size. The disadvantage is that all kmers from the query sequence
need to be processed which may slow query time.
Minimizer sampling uses a more sophisticated sampling strategy that allows sampling
of both the database sequence s and the query sequence q. We must again choose one kmer from s(0, L − 1) to store in the index. This time, we choose to store the minimum
kmer from s(0, L − 1) in our index where we order substrings in some canonical order.
For simplicity, one can use the alphabetical order where A < C < T < G which implies
AAA < AGA < AGG < T AA. To improve performance and increase the probability that
rare kmers become minimizers, different orderings are often used. For example, Roberts et
27
al. proposed using C < A < T < G in odd numbered bases and the reverse ordering in evennumbered bases [75]. Alternatively, Wood et al. [98] suggest using the exclusiveor (XOR)
operation to scramble the standard ordering of each kmer before comparing the kmers
to each other using lexicographical ordering. Once an ordering is established, we process
each length L substring of s (equivalently each window of w = L − k + 1 kmers) in turn
storing the minimum kmer occurrence in the index. The first view focusing on substrings
of length L seems more intuitive; the second view focusing on windows of w kmers is useful
when predicting the expected size reduction from using minimizer sampling. We note a few
things. First, there may be multiple occurrences of a minimum kmer within a length L
substring; in this case for the original minimizer algorithm but not the one we focus on in
this chapter, each occurrence is stored in the index. Second, minimizer sampling is likely to
store more occurrences than fixed sampling as it does not maximize the distance between
kmer occurrences stored in the index. Third, the time to construct the index is greater as
more substrings need to be considered. The advantage that minimizer has comes at query
time. Rather than choosing all kmers from query q, it does the same sampling. That is,
we consider every substring of length L and choose only the minimum kmers from within
each length L substring to consider for extension. Since both the query and each database
string extract the minimum kmer(s) from every substring of length L, if there is an EM
of length L, the same minimum kmer will be extracted and then extended into the EM
and then M EM . In summary, minimizer sampling requires more time to construct its index
and builds a larger index than fixed sampling, but it processes fewer kmers from the query
sequence and thus may have faster query processing times.
We illustrate the two algorithms using our previous example where we use k = 3 and
L = 8 so w = L − k + 1 = 6, and we use D1 = {s1 } as our database and D2 = {s2 } as our
28
query dataset. The goal is to return M EM ({s1 }, {s2 }, 8). With fixed sampling, we store
the 3mer AGG with its occurrence (s1 , 7) and the 3mer CT A and its occurrence (s1 , 13)
in the 3mer index.
All 19 3mers from s2 are extracted with both AGG and CT A being shared 3mers.
Since CT A occurs three times in s2 , we consider four shared 3mer occurrences for extension: (s1 , 7, s2 , 7), (s1 , 13, s2 , 10), (s1 , 13, s2 , 13), (s1 , 13, s2 , 16). The first and third can be
extended to the same M EM of length 9 whereas the other two cannot be extended to a
length 8 M EM and thus are false positives.
With minimizer sampling, we store three 3mer occurrences to the index, two with ACT
and one with AGG: (s1 , 4), (s1 , 7), and (s1 , 12). Minimizer sampling is also applied to the
query s2 and three 3mer occurrences are chosen to test for extension, two with ACT and
one with AAG: (s2 , 7), (s2 , 12), and (s2 , 15). The only shared 3mer is ACT , and since ACT
appears twice in both sequences, we consider four shared 3mer occurrences for extension:
(s1 , 4, s2 , 12), (s1 , 4, s2 , 15), (s1 , 12, s2 , 12), and (s1 , 12, s2 , 15). Only (s1 , 12, s2 , 12) can be
extended to an M EM of length at least 8; the other three shared kmer occurrences are
false positives.
Schleimer et al. [80] independently introduced minimizer sampling as winnowing sampling. While winnowing is identical to minimizer, winnowing has been studied and used
for different problems and domains than minimizer. As in [75], Schleimer observed that in
low complexity regions, a kmer might occur more than once and all of its occurrences are
sampled. Schleimer et al. suggested a simple fix to this problem which they called robust
winnowing where they do not sample duplicate kmers. The idea is that for each window,
we select the minimizer. If there is more than one occurrence of the minimizer, we select
the same occurrences as the previous window. If not, we select the rightmost minimizer
29
occurrence. This results in a possible loss of the correct minimizer occurrence match. They
suggested looking at all w adjacent minimizers to find the correct match.
As we described, there are two possible optimizations that can be used to improve the
performance of minimizer sampling. The first one is to avoid using lexicographical ordering.
The second one is not to sample duplicate minimizers. In this chapter, we study the effects of
each optimization individually and combined. To avoid lexicographical ordering, we use the
randomization method suggested by Wood [98]. To prevent sampling duplicate minimizers,
we use the robust winnowing method proposed by Schleimer [80], but only apply robust
winnowing to the index, not to the query sequences. We sample all minimizers from a query
sequence window. In this scenario, the correct minimizer occurrence matches are guaranteed
to be found. We finally apply both methods together to test the effectiveness of using both
optimizations simultaneously.
We represent nucleotides using two bits and store kmers for k ≤ 32 in a 64bit block.
All indexes are saved as hash tables where a key is a kmer and its value is a pointer to that
kmer’s list of occurrences. We store each kmer’s list of occurrences in a set data structure;
each occurrence is an ordered pair of 64bit positive integers (s, j) where s is a sequence ID
and j is the ending position of this kmer occurrence in s.
3.3.1
Sampling methods
We compare fixed sampling and minimizer sampling. Minimizer sampling can be improved
using two different optimizations: randomized ordering and duplicate minimizer removal.
We list all possible combinations of our two optimizations for minimizer sampling in the
following table.
Using both optimizations will result in the most effective minimizer method minrand,one .
30
Table 3.1: Possible minimizer sampling versions based on the optimization techniques.
Duplicate handling \Ordering schema Lexicographical Randomized
Sample all duplicate minimizers
minlex,many
minrand,many
Remove duplicate minimizers
minlex,one
minrand,one
Thus, we compare minrand,one with fixed sampling (f ix) to test the effectiveness of the
two major sampling methods. We compare minlex,one with min minrand,one to determine
how much effect the randomization optimization has, and we compare minrand,many with
min minrand,one to determine how effective duplicate removal is. We will not consider
minlex,many in any comparison since it is the worst version of minimizer and known to
be inefficient. Next, we formally describe each sampling method. Note that we choose
parameters that ensure we achieve 100% sensitivity which means we will find all M EM s.
In fixed sampling (f ix), as we described earlier, we build the kmer index for a database
of sequences by sampling from every database sequence s the kmer occurrences ending at
positions L − 1 + (j − 1)w where w = L − k + 1 and 0 ≤ j ≤ b(s − L + 1)/wc. We refer to
w = L−k +1 as our sampling step. During the query phase, we extract all kmer occurrences
from each query sequence q and consider them for extension. Since every kmer is extracted
from q, if a database sequence s and q have an M EM of length at least L, then some shared
kmer from that M EM will be in the kmer index and the M EM can be recovered.
In standard minimizer sampling without any optimization (minlex,many ), for every substring of length L in a database sequence s, we store the minimum kmer occurrence in
our index; if there is more than one minimum kmer within any length L substring, all
minimum kmer occurrences are stored. We use the normal lexicographical ordering where
A < C < T < G to define minimum kmers in our work. Unlike fixed sampling, during the
31
query phase, we use the same sampling for a query sequence q. That is, we consider only the
minimum kmers from each substring of length L in q for extension. Since both the query
and each database string extract the minimum kmer(s) from every substring of length L, if
there is an EM of length L, the same minimum kmer will be extracted and then extended
into the EM and then M EM .
We find minimum kmers from s and q using a sliding window approach where we use the
minimum kmer from the previous window to speed up the search for the minimum kmer(s)
from the new window. For the first substring s(0, L − 1) or q(0, L − 1), we must examine all
L − k + 1 kmers and choose the minimum one(s). We store the rightmost minimum kmer
and its position as our current minimum kmer. If the current minimum kmer belongs
to the next window, then we can find the minimum kmer for the next window by simply
comparing the new kmer in that window with the current minimum kmer. If the new
kmer is no larger than the current minimum kmer, we update the current minimum kmer
to be the new kmer. Otherwise, the new current minimum kmer is the same as the old
minimum kmer. If the current minimum kmer is not part of the next window, then we
must again examine all L − k + 1 kmers and choose the minimum one(s).
Now, we describe how to apply optimizations to reduce the number of kmers sampled
from the database which leads to a significant speedup of query time. The first optimization
is to use randomized ordering instead of lexicographical ordering. To do this, we first create
a random kmer mask by uniformly selecting k letters from A, C, G, or T in each positions.
We then view a kmer as a 2k bit string. For any kmer or equivalently 2k bit string, we
create a new scrambled 2k bit string by doing an exclusiveor (XOR) operation between
every bit of the 2k bit string and the 2k bit mask. We then sort all the scrambled 2k bit
strings to identify a minimizer. For example, the bit string for the 4mer AACC is 00000101
32
using lexicographical ordering, where A = 00, C = 01, G = 10, and T = 11. Let the random
4mer mask be CGAT which is equivalent to the 8bit string 01100011. After applying the
XOR operation between AACC (00000101) and CGAT (01100011), the resulting scrambled
new bit string for AACC is 01100110. We refer to minimizer sampling that uses randomized
ordering as minrand .
The second optimization prevents sampling duplicate minimizers in the indexing phase.
There are two occasions where standard minimizer stores duplicate minimizers in the index.
The first occurs when we the current minimizer is not part of the next window and we
must examine all w = L − k + 1 kmers in that window. If we find multiple minimizers,
all are stored in the index using the standard minimizer sampling strategy. To apply the
duplicate removal optimization, we store only the rightmost minimizer in the index. The
second possibility for storing duplicate minimizers occurs when the current minimizer for
the previous window still lies within the next window and is identical to the one new kmer
for that window. In this scenario, to remove duplicates, we do not store this duplicate copy
at this time in the index. However, we do track its position so that if no new minimizer
is found before the current minimizer moves out of the current window, we can use this
kmer to replace the current minimizer at that time and still do only one comparison for
that window. At that time, we would have to store this minimizer in the index if it is the
minimizer of that window. We refer to minimizer sampling that uses duplicate removal as
minone .
3.3.2
Indexing and querying
In the indexing phase, we create a kmer index for a given database and sampling method as
follows. We sample kmers and their occurrences from each sequence based on the selected
33
sampling method (f ix, minrand,one , minrand,many , and minlex,one ). We save the sampled
kmers into the index dictionary, and for each kmer occurrence, we update the corresponding
list of kmer occurrences.
We then proceed to the query phase where we sequentially process each query sequence.
If we use fixed sampling (f ix), we extract all kmer occurrences from query sequence q. For
all minimizer methods minrand,one , minrand,many , and minlex,one , we extract the minimum
kmer occurrences including duplicates from each window in q.
Once we extract the kmer occurrences from q, we use the index to find shared kmer
occurrences and then M EM s as follows. For every kmer occurrence in q, we check if the
kmer is in the index dictionary. If the kmer is found, then we use the kmer’s associated list
of occurrences to find all shared kmer occurrences between q and the database of sequences.
We perform this search in a manner similar to NCBI BLAST with the goal of minimizing
the number of database read operations. Specifically, we group the shared kmer occurrences
between q and DB by database sequence ID s. For the list of kmer occurrences shared
between q and s, we sort them in alphabetical order and then positional order. We store
this information in a hash table with key s where the hash table entries are pointers to the
sorted lists of shared kmer occurrences. We then read in each relevant database sequence s
exactly once and process all the corresponding shared kmers in alphabetical order of kmer.
For every query sequence q and a database sequence s, we report any shared kmer
occurrence that can be extended to length at least L as an M EM . Our extension method
is similar to that of Khiste and Ilie [38]. Before we try to extend a shared kmer occurrence,
we first check if it is contained within our list of discovered M EM s which is, of course,
initially empty. If so, we skip this shared kmer occurrence and move on to the next one. If
not, then we try to extend the kmer in both directions to see if it is part of an M EM with
34
length at least L. If the extension succeeds, we add the new M EM to our list of discovered
M EM s. If the extension fails, we report this shared kmer occurrence as a false positive.
This ensures we only extend one shared kmer occurrence within any M EM .
We check if a shared kmer occurrence is part of a discovered M EM using the following
properties. A shared kmer occurrence (q, jq0 , s, js0 ) is part of a shared M EM M EM =
(q(iq , jq ), s(is , js )) if the following conditions hold: (1) iq ≤ jq0 − k + 1 < jq0 ≤ jq , (2),
is ≤ js0 − k + 1 < js0 ≤ js , and (3) (jq0 − k + 1) − iq = (js0 − k + 1) − is . Checking these
conditions can be done in constant time per discovered M EM , and typically the number of
discovered M EM s per pair of sequences q and S is small, so this verification step typically
takes constant time.
3.3.3
Experimental setting and evaluation metics
3.3.3.1
Database and query sets
We consider only nucleotide datasets. Our database is the human genome. We use three
query sets: the mouse genome, the chimp genome and an NGS dataset. All the datasets are
publicly available. The genome datasets can be downloaded from UCSC
(http://hgdownload.cse.ucsc.edu). The NGS dataset can be downloaded from Sequence
Read Archive (SRA) on the NCBI website (https://www.ncbi.nlm.nih.gov/sra) Table 4.1
describes each dataset used. According to Koning et al. [20], two third of the human genome
consists of repetitive sequences. It is also known that the mouse genome contains many
repeats too. [56, 58]. It is unclear if this is the case for the chimp and NGS datasets.
Performing the queries using each query set directly would require large amounts of
computing power, memory and time. To support parallelization of queries on MSU’s High
35
Performance Computing Cluster, we preprocess all the datasets as follows. We first divide
every sequence into nonoverlapping sequences of length 1000. The number of resulting
sequences is show in Table 4.1. This allows us to run our experiments in parallel and to
ensure that the set of shared kmers and M EM s can be saved in RAM. We also only save
letters in {A, C, G, T }; that is, ambiguous characters are removed. For each preprocessed
query set, we partition the set into 1000 query sets of equal size (except the last set my be
slightly smaller). We recognize that we may not be able to find M EM s that extend across
the preprocessed sequences, but this should not significantly change our results.
Table 3.2: datasets used for testing.
Datasets
Homo sapiens (Human)
Mus musculus (Mouse)
Pan troglodytes (Chimp)
SRA:SRR003161 (NGS)
Size (Mbp)
3137
2731
3218
788.5
#Seq.
93
66
24,132
1,376,701
Type
Database
Query set
Query set
Query set
# Processed Seq.
2,897,341
5,306
5,818
2,792
The database set is only one large set. The query sets are partitioned into 1000 small query
sets where each small set has the indicated number of processed sequences (except the last
set may have fewer sequences). The processed sequences are of length 1000 (except the last
processed sequence for every sequence may be shorter).
For each of our preprocessed query sets, we compute the actual number of MEMs between
that query set and the preprocessed human genome for both choices of L. These results are
shown in Table 3.3.
Table 3.3: The number of M EM s for each query set and the human genome for
both choices of L given our preprocessing into sequences of length 1000.
L
50
100
Mouse
Chimp
NGS
838,857,328 2,077,183,744 940,731
428,609
101,868,611
457,512
36
3.3.3.2
Index parameters and metrics
We study the impact of the sampling methods on the kmer index creation phase. We
consider the following sampling methods f ix, minrand,one , minlex,one , and minrand,many .
We use the index to find all M EM s of length at least L where L ∈ {50, 100}. For each
sampling method, we create set of indexes for k ∈ [12, 32]. We consider L = 50 and L = 100,
because it frequently used in biological applications that compare mouse and chimp against
human genome [91, 38] or map the NGS dataset against human genome [92, 36].
For each index, we report the dictionary size, lists size and the total index size which
is the sum of dictionary and lists sizes. The dictionary size is measured by counting the
number of kmers. The lists size is measured by counting the number of kmers occurrences
in all lists. We also report the index construction time.
3.3.3.3
Querying parameters and metrics
The index is used to find all M EM s of length at least L where L ∈ {50, 100}. For L and for
each sampling method, we used set of kmer indexes where k ∈ {12, 16, 20, 24, 28, 32}. The
total number of indexes considered is 2 × 3 × 6 = 48 indexes. All the indexes give the same
final results, namely all M EM s of length at least L.
For each query phase, we report the time and the number of “false positives”. The
number of false positives for a query set is the number of shared kmer occurrences that
failed to be extended to a M EM of length at least L. Recall that we partition each query
set into 1000 query set partitions. The reported time is the sum of times that an index needs
to answer all queries in all query set partitions. Likewise, the number of false positives for a
query set is the sum of the number false positives for all queries in all query set partitions..
37
3.3.3.4
System specification/configuration
We run the experiments on a cluster that runs the Community Enterprise Operating System
(CentOS) 6.6. The cluster has 24 nodes where each node has two 2.5Ghz 10core Intel Xeon
E52670v2 processors, 256 GB RAM, and 500 GB local disk.
3.4
3.4.1
Results and discussion
Index size and index construction time
Fixed sampling (f ix) produces indexes that are less than half the size of those produced by
all minimizer sampling methods (minrand,one , minrand,many , and minlex,one ) for almost all
choices of k. Likewise, we can construct fixed sampling’s index roughly 3 to 4 times as fast
as we can construct minimizer’s index. We provide full index size and construction time
results in Figure 4.2.
We now explore why fixed sampling produces indexes that are roughly half the size
of indexes produced by minimizer sampling minrand,one . We start with the size of the
occurrence lists. For a fixed value of k, we can accurately predict the size of fixed sampling’s
kmer occurrence lists because 1/w of the total number of kmer occurrences will be sampled.
For minimizer, Roberts et al. showed that for random sequences, the number of minimizers
would be roughly 2/(w + 1) of the total number of kmer occurrences [75]. Basically, each
minimizer would cover roughly half a window of length w rather than a full window of
length w as we get from fixed sampling. Thus, we would expect fixed sampling to produce
occurrence lists that are roughly (w + 1)/2w the size of the occurrence lists produced by
minimizer sampling; that is, the occurrence lists should be just more than half the size.
38
Figure 3.1: The dictionary sizes, lists sizes, and construction times for kmer
indexes built using fixed sampling (f ix) and minimizer sampling minrand,one ,
minrand,many , and minlex,one . For parts (a), (b), and (c), we use L = 50. For parts (d),
(e), and (f), we use L = 100. For all graphs, 12 ≤ k ≤ 32.
Note, Roberts et al. observed that the actual proportion of minimizers in practice can be a
few percent above 2/(w + 1) for several reasons. In our experiments, we see that the number
of sampled occurrences for fixed sampling divided by the number of sampled occurrences for
minimizer sampling actually ranges from 48% to 55% for both L = 50 and L = 100. This
is consistent with the observation of Roberts et al. that the actual proportion of minimizers
can be a few percent above 2/(w + 1).
39
Minimizer sampling minrand,many has essentially identical results to minimizer sampling
minrand,one with respect to the size of occurrence lists; the optimization to remove duplicate
minimizers from a window does not have much effect on the total number of sampled occurrences. Minimizer sampling minlex,one produced indexes that are larger than minimizer
sampling minrand,one with respect to the size of occurrence lists; the optimization to use
randomized ordering, instead of lexicographical ordering, effectively reduces over sampling
the same kmer in regions with many repeats resulting in 15% to 20% reduction for L = 50
and L = 100, respectively.
We now consider the dictionary size. For the smallest values of k that we consider,
mainly 1215, minimizer typically has much smaller dictionaries than fixed sampling. For
k = 12 and L = 100, minimizer’s dictionary is almost 6 times smaller than fixed sampling’s
dictionary. For these small values of k, many of the sampled kmers are chosen many times,
and this is especially true for minimizer which leads to its smaller dictionary. However, for
these k values, because many of the sampled kmers are chosen many times, the dictionaries
are much smaller than the occurrence lists, so fixed sampling still has a total index size that
is roughly half that of minimizer. For example, for k = 12 and L = 50, for fixed sampling,
each kmer in the dictionary appears roughly 6.5 times in the occurrence lists whereas for
minimizer sampling minrand,one , each kmer in the dictionary appears roughly 52 times in
the occurrence lists.
Once we consider k ≥ 16, for fixed sampling, each dictionary consists of mostly unique
kmers. For example, for k = 16 and L = 50, each kmer in fixed sampling’s dictionary
appears roughly 1.23 times in the occurrence lists. For minimizer sampling minrand,one ,
this starts to happen around k = 21. For example for k = 21 and L = 50, each kmer in
minimizer sampling’s dictionary appears roughly 1.24 times in the occurrence lists. By the
40
time k = 32, each dictionary kmer appears less than 1.13 times in the occurrence lists for
both fixed sampling and minimizer sampling. This implies that for large k, the dictionary
size is comparable to the occurrence lists size. Specifically, we see that fixed sampling’s
dictionaries are roughly half the size of minimizer sampling’s dictionaries for k ≥ 21 for
both L = 50 and L = 100. Finally, we note that the dictionary size for minimizer sampling minrand,many is identical to that of minimizer sampling minrand,one as minrand,many
only omits some repeated occurrences for the same kmer. Minimizer sampling minlex,one
produced dictionaries that are larger than minimizer sampling minrand,one , again because
we reduce oversampling the same kmer in regions with many repeats. For example, when
k > 16 the reduction ranges from 12% to 23% for L = 50 and L = 100, respectively.
We note that for all sampling methods, increasing the value of k increases the size of the
index. This is expected, since the sampling step w = L − k + 1 decreases as k increases.
Finally, fixed sampling’s faster construction time is easily explained. First, the number of
sampled occurrences is less than half as many as minimizer sampling. Second, no comparisons
are needed; fixed sampling simply grabs every wth kmer whereas minimizer sampling needs
to consider every kmer and do comparisons to determine if new kmers are minimizers.
However, as we show in later, the reduction has significant impact on reducing query time.
3.4.2
Query time
We first start with our query time results. Our full query time results for each of our three
sampling methods for L = 50 and L = 100 are shown in Tables 3.4 and 3.5.
Our key query time result is that for the same k values and query data with many repeats,
such as in the mouse genome, minrand,one processes queries significantly faster than fixed
sampling, especially for commonly used small k values like 12 and 16 [56, 30, 3, 77, 2, 33,
41
Table 3.4: Query times (in hours) for all sampling methods and all choices of k
when L = 50.
Query set
Mouse
Chimp
NGS
k
12
16
20
24
28
32
12
16
20
24
28
32
12
16
20
24
28
32
f ix
447.00
106.49
52.91
33.31
20.84
13.75
1493.09
661.67
340.00
180.67
98.20
55.25
237.04
88.02
41.75
21.33
11.83
6.77
minrand,one
284.50
36.08
26.96
17.36
15.48
12.43
1294.66
510.99
471.84
184.19
116.06
71.34
197.73
79.71
49.40
22.72
14.98
10.68
minrand,many
437.16
142.22
107.46
58.38
48.77
32.79
1407.43
641.86
859.09
221.33
146.56
79.92
205.70
72.66
47.99
22.21
15.73
9.80
minlex,one
1008.30
204.39
86.13
48.54
28.87
18.58
2340.63
932.27
455.38
205.94
116.46
56.58
486.43
134.30
59.96
26.69
14.38
7.36
96, 98]. For example, when k = 12 and k = 16, minrand,one answer the queries 126.14% and
369.14%, on average, faster than fixed sampling for L = 50 and L = 100, respectively. For
large k, k > 16, minrand,one processes the queries 58.36% and 271.64%, on average, faster
than fixed sampling for L = 50 and L = 100, respectively. When there are only a few repeats
in the query data, such as in the chimp genome and NGS datasets, and for small k values,
minrand,one is 15.15% to 37.65% faster than fixed sampling, on average. On the other hand,
when the value of k > 16, minrand,one is slower than fixed by 7.73% to 19.82%.
While we observe that minimizer sampling process queries faster than fixed sampling for
the same choice of k, we also observe that minimizer sampling uses more space than fixed
sampling for the same choice of k. We will later compare minimizer sampling with fixed
sampling when they are restricted to indexes of the same size to determine which is indeed
42
Table 3.5: Query times (in hours) or all sampling methods and all choices of k
when L = 100.
Query set
Mouse
Chimp
NGS
k
12
16
20
24
28
32
12
16
20
24
28
32
12
16
20
24
28
32
f ix
198.46
41.74
21.31
12.37
7.28
4.92
732.51
276.38
131.61
61.68
33.20
17.78
111.03
33.33
15.58
7.27
4.12
2.32
minrand,one
101.05
5.63
3.59
3.76
2.17
2.16
524.91
203.60
160.61
50.29
41.01
30.91
79.95
27.39
13.93
5.91
5.01
4.47
minrand,many
102.06
7.25
4.52
4.29
2.48
2.35
505.38
203.90
254.88
50.03
40.88
31.53
82.78
28.17
14.47
5.85
5.34
4.43
minlex,one
692.17
142.50
40.38
15.60
7.49
4.89
1744.26
560.79
242.23
98.51
53.60
21.94
364.51
83.38
27.85
10.90
6.08
2.52
faster. When exploring this tradeoff, we find that fixed sampling is faster than minimizer
sampling when both methods have equal sized indexes.
Our next query time result is that increasing k significantly decreases the query processing
time of all methods. For all query sets, increasing k from 12 to 16 reduces the query time of
fixed sampling and minrand,one by roughly 35 times; the one exception is minrand,one with
the mouse genome query set for L = 50 and L = 100 where the reduction is only 7.89% and
17.96% times, respectively. For all query sets and all methods, increasing k by an additive
factor of 4 above 16 roughly halves the method’s query processing time with a couple of
outliers in both directions.
Our final query time result is that the optimization using randomized ordering is significantly more effective than the optimization that removes duplicate minimizers; especially
43
for large L and small k values. For the mouse genome and for k = 12 and k = 16, the
randomized ordering is, on average, 360.46% to 1508.86% faster than lexicographical ordering. For the chimp and NGS datasets and k = 12 and k = 16, the randomized ordering
is, on average, 81.62% to 280.17% faster. On the other hand, duplicate minimizer removal
improves minimizer sampling using the mouse genome by 14.94% to 173.93%, on average,
for k = 12 and k = 16. For the chimp and NGS datasets, duplicate minimizer removal does
not improve minimizer sampling; the one exception is chimp data when L = 100, it gives
17.16% faster query processing for k = 12 and k = 16.
3.4.2.1
The theoretical vs. empirical query time expectation
Because minimizer sampling only tests some kmers extracted from the query sequence to see
if they are shared kmers, one might expect that minimizer sampling would process queries
as much as w times faster than fixed sampling. However, the query time results show this
is not the case; the speedup is typically much less than w and often less than twice as fast.
This is explained by counting the total number of shared kmer occurrences found by both
fixed and minrand,one . These counts are shown in Tables 3.6 and 3.7.
Recall fixed sampling will test all q − k + 1 kmers from a query sequence q. On the
other hand, the expected number of query kmers that minimizer sampling will test is 2(q −
k + 1)/(w + 1) or roughly 2/(w + 1) times smaller if the sequences are generated uniformly
at random [75]. Each tested kmer will generate x shared kmer occurrences where x is
the length of that kmer’s occurrence list in the index. Theoretically, if x = c for fixed
sampling, then we expect that x = 2c for minimizer sampling. Then fixed sampling will test
c(q−k+1) kmers occurrences and minimizer sampling will test 2c(q−k+1)/(w+1) of kmers
occurrences; since minimizer sampling produce lists large by a factor of two. However, this
44
Table 3.6: The number of shared kmer occurrences (in billions) for all sampling
methods and all choices of k when L = 50.
Query set
Mouse
Chimp
NGS
k
12
16
20
24
28
32
12
16
20
24
28
32
12
16
20
24
28
32
f ix
281.90
103.19
55.96
33.28
19.55
11.74
714.46
310.33
155.08
75.52
37.92
22.12
101.81
30.29
12.95
6.41
3.44
2.09
minrand,one
100.58
25.02
20.84
11.33
11.68
9.36
445.52
176.65
255.68
62.58
46.97
27.64
64.62
24.05
17.20
7.24
5.11
3.42
minrand,many
348.92
202.80
161.16
83.47
73.18
47.14
515.02
223.43
520.61
85.37
109.50
45.33
71.33
26.56
20.26
8.24
5.95
3.91
minlex,one
870.72
242.16
101.93
52.93
29.27
16.98
1699.86
601.94
221.16
87.17
45.74
23.79
390.05
86.30
28.24
11.00
5.86
3.06
is not always the case. For example, for q = 1000 and L = 100, then we expect minimizer
sampling to be 95% faster than fixed sampling when 20 ≤ k ≤ 32. For chimp and NGS
datasets, the empirical results show that minimizer sampling is 7.73% to 14.24% slower than
fixed sampling.
To understand the query time, we need to compute the average size of x for kmers that
are in the index dictionary. We show this in Tables 3.8 and 3.9 for each sampling method;
specifically, these tables show the mean and the standard deviation of occurrence list lengths
in each index. For L = 100 and k = 12, the mean length of a minimizer sampling occurrence
list is just over 13 times larger than the mean length of a fixed sampling occurrence list.
For k = 16, this falls to roughly 1.55 times larger, and for larger k, this falls to just a bit
larger. Even more dramatic, the standard deviation for minimizer’s occurrence list lengths
45
Table 3.7: The number of shared kmer occurrences (in billions) for all sampling
methods and all choices of k when L = 100.
Query set
Mouse
Chimp
NGS
k
12
16
20
24
28
32
12
16
20
24
28
32
12
16
20
24
28
32
f ix
124.51
40.72
20.98
10.94
6.04
3.25
355.18
129.05
58.20
24.31
11.72
6.08
44.95
11.89
4.86
2.08
1.06
0.58
minrand,one
25.91
1.49
0.69
0.57
0.10
0.10
157.24
61.04
87.84
13.01
10.77
8.09
22.98
7.81
4.67
1.54
1.36
1.09
minrand,many
35.00
4.24
2.33
1.27
0.36
0.50
160.65
61.30
174.00
13.05
11.62
8.10
23.41
7.86
4.71
1.55
1.37
1.10
minlex,one
663.35
156.26
44.94
13.70
5.09
2.82
1499.97
418.83
125.58
37.12
15.88
6.35
318.71
59.64
14.79
4.17
1.82
0.73
ranges from 6.58 times up to 21.5 times larger than the standard deviation of fixed samplings
occurrence list lengths.
What this shows is that some kmers in minimizer sampling have very large occurrence
lists. Furthermore, the kmers that have large occurrence lists are exactly the kmers that
are most likely to be extracted from a query sequence since the sampling method is biased
to choose them. This explains why, despite testing relatively few query kmers, minimizer
sampling have much larger query times than expected theoretically.
3.4.3
Space and speed
We summarize our comparison of the the sampling methods by plotting the space and speed
of the resulting index for each query set and both choices of L in Figure 4.3.
46
Table 3.8: The mean and standard deviation of the length of a kmer’s list of
occurrences using the human genome for all sampling methods and all choices
of k when L = 50.
Mean
Std. Dev.
k
12
16
20
24
28
32
12
16
20
24
28
32
f ix
6.50
1.23
1.13
1.10
1.08
1.07
37.61
10.17
6.65
4.66
3.41
2.65
minrand,one
51.72
2.01
1.26
1.19
1.16
1.13
419.57
67.30
32.14
20.06
17.56
10.61
minrand,many
53.20
2.05
1.27
1.19
1.16
1.13
483.85
66.02
36.96
22.73
16.91
11.84
minlex,one
58.70
2.26
1.26
1.19
1.15
1.12
616.48
69.38
32.79
19.98
13.74
9.09
Table 3.9: The mean and standard deviation of the length of a kmer’s list of
occurrences using the human genome for all sampling methods and all choices
of k when L = 100.
Mean
Std. Dev.
k
12
16
20
24
28
32
12
16
20
24
28
32
f ix
3.63
1.17
1.11
1.08
1.06
1.05
18.68
6.21
4.02
2.62
1.87
1.37
minrand,one
45.20
1.81
1.24
1.19
1.16
1.13
401.81
67.08
25.38
17.51
16.71
9.02
minrand,many
44.70
1.95
1.25
1.19
1.15
1.13
402.42
61.54
30.99
19.63
14.84
10.11
minlex,one
53.02
2.34
1.28
1.19
1.15
1.12
712.16
80.71
36.05
20.47
13.85
8.61
If we ask both methods to use the same space regardless of k, we find that fixed sampling
typically answers queries at least as fast as minrand,one and often is faster. For example, the
index created using fixed sampling with k = 16 has roughly the same size as the index created
47
Figure 3.2: Comparing the space and speed of fixed sampling (f ix), minimizer
sampling minrand,one , minrand,many , and minlex,one . We use L = 50 for the three upper
figures and L = 100 for the three lower figures. The values for minlex,one when k = 12 are
very large and removed from figure below.
using minimizer sampling with k = 12. However, fixed sampling is 37.43%, 51.11%, and
44.52% faster than minimizer sampling for the mouse, chimp, and NGS datasets, respectively.
Combined with the fact that fixed sampling is much simpler than minimizer sampling, it
is clear that fixed sampling is always the best choice if we optimize both time and space
regardless of k.
Finally, we conclude by observing the randomized ordering optimization is more effective
48
than the duplicate removal optimization. We can see that for all k values we consider,
the effectiveness of minrand,many and minrand,one are very similar. On the other hand,
minlex,one is significantly worst than both minrand,one and minrand,many .
49
Chapter 4
Soft fixed sampling
Finding highly similar local alignments (HSLAs) is an important step in bioinformatics
and computational biology. Finding HSLAs is often sped up using kmer indexes. NCBI
BLAST, and in particular indexed MegaBLAST [56], is the classic and the most widely used
program to find HSLAs using kmers. Since the size of kmer indexes is large, sampling kmer occurrences has been used as an effective way to reduce index size and query time. Based
on the results of the last chapter, and also because BLAST only supports fixed sampling,
we focus on fixed sampling. We distinguish between two types of fixed sampling: hard
sampling [56, 66] and soft sampling [36]. In hard sampling, we choose w small enough such
that we are guaranteed to find all desired HSLAs. On the other hand, in soft sampling, we
consider large w values, and thus we risk missing some HSLAs.
In this chapter, we study how best to sample a kmer index to manage index size, query
time, and accuracy where accuracy refers to finding all desired HSLAs. We show that using
soft sampling, which has largely been ignored in previous studies, significantly reduces index
size and computation times with very little loss in accuracy. We study soft sampled kmer
indexes in the context of finding HLSAs between a query sequence q and a database of
sequences DB. We also study that soft sampling can be effectively used in mapping ESTs
to a genome for mapping tools that first find HSLAs.
We start by formalizing the problem of finding HSLAs. Then we illustrate how to
50
use NCBI BLAST kmer indexes to find HSLAs. Next, we formally define hard and soft
sampling. We then describe how to use soft sampling to improve EST mapping. Next, we
illustrate the experimental settings and analytical modeling. We conclude with results that
demonstrate that soft sampling is a simple but effective strategy for performing efficient
searches for HSLAs.
4.1
Highly similar local alignments (HSLA)
We now formally define the first application, finding HSLAs. We start by defining a local
alignment A(s, q) between two sequences s and q. For simplicity, we denote A(s, q) as just
A.
Definition 6 (Local alignment) A local alignment A(s, q) between any two sequences s
and q is a triple (x, y, m) where x is a contiguous subsequence of s, y is a contiguous subsequence of q, and m is an injective and monotonically increasing mapping from positions in
x to positions in y.
Figure 4.1: Example database sequence s and query sequence q and two local
alignments A1 and A2 . The symbol () identifies two mapped identical positions and (∗)
is an inserted gap position in one of the two sequences.
s : CCAACGAT ACCCCCCT T T T T CT GCGT CC ∗ ∗
     
         
q : ∗ ∗ AACGA ∗ AGGGGGGT T T T T GT GCGT GGGG

{z
A1
}

{z
A2
}
Within an alignment A = (x, y, m), some positions in x may map to no positions in y
and vice versa. Let map(A) denote the number of positions in x that map to positions in
51
y, and let match(A) denote the number of mapped positions that are identical. We then
define the length of A to be A = x + y − map(A), and we define the match percentage
to be mp(A) = match(A)/A. Finally, we define E(A) = A − match(A) to be the number
of errors in alignment A.
To illustrate these definitions, consider the example in Fig 4.1 with two local alignments
A1 and A2. We have map(A1) = 6, match(A1) = 6, A1 = 7, mp(A1) = 85.7%, and
E(A) = 1 whereas map(A2) = 11, match(A2) = 10, A2 = 11, mp(A2) = 91%, and
E(A) = 1.
When searching for local alignments, our goal is to find all HSLAs that have a minimum
length and match percentage. We formally define our targeted HSLAs, which we also refer
to as true matches, as follows:
Definition 7 (True match or HSLA) For a database of sequences DB, a query sequence
q, an alignment length threshold l, and a match threshold t, we define HSLA(DB, q, l, t) =
{A(s, q)  s ∈ DB, A(s, q) ≥ l and mp(A(s, q)) ≥ t}.
We use HSLA(q, t) when DB and l are clear from context. We also define short HSLAs to
represent HSLAs that are barely in HSLA(DB, q, l, t) and which are the hardest to find.
Definition 8 (Short HSLA) For a database of sequences DB, a query sequence q, an
alignment length threshold l, and a match threshold t, we define HSLAshort (DB, q, l, t) =
{A(s, q)  s ∈ DB, l ≤ A(s, q) ≤ (2 − t)l and mp(A(s, q)) ≥ t}.
For example, HSLA({s}, q, 6, 85%) = {A1, A2} whereas HSLA({s}, q, 11, 85%) = {A2}; A1
is omitted because it does not meet the length threshold of 11. Likewise, HSLA({s}, q, 6, 90%) =
{A2}; A1 is dropped because it does not meet the match percentage threshold. Focusing on
short HSLAs, HSLAshort ({s}, q, 6, 85%) = {A1}. A2 is dropped because it is too long. Note
52
that HSLA({s}, q, 6, 90%) actually includes several alignments that overlap significantly with
A1; we follow standard practice and only include the longest alignment with highest match
percentage from any group of highly overlapping alignments in HSLA(s, q, l, t).
4.2
Using NCBI BLAST kmer indexes to finding HSLAs
We now describe how indexed BLAST [56] is typically used to find HSLAs in HSLA(DB, q, l, t).
Specifically, indexed BLAST uses a seedandextend search process where we have one seed
phase and two extension phases. In the seed phase, for a given k value k 0 , indexed BLAST
uses a k 0 mer index to find shared k 0 mers, where a shared k 0 mer is a substring formed by
k 0 consecutive letters that appear in both a database sequence s ∈ DB and in the query
q. More specifically, indexed BLAST identifies the locations or occurrences of these shared
k 0 mers. Once shared k 0 mer occurrences are found, BLAST performs the first extension
phase. In this phase, each occurrence is extended in both directions to find a maximal exact
match (MEM), which is an exact match that cannot be extended in either direction without
introducing mismatches. If a found MEM has length at least some threshold k ∗ (defined
below), BLAST performs the second extension phase where it tries to extend the MEM into
an HSLA. BLAST’s extension process in this second phase is slightly more complex than
the process from its first phase since BLAST must allow some mismatches and gaps in this
second phase.
To illustrate this process, consider our example from Fig 4.1 and suppose we use BLAST
to search for HSLA({s}, q, 11, 90%) with k 0 = 4 and k ∗ = 5. Suppose the seed phase returns
the shared 4mers AACG, TTTT, and TGCG. When BLAST performs the first extension
phase, it would find the MEMs AACGA, TTTTT, and TGCGT. Since k ∗ = 5, BLAST
53
would then try to extend the three MEMs to HSLAs. The latter two would extend to A2
whereas the MEM AACGA cannot be extended into an HSLA.
We now describe BLAST’s first two phases in more detail starting with the seed phase.
BLAST constructs a k 0 mer index as follows. The k 0 mer index saves a list of database k 0 0
mers in a lookup table of all possible k 0 mers, which is 4k entries. We refer to this lookup
table as a dictionary. For each k 0 mer in the dictionary, BLAST saves some of its occurrences
in an inverted list (also known as an offset list). A k 0 mer occurrence is an ordered pair (s, i)
where s is the string containing this occurrence and i is the position of the last character in
this occurrence.
BLAST then finds shared k 0 mers as follows. BLAST extracts all k 0 mers from query
sequence q. BLAST then searches for each extracted k 0 mer in the dictionary. If the extracted
k 0 mer is in the dictionary, it represents a shared k 0 mer for q and some s ∈ DB. BLAST
uses that k 0 mer’s inverted list to find occurrences of that k 0 mer in DB.
A key choice is what value of k 0 should be used. Typically, k 0 is chosen to be at most 16
0
so that the list of all possible k 0 mers (which has 4k entries) can be stored as an array in
RAM. Since we use BLAST to perform our experiments, we use BLAST’s default value of
k 0 = 12.
We next describe the first extension phase where BLAST searches for MEM * s which are
k
MEMs of length at least k ∗ . The extension itself is straightforward since mismatches and
gaps are not allowed. The key issue for this phase is what k ∗ should be. We want k ∗ to be as
large as possible to reduce the number of false positives, which are MEM * s that cannot be
k
extended into HSLAs. It is well known how to compute k ∗ given a target alignment length
L and a maximum number of errors E [94, 95, 19]. Specifically, k ∗ = bL/(1 + E)c. The
basic idea is that the worst case is when the errors are evenly spaced. The question then is
54
what value of L and E should be used. The hardest HSLAs to find are the short HSLAs
defined in Definition 8; basically those of length exactly l and match percentage t. Thus, we
use L = l and E = b(1 − t)lc, which leads to k ∗ = bl/(1 + b(1 − t)lc)c.
4.3
Hard versus soft Sampling
The fundamental issue with using k 0 mer indexes to search for HSLAs is that the k 0 mer
index can be very large. Most systems including BLAST control dictionary size by limiting
k 0 to a small value such as 12. With this choice of k 0 , the problem is that there are too many
k 0 mer occurrences because the total number of k 0 mer occurrences is roughly the total length
of all the sequences in the database DB. The human genome is roughly 3 billion base pairs,
so this would mean roughly 3 billion k 0 mer occurrences.
For this reason, k 0 mer indexes are typically sampled where we only save some k 0 mer
occurrences rather than all of them. We focus on fixed sampling where for a given sampling
step w ≥ 1, a kmer that occurs at every wth position is saved. We distinguish between two
types of fixed sampling: hard sampling [56, 66] and soft sampling [36].
In hard sampling, we choose w ≤ k ∗ − k 0 + 1 so that we are guaranteed to find a k 0 mer
within every MEM * . Thus, when we apply the first extension step, we will find the resulting
k
MEM * . Since we find all MEM * s after the first extension step, we are guaranteed to find
k
k
all HSLAs after the second extension step. Without loss of generality, for hard sampling,
we assume w = k ∗ − k 0 + 1 since this maximizes the space savings with no loss in accuracy.
We refer to this w value as w0 (w0 = k ∗ − k 0 + 1).
In soft sampling, we consider w > w0 . Because we no longer are guaranteed to choose
a k 0 mer from every MEM * , when we apply the first extension phase, we may miss some
k
55
MEM * s which may lead to missing some HSLAs in the next extension phase. Thus, if we
k
use soft sampling, we risk missing some HSLAs.
4.4
Retention rates and false positives
Recall our goal is to find HSLA(DB, q, l, t). We denote the HSLAs and short HSLAs found
by using indexed BLAST with parameter values k 0 and w to be HSLA(DB, q, l, t, k 0 , w)
and HSLAshort (DB, q, l, t, k 0 , w), respectively. With hard sampling (w = w0 ), we know
HSLA(DB, q, l, t, k 0 , w0 ) = HSLA(DB, q, l, t). With soft sampling (w > w0 ),
HSLA(DB, q, l, t) − HSLA(DB, q, l, t, k 0 , w) 6= ∅ is possible. We define the retention rate of
HSLAs as a function of w as follows.
Definition 9 (Retention Rate) For a k 0 mer index with a sampling step w (SI(w)), the
retention rate for HSLA (RR(w, w0 )) and the retention rate for HSLAshort RRshort (w, w0 )
are:
RR(w, w0 ) =  HSLA(DB, q, l, t, k 0 , w)/ HSLA(DB, q, l, t),
RRshort (w, w0 ) =  HSLAshort (DB, q, l, t, k 0 , w)/ HSLAshort (DB, q, l, t).
We typically express these ratios as percentages. We will study how RR(w) and RRshort (w)
change as a function of w. Because short HSLAs are the hardest true matches to find, we
expect RR(w) > RRshort (w) in most cases.
We present a new analytical model to compute the expected retention rate of HSLAs
in HSLAshort (DB, q, l, t). The new model is an extension to Kent’s analytical model [36]
where he essentially assumed w = k 0 = k ∗ in his model. On the other hand, we propose a
new model where we assume k 0 < k ∗ and w ≥ 1. We refer to the new model as the BLAST
model since it accounts for typical parameters used in BLAST searches.
56
Searching with sampled k 0 mer index produces two intermediate results: shared k 0 mers
and MEM * s. The second extension process, extending MEM * s into HSLAs, is more
k
k
complex and costly than the first extension process since we are allowing some mismatches
and gaps. We thus define MEM * s that do not extend into HSLAs to be false positives.
k
We will also study how the number of false positives changes as a function of w.
4.5
Using NCBI BLAST kmer indexes in EST mapping
Our second motivating application, which builds upon the first, is mapping ESTs on a
genome, a fundamental procedure in genome research. Many different mapping tools are
available, each with their own advantages [31]. We focus on hash table–based, seedandextend mappers such as mrFAST/mrsFAST [30, 3], SHRiMP [77], Hobbes [2], drFAST [33],
and RazerS [96]. These mappers are typically fully sensitive mappers that “can detect reads
missed by other tools” [31] but may be relatively slow.
We study whether soft sampling kmer indexes might increase the speed of these mappers
with relatively little loss in sensitivity when working with the human genome as our database.
These methods work in two stages. First, they find the set of all HSLAs between an EST and
a genome. Then they map the EST to the genome by selecting and linking these HSLAs.
The mappers usually differ in how to modify, evaluate, and use the resulting HSLAs to
assess the final mapping process. Fig 4.2 illustrates the mapping procedure.
In this chapter, we assess the effectiveness of soft sampling in mapping human ESTs on
a human genome. Specifically, we assess whether the correct mapping is retained when we
use softsampled k 0 mer indexes to complete the first stage of finding HSLAs. We measure
57
the effect of sampling on both the index size and the query time. We only simulate the
mapping process because we want our results to be general and independent from the details
of the final mapping process of a mapper. We hope our findings encourage more developers
to allow the use of a wider range of k 0 and w values in their mappers.
Figure 4.2: Illustration of EST mapping process. The HSLAs (A, A∗), (B, B∗), (C, C∗),
and (D, D∗) are used to report the final mapping.
4.6
Materials and method
We evaluate the effect of soft sampling on using BLAST to (i) find HSLAs and (ii) map
ESTs to the human genome. For both applications, we describe our database and how we
create sampled indexes. We then describe our query sets and how we perform queries. We
next describe our evaluation metrics. Finally, we describe how we extend Kent’s analytical
model to work with our choices of k 0 = 12 and w.
4.6.1
Experimental settings
Database: For both applications of finding HSLAs and EST mapping, we use the human genome database provided by Morgulis et al. from their MegaBLAST paper [56] as
58
our database. Morgulis et al. note that the human genome database was the most frequently searched database in NCBI in 2007 with 10,000 submitted queries per weekday.
They partitioned the human genome database into volumes, each of which is roughly 1 GB in
size and available at (ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/indexed_megablast/
fasta/human). We summarize key characteristics of each volume in Table 4.1. We did experiments with both masked and unmasked data but report results only for the unmasked data
since the results were similar. As in [56], we treat each volume as a separate database. That
is, we create an index for each volume separately and search each volume’s index separately.
To obtain results for the human database, we then simply union the results found for each
volume.
Table 4.1: Human genome volume characteristics.
Name
Chr. 15, unmasked
Chr. 613, unmasked
Chr. 14Y, unmasked
Chr. 18, masked
Chr. 9Y, masked
Size(Mbytes)
1039.86
1093.27
778.75
1517.93
1400.78
Size(bp)
1,025,201,451
1,077,856,590
767,769,314
1,493,033,824
1,377,793,531
Sampled index construction: For finding HSLAs, we use four different minimum alignment lengths l: 50, 100, 200, and 400 and a match threshold t = 96% or t = 97%. For each
of our four choices of l, we use k ∗ = l/(1 + b(1 − t)lc). For mapping ESTs, similar to Kent’s
design of BLAT [36], we use the same choices except we omit l = 400. Specifically, Kent
used l = 100; we also include l = 50 and l = 200 to study EST mapping under a wider set
√
of possible choices. We use a geometric progression with base 2 to choose w values for
√ i
soft sampling indexes. Specifically, we consider w = 2 . For each l, we ignore w less than
w0 = k ∗ − k 0 + 1 since w0 is the largest hard sampling value. Likewise, we ignore w ≥ l as
59
these can completely skip over a potential alignment of length l. This results in a total of
eleven choices ranging from w = 8 to w = 256. Combined with four choices of w0 for hard
sampling and three volumes, we create a total of 15 × 3 = 45 sampled indexes. We use SI(w)
to denote a sampled index created with sampling parameter w; note SI(w0 ) denotes a hard
sampling index. These choices are summarized in Table 4.2. Note some sampled indexes are
used with multiple l values. For example, the sampled indexes SI(22) and SI(32) are used
for each choice of l.
Table 4.2: A summary of the parameters used in our experiments for (1) finding
HSLAs and (2) EST mappings. For HSLA, we consider all four choices of l. For
EST, we only consider the first three choices of l.
Sampling
l
t
50 96%
100 97%
200 97%
400 97%
parameters
k ∗ k 0 w0
16 12 5
25 12 14
28 12 17
30 12 19
w > w0
8, 11, 16, 22, 32
16, 22, 32, 45, 64
22, 32, 45, 64, 90, 128
22, 32, 45, 64, 90, 128, 181, 256
The k 0 mer indexes are built using BLAST with sampling steps w0 and w. True matches
HSLAs are of length ≥ l and a match percentage ≥ t. Only HSLAs that have shared
k ∗ mers are reported by BLAST.
We build our sampled indexes using the BLAST program makembindex for the three
volumes of the unmasked human genome database using BLAST’s default value of k 0 = 12.
Query Sets and Mappable Queries: For HSLA, we use the same query sets that Morgulis
et al. used to evaluate Indexed BLAST [56]. Morgulis et al. organized the queries into three
sets based on the average query length: qsmall (average length 500), qmedium (average
length 10,000), and qlarge (average length 100,000). Each set has 100 queries for 300 total
queries. We group all the queries into a single set of 300 queries and report all results using
60
this single query set. The query sets are available at the following url: ftp://ftp.ncbi.
nlm.nih.gov/pub/agarwala/indexed_megablast/queries/human. For EST mapping, we
form our query set Q by randomly selecting 1000 human ESTs (average length 490) from Expressed Sequence Tags database from NCBI https://www.ncbi.nlm.nih.gov/dbEST. For
each length l, we define Q(l) to be the subset of Q that has a nonempty HSLA(DB, q, l, t)
and refer to these as the mappable queries for length l.
Query Processing: For every query q in the query set, we run BLAST using the blastn
program with the task megablast option using its default settings except we select MEM *
k
value using wordsize k ∗ , we use multiple values of w, and we set the matching threshold,
also known as identity percentage, using percidentity 96% for l = 50 and percidentity
97% for all other l. This will return HSLA(DB, q, k ∗ , t, k 0 , w) where every alignment
must have an MEM * . That is, the match percentage t will be satisfied, but the lengths are
k
only guaranteed to be at least k ∗ , not l. We filter out any HSLAs that are too short to
produce HSLA(DB, q, l, t, k 0 , w).
False Positives: For any query q and any w, we report the number of false positives
F P (q, w) as the number of alignments in HSLA(DB, q, k ∗ , t, k 0 , w) − HSLA(DB, q, l, t).
This should be very close to the number of MEM * s that do not extend to alignments in
k
HSLA(DB, q, l, t); the two numbers might differ if multiple MEM * s are part of the same
k
alignment in HSLA(DB, q, k ∗ , t) − HSLA(DB, q, l, t).
Experimental System: We run the experiments on a cluster that runs the Community
Enterprise Operating System (CentOS) 6.6. The cluster has 24 nodes where each node has
61
two 2.5Ghz 10core Intel Xeon E52670v2 processors, 20 cores, and 500 GB.
The HSLA Evaluation Metrics: We evaluate the effectiveness of a given k 0 mer index SI(w) as a function of w and w0 using three metrics: (1) index size reduction, (2)
retention rate of HSLAs, and (3) query time reduction. For retention rate, we consider
retention of all HSLAs as denoted by HSLA(DB, q, l, t) and short HSLAs as denoted by
HSLAshort (DB, q, l, t). To help explain query time reduction, we also measure false positive
reduction. We describe each metric in more detail.
For each SI(w) and each choice of w > w0 , we define the sampled index size reduction
as
SIR(w, w0 ) =
SI(w)
SI(w0 )
(4.1)
where I is the size of index I. Index size is the sum of dictionary size, measured by counting
the number of k 0 mers, and inverted lists’ size, measured by counting the number of k 0 mer
occurrences in all the inverted lists. Since the human genome is split into three volumes and
we create a sampled index for each volume, we compute the total index size for all indexes
over all volumes. For the total dictionary size, we take the union of all three dictionaries, and
then we measure the total dictionary size by counting the number of k 0 mers in the union
set. For the total inverted lists’ size, we take the sum over all three inverted lists’ sizes.
For each SI(w) and each choice of w > w0 , we report the full retention rate RR(w, w0 )
and the short retention rate RRshort (w, w0 ) which we define as follows. For w, w0 and q,
we define
RR(q, w, w0 ) =
 HSLA(DB, q, l, t, k 0 , w)
 HSLA(DB, q, l, t, k 0 , w0 )
62
(4.2)
and
RRshort (q, w, w0 ) =
 HSLAshort (DB, q, l, t, k 0 , w)
 HSLAshort (DB, q, l, t, k 0 , w0 )
(4.3)
We say that RR(q, w, w0 ) or RRshort (q, w, w0 ) is undefined if the denominator is 0. We
typically report both ratios as percentages. We use all three volumes to get these percentages. We then set RR(w, w0 ) and RRshort (w, w0 ) to be the average of RR(q, w, w0 ) and
RRshort (q, w, w0 ), respectively, where we only include query q in the average if RR(q, w, w0 )
or RRshort (q, w, w0 ), respectively, is defined. We report RR(w, w0 ) since this is a typical
user query. We specifically define RRshort (w, w0 ) to fairly compare empirical retention rate
with expected retention rate. Intuitively, RRshort (w, w0 ) focuses on the hardest to retain
HSLAs.
For each SI(w) and each choice of w > w0 , we report the average query time reduction
percentage QT R(w, w0 ) which we define as follows. We start by defining the query time
QT (q, w) for a given query q and sampled index SI(w) (including SI(w0 )) as follows. We
process each query q on SI(w) five times using BLAST and we set QT (q, w) to be the
median of the five values. Since SI(w) is partitioned into three volumes, the query time for
a given q is the sum of the query times over the three volumes. The query time reduction
QT R(q, w, w0 ) is then
QT R(q, w, w0 ) =
QT (q, w)
.
QT (q, w0 )
(4.4)
Finally, the average query time reduction QT R(w, w0 ) is the average of QT R(q, w, w0 ) over
all q.
Finally, to help explain the query time reduction results, for each SI(w) and each choice
of w > w0 , we report the average false positive reduction rate F P R(w, w0 ) which we define
as follows. For a given query q and sampled index SI(w) (including SI(w0 )), we define
63
F P (q, w) to be the number of false positive; that is, HSLAs that do not lead to elements of
HSLA(DB, q, l, t) when we apply the second, more expensive, extension phase. We believe
that F P (q, w) decreases as w increases, and this may help explain any reduction in query
time. To test this, we define the false positive reduction rate F P R(q, w, w0 ) to be
F P R(q, w, w0 ) =
F P (q, w)
F P (q, w0 )
(4.5)
Finally, the average false positive reduction rate F P R(w, w0 ) is the average of F P R(q, w)
over all q.
EST Mapping Evaluation Metrics: For each soft sampled index SI(w) and a given
length l, we report its retention rate, RRmap (w, l), as the percentage of Q(l) such that all of
HSLA(DB, q, l, t) is found using SI(w). We use this requirement because this implies that
the mapping result for SI(w) for the given query q will be identical to the mapping result
for SI(w0 ) and q regardless of the mapping procedure used. Otherwise, at least one highly
similar local alignment is lost and we pessimistically assume that the mapping result would
be lost as well.
More formally, for a given mappable queries set Q(l) and k 0 mer index SI(w), we define
the set Q0 (l) ⊂ Q(l) as follows
Q0 (l) = {∀q ∈ Q(l)  HSLA(DB, q, l, t, k 0 , w) = HSLA(DB, q, l, t, k 0 , w0 )}
64
(4.6)
Then, we define the index retention rate RRmap as follows:
RRmap (w, w0 ) =
Q0 (l)
Q(l)
(4.7)
We also report the effect of w on query time using the same process as with HSLA,
namely, running each query five times, taking the median time, and then reporting the
average reduction in query time over all 1000 queries. Note that we use all queries rather
than just the mappable queries when reporting query time.
4.6.2
Analytical modeling
We now describe how we analytically model two of the evaluation metrics, index size reduction and retention rate.
Predicting Index Size: We first show how we compute the expected size of a sampled
index SI(w). For the dictionary, we assume the k 0 mer dictionary is full and thus the size of
0
a k 0 mer dictionary is 4k entries which, in our case, is 412 . This may not be accurate, but
since the dictionary size is typically much smaller than the inverted lists size given k 0 = 12,
this is accurate enough. The number of k 0 mer occurrences stored in the inverted lists is
simply (D − S(k 0 + 1))/w where D = DB, the number of positions in DB, and S is the
number of distinct sequences in DB. Thus, the predicted size of SI(w) is simply
0
size(SI(w)) = 4k +
D − S(k 0 + 1)
w
(4.8)
Predicting Retention Rate: We present a new analytical model to compute the expected retention rate of HSLAs in HSLAshort (DB, q, l, t). We start by presenting Kent’s
65
analytical model [36] where he essentially assumed w = k 0 = k ∗ in his model. We then
propose a new model that we refer to as the BLAST model to account for typical parameters used in BLAST searches. We refer to the expected retention rates as E[RRK ] and
E[RRB ]. For both retention rates, we make a few simplifying assumptions and refer to the
two models generically as E[RR] when describing these common assumptions. First, we
restrict our attention to HSLAs that have length exactly l. Second, we assume each HSLA
in HSLAshort (DB, q, l, t, k 0 , w) is retained with the same probability, and this probability is
independent of other HSLAs. This implies
E[RR] = E[
 HSLAshort (DB, q, l, t, k 0 , w)
]
 HSLAshort (DB, q, l, t)
(4.9)
simplifies to just p(A)
E[RR] = p(A)
(4.10)
which represents the probability that a short HSLA A is retained. This allows us to focus
on a single short HSLA A in the rest of this analysis. Finally, we assume each position in
A is independent of other positions and the probability that any position in A is a match is
exactly t.
Kent’s original retention rate model (E[RRK ]): We start with Kent’s original model [36].
where he assumes w = k ∗ = k 0 . The number of k ∗ mers that are guaranteed to be chosen
from x within A is
T = b(x − k ∗ + 1)/wc
(4.11)
Furthermore, these k ∗ mers will be adjacent to each other with no gaps. For A to be retained,
66
at least one of these chosen k ∗ mers from x must exactly match the corresponding k ∗ mer
in y from q. The probability of such an exact match assuming each position is independent
and that the overall match percentage within A is t is then
p = tk
∗
(4.12)
Since the sampled k ∗ mers do not overlap, the probability that all fail to match is then
(1 − p)T . Thus, the probability that at least one will match and alignment A will be found
is p(A) = 1 − (1 − p)T . Since E[RRK ] = p(A), we have
E[RRK ] = 1 − (1 − p)T
(4.13)
BLAST retention rate model (E[RRB ]): To extend this analysis to the typical BLAST
setting with distinct w, k ∗ and k 0 , we must modify the formula in two ways. The first key
issue is that we sample k 0 mers but then extend them to search for k ∗ mers. The sampled
0
k 0 mer must be an exact match, which again happens with probability p = tk The key issue
after this is whether this can be extended to an MEM * . Suppose this can extend exactly
k
0 ≤ l ≤ k ∗ − k 0 − 1 characters to the left before we get a mismatch. We then need it to
extend at least k ∗ − l − k 0 characters to the right. The probability we can extend exactly
l characters to the left is tl (1 − t). The probability we can extend at least k ∗ − l − k 0
∗ 0
characters to the right is tk −k −l . Thus, the probability that we have a k 0 mer, it extends
exactly 0 ≤ l ≤ k ∗ − k 0 − 1 characters to the left, and it extends at least k ∗ − l − k 0 characters
67
∗ 0
∗
0
to the right is then tk tl (1 − t)tk −k −l = tk (1 − t) There are k ∗ − k 0 − 1 choices for l leading
∗
to a final probability of (k ∗ −k 0 +1)tk (1−t). The other possibility is that it extends at least
∗
k ∗ − k 0 characters to the left which occurs with probability tk giving us a total probability
of
∗
p0 = (k ∗ − k 0 − 1)tk (1 − t) + tk
∗
(4.14)
The second key issue is that in Equation 4.11, we used the floor function as this is the
number of k ∗ mers from x within A that are guaranteed to be chosen. Using the floor
function ignores the possibility that we may have an additional k ∗ mer chosen from x. That
is, the number of k ∗ mers from k that will be sampled might be either
Tf = b(x − k ∗ + 1)/wc
(4.15)
Tc = d(x − k ∗ + 1)/we
(4.16)
where Tf = T from Equation 4.11. If we assume that each possible window for w is equally
likely, then
p(Tf ) =
w × Tc − (x − k ∗ + 1)
w
p(Tc ) = 1 − p(Tf )
(4.17)
(4.18)
For the case where b(x − k ∗ + 1)/wc = (x − k ∗ + 1)/w, p(Tf ) = 0 which means p(Tc ) = 1
so the result is still correct.
In our new BLAST model, we update Equation 4.13 replacing p with p0 and replacing T
with Tc and Tf as follows.
68
T
E[RRB ] = 1 − p(Tf )(1 − p0 ) f − p(Tc )(1 − p0 )Tc
(4.19)
We will compare both Kent’s model and our new BLAST model in our results.
4.7
Results and discussion
We report the impact of sampling on the efficiency of a k 0 mer index on the index size and
query performance. We report both the expected and the actual impact of sampling.
4.7.1
Index size
As expected, the index size is inversely proportional to the sampling step w. This means that
soft sampling does lead to a significant reduction in space when compared to hard sampling.
For example, when w/w0 is roughly 1.7 and 4.4, the index size reduces by 38% and 74%
for all values of l we considered. The percentage of reduction increases as l increases. For
example when l = 400 and w/w0 is almost 10, the index size reduces by 90%.
With hard sampling w0 = k ∗ − k 0 + 1, the space reduction is limited by k ∗ . With soft
sampling w > k ∗ − k 0 + 1, w is limited primarily by l, where typically l k ∗ (see Table 4.2).
We plot results for the percentage reduction in index size in Fig 4.3. Since the expected
index size (see Eq 4.8) and the actual index size are almost identical, the expected size is
omitted.
Sampling reduces index size because it reduces the number of sampled k 0 mers leading to
a factor of w reduction in inverted lists size, the dominant component of index size. On the
other hand, although sampling does reduce dictionary size, the reduction is relatively small
and does not greatly affect the final index size. For example when w/w0 is roughly 1.7 and
69
Figure 4.3: The sampled index SI(w) size (percentage) as a function of sampling
step size w of SI(w) versus sampled index SI(w0 ). The k 0 mer indexes are built
with k 0 = 12 and w ≥ w0 where w0 = l − k + 1.
4.4, the dictionary is about 9% and 17% of the the index size and the average reduction in
the dictionary size is 9% and 27% respectively.
4.7.2
Retention rate of HSLAs
We first examine RR(w, w0 ) to study how the overall HSLA retention rate changes as a
function of w, w0 , and l. We first observe that RR(w, w0 ) improves as l increases. In
particular, as can be seen from our RR(w, w0 ) results from Figure 4.4, if we look at choices
for w and w0 that have a similar ratio w/w0 , the RR(w, w0 ) retention result is higher for
larger l. In particular, whereas RR(32, 5) for l = 50 falls below 80%, RR(w, w0 ) ≥ 96.6%
for l ≥ 100 for all tested values of w, and RR(256, 30) = 97.5% for l = 400. Thus, for
70
large values of l, we can use soft sampling where w/w0 approaches even 10 and still achieve
retention rates of close to 100%.
Figure 4.4: The actual HSLA retention rate RR(w, w0 ), the actual short HSLA
retention rate RRshort (w, w0 ), and the expected short HSLA retention rate using
both Kent’s model E[RRK (w, w0 )] and BLAST model E[RRB (w, w0 )] for (a) l = 50,
w0 = 5, (b) l = 100, w0 = 14, (c) l = 200, w0 = 17, and (d) l = 400, w0 = 30. For other
parameters values see Table 4.2.
71
We now focus on the retention rate of short HSLAs. First we compare RR(w, w0 ) with
RRshort (w, w0 ) as a function of w, w0 , and l. Figure 4.4 also contains our RRshort (w, w0 ) results. We observe that if w/w0 < 4, then the difference between RR(w, w0 ) and RRshort (w, w0 )
is small (less than 4%) for almost all choices of l; the one outlier is l = 100 where we see a
difference of 11% for w/w0 = 45/14 ≈ 3.2. For example, for all values of l except 100, the
difference between RR(w, w0 ) and RRshort (w, w0 ) is at most 1.8%. We do see a significant
difference between RR(w, w0 ) and RRshort (w, w0 ) for our largest choices of w which means
that RRshort (w, w0 ) does fall off by w/w0 = 10 or so. Thus, in contrast to RR(w, w0 ), there
is an upper limit to how much we can soft sample before RRshort (w, w0 ) suffers.
Next we examine how RRshort (w, w0 ) changes as a function of w, w0 , and l. Similar to
RR(w, w0 ), we observe RRshort (w, w0 ) generally improves as l increases given roughly the
same ratio of w/w0 . For example, when w/w0 is roughly 6, RRshort (w, w0 ) is 65%, and
85%, and 95% for l equal to 50, 200, and 400, respectively. In general, we can retain 90%
more short HSLAs for either small w/w0 ratios (less than 2 or 3) or large l values (200 or
400).
We now want to compare empirical retention rate with predicted retention rate as a
function of w, w0 and l. Comparing RR(w, w0 ) to E[RRB (w, w0 )] is not fair as RR(w, w0 )
includes many alignments significantly longer than l whereas E[RRB (w, w0 )] focuses only
on alignments with length exactly l. To more fairly compare empirical retention rate to
expected retention rate, we compare RRshort (w, w0 ), where the length of the alignment is in
the range [l, (2 − t)l], with E[RRB (w, w0 )], where an alignment is assumed to have a length
l. We consider HSLAs with length up to (2 − t)l to ensure there are a reasonable number
of HSLAs. We also note that if we assume that the (1 − t)l errors were all insertions rather
than substitutions, this would increase the length of the HSLA to (2 − t)l. Figure 4.4 also
72
contains our E[RRB (w, w0 )] results.
We observe that the BLAST model predicts actual retention rate of short HSLAs with
reasonable accuracy, particularly for small w/w0 and for larger l. For example, the typical
difference between RRshort (w, w0 ) and E[RRB (w, w0 )] when w/w0 ≤ 2 is less than 5% for
all choices of l and is less than 1% for large l = 200 and l = 400. The typical difference
stays below 10% for almost all choices of w/w0 and l with only a few exceptions. The
difference between RRshort (w, w0 ) and E[RRB (w, w0 )] does grow as w/w0 increases, but at
a relatively slow rate, typically maximized at the largest choice of w/w0 , though this does
not hold for l = 400. We do note that we have relatively few empirical HSLAs for the large
w values for l = 400, so perhaps with more samples, the difference between RRshort (w, w0 )
and E[RRB (w, w0 )] might increase for these l and w choices.
Finally, we compare the predictions from Kent’s model E[RRK (w, w0 )] and our new
BLAST model E[RRB (w, w0 )]. Our new BLAST model is significantly more accurate than
Kent’s model, especially as w/w0 increases. For example, for w/w0 equal to 1.7, 3.4 and 4.7,
E[RRK ] is on average less than E[RRB ] by 5%, 18%, and 23%, respectively, for all l values
we consider. Kent’s model has several issues. First, because of the floor function used in
Eq. 4.11, it underestimates the number of sampled k ∗ mers from a given HSLA resulting
in common retention rate predictions for multiple values of w. For example when l = 100,
E[RRK ] = 46.70% for both w = 45 and w = 64. The second flaw is that Kent’s model was
not designed to handle different values for k 0 , k ∗ , and w which is what is typically used in
BLAST. Because our BLAST model is designed to overcome both issues, it achieves better
results, particularly for larger w/w0 and for larger l.
73
4.7.3
Possible improvements for the BLAST model
While our new BLAST model is much more accurate than Kent’s original model, it still
underestimates actual retention rates for large w/w0 . We now explore possible explanations
for this underestimate. We believe the fundamental problem with our new BLAST model
(as well as Kent’s model) is that for any HSLAshort (DB, q, l, t), it only assumes that each
position is a match with probability t.
We demonstrate the shortcomings of this assumption in two different ways. We first
show that using this assumption, we greatly underestimate the length of the maximum MEM
within any HSLA; we refer to this maximum MEM length as MAXMEM. Long MEMs are
relevant because long MEMs significantly increase the likelihood of recovering an HSLA.
For example, if an HSLA includes an MEM of length w + k − 1, then it is guaranteed the
HSLA will be found since one kmer is guaranteed to be chosen from within the MEM.
We perform this comparison as follows. We first obtain an empirical distribution of
MAXMEM by recording the length of the longest MEM in every HSLA in RRshort (w, w0 ).
We then use the BLAST model’s fundamental assumption that each position in an HSLA is
identical with probability t to create a corresponding predicted distribution of MAXMEM.
For this predicted distribution of MAXMEM, we assume that the length of the HSLA is
l, the number of mismatches is exactly (1 − t)l, and each position is equally likely to be a
mismatch. All told, there are l choose (1−t)l different combinations of errors that are equally
likely. We can then compute the predicted distribution by enumerating all possibilities for
l = 50 and l = 100.
For l > 100, it takes too much time to enumerate all possibilities. Thus, we use Monte
Carlo simulation to compute a second predicted distribution for MAXMEM. We create short
74
HSLAs as follows. We start with an alignment of length l, a set S of mismatch positions
which is initialized to empty, and a count C of the number of mismatch positions which is
initially 0. Then, we repeatedly choose a position in the range [0, L−1] uniformly at random.
If the position is not in S, we add the position to S and increment C by one. Otherwise, we
do nothing with the chosen position and choose another one. When C reaches (1 − t)l, we
stop with a complete HSLA. We then record its longest MEM. We do this until we have
recorded one million such longest MEMs.
For l = 50 and l = 100, Monte Carlo simulation and complete enumeration produce
essentially identical distributions for MAXMEM. Thus, we only show results from our Monte
Carlo simulations since these cover all choices of l. We show the results for our experimental
and Monte Carlo distribution of MAXMEM for all four choices of l in Figure 4.5. We
observe that the empirical MAXMEM distribution is weighted more heavily towards longer
MEMs than the predicted distribution. This demonstrates that the assumptions used in the
BLAST model do not correctly predict the distribution for MAXMEM length; in general,
they underestimate the probability for finding longer MEMs. While the distribution of
MAXMEM is not identical to retention rate of HSLAs, this finding provides evidence that
we need stronger assumptions to better predict retention rate of HSLAs.
We now show another fundamental flaw with the base assumption of the BLAST model.
Consider an HSLA of length l, and suppose we assume that each query position is independent and matches its corresponding database position with probability t (similar to the
BLAST model’s assumptions). Then, there is a (1 − t) probability that each position does
not match. In this scenario, the total number of mismatches has the binomial distribution
Bin(l, (1−t)) which has an expected number of mismatches of exactly (1−t)l. If the number
of mismatches exceeds this expected value, we would no longer have an HSLA, but this is
75
Figure 4.5: Distribution of predicted and empirical MAXMEM lengths in HSLAs.
The predicted MAXMEM lengths are computed from a Monte Carlo simulation. (a) l = 50,
t = 96%, (b) l = 100, t = 97%, (c) l = 200, t = 97%, and (d) l = 400, t = 97%.
clearly contradicts with the first assumption that we start with an HSLA. The probability
that the number of mismatches exceeds (1 − t)l is given in Table 4.3. Given this weak as
76
sumption, the BLAST model essentially starts with a probability ranging from 32% to 43%
that the given HSLA is not an HSLA.
Table 4.3: The probability that the number of mismatches exceeds (1 − t)l for
various choices of l and t.
l
50
100
200
400
t
Probablity mismatches exceeds (1 − t)l
0.96
32.3%
0.97
35.3%
0.97
39.4%
0.97
42.4%
Under the assumption that the number of mismatches in a HSLA follow Bin(l, (1 − t)),
Kents’s model underestimates the the existence of the HSLA by 30%  40%.
We have shown that the weak assumption used in the BLAST model (1)underestimates
the probability of longer MAXMEMs and (2) gives a significant probability for HSLAs
to not be HSLAs. Taken together, we believe a new model with stronger assumptions is
needed to produce more accurate predictions about retention rate of short HSLAs.
4.7.4
Query time
We now examine how soft sampling affects query time. The query time is approximately
inversely proportional to the sampling step w for all l values without significantly reducing
retention rate RR(w, w0 ). For example, for l = 200, the median query time for hard sampling is 26 hours. When w/w0 is 3.8 and 5.3, the median query time reduction percentages
are 64.51% (median query time 10 hours) and 73.38% (median query time 7.4 hours), respectively, while maintaining RR > 99%. Similarly, for l = 400, the median query time for hard
sampling is 18.6 hours. When w/w0 is 6.7 and 9.3, the query time reduction percentages are
78.36% (median query time 4.3 hours) and 83.99% (median query time 3.3 hours), respectively, while maintaining RR > 99%. Fig 4.6 shows our full query time results represented
77
as query time reduction (QT R) percentages.
Figure 4.6: The average median query time reduction (QTR) percentages and the
actual false positive reduction (FPR) percentages as a function of sampling step
w. (a) l = 50, w0 = 5, (b) l = 100, w0 = 14, (c) l = 200, w0 = 17, and (d) l = 400, w0 = 30.
For other parameters values see Table 4.2.
The reduction in false positive rate mostly explains the reduction in query time. Re78
call that false positives are alignments in HSLA(DB, q, k ∗ , t) − HSLA(DB, q, l, t), which is
roughly the number of MEM * s that do not extend into alignments in HSLA(DB, q, l, t).
k
This implies much of the query time is spent ruling out false positives and that using soft
sampling not only has little affect on retention rate but also significantly reduces false positives. Our full false positive reduction rate results are shown in Fig 4.6. As can be seen
from this figure, the plots for false positive reduction rate (F P R) and query time reduction
(QT R) percentage are very similar.
4.7.5
Mapping results
We report our retention rate and query time results for EST mapping in Tables 4.4, 4.5,
and 4.6. Our results show that the number of mappable queries that retain all HSLAs is
very high even when we use soft sampling. Furthermore, we achieve significant reductions
in query processing time. Recall that a query q is mappable is there is at least one HSLA
between q and the reference (the human genome in our case). When an index SI(w) is used
where w > w0 , a query is lost if even a single HSLA is not found by SI(w).
Table 4.4: The retention rate (RRmap ) and query time reduction (QT R) results
for all 1000 queries when l = 50 and w0 = 5, where 879 were mappable queries.
w
5
8
11
16
22
32
# retained queries
879
876
864
849
811
677
RRmap
100.00%
99.66%
98.29%
96.59%
92.26%
77.02%
QT R
100.00%
100.00%
89.72%
75.67%
64.19%
51.40%
Using soft sampling, we are able to greatly reduce the index size, significantly reduce
query time, and correctly map more than 95% of the mappable queries for l ≥ 100. In fact,
79
Table 4.5: The retention rate (RRmap )and query time reduction (QT R) results
for all 1000 queries when l = 100 and w0 = 14, where 794 were mappable queries.
w
14
16
22
32
45
64
# retained queries
794
794
794
793
789
784
RRmap
100.00%
100.00%
100.00%
99.87%
99.37%
98.74%
QT R
100.00%
95.60%
85.67%
83.07%
80.26%
76.68%
Table 4.6: The retention rate (RRmap )and query time reduction (QT R) results
for all 1000 queries when l = 200 and and w0 = 17, where 528 were mappable
queries.
w
# retained queries
17
528
22
528
32
528
45
525
64
528
90
523
128
506
RRmap
100.00%
100.00%
100.00%
99.43%
100.00%
99.05%
95.83%
QT R
100.00%
93.76%
91.91%
87.07%
84.08%
80.36%
78.76%
for l ≥ 100, we correctly map almost 99% of the mappable queries for w/w0 approaching 5.
For l = 200, we correctly map at least 95% of the mappable queries for w/w0 = 7.5. For
l = 50, we still see good retention rates but the drop off is a bit faster. Specifically, for l = 50,
for w/w0 approaching 3, we correctly map 96% of the mappable queries. For w/w0 between
4 and 5, we correctly map 92% of the mappable queries, and for w/w0 between 6 and 7, we
correctly map 77% of the mappable queries. Finally, the actual retention rates may be even
better than the ones reported as we required that all HSLAs be retained whereas mapping
may proceed properly even if some HSLAs are lost.
In human EST mapping, it is common to use HSLAs of length l = 100 to search for
80
the best mapping [36, 66, 99]. To study a broader range of possibilities, we also consider
l = 50 (more HSLAs and thus more mappable queries) and l = 200 (fewer HSLAs and thus
fewer mappable queries). Our results imply that soft sampling can be used with relatively
small loss in sensitivity for the commonly used case of l = 100. Given that the index sizes
are significantly reduced and query times are also reduced, soft sampling may allow for EST
mapping using kmer based methods for larger genomes with only a small loss in sensitivity.
81
Chapter 5
Conclusions and future work
In many biological applications, kmer indexes are widely used to find similar sequences. A
major problem with using kmerdataset indexes is that the index size is significantly larger
than the underlying datasets. To ensure kmer indexes are feasible, kmer index size and
query time must be mitigated. One of the most effective approaches to mitigate kmer index
size and query time is to use sampling. In this dissertation, I systematically study the effect
of sampling on kmer indexes. Specifically, I study how key parameters such as sampling
strategy, sampling step size, and kmer size affect index size, query time, and query accuracy.
The results show that the choice of the right sampling method is not trivial, and in fact has
a great impact on both the index size and query time.
We now summarize our main conclusions. When comparing fixed and minimizer sampling, our results show that fixed sampling typically answers queries at least as fast as
minimizer sampling and often is faster when both methods to use the same space. In particularly for small k values, such as k ≤ 16, fixed sampling is 37.43% to 51.11 % faster. As
is common in many applications, there is a space versus speed tradeoff. Using a larger k
requires more space but results in smaller query times. The key benefit of increasing k is
that there are many fewer false positives which leads to much faster query processing.
Based on our experiments with the human genome and NCBI BLAST, fixed soft sampling achieves significant space and time savings while also retaining highly similar local
82
alignments (HSLAs) with much higher probabilities than predicted by analytical modeling.
Even better, when applied to EST mapping, fixed soft sampling achieves significant space
and time savings while retaining 98% of all mapping results when the length of HSLAs is
at least 100 characters, and the retention results may be even better as we pessimistically
assume that losing even a single highly similar local alignment will lead to an incorrect mapping result. Further, because we performed all of our local alignments using BLAST, these
results can be easily tested and adopted by other researchers.
We can extend this work in many directions. We would like to test the effectiveness
of fixed soft sampling using other real biological datasets and in other applications such as
clustering or SNP detection. Recently, a complementary approach of spaceefficient referentially compressed search indexes has been proposed to support similarity searches on genome
datasets [94, 95]. In this method, genomes are compressed against some reference genome(s).
Given a query, the index then searches two parts: the reference and all genomespecific individual differences. Both parts are saved in compressed suffix trees. Danek et al. [19] extend
referencebased compression with the use of a kmer index. We think employing the complementary approach of reference compression in unison with fixed soft sampled kmer indexes
may be fruitful.
83
BIBLIOGRAPHY
84
BIBLIOGRAPHY
[1] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix
trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004.
[2] Athena Ahmadi, Alexander Behm, Nagesh Honnalli, Chen Li, Lingjie Weng, and Xiaohui Xie. Hobbes: Optimized grambased methods for efficient read alignment. Nucleic
Acids Research, 40(6):e41–e41, 2012.
[3] Can Alkan, Jeffrey M Kidd, Tomas MarquesBonet, Gozde Aksay, Francesca Antonacci, Fereydoun Hormozdiari, Jacob O Kitzman, Carl Baker, Maika Malig, Onur
Mutlu, et al. Personalized copy number and segmental duplication maps using nextgeneration sequencing. Nature Genetics, 41(10):1061–1067, 2009.
[4] Meznah Almutairy and Eric Torng. The effects of sampling on the efficiency and
accuracy of k mer indexes: Theoretical and empirical comparisons using the human
genome. PLOS ONE, 12(7):e0179046, 2017.
[5] Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman.
Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
[6] Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng
Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSIBLAST: A new
generation of protein database search programs. Nucleic Acids Research, 25(17):3389–
3402, 1997.
[7] Sasha K Ames, David A Hysom, Shea N Gardner, G Scott Lloyd, Maya B Gokhale,
and Jonathan E Allen. Scalable metagenomic taxonomy classification using a reference
genome database. Bioinformatics, 29(18):2253–2260, 2013.
[8] Richard Arratia, Louis Gordon, and Michael Waterman. An extreme value theory for
sequence matching. The annals of statistics, pages 971–993, 1986.
[9] Ergude Bao, Tao Jiang, Isgouhi Kaloshian, and Thomas Girke. SEED: Efficient clustering of nextgeneration sequences. Bioinformatics, 27(18):2502–2509, 2011.
[10] Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. Spaceconstrained grambased indexing for efficient approximate string search. In Data Engineering, 2009.
ICDE’09. IEEE 25th International Conference on, pages 604–615. IEEE, 2009.
85
[11] Medha Bhagwat, Lynn Young, and Rex R Robison. Using BLAT to find sequence
similarity in closely related genomes. Current Protocols in Bioinformatics, pages 10–8,
2012.
[12] Jacek Blazewicz, Wojciech Frohmberg, Michal Kierzynka, Erwin Pesch, and Pawel
Wojciechowski. Protein alignment algorithms with an efficient backtracking routine
on multiple gpus. BMC bioinformatics, 12(1):181, 2011.
[13] Leonid Boytsov. Indexing methods for approximate dictionary searching: Comparative
analysis. Journal of Experimental Algorithmics (JEA), 16:1–1, 2011.
[14] Michael Brudno, Sanket Malde, Alexander Poliakov, Chuong B Do, Olivier Couronne,
Inna Dubchak, and Serafim Batzoglou. Glocal alignment: Finding rearrangements
during alignment. Bioinformatics, 19(suppl 1):i54–i62, 2003.
[15] Jeremy Buhler, Uri Keich, and Yanni Sun. Designing seeds for similarity search in
genomic DNA. In Proceedings of the Seventh Annual International Conference on
Research in Computational Molecular Biology, pages 67–75. ACM, 2003.
[16] Michael Burrows and David J Wheeler. A blocksorting lossless data compression
algorithm. Digital Equipment Corporation, 1994.
[17] Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L Madden. BLAST+: Architecture and applications.
BMC Bioinformatics, 10(1):421, 2009.
[18] Rayan Chikhi, Antoine Limasset, Shaun Jackman, Jared T Simpson, and Paul
Medvedev. On the representation of de Bruijn graphs. In Research in Computational
Molecular Biology, pages 35–55. Springer, 2014.
[19] Agnieszka Danek, Sebastian Deorowicz, and Szymon Grabowski. Indexes of large
genome collections on a PC. PLOS ONE, 9(10):e109384, 2014.
[20] AP Jason de Koning, Wanjun Gu, Todd A Castoe, Mark A Batzer, and David D
Pollock. Repetitive elements may comprise over twothirds of the human genome.
PLOS Genetic, 7(12):e1002384, 2011.
[21] Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka DebudajGrabysz. KMC 2: Fast and resourcefrugal kmer counting. Bioinformatics,
31(10):1569–1576, 2015.
[22] Todd Z DeSantis, Keith Keller, Ulas Karaoz, Alexander V Alekseyenko, Navjeet NS
Singh, Eoin L Brodie, Zhiheng Pei, Gary L Andersen, and Niels Larsen. Simrank:
Rapid and sensitive generalpurpose kmer search tool. BMC ecology, 11(1):11, 2011.
86
[23] Naryttza N Diaz, Lutz Krause, Alexander Goesmann, Karsten Niehaus, and Tim W
Nattkemper. TACOA–taxonomic classification of environmental genomic fragments
using a kernelized nearest neighbor approach. BMC Bioinformatics, 10(1):56, 2009.
[24] Robert C Edgar. MUSCLE: Multiple sequence alignment with high accuracy and high
throughput. Nucleic acids research, 32(5):1792–1797, 2004.
[25] Robert C Edgar. Search and clustering orders of magnitude faster than BLAST.
Bioinformatics, 26(19):2460–2461, 2010.
[26] Peter Elias. Universal codeword sets and representations of the integers. Information
Theory, IEEE Transactions on, 21(2):194–203, 1975.
[27] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium
on, pages 390–398. IEEE, 2000.
[28] Mohammadreza Ghodsi, Bo Liu, and Mihai Pop. DNACLUST: Accurate and efficient
clustering of phylogenetic marker genes. BMC bioinformatics, 12(1):271, 2011.
[29] Louis Gordon, Mark F Schilling, and Michael S Waterman. An extreme value theory
for long head runs. Probability Theory and Related Fields, 72(2):279–287, 1986.
[30] Faraz Hach, Fereydoun Hormozdiari, Can Alkan, Farhad Hormozdiari, Inanc Birol,
Evan E Eichler, and S Cenk Sahinalp. mrsFAST: A cacheoblivious algorithm for
shortread mapping. Nature Methods, 7(8):576–577, 2010.
[31] Ayat Hatem, Doruk Bozdağ, Amanda E Toland, and Ümit V Çatalyürek. Benchmarking short sequence mapping tools. BMC Bioinformatics, 14(1):1, 2013.
[32] Liisa Holm, Chris Sander, et al. Mapping the protein universe. ScienceNew York
Then Washington, pages 595–602, 1996.
[33] Farhad Hormozdiari, Faraz Hach, S Cenk Sahinalp, Evan E Eichler, and Can Alkan.
Sensitive and fast mapping of dibase encoded reads. Bioinformatics, 27(14):1915–
1921, 2011.
[34] Kris Irizarry, Vlad Kustanovich, Cheng Li, Nik Brown, Stanley Nelson, Wing Wong,
and Christopher J Lee. Genomewide analysis of singlenucleotide polymorphisms in
human expressed sequences. Nature Genetics, 26(2):233–236, 2000.
[35] Kazutaka Katoh, Kazuharu Misawa, Keiichi Kuma, and Takashi Miyata. MAFFT: A
novel method for rapid multiple sequence alignment based on fast fourier transform.
Nucleic acids research, 30(14):3059–3066, 2002.
87
[36] W James Kent. BLATthe BLASTlike alignment tool. Genome Research, 12(4):656–
664, 2002.
[37] Zia Khan, Joshua S Bloom, Leonid Kruglyak, and Mona Singh. A practical algorithm
for finding maximal exact matches in large sequence datasets using sparse suffix arrays.
Bioinformatics, 25(13):1609–1616, 2009.
[38] Nilesh Khiste and Lucian Ilie. EMEM: Efficient computation of maximal exact
matches for very large genomes. Bioinformatics, 31(4):509–514, 2015.
[39] You Jung Kim, Andrew Boyd, Brian D Athey, and Jignesh M Patel. miBLAST:
Scalable evaluation of a batch of nucleotide sequence queries with blast. Nucleic Acids
Research, 33(13):4335–4344, 2005.
[40] Ian Korf and Warren Gish. MPBLAST: Improved blast performance with multiplexed
queries. Bioinformatics, 16(11):1052–1053, 2000.
[41] Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway,
Corina Antonescu, and Steven L Salzberg. Versatile and open software for comparing
large genomes. Genome Biology, 5(2):R12, 2004.
[42] Chen Li, Bin Wang, and Xiaochun Yang. VGRAM: Improving performance of approximate queries on string collections using variablelength grams. In Proceedings
of the 33rd international conference on Very large data bases, pages 303–314. VLDB
Endowment, 2007.
[43] Heng Li. Minimap and miniasm: Fast mapping and de novo assembly for noisy long
sequences. Bioinformatics, page btw152, 2016.
[44] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–
wheeler transform. Bioinformatics, 25(14):1754–1760, 2009.
[45] Ruiqiang Li, Hongmei Zhu, Jue Ruan, Wubin Qian, Xiaodong Fang, Zhongbin Shi,
Yingrui Li, Shengting Li, Gao Shan, Karsten Kristiansen, et al. De novo assembly
of human genomes with massively parallel short read sequencing. Genome Research,
20(2):265–272, 2010.
[46] Weizhong Li and Adam Godzik. Cdhit: A fast program for clustering and comparing
large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659, 2006.
[47] Yang Li et al. MSPKmerCounter: A fast and memory efficient approach for kmer
counting. arXiv preprint arXiv:1505.06550, 2015.
88
[48] Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan, and Subhash
Suri. Memory efficient minimum substring partitioning. In Proceedings of the VLDB
Endowment, volume 6(3), pages 169–180. VLDB Endowment, 2013.
[49] Hao Lin, Zefeng Zhang, Michael Q Zhang, Bin Ma, and Ming Li. ZOOM! zillions of
oligos mapped. Bioinformatics, 24(21):2431–2437, 2008.
[50] David J Lipman and William R Pearson. Rapid and sensitive protein similarity
searches. Science, 227(4693):1435–1441, 1985.
[51] Bin Ma, John Tromp, and Ming Li. PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18(3):440–445, 2002.
[52] M McIlroy. Development of a spelling list. Communications, IEEE Transactions on,
30(1):91–99, 1982.
[53] Hamish McWilliam, Weizhong Li, Mahmut Uludag, Silvano Squizzato, Young Mi Park,
Nicola Buso, Andrew Peter Cowley, and Rodrigo Lopez. Analysis tool web services
from the emblebi. Nucleic acids research, 41(W1):W597–W600, 2013.
[54] Giles Miclotte, Mahdi Heydari, Piet Demeester, Pieter Audenaert, and Jan Fostier.
Jabba: Hybrid error correction for long sequencing reads using maximal exact matches.
In International Workshop on Algorithms in Bioinformatics, pages 175–188. Springer,
2015.
[55] Luciano Milanesi and Igor B Rogozin. ESTMAP: A system for expressed sequence tags
mapping on genomic sequences. NanoBioscience, IEEE Transactions on, 2(2):75–78,
2003.
[56] Aleksandr Morgulis, George Coulouris, Yan Raytselis, Thomas L Madden, Richa Agarwala, and Alejandro A Schäffer. Database indexing for production MegaBLAST
searches. Bioinformatics, 24(16):1757–1764, 2008.
[57] Aleksandr Morgulis, E Michael Gertz, Alejandro A Schäffer, and Richa Agarwala. A
fast and symmetric DUST implementation to mask lowcomplexity DNA sequences.
Journal of Computational Biology, 13(5):1028–1040, 2006.
[58] Aleksandr Morgulis, E Michael Gertz, Alejandro A Schäffer, and Richa Agarwala. WindowMasker: Windowbased masker for sequenced genomes. Bioinformatics, 22(2):134–
141, 2006.
[59] Narjes S Movahedi, Elmirasadat Forouzmand, and Hamidreza Chitsaz. De novo
coassembly of bacterial genomes from multiple single cells. In Bioinformatics and
Biomedicine (BIBM), 2012 IEEE International Conference on, pages 1–5. IEEE, 2012.
89
[60] Gonzalo Navarro. A guided tour to approximate string matching. ACM computing
surveys (CSUR), 33(1):31–88, 2001.
[61] Gonzalo Navarro, Ricardo A. BaezaYates, Erkki Sutinen, and Jorma Tarhio. Indexing
methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19–
27, 2001.
[62] Gonzalo Navarro, Erkki Sutinen, Jani Tanninen, and Jorma Tarhio. Indexing text with
approximate qgrams. In Combinatorial Pattern Matching, pages 350–363. Springer,
2000.
[63] Gonzalo Navarro, Erkki Sutinen, and Jorma Tarhio. Indexing text with approximate
qgrams. Journal of Discrete Algorithms, 3(2):157–175, 2005.
[64] Saul B Needleman and Christian D Wunsch. A general method applicable to the
search for similarities in the amino acid sequence of two proteins. Journal of molecular
biology, 48(3):443–453, 1970.
[65] Pauline C Ng and Steven Henikoff. Predicting deleterious amino acid substitutions.
Genome Research, 11(5):863–874, 2001.
[66] Zemin Ning, Anthony J Cox, and James C Mullikin. SSAHA: A fast search method
for large DNA databases. Genome Research, 11(10):1725–1729, 2001.
[67] Panagiotis Papapetrou. Embeddingbased subsequence matching in large sequence
databases. PhD thesis, Citeseer, 2010.
[68] Panagiotis Papapetrou, Vassilis Athitsos, George Kollios, and Dimitrios Gunopulos.
Referencebased alignment in large sequence databases. Proceedings of the VLDB
Endowment, 2(1):205–216, 2009.
[69] William R Pearson and David J Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444–2448, 1988.
[70] Jason Pell, Arend Hintze, Rosangela CaninoKoning, Adina Howe, James M Tiedje,
and C Titus Brown. Scaling metagenome sequence assembly with probabilistic de
Bruijn graphs. Proceedings of the National Academy of Sciences, 109(33):13272–13277,
2012.
[71] Pierre Peterlongo and Rayan Chikhi. Mapsembler, targeted and micro assembly of
large NGS datasets on a desktop computer. BMC Bioinformatics, 13(1):48, 2012.
[72] Nicolas Philippe, Mikael Salson, Thierry Lecroq, Martine Leonard, Therese Commes,
and Eric Rivals. Querying large read collections in main memory: A versatile data
structure. BMC bioinformatics, 12(1):242, 2011.
90
[73] Peter Rice, Ian Longden, Alan Bleasby, et al. EMBOSS: The european molecular
biology open software suite. Trends in genetics, 16(6):276–277, 2000.
[74] Todd Richmond. Gene recognition via spliced alignment. Genome Biology, 1(1):reports233, 2000.
[75] Michael Roberts, Wayne Hayes, Brian R Hunt, Stephen M Mount, and James A Yorke.
Reducing storage requirements for biological sequence comparison. Bioinformatics,
20(18):3363–3369, 2004.
[76] Michael Roberts, Brian R Hunt, James A Yorke, Randall A Bolanos, and Arthur L
Delcher. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734–752, 2004.
[77] Stephen M Rumble, Phil Lacroute, Adrian V Dalca, Marc Fiume, Arend Sidow, and
Michael Brudno. Shrimp: Accurate mapping of short colorspace reads. PLOS ONE
Computational Biology, 5(5):e1000386, 2009.
[78] Ravi Sachidanandam, David Weissman, Steven C Schmidt, Jerzy M Kakol, Lincoln D
Stein, Gabor Marth, Steve Sherry, James C Mullikin, Beverley J Mortimore, David L
Willey, et al. A map of human genome sequence variation containing 1.42 million single
nucleotide polymorphisms. Nature, 409(6822):928–933, 2001.
[79] Mark F Schilling. The surprising predictability of long runs. Mathematics Magazine,
85(2):141–149, 2012.
[80] Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: local algorithms
for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international
conference on Management of data, pages 76–85. ACM, 2003.
[81] Ilya N Shindyalov and Philip E Bourne. Protein structure alignment by incremental
combinatorial extension (ce) of the optimal path. Protein engineering, 11(9):739–747,
1998.
[82] Jared T Simpson and Richard Durbin. Efficient construction of an assembly string
graph using the FMindex. Bioinformatics, 26(12):i367–i373, 2010.
[83] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, Steven JM
Jones, and Inanç Birol. ABySS: A parallel assembler for short read sequence data.
Genome Research, 19(6):1117–1123, 2009.
[84] Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.
91
[85] Yanni Sun and Jeremy Buhler. Designing multiple simultaneous seeds for DNA similarity search. Journal of Computational Biology, 12(6):847–861, 2005.
[86] Erkki Sutinen and Jorma Tarhio. On using qgram locations in approximate string
matching. Algorithms – ESA’95, pages 327–340, 1995.
[87] William R Taylor, Tomas P Flores, and Christine A Orengo. Multiple protein structure
alignment. Protein Science, 3(10):1858–1870, 1994.
[88] Julie D Thompson, Desmond G Higgins, and Toby J Gibson. CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic acids research,
22(22):4673–4680, 1994.
[89] Niko Välimäki and Eric Rivals. Scalable and versatile kmer indexing for highthroughput sequencing data. In Bioinformatics Research and Applications, pages 237–
248. Springer, 2013.
[90] Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, and Christopher Jermaine. Referencebased indexing of sequence databases. In Proceedings of the 32nd
international conference on Very large data bases, pages 906–917. VLDB Endowment,
2006.
[91] Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. essaMEM:
Finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics,
29(6):802–804, 2013.
[92] Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. A long
fragment aligner called ALFALFA. BMC Bioinformatics, 16(1):159, 2015.
[93] Sebastian Wandelt and Ulf Leser. QGramProjector: Qgram projection for indexing
highlysimilar strings. In Advances in Databases and Information Systems, pages 260–
273. Springer, 2013.
[94] Sebastian Wandelt and Ulf Leser. MRCSI: Compressing and searching string collections
with multiple references. Proceedings of the VLDB Endowment, 8(5):461–472, 2015.
[95] Sebastian Wandelt, Johannes Starlinger, Marc Bux, and Ulf Leser. RCSI: Scalable
similarity search in thousand(s) of genomes. Proceedings of the VLDB Endowment,
6(13):1534–1545, August 2013.
[96] David Weese, AnneKatrin Emde, Tobias Rausch, Andreas Döring, and Knut Reinert.
RazerS: Fast read mapping with sensitivity control. Genome Research, 19(9):1646–
1654, 2009.
92
[97] Thomas Wiehe, Steffi GebauerJung, Thomas MitchellOlds, and Roderic Guigo. SGP1: Prediction and validation of homologous genes based on sequence alignments.
Genome Research, 11(9):1574–1583, 2001.
[98] Derrick E Wood and Steven L Salzberg. Kraken: Ultrafast metagenomic sequence
classification using exact alignments. Genome Biology, 15(3):R46, 2014.
[99] Thomas D Wu and Colin K Watanabe. GMAP: A genomic mapping and alignment
program for mRNA and EST sequences. Bioinformatics, 21(9):1859–1875, 2005.
[100] Hongyi Xin, Donghyuk Lee, Farhad Hormozdiari, Samihan Yedkar, Onur Mutlu, and
Can Alkan. Accelerating read mapping with fasthash. BMC Genomics, 14(Suppl
1):S13, 2013.
[101] Chengxi Ye, Zhanshan Sam Ma, Charles H Cannon, Mihai Pop, and W Yu Douglas.
Exploiting sparseness in de novo genome assembly. BMC Bioinformatics, 13(6):1, 2012.
[102] Daniel R Zerbino and Ewan Birney. Velvet: Algorithms for de novo short read assembly
using de bruijn graphs. Genome Research, 18(5):821–829, 2008.
[103] Hongyu Zhang. Alignment of BLAST highscoring segment pairs based on the longest
increasing subsequence algorithm. Bioinformatics, 19(11):1391–1396, 2003.
[104] Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller. A greedy algorithm
for aligning DNA sequences. Journal of Computational Biology, 7(12):203–214, 2000.
[105] Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM computing
surveys (CSUR), 38(2):6, 2006.
93