You are here
(1 - 1 of 1)
- Studying the effects of sampling on the efficiency and accuracy of k-mer indexes
- Almutairy, Meznah
- Electronic Theses & Dissertations
"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local...
Show more"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local alignments (HSLAs) between a query sequence and a database of sequences. We focus on finding MEMs and HSLAs over nucleotide sequences. One of the most common ways to search for all MEMs and HSLAs is to use a k-mer index such as BLAST. A major problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. We classify sampling strategies used to create k-mer indexes in two ways: how they choose k-mers and how many k-mers they choose. The k-mers can be chosen in two ways: fixed sampling and minimizer sampling. A sampling method might select enough k-mers such that the k-mer index reaches full accuracy. We refer to this sampling as hard sampling. Alternatively, a sampling method might select fewer k-mers to reduce the index size even further but the index does not guarantee full accuracy. We refer to this sampling as soft sampling. In the current literature, no systematic study has been done to compare the different sampling methods and their relative benefits/weakness. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. Also, most previous work uses hard sampling, in which all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the k-mer index at a cost of decreasing query accuracy. We systematically compare fixed and minimizer sampling to find all MEMs between large genomes such as the human genome and the mouse genome. We also study soft sampling to find all HSLAs using the NCBI BLAST tool with the human genome and human ESTs. We use BLAST, since it is the most widely used tool to search for HSLAs. We compared the sampling methods with respect to index size, query time, and query accuracy. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. The results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy."--Pages ii-iii.