You are here
Search results
(1 - 5 of 5)
- Title
- Studying the effects of sampling on the efficiency and accuracy of k-mer indexes
- Creator
- Almutairy, Meznah
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local...
Show more"Searching for local alignments is a critical step in many bioinformatics applications and pipelines. This search process is often sped up by finding shared exact matches of a minimum length. Depending on the application, the shared exact matches are extended to maximal exact matches, and these are often extended further to local alignments by allowing mismatches and/or gaps. In this dissertation, we focus on searching for all maximal exact matches (MEMs) and all highly similar local alignments (HSLAs) between a query sequence and a database of sequences. We focus on finding MEMs and HSLAs over nucleotide sequences. One of the most common ways to search for all MEMs and HSLAs is to use a k-mer index such as BLAST. A major problem with k-mer indexes is the space required to store the lists of all occurrences of all k-mers in the database. One method for reducing the space needed, and also query time, is sampling where only some k-mer occurrences are stored. We classify sampling strategies used to create k-mer indexes in two ways: how they choose k-mers and how many k-mers they choose. The k-mers can be chosen in two ways: fixed sampling and minimizer sampling. A sampling method might select enough k-mers such that the k-mer index reaches full accuracy. We refer to this sampling as hard sampling. Alternatively, a sampling method might select fewer k-mers to reduce the index size even further but the index does not guarantee full accuracy. We refer to this sampling as soft sampling. In the current literature, no systematic study has been done to compare the different sampling methods and their relative benefits/weakness. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. Also, most previous work uses hard sampling, in which all similar sequences are guaranteed to be found. In contrast, we study soft sampling, which further reduces the k-mer index at a cost of decreasing query accuracy. We systematically compare fixed and minimizer sampling to find all MEMs between large genomes such as the human genome and the mouse genome. We also study soft sampling to find all HSLAs using the NCBI BLAST tool with the human genome and human ESTs. We use BLAST, since it is the most widely used tool to search for HSLAs. We compared the sampling methods with respect to index size, query time, and query accuracy. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. When identifying HSLAs, we find that soft sampling significantly reduces both index size and query time with relatively small losses in query accuracy. The results demonstrate that soft sampling is a simple but effective strategy for performing efficient searches for HSLAs. We also provide a new model for sampling with BLAST that predicts empirical retention rates with reasonable accuracy."--Pages ii-iii.
Show less
- Title
- Exploiting node mobility for energy optimization in wireless sensor networks
- Creator
- El-Moukaddem, Fatme Mohammad
- Date
- 2012
- Collection
- Electronic Theses & Dissertations
- Description
-
Wireless Sensor Networks (WSNs) have become increasingly available for data-intensive applications such as micro-climate monitoring, precision agriculture, and audio/video surveillance. A key challenge faced by data-intensive WSNs is to transmit the sheer amount of data generated within an application's lifetime to the base station despite the fact that sensor nodes have limited power supplies such as batteries or small solar panels. The availability of numerous low-cost robotic units (e.g....
Show moreWireless Sensor Networks (WSNs) have become increasingly available for data-intensive applications such as micro-climate monitoring, precision agriculture, and audio/video surveillance. A key challenge faced by data-intensive WSNs is to transmit the sheer amount of data generated within an application's lifetime to the base station despite the fact that sensor nodes have limited power supplies such as batteries or small solar panels. The availability of numerous low-cost robotic units (e.g. Robomote and Khepera) has made it possible to construct sensor networks consisting of mobile sensor nodes. It has been shown that the controlled mobility offered by mobile sensors can be exploited to improve the energy efficiency of a network.In this thesis, we propose schemes that use mobile sensor nodes to reduce the energy consumption of data-intensive WSNs. Our approaches differ from previous work in two main aspects. First, our approaches do not require complex motion planning of mobile nodes, and hence can be implemented on a number of low-cost mobile sensor platforms. Second, we integrate the energy consumption due to both mobility and wireless communications into a holistic optimization framework.We consider three problems arising from the limited energy in the sensor nodes. In the first problem, the network consists of mostly static nodes and contains only a few mobile nodes. In the second and third problems, we assume essentially that all nodes in the WSN are mobile. We first study a new problem called max-data mobile relay configuration (MMRC) that finds the positions of a set of mobile sensors, referred to as relays, that maximize the total amount of data gathered by the network during its lifetime. We show that the MMRC problem is surprisingly complex even for a trivial network topology due to the joint consideration of the energy consumption of both wireless communication and mechanical locomotion. We present optimal MMRC algorithms and practical distributed implementations for several important network topologies and applications. Second, we consider the problem of minimizing the total energy consumption of a network. We design an iterative algorithm that improves a given configuration by relocating nodes to new positions. We show that this algorithm converges to the optimal configuration for the given transmission routes. Moreover, we propose an efficient distributed implementation that does not require explicit synchronization. Finally, we consider the problem of maximizing the lifetime of the network. We propose an approach that exploits the mobility of the nodes to balance the energy consumption throughout the network. We develop efficient algorithms for single and multiple round approaches. For all three problems, we evaluate the efficiency of our algorithms through simulations. Our simulation results based on realistic energy models obtained from existing mobile and static sensor platforms show that our approaches significantly improve the network's performance and outperform existing approaches.
Show less
- Title
- A sorted partitioning approach to fast and scalable dynamic packet classification
- Creator
- Yingchareonthawornchai, Sorrachai
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
"The advent of Software-Defined Networking (SDN) leads to two key challenges for packet classification on the dramatically increased dynamism and dimensionality. Although packet classification is a well-studied problem, no existing solution satisfies these new requirements without sacrificing classification speed. Decision tree methods, such as HyperCuts, EffiCuts, and SmartSplit can achieve high-speed packet classification, but support neither fast updates nor high dimensionality. The Tuple...
Show more"The advent of Software-Defined Networking (SDN) leads to two key challenges for packet classification on the dramatically increased dynamism and dimensionality. Although packet classification is a well-studied problem, no existing solution satisfies these new requirements without sacrificing classification speed. Decision tree methods, such as HyperCuts, EffiCuts, and SmartSplit can achieve high-speed packet classification, but support neither fast updates nor high dimensionality. The Tuple Space Search (TSS) algorithm used in Open vSwitch achieves fast updates and high dimensionality but not high-speed packet classification. We propose a hybrid approach, PartitionSort, that combines the benefits of both TSS and decision trees achieving high-speed packet classification, fast updates and high dimensionality. A key to PartitionSort is a novel notion of ruleset sortability that provides two key benefits. First, it results in far fewer partitions than TSS. Second, it allows the use of Multi-dimensional Interval Trees to achieve logarithmic classification and update time for each sortable ruleset partition. Our extensive experimental results show that PartitionSort is an order of magnitude faster than TSS in classifying packets while achieving comparable update time. PartitionSort is a few orders of magnitude faster in construction time than SmartSplit, a state-of-the-art decision tree classifier, while achieving a competitive classification time. Finally, PartitionSort is scalable to an arbitrary number of fields and requires only linear space."-Page ii.
Show less
- Title
- TCAM reduction techniques for all-match classifiers
- Creator
- Wender, Nicholas Jon
- Date
- 2012
- Collection
- Electronic Theses & Dissertations
- Description
-
Network intrusion detection systems require all-match packet classication, where all rules matching a packet are reported by the system. The problem of eciently reporting all matching rules is known as the all-match optimization problem. One solution is to convert an all-match classier into a first-match classier (in which only the rst classier rule that matches a packet is reported), and use ternary content addressable memory (TCAM) forpacket classication.In this thesis, we evaluate two...
Show moreNetwork intrusion detection systems require all-match packet classication, where all rules matching a packet are reported by the system. The problem of eciently reporting all matching rules is known as the all-match optimization problem. One solution is to convert an all-match classier into a first-match classier (in which only the rst classier rule that matches a packet is reported), and use ternary content addressable memory (TCAM) forpacket classication.In this thesis, we evaluate two classier minimization approaches. First, we consider the use of all-match classier-specic optimization algorithms. Second, we use state-of-the-art first-match classier optimization algorithms in conjunction with all-match algorithms. Our results indicate the appropriate approach is related to the number of TCAM chips available for classication. When using one TCAM chip, we attain 70.85% TCAM space savings using rst-match classier optimization algorithms instead of all-match classier optimization algorithms. When using multiple TCAM chips, we found that it is best to use all-match specic optimization algorithms.
Show less
- Title
- Algorithms for deep packet inspection
- Creator
- Patel, Jignesh
- Date
- 2012
- Collection
- Electronic Theses & Dissertations
- Description
-
The core operation in network intrusion detection and prevention systems is Deep Packet Inspection (DPI), in which each security threat is represented as a signature, and the payload of each data packet is matched against the set of current security threat signatures. DPI is also used for other networking applications like advanced QoS mechanisms, protocol identification etc.. In the past, attack signatures were specified as strings, and a great deal of research has been done in string...
Show moreThe core operation in network intrusion detection and prevention systems is Deep Packet Inspection (DPI), in which each security threat is represented as a signature, and the payload of each data packet is matched against the set of current security threat signatures. DPI is also used for other networking applications like advanced QoS mechanisms, protocol identification etc.. In the past, attack signatures were specified as strings, and a great deal of research has been done in string matching for network applications. Today most DPI systems use Regular Expression (RE) to represent signatures. RE matching is more diffcult than string matching, and current string matching solutions don't work well for REs. RE matching for networking applications is diffcult for several reasons. First, the DPI application is usually implemented in network devices, which have limited computing resources. Second, as new threats are discovered, size of the signature set grows over time. Last, the matching needs to be done at network speeds, the growth of which out paces improvements in computing speed; so there is a need for novel solutions that can deliver higher throughput. So RE matching for DPI is a very important and active research area.In our research, we investigate the existing methods proposed for RE matching, identify their limitations, and propose new methods to overcome these limitations. RE matching remains a fundamentally challenging problem due to the diffculty in compactly encoding DFA. While the DFA for any one RE is typically small, the DFA that corresponds to the entire set of REs is usually too large to be constructed or deployed. To address this issue, many alternative automata implementations that compress the size of the final automaton have been proposed. However, previously proposed automata construction algorithms employ a “Union then Minimize” framework where the automata for each RE are first joined before minimization occurs. This leads to expensive minimization on a large automata, and a large intermediate memory footprint. We propose a “Minimize then Union” framework for constructing compact alternative automata, which minimizes smaller automata first before combining them. This approach required much less time and memory, allowing us to handle a much larger RE set. Prior hardware based RE matching algorithms typically use FPGA. The drawback of FPGA is that resynthesizing and updating FPGA circuitry to handle RE updates is slow and diffcult. We propose the first hardware-based RE matching approach that uses Ternary Content Addressable Memory (TCAM). TCAMs have already been widely used in modern networking devices for tasks such as packet classification, so our solutions can be easily deployed. Our methods support easy RE updates, and we show that we can achieve very high throughput. The main reason combined DFAs for multiple REs grow exponentially in size is because of replication of states. We developed a new overlay automata model which exploit this replication to compress the size of the DFA. The idea is to group together the replicated DFA structures instead of repeating them multiple times. The result is that we get a final automata size that is close to that of a NFA (which is linear in the size of the RE set), and simultaneously achieve fast deterministic matching speed of a DFA.
Show less