You are here
Search results
(21 - 31 of 31)
Pages
- Title
- Near duplicate image search
- Creator
- Li, Fengjie
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Information retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases.To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e...
Show moreInformation retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases.To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e.g. dense feature vectors) and bag-of-features (e.g. local feature represented images). For sparse data, inverted index and document retrieval models had been proved to be very effective for large scale retrieval problems. For dense data and bag-of-feature data, however, there are still some open problems. For example, Locality Sensitive Hashing, a state-of-the-art method for searching high dimensional vectors, often fails to make a good tradeoff between precision and recall. Namely, it tends to achieve high preci- sion but with low recall or vice versa. The bag-of-words model, a popular approach for searching objects represented bag-of-features, has a limited performance because of the information loss during the quantization procedure.Since the general problem of searching objects represented in dense vectors and bag-of-features may be too challenging, in this dissertation, we focus on nearly duplicate search, in which the matched objects is almost identical to the query. By effectively exploring the statistical proper- ties of near duplicities, we will be able to design more effective indexing schemes and search algorithms. Thus, the focus of this dissertation is to design new indexing methods and retrieval algorithms, for near duplicate search in large scale databases, that accurately capture the data simi- larity and delivers more accurate and efficient search. Below, we summarize the main contributions of this dissertation:Our first contribution is a new algorithm for searching near duplicate bag-of-features data. The proposed algorithm, named random seeding quantization, is more efficient in generating bag-of- words representations for near duplicate images. The new scheme is motivated by approximating the optimal partial matching between bag-of-features, and thus produces a bag-of-words representation capturing the true similarities of the data, leading to more accurate and efficient retrieval of bag-of-features data.Our second contribution, termed Random Projection Filtering, is a search algorithm designed for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near duplicity, the algorithm projects high dimensional vectors into lower dimensional space and filter out irrelevant items. Our effective filtering procedure makes RPF more accurate and efficient to identify nearly duplicate objects in databases.Our third contribution is to develop and evaluate a new randomized range search algorithm for near duplicate vectors in high dimensional spaces, termed as Random Projection Search. Different from RPF, the algorithm presented in this chapter is suitable for a wider range of applications be- cause it does not require the sparsity constrains for high search accuracy. The key idea is to project both the data points and the query point into an one dimensional space by a random projection, and perform one dimensional range search to find the subset of data points that are within the range of a given query using binary search. We prove the theoretical guarantee for the proposed algorithm and evaluate its empirical performance on a dataset of 1.1 billion image features.
Show less
- Title
- The evolutionary potential of populations on complex fitness landscapes
- Creator
- Bryson, David Michael
- Date
- 2012
- Collection
- Electronic Theses & Dissertations
- Description
-
Evolution is a highly contingent process, where the quality of the solutions produced is affected by many factors. I explore and describe the contributions of three such aspects that influence overall evolutionary potential: the prior history of a population, the type and frequency of mutations that the organisms are subject to, and the composition of the underlying genetic hardware. I have systematically tested changes to a digital evolution system, Avida, measuring evolutionary potential in...
Show moreEvolution is a highly contingent process, where the quality of the solutions produced is affected by many factors. I explore and describe the contributions of three such aspects that influence overall evolutionary potential: the prior history of a population, the type and frequency of mutations that the organisms are subject to, and the composition of the underlying genetic hardware. I have systematically tested changes to a digital evolution system, Avida, measuring evolutionary potential in seven different computational environments ranging in complexity of the underlying fitness landscapes. I have examined trends and general principles that these measurements demonstrate and used my results to optimize the evolutionary potential of the system, broadly enhancing performance. The results of this work show that history and mutation rate play significant roles in evolutionary potential, but the final fitness levels of populations are remarkably stable to substantial changes in the genetic hardware and a broad range of mutation types.
Show less
- Title
- Scheduling for CPU Packing and node shutdown to reduce the energy consumption of high performance computing centers
- Creator
- Vudayagiri, Srikanth Phani
- Date
- 2010
- Collection
- Electronic Theses & Dissertations
- Description
-
During the past decade, there has been a tremendous growth in the high performance computing and data center arenas. The huge energy requirements in these sectors have prompted researchers to investigate possible ways to reduce their energy consumption. Reducing the energy consumption is not only beneficial to an organization economically but also to the environment. In this thesis, we focus our attention on high performance scientific computing clusters. We first perform experiments with the...
Show moreDuring the past decade, there has been a tremendous growth in the high performance computing and data center arenas. The huge energy requirements in these sectors have prompted researchers to investigate possible ways to reduce their energy consumption. Reducing the energy consumption is not only beneficial to an organization economically but also to the environment. In this thesis, we focus our attention on high performance scientific computing clusters. We first perform experiments with the CPU Packing feature available in Linux using programs from the SPEC CPU2000 suite. We then look at an energy-aware scheduling algorithm for the cluster that assumes that CPU Packing is enabled on all the nodes. Using simulations, we compare the scheduling done by this algorithm to that done by the existing, commercial Moab scheduler in the cluster. We experiment with the Moab Green Computing feature and based on our observations, we implement the shutdown mechanism used by Moab in our simulations. Our results show that Moab Green Computing could provide about an 13% energy savings on average for the HPC cluster without any noticeable decrease in the performance of jobs.
Show less
- Title
- Statistical and learning algorithms for the design, analysis, measurement, and modeling of networking and security systems
- Creator
- Shahzad, Muhammad (College teacher)
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
"The goal of this thesis is to develop statistical and learning algorithms for the design, analysis, measurement, and modeling of networking and security systems with specific focus on RFID systems, network performance metrics, user security, and software security. Next, I give a brief overview of these four areas of focus." -- Abstract.
- Title
- Consistency for distributed data stores
- Creator
- Roohitavaf, Mohammad
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
Geo-replicated data stores are one of the integral parts of today's Internet services. Service providers usually replicate their data on different data centers worldwide to achieve higher performance and data durability. However, when we use this approach, the consistency between replicas becomes a concern. At the highest level of consistency, we want strong consistency that provides the illusion of having only a single copy of the data. However, strong consistency comes with high performance...
Show moreGeo-replicated data stores are one of the integral parts of today's Internet services. Service providers usually replicate their data on different data centers worldwide to achieve higher performance and data durability. However, when we use this approach, the consistency between replicas becomes a concern. At the highest level of consistency, we want strong consistency that provides the illusion of having only a single copy of the data. However, strong consistency comes with high performance and availability costs. In this work, we focus on weaker consistency models that allow us to provide high performance and availability while preventing certain inconsistencies. Session guarantees (aka. client-centric consistency models) are one of such weaker consistency models that prevent some of the inconsistencies from occurring in a client session. We provide modified versions of session guarantees that, unlike traditional session guarantees, do not cause the problem of slowdown cascade for partitioned systems. We present a protocol to provide session guarantees for eBay NuKV that is a key-value store designed for eBay's internal services with high performance and availability requirements. We utilize Hybrid Logical Clocks (HLCs) to provide wait-free write operations while providing session guarantees. Our experiments, done on eBay cloud platform, show our protocol does not cause significant overhead compared with eventual consistency. In addition to session guarantees, a large portion of this dissertation is dedicated to causal consistency. Causal consistency is especially interesting as it is has been proved to be the strongest consistency model that allows the system to be available even during network partitions. We provide CausalSpartanX protocol that, using HLCs, improves current time-based protocols by eliminating the effect of clock anomalies such as clock skew between servers. CausalSpartanX also supports non-blocking causally consistent read-only transactions that allow applications to read a set of values that are causally consistent with each other. Read-only transactions provide a powerful abstraction that is impossible to be replaced by a set of basic read operations. CausalSpartanX, like other causal consistency protocols, assumes sticky clients (i.e. clients that never change the replica that they access). We prove if one wants immediate visibility for local updates in a data center, clients have to be sticky. Based on the structure of CausalSpartanX, we provide our Adaptive Causal Consistency Framework (ACCF) that is a configurable framework that generalizes current consistency protocols. ACCF provides a basis for designing adaptive protocols that can constantly monitor the system and clients' usage pattern and change themselves to provide better performance and availability. Finally, we present our Distributed Key-Value Framework (DKVF), a framework for rapid prototyping and benchmarking consistency protocols. DKVF lets protocol designers only focus on their high-level protocols, delegating all lower level communication and storage tasks to the framework.
Show less
- Title
- Exploiting cross-technology interference for efficient network services in wireless systems
- Creator
- Zhou, Ruogu
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
In the last decade, we have witnessed the wide adoption of a variety of wireless technologies like WiFi, Cellular, Bluetooth, ZigBee, and Near-field Communication(NFC). However, the fast growth of wireless networks generates significant cross-technology interference, which leads to network performance degradation and potential security breach. In this dissertation, we propose two novel physical layer techniques to deal with the interference, and improve the performance and security of sensor...
Show moreIn the last decade, we have witnessed the wide adoption of a variety of wireless technologies like WiFi, Cellular, Bluetooth, ZigBee, and Near-field Communication(NFC). However, the fast growth of wireless networks generates significant cross-technology interference, which leads to network performance degradation and potential security breach. In this dissertation, we propose two novel physical layer techniques to deal with the interference, and improve the performance and security of sensor networks and mobile systems, respectively. First, we exploit the WiFi interference as a ``blessing" in the design of sensor networks and develop novel WiFi interference detection techniques for ZigBee sensors. Second, utilizing these techniques, we design three efficient network services: WiFi discovery which detects the existence of nearby WiFi networks using ZigBee sensors, WiFi performance monitoring which measures and tracks performance of WiFi networks using a ZigBee sensor network, and time synchronization which provides synchronized clocks for sensor networks based on WiFi signals. Third, we design a novel, noninvasive NFC security system called {\em nShield} to reduce the transmission power of NFC radios, which protects NFC against passive eavesdropping. nShield implements a novel adaptive RF attenuation scheme, in which the extra RF energy of NFC transmissions is determined and absorbed by nShield. At the same time, nShield scavenges the extra RF energy to sustain the perpetual operation. Together with the extremely lo-power design, it enables nShield to provide the host uninterrupted protection against malicious eavesdropping. The above systems are implemented and extensively evaluated on a testbed of sensor networks and smartphones.
Show less
- Title
- Geometric and topological modeling techniques for large and complex shapes
- Creator
- Feng, Xin
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
The past few decades have witnessed the incredible advancements in modeling, digitizing and visualizing techniques for three–dimensional shapes. Those advancements led to an explosion in the number of three–dimensional models being created for design, manufacture, architecture, medical imaging, etc. At the same time, the structure, function, stability, and dynamics of proteins, subcellular structures, organelles, and multiprotein complexes have emerged as a leading interest in...
Show moreThe past few decades have witnessed the incredible advancements in modeling, digitizing and visualizing techniques for three–dimensional shapes. Those advancements led to an explosion in the number of three–dimensional models being created for design, manufacture, architecture, medical imaging, etc. At the same time, the structure, function, stability, and dynamics of proteins, subcellular structures, organelles, and multiprotein complexes have emerged as a leading interest in structural biology, another major source of large and complex geometric models. Geometric modeling not only provides visualizations of shapes for large biomolecular complexes but also fills the gap between structural information and theoretical modeling, and enables the understanding of function, stability, and dynamics.We first propose, for tessellated volumes of arbitrary topology, a compact data structure that offers constant–time–complexity incidence queries among cells of any dimensions. Our data structure is simple to implement, easy to use, and allows for arbitrary, user–defined 3–cells such as prisms and hexahedra, while remaining highly efficient in memory usage compared to previous work. We also provide the analysis on its time complexity for commonly–used incidence and adjacency queries such as vertex and edge one–rings.We then introduce a suite of computational tools for volumetric data processing, information extraction, surface mesh rendering, geometric measurement, and curvature estimation for biomolecular complexes. Particular emphasis is given to the modeling of Electron Microscopy Data Bank (EMDB) data and Protein Data Bank (PDB) data. Lagrangian and Cartesian representations are discussed for the surface presentation. Based on these representations, practical algorithms are developed for surface area and surface–enclosed volume calculation, and curvature estimation. Methods for volumetric meshing have also been presented. Because the technological development in computer science and mathematics has led to a variety of choices at each stage of the geometric modeling, we discuss the rationales in the design and selection of various algorithms. Analytical test models are designed to verify the computational accuracy and convergence of proposed algorithms. We selected six EMDB data and six PDB data to demonstrate the efficacy of the proposed algorithms in handling biomolecular surfaces and explore their capability of geometric characterization of binding targets. Thus, our toolkit offers a comprehensive protocol for the geometric modeling of proteins, subcellular structures, organelles, and multiprotein complexes.Furthermore, we present a method for computing “choking” loops—a set of surface loops that describe the narrowing of the volumes inside/outside of the surface and extend the notion of surface homology and homotopy loops. The intuition behind their definition is that a choking loop represents the region where an offset of the original surface would get pinched. Our generalized loops naturally include the usual2g handles/tunnels computed based on the topology of the genus–g surface, but also include loops that identify chokepoints or bottlenecks, i.e., boundaries of small membranes separating the inside or outside volume of the surface into disconnected regions. Our definition is based on persistent homology theory, which gives a measure to topological structures, thus providing resilience to noise and a well–defined way to determine topological feature size.Finally, we explore the application of persistent homology theory in protein folding analysis. The extremely complex process of protein folding brings challenges for both experimental study and theoretical modeling. The persistent homology approach studies the Euler characteristics of the protein conformations during the folding process. More precisely, the persistence is measured by the variation of van der Waals radius, which leads to the change of protein 3D structures and uncovers the inter–connectivity. Our results on fullerenes demonstrate the potential of our geometric and topological approach to protein stability analysis.
Show less
- Title
- A study of Bluetooth Frequency Hopping sequence : modeling and a practical attack
- Creator
- Albazrqaoe, Wahhab
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
The Bluetooth is a wireless interface that enables electronic devices to establish short-range, ad-hoc wireless connections. This kind of short-range wireless networking is known as Wireless Personal Area Networks (WPAN). Because of its attractive features of small size, low cost, and low power, Bluetooth gains a world wide usage. It is embedded in many portable computing devices and considered as a good replacement for local wire connections. Since wireless data is inherently exposed to...
Show moreThe Bluetooth is a wireless interface that enables electronic devices to establish short-range, ad-hoc wireless connections. This kind of short-range wireless networking is known as Wireless Personal Area Networks (WPAN). Because of its attractive features of small size, low cost, and low power, Bluetooth gains a world wide usage. It is embedded in many portable computing devices and considered as a good replacement for local wire connections. Since wireless data is inherently exposed to eavesdropping, the security and confidentiality is a central issue for wireless standard as well as Bluetooth. To maintain security and confidentiality of wireless packets, the Bluetooth system mainly relies on the Frequency Hopping mechanism to equivocate an adversary. By this technique, a wireless channel is accessed for transmitting a packet. For each wireless packet, a single channel is selected in a pseudo random way. This kind of randomness in channel selection makes it difficult for an eavesdropped to predict the next channel to be accessed. Hence, capturing Bluetooth wireless packets is a challenge. In this work, we investigate the Frequency Hopping sequence and specifically the hop selection kernel. We analyze the operation of the kernel hardware by partitioning it into three parts. Based on this modeling, we propose an attacking method for the hop selection kernel. The proposed method shows how to expose the clock value hidden in the kernel. This helps to predict Bluetooth hopping sequence and, hence, capturing Bluetooth wireless packet is possible.
Show less
- Title
- Towards machine learning based source identification of encrypted video traffic
- Creator
- Shi, Yan (Of Michigan State University)
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
The rapid growth of the Internet has helped to popularize video streaming services, which has now become the most dominant content on the Internet. The management of video streaming traffic is complicated by its enormous volume, diverse communication protocols and data formats, and the widespread adoption of encryption. In this thesis, the aim is to develop a novel firewall framework, named Soft-margined Firewall, for managing encrypted video streaming traffic while avoiding violation of user...
Show moreThe rapid growth of the Internet has helped to popularize video streaming services, which has now become the most dominant content on the Internet. The management of video streaming traffic is complicated by its enormous volume, diverse communication protocols and data formats, and the widespread adoption of encryption. In this thesis, the aim is to develop a novel firewall framework, named Soft-margined Firewall, for managing encrypted video streaming traffic while avoiding violation of user privacy. The system distinguishes itself from conventional firewall systems by incorporating machine learning and Traffic Analysis (TA) as a traffic detection and blocking mechanism. The goal is to detect unknown network traffic, including traffic that is encrypted, tunneled through Virtual Private Network, or obfuscated, in realistic application scenarios. Existing TA methods have limitations in that they can deal only with simple traffic patterns-usually, only a single source of traffic is allowed in a tunnel, and a trained classifier is not portable between network locations, requiring redundant training. This work aims to address these limitations with new techniques in machine learning. The three main contributions of this work are: 1) developing new statistical features around traffic surge periods that can better identify websites with dynamic contents; 2) a two-stage classifier architecture to solve the mixed-traffic problem with state-of-the-art TA features; and 3) leveraging a novel natural-language inspired feature to solve the mixed-traffic problem using Deep-Learning methods. A fully working Soft-margin Firewall with the above distinctive features have been designed, implemented, and verified for both conventional classifiers and the proposed deep-learning based classifiers. The efficacy of the proposed system is confirmed via experiments conducted on actual network setups with a custom-built prototype firewall and OpenVPN servers. The proposed feature-classifier combinations show superior performance compared to previous state-of-the-art results. The solution that combines natural-language inspired traffic feature and Deep-Learning is demonstrated to be able to solve the mixed-traffic problem, and capable of predicting multiple labels associated with one sample. Additionally, the classifier can classify traffic recorded from locations that are different from where the trained traffic was collected. These results are the first of their kind and are expected to lead the way of creating next-generation TA-based firewall systems.
Show less
- Title
- Hidden Markov model-based homology search and gene prediction in NGS ERA
- Creator
- Techa-angkoon, Prapaporn
- Date
- 2017
- Collection
- Electronic Theses & Dissertations
- Description
-
The exponential cost reduction of next-generation sequencing (NGS) enabled researchers to sequence a large number of organisms in order to answer various questions in biology, ecology, health, etc. For newly sequenced genomes, gene prediction and homology search against characterized protein sequence databases are two fundamental tasks for annotating functional elements in the genomes. The main goal of gene prediction is to identify the gene locus and their structures. As there is...
Show moreThe exponential cost reduction of next-generation sequencing (NGS) enabled researchers to sequence a large number of organisms in order to answer various questions in biology, ecology, health, etc. For newly sequenced genomes, gene prediction and homology search against characterized protein sequence databases are two fundamental tasks for annotating functional elements in the genomes. The main goal of gene prediction is to identify the gene locus and their structures. As there is accumulating evidence showing important functions of RNAs (ncRNAs), comprehensive gene prediction should include both protein-coding genes and ncRNAs. Homology search against protein sequences can aid identification of functional elements in genomes. Although there are intensive research in the fields of gene prediction, ncRNA search, and homology search, there are still unaddressed challenges. In this dissertation, I made contributions in these three areas. For gene prediction, I designed an HMM-based ab initio gene prediction tool that considers G+C gradient in grass genomes. For homology search, I designed a method that can align short reads against protein families using profile HMMs. For ncRNA search, I designed a ncRNA alignment tool that can align highly structured ncRNAs using only sequence similarity. Below I summarize my contributions.Despite decades of research about gene prediction, existing gene prediction tools are not carefully designed to deal with variant G+C content and 5'-3' changing patterns inside coding regions. Thus, these tools can miss genes with positive or negative G+C gradient in grass genomes such as rice, maize, sorghum, etc. I implemented a tool named AUGUSTUS-GC that accounts for 5'-3' G+C gradient. Our tool can accurately predict protein-coding genes in plant genomes especially grass genomes.A large number of sequencing projects produced short reads from the whole genomes or transcriptomic data. I designed a short reads homology search tool that employs paired-end reads to improve homology search sensitivity. The experimental results show that our tool can achieve significantly better sensitivity and accuracy in aligning short reads that are part of remote homologs.Despite the extensive studies of ncRNA search, the existing tools that heavily depend on the secondary structure in homology search cannot efficiently handle RNA-seq data that is accumulating rapidly. It will be ideal if we can have a faster ncRNA homology search tool with similar accuracy as those adopting secondary structure. I implemented an accurate ncRNA alignment tool called glu-RNA that can achieve similar accuracy to structural alignment tools while keeping the same running time complexity as sequence alignment tools. The experimental results demonstrate that our tool can achieve more accurate alignments than the popular sequence alignment tools and a well-known structural alignment program.
Show less
- Title
- Privacy and integrity preserving computation in distributed systems
- Creator
- Chen, Fei
- Date
- 2011
- Collection
- Electronic Theses & Dissertations
- Description
-
Preserving privacy and integrity of private data has become core requirements for many distributed systems across different parties. In these systems, one party may try to compute or aggregate useful information from the private data of other parties. However, this party is not be fully trusted by other parties. Therefore, it is important to design security protocols for preserving such private data. Furthermore, one party may want to query the useful information computed from such private...
Show morePreserving privacy and integrity of private data has become core requirements for many distributed systems across different parties. In these systems, one party may try to compute or aggregate useful information from the private data of other parties. However, this party is not be fully trusted by other parties. Therefore, it is important to design security protocols for preserving such private data. Furthermore, one party may want to query the useful information computed from such private data. However, query results may be modified by a malicious party. Thus, it is important to design query protocols such that query result integrity can be verified.In this dissertation, we study four important privacy and integrity preserving problems for different distributed systems. For two-tiered sensor networks, where storage nodes serve as an intermediate tier between sensors and a sink for storing data and processing queries, we proposed SafeQ, a protocol that prevents compromised storage nodes from gaining information from both sensor collected data and sink issued queries, while it still allows storage nodes to process queries over encrypted data and the sink to detect compromised storage nodes when they misbehave. For cloud computing, where a cloud provider hosts the data of an organization and replies query results to the customers of the organization, we propose novel privacy and integrity preserving schemes for multi-dimensional range queries such that the cloud provider can process encoded queries over encoded data without knowing the actual values, and customers can verify the integrity of query results with high probability. For distributed firewall policies, we proposed the first privacy-preserving protocol for cross-domain firewall policy optimization. For any two adjacent firewalls belonging to two different administrative domains, our protocol can identify in each firewall the rules that can be removed because of the other firewall. For network reachability, one of the key factors for capturing end-to-end network behavior and detecting the violation of security policies, we proposed the first cross-domain privacy-preserving protocol for quantifying network reachability.
Show less