You are here
Search results
(1 - 20 of 117)
Pages
- Title
- Collaborative learning : theory, algorithms, and applications
- Creator
- Lin, Kaixiang
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Human intelligence prospers with the advantage of collaboration. To solve one or a set of challenging tasks, we can effectively interact with peers, fuse knowledge from different sources, continuously inspire, contribute, and develop the expertise for the benefit of the shared objectives. Human collaboration is flexible, adaptive, and scalable in terms of various cooperative constructions, collaborating across interdisciplinary, even seemingly unrelated domains, and building large-scale...
Show moreHuman intelligence prospers with the advantage of collaboration. To solve one or a set of challenging tasks, we can effectively interact with peers, fuse knowledge from different sources, continuously inspire, contribute, and develop the expertise for the benefit of the shared objectives. Human collaboration is flexible, adaptive, and scalable in terms of various cooperative constructions, collaborating across interdisciplinary, even seemingly unrelated domains, and building large-scale disciplined organizations for extremely complex tasks. On the other hand, while machine intelligence achieved tremendous success in the past decade, the ability to collaboratively solve complicated tasks is still limited compared to human intelligence.In this dissertation, we study the problem of collaborative learning - building flexible, generalizable, and scalable collaborative strategies to facilitate the efficiency of learning one or a set of objectives. Towards achieving this goal, we investigate the following concrete and fundamental problems:1. In the context of multi-task learning, can we enforce flexible forms of interactions from multiple tasks and adaptively incorporate human expert knowledge to guide the collaboration?2. In reinforcement learning, can we design collaborative methods that effectivelycollaborate among heterogeneous learning agents to improve the sample-efficiency?3. In multi-agent learning, can we develop a scalable collaborative strategy to coordinate a massive number of learning agents accomplishing a shared task?4. In federated learning, can we have provable benefit from increasing the number of collaborative learning agents?This thesis provides the first line of research to view the above learning fields in a unified framework, which includes novel algorithms for flexible, adaptive collaboration, real-world applications using scalable collaborative learning solutions, and fundamental theories for propelling the understanding of collaborative learning.
Show less
- Title
- Achieving reliable distributed systems : through efficient run-time monitoring and predicate detection
- Creator
- Tekken Valapil, Vidhya
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Runtime monitoring of distributed systems to perform predicate detection is critical as well as a challenging task. It is critical because it ensures the reliability of the system by detecting all possible violations of system requirements. It is challenging because to guarantee lack of violations one has to analyze every possible ordering of system events and this is an expensive task. In this report, wefocus on ordering events in a system run using HLC (Hybrid Logical Clocks) timestamps,...
Show moreRuntime monitoring of distributed systems to perform predicate detection is critical as well as a challenging task. It is critical because it ensures the reliability of the system by detecting all possible violations of system requirements. It is challenging because to guarantee lack of violations one has to analyze every possible ordering of system events and this is an expensive task. In this report, wefocus on ordering events in a system run using HLC (Hybrid Logical Clocks) timestamps, which are O(1) sized timestamps, and present some efficient algorithms to perform predicate detection using HLC. Since, with HLC, the runtime monitor cannot find all possible orderings of systems events, we present a new type of clock called Biased Hybrid Logical Clocks (BHLC), that are capable of finding more possible orderings than HLC. Thus we show that BHLC based predicate detection can find more violations than HLC based predicate detection. Since predicate detection based on both HLC and BHLC do not guarantee detection of all possible violations in a system run, we present an SMT (Satisfiability Modulo Theories) solver based predicate detection approach, that guarantees the detection of all possible violations in a system run. While a runtime monitor that performs predicate detection using SMT solvers is accurate, the time taken by the solver to detect the presence or absence of a violation can be high. To reduce the time taken by the runtime monitor, we propose the use of an efficient two-layered monitoring approach, where the first layer of the monitor is efficient but less accurate and the second layer is accurate but less efficient. Together they reduce the overall time taken to perform predicate detection drastically and also guarantee detection of all possible violations.
Show less
- Title
- Multi-objective regression with application to the climate domain
- Creator
- Abraham, Zubin John
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Regression-based approaches are widely used in climate research to derive the statistical, spatial, and temporal relationships among climate variables. Despite its extensive literature, existing approaches are insufficient to address the unique challenges arising from the data characteristics and requirements of this domain. For example, climate variables such as precipitation have zero-inflated distributions, which render ineffective any linear regression models constructed from the data. In...
Show moreRegression-based approaches are widely used in climate research to derive the statistical, spatial, and temporal relationships among climate variables. Despite its extensive literature, existing approaches are insufficient to address the unique challenges arising from the data characteristics and requirements of this domain. For example, climate variables such as precipitation have zero-inflated distributions, which render ineffective any linear regression models constructed from the data. In addition, whereas traditional regression-based approaches emphasize on minimizing the discrepancy between observed and predicted values, there is a growing demand for regression outputs that satisfy other domain-specific criteria. To address these challenges, this thesis presents multi-objective regression frameworks designed to extend current regression-based approaches to meet the needs of climate researchers. First, a framework called Integrated Classification and Regression (ICR) is developed to accurately capture the timing of rain events and the magnitude of rain amount in zero-inflated precipitation data. The second multi-objective regression framework focuses on modeling the extreme values of a distribution without degrading its overall accuracy in predicting non-extreme values. The third framework emphasizes on both minimizing the divergence between the regression output and observed data while maximizing the fit of their cumulative distribution functions. The fourth contribution extends this framework to a multi-output setting, to ensure that the joint distribution of the multiple regression outputs is realistic and consistent with true observations.
Show less
- Title
- Automatic verification and revision of multitolerant programs
- Creator
- Chen, Jingshu
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
The notion of multitolerance is based on the observation that modern programs are often subject to multiple faults. And, the requirements in the presence of these faults vary based on the nature of the faults, its severity and the cost of providing fault-tolerance to it. Also, assurance of multitolerant systems is necessary via they are integral parts of our lives. This dissertation proposes to provide such assurance via automated verification and revision.Regarding verification, we focus on...
Show moreThe notion of multitolerance is based on the observation that modern programs are often subject to multiple faults. And, the requirements in the presence of these faults vary based on the nature of the faults, its severity and the cost of providing fault-tolerance to it. Also, assurance of multitolerant systems is necessary via they are integral parts of our lives. This dissertation proposes to provide such assurance via automated verification and revision.Regarding verification, we focus on verification of self-stabilization, which is the ability of the program to recover from arbitrary states. We consider verification of self-stabilization because several multitolerant systems are indeed stabilizing. Also, most of literature on verification of fault-tolerance focuses on safety property; our work complements it by considering liveness properties. Hence, we envision verification of multitolerant programs by using existing approaches for verifying safety and using the results from this dissertation for verifying liveness. We propose a technique that is based on a bottleneck (fairness requirements) identified in existing work on verification of stabilization. Our approach uses the role of fairness along with symbolic model checking, and hence reduces the cost of verification substantially. We also propose a constraint-based approach that reduces the task of verifying self-stabilization into a well-studied problem of constraint solving, so that one can leverage the use of existing highly optimized solutions (SAT/SMT solvers) to reduce the verification cost.Regarding revision, we focus on revising existing programs to obtain the corresponding multitolerant ones. Revising the program manually is expensive since it requires additional verification steps to guarantee correctness. Also, manual revision may violate existing requirements. For these reasons, we propose an automatic approach to revise a given program to add multitolerance to the given class(es) of faults. Specifically, we characterize multitolerance in terms of strong multitolerance and weak multitolerance. Intuitively, strong multitolerance provides higher guarantees than weak multitolerance. However, there are scenarios where designing a strong multitolerant program is expensive or impossible although designing weak multitolerance is feasible. We investigate the complexity of automatic revision for adding multitolerance. In particular, we identify instances where adding weak multitolerance is NP-hard even though adding strong multitolerance in the same setting in P. We also develop algorithms (and heuristics) for automatic revision for adding multitolerance to existing programs. We implement these algorithms in a model repair tool for automatically adding multitolerance. Additionally, we build a lightweight framework that utilizes our model repair tool for automatically revising UML state diagram for adding fault-tolerance. This framework has several practical and methodological significance regarding the development of concurrent software. Specifically, this framework allows designers to revise an existing UML model to add fault-tolerance without a detailed knowledge of the formalism behind model repair algorithms.
Show less
- Title
- Inferring regulatory interactions in transcriptional regulatory networks
- Creator
- Mahmoud, Sherine Awad
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Living cells are realized by complex gene expression programs that are moderated by regulatory proteins called transcription factors (TFs). The TFs control the differential expression of target genes in the context of transcriptional regulatory networks (TRNs), either individually or in groups. Deciphering the mechanisms of how the TFs control the expression of target genes is a challenging task, especially when multiple TFs collaboratively participate in the transcriptional regulation....
Show moreLiving cells are realized by complex gene expression programs that are moderated by regulatory proteins called transcription factors (TFs). The TFs control the differential expression of target genes in the context of transcriptional regulatory networks (TRNs), either individually or in groups. Deciphering the mechanisms of how the TFs control the expression of target genes is a challenging task, especially when multiple TFs collaboratively participate in the transcriptional regulation. Recent developments in biotechnology have been applied to uncover TF-target binding relationships to reconstruct draft regulatory circuits at a systems level. Furthermore, to identify regulatory interactions in vivo and consequently reveal their functions, TF single/double knockouts and over-expression experiments have been systematically carried out. However, the results of many single or even double-knockout experiments are often non-conclusive, since many genes are regulated by multiple TFs with complementary functions. To predict the TF combinations that the knocking out of them are most likely to bring about the phenotypic change, we developed a new computational tool called TRIM that models the interactions between the TFs and the target genes in terms of both the TF-target interaction's function (activation or repression) and its corresponding logical role (necessary and/or sufficient). We used DNA-protein binding and gene expression data to construct regulatory modules for inferring the transcriptional regulatory interaction models for the TFs and their corresponding target genes. Our TRIM algorithm is based on an HMM and a set of constraints that relate gene expression patterns to regulatory interaction models. However, TRIM infers up to 2-TFs interactions. Inferring the collaborative interactions of multiple TFs is a computationally difficult task, because when multiple TFs simultaneously or sequentially control their target genes, a single gene responds to merged inputs, resulting in complex gene expression patterns. We developed mTRIM to solve this problem with a modified association rule mining approach. mTRIM is a novel method to infer TF collaborationsin transcriptional regulation networks. It can not only identify TF groups that regulate thecommon targets collaboratively but also TFs with complementary functions. However, mTRIM ignores the effect of miRNAs on target genes. In order to take miRNAs' effect into considerations, we developed a new computational model called TmiRNA that incorporates miRNAs into the inference. TmiRNA infers the interactions between a set of regulators including both TFs and miRNAs and the set of their target genes. We used our model to study the combinatorial code of Human Cancer transcriptional regulation.
Show less
- Title
- Profile HMM-based protein domain analysis of next-generation sequencing data
- Creator
- Zhang, Yuan
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
Sequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS...
Show moreSequence analysis is the process of analyzing DNA, RNA or peptide sequences using a wide range of methodologies in order to understand their functions, structures or evolution history. Next generation sequencing (NGS) technologies generate large-scale sequence data of high coverage and nucleotide level resolution at low costs, benefiting a variety of research areas such as gene expression profiling, metagenomic annotation, ncRNA identification, etc. Therefore, functional analysis of NGS sequences becomes increasingly important because it provides insightful information, such as gene expression, protein composition, and phylogenetic complexity, of the species from which the sequences are generated. One basic step during the functional analysis is to classify genomic sequences into different functional categories, such as protein families or protein domains (or domains for short), which are independent functional units in a majority of annotated protein sequences. The state-of-the-art method for protein domain analysis is based on comparative sequence analysis, which classifies query sequences into annotated protein or domain databases. There are two types of domain analysis methods, pairwise alignment and profile-based similarity search. The first one uses pairwise alignment tools such as BLAST to search query genomic sequences against reference protein sequences in databases such as NCBI-nr. The second one uses profile HMM-based tools such as HMMER to classify query sequences into annotated domain families such as Pfam. Compared to the first method, the profile HMM-based method has smaller search space and higher sensitivity with remote homolog detection. Therefore, I focus on profile HMM-based protein domain analysis.There are several challenges with protein domain analysis of NGS sequences. First, sequences generated by some NGS platforms such as pyrosequencing have relatively high error rates, making it difficult to classify the sequences into their native domain families. Second, existing protein domain analysis tools have low sensitivity with short query sequences and poorly conserved domain families. Third, the volume of NGS data is usually very large, making it difficult to assemble short reads into longer contigs. In this work, I focus on addressing these three challenges using different methods. To be specific, we have proposed four tools, HMM-FRAME, MetaDomain, SALT, and SAT-Assembler. HMM-FRAME focuses on detecting and correcting frameshift errors in sequences generated by pyrosequencing technology, thus accurately classifying metagenomic sequences containing frameshift errors into their native protein domain families. MetaDomain and SALT are both designed for short reads generated by NGS technologies. MetaDomain uses relaxed position-specific score thresholds and alignment positions to increase the sensitivity while keeping the false positive rate at a low level. SALT combines both position-specific score thresholds and graph algorithms and achieves higher accuracy than MetaDomain. SAT-Assembler conducts targeted gene assembly from large-scale NGS data. It has smaller memory usage, higher gene coverage, and lower chimera rate compared with existing tools. Finally, I will make a conclusion on my work and briefly talk about some future work
Show less
- Title
- Holistic performance control for mission-critical cyber-physical systems
- Creator
- Chen, Jinzhu
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Recent years have seen the growing deployments of Cyber-Physical Systems (CPSs) in many mission-critical applications such as security, civil infrastructure, and transportation. These applications often impose stringent performance requirements on system
sensing fidelity ,timeliness ,energy efficiency andreliability . However, existing approaches treat these concerns in isolation and hence are not suitable for CPSs where the...
Show moreRecent years have seen the growing deployments of Cyber-Physical Systems (CPSs) in many mission-critical applications such as security, civil infrastructure, and transportation. These applications often impose stringent performance requirements on systemsensing fidelity ,timeliness ,energy efficiency andreliability . However, existing approaches treat these concerns in isolation and hence are not suitable for CPSs where the system performances are dependent of each other because of the tight integration of computational and physical processes. In this dissertation, we investigate the dependencies between these performances and propose the holistic performance control approaches for two typical mission-critical CPSs, which are Wireless Cyber-phyiscal Surveillance (WCS) systems and data centers. We first propose a holistic approach called {\em Fidelity-Aware Utilization Controller} (FAUC) for WCS systems that combine low-end sensors with cameras for large-scalead hoc surveillance in unplanned environments. By integrating data fusion with feedback control, FAUC enforces a CPU utilization upper bound to ensure the system's real-time schedulability under dynamic CPU workloads at runtime because of stochastic detection results. At the same time, FAUC optimizes system fidelity and adjusts the control objective of CPU utilization adaptively in the presence of variations of target/noise characteristics. The testbed experiments and the trace-driven simulations show that FAUC can achieve robust fidelity and real-time guarantees in dynamic environments.We then present a proactive thermal and energy control approach for data centers to improve the energy efficiency while ensuring the data center reliability. It consists of a high-fidelity real-time temperature prediction system and a predictive thermal and energy control (PTEC) system. The prediction system integrates Computational Fluid Dynamics (CFD) modeling,in situ wireless sensing and real-time data-driven prediction. To ensure the forecasting fidelity, we leverage the realistic physical thermodynamic models of CFD to generate transient temperature distribution and calibrate it using sensor feedback. Both simulated temperature distribution and sensor measurements are then used to train a real-time prediction algorithm. Based on the temperature prediction system, we propose the PTEC system, which leverages the server built-in sensors and monitoring utilities, as well as a network of wireless sensors to monitor the thermal and power conditions of a data center. It predicts the server inlet temperatures in real-time, and optimizes temperature setpoints and cold air supply rates of cooling systems, as well as the speeds of server internal fans, to minimize their overall energy consumption. To ensure the data center reliability, PTEC enforces a set of thermal safety requirements including the upper bounds on server inlet temperatures and their variations, to prevent server overheating and reduce server hardware failure rate. A partition-based approach is proposed to solve the control problem efficiently for large-scale data centers. Extensive testbed experiments and trace-driven CFD simulations show that PTEC can safely reduce substantial cooling and circulation energy consumption compared with traditional approaches, and can adapt to the realistic and dynamic data center workload.
Show less
- Title
- Near duplicate image search
- Creator
- Li, Fengjie
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Information retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases.To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e...
Show moreInformation retrieval addresses the fundamental problem of how to identify the objects from database that satisfies the information needs of users. Facing the information overload, the major challenge in search algorithm design is to ensure that useful information can be found both accurately and efficiently from large databases.To address this challenge, different indexing and retrieval methods had been proposed for different types of data, namely sparse data (e.g. documents), dense data (e.g. dense feature vectors) and bag-of-features (e.g. local feature represented images). For sparse data, inverted index and document retrieval models had been proved to be very effective for large scale retrieval problems. For dense data and bag-of-feature data, however, there are still some open problems. For example, Locality Sensitive Hashing, a state-of-the-art method for searching high dimensional vectors, often fails to make a good tradeoff between precision and recall. Namely, it tends to achieve high preci- sion but with low recall or vice versa. The bag-of-words model, a popular approach for searching objects represented bag-of-features, has a limited performance because of the information loss during the quantization procedure.Since the general problem of searching objects represented in dense vectors and bag-of-features may be too challenging, in this dissertation, we focus on nearly duplicate search, in which the matched objects is almost identical to the query. By effectively exploring the statistical proper- ties of near duplicities, we will be able to design more effective indexing schemes and search algorithms. Thus, the focus of this dissertation is to design new indexing methods and retrieval algorithms, for near duplicate search in large scale databases, that accurately capture the data simi- larity and delivers more accurate and efficient search. Below, we summarize the main contributions of this dissertation:Our first contribution is a new algorithm for searching near duplicate bag-of-features data. The proposed algorithm, named random seeding quantization, is more efficient in generating bag-of- words representations for near duplicate images. The new scheme is motivated by approximating the optimal partial matching between bag-of-features, and thus produces a bag-of-words representation capturing the true similarities of the data, leading to more accurate and efficient retrieval of bag-of-features data.Our second contribution, termed Random Projection Filtering, is a search algorithm designed for efficient near duplicate vector search. By explicitly exploiting the statistical properties of near duplicity, the algorithm projects high dimensional vectors into lower dimensional space and filter out irrelevant items. Our effective filtering procedure makes RPF more accurate and efficient to identify nearly duplicate objects in databases.Our third contribution is to develop and evaluate a new randomized range search algorithm for near duplicate vectors in high dimensional spaces, termed as Random Projection Search. Different from RPF, the algorithm presented in this chapter is suitable for a wider range of applications be- cause it does not require the sparsity constrains for high search accuracy. The key idea is to project both the data points and the query point into an one dimensional space by a random projection, and perform one dimensional range search to find the subset of data points that are within the range of a given query using binary search. We prove the theoretical guarantee for the proposed algorithm and evaluate its empirical performance on a dataset of 1.1 billion image features.
Show less
- Title
- Toward efficient spectrum use in multicarrier wireless networks
- Creator
- Huang, Pei
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
The last decade has witnessed growing interest in dynamic spectrum access, which is motivated by the observation that a large portion of the radio spectrum has been licensed but remains highly underutilized while a few small unlicensed bands that are open to anyone are getting more crowded due to the explosive expansion of wireless services. To provide more flexible access to radio spectrum, dynamic spectrum access is introduced to enable unlicensed users to opportunistically utilize vacant...
Show moreThe last decade has witnessed growing interest in dynamic spectrum access, which is motivated by the observation that a large portion of the radio spectrum has been licensed but remains highly underutilized while a few small unlicensed bands that are open to anyone are getting more crowded due to the explosive expansion of wireless services. To provide more flexible access to radio spectrum, dynamic spectrum access is introduced to enable unlicensed users to opportunistically utilize vacant spectrum chunks (known as spectrum holes) in licensed frequency bands.In dynamic spectrum access, non-contiguous orthogonal frequency division multiplexing (NC-OFDM) is widely adopted to efficiently utilize fragmented spectrum because it is convenient to keep silent on some spectrum fragments to avoid interference with licensed users. In NC-OFDM, a band of spectrum is divided into many orthogonal subcarriers and data are transmitted on a subset of them simultaneously. The subcarriers that interfere with the licensed users are deactivated. Because each subcarrier can be managed independently, this dissertation introduces a series of techniques that exploit the subcarriers to address problems in dynamic spectrum access.When unlicensed users called secondary users (SUs) are granted the permission to operate in the licensed bands, they must ensure that the interference caused by them to licensed users known as primary users (PUs) is within a limit. Even without such a requirement, SUs should avoid as many collisions as possible. To improve spectrum hole extraction rate and reduce collision rate, we propose a spectrum occupancy prediction model that helps estimate the spectrum availability. It measures a wide band of spectrum with OFDM and groups subcarriers to subchannels based on spectrum use activities. In each subchannel, frequent spectrum occupancy patterns are identified and used to predict future channel states (i.e., busy or idle). With the prediction, SUs are able to make use of spectrum holes more aggressively without introducing undue interference to PUs.In the spectrum holes discovered above, a mechanism is needed to coordinate medium access between devices. Because devices opportunistically utilize spectrum holes, a device may experience severe contentions with devices from various applications. We propose a collision detection and bitwise arbitration (CDBA) mechanism that quickly identifies the winner in a contention using combined information from both the time domain and the frequency domain. It enables collision detection in the frequency domain by selectively deactivating subcarriers at each transmitter.The CDBA assumes that all devices adopt the same channel width, but different radio technologies have different requirements on channel width. When heterogeneous radios coexist in a contention domain, wideband devices can hardly win medium access opportunities in contention with narrowband devices. To address the problem, we propose an adaptive channel bonding protocol in which a wideband device initiates transmission as long as there exist some idle narrow channels and gradually increases channel width during transmission whenever new narrow channels become available. To increase the chance to win some narrow channels, a wideband device contends on subcarriers of each narrow channel with a different priority.After the contention problem is addressed, we study how to increase the transmission speed when a device is granted the permission to transmit. As wireless networks move toward wider channel widths, it is common that different subcarriers experience different fade. To cope with the frequency-selective fading, modulation scheme for each subcarrier should be selected based on the subcarrier channel quality. We exploit the modulation diversity in our modulation scheme coding to improve network throughput.Besides unicast, broadcast is another fundamental mechanism in wireless networks. Because devices utilize spectrum fragments opportunistically, different receivers may have different vacant spectrum fragments at different locations. The broadcast is more challenging in dynamic spectrum access because the transmitter needs to consider the diversity of spectrum availability. We propose a spectrum fragment agile broadcast (SFAB) protocol to support broadcast under nonuniform spectrum availability. It encodes unique sequences on subcarriers of each spectrum fragment to achieve fast spectrum agreement between the transmitter and the receivers.
Show less
- Title
- Finding optimized bounding boxes of polytopes in d-dimensional space and their properties in k-dimensional projections
- Creator
- Shahid, Salman (Of Michigan State University)
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Using minimal bounding boxes to encapsulate or approximate a set of points in d-dimensional space is a non-trivial problem that has applications in a variety of fields including collision detection, object rendering, high dimensional databases and statistical analysis to name a few. While a significant amount of work has been done on the three dimensional variant of the problem (i.e. finding the minimum volume bounding box of a set of points in three dimensions), it is difficult to find a...
Show moreUsing minimal bounding boxes to encapsulate or approximate a set of points in d-dimensional space is a non-trivial problem that has applications in a variety of fields including collision detection, object rendering, high dimensional databases and statistical analysis to name a few. While a significant amount of work has been done on the three dimensional variant of the problem (i.e. finding the minimum volume bounding box of a set of points in three dimensions), it is difficult to find a simple method to do the same for higher dimensions. Even in three dimensions existing methods suffer from either high time complexity or suboptimal results with a speed up in execution time. In this thesis we present a new approach to find the optimized minimum bounding boxes of a set of points defining convex polytopes in d-dimensional space. The solution also gives the optimal bounding box in three dimensions with a much simpler implementation while significantly speeding up the execution time for a large number of vertices. The basis of the proposed approach is a series of unique properties of the k-dimensional projections that are leveraged into an algorithm. This algorithm works by constructing the convex hulls of a given set of points and optimizing the projections of those hulls in two dimensional space using the new concept of Simultaneous Local Optimal. We show that the proposed algorithm provides significantly better performances than those of the current state of the art approach on the basis of time and accuracy. To illustrate the importance of the result in terms of a real world application, the optimized bounding box algorithm is used to develop a method for carrying out range queries in high dimensional databases. This method uses data transformation techniques in conjunction with a set of heuristics to provide significant performance improvement.
Show less
- Title
- Non-coding RNA identification in large-scale genomic data
- Creator
- Yuan, Cheng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
Noncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes....
Show moreNoncoding RNAs (ncRNAs), which function directly as RNAs without translating into proteins, play diverse and important biological functions. ncRNAs function not only through their primary structures, but also secondary structures, which are defined by interactions between Watson-Crick and wobble base pairs. Common types of ncRNA include microRNA, rRNA, snoRNA, tRNA. Functions of ncRNAs vary among different types. Recent studies suggest the existence of large number of ncRNA genes. Identification of novel and known ncRNAs becomes increasingly important in order to understand their functionalities and the underlying communities.Next-generation sequencing (NGS) technology sheds lights on more comprehensive and sensitive ncRNA annotation. Lowly transcribed ncRNAs or ncRNAs from rare species with low abundance may be identified via deep sequencing. However, there exist several challenges in ncRNA identification in large-scale genomic data. First, the massive volume of datasets could lead to very long computation time, making existing algorithms infeasible. Second, NGS has relatively high error rate, which could further complicate the problem. Third, high sequence similarity among related ncRNAs could make them difficult to identify, resulting in incorrect output. Fourth, while secondary structures should be adopted for accurate ncRNA identification, they usually incur high computational complexity. In particular, some ncRNAs contain pseudoknot structures, which cannot be effectively modeled by the state-of-the-art approach. As a result, ncRNAs containing pseudoknots are hard to annotate.In my PhD work, I aimed to tackle the above challenges in ncRNA identification. First, I designed a progressive search pipeline to identify ncRNAs containing pseudoknot structures. The algorithms are more efficient than the state-of-the-art approaches and can be used for large-scale data. Second, I designed a ncRNA classification tool for short reads in NGS data lacking quality reference genomes. The initial homology search phase significantly reduces size of the original input, making the tool feasible for large-scale data. Last, I focused on identifying 16S ribosomal RNAs from NGS data. 16S ribosomal RNAs are very important type of ncRNAs, which can be used for phylogenic study. A set of graph based assembly algorithms were applied to form longer or full-length 16S rRNA contigs. I utilized paired-end information in NGS data, so lowly abundant 16S genes can also be identified. To reduce the complexity of problem and make the tool practical for large-scale data, I designed a list of error correction and graph reduction techniques for graph simplification.
Show less
- Title
- Structure and evolutionary dynamics in fitness landscapes
- Creator
- Pakanati, Anuraag R.
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Evolution can be conceptualized as an optimization algorithm that allows populations to search through genotypes for those that produce high fitness solutions. This search process is commonly depicted as exploring a “fitness landscape”, which combines similarity relationships among genotypes with the concept of a genotype-fitness map. As populations adapt to their fitness landscape, they accumulate information about the fitness landscape in which they live. A greater understanding of...
Show moreEvolution can be conceptualized as an optimization algorithm that allows populations to search through genotypes for those that produce high fitness solutions. This search process is commonly depicted as exploring a “fitness landscape”, which combines similarity relationships among genotypes with the concept of a genotype-fitness map. As populations adapt to their fitness landscape, they accumulate information about the fitness landscape in which they live. A greater understanding of evolution on fitness landscapes will help elucidate fundamental evolutionary processes. I examine methods of estimating information acquisition in evolving populations and find that these techniques have largely ignored the effects of common descent. Since information is estimated by measuring conserved genomic regions across a population, common descent can create a severe bias by increasing similarities among unselected regions. I introduce a correction method to compensate for the effects of common descent on genomic information and empirically demonstrate its efficacy.Next, I explore three instantiations of NK, Avida, and RNA fitness landscapes to better understand structural properties such as the distribution of peaks and the size of basins of attraction. I find that the fitness of peaks is correlated with the fitness of peaks within their neighborhood, and that the size of peaks' basins of attraction tends to be proportional to the heights of the peaks. Finally, I visualize local dynamics and perform a detailed comparison between the space of what evolutionary trajectories are technically possible from a single starting point and the results of actual evolving populations.
Show less
- Title
- Unobtrusive physiological monitoring using smartphones
- Creator
- Hao, Tian (Research scientist)
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
"This thesis presents an in-depth investigation in unobtrusive smartphone-based physiological monitoring, which aims to help people get healthier and fitter in a more efficient and less costly way." -- Abstract.
- Title
- Data clustering with pairwise constraints
- Creator
- Yi, Jinfeng
- Date
- 2014
- Collection
- Electronic Theses & Dissertations
- Description
-
The classical unsupervised clustering is an ill-posed problem due to the absence of a unique clustering criteria. This issue can be addressed by introducing additional supervised information, usually casts in the form of pairwise constraints, to the clustering procedure. Depending on the sources, most pairwise constraints can be classified into two categories: (i) pairwise constraints collected from a set of non-expert crowd workers, which leads to the problem of crowdclustering, and (ii)...
Show moreThe classical unsupervised clustering is an ill-posed problem due to the absence of a unique clustering criteria. This issue can be addressed by introducing additional supervised information, usually casts in the form of pairwise constraints, to the clustering procedure. Depending on the sources, most pairwise constraints can be classified into two categories: (i) pairwise constraints collected from a set of non-expert crowd workers, which leads to the problem of crowdclustering, and (ii) pairwise constraints collected from oracle or experts, which leads to the problem of semi-supervised clustering. In both cases, the costs of collecting pairwise constraints can be expensive, thus it is important to identify the minimal number of pairwise constraints needed to accurately recover the underlying true data partition, also known as a sample complexity problem.In this thesis, we first analyze the sample complexity of crowdclustering. At first, we propose a novel crowdclustering approach based on the theory of matrix completion. Unlike the existing crowdclustering algorithm that is based on a Bayesian generative model, the proposed approach is more desirable since it only needs a much less number of crowdsourced pairwise annotations to accurately cluster all the objects. Our theoretical analysis shows that in order to accurately cluster $N$ objects, only $O(N\log^2 N)$ randomly sampled pairs should be annotated by crowd workers. To further reduce the sample complexity, we then introduce a semi-crowdsourced clustering framework that is able to effectively incorporate the low-level features of the objects to be clustered. In this framework, we only need to sample a subset of $n \ll N$ objects and generate their pairwise constraints via crowdsourcing. After completing a $n \times n$ similarity matrix using the proposed crowdclustering algorithm, we can further recover a $N \times N$ similarity matrix by applying a regression-based distance metric learning algorithm to the completed smaller size similarity matrix. This enables us to reliably cluster $N$ objects with only $O(n\log^2 n)$ crowdsourced pairwise constraints.Next, we study the problem of sample complexity in semi-supervised clustering. To this end, we propose a novel convex semi-supervised clustering approach based on the theory of matrix completion. In order to reduce the number of pairwise constraints needed %to achieve a perfect data partitioning,we apply a nature assumption that the feature representationsof the objects are able to reflect the similarities between objects. This enables us to only utilize $O(\log N)$ pairwiseconstraints to perfectly recover the data partition of $N$ objects.Lastly, in addition to sample complexity that relates to labeling costs, we also consider the computational costs of semi-supervised clustering.%In addition to sample complexity that relates to the labeling costs, we also consider the computational cost of semi-supervised clustering in the final part of this thesis.Specifically, we study the problem of efficiently updating clustering results when the pairwise constraints are generated sequentially, a common case in various real-world applications such as social networks. To address this issue, we develop a dynamic semi-supervised clustering algorithm that casts the clustering problem into a searching problem in a feasibleconvex space, i.e., a convex hull with its extreme points being an ensemble of multiple data partitions. Unlike classical semi-supervised clustering algorithms that need to re-optimize their objective functions when new pairwise constraints are generated, the proposed method only needs to update a low-dimensional vector and its time complexity is irrelevant to the number of data points to be clustered. This enables us to update large-scale clustering results in an extremely efficient way.
Show less
- Title
- Computational identification and analysis of non-coding RNAs in large-scale biological data
- Creator
- Lei, Jikai
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Non-protein-coding RNAs (ncRNAs) are RNA molecules that function directly at the level of RNA without translating into protein. They play important biological functions in all three domains of life, i.e. Eukarya, Bacteria and Archaea. To understand the working mechanisms and the functions of ncRNAs in various species, a fundamental step is to identify both known and novel ncRNAs from large-scale biological data.Large-scale genomic data includes both genomic sequence data and NGS sequencing...
Show moreNon-protein-coding RNAs (ncRNAs) are RNA molecules that function directly at the level of RNA without translating into protein. They play important biological functions in all three domains of life, i.e. Eukarya, Bacteria and Archaea. To understand the working mechanisms and the functions of ncRNAs in various species, a fundamental step is to identify both known and novel ncRNAs from large-scale biological data.Large-scale genomic data includes both genomic sequence data and NGS sequencing data. Both types of genomic data provide great opportunity for identifying ncRNAs. For genomic sequence data, a lot of ncRNA identification tools that use comparative sequence analysis have been developed. These methods work well for ncRNAs that have strong sequence similarity. However, they are not well-suited for detecting ncRNAs that are remotely homologous. Next generation sequencing (NGS), while it opens a new horizon for annotating and understanding known and novel ncRNAs, also introduces many challenges. First, existing genomic sequence searching tools can not be readily applied to NGS data because NGS technology produces short, fragmentary reads. Second, most NGS data sets are large-scale. Existing algorithms are infeasible on NGS data because of high resource requirements. Third, metagenomic sequencing, which utilizes NGS technology to sequence uncultured, complex microbial communities directly from their natural inhabitants, further aggravates the difficulties. Thus, massive amount of genomic sequence data and NGS data calls for efficient algorithms and tools for ncRNA annotation.In this dissertation, I present three computational methods and tools to efficiently identify ncRNAs from large-scale biological data. Chain-RNA is a tool that combines both sequence similarity and structure similarity to locate cross-species conserved RNA elements with low sequence similarity in genomic sequence data. It can achieve significantly higher sensitivity in identifying remotely conserved ncRNA elements than sequence based methods such as BLAST, and is much faster than existing structural alignment tools. miR-PREFeR (miRNA PREdiction From small RNA-Seq data) utilizes expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNA-Seq data samples. It is sensitive, accurate, fast and has low-memory footprint. metaCRISPR focuses on identifying Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) from large-scale metagenomic sequencing data. It uses a kmer hash table to efficiently detect reads that belong to CRISPRs from the raw metagonmic data set. Overlap graph based clustering is then conducted on the reduced data set to separate different CRSIPRs. A set of graph based algorithms are used to assemble and recover CRISPRs from the clusters.
Show less
- Title
- Automated addition of fault-tolerance via lazy repair and graceful degradation
- Creator
- Lin, Yiyan
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
In this dissertation, we concentrate on the problem of automated addition of fault-tolerance that transforms a fault-intolerant program to be a fault-tolerant program. We solve this problem via model repair. Model repair is a correct-by-construct technique to revise an existing model so that the revised model satisfies the given correctness criteria, such as safety, liveness, or fault-tolerance. We consider two problems of using model repair to add fault-tolerance. First, if the repaired...
Show moreIn this dissertation, we concentrate on the problem of automated addition of fault-tolerance that transforms a fault-intolerant program to be a fault-tolerant program. We solve this problem via model repair. Model repair is a correct-by-construct technique to revise an existing model so that the revised model satisfies the given correctness criteria, such as safety, liveness, or fault-tolerance. We consider two problems of using model repair to add fault-tolerance. First, if the repaired model violates the assumptions (e.g., partial observability, inability to detect crashed processes, etc) made in the underlying system, then it cannot be implemented. We denote these requirements as realizability constraints. Second, the addition of fault-tolerance may fail if the program cannot fully recover after certain faults occur. In this dissertation, we propose a lazy repair approach to address realizability issues in adding fault-tolerance. Additionally, we propose a technique to automatically add graceful degradation to a program, so that the program can recover with partial functionality (that is identified by the designer to be the critical functionality) if full recovery is impossible.A model repair technique transforms a model to another model that satisfies a new set of properties. Such a transformation should also maintain the mapping between the model and the underlying program. For example, in a distributed program, every process is restricted to read (or write) some variables in other processes. A model that represents this program should also disallow the process to read (or write) those inaccessable variables. If these constraints are violated, then the corresponding model will be unrealizable. An unrealizable model (in this context, a model that violates the read/write restrictions) may make it impossible to obtain the corresponding implementation.%In this dissertation, we call the read (or write) restriction as a realizability constraint in distributed systems. An unrealizable model (a model that violates the realizability constraints) may complicate the implementation by introducing extra amount of modification to the program. Such modification may in turn break the program's correctness.Resolving realizability constraints increases the complexity of model repair. Existing model repair techniques introduce heuristics to reduce the complexity. However, this heuristic-based approach is designed and optimized specifically for distributed programs. We need a more generic model repair approach for other types of programs, e.g., synchronous programs, cyber-physical programs, etc. Hence, in this dissertation, we propose a model repair technique, i.e., lazy repair, to add fault-tolerance to programs with different types of realizability constraints. It involves two steps. First, we only focus on repairing to obtain a model that satisfies correctness criteria while ignoring realizability constraints. In the second step, we repair this model further by removing behaviors while ensuring that the desired specification is preserved. The lazy repair approach simplifies the process of developing heuristics, and provides a tradeoff in terms of the time saved in the first step and the extra work required in the second step. We demonstrate that lazy repair is applicable in the context of distributed systems, synchronous systems and cyber-physical systems.In addition, safety critical systems such as airplanes, automobiles and elevators should operate with high dependability in the presence of faults. If the occurrence of faults breaks down some components, the system may not be able to fully recover. In this scenario, the system can still operate with remaining resources and deliver partial but core functionality, i.e., to display graceful degradation. Existing model repair approaches, such as addition of fault-tolerance, cannot transform a program to provide graceful degradation. In this dissertation, we propose a technique to add fault-tolerance to a program with graceful degradation. In the absence of faults, such a program exhibits ideal behaviors. In the presence of faults, the program is allowed to recover with reduced functionality. This technique involves two steps. First, it automatically generates a program with graceful degradation based on the input fault-intolerant program. Second, it adds fault-tolerance to the output program from first step. We demonstrate that this technique is applicable in the context of high atomicity programs as well as low atomicity programs (i.e., distributed programs). We also present a case study on adding multi-graceful degradation to a dangerous gas detection and ventilation system. Through this case study, we show that our approach can assist the designer to obtain a program that behaves like the deployed system.
Show less
- Title
- Large-scale high dimensional distance metric learning and its application to computer vision
- Creator
- Qian, Qi
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Learning an appropriate distance function (i.e., similarity) is one of the key tasks in machine learning, especially for distance based machine learning algorithms, e.g., $k$-nearest neighbor classifier, $k$-means clustering, etc. Distance metric learning (DML), the subject to be studied in this dissertation, is designed to learn a metric that pulls the examples from the same class together and pushes the examples from different classes away from each other. Although many DML algorithms have...
Show moreLearning an appropriate distance function (i.e., similarity) is one of the key tasks in machine learning, especially for distance based machine learning algorithms, e.g., $k$-nearest neighbor classifier, $k$-means clustering, etc. Distance metric learning (DML), the subject to be studied in this dissertation, is designed to learn a metric that pulls the examples from the same class together and pushes the examples from different classes away from each other. Although many DML algorithms have been developed in the past decade, most of them can handle only small data sets with hundreds of features, significantly limiting their applications to real world applications that often involve millions of training examples represented by hundreds of thousands of features. Three main challenges are encountered to learn the metric from these large-scale high dimensional data: (i) To make sure that the learned metric is a Positive Semi-Definitive (PSD) matrix, a projection into the PSD cone is required at every iteration, whose cost is cubic in the dimensionality making it unsuitable for high dimensional data; (ii) The number of variables that needs to be optimized in DML is quadratic in the dimensionality, which results in the slow convergence rate in optimization and high requirement of memory storage; (iii) The number of constraints used by DML is at least quadratic, if not cubic, in the number of examples depending on if pairwise constraints or triplet constraints are used in DML. Besides, features can be redundant due to high dimensional representations (e.g., face features) and DML with feature selection is preferred for these applications.The main contribution of this dissertation is to address these challenges both theoretically and empirically. First, for the challenge arising from the PSD projection, we exploit the mini-batch strategy and adaptive sampling with smooth loss function to significantly reduce the number of updates (i.e., projections) while keeping the similar performance. Second, for the challenge arising from high dimensionality, we propose a dual random projection approach, which enjoys the light computation due to the usage of random projection and at the same time, significantly improves the effectiveness of random projection. Third, for the challenge with large-scale constraints, we develop a novel multi-stage metric learning framework. It divides the original optimization problem into multiple stages. It reduces the computation by adaptively sampling a small subset of constraints at each stage. Finally, to handle redundant features with group property, we develop a greedy algorithm that selects feature group and learns the corresponding metric simultaneously at each iteration leading to further improvement of learning efficiency when combined with adaptive mini-batch strategy and incremental sampling. Besides the theoretical and empirical investigation of DML on the benchmark datasets of machine learning, we also apply the proposed methods to several important computer vision applications (i.e., fine-grained visual categorization (FGVC) and face recognition).
Show less
- Title
- Image annotation and tag completion via kernel metric learning and noisy matrix recovery
- Creator
- Feng, Zheyun
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
In the last several years, with the ever-growing popularity of digital photography and social media, the number of images with user-provided tags has increased enormously. Due to the large amount and content versatility of these images, there is an urgent need to categorize, index, retrieve and browse these images via semantic tags (also called attributes or keywords). Following this trend, image annotation or tag completion out of missing and noisy given tags over large scale datasets has...
Show moreIn the last several years, with the ever-growing popularity of digital photography and social media, the number of images with user-provided tags has increased enormously. Due to the large amount and content versatility of these images, there is an urgent need to categorize, index, retrieve and browse these images via semantic tags (also called attributes or keywords). Following this trend, image annotation or tag completion out of missing and noisy given tags over large scale datasets has become an extremely hot topic in the interdisciplinary areas of machine learning and computer vision.The overarching goal of this thesis is to reassess the image annotation and tag completion algorithms that mainly capture the essential relationship both between and within images and tags even when the given tag information is incomplete or noisy, so as to achieve a better performance in terms of both effectiveness and efficiency in image annotation and other tag relevant tasks including tag completion, tag ranking and tag refinement.One of the key challenges in search-based image annotation models is to define an appropriate similarity measure (distance metric) between images, so as to assign unlabeled images with tags that are shared among similar labeled training images. Many kernel metric learning (KML) algorithms have been developed to serve as such a nonlinear distance metric. However, most of them suffer from high computational cost since the learned kernel metric needs to be projected into a positive semi-definite (PSD) cone. Besides, in image annotation tasks, existing KML algorithms require to convert image annotation tags into binary constraints, which lead to a significant semantic information loss and severely reduces the annotation performance.In this dissertation we propose a robust kernel metric learning (RKML) algorithm based on regression technique that is able to directly utilize the image tags. RKML is computationally efficient since the PSD property is automatically ensured by the regression technique. Numeric constraints over tags are also applied to better exploit the tag information and hence improve the annotation accuracy. Further, theoretical guarantees for RKML are provided, and its efficiency and effectiveness are also verified empirically by comparing it to state-of-the-art approaches of both distance metric learning and image annotation.Since the user-provided image tags are always incomplete and noisy, we also propose a tag completion algorithm by noisy matrix recovery (TCMR) to simultaneously enrich the missing tags and remove the noisy ones. TCMR assumes that the observed tags are independently sampled from unknown distributions that are represented by a tag matrix, and our goal is to recover that tag matrix based on the partially revealed tags which could be noisy. We provide theoretical guarantees for TCMR with recovery error bounds. In addition, a graph Laplacian based component is introduced to enforce the recovered tags to be consistent with the visual contents of images. Our empirical study with multiple benchmark datasets for image tagging shows that the proposed algorithm outperforms state-of-the-art approaches in terms of both effectiveness and efficiency when handling missing and noisy tags.
Show less
- Title
- Cortex-inspired developmental learning networks for stereo vision
- Creator
- Solgi, Mojtaba
- Date
- 2013
- Collection
- Electronic Theses & Dissertations
- Description
-
How does the human brain make sense of the 3D world while its visual input, the retinal images, are only two-dimensional? There are multiple depth-cues exploited by the brain to create a 3D model of the world. Despite the importance of this subject both for scientists and engineers, the underlying computational mechanisms of the stereo vision in the human brain is still largely unknown. This thesis is an attempt towards creating a developmental model of the stereo vision in the visual cortex....
Show moreHow does the human brain make sense of the 3D world while its visual input, the retinal images, are only two-dimensional? There are multiple depth-cues exploited by the brain to create a 3D model of the world. Despite the importance of this subject both for scientists and engineers, the underlying computational mechanisms of the stereo vision in the human brain is still largely unknown. This thesis is an attempt towards creating a developmental model of the stereo vision in the visual cortex. By developmental we mean that the features of each neuron are developed, instead of hand-crafted, so that the limited resource is optimally used. This approach helps us learn more about the biological stereo vision, and also yields results superior to those of traditional computer vision approaches, e.g., under weak textures. Developmental networks, such as Where-What Networks (WWN), have been shown promising for simultaneous attention and recognition, while handling variations in scale, location and type as well as inter-class variations. Moreover, in a simpler prior setting, they have shown sub-pixel accuracy in disparity detection in challenging natural images. However, the previous work for stereo vision was limited to 20 pixel stripes of shifted images and unable to scale to real world problems. This dissertation presents work on building neuromorphic developmental models for stereo vision, focusing on 1) dynamic synapse retraction and growth as a method of developing more efficient receptive fields 2) training for images that involve complex natural backgrounds 3) integration of depth perception with location and type information. In a setting of 5 object classes, 7 × 7 = 49 locations and 11 disparity levels, the network achieves above 95% recognition rate for object shapes, under one pixel disparity detection error, and under 10 pixel location error. These results are reported using challenging natural and synthetic textures both on background and foreground objects in disjoint testing.
Show less
- Title
- Signal processing and machine learning approaches to enabling advanced sensing and networking capabilities in everyday infrastructure and electronics
- Creator
- Ali, Kamran (Scientist)
- Date
- 2020
- Collection
- Electronic Theses & Dissertations
- Description
-
Mainstream commercial off-the-shelf (COTS) electronic devices of daily use are usually designed and manufactured to serve a very specific purpose. For example, the WiFi routers and network interface cards (NICs) are designed for high speed wireless communication, RFID readers and tags are designed to identify and track items in supply chain, and smartphone vibrator motors are designed to provide haptic feedback (e.g. notifications in silent mode) to the users. This dissertation focuses on...
Show moreMainstream commercial off-the-shelf (COTS) electronic devices of daily use are usually designed and manufactured to serve a very specific purpose. For example, the WiFi routers and network interface cards (NICs) are designed for high speed wireless communication, RFID readers and tags are designed to identify and track items in supply chain, and smartphone vibrator motors are designed to provide haptic feedback (e.g. notifications in silent mode) to the users. This dissertation focuses on revisiting the physical-layer of various such everyday COTS electronic devices, either to leverage the signals obtained from their physical layers to develop novel sensing applications, or to modify/improve their PHY/MAC layer protocols to enable even more useful deployment scenarios and networking applications - while keeping their original purpose intact - by introducing mere software/firmware level changes and completely avoiding any hardware level changes. Adding such new usefulness and functionalities to existing everyday infrastructure and electronics has advantages both in terms of cost and convenience of use/deployment, as those devices (and their protocols) are already mainstream, easily available, and often already purchased and in use/deployed to serve their mainstream purpose of use.In our works on WiFi signals based sensing, we propose signal processing and machine learning approaches to enable fine-grained gesture recognition and sleep monitoring using COTS WiFi devices. In our work on gesture recognition, we show for the first time thatWiFi signals can be used to recognize small gestures with high accuracy. In our work on sleep monitoring, we propose for the first time aWiFi CSI based sleep quality monitoring scheme which can robustly track breathing and body/limb activity related vital signs during sleep throughout a night in an individual and environment independent manner.In our work on RFID signals based sensing, we propose signal processing and machine learning approaches to effectively image customer activity in front of display items in places such as retail stores using commercial off-the-shelf (COTS) monostatic RFID devices (i.e. which use a single antenna at a time for both transmitting and receiving RFID signals to and from the tags). The key novelty of this work is on achieving multi-person activity tracking in front of display items by constructing coarse grained images via robust, analytical model-driven deep learning based, RFID imaging. We implemented our scheme using a COTS RFID reader and tags.In our work on smartphone's vibration based sensing, we propose a robust and practical vibration based sensing scheme that works with smartphones with different hardware, can extract fine-grained vibration signatures of different surfaces, and is robust to environmental noise and hardware based irregularities. A useful application of this sensing is symbolic localization/tagging, e.g. figuring out whether a user's device is in their hand, pocket, or at their bedroom table, etc. Such symbolic tagging of locations can provide us with indirect information about user activities and intentions without any dedicated infrastructure, based on which we can enable useful services such as context aware notifications/alarms. To make our scheme easily scalable and compatible with COTS smartphones, we design our signal processing and machine learning pipeline such that it relies only on builtin vibration motors and microphone for sensing, and it is robust to hardware irregularities and background environmental noises. We tested our scheme on two different Android smartphones.In our work on powerline communications (PLCs), we propose a distributed spectrum sharing scheme for enterprise level PLC mesh networks. This work is a major step towards using existing COTS PLC devices to connect different types of Internet of Things (IoT) devices for sensing and control related applications in large campuses such as enterprises. Our work is based on identification of a key weakness of the existing HomePlug AV (HPAV) PLC protocol that it does not support spectrum sharing, i.e., currently each link operates over the whole available spectrum, and therefore, only one link can operate at a time. Our proposed spectrum sharing scheme significantly boosts both aggregated and per-link throughputs, by allowing multiple links to communicate concurrently, while requiring a few modifications to the existing HPAV protocol.
Show less