You are here
Search results
(1  5 of 5)
 Title
 Computational identification and analysis of noncoding RNAs in largescale biological data
 Creator
 Lei, Jikai
 Date
 2015
 Collection
 Electronic Theses & Dissertations
 Description

Nonproteincoding RNAs (ncRNAs) are RNA molecules that function directly at the level of RNA without translating into protein. They play important biological functions in all three domains of life, i.e. Eukarya, Bacteria and Archaea. To understand the working mechanisms and the functions of ncRNAs in various species, a fundamental step is to identify both known and novel ncRNAs from largescale biological data.Largescale genomic data includes both genomic sequence data and NGS sequencing...
Show moreNonproteincoding RNAs (ncRNAs) are RNA molecules that function directly at the level of RNA without translating into protein. They play important biological functions in all three domains of life, i.e. Eukarya, Bacteria and Archaea. To understand the working mechanisms and the functions of ncRNAs in various species, a fundamental step is to identify both known and novel ncRNAs from largescale biological data.Largescale genomic data includes both genomic sequence data and NGS sequencing data. Both types of genomic data provide great opportunity for identifying ncRNAs. For genomic sequence data, a lot of ncRNA identification tools that use comparative sequence analysis have been developed. These methods work well for ncRNAs that have strong sequence similarity. However, they are not wellsuited for detecting ncRNAs that are remotely homologous. Next generation sequencing (NGS), while it opens a new horizon for annotating and understanding known and novel ncRNAs, also introduces many challenges. First, existing genomic sequence searching tools can not be readily applied to NGS data because NGS technology produces short, fragmentary reads. Second, most NGS data sets are largescale. Existing algorithms are infeasible on NGS data because of high resource requirements. Third, metagenomic sequencing, which utilizes NGS technology to sequence uncultured, complex microbial communities directly from their natural inhabitants, further aggravates the difficulties. Thus, massive amount of genomic sequence data and NGS data calls for efficient algorithms and tools for ncRNA annotation.In this dissertation, I present three computational methods and tools to efficiently identify ncRNAs from largescale biological data. ChainRNA is a tool that combines both sequence similarity and structure similarity to locate crossspecies conserved RNA elements with low sequence similarity in genomic sequence data. It can achieve significantly higher sensitivity in identifying remotely conserved ncRNA elements than sequence based methods such as BLAST, and is much faster than existing structural alignment tools. miRPREFeR (miRNA PREdiction From small RNASeq data) utilizes expression patterns of miRNA and follows the criteria for plant microRNA annotation to accurately predict plant miRNAs from one or more small RNASeq data samples. It is sensitive, accurate, fast and has lowmemory footprint. metaCRISPR focuses on identifying Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) from largescale metagenomic sequencing data. It uses a kmer hash table to efficiently detect reads that belong to CRISPRs from the raw metagonmic data set. Overlap graph based clustering is then conducted on the reduced data set to separate different CRSIPRs. A set of graph based algorithms are used to assemble and recover CRISPRs from the clusters.
Show less
 Title
 Estimating covariance structure in high dimensions
 Creator
 Maurya, Ashwini
 Date
 2016
 Collection
 Electronic Theses & Dissertations
 Description

"Many of scientific domains rely on extracting knowledge from highdimensional data sets to provide insights into complex mechanisms underlying these data. Statistical modeling has become ubiquitous in the analysis of high dimensional data for exploring the largescale gene regulatory networks in hope of developing better treatments for deadly diseases, in search of better understanding of cognitive systems, and in prediction of volatility in stock market in the hope of averting the potential...
Show more"Many of scientific domains rely on extracting knowledge from highdimensional data sets to provide insights into complex mechanisms underlying these data. Statistical modeling has become ubiquitous in the analysis of high dimensional data for exploring the largescale gene regulatory networks in hope of developing better treatments for deadly diseases, in search of better understanding of cognitive systems, and in prediction of volatility in stock market in the hope of averting the potential risk. Statistical analysis in these highdimensional data sets yields better results only if an estimation procedure exploits hidden structures underlying the data. This thesis develops flexible estimation procedures with provable theoretical guarantees for estimating the unknown covariance structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Due to the importance of structure estimation, the methodology is developed for the estimation of both covariance and its inverse in parametric and as well in nonparametric framework."Page ii.
Show less
 Title
 Hierarchical learning for large multiclass classification in network data
 Creator
 Liu, Lei
 Date
 2015
 Collection
 Electronic Theses & Dissertations
 Description

Multiclass learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multiclass learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle...
Show moreMulticlass learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multiclass learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle imbalanced classes. To overcome these challenges, there has been increasing interest in recent years to develop hierarchical learning methods for large multiclass problems. Unfortunately, none of them were designed for classification of network data. In addition, there are very few studies devoted to classification of heterogeneous networks, where the nodes may have different feature sets. This thesis aims to overcome these limitations with the following contributions.First, as the number of classes in big data applications can be very large, ranging from thousands to possibly millions, two hierarchical learning schemes are proposed to deal with the socalled extreme multiclass learning problems. The first approach, known as recursive nonnegative matrix factorization (RNMF), is designed to achieve sublinear runtime in classifying test data. Although RNMF reduces the test time significantly, it may also assign the same class to multiple leaf nodes, which hampers the interpretability of the model as a concept hierarchy for the classes. Furthermore, since RNMF employs a greedy strategy to partition the classes, there is no theoretical guarantee that the partitions generated by the tree would lead to a globally optimal solution.To address the limitations of RNMF, an alternative hierarchical learning method known as matrix factorization tree (MFTree) is proposed. Unlike RNMF, MFtree is designed to optimize a global objective function while learning its taxonomy structure. A formal proof is provided to show the equivalence between the objective function of MFtree and the HilbertSchmidt Independence Criterion (HSIC). Furthermore, to improve its training efficiency, a fast algorithm for inducing approximate MFTree is also developed.Next, an extension of MFTree to network data is proposed. This approach can seamlessly integrate both the link structure and node attribute information into a unified learning framework. To the best of our knowledge, this is the first study that automatically constructs a taxonomy structure to predict large multiclass problems for network classification. Empirical results suggest that the approach can effectively capture the relationship between classes and generate class taxonomy structures that resemble those produced by human experts. The approach can also be easily parallelizable and has been implemented in a MapReduce framework.Finally, we introduce a network learning task known as coclassification to classify heterogeneous nodes in multiple networks. Unlike existing node classification problems, the goal of coclassification is to learn the classifiers in multiple networks jointly, instead of learning to classify each network independently. The framework proposed in this thesis can utilize prior information about the relationship between classes in different networks to improve its prediction accuracy.
Show less
 Title
 Kernelbased clustering of big data
 Creator
 Chitta, Radha
 Date
 2015
 Collection
 Electronic Theses & Dissertations
 Description

There has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The stateoftheart algorithms for clustering big data sets...
Show moreThere has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The stateoftheart algorithms for clustering big data sets are linear clustering algorithms, which assume that the data is linearly separable in the input space, and use measures such as the Euclidean distance to define the interpoint similarities. Though efficient, linear clustering algorithms do not achieve high cluster quality on realworld data sets, which are not linearly separable. Kernelbased clustering algorithms employ nonlinear similarity measures to define the interpoint similarities. As a result, they are able to identify clusters of arbitrary shapes and densities. However, kernelbased clustering techniques suffer from two major limitations: (i) Their running time and memory complexity increase quadratically with the increase in the size of the data set. They cannot scale up to data sets containing billions of data points. (ii) The performance of the kernelbased clustering algorithms is highly sensitive to the choice of the kernel similarity function. Ad hoc approaches, relying on prior domain knowledge, are currently employed to choose the kernel function, and it is difficult to determine the appropriate kernel similarity function for the given data set.In this thesis, we develop scalable approximate kernelbased clustering algorithms using random sampling and matrix approximation techniques. They can cluster big data sets containing billions of highdimensional points not only as efficiently as linear clustering algorithms but also as accurately as classical kernelbased clustering algorithms.Our first contribution is based on the premise that the similarity matrices corresponding to big data sets can usually be wellapproximated by lowrank matrices built from a subset of the data. We develop an approximate kernelbased clustering algorithm, which uses a lowrank approximate kernel matrix, constructed from a uniformly sampled small subset of the data, to perform clustering. We show that the proposed algorithm has linear running time complexity and low memory requirements, and also achieves high cluster quality, when provided with sufficient number of data samples. We also demonstrate that the proposed algorithm can be easily parallelized to handle distributed data sets. We then employ nonlinear random feature maps to approximate the kernel similarity function, and design clustering algorithms which enhance the efficiency of kernelbased clustering, as well as label assignment for previously unseen data points. Our next contribution is an online kernelbased clustering algorithm that can cluster potentially unbounded stream data in realtime. It intelligently samples the data stream and finds the cluster labels using these sampled points. The proposed scheme is more effective than the current kernelbased and linear stream clustering techniques, both in terms of efficiency and cluster quality. We finally address the issues of high dimensionality and scalability to data sets containing a large number of clusters. Under the assumption that the kernel matrix is sparse when the number of clusters is large, we modify the above online kernelbased clustering scheme to perform clustering in a lowdimensional space spanned by the top eigenvectors of the sparse kernel matrix. The combination of sampling and sparsity further reduces the running time and memory complexity. The proposed clustering algorithms can be applied in a number of realworld applications. We demonstrate the efficacy of our algorithms using several large benchmark text and image data sets. For instance, the proposed batch kernel clustering algorithms were used to cluster large image data sets (e.g. Tiny) containing up to 80 million images. The proposed stream kernel clustering algorithm was used to cluster over a billion tweets from Twitter, for hashtag recommendation.
Show less
 Title
 Two applications of quantitative methods in education : sampling design effects in largescale data and causal inference of classsize effects
 Creator
 Shen, Ting (Graduate of Michigan State University)
 Date
 2018
 Collection
 Electronic Theses & Dissertations
 Description

"This dissertation is a collection of four papers in which the former two papers address the issues of external validity concerning incorporating complex sampling design in model analysis in largescale data and the latter two papers address issues of internal validity involving statistical methods that facilitate causal inference of class size effects. Chapter 1 addressed whether, when and how to apply complex sampling weights via empirical, simulation and software investigations in the...
Show more"This dissertation is a collection of four papers in which the former two papers address the issues of external validity concerning incorporating complex sampling design in model analysis in largescale data and the latter two papers address issues of internal validity involving statistical methods that facilitate causal inference of class size effects. Chapter 1 addressed whether, when and how to apply complex sampling weights via empirical, simulation and software investigations in the context of largescale educational data focusing on fixed effects. The empirical evidences reveal that unweighted estimates agree with the weighted cases and two scaling methods make no difference. The possible difference between weighted single versus multilevel model may lie in the scaling procedure in the latter. The simulation results indicate that relative bias of the estimates in the models of unweighted single level, unweighted multilevel, weighted single level and weighted multilevel varies across different variables, but unweighted multilevel has the smallest root mean square errors consistently while weighted single model has the largest values for levelone variables. The software finding indicates that STATA and Mplus are more flexible and capable especially for weighted multilevel models where scaling is required. Chapter 2 investigated how to account for informative design arising from unequal probability of selection in multilevel modeling with a focus of the multilevel pseudo maximum likelihood (MPML) and the sample distribution approach (SDA). The Monte Carlo simulation evaluated the performance of MPML considering sampling weights and scaling. The results indicate that unscaled estimates have substantial positive bias for estimating cluster and individuallevel variations, thus the scaling procedure is essential. The SDA is conducted using empirical data, and the results are similar to the unweighted case which seems that the sampling design is not that informative or SDA is not working well in practice. Chapter 3 examined the longterm and causal inferences of class size effects on reading and mathematics achievement as well as on noncognitive outcomes in early grades via applying individual fixed effects models and propensity scores methods on the data of ECLSK 2011. Results indicate that attending smaller class improves reading and math achievement. In general, evidence of class size effects on noncognitive outcomes is not significant. Considering potential measurement errors involved in noncognitive variables, evidence of class size effects on noncognitive domain is less reliable. Chapter 4 applied instrumental variables (IV) methods and regression discontinuity designs (RDD) on TIMSS data in 2003, 2007 and 2011 to investigate whether class size has effects on eighth grader's cognitive achievement and noncognitive outcomes in math and four science subjects across four European countries (i.e., Hungary, Lithuania, Romania and Slovenia). The results of the IV analyses indicate that in Romania smaller class size has significant positive effects on academic scores for math, physics, chemistry and earth science as well as for math enjoyment in 2003. In Lithuania, class size effects on noncognitive skills are not consistent between IV and RDD analyses in 2007. Overall, the small class size benefit on achievement scores is only observed in Romania in 2003 while evidence of classsize effects on noncognitive skills may lack of reliability."Pages iiiii.
Show less