You are here
Search results
(1 - 5 of 5)
- Title
- Two applications of quantitative methods in education : sampling design effects in large-scale data and causal inference of class-size effects
- Creator
- Shen, Ting (Graduate of Michigan State University)
- Date
- 2018
- Collection
- Electronic Theses & Dissertations
- Description
-
"This dissertation is a collection of four papers in which the former two papers address the issues of external validity concerning incorporating complex sampling design in model analysis in large-scale data and the latter two papers address issues of internal validity involving statistical methods that facilitate causal inference of class size effects. Chapter 1 addressed whether, when and how to apply complex sampling weights via empirical, simulation and software investigations in the...
Show more"This dissertation is a collection of four papers in which the former two papers address the issues of external validity concerning incorporating complex sampling design in model analysis in large-scale data and the latter two papers address issues of internal validity involving statistical methods that facilitate causal inference of class size effects. Chapter 1 addressed whether, when and how to apply complex sampling weights via empirical, simulation and software investigations in the context of large-scale educational data focusing on fixed effects. The empirical evidences reveal that unweighted estimates agree with the weighted cases and two scaling methods make no difference. The possible difference between weighted single versus multi-level model may lie in the scaling procedure in the latter. The simulation results indicate that relative bias of the estimates in the models of unweighted single level, unweighted multilevel, weighted single level and weighted multi-level varies across different variables, but unweighted multilevel has the smallest root mean square errors consistently while weighted single model has the largest values for level-one variables. The software finding indicates that STATA and Mplus are more flexible and capable especially for weighted multi-level models where scaling is required. Chapter 2 investigated how to account for informative design arising from unequal probability of selection in multilevel modeling with a focus of the multilevel pseudo maximum likelihood (MPML) and the sample distribution approach (SDA). The Monte Carlo simulation evaluated the performance of MPML considering sampling weights and scaling. The results indicate that unscaled estimates have substantial positive bias for estimating cluster- and individual-level variations, thus the scaling procedure is essential. The SDA is conducted using empirical data, and the results are similar to the unweighted case which seems that the sampling design is not that informative or SDA is not working well in practice. Chapter 3 examined the long-term and causal inferences of class size effects on reading and mathematics achievement as well as on non-cognitive outcomes in early grades via applying individual fixed effects models and propensity scores methods on the data of ECLS-K 2011. Results indicate that attending smaller class improves reading and math achievement. In general, evidence of class size effects on non-cognitive outcomes is not significant. Considering potential measurement errors involved in non-cognitive variables, evidence of class size effects on noncognitive domain is less reliable. Chapter 4 applied instrumental variables (IV) methods and regression discontinuity designs (RDD) on TIMSS data in 2003, 2007 and 2011 to investigate whether class size has effects on eighth grader's cognitive achievement and non-cognitive outcomes in math and four science subjects across four European countries (i.e., Hungary, Lithuania, Romania and Slovenia). The results of the IV analyses indicate that in Romania smaller class size has significant positive effects on academic scores for math, physics, chemistry and earth science as well as for math enjoyment in 2003. In Lithuania, class size effects on noncognitive skills are not consistent between IV and RDD analyses in 2007. Overall, the small class size benefit on achievement scores is only observed in Romania in 2003 while evidence of class-size effects on non-cognitive skills may lack of reliability."--Pages ii-iii.
Show less
- Title
- Kernel-based clustering of big data
- Creator
- Chitta, Radha
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
There has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets...
Show moreThere has been a rapid increase in the volume of digital data over the recent years. A study by IDC and EMC Corporation predicted the creation of 44 zettabytes (10^21 bytes) of digital data by the year 2020. Analysis of this massive amounts of data, popularly known as big data, necessitates highly scalable data analysis techniques. Clustering is an exploratory data analysis tool used to discover the underlying groups in the data. The state-of-the-art algorithms for clustering big data sets are linear clustering algorithms, which assume that the data is linearly separable in the input space, and use measures such as the Euclidean distance to define the inter-point similarities. Though efficient, linear clustering algorithms do not achieve high cluster quality on real-world data sets, which are not linearly separable. Kernel-based clustering algorithms employ non-linear similarity measures to define the inter-point similarities. As a result, they are able to identify clusters of arbitrary shapes and densities. However, kernel-based clustering techniques suffer from two major limitations: (i) Their running time and memory complexity increase quadratically with the increase in the size of the data set. They cannot scale up to data sets containing billions of data points. (ii) The performance of the kernel-based clustering algorithms is highly sensitive to the choice of the kernel similarity function. Ad hoc approaches, relying on prior domain knowledge, are currently employed to choose the kernel function, and it is difficult to determine the appropriate kernel similarity function for the given data set.In this thesis, we develop scalable approximate kernel-based clustering algorithms using random sampling and matrix approximation techniques. They can cluster big data sets containing billions of high-dimensional points not only as efficiently as linear clustering algorithms but also as accurately as classical kernel-based clustering algorithms.Our first contribution is based on the premise that the similarity matrices corresponding to big data sets can usually be well-approximated by low-rank matrices built from a subset of the data. We develop an approximate kernel-based clustering algorithm, which uses a low-rank approximate kernel matrix, constructed from a uniformly sampled small subset of the data, to perform clustering. We show that the proposed algorithm has linear running time complexity and low memory requirements, and also achieves high cluster quality, when provided with sufficient number of data samples. We also demonstrate that the proposed algorithm can be easily parallelized to handle distributed data sets. We then employ non-linear random feature maps to approximate the kernel similarity function, and design clustering algorithms which enhance the efficiency of kernel-based clustering, as well as label assignment for previously unseen data points. Our next contribution is an online kernel-based clustering algorithm that can cluster potentially unbounded stream data in real-time. It intelligently samples the data stream and finds the cluster labels using these sampled points. The proposed scheme is more effective than the current kernel-based and linear stream clustering techniques, both in terms of efficiency and cluster quality. We finally address the issues of high dimensionality and scalability to data sets containing a large number of clusters. Under the assumption that the kernel matrix is sparse when the number of clusters is large, we modify the above online kernel-based clustering scheme to perform clustering in a low-dimensional space spanned by the top eigenvectors of the sparse kernel matrix. The combination of sampling and sparsity further reduces the running time and memory complexity. The proposed clustering algorithms can be applied in a number of real-world applications. We demonstrate the efficacy of our algorithms using several large benchmark text and image data sets. For instance, the proposed batch kernel clustering algorithms were used to cluster large image data sets (e.g. Tiny) containing up to 80 million images. The proposed stream kernel clustering algorithm was used to cluster over a billion tweets from Twitter, for hashtag recommendation.
Show less
- Title
- Estimating covariance structure in high dimensions
- Creator
- Maurya, Ashwini
- Date
- 2016
- Collection
- Electronic Theses & Dissertations
- Description
-
"Many of scientific domains rely on extracting knowledge from high-dimensional data sets to provide insights into complex mechanisms underlying these data. Statistical modeling has become ubiquitous in the analysis of high dimensional data for exploring the large-scale gene regulatory networks in hope of developing better treatments for deadly diseases, in search of better understanding of cognitive systems, and in prediction of volatility in stock market in the hope of averting the potential...
Show more"Many of scientific domains rely on extracting knowledge from high-dimensional data sets to provide insights into complex mechanisms underlying these data. Statistical modeling has become ubiquitous in the analysis of high dimensional data for exploring the large-scale gene regulatory networks in hope of developing better treatments for deadly diseases, in search of better understanding of cognitive systems, and in prediction of volatility in stock market in the hope of averting the potential risk. Statistical analysis in these high-dimensional data sets yields better results only if an estimation procedure exploits hidden structures underlying the data. This thesis develops flexible estimation procedures with provable theoretical guarantees for estimating the unknown covariance structures underlying data generating process. Of particular interest are procedures that can be used on high dimensional data sets where the number of samples n is much smaller than the ambient dimension p. Due to the importance of structure estimation, the methodology is developed for the estimation of both covariance and its inverse in parametric and as well in non-parametric framework."--Page ii.
Show less
- Title
- Community detection in temporal multi-layer networks
- Creator
- Al-sharoa, Esraa Mustafa
- Date
- 2019
- Collection
- Electronic Theses & Dissertations
- Description
-
"Many real world systems and relational data can be modeled as networks or graphs. With the availability of large amounts of network data, it is important to be able to reduce the network's dimensionality and extract useful information from it. A key approach to network data reduction is community detection. The objective of community detection is to summarize the network by a set of modules, where the similarity within the modules is maximized while the similarity between different modules...
Show more"Many real world systems and relational data can be modeled as networks or graphs. With the availability of large amounts of network data, it is important to be able to reduce the network's dimensionality and extract useful information from it. A key approach to network data reduction is community detection. The objective of community detection is to summarize the network by a set of modules, where the similarity within the modules is maximized while the similarity between different modules is minimized. Early work in graph based community detection methods focused on static or single layer networks. This type of networks is usually considered as an oversimplification of many real world complex systems, such as social networks where there may be different types of relationships that evolve with time. Consequently, there is a need for a meaningful representation of such complex systems. Recently, multi-layer networks have been used to model complex systems where the objects may interact through different mechanisms. However, there is limited amount of work in community detection methods for dynamic and multi-layer networks. In this thesis, we focus on detecting and tracking the community structure in dynamic and multi-layer networks. Two particular applications of interest are considered including temporal social networks and dynamic functional connectivity networks (dFCNs) of the brain. In order to detect the community structure in dynamic single-layer and multi-layer networks, we have developed methods that capture the structure of these complex networks. In Chapter 2, a low-rank + sparse estimation based evolutionary spectral clustering approach is proposed to detect and track the community structure in temporal networks. The proposed method tries to decompose the network into low-rank and sparse parts and obtain smooth cluster assignments by minimizing the subspace distance between consecutive time points, simultaneously. Effectiveness of the proposed approach is evaluated on several synthetic and real social temporal networks and compared to the existing state-of-the-art algorithms. As the method developed in Chapter 2 is limited to dynamic single-layer networks and can only take limited amount of historic information into account, a tensor-based approach is developed in Chapter 3 to detect the community structure in dynamic single-layer and multi-layer networks. The proposed framework is used to track the change points as well as identify the community structure across time and multiple subjects of dFCNs constructed from resting state functional magnetic resonance imaging (rs-fMRI) data. The dFCNs are summarized into a set of FC states that are consistent over time and subjects. The detected community structures are evaluated using a consistency measure. In Chapter 4, an information-theoretic approach is introduced to aggregate the dynamic networks and identify the time points that are topologically similar to combine them into a tensor. The community structure of the reduced network is then detected using a tensor based approach similar to the one described in Chapter 3. In Chapter 5, a temporal block spectral clustering framework is introduced to detect and track the community structure of multi-layer temporal networks. A set of intra- and inter-adjacency matrices is constructed and combined to create a set of temporal supra-adjacency matrices. In particular, both the connections between nodes of the network within a time window, i.e. intra-layer adjacency, as well as the connections between nodes across different time windows, i.e. inter-layer adjacency are taken into account. The community structure is then detected by applying spectral clustering to these supra-adjacency matrices. The proposed approach is evaluated on dFCNs constructed from rs-fMRI across time and subjects revealing dynamic connectivity patterns between the resting state networks (RSNs)."--Pages ii-iii.
Show less
- Title
- Hierarchical learning for large multi-class classification in network data
- Creator
- Liu, Lei
- Date
- 2015
- Collection
- Electronic Theses & Dissertations
- Description
-
Multi-class learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multi-class learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle...
Show moreMulti-class learning from network data is an important but challenging problem with many applications, including malware detection in computer networks, user modeling in social networks, and protein function prediction in biological networks. Despite the extensive research on large multi-class learning, there are still numerous issues that have not been sufficiently addressed, such as efficiency of model testing, interpretability of the induced models, as well as the ability to handle imbalanced classes. To overcome these challenges, there has been increasing interest in recent years to develop hierarchical learning methods for large multi-class problems. Unfortunately, none of them were designed for classification of network data. In addition, there are very few studies devoted to classification of heterogeneous networks, where the nodes may have different feature sets. This thesis aims to overcome these limitations with the following contributions.First, as the number of classes in big data applications can be very large, ranging from thousands to possibly millions, two hierarchical learning schemes are proposed to deal with the so-called extreme multi-class learning problems. The first approach, known as recursive non-negative matrix factorization (RNMF), is designed to achieve sublinear runtime in classifying test data. Although RNMF reduces the test time significantly, it may also assign the same class to multiple leaf nodes, which hampers the interpretability of the model as a concept hierarchy for the classes. Furthermore, since RNMF employs a greedy strategy to partition the classes, there is no theoretical guarantee that the partitions generated by the tree would lead to a globally optimal solution.To address the limitations of RNMF, an alternative hierarchical learning method known as matrix factorization tree (MF-Tree) is proposed. Unlike RNMF, MF-tree is designed to optimize a global objective function while learning its taxonomy structure. A formal proof is provided to show the equivalence between the objective function of MF-tree and the Hilbert-Schmidt Independence Criterion (HSIC). Furthermore, to improve its training efficiency, a fast algorithm for inducing approximate MF-Tree is also developed.Next, an extension of MF-Tree to network data is proposed. This approach can seamlessly integrate both the link structure and node attribute information into a unified learning framework. To the best of our knowledge, this is the first study that automatically constructs a taxonomy structure to predict large multi-class problems for network classification. Empirical results suggest that the approach can effectively capture the relationship between classes and generate class taxonomy structures that resemble those produced by human experts. The approach can also be easily parallelizable and has been implemented in a MapReduce framework.Finally, we introduce a network learning task known as co-classification to classify heterogeneous nodes in multiple networks. Unlike existing node classification problems, the goal of co-classification is to learn the classifiers in multiple networks jointly, instead of learning to classify each network independently. The framework proposed in this thesis can utilize prior information about the relationship between classes in different networks to improve its prediction accuracy.
Show less