TOPOLOGICAL REPRESENTATION AND DIMENSIONALITY REDUCTION OF SINGLE CELL RNA SEQUENCING AND VIRAL PHYLOGENETIC DATA
         With the development and improvement in technology, we are now able to observe and collect large data, especially in the field of biology. Sequencing technology has rapidly advanced, allowing for large-scale sequencing with increased precision. This has resulted in large datasets with high dimensionality. One trade-off with high dimensionality is the increase in sparsity. There are multiple causes of sparsity, including sequencing errors, nonuniform noise, low read depth, etc. It is paramount that we find an effective way to reduce the dimensionality of the data and select important features for downstream analysis.DNA sequencing is a vital aspect of computational biology. Analyzing the sequences gives insight into relationships between species as well as within species, such as evolution, genetic drift, mutation, common ancestry, and more. We gathered over 3.6 million SARS-CoV-2 sequences from all over the world and performed multiple sequence alignments to extract mutations. Utilizing the mutation data, we performed UMAP dimensionality reduction followed by clustering to analyze mutational trends in the US and the world. However, such alignment methods do come with limitations, namely in computational cost and assumptions. To this end, we developed k-mer topology, an alignment-free DNA sequencing method that utilizes techniques from topological data analysis. This results in a uniform set of features for all sequences, regardless of their sequence length.Another new technology in DNA sequencing technology is single-cell RNA-sequencing (scRNA-seq), where gene expression profiles for each cell can be extracted. With current technology, roughly 10,000 cells and over 20,000 genes can be sequenced, which has caused problems due to the high dimensionality of the data. Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Additionally, we proposed topological nonnegative matrix factorization (TNMF), a matrix decomposition algorithm which incorporates multiscale geometric and topological information in the form of persistent Laplacian. Due to the nonnegativity constraint of the method, the reduced features are interpretable. Both CCP and TNMF are verified on scRNA-seq data to show their superior performance in both clustering, classification, and visualization.Traditional visualization methods require a dimensionality reduction to 2 or 3 dimensions. However, such aggressive reduction can lead to misleading conclusions. We, therefore, introduce Residue-Similarity (RS) plot, which is a visualization tool for data with dimensions greater than 3. The RS plot is constructed from two measures, residue (R) and similarity (S) scores. The R-score measures the interclass difference, and the S-score measures the intraclass similarity score. The Residue Similarity Index (RSI) was introduced to evaluate the R-score and S-score and is verified to correlate with clustering and classification accuracies across a variety of benchmark datasets.
    
    Read
- In Collections
- 
    Electronic Theses & Dissertations
                    
 
- Copyright Status
- In Copyright
- Material Type
- 
    Theses
                    
 
- Authors
- 
    Hozumi, Yuta
                    
 
- Thesis Advisors
- 
    Wei, Guo-Wei
                    
 
- Committee Members
- 
    Iwen, Mark
                    
 Rapinchuk, Ekaterina
 Huang, Longxiu
 
- Date Published
- 
    2024
                    
 
- Program of Study
- 
    Applied Mathematics - Doctor of Philosophy
                    
 
- Degree Level
- 
    Doctoral
                    
 
- Language
- 
    English
                    
 
- Pages
- 265 pages
- Permalink
- https://doi.org/doi:10.25335/vmgw-zp73