SUPERVISED DIMENSION REDUCTION TECHNIQUES FOR HIGH-DIMENSIONAL DATA By Dylan Molho A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science, and Engineering – Doctor of Philosophy 2022 ABSTRACT SUPERVISED DIMENSION REDUCTION TECHNIQUES FOR HIGH-DIMENSIONAL DATA By Dylan Molho The data sets arising in modern science and engineering are often extremely large, befitting the era of big data. But these data sets are not only large in the number of samples they have, they may also have a large number of features, placing each data point in a high-dimensional space. However, unique problems arise when the dimension of the data has the same or even greater order than the sample size. This scenario in statistics is known as the High Dimension, Low Sample Size problem (HDLSS). In this paradigm, many standard statistical estimators are shown to perform sub-optimally and in some cases can not be computed at all. This dissertation develops two novel algorithms that successfully operate in the paradigm of HDLSS. We first propose the Generalized Eigenvalue (GEV) estimator, a unified sparse projection regression framework for estimating generalized eigenvector problems. Unlike existing work, we reformulate a sequence of computationally intractable non-convex generalized Rayleigh quotient optimization problems into a computationally efficient simultaneous linear regression problem, padded with a sparse penalty to deal with high-dimensional predictors. We showcase the applica- tions of our method by considering three iconic problems in statistics: the sliced inverse regression (SIR), linear discriminant analysis (LDA), and canonical correlation analysis (CCA). We show the reformulated linear regression problem is able to recover the same projection space obtained by the original generalized eigenvalue problem. Statistically, we establish the nonasymptotic error bounds for the proposed estimator in the applications of SIR and LDA, and prove these rates are minimax optimal. We present how the GEV is applied to the CCA problem, and adapt the method for a robust Huber-loss based formulation for noisy data. We test our framework on both synthetic and real datasets and demonstrate its superior performance compared with other state-of-the-art methods in high dimensional statistics. The second algorithm is the scJEGNN, a graphical neural network (GNN) tailored to the task of data integration for HDLSS single-cell sequencing data. We show that with its unique model, the GNN is able to leverage structural information of the biological data relations in order to perform a joint embedding of multiple modalities of single-cell gene expression data. The model is applied to data from the NeurIPS 2021 competition for Open Problems in Single-Cell Analysis, and we demonstrate that our model is able to outperform top teams from the joint embedding task. ACKNOWLEDGEMENTS I would like to thank my advisor and committee chair, Dr. Yuying Xie. Since the start of my time with him, he has shown commitment and passion to research while also being endlessly patient and adaptive. His knowledge of a diverse range of applied and theoretical techniques has been invaluable. He has been the perfect combination of professional and personable. I also want to thank my co-advisor, Dr. Qiang Sun. His depth of expertise in statistics and mathematics has been inspiring and served as constant motivation for my own work. His insight and clarity in our work has helped guide both my research and development as a scholar. I regret Covid taking away an opportunity to have worked with him more in person. I would also like to thank my other committee members, Dr. Ming Yan and Dr. Rongrong Wang for their knowledgeable feedback and suggestions into further research directions. I also want to thank the CMSE community, Lisa Roy, Heather Williams, etc. They are always there to offer help. Lastly, thank you to friends and family for providing so much support during this time. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Single-Cell Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 CHAPTER 3 THEORETICAL PROPERTIES OF THE GEV ESTIMATOR . . . . . . . . 22 3.1 General Error Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3.1 Proof of Lemma 7 . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3.2 Proof of Lemma 8 . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Sliced Inverse Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.1 Consistency for SIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2.2 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3.1 Consistency for LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1.1 Proof of Theorem 19 . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1.2 Proof of Theorem 22 . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Minimax Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4.1 Proof of Theorem 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.2 Proof of Corollary 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 CHAPTER 4 EMPIRICAL RESULTS OF THE GEV ESTIMATOR . . . . . . . . . . . . 52 4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1.1 Robust Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Sliced Inverse Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.1 Heavy Noise Slice Inverse Regression . . . . . . . . . . . . . . . . . . . . 59 4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5 Application to Tumor-Infiltrating Lymphocytes Data . . . . . . . . . . . . . . . . 64 4.6 Application to Single-Cell RNAseq Data . . . . . . . . . . . . . . . . . . . . . . 66 v CHAPTER 5 GRAPHICAL NEURAL NETWORKS FOR MULTI-MODAL DATA IN- TEGRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.2 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.3 Graph Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.2.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 vi LIST OF TABLES Table 4.1: Summary of estimation accuracy for categorical response in low and high di- mensions. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 4.2: Summary of estimation accuracy for continuous response in low dimensions. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. . . 58 Table 4.3: Summary of estimation accuracy for continuous response in high dimensions. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. . . 59 Table 4.4: Summary of estimation accuracy for Huber loss estimation in low dimensions with high noise. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Table 4.5: Summary of estimation accuracy for Huber loss estimation in high dimensions with high noise. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 4.6: Summary statistics reporting performance of the GEV, LDA−ℓ1 , Direct and Oracle methods. We report the means of the FD with its standard deviation in parentheses. The results are based on 100 replications . . . . . . . . . . . . . . 62 Table 4.7: Summary statistics reporting performance of the GEV and PMA methods. We report the means of the FD with its standard deviation in parentheses. The results are based on 100 replications . . . . . . . . . . . . . . . . . . . . . . . . 64 Table 5.1: Performances for Joint Embedding Task . . . . . . . . . . . . . . . . . . . . . . 79 vii LIST OF FIGURES Figure 1.1: Unlabeled 2-dimensional data. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 1.2: Labeled 2-dimensional data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 1.3: Data projection on the x and y axis. . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 1.4: The LDA decision boundary between samples from Class 1 and Class2. White x’s mark the sample means, and the yellow x gives the midpoint be- tween the two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Figure 2.1: A simple graph with adjacency matrix. . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.2: The architecture of a feedforward network. The nodes represent the coordinates of each layer, and the edges represent the weights of the linear transformations. . . . . . 13 Figure 2.3: The general framework of an autoencoder. The hidden layer gives the code of the encoder, and the lower dimension of the code constrains the representation. . . . . . . 15 Figure 2.4: An example matrix for single-cell data. . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 4.1: Comparison of convergence rates of different algorithms. . . . . . . . . . . . . . . . 53 Figure 4.2: Relationship between plasma cells and GEV-SIR direction. The left panel shows the distribution of eigenvalues of Ω̂. The scatter plots in the middle and right panels show the relationship between the tumor infiltrated plasma cell and the GEV-SIR direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Figure 4.3: GEV-SIR analysis of embryoid body scRNAseq data. . . . . . . . . . . . . . . . . 67 Figure 5.1: scJEGNN graph construction process. The input data determines the value of the weighted edges between the cell nodes and feature nodes, values of zero indicate no edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 5.2: scJEGNN graph convolution. Multiple convolution layers propagate infor- mation from the weighted edges to update cell and feature nodes. . . . . . . . . 76 Figure 5.3: scJEGNN Autoencoder architecture. Each layer is fully connected, and the encoder layers feature drop out and batch normalization steps. . . . . . . . . . . 79 viii LIST OF ALGORITHMS Algorithm 4.1: A fast iterative shrinkage-thresholding algorithm for GEV. . . . . . . . . 52 Algorithm 4.2: Huber loss algorithm for robust GEV. . . . . . . . . . . . . . . . . . . . 55 ix CHAPTER 1 INTRODUCTION In the past twenty years, we have witnessed an explosion of data from different domain areas, including medical imaging, finance, and genomics. The data sets arising in modern science and engineering are often extremely large, befitting the era of big data. But these data sets are not only large in the number of samples they have, they may also have a large number of features, plac- ing each data point in a high-dimensional space. Data like this is common in fields like biology, where for instance measurements of the gene expression of cells can have tens of thousands or even hundreds of thousands features. This enrichment of data offers promises of solutions to many challenging goals, including detecting genes underlying complex diseases and designing novel drug treatments. However, unique problems arise when the dimension of the data has the same order as or even greater than the sample size. This scenario in statistics is known as a High Dimen- sion, Low Sample Size problem (HDLSS) [HMN05, SSZM16]. In such a regime, many classical statistical methods no longer have guarantees of success, and standard asymptotic theory often fails to provide useful predictions. As well, in high dimensions, our intuitions on basic concepts such as the distance between points begins to break down. As a result of the “curse of dimensionality”, higher dimensional versions of the cube no longer have the majority of their mass near the center of the cube, but instead the vast majority of volume is found near the corners. Similarly, multivariate normal distributions more and more act like uniform distributions on hyperspheres, and estimators like k-nearest neighbors become unreliable due to distances being very similar between points in the high-dimensional data set. To better illustrate this behavior, we give a simple example of Linear Discriminant Analysis (LDA), a classical machine learning technique that trains a classifier for the data. The classifier is determined based on finding a linear projection of the data to a lower dimensional space that best separates the data based on its class. In Figure 1.1 we see unlabeled data in 2-dimensions that have a high spread along the x-axis, with very little variation along the y-axis. If we sought 1 to simplify the data to a one-dimensional representation, we could simply project the data to the x-axis, preserving most of the variation of the data. Figure 1.1: Unlabeled 2-dimensional data. Once the data is given additional class information, we see that most of the pertinent informa- tion about how the classes are separated would be lost upon projection to the x-axis. We see in Figure 1.3 that the projection of the data on the x-axis mixes much of the class data, making it nearly impossible to find a good point to separate the classes. On the other hand, projection on y-axis does a good job of separating the class information, despite being more compactly spaced after projection. It is not too surprising that this is case: the data is generated from two multivariate normal distributions with mean µ1 = (13, 1) for class 1 and mean µ2 = (21, 2), with both classes sharing a diagonal covariance of    25 0  Σ= . 0 .1 2 Figure 1.2: Labeled 2-dimensional data. Figure 1.3: Data projection on the x and y axis. 3 While the projection of the data on the y-axis fared well in distinguishing the class information, we can do even better. The solution provided by LDA takes into account the location of the sample means, and corrects for the covariance of the data to make a more optimal separation of the data. Let µ̂1 and µ̂2 be our estimated class means from the data, marked by the white x’s in Figure 1.4, and let Σb be our estimated covariance. Then the linear discriminant function applied to a new data point x ∈ R2 is given by the inner product   b −1 µ̂1 + µ̂2 Ψ(x) b = ⟨µ̂1 − µ̂2 , Σ x− ⟩. 2 If this value is less than 0, label the point x as belonging to class 1, and if it is greater than 0, then it is labeled belonging to class 2. This provides a linear decision boundary in the plane, given by the yellow dotted line in Figure 1.4, where the values of Ψ(x) b = 0. The decision boundary shows how the LDA classifier labels new points, and if we wanted to project the data to best separate the samples by classes, we would project the points to a line that is orthogonal to this decision boundary. Figure 1.4: The LDA decision boundary between samples from Class 1 and Class2. White x’s mark the sample means, and the yellow x gives the midpoint between the two. 4 If we assume that both classes are equally likely, then we can calculate the error probability using the LDA decision boundary as b = 1 P1 [Ψ(x Err(Ψ) b ′ ) ≤ 0] + 1 P2 [Ψ(x b ′′ ) > 0], 2 2 where x′ and x′′ are samples drawn independently from probability distributions P1 and P2 of class 1 and 2 respectively. We can analyze the behavior of the error asymptotically, i.e. look at the limiting behavior as our sample size n increases to infinity. Define the value γ = ∥µ1 − µ2 ∥2 , the distance between the two class means, and let us simplify the problem further by assuming the covariance Σ is the identity matrix. Then the random error Err(Ψ) b converges in probability to the fixed number prob.  γ Err(Ψ b Id ) − −−→ Φ − , 2 where Φ(t) = P[Z ≤ t] is the cumulative distribution function of the standard normal variable. Thus the error behaves purely as a function of the distance of the class means as n increases: as the means grow in distance, the amount of error goes down. But this is assuming that the dimension of the data remains fixed. If instead we had the dimension of the data d at a fixed ratio with the sample size n, we would see a different picture emerge. In many cases this fixed ratio is more realistic for applications where the dimension is high and collecting much larger samples is infeasible. Assume the ratio d/n converges to some non-negative fraction α > 0. Under this high-dimensional scaling, our error converges to a different sub-optimal value ! prob. b Id ) −−−→ Φ − p γ2 Err(Ψ . 2 γ 2 + 2α In this case the classical prediction Φ(−γ/2) drastically underestimates the error rate. Our example application of LDA was applied to data that is only 2-dimensional with a sample size of 80. We would see a quick deterioration in performance if the dimension was much higher, e.g. d = 40. At the point d > n, we are unable to even calculate our estimator, since estimates for the covariance Σ b are no longer full rank, which makes computing its inverse impossible. These HDLSS phenomena necessitates the development of new theory as well as new meth- ods in order to manage these difficult problems, and achieve the desired outcomes for downstream 5 data science tasks. To overcome the barriers found in HDLSS scenarios, one must make addi- tional assumptions on the data, either with explicit formulations or with implicit beliefs about the behavior of the data. The first type of research leads to structural assumptions placed on the probability model that generates the data, which allow for alterations to classical methods to yield theoretically optimal estimators for the chosen well-defined tasks. The second type of research, in contrast, makes general assumptions usually based on the the causal nature of chosen real-world data application, where the data is assumed to have dependencies between the various parameters. While there are no theoretical guarantees for such methods, the strength of the estimator is instead demonstrated empirically on simulated or real data sets. We explore ideas from both of these fields, and develop two novel algorithms that successfully operate in the paradigm of HDLSS. In the first case, we develop an estimator for high-dimensional data with response variables with the assumption that the data has an underlying low-rank structure, and that the lower-dimensional representation is obtained with sparse projection directions. Sparsity and low rank representations are natural assumptions to make on high-dimensional data due to the likelihood of having very few of the potential thousands of variables contribute meaningfully to a response variable. Data exhibiting this structure can be found in a diverse collection of applications, ranging from ge- nomics to economics. Our estimator produces a supervised linear dimension reduction of the data that attempts to maximally preserve the relationship between the covariate data and the response variables. We achieve this through finding vectors that relate the covariance of the covariates with the covariance between the data and response variables, solving a type of generalized eigenvalue optimization problem. We name the method the Generalized Eigenvalue (GEV) Estimator, and show that this method is able to solve three separate classical statistical problems in the HDLSS paradigm: Linear Discriminant Analysis (LDA), Sliced Inverse Regression (SIR), and Canoni- cal Correlation Analysis (CCA). We give theoretical guarantees of convergence that are shown to be minimax optimal for both LDA and SIR, and give empirical results of GEV outperforming competitor methods in all three applications on simulated data, as well as applications of GEV to real-world data analysis tasks on gene expression data. 6 For the second case, we design a method for multi-modal data integration using deep learning. We tailor a graphical neural network (GNN) for use on single-cell sequencing data, a rich new data source in biology that has revolutionized the field. Single-cell sequencing data produces matrices where each row represents a cell, and each column gives a value corresponding to the expression of some gene. This data often falls in the HDLSS regime, since sequencing cells is an expensive process, but can produce for each cell potentially hundreds of thousands of features given by different genes. We show that by using a bipartite graphical model of the data that represents both cell and genes as nodes, we are able to leverage the causal structure of the gene expression data to create a low-dimensional representation that preserves important biological information. The GNN is combined with an autoencoder model in order to train a low-dimensional representation through latent feature regularization. We apply our model named scJEGNN to the task for single- cell multi-modality data integration in the NeurIPS 2021 special competition for Open Problems in Single-Cell Analysis, and we demonstrate our model is able to outperform the best performing competitor models. The structure of this dissertation is as follows: in Chapter 2 we go over some preliminary mathematical background leading up to the GEV estimator, introduce the basics of neural networks and the specific models we use from deep learning, and give some description of single-cell data, which features prominently in our applications. In Chapter 3 we develop the formal theory of the GEV estimator, including the nonasymptotic convergence theorems and minimax bounds. In Chapter 4 we give the computational algorithm for the GEV estimator, and show the application of the estimator to simulated and real data problems. In Chapter 5 we detail the scJEGNN architecture and show its performance on the single-cell sequencing data integration task. 7 CHAPTER 2 BACKGROUND In this chapter we develop the necessary mathematical background to understand the work. We assume some familiarity with linear algebra, multivariate calculus, and probability theory, but we first review some relevant concepts from these below. Then we review the basics of neural networks, including feedforward networks, autoencoder, and graphical neural networks. Lastly we introduce single-cell data and its structure and behavior. 2.1 Mathematical Preliminaries Probability. Let X be a continuous real-valued random variable with probability distribution P. If one wishes to understand how spread out X is from its mean E[X] = µ, then we can look at the tail probabilities P(|X − µ| ≥ t) for t > 0. This value gives us a concentration inequality, which tells us how likely it is for X to be a distance of t from its mean µ. Often it is beneficial to have random variables that concentrate around its mean. A common example is when we have a population parameter we are trying to estimate with a function of the data, which is a random variable, and the function has mean equal to the population parameter. If X ∼ N(µ, σ 2 ) is normally distributed, then we can show that X has quick decay of probability in its tails: t2 − P[|X − µ| ≥ t] ≤ 2e 2σ 2 for all t ∈ R. (2.1) This quick rate of decay as a function of t is a general property that we wish to generalize. To that end we call a random variable X sub-Gaussian if there is a positive number σ such that equation (2.1) holds for X. A similar but weaker property that we define is as follows: X is sub-exponential with parameters (ν, α) if for all t ∈ R  t2  2e− 2ν 2  if 0 ≤ t ≤ ν2 α P[|X − µ| ≥ t] ≤ (2.2)  − 2αt ν2  2e if t > α. 8 For sub-exponential variables, when t is small enough, the concentration inequality is sub-Gaussian in nature (i.e. with the exponent quadratic in t), but for larger t, the exponential component of the bound scales linearly in t. The location of this shift in behavior is then controlled by the parameter α, and in the limit α → 0, we get back sub-Gaussian inequalities. If we have two continuous random variables X and X ′′ with probability distributions P and Q respectively, then the Kullback-Leibler divergence (KL divergence) between the distributions is defined to be   p(x) Z ∞ DKL (P||Q) = p(x) log dx −∞ q(x) where p and q denote the probability densities functions of P and Q. For simplicity denote [n] = {1, . . . , n} as the discrete set from 1 to n. Lastly if we have a collection of random variables X1 , . . . , Xn , we define the first order statistic X(1) to be the minimum value of the collection {Xi }i∈[n] , the second order statistic X(2) to be the second smallest value of the collection, and so on to the nth order statistic X(n) , which is the largest value of the collection. Linear Algebra. Let A ∈ Rm×n be a real m by n matrix, and assume that m > n and that A has full rank. The singular value decomposition of A is defined to be the identity n A = UΣV⊤ = ∑ σi ui v⊤ i i=1 where Σ = diag(σ1 , σ2 , . . . , σn ) ∈ Rn×n is a diagonal matrix with positive real numbers on the diagonal and 0’s elsewhere, U = [u1 |u2 | · · · |un ] ∈ Rm×n is a orthonormal matrix with columns ui ∈ Rm , and V = [v1 |v2 | · · · |vn ] ∈ Rn×n is a orthonormal matrix with columns vi ∈ Rn . If A is a square symmetric matrix in Rn×n , then the eigendecomposition of A is the identity n A = VΛV⊤ = ∑ λi vi v⊤ i i=1 where Λ = diag(λ1 , λ2 , . . . , λn ) ∈ Rn×n is a diagonal matrix and V = [v1 |v2 | · · · |vn ] ∈ Rn×n is a orthogonal matrix where each (vi , λi ) is an eigenvector/eigenvalue-pair of A satisfying Avi = λi vi . For two matrices A, B ∈ Rd×d , we define the vector/scalar pair (v, ρ) with v ∈ Rd and ρ ∈ R to be 9 a generalized eigenpair if Av = ρBv. In the case that B is nonsingular, the pairs (v, ρ) correspond to the eigenpairs of the matrix B−1 A. Alternatively, finding these eigenpairs is equivalent to finding the vectors that are critical points of Rayleigh quotient v⊤ Av v⊤ Bv with the corresponding generalized eigenvalue ρ equal to the value of the quotient. This framework occurs commonly in statistics; when seeking a linear projection of the data many classic methods solve a form of the above generalized eigenvalue problem with A and B acting as covariance matrices of the data. Further notation we use is as follows. For two matrices A and B, let ⟨A, B⟩ = tr A⊤ B denotes  1/q the trace inner product. For a vector u ∈ Rd and q ∈ [0, ∞], ∥u∥q = ∑dj=1 |u j |q is the ℓq norm if 0 ≤ q ≤ ∞; when q = 0, ∥u∥0 is the number of nonzero entries of u; when q = ∞, ∥u∥∞ = max1≤ j≤d |u j |. For a matrix A ∈ Rm×n , we use tr(A) to denote its trace and Ai j for the (i, j)-th element, and for q ∈ (0, ∞), ∥A∥q is ℓq operator norm, while ∥A∥F and ∥A∥max are used to denote the Frobenius norm and the entry-wise maximum norm, respectively. For q1 , q2 ∈ [0, ∞], the matrix  (q1 , q2 )-pseudonorm ∥A∥q1 ,q2 of A is defined as ∥A∗1 ∥q2 , ∥A∗2 ∥q2 , . . . , ∥A∗m ∥q2 q , where A∗i 1 denotes the ith column of A. If J ∈ [m] is a subset of indices of size j, AJ ∈ R j×n is the submatrix given by the j rows with indices in J. We use ρmax (A) and ρmin (A) to denote the maximum eigenvalue and minimum eigenvalue, respectively. C,C1 , and C2 are constants that may vary in different instances of usage. Graphs. A graph, denoted G = {V , E }, is a set of nodes V = {v1 , . . . , vn }, and a set of edges E = {e1 , . . . , em }. Nodes commonly represent entities in a data science problem, while the edges give the relations between them, e.g. social media users as nodes and edges representing friendships, or chemical atoms as nodes, and chemical bonds as edges. An edge e ∈ E connects two nodes v1e , v2e , thus e can be represented as (v1e , v2e ), an element of V × V . A node vi is adjacent to another node v j 10 Figure 2.1: A simple graph with adjacency matrix. if and only if there exists an edge between them. We define the (first-order) neighbors of a node vi , denoted N (vi ), as the set of nodes that are adjacent to vi . A graph G can be equivalently represented as an adjacency matrix which describes the connectivity between the nodes. Let A ∈ {0, 1}n×n be a matrix where Ai, j = 1 if vi is adjacent to v j , and equal to 0 otherwise. The degree of a node vi is the number of nodes adjacent to vi , d(vi ) = ∑ 1v j ∈N (vi ) v j ∈V where 1v j ∈N (vi ) = 1 in the event v j ∈ N (vi ) and 0 otherwise. The degree matrix D is a diagonal matrix defined as Di,i = d(vi ), Di, j = 0 if i ̸= j. 2.2 Deep Learning Deep learning is a class of machine learning algorithms that are built from artificial neural net- works. Originating to a linear model in [MP43], it was further developed into the perceptron in [Ros58], which can learn parameters for the function given training samples. Neural networks (NNs) had a renaissance of interest and research in the early 2000s with the advent of “big data” sources and more powerful computational machines to train the models. Since then deep learn- ing models have consistently proven to outperform state-of-the-art traditional methods in a large 11 number of applications. The power and flexibility of different deep learning models has firmly established the popularity of the models in machine learning tasks. We introduce the basics of common NN models that we use in our work. A fuller treatment of the subject of deep learn- ing can be found in [GBC16], and [MT21] provides an excellent reference for graphical neural networks. Feedforward Networks. A feedfoward network is simply a special type of function that is made by composing a collection of simpler functions. As in all machine learning tasks, the feedfoward network is an approximation of a sought after function f ∗ ()b, for instance if the task is classifi- cation, one wishes to find a mapping f (x|Θ) that best approximates the ideal classifier f ∗ (x) = y. The values of the parameters Θ that determine the feedforward network f (x|Θ) are learned during training. In feedfoward networks, the function f : Rd → Rk given by the network is a composition of simpler functions that are referred to as the layers of the network. The output dimension k is chosen to suit the chosen application of the network. A single layer generally has an affine transformation followed by a nonlinear “activation function” applied pointwise. This means for the input vector x, the first layer would produce h1 = σ b1 + W1 x ,  where W1 ∈ Rd×d1 , b1 ∈ Rd1 , and σ : Rd1 → Rd1 is the activation function. The output h1 is the first “hidden layer” of the network, and the depth of the network is determined by the number of layers. In general if a network has depth m, then the network function would be f (x) = bm + Wm σ (bm−1 + Wm σ (· · · σ (b1 + Wm x)))), where Wi ∈ Rdi ×di−1 and b ∈ Rdi , and our output dimension k would be equal to dm . The values of the weights of Wi and bi , i ∈ [m] give the collection of parameters Θ that determine the function, and are learned during training. These transformations between layers are depicted in Figure 2.2 as bipartite graphs, where the coordinates before and after are given as left and right nodes, and the edges between them represent the weights of the matrix Wi . 12 Figure 2.2: The architecture of a feedforward network. The nodes represent the coordinates of each layer, and the edges represent the weights of the linear transformations. Activation functions were originally designed to recreate the behaviors of biological neurons, which receive a signal and either kill it or propagate it to further neurons. These functions introduce non-linearity into the neural network which leads to strong theorems guaranteeing the function’s approximation capabilities under certain conditions. There are a collection of commonly used functions, and one of the most used is the Rectified Linear Unit (ReLU). The ReLU function is linear (identity) for all positive inputs and 0 for all negative values; ReLU(z) = max(0, z). Many variants of the function exist, many of which attempt to address the lack of gradient the function has for negative inputs, like the LeakyReLU, ELU, and GELU. Prior to the use of ReLU, using sigmoid functions were the norm, like the logisitic sigmoid 1 σ (z) = 1 + exp(−z) due to the belief that activation functions had to be continuous to properly train the model. The output of the function is run through a chosen loss function that allows for optimization methods like gradient descent to learn the parameters. If the task is regression, so that each training 13 sample has the pair (x, y), y ∈ Rk , the output f (x) = ŷ ∈ Rk could be run through the simple squared loss function to measure the difference between the predicted ŷ and ground truth y: k L(y, ŷ) = ∑ (yi − ŷi )2 . i=1 If the task is classification, the neural network needs to output the class label of the input. Instead of a discrete output from a finite set of labels, probabilities are given for each class of C possible classes, so that the output ŷ would be a vector in RC . The softmax function is used to output values between 0 and 1 so that the total adds up to 1: exp(zi ) ŷi = softmax(z)i = , i ∈ [C], ∑ j exp(z j ) where zi is the ith element of the vector z. Then with the predicted ŷ the loss function of cross- entropy is used to measure the difference from the truth C L(y, ŷ) = − ∑ yi log(ŷi ). i=1 Here yi = 1 if the class of x is i, and 0 otherwise. During inference, an unlabeled input is given label i if ŷi is the largest value among all the coordinates of ŷ. Autoencoders. An autoencoder is a special type of neural network that tries to reproduce its input as its output. The autoencoder consists of two components: an encoder h = f (x), which encodes x into a hidden representation (called a code) h, and a decoder g which attempts to reconstruct x from h, represented g(h) = x̂. The network is trained to minimize the reconstruction error L(x, x̂) = L(x, g( f (x))), where L(x, x̂) measures the difference between x and x̂. The utility of an autoencoder comes from its limitations; perfectly recreating the input is made to be impossible, so good approximations x̂ are trained by encoding only important information in the code h. This limitation is achieved through the creation of a bottleneck that the input x is forced through. As seen in Figure 2.3, this bottleneck occurs at the code layer that constrains its representation in some manner. 14 Figure 2.3: The general framework of an autoencoder. The hidden layer gives the code of the encoder, and the lower dimension of the code constrains the representation. There are two main ways this constraint in the autoencoder is implemented, by making the di- mension of the code smaller than the input, or by placing certainly penalties on the latent represen- tation that discourages memorization between input and output. The first leads to undercomplete autoencoders, which forces the lower dimension code h to preserve the most important features of the input. The second leads to regularized autoencoders, which have additional terms added to the loss function L(x, g( f (x))) + η · Ω(h), where Ω(h) is the regularization term applied to the code h and η is a hyperparameter controlling the amount of penalty. One regularization is the ℓ1 -regularization Ω(h) = ∥h∥1 which promotes sparsity in the code h. Other regularizations may explicitly promote certain fea- tures that are data specific to be preserved in the hidden representation, such as the cell type of of single-cell gene expression data. 15 Graphical Neural Networks. Graph neural networks (GNNs) are a collection of deep learning architectures that are designed to deal with graph-structured data. Other architectures like feedfor- ward networks or convolutional neural networks (not covered) are more amenable to data that is structured as a regular grid, like vectors or matrices. GNNs expand the functionality of neural net- works to to this more multifarious data, and allow for both node-based and graph-based learning tasks. These models are quite recent innovations, as the first GNN model was published as recently as 2005 in [SYG+ 05]. Like all neural networks, GNNs act as a type of representation learning of its input data, and it is through learning a good representation of its input data that it is able to perform well in designated tasks. Since the input data for GNNs are graphs, there are two ways to go about this representation. For node-based tasks, the GNN learns good features for each node, using the graph structure to facilitate the calculation of this representation. For graph-based tasks, the GNN aims to learn features to represent the entire graph, and learning node features occurs only as an intermediate ancillary step. To learn updated node features on the graph, the GNN takes in both the input node features and the graph structure given by the adjacency matrix A ∈ Rn×n , where n gives the number of nodes. If the nodes features are given by vectors in Rd , the collection of features can be given by a matrix F ∈ Rn×d , and the update takes the form F1 = h(A, F) for some function h called the “graph filter”. For node-based tasks, the graph filtering operation is usually sufficient, and the GNN consists of multiple graph filters stacked consecutively to generate final node features. Other operations are necessary for graph-based tasks to generate the features for the entire graph from the node features. Tailored functions that take into account the graph structure like graph pooling operations are used to generate global features. We forego further discussion of graph-based task models, since our applications are node-based only. Like feedforward networks, the general framework for node-based GNNs is a composition of multiple steps of graph filtering followed by a non-linear activation. If the network has depth m, 16 then the collection of operations would be denoted Fm = hm (A, σ (hm−1 (A, · · · σ (h1 (A1 , F))))). The final output Fm is then used for a downstream task related, e.g. classification on the nodes. The non-linear activation functions come from the same collection of activation functions used for other neural networks, but they can be combined with the graph filterings in novel ways. The spectral filtering process, for instance, has the nodes transformed via a Graph Fourier Transform, applies the activation function to the transformed coefficients, and then reconstructs the nodes from the spectral representation. While there are a large collection of graph filters, including a whole class of spectral-based filters, we focus here on the simplest spatial-based graph filter. Let our filter hi , followed by the activation function σ , be defined as σ (hi (A, Fi−1 )) = σ AFi−1 Wi ,  for Wi ∈ Rdi−1 ×di . This transformation is the same as the feedforward network with the exception of the multiplication by A. This product AFi−1 means that for every node, we sum up all the feature vectors of all the neighboring nodes but not the node itself (unless there are self-loops in the graph.) This allows the topology of the graph to be taken into account during these transformations, but one generally wishes for a node to propagate its own features into its next updated representation. To correct for this, the adjacency matrix is replaced by A b = A + I for I the identity matrix in Rn , so that A b supplies self-loops to the graph. Furthermore, since A is typically not normalized, the product AFi−1 can change the scale of the feature nodes based on the number of neighbors a node has. One can normalize A via multiplication with the inverse degree matrix D−1 , but more often a symmetric normalization is used giving D−1/2 AD−1/2 . Combining these two we get   −1/2 b b −1/2 i σ (hi (A, Fi−1 )) = σ D b AD Fi−1 W , for D b the degree matrix of A. b Training. The training of deep learning models occurs through the minimization of a loss func- tion L with respect to the parameters of the models – the weights of the affine transformations. We 17 denote the loss function as L(Θ) where Θ denotes all the parameters to be optimized. Gradient descent, a first-order iterative optimization algorithm, is often used to minimize the loss function. At each iteration the parameters Θ are updated by shifting them in the direction of the negative gradient (the direction the loss function decreases the most at that location): Θ′ = Θ − η · ∇Θ L(Θ), where ∇Θ L(Θ) denotes the gradient of L at Θ, and η > 0 is the learning rate, which is usually fixed at a small constant. The gradient is usually averaged over a collection of training samples in a batch, which provides greater statistical consistency and computational efficiency. The process is iterated until convergence or some condition is met. 2.3 Single-Cell Data In the biological sciences, the advent of single-cell technologies has revolutionized the investiga- tion of cellular behavior in the context of its microenvironment. Single-cell sequencing is able to measure multiple molecular features in multiple modalities in a cell, such as gene expressions, protein abundance and chromatin accessibility. Measurements at the single-cell level allow for unprecedented resolution for studying cell-to-cell heterogeneity. Such data sheds new insights across biological disciplines including oncology [LYS18], neurology [RWM+ 18], and immunol- ogy [SDB+ 18]. Technologies like single-cell transcriptome sequencing (scRNA-seq) and single- cell assay for transposase-accessible chromatin with sequencing (scATAC-seq) provide data on the RNA and DNA gene expression of individual cells respectively. scRNA-seq data makes it possible to measure transcriptome-wide gene expression, and enables researchers to distinguish different cell types based on their gene expression, organize cell populations, and identify cells transition- ing between states [AHB+ 16, HBR+ 17, Con18]. Similarly, scATAC-seq studies can reveal somatic clonal structures such as those found in cancer [ZNC+ 19], helping monitor cell lineage develop- ment. The technologies that produce these data sources are very interesting in their own right, but for our purposes it suffices to understand the format this data is presented to us after the sampling 18 Figure 2.4: An example matrix for single-cell data. takes place. For both scRNA-seq and scATAC-seq, the data can be simply represented as a n × d matrix where n gives the number of cells sampled and d gives the “sequencing depth”, i.e. the number of genes tested for in the study. Then each row of the matrix gives a unique cell, and each column give a unique gene. The values of the matrix consist of non-negative integers giving the count data of how many times the gene expressed for a particular cell. These matrices can have further information supplementing the data, such as cell type annotations assigning a class to each cell, or cell-cycle scores quantifying the developmental stage of the cell. The cell-cycle scoring is based on the expression of G2/M and S phase markers, where S is the synthesis phase for the replication of the chromosomes (also part of interphase), G2 is the gap 2 phase representing the end of interphase, prior to entering the mitotic phase, and the M phase is the nuclear division of the cell (consisting of prophase, metaphase, anaphase and telophase) [NHS+ 16]. This additional information would be included as a collection of additional columns at the end of the matrix, indicating the appropriate labels or scores. An example matrix giving the gene expression data of a single-cell experiment is given in Figure 2.4. There are two important aspects of single-cell data that are necessary to understand before using it for analysis: batch effects and dropout. In single-cell sequencing methods, data is organized 19 into separate batches, where large groups of cells are potentially sampled in multiple laboratories using different cell dissociation and handling protocols, library preparation technologies and/or sequencing platforms. These different factors result in batch effects [TBH+ 17] that can change the expression of genes systematically from one batch to another. Such differences can mask underlying biology or introduce spurious structure in the data, and must be corrected prior to further analysis to avoid misleading conclusions [AHMM18]. Since becoming a standard data preprocessing step for singe-cell analysis, there have been many advanced techniques developed to address batch effect removal, including techniques based on CCA [ZWT19] and a number of deep learning models [LWL+ 20, SSZ+ 17]. Dropout is the name given to the technical error that occurs when performing single-cell sam- pling, which leads to artificial counts of zero in the gene expression read outs. The error occurs during library preparation – a technical step that duplicates a gene many times in order to be counted during sampling – and occurs more frequently for genes that express at low levels in their respective cell types [Qui20]. This leads to sequencing data that is notoriously sparse, where the vast majority of features may be zero in a typical dataset due to dropout [SNL+ 17], which can be even more pronounced in multi-modal data [LHH20]. Dropout events lead to increased technical variability and noise in the single-cell data, and makes it difficult to differentiate “true” zero counts from “false” ones. Here, true zero counts indicate that a gene is not expressed in a particular cell type, which could act as important information to differentiate cell types. Addressing dropout requires specialized data processing methods such as imputation. Imputation takes in data and at- tempts to replace artificial zero counts with realistic count values, hopefully while preserving true zero counts. A diverse collection of methods exist for imputation, most of which rely on gaining information about cell behaviors from similar cells in the dataset, or by transferring knowledge of cell behaviors from other datasets. Methods like Phenograph [LSB+ 18], MAGIC [vDSN+ 18], and Seurat [BHS+ 18] use K-nearest neighbor (KNN) graphs to model relationship between cells, but the high sparsity of the data may cause these neighborhood estimates to be unreliable, and may over-simplify the complex cell and gene relationships of cell population. Deep learning methods 20 have improved on these [WAH+ 19, APYG19], with the top performing methods using GNNs, such as GraphSCI [RZL+ 21] and scGNN [WMC+ 21]. 21 CHAPTER 3 THEORETICAL PROPERTIES OF THE GEV ESTIMATOR Let x be a d-dimensional random vector of covariates with covariance matrix Σ. Let y be a one- dimensional continuous response and let x | y be a d-dimensional random vector of covariates and Ω = cov(E[x|y]). It is advantageous to find a linear projection of x to a subspace of dimension K ≪ d such that the population centroids, E[x|y], separate the most in the projected space. This amounts to finding the vectors that maximize v⊤ Ωv, but minimize overlap of the data after projection; i.e. minimize v⊤ cov(x|y)v. Assuming cov(x|y) is the same for all y, we can consider the following optimization procedure by maximizing a sequence of Rayleigh quotients: v⊤k Ωvk v∗k = argmax ⊤ , s.t. v⊤k Σv j = 0, for all 1 ≤ j < k ≤ K. (3.1) vk vk Σvk where, in classification problems with y taking discrete values (such as LDA), Ω and Σ are fre- quently referred to as the “between-class” and “within-class” covariance matrices, respectively.1 v⊤ k Ωvk It is easily verified that the quotient v⊤ has critical points for each generalized eigenvector k Σvk of (Ω, Σ), so that v∗k is a critical point if Ωv∗k = ρΣv∗k for some ρ ∈ R. If Σ is invertible, the problem reduces to finding eigenvectors of Σ−1 Ω, but since estimates of Σ are singular in the high-dimensional regime, other means of solving the problem have to be used. Thus the first projection vector is sought to maximize the between-class covariance relative to the within-class covariance. Then it seeks the second projection vector that maximizes the between-class covariance subject to the constraint that it is orthogonal to the first projection di- rection with respect to Σ. This procedure is then continued up to K times, where K is chosen to fully capture the signal. In this work we focus on the case K = rank(Ω), where in the ap- 1 Note that the actual within-class covariance would be cov(x|y) = Σx|y . However, with the assumption of ho- moscedasticity, that Σx|y = Σx|y′ for all y, y′ , we have the pooled covariance ΣP equal to any within-class covariance, and with the law of total covariance Σ = ΣP + Ω we have the equivalence of v⊤k Ωvk v⊤k Ωvk 1 v⊤k Ωvk argmax ⊤ = argmax ⊤ = argmax = argmax . vk vk Σvk vk vk ΣP vk + v⊤ k Ωvk vk v⊤ k ΣP vk +1 vk ⊤ vk Σx|y vk v⊤ k Ωvk 22 plications we consider rank(Ω) ≪ d. Then the K projection vectors are concatenated to obtain VK = {v∗1 , . . . , v∗K }, and the projection space, VK , is obtained by setting VK = span{VK }, the space spanned by the linear combinations of the K projection vectors. However as it stands, the above approach is undesirable from both a computational standpoint as well as from an estimation perspective. Each subsequent projection vector relies on all es- timates of previous projection directions. Thus, propagation of the estimation error is possible. In addition, the corresponding optimization problems are nonconvex, hence, the convergence of any optimization algorithms to the global optima is not assured. This computational intractability poses additional theoretical challenges and thus, most methods that are based on (3.1) do not have theoretical guarantees. We reformulate (3.1) such that the projection space VK can be recovered in a simultaneous manner via a convex optimization problem we call the Generalized EigenValue (GEV) projection. Using a sparsity assumption, we formulate the proposed GEV procedure into an optimization analogous to a type of matrix lasso regression problem. 3.1 General Error Bound Without loss of generality, we assume that µ̄ ≡ E(x) = 0. If µ̄ ̸= 0, we can center the data via x′ ≡ x − E(x), taking x′ as our centralized data with mean 0. Let Ω have an eigendecomposition K Ω = var{E[x|y]} = ∑ ρk uk u⊤k = UU⊤ k=1 √ √ where U = ( ρ1 u1 , . . . , ρK uK ) ∈ Rd×K . Let W∗ be the solution to the following optimization problem   ∗ 1 1/2 2 W = argmin Σ W − Σ−1/2 U F . (3.2) W∈Rd×K 2 Theorem 1. Let W = span{W∗ }. Then we have VK = W . 2 Proof. Let Q(W) = 1 2 Σ1/2 W − Σ−1/2 U F . Then W∗ is the minimizer of Q(W) and W∗ satisfies ∇Q(W∗ ) = ΣW∗ − U = 0, which yields W∗ = Σ−1 U. 23 To proceed, we need the following two lemmas. Our first lemma concerns the relation between the eigenpairs of Σ−1/2 AΣ−1/2 and those of Σ−1 A. Lemma 2. Let A be a symmetric matrix, Σ symmetric positive definite. If (ρ, v) is an eigenpair of Σ−1/2 AΣ−1/2 , then (ρ, Σ−1/2 v) is an eigenpair of Σ−1 A; and vice versa. Our next lemma connects the sequential optimization problem (3.1) to the eigenpairs of Σ−1 Ω. Lemma 3. The eigenvectors of Σ−1 Ω corresponding to the nonzero eigenvalues solves (3.1). Using Lemma 3, we only need to show that W∗ is equal to the eigenvectors of Σ−1 Ω up to an orthogonal transformation. Suppose we have the following eigendecomposition of U⊤ Σ−1 U: U⊤ Σ−1 U = PΛP⊤ . Then left-multiplying both sides by Σ−1 U and right-multiplying P, we have Σ−1 UU⊤ Σ−1 UP = Σ−1 UPΛ, or equivalently Σ−1 ΩW∗ P = W∗ PΛ. Using the fact that V = span{W∗ P} = span{W∗ } = W finishes the proof. 3.1.1 Proof of Lemma 2 Proof. Suppose (ρ, v) is an eigenpair of Σ−1/2 AΣ−1/2 , then Σ−1/2 AΣ−1/2 v = ρv. (3.3) Multiplying both sides in (3.3) by Σ−1/2 , we obtain that (ρ, Σ−1/2 v) is an eigenpair of Σ−1 A; and vice versa. Since Σ−1/2 AΣ−1/2 and Σ−1 A have the same rank (since Σ is full rank), we further conclude that the eigenpairs of Σ−1/2 AΣ−1/2 and those of Σ−1 A have a one-to-one correspon- dence with the same eigenvalues. 24 3.1.2 Proof of Lemma 3 Proof. We rewrite problem (3.1) as: u⊤ −1/2 ΩΣ−1/2 u 1Σ 1 u∗1 = argmax ⊤ , u1 u1 u⊤kΣ −1/2 ΩΣ−1/2 u k u∗k = argmax ⊤ s.t. uk ⊥ u j = 0, ∀1 ≤ j < k, ∀1 ≤ k ≤ K, uk uk v∗k = Σ−1/2 u∗k , Applying Lemma 2 finishes the proof of Lemma 3. Formulation (3.2) resembles the least square loss in linear models, and the loss function in (3.2) can be regarded as the matrix version of least square loss. Despite its simpleness, it recovers the same projection space as produced by (3.1). Note that any estimator function Q(W) that satisfies ∇Q(W) = ΣW − U will recover V . We will exploit this later for an alternative algorithm that makes use of Huber norm regularlization for noisy data. As it stands, however, the estimator Q(W) is not able to statistically recover W∗ in the paradigm of high-dimensional data. To handle high-dimensional features, we impose an assumption of sparsity on the structure on W and propose to solve the following penalized regression problem: n1 o ⊤ ⊤   argmin tr W ΣW − tr W U + λ ∥W∥1,1 . (3.4) W 2 2 The first two terms above are just an expansion of 1 2 Σ1/2 W − Σ−1/2 U F . Let Σb and U b be the estimates of Σ and U, respectively. Plugging these estimates into (3.4) above, we obtain the sample version n1 o ⊤b ⊤b   W = argmin c tr W ΣW − tr W U + λ ∥W∥1,1 . (3.5) W 2 Let S = {(i, j) : w∗i, j ̸= 0} be the total support set of W∗ and assume W∗ is s-sparse, that is |S | = s. To give the main theoretical result for the estimation error, we need the following definition of the generalized restricted eigenvalue (GRE) for matrices and the corresponding GRE condition. 25 Definition 4 (Generalized Restricted Eigenvalue for Matrices). Let K, m ∈ N, and γ ∈ R. The generalized restricted eigenvalue (GRE) for matrices is defined as n o κ+ (K, m, γ) = supV tr V⊤ ΣV ∥V∥2F : V ∈ C (K, m, γ) ,  b n o (3.6) κ− (K, m, γ) = infV tr V⊤ ΣV ∥V∥2F : V ∈ C (K, m, γ) ,  b where C (K, m, γ) = V ∈ Rd×K : S ⊆ J, |J| ≤ m, ∥VJ c ∥1,1 ≤ γ∥VJ ∥1,1 and J ⊂ [d].  Assumption 5. There exists K, m ∈ N and γ ∈ R and constants κ∗ , κ ∗ ∈ R such that 0 < κ∗ ≤ κ− (K, m, γ) ≤ κ+ (K, m, γ) ≤ κ ∗ < ∞. The assumption above is necessary for our theoretical development and was first proposed by [BRT09] for the vector case and various versions are standard in high-dimensional estimation literature. Our definition is a direct extension of theirs to the matrix case. Define L(W) ≡ 12 tr W⊤ ΣW − tr W⊤ U   b b as our cost function without the regularization term. As well, assume that there exists M, ρ > 0 such that 1/M ≤ ρmin (Σ) ≤ ρmax (Σ) ≤ M and ρ ≤ ρK (Ω). We are ready to state our first result on the estimation error, which concerns the W, under the event that ∥∇L(W∗ )∥∞,∞ ≤ λ /2 .  performance of c Theorem 6. Assume that Assumption 5 holds with k = K, m = s, γ = 3. Suppose that ∇L(W∗ ) ∞,∞ ≤ λ /2. Then we have √ Pc W − PW∗ F ≤ 3Mκ∗−1 ρ −1 λ s. (3.7) Remark. The theorems above follow a general method of giving error bounds on M-estimators. M-estimators are a family of estimators that combine a cost function with a regularizer, which require two properties for consistency in high dimensions: the decomposibility of the regular- izer and restricted strong convexity of their respective cost function. See [Wai19] chapter 9 for an explanation of these methods. Note that by conditioning on the random event G(λ ) = ∥∇L(W∗ )∥∞,∞ ≤ λ /2 these theorems give deterministic bounds; it is by further specifying the  application-dependent structure of Ω and Σ that leads to probabilistic bounds. In all applications this will yield a statement of the form G(λ ) holds with high probability which identifies λ = f (n, d) with a function of the sample size and dimension of the data. 26 3.1.3 Proof of Theorem 6 Sd Proof. Let S = j=1 S j be the union of supports for each projection direction. We first need the following lemma. Lemma 7. Suppose that ∥∇L(W∗ )∥∞ ≤ λ /2. For a E such that S ⊆ E and ∥E ∥0 ≤ 2s, we have W − W∗ W − W∗ )E  c E c 1,1 ≤ 3 (c 1,1 . 3.1.3.1 Proof of Lemma 7 Proof. By the mean value theorem, there exists a f W, some convex combination of c W and W∗ , W − ∇L W∗ = H(c W − W∗ ), where H = ∇2 L(f W) ∈ R p×K×p×K is a forth-order   such that ∇L c tensor   ∇ ∂∂Lw(W) ··· ∇ ∂∂Lw(W) f f  11 1K   .. ..  H=  . . .    ∇ ∂∂Lw(W) · · · ∇ ∂∂Lw(W) f f d1 dK The tensor-matrix product is defined as HW = (ai j ) p×K ∈ Rd×K , with ai j = ⟨∇∂ L(f W)/∂ wi j , W⟩. Let Γb be some sub-gradient matrix of ∥W∥1,1 evaluated at c W. The Karush-Tuhn-Tucker (KKT) W imply conditions of c W − W∗  0 = ∇L c W + λ Γ,b c W − ∇L W∗ + ∇L W∗ + λ Γ, W − W∗    = ∇L c b c W − W∗ + ∇L W∗ + λ Γ, W − W∗   = H c b c W − W∗ , we further have  Using the positive semi-definiteness of the quadratic form H c W−W ,c 0 ≤ − ∇L(W∗ ), c W − W∗ − λ Γ, b c W − W∗ . (3.8) | {z } | {z } I1 I2 Next, we bound I1 and I2 respectively. Applying Hölder inequality to I1 obtains us that ∇L(W∗ ), c W − W∗ ≥ − ∇L(W∗ ) ∞,∞ W − W∗ c 1,1 . 27 For I2 , separating the support of Γ W − W∗ into E and E c , using the assumption E c ∩ S = 0, b and c / we have W∗E c = 0 and thus Γ W − W∗ )E c = 1E c , c b E c , (c WE c = 1E c , (c W − W∗ )E c = ∥(c W − W∗ )E c ∥1,1 since Γ b i j = sign(c Wi j ) when c Wi j ̸= 0. On the other hand, we have ⟨Γ W − W∗ )E ⟩ ≥ −∥Γ b E , (c W − W∗ )E ∥1,1 b E ∥∞,∞ ∥(c = ∥(c W − W∗ )E ∥1,1 by the Hölder inequality and the identity ∥Γ b E ∥∞,∞ = 1. Plugging the derived inequalities above, we obtain λ Γ, b cW − W∗ = λ Γ W − W∗ )E c + λ Γ b E c , (c W − W∗ )E b E , (c ≥ λ (cW − W∗ ) E c 1,1 − λ (c W − W∗ )E 1,1 . Plugging the bounds for I1 and I2 back into (3.8) yields that 0 ≤ ∇L(W∗ ) ∞,∞ W − W∗ 1,1 − λ (c c W − W∗ )E c 1,1 + λ (c W − W∗ )E 1,1 ≤ − λ − ∇L(W∗ ) ∞,∞ (c W − W∗ )E c ∥1,1  + λ + ∇L(W∗ ) ∞,∞ W − W∗ E 1,1 ,   c which further yields that λ + ∇L W∗  W − W∗ E c W − W∗ W − W∗ )E  ∞,∞  c 1,1 ≤ c E 1,1 ≤ 3 (c 1,1 , W∗  λ − ∇L ∞,∞ which completes the proof. Taking E = S in Lemma 7 and using the restrictive eigenvalue condition with lower bound κ− (K, s, 3) ≥ κ∗ , we obtain 2 W − W∗ Wt − ∇L W∗ , c Wt − W∗ .   κ∗ c 2,2 ≤ ∇L c 28 since ∇L(W ) = ΣW − U. We note that, for any matrix A, we have ∥A∥F = ∥A∥2,2 , that is the Frobenius norm and the ℓ2,2 -norm coincides. b be defined as above as a sub-gradient matrix of ∥W∥1,1 evaluated at c Let Γ W. To bound the right hand side of the inequality above, we add λ Γ, b cW − W∗ to both sides and obtain 2 W − W∗ ∇L(W∗ ) + λ Γ, W − W∗ ≤ ∇L c W − W∗ .  κ∗ c 2,2 + b c W + λ Γ, b c (3.9) Since we have W − W∗ = 0,  ∇L c W + λ Γ,b c plugging the above equality back into (3.9), we obtain 2 κ∗ cW − W∗ 1,2 ≤ − ∇L(W∗ ), c W − W∗ + − λ Γ, b cW − W∗ . (3.10) | {z } | {z } II1 II2 In a similar argument to the proof of Lemma 7 above, applying Hölder inequality to both II1 and II2 , we obtain − ∇L(W∗ ), c W − W∗ ≤ ∇L(W∗ ) ∞,∞ W − W∗ 1,1 c    = ∇L(W∗ ) ∞,∞ W − W∗ S + W − W∗  c c Sc ≤ 4 ∇L(W∗ ) W − W∗  S 1,1 , c ∞,∞ and W − W∗ ≤ −λ W − W∗ W − W∗   − λ Γ, c c S c 1,1 + λ c S 1,1 , where we use Lemma 7 in the first displayed inequality. Plugging the bounds above for II1 and II2 W −W∗ S 1,1 =  back into (3.10) and then applying the Cauchy-Schwartz inequality to the term c 1S , (cW − W∗ )S , we further obtain 2 W − W∗ ≤ 4 ∇ L W∗ W − W∗ W − W∗     κ∗ c +λ S 1,1 − λ c c F ∞,∞ S c 1,1 W − W∗ S   ≤ 4λ /2 + λ c 1,1 W − W∗ S 1,1  ≤ 3λ c √ W − W∗ S F  ≤ 3λ s c √ ≤ 3λ s c W − W∗ F 29 Canceling the term c W − W∗ F on both sides, we obtain −1 √ W − W∗ c F ≤ 3κ ∗ λ s. (3.11) We need the following lemma. Lemma 8. Let ρK,W∗ be the K-th largest singular value of W∗ . Then we must have √ −1 PcW − PW∗ F ≤ 2ρK,W∗ c W − W∗ F . 3.1.3.2 Proof of Lemma 8 Proof. Let W∗ have the singular value decomposition that W∗ = EDF⊤ , F ∈ RK×K , E ∈ R p×K are orthogonal, and D ∈ RK×K is a diagonal matrix. Then we have PW∗ = PE . Looking at the 2 regression problem infQ E − c WQ F , the least squares solution is Q = (cW⊤ cW)−1 c W⊤ E, giving 2 2 inf E − cWQ F = E − Pc W E F Q = tr EE⊤ − Pc EE⊤ − EE⊤ Pc EE⊤ Pc  W W + Pc W W    = tr I − Pc W E P using identities PcW =cW(c W⊤ c W)−1 cW⊤ , P2c = Pc W , EE⊤ = PE , and the cylic property of trace. W     tr I − Pc W E P = K − tr Pc W E P 1  1  ≥ tr (PE ) − tr Pc W P E + tr P W 2 2 c 1 2 = PE − Pc W F 2 1 2 = PW∗ − Pc W F , 2 30 where the first equality and first inequality uses the fact that PE is rank K and Pc W is at most rank 2 K. Therefore, taking Q = FD−1 , we can bound Pc W − PW∗ F as √ Pc W − PW∗ F ≤ 2 inf cWQ − E F Q √ ≤ 2 cWFD−1 − EDF⊤ FD−1 F √ ≤ 2 c W − W∗ F ∥D−1 ∥2 √ −1 ∗ 2 ≤ 2ρK,W ∗ W−W c F . Now using Lemma 8, we obtain that √ −1 −1 √ PcW − PW∗ F ≤ 3 2ρK,W∗ κ∗ λ s. Then it remains to lower bound ρK (W∗ ). We start by writing W∗ (W∗ )⊤ = Σ−1 ΩΣ−1 . We need the following lemma, which can be proved by the min-max theorem and thus the proof is omitted. Lemma 9. Let A ∈ Rd×d be a symmetric positive definite matrix and B ∈ Rd×d a symmetric positive semidefinite matrix. Then for any 1 ≤ k ≤ d, we have ρmin (A)ρk (B) ≤ ρk (AB) ≤ ρmax (A)ρk (B). Applying Lemma 9, we obtain that −1/2 2 ρK,W ∗ ≥ ρK (Σ ΩΣ−1/2 )ρmax−1 −2 (Σ) ≥ ρK (Ω)ρmax (Σ). Therefore, we complete the proof of desired statement by plugging the bound above. 3.2 Sliced Inverse Regression Supervised dimension reduction that preserves the conditional dependence of the data has a history in Sufficient Dimension Reduction (SDR), [Coo98]. As a method for performing SDR, the SIR method first developed in [Li91]. For a random vector x with elliptic distribution and univariate 31 response variable y = f (x), the goal of finding a low-dimensional representation of x should take into account the relationship of the data to y, ideally without losing any information which is es- sential in predicting y. The objective of SDR methods is to find, without knowing f , a subspace V ⊆ Rd such that y ⊥ x|PV x. A subspace that satisfies this property is called an effective dimen- sion reduction (EDR) space. Under some minor conditions, the intersection of all EDR spaces is also an EDR space with minimum dimension, called the central space. The minimal model for SDR methods is to assume the multiple index model where the link function takes the form ⊤ y = f (v⊤ 1 x, . . . , vK x, ε) for vi ∈ Rd for i = 1, . . . , K and the error ε is independent of x and E[ε] = 0. Thus, under this model it suffices to find span{v1 , . . . , vK } to determine the central space. The SIR estimator was one of many techniques developed for SDR, but was favored due to its simplicity and computational efficiency. The name Sliced Inverse Regression comes from both the use of the “inverse regression curve” E[x|y] as well as the sliced estimator of Ω = cov(E[x|y]) to determine the central space. As proved in [Li91], since the column space col(Ω) = Σspan{v1 , . . . , vK }, the central space can be estimated via Σ̂−1 col(Ω̂). The additional linearity condition that ∀b, vi ∈ Rd , E[b⊤ x|v⊤ ⊤ K ⊤ 1 x, . . . , vK x] = c0 + ∑i=1 ci vk x, where c0 , . . . , cK ∈ R, is re- quired, but this is automatically satisfied assuming x is elliptically distributed. [? ] proves the consistency of SIR holds if and only if lim d/n = 0, motivating high dimension methods. For the application of SDR, numerous competing procedures have been developed in the past couple decades, including [CL98, LN06, Li07]. While many approaches built flexible semi- parametric models such as projection pursuit regression [FS81], and MAVE [XTLZ02], none of these function in HDLSS scenarios. A major breakthrough was achieved in regularlized SDR methods proposed by [LZL19] using the Lasso SIR method for the HDLSS under the sparsity assumption. The GEV method is closely related to the Lasso SIR method, but has important dif- ferences from their own that leads to performance improvements in scenarios that have significant eigengaps, and has consistent performance that is as good or better elsewhere. A closely related method to the GEV estimator can be found in [WCZZ18] which stems from similar motivations 32 but attempts only to compute the optimal projected coordinates of the data instead of determining the projection explicitly. Their regularization uses a group lasso approach in contrast to the GEV estimator, and the non asymptotic rates of error they demonstrated are suboptimal. 3.2.1 Consistency for SIR The space Σ−1 col(Ω) is given by the span of the generalized eigenvectors of (Ω, Σ), justifying the GEV estimator. As in the original SIR estimation technique, we use the sliced estimator of Ω, defined as follows. Given the samples (xi , yi ), i ∈ [n], for a chosen constant H ∈ N, K < H < n, divide the data into H groups determined by the order statistics of y(i) that give H “slices” of the domain of y in the form of intervals (y(hi ) , y(hi+1 ) ], i ∈ {2, . . . , H − 1}, with (−∞, y(h1 ) ] and (y(hH−1 ) , ∞) at the tails. In general these may lead to different sized groups, but without loss of generality we may assume n = cH so that each slice is chosen to consist of c points. We may relabel the data as (xh, j , yh, j ) for the j-th sample in slice number h, i.e., yh, j = y(c(h−1)+ j) and xh, j = x(c(h−1)+ j) . Then the sample mean in h-th slice is x̄h ≡ 1c ∑cj=1 xh, j , allowing us to estimate b H = 1 ∑H x̄h x̄⊤ = 1 XH X⊤ , where XH = (x̄1 , . . . , x̄H ) is the d ×H matrix with the slice means Ω H h=1 h H H as columns. Here, the intuition for the estimate Ω = cov(E[x|y]) is that each sample slice mean x̄h serves as a local estimate of E[x|y ∈ Ih ] for Ih = (yh−1,c , yh,c ]. First, it is worth noting that it is not immediate can be treated as coming from the distribution x| (y ∈ Ih ); this is proven S that the samples j∈[c] xh, j to be the case in [LZL18]. Furthermore, given the estimate 1 H ∑H E[x|y ∈ Ih ]E[x|y ∈ Ih ]⊤ of Ω, it is not guaranteed to be consistent. We can compare this estimate with the classic consistent estimator 1 ⊤ of Ω. A necessary condition for the sliced based estimate to be consistent is n ∑i E[x|yi ]E[x|yi ] the average loss of variance in each slice decreases to zero as H increases. This holds automatically if E[x|y] is smooth and y is compactly supported. In general though, one needs a basic assumption on the smoothness and tail distribution of E[x|y]. We give that assumption below, ϑ -stability ([LZL19]), which is minimal and holds for many distributions. Note that while an estimate based on samples of E[x|yi ] is possible, the estimate would fair very poorly, since each mean E[x|yi ] 33 would have at best a single point estimate given by xi from the pair (xi , yi ). Given this structure of Ω b and the usual estimator for Σ b = 1 ∑ xi x⊤ , we show consistency of the n i GEV estimator given the assumption of the Stability condition. To do so, given a decomposition of ∥∇L(W∗ )∥∞,∞ into terms involving estimation error of Σ b and U,b we will use standard techniques to bound the former, but the latter will require extra work. We will show that while ∥U b − U∥∞,∞ has inherent difficulties in controlling an upper bound, if one instead uses ∥U b − Ũ∥∞,∞ there is a direct way to control the upper bound with high probability, where Ũ = PU (U) b is the projection of the estimated eigenvectors on the span of U. As we will see, this avoids any cumbersome eigengap assumptions, but requires rank(Ũ) = rank(U). To make this substitution one needs to guarantee that an estimator that uses Ũ instead of U will recover the desired subspace. This is proven showing lower bounds on the norms of projected eigenvectors hold with high probability given the assumption of the stability condition. It is important to note that our parameter U = √ √  ρ1 u1 , . . . , ρK ud with the additional coefficients of the square roots of estimated eigenvalues makes this easier to bound than if there was no coefficients as found in the [LZL19] model. Assumption 10 (Stability). For positive constants α1 < 1 < α2 let AH (α1 , α2 ) be the collection of all partitions −∞ = a0 < a1 < · · · < aH = ∞ of R satisfying α1 α2 ≤ P(ai ≤ y < ai+1 ) ≤ . H H The inverse regression curve E[x|y] is ϑ -sliced stable with respect to y for some ϑ > 0 if ∃α1 , α2 , α3 such that for any v ∈ Rd and partition AH (α1 , α2 ) H−1 1 α3 H ∑ var(v⊤E[x|y]|ah ≤ y ≤ ah+1) ≤ H ϑ var(v⊤ E[x|y]) h=0 for sufficiently large H. The curve is stable if it is ϑ -sliced stable for some positive constant ϑ . Theorem 11. Assume that assumption (10) holds for E[x|y], x is sub-Gaussian with variance q proxy σ , n = ρK d for some α > 1/2, and λ = C log(d) α n for some constant C. Then, there exists constants C1 ,C2 such that with probability at least 1 −C1 exp (−C2 log(d)) r ∗ log(d) ∥∇L(W )∥∞,∞ ≤ C . n 34 Corollary 12. There exist constants C,C1 ,C2 such that r s log(d) PcW − PW∗ F ≤ CMκ∗−1 ρ −1 n holds with probability at least 1 −C1 exp(−C2 log(d)) . 3.2.2 Proof of Theorem 11 Proof. We have ∇L(W∗ ) ∞,∞ ≤ ΣW b ∗ − ΣW∗ ∞,∞ + ΣW∗ − U b ∞,∞ . | {z } | {z } I1 I2 We first bound I1 . Since we have ∥AB∥∞,∞ ≤ ∥A∥∞,∞ ∥B∥1 where the latter is the operator norm, since ! ∥AB∥∞,∞ = max ∑ Ai,k Bk, j i, j k ≤ max |Ai, j | max ∑ |Bk,ℓ | i, j ℓ k = ∥A∥∞,∞ ∥B∥1 Then I1 ≤ ∥Σb − Σ∥∞,∞ ∥W∗ ∥1 , where the second factor may be treated as a constant. It suffices to bound the estimation error of Σ. √ Lemma 13. Let M e = maxi Σii . Suppose that n > 6 log d. Then, for some universal constant C, with probability at least 1 − exp(− log(d/4)), we must have s b − Σ∥∞,∞ ≤ C M log d . e ∥Σ n Proof of Lemma 13. Using a similar argument to the proof of Lemma 1 in [RWRY11], which is p  omitted for simplicity, we obtain, for all t ∈ 0,C1 M e ,    nt 2  P Σ b i j − Σi j ≥ t ≤ 4 exp − , C2 Me 35 q where C1 = 3200 and C2 = 40 are two universal constants. Therefore, taking t = 3C2 Mn e −1 log d p such that t < C1 M e and applying union bound over the d 2 entries gives s ! 3C 2 e log d M P Σ b −Σ ∞,∞ ≥ ≤ 4d −1 n q p e −1 Observing that t = 3C2 Mn log d < C1 M e implies n > 6 log d and this finishes the proof. What remains is bounding I2 . Using identity ΣW∗ = U, we have I2 = ∥U − U∥ b ∞,∞ . Let V̂ ≡ [û1 , . . . , ûK ] and Λ̂ = diag(ρ̂1 , . . . , ρ̂K ). Then let A = √1 X⊤ V̂ be a H × K matrix. Let x = z + w H H be the orthogonal decomposition with respect to col(Ω) and its complement. Then we have the decomposition XH = ZH + WH . This leads to the identity   1 b= U XH XH V̂ Λ̂−1/2 ⊤ H   1 1 = √ ZH A + √ WH A Λ̂−1/2 H H and e = PU (U) U b = √1 ZH AΛ̂−1/2 . H Then U b −U e= √1 WH AΛ̂−1/2 and H 1 1 ∥ √ WH AΛ̂−1/2 ∥∞,∞ ≤ ∥ √ WH ∥∞,∞ ∥AΛ̂−1/2 ∥1,∞ H H √ √ Note that ∥Λ̂−1/2 A∥1,∞ ≤ H∥Λ̂−1/2 A∥2,∞ by the basic inequality ∥v∥1 ≤ H∥v∥2 for any v ∈ q RH . Then ∀i ∈ [d], we have ∥A∗,i ∥2 = H1 V̂⊤ ⊤ p ∗,i XH XH V̂∗,i = ρ̂i , thus 1 ∥ √ WH AΛ̂−1/2 ∥∞,∞ ≤ ∥WH ∥∞,∞ . (3.12) H The benefit of having U in our model appears – if we had taken V̂ as our parameter and not p b we would have had the above bound times maxi ρ̂i . Instead we merely need to bound the U, behavior of WH to give an upper bound on I2 given the legitimacy of the substitution of U e for U, which we will see is very manageable. 36 To show the substitution of Ũ is legitimate, define W∗∗ as   ∗∗ 1 1/2 W = argmin ∥Σ W − Σ−1/2 U∥ e 2F (3.13) W 2 We show with high probability that this model recovers the desired reduction space. It is trivial to show that assuming U e is of rank K, span(W∗ ) = span(W∗∗ ). To show that the rank is K, it suffices to show that none of the projected vectors of ũi = PU (ûi ) are 0, and that the projection of the K vectors are injective. Thus we give a positive lower bound on the norms ∥ũ∥2 , and lower bounds on the angles between ∠(ũi , ũ j ) ∀i, j ∈ [K], i ̸= j. These important results come directly from [LZL19]. Theorem 14. If nρK = d α for some α > 1/2, there exists positive constants C1 , C2 , C3 , C4 , such that 1. for j = 1, . . . , K r ρK ∥ũ j ∥2 ≥ C1 ρ̂ j 2. for j = K + 1, . . . , H p r d log(d) ρK ∥ũ j ∥2 ≤ C2 nρK ρ̂ j hold with probability at least 1 −C3 exp (−C4 log(d)) . Its the first inequality that matters for our purposes; it not only gives us the lower bound on projected norms, it is also used in the next theorem: Theorem 15. The angles between any two vectors in {ũ1 , . . . , ũd } are nearly π/2 with high prob- ability. In particular there exist constants C1 ,C2 ,C3 such that p d log(d) | cos(∠(ũi , ũ j ))| ≤ C1 nρK holds with probability at least 1 −C2 exp (−C3 log(d)) . for any i ̸= j. Both proofs above are done in the case that x is Gaussian. The proofs rely on two core lemmas to show their claims hold with high probability. The first was proven in [? ] and is already done in 37 the case that x is sub-Gaussian. The other lemma gives basic tail bounds for χ 2 variables and can be extended to the case of sub-exponential random variables. Lemma 16. Let c1 , . . . , c p be positive constants. We have the following statements: 1. For d sub-Gaussian random variables x1 , . . . , xd , there exist constants C1 and C2 such that for any sufficiently small a we have ! ! 1 d 2 a2 ci (xi2 − E[xi2 ]) > a ≤ C1 exp − . (3.14) d∑ P i C2 ∑ j c2j 2. For 2d sub-Gaussian random variables x1 , . . . , x p , y1 , . . . , y p with E[xi ] = E[yi ] = 0 for all i ∈ [d], there exist constants C1 and C2 such that for any sufficiently small a, we have ! ! 1 d 2 a2 ci xi yi > a ≤ C1 exp − . (3.15) d∑ P i C2 ∑ j c2j Proof. If xi is sub-Gaussian with parameter σ , then trivially xi2 is sub-exponential since ∀t > 0 we √ −t have P xi2 ≥ t = P |xi | ≥ t ≤ 2 exp( 2σ  2 ) where the inequality comes from the sub-Gaussian property. It is straightforward to show that xi2 has sub-exponential parameters (νi , σ ) for any νi2 > E[xi4 ]. As well for any ci ∈ R, ci xi2 is sub-exponential with parameters (ci νi , ci σ ) since we have the tail bound −t 2   P(ci xi2 ≥ t) = P(xi2 ≥ t/ci ) ≤ exp 2c2i νi2 √ c2i νi2 1 ∑i c2i νi2 σ (maxi ci ) for all 0 ≤ t/ci ≤ νi2 /σ ⇔0≤t ≤ ci σ . Then d ∑i ci xi2 is sub-exponential with parameters ( d , d ). ∑i c2i νi2 Then for a ≤ dσ maxi ci we have ! 1 c2 a2 ci (xi2 − E[xi2 ]) > a ≤ 2 exp(− ). (3.16) d∑ P i ∑ j c2j νi2 (xi +y j )2 −xi2 −y2j The second statement follows likewise since xi y j = 2 is sub-exponential. √ H d log(d) Corollary 17. Let Σ1 ≡ cov(w) and IH the identity on R . Then if n is sufficiently small, the event p   1 ⊤ tr(Σ1 ) d log(d) Ω = ω ∥ WH WH − IH ∥F ≤ a H n n 38 happens with probability at least 1 −C1 exp (−C2 log(d)) . Proof. Since x is sub-Gaussian, XH has columns that are averages of c independent sub-Gaussians, and under linear projection each entry of WH is likewise sub-Gaussian. The term inside the Frobe- nius norm is a matrix with entries that are a sum of sub-exponential random variables, with the subtraction of the means for the squared terms. In particular if Wi j is the i j-th entry of WH and δi j is the usual Kronecker delta, ! d WkiWk j 1 ⊤ tr(Σ1 ) tr(Σ1 ) WH WH − IH = ∑ H − δi j n . H n k ij Since every Wki = 1c ∑ce=1 wki e for c = n/H, " #" # d WkiWk j 1 d c c ∑ H = c2 n ∑ ∑ wkie ∑ wke f . k k e=1 e=1 W2 h i tr(Σ1 ) When i = j we have E ∑kp Hki = n . Then each WkiWk j is sub-exponential and p ! 1 ⊤ tr(Σ1 ) d log(d) P ∥ WH WH − IH ∥F ≤ a H n n   1 ⊤ p =P ∥ 2 WH WH − tr(Σ1 )IH ∥F ≤ a d log(d) c  " #2  i=H, j=H d 1 =P  2 ∑ ∑ WkiWk j − δi j tr(Σ1 ) ≤ a2 d log(d) c ij k p ! d a d log(d) 1 ≤H 2 P 2 ∑ WkiWk j − δi j tr(Σ1 ) ≤ c k H log(d)a2   ≤C1 exp − C2 1 where the last inequality comes from the application of (16) to c2 ∑dk WkiWk j − δi j tr(Σ1 ) with first sub-exponential parameter being O(d). Given the above, the proof of Theorem 14 goes through exactly as in [LZL19]. The proof of Theorem 15 uses a transformation of XH via an orthogonal matrix T such that √1 T ZH H = (A⊤ , 0)⊤ 39 and √1 T WH H = (0, B⊤ )⊤ , where A ∈ RK×H and B ∈ R(d−K)×H . Then via this transformation the proof depends on the following events happening with high probability: (i) ρmin (AA⊤ ) ≥ λ , (ii) √ d log(d) q ∥PT ZH (T û j )∥2 ≥ C ρρ̂Kj for all j ∈ [K], and (iii) ∥B⊤ B − λ IH ∥F ≤ C n for some scalar λ > 0. (i) follows from the Sine-Theta Theorem, (ii) follows from Theorem 14, and (iii) follows from Lemma 16. q log(d) Lemma 18 (Bounding WH ). Assume λ = C n for some constant C. Then there exist con- stants C1 ,C2 such that 1 ∥ √ WH AΛ̂−1/2 ∥∞,∞ ≤ λ /2 H with probability at least 1 −C1 exp(−C2 log(d)). Proof. Using the bound in (3.12), it suffices to bound ∥WH ∥∞,∞ . As a linear function of a sub- √ Gaussian variable, each entry Wi j = 1c ∑ck=1 wkij of WH is sub-Gaussian with parameter σw / c, for some σw ∈ R, i ∈ [d] and j ∈ [H]. Then using a union bound we have r ! CH log(dH) √ p  P max |Wi j | ≥ ≤ dHP | cWi j | ≥ C log(dH) i, j 4n ≤ 2e−(C−1) log(dH) Combining the bounds for I1 and I2 we may seek bounds       ∗ λ ∗ ∗ λ ∗ λ P ∥∇L(W )∥∞,∞ ≤ ≤ P ΣW b − ΣW ∞,∞ ≤ + P ΣW − U b ≤ 2 4 ∞,∞ 4 q Setting λ = 2∥W ∥1C log(d) ∗ n yields the desired probabilities, with a change of constants. This completes the proof of the bound. 3.3 Linear Discriminant Analysis Linear Discriminant Analysis (LDA) is a classification technique, that assumes x ∈ Rd is a random vector and y ∈ {1, . . . , K + 1} is a discrete random variable over K + 1 classes, such that x|y = k ∼ 40 N(µk , Σ) for k ∈ {1, . . . , K + 1}. Given the normal model, the Bayes rule for estimating y can be explicitely derived as µk ŷ = argmax v⊤k (x − ) + log(πk ), k 2 where πk = P(y = k) and vk = Σ−1 µk for k ∈ [K + 1]. Alternatively, LDA can be viewed from the perspective of dimensionality reduction, given by Fisher’s discriminant problem where one seeks a low dimensional projection of the data such that the between-class variance is large relative to the within class variance. This problem is formulated as max v⊤ ⊤ ⊤ k Ωvk subject to vk Σvk ≤ 1, vk Σvi = 0 ∀i < k. vk where Ω ≡ cov(E[x|y]) as before. It is simple to show that this gives us exactly the same problem that motivates the GEV esti- mator for the case of a discrete response variable y. Note here that Σ is the homoskedastic within- class covariance, and not cov(x). However, it is straightforward to show using the law of total covariance that the generalized eigenvectors of (Ω, Σ) are the same as those of (Ω, cov(x)) (with different eigenvalues). Thus we can treat Σ above as cov(x) without affecting the problem. The vectors vk determined by the optimization are called the discriminant vectors, and for a d × n data matrix X a classification rule is obtained by computing the projection V⊤ X for V ≡ [v1 , . . . , vK ], and then assigning each observation to the nearest V⊤ µk . In recent years, several references have extended LDA to the high dimensional setting using sparsity, most of which use a lasso penalty [THNC03, Len08, CHWE11]. The derivation of LDA using Fisher’s Discriminant is leveraged in the well-known method found in [WT11], that uses a group lasso regularization with a diagonal estimate for the covariance of the data. A powerful technique called MSDS for multiclass sparse discriminant analysis is given in [ZMY18], which has both theoretical and empirical justification. This estimator attempts to find sparse discriminant vectors from the Bayes rule of the normal model for LDA, whereas our own method uses the Fisher’s Discriminant derivation. The authors also show the estimators of ROAD [FFT12], and DSDA [MZY12], occur as special cases of MSDS up to particular constants. While error bounds 41 are given on the estimator, they are not in the form of a function of the data dimension and sample size. We show our estimator performs better empirically and has strong error bounds in terms of the data dimension and sample size. 3.3.1 Consistency for LDA Fisher’s Discriminant problem in many ways can be seen as a special case of SIR for a discrete output of y. With Ω as defined above and with the assumption that E[x] = 0, its estimator simplifies 1 K+1 ⊤ 1 to Ω̂ = K+1 ∑k=1 x̄k x̄k where x̄k = nk ∑ni=1 xi 1(yi = k). Here we use 1(yi = k) as the indicator variable that xi is of class k, and nk = ∑ni=1 1(yi = k) is the number of samples that are of class p p k. Let Û = ( ρ̂1 û1 , . . . , ρ̂K ûK ) be columns using the eigenpairs (ρ̂k , ûk ) of Ω̂. Then the GEV solution Ŵ of (3.5) using Σ̂ and Û gives us K estimated discriminant vectors which can be used for LDA classification. Due to the similarity of LDA to SIR, much of the proof of consistency of the discriminant vectors can proceed analogously to that of SIR. Here, the Stability assumption (10) takes the form of a “balanced clusters” assumption. Since the Stability assumption holds for any H sufficiently large, we may take H = K + 1 so that for γ1 = K + 1 mink∈[K+1] πk and γ2 = K maxk∈[K] πk , any partition AH (γ1 , γ2 ) will separate the support values of y into intervals [ak−1 , ak ], k − 1 < ak−1 < k < ak < k + 1. Thus for k ∈ [K + 1], the random vector  E[x|y] ak−1 ≤ y ≤ ak = E[x|y = k] is a constant. Then var(v⊤ E[x|y] ak−1 ≤ y ≤ ak ) = 0 for all v ∈ Rd , k ∈ [K + 1]. The existence of positive constants γ1 < 1 < γ2 that serve as bounds to the probabilities of the classes of x guarantee that there are no asymptotic behaviors of the form limd→∞ πk = 0 as a function of dimensionality of x. For the sake of simplicity, we assume all probabilities πk are equal, with the extension to the general case being straightforward. Theorem 19. Let x be multivariate Gaussian, n = ρK d α for some α > 1/2, and y = f (x) has finite support along with assumption (10) holding for E[x|y]. Then, with probability at least 1 − 42 C1 exp (−C2 log(d)), we have r ∗ log(d) ∥∇L(W )∥∞,∞ ≤ C . n Corollary 20. Assume that 1/M ≤ ρmin (Σ) ≤ ρmax (Σ) ≤ M and ρ ≤ ρK (Ω). Then there exist constants C,C1 ,C2 such that r s log(d) PcW − PW∗ F ≤ CMκ∗−1 ρ −1 n holds with probability at least 1 −C1 exp(−C2 log(d)) . 3.3.1.1 Proof of Theorem 19 Proof. For the sake of simplicity, we assume all probabilities πk are equal, with the extension to the general case being straightforward. As earlier, we have ∇L(W∗ ) ∞,∞ ≤ ΣW b ∗ − ΣW∗ ∞,∞ + U−U b ∞,∞ . | {z } | {z } I1 I2 The bound on I1 follows from the theorem used earlier in [RWRY11]. What remains is bound- ing I2 . Let Vb = [b u1 , . . . , u b = diag(ρ̂1 , . . . , ρ̂K ), and XK+1 = (x̄1 , . . . , x̄K+1 ). Then let A = √ 1 X⊤ V b K ], Λ b K+1 K+1 be a (K + 1) × K matrix. Using the same decomposition x = z + w with respect to col(Ω) and its complement we get XK+1 = ZK+1 + WK+1 . Since 1 ⊤ = Ω, K+1 XK+1 XK+1 b we get the identity   1 1 b −1/2 b= U √ ZK+1 A + √ WK+1 A Λ K +1 K +1 and e = PΩ (U) U b = √1 ZK+1 AΛ b −1/2 . H Then U b −Ue= √ 1 WK+1 AΛ b −1/2 and K+1 1 b −1/2 ∥∞,∞ ≤ ∥ √ 1 WK+1 ∥∞,∞ ∥AΛ b −1/2 ∥1,∞ . ∥√ WK+1 AΛ K +1 K +1 43 √ Since ∥Λb −1/2 A∥1,∞ ≤ K + 1∥Λ b −1/2 A∥2,∞ , ∀i ∈ [d], and r 1 b⊤ p ∥A∗,i ∥2 = V∗,i XK+1 X⊤ K+1 b ∗,i = ρ̂i , V K +1 we have 1 b −1/2 ∥∞,∞ ≤ ∥WK+1 ∥∞,∞ . ∥√ WK+1 AΛ K +1 As in the SIR application, we are able to use U e in a substitute model (3.13) that recovers the desired reduction space with high probability. All that needs to be shown is that with high probability U e is of rank K. We may directly use the results of (14) and (15), since all assumptions of SIR hold in the LDA application, with the only requirement being a similar simplifying assumption that ∃c > 0 such that c(K + 1) = n. Thus we get that U e is of rank K with probability at least 1 −C1 log(−C2 d) for constants C1 ,C2 . q log(d) Lemma 21 (Bounding WK+1 ). Assume λ = C n for some constant C. Then there exist constants C1 ,C2 such that 1 b −1/2 ∥∞,∞ ≤ λ /2 ∥√ WK+1 AΛ K +1 with probability at least 1 −C1 exp(−C2 log(d)). Proof. Without loss of generality, each entry Wi j = 1c ∑ck=1 wkij of WH is Gaussian with variance var(w)/c2 , assuming each wkij is Gaussian with variance var(w) for i ∈ [d] and j ∈ [K + 1]. Then using a union bound we have r ! C(K + 1) log(d(K + 1)) P max |Wi j | ≥ i, j 4n  p  ≤ d(K + 1)P |c2Wi j | ≥ C log(d(K + 1)) ≤ 2e−(C−1) log(d(K+1)) This completes the proof. 44 Given the consistency of c W, it is straightforward to show the consistency of the classification rate of the estimated classifier. The classifier Ybc W gives the label k to a new point x′ if µk is the nearest class centroid to x′ after projection by c W: Ybc W (X) ≡ argmin ∥cW⊤ x′ − c W⊤ µ̂k ∥22 . k Let the misclassification error rate be given by Rn ≡ P(Yb ̸= Y |observed data) where Y is the true la- bel. Likewise define R as the misclassification rate of the population classifier using the parameters W∗ and µk . Then we have the following. Theorem 22. There exist constants C,C1 ,C2 such that r s log(d) |Rn − R| ≤ Cκ∗−1 ρ −1 n with probability at least 1 −C1 exp(−C2 log(d)) . 3.3.1.2 Proof of Theorem 22 Proof. Let Y be the classifier using the population parameters of W∗ and µk . Define lk = ∥W∗⊤ X− W∗⊤ µk ∥22 and lˆk = ∥c W⊤ X − c W⊤ µ̂k ∥22 . For any ε > 0, we have Rn − R ≤ P(Yb ̸= Y ) ≤ 1 − P(|lˆk − lk | < ε/2, |lk − lk′ | > ε, for any k, k′ ) ≤ P(|lˆk − lk | ≥ ε/2 for some k) + P(|lk − lk′ | ≤ ε for some k, k′ ). For the first term |lˆk − lk | ≤ |(c WcW⊤ µ̂k − W∗ W∗⊤ µk )⊤ X| + |µ̂⊤ cc⊤ ⊤ ∗ ∗⊤ k WW µ̂k − µk W W µk |. √ It is straightforward to show given (3.11) that bounds ∥c W − W∗ ∥F ≤ 3κ∗−1 λ s, we have ∥cWcW⊤ − √ W∗ W∗⊤ ∥2 ≤ Cκ∗−1 λ s and given a standard error bound on sample mean estimation we get |lˆk − q lk | ≤ Cκ∗−1 s log(d) n with probability 1 −C1 exp(−C2 log(d)). The expression |lk − lk′ | is normally distributed with mean (µk −µk′ )⊤ W∗ W∗⊤ (µk +µk′ ) and variance (µk −µk′ )⊤ W∗ W∗⊤ ΣW∗ W∗⊤ (µk − 45 µk′ ), and since we have the assumption (5) which implies (µk − µk′ ) is bounded away from 0, with q high probability |lk − lk′ | > Cκ∗−1 s log(d) ′ n . Applying a union bound over k, k for both terms yields the result. 3.4 Minimax Rate We show that both the error rates for SIR and LDA fall under a model which achieve minimax rates. The setup closely follows [TSY20]. Define the GEV parameter space as follows: let {J1 , . . . , JH } for H > K be a measurable partition of the sample space of y, and define ỹ = ∑H c=1 c · 1(y ∈ Jc ) as the discretized version of y. Then we may define the corresponding conditional covariance e ≡ cov[E(x|ỹ)]. Let F (s, d, K, γ̃; κ, M) be the set of all pairs of matrices (Σ, Ω) Ω e with generalized eigenpairs (γ̃i , ṽi ), γ̃i ∈ R, ṽi ∈ Rd for i ∈ [K], such that 1. ∑K i=1 ∥ṽi ∥0 = s. 2. 1/M ≤ ρmin (Σ) ≤ ρmax (Σ) ≤ M. 3. κ γ̃ ≥ γ̃1 ≥ · · · ≥ γ̃K ≥ γ̃ > 0 for a fixed constant κ > 1. For simplicity denote F (s, d, K, γ̃; κ, M) by F . Let L(x) denote the distribution of a random variable x. Then the probability spaces for the GEV problem are P (n, H, s, d, K, γ̃; κ, M) = {L((x1 , ỹ1 ), . . . , (xn , ỹn )) : (xi , ỹi )’s are i.i.d. such that (cov[E(xi |ỹi )], cov(xi )) ∈ F }, where n is the sample size, and parameters s, d, and γ̃ may depend on n, while κ, M are fixed constants. Note that this framework gives a fixed slicing scheme where H is treated as a bounded integer, so that K is also bounded above by H − 1. Denote V e ∈ Rd×K as the matrix where each column is one of the generalized eigenvectors of (Σ, Ω) e with nonzero eigenvalue. Let V b be any estimator for V.e The following provides a lower bound on the minimax rate among all estimators. 46 Theorem 23. Assume nγ̃ 2 ≥ C log eds for some sufficiently large constant C0 . Then there exist positive constants C1 and C2 such that   s log(ed/s) inf sup P ∥PV̂ − PṼ ∥2F ≥ C1 ∧C2 ≥ 0.8, V̂ P∈P nγ̃ 2 where P = P (n, H, s, d, K, γ̃; κ, M). 3.4.1 Proof of Theorem 23 The proof follows from [TSY20], for completeness we include the outline of the proof and the relevant papers for intermediate steps. Since any special case of the estimation problem yields a lower bound for the general case, we are able to specify further that the data distribution has an assumption of normality on the conditional distribution x|ỹ. We specify a subset of the parameter space as follows: let K = 1, H = 2, and for i = 1, 2, let xi |(ỹi = 1) ∼ Nd ((1 − α)v, Id − Ω̃), P(ỹ = 1) = α, xi |(ỹi = 2) ∼ Nd (−αv, Id − Ω̃), P(ỹ = 2) = 1 − α. Here we have v ∈ O(d, 1), (unit length d-vectors), and for 0 < α < 1 and γ = α we have E[x] = 0, Ω̃ = cov(E[x|ỹ]) = γvv⊤ , and Σ = cov(x) = Id for Id the identity matrix on Rd . The minimax results are derived using Fano’s Lemma found in [Yu97] (Lemma 3). Lemma 24 (Fano’s Lemma). Let (Θ, ρ) be a metric space and {Pθ : θ ∈ Θ} a collection of proba- bility measures. For any totally bounded T ⊂ Θ, denote by M(T, ρ, ε) the ε-packing number of T with respect to ρ, that is, the maximimal number of points in T whose pairwise minimum distance in ρ is at least ε. Define the Kullback-Leibler diameter of T by dKL (T ) ≡ sup D(Pθ ∥Pθ ′ ) θ ,θ ′ ∈T for D(Pθ ||Pθ ′ ) the KL divergence between distributions Pθ and Pθ ′ . Then ε2   2 dKL (T ) + log 2 inf sup Eθ [ρ (θ̂ (x), θ )] ≥ sup sup 1− θ̂ θ ∈Θ T ⊂Θ ε>0 4 log M(T, ρ, ε) 47 and equivalently, ε2   2 dKL (T ) + log 2 inf sup Pθ ρ (θ̂ (x), θ ) ≥ ≥ 1− . θ̂ θ ∈Θ 4 log M(T, ρ, ε) Then two steps are required, first determining the Kullback-Leibler divergence between the data distributions in T , and second determining the ε-packing of T . The first task is straightforward. For i = 1, 2, let Σ = Id and Ω e i = γvi v⊤ , for γ ∈ (0, 1), vi ∈ i O(d, 1). Then let P(Ω e i , Σ) denote the distribution of a random i.i.d. sample of size n from the mixture Guassian distribution P(Ω e i , Σ) = αP1 (Ω e i , Σ) + (1 − α)P2 (Ω e i , Σ), where P1 (Ω e i , Σ) and e i , Σ) denote multivariate normal distributions Nd ((1 − α)vi , Id − Ω P2 (Ω e i ) and Nd (−αvi , Id − Ω e i ), respectively. Then using convexity of the K-L divergence, for two mixture-Gaussian distributions P(Ωe 1 , Σ) and P(Ω e 2 , Σ) we have D(P(Ω e 1 , Σ)||P(Ωe 2 , Σ)) ≤ αD(P1 (Ω e 1 ||Σ), P1 (Ω e 2 , Σ)) + (1 − α)D(P2 (Ω e 1 , Σ)||P2 (Ω e 2 , Σ)), thus it is sufficient to bound the K-L divergence between two Gaussian distributions. Using the explicit formula for the K-L divergence between Gaussian distributions and the prop- erties of the chosen parameters, the authors of [TSY20] determine the upper bound 2 D(P(Ωe 1 , Σ)||P(Ω e 2 , Σ)) ≤ 3γ · n∥v1 − v2 ∥22 . 1 − γ2 Once the K-L divergence is determined, the second task proceeds according to a well estab- lished packing argument that can be found in [CGRZ13] or [CMW13]. Corollary 25. Assume that ρ is a lower bound of both ρK and MγK , where γK is the K-th general- ized eigenvalue of the pair (Ω, Σ). Then GEV estimator obtains the minimax rate up to constants. 3.4.2 Proof of Corollary 25 Proof. The assumption that ∑K ∗ i=1 ∥ṽi ∥0 = s is equivalent to the sparsity assumptions on W . We have ∃γ > 0 such that γ < γK for generalized eigenpair (vK , γK ) of (Ω, Σ). It is simple to show that with the assumption ρ < MγK , then we may substitute γ with ρ/M for a lower bound on γK . 48 One needs to compare γ with γ̃. We can do this with the stability theorem and Weyl’s theorem. Let P(y ∈ Jh ) = Ph and µh = E[x|y ∈ Jh ]. We need the following from [? ] (Lemma 11 in supplementary materials). Lemma 26. Define the event E(ε) = {ω |Ph − H1 | > ε, ∀H}. There exist a positive constant C such 4 that, for any ε > Hc−1 , we have √ ε2 P(E(ε)) < CH 2 Hc + 1 exp(−(Hc + 1) ) 32 for sufficiently large H and c. Then for any v ∈ Rd we have |v⊤ Ωv⊤ − v⊤ Ωve ⊤ | = |v⊤ cov(E[x|y])v − v⊤ cov(E[x|ỹ])v| H H = v⊤ ∑ Ph cov(E[x|y]|y ∈ Jh )v − v⊤ ∑ Ph µh µh⊤ v h h H = ∑ Ph v⊤ cov(E[x|y]|y ∈ Jh )v − v⊤ µh µh⊤ v h  1 ≤ H +ε ∑ v⊤cov(E[x|y]|y ∈ Jh)v h α3 ≤ (1 + Hε) ϑ v⊤ Ωv H where the last inequality follows from the Stability Assumption (10). Taking maximum over norm 1 vectors yields ∥Ω − Ω∥ e 2 ≤ (1 + Hε) αϑ3 ρ1 . From Weyl’s inequality we have |ρK (Ω) − H e ≤ ∥Ω − Ω∥ ρK (Ω)| e 2 . Thus ρ − (1 + Hε) αϑ3 ρ1 < ρK (Ω), e and serves as a lower bound. Then we H α ρ−(1+Hε) ϑ3 H can replace γ̃ with M . This completes the proof. 3.5 Canonical Correlation Analysis The GEV estimator can also be applied to the Canonical Correlation Analysis problem [Hot33], which has also had a number of techniques proposed in the past two decades for performing the task under HDLSS conditions where sparsity is assumed on the canonical directions [WKI08, WTH09, PTB09, HST11]. [CGRZ13] first gives the characterization of the probabilistic CCA 49 model for sparse canonical directions, and presents the CAPIT method for the problem. Rates of convergence are given that depend on an independent estimate of the precision matrices of the two data sources, which can often be difficult to compute even diagonal approximations of. A modern standard for estimation in this problem is Penalized Multivariate Analysis (PMA) method [WTH09] that estimates a regularized version of the singular value decomposition. Our estimator is shown to perform better empirically on simulations for sparse CCA. Canonical Correlation Analysis is a classical technique that finds the linear combination of two sets of random variables with maximal correlation. It has been applied to a number of different fields, including pyschology, neurology, genomics and economics. Let x ∈ Rd1 and y ∈ Rd2 be zero-mean random vectors with joint covariance matrix    Σx Σxy  Σ= , Σyx Σy where Σx = (Σx,kℓ ) and Σy = (Σy,kℓ ) are the covariance matrices for x and y , respectively, and Σxy = (Σxy,kℓ ) = Σ⊤ yx is the cross-covariance matrix between x and y. Then CCA determines the K canonical direction vectors by solving max v⊤ ⊤ ⊤ ⊤ ⊤ xk Σxy vyk , subject to vxk Σx vxk = vyk Σy vyk = 1, vxk Σx vx j = vyk Σy vy j = 0 (3.17) for k ∈ [K] and j < k. The optimization problem can be solved by applying singular value decom- −1/2 −1/2 position on the matrix Σx Σxy Σy , and a sample version is given by replacing the covariances with their usual estimators. This leads to consistent estimation of the canonical directions when the dimensions d1 and d2 are fixed and the sample size n increases. In the high-dimensional setting, when the dimensions exceed the sample size, one cannot com- pute the inverse sample covariances. This leads to the structural assumption of sparsity in the canonical directions, which allows for successful estimation. As shown in [CGRZ13], the set of (vxk , vyk ) are solutions to 3.17 if and only if ! K Σxy = Σx ∑ λk vxk v⊤yk Σy k=1 50 for some λk > 0, giving the correlation weights. We show that (3.17) is a special case of the generalized eigenvalue problem (3.1) with        0 Σxy   Σx 0   vxk  Ω= , Σ =   , and vk =  . Σyx 0 0 Σy vyk Substituting the above into (3.1), we yield 2v⊤ xk Σxy vyk v∗k = argmax vxk ,vyk vxk Σx vxk + v⊤ ⊤ yk Σy vyk It is straightforward to show this is equivalent to (3.17). Now if we assume that the Σxy has the following singular value decomposition K ⊤ Σxy = ∑ σk ux,k νy,k k=1 so that Σxy = Ox DO⊤ y . Then we have   0 Ox DO⊤ y Ω= .   Ox DO⊤ y 0 We need the following lemma which gives the eigendecomposition of Ω. Lemma 27. We have Ω = QΛQ⊤ , where      Ox Ox   D 0  Q=  and Λ =  . Oy −Oy 0 −D Proof. The proof of this lemma follows from direct calculation. √ Using Lemma 27, we observe that U in (3.4) can be taken as (O⊤ ⊤ ⊤ x , Oy ) / 2. An esti- mator of U can be obtained by concatenating the top rank-K left and right singular matrix of b xy = n−1 ∑n Xi Y⊤ , where Xi ’s and Yi ’s are independent and identically distributed samples of Σ i=1 i x and y, respectively. Thus we may apply the GEV estimator (3.5) to the CCA problem with the estimates for U b and Σ b defined above, and recover the spaces given by span{vx1 , . . . , vxK } and span{vy1 , . . . , vyK }. 51 CHAPTER 4 EMPIRICAL RESULTS OF THE GEV ESTIMATOR 4.1 Implementation To efficiently solve for W from (3.5), we implement from [BT09] the Fast Iterative Shrinkage- Thresholding Algorithm (FISTA). FISTA is an alteration of the iterative first order method ISTA used to solve ℓ1 -regularized convex optimization problems. The alteration uses a version of Nes- terov acceleration [Nes83] to achieve a convergence rate of O(1/k2 ). Define Sλ (·), a element-wise soft thresholding operator, the gradient of the (3.4), as follows:    sgn(ai j )|ai j − λ |, if|ai j | > λ  Sλ (A) i j =  0  otherwise. Algorithm 4.1: A fast iterative shrinkage-thresholding algorithm for GEV. (0) Input: U, Σ, and λ . Initialization: take WX ∈ Rd×K , ρ −1 = 1/ρ1 (Σ), t (0) = 1, and k = 0 Output: W 1 repeat (k+1) (k) (k) = Sρ −1 λ WX − ρ −1 (ΣWX −U) ;  2 WY √ (k+1) 1+ 1+4(t (k) )2 3 t = 2 ; (k+1) (k+1) (k) −1 (k+1) (k) 4 WX = WY + tt (k+1) (WY −WY ) 5 until converge; Since the GEV problem estimator acts like a matrix version of Lasso regression, there are many algorithms one could potentially apply to a ℓ1 -regularized optimization problem. We sought a comparison of the following methods to solve the problem for a fixed choice of the hyper-parameter λ giving the regularization weight: Subgradient [Nes04, Chapter 3.1], Proximal method [Roc70], FISTA, ADMM [BPC+ 11], and Chambolle-Pock [CP11]. The model used to test the convergence is taken from the SIR simulations below, using Model 1 in the continuous responses, with Gaussian features and noise, and d = 50, n = 200. 52 Figure 4.1: Comparison of convergence rates of different algorithms. As expected, in Figure (4.1) we see the subgradient method was suboptimal compared to all other methods, with a very slow convergence rate. However, due to the incredibly low number of iterations required to reach a stable critical point, empirically all the other methods performed equivalently in terms of iterations, with the exception of Chambolle-Pock, which had problems with convergence after the first iteration. As well, important differences occurred in run-time; ADMM in particular requires the computation of a matrix inverse, which drastically increases its run-time. Due to the simplicity of implementation, the speed of convergence and quick run-time, we maintained the FISTA implementation of the GEV method above. 53 4.1.1 Robust Modification This algorithm is sufficient for most applications of the GEV algorithm, but an important case arises for data that comes from heavy noise distributions or data with outliers. In many applica- tions, the assumption of sub-Gaussian tails is unrealistic; applications using functional magnetic resonance imaging (fMRI) [ENK16] or microarray data giving gene expression level [WPL15] have been observed to have heavy tails and large kurtosis, regardless of normalization meth- ods used. The kurtosis of a random variable X is defined as the fourth centralized moment  4  X−µ E σ , and high values indicate either that the probability mass is concentrated around the mean and the data-generating process produces occasional values far from the mean (i.e. outliers), or that the probability mass is concentrated in the tails of the distribution. To that end, we robustify the matrix square loss by introducing the following matrix Huber loss as a substitute to the Frobenius loss term in (3.2)  Lα = ∑ ℓα [ΣW]i j − Ui j ij where   x2 , if|x| ≤ α1  ℓα (x) =  2α −1 |x| − α −2 otherwise.  The Huber loss [Hub73], ℓα (x) is quadratic for small values of x, and becomes linear when x gets larger. The parameter α controls the blending of quadratic and linear loss. The least square loss and least absolute deviation (LAD) loss can be regarded as two extremes of the Huber loss for α = 0 and α = ∞, respectively. Using this loss in the ℓ1 -regularized scheme, we get the estimate     Wα,λ = argmin ∑ ℓα [ΣW]i j − Ui j + λ ∥W∥1,1 c b b (4.1) W∈Rd×K ij Notice however that the term (Σ1/2 W − Σ−1/2 U) has been replaced in the Huber loss for the term (ΣW − U). As noted after the proof of Theorem 1, either expression when used in the Frobe- nius norm loss yields the same solution for the space of V , but will give different models when combined with the ℓ1 -regularization term and used in (3.4). Computationally, if the expression 54 (ΣW − U) is used in Algorithm ??, the gradient of the expression requires computation of Σ2 , which leads to greater run-time and worse performance likely due to floating point error in com- pared to the performance of the expression (Σ1/2 W − Σ−1/2 U). However, when using the Huber loss, the gradient computation of this model will lead to the computation of Σ−1/2 if we use ex- pression (Σ1/2 W − Σ−1/2 U). This is statistically and computationally undesirable, especially in the case of d > n where Σ b is singular. Then with Dα as the gradient of the Huber loss, we define the algorithm below.   2ai j if|ai j | ≤ α −1   Dα (A) i j =  2α −1 sgn(ai j ), otherwise.  Algorithm 4.2: Huber loss algorithm for robust GEV. (0) Input: U, Σ, λ and α. Initialization: take WX ∈ Rd×K , ρ −1 = 1/ρ1 (Σ), t (0) = 1, and k=0 Output: W 1 repeat (k+1) (k) (k) = Sαλ WX − ρ −2 ΣDα (ΣWX −U) ;  2 WY √ (k+1) 1+ 1+4(t (k) )2 3 t = 2 ; (k+1) (k+1) (k) −1 (k+1) (k) 4 WX = WY + tt (k+1) (WY −WY ) 5 until converge; 4.2 Sliced Inverse Regression We compare our method of applying GEV to the sliced inverse regression problem against the classical method we label SIR [Li91], and the modern method LassoSIR (LSIR) [LZL19], used for high dimensional problems. To facilitate a fair comparison with SIR and LSIR, all the simulation studies are generated under forward models including both categorical and continuous responses for low (d = 100) and high (d = 1000) dimensional predictors. Throughout the simulations, we use a K-fold cross-validation (CV) to select the tuning parameters and quantify the estimation accuracy using three different metrics defined as follows: the canonical correlation (CCA) between 55 x⊤ Ŵ and x⊤ W; the Frobenius norm distance (FD) between PW and its estimate PŴ ; the trace correlation (TC) defined as tr(PW PŴ )/K with K being the structural dimensions. Let Σ = (σi j )d×d , where σi j = 0.5|i− j| and d is taken to be 50 or 500. To demonstrate the robustness of SDR for categorical response, we consider two simulation scenarios for generating the predictor variables x: 1) from N (0, Σ) and 2) from t5 (0, Σ). Let β1 and β2 be the d−dimensional vectors with their √ √ first six elements being (1, 1, 1, 1, 1, 1)/ 6 and (1, −1, 1, −1, 1, −1)/ 6 and the rest being zero. The response Y is generated from the multinomial distribution with fk (x) Pr(y = k) = , k = 1, . . . , k − 1, 1 + ∑K−1 j=1 f j (x) where K is the number of categories and fk (x) is the component connecting x with y. We consider the following two different models of fk (x): 1. Model 1: fk (x) = sin(x⊤ βk /4) + 1; 2. Model 2: fk (x) = exp(x⊤ βk ). For both models, the fk (x) components are monotone within the domain of x, so they are favorable to SIR. Moreover, we can see that model 2 is actually the multinomial logistic regression. The simulation includes comparing all combination of the two scenarios, two models and the two configurations (n, d) = (200, 50) and (500, 1000) with 100 replicates. As shown in Table 4.1, GEV-SIR dominates SIR and LSIR for all three metrics. 56 Table 4.1: Summary of estimation accuracy for categorical response in low and high dimensions. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. Sample and Error Type Model Method (d, n) FD TC CCA GEV 50, 200 3.38 (.359) .154 (.09) .602 (.188) Model 1 SIR 50, 200 3.75 (.137) .062 (.034) .393 (.098) LSIR 50, 200 3.49 (.397) .127 (.099) .562 (.232) GEV 50, 200 2.06 (.321) .484 (.08) .950 (.035) Model 2 SIR 50, 200 2.90 (.237) .276 (.059) .811 (.042) Gaussian-X LSIR 50, 200 2.26 (.386) .436 (.096) .945 (.034) GEV 1000, 500 3.61 (.319) .097 (.08) .209 (.120) Model 1 LSIR 1000, 500 3.71 (.326) .073 (.082) .305 (.206) GEV 1000, 500 1.97 (.379) .507 (.095) .495 (.146) Model 2 LSIR 1000, 500 2.08 (.432) .479 (.108) .718 (.178) GEV 50, 200 3.31 (.383) .174 (.096) .652 (.192) Model 1 SIR 50, 200 3.73 (.165) .068 (.041) .420 (.100) LSIR 50, 200 3.43 (.424) .143 (.106) .620 (.218) GEV 50, 200 1.87 (.385) .534 (.096) .954 (.034) Model 2 SIR 50, 200 2.83 (.275) .292 (.069) .813 (.035) Elliptical-X LSIR 50, 200 2.10 (.400) .474 (.101) .951 (.029) GEV 1000, 500 3.41 (.436) .149 (.109) .353 (.182) Model 1 LSIR 1000, 500 3.48 (.416) .131 (.104) .443 (.208) GEV 1000, 500 1.88 (.461) .530 (.115) .618 (.169) Model 2 LSIR 1000, 500 1.89 (.486) .527 (.121) .727 (.167) For continuous response, we consider the following four scenarios for both low (d = 50) and high (d = 1000) dimensional data with either 1 : Gaussian predictors and Gaussian noise. 2 : Gaussian predictors and elliptical noise. 3 : Elliptical predictors and Gaussian noise. We randomly generate n = 500 predictors x from either a multivariate normal or elliptical distribution with mean zero and and the same covariance matrix as in categorical cases. For the continuous responses, we then generate the responses variable according to the following three models: 1. Model 1: y = (x⊤ β1 )/{0.5 + (x⊤ β2 + 1.5)2 } + 0.5ε; 2. Model 2: y = x⊤ β1 + 2 + exp(x⊤ β2 ) + 0.5 ∗ ε|x⊤ β1 + 2|, 57 3. Model 3: y = (x⊤ β1 + 1)2 + (x⊤ β2 + 1)2 + 0.5 ∗ ε, The ε’s are independently generated from either standard normal or t5 distribution. Here we set √ √ β1 = (1, . . . , 1, 0, . . . , 0)⊤ / 6 and β2 = (1, −1, 1, −1, 1, −1, 0, . . . , 0)⊤ / 6 with the first 6 elements of both vectors being nonzero. The results are found in tables 4.2 and 4.3. Table 4.2: Summary of estimation accuracy for continuous response in low dimensions. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. Sample and Error Type Model Method (d, n) FD TC CCA GEV 50, 200 2.27 (.247) .433 (.062) .974 (.018) Model 1 SIR 50, 200 3.22 (.293) .196 (.073) .793 (.061) LSIR 50, 200 2.16 (.220) .460 (.055) .973 (.017) Gaussian-X, Gaussian-Error GEV 50, 200 2.08 (.120) .478 (.030) .985 (.014) Model 2 SIR 50, 200 2.66 (.225) .335 (.056) .859 (.049) LSIR 50, 200 2.11 (.136) .474 (.034) .971 (.019) GEV 50, 200 2.11 (.179) .472 (.045) .982 (.020) Model 3 SIR 50, 200 3.19 (.348) .202 (.087) .733 (.138) LSIR 50, 200 2.19 (.199) .452 (.050) .965 (.030) GEV 50, 200 2.38 (.303) .404 (.076) .962 (.027) Model 1 SIR 50, 200 3.47 (.237) .131 (.059) .683 (.111) LSIR 50, 200 2.30 (.262) .424 (.065) .961 (.024) Gaussian-X, Elliptical-Error GEV 50, 200 2.08 (.183) .478 (.046) .983 (.018) Model 2 SIR 50, 200 2.82 (.243) .296 (.061) .813 (.060) LSIR 50, 200 2.13 (.171) .469 (.043) .965 (.023) GEV 50, 200 2.18 (.278) .456 (.070) .972 (.068) Model 3 SIR 50, 200 3.21 (.351) .199 (.088) .734 (.128) LSIR 50, 200 2.24 (.278) .439 (.070) .958 (.056) GEV 50, 200 2.11 (.290) .473 (.072) .978 (.016) Model 1 SIR 50, 200 3.21 (.299) .199 (.075) .786 (.070) LSIR 50, 200 2.04 (.310) .490 (.077) .961 (.017) Elliptical-X, Gaussian-Error GEV 50, 200 2.25 (.261) .437 (.065) .953 (.056) Model 2 SIR 50, 200 2.99 (.327) .253 (.082) .770 (.094) LSIR 50, 200 2.33 (.300) .418 (.075) .929 (.060) GEV 50, 200 2.75 (.434) .312 (.108) .828 (.137) Model 3 SIR 50, 200 3.50 (.286) .126 (.067) .558 (.154) LSIR 50, 200 2.85 (.437) .287 (.109) .801 (.144) 58 Table 4.3: Summary of estimation accuracy for continuous response in high dimensions. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. Sample and Error Type Model Method (d, n) FD TC CCA GEV 1000, 500 3.06 (.133) .236 (.033) .898 (.028) Model 1 LSIR 1000, 500 3.13 (.123) .217 (.031) .889 (.029) Gaussian-X, Gaussian-Error GEV 1000, 500 2.69 (.164) .329 (.041) .907 (.027) Model 2 LSIR 1000, 500 2.79 (.152) .303 (.038) .895 (.027) GEV 1000, 500 3.13 (.256) .216 (.064) .828 (.076) Model 3 LSIR 1000, 500 3.23 (.223) .194 (.056) .812 (.084) GEV 1000, 500 3.25 (.238) .187 (.060) .827 (.133) Model 1 LSIR 1000, 500 3.31 (.235) .171 (.059) .813 (.137) Gaussian-X, Elliptical-Error GEV 1000, 500 2.84 (.291) .291 (.073) .879 (.058) Model 2 LSIR 1000, 500 2.97 (.247) .259 (.062) .851 (.085) GEV 1000, 500 3.19 (.296) .203 (.074) .817 (.078) Model 3 LSIR 1000, 500 3.25 (.232) .186 (.058) .804 (.088) GEV 1000, 500 3.12 (.176) .220 (.044) .866 (.036) Model 1 LSIR 1000, 500 3.11 (1.01) .183 (.044) .830 (.068) Elliptical-X, Gaussian-Error GEV 1000, 500 3.37 (.193) .158 (.048) .777 (.070) Model 2 LSIR 1000, 500 3.49 (.174) .127 (.044) .743 (.088) GEV 1000, 500 3.87 (.088) .032 (.022) .417 (.138) Model 3 LSIR 1000, 500 3.91(.081) .023 (.020) .353 (.159) 4.2.1 Heavy Noise Slice Inverse Regression Here we show our adapted Huber loss GEV method when applied to the SIR problem with in- creased noise. We use the same models in the continuous response section with all the same conditions, with the exception of the coefficient of the noise term being raised from 0.5 to 1. As well we included an additional model Model 4: y = (x⊤ β1 + 3)2 + 2|x⊤ β2 + 3| + ε|x⊤ β2 |; and the scenario of having elliptical features for x and elliptical noise. We compare this to both SIR and LSIR. 59 Table 4.4: Summary of estimation accuracy for Huber loss estimation in low dimensions with high noise. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. Sample and Error Type Model Method (d, n) FD TC CCA GEV 50, 200 3.25 (.321) .188 (.080) .699 (.153) Model 1 SIR 50, 200 3.83 (.125) .04 (.031) .286 (.114) LSIR 50, 200 3.56 (.318) .108 (.080) .573 (.167) Gaussian-X, Gaussian-Error GEV 50, 200 2.00 (.238) .498 (.080) .992 (.047) Model 2 SIR 50, 200 2.27 (.346) .432 (.087) .979 (.102) LSIR 50, 200 2.01 (.234) .497 (.058) .997 (.041) GEV 50, 200 2.41 (.332) .397 (.083) .992 (.075) Model 3 SIR 50, 200 3.55 (.249) .111 (.062) .560 (.131) LSIR 50, 200 2.43 (.398) .390 (.100) .915 (.089) GEV 50, 200 2.43 (.376) .391 (.094) .913 (.158) Model 4 SIR 50, 200 3.61 (.227) .099 (.058) .523 (.178) LSIR 50, 200 2.53 (.434) .367 (.108) .900 (.176) GEV 50, 200 3.28 (.374) .181 (.096) .651 (.176) Model 1 SIR 50, 200 3.84 (.116) .040 (.029) .290 (.096) LSIR 50, 200 3.60 (.303) .100 (.076) .545 (.159) Gaussian-X, Elliptical-Error GEV 50, 200 2.02 (.260) .493 (.065) .991 (.046) Model 2 SIR 50, 200 2.38 (.356) .406 (.089) .971 (.093) LSIR 50, 200 2.02 (.270) .492 (.068) .995 (.040) GEV 50, 200 2.47 (.309) .381 (.077) .915 (.097) Model 3 SIR 50, 200 3.62 (.242) .096 (.061) .511 (.159) LSIR 50, 200 2.54 (.364) .366 (.091) .911 (.105) GEV 50, 200 2.59 (.416) .353 (.104) .890 (.169) Model 4 SIR 50, 200 3.71 (.250) .073 (.063) .431 (.180) LSIR 50, 200 2.69 (.489) .329 (.122) .874 (.193) GEV 50, 200 3.17(.389) .207 (.097) .727 (.203) Model 1 SIR 50, 200 3.83 (.117) .043 (.029) .309 (.097) LSIR 50, 200 3.44 (.249) .141 (.062) .666 (.176) Elliptical-X, Gaussian-Error GEV 50, 200 2.17 (.072) .456 (.018) .955 (.006) Model 2 SIR 50, 200 2.75 (.148) .312 (.037) .887 (.008) LSIR 50, 200 2.19 (.045) .451 (.011) .966 (.002) GEV 50, 200 2.58 (.301) .355 (.075) .881 (.065) Model 3 SIR 50, 200 3.54 (.302) .114 (.075) .621 (.170) LSIR 50, 200 2.63 (.419) .341 (.105) .888 (.117) GEV 50, 200 2.96 (.341) .261 (.085) .742 (.103) Model 4 SIR 50, 200 3.68 (.248) .080 (.062) .463 (.192) LSIR 50, 200 3.06 (.429) .234 (.107) .741 (.136) GEV 50, 200 3.27 (.368) .183 (.092) .670 (.188) Model 1 SIR 50, 200 3.84 (.112) .041 (.028) .287 (.092) LSIR 50, 200 3.51 (.275) .121 (.069) .621 (.175) GEV 50, 200 2.19 (.070) .454 (.017) .953 (.006) Elliptical-X, Elliptical-Error Model 2 SIR 50, 200 2.81 (.187) .296 (.047) .878 (.011) LSIR 50, 200 2.20 (.071) .450 (.018) .964 (.003) GEV 50, 200 2.66 (.268) .334 (.067) .860 (.062) Model 3 SIR 50, 200 3.61 (.237) .097 (.059) .553 (.161) LSIR 50, 200 2.73 (.334) .317 (.084) .871 (.078) GEV 50, 200 3.00 (.337) .249 (.084) .725 (.088) Model 4 SIR 50, 200 3.70 (.198) .076 (.049) .447 (.172) LSIR 50, 200 3.12 (.434) .218 (.108) .712 (.127) 60 Table 4.5: Summary of estimation accuracy for Huber loss estimation in high dimensions with high noise. We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in parentheses. The results are based on 100 replications. Sample and Error Type Model Method (d, n) FD TC CCA GEV 1000, 500 3.55 (.100) .112 (.025) .282 (.113) Model 1 LSIR 1000, 500 3.65 (.116) .087 (.029) .464 (129) Gaussian-X, Gaussian-Error GEV 1000, 500 2.12 (.042) .470 (.010) .670 (.157) Model 2 LSIR 1000, 500 2.22 (.079) .445 (.020) .448 (.152) GEV 1000, 500 2.68 (.160) .329 (.040) .647 (.110) Model 3 LSIR 1000, 500 2.92 (.192) .271 (.048) .389 (.130) GEV 1000, 500 3.13 (.215) .216 (.054) .578 (.102) Model 4 LSIR 1000, 500 3.31 (.177) .173 (.044) .320 (.130) GEV 1000, 500 3.67 (.107) .082 (.027) .384 (.108) Model 1 LSIR 1000, 500 3.80 (.112) .050 (028) .266 (.114) Gaussian-X, Elliptical-Error GEV 1000, 500 2.13 (.044) .467 (.011) .665 (.159) Model 2 LSIR 1000, 500 2.23 (.080) .441 (.022) .472 (.160) GEV 1000, 500 2.79 (.167) .302 (.042) .614 (.105) Model 3 LSIR 1000, 500 3.03 (.191) .234 (.048) .377 (.125) GEV 1000, 500 3.17 (.208) .206 (.052) .586 (.097) Model 4 LSIR 1000, 500 3.32 (.188) .170 (.047) .327 (.137) GEV 1000, 500 3.52 (.151) .121 (.038) .519 (.106) Model 1 LSIR 1000, 500 3.61 (.134) .098 (.034) .337 (.125) Elliptical-X, Gaussian-Error GEV 1000, 500 2.66 (.240) .334 (.060) .572 (.123) Model 2 LSIR 1000, 500 3.10 (.249) .224 (.062) .385 (.139) GEV 1000, 500 3.40 (.188) .149 (.047) .476 (.099) Model 3 LSIR 1000, 500 3.59 (.164) .101 (.041) .334 (.115) GEV 1000, 500 3.84 (.083) .040 (.021) .228 (.075) Model 4 LSIR 1000, 500 3.94 (.064) .015 (.016) .236 (.118) GEV 1000, 500 3.68 (.116) .083 (.023) .420 (.085) Model 1 LSIR 1000, 500 3.76 (123) .059 (.031) .309 (.125) GEV 1000, 500 2.68 (.213) .330 (.053) .589 (.116) Model 2 LSIR 1000, 500 3.14 (.225) .216 (.064) .456 (.135) Elliptical-X, Elliptical-Error GEV 1000, 500 3.44 (.193) .138 (.048) .495 (.080) Model 3 LSIR 1000, 500 3.61 (.153) .097 (.038) .394 (.108) GEV 1000, 500 3.84 (.087) .039 (022) .228 (.076) Model 4 LSIR 1000, 500 3.94 (.070) .015 (.018) .230 (.123) 4.3 Linear Discriminant Analysis In this section, we investigated the performance of GEV method under high-dimensional Linear Discriminant Analysis (LDA) framework for both binary and multi-class classification problems by applying them to simulated data generated under two types of within group covariance matri- ces: block Toeplitz suggested by [WT11] and Sparse precision matrix as described above. For comparison, we also included the ℓ1 −penalized linear discriminant analysis (LDA-ℓ1) [WT11] us- 61 ing R package penalizedLDA and the direct approach for discriminant analysis [MZY12, ZMY18] implemented in R packages dsda and msda for binary or multi-class cases respectively. To serve as a benchmark, we also included the Oracle classifier derived from the population parameters Σw and Σb . For each simulation, we generate 100K samples with d = 500 features, where K is the number of classes. We consider simulation settings for binary and multi-class cases as follows: • Binary case: we set µ1 = 0 and µ2 j ∼ N(0.3, 0.5) for j ∈ {1, . . . , 20} and µ2 j = 0 otherwise. For the block Toeplitz, the block diagonal matrix, Σw , consists five equal size blocks with the (i, j)th element of each block equals to 0.7|i− j| . In term of the sparse precision matrix, we simulated the K-nearest-neighbor networks as describe above. Both the covariance struc- tures were used to mimic the biological gene networks with sparse conditional dependency structure [WT11, XLV16]. We then simulate xi ∼ N(µk , Σw ) for i ∈ Ck . • Multi-class case: we consider K = 3 classes and set µ1 = 0, µ2 j = 0 ∼ N(0.3, 0.5) for j ∈ {1, . . . , 20}, µ3 j = N(−0.5, 0.5) for j ∈ {21, . . . , 40} and µk j = 0 otherwise. With the same covariance structure as in the binary cases, we simulate the data as xi ∼ N(µk , Σw ) for i ∈ Ck . In Table 4.6, we reported the prediction accuracy based on 100 replicates. Table 4.6: Summary statistics reporting performance of the GEV, LDA−ℓ1 , Direct and Oracle methods. We report the means of the FD with its standard deviation in parentheses. The results are based on 100 replications Σw Type LDA-ℓ1 Direct GEV Orcal Toeplitz Error 75.12 (23.36) 31.96 (16.61) 30.58 (17.04) 14.54 (9.77) Binary NN Error 60.12 (8.53) 54.88 (13.73) 52.70 (11.71) 23.08 (4.64) Toeplitz Error 155.88 (24.26) 70.14 (31.54) 60.02 (28.30) 27.34 (14.09) Multi-class NN Error 103.1 (7.60) 83.3 (7.90) 40.68 (24.46) 35.06 (17.04) 4.4 Canonical Correlation Analysis In this subsection, we assess the performance of GEV under sparse CCA framework by applying it to several simulated datasets. In all settings, we let X and Y have same dimension d = q, Σx = 62 Σy = Σ. Following the formulation in [CGRZ13], we model Cov(X,Y ) = Σxy as Σxy = ΣxUΛV ′ Σy , (4.2) where U = (U1 ,U2 ) and V = (V1 ,V2 ) are d × 2 matrices, and Λ is a 2 × 2 diagonal matrix with λ1 = 1 and λ2 = 0.7. We set the nonzero rows of U1 , U2 , V1 and V2 at {1, 2, . . . , 6}, {7, . . . , 12}, {d − 5, . . . , d} and {d − 11, . . . , d − 6}. The values at the nonzero rows are sampled uniformly from (−1, −0.5) ∪ (0.5, 1) and then are normalized with respect to Σ such that U ′ ΣU = I and V ′ ΣV = I. To capture different dependency structures, we consider the following three settings. • Identity with Σ = I p . • Toeplitz with Σ = (σi j ) where σi j = 0.3|i− j| for all i, j ∈ [d]. • Sparse precision matrix with Ω = Σ−1 being sparse. Specifically, we generated the sparse precision matrix through nearest-neighbour networks algorithm in [LG06] with number of neighbors, m = 5. We compared the performance of our methods and the Penalized Multivariate Analysis method (PMA) proposed by [WTH09] via examining the Frobenius norm distance (FD) measuring the distance between the true and estimated subspaces. The sparsity tuning parameters in PMA were chosen using permutation as suggested by the R package PMA, while the tuning parameter λ in GEV was selected by cross-validation. Results of the simulations are reported in Table 4.7. Summary statistics are based on 100 replicate trails under each of the six conditions. In general, the GEV method results in smaller Frobenius norm distance for both U and V . Under all settings, the GEV method outperforms the PMA methods yielding greater improvements as the dependence structures become more complicated. 63 Table 4.7: Summary statistics reporting performance of the GEV and PMA methods. We report the means of the FD with its standard deviation in parentheses. The results are based on 100 replications (p, q, n) Σ U−PMA V −PMA U−GEV V −GEV Identity 0.589 (0.085) 0.543 (0.126) 0.515 (0.120) 0.496 (0.116) (50, 50, 300) Toeplitz 0.839 (0.089) 0.833 (0.080) 0.623(0.106) 0.607 (0.099) NN 1.365 (0.132) 1.353 (0.141) 0.774 (0.137) 0.778 (0.171) Identity 0.255 (0.076) 0.399 (0.043) 0.208 (0.074) 0.297 (0.071) (500, 500, 2000) Toeplitz 0.696 (0.030) 0.745 (0.023) 0.426 (0.059) 0.424 (0.060) NN 1.031 (0.152) 1.031 (0.076) 0.433 (0.079) 0.420 (0.090) 4.5 Application to Tumor-Infiltrating Lymphocytes Data To demonstrate our approach’s potential utility, we apply the GEV-SIR algorithm to the Tumor- Infiltrating Lymphocytes (TILs) data inferred from The Cancer Genome Atlas (TCGA) Ovarian serous cystadenocarcinoma (Ovarian Cancer) and Lung Squamous Cell Carcinoma (Lung cancer) using CIBERSORT [NLG+ 15]. Compelling clinical evidence suggests that the presence of effec- tor immune cells, such as CD8+ T cells and plasma cells, is positively associated with superior survival in patients with ovarian cancer. Notably, an inflamed tumor microenvironment, which is characterized by the infiltration of CD8+ T cells, also attracts plasma cells. A higher percent- age of plasma cell infiltration is significantly correlated with the highest levels of CD8+ , CD4+ , and CD20+ TILs, and superior clinical outcomes in patients with ovarian cancer [SJ15]. Indeed, a pan-cancer analysis also identified plasma cells as a novel prognostic factor for superior sur- vival [WN18]. However, the mechanism of plasma cell homing to the tumor bed remains unclear. Identifying oncogenic signaling pathways that shape the plasticity of plasma cell recruitment and differentiation holds promise to better classify patients based on their immune-editing profiles. We extracted the expression of 2,000 genes with the largest variance among samples. We first determine the number of dimensions using eigen decomposition. As shown in Figure 1, GEV- SIR favors d = 1 because of a large gap between the first and the second largest eigenvalue. The tuning parameter λ is then selected via the cross-validation procedure. Figure 4.2 shows a strong monotonic relationship between the GEV-SIR score and the plasma cell abundances. To validate 64 the derived GEV-SIR score’s predictability, we used the TCGA Lung cancer data as a test set. Specifically, we selected the same 2,000 genes as in Ovarian cancer and projected them into the estimated Ovarian cancer GEV-SIR direction. The right panel in Figure 1 demonstrates a similar monotonically decreasing pattern between the GEV-SIR direction and plasma cell recruitment in the Lung cancers. Figure 4.2: Relationship between plasma cells and GEV-SIR direction. The left panel shows the distribu- tion of eigenvalues of Ω̂. The scatter plots in the middle and right panels show the relationship between the tumor infiltrated plasma cell and the GEV-SIR direction. To better understand the reduced dimensions’ biologic significance, we performed a GO path- way enrichment analysis using GSEA [STM+ 05]. In the reduced dimension, gene clusters which regulate immune cell differentiation ( p value < 0.001), effector function such as enzyme ac- tivity ( p value < 0.001), regulation of apoptosis ( p value < 0.05), and chemotaxis signaling ( p value < 0.05), are significantly enriched. Among the strongest pathways that are positively as- sociated with plasma cell recruitment in the second GEV-SIR direction are the defense response ( p value < 0.001) and type 1 Interferon (IFN-I) network (p value < 0.05). The defense response pathway is informed by immune detection of danger-associated molecular patterns (DAMPs) and pathogen-associated molecular patterns (PAMPs). In the tumor immune detection, cancer cell damage-associated DAMPs, such as DNA, could alert immune cells and promote an “inflamed” tumor microenvironment [GGK10], which is amenable for plasma cell recruitment. IFN-I sig- natures have been emerging as a central signaling pathway that facilitates anti-tumor immunity [ID14]. IFN-I functions by binding to its receptor on target cells and launch a large transcriptome 65 consisting of interferon-stimulated genes (ISGs), among which are chemokines, such as CXCL9, CXCL10, and CXCL12. CXCL9 and CXCL10 are essential mediators of effector immune cell chemotaxis [SCR14]. Downregulation of these chemokines severely compromises anti-tumor im- munity. Importantly, CXCL12 potently promotes plasma cell recruitment [DPN14]. 4.6 Application to Single-Cell RNAseq Data To demonstrate the ability of GEV-SIR handling noisy data, we next utilize GEV-SIR to analyze a dataset of human embryonic stem cells grown over a 27-day time course from [MvDW+ 19]. The [MvDW+ 19] dataset comprises expression measurements of 33694 genes over 31,000 cells through single-cell RNAseq (scRNAseq) technology, where cells were sampled at the following differentiation time intervals: (Day 0-3), (Day 6-9), (Day 12-15), (Day 18-21), and (Day 24-27). Unlike the measurement from bulk tissue as in the tumor-infiltrating lymphocytes data, the scR- NAseq data suffers from high noise level, contamination with outliers, and large proportion of missing values due to the limited initial mRNA in each cell. Taking the developmental time as response and gene expression as predictors, we aim to reveal the driving factors/pathways for the differentiating process of embryonic cells. Following the same preprocessing procedure as in [MvDW+ 19], we applied our GEV-SIR method to the normalized scRNAseq data and showed the two dimension embedding in Figure 4.3. Our analysis successfully captured the smooth transi- tion of the embryonic differentiation process with GEV-SIR1 direction capturing the difference between all developmental interval and GEV-SIR2 direction mainly reflecting the differences be- tween the last two intervals and the first three. 66 10 5 Time GEV_SIR2 20 15 10 5 0 0 −5 0 10 20 30 40 GEV_SIR1 Figure 4.3: GEV-SIR analysis of embryoid body scRNAseq data. 67 CHAPTER 5 GRAPHICAL NEURAL NETWORKS FOR MULTI-MODAL DATA INTEGRATION With the emergence of joint platforms for single-cell sequencing, the data produced by methods like scRNA-seq and scATAC-seq can be combined for multi-modality cell sequencing, attributing to each cell mRNA and DNA data. This new collection of data provides unique challenges in how best to incorporate the large amount of multi-modal data for data analysis purposes. Both data streams are very high dimension with sample sizes often on the same order or smaller than the number of features, placing the data analysis problem in the HDLSS scenario. Furthermore, the sequencing data is notoriously sparse, where the vast majority of features may be zero in a typical dataset due to technical error from dropout [SNL+ 17], which can be even more pronounced in multi-modal data [LHH20]. One main goal for multi-modal data is achieving data integration, which is any method that combines the heterogeneous data better for downstream tasks. One way to achieve this task is by performing a joint embedding of the features of the two modalities in a shared low-dimensional space. Such an embedding ideally captures a meaningful representation of the complex cellular states from different types of measurements. Deep learning techniques have recently been used to solve the task of multi-modal date inte- gration for single-cell data to great success [GZP21, AAB+ 20]. However, most of the current fail to take into account high-order interactions among cells or different modalities, and instead treat each cell as a separate input. This higher-order information is essential giving structure to the data that allows for learning a proper low-dimensional representation of high-dimensional and sparse cell features. Graph neural networks (GNNs) [LDJ+ 21, KW17] give unique tools for capturing the desired higher-order information required for data integration. GNNs aggregate information from neighborhoods to update node embeddings iteratively, which allow for the encoding high-order structural information through multiple aggregation layers. In addition, GNNs smooth the features by aggregating neighbors’ embedding, which provides an extra denoising mechanism [MLZ+ 21]. Hence, by modeling the interactions between cells and their features as a graph, we can adopt 68 GNNs to exploit the structural information. We implement a GNN framework for multimodal data integration designed in concert with [WDJ+ 22] called scJEGNN for single-cell Joint Embedding GNNs. We apply this model to bench- mark datasets provided by NeurIPS 2021 [LBC+ 21] for a multimodal since-cell data integration competition and compare its performance to competitor submissions. 5.1 Problem Statement The two modalities we operate on are GEX as mRNA data, and ATAC as DNA data. Each modality is represented as a matrix Xi ∈ Rn×di , i = 1, 2, where n is the number of cells and di denotes the feature dimension for each cell. In our application the GEX has dimension d1 = 13, 431 while the ATAC has dimension d2 = 116, 490, while the total sample size is n = 42, 492. The data is also highly sparse; only 9.75% of GEX and 2.9% of ATAC features are nonzero on average. The data has expert annotation giving cell type labels for each cell with a total of 22 different cell type classes, and 2 different real values indicated cell-cycle developmental stages. As well, in our application, two modalities are sequenced with a total of 10 batches, introducing the possibility of large batch effects occurring. The goal then is to learn an embedded representation of the cells in Rd3 that best leverages the underlying information of the two modalities in order to preserve cell info and remove spurious batch effects on the representations. This evaluation of how well the embedding represents per- tinent biological information is calculated by a collection of metrics M : Rn×d3 → Rk , M(X) = (m1 (X), . . . , mk (X)), where k metrics are given by mi : Rn×d3 → R, i ∈ [k], with higher values indicating better performance of the embedding. The problem can be formally defined as Given modality X1 ∈ Rn×d1 and modality X2 ∈ Rn×d2 , learn three mapping functions fθ1 , fθ2 and fθ3 parameterized by θ1 , θ2 and θ3 to learn a representation H ∈ Rn×d3 H = fθ3 (CONCAT( fθ1 (X1 ), fθ2 (X2 ))) (5.1) ′ ′ that best maximizes the coordinates of M(H). Here fθ1 (X1 ) ∈ Rn×d1 and fθ2 (X2 ) ∈ Rn×d2 cor- ′ respond to new representations learned from modality X1 and X2 , and for CONCAT : (Rn×d1 × 69 ′ ′ ′ Rn×d2 ) → Rn×(d1 +d2 ) the function that concatenates the rows of two matrices together. For our application, we have k = 3 for the number of metrics measuring the performance of the embedding H. The three measurements are are given by a cell-type conservation metric, a cell-cycle conservation metric, and a batch removal metric: • NMI cluster/label: The Normalized mutual information (NMI) [MGH11] compares the overlap of two clusterings. The NMI is applied to the integrated data to compare the cell type labels with an automated clustering (based on Louvain clustering). NMI scores of 0 or 1 correspond to uncorrelated clustering or a perfect match, respectively. Automated Louvain clustering is performed at resolution ranges from 0.1 to 2 in steps of 0.1, and the clustering output with the highest NMI with the label set is used. • Cell-cycle conservation: The cell-cycle conservation score evaluates the amount of variance explained by cell-cycle per batch prior to integration versus the amount of variance after integration. The relative differences of varbe f orei and vara f teri per batch i are aggregated into a final score between 0 and 1, via 1 B |varbe f orei − vara f teri |   CCconservation = ∑ 1 − , B i varbe f orei where B gives the number of batches. Values near 0 indicate little conservation of variance explained by the cell-cycle, while values near 1 indicate nearly perfect conservation. • Batch ASW: The Average Silhouette Distance (ASW) is used to quantify batch mixing by taking into account the incompatibility of batch labels per cell type cluster. The Batch ASW considers the absolute silhouette width, on batch labels per cell. Here, 0 indicates that batches are thoroughly mixed, but any variation from 0 indicates the presence of a batch effect. The metric re-scales this score so that higher values imply better batch mixing and uses the equation below to determine the per-cell type label, j: 1 batchASW j = 1 − |s(i)| |C j | i∈∑ C j 70 where C j is the set of cells with the cell label j and |C j | denotes the number of cells in that set. To obtain the final batchASW score, the label-specific batchASW j scores are averaged: 1 C batchASW = batchASW j C∑ j where C is the number of unique cell labels. A batchASW value of 1 indicates optimal batch mixing, and a value of 0 indicates fully separated batch clusters. 5.2 Method In this section, we introduce the scJEGNN framework for multimodal data integration. An illus- tration of the framework is shown in Figure [ref fig]. Specifically, our framework can be divided into four stages: data preprocessing, graph construction, cell-feature graph convolution, and an autoencoder architecture for the final embedding. 5.2.1 Data Preprocessing Both modalities, X1 for GEX and X2 for ATAC, go through some standard preprocessing steps regularly done in single-cell sequencing tasks. The below sequence of operations describe both fθ1 and fθ2 . First the matrices are ℓ1 -normalized, meaning that each row vector (cell) is divided by the total sum of the absolute values of all its features, normalizing the weight of each cell’s total gene expression output. Then the data is log-transformed, so that each normalized value X̄i j is updated to the value log(X̄i j ∗ 104 + 1). These values are then divided by the standard deviation of each column, which normalizes the variation of each feature. Lastly both modalities go through an initial dimension reduction using the Latent Semantic Indexing (LSI), which is a type of transformation based on the SVD decom- position. For some choice of k, the transformation simply chooses the top k left singular vectors, and projects the data to dimension k where each coordinate is the inner product with the k vectors. Here we choose different values for k giving d1′ and d2′ for X1 and X2 . In our experiment, we found 71 choosing d1′ = 100 for GEX and d2′ = 65 for ATAC gave the best performance. Then the data is concatenated giving an output X b ∈ Rn×(d1′ +d2′ ) . We simplify notation and denote d ′ = d ′ + d ′ as 1 2 the combined feature dimension. 5.2.2 Graph Construction Given the preprocessed X, c we construct a graph that the GNN can be applied to. We construct a cell-feature bipartite graph, depicted in Figure 5.1, where the cells and their biological features are treated as different nodes, giving us a collection of cell nodes and a collection of feature nodes. The edges are designed to be strictly between the two collections, so that an edge connecting a cell node i to a feature node j directly represents the value of the cell’s feature given by X ci j . This requires a weighted edge graph instead of the usual 0-1 adjacency matrices. As we will see, given a proper choice of node embedding values, this graph will be able to propagate information from cells to pertinent feature nodes, and likewise feature nodes can also propagate their information to the cell nodes that express them highly. We denote the bipartite graph as G = (U , V , E ). In this graph U is the set of nodes representing the n cells {u1 , . . . , un } and cV is the set of nodes representing the d ′ features {v1 , . . . vd ′ } with one node for each feature dimension of the input data. The set E ⊆ U × V gives the edges in the graph between the nodes U and V which describe the relations between the cells and the features. The graph can be denoted by the weighted adjacency matrix    0 Xb (n+d ′ )×(n+d ′ ) A= ∈R  Xb⊤ 0 where 0 is a matrix with all zeros, and X b ∈ Rn×d ′ is the input feature matrix of cells. A is designed to give the structure of a bipartite graph since nodes of the cell or feature sets only have possible edges between nodes of the other set, not within the same set. The initial embeddings of the feature and cell nodes are given by matrices V and U respectively, where each row gives the vector of one node embedding. The feature nodes {v1 , . . . , vd } ∈ Rd are initialized as one-hot vectors, so that each vi has a one in index i and zeros elsewhere, making V ∈ Rd×d an identity matrix. The 72 cell nodes are initialized as zero vectors in the same dimension, so that U ∈ Rn×d has all zero entries. The lack of prior information on the cells leads to this uniform initialization for those nodes, and the one-hot embedding works well with the chosen graph convolution to recreate the gene expression and propagate it into the cell node embeddings. Then our cell-feature graph can be denoted G = (A, V, U). Figure 5.1: scJEGNN graph construction process. The input data determines the value of the weighted edges between the cell nodes and feature nodes, values of zero indicate no edge. 73 5.2.3 Graph Convolution Given the constructed cell-feature graph, we wish to choose a graph convolution that captures higher-order structural information from the links between nodes to create better cell node repre- sentations. In each layer of a GNN, the embedding of a node is updated according to the propagated value of its neighbor, given by the edge weight times the neighbor node. In the field of GNNs we call this type of information propagation “message passing” [GSR+ 17] between neighbors. While a the two different node types could yield different message passing methods for each type, we use the same updates for both. To illustrate our method, we give notation of the updates applied to nodes in U , and the updates for nodes in V are completely analogous. Let Hl = {hl1 , . . . , hln }, ′ hli ∈ Rd be the input node embeddings in the l th layer, where hli corresponds to node ui . Then the output embedding to the l th layer can be expressed as hl+1 i = Update(hli , Agg(hlj | j ∈ Ni )), where Ni is the set of first-order neighbors of node ui , Agg(·) indicates an aggregation function on neighbor nodes’ embeddings, and Update(·) is an update function the generates a new node embedding from the previous one and the aggregation output. While there are many choices for both aggregation and update functions in GNNs, we choose the common and simple Graph Convolution Network (GCN) [KW17] model for these layer up- dates. In general the GCN creates a message mi,l for node i at layer l as follows: ! e ji l l mi,l = σ bl + ∑ h jW j∈N c ji i where j varies over neighbors of ui in V , e ji denotes the edge weight between ui and v j , W l and bl are trainable parameters, σ (·) is an activation function, and c ji is a normalization term defined as s r c ji = ∑ e jk ∑ eki . k∈N j k∈Ni After generating the messages from neighborhoods we update the embedding for nodes in U as hil+1 = hli + mi,l . 74 This simple residual mechanism adds the previous layer to the newly updated embedding, which enhances self information, by combining the node embedding with its aggregated neighborhood information. We choose to decouple the propagation and transformation of the node embeddings. This means that we set W l = Idd ′ as the identity matrix and bl = 0 for all layers l. As well the activation function σ is set to the identity. The choice to remove the learnable parameters and activation func- tion may seem like a big limitation on the transformation, but recent work [WSZ+ 19, HDW+ 20] has shown that if later transformations occur (as in our model), the the performance of the model is often improved from the use of simplified GCN layers, and the computation efficiency is greatly increased. [HDW+ 20] in particular found that if later transformations occur after simplified GCN layers, they are able to produce representations with the same level or better performance than they would with the earlier trainable parameters. This choice also means that the hidden layers keep ′ same dimension throughout, so the end output yields HLU ∈ Rn×d where HLU is the hidden layer representation of the cell nodes in the last layer L. In our application we found L = 3 to have the best performance. We take advantage of this consistent hidden layer size by completing the GNN output with a weighted summation of all the hidden layers, giving L b = ∑ wi · Hi . H U i=1 Our GCN model is seen in Figure 5.2, showing the summation of the GCN layers H b being used as input for the final autoencoder layer. It is worth noting that the simplified GCN computation and choice of node embeddings leads to a recreation of each cell’s original representation for the first and second update. If a cell i has features given by the ith row of X,b then the cell node update would equal ui = ∑d X k=1 i,k ek , where b ek the kth unit vector with 1 in coordinate k and 0’s elsewhere. This value is also the output of the second layer for the cell nodes due to the lack of an update for the feature nodes on the first layer from the all zero values of the initial cell node embeddings. After the second layer the cell nodes update in a novel manner, weighing messages higher from features that are expressed at a 75 greater value in that cell. This leads to increasingly similar cell embeddings from cells that have high coexpression of features. Figure 5.2: scJEGNN graph convolution. Multiple convolution layers propagate information from the weighted edges to update cell and feature nodes. 76 5.2.4 Autoencoder In order to train the final cell embeddings, we use an autoencoder model, presented in Figure 5.3 to achieve desired joint embedding of the data. The autoencoder consists of an encoder layer E and decoder layer D, both modeled as fully connected perceptions. The encoding layer E takes in H c and each layer except the last is computed as b l+1 = DO(BN(σ (H H b lW l )), p), where H b l is the output of the l th layer, W l Rdl ×dl+1 is a trainable linear transformation, σ is an activation function, BN is the batch normalization function [IS15], and DO is a dropout function with parameter 0 < p < 1. Batch normalization operates by normalizing the empirical mean and variance of each batch. The use of the batch normalization function is a well-established technique in deep learning to improve training. The dropout function randomly zeroes some of the elements of the input matrix with probability p using samples from a Bernoulli distribution, and also im- proves training by forcing information redundancy in the connections between layers. The last layer removes the batch normalization and dropout leaving b L = σ (H H b L−1W L ). We choose L = 4, p = .2, and reduce the dimension of the input iteratively with d1 = 150, d2 = 120, d3 = 100, to the final embedding dimension of 39. For all layers σ is chosen to be the ReLU function, ReLU(x) = max(0, x). The decoder layer is a simple two-layer transformation of the joint embedding back to the original dimension d ′ , giving D(H b L ) = σ (σ (Hb LW 1 )W 2 ) ′ where σ = ReLU and W 1 ∈ RL×d1 and W 2 ∈ Rd1 ×d . The goal of the autoencoder is take in the GNN output H, b and learn a low-dimensional repre- sentation that properly captures the biological information we care about. In order to due so the model is trained via latent feature regularization, which forces chosen latent features to predict for 77 cell type, cell-cycle phase score, and to blur the batch features in order to remove spurious batch effects. We combine these with the usual reconstruction loss that is used to train autoencoders to produce salient features in the encoder representation. Thus the autoencoder is trained with both supervised and self-supervised losses: three supervised losses are applied to the output of the en- coder that gives the final joint embedding, and one self-supervised loss is applied to the output of the decoder. The hidden layer size of H b L is specifically designed to accommodate enough features for each of these supervised losses. We allocate 22 features for the number of cell types, 10 fea- tures for the number of batches, 2 features for the cell-cycle score, and 5 extra features to allow for additional pertinent information to improve the reconstruction loss, giving us the total of 39. The losses are then summed to give us a total loss L. In detail we have L = Lrecon + Lcell type + Lbatch + Lcell-cycle . The reconstruction loss is simply mean squared error giving 1 n b 2 Lrecon (D(E(H)) b = ∑ X − D(E(H)) b . n i=1 The cell type loss is a cross-entropy loss function on the classification task; if H b L gives us weights i,1:J corresponding to the J cell classes for input i, and Ci ∈ {1, . . . , J} denotes class of cell i, then the loss is n J Lcell type (H) b =∑ ∑ 1Ci=c j log(Ŷ j ) i=1 j=1 where 1Ci =c j is equal to 1 if Ci = c j and 0 otherwise, and bL eHi, j Ŷ j = bL H ∑Jk=1 e i, j is the softmax function. The loss function cLbatch for the batch effect is also a cross-entropy loss, but it attempts to remove batch effects by training the classifier to learn random batch labels. To do so we use a uniform distribution to generate batch labels for the input, and use the same function above on the 10 batch features in H b L . The last function Lcell-cycle is a simple mean squared error loss on each cells cell-cycle phase score compared to the 2 allocated features in H bL that are supposed to capture the values. 78 Figure 5.3: scJEGNN Autoencoder architecture. Each layer is fully connected, and the encoder layers feature drop out and batch normalization steps. 5.3 Experimental Results Table 5.1: Performances for Joint Embedding Task Model NMI Cluster/label Cell-Cycle Conservation Batch ASW Average Metric Baseline .6502 .8259 .7178 .7313 GLUE .7754 .8355 .9100 .8403 Amateur (JAE) 0.7723 .9195 .8898 .8610 scJEGNN .8057 .9204 .9112 .8791 We demonstrate the effectiveness of our framework scJEGNN in the joint embedding task for GEX and ATAC, and show that the model outperforms the competitor submissions to the competition on the three evaluation metrics. Of the 25 teams that submitted models to be evaluated to the NeurIPS 2021 Joint Embedding task competition, we choose the top two performing teams and show their performance along side our own. Team Amateur submitted JAE, an autoencoder that we designed our own autoencoder from, with the same latent feature regularization but additional residual connections and layers, and no graphical component. GLUE was an autoencoder model as well guided by an external knowledge graph. We additionally provide a baseline model provided 79 by simply evaluating a concatenation of the two modalities after dimension reduction by principle component analysis (PCA). In Table 5.1 we can see that our model significantly outperforms the other models, with an improvement over 0.1 according to the average metric. 80 CHAPTER 6 CONCLUSION In this work we have developed two methods for finding low-dimensional representations of high- dimension data. The first is given by a unified framework for generalized eigenvalue problems in the GEV estimator. This sparse projection regression framework is a reformulation of an in- tractable Rayleigh quotient problem and achieves great computational efficiency. We established nonasymptotic error bounds on the proposed estimators for the applications of SIR and LDA, and showed these rates are minimax optimal. We showed application of GEV to the CCA problem, and adapted the algorithm for a robust Huber-loss based formulation. We tested our framework on both synthetic and real datasets and demonstrated the algorithm’s superior performance compared with other state-of-the-art methods in high dimensional data. The second method is the scJEGNN, a graphical neural network tailored to data integration for HDLSS single-cell sequencing data. We showed that with the unique model, the GNN is able to leverage structural information of the bi- ological data relations in order to perform a joint embedding of multiple modalities of single-cell gene expression data. 6.1 Future Work GEV. One obvious goal is to show the same statistical consistency the GEV estimator obtains for SIR and LDA is also true for the application of CCA. The adaptation of the GEV estimator to CCA required changing the structure of U from a collection of products of eigenvalues and eigenvectors of Ω = cov(E[x|y]), to the combination of the right and left singular vectors of Σxy = Ox DO⊤ y. √ Setting U = (O⊤ ⊤ T x , Oy ) / 2 requires additional work to show that the difference ∥U − U∥∞,∞ is b q bound above by the desired rate of C log(d) n . Similarly, the robust version of the GEV estimator using the Huber loss also has the possibility of proving strong theoretical rates of convergence. This work requires a new derivation of the bound on ∥∇L(W∗ )∥∞,∞ ≤ λ /2 due to the change in the gradient of the loss. The derivative of the Huber norm is more complicated than the Frobenius norm 81 and leads to a piecewise defined function with additional multiplications of Σ. Lastly an extension of the GEV estimator to a nonlinear dimension reduction technique is also likely possible. An arbitrary manifold can be approximated locally by linear spaces which can be estimated using K- nearest neighbors from the sample data. Given these connected affine spaces, we can apply GEV to each to get a collection of projections, which we can carefully combine to project the data to a lower dimensional space. This likely requires much more stringent requirements about the sample size in each portion of the linear approximation. GNNs for single-cell tasks. The cell-gene graph of the scJEGNN has the means to be used in a number of tasks in singe-cell data analysis. The representation gained for the cells after going through multiple convolution layers may lead to much better estimates for methods like K- nearest neighbors, which is relied on in many methods that perform imputation. If naively applied, the KNN estimate on the initial highly sparse data is likely to be unreliable, and if the estimate could be improved by applying to the updated cell representations, the downstream steps taken for imputation could be drastically improved. In addition, these representations could be used for other common node-based tasks like classification and clustering, which lead to cell annotation methods and biological clustering in the single-cell world, or for graph-based tasks, which would lead to methods for disease prediction given cell populations from distinct patients. The methods using this graphical model can be further enhanced with some key alterations to the graph structure. The bipartite graph structure can be extended to a fuller graph that has edges between the gene nodes and edges between cell nodes. Gene node edges are important to represent gene pathways, which indicate any number of gene causal relations including gene co-expression or regulatory networks. In spatial transcriptomics data [Rus16], single-cell sequencing data is given new geometric context so that each small cluster of cells is placed in a 2 or 3-dimensional grid. This spatial data can be included in the cell-gene graph with additional edges between cells (or group of cells) indicating adjacency relations. These collection of methods utilizing this GNN model seem promising for the variety of applications listed, and are being actively developed together into a full GNN-based 82 package for single cell data analysis methods. 83 BIBLIOGRAPHY 84 BIBLIOGRAPHY [AAB+ 20] Ricard Argelaguet, Damien Arnol, Danila Bredikhin, Yonatan Deloro, Britta Velten, John Marioni, and Oliver Stegle. Mofa+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome biology, 21(1):1–17, 2020. [AHB+ 16] Benedict Anchang, Tom Hart, Sean Bendall, Peng Qiu, Zach Bjornson, Michael Linderman, Garry Nolan, and Sylvia Plevritis. Visualization and cellular hierarchy inference of single-cell data using spade. Nature Protocols, 11(7):1264–1279, 2016. [AHMM18] Lun Aaron, Laleh Haghverdi, Michael Morgan, and John Marioni. Batch effects in single-cell rna-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 36(5):421–427, 2018. [APYG19] Cédric Arisdakessian, Olivier Poirion, Xun Yunits, Breck Zhu, and Lana Garmire. Deepimpute: an accurate, fast, and scalable deep neural network method to impute single-cell rna-seq data. Genome Biology, 20:1–14, 2019. [BHS+ 18] Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology, 36(5):411–420, 2018. [BPC+ 11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Dis- tributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, pages 1–122, 2011. [BRT09] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37:1705–1732, 2009. [BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009. [CGRZ13] Mengjie Chen, Chao Gao, Zhao Ren, and Harrison H. Zhou. Sparse CCA via Pre- cision Adjusted Iterative Thresholding. arXiv:1311.6186 [math, stat], November 2013. arXiv: 1311.6186. [CHWE11] Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. Sparse dis- criminant analysis. Technometrics, 53(4):406–413, 2011. [CL98] Chun-Houh Chen and Ker-Chau Li. Can SIR be as popular as multiple linear re- gression? Statistica Sinica, 8:289–316, 1998. [CMW13] Tianwen Tony Cai, Zongming Ma, and Yihong Wu. Sparse PCA: Optimal rates and 85 adaptive estimation. The Annals of Statistics, 41(6):3074–3110, 2013. [Con18] Tabula Muris Consortium. Single-cell transcriptomics of 20 mouse organs creates a tabula muris. Nature, 562(7727):367–372, 2018. [Coo98] Ralph Dennis Cook. Regression graphics. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York, 1998. [CP11] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for con- vex problems with applications to imaging. Journal of Mathematical Imaging and Vision, (40):120–145, 2011. [DPN14] Yvonne Döring, Christian Pawig, Lukas Weber, and Heidi Noels. The cxcl12/cxcr4 chemokine ligand/receptor axis in cardiovascular disease. Frontiers in Physiology, 5, 2014. [ENK16] Anders Eklund, Thomas Nichols, and Hans Knutsson. Cluster failure: Why fmri inferences for spatial extent have inflated false-positive rates. Proceedings of the National Academy of Sciences, (113):7900–7905, 2016. [FFT12] Jianqing Fan, Yang Feng, and Xin Tong. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(4):745–771, September 2012. [FS81] Jerome Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823, 1981. [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [GGK10] Sergei Grivennikov, Florian Greten, and Michael Karin. Immunity, inflammation, and cancer. Cell, 140(6):883–899, 2010. [GSR+ 17] Justin Gilmer, Samuel Schoenholz, Patrick Riley, Oriol Vinyals, and George Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, ICML, 2017. [GZP21] Boying Gong, Yun Zhou, and Elizabeth Purdom. Cobolt: Joint analysis of multi- modal single-cell sequencing data. bioRxiv, 2021. [HBR+ 17] Adam Haber, Moshe Biton, Noga Rogel, Rebecca H Herbst, Karthik Shekhar, Christopher Smillie, Grace Burgin, Toni M Delorey, Michael R Howitt, Yarden Katz, Itay Tirosh, Semir Beyaz, Danielle Dionne, Mei Zhang, Raktima Raychowd- hury, Wendy Garrett, Orit Rozenblatt-Rosen, Hai Ning Shi, Omer Yilmaz, Ramnik J Xavier, and Aviv Regev. A single-cell survey of the small intestinal epithelium. 86 Nature, 551(7680):333–339, 2017. [HDW+ 20] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng Wang. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation, page 639–648. Association for Computing Machinery, New York, NY, USA, 2020. [HMN05] Peter Hall, James Stephen Marron, and Amnon Neeman. Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):427–444, 2005. [Hot33] Harold Hotelling. Analysis of a complex of statistical variables into principal com- ponents. Journal of Educational Psychology, 24(6):417, 1933. [HST11] David Hardoon and John Shawe-Taylor. Sparse canonical correlation analysis. Machine Learning, 83(3):331–353, 2011. [Hub73] Peter Huber. Robust Regression: Asymptotics, Conjectures and Monte Carlo. The Annals of Statistics, 1(5):799–821, 1973. [ID14] Lionel Ivashkiv and Laura Donlin. Regulation of type i interferon responses. Nature Reviews Immunology, 14(1):36–49, 2014. [IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 448–456. JMLR.org, 2015. [KW17] Thomas Kipf and Max Welling. Semi-supervised classification with graph convolu- tional networks. In ICLR, 2017. [LBC+ 21] Malte Luecken, Daniel Bernard Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann Chen, Louise Deconinck, Angela Detweiler, Alejandro Granados, et al. A sandbox for prediction and integration of dna, rna, and proteins in single cells. In NeurIPS Datasets and Benchmarks Track (Round 2), 2021. [LDJ+ 21] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. Graph neural networks with adaptive residual. Advances in Neural Information Processing Systems, 34, 2021. [Len08] Chenlei Leng. Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Computational Biology and Chemistry, 32(6):417– 425, 2008. 87 [LG06] Hongzhe Li and Jiang Gui. Gradient directed regularization for sparse Gaussian con- centration graphs, with applications to inference of genetic networks. Biostatistics (Oxford, England), 7(2):302–317, April 2006. [LHH20] Jeongwoo Lee, Do Young Hyeon, and Daehee Hwang. Single-cell multiomics: technologies and data analysis methods. Experimental Molecular Medicine, 52(9):1428—-1442, 2020. [Li91] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991. [Li07] Lexin Li. Sparse sufficeint dimension reduction. Biometrika, 94(3):603–613, 2007. [LN06] Lexin Li and Christopher Nachtsheim. Sparse sliced inverse regression. Technometrics, 48(4):503–510, 2006. [LSB+ 18] Jacob Levine, Erin Simonds, Sean Bendall, Kara Davis, El ad Amir, Michelle D Tadmor, Oren Litvin, Harris Fienberg, Astraea Jager, Eli Zunder, Rachel Finck, Amanda Gedman, Ina Radtke, James R Downing, Dana Pe’er, and Garry Nolan. Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate with prognosis. Cell, 162(1):184–197, 2018. [LWL+ 20] Xiangjie Li, Kui Wang, Yafei Lyu, Huize Pan, Jingxiao Zhang, Dwight Stambolian, Katalin Susztak, Muredach P Reilly, Gang Hu, and Mingyao Li. Deep learning enables accurate clustering with batch effect removal in single-cell rna-seq analysis. Nature Communications, 11(1), 2020. [LYS18] Hanna Levitin, Jinzhou Yuan, and Peter Sims. Single-cell transcriptomic analysis of tumor heterogeneity. Trends Cancer, 4(4):264–268, 2018. [LZL18] Qian Lin, Zhigen Zhao, and Jun Liu. On consistency and sparsity for sliced inverse regression in high dimensions. The Annals of Statistics, 46(2):482–1492, 2018. [LZL19] Qian Lin, Zhigen Zhao, and Jun S. Liu. Sparse sliced inverse regression via lasso. Journal of the American Statistical Association, 114:1726–1739, 2019. [MGH11] Aaron McDaid, Derek Greene, and Neil Hurley. Normalized mutual infor- mation to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515, 2011. [MLZ+ 21] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A uni- fied view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information Knowledge Management, pages 1202–1211, 2021. 88 [MP43] W.S. McCulloch and W Pitts. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5:115–133, 1943. [MT21] Yao Ma and Jiliang Tang. Deep Learning on Graphs. Cambridge University Press, 2021. [MvDW+ 19] Kevin Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel Burkhardt, William Chen, Kristina Yim, Antonia van den Elzen, Matthew Hirn, Ronald Coif- man, Natalia Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing struc- ture and transitions in high-dimensional biological data. Nature Biotechnology, 37(12):1482–1492, 2019. [MZY12] Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 99(1):29–42, March 2012. [Nes83] Yurii Nesterov. A method for solving the convex programming problem with conver- gence rate o(1/k2 ). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983. [Nes04] Yu. Nesterov. Introductory Lectures on Convex Optimization. A Basic Course. 2004. [NHS+ 16] Sonia Nestorowa, Fiona Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi Shepherd, Elisa Laurenti, Nicola Wilson, David Kent, and Berthold Göttgens. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differ- entiation. Blood, 128(8):20–31, 2016. [NLG+ 15] Aaron Newman, Chih Long Liu, Michael Green, Andrew Gentles, Weiguo Feng, Yue Xu, Chuong Hoang, Maximilian Diehn, and Ash Alizadeh. Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, (5):453–457, 2015. [PTB09] Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correla- tion analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 8(1):1–34, 2009. [Qui20] Peng Qui. Embracing the dropouts in single-cell rna-seq analysis. Nature Communications, 11(1), 2020. [Roc70] Ralph Tyrell Rockafellar. Convex analysis. Princeton University Press, Princeton, 1970. [Ros58] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. [Rus16] Nicole Rusk. Spatial transcriptomics. Nature Methods, 13(710), 2016. 89 [RWM+ 18] Bushra Raj, Daniel E Wagner, Aaron McKenna, Shristi Pandey, Allon M Klein, Jay Shendure, James A Gagnon, and Alexander Schier. Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nature Biotechnology, 36(5):442–450, 2018. [RWRY11] Pradeep Ravikumar, Martin Wainwright, Garvesh Raskutti, and Bin Yu. High- dimensional covariance estimation by minimizing l1-penalized log-determinant di- vergence. Electronic Journal of Statistics, 5:935–980, 2011. [RZL+ 21] Jiahua Rao, Xiang Zhou, Yutong Lu, Huiying Zhao, and Yuedong Yang. Imputing single-cell rna-seq data by combining graph convolution and autoencoder neural networks. iScience, 24(5):102393, 2021. [SCR14] William Schneider, Meike Chevillotte, and Charles Rice. Interferon-stimulated genes: a complex web of host defenses. Annual Review of Immunology, 32:513– 545, 2014. [SDB+ 18] William Stephenson, Laura Donlin, Andrew Butler, Cristina Rozo, Bernadette Bracken, Ali Rashidfarrokhi, Susan Goodman, Lionel Ivashkiv, Vivian Bykerk, Dana Orange, Robert Darnell, Harold Swerdlow, and Rahul Satija. Single-cell rna- seq of rheumatoid arthritis synovial tissue using low-cost microfluidic instrumenta- tion. Nature Communications, 9(1):791, 2018. [SJ15] Phillip Santoiemma and Daniel Powell Jr. Tumor infiltrating lymphocytes in ovarian cancer. Cancer Biology Therapy, 16(6):807–820, 2015. [SNL+ 17] Valentine Svensson, Kedar Natarajan, Lam-Ha Ly, Ricardo Miragaia, Charlotte La- balette, Iain Macaulay, Ana Cvejic, and Sarah Teichmann. Power analysis of single- cell rna-sequencing experiments. Nature Methods, 14:381–387, 2017. [SSZ+ 17] Uri Shaham, Kelly Stanton, Jun Zhao, Huamin Li, Khadir Raddassi, Ruth Mont- gomery, and Yuval Kluger. Removal of batch effects using distribution-matching residual networks. Bioinformatics, 33(16):2539—-2546, 2017. [SSZM16] Dan Shen, Haipeng Shen, Hongtu Zhu, and JS Marron. The statistics and mathemat- ics of high dimension low sample size asymptotics. Statistica Sinica, 26(4):1747, 2016. [STM+ 05] Aravind Subramanian, Pablo Tamayo, Vamsi Mootha, Sayan Mukherjee, Benjamin Ebert, Michael Gillette, Amanda Paulovich, Scott Pomeroy, Todd Golub, Eric Lander, and Jill Mesirov. Gene set enrichment analysis: a knowledge-based ap- proach for interpreting genome-wide expression profiles. Proceeds of the National Academy of Sciences of the United States of America, 102(43):15545–15550, 2005. [SYG+ 05] Franco Scarselli, Sweah Liang Yong, Marco Gori, Markus Hagenbuchner, 90 Ah Chung Tsoi, and Marco Maggini. Graph neural networks for ranking web pages. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pages 666–672, 2005. [TBH+ 17] Po-Yuan Tung, John Blischak, Chiaowen Joyce Hsiao, David Knowles, Jonathan Burnett, Jonathan Pritchard, and Yoav Gilad. Batch effects and the effective design of single-cell gene expression studies. Scientific reports, 7, 2017. [THNC03] Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Class prediction by nearest shrunken centroids, with applications to dna microar- rays. Statistical Science, 18(1):104–117, 2003. [TSY20] Kai Tan, Lei Shi, and Zhou Yu. Sparse sir: Optimal rates and adaptive estimation. The Annals of Statistics, 48(1):64–85, 2020. [vDSN+ 18] David van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Am- brose Carr, Cassandra Burdziak, Kevin Moon, Christine L Chaffer, Diwakar Pat- tabiraman, Brian Bierie, Linas Mazutis, Guy Wolf, Smita Krishnaswamy, and Dana Pe’er. Recovering gene interactions from single-cell data using data diffusion. Cell, 174(3):716–729, 2018. [WAH+ 19] Jingshu Wang, Divyansh Agarwal, Mo Huang, Gang Hu, Zilu Zhou, Chengzhong Ye, and Nancy Zhang. Data denoising with transfer learning in single-cell transcrip- tomics. Nature Methods, 16:875—-878, 2019. [Wai19] Martin Wainright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. [WCZZ18] Tao Wang, Mengjie Chen, Hongyu Zhao, and Lixing Zhu. Estimating a sparse re- duction for general regression in high dimensions. Statistics and Computing, 28:33– 46, 2018. [WDJ+ 22] Hongzhi Wen, Jiayuan Ding, Wei Jin, Xie Yuying, and Jiliang Tang. Graph neural networks for multimodal single-cell data integration, 2022. [WKI08] Ami Wiesel, Mark Kliger, and Alfred Hero III. A greedy approach to sparse canon- ical correlation analysis. arXiv preprint arXiv:0801.2748, 2008. [WMC+ 21] Juexin Wang, Anjun Ma, Yuzhou Chang, Jianting Gong, Yuexu Jiang, Ren Qi, Cankun Wang, Hongjun Fu, Qin Ma, and Dong Xu. scGNN is a novel graph neu- ral network framework for single-cell RNA-seq analyses. Nature Communications, 12(1882), 2021. [WN18] Maartje Wouters and Brad Nelson. Prognostic significance of tumor-infiltrating b cells and plasma cells in human cancer. Clinical Cancer Research, 24(24):6125– 91 6135, 2018. [WPL15] Lan Wang, Bo Peng, and Runze Li. A high-dimensional nonparametric multivariate test for mean vector. Journal of the American Statistical Association, (110):1658– 1669, 2015. [WSZ+ 19] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Wein- berger. Simplifying graph convolutional networks. In Proceedings of the 36th International Conference on Machine Learning, pages 6861–6871, 2019. [WT11] D M Witten and R Tibshirani. Penalized classification using Fisher’s linear discrim- inant. Journal of Royal Statistical Society, Series B, 73:753–772, 2011. [WTH09] Daniela Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decom- position, with applications to sparse principal components and canonical correlation analysis. Biostatistics (Oxford, England), 10(3):515–534, 2009. [XLV16] Yuying Xie, Yufeng Liu, and William Valdar. Joint estimation of multiple depen- dent Gaussian graphical models with applications to mouse genomics. Biometrika, 103(3):493–511, September 2016. [XTLZ02] Yingcun Xia, Howell Tong, W. K. Li, and Li-Xing Zhu. An adaptive estimation of dimension reduction space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):363–410, 2002. [Yu97] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics, pages 423–435, 1997. [ZMY18] Hui Zou, Qing Mai, and Yi Yang. Multiclass Sparse Discriminant Analysis. Statistica Sinica, 2018. [ZNC+ 19] Hamim Zafar, Nicholas Navin, Ken Chen, , and Luay Nakhleh. SiCloneFit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome research, 29(11):1847– 1859, 2019. [ZWT19] Feng Zhang, Yu Wu, and Weidong Tian. A novel approach to remove the batch effect of single-cell data. Cell Discovery, 5(46), 2019. 92