SUPERVISED DIMENSION REDUCTION TECHNIQUES FOR HIGH-DIMENSIONAL
                                        DATA
                                         By
                                    Dylan Molho
                                 A DISSERTATION
                                     Submitted to
                             Michigan State University
                     in partial fulfillment of the requirements
                                  for the degree of
    Computational Mathematics, Science, and Engineering – Doctor of Philosophy
                                        2022


                                            ABSTRACT
    SUPERVISED DIMENSION REDUCTION TECHNIQUES FOR HIGH-DIMENSIONAL
                                                DATA
                                                   By
                                            Dylan Molho
    The data sets arising in modern science and engineering are often extremely large, befitting the
era of big data. But these data sets are not only large in the number of samples they have, they
may also have a large number of features, placing each data point in a high-dimensional space.
However, unique problems arise when the dimension of the data has the same or even greater order
than the sample size. This scenario in statistics is known as the High Dimension, Low Sample Size
problem (HDLSS). In this paradigm, many standard statistical estimators are shown to perform
sub-optimally and in some cases can not be computed at all.
    This dissertation develops two novel algorithms that successfully operate in the paradigm of
HDLSS. We first propose the Generalized Eigenvalue (GEV) estimator, a unified sparse projection
regression framework for estimating generalized eigenvector problems. Unlike existing work, we
reformulate a sequence of computationally intractable non-convex generalized Rayleigh quotient
optimization problems into a computationally efficient simultaneous linear regression problem,
padded with a sparse penalty to deal with high-dimensional predictors. We showcase the applica-
tions of our method by considering three iconic problems in statistics: the sliced inverse regression
(SIR), linear discriminant analysis (LDA), and canonical correlation analysis (CCA). We show the
reformulated linear regression problem is able to recover the same projection space obtained by
the original generalized eigenvalue problem. Statistically, we establish the nonasymptotic error
bounds for the proposed estimator in the applications of SIR and LDA, and prove these rates are
minimax optimal. We present how the GEV is applied to the CCA problem, and adapt the method
for a robust Huber-loss based formulation for noisy data. We test our framework on both synthetic
and real datasets and demonstrate its superior performance compared with other state-of-the-art
methods in high dimensional statistics.


    The second algorithm is the scJEGNN, a graphical neural network (GNN) tailored to the task of
data integration for HDLSS single-cell sequencing data. We show that with its unique model, the
GNN is able to leverage structural information of the biological data relations in order to perform
a joint embedding of multiple modalities of single-cell gene expression data. The model is applied
to data from the NeurIPS 2021 competition for Open Problems in Single-Cell Analysis, and we
demonstrate that our model is able to outperform top teams from the joint embedding task.


                                   ACKNOWLEDGEMENTS
I would like to thank my advisor and committee chair, Dr. Yuying Xie. Since the start of my time
with him, he has shown commitment and passion to research while also being endlessly patient
and adaptive. His knowledge of a diverse range of applied and theoretical techniques has been
invaluable. He has been the perfect combination of professional and personable. I also want to
thank my co-advisor, Dr. Qiang Sun. His depth of expertise in statistics and mathematics has been
inspiring and served as constant motivation for my own work. His insight and clarity in our work
has helped guide both my research and development as a scholar. I regret Covid taking away an
opportunity to have worked with him more in person.
    I would also like to thank my other committee members, Dr. Ming Yan and Dr. Rongrong
Wang for their knowledgeable feedback and suggestions into further research directions.
    I also want to thank the CMSE community, Lisa Roy, Heather Williams, etc. They are always
there to offer help.
    Lastly, thank you to friends and family for providing so much support during this time.
                                                 iv


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       ix
CHAPTER 1     INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          1
CHAPTER 2 BACKGROUND . .            . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 8
   2.1 Mathematical Preliminaries   . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 8
   2.2 Deep Learning . . . . . . .  . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 11
   2.3 Single-Cell Data . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . 18
CHAPTER 3 THEORETICAL PROPERTIES OF THE GEV ESTIMATOR                            . . . . . . . . 22
   3.1 General Error Bound . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 23
       3.1.1 Proof of Lemma 2 . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 24
       3.1.2 Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 25
       3.1.3 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . 27
               3.1.3.1 Proof of Lemma 7 . . . . . . . . . . . . . . . . .        . . . . . . . . 27
               3.1.3.2 Proof of Lemma 8 . . . . . . . . . . . . . . . . .        . . . . . . . . 30
   3.2 Sliced Inverse Regression . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . 31
       3.2.1 Consistency for SIR . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 33
       3.2.2 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . 35
   3.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 40
       3.3.1 Consistency for LDA . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . 42
               3.3.1.1 Proof of Theorem 19 . . . . . . . . . . . . . . . .       . . . . . . . . 43
               3.3.1.2 Proof of Theorem 22 . . . . . . . . . . . . . . . .       . . . . . . . . 45
   3.4 Minimax Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 46
       3.4.1 Proof of Theorem 23 . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . 47
       3.4.2 Proof of Corollary 25 . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . 48
   3.5 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . .    . . . . . . . . 49
CHAPTER 4 EMPIRICAL RESULTS OF THE GEV ESTIMATOR                         . . . . . . . . . . . . 52
   4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . 52
       4.1.1 Robust Modification . . . . . . . . . . . . . . . . .       . . . . . . . . . . . . 54
   4.2 Sliced Inverse Regression . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . 55
       4.2.1 Heavy Noise Slice Inverse Regression . . . . . . . .        . . . . . . . . . . . . 59
   4.3 Linear Discriminant Analysis . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . 61
   4.4 Canonical Correlation Analysis . . . . . . . . . . . . . . . .    . . . . . . . . . . . . 62
   4.5 Application to Tumor-Infiltrating Lymphocytes Data . . . .        . . . . . . . . . . . . 64
   4.6 Application to Single-Cell RNAseq Data . . . . . . . . . .        . . . . . . . . . . . . 66
                                                v


CHAPTER 5    GRAPHICAL NEURAL NETWORKS FOR MULTI-MODAL DATA IN-
             TEGRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 68
   5.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 69
   5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 71
       5.2.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 71
       5.2.2 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 72
       5.2.3 Graph Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 74
       5.2.4 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 77
   5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 79
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
   6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
                                              vi


                                        LIST OF TABLES
Table 4.1: Summary of estimation accuracy for categorical response in low and high di-
           mensions. We report the means of three accuracy metrics (CCA, FD and TC)
           with their standard deviations in parentheses. The results are based on 100
           replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Table 4.2: Summary of estimation accuracy for continuous response in low dimensions.
           We report the means of three accuracy metrics (CCA, FD and TC) with their
           standard deviations in parentheses. The results are based on 100 replications. . . 58
Table 4.3: Summary of estimation accuracy for continuous response in high dimensions.
           We report the means of three accuracy metrics (CCA, FD and TC) with their
           standard deviations in parentheses. The results are based on 100 replications. . . 59
Table 4.4: Summary of estimation accuracy for Huber loss estimation in low dimensions
           with high noise. We report the means of three accuracy metrics (CCA, FD
           and TC) with their standard deviations in parentheses. The results are based
           on 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Table 4.5: Summary of estimation accuracy for Huber loss estimation in high dimensions
           with high noise. We report the means of three accuracy metrics (CCA, FD
           and TC) with their standard deviations in parentheses. The results are based
           on 100 replications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Table 4.6: Summary statistics reporting performance of the GEV, LDA−ℓ1 , Direct and
           Oracle methods. We report the means of the FD with its standard deviation in
           parentheses. The results are based on 100 replications . . . . . . . . . . . . . . 62
Table 4.7: Summary statistics reporting performance of the GEV and PMA methods.
           We report the means of the FD with its standard deviation in parentheses. The
           results are based on 100 replications . . . . . . . . . . . . . . . . . . . . . . . . 64
Table 5.1: Performances for Joint Embedding Task . . . . . . . . . . . . . . . . . . . . . . 79
                                                 vii


                                        LIST OF FIGURES
Figure 1.1: Unlabeled 2-dimensional data. . . . . . . . . . . . . . . . . . . . . . . . . . .          2
Figure 1.2: Labeled 2-dimensional data. . . . . . . . . . . . . . . . . . . . . . . . . . . . .        3
Figure 1.3: Data projection on the x and y axis. . . . . . . . . . . . . . . . . . . . . . . . .       3
Figure 1.4: The LDA decision boundary between samples from Class 1 and Class2.
            White x’s mark the sample means, and the yellow x gives the midpoint be-
            tween the two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       4
Figure 2.1: A simple graph with adjacency matrix. . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.2: The architecture of a feedforward network. The nodes represent the coordinates of
            each layer, and the edges represent the weights of the linear transformations. . . . . . 13
Figure 2.3: The general framework of an autoencoder. The hidden layer gives the code of the
            encoder, and the lower dimension of the code constrains the representation. . . . . . . 15
Figure 2.4: An example matrix for single-cell data. . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 4.1: Comparison of convergence rates of different algorithms. . . . . . . . . . . . . . . . 53
Figure 4.2: Relationship between plasma cells and GEV-SIR direction. The left panel shows the
            distribution of eigenvalues of Ω̂. The scatter plots in the middle and right panels
            show the relationship between the tumor infiltrated plasma cell and the GEV-SIR
            direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 4.3: GEV-SIR analysis of embryoid body scRNAseq data.          . . . . . . . . . . . . . . . . 67
Figure 5.1: scJEGNN graph construction process. The input data determines the value
            of the weighted edges between the cell nodes and feature nodes, values of
            zero indicate no edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 5.2: scJEGNN graph convolution. Multiple convolution layers propagate infor-
            mation from the weighted edges to update cell and feature nodes. . . . . . . . . 76
Figure 5.3: scJEGNN Autoencoder architecture. Each layer is fully connected, and the
            encoder layers feature drop out and batch normalization steps. . . . . . . . . . . 79
                                                  viii


                                LIST OF ALGORITHMS
Algorithm 4.1: A fast iterative shrinkage-thresholding algorithm for GEV. . . . . . . . . 52
Algorithm 4.2: Huber loss algorithm for robust GEV. . . . . . . . . . . . . . . . . . . . 55
                                             ix


                                            CHAPTER 1
                                         INTRODUCTION
In the past twenty years, we have witnessed an explosion of data from different domain areas,
including medical imaging, finance, and genomics. The data sets arising in modern science and
engineering are often extremely large, befitting the era of big data. But these data sets are not only
large in the number of samples they have, they may also have a large number of features, plac-
ing each data point in a high-dimensional space. Data like this is common in fields like biology,
where for instance measurements of the gene expression of cells can have tens of thousands or
even hundreds of thousands features. This enrichment of data offers promises of solutions to many
challenging goals, including detecting genes underlying complex diseases and designing novel
drug treatments. However, unique problems arise when the dimension of the data has the same
order as or even greater than the sample size. This scenario in statistics is known as a High Dimen-
sion, Low Sample Size problem (HDLSS) [HMN05, SSZM16]. In such a regime, many classical
statistical methods no longer have guarantees of success, and standard asymptotic theory often fails
to provide useful predictions. As well, in high dimensions, our intuitions on basic concepts such
as the distance between points begins to break down. As a result of the “curse of dimensionality”,
higher dimensional versions of the cube no longer have the majority of their mass near the center of
the cube, but instead the vast majority of volume is found near the corners. Similarly, multivariate
normal distributions more and more act like uniform distributions on hyperspheres, and estimators
like k-nearest neighbors become unreliable due to distances being very similar between points in
the high-dimensional data set.
    To better illustrate this behavior, we give a simple example of Linear Discriminant Analysis
(LDA), a classical machine learning technique that trains a classifier for the data. The classifier
is determined based on finding a linear projection of the data to a lower dimensional space that
best separates the data based on its class. In Figure 1.1 we see unlabeled data in 2-dimensions
that have a high spread along the x-axis, with very little variation along the y-axis. If we sought
                                                  1


to simplify the data to a one-dimensional representation, we could simply project the data to the
x-axis, preserving most of the variation of the data.
                               Figure 1.1: Unlabeled 2-dimensional data.
    Once the data is given additional class information, we see that most of the pertinent informa-
tion about how the classes are separated would be lost upon projection to the x-axis. We see in
Figure 1.3 that the projection of the data on the x-axis mixes much of the class data, making it
nearly impossible to find a good point to separate the classes. On the other hand, projection on
y-axis does a good job of separating the class information, despite being more compactly spaced
after projection. It is not too surprising that this is case: the data is generated from two multivariate
normal distributions with mean µ1 = (13, 1) for class 1 and mean µ2 = (21, 2), with both classes
sharing a diagonal covariance of                             
                                                   25 0 
                                            Σ=               .
                                                     0 .1
                                                      2


   Figure 1.2: Labeled 2-dimensional data.
Figure 1.3: Data projection on the x and y axis.
                       3


     While the projection of the data on the y-axis fared well in distinguishing the class information,
we can do even better. The solution provided by LDA takes into account the location of the sample
means, and corrects for the covariance of the data to make a more optimal separation of the data.
Let µ̂1 and µ̂2 be our estimated class means from the data, marked by the white x’s in Figure 1.4,
and let Σb be our estimated covariance. Then the linear discriminant function applied to a new data
point x ∈ R2 is given by the inner product
                                                                       
                                                     b −1     µ̂1 + µ̂2
                                Ψ(x)
                                b     = ⟨µ̂1 − µ̂2 , Σ     x−            ⟩.
                                                                  2
If this value is less than 0, label the point x as belonging to class 1, and if it is greater than 0, then
it is labeled belonging to class 2. This provides a linear decision boundary in the plane, given by
the yellow dotted line in Figure 1.4, where the values of Ψ(x)
                                                             b      = 0. The decision boundary shows
how the LDA classifier labels new points, and if we wanted to project the data to best separate
the samples by classes, we would project the points to a line that is orthogonal to this decision
boundary.
Figure 1.4: The LDA decision boundary between samples from Class 1 and Class2. White x’s
mark the sample means, and the yellow x gives the midpoint between the two.
                                                      4


    If we assume that both classes are equally likely, then we can calculate the error probability
using the LDA decision boundary as
                                 b = 1 P1 [Ψ(x
                             Err(Ψ)            b ′ ) ≤ 0] + 1 P2 [Ψ(x
                                                                  b ′′ ) > 0],
                                       2                     2
where x′ and x′′ are samples drawn independently from probability distributions P1 and P2 of
class 1 and 2 respectively. We can analyze the behavior of the error asymptotically, i.e. look at
the limiting behavior as our sample size n increases to infinity. Define the value γ = ∥µ1 − µ2 ∥2 ,
the distance between the two class means, and let us simplify the problem further by assuming the
covariance Σ is the identity matrix. Then the random error Err(Ψ)     b converges in probability to the
fixed number
                                                     prob.
                                                             γ
                                       Err(Ψ  b Id ) −
                                                     −−→ Φ − ,
                                                               2
where Φ(t) = P[Z ≤ t] is the cumulative distribution function of the standard normal variable.
Thus the error behaves purely as a function of the distance of the class means as n increases: as the
means grow in distance, the amount of error goes down. But this is assuming that the dimension of
the data remains fixed. If instead we had the dimension of the data d at a fixed ratio with the sample
size n, we would see a different picture emerge. In many cases this fixed ratio is more realistic for
applications where the dimension is high and collecting much larger samples is infeasible. Assume
the ratio d/n converges to some non-negative fraction α > 0. Under this high-dimensional scaling,
our error converges to a different sub-optimal value
                                                                      !
                                              prob.
                                      b Id ) −−−→ Φ − p        γ2
                                  Err(Ψ                                 .
                                                           2 γ 2 + 2α
In this case the classical prediction Φ(−γ/2) drastically underestimates the error rate. Our example
application of LDA was applied to data that is only 2-dimensional with a sample size of 80. We
would see a quick deterioration in performance if the dimension was much higher, e.g. d = 40. At
the point d > n, we are unable to even calculate our estimator, since estimates for the covariance
Σ
b are no longer full rank, which makes computing its inverse impossible.
    These HDLSS phenomena necessitates the development of new theory as well as new meth-
ods in order to manage these difficult problems, and achieve the desired outcomes for downstream
                                                       5


data science tasks. To overcome the barriers found in HDLSS scenarios, one must make addi-
tional assumptions on the data, either with explicit formulations or with implicit beliefs about
the behavior of the data. The first type of research leads to structural assumptions placed on the
probability model that generates the data, which allow for alterations to classical methods to yield
theoretically optimal estimators for the chosen well-defined tasks. The second type of research, in
contrast, makes general assumptions usually based on the the causal nature of chosen real-world
data application, where the data is assumed to have dependencies between the various parameters.
    While there are no theoretical guarantees for such methods, the strength of the estimator is
instead demonstrated empirically on simulated or real data sets. We explore ideas from both of
these fields, and develop two novel algorithms that successfully operate in the paradigm of HDLSS.
    In the first case, we develop an estimator for high-dimensional data with response variables
with the assumption that the data has an underlying low-rank structure, and that the lower-dimensional
representation is obtained with sparse projection directions. Sparsity and low rank representations
are natural assumptions to make on high-dimensional data due to the likelihood of having very
few of the potential thousands of variables contribute meaningfully to a response variable. Data
exhibiting this structure can be found in a diverse collection of applications, ranging from ge-
nomics to economics. Our estimator produces a supervised linear dimension reduction of the data
that attempts to maximally preserve the relationship between the covariate data and the response
variables. We achieve this through finding vectors that relate the covariance of the covariates with
the covariance between the data and response variables, solving a type of generalized eigenvalue
optimization problem. We name the method the Generalized Eigenvalue (GEV) Estimator, and
show that this method is able to solve three separate classical statistical problems in the HDLSS
paradigm: Linear Discriminant Analysis (LDA), Sliced Inverse Regression (SIR), and Canoni-
cal Correlation Analysis (CCA). We give theoretical guarantees of convergence that are shown
to be minimax optimal for both LDA and SIR, and give empirical results of GEV outperforming
competitor methods in all three applications on simulated data, as well as applications of GEV to
real-world data analysis tasks on gene expression data.
                                                 6


    For the second case, we design a method for multi-modal data integration using deep learning.
We tailor a graphical neural network (GNN) for use on single-cell sequencing data, a rich new data
source in biology that has revolutionized the field. Single-cell sequencing data produces matrices
where each row represents a cell, and each column gives a value corresponding to the expression
of some gene. This data often falls in the HDLSS regime, since sequencing cells is an expensive
process, but can produce for each cell potentially hundreds of thousands of features given by
different genes. We show that by using a bipartite graphical model of the data that represents both
cell and genes as nodes, we are able to leverage the causal structure of the gene expression data
to create a low-dimensional representation that preserves important biological information. The
GNN is combined with an autoencoder model in order to train a low-dimensional representation
through latent feature regularization. We apply our model named scJEGNN to the task for single-
cell multi-modality data integration in the NeurIPS 2021 special competition for Open Problems
in Single-Cell Analysis, and we demonstrate our model is able to outperform the best performing
competitor models.
    The structure of this dissertation is as follows: in Chapter 2 we go over some preliminary
mathematical background leading up to the GEV estimator, introduce the basics of neural networks
and the specific models we use from deep learning, and give some description of single-cell data,
which features prominently in our applications. In Chapter 3 we develop the formal theory of
the GEV estimator, including the nonasymptotic convergence theorems and minimax bounds. In
Chapter 4 we give the computational algorithm for the GEV estimator, and show the application of
the estimator to simulated and real data problems. In Chapter 5 we detail the scJEGNN architecture
and show its performance on the single-cell sequencing data integration task.
                                                  7


                                             CHAPTER 2
                                           BACKGROUND
In this chapter we develop the necessary mathematical background to understand the work. We
assume some familiarity with linear algebra, multivariate calculus, and probability theory, but
we first review some relevant concepts from these below. Then we review the basics of neural
networks, including feedforward networks, autoencoder, and graphical neural networks. Lastly we
introduce single-cell data and its structure and behavior.
2.1     Mathematical Preliminaries
Probability. Let X be a continuous real-valued random variable with probability distribution P.
If one wishes to understand how spread out X is from its mean E[X] = µ, then we can look at the
tail probabilities P(|X − µ| ≥ t) for t > 0. This value gives us a concentration inequality, which tells
us how likely it is for X to be a distance of t from its mean µ. Often it is beneficial to have random
variables that concentrate around its mean. A common example is when we have a population
parameter we are trying to estimate with a function of the data, which is a random variable, and
the function has mean equal to the population parameter. If X ∼ N(µ, σ 2 ) is normally distributed,
then we can show that X has quick decay of probability in its tails:
                                                        t2
                                                     −
                               P[|X − µ| ≥ t] ≤ 2e     2σ 2  for all t ∈ R.                        (2.1)
This quick rate of decay as a function of t is a general property that we wish to generalize. To that
end we call a random variable X sub-Gaussian if there is a positive number σ such that equation
(2.1) holds for X. A similar but weaker property that we define is as follows: X is sub-exponential
with parameters (ν, α) if for all t ∈ R
                                                
                                                          t2
                                                 2e− 2ν 2
                                                
                                                              if 0 ≤ t ≤  ν2
                                                                          α
                             P[|X − µ| ≥ t] ≤                                                      (2.2)
                                                 − 2αt              ν2
                                                  2e         if t >  α.
                                                    8


For sub-exponential variables, when t is small enough, the concentration inequality is sub-Gaussian
in nature (i.e. with the exponent quadratic in t), but for larger t, the exponential component of the
bound scales linearly in t. The location of this shift in behavior is then controlled by the parameter
α, and in the limit α → 0, we get back sub-Gaussian inequalities.
     If we have two continuous random variables X and X ′′ with probability distributions P and Q
respectively, then the Kullback-Leibler divergence (KL divergence) between the distributions is
defined to be
                                                                           
                                                                       p(x)
                                                       Z ∞
                                       DKL (P||Q) =         p(x) log          dx
                                                        −∞             q(x)
where p and q denote the probability densities functions of P and Q. For simplicity denote [n] =
{1, . . . , n} as the discrete set from 1 to n. Lastly if we have a collection of random variables
X1 , . . . , Xn , we define the first order statistic X(1) to be the minimum value of the collection {Xi }i∈[n] ,
the second order statistic X(2) to be the second smallest value of the collection, and so on to the nth
order statistic X(n) , which is the largest value of the collection.
Linear Algebra. Let A ∈ Rm×n be a real m by n matrix, and assume that m > n and that A has
full rank. The singular value decomposition of A is defined to be the identity
                                                                n
                                              A = UΣV⊤ = ∑ σi ui v⊤      i
                                                               i=1
where Σ = diag(σ1 , σ2 , . . . , σn ) ∈ Rn×n is a diagonal matrix with positive real numbers on the
diagonal and 0’s elsewhere, U = [u1 |u2 | · · · |un ] ∈ Rm×n is a orthonormal matrix with columns
ui ∈ Rm , and V = [v1 |v2 | · · · |vn ] ∈ Rn×n is a orthonormal matrix with columns vi ∈ Rn . If A is a
square symmetric matrix in Rn×n , then the eigendecomposition of A is the identity
                                                                n
                                              A = VΛV⊤ = ∑ λi vi v⊤     i
                                                               i=1
where Λ = diag(λ1 , λ2 , . . . , λn ) ∈ Rn×n is a diagonal matrix and V = [v1 |v2 | · · · |vn ] ∈ Rn×n is a
orthogonal matrix where each (vi , λi ) is an eigenvector/eigenvalue-pair of A satisfying Avi = λi vi .
For two matrices A, B ∈ Rd×d , we define the vector/scalar pair (v, ρ) with v ∈ Rd and ρ ∈ R to be
                                                            9


a generalized eigenpair if
                                                   Av = ρBv.
In the case that B is nonsingular, the pairs (v, ρ) correspond to the eigenpairs of the matrix B−1 A.
Alternatively, finding these eigenpairs is equivalent to finding the vectors that are critical points of
Rayleigh quotient
                                                     v⊤ Av
                                                     v⊤ Bv
with the corresponding generalized eigenvalue ρ equal to the value of the quotient. This framework
occurs commonly in statistics; when seeking a linear projection of the data many classic methods
solve a form of the above generalized eigenvalue problem with A and B acting as covariance
matrices of the data.
     Further notation we use is as follows. For two matrices A and B, let ⟨A, B⟩ = tr A⊤ B denotes
                                                                                                   
                                                                                        1/q
the trace inner product. For a vector u ∈ Rd and q ∈ [0, ∞], ∥u∥q = ∑dj=1 |u j |q             is the ℓq norm
if 0 ≤ q ≤ ∞; when q = 0, ∥u∥0 is the number of nonzero entries of u; when q = ∞, ∥u∥∞ =
max1≤ j≤d |u j |. For a matrix A ∈ Rm×n , we use tr(A) to denote its trace and Ai j for the (i, j)-th
element, and for q ∈ (0, ∞), ∥A∥q is ℓq operator norm, while ∥A∥F and ∥A∥max are used to denote
the Frobenius norm and the entry-wise maximum norm, respectively. For q1 , q2 ∈ [0, ∞], the matrix
                                                                                           
(q1 , q2 )-pseudonorm ∥A∥q1 ,q2 of A is defined as ∥A∗1 ∥q2 , ∥A∗2 ∥q2 , . . . , ∥A∗m ∥q2 q , where A∗i
                                                                                                1
denotes the       ith column of A. If J ∈ [m] is a subset of indices of size j, AJ ∈ R j×n is the submatrix
given by the j rows with indices in J. We use ρmax (A) and ρmin (A) to denote the maximum
eigenvalue and minimum eigenvalue, respectively. C,C1 , and C2 are constants that may vary in
different instances of usage.
Graphs. A graph, denoted G = {V , E }, is a set of nodes V = {v1 , . . . , vn }, and a set of edges E =
{e1 , . . . , em }. Nodes commonly represent entities in a data science problem, while the edges give
the relations between them, e.g. social media users as nodes and edges representing friendships, or
chemical atoms as nodes, and chemical bonds as edges. An edge e ∈ E connects two nodes v1e , v2e ,
thus e can be represented as (v1e , v2e ), an element of V × V . A node vi is adjacent to another node v j
                                                       10


                             Figure 2.1: A simple graph with adjacency matrix.
if and only if there exists an edge between them. We define the (first-order) neighbors of a node vi ,
denoted N (vi ), as the set of nodes that are adjacent to vi . A graph G can be equivalently represented
as an adjacency matrix which describes the connectivity between the nodes. Let A ∈ {0, 1}n×n be
a matrix where Ai, j = 1 if vi is adjacent to v j , and equal to 0 otherwise. The degree of a node vi is
the number of nodes adjacent to vi ,
                                           d(vi ) =   ∑    1v j ∈N (vi )
                                                    v j ∈V
where 1v j ∈N (vi ) = 1 in the event v j ∈ N (vi ) and 0 otherwise. The degree matrix D is a diagonal
matrix defined as Di,i = d(vi ), Di, j = 0 if i ̸= j.
2.2    Deep Learning
Deep learning is a class of machine learning algorithms that are built from artificial neural net-
works. Originating to a linear model in [MP43], it was further developed into the perceptron in
[Ros58], which can learn parameters for the function given training samples. Neural networks
(NNs) had a renaissance of interest and research in the early 2000s with the advent of “big data”
sources and more powerful computational machines to train the models. Since then deep learn-
ing models have consistently proven to outperform state-of-the-art traditional methods in a large
                                                       11


number of applications. The power and flexibility of different deep learning models has firmly
established the popularity of the models in machine learning tasks. We introduce the basics of
common NN models that we use in our work. A fuller treatment of the subject of deep learn-
ing can be found in [GBC16], and [MT21] provides an excellent reference for graphical neural
networks.
Feedforward Networks. A feedfoward network is simply a special type of function that is made
by composing a collection of simpler functions. As in all machine learning tasks, the feedfoward
network is an approximation of a sought after function f ∗ ()b, for instance if the task is classifi-
cation, one wishes to find a mapping f (x|Θ) that best approximates the ideal classifier f ∗ (x) = y.
The values of the parameters Θ that determine the feedforward network f (x|Θ) are learned during
training.
     In feedfoward networks, the function f : Rd → Rk given by the network is a composition of
simpler functions that are referred to as the layers of the network. The output dimension k is chosen
to suit the chosen application of the network. A single layer generally has an affine transformation
followed by a nonlinear “activation function” applied pointwise. This means for the input vector
x, the first layer would produce
                                         h1 = σ b1 + W1 x ,
                                                              
where W1 ∈ Rd×d1 , b1 ∈ Rd1 , and σ : Rd1 → Rd1 is the activation function. The output h1 is the
first “hidden layer” of the network, and the depth of the network is determined by the number of
layers. In general if a network has depth m, then the network function would be
                        f (x) = bm + Wm σ (bm−1 + Wm σ (· · · σ (b1 + Wm x)))),
where Wi ∈ Rdi ×di−1 and b ∈ Rdi , and our output dimension k would be equal to dm . The values of
the weights of Wi and bi , i ∈ [m] give the collection of parameters Θ that determine the function,
and are learned during training. These transformations between layers are depicted in Figure 2.2
as bipartite graphs, where the coordinates before and after are given as left and right nodes, and
the edges between them represent the weights of the matrix Wi .
                                                   12


Figure 2.2: The architecture of a feedforward network. The nodes represent the coordinates of each layer,
and the edges represent the weights of the linear transformations.
    Activation functions were originally designed to recreate the behaviors of biological neurons,
which receive a signal and either kill it or propagate it to further neurons. These functions introduce
non-linearity into the neural network which leads to strong theorems guaranteeing the function’s
approximation capabilities under certain conditions. There are a collection of commonly used
functions, and one of the most used is the Rectified Linear Unit (ReLU). The ReLU function is
linear (identity) for all positive inputs and 0 for all negative values;
                                          ReLU(z) = max(0, z).
Many variants of the function exist, many of which attempt to address the lack of gradient the
function has for negative inputs, like the LeakyReLU, ELU, and GELU. Prior to the use of ReLU,
using sigmoid functions were the norm, like the logisitic sigmoid
                                                          1
                                           σ (z) =
                                                     1 + exp(−z)
due to the belief that activation functions had to be continuous to properly train the model.
    The output of the function is run through a chosen loss function that allows for optimization
methods like gradient descent to learn the parameters. If the task is regression, so that each training
                                                      13


sample has the pair (x, y), y ∈ Rk , the output f (x) = ŷ ∈ Rk could be run through the simple squared
loss function to measure the difference between the predicted ŷ and ground truth y:
                                                        k
                                            L(y, ŷ) = ∑ (yi − ŷi )2 .
                                                       i=1
If the task is classification, the neural network needs to output the class label of the input. Instead
of a discrete output from a finite set of labels, probabilities are given for each class of C possible
classes, so that the output ŷ would be a vector in RC . The softmax function is used to output values
between 0 and 1 so that the total adds up to 1:
                                                           exp(zi )
                                   ŷi = softmax(z)i =                 , i ∈ [C],
                                                         ∑ j exp(z j )
where zi is the ith element of the vector z. Then with the predicted ŷ the loss function of cross-
entropy is used to measure the difference from the truth
                                                         C
                                           L(y, ŷ) = − ∑ yi log(ŷi ).
                                                        i=1
Here yi = 1 if the class of x is i, and 0 otherwise. During inference, an unlabeled input is given
label i if ŷi is the largest value among all the coordinates of ŷ.
Autoencoders. An autoencoder is a special type of neural network that tries to reproduce its input
as its output. The autoencoder consists of two components: an encoder h = f (x), which encodes
x into a hidden representation (called a code) h, and a decoder g which attempts to reconstruct x
from h, represented g(h) = x̂. The network is trained to minimize the reconstruction error
                                           L(x, x̂) = L(x, g( f (x))),
where L(x, x̂) measures the difference between x and x̂. The utility of an autoencoder comes from
its limitations; perfectly recreating the input is made to be impossible, so good approximations
x̂ are trained by encoding only important information in the code h. This limitation is achieved
through the creation of a bottleneck that the input x is forced through. As seen in Figure 2.3, this
bottleneck occurs at the code layer that constrains its representation in some manner.
                                                       14


Figure 2.3: The general framework of an autoencoder. The hidden layer gives the code of the encoder, and
the lower dimension of the code constrains the representation.
    There are two main ways this constraint in the autoencoder is implemented, by making the di-
mension of the code smaller than the input, or by placing certainly penalties on the latent represen-
tation that discourages memorization between input and output. The first leads to undercomplete
autoencoders, which forces the lower dimension code h to preserve the most important features of
the input. The second leads to regularized autoencoders, which have additional terms added to the
loss function
                                       L(x, g( f (x))) + η · Ω(h),
where Ω(h) is the regularization term applied to the code h and η is a hyperparameter controlling
the amount of penalty. One regularization is the ℓ1 -regularization
                                              Ω(h) = ∥h∥1
which promotes sparsity in the code h. Other regularizations may explicitly promote certain fea-
tures that are data specific to be preserved in the hidden representation, such as the cell type of of
single-cell gene expression data.
                                                    15


Graphical Neural Networks. Graph neural networks (GNNs) are a collection of deep learning
architectures that are designed to deal with graph-structured data. Other architectures like feedfor-
ward networks or convolutional neural networks (not covered) are more amenable to data that is
structured as a regular grid, like vectors or matrices. GNNs expand the functionality of neural net-
works to to this more multifarious data, and allow for both node-based and graph-based learning
tasks. These models are quite recent innovations, as the first GNN model was published as recently
as 2005 in [SYG+ 05].
     Like all neural networks, GNNs act as a type of representation learning of its input data, and it is
through learning a good representation of its input data that it is able to perform well in designated
tasks. Since the input data for GNNs are graphs, there are two ways to go about this representation.
For node-based tasks, the GNN learns good features for each node, using the graph structure to
facilitate the calculation of this representation. For graph-based tasks, the GNN aims to learn
features to represent the entire graph, and learning node features occurs only as an intermediate
ancillary step.
     To learn updated node features on the graph, the GNN takes in both the input node features and
the graph structure given by the adjacency matrix A ∈ Rn×n , where n gives the number of nodes.
If the nodes features are given by vectors in Rd , the collection of features can be given by a matrix
F ∈ Rn×d , and the update takes the form
                                             F1 = h(A, F)
for some function h called the “graph filter”. For node-based tasks, the graph filtering operation is
usually sufficient, and the GNN consists of multiple graph filters stacked consecutively to generate
final node features. Other operations are necessary for graph-based tasks to generate the features
for the entire graph from the node features. Tailored functions that take into account the graph
structure like graph pooling operations are used to generate global features. We forego further
discussion of graph-based task models, since our applications are node-based only.
     Like feedforward networks, the general framework for node-based GNNs is a composition of
multiple steps of graph filtering followed by a non-linear activation. If the network has depth m,
                                                  16


then the collection of operations would be denoted
                               Fm = hm (A, σ (hm−1 (A, · · · σ (h1 (A1 , F))))).
The final output Fm is then used for a downstream task related, e.g. classification on the nodes.
The non-linear activation functions come from the same collection of activation functions used
for other neural networks, but they can be combined with the graph filterings in novel ways. The
spectral filtering process, for instance, has the nodes transformed via a Graph Fourier Transform,
applies the activation function to the transformed coefficients, and then reconstructs the nodes from
the spectral representation. While there are a large collection of graph filters, including a whole
class of spectral-based filters, we focus here on the simplest spatial-based graph filter.
    Let our filter hi , followed by the activation function σ , be defined as
                                      σ (hi (A, Fi−1 )) = σ AFi−1 Wi ,
                                                                         
for Wi ∈ Rdi−1 ×di . This transformation is the same as the feedforward network with the exception
of the multiplication by A. This product AFi−1 means that for every node, we sum up all the feature
vectors of all the neighboring nodes but not the node itself (unless there are self-loops in the graph.)
This allows the topology of the graph to be taken into account during these transformations, but
one generally wishes for a node to propagate its own features into its next updated representation.
To correct for this, the adjacency matrix is replaced by A       b = A + I for I the identity matrix in Rn ,
so that A
        b supplies self-loops to the graph. Furthermore, since A is typically not normalized, the
product AFi−1 can change the scale of the feature nodes based on the number of neighbors a node
has. One can normalize A via multiplication with the inverse degree matrix D−1 , but more often a
symmetric normalization is used giving D−1/2 AD−1/2 . Combining these two we get
                                                                               
                                                          −1/2 b b −1/2       i
                              σ (hi (A, Fi−1 )) = σ D   b     AD        Fi−1 W ,
for D
    b the degree matrix of A.  b
Training. The training of deep learning models occurs through the minimization of a loss func-
tion L with respect to the parameters of the models – the weights of the affine transformations. We
                                                        17


denote the loss function as L(Θ) where Θ denotes all the parameters to be optimized. Gradient
descent, a first-order iterative optimization algorithm, is often used to minimize the loss function.
At each iteration the parameters Θ are updated by shifting them in the direction of the negative
gradient (the direction the loss function decreases the most at that location):
                                        Θ′ = Θ − η · ∇Θ L(Θ),
where ∇Θ L(Θ) denotes the gradient of L at Θ, and η > 0 is the learning rate, which is usually
fixed at a small constant. The gradient is usually averaged over a collection of training samples in
a batch, which provides greater statistical consistency and computational efficiency. The process
is iterated until convergence or some condition is met.
2.3     Single-Cell Data
In the biological sciences, the advent of single-cell technologies has revolutionized the investiga-
tion of cellular behavior in the context of its microenvironment. Single-cell sequencing is able
to measure multiple molecular features in multiple modalities in a cell, such as gene expressions,
protein abundance and chromatin accessibility. Measurements at the single-cell level allow for
unprecedented resolution for studying cell-to-cell heterogeneity. Such data sheds new insights
across biological disciplines including oncology [LYS18], neurology [RWM+ 18], and immunol-
ogy [SDB+ 18]. Technologies like single-cell transcriptome sequencing (scRNA-seq) and single-
cell assay for transposase-accessible chromatin with sequencing (scATAC-seq) provide data on the
RNA and DNA gene expression of individual cells respectively. scRNA-seq data makes it possible
to measure transcriptome-wide gene expression, and enables researchers to distinguish different
cell types based on their gene expression, organize cell populations, and identify cells transition-
ing between states [AHB+ 16, HBR+ 17, Con18]. Similarly, scATAC-seq studies can reveal somatic
clonal structures such as those found in cancer [ZNC+ 19], helping monitor cell lineage develop-
ment.
     The technologies that produce these data sources are very interesting in their own right, but
for our purposes it suffices to understand the format this data is presented to us after the sampling
                                                  18


                           Figure 2.4: An example matrix for single-cell data.
takes place. For both scRNA-seq and scATAC-seq, the data can be simply represented as a n × d
matrix where n gives the number of cells sampled and d gives the “sequencing depth”, i.e. the
number of genes tested for in the study. Then each row of the matrix gives a unique cell, and each
column give a unique gene. The values of the matrix consist of non-negative integers giving the
count data of how many times the gene expressed for a particular cell. These matrices can have
further information supplementing the data, such as cell type annotations assigning a class to each
cell, or cell-cycle scores quantifying the developmental stage of the cell. The cell-cycle scoring
is based on the expression of G2/M and S phase markers, where S is the synthesis phase for the
replication of the chromosomes (also part of interphase), G2 is the gap 2 phase representing the
end of interphase, prior to entering the mitotic phase, and the M phase is the nuclear division of
the cell (consisting of prophase, metaphase, anaphase and telophase) [NHS+ 16]. This additional
information would be included as a collection of additional columns at the end of the matrix,
indicating the appropriate labels or scores. An example matrix giving the gene expression data of
a single-cell experiment is given in Figure 2.4.
    There are two important aspects of single-cell data that are necessary to understand before using
it for analysis: batch effects and dropout. In single-cell sequencing methods, data is organized
                                                 19


into separate batches, where large groups of cells are potentially sampled in multiple laboratories
using different cell dissociation and handling protocols, library preparation technologies and/or
sequencing platforms. These different factors result in batch effects [TBH+ 17] that can change
the expression of genes systematically from one batch to another. Such differences can mask
underlying biology or introduce spurious structure in the data, and must be corrected prior to
further analysis to avoid misleading conclusions [AHMM18]. Since becoming a standard data
preprocessing step for singe-cell analysis, there have been many advanced techniques developed
to address batch effect removal, including techniques based on CCA [ZWT19] and a number of
deep learning models [LWL+ 20, SSZ+ 17].
    Dropout is the name given to the technical error that occurs when performing single-cell sam-
pling, which leads to artificial counts of zero in the gene expression read outs. The error occurs
during library preparation – a technical step that duplicates a gene many times in order to be
counted during sampling – and occurs more frequently for genes that express at low levels in their
respective cell types [Qui20]. This leads to sequencing data that is notoriously sparse, where the
vast majority of features may be zero in a typical dataset due to dropout [SNL+ 17], which can be
even more pronounced in multi-modal data [LHH20]. Dropout events lead to increased technical
variability and noise in the single-cell data, and makes it difficult to differentiate “true” zero counts
from “false” ones. Here, true zero counts indicate that a gene is not expressed in a particular cell
type, which could act as important information to differentiate cell types. Addressing dropout
requires specialized data processing methods such as imputation. Imputation takes in data and at-
tempts to replace artificial zero counts with realistic count values, hopefully while preserving true
zero counts. A diverse collection of methods exist for imputation, most of which rely on gaining
information about cell behaviors from similar cells in the dataset, or by transferring knowledge of
cell behaviors from other datasets. Methods like Phenograph [LSB+ 18], MAGIC [vDSN+ 18], and
Seurat [BHS+ 18] use K-nearest neighbor (KNN) graphs to model relationship between cells, but
the high sparsity of the data may cause these neighborhood estimates to be unreliable, and may
over-simplify the complex cell and gene relationships of cell population. Deep learning methods
                                                   20


have improved on these [WAH+ 19, APYG19], with the top performing methods using GNNs, such
as GraphSCI [RZL+ 21] and scGNN [WMC+ 21].
                                          21


                                                       CHAPTER 3
                    THEORETICAL PROPERTIES OF THE GEV ESTIMATOR
Let x be a d-dimensional random vector of covariates with covariance matrix Σ. Let y be a one-
dimensional continuous response and let x | y be a d-dimensional random vector of covariates and
Ω = cov(E[x|y]). It is advantageous to find a linear projection of x to a subspace of dimension K ≪
d such that the population centroids, E[x|y], separate the most in the projected space. This amounts
to finding the vectors that maximize v⊤ Ωv, but minimize overlap of the data after projection; i.e.
minimize v⊤ cov(x|y)v. Assuming cov(x|y) is the same for all y, we can consider the following
optimization procedure by maximizing a sequence of Rayleigh quotients:
                                        v⊤k Ωvk
                       v∗k = argmax       ⊤
                                                    , s.t. v⊤k Σv j = 0, for all 1 ≤ j < k ≤ K.                   (3.1)
                                 vk     vk Σvk
where, in classification problems with y taking discrete values (such as LDA), Ω and Σ are fre-
quently referred to as the “between-class” and “within-class” covariance matrices, respectively.1
                                                   v⊤
                                                    k Ωvk
It is easily verified that the quotient            v⊤
                                                            has critical points for each generalized eigenvector
                                                    k Σvk
of (Ω, Σ), so that v∗k is a critical point if Ωv∗k = ρΣv∗k for some ρ ∈ R. If Σ is invertible, the
problem reduces to finding eigenvectors of Σ−1 Ω, but since estimates of Σ are singular in the
high-dimensional regime, other means of solving the problem have to be used.
     Thus the first projection vector is sought to maximize the between-class covariance relative
to the within-class covariance. Then it seeks the second projection vector that maximizes the
between-class covariance subject to the constraint that it is orthogonal to the first projection di-
rection with respect to Σ. This procedure is then continued up to K times, where K is chosen
to fully capture the signal. In this work we focus on the case K = rank(Ω), where in the ap-
    1 Note that the actual within-class covariance would be cov(x|y) = Σx|y . However, with the assumption of ho-
moscedasticity, that Σx|y = Σx|y′ for all y, y′ , we have the pooled covariance ΣP equal to any within-class covariance,
and with the law of total covariance Σ = ΣP + Ω we have the equivalence of
                      v⊤k Ωvk                     v⊤k Ωvk                       1                v⊤k Ωvk
              argmax    ⊤
                              = argmax ⊤                         = argmax              = argmax            .
                 vk   vk Σvk        vk    vk ΣP vk + v⊤   k Ωvk       vk   v⊤
                                                                            k ΣP vk
                                                                                    +1      vk
                                                                                                 ⊤
                                                                                                vk Σx|y vk
                                                                            v⊤
                                                                             k Ωvk
                                                               22


plications we consider rank(Ω) ≪ d. Then the K projection vectors are concatenated to obtain
VK = {v∗1 , . . . , v∗K }, and the projection space, VK , is obtained by setting VK = span{VK }, the space
spanned by the linear combinations of the K projection vectors.
    However as it stands, the above approach is undesirable from both a computational standpoint
as well as from an estimation perspective. Each subsequent projection vector relies on all es-
timates of previous projection directions. Thus, propagation of the estimation error is possible.
In addition, the corresponding optimization problems are nonconvex, hence, the convergence of
any optimization algorithms to the global optima is not assured. This computational intractability
poses additional theoretical challenges and thus, most methods that are based on (3.1) do not have
theoretical guarantees. We reformulate (3.1) such that the projection space VK can be recovered
in a simultaneous manner via a convex optimization problem we call the Generalized EigenValue
(GEV) projection. Using a sparsity assumption, we formulate the proposed GEV procedure into
an optimization analogous to a type of matrix lasso regression problem.
3.1    General Error Bound
Without loss of generality, we assume that µ̄ ≡ E(x) = 0. If µ̄ ̸= 0, we can center the data via
x′ ≡ x − E(x), taking x′ as our centralized data with mean 0. Let Ω have an eigendecomposition
                                                           K
                                    Ω = var{E[x|y]} =     ∑ ρk uk u⊤k = UU⊤
                                                         k=1
                √               √
where U = ( ρ1 u1 , . . . , ρK uK ) ∈ Rd×K . Let W∗ be the solution to the following optimization
problem
                                                                             
                                     ∗              1 1/2                   2
                                   W = argmin          Σ W − Σ−1/2 U        F
                                                                                .                     (3.2)
                                         W∈Rd×K     2
Theorem 1. Let W = span{W∗ }. Then we have VK = W .
                                                  2
Proof. Let Q(W) =           1
                            2 Σ1/2 W − Σ−1/2 U    F
                                                    . Then W∗ is the minimizer of Q(W) and W∗ satisfies
                                           ∇Q(W∗ ) = ΣW∗ − U = 0,
which yields W∗ = Σ−1 U.
                                                       23


    To proceed, we need the following two lemmas. Our first lemma concerns the relation between
the eigenpairs of Σ−1/2 AΣ−1/2 and those of Σ−1 A.
Lemma 2. Let A be a symmetric matrix, Σ symmetric positive definite. If (ρ, v) is an eigenpair of
Σ−1/2 AΣ−1/2 , then (ρ, Σ−1/2 v) is an eigenpair of Σ−1 A; and vice versa.
    Our next lemma connects the sequential optimization problem (3.1) to the eigenpairs of Σ−1 Ω.
Lemma 3. The eigenvectors of Σ−1 Ω corresponding to the nonzero eigenvalues solves (3.1).
    Using Lemma 3, we only need to show that W∗ is equal to the eigenvectors of Σ−1 Ω up to an
orthogonal transformation. Suppose we have the following eigendecomposition of U⊤ Σ−1 U:
                                       U⊤ Σ−1 U = PΛP⊤ .
Then left-multiplying both sides by Σ−1 U and right-multiplying P, we have
                                  Σ−1 UU⊤ Σ−1 UP = Σ−1 UPΛ,
or equivalently
                                     Σ−1 ΩW∗ P = W∗ PΛ.
Using the fact that V = span{W∗ P} = span{W∗ } = W finishes the proof.
3.1.1   Proof of Lemma 2
Proof. Suppose (ρ, v) is an eigenpair of Σ−1/2 AΣ−1/2 , then
                                     Σ−1/2 AΣ−1/2 v = ρv.                                      (3.3)
Multiplying both sides in (3.3) by Σ−1/2 , we obtain that (ρ, Σ−1/2 v) is an eigenpair of Σ−1 A; and
vice versa. Since Σ−1/2 AΣ−1/2 and Σ−1 A have the same rank (since Σ is full rank), we further
conclude that the eigenpairs of Σ−1/2 AΣ−1/2 and those of Σ−1 A have a one-to-one correspon-
dence with the same eigenvalues.
                                                24


3.1.2   Proof of Lemma 3
Proof. We rewrite problem (3.1) as:
                            u⊤   −1/2 ΩΣ−1/2 u
                              1Σ               1
            u∗1 = argmax             ⊤
                                                 ,
                                   u1 u1
                            u⊤kΣ
                                 −1/2 ΩΣ−1/2 u
                                               k
            u∗k = argmax             ⊤
                                                  s.t. uk ⊥ u j = 0, ∀1 ≤ j < k, ∀1 ≤ k ≤ K,
                                   uk uk
            v∗k = Σ−1/2 u∗k ,
Applying Lemma 2 finishes the proof of Lemma 3.
    Formulation (3.2) resembles the least square loss in linear models, and the loss function in (3.2)
can be regarded as the matrix version of least square loss. Despite its simpleness, it recovers the
same projection space as produced by (3.1). Note that any estimator function Q(W) that satisfies
∇Q(W) = ΣW − U will recover V . We will exploit this later for an alternative algorithm that
makes use of Huber norm regularlization for noisy data. As it stands, however, the estimator
Q(W) is not able to statistically recover W∗ in the paradigm of high-dimensional data. To handle
high-dimensional features, we impose an assumption of sparsity on the structure on W and propose
to solve the following penalized regression problem:
                                  n1                                          o
                                            ⊤                ⊤
                                                                
                          argmin      tr W ΣW − tr W U + λ ∥W∥1,1 .                              (3.4)
                              W     2
                                                                               2
The first two terms above are just an expansion of       1
                                                         2 Σ1/2 W − Σ−1/2 U    F
                                                                                 .
    Let Σb and U b be the estimates of Σ and U, respectively. Plugging these estimates into (3.4)
above, we obtain the sample version
                                     n1                                           o
                                              ⊤b                ⊤b
                                                                  
                      W = argmin
                      c                  tr W ΣW − tr W U + λ ∥W∥1,1 .                           (3.5)
                                W      2
    Let S = {(i, j) : w∗i, j ̸= 0} be the total support set of W∗ and assume W∗ is s-sparse, that
is |S | = s. To give the main theoretical result for the estimation error, we need the following
definition of the generalized restricted eigenvalue (GRE) for matrices and the corresponding GRE
condition.
                                                     25


Definition 4 (Generalized Restricted Eigenvalue for Matrices). Let K, m ∈ N, and γ ∈ R. The
generalized restricted eigenvalue (GRE) for matrices is defined as
                                             n                                     o
                     κ+ (K, m, γ) = supV tr V⊤ ΣV           ∥V∥2F : V ∈ C (K, m, γ) ,
                                                         
                                                     b
                                            n                                     o               (3.6)
                     κ− (K, m, γ) = infV tr V⊤ ΣV          ∥V∥2F : V ∈ C (K, m, γ) ,
                                                        
                                                    b
where C (K, m, γ) = V ∈ Rd×K : S ⊆ J, |J| ≤ m, ∥VJ c ∥1,1 ≤ γ∥VJ ∥1,1 and J ⊂ [d].
                       
Assumption 5. There exists K, m ∈ N and γ ∈ R and constants κ∗ , κ ∗ ∈ R such that
                            0 < κ∗ ≤ κ− (K, m, γ) ≤ κ+ (K, m, γ) ≤ κ ∗ < ∞.
     The assumption above is necessary for our theoretical development and was first proposed
by [BRT09] for the vector case and various versions are standard in high-dimensional estimation
literature. Our definition is a direct extension of theirs to the matrix case.
     Define L(W) ≡ 12 tr W⊤ ΣW          − tr W⊤ U
                                                 
                                 b               b as our cost function without the regularization
term. As well, assume that there exists M, ρ > 0 such that 1/M ≤ ρmin (Σ) ≤ ρmax (Σ) ≤ M and
ρ ≤ ρK (Ω). We are ready to state our first result on the estimation error, which concerns the
                  W, under the event that ∥∇L(W∗ )∥∞,∞ ≤ λ /2 .
                                            
performance of c
Theorem 6. Assume that Assumption 5 holds with k = K, m = s, γ = 3. Suppose that ∇L(W∗ )            ∞,∞
                                                                                                        ≤
λ /2. Then we have
                                                                    √
                                     Pc
                                      W
                                        − PW∗   F
                                                  ≤ 3Mκ∗−1 ρ −1 λ s.                              (3.7)
Remark. The theorems above follow a general method of giving error bounds on M-estimators.
M-estimators are a family of estimators that combine a cost function with a regularizer, which
require two properties for consistency in high dimensions: the decomposibility of the regular-
izer and restricted strong convexity of their respective cost function. See [Wai19] chapter 9
for an explanation of these methods. Note that by conditioning on the random event G(λ ) =
  ∥∇L(W∗ )∥∞,∞ ≤ λ /2 these theorems give deterministic bounds; it is by further specifying the

application-dependent structure of Ω and Σ that leads to probabilistic bounds. In all applications
this will yield a statement of the form G(λ ) holds with high probability which identifies λ = f (n, d)
with a function of the sample size and dimension of the data.
                                                   26


3.1.3   Proof of Theorem 6
                 Sd
Proof. Let S =     j=1 S j be the union of supports for each projection direction. We first need the
following lemma.
Lemma 7. Suppose that ∥∇L(W∗ )∥∞ ≤ λ /2. For a E such that S ⊆ E and ∥E ∥0 ≤ 2s, we have
                                 W − W∗                      W − W∗ )E
                                          
                                 c
                                            E c 1,1    ≤ 3 (c                 1,1
                                                                                  .
3.1.3.1   Proof of Lemma 7
Proof. By the mean value theorem, there exists a f         W, some convex combination of c         W and W∗ ,
              W − ∇L W∗ = H(c         W − W∗ ), where H = ∇2 L(f             W) ∈ R p×K×p×K is a forth-order
                              
such that ∇L c
tensor                                                                    
                                         ∇ ∂∂Lw(W)       ··· ∇ ∂∂Lw(W)
                                                  f                   f
                                                 11                  1K   
                                              ..                 ..       
                                 H=           .                  .       .
                                                                           
                                                                          
                                         ∇ ∂∂Lw(W) · · · ∇ ∂∂Lw(W)
                                                  f                   f
                                                  d1                  dK
The tensor-matrix product is defined as HW = (ai j ) p×K ∈ Rd×K , with ai j = ⟨∇∂ L(f           W)/∂ wi j , W⟩.
Let Γb be some sub-gradient matrix of ∥W∥1,1 evaluated at c            W. The Karush-Tuhn-Tucker (KKT)
              W imply
conditions of c
                                              W − W∗
                                    
                      0 = ∇L c   W + λ Γ,b c
                                 W − ∇L W∗ + ∇L W∗ + λ Γ,                      W − W∗
                                                                  
                        = ∇L c                                              b c
                                W − W∗ + ∇L W∗ + λ Γ,                W − W∗
                                                         
                        = H c                                  b c
                                                                                       W − W∗ , we further have
                                                                                    
Using the positive semi-definiteness of the quadratic form H c           W−W ,c
                           0 ≤ − ∇L(W∗ ), c      W − W∗ − λ Γ,       b c W − W∗ .                          (3.8)
                                  |          {z             } |          {z          }
                                             I1                           I2
Next, we bound I1 and I2 respectively. Applying Hölder inequality to I1 obtains us that
                       ∇L(W∗ ), c W − W∗ ≥ − ∇L(W∗ )                 ∞,∞
                                                                           W − W∗
                                                                           c
                                                                                        1,1
                                                                                            .
                                                       27


For I2 , separating the support of Γ           W − W∗ into E and E c , using the assumption E c ∩ S = 0,
                                         b and c                                                             /
we have W∗E c = 0 and thus
                                Γ         W − W∗ )E c = 1E c , c
                                 b E c , (c                            WE c
                                                            = 1E c , (c W − W∗ )E c
                                                            = ∥(c W − W∗ )E c ∥1,1
since Γ
      b i j = sign(c
                   Wi j ) when c Wi j ̸= 0. On the other hand, we have
                            ⟨Γ      W − W∗ )E ⟩ ≥ −∥Γ
                             b E , (c                                    W − W∗ )E ∥1,1
                                                              b E ∥∞,∞ ∥(c
                                                     = ∥(c  W − W∗ )E ∥1,1
by the Hölder inequality and the identity ∥Γ           b E ∥∞,∞ = 1. Plugging the derived inequalities above,
we obtain
                     λ Γ,
                       b cW − W∗ = λ Γ                  W − W∗ )E c + λ Γ
                                               b E c , (c                             W − W∗ )E
                                                                               b E , (c
                                          ≥ λ (cW − W∗ ) E c       1,1
                                                                       − λ (c W − W∗ )E    1,1
                                                                                               .
Plugging the bounds for I1 and I2 back into (3.8) yields that
            0 ≤ ∇L(W∗ )        ∞,∞
                                     W − W∗ 1,1 − λ (c
                                     c                          W − W∗ )E c    1,1
                                                                                   + λ (c W − W∗ )E  1,1
              ≤ − λ − ∇L(W∗ ) ∞,∞ (c             W − W∗ )E c ∥1,1
                                             
                 + λ + ∇L(W∗ ) ∞,∞               W − W∗ E 1,1 ,
                                                              
                                                 c
which further yields that
                                   λ + ∇L W∗
                                                       
             W − W∗ E c                                            W − W∗                  W − W∗ )E
                                                          ∞,∞              
             c
                           1,1
                               ≤                                   c
                                                                              E 1,1   ≤ 3 (c           1,1
                                                                                                           ,
                                                W∗
                                                       
                                   λ − ∇L                  ∞,∞
which completes the proof.
    Taking E = S in Lemma 7 and using the restrictive eigenvalue condition with lower bound
κ− (K, s, 3) ≥ κ∗ , we obtain
                                           2
                             W − W∗                         Wt − ∇L W∗ , c       Wt − W∗ .
                                                                             
                         κ∗ c              2,2
                                               ≤ ∇L c
                                                            28


since ∇L(W ) = ΣW − U. We note that, for any matrix A, we have ∥A∥F = ∥A∥2,2 , that is the
Frobenius norm and the ℓ2,2 -norm coincides.
          b be defined as above as a sub-gradient matrix of ∥W∥1,1 evaluated at c
      Let Γ                                                                             W. To bound the
right hand side of the inequality above, we add λ Γ,   b cW − W∗ to both sides and obtain
                          2
                W − W∗           ∇L(W∗ ) + λ Γ,   W − W∗ ≤ ∇L c                      W − W∗ .
                                                                            
            κ∗ c          2,2
                              +                 b c                      W + λ Γ, b c                   (3.9)
Since we have
                                                      W − W∗ = 0,
                                             
                                     ∇L c W + λ Γ,b c
plugging the above equality back into (3.9), we obtain
                                 2
                   κ∗ cW − W∗    1,2
                                     ≤ − ∇L(W∗ ), c   W − W∗ + − λ Γ,      b cW − W∗ .                 (3.10)
                                       |          {z             } |          {z      }
                                                  II1                         II2
In a similar argument to the proof of Lemma 7 above, applying Hölder inequality to both II1 and
II2 , we obtain
             − ∇L(W∗ ), c W − W∗ ≤ ∇L(W∗ )           ∞,∞
                                                           W − W∗ 1,1
                                                           c
                                                                                            
                                     = ∇L(W∗ ) ∞,∞              W − W∗ S +        W − W∗
                                                                        
                                                                c                 c
                                                                                             Sc
                                     ≤ 4 ∇L(W∗ )               W − W∗
                                                                        
                                                                          S 1,1 ,
                                                               c
                                                       ∞,∞
and
                         W − W∗ ≤ −λ           W − W∗                    W − W∗
                                                                                 
                  − λ Γ, c                     c
                                                          S c 1,1 + λ
                                                                         c
                                                                                    S 1,1 ,
where we use Lemma 7 in the first displayed inequality. Plugging the bounds above for II1 and II2
                                                                                         W −W∗ S 1,1 =
                                                                                                    
back into (3.10) and then applying the Cauchy-Schwartz inequality to the term c
  1S , (cW − W∗ )S , we further obtain
                      2
             W − W∗       ≤ 4 ∇ L W∗                      W − W∗                   W − W∗
                                                                                          
         κ∗ c                                   +λ                    S 1,1 − λ
                                                          c                        c
                      F                     ∞,∞                                                S c 1,1
                                            W − W∗ S
                                                     
                          ≤ 4λ /2 + λ       c
                                                            1,1
                                   W − W∗ S 1,1
                                            
                          ≤ 3λ c
                                √
                                      W − W∗ S F
                                               
                          ≤ 3λ s c
                                √
                          ≤ 3λ s c   W − W∗ F
                                                   29


Canceling the term c  W − W∗   F
                                 on both sides, we obtain
                                                            −1 √
                                      W − W∗
                                      c
                                                   F
                                                      ≤ 3κ  ∗   λ s.                             (3.11)
    We need the following lemma.
Lemma 8. Let ρK,W∗ be the K-th largest singular value of W∗ . Then we must have
                                                  √ −1
                              PcW
                                  − PW∗     F
                                              ≤      2ρK,W∗ c    W − W∗     F
                                                                              .
3.1.3.2  Proof of Lemma 8
Proof. Let W∗ have the singular value decomposition that W∗ = EDF⊤ , F ∈ RK×K , E ∈ R p×K
are orthogonal, and D ∈ RK×K is a diagonal matrix. Then we have PW∗ = PE . Looking at the
                                    2
regression problem infQ E − c WQ    F
                                      , the least squares solution is
Q = (cW⊤ cW)−1 c W⊤ E, giving
                               2                    2
                  inf E − cWQ  F
                                 = E − Pc    W
                                                E   F
                   Q
                                 = tr EE⊤ − Pc        EE⊤ − EE⊤ Pc                EE⊤ Pc
                                                                                         
                                                   W                     W
                                                                           + Pc W      W
                                                  
                                 = tr I − Pc   W E
                                                    P
using identities PcW
                     =cW(c W⊤ c
                              W)−1 cW⊤ , P2c = Pc     W
                                                        , EE⊤ = PE , and the cylic property of trace.
                                             W
                                                           
                       tr I − Pc
                               W E
                                  P = K − tr Pc        W E
                                                          P
                                            1                            1          
                                         ≥    tr (PE ) − tr Pc   W
                                                                    P E  +   tr  P W
                                            2                              2       c
                                            1                 2
                                         = PE − Pc       W    F
                                            2
                                            1                   2
                                         = PW∗ − Pc       W F
                                                                  ,
                                            2
                                                   30


where the first equality and first inequality uses the fact that PE is rank K and Pc
                                                                                   W
                                                                                     is at most rank
                                                                 2
K. Therefore, taking Q = FD−1 , we can bound Pc        W
                                                         − PW∗   F
                                                                   as
                                             √
                            Pc
                             W
                                − PW∗   F
                                           ≤    2 inf cWQ − E   F
                                                   Q
                                             √
                                           ≤    2 cWFD−1 − EDF⊤ FD−1        F
                                             √
                                           ≤ 2 c   W − W∗ F ∥D−1 ∥2
                                             √ −1                ∗ 2
                                           ≤ 2ρK,W     ∗ W−W
                                                         c
                                                                   F
                                                                     .
    Now using Lemma 8, we obtain that
                                                     √ −1 −1 √
                                 PcW
                                      − PW∗   F
                                                 ≤  3  2ρK,W∗ κ∗ λ s.
Then it remains to lower bound ρK (W∗ ). We start by writing W∗ (W∗ )⊤ = Σ−1 ΩΣ−1 . We need
the following lemma, which can be proved by the min-max theorem and thus the proof is omitted.
Lemma 9. Let A ∈ Rd×d be a symmetric positive definite matrix and B ∈ Rd×d a symmetric
positive semidefinite matrix. Then for any 1 ≤ k ≤ d, we have
                             ρmin (A)ρk (B) ≤ ρk (AB) ≤ ρmax (A)ρk (B).
Applying Lemma 9, we obtain that
                                      −1/2
                       2
                     ρK,W  ∗ ≥ ρK (Σ       ΩΣ−1/2 )ρmax−1                −2
                                                          (Σ) ≥ ρK (Ω)ρmax   (Σ).
    Therefore, we complete the proof of desired statement by plugging the bound above.
3.2    Sliced Inverse Regression
Supervised dimension reduction that preserves the conditional dependence of the data has a history
in Sufficient Dimension Reduction (SDR), [Coo98]. As a method for performing SDR, the SIR
method first developed in [Li91]. For a random vector x with elliptic distribution and univariate
                                                    31


response variable y = f (x), the goal of finding a low-dimensional representation of x should take
into account the relationship of the data to y, ideally without losing any information which is es-
sential in predicting y. The objective of SDR methods is to find, without knowing f , a subspace
V ⊆ Rd such that y ⊥ x|PV x. A subspace that satisfies this property is called an effective dimen-
sion reduction (EDR) space. Under some minor conditions, the intersection of all EDR spaces is
also an EDR space with minimum dimension, called the central space. The minimal model for
SDR methods is to assume the multiple index model where the link function takes the form
                                                                      ⊤
                                            y = f (v⊤   1 x, . . . , vK x, ε)
for vi ∈ Rd for i = 1, . . . , K and the error ε is independent of x and E[ε] = 0. Thus, under this
model it suffices to find span{v1 , . . . , vK } to determine the central space.
    The SIR estimator was one of many techniques developed for SDR, but was favored due to
its simplicity and computational efficiency. The name Sliced Inverse Regression comes from
both the use of the “inverse regression curve” E[x|y] as well as the sliced estimator of Ω =
cov(E[x|y]) to determine the central space. As proved in [Li91], since the column space col(Ω) =
Σspan{v1 , . . . , vK }, the central space can be estimated via Σ̂−1 col(Ω̂). The additional linearity
condition that ∀b, vi ∈ Rd , E[b⊤ x|v⊤                  ⊤                   K ⊤
                                          1 x, . . . , vK x] = c0 + ∑i=1 ci vk x, where c0 , . . . , cK ∈ R, is re-
quired, but this is automatically satisfied assuming x is elliptically distributed. [? ] proves the
consistency of SIR holds if and only if lim d/n = 0, motivating high dimension methods.
    For the application of SDR, numerous competing procedures have been developed in the past
couple decades, including [CL98, LN06, Li07]. While many approaches built flexible semi-
parametric models such as projection pursuit regression [FS81], and MAVE [XTLZ02], none of
these function in HDLSS scenarios. A major breakthrough was achieved in regularlized SDR
methods proposed by [LZL19] using the Lasso SIR method for the HDLSS under the sparsity
assumption. The GEV method is closely related to the Lasso SIR method, but has important dif-
ferences from their own that leads to performance improvements in scenarios that have significant
eigengaps, and has consistent performance that is as good or better elsewhere. A closely related
method to the GEV estimator can be found in [WCZZ18] which stems from similar motivations
                                                           32


but attempts only to compute the optimal projected coordinates of the data instead of determining
the projection explicitly. Their regularization uses a group lasso approach in contrast to the GEV
estimator, and the non asymptotic rates of error they demonstrated are suboptimal.
3.2.1    Consistency for SIR
The space Σ−1 col(Ω) is given by the span of the generalized eigenvectors of (Ω, Σ), justifying
the GEV estimator. As in the original SIR estimation technique, we use the sliced estimator of Ω,
defined as follows. Given the samples (xi , yi ), i ∈ [n], for a chosen constant H ∈ N, K < H < n,
divide the data into H groups determined by the order statistics of y(i) that give H “slices” of
the domain of y in the form of intervals (y(hi ) , y(hi+1 ) ], i ∈ {2, . . . , H − 1}, with (−∞, y(h1 ) ] and
(y(hH−1 ) , ∞) at the tails. In general these may lead to different sized groups, but without loss of
generality we may assume n = cH so that each slice is chosen to consist of c points. We may
relabel the data as (xh, j , yh, j ) for the j-th sample in slice number h, i.e., yh, j = y(c(h−1)+ j) and
xh, j = x(c(h−1)+ j) . Then the sample mean in h-th slice is x̄h ≡ 1c ∑cj=1 xh, j , allowing us to estimate
b H = 1 ∑H x̄h x̄⊤ = 1 XH X⊤ , where XH = (x̄1 , . . . , x̄H ) is the d ×H matrix with the slice means
Ω      H h=1          h     H       H
as columns.
     Here, the intuition for the estimate Ω = cov(E[x|y]) is that each sample slice mean x̄h serves as
a local estimate of E[x|y ∈ Ih ] for Ih = (yh−1,c , yh,c ]. First, it is worth noting that it is not immediate
                                  can be treated as coming from the distribution x| (y ∈ Ih ); this is proven
                    S
that the samples      j∈[c] xh, j
to be the case in [LZL18]. Furthermore, given the estimate           1
                                                                     H  ∑H E[x|y ∈ Ih ]E[x|y ∈ Ih ]⊤ of Ω, it is
not guaranteed to be consistent. We can compare this estimate with the classic consistent estimator
1                     ⊤  of Ω. A necessary condition for the sliced based estimate to be consistent is
n ∑i E[x|yi ]E[x|yi ]
the average loss of variance in each slice decreases to zero as H increases. This holds automatically
if E[x|y] is smooth and y is compactly supported. In general though, one needs a basic assumption
on the smoothness and tail distribution of E[x|y]. We give that assumption below, ϑ -stability
([LZL19]), which is minimal and holds for many distributions. Note that while an estimate based
on samples of E[x|yi ] is possible, the estimate would fair very poorly, since each mean E[x|yi ]
                                                       33


would have at best a single point estimate given by xi from the pair (xi , yi ).
    Given this structure of Ω  b and the usual estimator for Σ b = 1 ∑ xi x⊤ , we show consistency of the
                                                                    n        i
GEV estimator given the assumption of the Stability condition. To do so, given a decomposition of
∥∇L(W∗ )∥∞,∞ into terms involving estimation error of Σ        b and U,b we will use standard techniques
to bound the former, but the latter will require extra work. We will show that while ∥U        b − U∥∞,∞
has inherent difficulties in controlling an upper bound, if one instead uses ∥U      b − Ũ∥∞,∞ there is a
direct way to control the upper bound with high probability, where Ũ = PU (U)         b is the projection
of the estimated eigenvectors on the span of U. As we will see, this avoids any cumbersome
eigengap assumptions, but requires rank(Ũ) = rank(U). To make this substitution one needs to
guarantee that an estimator that uses Ũ instead of U will recover the desired subspace. This is
proven showing lower bounds on the norms of projected eigenvectors hold with high probability
given the assumption of the stability condition. It is important to note that our parameter U =
 √                 √     
    ρ1 u1 , . . . , ρK ud with the additional coefficients of the square roots of estimated eigenvalues
makes this easier to bound than if there was no coefficients as found in the [LZL19] model.
Assumption 10 (Stability). For positive constants α1 < 1 < α2 let AH (α1 , α2 ) be the collection of
all partitions −∞ = a0 < a1 < · · · < aH = ∞ of R satisfying
                                       α1                           α2
                                          ≤ P(ai ≤ y < ai+1 ) ≤ .
                                       H                            H
The inverse regression curve E[x|y] is ϑ -sliced stable with respect to y for some ϑ > 0 if ∃α1 , α2 , α3
such that for any v ∈ Rd and partition AH (α1 , α2 )
                           H−1
                        1                                           α3
                        H  ∑ var(v⊤E[x|y]|ah ≤ y ≤ ah+1)         ≤
                                                                    H ϑ
                                                                         var(v⊤ E[x|y])
                           h=0
for sufficiently large H. The curve is stable if it is ϑ -sliced stable for some positive constant ϑ .
Theorem 11. Assume that assumption (10) holds for E[x|y], x is sub-Gaussian with variance
                                                       q
proxy σ , n = ρK d for some α > 1/2, and λ = C log(d)
                      α
                                                             n    for some constant C. Then, there exists
constants C1 ,C2 such that with probability at least 1 −C1 exp (−C2 log(d))
                                                           r
                                               ∗               log(d)
                                      ∥∇L(W )∥∞,∞ ≤ C                  .
                                                                  n
                                                    34


Corollary 12. There exist constants C,C1 ,C2 such that
                                                                  r
                                                                       s log(d)
                                PcW
                                    − PW∗      F
                                                 ≤ CMκ∗−1 ρ −1
                                                                           n
holds with probability at least 1 −C1 exp(−C2 log(d)) .
3.2.2  Proof of Theorem 11
Proof. We have
                      ∇L(W∗ )     ∞,∞
                                        ≤ ΣW b ∗ − ΣW∗
                                                              ∞,∞
                                                                   + ΣW∗ − U      b
                                                                                    ∞,∞
                                                                                        .
                                           |           {z        } |           {z     }
                                                       I1                      I2
We first bound I1 . Since we have ∥AB∥∞,∞ ≤ ∥A∥∞,∞ ∥B∥1 where the latter is the operator norm,
since
                                                                       !
                                 ∥AB∥∞,∞ = max            ∑ Ai,k Bk, j
                                                  i, j
                                                          k
                                              ≤ max |Ai, j | max ∑ |Bk,ℓ |
                                                  i, j        ℓ     k
                                              = ∥A∥∞,∞ ∥B∥1
Then I1 ≤ ∥Σb − Σ∥∞,∞ ∥W∗ ∥1 , where the second factor may be treated as a constant. It suffices to
bound the estimation error of Σ.
                             √
Lemma 13. Let M   e = maxi Σii . Suppose that n > 6 log d. Then, for some universal constant C,
with probability at least 1 − exp(− log(d/4)), we must have
                                                            s
                                         b − Σ∥∞,∞ ≤ C M log d .
                                                              e
                                     ∥Σ
                                                                  n
Proof of Lemma 13. Using a similar argument to the proof of Lemma 1 in [RWRY11], which is
                                                          p 
omitted for simplicity, we obtain, for all t ∈ 0,C1 M       e ,
                                                              
                                                                         nt 2 
                              P Σ  b i j − Σi j ≥ t ≤ 4 exp −                   ,
                                                                        C2 Me
                                                       35


                                                                                                  q
where C1 = 3200 and C2 = 40 are two universal constants. Therefore, taking t = 3C2 Mn                    e −1 log d
                        p
such that t < C1 M         e and applying union bound over the d 2 entries gives
                                                               s                !
                                                                  3C  2
                                                                        e log d
                                                                        M
                                   P Σ      b −Σ
                                                        ∞,∞
                                                            ≥                     ≤ 4d −1
                                                                         n
                           q                              p
                                  e  −1
Observing that t = 3C2 Mn log d < C1 M                       e implies n > 6 log d and this finishes the proof.
     What remains is bounding I2 . Using identity ΣW∗ = U, we have I2 = ∥U − U∥                    b ∞,∞ . Let V̂ ≡
[û1 , . . . , ûK ] and Λ̂ = diag(ρ̂1 , . . . , ρ̂K ). Then let A =     √1 X⊤ V̂ be a H × K matrix. Let x = z + w
                                                                           H H
be the orthogonal decomposition with respect to col(Ω) and its complement. Then we have the
decomposition XH = ZH + WH . This leads to the identity
                                                                
                                                    1
                                       b=
                                       U               XH XH V̂ Λ̂−1/2
                                                            ⊤
                                                    H
                                                                             
                                                      1             1
                                            = √ ZH A + √ WH A Λ̂−1/2
                                                       H             H
and
                                              e = PU (U)
                                              U           b = √1 ZH AΛ̂−1/2 .
                                                                    H
Then U    b −U   e=    √1 WH AΛ̂−1/2      and
                         H
                                 1                                  1
                              ∥ √ WH AΛ̂−1/2 ∥∞,∞ ≤ ∥ √ WH ∥∞,∞ ∥AΛ̂−1/2 ∥1,∞
                                  H                                  H
                                   √                                                          √
Note that ∥Λ̂−1/2 A∥1,∞ ≤ H∥Λ̂−1/2 A∥2,∞ by the basic inequality ∥v∥1 ≤ H∥v∥2 for any v ∈
                                                      q
RH . Then ∀i ∈ [d], we have ∥A∗,i ∥2 = H1 V̂⊤                         ⊤
                                                                                p
                                                             ∗,i XH XH V̂∗,i =    ρ̂i , thus
                                               1
                                         ∥ √ WH AΛ̂−1/2 ∥∞,∞ ≤ ∥WH ∥∞,∞ .                                    (3.12)
                                                H
     The benefit of having U in our model appears – if we had taken V̂ as our parameter and not
                                                                        p
b we would have had the above bound times maxi ρ̂i . Instead we merely need to bound the
U,
behavior of WH to give an upper bound on I2 given the legitimacy of the substitution of U                  e for U,
which we will see is very manageable.
                                                                 36


    To show the substitution of Ũ is legitimate, define W∗∗ as
                                                                                  
                                    ∗∗                1 1/2
                                 W      = argmin          ∥Σ W − Σ−1/2 U∥     e 2F               (3.13)
                                             W        2
We show with high probability that this model recovers the desired reduction space. It is trivial to
show that assuming U     e is of rank K, span(W∗ ) = span(W∗∗ ). To show that the rank is K, it suffices
to show that none of the projected vectors of ũi = PU (ûi ) are 0, and that the projection of the K
vectors are injective. Thus we give a positive lower bound on the norms ∥ũ∥2 , and lower bounds
on the angles between ∠(ũi , ũ j ) ∀i, j ∈ [K], i ̸= j.
    These important results come directly from [LZL19].
Theorem 14. If nρK = d α for some α > 1/2, there exists positive constants C1 , C2 , C3 , C4 , such
that
   1. for j = 1, . . . , K
                                                                   r
                                                                      ρK
                                                    ∥ũ j ∥2 ≥ C1
                                                                      ρ̂ j
   2. for j = K + 1, . . . , H                                p            r
                                                                d log(d) ρK
                                             ∥ũ j ∥2 ≤ C2
                                                                 nρK         ρ̂ j
hold with probability at least 1 −C3 exp (−C4 log(d)) .
    Its the first inequality that matters for our purposes; it not only gives us the lower bound on
projected norms, it is also used in the next theorem:
Theorem 15. The angles between any two vectors in {ũ1 , . . . , ũd } are nearly π/2 with high prob-
ability. In particular there exist constants C1 ,C2 ,C3 such that
                                                                   p
                                                                     d log(d)
                                       | cos(∠(ũi , ũ j ))| ≤ C1
                                                                      nρK
holds with probability at least 1 −C2 exp (−C3 log(d)) . for any i ̸= j.
    Both proofs above are done in the case that x is Gaussian. The proofs rely on two core lemmas
to show their claims hold with high probability. The first was proven in [? ] and is already done in
                                                           37


the case that x is sub-Gaussian. The other lemma gives basic tail bounds for χ 2 variables and can
be extended to the case of sub-exponential random variables.
Lemma 16. Let c1 , . . . , c p be positive constants. We have the following statements:
    1. For d sub-Gaussian random variables x1 , . . . , xd , there exist constants C1 and C2 such that
        for any sufficiently small a we have
                                                                        !                                   !
                                    1                                                              d 2 a2
                                           ci (xi2 − E[xi2 ]) > a          ≤ C1 exp −                         .              (3.14)
                                    d∑
                                P
                                        i                                                        C2 ∑ j c2j
    2. For 2d sub-Gaussian random variables x1 , . . . , x p , y1 , . . . , y p with E[xi ] = E[yi ] = 0 for all
        i ∈ [d], there exist constants C1 and C2 such that for any sufficiently small a, we have
                                                                  !                                    !
                                           1                                                 d 2 a2
                                                   ci xi yi > a ≤ C1 exp −                                .                  (3.15)
                                           d∑
                                   P
                                                i                                        C2 ∑ j c2j
Proof. If xi is sub-Gaussian with parameter σ , then trivially xi2 is sub-exponential since ∀t > 0 we
                                   √                    −t
have P xi2 ≥ t = P |xi | ≥ t ≤ 2 exp( 2σ
                  
                                                           2 ) where the inequality comes from the sub-Gaussian
property. It is straightforward to show that xi2 has sub-exponential parameters (νi , σ ) for any νi2 >
E[xi4 ]. As well for any ci ∈ R, ci xi2 is sub-exponential with parameters (ci νi , ci σ ) since we have the
tail bound
                                                                                         −t 2
                                                                                                 
                                  P(ci xi2 ≥ t) = P(xi2        ≥ t/ci ) ≤ exp
                                                                                        2c2i νi2
                                                                                                                          √
                                           c2i νi2           1                                                               ∑i c2i νi2 σ (maxi ci )
for all 0 ≤ t/ci ≤ νi2 /σ      ⇔0≤t ≤       ci σ .  Then     d ∑i ci xi2 is sub-exponential with parameters (                 d        ,    d        ).
                   ∑i c2i νi2
Then for a ≤     dσ maxi ci   we have
                                                                      !
                                   1                                                          c2 a2
                                          ci (xi2 − E[xi2 ]) > a          ≤ 2 exp(−                    ).                    (3.16)
                                   d∑
                               P
                                      i                                                    ∑ j c2j νi2
                                                                            (xi +y j )2 −xi2 −y2j
     The second statement follows likewise since xi y j =                             2           is sub-exponential.
                                                                                                  √
                                                                                H                   d log(d)
Corollary 17. Let Σ1 ≡ cov(w) and IH the identity on R . Then if                                       n     is sufficiently small,
the event                                                                                   p
                                                                                
                                           1 ⊤                 tr(Σ1 )                         d log(d)
                              Ω = ω ∥ WH WH −                            IH ∥F ≤ a
                                          H                       n                               n
                                                                 38


happens with probability at least
                                         1 −C1 exp (−C2 log(d)) .
Proof. Since x is sub-Gaussian, XH has columns that are averages of c independent sub-Gaussians,
and under linear projection each entry of WH is likewise sub-Gaussian. The term inside the Frobe-
nius norm is a matrix with entries that are a sum of sub-exponential random variables, with the
subtraction of the means for the squared terms. In particular if Wi j is the i j-th entry of WH and δi j
is the usual Kronecker delta,
                                                                                    !
                                                         d    WkiWk j
                       1 ⊤             tr(Σ1 )                              tr(Σ1 )
                         WH WH −                IH =    ∑ H − δi j n                     .
                       H                  n               k                           ij
Since every Wki = 1c ∑ce=1 wki
                             e for c = n/H,
                                                        "         #"          #
                               d  WkiWk j        1 d        c          c
                              ∑ H           =
                                                c2 n ∑    ∑ wkie      ∑ wke f   .
                               k                     k    e=1         e=1
                             W2
                        h        i
                                      tr(Σ1 )
When i = j we have E ∑kp Hki =           n .    Then each WkiWk j is sub-exponential and
                                                                  p            !
                             1 ⊤              tr(Σ1 )                d log(d)
                       P ∥ WH WH −                    IH ∥F ≤ a
                             H                   n                      n
                                                                             
                             1 ⊤                                 p
                     =P ∥ 2 WH WH − tr(Σ1 )IH ∥F ≤ a d log(d)
                            c
                                        "                            #2                 
                              i=H, j=H d
                            1
                     =P  2 ∑              ∑   WkiWk j − δi j tr(Σ1 ) ≤ a2 d log(d)
                           c       ij       k
                                                                    p           !
                                   d                              a   d  log(d)
                               1
                     ≤H 2 P 2 ∑ WkiWk j − δi j tr(Σ1 ) ≤
                              c k                                       H
                                    log(d)a2
                                               
                     ≤C1 exp −
                                       C2
                                                                               1
     where the last inequality comes from the application of (16) to           c2 ∑dk WkiWk j − δi j tr(Σ1 ) with
first sub-exponential parameter being O(d).
     Given the above, the proof of Theorem 14 goes through exactly as in [LZL19]. The proof of
Theorem 15 uses a transformation of XH via an orthogonal matrix T such that                √1 T ZH
                                                                                             H
                                                                                                     = (A⊤ , 0)⊤
                                                       39


and  √1 T WH
       H
                 = (0, B⊤ )⊤ , where A ∈ RK×H and B ∈ R(d−K)×H . Then via this transformation the
proof depends on the following events happening with high probability: (i) ρmin (AA⊤ ) ≥ λ , (ii)
                                                                               √
                                                                                d log(d)
                        q
∥PT ZH (T û j )∥2 ≥ C ρρ̂Kj for all j ∈ [K], and (iii) ∥B⊤ B − λ IH ∥F ≤ C        n     for some scalar
λ > 0. (i) follows from the Sine-Theta Theorem, (ii) follows from Theorem 14, and (iii) follows
from Lemma 16.
                                                    q
                                                      log(d)
Lemma 18 (Bounding WH ). Assume λ = C                    n   for some constant C. Then there exist con-
stants C1 ,C2 such that
                                          1
                                       ∥ √ WH AΛ̂−1/2 ∥∞,∞ ≤ λ /2
                                           H
with probability at least 1 −C1 exp(−C2 log(d)).
Proof. Using the bound in (3.12), it suffices to bound ∥WH ∥∞,∞ . As a linear function of a sub-
                                                                                                 √
Gaussian variable, each entry Wi j = 1c ∑ck=1 wkij of WH is sub-Gaussian with parameter σw / c, for
some σw ∈ R, i ∈ [d] and j ∈ [H]. Then using a union bound we have
                                   r                !
                                       CH log(dH)              √          p             
                  P max |Wi j | ≥                      ≤ dHP | cWi j | ≥ C log(dH)
                       i, j                 4n
                                                       ≤ 2e−(C−1) log(dH)
    Combining the bounds for I1 and I2 we may seek bounds
                                                                                             
                   ∗           λ                 ∗        ∗        λ              ∗             λ
     P ∥∇L(W )∥∞,∞ ≤               ≤ P ΣW   b      − ΣW ∞,∞ ≤          + P ΣW − U      b      ≤
                               2                                   4                      ∞,∞   4
                            q
Setting λ = 2∥W ∥1C log(d)
                     ∗
                                n   yields the desired probabilities, with a change of constants. This
completes the proof of the bound.
3.3    Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is a classification technique, that assumes x ∈ Rd is a random
vector and y ∈ {1, . . . , K + 1} is a discrete random variable over K + 1 classes, such that x|y = k ∼
                                                      40


N(µk , Σ) for k ∈ {1, . . . , K + 1}. Given the normal model, the Bayes rule for estimating y can be
explicitely derived as
                                                        µk
                                   ŷ = argmax v⊤k (x −    ) + log(πk ),
                                           k            2
where πk = P(y = k) and vk = Σ−1 µk for k ∈ [K + 1]. Alternatively, LDA can be viewed from the
perspective of dimensionality reduction, given by Fisher’s discriminant problem where one seeks
a low dimensional projection of the data such that the between-class variance is large relative to
the within class variance. This problem is formulated as
                      max v⊤                      ⊤            ⊤
                               k Ωvk subject to vk Σvk ≤ 1, vk Σvi = 0   ∀i < k.
                        vk
where Ω ≡ cov(E[x|y]) as before.
    It is simple to show that this gives us exactly the same problem that motivates the GEV esti-
mator for the case of a discrete response variable y. Note here that Σ is the homoskedastic within-
class covariance, and not cov(x). However, it is straightforward to show using the law of total
covariance that the generalized eigenvectors of (Ω, Σ) are the same as those of (Ω, cov(x)) (with
different eigenvalues). Thus we can treat Σ above as cov(x) without affecting the problem. The
vectors vk determined by the optimization are called the discriminant vectors, and for a d × n data
matrix X a classification rule is obtained by computing the projection V⊤ X for V ≡ [v1 , . . . , vK ],
and then assigning each observation to the nearest V⊤ µk .
    In recent years, several references have extended LDA to the high dimensional setting using
sparsity, most of which use a lasso penalty [THNC03, Len08, CHWE11]. The derivation of LDA
using Fisher’s Discriminant is leveraged in the well-known method found in [WT11], that uses
a group lasso regularization with a diagonal estimate for the covariance of the data. A powerful
technique called MSDS for multiclass sparse discriminant analysis is given in [ZMY18], which
has both theoretical and empirical justification. This estimator attempts to find sparse discriminant
vectors from the Bayes rule of the normal model for LDA, whereas our own method uses the
Fisher’s Discriminant derivation. The authors also show the estimators of ROAD [FFT12], and
DSDA [MZY12], occur as special cases of MSDS up to particular constants. While error bounds
                                                     41


are given on the estimator, they are not in the form of a function of the data dimension and sample
size. We show our estimator performs better empirically and has strong error bounds in terms of
the data dimension and sample size.
3.3.1    Consistency for LDA
Fisher’s Discriminant problem in many ways can be seen as a special case of SIR for a discrete
output of y. With Ω as defined above and with the assumption that E[x] = 0, its estimator simplifies
          1    K+1         ⊤                 1
to Ω̂ =  K+1 ∑k=1 x̄k x̄k      where x̄k =   nk ∑ni=1 xi 1(yi = k). Here we use 1(yi = k) as the indicator
variable that xi is of class k, and nk = ∑ni=1 1(yi = k) is the number of samples that are of class
             p                 p
k. Let Û = ( ρ̂1 û1 , . . . , ρ̂K ûK ) be columns using the eigenpairs (ρ̂k , ûk ) of Ω̂. Then the GEV
solution Ŵ of (3.5) using Σ̂ and Û gives us K estimated discriminant vectors which can be used
for LDA classification.
     Due to the similarity of LDA to SIR, much of the proof of consistency of the discriminant
vectors can proceed analogously to that of SIR. Here, the Stability assumption (10) takes the form
of a “balanced clusters” assumption. Since the Stability assumption holds for any H sufficiently
large, we may take H = K + 1 so that for γ1 = K + 1 mink∈[K+1] πk and γ2 = K maxk∈[K] πk , any
partition AH (γ1 , γ2 ) will separate the support values of y into intervals [ak−1 , ak ], k − 1 < ak−1 <
k < ak < k + 1. Thus for k ∈ [K + 1], the random vector
                                                               
                                     E[x|y] ak−1 ≤ y ≤ ak = E[x|y = k]
is a constant. Then var(v⊤ E[x|y] ak−1 ≤ y ≤ ak ) = 0 for all v ∈ Rd , k ∈ [K + 1]. The existence of
positive constants γ1 < 1 < γ2 that serve as bounds to the probabilities of the classes of x guarantee
that there are no asymptotic behaviors of the form limd→∞ πk = 0 as a function of dimensionality
of x. For the sake of simplicity, we assume all probabilities πk are equal, with the extension to the
general case being straightforward.
Theorem 19. Let x be multivariate Gaussian, n = ρK d α for some α > 1/2, and y = f (x) has
finite support along with assumption (10) holding for E[x|y]. Then, with probability at least 1 −
                                                         42


C1 exp (−C2 log(d)), we have
                                                                        r
                                                        ∗                 log(d)
                                          ∥∇L(W )∥∞,∞ ≤ C                         .
                                                                             n
Corollary 20. Assume that 1/M ≤ ρmin (Σ) ≤ ρmax (Σ) ≤ M and ρ ≤ ρK (Ω). Then there exist
constants C,C1 ,C2 such that
                                                                            r
                                                                               s log(d)
                                      PcW
                                          − PW∗         F
                                                           ≤ CMκ∗−1 ρ −1
                                                                                    n
holds with probability at least 1 −C1 exp(−C2 log(d)) .
3.3.1.1   Proof of Theorem 19
Proof. For the sake of simplicity, we assume all probabilities πk are equal, with the extension to
the general case being straightforward. As earlier, we have
                              ∇L(W∗ )     ∞,∞
                                              ≤ ΣW     b ∗ − ΣW∗
                                                                           ∞,∞
                                                                               + U−U      b
                                                                                             ∞,∞
                                                                                                  .
                                                    |            {z          } |        {z      }
                                                                  I1                     I2
    The bound on I1 follows from the theorem used earlier in [RWRY11]. What remains is bound-
ing I2 .
    Let Vb = [b
              u1 , . . . , u      b = diag(ρ̂1 , . . . , ρ̂K ), and XK+1 = (x̄1 , . . . , x̄K+1 ). Then let A = √ 1 X⊤ V
                           b K ], Λ                                                                                      b
                                                                                                                 K+1 K+1
be a (K + 1) × K matrix. Using the same decomposition x = z + w with respect to col(Ω) and its
complement we get XK+1 = ZK+1 + WK+1 . Since                          1           ⊤    = Ω,
                                                                     K+1 XK+1 XK+1
                                                                                            b we get the identity
                                                                                    
                                          1                           1                 b −1/2
                                b=
                                U      √         ZK+1 A + √                WK+1 A Λ
                                         K +1                        K +1
and
                                        e = PΩ (U)
                                        U             b = √1 ZK+1 AΛ         b −1/2 .
                                                                 H
Then U b −Ue=   √ 1 WK+1 AΛ          b −1/2 and
                    K+1
                           1              b −1/2 ∥∞,∞ ≤ ∥ √ 1 WK+1 ∥∞,∞ ∥AΛ                  b −1/2 ∥1,∞ .
                 ∥√               WK+1 AΛ
                         K +1                                        K +1
                                                                43


                        √
Since ∥Λb −1/2 A∥1,∞ ≤    K + 1∥Λ  b −1/2 A∥2,∞ , ∀i ∈ [d], and
                                        r
                                            1 b⊤                          p
                           ∥A∗,i ∥2 =            V∗,i XK+1 X⊤  K+1
                                                                   b ∗,i = ρ̂i ,
                                                                   V
                                          K +1
we have
                                  1              b −1/2 ∥∞,∞ ≤ ∥WK+1 ∥∞,∞ .
                            ∥√          WK+1 AΛ
                               K +1
    As in the SIR application, we are able to use U       e in a substitute model (3.13) that recovers
the desired reduction space with high probability. All that needs to be shown is that with high
probability U
            e is of rank K. We may directly use the results of (14) and (15), since all assumptions of
SIR hold in the LDA application, with the only requirement being a similar simplifying assumption
that ∃c > 0 such that c(K + 1) = n. Thus we get that U          e is of rank K with probability at least
1 −C1 log(−C2 d) for constants C1 ,C2 .
                                                      q
                                                         log(d)
Lemma 21 (Bounding WK+1 ). Assume λ = C                     n   for some constant C. Then there exist
constants C1 ,C2 such that
                                        1             b −1/2 ∥∞,∞ ≤ λ /2
                                 ∥√         WK+1 AΛ
                                       K +1
with probability at least 1 −C1 exp(−C2 log(d)).
Proof. Without loss of generality, each entry Wi j = 1c ∑ck=1 wkij of WH is Gaussian with variance
var(w)/c2 , assuming each wkij is Gaussian with variance var(w) for i ∈ [d] and j ∈ [K + 1]. Then
using a union bound we have
                                              r                             !
                                                  C(K + 1) log(d(K + 1))
                            P max |Wi j | ≥
                                  i, j                        4n
                                                       p                   
                            ≤ d(K + 1)P |c2Wi j | ≥ C log(d(K + 1))
                                                      ≤ 2e−(C−1) log(d(K+1))
    This completes the proof.
                                                    44


     Given the consistency of c     W, it is straightforward to show the consistency of the classification
rate of the estimated classifier. The classifier Ybc     W
                                                            gives the label k to a new point x′ if µk is the
nearest class centroid to x′ after projection by c      W:
                                    Ybc
                                      W
                                           (X) ≡ argmin ∥cW⊤ x′ − c  W⊤ µ̂k ∥22 .
                                                    k
Let the misclassification error rate be given by Rn ≡ P(Yb ̸= Y |observed data) where Y is the true la-
bel. Likewise define R as the misclassification rate of the population classifier using the parameters
W∗ and µk . Then we have the following.
Theorem 22. There exist constants C,C1 ,C2 such that
                                                                 r
                                                                    s log(d)
                                          |Rn − R| ≤ Cκ∗−1 ρ −1
                                                                        n
with probability at least 1 −C1 exp(−C2 log(d)) .
3.3.1.2     Proof of Theorem 22
Proof. Let Y be the classifier using the population parameters of W∗ and µk . Define lk = ∥W∗⊤ X−
W∗⊤ µk ∥22 and lˆk = ∥c    W⊤ X − c W⊤ µ̂k ∥22 . For any ε > 0, we have
                 Rn − R ≤ P(Yb ̸= Y )
                          ≤ 1 − P(|lˆk − lk | < ε/2, |lk − lk′ | > ε, for any k, k′ )
                          ≤ P(|lˆk − lk | ≥ ε/2 for some k) + P(|lk − lk′ | ≤ ε for some k, k′ ).
For the first term
              |lˆk − lk | ≤ |(c
                              WcW⊤ µ̂k − W∗ W∗⊤ µk )⊤ X| + |µ̂⊤        cc⊤            ⊤ ∗ ∗⊤
                                                                    k WW µ̂k − µk W W µk |.
                                                                                        √
It is straightforward to show given (3.11) that bounds ∥c         W − W∗ ∥F ≤ 3κ∗−1 λ s, we have ∥cWcW⊤ −
                           √
W∗ W∗⊤ ∥2 ≤ Cκ∗−1 λ s and given a standard error bound on sample mean estimation we get |lˆk −
             q
lk | ≤ Cκ∗−1 s log(d)
                    n     with probability 1 −C1 exp(−C2 log(d)). The expression |lk − lk′ | is normally
distributed with mean (µk −µk′ )⊤ W∗ W∗⊤ (µk +µk′ ) and variance (µk −µk′ )⊤ W∗ W∗⊤ ΣW∗ W∗⊤ (µk −
                                                         45


µk′ ), and since we have the assumption (5) which implies (µk − µk′ ) is bounded away from 0, with
                                              q
high probability |lk − lk′ | > Cκ∗−1 s log(d)                                                      ′
                                                    n . Applying a union bound over k, k for both terms yields
the result.
3.4     Minimax Rate
We show that both the error rates for SIR and LDA fall under a model which achieve minimax rates.
The setup closely follows [TSY20]. Define the GEV parameter space as follows: let {J1 , . . . , JH }
for H > K be a measurable partition of the sample space of y, and define ỹ = ∑H                        c=1 c · 1(y ∈ Jc )
as the discretized version of y. Then we may define the corresponding conditional covariance
e ≡ cov[E(x|ỹ)]. Let F (s, d, K, γ̃; κ, M) be the set of all pairs of matrices (Σ, Ω)
Ω                                                                                                    e with generalized
eigenpairs (γ̃i , ṽi ), γ̃i ∈ R, ṽi ∈ Rd for i ∈ [K], such that
   1. ∑K i=1 ∥ṽi ∥0 = s.
   2. 1/M ≤ ρmin (Σ) ≤ ρmax (Σ) ≤ M.
   3. κ γ̃ ≥ γ̃1 ≥ · · · ≥ γ̃K ≥ γ̃ > 0 for a fixed constant κ > 1.
For simplicity denote F (s, d, K, γ̃; κ, M) by F . Let L(x) denote the distribution of a random
variable x. Then the probability spaces for the GEV problem are
                              P (n, H, s, d, K, γ̃; κ, M) = {L((x1 , ỹ1 ), . . . , (xn , ỹn )) :
                           (xi , ỹi )’s are i.i.d. such that (cov[E(xi |ỹi )], cov(xi )) ∈ F },
where n is the sample size, and parameters s, d, and γ̃ may depend on n, while κ, M are fixed
constants. Note that this framework gives a fixed slicing scheme where H is treated as a bounded
integer, so that K is also bounded above by H − 1. Denote V                   e ∈ Rd×K as the matrix where each
column is one of the generalized eigenvectors of (Σ, Ω)             e with nonzero eigenvalue. Let V            b be any
estimator for V.e The following provides a lower bound on the minimax rate among all estimators.
                                                              46


Theorem 23. Assume nγ̃ 2 ≥ C log eds for some sufficiently large constant C0 . Then there exist
positive constants C1 and C2 such that
                                                                                  
                                                               s log(ed/s)
                         inf sup P     ∥PV̂ − PṼ ∥2F ≥ C1                    ∧C2     ≥ 0.8,
                           V̂ P∈P                                   nγ̃ 2
where P = P (n, H, s, d, K, γ̃; κ, M).
3.4.1   Proof of Theorem 23
The proof follows from [TSY20], for completeness we include the outline of the proof and the
relevant papers for intermediate steps. Since any special case of the estimation problem yields a
lower bound for the general case, we are able to specify further that the data distribution has an
assumption of normality on the conditional distribution x|ỹ.
    We specify a subset of the parameter space as follows: let K = 1, H = 2, and for i = 1, 2, let
                          xi |(ỹi = 1) ∼ Nd ((1 − α)v, Id − Ω̃),            P(ỹ = 1) = α,
                          xi |(ỹi = 2) ∼ Nd (−αv, Id − Ω̃),            P(ỹ = 2) = 1 − α.
Here we have v ∈ O(d, 1), (unit length d-vectors), and for 0 < α < 1 and γ = α we have E[x] = 0,
Ω̃ = cov(E[x|ỹ]) = γvv⊤ , and Σ = cov(x) = Id for Id the identity matrix on Rd .
    The minimax results are derived using Fano’s Lemma found in [Yu97] (Lemma 3).
Lemma 24 (Fano’s Lemma). Let (Θ, ρ) be a metric space and {Pθ : θ ∈ Θ} a collection of proba-
bility measures. For any totally bounded T ⊂ Θ, denote by M(T, ρ, ε) the ε-packing number of T
with respect to ρ, that is, the maximimal number of points in T whose pairwise minimum distance
in ρ is at least ε. Define the Kullback-Leibler diameter of T by
                                         dKL (T ) ≡ sup D(Pθ ∥Pθ ′ )
                                                      θ ,θ ′ ∈T
for D(Pθ ||Pθ ′ ) the KL divergence between distributions Pθ and Pθ ′ . Then
                                                                   ε2
                                                                                              
                                    2                                         dKL (T ) + log 2
                     inf sup Eθ [ρ (θ̂ (x), θ )] ≥ sup sup                1−
                      θ̂ θ ∈Θ                         T ⊂Θ ε>0 4              log M(T, ρ, ε)
                                                         47


and equivalently,
                                                                ε2
                                                                    
                                             2                                dKL (T ) + log 2
                            inf sup Pθ ρ (θ̂ (x), θ ) ≥                 ≥ 1−                    .
                             θ̂ θ ∈Θ                             4            log M(T, ρ, ε)
    Then two steps are required, first determining the Kullback-Leibler divergence between the
data distributions in T , and second determining the ε-packing of T .
    The first task is straightforward. For i = 1, 2, let Σ = Id and Ω              e i = γvi v⊤ , for γ ∈ (0, 1), vi ∈
                                                                                                i
O(d, 1). Then let P(Ω       e i , Σ) denote the distribution of a random i.i.d. sample of size n from the
mixture Guassian distribution P(Ω          e i , Σ) = αP1 (Ω   e i , Σ) + (1 − α)P2 (Ω e i , Σ), where P1 (Ω e i , Σ) and
    e i , Σ) denote multivariate normal distributions Nd ((1 − α)vi , Id − Ω
P2 (Ω                                                                                    e i ) and Nd (−αvi , Id − Ω  e i ),
respectively. Then using convexity of the K-L divergence, for two mixture-Gaussian distributions
P(Ωe 1 , Σ) and P(Ω   e 2 , Σ) we have
   D(P(Ω   e 1 , Σ)||P(Ωe 2 , Σ)) ≤ αD(P1 (Ω      e 1 ||Σ), P1 (Ω e 2 , Σ)) + (1 − α)D(P2 (Ω   e 1 , Σ)||P2 (Ω
                                                                                                             e 2 , Σ)),
thus it is sufficient to bound the K-L divergence between two Gaussian distributions.
    Using the explicit formula for the K-L divergence between Gaussian distributions and the prop-
erties of the chosen parameters, the authors of [TSY20] determine the upper bound
                                                                          2
                                  D(P(Ωe 1 , Σ)||P(Ω   e 2 , Σ)) ≤ 3γ · n∥v1 − v2 ∥22 .
                                                                     1 − γ2
    Once the K-L divergence is determined, the second task proceeds according to a well estab-
lished packing argument that can be found in [CGRZ13] or [CMW13].
Corollary 25. Assume that ρ is a lower bound of both ρK and MγK , where γK is the K-th general-
ized eigenvalue of the pair (Ω, Σ). Then GEV estimator obtains the minimax rate up to constants.
3.4.2     Proof of Corollary 25
Proof. The assumption that ∑K                                                                                       ∗
                                       i=1 ∥ṽi ∥0 = s is equivalent to the sparsity assumptions on W . We
have ∃γ > 0 such that γ < γK for generalized eigenpair (vK , γK ) of (Ω, Σ). It is simple to show
that with the assumption ρ < MγK , then we may substitute γ with ρ/M for a lower bound on γK .
                                                              48


One needs to compare γ with γ̃. We can do this with the stability theorem and Weyl’s theorem.
Let P(y ∈ Jh ) = Ph and µh = E[x|y ∈ Jh ]. We need the following from [? ] (Lemma 11 in
supplementary materials).
Lemma 26. Define the event E(ε) = {ω |Ph − H1 | > ε, ∀H}. There exist a positive constant C such
                     4
that, for any ε >   Hc−1 , we have
                                              √                      ε2
                             P(E(ε)) < CH 2 Hc + 1 exp(−(Hc + 1) )
                                                                     32
for sufficiently large H and c.
    Then for any v ∈ Rd we have
               |v⊤ Ωv⊤ − v⊤ Ωve ⊤ | = |v⊤ cov(E[x|y])v − v⊤ cov(E[x|ỹ])v|
                                            H                             H
                                     = v⊤ ∑ Ph cov(E[x|y]|y ∈ Jh )v − v⊤ ∑ Ph µh µh⊤ v
                                            h                             h
                                        H
                                     = ∑ Ph v⊤ cov(E[x|y]|y ∈ Jh )v − v⊤ µh µh⊤ v
                                       h       
                                          1
                                     ≤
                                          H
                                            +ε    ∑ v⊤cov(E[x|y]|y ∈ Jh)v
                                                   h
                                                 α3
                                     ≤ (1 + Hε) ϑ v⊤ Ωv
                                                 H
    where the last inequality follows from the Stability Assumption (10). Taking maximum over
norm 1 vectors yields ∥Ω − Ω∥    e 2 ≤ (1 + Hε) αϑ3 ρ1 . From Weyl’s inequality we have |ρK (Ω) −
                                                  H
     e ≤ ∥Ω − Ω∥
ρK (Ω)|            e 2 . Thus ρ − (1 + Hε) αϑ3 ρ1 < ρK (Ω),
                                                          e and serves as a lower bound. Then we
                                              H
                               α
                     ρ−(1+Hε) ϑ3
                               H
can replace γ̃ with        M      . This completes the proof.
3.5     Canonical Correlation Analysis
The GEV estimator can also be applied to the Canonical Correlation Analysis problem [Hot33],
which has also had a number of techniques proposed in the past two decades for performing the
task under HDLSS conditions where sparsity is assumed on the canonical directions [WKI08,
WTH09, PTB09, HST11]. [CGRZ13] first gives the characterization of the probabilistic CCA
                                                   49


model for sparse canonical directions, and presents the CAPIT method for the problem. Rates
of convergence are given that depend on an independent estimate of the precision matrices of the
two data sources, which can often be difficult to compute even diagonal approximations of. A
modern standard for estimation in this problem is Penalized Multivariate Analysis (PMA) method
[WTH09] that estimates a regularized version of the singular value decomposition. Our estimator
is shown to perform better empirically on simulations for sparse CCA.
    Canonical Correlation Analysis is a classical technique that finds the linear combination of two
sets of random variables with maximal correlation. It has been applied to a number of different
fields, including pyschology, neurology, genomics and economics. Let x ∈ Rd1 and y ∈ Rd2 be
zero-mean random vectors with joint covariance matrix
                                                                 
                                                   Σx Σxy 
                                            Σ=                   ,
                                                      Σyx Σy
where Σx = (Σx,kℓ ) and Σy = (Σy,kℓ ) are the covariance matrices for x and y , respectively, and
Σxy = (Σxy,kℓ ) = Σ⊤    yx is the cross-covariance matrix between x and y. Then CCA determines the
K canonical direction vectors by solving
       max v⊤                             ⊤             ⊤               ⊤          ⊤
               xk Σxy vyk , subject to vxk Σx vxk = vyk Σy vyk = 1, vxk Σx vx j = vyk Σy vy j = 0   (3.17)
for k ∈ [K] and j < k. The optimization problem can be solved by applying singular value decom-
                              −1/2       −1/2
position on the matrix Σx          Σxy Σy     , and a sample version is given by replacing the covariances
with their usual estimators. This leads to consistent estimation of the canonical directions when
the dimensions d1 and d2 are fixed and the sample size n increases.
    In the high-dimensional setting, when the dimensions exceed the sample size, one cannot com-
pute the inverse sample covariances. This leads to the structural assumption of sparsity in the
canonical directions, which allows for successful estimation. As shown in [CGRZ13], the set of
(vxk , vyk ) are solutions to 3.17 if and only if
                                                                     !
                                                       K
                                        Σxy = Σx       ∑ λk vxk v⊤yk   Σy
                                                      k=1
                                                        50


for some λk > 0, giving the correlation weights. We show that (3.17) is a special case of the
generalized eigenvalue problem (3.1) with
                                                                                      
                           0 Σxy                      Σx 0                      vxk 
                    Ω=                   , Σ =                     , and vk =         .
                              Σyx 0                         0 Σy                       vyk
Substituting the above into (3.1), we yield
                                                              2v⊤
                                                                xk Σxy vyk
                                       v∗k = argmax
                                              vxk ,vyk  vxk Σx vxk + v⊤
                                                         ⊤
                                                                       yk Σy vyk
It is straightforward to show this is equivalent to (3.17).
     Now if we assume that the Σxy has the following singular value decomposition
                                                         K
                                                                      ⊤
                                               Σxy =     ∑ σk ux,k νy,k
                                                        k=1
so that Σxy = Ox DO⊤       y . Then we have
                                                                          
                                                        0       Ox DO⊤  y
                                          Ω=                              .
                                                                          
                                                    Ox DO⊤  y       0
We need the following lemma which gives the eigendecomposition of Ω.
Lemma 27. We have Ω = QΛQ⊤ , where
                                                                              
                                     Ox Ox                            D 0 
                                Q=                     and Λ =                .
                                        Oy −Oy                             0 −D
Proof. The proof of this lemma follows from direct calculation.
                                                                                              √
     Using Lemma 27, we observe that U in (3.4) can be taken as (O⊤                      ⊤ ⊤
                                                                                    x , Oy ) / 2. An esti-
mator of U can be obtained by concatenating the top rank-K left and right singular matrix of
b xy = n−1 ∑n Xi Y⊤ , where Xi ’s and Yi ’s are independent and identically distributed samples of
Σ                i=1      i
x and y, respectively. Thus we may apply the GEV estimator (3.5) to the CCA problem with
the estimates for U    b and Σ  b defined above, and recover the spaces given by span{vx1 , . . . , vxK } and
span{vy1 , . . . , vyK }.
                                                           51


                                                   CHAPTER 4
                       EMPIRICAL RESULTS OF THE GEV ESTIMATOR
4.1     Implementation
To efficiently solve for W from (3.5), we implement from [BT09] the Fast Iterative Shrinkage-
Thresholding Algorithm (FISTA). FISTA is an alteration of the iterative first order method ISTA
used to solve ℓ1 -regularized convex optimization problems. The alteration uses a version of Nes-
terov acceleration [Nes83] to achieve a convergence rate of O(1/k2 ). Define Sλ (·), a element-wise
soft thresholding operator, the gradient of the (3.4), as follows:
                                              
                                              sgn(ai j )|ai j − λ |, if|ai j | > λ
                                              
                               Sλ (A) i j =
                                               0
                                                                      otherwise.
  Algorithm 4.1: A fast iterative shrinkage-thresholding algorithm for GEV.
                                                     (0)
     Input: U, Σ, and λ . Initialization: take WX ∈ Rd×K , ρ −1 = 1/ρ1 (Σ), t (0) = 1, and k = 0
     Output: W
  1  repeat
             (k+1)             (k)               (k)
                   = Sρ −1 λ WX − ρ −1 (ΣWX −U) ;
                                                         
  2     WY
                      √
           (k+1)    1+ 1+4(t (k) )2
  3      t       =         2        ;
             (k+1)     (k+1)      (k) −1    (k+1)    (k)
  4     WX         = WY      + tt (k+1)  (WY      −WY )
  5  until converge;
     Since the GEV problem estimator acts like a matrix version of Lasso regression, there are many
algorithms one could potentially apply to a ℓ1 -regularized optimization problem. We sought a
comparison of the following methods to solve the problem for a fixed choice of the hyper-parameter
λ giving the regularization weight: Subgradient [Nes04, Chapter 3.1], Proximal method [Roc70],
FISTA, ADMM [BPC+ 11], and Chambolle-Pock [CP11]. The model used to test the convergence
is taken from the SIR simulations below, using Model 1 in the continuous responses, with Gaussian
features and noise, and d = 50, n = 200.
                                                        52


                  Figure 4.1: Comparison of convergence rates of different algorithms.
    As expected, in Figure (4.1) we see the subgradient method was suboptimal compared to all
other methods, with a very slow convergence rate. However, due to the incredibly low number
of iterations required to reach a stable critical point, empirically all the other methods performed
equivalently in terms of iterations, with the exception of Chambolle-Pock, which had problems
with convergence after the first iteration. As well, important differences occurred in run-time;
ADMM in particular requires the computation of a matrix inverse, which drastically increases its
run-time. Due to the simplicity of implementation, the speed of convergence and quick run-time,
we maintained the FISTA implementation of the GEV method above.
                                                   53


4.1.1    Robust Modification
This algorithm is sufficient for most applications of the GEV algorithm, but an important case
arises for data that comes from heavy noise distributions or data with outliers. In many applica-
tions, the assumption of sub-Gaussian tails is unrealistic; applications using functional magnetic
resonance imaging (fMRI) [ENK16] or microarray data giving gene expression level [WPL15]
have been observed to have heavy tails and large kurtosis, regardless of normalization meth-
ods used. The kurtosis of a random variable X is defined as the fourth centralized moment
        4 
      X−µ
E      σ      , and high values indicate either that the probability mass is concentrated around the
mean and the data-generating process produces occasional values far from the mean (i.e. outliers),
or that the probability mass is concentrated in the tails of the distribution.
    To that end, we robustify the matrix square loss by introducing the following matrix Huber loss
as a substitute to the Frobenius loss term in (3.2)
                                                                  
                                     Lα = ∑ ℓα [ΣW]i j − Ui j
                                            ij
where
                                       
                                        x2 ,                if|x| ≤ α1
                                       
                              ℓα (x) =
                                        2α −1 |x| − α −2 otherwise.
                                       
The Huber loss [Hub73], ℓα (x) is quadratic for small values of x, and becomes linear when x gets
larger. The parameter α controls the blending of quadratic and linear loss. The least square loss
and least absolute deviation (LAD) loss can be regarded as two extremes of the Huber loss for
α = 0 and α = ∞, respectively. Using this loss in the ℓ1 -regularized scheme, we get the estimate
                                                                            
                        Wα,λ = argmin ∑ ℓα [ΣW]i j − Ui j + λ ∥W∥1,1
                        c                           b         b                                (4.1)
                                W∈Rd×K     ij
    Notice however that the term (Σ1/2 W − Σ−1/2 U) has been replaced in the Huber loss for the
term (ΣW − U). As noted after the proof of Theorem 1, either expression when used in the Frobe-
nius norm loss yields the same solution for the space of V , but will give different models when
combined with the ℓ1 -regularization term and used in (3.4). Computationally, if the expression
                                                  54


(ΣW − U) is used in Algorithm ??, the gradient of the expression requires computation of Σ2 ,
which leads to greater run-time and worse performance likely due to floating point error in com-
pared to the performance of the expression (Σ1/2 W − Σ−1/2 U). However, when using the Huber
loss, the gradient computation of this model will lead to the computation of Σ−1/2 if we use ex-
pression (Σ1/2 W − Σ−1/2 U). This is statistically and computationally undesirable, especially in
the case of d > n where Σ   b is singular. Then with Dα as the gradient of the Huber loss, we define
the algorithm below.
                                              
                                               2ai j             if|ai j | ≤ α −1
                                              
                                        
                              Dα (A) i j =
                                               2α −1 sgn(ai j ), otherwise.
                                              
  Algorithm 4.2: Huber loss algorithm for robust GEV.
                                                         (0)
    Input: U, Σ, λ and α. Initialization: take WX ∈ Rd×K , ρ −1 = 1/ρ1 (Σ), t (0) = 1, and
              k=0
    Output: W
  1 repeat
            (k+1)           (k)                   (k)
                  = Sαλ WX − ρ −2 ΣDα (ΣWX −U) ;
                                                             
  2     WY
                     √
          (k+1)    1+ 1+4(t (k) )2
  3     t       =        2         ;
            (k+1)     (k+1)      (k) −1    (k+1)     (k)
  4     WX        = WY      + tt (k+1)  (WY      −WY )
  5 until converge;
4.2     Sliced Inverse Regression
We compare our method of applying GEV to the sliced inverse regression problem against the
classical method we label SIR [Li91], and the modern method LassoSIR (LSIR) [LZL19], used for
high dimensional problems. To facilitate a fair comparison with SIR and LSIR, all the simulation
studies are generated under forward models including both categorical and continuous responses
for low (d = 100) and high (d = 1000) dimensional predictors. Throughout the simulations, we
use a K-fold cross-validation (CV) to select the tuning parameters and quantify the estimation
accuracy using three different metrics defined as follows: the canonical correlation (CCA) between
                                                        55


x⊤ Ŵ and x⊤ W; the Frobenius norm distance (FD) between PW and its estimate PŴ ; the trace
correlation (TC) defined as tr(PW PŴ )/K with K being the structural dimensions. Let Σ = (σi j )d×d ,
where σi j = 0.5|i− j| and d is taken to be 50 or 500. To demonstrate the robustness of SDR for
categorical response, we consider two simulation scenarios for generating the predictor variables
x: 1) from N (0, Σ) and 2) from t5 (0, Σ). Let β1 and β2 be the d−dimensional vectors with their
                                             √                                   √
first six elements being (1, 1, 1, 1, 1, 1)/ 6 and (1, −1, 1, −1, 1, −1)/ 6 and the rest being zero.
     The response Y is generated from the multinomial distribution with
                                                fk (x)
                          Pr(y = k) =                         , k = 1, . . . , k − 1,
                                           1 + ∑K−1
                                                 j=1  f j (x)
where K is the number of categories and fk (x) is the component connecting x with y. We consider
the following two different models of fk (x):
    1. Model 1: fk (x) = sin(x⊤ βk /4) + 1;
    2. Model 2: fk (x) = exp(x⊤ βk ).
For both models, the fk (x) components are monotone within the domain of x, so they are favorable
to SIR. Moreover, we can see that model 2 is actually the multinomial logistic regression.
     The simulation includes comparing all combination of the two scenarios, two models and the
two configurations (n, d) = (200, 50) and (500, 1000) with 100 replicates. As shown in Table 4.1,
GEV-SIR dominates SIR and LSIR for all three metrics.
                                                     56


Table 4.1: Summary of estimation accuracy for categorical response in low and high dimensions.
We report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations
in parentheses. The results are based on 100 replications.
       Sample and Error Type  Model    Method   (d, n)     FD           TC          CCA
                                       GEV      50, 200     3.38 (.359) .154 (.09)  .602 (.188)
                              Model 1
                                       SIR      50, 200    3.75 (.137)  .062 (.034) .393 (.098)
                                       LSIR     50, 200    3.49 (.397)  .127 (.099) .562 (.232)
                                       GEV      50, 200    2.06 (.321)  .484 (.08)   .950 (.035)
                              Model 2
                                       SIR      50, 200    2.90 (.237)  .276 (.059) .811 (.042)
       Gaussian-X
                                       LSIR     50, 200    2.26 (.386)  .436 (.096) .945 (.034)
                                       GEV      1000, 500   3.61 (.319)  .097 (.08) .209 (.120)
                              Model 1
                                       LSIR     1000, 500  3.71 (.326)  .073 (.082) .305 (.206)
                                       GEV      1000, 500  1.97 (.379)  .507 (.095) .495 (.146)
                              Model 2
                                       LSIR     1000, 500  2.08 (.432)  .479 (.108) .718 (.178)
                                       GEV      50, 200    3.31 (.383)  .174 (.096) .652 (.192)
                              Model 1
                                       SIR      50, 200    3.73 (.165)  .068 (.041) .420 (.100)
                                       LSIR     50, 200    3.43 (.424)  .143 (.106) .620 (.218)
                                       GEV      50, 200     1.87 (.385) .534 (.096) .954 (.034)
                              Model 2
                                       SIR      50, 200    2.83 (.275)  .292 (.069) .813 (.035)
       Elliptical-X
                                       LSIR     50, 200    2.10 (.400)  .474 (.101) .951 (.029)
                                       GEV      1000, 500  3.41 (.436)  .149 (.109) .353 (.182)
                              Model 1
                                       LSIR     1000, 500  3.48 (.416)  .131 (.104) .443 (.208)
                                       GEV      1000, 500   1.88 (.461) .530 (.115) .618 (.169)
                              Model 2
                                       LSIR     1000, 500  1.89 (.486)  .527 (.121) .727 (.167)
    For continuous response, we consider the following four scenarios for both low (d = 50) and
high (d = 1000) dimensional data with either
    1 : Gaussian predictors and Gaussian noise.
    2 : Gaussian predictors and elliptical noise.
    3 : Elliptical predictors and Gaussian noise.
    We randomly generate n = 500 predictors x from either a multivariate normal or elliptical
distribution with mean zero and and the same covariance matrix as in categorical cases. For the
continuous responses, we then generate the responses variable according to the following three
models:
   1. Model 1: y = (x⊤ β1 )/{0.5 + (x⊤ β2 + 1.5)2 } + 0.5ε;
   2. Model 2: y = x⊤ β1 + 2 + exp(x⊤ β2 ) + 0.5 ∗ ε|x⊤ β1 + 2|,
                                                  57


   3. Model 3: y = (x⊤ β1 + 1)2 + (x⊤ β2 + 1)2 + 0.5 ∗ ε,
The ε’s are independently generated from either standard normal or t5 distribution. Here we set
                                    √                                                 √
β1 = (1, . . . , 1, 0, . . . , 0)⊤ / 6 and β2 = (1, −1, 1, −1, 1, −1, 0, . . . , 0)⊤ / 6 with the first 6 elements
of both vectors being nonzero. The results are found in tables 4.2 and 4.3.
Table 4.2: Summary of estimation accuracy for continuous response in low dimensions. We
report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in
parentheses. The results are based on 100 replications.
      Sample and Error Type              Model    Method    (d, n)    FD              TC          CCA
                                                  GEV       50, 200   2.27 (.247)     .433 (.062) .974 (.018)
                                         Model 1
                                                  SIR       50, 200   3.22 (.293)     .196 (.073) .793 (.061)
                                                  LSIR      50, 200   2.16 (.220)     .460 (.055) .973 (.017)
      Gaussian-X, Gaussian-Error
                                                  GEV       50, 200   2.08 (.120)     .478 (.030) .985 (.014)
                                         Model 2
                                                  SIR       50, 200   2.66 (.225)     .335 (.056) .859 (.049)
                                                  LSIR      50, 200   2.11 (.136)     .474 (.034) .971 (.019)
                                                  GEV       50, 200   2.11 (.179)     .472 (.045) .982 (.020)
                                         Model 3
                                                  SIR       50, 200   3.19 (.348)     .202 (.087) .733 (.138)
                                                  LSIR      50, 200   2.19 (.199)     .452 (.050) .965 (.030)
                                                  GEV       50, 200   2.38 (.303)     .404 (.076) .962 (.027)
                                         Model 1
                                                  SIR       50, 200   3.47 (.237)     .131 (.059) .683 (.111)
                                                  LSIR      50, 200   2.30 (.262)     .424 (.065) .961 (.024)
      Gaussian-X, Elliptical-Error
                                                  GEV       50, 200   2.08 (.183)     .478 (.046) .983 (.018)
                                         Model 2
                                                  SIR       50, 200   2.82 (.243)     .296 (.061) .813 (.060)
                                                  LSIR      50, 200   2.13 (.171)     .469 (.043) .965 (.023)
                                                  GEV       50, 200   2.18 (.278)     .456 (.070) .972 (.068)
                                         Model 3
                                                  SIR       50, 200   3.21 (.351)     .199 (.088) .734 (.128)
                                                  LSIR      50, 200   2.24 (.278)     .439 (.070) .958 (.056)
                                                  GEV       50, 200   2.11 (.290)     .473 (.072) .978 (.016)
                                         Model 1
                                                  SIR       50, 200   3.21 (.299)     .199 (.075) .786 (.070)
                                                  LSIR      50, 200   2.04 (.310)     .490 (.077) .961 (.017)
      Elliptical-X, Gaussian-Error
                                                  GEV       50, 200   2.25 (.261)     .437 (.065) .953 (.056)
                                         Model 2
                                                  SIR       50, 200   2.99 (.327)     .253 (.082) .770 (.094)
                                                  LSIR      50, 200   2.33 (.300)     .418 (.075) .929 (.060)
                                                  GEV       50, 200   2.75 (.434)     .312 (.108) .828 (.137)
                                         Model 3
                                                  SIR       50, 200   3.50 (.286)     .126 (.067) .558 (.154)
                                                  LSIR      50, 200   2.85 (.437)     .287 (.109) .801 (.144)
                                                         58


Table 4.3: Summary of estimation accuracy for continuous response in high dimensions. We
report the means of three accuracy metrics (CCA, FD and TC) with their standard deviations in
parentheses. The results are based on 100 replications.
     Sample and Error Type        Model   Method    (d, n)    FD          TC          CCA
                                          GEV       1000, 500 3.06 (.133) .236 (.033) .898 (.028)
                                  Model 1
                                          LSIR      1000, 500 3.13 (.123) .217 (.031) .889 (.029)
     Gaussian-X, Gaussian-Error
                                          GEV       1000, 500 2.69 (.164) .329 (.041) .907 (.027)
                                  Model 2
                                          LSIR      1000, 500 2.79 (.152) .303 (.038) .895 (.027)
                                          GEV       1000, 500 3.13 (.256) .216 (.064) .828 (.076)
                                  Model 3
                                          LSIR      1000, 500 3.23 (.223) .194 (.056) .812 (.084)
                                          GEV       1000, 500 3.25 (.238) .187 (.060) .827 (.133)
                                  Model 1
                                          LSIR      1000, 500 3.31 (.235) .171 (.059) .813 (.137)
     Gaussian-X, Elliptical-Error
                                          GEV       1000, 500 2.84 (.291) .291 (.073) .879 (.058)
                                  Model 2
                                          LSIR      1000, 500 2.97 (.247) .259 (.062) .851 (.085)
                                          GEV       1000, 500 3.19 (.296) .203 (.074) .817 (.078)
                                  Model 3
                                          LSIR      1000, 500 3.25 (.232) .186 (.058) .804 (.088)
                                          GEV       1000, 500 3.12 (.176) .220 (.044) .866 (.036)
                                  Model 1
                                          LSIR      1000, 500 3.11 (1.01) .183 (.044) .830 (.068)
     Elliptical-X, Gaussian-Error
                                          GEV       1000, 500 3.37 (.193) .158 (.048) .777 (.070)
                                  Model 2
                                          LSIR      1000, 500 3.49 (.174) .127 (.044) .743 (.088)
                                          GEV       1000, 500 3.87 (.088) .032 (.022) .417 (.138)
                                  Model 3
                                          LSIR      1000, 500 3.91(.081)  .023 (.020) .353 (.159)
4.2.1   Heavy Noise Slice Inverse Regression
Here we show our adapted Huber loss GEV method when applied to the SIR problem with in-
creased noise. We use the same models in the continuous response section with all the same
conditions, with the exception of the coefficient of the noise term being raised from 0.5 to 1. As
well we included an additional model
                          Model 4: y = (x⊤ β1 + 3)2 + 2|x⊤ β2 + 3| + ε|x⊤ β2 |;
and the scenario of having elliptical features for x and elliptical noise. We compare this to both
SIR and LSIR.
                                                 59


Table 4.4: Summary of estimation accuracy for Huber loss estimation in low dimensions with
high noise. We report the means of three accuracy metrics (CCA, FD and TC) with their standard
deviations in parentheses. The results are based on 100 replications.
         Sample and Error Type          Model   Method   (d, n)  FD          TC           CCA
                                                GEV      50, 200 3.25 (.321) .188 (.080)  .699 (.153)
                                        Model 1
                                                SIR      50, 200 3.83 (.125) .04 (.031)   .286 (.114)
                                                LSIR     50, 200 3.56 (.318) .108 (.080)  .573 (.167)
         Gaussian-X, Gaussian-Error
                                                GEV      50, 200 2.00 (.238) .498 (.080)  .992 (.047)
                                        Model 2
                                                SIR      50, 200 2.27 (.346) .432 (.087)  .979 (.102)
                                                LSIR     50, 200 2.01 (.234) .497 (.058)  .997 (.041)
                                                GEV      50, 200 2.41 (.332) .397 (.083)  .992 (.075)
                                        Model 3
                                                SIR      50, 200 3.55 (.249) .111 (.062)  .560 (.131)
                                                LSIR     50, 200 2.43 (.398) .390 (.100)  .915 (.089)
                                                GEV      50, 200 2.43 (.376) .391 (.094)  .913 (.158)
                                        Model 4
                                                SIR      50, 200 3.61 (.227) .099 (.058)  .523 (.178)
                                                LSIR     50, 200 2.53 (.434) .367 (.108)  .900 (.176)
                                                GEV      50, 200 3.28 (.374)  .181 (.096) .651 (.176)
                                        Model 1
                                                SIR      50, 200 3.84 (.116) .040 (.029)  .290 (.096)
                                                LSIR     50, 200 3.60 (.303) .100 (.076)  .545 (.159)
         Gaussian-X, Elliptical-Error
                                                GEV      50, 200 2.02 (.260) .493 (.065)  .991 (.046)
                                        Model 2
                                                SIR      50, 200 2.38 (.356) .406 (.089)  .971 (.093)
                                                LSIR     50, 200 2.02 (.270) .492 (.068)  .995 (.040)
                                                GEV      50, 200 2.47 (.309) .381 (.077)  .915 (.097)
                                        Model 3
                                                SIR      50, 200 3.62 (.242) .096 (.061)  .511 (.159)
                                                LSIR     50, 200 2.54 (.364) .366 (.091)  .911 (.105)
                                                GEV      50, 200 2.59 (.416) .353 (.104)  .890 (.169)
                                        Model 4
                                                SIR      50, 200 3.71 (.250) .073 (.063)  .431 (.180)
                                                LSIR     50, 200 2.69 (.489) .329 (.122)  .874 (.193)
                                                GEV      50, 200 3.17(.389)  .207 (.097)  .727 (.203)
                                        Model 1
                                                SIR      50, 200 3.83 (.117) .043 (.029)  .309 (.097)
                                                LSIR     50, 200 3.44 (.249) .141 (.062)  .666 (.176)
         Elliptical-X, Gaussian-Error
                                                GEV      50, 200 2.17 (.072) .456 (.018)  .955 (.006)
                                        Model 2
                                                SIR      50, 200 2.75 (.148) .312 (.037)  .887 (.008)
                                                LSIR     50, 200 2.19 (.045) .451 (.011)  .966 (.002)
                                                GEV      50, 200 2.58 (.301) .355 (.075)  .881 (.065)
                                        Model 3
                                                SIR      50, 200 3.54 (.302) .114 (.075)  .621 (.170)
                                                LSIR     50, 200 2.63 (.419) .341 (.105)  .888 (.117)
                                                GEV      50, 200 2.96 (.341) .261 (.085)  .742 (.103)
                                        Model 4
                                                SIR      50, 200 3.68 (.248) .080 (.062)  .463 (.192)
                                                LSIR     50, 200 3.06 (.429) .234 (.107)  .741 (.136)
                                                GEV      50, 200 3.27 (.368) .183 (.092)  .670 (.188)
                                        Model 1
                                                SIR      50, 200 3.84 (.112) .041 (.028)  .287 (.092)
                                                LSIR     50, 200 3.51 (.275) .121 (.069)  .621 (.175)
                                                GEV      50, 200 2.19 (.070) .454 (.017)  .953 (.006)
         Elliptical-X, Elliptical-Error Model 2
                                                SIR      50, 200 2.81 (.187) .296 (.047)  .878 (.011)
                                                LSIR     50, 200 2.20 (.071) .450 (.018)  .964 (.003)
                                                GEV      50, 200 2.66 (.268) .334 (.067)  .860 (.062)
                                        Model 3
                                                SIR      50, 200 3.61 (.237) .097 (.059)  .553 (.161)
                                                LSIR     50, 200 2.73 (.334) .317 (.084)  .871 (.078)
                                                GEV      50, 200 3.00 (.337) .249 (.084)  .725 (.088)
                                        Model 4
                                                SIR      50, 200 3.70 (.198) .076 (.049)  .447 (.172)
                                                LSIR     50, 200 3.12 (.434) .218 (.108)  .712 (.127)
                                                      60


Table 4.5: Summary of estimation accuracy for Huber loss estimation in high dimensions with
high noise. We report the means of three accuracy metrics (CCA, FD and TC) with their standard
deviations in parentheses. The results are based on 100 replications.
     Sample and Error Type          Model   Method   (d, n)    FD          TC          CCA
                                            GEV      1000, 500 3.55 (.100) .112 (.025) .282 (.113)
                                    Model 1
                                            LSIR     1000, 500 3.65 (.116) .087 (.029) .464 (129)
     Gaussian-X, Gaussian-Error
                                            GEV      1000, 500 2.12 (.042) .470 (.010) .670 (.157)
                                    Model 2
                                            LSIR     1000, 500 2.22 (.079) .445 (.020) .448 (.152)
                                            GEV      1000, 500 2.68 (.160) .329 (.040) .647 (.110)
                                    Model 3
                                            LSIR     1000, 500 2.92 (.192) .271 (.048) .389 (.130)
                                            GEV      1000, 500 3.13 (.215) .216 (.054) .578 (.102)
                                    Model 4
                                            LSIR     1000, 500 3.31 (.177) .173 (.044) .320 (.130)
                                            GEV      1000, 500 3.67 (.107) .082 (.027) .384 (.108)
                                    Model 1
                                            LSIR     1000, 500 3.80 (.112) .050 (028)  .266 (.114)
     Gaussian-X, Elliptical-Error
                                            GEV      1000, 500 2.13 (.044) .467 (.011) .665 (.159)
                                    Model 2
                                            LSIR     1000, 500 2.23 (.080) .441 (.022) .472 (.160)
                                            GEV      1000, 500 2.79 (.167) .302 (.042) .614 (.105)
                                    Model 3
                                            LSIR     1000, 500 3.03 (.191) .234 (.048) .377 (.125)
                                            GEV      1000, 500 3.17 (.208) .206 (.052) .586 (.097)
                                    Model 4
                                            LSIR     1000, 500 3.32 (.188) .170 (.047) .327 (.137)
                                            GEV      1000, 500 3.52 (.151) .121 (.038) .519 (.106)
                                    Model 1
                                            LSIR     1000, 500 3.61 (.134) .098 (.034) .337 (.125)
     Elliptical-X, Gaussian-Error
                                            GEV      1000, 500 2.66 (.240) .334 (.060) .572 (.123)
                                    Model 2
                                            LSIR     1000, 500 3.10 (.249) .224 (.062) .385 (.139)
                                            GEV      1000, 500 3.40 (.188) .149 (.047) .476 (.099)
                                    Model 3
                                            LSIR     1000, 500 3.59 (.164) .101 (.041) .334 (.115)
                                            GEV      1000, 500 3.84 (.083) .040 (.021) .228 (.075)
                                    Model 4
                                            LSIR     1000, 500 3.94 (.064) .015 (.016) .236 (.118)
                                            GEV      1000, 500 3.68 (.116) .083 (.023) .420 (.085)
                                    Model 1
                                            LSIR     1000, 500 3.76 (123)  .059 (.031) .309 (.125)
                                            GEV      1000, 500 2.68 (.213) .330 (.053) .589 (.116)
                                    Model 2
                                            LSIR     1000, 500 3.14 (.225) .216 (.064) .456 (.135)
     Elliptical-X, Elliptical-Error
                                            GEV      1000, 500 3.44 (.193) .138 (.048) .495 (.080)
                                    Model 3
                                            LSIR     1000, 500 3.61 (.153) .097 (.038) .394 (.108)
                                            GEV      1000, 500 3.84 (.087) .039 (022)  .228 (.076)
                                    Model 4
                                            LSIR     1000, 500 3.94 (.070) .015 (.018) .230 (.123)
4.3    Linear Discriminant Analysis
In this section, we investigated the performance of GEV method under high-dimensional Linear
Discriminant Analysis (LDA) framework for both binary and multi-class classification problems
by applying them to simulated data generated under two types of within group covariance matri-
ces: block Toeplitz suggested by [WT11] and Sparse precision matrix as described above. For
comparison, we also included the ℓ1 −penalized linear discriminant analysis (LDA-ℓ1) [WT11] us-
                                                   61


ing R package penalizedLDA and the direct approach for discriminant analysis [MZY12, ZMY18]
implemented in R packages dsda and msda for binary or multi-class cases respectively. To serve
as a benchmark, we also included the Oracle classifier derived from the population parameters Σw
and Σb . For each simulation, we generate 100K samples with d = 500 features, where K is the
number of classes. We consider simulation settings for binary and multi-class cases as follows:
     • Binary case: we set µ1 = 0 and µ2 j ∼ N(0.3, 0.5) for j ∈ {1, . . . , 20} and µ2 j = 0 otherwise.
        For the block Toeplitz, the block diagonal matrix, Σw , consists five equal size blocks with
        the (i, j)th element of each block equals to 0.7|i− j| . In term of the sparse precision matrix,
        we simulated the K-nearest-neighbor networks as describe above. Both the covariance struc-
        tures were used to mimic the biological gene networks with sparse conditional dependency
        structure [WT11, XLV16]. We then simulate xi ∼ N(µk , Σw ) for i ∈ Ck .
     • Multi-class case: we consider K = 3 classes and set µ1 = 0, µ2 j = 0 ∼ N(0.3, 0.5) for j ∈
        {1, . . . , 20}, µ3 j = N(−0.5, 0.5) for j ∈ {21, . . . , 40} and µk j = 0 otherwise. With the same
        covariance structure as in the binary cases, we simulate the data as xi ∼ N(µk , Σw ) for i ∈ Ck .
     In Table 4.6, we reported the prediction accuracy based on 100 replicates.
Table 4.6: Summary statistics reporting performance of the GEV, LDA−ℓ1 , Direct and Oracle
methods. We report the means of the FD with its standard deviation in parentheses. The results are
based on 100 replications
                             Σw    Type      LDA-ℓ1            Direct          GEV           Orcal
                          Toeplitz Error   75.12 (23.36)   31.96 (16.61)   30.58 (17.04)  14.54 (9.77)
            Binary
                            NN     Error   60.12 (8.53)    54.88 (13.73)   52.70 (11.71)  23.08 (4.64)
                          Toeplitz Error  155.88 (24.26)   70.14 (31.54)   60.02 (28.30) 27.34 (14.09)
          Multi-class
                            NN     Error   103.1 (7.60)     83.3 (7.90)    40.68 (24.46) 35.06 (17.04)
4.4      Canonical Correlation Analysis
In this subsection, we assess the performance of GEV under sparse CCA framework by applying
it to several simulated datasets. In all settings, we let X and Y have same dimension d = q, Σx =
                                                        62


Σy = Σ. Following the formulation in [CGRZ13], we model Cov(X,Y ) = Σxy as
                                              Σxy = ΣxUΛV ′ Σy ,                                           (4.2)
where U = (U1 ,U2 ) and V = (V1 ,V2 ) are d × 2 matrices, and Λ is a 2 × 2 diagonal matrix with
λ1 = 1 and λ2 = 0.7. We set the nonzero rows of U1 , U2 , V1 and V2 at {1, 2, . . . , 6}, {7, . . . , 12}, {d −
5, . . . , d} and {d − 11, . . . , d − 6}. The values at the nonzero rows are sampled uniformly from
(−1, −0.5) ∪ (0.5, 1) and then are normalized with respect to Σ such that U ′ ΣU = I and V ′ ΣV = I.
To capture different dependency structures, we consider the following three settings.
      • Identity with Σ = I p .
      • Toeplitz with Σ = (σi j ) where σi j = 0.3|i− j| for all i, j ∈ [d].
      • Sparse precision matrix with Ω = Σ−1 being sparse. Specifically, we generated the sparse
           precision matrix through nearest-neighbour networks algorithm in [LG06] with number of
           neighbors, m = 5.
      We compared the performance of our methods and the Penalized Multivariate Analysis method
(PMA) proposed by [WTH09] via examining the Frobenius norm distance (FD) measuring the
distance between the true and estimated subspaces. The sparsity tuning parameters in PMA were
chosen using permutation as suggested by the R package PMA, while the tuning parameter λ
in GEV was selected by cross-validation. Results of the simulations are reported in Table 4.7.
Summary statistics are based on 100 replicate trails under each of the six conditions. In general,
the GEV method results in smaller Frobenius norm distance for both U and V . Under all settings,
the GEV method outperforms the PMA methods yielding greater improvements as the dependence
structures become more complicated.
                                                      63


Table 4.7: Summary statistics reporting performance of the GEV and PMA methods. We report
the means of the FD with its standard deviation in parentheses. The results are based on 100
replications
              (p, q, n)       Σ        U−PMA          V −PMA        U−GEV          V −GEV
                           Identity  0.589 (0.085)  0.543 (0.126) 0.515 (0.120)  0.496 (0.116)
            (50, 50, 300)  Toeplitz  0.839 (0.089)  0.833 (0.080)  0.623(0.106)  0.607 (0.099)
                             NN      1.365 (0.132)  1.353 (0.141) 0.774 (0.137) 0.778 (0.171)
                           Identity  0.255 (0.076)  0.399 (0.043) 0.208 (0.074) 0.297 (0.071)
          (500, 500, 2000) Toeplitz  0.696 (0.030)  0.745 (0.023) 0.426 (0.059)  0.424 (0.060)
                             NN      1.031 (0.152)  1.031 (0.076) 0.433 (0.079)  0.420 (0.090)
4.5     Application to Tumor-Infiltrating Lymphocytes Data
To demonstrate our approach’s potential utility, we apply the GEV-SIR algorithm to the Tumor-
Infiltrating Lymphocytes (TILs) data inferred from The Cancer Genome Atlas (TCGA) Ovarian
serous cystadenocarcinoma (Ovarian Cancer) and Lung Squamous Cell Carcinoma (Lung cancer)
using CIBERSORT [NLG+ 15]. Compelling clinical evidence suggests that the presence of effec-
tor immune cells, such as CD8+ T cells and plasma cells, is positively associated with superior
survival in patients with ovarian cancer. Notably, an inflamed tumor microenvironment, which
is characterized by the infiltration of CD8+ T cells, also attracts plasma cells. A higher percent-
age of plasma cell infiltration is significantly correlated with the highest levels of CD8+ , CD4+ ,
and CD20+ TILs, and superior clinical outcomes in patients with ovarian cancer [SJ15]. Indeed,
a pan-cancer analysis also identified plasma cells as a novel prognostic factor for superior sur-
vival [WN18]. However, the mechanism of plasma cell homing to the tumor bed remains unclear.
Identifying oncogenic signaling pathways that shape the plasticity of plasma cell recruitment and
differentiation holds promise to better classify patients based on their immune-editing profiles.
    We extracted the expression of 2,000 genes with the largest variance among samples. We first
determine the number of dimensions using eigen decomposition. As shown in Figure 1, GEV-
SIR favors d = 1 because of a large gap between the first and the second largest eigenvalue. The
tuning parameter λ is then selected via the cross-validation procedure. Figure 4.2 shows a strong
monotonic relationship between the GEV-SIR score and the plasma cell abundances. To validate
                                                   64


the derived GEV-SIR score’s predictability, we used the TCGA Lung cancer data as a test set.
Specifically, we selected the same 2,000 genes as in Ovarian cancer and projected them into the
estimated Ovarian cancer GEV-SIR direction. The right panel in Figure 1 demonstrates a similar
monotonically decreasing pattern between the GEV-SIR direction and plasma cell recruitment in
the Lung cancers.
Figure 4.2: Relationship between plasma cells and GEV-SIR direction. The left panel shows the distribu-
tion of eigenvalues of Ω̂. The scatter plots in the middle and right panels show the relationship between the
tumor infiltrated plasma cell and the GEV-SIR direction.
     To better understand the reduced dimensions’ biologic significance, we performed a GO path-
way enrichment analysis using GSEA [STM+ 05]. In the reduced dimension, gene clusters which
regulate immune cell differentiation ( p value < 0.001), effector function such as enzyme ac-
tivity ( p value < 0.001), regulation of apoptosis ( p value < 0.05), and chemotaxis signaling (
 p value < 0.05), are significantly enriched. Among the strongest pathways that are positively as-
sociated with plasma cell recruitment in the second GEV-SIR direction are the defense response
( p value < 0.001) and type 1 Interferon (IFN-I) network (p value < 0.05). The defense response
pathway is informed by immune detection of danger-associated molecular patterns (DAMPs) and
pathogen-associated molecular patterns (PAMPs). In the tumor immune detection, cancer cell
damage-associated DAMPs, such as DNA, could alert immune cells and promote an “inflamed”
tumor microenvironment [GGK10], which is amenable for plasma cell recruitment. IFN-I sig-
natures have been emerging as a central signaling pathway that facilitates anti-tumor immunity
[ID14]. IFN-I functions by binding to its receptor on target cells and launch a large transcriptome
                                                      65


consisting of interferon-stimulated genes (ISGs), among which are chemokines, such as CXCL9,
CXCL10, and CXCL12. CXCL9 and CXCL10 are essential mediators of effector immune cell
chemotaxis [SCR14]. Downregulation of these chemokines severely compromises anti-tumor im-
munity. Importantly, CXCL12 potently promotes plasma cell recruitment [DPN14].
4.6    Application to Single-Cell RNAseq Data
To demonstrate the ability of GEV-SIR handling noisy data, we next utilize GEV-SIR to analyze
a dataset of human embryonic stem cells grown over a 27-day time course from [MvDW+ 19].
The [MvDW+ 19] dataset comprises expression measurements of 33694 genes over 31,000 cells
through single-cell RNAseq (scRNAseq) technology, where cells were sampled at the following
differentiation time intervals: (Day 0-3), (Day 6-9), (Day 12-15), (Day 18-21), and (Day 24-27).
Unlike the measurement from bulk tissue as in the tumor-infiltrating lymphocytes data, the scR-
NAseq data suffers from high noise level, contamination with outliers, and large proportion of
missing values due to the limited initial mRNA in each cell. Taking the developmental time as
response and gene expression as predictors, we aim to reveal the driving factors/pathways for
the differentiating process of embryonic cells. Following the same preprocessing procedure as in
[MvDW+ 19], we applied our GEV-SIR method to the normalized scRNAseq data and showed the
two dimension embedding in Figure 4.3. Our analysis successfully captured the smooth transi-
tion of the embryonic differentiation process with GEV-SIR1 direction capturing the difference
between all developmental interval and GEV-SIR2 direction mainly reflecting the differences be-
tween the last two intervals and the first three.
                                                  66


           10
            5
                                                                               Time
GEV_SIR2
                                                                                  20
                                                                                  15
                                                                                  10
                                                                                  5
                                                                                  0
            0
           −5
                    0           10            20         30           40
                                         GEV_SIR1
                Figure 4.3: GEV-SIR analysis of embryoid body scRNAseq data.
                                             67


                                            CHAPTER 5
   GRAPHICAL NEURAL NETWORKS FOR MULTI-MODAL DATA INTEGRATION
With the emergence of joint platforms for single-cell sequencing, the data produced by methods
like scRNA-seq and scATAC-seq can be combined for multi-modality cell sequencing, attributing
to each cell mRNA and DNA data. This new collection of data provides unique challenges in how
best to incorporate the large amount of multi-modal data for data analysis purposes. Both data
streams are very high dimension with sample sizes often on the same order or smaller than the
number of features, placing the data analysis problem in the HDLSS scenario. Furthermore, the
sequencing data is notoriously sparse, where the vast majority of features may be zero in a typical
dataset due to technical error from dropout [SNL+ 17], which can be even more pronounced in
multi-modal data [LHH20]. One main goal for multi-modal data is achieving data integration,
which is any method that combines the heterogeneous data better for downstream tasks. One way
to achieve this task is by performing a joint embedding of the features of the two modalities in a
shared low-dimensional space. Such an embedding ideally captures a meaningful representation
of the complex cellular states from different types of measurements.
    Deep learning techniques have recently been used to solve the task of multi-modal date inte-
gration for single-cell data to great success [GZP21, AAB+ 20]. However, most of the current fail
to take into account high-order interactions among cells or different modalities, and instead treat
each cell as a separate input. This higher-order information is essential giving structure to the data
that allows for learning a proper low-dimensional representation of high-dimensional and sparse
cell features. Graph neural networks (GNNs) [LDJ+ 21, KW17] give unique tools for capturing the
desired higher-order information required for data integration. GNNs aggregate information from
neighborhoods to update node embeddings iteratively, which allow for the encoding high-order
structural information through multiple aggregation layers. In addition, GNNs smooth the features
by aggregating neighbors’ embedding, which provides an extra denoising mechanism [MLZ+ 21].
Hence, by modeling the interactions between cells and their features as a graph, we can adopt
                                                 68


GNNs to exploit the structural information.
    We implement a GNN framework for multimodal data integration designed in concert with
[WDJ+ 22] called scJEGNN for single-cell Joint Embedding GNNs. We apply this model to bench-
mark datasets provided by NeurIPS 2021 [LBC+ 21] for a multimodal since-cell data integration
competition and compare its performance to competitor submissions.
5.1    Problem Statement
The two modalities we operate on are GEX as mRNA data, and ATAC as DNA data. Each modality
is represented as a matrix Xi ∈ Rn×di , i = 1, 2, where n is the number of cells and di denotes the
feature dimension for each cell. In our application the GEX has dimension d1 = 13, 431 while
the ATAC has dimension d2 = 116, 490, while the total sample size is n = 42, 492. The data is
also highly sparse; only 9.75% of GEX and 2.9% of ATAC features are nonzero on average. The
data has expert annotation giving cell type labels for each cell with a total of 22 different cell
type classes, and 2 different real values indicated cell-cycle developmental stages. As well, in our
application, two modalities are sequenced with a total of 10 batches, introducing the possibility of
large batch effects occurring.
    The goal then is to learn an embedded representation of the cells in Rd3 that best leverages the
underlying information of the two modalities in order to preserve cell info and remove spurious
batch effects on the representations. This evaluation of how well the embedding represents per-
tinent biological information is calculated by a collection of metrics M : Rn×d3 → Rk , M(X) =
(m1 (X), . . . , mk (X)), where k metrics are given by mi : Rn×d3 → R, i ∈ [k], with higher values
indicating better performance of the embedding. The problem can be formally defined as
    Given modality X1 ∈ Rn×d1 and modality X2 ∈ Rn×d2 , learn three mapping functions fθ1 , fθ2
and fθ3 parameterized by θ1 , θ2 and θ3 to learn a representation H ∈ Rn×d3
                                 H = fθ3 (CONCAT( fθ1 (X1 ), fθ2 (X2 )))                        (5.1)
                                                                         ′                    ′
that best maximizes the coordinates of M(H). Here fθ1 (X1 ) ∈ Rn×d1 and fθ2 (X2 ) ∈ Rn×d2 cor-
                                                                                                  ′
respond to new representations learned from modality X1 and X2 , and for CONCAT : (Rn×d1 ×
                                                  69


      ′            ′   ′
Rn×d2 ) → Rn×(d1 +d2 ) the function that concatenates the rows of two matrices together.
    For our application, we have k = 3 for the number of metrics measuring the performance of
the embedding H. The three measurements are are given by a cell-type conservation metric, a
cell-cycle conservation metric, and a batch removal metric:
    • NMI cluster/label: The Normalized mutual information (NMI) [MGH11] compares the
        overlap of two clusterings. The NMI is applied to the integrated data to compare the cell
        type labels with an automated clustering (based on Louvain clustering). NMI scores of 0 or
        1 correspond to uncorrelated clustering or a perfect match, respectively. Automated Louvain
        clustering is performed at resolution ranges from 0.1 to 2 in steps of 0.1, and the clustering
        output with the highest NMI with the label set is used.
    • Cell-cycle conservation: The cell-cycle conservation score evaluates the amount of variance
        explained by cell-cycle per batch prior to integration versus the amount of variance after
        integration. The relative differences of varbe f orei and vara f teri per batch i are aggregated into
        a final score between 0 and 1, via
                                              1 B           |varbe f orei − vara f teri |
                                                                                         
                             CCconservation = ∑ 1 −                                         ,
                                              B i                    varbe f orei
        where B gives the number of batches. Values near 0 indicate little conservation of variance
        explained by the cell-cycle, while values near 1 indicate nearly perfect conservation.
    • Batch ASW: The Average Silhouette Distance (ASW) is used to quantify batch mixing
        by taking into account the incompatibility of batch labels per cell type cluster. The Batch
        ASW considers the absolute silhouette width, on batch labels per cell. Here, 0 indicates that
        batches are thoroughly mixed, but any variation from 0 indicates the presence of a batch
        effect. The metric re-scales this score so that higher values imply better batch mixing and
        uses the equation below to determine the per-cell type label, j:
                                                          1
                                        batchASW j =                1 − |s(i)|
                                                       |C j | i∈∑
                                                                C j
                                                   70


       where C j is the set of cells with the cell label j and |C j | denotes the number of cells in that
       set. To obtain the final batchASW score, the label-specific batchASW j scores are averaged:
                                                         1 C
                                        batchASW =            batchASW j
                                                        C∑  j
       where C is the number of unique cell labels. A batchASW value of 1 indicates optimal batch
       mixing, and a value of 0 indicates fully separated batch clusters.
5.2     Method
In this section, we introduce the scJEGNN framework for multimodal data integration. An illus-
tration of the framework is shown in Figure [ref fig]. Specifically, our framework can be divided
into four stages: data preprocessing, graph construction, cell-feature graph convolution, and an
autoencoder architecture for the final embedding.
5.2.1   Data Preprocessing
Both modalities, X1 for GEX and X2 for ATAC, go through some standard preprocessing steps
regularly done in single-cell sequencing tasks. The below sequence of operations describe both fθ1
and fθ2 . First the matrices are ℓ1 -normalized, meaning that each row vector (cell) is divided by the
total sum of the absolute values of all its features, normalizing the weight of each cell’s total gene
expression output. Then the data is log-transformed, so that each normalized value X̄i j is updated
to the value
                                           log(X̄i j ∗ 104 + 1).
These values are then divided by the standard deviation of each column, which normalizes the
variation of each feature. Lastly both modalities go through an initial dimension reduction using
the Latent Semantic Indexing (LSI), which is a type of transformation based on the SVD decom-
position. For some choice of k, the transformation simply chooses the top k left singular vectors,
and projects the data to dimension k where each coordinate is the inner product with the k vectors.
Here we choose different values for k giving d1′ and d2′ for X1 and X2 . In our experiment, we found
                                                     71


choosing d1′ = 100 for GEX and d2′ = 65 for ATAC gave the best performance. Then the data is
concatenated giving an output X        b ∈ Rn×(d1′ +d2′ ) . We simplify notation and denote d ′ = d ′ + d ′ as
                                                                                                           1     2
the combined feature dimension.
5.2.2    Graph Construction
Given the preprocessed X,       c we construct a graph that the GNN can be applied to. We construct a
cell-feature bipartite graph, depicted in Figure 5.1, where the cells and their biological features are
treated as different nodes, giving us a collection of cell nodes and a collection of feature nodes.
The edges are designed to be strictly between the two collections, so that an edge connecting a
cell node i to a feature node j directly represents the value of the cell’s feature given by X            ci j . This
requires a weighted edge graph instead of the usual 0-1 adjacency matrices. As we will see, given
a proper choice of node embedding values, this graph will be able to propagate information from
cells to pertinent feature nodes, and likewise feature nodes can also propagate their information to
the cell nodes that express them highly.
    We denote the bipartite graph as G = (U , V , E ). In this graph U is the set of nodes representing
the n cells {u1 , . . . , un } and cV is the set of nodes representing the d ′ features {v1 , . . . vd ′ } with one
node for each feature dimension of the input data. The set E ⊆ U × V gives the edges in the graph
between the nodes U and V which describe the relations between the cells and the features. The
graph can be denoted by the weighted adjacency matrix
                                                          
                                             0      Xb
                                                                 (n+d ′ )×(n+d ′ )
                                      A=                  ∈R
                                                           
                                               Xb⊤ 0
where 0 is a matrix with all zeros, and X      b ∈ Rn×d ′ is the input feature matrix of cells. A is designed
to give the structure of a bipartite graph since nodes of the cell or feature sets only have possible
edges between nodes of the other set, not within the same set. The initial embeddings of the
feature and cell nodes are given by matrices V and U respectively, where each row gives the vector
of one node embedding. The feature nodes {v1 , . . . , vd } ∈ Rd are initialized as one-hot vectors, so
that each vi has a one in index i and zeros elsewhere, making V ∈ Rd×d an identity matrix. The
                                                           72


cell nodes are initialized as zero vectors in the same dimension, so that U ∈ Rn×d has all zero
entries. The lack of prior information on the cells leads to this uniform initialization for those
nodes, and the one-hot embedding works well with the chosen graph convolution to recreate the
gene expression and propagate it into the cell node embeddings. Then our cell-feature graph can
be denoted G = (A, V, U).
Figure 5.1: scJEGNN graph construction process. The input data determines the value of the
weighted edges between the cell nodes and feature nodes, values of zero indicate no edge.
                                                73


5.2.3    Graph Convolution
Given the constructed cell-feature graph, we wish to choose a graph convolution that captures
higher-order structural information from the links between nodes to create better cell node repre-
sentations. In each layer of a GNN, the embedding of a node is updated according to the propagated
value of its neighbor, given by the edge weight times the neighbor node. In the field of GNNs we
call this type of information propagation “message passing” [GSR+ 17] between neighbors. While
a the two different node types could yield different message passing methods for each type, we
use the same updates for both. To illustrate our method, we give notation of the updates applied
to nodes in U , and the updates for nodes in V are completely analogous. Let Hl = {hl1 , . . . , hln },
         ′
hli ∈ Rd be the input node embeddings in the l th layer, where hli corresponds to node ui . Then the
output embedding to the l th layer can be expressed as
                                 hl+1
                                  i   = Update(hli , Agg(hlj | j ∈ Ni )),
where Ni is the set of first-order neighbors of node ui , Agg(·) indicates an aggregation function
on neighbor nodes’ embeddings, and Update(·) is an update function the generates a new node
embedding from the previous one and the aggregation output.
     While there are many choices for both aggregation and update functions in GNNs, we choose
the common and simple Graph Convolution Network (GCN) [KW17] model for these layer up-
dates. In general the GCN creates a message mi,l for node i at layer l as follows:
                                                                          !
                                                             e ji l l
                                    mi,l = σ bl + ∑               h jW
                                                       j∈N
                                                             c ji
                                                           i
where j varies over neighbors of ui in V , e ji denotes the edge weight between ui and v j , W l and bl
are trainable parameters, σ (·) is an activation function, and c ji is a normalization term defined as
                                             s             r
                                      c ji =     ∑    e jk    ∑      eki .
                                                k∈N j        k∈Ni
After generating the messages from neighborhoods we update the embedding for nodes in U as
                                            hil+1 = hli + mi,l .
                                                   74


This simple residual mechanism adds the previous layer to the newly updated embedding, which
enhances self information, by combining the node embedding with its aggregated neighborhood
information.
    We choose to decouple the propagation and transformation of the node embeddings. This
means that we set W l = Idd ′ as the identity matrix and bl = 0 for all layers l. As well the activation
function σ is set to the identity. The choice to remove the learnable parameters and activation func-
tion may seem like a big limitation on the transformation, but recent work [WSZ+ 19, HDW+ 20]
has shown that if later transformations occur (as in our model), the the performance of the model
is often improved from the use of simplified GCN layers, and the computation efficiency is greatly
increased. [HDW+ 20] in particular found that if later transformations occur after simplified GCN
layers, they are able to produce representations with the same level or better performance than they
would with the earlier trainable parameters. This choice also means that the hidden layers keep
                                                                       ′
same dimension throughout, so the end output yields HLU ∈ Rn×d where HLU is the hidden layer
representation of the cell nodes in the last layer L. In our application we found L = 3 to have the
best performance. We take advantage of this consistent hidden layer size by completing the GNN
output with a weighted summation of all the hidden layers, giving
                                                  L
                                            b = ∑ wi · Hi .
                                            H             U
                                                 i=1
Our GCN model is seen in Figure 5.2, showing the summation of the GCN layers H         b being used as
input for the final autoencoder layer.
    It is worth noting that the simplified GCN computation and choice of node embeddings leads
to a recreation of each cell’s original representation for the first and second update. If a cell i has
features given by the ith row of X,b then the cell node update would equal ui = ∑d X
                                                                                      k=1 i,k ek , where
                                                                                          b
ek the kth unit vector with 1 in coordinate k and 0’s elsewhere. This value is also the output of
the second layer for the cell nodes due to the lack of an update for the feature nodes on the first
layer from the all zero values of the initial cell node embeddings. After the second layer the cell
nodes update in a novel manner, weighing messages higher from features that are expressed at a
                                                   75


greater value in that cell. This leads to increasingly similar cell embeddings from cells that have
high coexpression of features.
Figure 5.2: scJEGNN graph convolution. Multiple convolution layers propagate information from
the weighted edges to update cell and feature nodes.
                                                  76


5.2.4   Autoencoder
In order to train the final cell embeddings, we use an autoencoder model, presented in Figure 5.3
to achieve desired joint embedding of the data. The autoencoder consists of an encoder layer E
and decoder layer D, both modeled as fully connected perceptions. The encoding layer E takes in
H
c and each layer except the last is computed as
                                     b l+1 = DO(BN(σ (H
                                     H                       b lW l )), p),
where H b l is the output of the l th layer, W l Rdl ×dl+1 is a trainable linear transformation, σ is an
activation function, BN is the batch normalization function [IS15], and DO is a dropout function
with parameter 0 < p < 1. Batch normalization operates by normalizing the empirical mean and
variance of each batch. The use of the batch normalization function is a well-established technique
in deep learning to improve training. The dropout function randomly zeroes some of the elements
of the input matrix with probability p using samples from a Bernoulli distribution, and also im-
proves training by forcing information redundancy in the connections between layers. The last
layer removes the batch normalization and dropout leaving
                                            b L = σ (H
                                            H          b L−1W L ).
We choose L = 4, p = .2, and reduce the dimension of the input iteratively with d1 = 150, d2 = 120,
d3 = 100, to the final embedding dimension of 39. For all layers σ is chosen to be the ReLU
function, ReLU(x) = max(0, x). The decoder layer is a simple two-layer transformation of the
joint embedding back to the original dimension d ′ , giving
                                       D(H b L ) = σ (σ (Hb LW 1 )W 2 )
                                                           ′
where σ = ReLU and W 1 ∈ RL×d1 and W 2 ∈ Rd1 ×d .
    The goal of the autoencoder is take in the GNN output H,         b and learn a low-dimensional repre-
sentation that properly captures the biological information we care about. In order to due so the
model is trained via latent feature regularization, which forces chosen latent features to predict for
                                                      77


cell type, cell-cycle phase score, and to blur the batch features in order to remove spurious batch
effects. We combine these with the usual reconstruction loss that is used to train autoencoders to
produce salient features in the encoder representation. Thus the autoencoder is trained with both
supervised and self-supervised losses: three supervised losses are applied to the output of the en-
coder that gives the final joint embedding, and one self-supervised loss is applied to the output of
the decoder. The hidden layer size of H    b L is specifically designed to accommodate enough features
for each of these supervised losses. We allocate 22 features for the number of cell types, 10 fea-
tures for the number of batches, 2 features for the cell-cycle score, and 5 extra features to allow for
additional pertinent information to improve the reconstruction loss, giving us the total of 39. The
losses are then summed to give us a total loss L. In detail we have
                              L = Lrecon + Lcell type + Lbatch + Lcell-cycle .
The reconstruction loss is simply mean squared error giving
                                                        1 n b                   2
                              Lrecon (D(E(H))  b =         ∑       X   − D(E(H))
                                                                              b     .
                                                        n i=1
The cell type loss is a cross-entropy loss function on the classification task; if H   b L gives us weights
                                                                                        i,1:J
corresponding to the J cell classes for input i, and Ci ∈ {1, . . . , J} denotes class of cell i, then the
loss is
                                                         n   J
                                   Lcell type (H)
                                                b =∑        ∑ 1Ci=c j log(Ŷ j )
                                                       i=1 j=1
where 1Ci =c j is equal to 1 if Ci = c j and 0 otherwise, and
                                                             bL
                                                            eHi, j
                                                Ŷ j =             bL
                                                                  H
                                                        ∑Jk=1 e     i, j
is the softmax function. The loss function cLbatch for the batch effect is also a cross-entropy loss,
but it attempts to remove batch effects by training the classifier to learn random batch labels.
To do so we use a uniform distribution to generate batch labels for the input, and use the same
function above on the 10 batch features in H           b L . The last function Lcell-cycle is a simple mean
squared error loss on each cells cell-cycle phase score compared to the 2 allocated features in H        bL
that are supposed to capture the values.
                                                         78


Figure 5.3: scJEGNN Autoencoder architecture. Each layer is fully connected, and the encoder
layers feature drop out and batch normalization steps.
5.3    Experimental Results
                        Table 5.1: Performances for Joint Embedding Task
      Model        NMI Cluster/label    Cell-Cycle Conservation  Batch ASW      Average Metric
     Baseline             .6502                   .8259             .7178           .7313
      GLUE                .7754                   .8355             .9100           .8403
  Amateur (JAE)          0.7723                   .9195             .8898           .8610
    scJEGNN               .8057                   .9204             .9112           .8791
We demonstrate the effectiveness of our framework scJEGNN in the joint embedding task for GEX
and ATAC, and show that the model outperforms the competitor submissions to the competition
on the three evaluation metrics. Of the 25 teams that submitted models to be evaluated to the
NeurIPS 2021 Joint Embedding task competition, we choose the top two performing teams and
show their performance along side our own. Team Amateur submitted JAE, an autoencoder that
we designed our own autoencoder from, with the same latent feature regularization but additional
residual connections and layers, and no graphical component. GLUE was an autoencoder model as
well guided by an external knowledge graph. We additionally provide a baseline model provided
                                                79


by simply evaluating a concatenation of the two modalities after dimension reduction by principle
component analysis (PCA). In Table 5.1 we can see that our model significantly outperforms the
other models, with an improvement over 0.1 according to the average metric.
                                               80


                                             CHAPTER 6
                                           CONCLUSION
In this work we have developed two methods for finding low-dimensional representations of high-
dimension data. The first is given by a unified framework for generalized eigenvalue problems
in the GEV estimator. This sparse projection regression framework is a reformulation of an in-
tractable Rayleigh quotient problem and achieves great computational efficiency. We established
nonasymptotic error bounds on the proposed estimators for the applications of SIR and LDA, and
showed these rates are minimax optimal. We showed application of GEV to the CCA problem,
and adapted the algorithm for a robust Huber-loss based formulation. We tested our framework on
both synthetic and real datasets and demonstrated the algorithm’s superior performance compared
with other state-of-the-art methods in high dimensional data. The second method is the scJEGNN,
a graphical neural network tailored to data integration for HDLSS single-cell sequencing data. We
showed that with the unique model, the GNN is able to leverage structural information of the bi-
ological data relations in order to perform a joint embedding of multiple modalities of single-cell
gene expression data.
6.1     Future Work
GEV. One obvious goal is to show the same statistical consistency the GEV estimator obtains for
SIR and LDA is also true for the application of CCA. The adaptation of the GEV estimator to CCA
required changing the structure of U from a collection of products of eigenvalues and eigenvectors
of Ω = cov(E[x|y]), to the combination of the right and left singular vectors of Σxy = Ox DO⊤    y.
                           √
Setting U = (O⊤       ⊤ T
                 x , Oy ) / 2 requires additional work to show that the difference ∥U − U∥∞,∞ is
                                                                                           b
                                        q
bound above by the desired rate of C log(d)  n . Similarly, the robust version of the GEV estimator
using the Huber loss also has the possibility of proving strong theoretical rates of convergence.
This work requires a new derivation of the bound on ∥∇L(W∗ )∥∞,∞ ≤ λ /2 due to the change in the
gradient of the loss. The derivative of the Huber norm is more complicated than the Frobenius norm
                                                  81


and leads to a piecewise defined function with additional multiplications of Σ. Lastly an extension
of the GEV estimator to a nonlinear dimension reduction technique is also likely possible. An
arbitrary manifold can be approximated locally by linear spaces which can be estimated using K-
nearest neighbors from the sample data. Given these connected affine spaces, we can apply GEV
to each to get a collection of projections, which we can carefully combine to project the data to a
lower dimensional space. This likely requires much more stringent requirements about the sample
size in each portion of the linear approximation.
GNNs for single-cell tasks. The cell-gene graph of the scJEGNN has the means to be used
in a number of tasks in singe-cell data analysis. The representation gained for the cells after
going through multiple convolution layers may lead to much better estimates for methods like K-
nearest neighbors, which is relied on in many methods that perform imputation. If naively applied,
the KNN estimate on the initial highly sparse data is likely to be unreliable, and if the estimate
could be improved by applying to the updated cell representations, the downstream steps taken for
imputation could be drastically improved. In addition, these representations could be used for other
common node-based tasks like classification and clustering, which lead to cell annotation methods
and biological clustering in the single-cell world, or for graph-based tasks, which would lead to
methods for disease prediction given cell populations from distinct patients. The methods using
this graphical model can be further enhanced with some key alterations to the graph structure. The
bipartite graph structure can be extended to a fuller graph that has edges between the gene nodes
and edges between cell nodes. Gene node edges are important to represent gene pathways, which
indicate any number of gene causal relations including gene co-expression or regulatory networks.
In spatial transcriptomics data [Rus16], single-cell sequencing data is given new geometric context
so that each small cluster of cells is placed in a 2 or 3-dimensional grid. This spatial data can be
included in the cell-gene graph with additional edges between cells (or group of cells) indicating
adjacency relations. These collection of methods utilizing this GNN model seem promising for
the variety of applications listed, and are being actively developed together into a full GNN-based
                                                  82


package for single cell data analysis methods.
                                               83


BIBLIOGRAPHY
      84


                                    BIBLIOGRAPHY
[AAB+ 20] Ricard Argelaguet, Damien Arnol, Danila Bredikhin, Yonatan Deloro, Britta Velten,
          John Marioni, and Oliver Stegle. Mofa+: a statistical framework for comprehensive
          integration of multi-modal single-cell data. Genome biology, 21(1):1–17, 2020.
[AHB+ 16] Benedict Anchang, Tom Hart, Sean Bendall, Peng Qiu, Zach Bjornson, Michael
          Linderman, Garry Nolan, and Sylvia Plevritis. Visualization and cellular hierarchy
          inference of single-cell data using spade. Nature Protocols, 11(7):1264–1279, 2016.
[AHMM18] Lun Aaron, Laleh Haghverdi, Michael Morgan, and John Marioni. Batch effects in
          single-cell rna-sequencing data are corrected by matching mutual nearest neighbors.
          Nature Biotechnology, 36(5):421–427, 2018.
[APYG19]  Cédric Arisdakessian, Olivier Poirion, Xun Yunits, Breck Zhu, and Lana Garmire.
          Deepimpute: an accurate, fast, and scalable deep neural network method to impute
          single-cell rna-seq data. Genome Biology, 20:1–14, 2019.
[BHS+ 18] Andrew Butler, Paul Hoffman, Peter Smibert, Efthymia Papalexi, and Rahul Satija.
          Integrating single-cell transcriptomic data across different conditions, technologies,
          and species. Nature Biotechnology, 36(5):411–420, 2018.
[BPC+ 11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Dis-
          tributed optimization and statistical learning via the alternating direction method of
          multipliers. Foundations and Trends in Machine Learning, pages 1–122, 2011.
[BRT09]   Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis
          of lasso and dantzig selector. The Annals of Statistics, 37:1705–1732, 2009.
[BT09]    Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for
          linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[CGRZ13]  Mengjie Chen, Chao Gao, Zhao Ren, and Harrison H. Zhou. Sparse CCA via Pre-
          cision Adjusted Iterative Thresholding. arXiv:1311.6186 [math, stat], November
          2013. arXiv: 1311.6186.
[CHWE11]  Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. Sparse dis-
          criminant analysis. Technometrics, 53(4):406–413, 2011.
[CL98]    Chun-Houh Chen and Ker-Chau Li. Can SIR be as popular as multiple linear re-
          gression? Statistica Sinica, 8:289–316, 1998.
[CMW13]   Tianwen Tony Cai, Zongming Ma, and Yihong Wu. Sparse PCA: Optimal rates and
                                             85


          adaptive estimation. The Annals of Statistics, 41(6):3074–3110, 2013.
[Con18]   Tabula Muris Consortium. Single-cell transcriptomics of 20 mouse organs creates a
          tabula muris. Nature, 562(7727):367–372, 2018.
[Coo98]   Ralph Dennis Cook. Regression graphics. Wiley Series in Probability and Statistics.
          John Wiley & Sons, Inc., New York, 1998.
[CP11]    Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for con-
          vex problems with applications to imaging. Journal of Mathematical Imaging and
          Vision, (40):120–145, 2011.
[DPN14]   Yvonne Döring, Christian Pawig, Lukas Weber, and Heidi Noels. The cxcl12/cxcr4
          chemokine ligand/receptor axis in cardiovascular disease. Frontiers in Physiology,
          5, 2014.
[ENK16]   Anders Eklund, Thomas Nichols, and Hans Knutsson. Cluster failure: Why fmri
          inferences for spatial extent have inflated false-positive rates. Proceedings of the
          National Academy of Sciences, (113):7900–7905, 2016.
[FFT12]   Jianqing Fan, Yang Feng, and Xin Tong. A road to classification in high dimensional
          space: the regularized optimal affine discriminant. Journal of the Royal Statistical
          Society: Series B (Statistical Methodology), 74(4):745–771, September 2012.
[FS81]    Jerome Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the
          American Statistical Association, 76(376):817–823, 1981.
[GBC16]   Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
          2016. http://www.deeplearningbook.org.
[GGK10]   Sergei Grivennikov, Florian Greten, and Michael Karin. Immunity, inflammation,
          and cancer. Cell, 140(6):883–899, 2010.
[GSR+ 17] Justin Gilmer, Samuel Schoenholz, Patrick Riley, Oriol Vinyals, and George Dahl.
          Neural message passing for quantum chemistry. In Proceedings of the 34th
          International Conference on Machine Learning, ICML, 2017.
[GZP21]   Boying Gong, Yun Zhou, and Elizabeth Purdom. Cobolt: Joint analysis of multi-
          modal single-cell sequencing data. bioRxiv, 2021.
[HBR+ 17] Adam Haber, Moshe Biton, Noga Rogel, Rebecca H Herbst, Karthik Shekhar,
          Christopher Smillie, Grace Burgin, Toni M Delorey, Michael R Howitt, Yarden
          Katz, Itay Tirosh, Semir Beyaz, Danielle Dionne, Mei Zhang, Raktima Raychowd-
          hury, Wendy Garrett, Orit Rozenblatt-Rosen, Hai Ning Shi, Omer Yilmaz, Ramnik J
          Xavier, and Aviv Regev. A single-cell survey of the small intestinal epithelium.
                                            86


          Nature, 551(7680):333–339, 2017.
[HDW+ 20] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, YongDong Zhang, and Meng
          Wang. LightGCN: Simplifying and Powering Graph Convolution Network for
          Recommendation, page 639–648. Association for Computing Machinery, New
          York, NY, USA, 2020.
[HMN05]   Peter Hall, James Stephen Marron, and Amnon Neeman. Geometric representation
          of high dimension, low sample size data. Journal of the Royal Statistical Society:
          Series B (Statistical Methodology), 67(3):427–444, 2005.
[Hot33]   Harold Hotelling. Analysis of a complex of statistical variables into principal com-
          ponents. Journal of Educational Psychology, 24(6):417, 1933.
[HST11]   David Hardoon and John Shawe-Taylor. Sparse canonical correlation analysis.
          Machine Learning, 83(3):331–353, 2011.
[Hub73]   Peter Huber. Robust Regression: Asymptotics, Conjectures and Monte Carlo. The
          Annals of Statistics, 1(5):799–821, 1973.
[ID14]    Lionel Ivashkiv and Laura Donlin. Regulation of type i interferon responses. Nature
          Reviews Immunology, 14(1):36–49, 2014.
[IS15]    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep
          network training by reducing internal covariate shift. In Proceedings of the
          32nd International Conference on International Conference on Machine Learning
          - Volume 37, ICML’15, page 448–456. JMLR.org, 2015.
[KW17]    Thomas Kipf and Max Welling. Semi-supervised classification with graph convolu-
          tional networks. In ICLR, 2017.
[LBC+ 21] Malte Luecken, Daniel Bernard Burkhardt, Robrecht Cannoodt, Christopher Lance,
          Aditi Agrawal, Hananeh Aliee, Ann Chen, Louise Deconinck, Angela Detweiler,
          Alejandro Granados, et al. A sandbox for prediction and integration of dna, rna,
          and proteins in single cells. In NeurIPS Datasets and Benchmarks Track (Round 2),
          2021.
[LDJ+ 21] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang.
          Graph neural networks with adaptive residual. Advances in Neural Information
          Processing Systems, 34, 2021.
[Len08]   Chenlei Leng. Sparse optimal scoring for multiclass cancer diagnosis and biomarker
          detection using microarray data. Computational Biology and Chemistry, 32(6):417–
          425, 2008.
                                             87


[LG06]    Hongzhe Li and Jiang Gui. Gradient directed regularization for sparse Gaussian con-
          centration graphs, with applications to inference of genetic networks. Biostatistics
          (Oxford, England), 7(2):302–317, April 2006.
[LHH20]   Jeongwoo Lee, Do Young Hyeon, and Daehee Hwang. Single-cell multiomics:
          technologies and data analysis methods. Experimental Molecular Medicine,
          52(9):1428—-1442, 2020.
[Li91]    Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the
          American Statistical Association, 86(414):316–327, 1991.
[Li07]    Lexin Li. Sparse sufficeint dimension reduction. Biometrika, 94(3):603–613, 2007.
[LN06]    Lexin Li and Christopher Nachtsheim.             Sparse sliced inverse regression.
          Technometrics, 48(4):503–510, 2006.
[LSB+ 18] Jacob Levine, Erin Simonds, Sean Bendall, Kara Davis, El ad Amir, Michelle D
          Tadmor, Oren Litvin, Harris Fienberg, Astraea Jager, Eli Zunder, Rachel Finck,
          Amanda Gedman, Ina Radtke, James R Downing, Dana Pe’er, and Garry Nolan.
          Data-driven phenotypic dissection of aml reveals progenitor-like cells that correlate
          with prognosis. Cell, 162(1):184–197, 2018.
[LWL+ 20] Xiangjie Li, Kui Wang, Yafei Lyu, Huize Pan, Jingxiao Zhang, Dwight Stambolian,
          Katalin Susztak, Muredach P Reilly, Gang Hu, and Mingyao Li. Deep learning
          enables accurate clustering with batch effect removal in single-cell rna-seq analysis.
          Nature Communications, 11(1), 2020.
[LYS18]   Hanna Levitin, Jinzhou Yuan, and Peter Sims. Single-cell transcriptomic analysis
          of tumor heterogeneity. Trends Cancer, 4(4):264–268, 2018.
[LZL18]   Qian Lin, Zhigen Zhao, and Jun Liu. On consistency and sparsity for sliced inverse
          regression in high dimensions. The Annals of Statistics, 46(2):482–1492, 2018.
[LZL19]   Qian Lin, Zhigen Zhao, and Jun S. Liu. Sparse sliced inverse regression via lasso.
          Journal of the American Statistical Association, 114:1726–1739, 2019.
[MGH11]   Aaron McDaid, Derek Greene, and Neil Hurley. Normalized mutual infor-
          mation to evaluate overlapping community finding algorithms. arXiv preprint
          arXiv:1110.2515, 2011.
[MLZ+ 21] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A uni-
          fied view on graph neural networks as graph signal denoising. In Proceedings of
          the 30th ACM International Conference on Information Knowledge Management,
          pages 1202–1211, 2021.
                                            88


[MP43]     W.S. McCulloch and W Pitts. A logical calculus of the ideas immanent in nervous
           activity. The Bulletin of Mathematical Biophysics, 5:115–133, 1943.
[MT21]     Yao Ma and Jiliang Tang. Deep Learning on Graphs. Cambridge University Press,
           2021.
[MvDW+ 19] Kevin Moon, David van Dijk, Zheng Wang, Scott Gigante, Daniel Burkhardt,
           William Chen, Kristina Yim, Antonia van den Elzen, Matthew Hirn, Ronald Coif-
           man, Natalia Ivanova, Guy Wolf, and Smita Krishnaswamy. Visualizing struc-
           ture and transitions in high-dimensional biological data. Nature Biotechnology,
           37(12):1482–1492, 2019.
[MZY12]    Qing Mai, Hui Zou, and Ming Yuan. A direct approach to sparse discriminant
           analysis in ultra-high dimensions. Biometrika, 99(1):29–42, March 2012.
[Nes83]    Yurii Nesterov. A method for solving the convex programming problem with conver-
           gence rate o(1/k2 ). Proceedings of the USSR Academy of Sciences, 269:543–547,
           1983.
[Nes04]    Yu. Nesterov. Introductory Lectures on Convex Optimization. A Basic Course.
           2004.
[NHS+ 16]  Sonia Nestorowa, Fiona Hamey, Blanca Pijuan Sala, Evangelia Diamanti, Mairi
           Shepherd, Elisa Laurenti, Nicola Wilson, David Kent, and Berthold Göttgens. A
           single-cell resolution map of mouse hematopoietic stem and progenitor cell differ-
           entiation. Blood, 128(8):20–31, 2016.
[NLG+ 15]  Aaron Newman, Chih Long Liu, Michael Green, Andrew Gentles, Weiguo Feng,
           Yue Xu, Chuong Hoang, Maximilian Diehn, and Ash Alizadeh. Robust enumeration
           of cell subsets from tissue expression profiles. Nature Methods, (5):453–457, 2015.
[PTB09]    Elena Parkhomenko, David Tritchler, and Joseph Beyene. Sparse canonical correla-
           tion analysis with application to genomic data integration. Statistical Applications
           in Genetics and Molecular Biology, 8(1):1–34, 2009.
[Qui20]    Peng Qui. Embracing the dropouts in single-cell rna-seq analysis.            Nature
           Communications, 11(1), 2020.
[Roc70]    Ralph Tyrell Rockafellar. Convex analysis. Princeton University Press, Princeton,
           1970.
[Ros58]    Frank Rosenblatt. The perceptron: A probabilistic model for information storage
           and organization in the brain. Psychological Review, 65(6):386–408, 1958.
[Rus16]    Nicole Rusk. Spatial transcriptomics. Nature Methods, 13(710), 2016.
                                             89


[RWM+ 18] Bushra Raj, Daniel E Wagner, Aaron McKenna, Shristi Pandey, Allon M Klein,
          Jay Shendure, James A Gagnon, and Alexander Schier. Simultaneous single-cell
          profiling of lineages and cell types in the vertebrate brain. Nature Biotechnology,
          36(5):442–450, 2018.
[RWRY11]  Pradeep Ravikumar, Martin Wainwright, Garvesh Raskutti, and Bin Yu. High-
          dimensional covariance estimation by minimizing l1-penalized log-determinant di-
          vergence. Electronic Journal of Statistics, 5:935–980, 2011.
[RZL+ 21] Jiahua Rao, Xiang Zhou, Yutong Lu, Huiying Zhao, and Yuedong Yang. Imputing
          single-cell rna-seq data by combining graph convolution and autoencoder neural
          networks. iScience, 24(5):102393, 2021.
[SCR14]   William Schneider, Meike Chevillotte, and Charles Rice. Interferon-stimulated
          genes: a complex web of host defenses. Annual Review of Immunology, 32:513–
          545, 2014.
[SDB+ 18] William Stephenson, Laura Donlin, Andrew Butler, Cristina Rozo, Bernadette
          Bracken, Ali Rashidfarrokhi, Susan Goodman, Lionel Ivashkiv, Vivian Bykerk,
          Dana Orange, Robert Darnell, Harold Swerdlow, and Rahul Satija. Single-cell rna-
          seq of rheumatoid arthritis synovial tissue using low-cost microfluidic instrumenta-
          tion. Nature Communications, 9(1):791, 2018.
[SJ15]    Phillip Santoiemma and Daniel Powell Jr. Tumor infiltrating lymphocytes in ovarian
          cancer. Cancer Biology Therapy, 16(6):807–820, 2015.
[SNL+ 17] Valentine Svensson, Kedar Natarajan, Lam-Ha Ly, Ricardo Miragaia, Charlotte La-
          balette, Iain Macaulay, Ana Cvejic, and Sarah Teichmann. Power analysis of single-
          cell rna-sequencing experiments. Nature Methods, 14:381–387, 2017.
[SSZ+ 17] Uri Shaham, Kelly Stanton, Jun Zhao, Huamin Li, Khadir Raddassi, Ruth Mont-
          gomery, and Yuval Kluger. Removal of batch effects using distribution-matching
          residual networks. Bioinformatics, 33(16):2539—-2546, 2017.
[SSZM16]  Dan Shen, Haipeng Shen, Hongtu Zhu, and JS Marron. The statistics and mathemat-
          ics of high dimension low sample size asymptotics. Statistica Sinica, 26(4):1747,
          2016.
[STM+ 05] Aravind Subramanian, Pablo Tamayo, Vamsi Mootha, Sayan Mukherjee, Benjamin
          Ebert, Michael Gillette, Amanda Paulovich, Scott Pomeroy, Todd Golub, Eric
          Lander, and Jill Mesirov. Gene set enrichment analysis: a knowledge-based ap-
          proach for interpreting genome-wide expression profiles. Proceeds of the National
          Academy of Sciences of the United States of America, 102(43):15545–15550, 2005.
[SYG+ 05] Franco Scarselli, Sweah Liang Yong, Marco Gori, Markus Hagenbuchner,
                                            90


           Ah Chung Tsoi, and Marco Maggini. Graph neural networks for ranking web
           pages. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web
           Intelligence, pages 666–672, 2005.
[TBH+ 17]  Po-Yuan Tung, John Blischak, Chiaowen Joyce Hsiao, David Knowles, Jonathan
           Burnett, Jonathan Pritchard, and Yoav Gilad. Batch effects and the effective design
           of single-cell gene expression studies. Scientific reports, 7, 2017.
[THNC03]   Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu.
           Class prediction by nearest shrunken centroids, with applications to dna microar-
           rays. Statistical Science, 18(1):104–117, 2003.
[TSY20]    Kai Tan, Lei Shi, and Zhou Yu. Sparse sir: Optimal rates and adaptive estimation.
           The Annals of Statistics, 48(1):64–85, 2020.
[vDSN+ 18] David van Dijk, Roshan Sharma, Juozas Nainys, Kristina Yim, Pooja Kathail, Am-
           brose Carr, Cassandra Burdziak, Kevin Moon, Christine L Chaffer, Diwakar Pat-
           tabiraman, Brian Bierie, Linas Mazutis, Guy Wolf, Smita Krishnaswamy, and Dana
           Pe’er. Recovering gene interactions from single-cell data using data diffusion. Cell,
           174(3):716–729, 2018.
[WAH+ 19]  Jingshu Wang, Divyansh Agarwal, Mo Huang, Gang Hu, Zilu Zhou, Chengzhong
           Ye, and Nancy Zhang. Data denoising with transfer learning in single-cell transcrip-
           tomics. Nature Methods, 16:875—-878, 2019.
[Wai19]    Martin Wainright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint.
           Cambridge University Press, 2019.
[WCZZ18]   Tao Wang, Mengjie Chen, Hongyu Zhao, and Lixing Zhu. Estimating a sparse re-
           duction for general regression in high dimensions. Statistics and Computing, 28:33–
           46, 2018.
[WDJ+ 22]  Hongzhi Wen, Jiayuan Ding, Wei Jin, Xie Yuying, and Jiliang Tang. Graph neural
           networks for multimodal single-cell data integration, 2022.
[WKI08]    Ami Wiesel, Mark Kliger, and Alfred Hero III. A greedy approach to sparse canon-
           ical correlation analysis. arXiv preprint arXiv:0801.2748, 2008.
[WMC+ 21]  Juexin Wang, Anjun Ma, Yuzhou Chang, Jianting Gong, Yuexu Jiang, Ren Qi,
           Cankun Wang, Hongjun Fu, Qin Ma, and Dong Xu. scGNN is a novel graph neu-
           ral network framework for single-cell RNA-seq analyses. Nature Communications,
           12(1882), 2021.
[WN18]     Maartje Wouters and Brad Nelson. Prognostic significance of tumor-infiltrating b
           cells and plasma cells in human cancer. Clinical Cancer Research, 24(24):6125–
                                             91


          6135, 2018.
[WPL15]   Lan Wang, Bo Peng, and Runze Li. A high-dimensional nonparametric multivariate
          test for mean vector. Journal of the American Statistical Association, (110):1658–
          1669, 2015.
[WSZ+ 19] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Wein-
          berger. Simplifying graph convolutional networks. In Proceedings of the 36th
          International Conference on Machine Learning, pages 6861–6871, 2019.
[WT11]    D M Witten and R Tibshirani. Penalized classification using Fisher’s linear discrim-
          inant. Journal of Royal Statistical Society, Series B, 73:753–772, 2011.
[WTH09]   Daniela Witten, Robert Tibshirani, and Trevor Hastie. A penalized matrix decom-
          position, with applications to sparse principal components and canonical correlation
          analysis. Biostatistics (Oxford, England), 10(3):515–534, 2009.
[XLV16]   Yuying Xie, Yufeng Liu, and William Valdar. Joint estimation of multiple depen-
          dent Gaussian graphical models with applications to mouse genomics. Biometrika,
          103(3):493–511, September 2016.
[XTLZ02]  Yingcun Xia, Howell Tong, W. K. Li, and Li-Xing Zhu. An adaptive estimation
          of dimension reduction space. Journal of the Royal Statistical Society: Series B
          (Statistical Methodology), 64(3):363–410, 2002.
[Yu97]    Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam: Research
          Papers in Probability and Statistics, pages 423–435, 1997.
[ZMY18]   Hui Zou, Qing Mai, and Yi Yang.           Multiclass Sparse Discriminant Analysis.
          Statistica Sinica, 2018.
[ZNC+ 19] Hamim Zafar, Nicholas Navin, Ken Chen, , and Luay Nakhleh. SiCloneFit:
          Bayesian inference of population structure, genotype, and phylogeny of tumor
          clones from single-cell genome sequencing data. Genome research, 29(11):1847–
          1859, 2019.
[ZWT19]   Feng Zhang, Yu Wu, and Weidong Tian. A novel approach to remove the batch
          effect of single-cell data. Cell Discovery, 5(46), 2019.
                                             92