EXPLORING LOW-RANK PRIOR IN HIGH-DIMENSIONAL DATA
                                     By
                                   He Lyu
                            A DISSERTATION
                                Submitted to
                        Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
Computational Mathematics, Science and Engineering—Doctor of Philosophy
                                    2023


                                             ABSTRACT
High-dimensional data plays a ubiquitous role in real applications, ranging from biology, computer
vision, to social media. The large dimensionality poses new challenges on statistical methods
due to the "curse of dimensionality". To overcome these challenges, many statistical and machine
learning approaches have been developed based on imposing additional assumptions on the data.
One popular assumption is the low-rank prior, which assumes the high-dimensional data lies in a
low-dimensional subspace, and approximately exhibits low-rank structure.
    In this dissertation, we explore various applications of low-rank prior. Chapter 2 studies the
stability of leading singular subspaces. Various widely used algorithms have been proposed in
numerical analysis, matrix completion, and matrix denoising based on the low-rank assumption,
such as Principal Component Analysis and Singular Value Hard Thresholding. Many of these
methods involve the computation of Singular Value Decomposition (SVD). To study the stability of
these algorithms, in Chapter 2 we establish a useful set of formulae for the sin Θ distance between
the original and the perturbed singular subspaces. Following this, we further derive a collection of
new results on SVD perturbation related problems.
    In Chapter 3, we employ the low-rank prior for manifold denoising problems. Specifically,
we generalize the Robust PCA (RPCA) method to manifold setting and propose an optimization
framework that separates the sparse component from the noisy data. It is worth noting that in
this chapter, we generalize the low-rank prior to a more general form to accommodate data with a
more complex structure, instead of assuming the data itself lies in a low-dimensional subspace as
in RPCA, we assume the clean data is distributed around a low-dimensional manifold. Therefore,
if we consider a local neighborhood, the sub-matrix will be approximately low rank.
    Subsequently, in Chapter 4 we study the stability of invariant subspaces for eigensystems.
Specifically, we focus on the case where the eigensystem is ill-conditioned and explore how the
condition numbers affect the stability of invariant subspaces.
    The material presented in this dissertation encompasses several publications and preprints in
the fields of Statistical, Numerical Linear Algebra, and Machine Learning, including Lyu and Wang


(2020a); Lyu et al. (2019); Lyu and Wang (2022).

Copyright by
HE LYU
2023


To my family for their love and support.
                   v


                                      ACKNOWLEDGEMENTS
There are many people to whom I am indebted for their generous support throughout my doctorate
period. First of all, I would like to express my deepest gratitude to my advisor, Professor Rongrong
Wang. Words alone cannot fully express the profound influence she has had on my professional
development. Her brilliance, passion, and diligence greatly inspired me during my doctorate
studies, and I sincerely appreciate her constant guidance and support during these five years. It is
extremely lucky for me to have an advisor who always responds to my questions so promptly and
patiently, and who care so much about my research as well as my personal development.
    I also would like to thank all my other committee members, Professor Yuying Xie, Professor
Mark Iwen, and Professor Jiliang Tang for their guidance, advice, and for providing helpful feedback
and suggestions.
    I am fortunate to have a wonderful group of colleagues and friends, who have greatly encouraged
and supported me over the years. I thank Jieqian He, Shuyang Qin, Ningyu Sha, Hao Wang, Runze
Su, Zhuangdi Zhu, Avrajit Ghosh, Siddhant Gautam, Shĳun Liang, Angqi Li, Xitong Zhang,
Guangliang Liu, Zhiyu Xue, and Haitao Mao for making my graduate time colorful, and my
gratitude goes to Wei Ao, Kang Yu, Tianxudong Tang, Ziyi Xi, and Jiaxin Yang for their help
during my thumb injury. It is unfortunately impossible to name everyone here, so please accept my
apologies if I have left anyone out.
    Most of all, I owe my deepest thanks to my family who have always been my strongest supporters.
Their encouragement and love have been the backbone of my success. I feel incredibly fortunate to
have them in my life.
                                                   vi


                              TABLE OF CONTENTS
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          1
CHAPTER 2 MATRIX PERTURBATION ANALYSIS AND ITS APPLICATIONS . . . . 15
CHAPTER 3 MANIFOLD DENOISING BY NONLINEAR ROBUST PCA . . . . . . . . 51
CHAPTER 4 PERTURBATION OF INVARIANT SUBSPACES FOR ILL-CONDITIONED
           EIGENSYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
APPENDIX A APPENDIX FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . 95
APPENDIX B APPENDIX FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . . . . 102
APPENDIX C APPENDIX FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . 117
                                           vii


                                           CHAPTER 1
                                         INTRODUCTION
In the fields of statistical and machine learning, one frequently encounters the task of analyzing
high-dimensional data. Since high dimensionality poses a great challenge to traditional methods,
new methods that are specifically designed for high-dimensional data have been developed. A
promising approach to tackle the curse of dimensionality is to make prior assumptions on the data
Candes et al. (2011); Lee et al. (2006); Lyu and Wang (2020b); Krahmer et al. (2023). In this
dissertation, we focus on the low-intrinsic-dimensionality prior of the data, which assumes that
the high-dimensional data lies around a low-dimensional manifold. In the special case when the
manifold is a linear subspace, this prior reduces to the standard low-rank prior. The low-rank
assumption underlies many popular statistical and machine learning algorithms, such as Principal
Component Analysis and Singular Value Hard Thresholding.
    A large portion of this dissertation explores the robustness of the reconstruction under the
low-rank prior for various applications. In particular, we analyze the fundamental perturbation
problem of Singular Value Decomposition (SVD). Due to the great importance of SVD to data
science and its sensitivity to noise, studying its stability is crucial for the reliability using many
machine learning algorithms involving the SVD. In Chapter 2, we study the stability of each of the
SVD factors as well as those of their combinations and derive a collection of new results on SVD
perturbation problems.
    Beyond linear models which are suitable for original PCA Pearson (1901), real-world data may
be generated from more complicated nonlinear models, and different types of noise may be present
in the data, such as sparse noise and Gaussian noise. In Chapter 3, we generalize the low-rank
assumption to the manifold setting, where we assume that the data is generated from clean data
distributed around a low-dimensional manifold contaminated by sparse noise and Gaussian noise.
An optimization framework is proposed to separate sparse noise from the data. The performance
of the proposed method is tested on both synthetic and real datasets. In addition, we also provide a
theoretical error bound for the proposed algorithm.
                                                  1


     In the study of network analysis, the eigenvectors can provide useful information on the network,
such as the centrality of nodes and their connectivity. It is often desirable to identify how sensitive
the corresponding invariant subspaces are to noise. In Chapter 4, we investigate the stability
of invariant subspaces for diagonalizable matrices with a specific focus on the case when the
eigensystem is ill-conditioned. We explore the impact of condition numbers on the stability of
invariant subspaces, and derive an improved perturbation bound for the ill-conditioned situation.
     Before presenting more detailed discussions on our methods in each chapter of this dissertation,
here we first introduce some necessary notations.
1.1     Notations
     Throughout this dissertation, for any vector 𝑥 ∈ R𝑛 , we denote by ∥𝑥∥ its ℓ2 -norm, and by ∥𝑥∥ ∞ =
max𝑖 |𝑥𝑖 | its infinity norm. For any matrix 𝐴, let 𝜎𝑖 ( 𝐴) be the 𝑖th largest singular value, ∥ 𝐴∥ = 𝜎1 ( 𝐴)
                                         √︃Í                √︃
denotes the spectral norm, ∥ 𝐴∥ 𝐹 =              | 𝐴   | 2 = Í 𝜎 2 ( 𝐴) represents the Frobenius norm,
                                             𝑖, 𝑗 𝑖, 𝑗          𝑖 𝑖
                                                                                         Í
∥ 𝐴∥ ∞ = max𝑖, 𝑗 | 𝐴𝑖, 𝑗 | is the maximum of entrwise absolute values, and ∥ 𝐴∥ 1 = 𝑖, 𝑗 | 𝐴𝑖, 𝑗 | denotes
the summation of absolute values. We use 𝐴 = 𝑄 𝐴 𝑅 𝐴 to denote the 𝑄𝑅 decompositions of 𝐴, 𝑆( 𝐴)
to denote the spectrum of 𝐴, 𝜌( 𝐴) the spectral radius of 𝐴. For a complex matrix 𝐴, 𝐴∗ is its
conjugate transpose, if the matrix 𝐴 is real, we denote its transpose by 𝐴𝑇 . The condition number
                    ( 𝐴)
is 𝜅 2 ( 𝐴) = 𝜎max( 𝐴) , where 𝜎max ( 𝐴) and 𝜎min ( 𝐴) are the largest and smallest singular values of
              𝜎
                min
𝐴, respectively. In addition, #(·) is the cardinality of a set, and O𝑟 is the orthogonal matrix group
in dimension 𝑟. For two functions 𝑓 (𝑛) and 𝑔(𝑛), the big-O notation 𝑓 = 𝑂 (𝑔) means that 𝑓 is
asymptotically upper bounded by 𝑔 as 𝑛 → +∞, and 𝑓 = Ω(𝑔) means that 𝑓 is asymptotically lower
bounded by 𝑔 as 𝑛 → +∞. Furthermore, we use 𝑒𝑖 to denote the 𝑖th canonical basis vector, and
𝐵(𝑐, 𝑟) to denote a disk in a complex plane centered at 𝑐 with radius 𝑟.
1.2     Problem setup for perturbation analysis
     This dissertation prominently features matrix perturbation analysis, here we present the problem
settings of singular subspace perturbation and invariant subspace perturbation, which are the focus
of Chapter 2 and Chapter 4, respectively.
                                                       2


1.2.1   The singular subspace perturbation problem
    For a rectangular matrix 𝐴 ∈ R𝑛×𝑚 , let Δ𝐴 denote an unobserved perturbation matrix, the
observed perturbed matrix 𝐴    e = 𝐴 + Δ𝐴. We write the conformal SVD of the original matrix A and
the perturbed matrix 𝐴 e in block matrix form
                                ©Σ1     ª ©𝑉1𝑇 ª                              ©eΣ1    ª ©𝑉e𝑇 ª
                                                                                               1®
      𝐴 = 𝑈Σ𝑉 𝑇 = 𝑈1 𝑈2 ­­                ®­ ®, 𝐴
                                          ®­ 𝑇®
                                                    e= 𝑈  eeΣ  𝑉e𝑇 = 𝑈 e 1 𝑈2 ­
                                                                             e    ­
                                                                                          ®­ 𝑇®.
                                                                                          ® ­          (1.1)
                                        Σ2 𝑉2                                          Σ2 𝑉
                                                                                       e      e
                                  «       ¬« ¬                                    «       ¬« 2 ¬
Here 𝑈1 ∈ R𝑛×𝑟 , 𝑈2 ∈ R𝑛×(𝑛−𝑟) , 𝑉1 ∈ R𝑚×𝑟 , 𝑉2 ∈ R𝑚×(𝑚−𝑟) , [𝑈1 ,𝑈2 ] ∈ R𝑛×𝑛 , [𝑉1 ,𝑉2 ] ∈ R𝑚×𝑚
are orthogonal matrices, Σ1 = diag{𝜎1 , 𝜎2 , ..., 𝜎𝑟 } ∈ R𝑟×𝑟 , Σ2 = diag{𝜎𝑟+1 , 𝜎𝑟+2 , ..., 𝜎min{𝑚,𝑛} } ∈
R (𝑛−𝑟)×(𝑚−𝑟) , and the singular values are indexed in non-increasing order, i.e., 𝜎1 ≥ 𝜎2 ≥ · · · ≥
𝜎𝑟 ≥ 𝜎𝑟+1 ≥ · · · ≥ 𝜎min{𝑚,𝑛} . When 𝑛 ≠ 𝑚, Σ2 is rectangular, and the extra columns/rows are
padded with 0s. The decomposition of 𝐴         e has a similar structure with non-increasing singular
values e
       Σ1 = diag{e 𝜎1 ,e        𝜎𝑟 } ∈ R𝑟×𝑟 , e
                       𝜎2 , ...,e             Σ2 = diag{e 𝜎𝑟+1 ,e            𝜎min{𝑚,𝑛} } ∈ R (𝑛−𝑟)×(𝑚−𝑟) .
                                                                  𝜎𝑟+2 , ...,e
    This dissertation is mostly concerned with the case where there is a significant gap between
𝜎𝑟 and 𝜎𝑟+1 . In Chapter 2, we investigate the stability of each of the SVD factors and their
combinations, one example is to estimate the distance between the leading singular subspaces
under perturbation, i.e., the distance between 𝑠𝑝𝑎𝑛(𝑈1 ) and 𝑠𝑝𝑎𝑛(𝑈           e1 ).
1.2.2   The invariant subspace perturbation problem
    Let 𝐴 ∈ C𝑛×𝑛 be a diagonalizable matrix. An invariant subspace X of 𝐴 is one that satisfies
                                                  𝐴X ⊆ X.
When a perturbation Δ𝐴 is added to 𝐴, its invariant subspace X will be perturbed accordingly.
More precisely, any invariant subspace of a diagonalizable matrix is spanned by a subset of the
right eigenvectors. Suppose 𝐴 = 𝑋Λ𝑋 −1 is the eigen-decomposition of 𝐴, and 𝑋 contains the
normalized eigenvectors as columns. Partition 𝑋 into two blocks 𝑋 = [𝑋1 , 𝑋2 ],
                                                               ©Λ1       ª
                                     𝐴[𝑋1 , 𝑋2 ] = [𝑋1 , 𝑋2 ] ­­         ®,
                                                                         ®                             (1.2)
                                                                    Λ2
                                                               «         ¬
                                                     3


where Λ1 and Λ2 are diagonal matrices containing eigenvalues of 𝐴, then X1 = 𝑠𝑝𝑎𝑛(𝑋1 ) is an
invariant subspace of 𝐴. In addition, suppose that the corresponding eigenvalues in Λ1 and Λ2
have a gap, which ensures that with sufficiently small perturbation, the eigen-decomposition of the
perturbed matrix 𝐴 e = 𝐴 + Δ𝐴 has a similar block structure,
                                                                 ©eΛ1     ª
                                     𝐴[ 𝑋1 , 𝑋2 ] = [ 𝑋1 , 𝑋2 ] ­­
                                      e e    e        e    e              ®,
                                                                          ®                            (1.3)
                                                                       Λ2
                                                                       e
                                                                 «        ¬
and therefore we can estimate X1 by X    e1 = 𝑠𝑝𝑎𝑛( 𝑋  e1 ). We study the behavior the invariant subspace
X1 under perturbation, which is measured by the distance between X1 and X           e1 .
1.2.3   Distance between subspaces
    Noticing that in both singular subspace perturbation and invariant subspace perturbation, the
target is to estimate the distance between two subspaces. It is necessary to have a well-defined
metric to measure the difference between any two subspaces with the same dimension. For two
matrices 𝑈1 , 𝑈e1 ∈ C𝑛×𝑟 with orthonormal columns, a natural measure of the distance between
𝑠𝑝𝑎𝑛(𝑈1 ) and 𝑠𝑝𝑎𝑛(𝑈    e1 ) is given by their canonical angles. Let the singular values of 𝑈 ∗𝑈      e be
                                                                                                    1 1
𝛾1 ≥ 𝛾2 ≥ ... ≥ 𝛾𝑟 ≥ 0, then cos−1 𝛾𝑖 , 𝑖 = 1, ..., 𝑟 are the principal angles, and the sin Θ matrix is the
following diagonal matrix
                                     n                                                        o
                                              −1                    −1               −1
             sin Θ(𝑈1 , 𝑈1 ) = diag sin(cos (𝛾1 )), sin(cos (𝛾2 )), ..., sin(cos (𝛾𝑟 )) .
                        e
The matrix that holds all the tan Θ angles is
                                    n                                                          o
            tanΘ(𝑈1 , 𝑈e1 ) = diag tan(cos−1 (𝛾1 )), tan(cos−1 (𝛾2 )), · · · , tan(cos−1 (𝛾𝑟 )) .
The angles are usually measured under either the spectral norm ∥ sin Θ(𝑈1 , 𝑈        e1 )∥ or the Frobenius
norm ∥ sin Θ(𝑈1 , 𝑈e1 )∥ 𝐹 . It is well known that (see e.g., Cai and Zhang (2018); Knyazev and
Argentati (2002))
                                 ∥ sin Θ(𝑈1 , 𝑈e1 )∥ = ∥𝑈 ∗𝑈           e∗
                                                            2 1 ∥ = ∥𝑈2 𝑈1 ∥,                          (1.4)
                                                              e
                              ∥ sin Θ(𝑈1 , 𝑈e1 )∥ 𝐹 = ∥𝑈 ∗𝑈             e∗
                                                           2 1 ∥ 𝐹 = ∥𝑈2 𝑈1 ∥ 𝐹 ,                      (1.5)
                                                             e
where 𝑈2 is the orthogonal complement of 𝑈1 as defined in (1.1).
                                                       4


1.3   Mathematical background
    Let us warm up with some mathematical background needed for this dissertation. We start
with introducing related norms and norm relations in Section 1.3.1, followed by a brief review
of classical matrix perturbation results in Section 1.3.2. Then we move to the introduction of
fundamental large deviation inequalities, which are extensively used in our proofs.
1.3.1   Norms and norm relations
    The content in Chapter 2 features the following ℓ2,∞ -norm .
Definition 1.3.1. For a matrix 𝐴, its ℓ2,∞ -norm is defined as
                                       ∥ 𝐴∥ 2,∞ := sup ∥ 𝐴𝑥∥ ∞ .
                                                   ∥𝑥∥ 2 =1
Remark 1.3.1. One can directly see that ∥ 𝐴∥ 2,∞ = max𝑖 ∥𝑎𝑖 ∥ 2 , where 𝑎𝑇𝑖 is the 𝑖th row of 𝐴. This
relation indicates that the ℓ2,∞ -norm of a matrix 𝐴 is the maximum of the ℓ2 -norm of its rows.
    Here we list several helpful propositions of matrix norms that are extensively used in the proofs
of this dissertation.
Proposition 1.3.2. For a matrix 𝐴 ∈ R𝑛×𝑚 , the following relation between the spectral norm and
ℓ2,∞ -norm holds
                                                          √
                                     ∥ 𝐴∥ 2,∞ ≤ ∥ 𝐴∥ ≤ 𝑛∥ 𝐴∥ 2,∞ .
                                                𝑛 ∥𝑎𝑇 𝑥∥ 2 ≤ 𝑛 max ∥𝑎𝑇 𝑥∥ 2 = √𝑛∥ 𝐴∥
                                            √︃Í               √︃
Proof: For any vector 𝑥 ∈ R𝑚 , ∥ 𝐴𝑥∥ =          𝑖=1 𝑖               𝑖 𝑖                  2,∞ . On the
                      √︃Í
other hand, ∥ 𝐴𝑥∥ =       𝑛     𝑇 2              𝑇
                          𝑖=1 ∥𝑎𝑖 𝑥∥ ≥ max𝑖 ∥𝑎𝑖 𝑥∥ = ∥ 𝐴∥ 2,∞ .                                     □
    Proposition 1.3.2 indicates that when the number of rows 𝑛 is large, ∥ 𝐴∥ 2,∞ can be much
smaller than the spectral norm of 𝐴. For example, when 𝐴 has all 1s in the first column and 0s
                                                            √
elsewhere, ∥ 𝐴∥ 2,∞ = 1, while the spectral norm ∥ 𝐴∥ = 𝑛.
Proposition 1.3.3. For a matrix 𝐴 ∈ R𝑛×𝑚 , the following relation between spectral norm and
Frobenius norm holds
                                                     √︁
                                    ∥ 𝐴∥ ≤ ∥ 𝐴∥ 𝐹 ≤ rank( 𝐴)∥ 𝐴∥.
    Proposition 1.3.3 establishes the fact that the Frobenius norm is lower bounded by the spectral
norm and upper bounded by the spectral norm multiplied by the square root of matrix rank. The
                                                    5


following proposition derives an upper bound for the ℓ2,∞ -norm of matrix multiplication.
Proposition 1.3.4. For matrix 𝐴 ∈ R𝑛×𝑚 , 𝐵 ∈ R𝑚×𝑘 , then
                                        ∥ 𝐴𝐵∥ 2,∞ ≤ ∥ 𝐴∥ 2,∞ ∥𝐵∥.
Proof:
                        ∥ 𝐴𝐵∥ 2,∞ = max ∥𝑎𝑇𝑖 𝐵∥ ≤ max ∥𝑎𝑇𝑖 ∥∥𝐵∥ = ∥ 𝐴∥ 2,∞ ∥𝐵∥,
                                      𝑖               𝑖
where 𝑎𝑇𝑖 is the 𝑖th row of 𝐴.                                                                       □
Remark 1.3.5. Compared with ℓ2 -norm, ℓ2,∞ -norm can provide finer entrywise control, which
is beneficial in applications such as recommender systems since it is usually desirable to have
a uniform bound on all individuals. Besides, bounding ℓ2,∞ -norm is of great importance in
the analysis of exact recovery for many statistical problems, such as Stochastic Block Model
(SBM), Matrix completion, and Censored Block Model Abbe et al. (2020). In singular subspace
perturbation analysis, we are interested in using ℓ2,∞ -norm to characterize the distance between
singular subspaces 𝑠𝑝𝑎𝑛(𝑈1 ) and 𝑠𝑝𝑎𝑛(𝑈      e1 ). A desirable quantity we aim to bound is
                                           inf ∥𝑈 e1 − 𝑈1 𝑄∥ 2,∞ ,
                                          𝑄∈O𝑟
where 𝑄 is a rotation matrix and O𝑟 is the orthogonal matrix group in dimension 𝑟. In other words,
we consider the difference between 𝑈   e1 and 𝑈1 after they are maximally aligned by a proper rotation
𝑄. Since O𝑟 is compact, the infimum can be achieved.
Proposition 1.3.6 (Lemma 1 in Cai and Zhang (2018)). Write the SVD of 𝑈1𝑇 𝑈       e1 as 𝑈𝑇 𝑈
                                                                                           e = 𝑄 1 𝑆𝑄𝑇 ,
                                                                                         1 1         2
denote 𝑄𝑈 = 𝑄 1 𝑄𝑇2 , then
                                                                         √
            ∥ sin Θ(𝑈1 , 𝑈
                         e1 )∥ ≤ inf ∥𝑈 e1 − 𝑈1 𝑄∥ ≤ ∥𝑈  e1 − 𝑈1 𝑄𝑈 ∥ ≤ 2∥ sin Θ(𝑈1 , 𝑈  e1 )∥,
                                 𝑄∈O𝑟
                                                                          √
        ∥ sin Θ(𝑈1 , 𝑈
                     e1 )∥ 𝐹 ≤ inf ∥𝑈  e1 − 𝑈1 𝑄∥ 𝐹 = ∥𝑈 e1 − 𝑈1 𝑄𝑈 ∥ 𝐹 ≤ 2∥ sin Θ(𝑈1 , 𝑈 e1 )∥ 𝐹 .
                               𝑄∈O𝑟
    This proposition relates the quantity inf 𝑄∈O𝑟 ∥𝑈  e1 −𝑈1 𝑄∥ to the sin Θ distance, which combined
with Proposition 1.3.2 further leads to a trivial upper bound of the ℓ2,∞ -norm perturbation error
(see (2.1)).
                                                     6


Proposition 1.3.7 (Weyl’s inequality). Assume matrix 𝐴 ∈ R𝑛×𝑚 has singular values indexed in
non-increasing order 𝜎1 ≫ 𝜎2 ≫ · · · ≫ 𝜎min{𝑛,𝑚} , the perturbed matrix 𝐴          e = 𝐴 + Δ𝐴 has singular
values e   𝜎1 ≫ e     𝜎2 ≫ · · · ≫ e
                                   𝜎min{𝑛,𝑚} . For any 1 ≤ 𝑘 ≤ min{𝑛, 𝑚}, the following inequality holds
                                                  |𝜎𝑘 − e
                                                        𝜎𝑘 | ≤ ∥Δ𝐴∥.
     Weyl’s inequality above provides an error bound on the singular values under perturbation, it
indicates singular values are fairly stable under relatively small perturbation Δ𝐴. There is a wealth
of applications of Weyl’s bound in stability analysis. However, for eigenvalues of diagonalizable
matrices which are not necessarily symmetric, the following theorem indicates their sensitivity to
noise needs to be estimated by the condition number.
Theorem 1.3.8 (Bauer-Fike theorem). Let 𝐴 ∈ C𝑛×𝑛 be a diagonalizable matrix with eigen de-
composition 𝐴 = 𝑋Λ𝑋 −1 , where Λ is a diagonal matrix containing eigenvalues and 𝑋 ∈ C𝑛×𝑛 is a
non-singular eigenvector matrix. Suppose 𝜇 is an eigenvalue of the perturbed matrix 𝐴          e = 𝐴 + Δ𝐴,
then there exists 𝜆 ∈ Λ, such that
                                                |𝜆 − 𝜇| ≤ 𝜅2 (𝑋)∥Δ𝐴∥,
where 𝜅2 (𝑋) = ∥ 𝑋 ∥∥ 𝑋 −1 ∥ is the condition number of 𝑋.
     We will also need the definition of incoherence for the perturbation analysis.
Definition 1.3.2 (Incoherent). A matrix 𝑈 ∈ R𝑛×𝑟 with orthonormal columns (𝑛 ≥ 𝑟) is said to be
                                              √︃
𝜇-incoherent (𝜇 ≥ 1) if ∥𝑈 ∥ 2,∞ ≤ 𝜇 𝑛𝑟 .
Remark 1.3.9. It is worth noting that since 𝑈 ∈ R𝑛×𝑟 has orthonormal columns, if we denote 𝑈 =
                                                                                                       √︃
                          𝑇      Í       2
[𝑢 1 , 𝑢 2 , · · · , 𝑢 𝑛 ] , then 𝑖 ∥𝑢𝑖 ∥ = 𝑟, by Remark 1.3.1 we directly have ∥𝑈 ∥ 2,∞ = max𝑖 ∥𝑢𝑖 ∥ ≥ 𝑛𝑟 .
On the other hand, by Proposition 1.3.2, ∥𝑈 ∥ 2,∞ ≤ ∥𝑈 ∥ = 1. Therefore, we have the upper bound
                                           √︃
and lower bound for 𝜇, 1 ≤ 𝜇 ≤ 𝑛𝑟 . The upper bound is achieved for a standard basis, where
each 𝑢𝑖 has 1 on one coordinate and 0s elsewhere. When every entry in the matrix 𝑈 has the same
magnitude, the lower bound can be achieved. A small 𝜇 indicates the mass of 𝑈 is well spread
across its entries.
     It has been pointed out in Candes and Recht (2012) that with high probability, the coherence is
naturally bounded by some absolute constant for random orthogonal models. Indeed, the bounded
                                                           7


coherence assumption is extensively used in the analysis of matrix completion, matrix denoising,
and matrix perturbation analysis Candes et al. (2011); Candes and Recht (2012); Cape et al. (2019b);
Abbe et al. (2020, 2022).
1.3.2   Classical singular perturbation results
    Singular subspace perturbation has been extensively studied in the literature, here we include
two classical results, Davis-Kahan’s bound Davis and Kahan (1970) and Wedin’s sin Θ theorem
Wedin (1972). Interested readers are referred to Stewart (2006); Dopico (2000); Dopico and Moro
(2002)) for more relevant perturbation bounds.
Theorem 1.3.10 (Davis-Kahan’s sin Θ theorem). Assume 𝐴 ∈ R𝑛×𝑛 , 𝐴                        e = 𝐴 + Δ𝐴 ∈ R𝑛×𝑛 are two
symmetric matrices satisfying the following eigen decompositions
                                    ©Λ 1       ª ©𝑈1𝑇 ª                𝑇
                                                                                        ©eΛ1       ª ©𝑈e𝑇 ª
                                                                                                          1®
      𝐴 = 𝑈Λ𝑈𝑇     = 𝑈1 𝑈2 ­          ­
                                                 ® ­ 𝑇 ® , 𝐴 = 𝑈 Λ𝑈 = 𝑈
                                                 ® ­    ®  e     e  e  e        e1 𝑈 e2 ­ ­
                                                                                                     ®­ 𝑇®.
                                                                                                     ® ­         (1.6)
                                             Λ2 𝑈2                                              Λ2 𝑈
                                                                                                e        e
                                      «          ¬« ¬                                     «          ¬« 2 ¬
Here 𝑈1 ∈ R𝑛×𝑟 , 𝑈2 ∈ R𝑛×(𝑛−𝑟) , and [𝑈1 ,𝑈2 ] ∈ R𝑛×𝑛 is an orthogonal matrix. Eigenvalue matrices
Λ1 = diag{𝜆 1 , 𝜆 2 , · · · , 𝜆𝑟 }, Λ2 = diag{𝜆𝑟+1 , 𝜆𝑟+2 , · · · , 𝜆 𝑛 }, and the eigenvalues are indexed in non-
increasing order, i.e., 𝜆 1 ≥ 𝜆 2 ≥ · · · ≥ 𝜆𝑟 ≥ 𝜆𝑟+1 · · · ≥ 𝜆 𝑛 . The factors of 𝐴        e have similar structures.
Denote 𝛿 = min1≤𝑖≤𝑟,𝑟+1≤ 𝑗 ≤𝑛 |𝜆𝑖 − 𝜆        e𝑗 |, if 𝛿 > 0, then
                                          ∥ sin Θ(𝑈1 , 𝑈e1 )∥ 𝐹 ≤ ∥Δ𝐴𝑈1 ∥ 𝐹 .
                                                                           𝛿
    In fact, both occurrences of Frobenius norm above can be replaced with spectral norm. Davis-
Kahan’s theorem shows for symmetric matrices, the sin Θ distance depends on ||Δ𝐴|| and the
eigengap, it provides a tight bound for general symmetric matrices with no additional assumption
on the noise. Wedin has extended the theorem to general rectangular matrices.
Theorem 1.3.11 (Wedin’s theorem). Let 𝐴, 𝐴               e ∈ R𝑛×𝑚 be two rectangular matrices with singular
decomposition (1.1) as defined in Section 1.2.1, provided
                              𝛿 = min{             min            |e𝜎𝑖 − 𝜎 𝑗 |, min e 𝜎𝑖 } > 0,
                                        1≤𝑖≤𝑟,𝑟+1≤ 𝑗 ≤min{𝑛,𝑚}                  1≤𝑖≤𝑟
then
                                                                          ∥𝑈e𝑇 Δ𝐴∥ 2 + ∥Δ𝐴𝑉   e1 ∥ 2
                                     e1 )∥ 2 + ∥ sin Θ(𝑉1 , 𝑉
                                                            e1 )∥ 2 ≤        1      𝐹              𝐹
                    ∥ sin Θ(𝑈1 , 𝑈                                                                   .           (1.7)
                                           𝐹                       𝐹
                                                                                      𝛿2
                                                             8


    Here we make a few observations of the Wedin’s bound. First, the definition of 𝛿 requires that
the singular values in e Σ1 and Σ2 are separated. If ∥Δ𝐴∥ is small compared to the gap in singular
values, then essentially it requires the singular values in Σ1 and Σ2 to be separated. Second, (1.7) is
a uniform perturbation bound on both the right singular subspace 𝑈1 and the left singular subspace
𝑉1 . When the matrix size 𝑛 and 𝑚 are significantly different, 𝑈1 and 𝑉1 may have quite different
stability, and it will be desirable to derive a one-sided perturbation bound for each of them. More
discussions on this issue can be found in Section 2.5.2. Third, these classical results are quite tight
for general matrices with no statistical assumption on the noise, however, when the perturbation
matrix Δ𝐴 is random, the bounds on the right-hand side become random quantities, since they
involve the singular values of the perturbed matrix 𝐴. e To solve this problem, deterministic variants
of the Davis-Kahan’s sin Θ theorem were introduced in Yu et al. (2015); Cai and Zhang (2018)
that are particularly useful for statistical applications. Additionally, various asymptotic bounds on
eigenvector perturbations have also been derived in Cape et al. (2019a); Tang and Preibe (2018);
Fan et al. (2022); Agterberg et al. (2022).
1.3.3    Large deviation inequalities
    The proof techniques in this dissertation also feature high-dimensional probability, for com-
pleteness, here we present a brief review of relevant large deviation inequalities. Interested readers
are referred to Vershynin (2018) for a comprehensive understanding of large deviation bounds. For
certain types of random variables we are concerned with, these large deviation inequalities char-
acterize the exponential decline in the probability of tail events, hence provide us with statistical
guarantees on bounding these random variables.
    For a finite sequence of independent bounded random variables, the following Hoeffding’s
inequality provides a bound on the deviation of their summation to the expectation of the summation.
Theorem 1.3.12 (Hoeffding’s inequality Vershynin (2018)). Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 be independent
random variables, for each 1 ≤ 𝑖 ≤ 𝑛, 𝑎𝑖 ≤ 𝑋𝑖 ≤ 𝑏𝑖 . Then for any 𝑡 > 0, the following inequaity holds
                                                  !                                !
                                                                     2𝑡 2
                               𝑛
                              ∑︁
                          P       (𝑋𝑖 − E𝑋𝑖 ) ≥ 𝑡 ≤ 2 exp − Í𝑛                       .            (1.8)
                                                                    (𝑏   − 𝑎   ) 2
                              𝑖=1                               𝑖=1    𝑖     𝑖
Remark 1.3.13. If 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 are i.i.d. (independently and identically distributed) random
                                                     9


variables within [𝑎, 𝑏], E𝑋𝑖 = 𝜇, then (1.8) reduces to
                                                 !                        !
                                                                  2𝑡 2
                                  𝑛
                                 ∑︁
                             P       (𝑋𝑖 − 𝜇) ≥ 𝑡 ≤ 2 exp −                  ,
                                 𝑖=1
                                                              𝑛(𝑏 − 𝑎) 2
which further leads to                            !
                                     𝑛                                  
                                  1 ∑︁                            2𝑛
                              P         𝑋𝑖 − 𝜇 ≥ 𝑡 ≤ 2 exp −               .
                                  𝑛
                                    𝑖=1
                                                               (𝑏 − 𝑎) 2
We see that for bounded i.i.d. random variables, the probability that the deviation of the sample
mean to true expectation is larger than some constant decreases exponentially in the number of
samples.
    Suppose 𝑋 ∼ 𝐵(𝑛, 𝑝) follows Binomial distribution with probability 𝑝, then 𝑋 = 𝑋1 + 𝑋2 + · · · +
𝑋𝑛 , where 𝑋𝑖 are i.i.d. binary random variables which takes value 1 with probability 𝑝 and takes
value 0 with probability 1 − 𝑝. Since 0 ≤ 𝑋𝑖 ≤ 1, Hoeffding’s inequality can be used to bound
the deviation of 𝑋 and its expectation E𝑋 = 𝑛𝑝. However, when 𝑝 is very small, the bound by
Hoeffding’s inequality can be pessimistic. In this case, the following theorem provides an improved
deviation bound. Similar results can also be derived using Chernoff’s inequality (see e.g., Vershynin
(2018) Theorem 2.3.1).
Theorem 1.3.14 (Theorem 1 in Janson (2016)). Suppose 𝑋 follows Binomial distribution 𝐵(𝑛, 𝑝),
let 𝑞 = 1 − 𝑝, then for every 𝑎 ≥ 0,
                                                                       !
                                                              𝑎2
                               P(𝑋 > E𝑋 + 𝑎) ≤ exp −                     ,                       (1.9)
                                                        2(𝑛𝑝𝑞 + 𝑎/3)
where E𝑋 is the expectation of 𝑋, i.e., E𝑋 = 𝑛𝑝.
Remark 1.3.15. For some 0 < 𝛿 ≤ 1, plugging in 𝑎 = 𝛿E𝑋 = 𝛿𝑛𝑝 into (1.9), we obtain
                                  P(𝑋 > (1 + 𝛿)E𝑋) ≤ exp(−𝑐𝑛𝑝𝛿2 ),
for some absolute constant 𝑐.
    For independent and bounded random variables, if provided with information about the variance,
tighter bounds can be obtained when the variance is small using the following Bernstein’s inequality.
                                                   10


Theorem 1.3.16 (Bernstein’s inequality). Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 be independent, centered, and real
random variables, assume each one is uniformly bounded,
                                   E𝑋𝑖 = 0, |𝑋𝑖 | ≤ 𝐿, ∀1 ≤ 𝑖 ≤ 𝑛.
                              Í𝑛
Let 𝑋 be the summation 𝑋 =      𝑖=1 𝑋𝑖 , and 𝜈(𝑋) denotes the variance of 𝑋,
                                                        𝑛
                                                       ∑︁
                                       𝜈(𝑋)  = E𝑋 2  =      E𝑋𝑖2 .
                                                       𝑖=1
Then for all 𝑡 ≥ 0,
                                                            𝑡 2 /2
                                 P(|𝑋 | ≥ 𝑡) ≤ 2 exp(−                 ).                             (1.10)
                                                        𝜈(𝑋) + 𝐿𝑡/3
    Comparing the Hoeffding’s bound (1.8) and Bernstein’s bound (1.10), we see that when the
                                                                                                E𝑋𝑖2 ≪ 𝑛𝐿 2 ,
                                                                                        Í𝑛
variance is significantly smaller than the bound on random variables, i.e., 𝑖=1
Bernstein’s inequality gives a tighter bound. This is intuitive because in its assumption we are
provided with more information about the variance.
    Bernstein’s inequality can also be generalized to the summation of random matrices Tropp et al.
(2015), which bounds the spectral norm of summation of independent centered random matrices.
Theorem 1.3.17 (Matrix Bernstein inequality). Let matrices 𝑆1 , 𝑆2 , · · · , 𝑆 𝑛 ∈ R𝑑1 ×𝑑2 be indepen-
dent, centered random matrices, and assume that each matrix is uniformly bounded, i.e.,
                              E𝑆 𝑘 = 0, and ∥𝑆 𝑘 ∥ ≤ 𝐿, ∀𝑘 = 1, 2, · · · , 𝑛
Denote the sum
                                                   𝑛
                                                  ∑︁
                                              𝑍=      𝑆𝑘 ,
                                                  𝑘=1
and let 𝑣(𝑍) be the matrix variance statistic of the sum
                                                         ( 𝑛                     𝑛
                                                                                                )
                                                           ∑︁                   ∑︁
          𝑣(𝑍) = max{∥E(𝑍 𝑍 ∗ )∥, ∥E(𝑍 ∗ 𝑍)∥} = max              E(𝑆 𝑘 𝑆 ∗𝑘 ) ,     E(𝑆 ∗𝑘 𝑆 𝑘 ) .
                                                           𝑘=1                  𝑘=1
Then
                                                            −𝑡 2 /2
                         P(∥𝑍 ∥ ≥ 𝑡) ≤ (𝑑1 + 𝑑2 ) exp(                 ), ∀𝑡 ≥ 0.
                                                        𝑣(𝑍) + 𝐿𝑡/3
                                                  11


     The next one is a concentration inequality on the norm of independent sub-Gaussian random
variables. In high-dimensional probability theory, sub-Gaussian distribution denotes a very im-
portant family of distributions, as it contains many fundamental distributions including Gaussian
distribution. Moreover, concentration inequalities such as Hoeffding’s bound can be derived for
all sub-Gaussian distributions. Formally, a random variable 𝑋 follows sub-Gaussian distribution if
there exists a constant 𝐾 such that for all 𝑡 ≥ 0, the tails of 𝑋 satisfy
                                         P(|𝑋 | ≥ 𝑡) ≤ 2 exp(−𝑡 2 /𝐾 2 ).
Intuitively, the sub-Gaussian property above indicates the tail probability decays at least as fast
as normal distributions. Examples of sub-Gaussian distributions include Gaussian distribution,
Bernoulli distribution, and all bounded random variables. The sub-Gaussian norm is defined as
                                  ∥ 𝑋 ∥ 𝜓2 = inf{𝑡 > 0 : E exp(𝑋 2 /𝑡 2 ) ≤ 2}.
The following theorem bounds the norm of sub-Gaussian random vectors
Theorem 1.3.18 (Theorem 3.1.1 in Vershynin (2018)). Let 𝑋 = [𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ] ∈ R𝑛 be a random
vector with independent, sub-Gaussian coordinates 𝑋𝑖 which satisfies E𝑋𝑖2 = 1, then
                                                 √                    𝑐𝑡 2
                                     P(|∥ 𝑋 ∥ − 𝑛| ≥ 𝑡) ≤ 2 exp(− ),
                                                                      𝐾4
where 𝐾 = max𝑖 ∥ 𝑋𝑖 ∥ 𝜓2 .
Remark 1.3.19. Noticing that Gaussian random variables are also sub-Gaussian, in Theorem
1.3.18, when 𝑋 = [𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ] ∈ R𝑛 is a random vector with independent Gaussian coordinates
𝑋𝑖 ∼ 𝑁 (0, 𝜎 2 ), then
                                                  √                     𝑐𝑡 2
                                     P(|∥ 𝑋 ∥ − 𝜎 𝑛| ≥ 𝑡) ≤ 2 exp(− )
                                                                        𝜎2
Here we used the fact that for each Gaussian random variable 𝑋𝑖 , ∥ 𝑋𝑖 ∥ 𝜓2 ≤ 𝐶𝜎 for some absolute
constant 𝐶 (see e.g., Vershynin (2018)).
1.4   Overview of this dissertation
     Equipped with the presented mathematical backgrounds, we are ready to provide a more detailed
overview of our approaches for the applications in the following chapters.
                                                       12


     Chapter 2 studies the problem of singular perturbation analysis and its applications. As a funda-
mental tool in computational mathematics, SVD plays a ubiquitous role in numerical and statistical
algorithms, examples can be found in Principal Component Analysis, matrix completion, matrix
denoising, community detection, etc. To better understand the performance of these algorithms, it
is important to study the stability of the SVD steps. In this dissertation, we take a different approach
from the classical singular perturbation analyses. We first derive a set of exact formulae for the
sin Θ distance between the original and the perturbed singular subspaces, from these formulae, one
can see how the perturbation of the original matrix propagates into singular vectors and singu-
lar subspaces. More importantly, these formulae provide a direct way of analyzing the singular
perturbation error. Based on or motivated by these exact formulae, we derive a collection of new
results on SVD perturbation related problems. The newly derived results include three components:
1) this dissertation derives a tighter bound on ℓ2,∞ -norm singular subspace perturbations errors
under Gaussian noise. The bound holds for both cases where the matrix is a low-rank or general
matrix. Compared with existing works, the proposed result requires minimum assumptions and
achieves a tighter bound for general matrices. 2) We also provide a novel stability analysis of the
Principal Component Analysis for general full-rank matrices. As one of the arguably most popular
statistical dimensional reduction methods, the stability of PCA has been extensively studied in the
literature. However, most existing works focus on analyzing the stability of singular vectors or
singular values. This dissertation complements the literature in investigating the stability of PC
scores. 3) In addition, this dissertation presents a new error bound on singular value truncation,
which computes the best low-rank approximation of a given matrix with a target rank. A tight
error bound for low-rank matrices already exists in the literature, but this result does not generalize
to general matrices directly. In this dissertation, we take a different approach and derive a new
singular value truncation bound for general matrices. When the matrix is indeed low rank, The
proposed result reduces to the existing tight bound. Results of this chapter have given rise to the
manuscript Lyu and Wang (2020a).
     In Chapter 3, we employ the low-rank prior in the manifold setting. Here we follow the
                                                  13


manifold hypothesis popular in dimensional reduction techniques, which states real-world high-
dimensional data usually lies around low-dimensional manifolds embedded in high-dimensional
space. Therefore, if we consider a local neighborhood, the sub-matrix is approximately low-rank.
Specifically, we consider the manifold denoising problem, where the observed data is generated
from clean data distributed around a low-dimensional manifold contaminated by sparse noise and
possibly existing Gaussian noise, and our goal is to denoise the dataset. Toward this goal, we utilize
the low-rank assumption in each local neighborhood and propose an optimization framework to
separate the sparse noise from the data. Our approach is a generalization of Robust PCA, its
performance is tested on both synthetic and real-world datasets. In addition, we also provide a
theoretical error bound under some incoherence conditions and a near-optimal choice of the tuning
parameters. Results of this chapter has given rise to the paper Lyu et al. (2019).
    In Chapter 4, we investigate the stability of invariant subspaces for diagonalizable matrices, with
a specific focus on the situation when the eigensystem is ill-conditioned. Let X1 be some invariant
subspace of a diagonalizable matrix 𝐴 and 𝑋1 be the matrix storing the right eigenvectors that span
X1 , this chapter studies the stability of X1 under noise and explore the impact of condition numbers
on its stability. Previous works all suggest that as the condition number 𝜅2 (𝑋1 ) gets larger, the
invariant subspace X1 will be unstable to perturbation, in this dissertation, we make the point that
the growth of 𝜅2 (𝑋1 ) alone is not enough to destroy the stability. We illustrate this point by deriving
a new perturbation error bound that does not contain 𝜅2 (𝑋1 ), which implies that when the matrix 𝐴
gets closer to a Jordan form, one may still estimate its invariant subspaces from the noisy data stably.
Another implication of the derived result is that for matrices with ill-conditioned eigensystems,
their invariant subspaces may be more stable than their eigenvalues to matrix perturbation. Results
of this chapter have given rise to the manuscript Lyu and Wang (2022).
                                                    14


                                             CHAPTER 2
             MATRIX PERTURBATION ANALYSIS AND ITS APPLICATIONS
2.1   Introduction
2.1.1   Overview of Chapter 2
    This chapter studies the problem of singular perturbation analysis and its applications. As a
fundamental tool in computational mathematics, Singular Value Decomposition (SVD) is com-
monly used in statistical and machine learning algorithms. Studying the stability of SVD is vital to
understanding the performances of these algorithms.
    In this chapter, we establish a useful set of formulae for the sin Θ distance between the original
and the perturbed singular subspaces. These formulae explicitly show that how the perturbation of
the original matrix propagates into singular vectors and singular subspaces, thus providing a direct
way of analyzing them. Following this, we derive a collection of new results on SVD perturbation
related problems, which has given rise to the paper Lyu and Wang (2020a).
    Our first new result is a tighter bound on the ℓ2,∞ -norm of the singular subspace perturbation
errors under Gaussian noise. Our analysis complements the literature of ℓ2,∞ -norm singular
subspace perturbation bound. The derived bound matches minimax lower bound. Compared with
previous works, our result requires minimum assumptions, and holds for general full rank matrices
with Gaussian noise.
    The second new result is a novel stability analysis of the Principal Component Analysis for
general full rank matrices. We establish new error bounds for PCA scores under perturbation,
which was less explored in the literature, compared with the stability of singular vectors or singular
values.
    The third new result is an error bound on singular value truncation. A tight error bound for
low-rank matrices has been derived in Luo et al. (2021), here we consider the perturbation bound
for general full rank matrices. When the matrix is indeed low rank, our result reduces to the tight
bounds obtained in Luo et al. (2021).
                                                   15


2.1.2   Problem setting
    Singular value decomposition (SVD) is a fundamental tool in computational mathematics. Many
widely used algorithms in numerical analysis and statistics (e.g., Principal Component Analysis
Pearson (1901); Cai et al. (2021); Abbe et al. (2022), matrix completion Candes and Recht (2012);
Candes and Plan (2010); Keshavan et al. (2009), matrix denoising Donoho and Gavish (2014);
Gavish and Donoho (2014), community detection Yun and Proutiere (2014); Chin et al. (2015);
Abbe (2017), graph inference Tang and Preibe (2018); Athreya et al. (2021), etc.) involve the
SVD computation. Since the singular vectors and singular subspaces can be sensitive to noise,
the SVD step is often the stability bottleneck of the entire algorithm. Therefore, deriving optimal
perturbation bounds is vital to understanding the performances of these algorithms.
    In general, perturbation theory asks the question that how one function changes when its
argument is subject to a perturbation Stewart (1990). In the setting of SVD, the goal of perturbation
analysis is to study the sensitivity of SVD factors (i.e., 𝑈, Σ,𝑉) or their combinations to the
perturbation Δ𝐴. One major focus of this chapter is on the perturbation of singular subspaces, which
are subspaces spanned by singular vectors corresponding to a set of singular values. However, if
there is one singular value that corresponds to multiple singular vectors, it is not possible to bound
the perturbation of these individual singular vectors. Let us illustrate this point using the following
example.
    Consider the diagonal matrix
                                               ©1 0 0ª
                                               ­        ®
                                           𝐴 = ­0 1 0® ,
                                               ­        ®
                                               ­        ®
                                               ­        ®
                                                 0 0 5
                                               «        ¬
We can directly see 𝐴 has singular values 1, 1, 5, and we can choose right singular vectors cor-
responding to singular value 1 as 𝑢 1 = [1, 0, 0] 𝑇 and 𝑢 2 = [0, 1, 0] 𝑇 . For some small constant
                                                  16


0 < 𝜖 ≪ 1, let the perturbed matrix be
                                                    ©1 𝜖 0 ª
                                                    ­           ®
                                                𝐴 = ­ 𝜖 1 0® .
                                                e   ­           ®
                                                    ­           ®
                                                    ­           ®
                                                      0 0 5
                                                    «           ¬
    e the singular values are 1 + 𝜖, 1 − 𝜖, 5, and the singular vectors corresponding to singular values
In 𝐴,
1 + 𝜖 and 1 − 𝜖 are e 𝑢 1 = [ √1 , √1 , 0] and e
                                               𝑢 2 = [ √1 , − √1 , 0], which are significantly different from
                                2 2                      2      2
𝑢 1 and 𝑢 2 . However, after a more careful look, we see that the subspaces spanned by [𝑢 1 , 𝑢 2 ] are
[e   𝑢 2 ] are the same. The example indicates that individual singular vector can change dramatically
 𝑢1, e
even as the perturbation ∥Δ𝐴∥ → 0, but the singular subspaces are more stable.
    Chapter 1 provides the math preparations for singular subspace perturbation. Let 𝐴 = 𝑈Σ𝑉 𝑇 be
the SVD of a matrix 𝐴, and 𝐴     e= 𝑈  eeΣ𝑉e𝑇 be the SVD of its noisy version 𝐴    e = 𝐴 + Δ𝐴. The singular
subspace perturbation problem then studies the stability of a left or a right singular subspace of 𝐴
under the perturbation Δ𝐴. In the setting of Section 1.2.1, we aim to study the distance between the
leading singular subspaces 𝑈1 and 𝑈      e1 under perturbation. Besides, we also investigate the stability
of combinations of individual factors in SVD (e.g., in PCA, we need to analyze the PC scores
𝑈1 Σ1 ). Specifically, in this dissertation, we are concerned with the following three perturbation
problems
     • Singular subspace perturbation. This dissertation studies the stability of the leading subspace
       𝑈1 (or 𝑉1 ) of matrix 𝐴 under the perturbation matrix Δ𝐴. Specifically, we are interested in
       deriving a ℓ2,∞ -norm bound for the distance between the leading subspaces 𝑈1 and 𝑈           e1 under
       perturbation, i.e., we aim to bound min𝑄∈O𝑟 ∥𝑈        e1 − 𝑈1 𝑄∥ 2,∞ .
     • Stability of PCA scores, which are the projection of data matrix 𝐴 onto its PC directions,
       using our notations in Section 1.2.1, PC scores are given by 𝑈1 Σ1 . Our goal is to bound
       min𝑄∈O𝑟 |||𝑈1 Σ1 − 𝑈   e1eΣ1 𝑄|||, where ||| · ||| can be either the spectral norm or the Frobenius
       norm.
     • Stability of singular value truncation. The best rank-𝑟 approximation of the original matrix
       𝐴 and perturbed matrix 𝐴      e is given by 𝐴𝑟 = 𝑈1 Σ1𝑉 𝑇 and 𝐴     e𝑟 = 𝑈e1eΣ1𝑉e𝑇 , respectively. We
                                                                    1                   1
                                                        17


        study ||| 𝐴𝑟 − 𝐴
                       e𝑟 |||, with ||| · ||| being either the spectral norm or the Frobenius norm.
Perturbation analysis of these three problems is crucial in understanding the performances of many
spectral methods in statistics. There are a plurality of papers addressing SVD perturbation and their
statistical applications Cape et al. (2019b); Abbe et al. (2020); Cai et al. (2021); Abbe et al. (2022);
Lei (2019). To list a few statistical applications, in the study of network models such as SBM,
spectral methods are commonly used in community detection and clustering, where eigenvectors
of adjacency matrix can provide connectivity and centrality information about nodes. The ℓ2,∞ -
norm perturbation bound can facilitate the analysis of exact recovery for such methods Cape et al.
(2019b); Abbe et al. (2020); Cai et al. (2021). PCA is one of the most popular dimensional reduction
methods, the perturbation analysis of PC scores helps provide a more precise characterization for
the stability of the low-dimensional embedding Abbe et al. (2022). In matrix completion with noisy
entries, Singular Value Truncation is a fundamental method, and it is desirable to study how close
the reconstructed matrix is to the true matrix.
2.2 ℓ2,∞ -norm perturbation bound of the singular subspace perturbation
    Using the notations as in Section 1.2.1, the purpose of ℓ2,∞ -norm perturbation analysis is to
investigate the quantity
                                                min ∥𝑈 e1 − 𝑈1 𝑄∥ 2,∞ ,
                                               𝑄∈O𝑟
which characterizes the difference of the leading singular subspaces 𝑈1 and 𝑈         e1 under perturbation.
    In many applications, the ℓ2,∞ -metric is better suited than the sinΘ one since it provides finer
entrywise control. For instance, in clustering, classification, and dimension reduction, one cares
about the classification accuracy or the embedding quality of each data point, which corresponds
to the row-wise ℓ2 error of the leading singular vector matrix, and the maximal row-wise ℓ2 -error
is exactly the ℓ2,∞ -norm.
    Careful readers may have noticed that, from Proposition 1.3.2 and Proposition 1.3.6,
min𝑄∈O𝑟 ∥𝑈   e1 − 𝑈1 𝑄∥ 2,∞ is smaller than the sin Θ angle up to a constant Cai and Zhang (2018),
                                                                        √
                    min ∥𝑈  e1 − 𝑈1 𝑄∥ 2,∞ ≤ min ∥𝑈       e1 − 𝑈1 𝑄∥ ≤ 2∥ sin Θ(𝑈1 , 𝑈e1 )∥.           (2.1)
                   𝑄∈O𝑟                           𝑄∈O𝑟
This provides a trivial bound on the ℓ2,∞ -norm error, but can be very pessimistic.
                                                         18


    To see why, let us consider the case where matrix 𝐴 has rank 𝑟 with 𝑟 ≪ 𝑛, the perturbation
matrix Δ𝐴 has i.i.d. Gaussian entries 𝑁 (0, 𝜎 2 ), the matrix 𝑈1 of leading singular vector is 𝜇-
incoherent (see Definition 1.3.1 for the definition of incoherence), and the following gap condition
holds
                                       √︁
                                  𝑐 1 𝜎 max{𝑛, 𝑚} ≤ 𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴).
Under these assumptions, (2.1) combined with Wedin’s sinΘ bound yields that, with high proba-
bility,                                                (      √︁             )
                                               √           𝑐 2 max{𝑛, 𝑚}𝜎
                      min ∥𝑈e1 − 𝑈1 𝑄∥ 2,∞ ≤ 2 min 1,                           ∼ 𝑂 (1).             (2.2)
                     𝑄∈O𝑟                                      𝜎𝑟 − 𝜎𝑟+1
Here 𝑐 1 , 𝑐 2 are absolute constants, and the big-O notation is with respect to the size variables 𝑛 and
                                                                                    √︁
𝑚. The gap condition implies that the noise level 𝜎 can be as large as 𝑂 (1/ max{𝑛, 𝑚}), so the
             √︁
order of 𝜎 max{𝑛, 𝑚} is 𝑂 (1). Hence the bound in (2.2) is 𝑂 (1).
                                                                           √︁
    In contrast, it has been shown that ℓ2,∞ -norm bound of order 𝑂 ( 𝑟/𝑛) can be achieved under
the same conditions Abbe et al. (2020); Lyu and Wang (2020a), which implies that the trivial bound
derived in (2.1) can be very pessimistic.
2.2.1     Different bounds on ℓ2,∞ -norm perturbation
    Various ℓ2,∞ -norm bounds for the singular vector perturbation have been established by Abbe
et al. (2020); Chen et al. (2021a); Cheng et al. (2020); Cape et al. (2019b). In particular, Cape et al.
(2019b) derived a Procrustean matrix decomposition, based on which the authors further obtained
ℓ2,∞ -norm perturbation bounds.
    For the leading subspaces 𝑈1 and 𝑈      e1 under perturbation, let the SVD of 𝑈𝑇 𝑈    e be 𝑈𝑇 𝑈  e =
                                                                                         1 1       1 1
𝑄 1 𝑆𝑄𝑇 , denote 𝑄𝑈 = 𝑄 1 𝑄𝑇 . Proposition 1.3.6 suggests ∥𝑈
        2                     2
                                                                 e1 −𝑈1 𝑄𝑈 ∥ 2,∞ could be a good estimate
of the target quantity min𝑄∈O𝑟 ∥𝑈     e1 − 𝑈1 𝑄∥ 2,∞ . The Procrustean decomposition of 𝑈   e1 − 𝑈1 𝑄𝑈 in
Cape et al. (2019b) reads as the following theorem.
Theorem 2.2.1 (Theorem 3.1 in Cape et al. (2019b)). In the setting of 1.2.1, if 𝐴      e has rank at least
                                                      19


𝑟, then 𝑈e1 − 𝑈1 𝑄𝑈 ∈ R𝑛×𝑟 admits the following decomposition
                           𝑈e1 − 𝑈1 𝑄𝑈 = (𝐼 − 𝑈1𝑈𝑇 )Δ𝐴𝑉1 𝑄𝑉 e
                                                    1            Σ1−1
                                          + (𝐼 − 𝑈1𝑈1𝑇 )Δ𝐴(𝑉  e1 −𝑉1 𝑄𝑉 )e Σ1−1
                                                                                                  (2.3)
                                          + (𝐼 − 𝑈1𝑈1𝑇 ) 𝐴(𝑉e1 −𝑉1𝑉 𝑇 𝑉e e−1
                                                                      1 1 ) Σ1
                                          + 𝑈1 (𝑈1𝑇 𝑈
                                                    e1 − 𝑄𝑈 ).
The decomposition also holds when replacing the orthogonal matrices 𝑄𝑈 and 𝑄𝑉 with any real
matrices 𝑇1 and 𝑇2 .
    Intuition for this decomposition is that each of the four terms in (2.3) can be bounded fairly
easily. Perturbation bounds on singular subspaces were developed in Cape et al. (2019b) based on
Theorem 2.2.1. When applied to the case when the perturbation matrix Δ𝐴 has i.i.d. Gaussian
entries, which is the focus of this chapter, Cape et al. (2019b) has derived the following bound.
Theorem 2.2.2 (Theorem 4.3 in Cape et al. (2019b)). Suppose 𝐴 ∈ R𝑛×𝑚 has rank 𝑟 with 𝑟 ≤ 𝑚 ≤ 𝑛
                      √
and 𝜎𝑟 ( 𝐴) ≥ 𝐶 𝑝 1 / 𝑝 2 . When Δ𝐴 has i.i.d. standard normal distributions 𝑁 (0, 𝜎 2 ), then there is
a constant 𝐶𝑟 dependent on 𝑟 such that with high probability
                                                                      √             
                 min ∥𝑈 e1 − 𝑈1 𝑄∥ 2,∞ ≤ 𝐶𝑟 𝜎 log 𝑛         1+
                                                                 𝜎𝑚
                                                                       +
                                                                            𝑚
                                                                               ∥𝑈 ∥      ,        (2.4)
                𝑄∈O𝑟                             𝜎𝑟 ( 𝐴)       𝜎𝑟 ( 𝐴) log 𝑛 1 2,∞
                 √
where 𝐶𝑟 ∼ 𝑂 ( 𝑟).
    This bound is quite tight when 𝑚 ≪ 𝑛, however, when 𝑛 ≈ 𝑚, the second term in (2.4) becomes
quite large and the bound becomes pessimistic.
    In the setting where 𝐴 ∈ R𝑛×𝑚 has noisy and missing entries, and the dimension is highly
unbalance, i.e., 𝑚 ≫ 𝑛, Cai et al. (2021) derived a ℓ2,∞ perturbation bound. SVD was conducted
on as rescaled Gram matrix with diagonal entries deleted. When restricted to the case where there
is no missing entries, 𝐴 has rank 𝑟, and the perturbation matrix Δ𝐴 has i.i.d. 𝑁 (0, 𝜎 2 ) entries, the
result in Cai et al. (2021) reads as the following Theorem 2.2.3.
Theorem 2.2.3 (Theorem 3.1 in Cai et al. (2021)). Assume the following assumptions hold
                                                             √                      𝑐
                  𝜎𝑟 ( 𝐴) ≥ 𝑐 1 𝜎 log 𝑚 max{𝜅(𝑚𝑛) 1/4 , 𝜅 3 𝑛}, and ∥𝑈1 ∥ 2,∞ ≤ 2 .
                                 √︁
                                                                                    𝜅2
                                                   20


            𝜎 ( 𝐴)
where 𝜅 = 𝜎1 ( 𝐴) , 𝑐 1 is some sufficient large constant, and 𝑐 2 is sufficiently small. Then with high
              𝑟
probability,
                                                       2 √𝑚
                                                                                            !
                                              √      𝜎             𝜎𝜅
                      ∥𝑈e1 − 𝑈1 𝑄𝑈 ∥ 2,∞ ≤ 𝜅 2 𝑟 𝜇           +             + 𝜅 2 ∥𝑈1 ∥ 22,∞             (2.5)
                                                       2
                                                     𝜎𝑟 ( 𝐴)    𝜎  𝑟 ( 𝐴)
                 √︃             √︃             √ max𝑖, 𝑗 | 𝐴𝑖, 𝑗 |
Here 𝜇 = max{ 𝑛𝑟 ∥𝑈1 ∥ 2,∞ , 𝑚𝑟 ∥𝑉1 ∥ 2,∞ , 𝑛𝑚 ∥ 𝐴∥                } ≥ 1 characterizes the incoherence of
                                                           𝐹
matrices 𝑈1 ,𝑉1 , and 𝐴. For simplicity of presentation, we have omitted the logarithmic terms.
    When 𝜇 and the condition number 𝜅 are of constant order, the bound on the RHS of (2.5) is
near-optimal. However, if matrix 𝐴 has large conditional number 𝜅, it is not desirable to have 𝜅 in
the bound. It is worth mentioning that Cai et al. (2010) also derived the following minimax lower
bound.
Theorem 2.2.4 (Theorem 3.3 in Cai et al. (2021)). Suppose 1 ≤ 𝑟 ≤ 𝑛/2, and (Δ𝐴)𝑖 𝑗 ∼ 𝑁 (0, 𝜎 2 ).
                                                                                                 𝑖.𝑖.𝑑.
Define
                       M := {𝐵 ∈ R𝑛×𝑚 |rank(𝐵) = 𝑟, 𝜎𝑟 (𝐵) ∈ [0.9𝜎𝑟∗ , 1.1𝜎𝑟∗ ]}.
Denote by 𝑈 (𝐵) ∈ R𝑛×𝑟 the matrix containing the 𝑟 left singular vectors of 𝐵. Then there exists
some universal constant 𝑐 𝑙𝑏 > 0 such that
                                                                 (                            )
                         
                                  ˆ − 𝑈 ( 𝐴)∥ 2,∞ ≥ 𝑐 𝑙𝑏 min
                                                                    𝜎2 √           𝜎    √      1
             inf sup E min ∥𝑈𝑄                                              𝑛𝑚 + ∗ 𝑛, 1 √ ,             (2.6)
              𝑈ˆ 𝐴∈M       𝑄∈O𝑟                                      𝜎𝑟∗ 2         𝜎𝑟            𝑛
where the infimum is taken over all estimators for 𝑈 ( 𝐴) based on the noisy observation 𝐴 + Δ𝐴.
    Theorem 2.2.4 provides a minimax lower bound on the ℓ2,∞ -norm of singular subspace pertur-
                                                  √                                                     √
bation. When 𝐴 is rank-𝑟, 𝑛 ≍ 𝑚, and 𝜎𝑟 ≳ 𝜎 𝑛, the lower bound on the RHS becomes 𝑂 (1/ 𝑛).
    Some ℓ2,∞ -norm singular perturbation bounds were derived from perturbation analysis for
eigen subspaces of symmetric matrices. Using the Hermitian dilation trick Paulsen (2002), these
results on eigen subspaces can be extended to singular subspaces, which leads to a uniform bound
on ∥ sin Θ(𝑈1 , 𝑈e1 )∥ and ∥ sin Θ(𝑉1 , 𝑉
                                        e1 )∥. Relevant works that fall into this category include Lei
(2019); Abbe et al. (2020),
    In particular, Abbe et al. (2020) established an ℓ2,∞ -norm bound for eigenspaces of symmetric
random matrices whose expectations are of low-rank. Here we consider a simplified version
of the results in Abbe et al. (2020) when the perturbation matrix follows has i.i.d. Gaussian
                                                     21


entries. Suppose 𝐴 ∈ R𝑛×𝑛 , denote the eigenvalues of 𝐴 as 𝜆 1 ≥ 𝜆 2 ≥ · · · ≥ 𝜆 𝑛 , and the associated
eigenvectors are {𝑢 𝑗 }𝑛𝑗=1 . Analogously, for 𝐴  e = 𝐴 + Δ𝐴, the eigenvalues and eigenvectors are
e1 ≥ 𝜆
𝜆     e2 ≥ · · · ≥ 𝜆
                   e𝑛 and {e𝑢 𝑗 }𝑛 , respectively. Let 𝑈 = [𝑢 1 , 𝑢 2 , · · · , 𝑢𝑟 ] and 𝑈 = [e
                                 𝑗=1
                                                                                                  𝑢2, · · · , e
                                                                                              𝑢1, e           𝑢𝑟 ].
Denote Δ = min{𝜆𝑟 − 𝜆𝑟+1 , min1≤𝑖≤𝑟 |𝜆𝑖 |}, 𝜅 = max1≤𝑖≤𝑟 |𝜆𝑖 |/Δ. In addition, assume there exists
𝛾 ≥ 0 such that
                                             ∥ 𝐴∥ 2,∞ ≤ 𝛾Δ                                                  (2.7)
and 𝑐𝜅𝛾 ≤ 1 for some constant 𝛾 and for some 𝛿0 ∈ (0, 1),
                                        P(∥Δ𝐴∥ ≤ 𝛾Δ) ≥ 1 − 𝛿0 .                                             (2.8)
Theorem 2.2.5 (Theorem 2.1 in Abbe et al. (2020)). Under assumptions (2.7) and (2.8), with high
probability,
                          e1 − 𝑈1 𝑄𝑈 ∥ 2,∞ ≤ (𝑐 + 𝜅 2 𝛾)∥𝑈1 ∥ 2,∞ + 𝛾∥ 𝐴∥ 2,∞ /Δ.
                         ∥𝑈                                                                                 (2.9)
    The bound in Theorem 2.2.5 is quite tight when condition number 𝜅 is of constant order,
                    √︁
∥𝑈1 ∥ 2,∞ is of 𝑂 ( 𝑟/𝑛), and 𝐴 has exact rank 𝑟, since in this case ∥ 𝐴∥ 2,∞ /Δ = ∥𝑈1 Σ1𝑈1𝑇 ∥ 2,∞ /Δ ≤
                  √︁
𝜅∥𝑈1 ∥ 2,∞ ∼ 𝑂 ( 𝑟/𝑛). However, for general full rank matrices or matrices with a large condition
number, the bound 2.9 can be improved.
2.2.2    A near-optimal ℓ2,∞ -bound under i.i.d. Gaussian noise
    In summary, various singular subspace perturbation bound have been derived in the literature.
However, most existing results only work for low-rank matrices, some results require additional
assumption on the ℓ2,∞ -norm of matrix 𝐴, and may not give an informative bound when the
condition number becomes large. In Section 2.2.2, we present an improved perturbation bound
under the assumption that the perturbation matrix Δ𝐴 has i.i.d. Gaussian entries. The proposed
result complements existing literature in that it only require minimum assumptions, and holds for
both cases when matrix 𝐴 is of low rank or is a general full-rank matrix.
    We establish an improved ℓ2,∞ -norm singular subspace perturbation bound in the case where
each entry of the perturbation matrix follows i.i.d. Gaussian distribution. Statistical applications
that fall into this regime include Gaussian Mixture Model with isotropic noise covariance matrix
                                                   22


Löffler et al. (2021). Admittedly, the proposed result requires the perturbation matrix Δ𝐴 to have
i.i.d. Gaussian entries due to the proof technique we use, whether similar results can be obtained
when Δ𝐴 is sub-Gaussian is still open.
Theorem 2.2.6. Suppose 𝐴      e = 𝐴 + Δ𝐴 ∈ R𝑛×𝑚 , 𝑛¯ := max{𝑛, 𝑚}, Δ𝐴 has i.i.d. 𝑁 (0, 𝜎 2 ) entries, and
                                     √
assume 𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴) > 21𝜎 𝑛.    ¯ Then with probability at least 1 − 𝑐2 ,
                                                                                  𝑛
                                                       𝜎 2 𝑛¯                        𝑅(𝑟, 𝑛)
       min ∥𝑈  e1 − 𝑈1 𝑄∥ 2,∞ ≤ 𝑐 1 ∥𝑈1 ∥ 2,∞                           + 𝑐2 𝜎                     , (2.10)
       𝑄∈O𝑟                                   (𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴)) 2          𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴)
where
                                           √ √︁
                                           𝑟 + log 𝑛,          if 𝐴 is rank 𝑟;
                                          
                                          
                                          
                                𝑅(𝑟, 𝑛) =      √︁
                                          𝑟 + 𝑟 log 𝑛,
                                          
                                                                          else,
                                          
and 𝑐, 𝑐 1 , 𝑐 2 , 𝜎 are absolute constants independent of 𝑛 and 𝑚.
     The following corollary of Theorem 2.2.6 may be easier to digest.
Corollary 2.2.7. Under the same assumption as in Theorem 2.2.6, if we additionally assume that
the matrix 𝑈1 holding the leading 𝑟 left singular vectors of 𝐴 is 𝜇1 -incoherent with some constant
𝜇1 , then with probability at least 1 − 𝑐2 ,
                                         𝑛
                                                        √ √︁                        
                                                     𝜇1 𝑟 + 𝑟 log 𝑛 + 𝑟               𝑟
                         min ∥𝑈e1 − 𝑈1 𝑄∥ 2,∞ ≤ 𝐶 ·            √              ∼𝑂 √ ,
                        𝑄∈O𝑟                                      𝑛                    𝑛
where 𝑐 and 𝐶 are absolute constants, and the logarithmic factors are omitted in the big O expression.
A similar result holds for the right singular subspace.
     For low-rank matrices, Corollary 2.2.7 can be further improved.
Corollary 2.2.8. When 𝐴 is of rank-𝑟, (2.10) reduces to
                                                                              √ √︁
                                                               𝜎 2 𝑛¯          𝑟 + log 𝑛
                       min ∥𝑈e1 − 𝑈1 𝑄∥ 2,∞ ≤ 𝑐 1 ∥𝑈1 ∥ 2,∞            + 𝑐2 𝜎              .         (2.11)
                      𝑄∈O𝑟                                    𝜎𝑟2 ( 𝐴)           𝜎𝑟 ( 𝐴)
 Under the same assumption as in Corollary 2.2.7, with probability at least 1 − 𝑐2 ,
                                                                                          𝑛
                                                       √ √︁               √         √︂ 
                                                    𝜇1 𝑟 + 𝑟 log 𝑛 + 𝑟                  𝑟
                        min ∥𝑈e1 − 𝑈1 𝑄∥ 2,∞ ≤ 𝐶 ·             √              ∼𝑂          ,
                       𝑄∈O𝑟                                      𝑛                      𝑛
where 𝑐 and 𝐶 are absolute constants. A similar result holds for the right singular subspace.
                                                     23


Remark 2.2.9. To get a sense of the tightness of our results, we note that when 𝐴 is of rank-𝑟,
                       √
𝑛 ≍ 𝑚, and 𝜎𝑟 ≍ 𝜎 𝑛, the minimax lower bound (2.6) derived in Cai et al. (2021) reduces to (2.12)
below, which indicates that our result Corollary 2.2.8 matches the minimax lower bound 𝑂 ( √1 ) up
                                                                                                                       𝑛
                   √
to a factor of 𝑟.
                                                           ˆ − 𝑈 ( 𝐴)∥ 2,∞ ≥ 𝑐 √1 ,
                                                                              
                                    inf sup E min ∥𝑈𝑄                                                                 (2.12)
                                     𝑈ˆ 𝐴∈M     𝑄∈O𝑟                                      𝑛
                                 √︁                                                                                  √
                      Achieve 𝑂 ( 𝑟/𝑛) for                                                              Achieve 𝑂 (𝑟/ 𝑛) for
                                           Do not require 𝜅 = 𝑂 (1) Do not require addition assumptions
                        rank-𝑟 matrices                                                                   general matrices
  Cape et al. (2019b)          ✗                      ✓                              ✓                           ✗
     Lei (2019)                ✓                      ✓                              ✗                           ✗
  Abbe et al. (2020)           ✓                      ✗                              ✓                           ✗
   Cai et al. (2021)           ✓                      ✗                              ✗                           ✗
   This dissertation           ✓                      ✓                              ✓                           ✓
Table 2.1 Comparison of result in this dissertation with existing works about the ℓ2,∞ bound. Here
we compare these results under the assumptions that Δ𝐴 has i.i.d. 𝑁 (0, 𝜎 2 ) entries, ∥𝑈1 ∥ 2,∞ ≤
 √︁                                                                     𝜎 ( 𝐴)
𝑐 𝑟/𝑛, where 𝜎, 𝑐 are constants, and 𝑛 ≍ 𝑚. Here 𝜅 = 𝜎1 ( 𝐴) is the condition number of 𝐴. For
                                                                         𝑟
simplicity, we ignore log 𝑛 in the big-O notation.
    We take a moment here to make a comparison between several existing works with ours
(Theorem 2.1 and its corollaries). Table 2.1 summarizes the ℓ2,∞ -norm bound and requirements
in each work. The purpose of the comparison is to show the effectiveness of our results under the
setting where the perturbation matrix Δ𝐴 has i.i.d. Gaussian entries and the matrix 𝐴 is nearly
square (𝑛 ≍ 𝑚). To be fair, we would like to mention that some of the existing results might be
better suited for other settings (such as when 𝑚 ≫ 𝑛).
                                                                                        √︁
    From the table, we can see that our corollaries achieve the 𝑂 ( 𝑟/𝑛) order upper bound for
                                       √
low-rank matrices and 𝑂 (𝑟/ 𝑛) for full-rank matrices. In comparison, previous results in Cape
                                            √︁
et al. (2019b) do not achieve the 𝑂 ( 𝑟/𝑛) order for the low-rank case under the assumptions as in
                                                                                                         √︁
Corollary 2.2.8. The result in Lei (2019) achieved the same order of accuracy 𝑂 ( 𝑟/𝑛) but only
for rank-𝑟 matrices and under a more restrictive gap condition (below is a simplified version)
                                                                       √
                                                                                         
                                                             𝜎1
                                   𝜎𝑟 − 𝜎𝑟+1 ≥ 𝑐 min{ , 2𝑟}𝜎 𝑛¯ + ∥ 𝐴∥ 2,∞ ,
                                                             𝜎𝑟
where 𝑐 is a constant. Abbe et al. (2020); Chen et al. (2021a); Cheng et al. (2020); Eldridge et al.
(2018) consider the perturbation of one eigen-vector instead of a set of eigen-vectors. Abbe et al.
                                                              24


                         √︁
(2020) obtained an 𝑂 ( 𝑟/𝑛) perturbation bound for low-rank matrices, but for full rank matrices,
                                                                                              𝜎 ( 𝐴)
                                                                                               1
their bound is 𝑂 (1). Besides, the bound contains in it the condition number e  𝜅 ( 𝐴) = 𝜎 ( 𝐴)−𝜎          ,
                                                                                           𝑟      𝑟+1 ( 𝐴)
which potentially makes it very large. Likewise, the result in Cai et al. (2021) also contains a
                             𝜎 ( 𝐴)
condition number 𝜅( 𝐴) = 𝜎1 ( 𝐴) . In contrast, the condition number does not show up in our results,
                              𝑟
hence Theorem 2.2.6 is more suitable for matrices with large condition numbers. In addition, all
previous analyses except for Cape et al. (2019b) and Cai et al. (2021) are based on techniques
developed for eigen-decomposition. When they are applied to rectangular matrices 𝐴 ∈ R𝑛×𝑚 , the
matrix needs to be symmetrized, causing the resulting upper bounds potentially depend on both the
left and the right singular vectors. In contrast, our bound (2.10) is one-sided, in that the perturbation
of 𝑈1 only depends on ∥𝑈1 ∥ 2,∞ but not ∥𝑉1 ∥ 2,∞ .
     We make a more detailed comparison between our Theorem 2.2.6 and Theorem 4.3 in Cape
et al. (2019b). For a fair comparison, we restrict ourselves to the setting where both Cape et al.
(2019b) and our results hold, that is, when the data matrix 𝐴 ∈ R𝑛×𝑚 is of rank 𝑟 and the perturbation
matrix Δ𝐴 has i.i.d. Gaussian 𝑁 (0, 𝜎 2 ) entries with some constant 𝜎. As discussed in Theorem
                                                                             √
2.2.2, Theorem 4.3 in Cape et al. (2019b) reads if 𝑛 ≥ 𝑚, 𝜎𝑟 ( 𝐴) ≳ 𝜎(𝑛/ 𝑚), then (2.4) holds.
     Comparing (2.4) and (2.11), we notice that under the allowable gap condition 𝜎𝑟 ( 𝐴) ≳
  √︁                                                              √
𝜎 max{𝑛, 𝑚} and the incoherence condition ∥𝑈1 ∥ 2,∞ ≤ 𝑐𝑟 / 𝑛, the second term in (2.4), which
             2                             √
reads 𝐶𝑟 𝜎2 𝑚 , can be as large as 𝑂 ( 𝑟) when 𝑛 ≈ 𝑚, which is much larger than our bound
           𝜎𝑟 ( 𝐴)
   √︁
𝑂 ( 𝑟/𝑛).
     Admittedly, our current result only holds for Gaussian perturbation due to the proof techniques
we use. We leave the study of other perturbation types as future work.
2.3    Stability of Principal Component Analysis (PCA)
     As one of the arguably most popular tools for data visualization and exploration, PCA is used
to extract the main features from a dataset or to reduce the dimensionality of the data. There
is a vast literature on the analysis of PCA. Most previous works focused on the consistency of
Principal Component directions or eigenvalues Chen et al. (2021b); Cai et al. (2021); Vaswani and
Narayanamurthy (2017); Narayanamurthy and Vaswani (2020), while the stability of PC scores
                                                    25


(i.e., the projection of data matrix 𝐴 onto its PC directions, using our notations in Section 2.2, PC
scores are given by 𝑈1 Σ1 ) is less explored, despite its importance in the analysis of various spectral
methods.
     There are several relevant works investigating the stability of PC scores, but their analyses were
under different settings. For completeness, we include a brief review here. Abbe et al. (2022)
developed an ℓ 𝑝 analysis for a hollowed version of PCA, where SVD is conducted on the hollowed
gram matrix 𝐺 = H ( 𝐴𝐴𝑇 ). Here, 𝐴 is the data matrix and the operator H (·) zeros out all diagonal
                                                                1
entries of a square matrix. The PC scores are given by 𝑈Λ 2 , where 𝑈 and Λ are the eigenvector
matrix and corresponding eigenvalues of 𝐺. Perturbation bounds on PC scores in ℓ2,𝑝 -norm were
derived in Abbe et al. (2022) to characterize entrywise behaviour of PCA. Another line of research
studied adjacency spectral embedding (ASE) for random dot product graphs (RDPG), which is
closely related to PCA in that they both return a weighted singular vector matrix. Central limit
theorems for rows of ASE have been provided in Tang and Preibe (2018); Athreya et al. (2021).
     Different from these previous studies, in this section, we focus on the stability of PC scores of
the original PCA algorithm, which does not use the hallowed gram matrix. Given a centered data
matrix 𝐴 and its conformal SVD, PCA returns 𝑈1 Σ1 (or 𝑉1 Σ1 ) as the low-dimensional projection
into R𝑟 . Due to the possible similarity among singular values within Σ1 , the PCA embedding may
be subject to rotations. Hence when computing the error, we mode out this rotation and aim to
bound min𝑄∈O𝑟 ∥𝑈1 Σ1 − 𝑈    e1eΣ1 𝑄∥ or min𝑄∈O𝑟 ∥𝑈1 Σ1 − 𝑈    Σ1 𝑄∥ 𝐹 , where ∥ · ∥ is the spectral norm.
                                                            e1e
     The main difference between these quantities and the sinΘ angle between singular subspaces
is that 𝑈1 is now multiplied by the corresponding singular values, and it is the perturbation of
this product that we want to analyze. Naively, one may expect that the perturbation of 𝑈1 Σ1 is
approximately equal to the perturbation of 𝑈1 times ∥Σ1 ∥ plus the perturbation of Σ times ∥𝑈1 ∥,
and the perturbation of 𝑈1 can in turn be controlled by the sinΘ theorem. This argument leads to
                                                             𝜎 ( 𝐴)∥Δ𝐴∥
                               min ∥𝑈1 Σ1 − 𝑈   Σ1 𝑄∥ ≤ 𝑐 · 1
                                              e1e                         ,                         (2.13)
                              𝑄∈O𝑟                            𝜎𝑟 − 𝜎𝑟+1
where 𝑐 is some absolute constant. However, this bound is quite large due to the existence of 𝜎1 ( 𝐴)
in the numerator. Noticing that 𝜎1 ( 𝐴) appears in (2.13) because we consider 𝑈1 Σ1 as a whole,
                                                   26


in the following theorem, we show that the perturbed singular vectors corresponding to different
singular values actually have different levels of stability, which in turn enables a tighter bound on
the PC scores. More specifically, the next theorem shows that the singular vectors associated with
larger singular values are more stable.
Theorem 2.3.1. For 𝑗 = 1, ..., 𝑟, let sin Θ(e𝑢 𝑗 ,𝑈1 ) be the sinΘ angle between the 𝑗th left perturbed
singular vector e𝑢 𝑗 and the leading 𝑟-dimensional singular subspace 𝑠𝑝𝑎𝑛(𝑈1 ) of 𝐴. Then provided
that 3∥Δ𝐴∥ ≤ 𝜎𝑟 − 𝜎𝑟+1 , we have
                                                            𝐶 ∥Δ𝐴∥
                                     ∥ sin Θ(e
                                             𝑢 𝑗 ,𝑈1 )∥ ≤               .
                                                           𝜎 𝑗 − 𝜎𝑟+1
where 𝐶 is some universal constant and by definition ∥ sin Θ(e      𝑢 𝑗 ,𝑈1 )∥ ≡ ∥e𝑢𝑇𝑗 𝑈2 ∥, 𝑈2 is the orthog-
onal complement of 𝑈1 .
    The different levels of stability of singular vectors observed in Theorem 2.3.1 will help us get
rid of the 𝜎1 ( 𝐴) and establish a tighter bound on the PC scores.
Theorem 2.3.2. 𝐴   e = 𝐴 + Δ𝐴, 𝑈1 Σ1 is the PCA embedding of 𝐴 and 𝑈             Σ1 is that of 𝐴,
                                                                               e1e             e we have
                                                                                        
                                                                             2∥Δ𝐴∥
                     min ∥𝑈1 Σ1 − 𝑈 e1eΣ1 𝑄∥ ≤ 3∥Δ𝐴∥ + 3𝜎𝑟+1 min                       ,1 ,
                     𝑄∈O𝑟                                                   𝜎𝑟 − 𝜎𝑟+1
                                                                                                    2 ! 1/2
                                                                                         2∥Δ𝐴∥
   min ∥𝑈1 Σ1 − 𝑈  e1eΣ1 𝑄∥ 𝐹 ≤ 2∥(Δ𝐴)𝑟 ∥ 2𝐹 + 3 ∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 min                     ,1
  𝑄∈O𝑟                                                                                 𝜎𝑟 − 𝜎𝑟+1
                                                                                  
                                                                     2∥Δ𝐴∥
                                 + ∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 min                  ,1 .
                                                                    𝜎𝑟 − 𝜎𝑟+1
Here (Δ𝐴)𝑟 is the best rank-𝑟 approximation of Δ𝐴.
    The upper bound is tighter than (2.13) and can be used to facilitate the error analysis of
PCA-related methods (e.g., Little et al. (2018)).
Remark 2.3.3. When 𝐴 is rank-𝑟, the above result reduces to
                                   min ∥𝑈1 Σ1 − 𝑈  e1e Σ1 𝑄∥ ≤ 3∥Δ𝐴∥,
                                   𝑄∈O𝑟
                                                            √
                             min ∥𝑈1 Σ1 − 𝑈   Σ1 𝑄∥ 𝐹 ≤ ( 5 + 1)∥(Δ𝐴)𝑟 ∥ 𝐹 .
                                            e1e
                            𝑄∈O𝑟
                                                     27


Remark 2.3.4. A tight ℓ2,𝑝 -norm perturbation bound of PC scores for hollowed PCA was developed
in Abbe et al. (2022). However, unlike the previous theorem for vanilla PCA, it seems not possible
to eliminate 𝜎1 ( 𝐴) from the bound in hollowed PCA, due to the fact that the hollowed PCA conducts
the decomposition on the gram matrix instead of the original data matrix 𝐴. In the noisy setting,
the noise on the Gram matrix contains the term 𝐴𝑇 Δ𝐴, whose norm may reach 𝑂 (𝜎1 ( 𝐴)∥Δ𝐴∥)
with 𝜎1 included in the expression.
2.4    Stability of singular value truncation
     In addition to studying the perturbations of 𝑈1 and 𝑈1 Σ1 , we also investigate the stability of the
hard singular value thresholding operator, which provides the best rank-𝑟 approximation of 𝐴, i.e.,
𝐴𝑟 = 𝑈1 Σ1𝑉1𝑇 . This operator, also known as singular value truncation, is widely used in matrix
completion and matrix denoising for promoting low-rankness or reducing the noise Tanner and Wei
(2013); Donoho and Gavish (2014); Cai et al. (2010); Gavish and Donoho (2014). Let 𝐴        e = 𝐴 + Δ𝐴
be the noisy matrix, and let 𝐴  e𝑟 denote its best rank-𝑟 approximation. We characterize the stability
of the hard singular value thresholding operator through a bound on ∥ 𝐴𝑟 − 𝐴     e𝑟 ∥. Previous works
have investigated the stability of truncated SVD Luo et al. (2021); Vu et al. (2021), and tight error
bounds for low-rank matrices have been derived. However, a tight bound for general matrices is
still missing in the literature.
     For rank-𝑟 matrix 𝐴 with 𝑟 < min{𝑚, 𝑛}, the following perturbation result was obtained in Luo
et al. (2021),
                                           ∥𝐴− 𝐴  e𝑟 ∥ ≤ 2∥Δ𝐴∥.                                   (2.14)
Since in practice, 𝐴 may not be exactly rank-𝑟, we hope to establish upper bounds for general
full-rank matrices.
     We comment that although we can easily derive an upper bound of full-rank matrices from that
of the low-rank ones, the resulting bound is not tight. Explicitly, for a full-rank matrix 𝐴, 𝐴𝑟 is of
low rank, so we can apply (2.14) on 𝐴𝑟 to get
∥ 𝐴𝑟 − 𝐴e𝑟 ∥ = ∥ 𝐴𝑟 − ( 𝐴 + Δ𝐴)𝑟 ∥ = ∥ 𝐴𝑟 − ( 𝐴𝑟 + 𝐸)
                                                   e 𝑟 ∥ ≤ 2∥ 𝐸
                                                              e∥ ≤ 2∥Δ𝐴∥ + 2∥ 𝐴 − 𝐴𝑟 ∥ = 2∥Δ𝐴∥ + 2𝜎𝑟+1 ,
where 𝐸  e = Δ𝐴 + 𝐴 − 𝐴𝑟 and the first inequality used (2.14).
                                                     28


    Apparently, this bound is not optimal as it does not shrink to 0 when Δ𝐴 → 0. This then
motivates us to establish the following tighter bound.
Theorem 2.4.1 (Perturbation result on singular value truncation). Let 𝐴 ∈ R𝑛×𝑚 be any 𝑛 × 𝑚
matrix and 𝐴 e = 𝐴 + Δ𝐴 be its noisy version. Denote by 𝐴𝑟 and 𝐴       e𝑟 their rank-𝑟 thresholding with
all but the first 𝑟 singular values set to 0. Let 𝜎𝑖 be the 𝑖th largest singular value of 𝐴 and Σ2 be
the diagonal matrix containing 𝜎𝑟+1 , ..., 𝜎min (the (𝑟 + 1)th to the last singular values of 𝐴) on the
diagonal. Then
                                                                             
                                                                  2∥Δ𝐴∥
                           ∥ 𝐴𝑟 − 𝐴e𝑟 ∥ ≤ 2∥Δ𝐴∥ + 2𝜎𝑟+1 min                 ,1 ,                     (2.15)
                                                                 𝜎𝑟 − 𝜎𝑟+1
                                                                                     2 ! 1/2
             e𝑟 ∥ 𝐹 ≤ 2∥(Δ𝐴)𝑟 ∥ 2 + 3 ∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 min        2∥Δ𝐴∥
      ∥ 𝐴𝑟 − 𝐴                                                                     ,1            .   (2.16)
                                   𝐹                                     𝜎𝑟 − 𝜎𝑟+1
    This error bound has exactly the same form as the PCA perturbation bound established in the
previous section, except that here 𝐴𝑟 and 𝐴     e𝑟 do not differ by a rotation. Intuitively, this indicates
that the noise-induced rotation on 𝑈   e1 and that on 𝑉 e𝑇 can essentially cancel with each other.
                                                         1
Remark 2.4.2. When 𝐴 is a rank-𝑟 matrix, the bound in Theorem 2.4.1 reduces to the result in Luo
et al. (2021):
                                     
                                     
                                      ∥ 𝐴𝑟 − 𝐴
                                              e𝑟 ∥ ≤ 2∥Δ𝐴∥,
                                     
                                     
                                     
                                                                                                     (2.17)
                                                      √
                                     
                                             e𝑟 ∥ 𝐹 ≤ 5∥(Δ𝐴)𝑟 ∥ 𝐹 .
                                      ∥ 𝐴𝑟 − 𝐴
                                     
2.5    Closed-form expression of sin Θ distance between two singular spaces
    The several new results presented in the previous sections are derived either directly or indirectly
from a set of sinΘ formulae we shall establish in this section. In other words, these sinΘ formulae
serve as useful tools to analyze SVD based perturbation problems.
2.5.1    First order equivalent expressions of the sin Θ distance
    Following the same notation as in Section 1.2.1, our goal is to compute the exact expressions of
perturbation angles of the leading left singular subspace 𝑈1 under noise Δ𝐴. (1.4) and (1.5) indicate
that the matrices 𝑈2𝑇 𝑈e1 and 𝑈e𝑇 𝑈1 are key intermediate quantities to bound the sinΘ angles. In the
                                 2
following theorem, we provide useful expressions of these key quantities.
                                                      29


Theorem 2.5.1 (Angular perturbation formula). Let 𝐴, 𝐴             e = 𝐴 + Δ𝐴 be two 𝑛 × 𝑚 matrices and their
conformal SVDs are defined as (1.1). The rank of 𝐴 is at least 𝑟. Assume there is a gap between
the 𝑟th and the (𝑟 + 1)th singular values, i.e., 𝜎𝑟 − e      𝜎𝑟+1 > 0 and e 𝜎𝑟 − 𝜎𝑟+1 > 0. Then the following
expressions hold:
                              𝑈1𝑇 𝑈e2 = 𝐹 12 ◦ (𝑈𝑇 (Δ𝐴)𝑉      Σ𝑇2 + Σ1𝑉1𝑇 (Δ𝐴)𝑇 𝑈
                                                            e2e                   e2 ),
                                          𝑈        1
                              𝑈2𝑇 𝑈e1 = 𝐹 21 ◦ (𝑈𝑇 (Δ𝐴)𝑉      Σ𝑇1 + Σ2𝑉2𝑇 (Δ𝐴)𝑇 𝑈
                                                            e1e                   e1 ),
                                          𝑈        2
                                                                                                        (2.18)
                               𝑉1𝑇 𝑉
                                   e2 =  𝐹𝑉12 ◦ (Σ𝑇1 𝑈1𝑇 (Δ𝐴)𝑉 e2 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈
                                                                     1            Σ2 ),
                                                                                e2e
                                   e1 = 𝐹 21 ◦ (Σ𝑇 𝑈𝑇 (Δ𝐴)𝑉
                               𝑉2𝑇 𝑉                           e1 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈    Σ1 ).
                                                                                e1e
                                          𝑉        2 2               2
More specifically, the assumption 𝜎𝑟 −e        𝜎𝑟+1 > 0 is required for the first and the third expressions of
(2.18) to hold, and e  𝜎𝑟 − 𝜎𝑟+1 > 0 is required for the second and the last expressions to hold. Here
◦ means the Hadamard product, or element-wise product between two matrices. 𝐹𝑈12 ∈ R𝑟×(𝑛−𝑟)
has entries (𝐹𝑈12 )𝑖, 𝑗 = 2 1 2 , 1 ≤ 𝑖 ≤ 𝑟, 1 ≤ 𝑗 ≤ 𝑛 − 𝑟; 𝐹𝑈21 ∈ R (𝑛−𝑟)×𝑟 has entries (𝐹𝑈21 )𝑖, 𝑗 =
                            𝜎
                            e     −𝜎
                              𝑗+𝑟 𝑖
     1    , 1 ≤ 𝑖 ≤ 𝑛 − 𝑟, 1 ≤ 𝑗 ≤ 𝑟. Similarly, 𝐹𝑉12 ∈ R𝑟×(𝑚−𝑟) has entries (𝐹𝑉12 )𝑖, 𝑗 = 2 1 2 , 1 ≤
𝜎 2 −𝜎 2
e                                                                                               𝜎
                                                                                                e    −𝜎
  𝑗 𝑖+𝑟                                                                                           𝑗+𝑟 𝑖
𝑖 ≤ 𝑟, 1 ≤ 𝑗 ≤ 𝑚 −𝑟; and 𝐹𝑉21 ∈ R (𝑚−𝑟)×𝑟 with entries (𝐹𝑉21 )𝑖, 𝑗 = 2 1 2 , 1 ≤ 𝑖 ≤ 𝑚 −𝑟, 1 ≤ 𝑗 ≤ 𝑟.
                                                                             𝜎 −𝜎
                                                                             e
                                                                               𝑗 𝑖+𝑟
Here if 𝑖 > min{𝑛, 𝑚}, we enforce 𝜎𝑖 and e         𝜎𝑖 to be 0.
     Taking the spectral norm on both hand sides of (2.18) gives us the following new expression of
the sinΘ distance.
Corollary 2.5.2. If the condition in Theorem 2.5.1 is satisfied, then the sin Θ distances between the
𝑟 leading singular spaces of the original and the perturbed matrices satisfy
                     ∥ sin Θ(𝑈1 , 𝑈  e1 )∥ = ∥𝐹 12 ◦ (𝑈𝑇 (Δ𝐴)𝑉   e2eΣ𝑇2 + Σ1𝑉1𝑇 (Δ𝐴)𝑇 𝑈 e2 )∥
                                                 𝑈        1
                                           = ∥𝐹𝑈21 ◦ (𝑈2𝑇 (Δ𝐴)𝑉  e1eΣ𝑇1 + Σ2𝑉2𝑇 (Δ𝐴)𝑇 𝑈 e1 )∥,
                      ∥ sin Θ(𝑉1 , 𝑉 e1 )∥ = ∥𝐹 12 ◦ (Σ𝑇 𝑈𝑇 (Δ𝐴)𝑉   e2 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈 e2e Σ2 )∥
                                                 𝑉       1 1               1
                                           = ∥𝐹𝑉21 ◦ (Σ𝑇2 𝑈2𝑇 (Δ𝐴)𝑉 e1 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈
                                                                           2
                                                                                    e1e Σ1 )∥.
Remark 2.5.3. In the expressions of corollary 2.5.2, the singular value gaps are contained in the
terms 𝐹𝑈21 , 𝐹𝑈12 , 𝐹𝑉21 , and 𝐹𝑉12 as denominators. In this sense, (2.18) conveys the same insight as
Wedin’s sinΘ theorem.
                                                           30


     Everything else in the right-hand sides of Corollary 2.5.2 is straightforward to bound except
perhaps for the Hadamard products. The following lemma shows that the Hadamard product is also
relatively easy to treat.
Lemma 2.5.4. Assume 𝜎𝑟 −e           𝜎𝑟+1 > 0, e   𝜎𝑟 − 𝜎𝑟+1 > 0, let 𝐹𝑈12 , 𝐹𝑈21 , Σ1 , e     Σ1 , Σ2 , eΣ2 , be the same as
in Theorem 2.5.1 and let 𝐻1 ∈ R (𝑛−𝑟)×𝑟 , 𝐻2 ∈ R (𝑚−𝑟)×𝑟 , 𝐻3 ∈ R𝑟×(𝑚−𝑟) , 𝐻4 ∈ R𝑟×(𝑛−𝑟) be some
arbitrary matrices. Then
                                        𝜎𝑟                                                  𝜎𝑟+1
        |||𝐹𝑈21 ◦ (𝐻1e                                      |||𝐹𝑈21 ◦ (Σ2 𝐻2 )||| ≤
                                        e
                       Σ1 )||| ≤                |||𝐻1 |||,                                            |||𝐻2 |||,       (2.19)
                                     2
                                   𝜎𝑟 − 𝜎𝑟+1
                                   e        2                                              2     2
                                                                                         𝜎𝑟 − 𝜎𝑟+1
                                                                                         e
                                       𝜎𝑟+1                                                   𝜎𝑟
        |||𝐹𝑈12 ◦ (𝐻3e Σ𝑇2 )||| ≤               |||𝐻3 |||, |||𝐹𝑈12 ◦ (Σ1 𝐻4 )||| ≤
                                       e
                                                                                                      |||𝐻4 |||,       (2.20)
                                   𝜎𝑟2 − e  2
                                          𝜎𝑟+1                                           𝜎𝑟2 − e  2
                                                                                               𝜎𝑟+1
where ||| · ||| can be either the spectral or the Frobenius norm. Similar results also hold for 𝐹𝑉12 and
𝐹𝑉21 .
2.5.2     Examples in using Theorem 2.5.1
     We demonstrate how to use Theorem 2.5.1 to simplify proofs of some existing perturbation
bounds in the literature. The theorem we use to derive all the new results in this dissertation is in
the next section (Theorem 2.5.7). Curious readers may safely jump to the next section from here.
Example 1: The angular perturbation formulae in Theorem 2.5.1 naturally yield the one-sided
sinΘ bounds first discovered in Cai and Zhang (2018). Theorem 2.5.1 now introduces a very
straightforward derivation of these bounds.
Theorem 2.5.5 (One-sided sinΘ theorem). Using the same notation and quantities as in Theorem
2.5.1, if 𝜎𝑟 − e 𝜎𝑟+1 > 0, e  𝜎𝑟 − 𝜎𝑟+1 > 0, then
                                 (                                                                                       )
                                   𝜎
                                   e 𝑟 ∥(Δ𝐴)   e1 ∥ 𝜎𝑟+1 ∥(Δ𝐴)𝑇 𝑈
                                               𝑉                       e1 ∥ 𝜎𝑟 ∥(Δ𝐴)𝑉1 ∥ e          𝜎𝑟+1 ∥𝑈1𝑇 (Δ𝐴)∥
   ∥ sin Θ(𝑈1 , 𝑈e1 )∥ ≤ min                        +                         ,                   +                         ,
                                    e𝜎𝑟2 − 𝜎𝑟+1
                                              2             𝜎𝑟2 − 𝜎𝑟+1
                                                            e        2            𝜎𝑟2 − e   2
                                                                                          𝜎𝑟+1           𝜎𝑟2 − e  2
                                                                                                               𝜎𝑟+1
                                                                                                                       (2.21)
                                 (                                                                                      )
                                   𝜎𝑟 ∥𝑈 e𝑇 (Δ𝐴)∥ 𝜎 ∥(Δ𝐴)𝑉                ∥    𝜎𝑟 ∥𝑈   𝑇 (Δ𝐴)∥
                                                                                                    𝜎      ∥(Δ𝐴)𝑉     ∥
                                   e      1                𝑟+1         e1              1            e  𝑟+1          1
    ∥ sin Θ(𝑉1 , 𝑉e1 )∥ ≤ min                        +                      ,                     +                       .
                                     e𝜎𝑟2 − 𝜎𝑟+1
                                              2             𝜎𝑟2 − 𝜎𝑟+1
                                                            e        2           𝜎𝑟2 − e   2
                                                                                         𝜎𝑟+1           𝜎𝑟2 − e  2
                                                                                                               𝜎𝑟+1
                                                                                                                       (2.22)
Moreover,
                                                                                                   
                               e1 )∥, ∥ sin Θ(𝑉1 , 𝑉e1 )∥} ≤ min            1                1
         max{∥ sin Θ(𝑈1 , 𝑈                                                          ,                 ∥Δ𝐴∥.           (2.23)
                                                                       𝜎𝑟 − e 𝜎𝑟+1 e   𝜎𝑟 − 𝜎𝑟+1
                                                              31


     (2.21) and (2.22) are individual bounds on ∥ sin Θ(𝑈1 , 𝑈        e1 )∥ and ∥ sin Θ(𝑉1 , 𝑉
                                                                                             e1 )∥, while the
classical Wedin’s sinΘ theorem is a uniform bound on both ∥ sin Θ(𝑈1 , 𝑈          e1 )∥ and ∥ sin Θ(𝑉1 , 𝑉
                                                                                                         e1 )∥.
The benefit of obtaining the individual bounds was clearly pointed out in Cai and Zhang (2018) by
an example. When 𝐴 ∈ R𝑛×𝑚 is a fixed rank-𝑟 matrix with 𝑟 < 𝑛 ≪ 𝑚, and Δ𝐴 ∈ R𝑛×𝑚 is a small
random matrix with i.i.d. standard normal entries. The Wedin’s theorem implies
                                                                               √ √
                                                                       𝐶 max{ 𝑛, 𝑚}
                     max{∥ sin Θ(𝑈1 , 𝑈1 )∥, ∥ sin Θ(𝑉1 , 𝑉1 )∥} ≤
                                         e                   e                           ,             (2.24)
                                                                              𝜎𝑟
while the one-sided bounds approximately give,
                                                  √                               √
                                               𝐶    𝑛                          𝐶    𝑚
                         ∥ sin Θ(𝑈1 , 𝑈
                                      e1 )∥ ≤         , ∥ sin Θ(𝑉1 , 𝑉e1 )∥} ≤        .                (2.25)
                                                 𝜎𝑟                              𝜎𝑟
Since we assumed 𝑛 ≪ 𝑚, only the one-sided bound successfully indicated that 𝑈1 is more stable
than 𝑉1 .
     The proof of Theorem 2.5.5 is a simple application of Theorem 2.5.1.
Proof: From Theorem 2.5.1 we have
                                e1 = 𝐹 21 ◦ (𝑈𝑇 (Δ𝐴)𝑉
                            𝑈2𝑇 𝑈                       e1eΣ𝑇1 + Σ2𝑉2𝑇 (Δ𝐴)𝑇 𝑈 e1 ),
                                       𝑈        2
                                e2 = 𝐹 12 ◦ (𝑈𝑇 (Δ𝐴)𝑉
                            𝑈1𝑇 𝑈                       e2eΣ𝑇2 + Σ1𝑉1𝑇 (Δ𝐴)𝑇 𝑈 e2 ).
                                       𝑈        1
By (2.19) in Lemma 2.5.4,
                                     𝜎𝑟                            𝜎𝑟+1
                    ∥𝑈2𝑇 𝑈                   ∥𝑈2𝑇 (Δ𝐴)𝑉                    ∥𝑉2𝑇 (Δ𝐴)𝑇 𝑈
                         e1 ∥ ≤      e                   e1 ∥ +                         e1 ∥
                                 𝜎𝑟2 − 𝜎𝑟+1
                                 e       2                      𝜎𝑟2 − 𝜎𝑟+1
                                                                e       2
                                     𝜎𝑟                         𝜎𝑟+1
                                                                        ∥(Δ𝐴)𝑇 𝑈
                                     e
                               ≤             ∥(Δ𝐴)𝑉  e1 ∥ +                      e1 ∥                  (2.26)
                                 𝜎𝑟2 − 𝜎𝑟+1
                                 e       2                  𝜎𝑟2 − 𝜎𝑟+1
                                                            e       2
                                    ∥Δ𝐴∥
                               ≤            .                                                          (2.27)
                                 𝜎𝑟 − 𝜎𝑟+1
                                 e
Similarly,
                                         𝜎𝑟+1                       𝜎𝑟
                         ∥𝑈1𝑇 𝑈                    ∥𝑈1𝑇 Δ𝐴∥ +
                               e2 ∥ ≤    e
                                                                           ∥(Δ𝐴)𝑉1 ∥                   (2.28)
                                        2
                                      𝜎𝑟 − e  2
                                            𝜎𝑟+1                  2
                                                                 𝜎𝑟 − e 2
                                                                      𝜎𝑟+1
                                        ∥Δ𝐴∥
                                    ≤             .                                                    (2.29)
                                      𝜎𝑟 − e𝜎𝑟+1
Inserting (2.26) and (2.28) into ∥ sin Θ(𝑈1 , 𝑈  e1 )∥ = min{∥𝑈𝑇 𝑈   e ∥, ∥𝑈𝑇 𝑈e ∥}, we obtain (2.21). Sim-
                                                                   1 2       2 1
ilarly, (2.22) also holds. (2.23) is obtained by using (2.27) and (2.29).                                    □
                                                       32


Example 2: In this example, we show that one may obtain some interesting results when applying
Theorem 2.5.1 to some less usual choices of Δ𝐴.
    Explicitly, we use Theorem 2.5.1 to re-derive a useful result in Cai and Zhang (2018) but with
a more straightforward proof. The result, copied in Proposition 2.5.6, is about the sinΘ distance
between the leading singular subspace of a matrix 𝐴 and an arbitrary subspace.
Proposition 2.5.6 (Proposition 1 in Cai and Zhang (2018)). Suppose 𝐴 ∈ R𝑛×𝑚 . The orthonormal
matrix 𝑉 = [𝑉1 ,𝑉2 ] ∈ R𝑚×𝑚 is the matrix of right singular vectors of 𝐴, i.e., 𝑉1 ∈ R𝑚×𝑟 , 𝑉2 ∈
R𝑚×(𝑚−𝑟) correspond to the first 𝑟 and last 𝑚 − 𝑟 singular vectors respectively. [𝑊1 ,𝑊2 ] ∈ R𝑚×𝑚
is any orthonormal matrix with 𝑊1 ∈ R𝑚×𝑟 , 𝑊2 ∈ R𝑚×(𝑚−𝑟) . Given that 𝜎𝑟 ( 𝐴𝑊1 ) > 𝜎𝑟+1 ( 𝐴), we
have                                                    (                               )
                                                          𝜎𝑟 ( 𝐴𝑊1 )∥P ( 𝐴𝑊 ) 𝐴𝑊2 ∥
                                                                            1
                         ∥ sin Θ(𝑉1 ,𝑊1 )∥ ≤ min                                     ,1 .                  (2.30)
                                                              2            2
                                                            𝜎𝑟 ( 𝐴𝑊1 ) − 𝜎𝑟+1 ( 𝐴)
                                                       (                                   )
                                                         𝜎𝑟 ( 𝐴𝑊1 )∥P ( 𝐴𝑊 ) 𝐴𝑊2 ∥ 𝐹 √
                                                                           1
                      ∥ sin Θ(𝑉1 ,𝑊1 )∥ 𝐹 ≤ min                                       , 𝑟 .                (2.31)
                                                              2            2 ( 𝐴)
                                                            𝜎𝑟 ( 𝐴𝑊1 ) − 𝜎𝑟+1
    In order to use Theorem 2.5.1 to prove Proposition 2.5.6, we recognize that Proposition 2.5.6
is actually a sinΘ bound under a special perturbation. Specifically, if we set Δ𝐴 = 𝐴𝑊1𝑊1𝑇 − 𝐴,
then the quantity sin Θ(𝑉1 ,𝑊1 ) bounded in Proposition 2.5.6 is exactly the sinΘ angle between 𝐴
and 𝐴e = 𝐴 + Δ𝐴. In addition, this particular choice of Δ𝐴 has small magnitude of norm therefore
leading to a small perturbation bound.
Proof: Apply Theorem 2.5.1 to 𝐴 and 𝐴             e = 𝐴𝑊1𝑊 𝑇 , which means Δ𝐴 = 𝐴        e − 𝐴 = 𝐴𝑊1𝑊 𝑇 − 𝐴 =
                                                               1                                         1
−𝐴𝑊2𝑊2𝑇 . Assume 𝑈𝑖 ,𝑉𝑖 , Σ𝑖 , 𝑈     e𝑖 , 𝑉    Σ𝑖 , 𝑖 = 1, 2 are from the conformal SVDs (1.1) of this 𝐴
                                          e𝑖 , e
     e Then using the notation in Theorem 2.5.1, we have (𝐹 21 )𝑖, 𝑗 =
and 𝐴.                                                                                          1          ,𝑉e1 =
                                                                               𝑉        𝜎 ( 𝐴𝑊1 )−𝜎 2 ( 𝐴)
                                                                                          2
                                                                                          𝑗         𝑖+𝑟
       𝑇  𝑇
𝑊1 , Σ 𝑈 (Δ𝐴)𝑉1 = 0. Theorem 2.5.1 in this case gives
                   e
       2  2
                                     𝑉2𝑇 𝑊1 = 𝐹𝑉21 ◦ (𝑉2𝑇 (Δ𝐴)𝑇 𝑈     e1e Σ1 ).
By Lemma 2.5.4, this implies
                               𝜎𝑟 ( 𝐴𝑊1 )|||𝑈   e𝑇 𝐴𝑊2𝑊 𝑇 𝑉2 |||
                                                  1         2
                                                                      𝜎𝑟 ( 𝐴𝑊1 )|||𝑃 𝐴𝑊1 𝐴𝑊2 |||
               |||𝑉2𝑇 𝑊1 ||| ≤                                      ≤                             ,
                                   𝜎𝑟2 ( 𝐴𝑊1 ) − 𝜎𝑟+1  2 ( 𝐴)            𝜎𝑟2 ( 𝐴𝑊1 ) − 𝜎𝑟+1
                                                                                         2 ( 𝐴)
                                                          33


where ||| · ||| can be either the spectral of Frobenius norm. Also, we directly have ∥𝑉2𝑇 𝑊1 ∥ ≤ 1 and
                √
∥𝑉2𝑇 𝑊1 ∥ 𝐹 ≤ 𝑟, thus (2.30) and (2.31) hold.                                                              □
2.5.3   High order sinΘ distance formulae using series expansions
    Although the formulae in Theorem 2.5.1 are already quite useful, they are still only first-order
formulae in the following sense. Looking at the first formula in (2.18) of Theorem 2.5.1, a closer
examination shows that the unknown left hand side 𝑈1𝑇 𝑈         e2 also appears implicitly in the right-hand
side, albeit as high order terms. Since we consider upper bounds in the non-asymptotic regime,
high order errors may sometimes affect the tightness of the bound, so we hope to get rid of them.
    To be more specific about the implicit appearances of the high order terms, we denote the left
hand sides of the four formulae in Theorem 2.5.1 as 𝑋,𝑌 ,𝑊, 𝑍
                         𝑋 B 𝑈1𝑇 𝑈 e2 ,    𝑌 B 𝑈2𝑇 𝑈e1 ,   𝑊 B 𝑉1𝑇 𝑉e2 ,   𝑍 B 𝑉2𝑇 𝑉e1 .
First focus on the expression of 𝑌 in Theorem 2.5.1
           𝑌 ≡ 𝑈2𝑇 𝑈 e1 = 𝐹 21 ◦ (𝑈𝑇 (Δ𝐴)𝑉      Σ𝑇1 + Σ2𝑉2𝑇 (Δ𝐴)𝑇 𝑈
                                              e1e                   e1 )
                             𝑈       2
               = 𝐹𝑈21 ◦ (Σ2𝑉2𝑇 (Δ𝐴)𝑇 𝑈1𝑈1𝑇 𝑈   e1 + Σ2𝑉 𝑇 (Δ𝐴)𝑇 𝑈2𝑈𝑇 𝑈
                                                         2
                                                                         e      𝑇          𝑇 e e𝑇
                                                                       2 1 + 𝑈2 (Δ𝐴)𝑉1𝑉1 𝑉1 Σ1
                 + 𝑈2𝑇 (Δ𝐴)𝑉2𝑉2𝑇 𝑉 e1e Σ𝑇1 )
               = 𝐹𝑈21 ◦ (Σ2 𝛼𝑇12𝑈1𝑇 𝑈
                                    e1 + 𝛼21𝑉 𝑇 𝑉e e𝑇       21         𝑇        21         e𝑇
                                                1 1 Σ1 ) +𝐹𝑈 ◦ (Σ2 𝛼22𝑌 ) + 𝐹𝑈 ◦ (𝛼22 𝑍 Σ1 ),         (2.32)
                 |                 {z                  }
                                  B𝐶1
where the second line used 𝑈1𝑈1𝑇 + 𝑈2𝑈2𝑇 = 𝐼 and 𝑉1𝑉1𝑇 +𝑉2𝑉2𝑇 = 𝐼, the third line is a re-grouping
of terms, and 𝛼𝑖 𝑗 := 𝑈𝑖𝑇 Δ𝐴𝑉 𝑗 , 1 ≤ 𝑖, 𝑗 ≤ 2. We can get the same expression for 𝑍
           𝑍 ≡ 𝑉2𝑇 𝑉 e1 = 𝐹 21 ◦ (Σ𝑇 𝑈𝑇 (Δ𝐴)𝑉   e1 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈    Σ1 )
                                                                 e1e
                            𝑉       2 2                2
               = 𝐹𝑉21 ◦ (𝑉2𝑇 (Δ𝐴)𝑇 𝑈1𝑈1𝑇 𝑈     Σ1 +𝑉2𝑇 (Δ𝐴)𝑇 𝑈2𝑈2𝑇 𝑈
                                             e1e                      e1eΣ1 + Σ𝑇2 𝑈2𝑇 (Δ𝐴)𝑉1𝑉1𝑇 𝑉
                                                                                                e1
                 + Σ𝑇2 𝑈2𝑇 (Δ𝐴)𝑉2𝑉2𝑇 𝑉  e1 )
               = 𝐹𝑉21 ◦ (𝛼𝑇12𝑈1𝑇 𝑈
                                 e1eΣ1 + Σ𝑇2 𝛼21𝑉1𝑇 𝑉e1 ) +𝐹 21 ◦ (𝛼𝑇 𝑌 e
                                                            𝑉
                                                                                21       𝑇
                                                                    22 Σ1 ) + 𝐹𝑉 ◦ (Σ2 𝛼22 𝑍).        (2.33)
                 |                 {z                  }
                                  B𝐶2
                                                        34


Looking at the last right-hand sides of (2.32) and (2.33), we see that 𝑌 and 𝑍 are contained in the
second and third terms, respectively, so they appear on both hand sides.
    To highlight this structure, we shorten the notation by letting F be the linear operator defined
as
                                  
                              © 𝑌  ª ©𝐹𝑈21 ◦ (Σ2 𝛼𝑇22𝑌 ) + 𝐹𝑈21 ◦ (𝛼22 𝑍 e        Σ𝑇1 ) ª
                            F ­ ® = ­
                              ­        ®   ­                                                ®.
                                                 21                     21
                                                          𝑇                      𝑇
                                                                                            ®
                                 𝑍 
                                           𝐹𝑉 ◦ (𝛼22𝑌 e    Σ1 ) + 𝐹𝑉 ◦ (Σ2 𝛼22 𝑍)
                              « ¬ «                                                       ¬
Then (2.32) and (2.33) become,
                                                                     
                                               𝑌  ©𝐶 ª
                                                 ­ 1®            © 𝑌  ª
                                                 = ­ ® + F ­­   ®® .
                                               𝑍       𝐶            𝑍 
                                                 « 2¬
                                                                     
                                                                   « ¬
Clearly, this is an implicit equation system of 𝑌 and 𝑍.
    Provided ∥F ∥ < 1, we can move F to the left and take the inverse
                                            
                           ©𝑈2𝑇 𝑈 e1 ª     𝑌                    ©𝐶1 ª ∑︁     ∞       ©𝐶1 ª
                           ­ 𝑇 ®≡            = (1 − F ) −1 ­­ ®® =              F 𝑘 ­­ ®® .
                           ­          ®     
                             𝑉 𝑉           𝑍                     𝐶                    𝐶
                           « 2 1¬                                 « 2¬                 « 2¬
                                  e                                         𝑘=0
                                            
This gives us a series expression of the quantities 𝑈2𝑇 𝑈              e1 and 𝑉 𝑇 𝑉
                                                                                   2 1
                                                                                      e , which allows us to derive
Theorem 2.2.6 and Theorem 2.3.1 presented before. We summarize this result in the following
theorem.
Theorem 2.5.7 (Angular perturbation formula using series expansion). Using the same notation
and quantities as in Theorem 2.5.1, we have
                                                                                                   
                  ©𝑈2𝑇 𝑈
                       e1 ª ©𝐶1 ª           © 𝑈2𝑇 𝑈e1  ª    ©𝑈1𝑇 𝑈 e2 ª ©𝐶3 ª           © 𝑈1𝑇 𝑈
                                                                                                   e2  ª
                  ­ 𝑇 ® = ­ ® + F ­  𝑇  ® ,                ­ 𝑇 ® = ­ ® + G ­  𝑇  ® .                 (2.34)
                  ­       ® ­ ®             ­           ®     ­          ® ­ ®            ­          ®
                    𝑉 𝑉         𝐶              𝑉 𝑉              𝑉 𝑉            𝐶             𝑉 𝑉
                  « 2 1 ¬ « 2¬              « 2 1 ¬          « 1 2 ¬ « 4¬                « 1 2 ¬
                       e                             e               e                            e 
In addition, provided that ∥F ∥ < 1 and ∥G∥ < 1, we have
                          ©𝑈2𝑇 𝑈e1 ª ∑︁   ∞          ©𝐶1 ª     ©𝑈1𝑇 𝑈 e2 ª ∑︁   ∞       ©𝐶3 ª
                          ­
                          ­ 𝑇 ®
                                   ®=           F 𝑘 ­­ ®® ,    ­
                                                               ­ 𝑇 ®
                                                                          ®=       G 𝑘 ­­ ®® .               (2.35)
                            𝑉 𝑉                       𝐶          𝑉 𝑉                     𝐶
                          « 2 1¬                     « 2¬      « 1 2¬                   « 4¬
                                e        𝑘=0                          e        𝑘=0
Here
                              ©𝐶1 ª ©𝐹𝑈21 ◦ (Σ2 𝛼𝑇22𝐶1 ) + 𝐹𝑈21 ◦ (𝛼22𝐶2e              Σ𝑇1 ) ª
                           F­ ®=­
                              ­    ®     ­                                                   ®,
                                             21                        21
                                                       𝑇                         𝑇
                                                                                             ®
                               𝐶2          𝐹𝑉 ◦ (𝛼22𝐶1e      Σ1 ) + 𝐹𝑉 ◦ (Σ2 𝛼22𝐶2 )
                              « ¬ «                                                          ¬
                                                            35


                                ©𝐶3 ª ©𝐹 12 ◦ (𝛼11𝐶4e       Σ𝑇2 ) + 𝐹𝑈12 ◦ (Σ1 𝛼𝑇11𝐶3 ) ª
                             G ­­ ®® = ­­ 𝑈                                             ®.
                                            12                        12
                                                     𝑇                       𝑇
                                                                                        ®
                                 𝐶4       𝐹𝑉 ◦ (Σ1 𝛼11𝐶4 ) + 𝐹𝑉 ◦ (𝛼11𝐶3 Σ2 )      e
                                « ¬ «                                                   ¬
        𝐶1 = 𝐹𝑈21 ◦ (Σ2 𝛼𝑇12𝑈1𝑇 𝑈  e1 + 𝛼21𝑉 𝑇 𝑉 e e𝑇
                                               1 1 Σ1 ),      𝐶2 = 𝐹𝑉21 ◦ (𝛼𝑇12𝑈1𝑇 𝑈 e1e Σ1 + Σ𝑇2 𝛼21𝑉1𝑇 𝑉
                                                                                                         e1 ),
        𝐶3 = 𝐹𝑈12 ◦ (𝛼12𝑉2𝑇 𝑉  e2eΣ𝑇2 + Σ1 𝛼𝑇21𝑈2𝑇 𝑈 e2 ),    𝐶4 = 𝐹𝑉12 ◦ (Σ𝑇1 𝛼12𝑉2𝑇 𝑉  e2 + 𝛼𝑇 𝑈𝑇 𝑈
                                                                                               21 2 2 Σ2 ),
                                                                                                      ee
and 𝛼𝑖 𝑗 := 𝑈𝑖𝑇 Δ𝐴𝑉 𝑗 .
Remark 2.5.8. Careful readers may observe that, although we removed all cross terms 𝑈1𝑇 𝑈                         e2 ,
𝑈𝑇 𝑈
  2
    e1 , 𝑉 𝑇 𝑉
          1
             e2 , 𝑉 𝑇 𝑉
                    2
                      e1 from the right-hand sides of the expressions (2.35), there are still terms like
𝑈1𝑇 𝑈
    e1  and 𝑉1𝑇 𝑉 e1 appearing on the right-hand side. In fact, these terms are of order 𝑂 (1) thus will
not degrade the tightness of the upper bounds by any order of magnitudes and only possibly affect
the constants.
    When 𝐴 has rank 𝑟, Theorem 2.5.7 reduces to the following simpler formulae.
Corollary 2.5.9. Using the definitions above, when 𝐴 has rank 𝑟 and ∥Δ𝐴∥ < 𝜎𝑟 ( 𝐴),                 e
                               +∞
                                                                             e e−1 e−(2𝑘+1) ,
                               ∑︁
                      𝑈2𝑇 𝑈        (𝛼22 𝛼𝑇22 ) 𝑘 (𝛼21𝑉1𝑇 𝑉 e1 + 𝛼22 𝛼𝑇 𝑈𝑇 𝑈
                             =                                         12 1 1 Σ1 ) Σ1
                          e1
                               𝑘=0
                               +∞
                                                                                                               (2.36)
                                                                             e e−1 e−(2𝑘+1) .
                               ∑︁
                      𝑉2𝑇 𝑉
                          e1 =     (𝛼𝑇22 𝛼22 ) 𝑘 (𝛼𝑇12𝑈1𝑇 𝑈e1 + 𝛼𝑇 𝛼21𝑉 𝑇 𝑉
                                                                   22       1 1 Σ1 ) Σ1
                               𝑘=0
Remark 2.5.10. When 𝐴 has rank-𝑟 and 𝛼22 is full rank, Corollary 2.5.9 can also be derived
                                                                                                            −1
                                                                                              ©       𝛼22 ª
using series expansion for Sylvester-type equations. Denote matrix 𝑀 = ­­                                 ® , 𝑋=
                                                                                                          ®
                                                                                                  𝑇
                                                                                                𝛼22
                                                                                              «           ¬
        𝛼22 ª ©𝑈2 𝑈𝑇                             ©𝛼 𝑉 𝑉  𝑇       −1
                                                               Σ ª
                                 −1 , and 𝑌 = ­ 21 1 1 1 ®. Direct calculation gives 𝑀 𝑋 − 𝑋 𝐵 = 𝑌 .
©                    e1 ª                                   ee
­            ®­         ®, 𝐵 = e
                               Σ 1
                                                               Σ1−1
­ 𝑇          ®­ 𝑇 ®                              ­ 𝑇 𝑇              ®
  𝛼22           𝑉2 𝑉 e1                            𝛼12𝑈1 𝑈  e1e
«            ¬«         ¬                        «                  ¬
By the assumption in Corollary 2.5.9, ∥𝛼22 ∥ ≤ ∥Δ𝐴∥ < 𝜎𝑟 ( 𝐴),              e we can see for any eigenvalue 𝜆
of matrix 𝑀, it holds that |𝜆| >           1 .     Classical series expansion for Sylvester-type equations
                                        𝜎𝑟 ( 𝐴)
                                             e
(Theorem VII.2.2 in Bhatia (2013)) also leads to equation (2.36).
2.5.4    Examples of Using Theorem 2.5.7
    Theorem 2.5.7 is used to derive the refined ℓ2,∞ bound (Theorem 2.2.6) and the sinΘ bound
between singular vectors and their resided singular subspace (Theorem 2.3.1), which provided the
                                                           36


main intuition behind our PCA and singular value truncation results (Theorem 2.3.2 and Theorem
2.4.1) in Section 2.3 and Section 2.4. Here, we only present the proof of Theorem 2.3.1, the proof
of the main results are provided in Section 2.6.
Proof: Again we denote 𝑌 B 𝑈2𝑇 𝑈         e1 and 𝑍 B 𝑉 𝑇 𝑉
                                                            2 1
                                                                e . Restricting (2.34) in Theorem 2.5.7 to the 𝑗th
columns (1 ≤ 𝑗 ≤ 𝑟), we have
                                          𝜎 𝑗 (𝐹𝑈21 ) 𝑗 ◦ (𝛼22 𝑍 𝑗 ) + (𝐹𝑈21 ) 𝑗 ◦ (Σ2 𝛼𝑇22𝑌 𝑗 ),
                          𝑌 𝑗 = (𝐶1 ) 𝑗 + e
                         𝑍 𝑗 = (𝐶2 ) 𝑗 + e𝜎 𝑗 (𝐹𝑉21 ) 𝑗 ◦ (𝛼𝑇22𝑌 𝑗 ) + (𝐹𝑉21 ) 𝑗 ◦ (Σ𝑇2 𝛼22 𝑍 𝑗 ).
                                               𝜎𝑗
                                               e                       𝜎                                𝜎
It is easy to verify that ∥(𝐶1 ) 𝑗 ∥ ≤ 2 2 ∥𝛼21 ∥ + 2 𝑟+12 ∥𝛼12 ∥, ∥(𝐶2 ) 𝑗 ∥ ≤ 2 𝑟+12 ∥𝛼21 ∥ +
                                            𝜎 −𝜎
                                            e                       𝜎 −𝜎
                                                                    e                                𝜎 −𝜎
                                                                                                     e
                                              𝑗 𝑟+1                   𝑗 𝑟+1                            𝑗 𝑟+1
    𝜎𝑗
    e
          ∥𝛼12 ∥,  then
𝜎 −𝜎 2
e 2
  𝑗 𝑟+1
                           1                                                                              
           ∥𝑌 𝑗 ∥ ≤                 𝜎 𝑗 ∥𝛼21 ∥ + 𝜎𝑟+1 ∥𝛼12 ∥ + e     𝜎 𝑗 ∥𝛼22 ∥∥𝑍 𝑗 ∥ + 𝜎𝑟+1 ∥𝛼22 ∥∥𝑌 𝑗 ∥ ,
                     𝜎 2𝑗 − 𝜎𝑟+1
                               2
                                    e
                     e
                            1                                                                             
           ∥𝑍 𝑗 ∥ ≤                 𝜎𝑟+1 ∥𝛼21 ∥ + e    𝜎 𝑗 ∥𝛼12 ∥ + e𝜎 𝑗 ∥𝛼22 ∥∥𝑌 𝑗 ∥ + 𝜎𝑟+1 ∥𝛼22 ∥∥𝑍 𝑗 ∥ .
                     𝜎 2𝑗 − 𝜎𝑟+1
                     e         2
Summing up the first inequality multiplied by e              𝜎 2𝑗 − 𝜎𝑟+1
                                                                      2 −𝜎
                                                                              𝑟+1 ∥𝛼22 ∥ and the second inequality
multiplied by e 𝜎 𝑗 ∥𝛼22 ∥, after some simplification we get
                                               𝜎 𝑗 ∥𝛼21 ∥ + 𝜎𝑟+1 ∥𝛼12 ∥ + ∥𝛼22 ∥∥𝛼12 ∥
                                               e
                      ∥𝑌 𝑗 ∥ = ∥𝑈2𝑇 e 𝑢𝑗∥ ≤
                                                          𝜎 2𝑗 − (𝜎𝑟+1 + ∥𝛼22 ∥) 2
                                                          e
                                               (𝜎 𝑗 − ∥Δ𝐴∥)∥Δ𝐴∥ + 𝜎𝑟+1 ∥Δ𝐴∥ + ∥Δ𝐴∥ 2
                                            ≤
                                                    (𝜎 𝑗 − ∥Δ𝐴∥) 2 − (𝜎𝑟+1 + ∥Δ𝐴∥) 2
                                                         ∥Δ𝐴∥
                                            ≤
                                               𝜎 𝑗 − 𝜎𝑟+1 − 2∥Δ𝐴∥
                                                3∥Δ𝐴∥
                                            ≤                 ,
                                               𝜎 𝑗 − 𝜎𝑟+1
provided that 3∥Δ𝐴∥ ≤ 𝜎𝑟 − 𝜎𝑟+1 . Here the second inequality is because the upper bound on the
right-hand side is decreasing with respect to e            𝜎 𝑗 and increasing with respect to ∥𝛼22 ∥. Similarly,
we also have
                                        e𝜎 𝑗 ∥𝛼12 ∥ + 𝜎𝑟+1 ∥𝛼21 ∥ + ∥𝛼22 ∥∥𝛼21 ∥             3∥Δ𝐴∥
                ∥𝑍 𝑗 ∥ = ∥𝑉2𝑇 e  𝑣𝑗∥ ≤                                                   ≤              .
                                                  e𝜎 2𝑗 − (𝜎𝑟+1 + ∥𝛼22 ∥) 2                 𝜎 𝑗 − 𝜎𝑟+1
                                                                                                                 □
                                                              37


2.6   Proof of the main results
    In Section 2.6.1, we derive the proof of Theorem 2.5.1 and Lemma 2.5.4. After that, we present
the proof of Theorem 2.2.6 in Section 2.6.2. Since the proof of one key lemma (Lemma 2.6.3) is
long and involved, we divide it into low-rank case and full-rank case. We prove the low-rank case in
Section 2.6.3, and the proof of full-rank case is deferred to appendix. In Section 2.6.4 we provide
the proof of Theorem 2.4.1, while the proof of Theorem 2.3.2 can be found in Section 2.6.5.
2.6.1   Proof of Theorem 2.5.1 and Lemma 2.5.4
Proof of Theorem 2.5.1: First, decompose the perturbation Δ𝐴 in the following two ways:
                           Δ𝐴 = 𝐴 e− 𝐴 = 𝑈 eeΣ𝑉e𝑇 − 𝑈Σ𝑉 𝑇
                               = (𝑈 + Δ𝑈)e  Σ𝑉e𝑇 − 𝑈Σ(𝑉 e − Δ𝑉)𝑇
                                                                                              (2.37)
                               = 𝑈eΣ𝑉 e𝑇 + (Δ𝑈)e  e𝑇 − 𝑈Σ𝑉
                                                 Σ𝑉         e𝑇 + 𝑈Σ(Δ𝑉)𝑇
                               = 𝑈 (ΔΣ)𝑉 e𝑇 + (Δ𝑈)e Σ𝑉e𝑇 + 𝑈Σ(Δ𝑉)𝑇 ,
and
                           Δ𝐴 = 𝐴 e− 𝐴 = 𝑈 eeΣ𝑉e𝑇 − 𝑈Σ𝑉 𝑇
                               =𝑈  Σ(𝑉 + Δ𝑉)𝑇 − (𝑈
                                  ee                e − Δ𝑈)Σ𝑉 𝑇
                                                                                              (2.38)
                               =𝑈  Σ𝑉 𝑇 + 𝑈
                                  ee       eeΣ(Δ𝑉)𝑇 − 𝑈Σ𝑉
                                                        e 𝑇 + (Δ𝑈)Σ𝑉 𝑇
                               =𝑈 e(ΔΣ)𝑉 𝑇 + 𝑈 eeΣ(Δ𝑉)𝑇 + (Δ𝑈)Σ𝑉 𝑇 .
Multiplying (2.37) with 𝑈𝑇 on the left and 𝑉  e on the right leads to
                             𝑈𝑇 (Δ𝐴)𝑉  e = ΔΣ + 𝑈𝑇 (Δ𝑈)e  Σ + Σ(Δ𝑉)𝑇 𝑉e.                      (2.39)
Similarly, multiplying (2.38) with 𝑈e𝑇 on the left and 𝑉 on the right we obtain
                             𝑈e𝑇 (Δ𝐴)𝑉 = ΔΣ + e  Σ(Δ𝑉)𝑇 𝑉 + 𝑈 e𝑇 (Δ𝑈)Σ.                       (2.40)
Denote 𝑑𝑃 = 𝑈𝑇 (Δ𝐴)𝑉,   e 𝑑 𝑃¯ = 𝑈e𝑇 (Δ𝐴)𝑉, ΔΩ𝑈 = 𝑈𝑇 (Δ𝑈), ΔΩ𝑉 = 𝑉 𝑇 (Δ𝑉). Notice that 𝐼 =
e𝑇 𝑈
𝑈   e = 𝑈𝑇 𝑈 gives (𝑈 + Δ𝑈)𝑇 𝑈  e = 𝑈 𝑇 (𝑈
                                         e − Δ𝑈), hence 𝑈𝑇 Δ𝑈 = −Δ𝑈𝑇 𝑈.  e Similarly, we also have
                                                  38


𝑉 𝑇 Δ𝑉 = −Δ𝑉 𝑇 𝑉.
                e Plugging these into (2.39) and (2.40), we have
                                            𝑇
                                   𝑑𝑃 = 𝑈 Δ𝐴𝑉
                                  
                                  
                                                  e = ΔΣ + ΔΩ𝑈 e     Σ − ΣΔΩ𝑉 ,
                                                                                                          (2.41)
                                   𝑑 𝑃¯ = 𝑈e𝑇 Δ𝐴𝑉 = ΔΣ + e           𝑇 − ΔΩ𝑇 Σ.
                                  
                                                              ΣΔΩ𝑉            𝑈
                                  
Next, from (2.41) we can cancel ΔΩ𝑉 by
                             𝐺 𝑈 : = 𝑑𝑃e  Σ𝑇 + Σ𝑑 𝑃¯𝑇
                                    = ΔΣe Σ𝑇 + Σ(ΔΣ)𝑇 + ΔΩ𝑈 e         Σ𝑇 − ΣΣ𝑇 ΔΩ𝑈
                                                                     Σe
                                    =e ΣeΣ𝑇 − ΣΣ𝑇 + ΔΩ𝑈 e     ΣeΣ𝑇 − ΣΣ𝑇 ΔΩ𝑈 .
Let ΔΩ𝑈 = {𝑤𝑖 𝑗 }𝑖,𝑛 𝑗=1 , then for all 1 ≤ 𝑖, 𝑗 ≤ 𝑛, the following equations hold
                                                  𝜎 2𝑗 − 𝜎𝑖2 )𝑤𝑖 𝑗 ,
                                               
                                                (e                         𝑖 ≠ 𝑗,
                                               
                                               
                                               
                                   (𝐺 𝑈 )𝑖 𝑗 =                                                            (2.42)
                                                  𝜎 2 − 𝜎𝑖2 )(𝑤𝑖 𝑗 + 1), 𝑖 = 𝑗 .
                                                (e
                                               
                                               
                                                𝑗
Here if 𝑖 > min{𝑛, 𝑚}, we define 𝜎𝑖 or e         𝜎𝑖 to be 0. Also, define 𝐹𝑈12 , 𝐹𝑈21 , 𝐹𝑉12 , 𝐹𝑉21 as in the
statement of Theorem 2.5.1. By assumption, e           𝜎𝑟 − 𝜎𝑟+1 > 0, 𝜎𝑟 − e    𝜎𝑟+1 > 0, we can directly check
that the denominators in these four matrices only have nonzero entries, thus are well defined.
Consider the upper right part in ΔΩ𝑈 = 𝑈𝑇 (Δ𝑈), that is, 1 ≤ 𝑖 ≤ 𝑟, 𝑟 + 1 ≤ 𝑗 ≤ 𝑛, from (2.42) we
have
                                          1
                             𝑤𝑖 𝑗 =             (𝐺 𝑈 )𝑖 𝑗 , 1 ≤ 𝑖 ≤ 𝑟, 𝑟 + 1 ≤ 𝑗 ≤ 𝑛.
                                     𝜎 2𝑗 − 𝜎𝑖2
                                     e
Therefore,
               e2 = 𝑈𝑇 (Δ𝑈2 ) = 𝐹 12 ◦ (𝐺 12 ) = 𝐹 12 ◦ (𝑈𝑇 (Δ𝐴)𝑉
           𝑈1𝑇 𝑈                                                           e2e Σ𝑇2 + Σ1𝑉1𝑇 (Δ𝐴)𝑇 𝑈
                                                                                                 e2 ).
                        1               𝑈        𝑈         𝑈        1
Following the same reasoning, we also obtain
                    𝑈2𝑇 𝑈e1 = 𝑈𝑇 (Δ𝑈1 ) = 𝐹 21 ◦ (𝑈𝑇 (Δ𝐴)𝑉         e1eΣ𝑇1 + Σ2𝑉2𝑇 (Δ𝐴)𝑇 𝑈 e1 ),
                                 2               𝑈         2
                         e2 = 𝑉 𝑇 (Δ𝑉2 ) = 𝐹 12 ◦ (Σ𝑇 𝑈𝑇 (Δ𝐴)𝑉
                     𝑉1𝑇 𝑉                                            e2 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈    Σ2 ),
                                                                                       e2e
                                 1              𝑉         1 1                1
                    𝑉2𝑇 𝑉e1 = 𝑉 𝑇 (Δ𝑉2 ) = 𝐹 21 ◦ (Σ𝑇 𝑈𝑇 (Δ𝐴)𝑉        e1 +𝑉 𝑇 (Δ𝐴)𝑇 𝑈    Σ1 ).
                                                                                       e1e
                                 2              𝑉         2 2                2
                                                                                                              □
                                                          39


Proof of Lemma 2.5.4: Here we only prove the first inequality in (2.19), i.e., |||𝐹𝑈21 ◦ (𝐻1e                              Σ1 )||| ≤
    𝜎𝑟
    e
           |||𝐻1 |||, the other three inequalities can be proved similarly.                           Recall the definition of 𝐹𝑈21
𝜎𝑟 −𝜎 2
e 2
       𝑟+1
is (𝐹𝑈21 )𝑖−𝑟, 𝑗 = 2 1 2 , 𝑟 + 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑟. We directly have
                   𝜎 −𝜎
                   e
                     𝑗 𝑖
                                                   𝐹𝑈21 ◦ (𝐻1e Σ1 ) = 𝐹¯𝑈21 ◦ 𝐻1 ,
where
                                                          𝜎𝑗
                               ( 𝐹¯𝑈21 )𝑖−𝑟, 𝑗 =
                                                          e
                                                                 , 𝑟 + 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑟.
                                                      𝜎 2𝑗 − 𝜎𝑖2
                                                      e
Let 𝐵1 = 𝐹𝑈21 ◦ (𝐻1e    Σ1 ), then 𝐻1 = 𝐹        e21 ◦ 𝐵1 , where
                                                   𝑈
                                            𝜎 2𝑗 − 𝜎𝑖2
                                            e                        𝜎2
                        (𝐹e21 )𝑖−𝑟, 𝑗     =                =e𝜎 𝑗 − 𝑖 , 𝑟 + 1 ≤ 𝑖 ≤ 𝑛, 1 ≤ 𝑗 ≤ 𝑟.
                            𝑈                     𝜎𝑗
                                                  e                  𝜎𝑗
                                                                     e
Inserting the above expression of 𝐹           e into 𝐻1 = 𝐹     e21 ◦ 𝐵1 , we have
                                                                  𝑈
                                                           2                                         1
                     ©e𝜎1                          ª ©𝜎𝑟+1                             ª ©­ e      𝜎1
                                                                                                                        ª
                                                                                                                        ®
                     ­                             ® ­                                 ® ­
                                                                                                           1
                                                                                                                        ®
                     ­
                     ­       𝜎2
                             e
                                                   ® ­
                                                   ® ­               2
                                                                  𝜎𝑟+2
                                                                                       ® ­
                                                                                       ® ­                              ®
                                                   ®−­                                                    𝜎2
          𝐻1 = 𝐵1 ­­                                                                   ® 𝐵1 ­             e             ®
                                                                                                                        ®.
                                    ..             ® ­                     ..          ® ­                     ..       ®
                     ­
                     ­                  .          ® ­
                                                   ® ­                         .       ® ­
                                                                                       ® ­                        .     ®
                     ­                             ® ­                                 ® ­                              ®
                                            𝜎𝑟 ¬ «
                                            e                                      𝜎𝑛2 ¬                              1 ®
                     «                                                                         «                     𝜎𝑟 ¬
                                                                                                                     e
Take norm on both sides, we obtain
                                                             𝜎2                  𝜎𝑟2 − 𝜎𝑟+1
                                                                                 e           2
                           |||𝐻1 ||| ≥ e  𝜎𝑟 |||𝐵1 ||| − 𝑟+1 |||𝐵1 ||| =                           |||𝐵1 |||,
                                                              𝜎𝑟
                                                              e                      𝜎𝑟
                                                                                     e
which further gives
                                                                    𝜎𝑟
                                                                    e
                                                |||𝐵1 ||| ≤                 |||𝐻1 |||.
                                                              𝜎𝑟2 − 𝜎𝑟+1
                                                              e         2
                                                                                                                                   □
Remark 2.6.1. When e        𝜎𝑟 > 𝜎𝑟+1 , 𝜎𝑟 > e         𝜎𝑟+1 ,the bounds in Lemma 2.5.4 are tight. That is, in this
case, there exists 𝐻𝑖 , 1 ≤ 𝑖 ≤ 4, such that the equalities in (2.19) and (2.20) hold. Specifically, let
                        ©0 · · · 0          𝜖ª                             ©0 · · · · · ·       0ª
                        ­                        ®                         ­                          ®
                                                                           ­ .. ..       ..      .. ®
                        ­0 · · · · · ·      0®
                        ­                        ®
                                                        (𝑛−𝑟)×𝑟
                                                                           ­. .           .       .®       𝑟×(𝑚−𝑟) ,
                                                 ®∈R
                  𝐻1 = ­­ . .                                     , 𝐻  3 =                        .. ®® ∈ R
                                                                           ­                          ®
                        ­ .. ..        ..    .. ®®                         ­0 ...
                                                                           ­              ..
                        ­               .     .®                           ­               .       .®
                        ­                        ®                         ­                          ®
                        «0 · · · · · ·      0¬                             «𝜖 0 · · ·           0¬
                                                                  40


                     ©0 · · · 0 𝜖 ª                                  ©0 · · · · · · 0 ª
                     ­                        ®                      ­                     ®
                                                                     ­ .. ..       .. .. ®
                     ­0 · · · · · · 0 ®
                     ­                        ®
                                                                     ­. .           . .®
               𝐻2 = ­­ . .                    ® ∈ R (𝑚−𝑟)×𝑟 , 𝐻4 = ­                       ® ∈ R𝑟×(𝑛−𝑟) ,
                     ­ .. ..         .
                                     .. .. ®. ®                      ­
                                                                     ­0 ..  .       .    .
                                                                                    .. .. ®®
                     ­                        ®                      ­                     ®
                     ­                        ®                      ­                     ®
                     « 0   · · ·   ·  · ·   0 ¬                      « 𝜖   0     ·   · · 0 ¬
then we can directly check that the equalities in (2.19) and (2.20) hold.
2.6.2   Proof of Theorem 2.2.6
    To prove Theorem 2.2.6, we need to decompose 𝑈                 e1 − 𝑈1 𝑄 into a sum of several components
and bound them separately. For convenience, we put the decomposition in the following lemma,
which is similar in nature to Theorem 3.1 in Cape et al. (2019b).
Proposition 2.6.2. Set the rotation 𝑄 to be 𝑄 = 𝑄 1 𝑄𝑇2 , where 𝑄 1 and 𝑄 2 are the left and right
singular vectors from the SVD: 𝑈1𝑇 𝑈          e1 = 𝑄 1 𝑆𝑄𝑇 , then
                                                            2
  e1 − 𝑈1 𝑄 = 𝑈2𝑈𝑇 Δ𝐴𝑉1𝑉 𝑇 𝑉        e e−1              𝑇         𝑇 e e−1                 𝑇 e e−1                     𝑇
  𝑈                 2            1 1 Σ1 + 𝑈2𝑈2 Δ𝐴𝑉2𝑉2 𝑉1 Σ1 + 𝑈2 Σ2𝑉2 𝑉1 Σ1 + 𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄 2 ,
                                                                                                                   (2.43)
and
                                              ∥𝑆 − 𝐼 ∥ ≤ ∥ sin Θ(𝑈1 , 𝑈e1 )∥ 2 .                                   (2.44)
Proof: By direct calculation, we have
  e1 − 𝑈1 𝑄 = 𝑈
  𝑈             e1 − 𝑈1 𝑄 1 𝑄𝑇
                                 2
             =𝑈 e1 − 𝑈1 𝑄 1 𝑆𝑄𝑇 + 𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄𝑇
                                   2                       2
                e1 − 𝑈1𝑈𝑇 𝑈                              𝑇
             =𝑈            1 1 + 𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄 2
                               e
             = 𝑈2𝑈2𝑇 𝑈e1 + 𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄𝑇
                                                   2
             = 𝑈2𝑈2𝑇 Δ𝐴𝑉  e1e  Σ1−1 + 𝑈2 Σ2𝑉2𝑇 𝑉     e1eΣ1−1 + 𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄𝑇2                                   (2.45)
             = 𝑈2𝑈2𝑇 Δ𝐴𝑉1𝑉1𝑇 𝑉      e1e   Σ1−1 + 𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉     Σ1−1 + 𝑈2 Σ2𝑉2𝑇 𝑉
                                                                   e1e                     e1eΣ1−1 + 𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄𝑇2 .
In addition, since ∥𝑆∥ = ∥𝑈1𝑇 𝑈    e1 ∥ ≤ 1,
                         ∥𝑆 − 𝐼 ∥ = 1 − min 𝑆𝑖 ≤ 1 − min 𝑆𝑖2 = ∥ sin Θ(𝑈1 , 𝑈            e1 )∥ 2 ,
                                                𝑖              𝑖
                                                             41


where 𝑆𝑖 is the 𝑖th diagonal entry of 𝑆. Hence
                ∥𝑈1 𝑄 1 (𝑆 − 𝐼)𝑄𝑇2 ∥ 2,∞ ≤ ∥𝑈1 ∥ 2,∞ ∥𝑆 − 𝐼 ∥ ≤ ∥𝑈1 ∥ 2,∞ ∥ sin Θ(𝑈1 , 𝑈     e1 )∥ 2 .
                                                                                                           □
    The first and the last terms in the expansion (2.43) are easy to bound, the following lemma is
devoted to bounding the middle terms, which requires invoking the angular perturbation formula
Theorem 2.5.7.
Lemma 2.6.3. Under the assumption of Theorem 2.2.6, it holds that
                                                                                          𝜎𝑅(𝑟, 𝑛)
          max{∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉      e1eΣ1−1 ∥ 2,∞ , ∥𝑈2 Σ2𝑉2𝑇 𝑉 e1eΣ1−1 ∥ 2,∞ } ≤ 𝐶                      ,
                                                                                     𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴)
where 𝐶 is some constant and
                                           √ √︁
                                           𝑟 + log 𝑛,            if 𝐴 is of rank 𝑟;
                                          
                                          
                                          
                               𝑅(𝑟, 𝑛) =        √︁
                                          𝑟 + 𝑟 log 𝑛,
                                          
                                                                               else.
                                          
    Before proving this lemma, let us first see how to use it to prove Theorem 2.2.6.
Proof of Theorem 2.2.6: Due to (2.43), we have
          min ∥𝑈           e 2,∞ ≤ ∥𝑈2𝑈𝑇 Δ𝐴𝑉1𝑉 𝑇 𝑉
                   e1 − 𝑈1 𝑄∥
                                             2           1
                                                           e1eΣ1−1 ∥ 2,∞ +∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉      Σ1−1 ∥ 2,∞
                                                                                              e1e
         𝑄∈O
          e 𝑟                         |              {z                 }
                                                      (I)
                                      + ∥𝑈2 Σ2𝑉2𝑇 𝑉  e1e Σ1−1 ∥ 2,∞ + ∥𝑈1 ∥ 2,∞ ∥ sin Θ(𝑈1 , 𝑈e1 )∥ 2 .
                                                                       |            {z              }
                                                                                    (II)
The two middle terms are bounded in Lemma 2.6.3 . We are left to bound the first and the last
terms. For the last term, we have
                                                           2
                                                                                   36𝜎 2 𝑛¯
                                  
                                          2∥Δ𝐴∥
                 (II) ≤ ∥𝑈1 ∥ 2,∞                             ≤ ∥𝑈1 ∥ 2,∞                           .
                                    𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴)                     (𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴)) 2
Here the first inequality used (2.50) in Lemma 2.6.7, the second one used Corollary 7.3.3 of
Vershynin (2018) which bounds the spectral norm of i.i.d. Gaussian matrices: with probability at
                                                                   √
least 1 − 𝑒 −𝑐𝑛¯ for some absolute constant 𝑐, ∥Δ𝐴∥ ≤ 3𝜎 𝑛.           ¯
                                                          42


    Next we bound (I).
                   (I) = ∥𝑈2𝑈2𝑇 Δ𝐴𝑉1𝑉1𝑇 𝑉   e1e Σ1−1 ∥ 2,∞ ≤ ∥𝑈2𝑈2𝑇 Δ𝐴𝑉1 ∥ 2,∞ ∥𝑉1𝑇 𝑉   e1e Σ1−1 ∥
                             1                                 7                                           (2.46)
                        ≤         ∥𝑈2𝑈2𝑇 Δ𝐴𝑉1 ∥ 2,∞ ≤                 ∥𝑈2𝑈2𝑇 Δ𝐴𝑉1 ∥ 2,∞ .
                          𝜎𝑟 ( 𝐴)
                               e                           6𝜎  𝑟 ( 𝐴)
                                                                                              √
Here the last inequality is by Weyl’s bound and the assumption 𝜎𝑟 ( 𝐴) > 21𝜎 𝑛.                  ¯ (2.46) implies
that it suffices to bound the row norms of 𝑈2𝑈2𝑇 Δ𝐴𝑉1 . Since Δ𝐴 is i.i.d. 𝑁 (0, 𝜎 2 ), 𝑈2 and 𝑉1 are
independent of Δ𝐴 and that ∥𝑈2 ∥ = ∥𝑉1 ∥ = 1, then each row of 𝑈2𝑈2𝑇 Δ𝐴𝑉1 is a Gaussian vector
having independent Gaussian entries with mean 0 and variance at most 𝜎 2 . By exactly the same
proof as Theorem 3.1.1 in Vershynin (2018), there exists a constant 𝑐 such that for all 𝑡 > 0,
                                                                           −       𝑐𝑡 2
                               𝑇  𝑇                  𝑇   𝑇  √                𝜎 2 ∥𝑢𝑇 𝑈𝑇 ∥ 2
                         P( ∥𝑢𝑖 𝑈2 Δ𝐴𝑉1 ∥ − 𝜎∥𝑢𝑖 𝑈2 ∥ 𝑟 ≥ 𝑡) < 2𝑒                   𝑖 2 ,
where 𝑢𝑇𝑖 is the 𝑖th row of 𝑈2 .
    Setting in the above 𝑡 = 𝜎∥𝑢𝑇𝑖 𝑈2𝑇 ∥ 3 log 𝑛/𝑐, then with probability at least 1 − 23 ,
                                            √︁
                                                                                                   𝑛
                                              √ √︁                               √ √︁
                    ∥𝑢𝑇𝑖 𝑈2𝑇 Δ𝐴𝑉1 ∥ ≤ 𝑐 1 𝜎( 𝑟 + log 𝑛)∥𝑢𝑇𝑖 𝑈2𝑇 ∥ ≤ 𝑐 1 𝜎( 𝑟 + log 𝑛),
with some constant 𝑐 1 . By the union bound, the probability of failure for all the rows is at most 22 .
                                                                                                              𝑛
Hence with probability at      least 1 − 22 , it holds
                                         𝑛
                                                                   √ √︁
                                   ∥𝑈2𝑈2𝑇 Δ𝐴𝑉1 ∥ 2,∞ ≤ 𝑐 1 𝜎( 𝑟 + log 𝑛).
Plugging this into (2.46), we obtain
                                                         √ √︁
                                                   𝑐 1 𝜎( 𝑟 + log 𝑛)
                                           (I) ≤                        .
                                                          𝜎𝑟 ( 𝐴)
Combining the bounds on I, II and Lemma 2.6.3 completes the proof.                                              □
2.6.3    Proof of Lemma 2.6.3
    Here we first provide the proof for the low-rank case to give the reader some intuition. The
full-rank case follows a similar idea but is quite notationally heavy, we defer the proof of Lemma
2.6.3 for full-rank case to appendix.
                                                         43


Proof of Lemma 2.6.3- the low-rank case: When 𝐴 is of rank 𝑟, the second quantity to be bounded
in Lemma 2.6.3 is 0, hence we focus on the first quantity ∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉                   e1e Σ1−1 ∥ 2,∞ .
     Let 𝑢𝑇𝑖 be the 𝑖th row of 𝑈2 , then by Corollary 2.5.9, the 𝑖th row of 𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉                   e1eΣ1−1 can
be expressed as
       𝑢𝑇𝑖 𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉    e1eΣ1−1
                           ∞
                                                                                                 !
                          ∑︁
            = 𝑢𝑇𝑖 𝛼22         (𝛼𝑇22 𝛼22 ) 𝑘 (𝛼𝑇12𝑈1𝑇 𝑈
                                                     e1e Σ1−1 + 𝛼𝑇22 𝛼21𝑉1𝑇 𝑉e1eΣ1−2 )(eΣ1−2 ) 𝑘 e Σ1−1
                                                                                                                    (2.47)
                          𝑘=0
                      ∞
                                                                                                     !
                     ∑︁
            = 𝑢𝑇𝑖         (𝛼22 𝛼𝑇22 ) 𝑘 (𝛼22 𝛼𝑇12𝑈1𝑇 𝑈
                                                     e1e Σ1−1 + 𝛼22 𝛼𝑇22 𝛼21𝑉1𝑇 𝑉 e1eΣ1−2 )(e
                                                                                            Σ1−2 ) 𝑘 e Σ1−1 ,
                     𝑘=0
where 𝛼𝑖 𝑗 = 𝑈𝑖𝑇 Δ𝐴𝑉 𝑗 . Due to the orthogonality of 𝑈 and 𝑉, the entries in each 𝛼𝑖 𝑗 follow i.i.d.
𝑁 (0, 𝜎 2 ) distribution, and 𝛼22 is independent of 𝛼12 . This further implies that 𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22
and 𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 𝛼𝑇22 are independent of 𝛼12 and 𝛼21 , respectively. Conditional on 𝛼22 ,
𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 𝛼𝑇12 varies with 𝛼12 , and it follows normal distribution. Again by Theorem 3.1.1
in Vershynin (2018), for fixed 𝑘 = 0, ..., there exists a constant 𝑐 such that
                                                                                                                        !
                                          √                                                            𝑐𝑡 2
 P(   ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 𝛼𝑇12 ∥ − 𝜎 𝑟 ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 ∥    > 𝑡) ≤ 2 exp −                                      .
                                                                                         𝜎 2 ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 ∥ 2
                                                             √︁
Setting in the above 𝑡 = 𝜎∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 ∥ log(𝑛3 · 2 𝑘 )/𝑐, we get with probability at least
1 − 𝑘2 3 ,
     2 𝑛
                                                       √
                    ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 𝛼𝑇12 ∥ ≤ 𝜎 𝑟 ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 ∥ + 𝑡
                                                                                   √ √︁              √
                                                  ≤ 𝑐 2 𝜎∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 ∥( 𝑟 + log 𝑛 + 𝑘),                  (2.48)
where 𝑐 2 is some absolute constant. Then
                                                                   𝜎 ∥𝛼22 ∥ 2𝑘+1 √ √︁                       √
                                                                                
                                                  −(2𝑘+2)
            ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 𝛼𝑇12𝑈1𝑇 𝑈 e1eΣ1         ∥  ≤ 𝑐2                      ( 𝑟 + log 𝑛 + 𝑘).
                                                                   𝜎𝑟
                                                                   e       e𝜎𝑟
           ∥𝛼22 ∥                                                                                                       √
Let 𝜆 = e    𝜎𝑟
                  .  We   next argue    that 𝜆 <  1/2.  By  Corollary  7.3.3   of  Vershynin    (2018),   ∥Δ𝐴∥   ≤ 3𝜎       𝑛¯
with probability at least 1 − 𝑒 −𝑐𝑛¯ . On this event, by Weyl’s bound,
                                                          √           √
                         𝜎𝑟 ≥ 𝜎𝑟 − ∥Δ𝐴∥ ≥ 𝜎𝑟 − 3𝜎 𝑛¯ ≥ 18𝜎 𝑛¯ ≥ 6∥Δ𝐴∥ ≥ 6∥𝛼22 ∥,
                         e
                                                             44


                                                                                                 √
which implies 𝜆 < 1/2. The third inequality above is due to the assumption 21𝜎 𝑛¯ < 𝜎𝑟 . By union
bound on the probability of failure of (2.48) over all 𝑘 = 0, ..., we have with probability at least
1 − 43 ,
    𝑛
    ∞                                                                ∞                                  ∞
                                                                                                                 !
   ∑︁
                                            −(2𝑘+2)            𝜎 ∑︁ √ 2𝑘+1 √︁                       √ ∑︁
       ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) 𝑘 𝛼22 𝛼𝑇12𝑈1𝑇 𝑈
                                      e1e Σ1         ∥2 ≤ 𝑐3               𝑘𝜆       + ( log 𝑛 + 𝑟)         𝜆2𝑘+1
                                                              𝜎𝑟
                                                              e
   𝑘=0                                                              𝑘=0                                𝑘=0
                                                               𝜎 √ √︁
                                                        ≤ 𝑐 4 ( 𝑟 + log 𝑛)
                                                              𝜎𝑟
                                                              e
                                                                 √ √︁
                                                                   𝑟 + log 𝑛
                                                        ≤ 𝑐5 𝜎                  ,
                                                                    𝜎𝑟 ( 𝐴)
with 𝑐 3 − 𝑐 5 being absolute constants, where the last inequality used Weyl’s bound and the as-
                             √
sumption 𝜎𝑟 ( 𝐴) > 21𝜎 𝑛,       ¯ and the second inequality used the fact that for any 0 < 𝜆 < 1/2, we
have
           ∞ √                 ∞ √            ∞                            
          ∑︁                  ∑︁             ∑︁                  𝑑      1                 2𝜆
                𝑘𝜆2𝑘+1      ≤       𝑘𝜆 𝑘  ≤      (𝑘 + 1)𝜆 𝑘  =                −1 ≤               < 4.          (2.49)
          𝑘=0                 𝑘=0            𝑘=1
                                                                𝑑𝜆 1 − 𝜆              (1 − 𝜆) 2
Following the same reasoning, with probability at least 1 − 43 ,
                                                                        𝑛
                     ∞                                                             √ √︁
                    ∑︁
                                                             −(2𝑘+3)                 𝑟 + log 𝑛
                          ∥𝑢𝑇𝑖 (𝛼22 𝛼𝑇22 ) (𝑘+1) 𝛼21𝑉1𝑇 𝑉
                                                        e1eΣ1          ∥2  ≤ 𝑐6 𝜎                 ,
                                                                                      𝜎𝑟 ( 𝐴)
                    𝑘=0
for some constant 𝑐 6 . Using these in (2.47), by the union bound, we obtain that with probability at
least 1 − 82 ,
          𝑛                                                               √ √︁
                                                                           𝑟 + log 𝑛
                               ∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉   e1eΣ1−1 ∥ 2,∞  ≤ 𝑐7 𝜎                 ,
                                                                             𝜎𝑟 ( 𝐴)
where 𝑐 7 is some constant.                                                                                        □
2.6.4    Proof of Theorem 2.4.1
    Although Theorem 2.4.1 is motivated and could be proved by Theorem 2.3.1, we provide an
alternative proof that is more straightforward. For this purpose, we will need the following lemmas.
Lemma 2.6.4. Under the same assumption as Theorem 2.4.1, we have
                         © −𝑈1𝑇 Δ𝐴𝑉   e1      −𝑈1𝑇 Δ𝐴𝑉  e2 ª
                                                               𝑇
                                                                      ©       0        𝑈1𝑇 𝑈e2𝑈e𝑇 𝐴
                                                                                                 2
                                                                                                    e𝑉
                                                                                                     e2 ª
         𝐴𝑟 − 𝐴𝑟 = 𝑈 ­
               e         ­                                 ® 𝑉 +𝑈 ­
                                                             e        ­                                 ®𝑉e𝑇 .
                                                           ®                                            ®
                           −𝑈𝑇 𝐴𝑉 𝑉 𝑇 𝑉             0                   −𝑈𝑇 Δ𝐴𝑉                0
                         « 2 2 2 1
                                         e                                        e1
                                                           ¬          « 2                               ¬
                                                         45


Proof: The lemma can be straightforwardly verified by using the relation 𝐴𝑟 = 𝑈1 Σ1𝑉1𝑇 and
e𝑟 = 𝑈
𝐴     e1e  e𝑇 .
         Σ1𝑉                                                                                            □
             1
Lemma 2.6.5 (Lemma 2 in Luo et al. (2021)). Suppose 𝑥 1 ≥ 𝑥 2 ≥ ... ≥ 𝑥 𝑘 ≥ 0 and 𝑦 1 ≥ 𝑦 2 ≥ ... ≥
                              Í𝑗           Í𝑗
𝑦 𝑘 ≥ 0. For any 1 ≤ 𝑗 ≤ 𝑘, 𝑖=1 𝑥𝑖 ≤ 𝑖=1 𝑦𝑖 . Then for any 𝑝 ≥ 1,
                                               ∑︁𝑘         ∑︁𝑘
                                                       𝑝            𝑝
                                                    𝑥𝑖 ≤         𝑦𝑖 .
                                                𝑖=1         𝑖=1
The equality holds if and only if (𝑥1 , 𝑥 2 , ..., 𝑥 𝑘 ) = (𝑦 1 , 𝑦 2 , ..., 𝑦 𝑘 ).
Lemma 2.6.6 (Theorem 1 in Thompson (1975)). Assume 𝐴, 𝐵, 𝐶 = 𝐴 + 𝐵 are (not necessarily
square) matrices of the same size, with singular values
                              𝛼1 ≥ 𝛼2 ≥ ..., 𝛽1 ≥ 𝛽2 ≥ ..., 𝛾1 ≥ 𝛾2 ≥ ...,
respectively. Let 𝑖1 < 𝑖 2 < ... < 𝑖 𝑚 and 𝑗1 < 𝑗 2 < ... < 𝑗 𝑚 be positive integers, and set
                                       𝑘 𝑡 = 𝑖𝑡 + 𝑗𝑡 − 𝑡, 𝑡 = 1, 2, ..., 𝑚.
Then the singular values of 𝐴, 𝐵, 𝐶 satisfy
                                        ∑︁𝑚             𝑚
                                                       ∑︁         ∑︁𝑚
                                             𝛾𝑘𝑡 ≤         𝛼𝑖𝑡 +         𝛽 𝑗𝑡 .
                                         𝑡=1           𝑡=1         𝑡=1
Lemma 2.6.7. We have the following uniform error bound on sin Θ distance
                                                                                                
                                       e1 )∥, ∥ sin Θ(𝑉1 , 𝑉  e1 )∥} ≤ min            2∥Δ𝐴∥
                   max{∥ sin Θ(𝑈1 , 𝑈                                                          ,1 . (2.50)
                                                                                     𝜎𝑟 − 𝜎𝑟+1
Proof of Lemma 2.6.7: If 𝜎𝑟 = 𝜎𝑟+1 , (2.50) holds trivially, here we consider the case 𝜎𝑟 > 𝜎𝑟+1 .
Consider the two possibilities 𝜎𝑟 − 𝜎𝑟+1 > 2∥Δ𝐴∥ and 𝜎𝑟 − 𝜎𝑟+1 ≤ 2∥Δ𝐴∥. When 𝜎𝑟 − 𝜎𝑟+1 >
2∥Δ𝐴∥, this and the Weyl’s bound
                                              |e
                                               𝜎𝑟 − 𝜎𝑟 | ≤ ∥Δ𝐴∥,
together give
                                                                       1
                         e𝜎𝑟 − 𝜎𝑟+1 > 𝜎𝑟 − 𝜎𝑟+1 − ∥Δ𝐴∥ > (𝜎𝑟 − 𝜎𝑟+1 ) > 0,
                                                                       2
                                                         46


which ensures the assumption in Theorem 2.5.5 to hold, and then (2.23) in Theorem 2.5.5 implies
                                 e1 )∥ = ∥𝑈𝑇 𝑈                 ∥Δ𝐴∥               2∥Δ𝐴∥
                   ∥ sin Θ(𝑈1 , 𝑈           2 1∥ ≤
                                              e                              ≤              .
                                                       𝜎𝑟 − ∥Δ𝐴∥ − 𝜎𝑟+1 𝜎𝑟 − 𝜎𝑟+1
When 𝜎𝑟 − 𝜎𝑟+1 ≤ 2∥Δ𝐴∥, we directly have
                                                             2∥Δ𝐴∥
                                        ∥𝑈2𝑇 𝑈
                                             e1 ∥ ≤ 1 ≤                .
                                                           𝜎𝑟 − 𝜎𝑟+1
Putting the two cases together, we have
                                              e1 )∥ ≤ min{ 2∥Δ𝐴∥ , 1}.
                                 ∥ sin Θ(𝑈1 , 𝑈
                                                              𝜎𝑟 − 𝜎𝑟+1
Following the same reasoning, we also have ∥ sin Θ(𝑉1 , 𝑉      e1 )∥ ≤ min{ 2∥Δ𝐴∥ , 1}, thus (2.50) holds.
                                                                              𝜎𝑟 −𝜎 𝑟+1
                                                                                                        □
Proof of Theorem 2.4.1: By Lemma 2.6.4,
                         © −𝑈1𝑇 Δ𝐴𝑉   e1     −𝑈1𝑇 Δ𝐴𝑉  e2 ª      ©       0       𝑈1𝑇 𝑈  e𝑇 𝐴
                                                                                      e2𝑈
                                                                                          2
                                                                                            e𝑉e2 ª
         ∥ 𝐴𝑟 − 𝐴
                e𝑟 ∥ ≤ ­
                         ­ 𝑇
                                                          ® + ­
                                                          ®      ­ 𝑇
                                                                                                 ®
                                                                                                 ®
                                      𝑇
                           −𝑈2 𝐴𝑉2𝑉2 𝑉1 e         0                −𝑈2 Δ𝐴𝑉1 e           0
                         «                                ¬      «                               ¬
                       √︃
                     ≤ ∥𝑈1𝑇 Δ𝐴∥ 2 + ∥𝑈2𝑇 𝐴𝑉2𝑉2𝑇 𝑉     e1 ∥ 2 + max{∥𝑈𝑇 Δ𝐴𝑉
                                                                          2
                                                                              e1 ∥, ∥𝑈𝑇 𝑈e e𝑇 ee
                                                                                       1 2𝑈2 𝐴𝑉2 ∥}
                       √︃
                     = ∥𝑈1𝑇 Δ𝐴∥ 2 + 𝜏 2 + max{∥𝑈2𝑇 Δ𝐴𝑉       e1 ∥, 𝜈},                              (2.51)
where we have let 𝜈 = ∥𝑈1𝑇 𝑈   e2𝑈e𝑇 𝐴
                                   2
                                     e𝑉e2 ∥ and 𝜏 = ∥𝑈𝑇 𝐴𝑉2𝑉 𝑇 𝑉
                                                         2
                                                                    e ∥, we next bound 𝜏 and 𝜈.
                                                                  2 1
                           𝜈 = ∥𝑈1𝑇 𝑈
                                    e2𝑈e𝑇 𝐴            𝑇e e
                                         2 𝑉2 ∥ = ∥𝑈1 𝑈2 Σ2 ∥ ≤ e
                                           ee                        𝜎𝑟+1 ∥𝑈1𝑇 𝑈 e2 ∥.              (2.52)
Due to the Weyl’s bound, we also have
                                          |e
                                           𝜎𝑟+1 − 𝜎𝑟+1 | < ∥Δ𝐴∥.                                    (2.53)
By (2.50),
                                                                                              
                                           2∥Δ𝐴∥                                     2∥Δ𝐴∥
             𝜈 ≤ (𝜎𝑟+1 + ∥Δ𝐴∥) min                  , 1 ≤ ∥Δ𝐴∥ + 𝜎𝑟+1 min                     ,1 .
                                         𝜎𝑟 − 𝜎𝑟+1                                 𝜎𝑟 − 𝜎𝑟+1
Similarly, we can also derive
                                                                             
                                                                  2∥Δ𝐴∥
                               𝜏=  ∥Σ2𝑉2𝑇 𝑉e1 ∥ ≤ 𝜎𝑟+1 min                  ,1 .
                                                                 𝜎𝑟 − 𝜎𝑟+1
                                                      47


Inserting the upper bounds of 𝜏 and 𝜈 back to (2.51) completes the proof of (2.15). For Frobenius
norm:
                                                                                                 2
                                   −𝑈1𝑇 Δ𝐴𝑉  e1             −𝑈1𝑇 Δ𝐴𝑉    e2 + 𝑈𝑇 𝑈  e𝑈
                                                                                1 2 2
                                                                                     e𝑇 𝐴e𝑉 e2 ª
              e𝑟 ∥ 2 =
                         ©
      ∥ 𝐴𝑟 − 𝐴     𝐹
                         ­
                         ­ 𝑇
                                                                                               ®
                                                                                               ®
                           −𝑈2 Δ𝐴𝑉 e1 − 𝑈𝑇 𝐴𝑉2𝑉 𝑇 𝑉                         0
                                                    2 1
                                                      e
                         «                 2                                                   ¬ 𝐹
                                                             2                                          2
                                   −𝑈1𝑇 Δ𝐴𝑉                        ©−𝑈1𝑇 Δ𝐴𝑉   e2 + 𝑈𝑇 𝑈e𝑈   e𝑇 𝐴e𝑉e2 ª
                                                                                       1 2 2
                         ©                   e1          ª
                     = ­ ­                               ®
                                                         ®      +  ­
                                                                   ­
                                                                                                      ® .
                                                                                                      ®
                              𝑇
                           −𝑈2 Δ𝐴𝑉         𝑇
                                   e1 − 𝑈 𝐴𝑉2𝑉 𝑉    𝑇 e1                            0
                         «                 2        2    ¬ 𝐹 «                                        ¬ 𝐹
                        |              {z                    } |                    {z                  }
                                       B𝑅1                                          B𝑅2
Next we bound 𝑅1 and 𝑅2 separately. First consider 𝑅2 , let 𝑀Λ𝑊 𝑇 be the singular value decom-
position of 𝑈1𝑇 𝑈e2 , where 𝑀 ∈ R𝑟×𝑟 , Λ ∈ R𝑟×𝑟 , 𝑊 ∈ R (𝑛−𝑟)×𝑟 . Then
                        𝑅2 = ∥ − 𝑈1 Δ𝐴𝑉  e2 + 𝑈1𝑈  e2eΣ2 ∥ 2𝐹
                             ≤ 2∥𝑈1𝑇 Δ𝐴𝑉e2 ∥ 2 + 2∥𝑈𝑇 𝑈   ee 2
                                              𝐹        1 2 Σ2 ∥ 𝐹
                             ≤ 2∥(Δ𝐴)𝑟 ∥ 2𝐹 + 2∥𝑈1𝑇 𝑈 e2𝑊𝑊 𝑇 e    Σ2 ∥ 2𝐹
                                                                         2
                                          2                2∥Δ𝐴∥
                             ≤ 2∥(Δ𝐴)𝑟 ∥ 𝐹 + 2 min                     ,1      ∥𝑊 𝑇 eΣ2 ∥ 2𝐹
                                                         𝜎𝑟 − 𝜎𝑟+1
                                                                         2 ∑︁𝑟
                                          2                2∥Δ𝐴∥                      2 ( 𝐴).
                             ≤ 2∥(Δ𝐴)𝑟 ∥ 𝐹 + 2 min                     ,1           𝜎𝑟+𝑘    e
                                                         𝜎𝑟 − 𝜎𝑟+1
                                                                               𝑘=1
In the second to last inequality, we used the fact that ∥ 𝐴𝐵∥ 𝐹 ≤ ∥ 𝐴∥∥𝐵∥ 𝐹 and in the last inequality,
we used ∥𝑃Ω 𝐴∥ 𝐹 ≤ ∥ 𝐴𝑟 ∥ 𝐹 for any 𝑟-dimensional subspace Ω. By Lemma 2.6.6,
∑︁𝑘              ∑︁𝑘                    ∑︁𝑘              ∑︁𝑘                ∑︁𝑘
     𝜎𝑟+𝑖 ( 𝐴)
            e =       𝜎𝑟+𝑖 ( 𝐴 + Δ𝐴) ≤       𝜎𝑖 (Δ𝐴) +         𝜎𝑟+𝑖 ( 𝐴) =        (𝜎𝑖 (Δ𝐴) + 𝜎𝑟+𝑖 ( 𝐴)) , 1 ≤ 𝑘 ≤ 𝑟.
 𝑖=1             𝑖=1                    𝑖=1              𝑖=1                𝑖=1
From Lemma 2.6.5, we have
               ∑︁𝑟                 𝑟
                                  ∑︁
                       2 ( 𝐴)
                     𝜎𝑟+𝑘   e   ≤     (𝜎𝑘 (Δ𝐴) + 𝜎𝑟+𝑘 ( 𝐴)) 2 ≤ (∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 ) 2 .
                𝑘=1               𝑘=1
Hence
                                                                 2
                                                   2∥Δ𝐴∥
                𝑅2 ≤   2∥(Δ𝐴)𝑟 ∥ 2𝐹 + 2   min                  ,1      (∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 ) 2 .        (2.54)
                                                 𝜎𝑟 − 𝜎𝑟+1
                                                        48


Next we consider 𝑅1 . Notice that
                                                                 2
                      ©−𝑈1𝑇 Δ𝐴𝑉  e1 ª ©            0
                                                                                                  e1 ∥ 𝐹 ) 2 .
                                                              ª
               𝑅1 = ­ ­             ®+­                       ® ≤ (∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥Σ2𝑉 𝑇 𝑉
                                    ®   ­                     ®                                 2
                        −𝑈𝑇 Δ𝐴𝑉  e1       −𝑈𝑇 𝐴𝑉 𝑉 𝑇 𝑉
                                    ¬ « 2 2 2 1¬ 𝐹
                                                           e
                      « 2
Following the same reasoning as in bounding 𝑅2 , we have
                                                                                        2
                                                                           2∥Δ𝐴∥
                           𝑅1 ≤ ∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 min                          ,1      .                  (2.55)
                                                                         𝜎𝑟 − 𝜎𝑟+1
Combining (2.54) and (2.55), we obtain
                                                                                                        2
                    e𝑟 ∥ 2                                                                 2∥Δ𝐴∥
            ∥ 𝐴𝑟 − 𝐴       ≤ 2∥(Δ𝐴)𝑟 ∥ 2𝐹 + 3       ∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 min                      ,1       .
                         𝐹                                                               𝜎𝑟 − 𝜎𝑟+1
                                                                                                                     □
2.6.5   Proof of Theorem 2.3.2
Proof of Theorem 2.3.2: Let 𝑉      e𝑇 𝑉1 = 𝑄 1 𝑆𝑄𝑇 be the SVD of 𝑉              e𝑇 𝑉1 . Define a special rotation
                                    1                  2                          1
𝑄ˆ = 𝑄 1 𝑄𝑇2 , and we bound |||𝑈1 Σ1 − 𝑈        e1e     ˆ
                                                   Σ1 𝑄|||,    where ||| · ||| can be either the the spectral or
the Frobenius norm. This yields an upper bound on min𝑄∈O𝑟 |||𝑈1 Σ1 − 𝑈                         e1eΣ1 𝑄|||. By a direct
calculation,
              |||𝑈1 Σ1 − 𝑈e1e   ˆ ≤ |||𝑈1 Σ1 − 𝑈
                            Σ1 𝑄|||                    e1e Σ1𝑉 e𝑇 𝑉1 + 𝑈 e1eΣ1𝑉 e𝑇 𝑉1 − 𝑈 e1e     ˆ
                                                                                              Σ1 𝑄|||
                                                                1                1
                                      ≤ |||𝑈1 Σ1 − 𝑈   e1e Σ1𝑉 e𝑇 𝑉1 ||| + |||𝑈
                                                                              e1eΣ1𝑉e𝑇 𝑉1 − 𝑈   e1e    ˆ
                                                                                                   Σ1 𝑄|||
                                                                1                      1
                                     = |||(𝑈1 Σ1𝑉1𝑇 − 𝑈     e1e Σ1𝑉e𝑇 )𝑉1 ||| + |||𝑈e1e Σ1 (𝑉           ˆ
                                                                                             e𝑇 𝑉1 − 𝑄)|||
                                                                     1                         1
                                      ≤ |||𝑈1 Σ1𝑉1𝑇 − 𝑈    e1e Σ1𝑉e𝑇 ||| + |||𝑈e1eΣ1 (𝑉           ˆ
                                                                                        e𝑇 𝑉1 − 𝑄)|||
                                                                    1                    1
                                     = ||| 𝐴𝑟 − 𝐴  e𝑟 ||| + |||𝑈
                                                               e1e Σ1 (𝑉          ˆ
                                                                       e𝑇 𝑉1 − 𝑄)|||.                            (2.56)
                                                                         1
The first term of (2.56) can be bounded by Theorem 2.4.1. Let us focus on the second term. Let
                                                                                       √︁
        e1 . Observe 𝑍 𝑇 𝑍 + 𝑉
𝑍 = 𝑉2𝑇 𝑉                       e𝑇 𝑉1 (𝑉e𝑇 𝑉1 )𝑇 = 𝐼𝑟 , this implies 𝑉      e𝑇 𝑉1 = 𝐼𝑟 − 𝑍 𝑇 𝑍 𝑄ˆ (Lemma 2.6.8),
                                 1        1                                   1
where the square root of a positive semi-definite matrix 𝐵 is defined to be the positive semi-definite
matrix 𝐵e such that 𝐵  e𝐵e = 𝐵.
    Using this observation on the quantity inside the norm of the second term on the right-hand
side of (2.56), we have
                                            √︁                                         √︁
           𝑈e1eΣ1 (𝑉         ˆ =𝑈
                    e𝑇 𝑉1 − 𝑄)     e1e Σ  (    𝐼 −  𝑍 𝑇 𝑍 − 𝐼) 𝑄ˆ = −𝑈     Σ   𝑍 𝑇 𝑍 ( 𝐼 − 𝑍 𝑇 𝑍 + 𝐼) −1 𝑄,    ˆ
                     1                  1                               e 1 1
                                                                           e
                                                           49


                                               √              √
where the last equality used the fact that ( 𝐼 − 𝑍 𝑇 𝑍 − 𝐼)( 𝐼 − 𝑍 𝑇 𝑍 + 𝐼) = −𝑍 𝑇 𝑍. Hence
                                                        √︁
        |||𝑈1 Σ1 (𝑉1 𝑉1 − 𝑄)||| ≤ |||𝑈1 Σ1 𝑍 𝑍 ||| · ∥( 𝐼 − 𝑍 𝑇 𝑍 + 𝐼) −1 𝑄∥
           e  e    e𝑇       ˆ          e  e   𝑇                            ˆ ≤ |||eΣ1 𝑍 𝑇 |||.      (2.57)
                                             √
The last inequality used ∥𝑍 ∥ ≤ 1, and ∥( 𝐼 − 𝑍 𝑇 𝑍 + 𝐼) −1 ∥ ≤ 1. Notice that
      Σ1 𝑍 𝑇 )𝑇 = 𝑍 e
     (e              Σ1 = 𝑉2𝑇 𝑉 Σ1 = 𝑉2𝑇 𝐴
                              e1e         e𝑇 𝑈e1 = 𝑉 𝑇 Δ𝐴𝑇 𝑈
                                                     2
                                                           e1 +𝑉 𝑇 𝐴𝑇 𝑈
                                                                  2
                                                                       e1 = 𝑉 𝑇 Δ𝐴𝑇 𝑈
                                                                                2
                                                                                       e1 + Σ𝑇 𝑈𝑇 𝑈
                                                                                               2 2 1.
                                                                                                  e
Then for the spectral norm of e  Σ1 𝑍 𝑇 , we have
                                                                                         
                                                                             2∥Δ𝐴∥
                 ∥e
                  Σ1 𝑍𝑇 ∥ ≤  ∥Δ𝐴∥ + 𝜎𝑟+1 ∥𝑈2𝑇 𝑈 e1 ∥ ≤ ∥Δ𝐴∥ + 𝜎𝑟+1 min                 ,1 .
                                                                            𝜎𝑟 − 𝜎𝑟+1
For the Frobenius norm, we have
                                                                                                  
                                                                                        2∥Δ𝐴∥
     ∥e
      Σ1  𝑍𝑇 ∥ 𝐹  ≤ ∥𝑉2𝑇 Δ𝐴𝑇 𝑈e1 ∥ 𝐹 + ∥Σ2𝑈𝑇 𝑈
                                             2 1∥𝐹
                                               e      ≤ ∥(Δ𝐴)𝑟 ∥ 𝐹 + ∥(Σ2 )𝑟 ∥ 𝐹 min             ,1 .
                                                                                       𝜎𝑟 − 𝜎𝑟+1
Combining this with (2.56) and (2.57) completes the proof.                                              □
Lemma 2.6.8. Let 𝐴 ∈ R𝑟×𝑟 be a semi-definite matrix with eigenvalues no greater than 1, 𝐵 ∈ R𝑟×𝑟
has SVD 𝐵 = 𝑈 𝐵 𝑆 𝐵𝑉𝐵𝑇 . In addition, 𝐴 and 𝐵 satisfy 𝐴 + 𝐵𝐵𝑇 = 𝐼, then
                                                 √
                                            𝐵 = 𝐼 − 𝐴𝑈 𝐵𝑉𝐵𝑇 .
                                          √                                         √
Proof: Since 𝐵𝐵𝑇 = 𝑈 𝐵 𝑆 2𝐵𝑈𝑇𝐵 , then       𝐵𝐵𝑇 = 𝑈 𝐵 𝑆 𝐵𝑈𝑇𝐵 , and therefore 𝐵 = 𝐵𝐵𝑇 𝑈 𝐵𝑉𝐵𝑇 . By as-
sumption, 𝐵𝐵𝑇 = 𝐼 − 𝐴, then the result of the lemma follows.                                            □
                                                     50


                                            CHAPTER 3
                  MANIFOLD DENOISING BY NONLINEAR ROBUST PCA
3.1   Introduction
3.1.1   Overview of Chapter 3
    This chapter considers the problem of manifold denoising. In the study of statistics and ma-
chine learning, there is a common manifold assumption that underlies many popular dimensional
reduction algorithms including Isomap (Isometric Feature Mapping), LLE (Local Linear Embed-
ding), and PCA (Principal Component Analysis), which states that real-world high-dimensional
data lies near a low-dimensional manifold embedded in the high-dimensional space. Therefore, if
we consider a local neighborhood of the data points, this local submatrix is approximately low-rank.
    Based on this observation, in this chapter, we extend Robust Principal Component Analysis
(RPCA) to the manifold setting. Suppose that the observed data matrix is composed of a sparse
component and a component drawn from some low-dimensional manifold. We propose and
analyze an optimization framework that separates the sparse component from the manifold under
noisy data. Theoretical error bounds are provided when the tangent spaces of the manifold satisfy
certain incoherence conditions. We also provide a near-optimal choice of the tuning parameters for
the proposed optimization formulation with the help of a new curvature estimation method. The
efficacy of our method is demonstrated on both synthetic and real datasets.
    Results of this chapter has given rise to the conference proceeding Lyu et al. (2019).
3.1.2   Manifold denoising
    Manifold learning is nowadays widely used in computer vision, image processing, and biological
data analysis on tasks such as classification, anomaly detection, data interpolation, and denoising.
Many machine learning and statistical methods are based on the assumption that real-world high-
dimensional data actually lies on low-dimensional manifolds embedded in the high-dimensional
space, examples can be found in dimensional reduction Roweis and Saul (2000); Balasubramanian
and Schwartz (2002); He et al. (2023) and clustering. However, in practice the data we observe
almost never lies perfectly on a manifold, usually they contain noise coming from various sources,
                                                  51


which can greatly jeopardize the quality of methods that are sensitive to noise, such as graph-based
methods.
    In recent years, several methods have been proposed to tackle this problem. One approach
involves detecting and eliminating outliers from the dataset Du et al. (2013); Sathe and Aggarwal
(2016). In particular, the authors of Du et al. (2013) proposed to measure the likelihood of each
sample being an outlier using a reliability score based on contextual distance. In order to utilize both
local and global manifold structures, an iterative algorithm was designed to update the reliability
score matrix. In Sathe and Aggarwal (2016), the authors proposed an iterative algorithm called
LODES which is a spectral embedding matrix based on local density. When constructing the
weight matrix, LODES took into consideration the difference in local density for each pair of data
points, and the output of this algorithm is the outlier score for each data point. It is worth noting
that these methods focus on detecting and removing outliers, and may suffer from the problem of
loss of information.
    Along another research line, graph-based denoising methods have been proposed in the literature
Hein and Maier (2006); Deutsch et al. (2016). In Hein and Maier (2006), a weighted kNN graph
was created for the noisy dataset based on the Gaussian kernel, then the noise was modeled as
a diffusion process governed by the graph Laplacian of the weighted kNN graph. The denoising
algorithm was designed based on reversing the diffusion process and was suitable for denoising
datasets with Gaussian, or more general isotropic high-dimensional noise. Deutsch et al. (2016)
also started with a weighted graph based on Gaussian kernel, then spectral graph wavelet transform
was performed in each feature dimension, and a non-iterative denoising method was realized by
removing all the scaling and wavelet coefficients above some threshold. This method was motivated
by the observation that the energy of the smooth manifold coordinate signals concentrate on low-
frequency spectral wavelets, while the energy of noise is spread equally across all frequencies.
    In this chapter, we consider the manifold denoising problem under the presence of both sparse
and Gaussian noise. Specifically, we are concerned with the mixed noise model
                                   𝑋˜ 𝑖 = 𝑋𝑖 + 𝑆𝑖 + 𝐸𝑖 , 𝑖 = 1, . . . , 𝑛,                         (3.1)
                                                    52


where 𝑋𝑖 ∈ R 𝑝 is the noiseless data independently drawn from some manifold M with an intrinsic
dimension 𝑑 ≪ 𝑝, 𝐸𝑖 is the i.i.d. Gaussian noise with small magnitudes, and 𝑆𝑖 is the sparse
noise with possibly large magnitudes. If 𝑆𝑖 has a large entry, then the corresponding 𝑋˜ 𝑖 is usually
considered as an outlier.
    A desirable denoising algorithm can simultaneously recover 𝑋𝑖 and 𝑆𝑖 from 𝑋˜ 𝑖 , 𝑖 = 1, .., 𝑛, there
are several benefits in recovering the noise term 𝑆𝑖 along with the signal 𝑋𝑖 . First, the support of 𝑆𝑖
indicates the locations of the anomaly, which is informative in many applications. For example, if
𝑋𝑖 is the gene expression data from the 𝑖th patient, the nonzero elements in 𝑆𝑖 indicate the differently
expressed genes that are the candidates for personalized medicine. Similarly, if 𝑆𝑖 is a result of
malfunctioned hardware, its nonzero elements indicate the locations of the malfunctioned parts.
Secondly, the recovery of 𝑆𝑖 allows the “outliers” to be pulled back to the data manifold instead of
simply being discarded. This prevents waste of information and is especially beneficial in cases
where data is insufficient. Thirdly, in some applications, the sparse 𝑆𝑖 is a part of the clean data
rather than a noise term, then the algorithm provides a natural decomposition of the data into a
sparse and a non-sparse component that may carry different pieces of information.
3.1.3    Robust PCA
    The method we are about to propose in Section 3.2 is closely related to Robust Principle
Component Analysis (RPCA) Candes et al. (2011) and differential geometry. In Section 3.1.3 and
Section 3.1.4, we provide some background on RPCA and geometry, respectively. Readers already
familiar with these areas can safely jump to Section 3.2 for the proposed method.
    Assume a data matrix has a low-rank component and a sparse component, the authors in Candes
et al. (2011) proved that under assumptions where the singular vectors of the low-rank matrix are
reasonably spread and the support set of the noise matrix is uniformly distributed, the with a high
probability, the Principle Component Pursuit (PCP) method can exactly recover both the low-rank
component and the sparse component.
    Formally, assume a large data matrix 𝑀 ∈ R 𝑝×𝑛 satisfies
                                             𝑀 = 𝐿 0 + 𝑆0 ,                                        (3.2)
                                                  53


where 𝐿 0 admits singular value decomposition 𝐿 0 = 𝑈Σ𝑉 ∗ = 𝑖=1                   𝜎𝑖 𝑢𝑖 𝑣𝑖∗ with 𝑟 being its rank and
                                                                            Í𝑟
𝜎1 , 𝜎2 , · · · , 𝜎𝑟 being its positive singular values. Here 𝑈 = [𝑢 1 , 𝑢 2 , · · · , 𝑢𝑟 ], 𝑉 = [𝑣 1 , 𝑣 2 , · · · , 𝑣 𝑟 ] are
singular vector matrices satisfying the following incoherence conditions with some constant 𝜇
                                                                                  √︂
                                     2      𝜇𝑟        2      𝜇𝑟           ∗           𝜇𝑟
                                ∥𝑈 ∥ 2,∞ ≤     , ∥𝑉 ∥ 2,∞ ≤      , ∥𝑈𝑉 ∥ ∞ ≤               .                             (3.3)
                                            𝑛                𝑚                       𝑛𝑚
Here ∥ · ∥ 2,∞ is defined as in Definition 1.3.1, and ∥ 𝑀 ∥ ∞ = max𝑖, 𝑗 |𝑀𝑖 𝑗 | is the maximum of absolute
values in 𝑀. PCP estimates the low-rank component and the sparse component by solving the
optimization problem
                                               min ∥𝐿 ∥ ∗ + 𝜆∥𝑆∥ 1
                                               subject to 𝐿 + 𝑆 = 𝑀.
Denote 𝑛 (1) = max{𝑛, 𝑝}, 𝑛 (2) = min{𝑛, 𝑝}, and choose 𝜆 = √𝑛1 , the following theorem has been
                                                                              (1)
proved in Candes et al. (2011)
Theorem 3.1.1 (Theorem 1.1 in Candes et al. (2011)). Suppose 𝐿 0 satisfies the incoherent condi-
tions 3.3, and the support set of 𝑆0 is uniformly distributed on all sets with cardinality 𝑚, then there
exists a constant 𝑐 such that with probability at least 1 − 𝑐𝑛−10          (1)
                                                                               , PCP exactly recovers 𝐿 0 and 𝑆0 ,
provided
                              rank(𝐿 0 ) ≤ 𝜌𝑟 𝑛 (2) 𝜇−1 (log 𝑛 (1) ) −2 , and 𝑚 ≤ 𝜌 𝑠 𝑛𝑝,
where 𝜌𝑟 and 𝜌 𝑠 are positive constants.
     Robust PCA provides an efficient method of handling noisy data with outliers, it has received
considerable attention and has demonstrated its success in separating data from sparse noise in
many applications. However, its assumption that the data lies in a low-dimensional subspace is
somewhat strict. In Section 3.2, we generalize the Robust PCA idea to the non-linear manifold
setting. The intuition behind our method is that the small intrinsic dimension assumption ensures
the local data matrix is approximately low rank.
3.1.4     Geometric background
     A fundamental assumption in this chapter is that the clean data lies in a low-dimensional
manifold embedded in a high-dimensional space. Before presenting the proposed method, let us
                                                          54


warm up with some geometric backgrounds. We first present a definition of a 𝑑-dimensional
manifold Lee (2012).
Definition 3.1.1 (Manifolds). We say that a topological space M is a 𝑑-dimensional manifold if it is:
1. a Hausdorff space; 2. there is a countable basis for the topology of 𝑀; 3. M is locally Euclidean
of dimension 𝑑, which means, each point of M has a neighborhood that is homeomorphic to an
open subset of R𝑑 . More specifically, for any 𝑝 ∈ M, we can find an open set 𝑈 ⊂ M, an open set
𝑈¯ ⊂ R𝑑 , and a homeomorphism 𝜑 : 𝑈 → 𝑈,    ¯ such pair (𝑈, 𝜑) is called a chart.
    Intuitively, a 𝑑-dimensional manifold locally resembles the Euclidean space R𝑑 . In order to
define derivatives and distances on manifolds, we further need to define a smooth structure on the
manifold M. And the definition of smoothness on manifolds is based on the calculus of maps on
Euclidean spaces.
Definition 3.1.2 (Smooth compatibility). Let M be a 𝑑-dimensional manifold, if (𝑈, 𝜑) and (𝑉, 𝜓)
are two charts, 𝑈 ∩ 𝑉 ≠ 𝜙, the transition map from 𝜑 to 𝜓 is defined as 𝜓 ◦ 𝜑−1 : 𝜑(𝑈 ∩ 𝑉) →
𝜓(𝑈 ∩𝑉). We say two charts (𝑈, 𝜑) and (𝑉, 𝜓) are smoothly compatible if either 𝑈 ∩𝑉 = 𝜙 or the
transition map 𝜓 ◦ 𝜑−1 is a diffeomorphism.
Definition 3.1.3 (Smooth manifold). A smooth manifold is a pair (M, A), where M is a manifold
and A is a family of charts whose domain covers M, any two charts in A are smoothly compatible
with each other.
    Next, we introduce the concept of tangent space and curvature on smooth manifolds. For
illustration purposes, here we focus on their intuition instead of formal definitions in Riemannian
Geometry. For a point 𝑝 on a smooth manifold M, we denote the tangent space at 𝑝 as 𝑇𝑝 .
Intuitively, it is the subspace that is "tangent" to M at point 𝑝, which represents all possible
directions curves can pass through this point. In the example of a 2-dimensional manifold in Figure
3.1, 𝑇𝑝 is the tangent space at point 𝑝.
    Intuitively, curvature measures how much a surface is bent. The curvature of a circle on a
2D plane is the reciprocal of its radius. More formally, the principal curvatures at a point on a
high-dimensional manifold are defined as the singular values of the second fundamental forms
                                                 55


                                     Figure 3.1 Local manifold geometry.
Kobayashi and Nomizu (1996). As estimating all the singular values from the noisy data may not
be stable, we are only interested in estimating the root mean square curvature, that is the root mean
squares of the principal curvatures.
    For the simplicity of illustration, we review the related concepts using the 2D surface M
embedded in R3 (Figure 3.1). For any curve 𝛾(𝑠) in M parametrized by arclength with unit tangent
vector 𝑡 𝛾 (𝑠), its curvature is the norm of the covariant derivative of 𝑡 𝛾 : ∥𝑑𝑡 𝛾 (𝑠)/𝑑𝑠∥ = ∥𝛾′′ (𝑠)∥. In
particular, we have the following decomposition
                                         𝛾′′ (𝑠) = 𝑘 𝑔 (𝑠) 𝑣ˆ (𝑠) + 𝑘 𝑛 (𝑠) 𝑛(𝑠),
                                                                            ˆ
where 𝑛(𝑠)
       ˆ      is the unit normal direction of the manifold at 𝛾(𝑠) and 𝑣ˆ is the direction perpendicular
to 𝑛(𝑠)
   ˆ     and 𝑡 𝛾 (𝑠), i.e., 𝑣ˆ = 𝑛ˆ × 𝑡 𝛾 (𝑠). The coefficient 𝑘 𝑛 (𝑠) along the normal direction is called
the normal curvature, and the coefficient 𝑘 𝑔 (𝑠) along the perpendicular direction 𝑣ˆ is called the
geodesic curvature. The principal curvatures purely depend on 𝑘 𝑛 . In particular, in 2D, the principal
curvatures are precisely the maximum and minimum of 𝑘 𝑛 among all possible directions.
    A natural way to compute the normal curvature is through geodesic curves. The geodesic curve
between two points is the shortest curve connecting them. Therefore geodesic curves are usually
viewed as “straight lines” on the manifold. The geodesic curves have the favorable property that
their curvatures have 0 contribution from 𝑘 𝑔 . That is to say, the second order derivative of the
geodesic curve parameterized by the arclength is exactly 𝑘 𝑛 .
                                                            56


3.2    Methodology
    Let 𝑋˜ = [ 𝑋˜ 1 , . . . , 𝑋˜ 𝑛 ] ∈ R 𝑝×𝑛 be the noisy data matrix containing 𝑛 samples. Each sample is a
vector in R 𝑝 independently drawn from (3.1). The overall data matrix 𝑋˜ has the representation
                                                           𝑋˜ = 𝑋 + 𝑆 + 𝐸,
where 𝑋 is the clean data matrix, 𝑆 is the matrix of the sparse noise, and 𝐸 is the matrix of the
Gaussian noise. We further assume that the clean data 𝑋 lies on some 𝑑-dimensional manifold
M with a small dimension 𝑑 ≪ 𝑝 embedded in R 𝑝 and the samples are sufficient (𝑛 ≥ 𝑝). The
small intrinsic dimension assumption ensures that data is locally low-dimensional so that the
corresponding local data matrix is of low rank. This property allows the data to be separated from
the sparse noise.
    The key idea behind our method is to handle the data locally. We use the 𝑘 Nearest Neighbors
(𝑘NN) to construct local data matrices, where 𝑘 is larger than the intrinsic dimension 𝑑. For a data
point 𝑋𝑖 ∈ R 𝑝 , we define the local patch centered at it to be the set consisted of its 𝑘NN and itself,
and a local data matrix 𝑋 (𝑖) associated with this patch is 𝑋 (𝑖) = [𝑋𝑖1 , 𝑋𝑖2 , . . . , 𝑋𝑖 𝑘 , 𝑋𝑖 ], where 𝑋𝑖 𝑗 is
the 𝑗th-nearest neighbor of 𝑋𝑖 . Let P𝑖 be the restriction operator to the 𝑖th patch, i.e., P𝑖 (𝑋) = 𝑋 𝑃𝑖
where 𝑃𝑖 is the 𝑛 × (𝑘 + 1) matrix that selects the columns of 𝑋 in the 𝑖th patch. Then 𝑋 (𝑖) = P𝑖 (𝑋).
Similarly, we define 𝑆 (𝑖) = P𝑖 (𝑆), 𝐸 (𝑖) = P𝑖 (𝐸) and 𝑋˜ (𝑖) = P𝑖 ( 𝑋).              ˜
    Since each local data matrix 𝑋 (𝑖) is nearly of low rank and 𝑆 is sparse, we can decompose the
noisy data matrix into low-rank parts and sparse parts through solving the following optimization
problem
       ˆ { 𝑆ˆ (𝑖) }𝑛 , { 𝐿ˆ (𝑖) }𝑛 } = arg min 𝐹 (𝑆, {𝑆 (𝑖) }𝑛 , {𝐿 (𝑖) }𝑛 )
     { 𝑆,          𝑖=1             𝑖=1                                   𝑖=1         𝑖=1
                                          𝑆,𝑆 (𝑖) ,𝐿 (𝑖)
                                                           𝑛
                                                          ∑︁
                                                               𝜆𝑖 ∥ 𝑋˜ (𝑖) − 𝐿 (𝑖) − 𝑆 (𝑖) ∥ 2𝐹 + ∥C(𝐿 (𝑖) )∥ ∗ + 𝛽∥𝑆 (𝑖) ∥ 1
                                                                                                                              
                                        ≡ arg min
                                           𝑆,𝑆 (𝑖) ,𝐿 (𝑖) 𝑖=1
                                           subject to 𝑆 (𝑖) = P𝑖 (𝑆),                                                          (3.4)
here we take 𝛽 = max{𝑘 + 1, 𝑝}−1/2 as in RPCA, 𝑋˜ (𝑖) = P𝑖 ( 𝑋)                     ˜ is the local data matrix on the 𝑖th
patch and C is the centering operator that subtracts the column mean: C(𝑍) = 𝑍 (𝐼 − 𝑘+1                          1 11𝑇 ), where
                                                                  57


1 is the (𝑘 + 1)-dimensional column vector of all ones. Here we are decomposing the data on each
patch into a low-rank part 𝐿 (𝑖) and a sparse part 𝑆 (𝑖) by imposing the nuclear norm and entry-wise
ℓ1 -norm on 𝐿 (𝑖) and 𝑆 (𝑖) , respectively. There are two key components in this formulation: 1). the
local patches are overlapping (for example, the first data point 𝑋1 may belong to several patches).
Thus, the constraint 𝑆 (𝑖) = P𝑖 (𝑆) is particularly important because it ensures copies of the same
point on different patches (and those of the sparse noise on different patches) remain the same.
2). we do not require 𝐿 (𝑖) to be restrictions of a universal 𝐿 to the 𝑖th patch, because the 𝐿 (𝑖) s
correspond to the local affine tangent spaces, and there is no reason for a point on the manifold to
have the same projection on different tangent spaces. This seemingly subtle difference has a large
impact on the final result.
     Next, we provide a geometric intuition for the formulation (3.4). Write the clean data matrix
𝑋 (𝑖) on the 𝑖th patch in its Taylor expansion along the manifold,
                                            𝑋 (𝑖) = 𝑋𝑖 1𝑇 + 𝑇 (𝑖) + 𝑅 (𝑖) ,                                (3.5)
where the Taylor series is expanded at 𝑋𝑖 (the center point of the 𝑖th patch), 𝑇 (𝑖) stores the first order
term and its columns lie in the tangent space of the manifold at 𝑋𝑖 , and 𝑅 (𝑖) contains all the higher
order terms. The sum of the first two terms 𝑋𝑖 1𝑇 + 𝑇 (𝑖) is the linear approximation to 𝑋 (𝑖) that is
unknown if the tangent space is not given. This linear approximation precisely corresponds to the
𝐿 (𝑖) s in (3.4), i.e., 𝐿 (𝑖) = 𝑋𝑖 1𝑇 + 𝑇 (𝑖) . Since the tangent space has the same dimensionality 𝑑 as
the manifold, with randomly chosen points, we have with probability one, that rank(𝑇 (𝑖) ) = 𝑑. As
a result, rank(𝐿 (𝑖) ) = rank(𝑋𝑖 1𝑇 + 𝑇 (𝑖) ) ≤ 𝑑 + 1. By the assumption that 𝑑 < min{𝑝, 𝑘 }, we know
that 𝐿 (𝑖) is indeed low rank.
     Combing (3.5) with 𝑋˜ (𝑖) = 𝑋 (𝑖) + 𝑆 (𝑖) + 𝐸 (𝑖) , we find the misfit term 𝑋˜ (𝑖) − 𝐿 (𝑖) − 𝑆 (𝑖) in (3.4)
equals 𝐸 (𝑖) + 𝑅 (𝑖) . This implies that the misfit contains the high order residues (i.e., the linear
approximation error) and the Gaussian noise.
     We solve (3.4) for the sparse component 𝑆.       ˆ If the data only contains sparse noise, i.e., 𝐸 = 0,
then 𝑋ˆ ≡ 𝑋˜ − 𝑆ˆ is the final estimation for 𝑋. If 𝐸 ≠ 0, we apply Singular Value Hard Thresholding
Gavish and Donoho (2014) to truncate C( 𝑋˜ (𝑖) − P𝑖 (𝑆)) and remove the Gaussian noise (See Section
                                                         58


                                     (𝑖)
3.5), and use the resulting 𝐿ˆ 𝜏 ∗ to construct a final estimate 𝑋ˆ of 𝑋 via least squares fitting
                                                         𝑛
                                                                            (𝑖)
                                                        ∑︁
                                          𝑋ˆ = arg min      𝜆𝑖 ∥P𝑖 (𝑍) − 𝐿ˆ 𝜏 ∗ ∥ 2𝐹 .                  (3.6)
                                               𝑍 ∈R 𝑝×𝑛 𝑖=1
    The following discussion revolves around (3.4) and (3.6), and the structure of the chapter is as
follows. In Section 3.3, we establish theoretical recovery guarantees for (3.4) which justifies our
choice of 𝛽 and allows us to theoretically choose 𝜆. The calculation of 𝜆 uses the curvature of the
manifold, so in Section3.4, we provide a simple method to estimate the average manifold curvature
and the method is robust to sparse noise. The optimization algorithms that solve (3.4) and (3.6) are
presented in Section 3.5 and the numerical experiments are in Section 3.6.
3.3   Theoretical choice of tuning parameters
    To establish the error bound, we need a coherence condition on the tangent spaces of the
manifold.
Definition 3.3.1. Let 𝑈 ∈ R𝑚×𝑟 (𝑚 ≥ 𝑟) be a matrix with 𝑈 ∗𝑈 = 𝐼, the coherence of 𝑈 is defined as
                                                      𝑚
                                             𝜇(𝑈) =         max ∥𝑈 ∗ e 𝑘 ∥ 22 ,
                                                      𝑟 𝑘 ∈{1,...,𝑚}
where e 𝑘 is the 𝑘th element of the canonical basis. For a subspace 𝑇, its coherence is defined as
                                                      𝑚
                                             𝜇(𝑉) =         max ∥𝑉 ∗ e 𝑘 ∥ 22 ,
                                                      𝑟 𝑘 ∈{1,...,𝑚}
where 𝑉 is an orthonormal basis of 𝑇. The coherence is independent of the choice of basis.
    The following theorem is proved for local patches constructed using the 𝜖-neighborhoods. We
use 𝑘NN in the experiments because 𝑘NN is more robust to insufficient samples. The full version
of Theorem 3.3.1 can be found in the appendix.
Theorem 3.3.1. [succinct version] Let each 𝑋𝑖 ∈ R 𝑝 , 𝑖 = 1, ..., 𝑛, be independently drawn from a
compact manifold M ⊆ R 𝑝 with an intrinsic dimension 𝑑 and endowed with the uniform distribution.
Let 𝑋𝑖 𝑗 , 𝑗 = 1, . . . , 𝑘 𝑖 be the 𝑘 𝑖 points falling in an 𝜂-neighborhood of 𝑋𝑖 with radius 𝜂, where 𝜂 > 0
is some fixed small constant. These points form the matrix 𝑋 (𝑖) = [𝑋𝑖1 , . . . , 𝑋𝑖 𝑘 , 𝑋𝑖 ]. For any
                                                                                              𝑖
𝑞 ∈ M, let 𝑇𝑞 be the tangent space of M at 𝑞 and define 𝜇¯ = sup𝑞∈M 𝜇(𝑇𝑞 ). Suppose the support
of the noise matrix 𝑆 (𝑖) is uniformly distributed among all sets of cardinality 𝑚𝑖 . Then as long as
                                                            59


𝑑 < 𝜌𝑟 min{𝑘, 𝑝} 𝜇¯ −1 log−2 max{ 𝑘,      ¯ 𝑝}, and 𝑚𝑖 ≤ 0.4𝜌 𝑠 𝑝𝑘 (here 𝜌𝑟 and 𝜌 𝑠 are positive constants,
𝑘¯ = max𝑖 𝑘 𝑖 , and 𝑘 = min𝑖 𝑘 𝑖 ) , then with probability over 1 − 𝑐 1 𝑛 max{𝑘, 𝑝}−10 − 𝑒 −𝑐2 𝑘 for some
constants 𝑐 1 and 𝑐 2 , the minimizer 𝑆ˆ to (3.4) with weights
                                     min{𝑘 𝑖 + 1, 𝑝}1/2
                               𝜆𝑖 =                         ,    𝛽𝑖 = max{𝑘 𝑖 + 1, 𝑝}−1/2                     (3.7)
                                              𝜖𝑖
has the error bound
                                                 ˆ − 𝑆 (𝑖) ∥ 2,1 ≤ 𝐶 √ 𝑝𝑛 𝑘¯ ∥𝜖 ∥ 2 .
                                       ∑︁
                                           ∥P𝑖 ( 𝑆)
                                        𝑖
Here 𝜖𝑖 = ∥ 𝑋˜ (𝑖) − 𝑋𝑖 1𝑇 − 𝑇 (𝑖) − 𝑆 (𝑖) ∥ 𝐹 will be estimated in the next section, 𝜖 = [𝜖1 , ..., 𝜖 𝑛 ], ∥ · ∥ 2,1
stands for taking ℓ2 norm along rows and ℓ1 norm along columns, and 𝑇 (𝑖) is the projection of
𝑋 (𝑖) − 𝑋𝑖 1𝑇 to the tangent space 𝑇𝑋𝑖 .
      Remark. We can interpret 𝜖 as the total noise in the data. As explained in Section 3.1.4,
∥ 𝑋˜ (𝑖) − 𝑋𝑖 1𝑇 − 𝑇 (𝑖) − 𝑆 (𝑖) ∥ 𝐹 = ∥𝑅 (𝑖) + 𝐸 (𝑖) ∥ 𝐹 , thus 𝜖 = 0 if the manifold is linear and the Gaussian
                                     √
noise is absent. The factor 𝑛 in front of ∥𝜖 ∥ 2 takes into account the use of different norms on
the two hand sides (the right-hand side is the Frobenius norm of the noise matrix {𝑅 (𝑖) + 𝐸 (𝑖) }𝑖=1            𝑛
                                                                                                                 √
obtained by stacking the 𝑅 (𝑖) + 𝐸 (𝑖) associated with each patch into one big matrix). The factor 𝑝
is due to the small weight 𝛽𝑖 of ∥𝑆 (𝑖) ∥ 1 compared to the weight 1 on ∥ 𝑋˜ (𝑖) − 𝐿 (𝑖) − 𝑆 (𝑖) ∥ 2𝐹 . The
factor 𝑘¯ appears because on average, each column of 𝑆ˆ − 𝑆 is added about 𝑘 := 𝑛1 𝑖 𝑘 𝑖 times on the
                                                                                                Í
left hand side.
3.4      Estimating the curvature
      The definition 𝜆𝑖 in (3.7) involves an unknown quantity 𝜖𝑖2 = ∥ 𝑋˜ (𝑖) − 𝑋𝑖 1𝑇 − 𝑇 (𝑖) − 𝑆 (𝑖) ∥ 2𝐹 ≡
∥𝑅 (𝑖) + 𝐸 (𝑖) ∥ 2𝐹 . We assume the standard deviation 𝜎 of the i.i.d. Gaussian entries of 𝐸 (𝑖) is known,
so ∥𝐸 (𝑖) ∥ 2𝐹 can be approximated. Since 𝑅 (𝑖) is independent of 𝐸 (𝑖) , the cross term ⟨𝑅 (𝑖) , 𝐸 (𝑖) ⟩ is
small. Our main task is estimating ∥𝑅 (𝑖) ∥ 2𝐹 , the linear approximation error defined in Section 3.1.4.
At local regions, second order terms dominates the linear approximation residue, hence estimating
∥𝑅 (𝑖) ∥ 2𝐹 requires the curvature information.
                                                              60


3.4.1   The proposed method
    All existing curvature estimation methods we are aware of are in the field of computer vision
where the objects are 2D surfaces in 3D Flynn and Jain (1989); Eppel (2006); Tong and Tang
(2005); Meek and Walton (2000). Most of these methods are difficult to generalize to high (> 3)
dimensions with the exception of the integral invariant based approaches Pottmann et al. (2007).
However, the integral invariant based approaches is not robust to sparse noise and is unsuited to
our problem.
    We propose a new method to estimate the root mean square curvature from the noisy data.
Although the graphic illustration is made in 3D, the method is dimension independent. To compute
the average normal curvature at a point 𝑝 ∈ M, we randomly pick 𝑚 points 𝑞𝑖 ∈ M on the manifold
lying within a proper distance to 𝑝 as specified in Algorithm 3.1. Let 𝛾𝑖 be the geodesic curve
between 𝑝 and 𝑞𝑖 . For each 𝑖, we compute the pairwise Euclidean distance ∥ 𝑝 − 𝑞𝑖 ∥ 2 and the pairwise
geodesic distance 𝑑𝑔 ( 𝑝, 𝑞𝑖 ) using the Dĳkstra’s algorithm. Through a circular approximation of
the geodesic curve as drawn in Figure 3.1, we can compute the curvature of the geodesic curve as
the inverse of the radius
                                            ∥𝛾𝑖′′ ( 𝑝)∥ = 1/𝑅𝛾′ ,                                  (3.8)
                                                                  𝑖
where 𝛾𝑖′ is the tangent direction along which the curvature is calculated and 𝑅𝛾′ is the radius of the
                                                                                          𝑖
circular approximation to the curve 𝛾 at 𝑝, which can be solved along with the angle 𝜃 𝛾′ through
                                                                                              𝑖
the geometric relations
                         2𝑅𝛾′ sin(𝜃 𝛾′ /2) = ∥ 𝑝 − 𝑞𝑖 ∥ 2 ,      𝑅𝛾′ 𝜃 𝛾′ = 𝑑𝑔 ( 𝑝, 𝑞𝑖 ),          (3.9)
                              𝑖        𝑖                            𝑖   𝑖
as indicated in Figure 3.1. Finally, we define the average curvature Γ̄( 𝑝) at 𝑝 to be
                              Γ̄( 𝑝) := (E𝑞𝑖 ∥𝛾𝑖′′ ( 𝑝)∥ 2 ) 1/2 ≡ (E𝑞𝑖 𝑅𝛾−2𝑖
                                                                              ) 1/2 .            (3.10)
    To estimate the root mean square curvature from the data, we construct two matrices 𝐷 and 𝐴.
𝐷 ∈ R𝑛×𝑛 is the pairwise distance matrix, where 𝐷 𝑖 𝑗 denotes the Euclidean distance between two
points 𝑋𝑖 and 𝑋 𝑗 . 𝐴 is a type of adjacency matrix defined as follows and is to be used to compute
                                                       61


the pairwise geodesic distances from the data,
                               
                                      if 𝑋 𝑗 is in the 𝑘 nearest neighbors of 𝑋𝑖 ,
                               
                                𝐷𝑖 𝑗
                               
                               
                        𝐴𝑖 𝑗 =                                                                  (3.11)
                               
                               0
                                     elsewhere.
                               
Algorithm 3.1 estimates the root mean square curvature at some point 𝑝 and Algorithm 3.2 estimates
the overall curvature within some region Ω on the manifold.
Algorithm 3.1 Estimate the root mean square curvature Γ̄( 𝑝) at some point 𝑝
Input: Distance matrix 𝐷, adjacency matrix 𝐴, some proper constants 𝑟 1 < 𝑟 2 , number of pairs 𝑚
Output: The estimated the root mean square curvature Γ̄( 𝑝)
  1: for 𝑖 = 1 to 𝑚 do
  2:     Randomly pick some point 𝑞𝑖 ∈ 𝐵( 𝑝, 𝑟 2 )\𝐵( 𝑝, 𝑟 1 )
  3:     Calculate the geodesic distance 𝑑𝑔 ( 𝑝, 𝑞𝑖 ) using 𝐴
  4:     Solve for the radius 𝑅𝑖 based on (3.9)
  5: end
  6: Compute estimated curvature Γ̄( 𝑝) = ( 𝑚   1 Í𝑚 𝑅 −2 ) 1/2
                                                    𝑖=1 𝑖
Algorithm 3.2 Estimate the overall curvature Γ̄(Ω) for some region Ω
Input: Distance matrix 𝐷, adjacency matrix 𝐴, some proper constants 𝑟 1 < 𝑟 2 , number of pairs 𝑚
Output: The estimated overall curvature Γ̄(Ω)
  1: for 𝑖 = 1 to 𝑚 do
  2:     Randomly pick a pair of points 𝑝𝑖 , 𝑞𝑖 ∈ Ω such that 𝑟 1 ≤ 𝑑 ( 𝑝𝑖 , 𝑞𝑖 ) ≤ 𝑟 2
  3:     Calculate the geodesic distance 𝑑𝑔 ( 𝑝𝑖 , 𝑞𝑖 ) using 𝐴
  4:     Solve for the radius 𝑅𝑖 based on (3.9)
  5: end
  6: Compute the estimated curvature Γ̄(Ω) = ( 𝑚     1 Í𝑚 𝑅 −2 ) 1/2
                                                         𝑖=1 𝑖
     The geodesic distance is computed using the Dĳkstra’s algorithm, which is not accurate when
𝑝 and 𝑞 are too close to each other. The constant 𝑟 1 in Algorithm 3.1 and 3.2 is thus used to make
sure that 𝑝 and 𝑞 are sufficiently apart. The constant 𝑟 2 is to make sure that 𝑞 is not too far away
from 𝑝, as after all we are computing the root mean square curvature around 𝑝.
3.4.2   Estimating 𝜆𝑖 from the root mean square curvature
     We provide a way to approximate 𝜆𝑖 when the number of points 𝑛 is finite. In the asymptotic
limit (𝑘 → ∞, 𝑘/𝑛 → 0), all the approximate sign “≈” below become “=”.
                                                     62


     Fix a point 𝑝 ∈ M and another point 𝑞𝑖 in the 𝜂-neighborhood of 𝑝. Let 𝛾𝑖 be the geodesic
curve between them. With the computed curvature Γ̄( 𝑝), we can estimate the linear approximation
error of expanding 𝑞𝑖 at 𝑝: 𝑞𝑖 ≈ 𝑝 + 𝑃𝑇𝑝 (𝑞𝑖 − 𝑝), where 𝑃𝑇𝑝 is the projection onto the tangent space
at 𝑝. Let E be the error of this linear approximation E (𝑞𝑖 , 𝑝) = 𝑞𝑖 − 𝑝 − 𝑃𝑇𝑝 (𝑞𝑖 − 𝑝) = 𝑃𝑇 ⊥ (𝑞𝑖 − 𝑝)
                                                                                                               𝑝
where 𝑇𝑝⊥ is the orthogonal complement of the tangent space.                     From Figure 3.1, the relation between
∥E ( 𝑝, 𝑞𝑖 ) ∥ 2 , ∥ 𝑝 − 𝑞𝑖 ∥ 2 , and 𝜃 𝛾′ is
                                         𝑖
                                                                          𝜃 𝛾′    ∥ 𝑝−𝑞𝑖 ∥ 2
                                      ∥E ( 𝑝, 𝑞𝑖 )∥ 2 ≈  ∥ 𝑝 − 𝑞𝑖 ∥ 2 sin 2𝑖   = 2𝑅 ′ 2 .                        (3.12)
                                                                                       𝛾
                                                                                         𝑖
To obtain a closed-form formula for E, we assume that for the fixed 𝑝 and a randomly chosen 𝑞𝑖 in an
𝜉 neighborhood of 𝑝, the projection 𝑃𝑇𝑝 (𝑞𝑖 − 𝑝) follows a uniform distribution in a ball with radius 𝜂′
(in fact 𝜂′ ≈ 𝜂 since when 𝜂 is small, the projection of 𝑞 − 𝑝 is almost 𝑞 − 𝑝 itself, therefore the radius
of the projected ball almost equal to the radius of the original neighborhood). Under this assumption,
let 𝑟𝑖 = ∥𝑃𝑇𝑝 (𝑞𝑖 − 𝑝)∥ 2 be the magnitude of the projection and 𝜙𝑖 = 𝑃𝑇𝑝 (𝑞𝑖 − 𝑝)/∥𝑃𝑇𝑝 (𝑞𝑖 − 𝑝)∥ 2
be the direction, by Vershynin (2018), 𝑟𝑖 and 𝜙𝑖 are independent of each other. As the curvature
𝑅𝛾𝑖 only depends on the direction, the numerator and the denominator of the right-hand side of
(3.12) are independent of each other. Therefore,
                                               ∥ 𝑝−𝑞𝑖 ∥ 4    E∥ 𝑝−𝑞𝑖 ∥ 4             E∥ 𝑝−𝑞𝑖 ∥ 4
                       E∥E ( 𝑝, 𝑞𝑖 )∥ 22 ≈ E            2 =             2 E𝑅 −2  =               2 · Γ̄2 ( 𝑝),   (3.13)
                                                 4𝑅 2 ′           4           𝛾′           4
                                                     𝛾                         𝑖
                                                      𝑖
where the first equality used the independence and the last equality used the definition of the root
mean square curvature in the previous subsection.
     Now we apply this estimation to the neighborhood of 𝑋𝑖 . Let 𝑝 = 𝑋𝑖 , and 𝑞 𝑗 = 𝑋𝑖 𝑗 be the
neighbors of 𝑋𝑖 . Using (3.13), the average linear approximation error on this patch is
                                                                                               4
                          1 (𝑖) 2          1  Í𝑘
                                                                    2  𝑘→∞ E∥ 𝑋𝑖 −𝑋𝑖 𝑗 ∥ 2 2
                            ∥𝑅 ∥ 𝐹 := 𝑘            ∥E (𝑋𝑖 𝑗 , 𝑋𝑖 )∥ 2 −−−−−→          4          Γ̄ (𝑋𝑖 ),       (3.14)
                          𝑘                   𝑗=1
where the right-hand side can also be estimated with
                                                   4                                     4
                              1 ∑︁ ∥ 𝑋𝑖 − 𝑋𝑖 𝑗 ∥ 2 2            𝑘→∞ E∥ 𝑋𝑖 − 𝑋𝑖 𝑗 ∥ 2 2
                                   𝑘
                                                      Γ̄ (𝑋𝑖 ) −−−−−→                        Γ̄ (𝑋𝑖 ).           (3.15)
                              𝑘             4                                    4
                                  𝑗=1
                                                               63


                                                                                        𝑘 ∥ 𝑋𝑖 −𝑋𝑖 𝑗 ∥ 42
Therefore, when 𝑘 is sufficient large,             1 ∥𝑅 (𝑖) ∥ 2   is also close to   1 Í
                                                                                                          Γ̄2 (𝑋𝑖 ), which can
                                                   𝑘           𝐹                     𝑘          4
                                                                                       𝑗=1
be completely computed from the data. Combining this with the argument at the beginning of §5
we get,
                                √︃                                                   𝑘 ∥ 𝑋𝑖 − 𝑋𝑖 ∥ 4                 1/2
                                                                                                  𝑗 2
                                                                                   ∑︁
    𝜖𝑖 =   ∥𝑅 (𝑖) + 𝐸 (𝑖) ∥ 𝐹≈     ∥𝑅 (𝑖) ∥ 2𝐹 + ∥𝐸 (𝑖) ∥ 2𝐹 ) ≈     (𝑘 + 1) 𝑝𝜎 2 +                      Γ̄2 (𝑋𝑖 )        =: 𝜖ˆ.
                                                                                              4
                                                                                    𝑗=1
                              min{𝑘+1,𝑝}1/2                                                                  𝜆ˆ𝑖 −𝜆𝑖∗ 𝑘→∞
Thus we can set 𝜆𝑖 =    ˆ                        due to (3.7). We show in the appendix that                             −−−−−→ 0,
                                     𝜖ˆ𝑖                                                                        𝜆∗
                                                                                                                  𝑖
                 min{𝑘+1,𝑝}1/2
where 𝜆𝑖∗ =              𝜖𝑖        as in (3.7).
3.5      Optimization algorithm
      To solve the convex optimization problem (3.4) in a memory-economic way, we first write 𝐿 (𝑖)
as a function of 𝑆 and eliminate them from the problem. We can do so by fixing 𝑆 and minimizing
the objective function with respect to 𝐿 (𝑖)
  𝐿ˆ (𝑖) = arg min 𝜆𝑖 ∥ 𝑋˜ (𝑖) − 𝐿 (𝑖) − 𝑆 (𝑖) ∥ 2𝐹 + ∥C(𝐿 (𝑖) )∥ ∗
              𝐿 (𝑖)
         = arg min 𝜆𝑖 ∥C(𝐿 (𝑖) ) − C( 𝑋˜ (𝑖) − 𝑆 (𝑖) )∥ 2𝐹 + ∥C(𝐿 (𝑖) )∥ ∗ + 𝜆𝑖 ∥(𝐼 − C)(𝐿 (𝑖) − ( 𝑋˜ (𝑖) − 𝑆 (𝑖) ))∥ 2𝐹 .
              𝐿 (𝑖)
                                                                                                                            (3.16)
Notice that 𝐿 (𝑖) can be decomposed as 𝐿 (𝑖) = C(𝐿 (𝑖) ) + (𝐼 − C)(𝐿 (𝑖) ), set 𝐴 = C(𝐿 (𝑖) ), 𝐵 = (𝐼 −
C) (𝐿 (𝑖) ), then (3.16) is equivalent to
             ˆ 𝐵)
           ( 𝐴,  ˆ = arg min 𝜆𝑖 ∥ 𝐴 − C( 𝑋˜ (𝑖) − 𝑆 (𝑖) )∥ 2 + ∥ 𝐴∥ ∗ + 𝜆𝑖 ∥𝐵 − (𝐼 − C)( 𝑋˜ ∗(𝑖) − 𝑆 (𝑖) ))∥ 2 ,
                                                                 𝐹                                                     𝐹
                          𝐴,𝐵
which decouples into
        𝐴ˆ = arg min 𝜆𝑖 ∥ 𝐴 − C( 𝑋˜ (𝑖) − 𝑆 (𝑖) )∥ 2𝐹 + ∥ 𝐴∥ ∗ , 𝐵ˆ = arg min 𝜆𝑖 ∥𝐵 − (𝐼 − C)( 𝑋˜ (𝑖) − 𝑆 (𝑖) )∥ 2𝐹 .
                  𝐴                                                           𝐵
The problems above have closed form solutions
                            𝐴ˆ = T1/2𝜆𝑖 (C( 𝑋˜ (𝑖) − P𝑖 (𝑆))), 𝐵ˆ = (𝐼 − C)( 𝑋˜ (𝑖) − P𝑖 (𝑆)),                              (3.17)
where T𝜇 is the soft-thresholding operator on the singular values
                          T𝜇 (𝑍) = 𝑈 max{Σ − 𝜇𝐼, 0}𝑉 ∗ , where 𝑈Σ𝑉 ∗ is the SVD of 𝑍.
                                                                 64


Combing 𝐴ˆ and 𝐵,   ˆ we have derived the closed form solution for 𝐿ˆ (𝑖)
                      𝐿ˆ (𝑖) (𝑆) = T1/2𝜆𝑖 (C( 𝑋˜ (𝑖) − P𝑖 (𝑆))) + (𝐼 − C)( 𝑋˜ (𝑖) − P𝑖 (𝑆)).          (3.18)
Plugging (3.18) into 𝐹 in (3.4), the resulting optimization problem solely depends on 𝑆. Then we
apply FISTA Beck and Teboulle (2009); Sha et al. (2019) to find the optimal solution 𝑆ˆ with
                                             𝑆ˆ = arg min 𝐹 ( 𝐿ˆ (𝑖) (𝑆), 𝑆).                         (3.19)
                                                       𝑆
Once 𝑆ˆ is found, if the data has no Gaussian noise, then the final estimation for 𝑋 is 𝑋ˆ ≡ 𝑋˜ − 𝑆;     ˆ if
                                                                                         (𝑖)
there is Gaussian noise, we use the following denoised local patches 𝐿ˆ 𝜏∗
                              (𝑖)
                           𝐿ˆ 𝜏 ∗ = 𝐻𝜏 ∗ (C( 𝑋˜ (𝑖) − P𝑖 ( 𝑆)))
                                                              ˆ + (𝐼 − C)( 𝑋˜ (𝑖) − P𝑖 ( 𝑆)),
                                                                                            ˆ         (3.20)
where 𝐻𝜏 ∗ is the Singular Value Hard Thresholding Operator with the optimal threshold as defined
                                                                                                          (𝑖)
in Gavish and Donoho (2014). This optimal thresholding removes the Gaussian noise from 𝐿ˆ 𝜏 ∗ .
                        (𝑖)
With the denoised 𝐿ˆ 𝜏∗ , we solve (3.6) to obtain the denoised data
                                                𝑛                    𝑛
                                                         (𝑖)
                                               ∑︁                   ∑︁
                                        𝑋ˆ = (     𝜆𝑖 𝐿ˆ 𝜏 ∗ 𝑃𝑖𝑇 )(     𝜆𝑖 𝑃𝑖 𝑃𝑖𝑇 ) −1 .              (3.21)
                                               𝑖=1                  𝑖=1
The proposed Nonlinear Robust Principle Component Analysis (NRPCA) algorithm is summarized
in Algorithm 3.3. There is one caveat in solving (3.4): the strong sparse noise may result in a wrong
Algorithm 3.3 Nonlinear Robust PCA
Input: Noisy data matrix 𝑋,        ˜ 𝑘 (number of neighbors in each local patch), 𝑇 (number of neighbor-
     hood updates iterations)
Output: The denoised data 𝑋,         ˆ the estimated sparse noise 𝑆ˆ
  1: Estimate the curvature using (3.10)
  2: Estimate 𝜆𝑖 , 𝑖 = 1, . . . , 𝑛 as in Section 3.4, set 𝛽 as in (3.4)
  3: 𝑆ˆ ← 0
  4: for iter = 1 to T do
  5:     Find the 𝑘NN for each point using 𝑋˜ − 𝑆ˆ and construct the restriction operators {P𝑖 }𝑖=1    𝑛
  6:     Construct the local data matrices 𝑋˜ (𝑖) = P𝑖 ( 𝑋)          ˜ using P𝑖 and the noisy data 𝑋˜
  7:     𝑆ˆ ← minimizer of (3.19) iteratively using FISTA
  8: end
  9: Compute each 𝐿   ˆ (𝑖)∗ from (3.20) and assign 𝑋ˆ from (3.21)
                        𝜏
                                                              65


neighborhood assignment when constructing the local patches. Therefore, once 𝑆ˆ is obtained and
removed from the data, we update the neighborhood assignment and re-compute 𝑆.    ˆ This procedure
is repeated 𝑇 times.
3.6   Numerical experiment
Figure 3.2 NRPCA applied to the noisy 3D Swiss roll dataset. 𝑋˜ − 𝑆ˆ is the result after subtracting
the sparse noise estimated by setting 𝑇 = 1 in NRPCA, i.e., no neighbour update; “ 𝑋˜ − 𝑆ˆ with one
neighbor update” used the 𝑆ˆ obtained by setting 𝑇 = 2 in NRPCA; clearly, the neighbour update
helped to remove more sparse noise; 𝑋ˆ is the data obtained via fitting the denoised tangent spaces
as in (3.6). Compared to“ 𝑋˜ − 𝑆ˆ with one neighbor update”, it further removed the Gaussian noise
from the data; ”Patch-wise Robust PCA” refers to the ad-hoc application of the vanilla Robust PCA
to each local patch independently, whose performance is worse than the proposed joint-recovery
formulation.
    Simulated Swiss roll: We demonstrate the superior performance of NRPCA on a synthetic
dataset following the mixed noise model (3.1). We sampled 2000 noiseless data 𝑋𝑖 uniformly from
a 3D Swiss roll and generated the Gaussian noise matrix with i.i.d. entries obeying N (0, 0.25).
The sparse noise matrix 𝑆 is generated by randomly replacing 100 entries of a zero 𝑝 × 𝑛 matrix
with i.i.d. samples generated from (−1) 𝑦 · 𝑧 where 𝑦 ∼ Bernoulli(0.5) and 𝑧 ∼ N (5, 0.09). We
applied NRPCA to the simulated data with patch size 𝑘 = 15. Figure 3.2 reports the denoising
                                                66


results in the original space (3D) looking down from above. We compare two ways of using the
outputs of NRPCA: 1). only remove the sparse noise from the data 𝑋˜ − 𝑆;          ˆ 2). remove both the
sparse and Gaussian noise from the data: 𝑋.       ˆ In addition, we plotted 𝑋˜ − 𝑆ˆ with and without the
neighbourhood update. These results are all superior to an ad-hoc application of the Robust PCA
on the individual local patches.
    High-dimensional Swiss roll: We carried out the same simulation on a high-dimensional Swiss
roll, and obtained better distinguishability among 1)-3). We also observed an overall improvement
of the performance of NRPCA, which matches our intuition that the assumptions of Theorem 3.3.1
are more likely to be satisfied in high dimensions. The denoised results are displayed in Figure 3.3,
where we clearly see that the use of 𝑋ˆ instead of 𝑋˜ − 𝑆ˆ allows a significant amount of Gaussian
noise to be removed from the data.
    In the high-dimensional simulation, we generated a Swiss roll in R20 as following:
    1. Choose the number of samples 𝑛 = 2000;
    2. let 𝑡 be the vector of length 𝑛 containing the 𝑛 uniform grid points in the interval [0, 4𝜋] with
grid space 4𝜋/(𝑛 − 1);
    3. Set the first three dimensions of the data the same way as the 3D Swiss roll, for 𝑖 = 1, ..., 𝑛,
                                       𝑋𝑖 (1) = (𝑡 (𝑖) + 1) cos(𝑡 (𝑖));
                                       𝑋𝑖 (2) = (𝑡 (𝑖) + 1) sin(𝑡 (𝑖));
                                       𝑋𝑖 (3) ∼ unif([0, 8𝜋]),
where unif([0, 8𝜋]) means the uniform distribution on the interval [0, 8𝜋].
    4. Set the 4-20 dimensions of the data to contain pure sinusoids with various frequencies
                                𝑋𝑖 (𝑘) = 𝑡 (𝑖) sin( 𝑓 𝑘 𝑡 (𝑖)), 𝑘 = 4, ..., 20, .
where 𝑓 𝑘 = 𝑘/21 is the frequency for the 𝑘th dimension. The noisy data is obtained by adding i.i.d.
Gaussian noise N (0, 0.25) to each entry of 𝑋 and adding sparse noise to 600 randomly chosen
entries where the noise added to each chosen entry obeys N (5, 0.09).
                                                        67


Figure 3.3 NRPCA applied to the noisy 20D Swiss roll data set. 𝑋˜ − 𝑆ˆ is the result after subtracting
the estimated sparse noise via NRPCA with 𝑇 = 1; “ 𝑋˜ − 𝑆ˆ with one neighbor update” is that with
𝑇 = 2, i.e., patches are reassigned once; 𝑋ˆ is the denoised data obtained via fitting the tangent spaces
in NRPCA with 𝑇 = 2; “Patch-wise Robust PCA” refers to the ad-hoc application of the vanilla
RPCA to each local patch independently, whose performance is clearly worse than the proposed
joint-recovery formulation.
    MNIST: We observe some interesting dimension reduction results of the MNIST dataset with
the help of NRPCA. It is well-known that the handwritten digits 4 and 9 have so high a similarity
that some popular dimension reduction methods, such as Isomap and Laplacian Eigenmaps (LE)
are not able to separate them into two clusters (first column of Figure 3.4). Despite the similarity,
a few other methods (such as t-SNE) are able to distinguish them to a much higher degree, which
suggests the possibility of improving the results of Isomap and LE with proper data pre-processing.
We conjecture that the overlapping parts in Figure 3.4 (the left column) are caused by personalized
writing styles with different beginning or finishing strokes. This type of differences can be better
modelled by sparse noise than Gaussian or Poisson noises.
                                                     68


                        0.015
                                                  Original, Laplacian                                                                               0.015
                                                                                                                                                                     Denoised Laplacian
                                                                                                                                                       0.01
                           0.01
        Laplacian2                                                                                                                  Laplacian2
                                                                                                                                                    0.005
                        0.005
                                                                                                                                                            0
                                0
                                                                                                                                                  -0.005
                      -0.005
                                                                                                                                                     -0.01
                         -0.01                                                                                                                    -0.015
                             -0.01     -0.008   -0.006   -0.004   -0.002       0       0.002    0.004   0.006   0.008   0.01                           -0.01        -0.005        0           0.005       0.01    0.015
                                                                     Laplacian1                                                                                                       Laplacian1
                               20
                                                         Original Isomap                                                                                   10
                                                                                                                                                                         Denoised Isomap
                               15
                                                                                                                                                            5
                               10
                     Isomap2                                                                                                                     Isomap2
                                5
                                                                                                                                                            0
                                0
                                -5
                                                                                                                                                            -5
                               -10
                               -15                                                                                                                         -10
                                 -20     -15       -10       -5            0       5           10       15      20      25                                   -15   -10       -5           0           5      10    15
                                                                       Isomap1                                                                                                         Isomap1
Figure 3.4 Laplacian eigenmaps and Isomap results for the original and the NRPCA denoised digits
4 and 9 from the MNIST dataset.
   The right columns of Figure 3.4 confirm this conjecture: after the NRPCA denoising (with
𝑘 = 10), we see a much better separability of the two digits using the first two coordinates of Isomap
and Laplacian Eigenmaps. Here we used 2000 randomly drawn images of 4 and 9 from the MNIST
training dataset. Figure 3.5 used another random set of the same cardinally and 𝑘 = 5, but they both
demonstrated that the denoising step greatly facilitates the dimensionality reduction.
   In addition, we observe some emerging trajectory (or skeleton) patterns in the plot of the
denoised embedding (right column of Figure 3.4 and Figure 3.5). Mathematically speaking, this is
due to the nuclear norm penalty on the tangent spaces in the optimization formulation that forces the
denoised data to have a small intrinsic dimension. However, since the small intrinsic dimensionality
is not manually inputted but implicitly imposed via an automatic calculation of the data curvature
and the weight parameter 𝜆𝑖 , we do not think the trajectory pattern is a human artifact. To further
examine the meaning the trajectories, we replaced the dots in the bottom two scattered plots in
Figure 3.5 by their original images of the digits, and obtained Figure 3.6 and Figure 3.7. We can see
that 1). the digits are better grouped in the denoised embedding than the original one and 2). the
trajectories in the denoised embedding correspond to graduate transitions between the two images
                                                                                                                               69


on the two ends. If two images are connected by two trajectories, then it indicates two ways for
one image to gradually deform into the other. Furthermore, Figure 3.8 listed a few images of 4 and
9 before and after denoising, which shows which part of the image is detected as sparse noise and
changed by NRPCA.
                        0.015
                                               Original, Laplacian                                                                 0.01
                                                                                                                                                     Denoised Laplacian
                           0.01
                                                                                                                                0.005
        Laplacian2
                        0.005
                                                                                                                Laplacian2
                                                                                                                                        0
                                0
                                                                                                                              -0.005
                      -0.005
                                                                                                                                 -0.01
                         -0.01
                      -0.015                                                                                                  -0.015
                           -0.01             -0.005          0           0.005         0.01        0.015                          -0.015        -0.01     -0.005       0        0.005        0.01   0.015
                                                                 Laplacian1                                                                                        Laplacian1
                               15
                                                      Original Isomap                                                                  15
                                                                                                                                                        Denoised Isomap
                               10
                                                                                                                                       10
                                5
                     Isomap2                                                                                                 Isomap2
                                                                                                                                        5
                                0
                                -5
                                                                                                                                        0
                               -10
                                                                                                                                        -5
                               -15
                               -20                                                                                                     -10
                                 -25   -20    -15      -10   -5      0        5   10   15     20    25                                   -15   -10       -5        0       5            10     15    20
                                                                  Isomap1                                                                                           Isomap1
Figure 3.5 Laplacian eigenmaps and Isomap results for the original and the NRPCA denoised digits
4 and 9 from the MNIST dataset.
                 Figure 3.6 Isomap embedding using the original data from the MNIST dataset.
                                                                                                           70


                  Figure 3.7 Isomap embedding using the Denoised data via NRPCA.
                   Original images for digit 4                  Denoised images for digit 4
                   Original images for digit 9                  Denoised images for digit 9
     Figure 3.8 A comparison of the original and the NRPCA denoised images of digit 4 and 9.
     Figure 3.9 shows the results for NRPCA denoising with more iterations of patch-reassignment,
we can see that the results almost have no visible difference after 𝑇 > 2. Since the patch-reassignment
is in the outer iteration, increasing its frequency greatly increases the computation time. Fortunately,
we find that often times two iterations are enough to deliver a good denoising result.
                                                     71


                      Noisy images          Denoised images with T=1     Denoised images with T=2
                Denoised images with T=3    Denoised images with T=4     Denoised images with T=5
        Figure 3.9 NRPCA Denoising results with more iterations of patch-reassignment.
    Biological data: We illustrate the potential usefulness of NRPCA algorithm on an embryoid
body (EB) differentiation dataset over a 27-day time course, which consists of gene expressions
for 31,000 cells measured with single-cell RNA-sequencing technology (scRNAseq) Martin and
Evans (1975); Moon et al. (2019). This EB data comprising expression measurement for cells
originated from embryoid at different stages is hence developmental in nature, which should exhibit
a progressive type of characters such as tree structure because all cells arise from a single oocyte
and then develop into different highly-differentiated tissues. This progression character is often
missing when we directly apply dimension reduction methods to the data as shown in Figure 3.10
because biological data including scRNAseq is highly noisy and often is contaminated with outliers
from different sources including environmental effects and measurement error. In this case, we aim
to reveal the progressive nature of the single-cell data from transcript abundance as measured by
scRNAseq.
    We first normalized the scRNAseq data following the procedure described in Moon et al. (2019)
and randomly selected 1000 cells using the stratified sampling framework to maintain the ratios
among different developmental stages. We applied our NRPCA method to the normalized subset
of EB data and then applied Locally Linear Embedding (LLE) to the denoised results. The two-
                                                   72


dimensional LLE results are shown in Figure 3.10. Our analysis demonstrated that although LLE
is unable to show the progression structure using noisy data, after the NRPCA denoising, LLE
successfully extracted the trajectory structure in the data, which reflects the underlying smooth
differentiating processes of embryonic cells. Interestingly, using the denoised data from 𝑋˜ − 𝑆ˆ with
neighbor update, the LLE embedding showed a branching at around day 9 and increased variance
in later time points, which was confirmed by manual analysis using 80 biomarkers in Moon et al.
(2019).
                      Figure 3.10 LLE results for denoised scRNAseq data set.
                                                 73


                                             CHAPTER 4
        PERTURBATION OF INVARIANT SUBSPACES FOR ILL-CONDITIONED
                                           EIGENSYSTEM
4.1   Introduction
4.1.1   Overview of Chapter 4
    Given a diagonalizable matrix 𝐴, this chapter studies the stability of its invariant subspaces
when its matrix of eigenvectors is ill-conditioned. Let X1 be some invariant subspace of 𝐴 and
𝑋1 be the matrix of the right eigenvectors that span X1 . It is generally believed that when the
condition number 𝜅2 (𝑋1 ) gets large, the corresponding invariant subspace X1 will become unstable
to perturbation. This chapter proves that this is not always the case.
    Specifically, we show that the growth of 𝜅2 (𝑋1 ) alone is not enough to destroy the stability. As
a direct application, the result in this chapter ensures that when 𝐴 gets close to a Jordan form, one
may still estimate its invariant subspaces from the noisy data stably. The result in this chapter also
suggests that for matrices with ill-conditioned eigensystems, their invariant subspaces may be more
stable than their eigenvalues to matrix perturbation.
    Results of this chapter has given rise to the manuscript Lyu and Wang (2022).
4.2   Invariant subspace perturbation analysis
    In Chapter 2 we have discussed the stability of singular subspaces, for symmetric matrices, the
techniques in Chapter 2 can also be extended to studying the stability of invariant subspaces. This
is because symmetric matrices enjoy the following nice properties: 1. all the eigenvalues are real;
2. for any symmetric matrix 𝐴, there exists an eigen decomposition 𝐴 = 𝑄Λ𝑄𝑇 , where 𝑄 is a
unitary matrix, and Λ is a diagonal matrix containing the eigenvalues.
    However, when 𝐴 ∈ C𝑛×𝑛 is a general square matrix with eigen decomposition 𝐴 = 𝑋Λ𝑋 −1 ,
studying the invariant subspaces of 𝐴 becomes more challenging. First, the eigenvalues in Λ may
not be real. Moreover, the eigenvector matrix 𝑋 here may not be unitary, and eigenvectors in 𝑋
may not be orthogonal to each other either. As a result, the condition numbers 𝜅 2 (𝑋) and 𝜅2 (𝑋1 )
can be arbitrarily large, examples can be found when the matrix 𝐴 is close to a Jordan form (e.g.,
                                                   74


Example 4.3.1). To get a better understanding of the behaviour of invariant subspace 𝑠𝑝𝑎𝑛(𝑋1 )
under perturbation, it is important to investigate the impact of the condition numbers on its stability.
    Chapter 1 provides us with the math techniques for invariant subspace perturbation analysis. In
the settings of Section 4.2, we consider a diagonalizable matrix 𝐴 ∈ C𝑛×𝑛 which has an eigenvector
matrix partitioned into two blocks 𝑋 = [𝑋1 , 𝑋2 ]. Assume the perturbed matrix 𝐴           e = 𝐴 + Δ𝐴 has
a similar block structure, 𝑋 e1 is the eigenvector matrix of 𝐴    e with partition 𝑋e1 = [ 𝑋    e2 ]. Define
                                                                                           e1 , 𝑋
𝑉 := (𝑋 −1 ) ∗ and 𝑉       e−1 ) ∗ , then 𝑉 ∗ 𝑋 = 𝑉
                    e := ( 𝑋                      e∗ 𝑋
                                                     e = 𝐼. Similar as the partition of 𝑋 and 𝑋,  e we also
partition matrix 𝑉 and 𝑉  e into two blocks, i.e., 𝑉 = [𝑉1 ,𝑉2 ], 𝑉e = [𝑉     e2 ].
                                                                         e1 , 𝑉
    Using the definitions above, we study the distance between subspaces X1 = 𝑠𝑝𝑎𝑛(𝑋1 ) and
X
f1 = 𝑠𝑝𝑎𝑛( 𝑋  e1 ). Specifically, we are interested in bounding the quantity
                                  ∥ sin Θ(X1 , X
                                               e1 )∥ = ∥ sin Θ(𝑄 𝑋 , 𝑄 e )∥,
                                                                  1    𝑋1
where the sin Θ distance between two subspaces are defined in Section 1.2.3.
4.2.1    Related works
    Since the exact sin Θ distance is hard to calculate, a central task in eigen-perturbation anal-
ysis is establishing useful upper bounds with simpler expressions Varah (1970); Stewart (1990);
Greenbaum et al. (2020); Karow and Kressner (2014); Ipsen (2003); Golub and Wilkinson (1976);
Demmel (1986); Kato (2013); Chatelin (2011); Gohberg et al. (2006); Davis and Kahan (1970).
Explicitly, we prefer upper bounds that are expressed by simple quantities related to 𝐴 and Δ𝐴, such
as ∥Δ𝐴∥, ∥ 𝐴∥, the condition number of 𝑋, and the gap between Λ1 and Λ2 , since these quantities
are more likely to be given as prior information and/or are easier to estimate when the exact 𝐴 is
unknown.
    The most well-known bound is perhaps the one given by the Davis-Kahan theorem Davis and
Kahan (1970) (see Theorem 1.3.10 for the theorem), which states that for Hermitian matrices, the
sin Θ distance depends only on ||Δ𝐴|| and the eigengap. For non-Hermitian matrices, however, it
is believed that the stability of X1 also depends on the condition number of the eigenvector matrix
𝑋: an ill-conditioned 𝑋 would cause instabilities of the invariant subspaces. However, a tight
relationship between the condition number and the sin Θ distance is yet to be established.
                                                      75


     This problem has been partially studied in several papers Haviv and Ritov (1994); Ipsen (2003);
Stewart (1973, 1990); Varah (1970). The best known relation between the sin Θ distance and the
condition numbers was provided in Stewart (1971), here we state a slightly simplified version of
this classical result.
Theorem 4.2.1 (simplified version of Theorem 4.1 Stewart (1971)). Provided that
                                            1                                              2
                   ∥Δ𝐴∥(∥ 𝐴∥ + ∥Δ𝐴∥) <          sep(𝑄 ∗𝑋 𝐴𝑄 𝑋1 , 𝑄𝑉∗ 𝐴𝑄𝑉2 ) − 2∥Δ𝐴∥ ,                  (4.1)
                                            4            1             2
the following error bound holds
                                                                  2∥Δ𝐴∥
                     ∥ tan Θ(𝑄 𝑋1 , 𝑄 𝑋e )∥ <                                                ,         (4.2)
                                        1      𝑠𝑒 𝑝(𝑄 ∗𝑋 𝐴𝑄 𝑋1 , 𝑄𝑉∗ 𝐴𝑄𝑉2 ) − 2∥Δ𝐴∥
                                                        1              2
where for any pair of matrices 𝐿 1 , 𝐿 2 , sep(𝐿 1 , 𝐿 2 ) := inf ∥𝑇 ∥=1 ∥𝑇 𝐿 1 − 𝐿 2𝑇 ∥.
     Since ∥ tan Θ∥ is larger than ∥ sin Θ∥, this also gives a sinΘ bound. Theorem 4.2.1 indicates
the key quantity that determines the stability of invariant subspaces is not the eigengap nor the
condition numbers, but a new quantity called separation defined using the norm of a Sylvester
operator related to 𝐴. Despite the efficacy in characterizing the subspace stability, separation is
difficult to estimate in practice when the original matrix 𝐴 is not fully known. In contrast, it is much
easier to come up with a rough estimate of (or be given some prior knowledge of) the condition
number of the eigensystem as well as the eigengap. Therefore, characterizing the stability directly
in terms of condition number (of the eigensystem) and eigengaps is still of great practical interest.
     One can replace the separation in (4.2) with the condition numbers using the following inequal-
ities,
                                                                                   𝑠𝑒 𝑝(Λ1 , Λ2 )
       𝑠𝑒 𝑝(𝑄 ∗𝑋 𝐴𝑄 𝑋1 , 𝑄𝑉∗ 𝐴𝑄𝑉2 ) = 𝑠𝑒 𝑝(𝑅 𝑋1 Λ1 𝑅 −1       , 𝑅𝑉2 Λ2 𝑅𝑉−1 ) ≥                      , (4.3)
                 1           2                            𝑋 1             2     𝜅 2 (𝑅 𝑋1 )𝜅2 (𝑅𝑉2 )
where the equality is given by direct calculation and the inequality is from Chapter V in Stewart
(1990). Noticing 𝜅2 (𝑅 𝑋1 ) = 𝜅2 (𝑋1 ) and 𝜅2 (𝑅𝑉2 ) = 𝜅2 (𝑉2 ), substituting (4.3) into (4.2) leads to the
following bound in terms of condition numbers
                                                      2𝜅2 (𝑋1 )𝜅2 (𝑉2 )∥Δ𝐴∥
                      ∥ tan Θ(X1 , X
                                   e1 )∥ <                                                 ,           (4.4)
                                            [𝑠𝑒 𝑝(Λ1 , Λ2 ) − 2𝜅2 (𝑋1 )𝜅2 (𝑉2 )∥Δ𝐴∥] +
                                                     76


where [·] + is the positive part of the input to guarantee the bound is positive. The separation
𝑠𝑒 𝑝(Λ1 , Λ2 ) only depends on the eigenvalues of 𝐴, and can be bounded by a constant which is
smaller than the common eigen gap 𝛿1 := min𝜆𝑖 ∈𝑆(Λ ),𝜆 𝑗 ∈𝑆(Λ ) |𝜆𝑖 − 𝜆 𝑗 |, more discussions can be
                                                        1        2
found in Appendix.
     We note that when the bound (4.4) is quite tight 𝜅2 (𝑋1 ) and 𝜅 2 (𝑉2 ) are small. The minimal
condition number is achieved when 𝐴 is Hermitian, for which 𝑋 is orthogonal and 𝜅2 (𝑋) = 𝜅2 (𝑋1 ) =
𝜅2 (𝑉2 ) = 1. Then (4.4) reduces to
                                                           2∥Δ𝐴∥
                                   ∥ tan Θ(X1 , X
                                                e1 )∥ <            ,                            (4.5)
                                                        𝛿1 − 2∥Δ𝐴∥
which meets the tight Davis-Kahan’s bound Davis and Kahan (1970) ensuring the stability of the
subspace with sufficient eigengap from others.
     However, the bound (4.4) suggests that X1 becomes unstable (0 tolerance of noise) as 𝜅2 (𝑋1 ) →
∞. That is, with fixed eigenvalues, 𝜅2 (𝑉2 ), and the magnitude of noise ∥Δ𝐴∥, as 𝜅2 (𝑋1 ) → ∞, the
bound on RHS also goes to infinity and becomes not informative.
     A similar result has been derived in Varah (1970). The original result in Varah (1970) was
stated for both diagonalizable matrices and Jordan forms. Here to avoid distractions, we only state
it for diagonalizable matrices, that is
                                                          𝐶𝑟 ∥Δ𝐴∥
                               ∥ sin Θ(X1 , X
                                            f1 )∥ ≤                     .                       (4.6)
                                                     𝜎min (𝑋)𝜎min (𝑋1 )
Here 𝐶𝑟 is a constant depending on 𝑟 := 𝑑𝑖𝑚(X1 ) and the eigengap 𝛿1 defined above. Recall that we
                                                                                                  √
required the columns of 𝑋 to be normalized, which leads to 1 ≤ 𝜎max (𝑋1 ) = 𝜅2 (𝑋1 )𝜎min (𝑋1 ) ≤ 𝑟.
From this, we see that the bound (4.6) is no smaller than
                                            𝐶𝑟 𝜅2 (𝑋1 )∥Δ𝐴∥
                                              √              .
                                                𝑟𝜎min (𝑋)
Thus (4.6) also suggests an instability of X1 as 𝜅2 (𝑋1 ) → ∞, which is quite pessimistic.
     In Ipsen (2003), the author proved that one can replace the absolute error ∥Δ𝐴∥ in (4.4) by a
relative error ∥ 𝐴−𝑘 Δ𝐴 𝐴
                        e−𝑙 ∥, where 𝑘 and 𝑙 are positive numbers. However, the dependence on the
condition numbers was not improved.
                                                   77


4.3    Motivating example
     As discussed in Section 4.2.1, (4.4) is the state-of-the-art relation between the sin Θ distance
and the condition numbers of the eigensystem. However, when 𝜅2 (𝑋1 ) gets large, (4.4) is no longer
tight, which can be seen from the following example. It motivates us to derive a new perturbation
bound in Section 4.4.
Example 4.3.1. Consider the following matrix
                                                                        
                                               𝐵 0                1 1 
                                          𝐴=                   𝐵=
                                                                        
                                                          ,               .
                                                0 1/2              𝜖 1
                                                                        
                                                                        
Assume the perturbation matrix Δ𝐴 ∈ C3×3 is arbitrary with 𝜖1 := ∥Δ𝐴∥ = 𝑜(1). Also assume
𝜖 = 𝑜(1). Let 𝑋1 be the 3 × 2 matrix containing the two eigenvectors of 𝐴 corresponding to the
block 𝐵. We want to find the stability of X1 = 𝑠𝑝𝑎𝑛(𝑋1 ).
     Since 𝐵 is close to a Jordan block, 𝜅2 (𝑋1 ) must be large. To verify it, we first obtain the
closed-form expressions for 𝑋1 and 𝑋2
                                                     ©1       1 ª            ©0 ª
                                               1 ­   ­  1       1®
                                                                 ®           ­ ®
                                    𝑋1 = √                             𝑋 2 = ­0 ® .
                                                                             ­ ®
                                                     ­𝜖 2 −𝜖 2 ® ,
                                              1 + 𝜖 ­­           ®
                                                                 ®
                                                                             ­ ®
                                                                             ­ ®
                                                       0      0               1
                                                     «           ¬           « ¬
Then this immediately implies X1 = 𝑠𝑝𝑎𝑛{𝑒 1 , 𝑒 2 } (𝑒𝑖 , 𝑖 = 1, 2 are the canonical basis vectors) and
𝜅2 (𝑋1 ) = 𝜖 −1/2 ≫ 1. In addition, the eigen-gap of this 𝐴 is sufficiently large, since the eigenvalues
in Λ1 are 1 ± 𝜖 1/2 , and Λ2 = 1/2.
     With a general perturbation Δ𝐴, there is no closed-form expression for X              e1 . We therefore
first use some special perturbations to make our point. Let 𝐸𝑖, 𝑗 be the 3 × 3 matrix whose
(𝑖, 𝑗)th entry equals 1 and other entries equal 0. Consider special perturbations of the form
Δ𝐴 = 𝜖1 𝐸𝑖, 𝑗 with 𝑖, 𝑗 ∈ {1, 2, 3}, and assume 𝜖 1 = 𝑜(1) is a different small constant than 𝜖. If
(𝑖, 𝑗) ∈ {(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (2, 3), (3, 3)}, one can verify that X
                                                                                       e1 is exactly the same
as X1 , so ∥ sin Θ(X1 , X e1 )∥ = 0. Else, (𝑖, 𝑗) = (3, 1) or (𝑖, 𝑗) = (3, 2). For these two, one can verify
that under the perturbation Δ𝐴 = 𝜖1 𝐸𝑖, 𝑗 , the eigenvalues of 𝐴 do not change. As for eigenvectors,
                                                           78


if (𝑖, 𝑗) = (3, 1), then the perturbed eigenvectors are
                                                                         
                                                       ©  1        1    ª
                                                                          ®
                                                       ­                 ®
                                     e1 = 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 ­­  𝜖 1/2
                                     𝑋                                1/2
                                                                   −𝜖  ® ,
                                                       ­                ®
                                                       ­  2𝜖       2𝜖1  ®
                                                           1√         √ 
                                                       «  1+2 𝜖   1−2 𝜖  ¬
where 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒 stands for column-wise normalization. We can directly compute the distance
between X1 and X     e1 to get
                                  ∥ sin Θ(X1 , X
                                               e1 )∥ ≤ 𝑂 (𝜖1 ) = 𝑂 (∥Δ𝐴∥).
Similarly, for (𝑖, 𝑗) = (3, 2), the same calculation again yields ∥ sin Θ(X1 , X  e1 )∥ ≤ 𝑂 (∥Δ𝐴∥).
     Therefore, for all the special perturbations of the form Δ𝐴 = 𝜖 1 𝐸𝑖, 𝑗 , we have ∥ sin Θ(X1 , X
                                                                                                    e1 )∥ ≤
𝑂 (∥Δ𝐴∥). Notice that the sin Θ distance does not get worse as 𝜅2 (𝑋1 ) = 𝜖 −1/2 → ∞, suggesting
that the bound (4.4) may be suboptimal. To get additional supporting evidence, we tested random
perturbations and summarized the values of ∥ sin Θ(X1 , X      e1 )∥ in Table 4.1.
                           𝜖               1e-2       1e-4       1e-6        1e-8      1e-10
                 Estimated by (4.4)      5.00e-5    4.08e-4    4.00e-3      0.0042      0.67
                 True sin Θ distance     2.07e-6    1.99e-6    1.99e-6     1.99e-7    1.99e-6
Table 4.1 Comparison of the true sin Θ distance in Example 4.3.1 with its upper bound computed
from (4.4) for various values of 𝜖. The perturbation matrix Δ𝐴 is a realization of the random
Gaussian matrix rescaled to a fixed norm ∥Δ𝐴∥ = 𝜖1 = 10−6 . With this fixed Δ𝐴, we let the
condition number 𝜅2 (𝑋1 ) → ∞ by letting 𝜖 → 0 in Example 4.3.1. We see that the true sin Θ
distance does not vary with 𝜖 while the upper bound (4.4) does, suggesting the suboptimality of
(4.4).
     In the simulation, we added a random perturbation with energy 𝜖1 to the matrix 𝐴 defined in
Example 4.3.1 and let the 𝜖 in 𝐴 go to 0. For each value of 𝜖, we compute the true sin Θ distance as
well as the value of the upper bound (4.4). We observed that the true sin Θ distance does not change
much as 𝜖 → 0 while the upper bound blows up, which suggests a suboptimality of the bound. A few
more details about the simulation: the random perturbation Δ𝐴 in this experiment was obtained
by re-scaling an i.i.d. Gaussian matrix to have a spectral norm of 10−6 . Because in Example
4.3.1, 𝑋1 are the eigenvectors of 𝐴 corresponding to the two largest magnitude eigenvalues, in the
                                                      79


simulation, we also took 𝑋  e1 to be the eigenvectors of 𝐴  e associated with the two largest magnitude
                                                                             √︂                          
eigenvalues. The true sin Θ distance in the table was computed by 1 − 𝜎𝑚𝑖𝑛             2       𝑄 ∗𝑋 𝑄 𝑋e , which
                                                                                                   1    1
equals sin Θ(𝑄 𝑋1 , 𝑄 𝑋e ) .
                        1
4.4   A new invariant subspace perturbation bound
    Both the theoretical argument for special perturbations and the numerical results for random
perturbations suggest that for the matrix 𝐴 in Example 4.3.1, the perturbation of X1 is 𝑂 (𝜖1 ) =
𝑂 (∥Δ𝐴∥), which is unaffected by the large condition number 𝜅2 (𝑋1 ) = 𝜖 −1/2 as 𝜖 → 0.
    In this section, we show that this phenomenon is no coincidence, and 𝜅2 (𝑋1 ) can indeed be
removed from the previous bound (4.4). In order to present the main results in this chapter, we need
to assume that the spectra of 𝐴 and 𝐴     e have gaps. Since the study of the eigenvalue perturbation is
out of the scope of this dissertation, we simply make the existence of eigengaps as an assumption.
Assumption 1 [Eigengap]: Suppose 𝑆(Λ1 ) and 𝑆(Λ2 ) are well-separated in the sense of
min𝜆∈𝑆(Λ ),𝜎∈𝑆(Λ ) |𝜆 − 𝜎| > 0.
           1        2
    The following assumption assumes that the gap still exists after perturbation.
Assumption 2 [Eigengap under perturbation]: Suppose 𝑆( e               Λ1 ) and 𝑆(Λ2 ) are well-separated
with some eigengap 𝛿𝜆 > 0. More explicitly, 0 < 𝛿𝜆 := min𝜆∈𝑆( e        Λ1 ),𝜎∈𝑆(Λ2 )
                                                                                        |𝜆 − 𝜎|.
    The following theorem improves upon (4.4) when 𝜅 2 (𝑋1 ) is large.
Theorem 4.4.1. Assume 𝐴 and 𝐴       e are diagonalizable matrices with decomposition (1.2) and (1.3),
and satisfying Assumption 2 with an eigengap 𝛿𝜆 . Let 𝑟 = # (S(Λ1 )) be the number of eigenvalues
in Λ1 and denote by 𝜆 e𝑗 , 𝑗 = 1, ..., 𝑟 the 𝑗th diagonal element in e Λ1 , then we have
                                                             𝑟
                                            𝜅 (𝑉 )∥Δ𝐴∥ 𝐹 Ö ©­                   𝑎              ª
                  sin Θ(𝑄 𝑋1 , 𝑄 𝑋e ) ≤ 2 2                     ­1 +
                                                                                               ®
                                                                                               ®
                                     1             𝑎            ­        min     | 𝜆
                                                                                   e 𝑗 − 𝜆 𝑘 | ®
                                                            𝑗=1      𝜆 𝑘 ∈𝑆(Λ2 )
                                                                «                              ¬            (4.7)
                                                             𝑟          
                                            𝜅 (𝑉 )∥Δ𝐴∥ 𝐹    Ö         𝑎
                                          ≤ 2 2                   1+       ,
                                                   𝑎                 𝛿𝜆
                                                            𝑗=1
where 𝑎 = ∥ 𝐴∥ + ∥Δ𝐴∥ + 𝜌(Λ2 ), and 𝜌(Λ2 ) is the spectral radius of Λ2 .
Remark 4.4.2. Notice that (4.7) does not contain 𝜅2 (𝑋1 ), and when being applied to Example
4.3.1, agrees with the numerical observation.
                                                     80


    We note that Theorem 4.4.1 states that
                                          e1 )∥ ≤ 𝜅2 (𝑉2 ) 𝑓 (∥ 𝐴∥, ∥Δ𝐴∥, 𝛿1 , 𝑟) ,
                             ∥ sin Θ(X1 , X                                                        (4.8)
where 𝑓 is some function of ∥ 𝐴∥, ∥Δ𝐴∥, 𝛿1 , and 𝑟 ≡ 𝑑𝑖𝑚(X1 ). The new bound ensures that the
stability will not keep getting worse as 𝜅2 (𝑋1 ) → ∞. In particular, when 𝐴 approaches a Jordan
form, we may still stably estimate its invariant subspaces from noisy data. Note that there is
a previous result Varah (1970) that guaranteed the stability of invariant subspaces for deficient
matrices (which correspond to 𝜅2 (𝑋1 ) = ∞), while our result holds for the much larger class of
diagonalizable matrices with large condition numbers. The proof technique is also different.
    Another important implication of Theorem 4.4.1 is that, if a matrix has an ill-conditioned
eigensystem, then its invariant subspaces are likely to be more stable than its eigenvalues to
perturbation. To see why, recall that the perturbation of eigenvalues is controlled by the Bauer-Fike
theorem,
                                           |𝜆 − 𝜆|
                                                e ≤ 𝜅2 (𝑋)∥Δ𝐴∥,
which is known to be tight. As the matrix gets more and more ill-conditioned, this bound is likely
to be larger than (4.8) as 𝜅2 (𝑋) ≥ 𝜅2 (𝑉2 ). Hence the eigenvalues are perturbed more than the
subspaces.
    To be a bit more concrete, for the 𝐴 in Example 4.3.1, the Bauer-Fike bound on the eigenvalues
                     𝜖 
implies |𝜆 − 𝜆| ∼ 𝑂 √1 , while our sin Θ bound on the invariant subspace X1 implies sin Θ(X1 , X
              e                                                                                    e1 ) =
                        𝜖
𝑂 (𝜖 1 ). Of course, for the invariant subspaces to be detectable after perturbation, we need the eigen-
gap to exist in the perturbed matrix, which can be guaranteed by the Bauker-Fike theorem if we
                √
require 2𝜖1 ≲ 𝜖. Under this requirement, the Bauker-Fike theorem indicates that the perturbed
eigenvalues are at most 𝑂 (1) away from the original ones, and our result indicates that the perturbed
                                 √
invariant subspace is only 𝑂 ( 𝜖) away from the original one. Hence the invariant subspace is much
more stable than the eigenvalues in this case.
4.4.1     Tightness of the bound (4.7)
    At first glance, the bound in (4.7) contains the 𝑟th power of the eigengap in the denominator,
which seems unusual. In this section, we demonstrate that it is actually tight for general matrices.
                                                      81


The dependence on 𝛿𝜆𝑟 is tight
     We give an example showing that the dependence on 𝛿𝜆 is tight, i.e., the 𝑟th power on 𝛿𝜆 in the
upper bound can be attained by the following examples.
Example 4.4.3. We first present an example for 𝑟 = 2.
                                                                           
                                     1     0        0                 0 0 0
                                                                           
                                                                           
                                𝐴 = 1 1 − 𝛿
                                                    0  ,    Δ𝐴 = 0 0 0 .
                                                                        
                                                                           
                                     0     0     1 − 2𝛿               0 𝜖 0
                                                                           
                                                                           
Here we set 𝜖 = min{𝑜(1), 𝑂 (𝛿2 )}, 0 < 𝛿 < 1. We consider the perturbation of the subspace X1
spanned by the largest two eigenvectors of 𝐴, so 𝑟 := 𝑑𝑖𝑚(X1 ) = 2. It is immediate to verify that
X1 = 𝑠𝑝𝑎𝑛(𝑒 1 , 𝑒 2 ) (𝑒𝑖 is the 𝑖th canonical basis), 𝑉2 = 𝑒 3 , and the eigengap between the largest two
and the smallest eigenvalues is 𝛿𝜆 = 𝛿. With the given perturbation, one can verify that 𝜆𝑖 = 𝜆     e𝑖 , for
𝑖 = 1, 2, 3. The perturbed subspace can also be calculated
                                                           
                                                         1  0 
                                                       
                                                                  
                                                                       
                                                          
                                                       
                                                                       
                                                                        
                                          e                1
                                         X1 = 𝑠𝑝𝑎𝑛  𝛿  ,  1  .
                                                                    
                                                       
                                                                  
                                                           
                                                         𝜖 2   𝛿𝜖  
                                                       
                                                                       
                                                                        
                                                         2𝛿    
In X e1 , we pick a special vector that has a large angle with the original subspace X1 : pick 𝑥 =
h            i𝑇                                                                                     
  1, 0, − 𝜖 2 ∈ X  e1 . One can verify that the sine of the angle between 𝑥 and X1 reaches Ω 𝜖 =
         2𝛿                                                                                       𝛿2
             ∥Δ𝐴∥ 𝐹
Ω 𝜅2 (𝑉2 ) 𝛿𝑟         , since 𝜅2 (𝑉2 ) = 1, 𝑟 = 2 and ∥Δ𝐴∥ 𝐹 = 𝜖 (The Big-Ω notation was defined in
                𝜆
Section 1.1). Since ∥ sin Θ(X1 , X   e1 )∥ is at least as large as the sin Θ angle between 𝑥 and X1 , then
                                         
                                   ∥Δ𝐴∥ 𝐹
∥ sin Θ(X1 , X1 )∥ ≥ Ω 𝜅2 (𝑉2 ) 𝛿𝑟
              e                             , and therefore the 𝑟th power of the eigengap is reached for
                                       𝜆
the case 𝑟 = 2.
     The above construction can be easily generalized to any 𝑟 > 2.
                                                       82


Example 4.4.4.
                                                             
                        0                                    
                                                             
                                                             
                        −1 𝛿                                 
                                                                                                
                                                                          0 · · · · · · · · · 0
                               −1 2𝛿
                                                                                                
                                                                          .                   . 
            𝐴 = 𝐼𝑟+1 −                                       , Δ𝐴 =  .. · · · · · · · · · ..  .
                                                                            
                                    ... ...                  
                                                                                                 
                                                                            0 · · · 0 𝜖 0
                                                                                                
                                                             
                        
                                           −1 (𝑟 − 1)𝛿       
                                                              
                                                                                                  
                                                             
                                                   0
                                                             
                                                          𝑟𝛿
                                                             
Here except for the 𝜖, all entries of the perturbation matrix Δ𝐴 are zero. Again, we set 𝜖 =
min{𝑜(1), 𝑂 (𝛿𝑟 )}, 0 < 𝛿 < 1. Consider the perturbation of the subspace spanned by the eigenvectors
associated with the largest 𝑟 eigenvalues, which is the subspace X1 = 𝑠𝑝𝑎𝑛 (𝑒 1 , 𝑒 2 , ..., 𝑒𝑟 ). One can
easily verify that 𝑉2 = 𝑒𝑟+1 , and the eigengap is still 𝛿𝜆 = 𝛿. We can again pick a special vector in the
                                            h                         i𝑇
                                                            𝑟−1   𝜖
perturbed invariant subspace X1 : 𝑥 = 1, 0, 0, · · · , (−1) 𝑟!𝛿𝑟 . The angle between 𝑥 and X1 then
                                 e
                                       
                                ∥Δ𝐴∥ 𝐹
             
              𝜖
reaches Ω 𝛿𝑟 = Ω 𝜅2 (𝑉2 ) 𝛿𝑟              . Therefore the power 𝑟 on the eigengap in the denominator of
                                   𝜆
(4.7) is attained.
The 𝜅2 (𝑉2 ) in (4.7) cannot be removed
    The bound in Theorem 2.5.1 successfully got rid of 𝜅2 (𝑋1 ). The next example shows that we
cannot further remove 𝜅2 (𝑉2 ) from it.
Example 4.4.5. We first consider a matrix of three dimensions.
                                                                            
                                    1 + 𝛿    0    0              0 0 0
                                                                             
                                                                            
                               𝐴 =  0       1    0  ,   Δ𝐴 = 𝜖 0 0 ,
                                                                    
                                                                            
                                     0
                                             1 1 − 𝛿             0 0 0
                                                                              
                                             2       1                      
where 𝜖 = min{𝑜(1), 𝑂 (𝛿2 )} and 0 < 𝛿, 𝛿1 ≪ 1. Consider X1 to be the 1-dimensional subspace
spanned by the eigenvector [1, 0, 0] 𝑇 of 𝐴 associated with the largest eigenvalue, so 𝑟 = 1. Under
the given perturbation, one can check that the perturbed eigenvector associated with the largest
                h               i𝑇                  h                i𝑇 
                   𝜖      𝜖
eigenvalue is 1, 𝛿 , 2𝛿(𝛿+𝛿 ) , so X    e1 = 𝑠𝑝𝑎𝑛 1, ,   𝜖      𝜖          . As a result,
                            1                            𝛿 2𝛿(𝛿+𝛿 )1
                                                                        
                                                                  𝜖
                                    ∥ sin Θ(X1 , X
                                                 e1 )∥ = Ω                 .
                                                              𝛿(𝛿 + 𝛿1 )
                                                      83


                                          
It is also immediate that 𝜅2 (𝑉2 ) = Ω 𝛿1 , 𝛿𝜆 = 𝛿, ∥Δ𝐴∥ 𝐹 = 𝜖 and ∥ 𝐴∥ = 𝑂 (1). Plugging these into
                                           1       
(4.7), we get that our upper bound is 𝑂 𝛿𝛿𝜖 . We can see that this bound can indeed bound the
                                                  1                                                   
actual sin Θ angle. But if 𝜅2 (𝑉2 ) was absent from the bound (4.7), then (4.7) would only be 𝑂 𝛿𝜖 ,
which is smaller than the actual sin Θ angle. Therefore the appearance of 𝜅2 (𝑉2 ) is necessary. The
same idea allows us to construct such examples for any dimension. Specifically, for any dimension
𝑛, we can define
                                                                          0 · · ·   0
                                                                                         
                                                              
                         1 + 𝛿                                                         
                                                                          𝜖 ...     .. 
                                                                        
                                                                                    .
                                   1
                                                                        
                                                                               .     .. 
                     𝐴=                                            Δ𝐴 = 0 ..
                                                              
                                                               ,                      . .
                                  1 1−𝛿                       
                                  2        1                            
                                                                           .. ..
                                                                                          
                                                                                       .. 
                         
                         
                                                               
                                                                         . .          .
                                               (1 − 2𝛿1 )𝐼𝑛−3           
                                                                          
                                                                                          
                                                                                          
                                                                        0 · · ·   0
                                                                          
     Here the perturbation matrix Δ𝐴 only contains one nonzero element at its (2, 1)th entry. Let X1
to be the subspace spanned by the first eigenvector, so again, 𝑟 = 1. Direct calculation gives that X1 =
                                                                                  h                   i𝑇
                                                                                        𝜖   𝜖
𝑠𝑝𝑎𝑛(𝑒 1 ), the eigenvector of 𝐴 associated with the largest eigenvalue is 1, 𝛿 , 2𝛿(𝛿+𝛿 ) , 0, ..., 0 ,
                                  e
                    h                                                                        1
                                               i𝑇                                   
hence X  f1 = 𝑠𝑝𝑎𝑛 1, , 𝜖       𝜖                                                       1
                                                    . We also have 𝜅 2 (𝑉2 ) = Ω 𝛿 , ∥Δ𝐴∥ 𝐹 = 𝜖, and
                        𝛿 2𝛿(𝛿+𝛿 ) , 0, ..., 0
                                    1                                                    1
∥ 𝐴∥ = 𝑂 (1). Then same discussion implies that the appearance of 𝜅2 (𝑉2 ) in the upper bound is
necessary.
A remark on the 𝑎 in the bound
     Observe that the bound in (4.7) also contains 𝑎, which is essentially the spectral norm of 𝐴.
We argue that the presence of 𝑎 is necessary as it ensures that the bound is scaling invariant. More
specifically, replacing 𝐴 and 𝐴  e by 𝑡 𝐴 and 𝑡 𝐴e with any scalar 𝑡 ≠ 0, we see that our bound in (4.7)
does not change, which matches the fact that the angle between the original and the perturbed
subspaces is invariant to a universal scaling.
4.4.2    Proof of the theorem
     In order to prove Theorem 2.5.1, we first state an equivalent expression of ∥ sin Θ(𝑄 𝑋1 , 𝑄 𝑋e )∥,
                                                                                                     1
                                    ∥ sin Θ(𝑄 𝑋1 , 𝑄 𝑋e )∥ = ∥𝑄𝑉∗ 𝑄 𝑋e ∥.
                                                       1          2   1
                                                      84


Here 𝑋  e1 = 𝑄 e 𝑅 e ,𝑉2 = 𝑄𝑉 𝑅𝑉 are the QR decompositions of 𝑋
                 𝑋1 𝑋1               2 2
                                                                                  e1 and 𝑉2 , respectively. The
proof is based on the following lemma in Li (1994).
Lemma 4.4.6 (Lemma 2.1 in Li (1994)). Let 𝑈1 , 𝑈           e1 ∈ C𝑛,𝑟 (1 ≤ 𝑟 ≤ 𝑛 − 1) with 𝑈 ∗𝑈1 = 𝑈    e∗𝑈  e = 𝐼,
                                                                                                1        1 1
and let X1 = 𝑠𝑝𝑎𝑛(𝑈1 ) and X            e1 = 𝑠𝑝𝑎𝑛(𝑈  e1 ).    If 𝑈e = [𝑈       e2 ] is a unitary matrix, then
                                                                          e1 , 𝑈
∥ sin Θ(X1 , X e1 )∥ = ∥𝑈 e∗𝑈1 ∥.
                            2
    In Lemma 4.4.6, let 𝑈1 = 𝑄 𝑋e , 𝑈        e1 = 𝑄 𝑋 , 𝑈e = [𝑄 𝑋 , 𝑄𝑉 ]. Noticing that 𝑉 ∗ 𝑋 = 𝐼, we can
                                          1           1             1     2
verify that 𝑈  e∗𝑈e = 𝐼. Therefore, it holds that ∥ sin Θ(𝑄 𝑋 , 𝑄 e )∥ = ∥𝑄 ∗ 𝑄 e ∥.
                                                                  1     𝑋1          𝑉2 𝑋1
    The next Lemma gives an equivalent expression of 𝑄𝑉∗ 𝑄 𝑋e .
                                                                      2    1
Lemma 4.4.7. Using the notations in Section 4.2, it holds that
                                                                         
                                    𝑄𝑉∗ 𝑄 𝑋e = (𝑅𝑉−1 ) ∗ 𝐹 ◦ 𝑉2∗ Δ𝐴 𝑋   e1 𝑅 −1 ,                             (4.9)
                                       2    1       2                           𝑋
                                                                                e
                                                                                 1
                                                                                                          −1
where ◦ denotes the Hadamard product, and 𝐹 ∈ C𝑛−𝑟,𝑟 is defined as 𝐹𝑖, 𝑗 = 𝜆                    e𝑗 − 𝜆𝑟+𝑖      , for
𝑖 = 1, ..., 𝑛 − 𝑟, 𝑗 = 1, ..., 𝑟.
    Lemma 4.4.7 has been implicitly derived in Li (1998), here for completeness, we provide its
proof here. An alternative proof using complex analysis can be found in the appendix.
Proof: Since 𝑋 −1 𝐴𝑋 = Λ and 𝑋          e−1 𝐴
                                            e𝑋e=e  Λ, then
                               𝑋 −1 Δ𝐴 𝑋
                                       e = −𝑋 −1 ( 𝐴 − 𝐴)
                                                        e𝑋  e = −Λ𝑋 −1 𝑋   e + 𝑋 −1 𝑋
                                                                                    eeΛ.
Consider the (𝑛 − 𝑟) × 𝑟 block in the lower-left corner of this equation, we have
                                                                      
                                             𝑉2∗ Δ𝐴 𝑋
                                                    e1 = 𝐹¯ ◦ 𝑉 ∗ 𝑋
                                                                 2
                                                                    e1 ,                                     (4.10)
where 𝐹¯𝑖, 𝑗 = 𝜆 e𝑗 − 𝜆𝑟+𝑖 , 1 ≤ 𝑖 ≤ 𝑛 − 𝑟, 1 ≤ 𝑗 ≤ 𝑟. Let 𝐹 = 1/𝐹,        ¯ where the division is carried out
elementwise, (4.10) becomes
                                                                      
                                             𝑉2∗ 𝑋
                                                 e1 = 𝐹 ◦ 𝑉 ∗ Δ𝐴 𝑋
                                                             2
                                                                   e1 .
Last, we replace the 𝑉2∗ and 𝑋      e1 on the left-hand side with their QR decomposition and move the 𝑅
factors to the right-hand sides to obtain the equation in the statement of this lemma.                            □
                                                         85


                                        
    Denoting 𝑀 = 𝐹 ◦ 𝑉2∗ Δ𝐴 𝑋          e1 𝑅 −1 , by Lemma 4.4.7, we have
                                                 𝑋1
                                                 e
                                                                                                1
                   sin Θ(𝑄 𝑋1 , 𝑄 𝑋e ) = 𝑄𝑉∗ 𝑄 𝑋e = (𝑅𝑉−1 ) ∗ 𝑀 ≤                                        ∥𝑀∥ .           (4.11)
                                       1              2     1              2              𝜎min (𝑅𝑉2 )
In order to bound ∥ 𝑀 ∥, we establish the following lemma for an equivalent expression of 𝑀.
Lemma 4.4.8. Denote 𝑀 = [𝑚 1 , 𝑚 2 , ..., 𝑚 𝑛−𝑟 ] ∗ . Then for 1 ≤ 𝑖 ≤ 𝑛 − 𝑟, the 𝑖th row of 𝑀 can be
expressed as
                    1                 
                               ∗ Δ𝐴 𝐴ˆ 𝑟−1 − 𝜎 𝐴ˆ 𝑟−2 + 𝜎 𝐴ˆ 𝑟−3 − · · · + (−1) 𝑟−1 𝜎
                                                                                                           
        𝑚𝑖∗ =               𝑉2,𝑖                     1            2                                  𝑟−1 𝐼𝑛 𝑄 𝑋
                                                                                                              e ,        (4.12)
               (−1) 𝑟+1 𝜎𝑟                                                                                      1
where 𝑉2,𝑖 is the 𝑖th column of 𝑉2 , 𝐴ˆ = 𝐴         e − 𝜆𝑖+𝑟 𝐼𝑛 , 𝐼𝑛 is the identity matrix of size 𝑛, 𝜆𝑖+𝑟 is the 𝑖th
diagonal element in Λ2 , 𝜎𝑘 is the homogeneous symmetric polynomial of order 𝑘 in 𝑟 variables,
that is
                                              ∑︁
                           𝜎𝑘 =                                𝜆ˆ𝑖1 𝜆ˆ𝑖2 · · · 𝜆ˆ𝑖 𝑘 , 𝑘 = 1, 2, ..., 𝑟.
                                  1≤𝑖 1 <𝑖2 ,...,𝑖 𝑘−1 <𝑖 𝑘 ≤𝑟
Here 𝜆ˆ 𝑗 := 𝜆e𝑗 − 𝜆𝑖+𝑟 , 𝑗 = 1, ..., 𝑟, and 𝜆 e𝑗 the 𝑗th diagonal element in e             Λ1 . By Assumption 2, 𝜆ˆ 𝑗 ≠ 0.
Proof of Lemma 4.4.8: Let 𝑏𝑖∗ be the 𝑖th row of the (𝑛 − 𝑟) × 𝑟 matrix 𝑉2∗ Δ𝐴. Then by Lemma 4.4.7
, the 𝑖th row of 𝑀 is
                                                      1                                 
                                                                                        
                                                      𝜆e −𝜆𝑖+𝑟
                                                    1
                                                                                         
                                                                                        
                                   ∗        ∗                        .                    −1
                                 𝑚 𝑖 = 𝑏 𝑖 𝑋1                         .                                                 (4.13)
                                                     
                                              e                          .                𝑅e .
                                                                                         𝑋1
                                                                                        
                                                                                  1     
                                                                            e𝑟 −𝜆𝑖+𝑟 
                                                                             𝜆
                                                                                        
Let 𝑦 be an arbitrary unit vector, and let 𝜆ˆ 𝑗 := 𝜆      e𝑗 −𝜆𝑖+𝑟 , for 𝑗 = 1, ..., 𝑟. In addition, define 𝑝 = 𝑅 −1 𝑦,
                                                                                                                           𝑋
                                                                                                                           e
                                                                                                                            1
which means
                                                 ∥𝑋 e1 𝑝∥ = 𝑄 e 𝑦 = 1.
                                                                  𝑋1
                                                                                                                         (4.14)
Then (4.13) yields
                                                                𝑟
                                                               ∑︁    1
                                               𝑚𝑖∗ 𝑦   = 𝑏𝑖∗ ­                                                           (4.15)
                                                             ©                   ª
                                                                          𝑝 𝑗e
                                                                             𝑥𝑗®,
                                                               𝑗=1
                                                                    𝜆ˆ 𝑗
                                                             «                   ¬
where for 1 ≤ 𝑗 ≤ 𝑟, e    𝑥 𝑗 is the 𝑗th column of 𝑋        e1 . Next, we derive an equivalent representation of
the summation in (4.15) with the help of the characteristic polynomial. Define 𝐴ˆ = 𝐴                           e − 𝜆𝑖+𝑟 𝐼, and
                                                               86


define its characteristic polynomial
                                          𝑞(𝑧) = 𝑧 − 𝜆ˆ 1 𝑧 − 𝜆ˆ 2 · · · 𝑧 − 𝜆ˆ 𝑟 .
                                                                                            
Expanding 𝑞(𝑧) leads to
                     𝑞(𝑧) = 𝑧𝑟 − 𝜎1 𝑧𝑟−1 + 𝜎2 𝑧𝑟−2 − · · · + (−1) 𝑟−1 𝜎𝑟−1 𝑧 + (−1) 𝑟 𝜎𝑟 ,                                  (4.16)
where 𝜎𝑘 , 𝑘 = 1, ..., 𝑟 are the homogeneous symmetric polynomials of order 𝑘 in 𝑟 variables, that is
                                                            ∑︁
                                      𝜎𝑘 =                                    𝜆ˆ𝑖1 𝜆ˆ𝑖2 · · · 𝜆ˆ𝑖 𝑘 .
                                               1≤𝑖 1 <𝑖 2 ,...,𝑖 𝑘−1 <𝑖 𝑘 ≤𝑟
Notice that 𝑋e1 is also invariant to 𝐴.        ˆ Since
                                                                                         
                          𝐴ˆ − 𝜆ˆ 𝑗 𝐼 𝑋  e1 = 𝐴     e− 𝜆e𝑗 𝑋    e1 = 𝑋 e1 e  Λ1 − 𝜆  e𝑗 𝐼𝑛 , 1 ≤ 𝑗 ≤ 𝑟,
                                                                       
         ˆ 𝑋
then 𝑞( 𝐴)  e1 = 𝑋e1 e Λ1 − 𝜆   e1 𝐼 e   Λ1 − 𝜆   e2 𝐼 · · · e  Λ1 − 𝜆 e𝑟 𝐼 = 0. This means for any 𝑐 ∈ C𝑟 , we have
                                                                                                            
                        ˆ                ˆ           ˆ                                       ˆ
                                          𝑟           𝑟−1                    𝑟−1                      𝑟
               0 = 𝑞 𝐴 𝑋1 𝑐 = 𝐴 − 𝜎1 𝐴
                            e                                − · · · + (−1) 𝜎𝑟−1 𝐴 + (−1) 𝜎𝑟 𝐼𝑛 𝑋              e1 𝑐.
Let us move the last term in the right-hand side to the left and for the terms left on the right, pull
one 𝐴ˆ out of the bracket,
                                                                                                         
          (−1) 𝑟+1 𝜎𝑟 𝑋˜ 1 𝑐 = 𝐴ˆ 𝑟−1 − 𝜎1 𝐴ˆ 𝑟−2 + 𝜎2 𝐴ˆ 𝑟−3 − · · · + (−1) 𝑟−1 𝜎𝑟−1 𝐼𝑛 𝐴ˆ 𝑋                 e1 𝑐.         (4.17)
Now let us take 𝑐 to be the vector consisting of 𝑐 𝑗 = 𝑝 𝑗 /𝜆ˆ 𝑗 , for 𝑗 = 1, ..., 𝑟, then 𝐴ˆ 𝑋                    e1 𝑐 = 𝑋
                                                                                                                          e1 𝑝 and
e1 𝑐 = Í𝑟
𝑋              1 𝑥 𝑝 . Plugging these two relations into (4.17), we get
          𝑗=1 ˆ e
              𝜆𝑗   𝑗 𝑗
                        𝑟
                     ©∑︁ 1
                                                                                                                     
        (−1) 𝑟+1 𝜎𝑟 ­                𝑥 𝑗 ® = 𝐴ˆ 𝑟−1 − 𝜎1 𝐴ˆ 𝑟−2 + 𝜎2 𝐴ˆ 𝑟−3 − · · · + (−1) 𝑟−1 𝜎𝑟−1 𝐼𝑛 𝑋˜ 1 𝑝,
                                          ª
                                 𝑝 𝑗e
                             𝜆ˆ
                     « 𝑗=1 𝑗              ¬
or equivalently,
                                                                                                           
                 𝑟                     𝐴ˆ 𝑟−1 − 𝜎1 𝐴ˆ 𝑟−2 + 𝜎2 𝐴ˆ 𝑟−3 − · · · + (−1) 𝑟−1 𝜎𝑟−1 𝐼𝑛 𝑋˜ 1 𝑝
                ∑︁   1
                        𝑝 𝑥˜ =                                                                                      .       (4.18)
                    ˆ𝑗 𝑗 𝑗
                    𝜆                                                  (−1)   𝑟+1 𝜎
                                                                                      𝑟
                𝑗=1
Plugging this back to the formula for 𝑚𝑖∗ 𝑦 (4.15), we get
           ∗           1           ∗
                                             
                                               ˆ 𝑟−1            ˆ 𝑟−2          ˆ 𝑟−3                  𝑟−1
                                                                                                                  
         𝑚𝑖 𝑦 =                  𝑉 Δ𝐴 𝐴                − 𝜎1 𝐴         + 𝜎2 𝐴          − · · · + (−1) 𝜎𝑟−1 𝐼𝑛 𝑄 𝑋e 𝑦.
                 (−1) 𝑟+1 𝜎𝑟 2,𝑖                                                                                        1
The equation above holds for arbitrary 𝑦 ∈ C𝑟 , hence (4.12) holds.                                                              □
                                                                   87


       Next, we prove Theorem 2.5.1.
Proof of Theorem 2.5.1: Denote 𝑏𝑖∗ = 𝑉2,𝑖                         ∗ Δ𝐴, by Lemma 4.4.8 we have
                                              𝐴ˆ 𝑟−1 − 𝜎1 𝐴ˆ 𝑟−2 + 𝜎2 𝐴ˆ 𝑟−3 − · · · + (−1) 𝑟−1 𝜎𝑟−1 𝐼𝑛
                            ∥𝑚𝑖∗ ∥ = 𝑏𝑖∗                                                                                   𝑄 𝑋e
                                                                                     𝜎𝑟                                           1
                                                  𝐴ˆ 𝑟−1 − 𝜎1 𝐴ˆ 𝑟−2 + 𝜎2 𝐴ˆ 𝑟−3 − · · · + (−1) 𝑟−1 𝜎𝑟−1 𝐼𝑛
                                     ≤   ∥𝑏𝑖∗ ∥
                                                                                         |𝜎𝑟 |
                                                    ˆ
                                                 ∥ 𝐴∥  𝑟−1     + |𝜎1 |∥ 𝐴∥  ˆ    𝑟−2   + |𝜎2 |∥ 𝐴∥  ˆ 𝑟−3 + · · · + |𝜎𝑟−1 |
                                     ≤ ∥𝑏𝑖∗ ∥                                                                                   .
                                                                                        |𝜎𝑟 |
Notice that
                                      ∑︁                                                            ∑︁
              |𝜎𝑘 | =                                    𝜆ˆ𝑖1 𝜆ˆ𝑖2 · · · 𝜆ˆ𝑖 𝑘 ≤                                    |𝜆ˆ𝑖1 | · |𝜆ˆ𝑖2 | · · · |𝜆ˆ𝑖 𝑘 |.
                         1≤𝑖 1 <𝑖 2 ,...,𝑖 𝑘−1 <𝑖 𝑘 ≤𝑟                                1≤𝑖 1 <𝑖2 ,...,𝑖 𝑘−1 <𝑖 𝑘 ≤𝑟
Define an auxiliary function
                              = 𝑧 + |𝜆ˆ 1 | 𝑧 + |𝜆ˆ 2 | · · · 𝑧 + |𝜆ˆ 𝑟 | = 𝑧𝑟 + 𝜎
                                                                                  
                        ¯
                        𝑞(𝑧)                                                                     ¯ 1 𝑧𝑟−1 + 𝜎¯ 2 𝑧𝑟−2 + · · · + 𝜎   ¯𝑟.
                                                             |𝜆ˆ𝑖1 | · |𝜆ˆ𝑖2 | · · · |𝜆ˆ𝑖 𝑘 | ≥ |𝜎𝑘 |, for 1 ≤ 𝑘 ≤ 𝑟, and 𝜎                    ¯ 𝑟 = |𝜆ˆ 1 | ·
                       Í
Here 𝜎       ¯𝑘 =        1≤𝑖1 <𝑖2 <···<𝑖 𝑘−1 <𝑖 𝑘 ≤𝑟
|𝜆ˆ 2 | · · · |𝜆ˆ 𝑟 | = |𝜎𝑟 |. Then
                                                      ∥ 𝐴∥ˆ 𝑟−1 + 𝜎     ¯ 1 ∥ 𝐴∥ˆ 𝑟−2 + 𝜎            ˆ 𝑟−3 + · · · + 𝜎
                                                                                              ¯ 2 ∥ 𝐴∥                ¯ 𝑟−1
                                ∥𝑚𝑖 ∥ ≤ ∥𝑏𝑖∗ ∥
                                                                                            ¯𝑟
                                                                                            𝜎
                                                       ¯
                                                      𝑞(𝑎)      −𝜎 ¯𝑟
                                           ≤ ∥𝑏𝑖∗ ∥
                                                            𝑎𝜎 ¯𝑟
                                                           𝑟
                                                                               !
                                              ∥𝑏𝑖 ∥ ©Ö                   𝑎
                                           =                     1+               − 1®
                                                                                       ª
                                                                       |𝜆ˆ 𝑗 |
                                                      ­
                                                𝑎
                                                      « 𝑗=1                            ¬
                                                         𝑟
                                              ∥𝑏𝑖 ∥ Ö ©­                            𝑎                ª
                                           ≤                  ­1 +
                                                                                                     ®
                                                                                                     ®,
                                                 𝑎
                                                       𝑗=1
                                                              ­           min         | 𝜆
                                                                                        e 𝑗 −  𝜆 𝑘 | ®
                                                                     𝜆 𝑘 ∈𝑆(Λ2 )
                                                              «                                      ¬
where 𝑎 = ∥ 𝐴∥ + ∥Δ𝐴∥ + 𝜌(Λ2 ) ≥ ∥ 𝐴∥.                      ˆ Combining the bounds for all 1 ≤ 𝑖 ≤ 𝑛 − 𝑟 leads to
                       𝑉2∗ Δ𝐴      Ö  𝑟 ©
                                                                𝑎                 ª ∥𝑉 ∥∥Δ𝐴∥ Ö              𝑟 ©
                                                                                                                                      𝑎                 ª
                                𝐹                                                            2          𝐹
      ∥𝑀 ∥ ≤                               ­1 +                                   ®≤                            ­1 +
                                           ­                                      ®                             ­                                       ®
                                                                                                                                                        ®.
                           𝑎
                                    𝑗=1
                                           ­          min        | 𝜆
                                                                   e 𝑗 −  𝜆   𝑘 | ®               𝑎
                                                                                                           𝑗=1
                                                                                                                ­           min         | 𝜆
                                                                                                                                          e 𝑗 −   𝜆 𝑘 | ®
                                                 𝜆 𝑘 ∈𝑆(Λ2 )                                                            𝜆 𝑘 ∈𝑆(Λ2 )
                                           «                                      ¬                             «                                       ¬
By (4.11), we further obtain (4.7).                                                                                                                        □
                                                                                88


                                           BIBLIOGRAPHY
Abbe, E. (2017). Community detection and stochastic block models: recent developments. The
  Journal of Machine Learning Research, 18(1):6446–6531.
Abbe, E., Fan, J., and Wang, K. (2022). An ℓp theory of pca and spectral clustering. The Annals
  of Statistics, 50(4):2359–2385.
Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random
  matrices with low expected rank. The Annals of Statistics, 48(3):1452–1474.
Agterberg, J., Lubberts, Z., and Priebe, C. E. (2022). Entrywise estimation of singular vectors of
  low-rank matrices with heteroskedasticity and dependence. IEEE Transactions on Information
  Theory, 68(7):4618–4650.
Athreya, A., Tang, M., Park, Y., and Priebe, C. E. (2021). On estimation and inference in latent
  structure random graphs. Statistical Science, 36(1):68–88.
Balasubramanian, M. and Schwartz, E. L. (2002). The isomap algorithm and topological stability.
  Science, 295(5552):7–7.
Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear
  inverse problems. SIAM journal on imaging sciences, 2(1):183–202.
Bhatia, R. (2013). Matrix analysis, volume 169. Springer Science & Business Media.
Cai, C., Li, G., Chi, Y., Poor, H. V., and Chen, Y. (2021). Subspace estimation from unbalanced and
  incomplete data matrices: ℓ2,∞ statistical guarantees. The Annals of Statistics, 49(2):944–967.
Cai, J. F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix
  completion. SIAM Journal on optimization, 20(4):1956–1982.
Cai, T. T. and Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaces with
  applications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89.
Candes, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust Principal Component Analysis? J.
  ACM, 58(3):11:1–11:37.
Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE,
  98(6):925–936.
Candes, E. J. and Recht, B. (2012). Exact matrix completion via convex optimization. Communi-
  cations of the ACM, 55(6):111–119.
Cape, J., Tang, M., and Priebe, C. E. (2019a). Signal-plus-noise matrix models: eigenvector
                                                   89


  deviations and fluctuations. Biometrika, 106(1):243–250.
Cape, J., Tang, M., and Priebe, C. E. (2019b). The two-to-infinity norm and singular subspace
  geometry with applications to high-dimensional statistics. The Annals of Statistics, 47(5):2405–
  2439.
Chatelin, F. (2011). Spectral approximation of linear operators. SIAM.
Chen, Y., Cheng, C., and Fan, J. (2021a). Asymmetry helps: Eigenvalue and eigenvector analyses
  of asymmetrically perturbed low-rank matrices. The Annals of statistics, 49(1):435.
Chen, Y., Chi, Y., Fan, J., Ma, C., et al. (2021b). Spectral methods for data science: A statistical
  perspective. Foundations and Trends® in Machine Learning, 14(5):566–806.
Cheng, C., Wei, Y., and Chen, Y. (2020). Inference for linear forms of eigenvectors under minimal
  eigenvalue separation: Asymmetry and heteroscedasticity. arXiv preprint arXiv:2001.04620.
Chin, P., Rao, A., and Vu, V. (2015). Stochastic block model and community detection in sparse
  graphs: A spectral algorithm with optimal rate of recovery. In Conference on Learning Theory,
  pages 391–423. PMLR.
Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. iii. SIAM
  Journal on Numerical Analysis, 7(1):1–46.
Demmel, J. W. (1986). Computing stable eigendecompositions of matrices. Linear Algebra and
  its Applications, 79:163–193.
Deutsch, S., Ortega, A., and Medioni, G. (2016). Manifold denoising based on spectral graph
  wavelets. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing
  (ICASSP), pages 4673–4677. IEEE.
Donoho, D. and Gavish, M. (2014). Minimax risk of matrix denoising by singular value threshold-
  ing. The Annals of Statistics, 42(6):2413–2440.
Dopico, F. M. (2000). A note on sin 𝜃 theorems for singular subspace variations. BIT Numerical
  Mathematics, 40(2):395–403.
Dopico, F. M. and Moro, J. (2002). Perturbation theory for simultaneous bases of singular subspaces.
  BIT Numerical Mathematics, 42(1):84–109.
Du, C., Sun, J., Zhou, S., and Zhao, J. (2013). An Outlier Detection Method for Robust Manifold
  Learning. In Yin, Z., Pan, L., and Fang, X., editors, Proceedings of The Eighth International
  Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2013, Advances
  in Intelligent Systems and Computing, pages 353–360. Springer Berlin Heidelberg.
                                                 90


Eldridge, J., Belkin, M., and Wang, Y. (2018). Unperturbed: spectral analysis beyond davis-kahan.
  In Algorithmic Learning Theory, pages 321–358. PMLR.
Eppel, S. (2006). Using curvature to distinguish between surface reflections and vessel con-
  tents in computer vision based recognition of materials in transparent vessels. arXiv preprint
  arXiv:1602.00177.
Fan, J., Fan, Y., Han, X., and Lv, J. (2022). Asymptotic theory of eigenvectors for random matrices
  with diverging spikes. Journal of the American Statistical Association, 117(538):996–1009.
Flynn, P. J. and Jain, A. K. (1989). On reliable curvature estimation. Computer Vision and Pattern
  Recognition, 89:110–116.
                                                                                                 √
Gavish, M. and Donoho, D. L. (2014). The optimal hard threshold for singular values is 4/ 3.
  IEEE Transactions on Information Theory, 60(8):5040–5053.
Gohberg, I., Lancaster, P., and Rodman, L. (2006). Invariant subspaces of matrices with applica-
  tions. SIAM.
Golub, G. H. and Wilkinson, J. H. (1976). Ill-conditioned eigensystems and the computation of the
  jordan canonical form. SIAM review, 18(4):578–619.
Greenbaum, A., Li, R. C., and Overton, M. L. (2020). First-order perturbation theory for eigenvalues
  and eigenvectors. SIAM review, 62(2):463–482.
Haviv, M. and Ritov, Y. (1994). Bounds on the error of an approximate invariant subspace for
  non-self-adjoint matrices. Numerische Mathematik, 67(4):491–500.
He, Y., Tian, Y., Wang, M., Chen, F., Yu, L., Tang, M., Chen, C., Zhang, N., Kuang, B., and
  Prakash, A. (2023). Que2engage: Embedding-based retrieval for relevant and engaging products
  at facebook marketplace. arXiv preprint arXiv:2302.11052.
Hein, M. and Maier, M. (2006). Manifold denoising. Advances in neural information processing
  systems, 19.
Ipsen, I. C. (2003). A note on unifying absolute and relative perturbation bounds. Linear algebra
  and its applications, 358(1-3):239–253.
Janson, S. (2016). Large deviation inequalities for sums of indicator variables. arXiv preprint
  arXiv:1609.00533.
Karow, M. and Kressner, D. (2014). On a perturbation bound for invariant subspaces of matrices.
  SIAM Journal on Matrix Analysis and Applications, 35(2):599–618.
Kato, T. (2013). Perturbation theory for linear operators, volume 132. Springer Science & Business
                                                  91


  Media.
Keshavan, R., Montanari, A., and Oh, S. (2009). Matrix completion from noisy entries. Advances
  in neural information processing systems, 22.
Knyazev, A. V. and Argentati, M. E. (2002). Principal angles between subspaces in an a-based
  scalar product: algorithms and perturbation estimates. SIAM Journal on Scientific Computing,
  23(6):2008–2040.
Kobayashi, S. and Nomizu, K. (1996). Foundations of differential geometry. 2.
Krahmer, F., Lyu, H., Saab, R., Veselovska, A., and Wang, R. (2023). Quantization of bandlimited
  graph signals. In Fourteenth International Conference on Sampling Theory and Applications.
Lee, H., Battle, A., Raina, R., and Ng, A. (2006). Efficient sparse coding algorithms. Advances in
  neural information processing systems, 19.
Lee, J. M. (2012). Smooth manifolds. Springer.
Lei, L. (2019). Unified 𝑙 2→∞ eigenspace perturbation theory for symmetric random matrices.
  arXiv preprint arXiv:1909.04798.
Li, R. C. (1994). On perturbations of matrix pencils with real spectra. Mathematics of Computation,
  62(205):231–265.
Li, R. C. (1998). Spectral variations and hadamard products: Some problems. Linear algebra and
  its applications, 278(1-3):317–326.
Little, A., Xie, Y., and Sun, Q. (2018). An analysis of classical multidimensional scaling. arXiv
  preprint arXiv:1812.11954.
Löffler, M., Zhang, A. Y., and Zhou, H. H. (2021). Optimality of spectral clustering in the gaussian
  mixture model. The Annals of Statistics, 49(5):2506–2530.
Luo, Y., Han, R., and Zhang, A. (2021). A schatten-q low-rank matrix perturbation analysis via
  perturbation projection error bound. Linear Algebra and its Applications, 630:225–240.
Lyu, H., Sha, N., Qin, S., Yan, M., Xie, Y., and Wang, R. (2019). Manifold denoising by nonlinear
  robust principal component analysis. Advances in neural information processing systems, 32.
Lyu, H. and Wang, R. (2020a). An exact 𝑠𝑖𝑛𝜃 formula for matrix perturbation analysis and its
  applications. arXiv preprint arXiv:2011.07669.
Lyu, H. and Wang, R. (2020b).           Sigma delta quantization for images.         arXiv preprint
  arXiv:2005.08487.
                                                 92


Lyu, H. and Wang, R. (2022). Perturbation of invariant subspaces for ill-conditioned eigensystem.
  arXiv preprint arXiv:2203.00068.
Martin, G. R. and Evans, M. J. (1975). Differentiation of clonal lines of teratocarcinoma cells:
  formation of embryoid bodies in vitro. Proceedings of the National Academy of Sciences,
  72(4):1441–1445.
Meek, D. S. and Walton, D. J. (2000). On surface normal and gaussian curvature approximations
  given data sampled from a smooth surface. Computer Aided Geometric Design, 17(6):521–543.
Moon, K., Dĳk, D. V., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., Elzen,
  A. v. d., Hirn, M. J., Coifman, R. R., Ivanova, N. B., Wolf, G., and Krishnaswamy, S. (2019).
  Visualizing Structure and Transitions for Biological Data Exploration. bioRxiv, page 120378.
Narayanamurthy, P. and Vaswani, N. (2020). Fast robust subspace tracking via pca in sparse
  data-dependent noise. IEEE Journal on Selected Areas in Information Theory, 1(3):723–744.
Paulsen, V. (2002). Completely bounded maps and operator algebras. Number 78. Cambridge
  University Press.
Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London,
  Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572.
Pottmann, H., Wallner, J., Yang, Y. L., Lai, Y. K., and Hu, S. M. (2007). Principal curvatures from
  the integral invariant viewpoint. Computer Aided Geometric Design, 24(8):428–442.
Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed-
  ding. science, 290(5500):2323–2326.
Sathe, S. and Aggarwal, C. (2016). Lodes: Local density meets spectral outlier detection. In
  Proceedings of the 2016 SIAM international conference on data mining, pages 171–179. SIAM.
Sha, N., Yan, M., and Lin, Y. (2019). Efficient seismic denoising techniques using robust principal
  component analysis. In SEG Technical Program Expanded Abstracts 2019, pages 2543–2547.
  Society of Exploration Geophysicists.
Stewart, G. (1990). Matrix perturbation theory.
Stewart, G. W. (1971). Error bounds for approximate invariant subspaces of closed linear operators.
  SIAM Journal on Numerical Analysis, 8(4):796–808.
Stewart, G. W. (1973). Error and perturbation bounds for subspaces associated with certain
  eigenvalue problems. SIAM review, 15(4):727–764.
Stewart, M. (2006). Perturbation of the svd in the presence of small singular values. Linear algebra
                                                  93


  and its applications, 419(1):53–77.
Tang, M. and Preibe, C. E. (2018). Limit theorems for eigenvectors of the normalized laplacian for
  random graphs. The Annals of Statistics, 46(5):2360–2415.
Tanner, J. and Wei, K. (2013). Normalized iterative hard thresholding for matrix completion. SIAM
  Journal on Scientific Computing, 35(5):S104–S125.
Thompson, R. C. (1975). Singular value inequalities for matrix sums and minors. Linear Algebra
  and its Applications, 11(3):251–269.
Tong, W. S. and Tang, C. K. (2005). Robust estimation of adaptive tensors of curvature by tensor
  voting. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(3):434–449.
Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and
  Trends® in Machine Learning, 8(1-2):1–230.
Varah, J. M. (1970). Computing invariant subspaces of a general matrix when the eigensystem is
  poorly conditioned. Mathematics of Computation, 24(109):137–149.
Vaswani, N. and Narayanamurthy, P. (2017). Finite sample guarantees for pca in non-isotropic and
  data-dependent noise. In 2017 55th Annual Allerton Conference on Communication, Control,
  and Computing (Allerton), pages 783–789. IEEE.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data
  science, volume 47. Cambridge university press.
Vu, T., Chunikhina, E., and Raich, R. (2021). Perturbation expansions and error bounds for the
  truncated singular value decomposition. Linear Algebra and its Applications, 627:94–139.
Wedin, P. (1972). Perturbation bounds in connection with singular value decomposition. BIT
  Numerical Mathematics, 12(1):99–111.
Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the davis–kahan theorem for
  statisticians. Biometrika, 102(2):315–323.
Yun, S. Y. and Proutiere, A. (2014). Accurate community detection in the stochastic block model
  via spectral algorithms. arXiv preprint arXiv:1412.7335.
Zhan, J. and Vaswani, N. (2015). Robust pca with partial subspace knowledge. IEEE Transactions
  on Signal Processing, 63(13):3332–3347.
                                                94


                                                  APPENDIX A
                                        APPENDIX FOR CHAPTER 2
Proof of Lemma 2.6.3- the full-rank case: We first bound ∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉                      Σ1−1 ∥ 2,∞ . To do so,
                                                                                             e1e
we need Theorem 2.5.7 to obtain the expansion of 𝑉2𝑇 𝑉              e1 . Let us first check that in the setting of
                         √
this lemma (i.e., 21𝜎 𝑛¯ < 𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴)), the condition in Theorem 2.5.7 is satisfied with high
probability, that is, ∥F ∥ < 1, where
                                  ©𝐶1 ª ©𝐹 21 ◦ (Σ2 𝛼𝑇22𝐶1 ) + 𝐹𝑈21 ◦ (𝛼22𝐶2e     Σ𝑇1 ) ª
                               F ­­ ®® = ­­ 𝑈                                           ®.
                                             21                      21
                                                     𝑇                      𝑇
                                                                                        ®
                                   𝐶2      𝐹𝑉 ◦ (𝛼22𝐶1 Σ1 ) + 𝐹𝑉 ◦ (Σ2 𝛼22𝐶2 )
                                                           e
                                  « ¬ «                                                 ¬
                                                                     √
As discussed in the proof of Theorem 2.2.6, ∥Δ𝐴∥ ≤ 3𝜎 𝑛¯ with probability at least 1 − 𝑒 −𝑐𝑛¯ with
                                                  √
some constant 𝑐. By the assumption 21𝜎 𝑛¯ < 𝜎𝑟 ( 𝐴) − 𝜎𝑟+1 ( 𝐴), we have with probability at least
1 − 𝑒 −𝑐𝑛¯ ,
     ©𝐶1 ª
  F ­­ ®®
      𝐶
     « 2¬
 ≤ ∥𝐹𝑈21 ◦ (Σ2 𝛼𝑇22𝐶1 )∥ + ∥𝐹𝑈21 ◦ (𝛼22𝐶2e      Σ𝑇1 )∥ + ∥𝐹𝑉21 ◦ (𝛼𝑇22𝐶1e  Σ1 )∥ + ∥𝐹𝑉21 ◦ (Σ𝑇2 𝛼22𝐶2 )∥
       𝜎𝑟+1                            𝜎𝑟
                                       e                            𝜎𝑟
                                                                    e                          𝜎𝑟+1
 ≤               ∥𝛼22 ∥∥𝐶1 ∥ +                ∥𝛼22 ∥∥𝐶2 ∥ +                ∥𝛼22 ∥∥𝐶1 ∥ +                 ∥𝛼22 ∥∥𝐶2 ∥
      2
    𝜎𝑟 − 𝜎𝑟+1
    e         2                      2    2
                                   𝜎𝑟 − 𝜎𝑟+1
                                   e                              2
                                                                𝜎𝑟 − 𝜎𝑟+1
                                                                e        2                    2     2
                                                                                            𝜎𝑟 − 𝜎𝑟+1
                                                                                            e
      ∥𝛼22 ∥
 =              (∥𝐶1 ∥ + ∥𝐶2 ∥)
   e𝜎𝑟 − 𝜎𝑟+1
             ∥Δ𝐴∥
                            √︃
 ≤                         · 2(∥𝐶1 ∥ 2 + ∥𝐶2 ∥ 2 )
    𝜎𝑟 − 𝜎𝑟+1 − ∥Δ𝐴∥
    √ √︃
      2
 ≤          ∥𝐶1 ∥ 2 + ∥𝐶2 ∥ 2 .
     6
Here the second inequality is due to Lemma 2.5.4. Now we have ∥F ∥ < 1 with high probability,
which enables us to use Theorem 2.5.7 to decompose 𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉                  e1eΣ1−1 . Denote
                             𝑎 1 = 𝐹𝑈21 ◦ (Σ2 𝛼𝑇12𝑈1𝑇 𝑈e1 ), 𝑎 2 = 𝐹 21 ◦ (𝛼21𝑉 𝑇 𝑉e e𝑇
                                                                     𝑈           1 1 Σ1 )
                            𝑎 3 = 𝐹𝑉21 ◦ (𝛼𝑇12𝑈1𝑇 𝑈e1eΣ1 ), 𝑎 4 = 𝐹𝑉21 ◦ (Σ𝑇2 𝛼21𝑉1𝑇 𝑉e1 ),
and
                             𝑓1 (𝑋) = 𝐹𝑈21 ◦ (Σ2 𝛼𝑇22 𝑋), 𝑓2 (𝑋) = 𝐹𝑈21 ◦ (𝛼22 𝑋 e   Σ𝑇1 ),
                                                          95


                                   𝑓3 (𝑋) = 𝐹𝑉21 ◦ (𝛼𝑇22 𝑋 e       Σ1 ), 𝑓4 (𝑋) = 𝐹𝑉21 ◦ (Σ𝑇2 𝛼22 𝑋).
By Theorem 2.5.7, each term in the expansion of 𝑉2𝑇 𝑉                         e1 is of the form
                               𝑓𝑖1 ( 𝑓𝑖2 (...( 𝑓𝑖 𝑘 (𝑎𝑖0 )))), 1 ≤ 𝑖0 , 𝑖 1 , ..., 𝑖 𝑘 ≤ 4, 𝑘 = 0, 1, 2, ....
Now assume 𝑖1 , ..., 𝑖 𝑘 and 𝑘 are fixed. Let 𝑤 be the 𝑖th row in 𝑈2 𝛼22 𝑓𝑖1 ( 𝑓𝑖2 (...( 𝑓𝑖 𝑘 (𝑎𝑖0 ))))e                                  Σ1−1 ,
and let 𝑏𝑇 = 𝑢𝑇𝑖 𝛼22 , where 𝑢𝑇𝑖 is the 𝑖th row of 𝑈2 . Then 𝑤 = 𝑏𝑇 𝑓𝑖1 ( 𝑓𝑖2 (...( 𝑓𝑖 𝑘 (𝑎𝑖0 ))))e                           Σ1−1 . Notice
that 𝑎𝑖 and each 𝑓𝑖 , 1 ≤ 𝑠 ≤ 𝑘, either contains Σ2 or e
          0                  𝑠                                                 Σ1 , let ℎ𝑖 = 1 if 𝑓𝑖 contains Σ2 , and ℎ𝑖 = 0
                                                                                              𝑠              𝑠                           𝑠
if 𝑓𝑖 𝑠 contains e    Σ1 . Let ℎ𝑖0 = 1 if 𝑎𝑖0 contains Σ2 and ℎ𝑖 𝑠 = 0 if 𝑎𝑖0 contains e                         Σ1 . Also, let 𝑚 be the
total number of times that e            Σ1 appears in 𝑓𝑖 and 𝑎𝑖 . Then
                                                                 𝑠          0
                                                   ℎ𝑖0 + ℎ𝑖1 + ... + ℎ𝑖 𝑘 + 𝑚 = 𝑘 + 1.                                                  (A.1)
Likewise, each 𝑓𝑖 𝑠 , 1 ≤ 𝑠 ≤ 𝑘 either contains 𝛼22 or 𝛼𝑇22 . Let 𝑑𝑖 𝑠 = 𝛼22 if 𝑓𝑖 𝑠 contains 𝛼22 and
𝑑𝑖 𝑠 = 𝛼𝑇22 if it contains 𝛼𝑇22 . Also, let 𝛾 = 𝛼𝑇12 if 𝑎𝑖0 contains 𝛼𝑇12 and 𝛾 = 𝛼21 if 𝑎𝑖0 contains 𝛼21 .
Last, denote 𝛽 = 𝑉1𝑇 𝑉        e1 if 𝑎𝑖 contains 𝑉 𝑇 𝑉
                                         0                  1 1
                                                               e and 𝛽 = 𝑈𝑇 𝑈     e , if 𝑎𝑖 contains 𝑈𝑇 𝑈
                                                                                1 1             0                1 1
                                                                                                                    e . For the 𝛾 and 𝛽
defined above, let 𝛾𝑙𝑇 be the 𝑙th row of 𝛾, i.e., 𝛾 = [𝛾1𝑇 ; 𝛾2𝑇 ; ...; 𝛾𝑛−𝑟                  𝑇 ] and 𝛽 be the 𝑖th column of 𝛽,
                                                                                                               𝑖
i.e., 𝛽 = [𝛽1 , 𝛽2 , ..., 𝛽𝑟 ]. Then for 1 ≤ 𝑗 ≤ 𝑟, the 𝑗th entry in 𝑤 is
  𝑤𝑗
                     ℎ𝑖                            ℎ𝑖                                    ℎ𝑖                          ℎ𝑖
                                                                                                                        0
     ∑︁ 𝑏 𝑙1 𝜎𝑟+𝑙1 ∑︁ (𝑑𝑖1 )𝑙1 𝑙2 𝜎𝑟+𝑙2
                                                                                            𝑘
                                                             ∑︁ (𝑑𝑖 𝑘−1 )𝑙 𝑘−1 𝑙 𝑘 𝜎𝑟+𝑙 ∑︁ (𝑑𝑖 𝑘 )𝑙 𝑘 𝑙0 𝜎𝑟+𝑙
                         1                             2                                     𝑘                           0 𝑇
  =                                                      ...                                                              𝛾𝑙 (e  𝜎 𝑚−1 𝛽 𝑗 )
       𝑙1 e 𝜎 2𝑗 − 𝜎𝑟+𝑙
                      2              𝜎 2𝑗 − 𝜎𝑟+𝑙
                                     e         2                        𝜎 2𝑗 − 𝜎𝑟+𝑙
                                                                        e        2                       𝜎 2𝑗 − 𝜎𝑟+𝑙
                                                                                                         e        2          0 𝑗
                          1 𝑙2                      2         𝑙𝑘                      𝑘          𝑙0                   0
                      ℎ𝑖                             ℎ𝑖                           ℎ𝑖
                                                                                      0
        ∑︁ 𝑏 𝑙1 𝜎𝑟+𝑙1 ∑︁             (𝑑𝑖1 )𝑙1 𝑙2 𝜎𝑟+𝑙2         ∑︁ (𝑑𝑖 𝑘 )𝑙 𝑘 𝑙0 𝜎𝑟+𝑙
      *                                                                                                       +
                           1                             2                             0
  =                                                        ...                           𝛾𝑙0 ,e 𝜎 𝑚−1    𝛽𝑗 .
             𝜎
             e   2 − 𝜎2                𝜎 2𝑗 − 𝜎𝑟+𝑙
                                       e         2                     𝜎 2𝑗 − 𝜎𝑟+𝑙
                                                                       e       2                   𝑗
         𝑙1      𝑗      𝑟+𝑙1 𝑙2                       2         𝑙0                  0
        |                                       {z                                          }
                                               ≡𝑀 𝑗
In the above, let
                               ℎ𝑖                            ℎ𝑖                                    ℎ𝑖                         ℎ𝑖
                                                                                                                                 0
                ∑︁ 𝑏 𝑙1 𝜎𝑟+𝑙1 ∑︁ (𝑑𝑖1 )𝑙1 𝑙2 𝜎𝑟+𝑙2
                                                                                                      𝑘
                                                                       ∑︁ (𝑑𝑖 𝑘−1 )𝑙 𝑘−1 𝑙 𝑘 𝜎𝑟+𝑙 ∑︁ (𝑑𝑖 𝑘 )𝑙 𝑘 𝑙0 𝜎𝑟+𝑙
                                  1                              2                                     𝑘                          0
       𝑀𝑗 =                                                        ...                                                              𝛾𝑙 0 .
                     𝜎 2𝑗 − 𝜎𝑟+𝑙
                  𝑙1 e
                                2             𝜎 2𝑗 − 𝜎𝑟+𝑙
                                              e           2                      𝜎 2𝑗 − 𝜎𝑟+𝑙
                                                                                 e          2                    𝜎 2𝑗 − 𝜎𝑟+𝑙
                                                                                                                 e         2
                                    1 𝑙2                      2        𝑙𝑘                       𝑘         𝑙0                  0
                                                                        96


We first bound ∥𝑀 𝑗 ∥. Denote 𝜂 𝑗 = 𝜎 2𝑗 − e                  𝜎 2𝑗 , and Δ 𝑗𝑙 = 𝜎 2𝑗 − 𝜎𝑟+𝑙       2 . Notice that
                             1                                1                                         1
                                       =                                          =
                     𝜎 2𝑗 − 𝜎𝑟+𝑙
                     e            2       𝜎 2𝑗 − 𝜎𝑟+𝑙  2 + (e    𝜎 2𝑗 − 𝜎 2𝑗 )                                 e𝜎 2 −𝜎 2
                                                                                            2      2
                                                                                        (𝜎 𝑗 − 𝜎𝑟+𝑙 )(1 + 2 2 )
                                                                                                                  𝑗     𝑗
                                                                                                               𝜎 −𝜎
                                                                                                                 𝑗 𝑟+𝑙
                                                    1                  1               𝜂𝑗       𝜂𝑗 2
                                       =                  𝜂𝑗      =         (1 +            +(       ) + ...).
                                          Δ 𝑗𝑙 (1 − Δ )               Δ 𝑗𝑙           Δ 𝑗𝑙       Δ 𝑗𝑙
                                                             𝑗𝑙
Hence
                                         Î𝑘                           Î𝑘             ℎ𝑖 𝑠
                                             𝑠=1 (𝑑𝑖 𝑠 )𝑙 𝑠 𝑙 𝑠+1 𝑠=0 𝜎𝑟+𝑙 𝑠
                                    𝑏 𝑙1                                                      𝑘                                 
                         ∑︁                                                                 Ö           𝜂𝑗          𝜂𝑗 2
       ∥𝑀𝑗 ∥ =                                                                                    1+           +(         ) + ... 𝛾𝑙0
                                                                                                       Δ 𝑗𝑙 𝑠      Δ 𝑗𝑙 𝑠
                                                      Î𝑘
                    𝑙1 ,...,𝑙 𝑘 ,𝑙0                       𝑠=0 Δ 𝑗𝑙 𝑠                        𝑠=0
                                                                                                       ℎ𝑖 𝑠
                              ∞
                                                                 Î𝑘                          Î𝑘
                            ∑︁                ∑︁           𝑏 𝑙1     𝑠=1   (𝑑 𝑖 𝑠 ) 𝑙 𝑠 𝑙             𝜎
                                                                                         𝑠+1 𝑠=0 𝑟+𝑙 𝑠
                                                                                                                      Í𝑘
                                                                                                                            𝑞𝑠
              =                                                     Î𝑘                Î𝑘         𝑞𝑠          𝛾𝑙0 · 𝜂 𝑗 𝑠=0
                    𝑞 0 ,𝑞 1 ,...,𝑞 𝑘 =0 𝑙1 ,...,𝑙 𝑘 ,𝑙0                𝑠=0 Δ 𝑗𝑙           𝑠=0 Δ 𝑗𝑙  𝑠
                                                                                                       ℎ𝑖 𝑠
                            ∞
                                                                 Î𝑘                           Î𝑘
                                                                     𝑠=1 (𝑑𝑖 𝑠 )𝑙 𝑠 𝑙 𝑠+1 𝑠=0 𝜎𝑟+𝑙 𝑠
                                                           𝑏 𝑙1                                                          Í𝑘
                           ∑︁                 ∑︁                                                                              𝑞𝑠
              ≤                                                                                               𝛾𝑙 0  · 𝜂 𝑗 𝑠=0      .
                                                                           Î𝑘             1+𝑞 𝑠
                  𝑞 0 ,𝑞 1 ,...,𝑞 𝑘 =0 𝑙 1 ,...,𝑙 𝑘 ,𝑙 0                       𝑠=0 𝑗𝑙 𝑠Δ
                                        |                                    {z                                   }
                                                                      𝑇 ( 𝑗,𝑞 0 ,...,𝑞 𝑘 )
Here we let 𝑙 𝑘+1 = 𝑙 0 . In the above, denote
                                                                    Î𝑘                          Î𝑘        ℎ𝑖 𝑠
                                                              𝑏 𝑙1     𝑠=1 (𝑑𝑖 𝑠 )𝑙 𝑠 𝑙 𝑠+1 𝑠=0 𝜎𝑟+𝑙 𝑠
                                                  ∑︁
               𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) ≡                                                                        𝛾𝑙 0  ∈ R𝑟×1 .
                                                                             Î𝑘             1+𝑞 𝑠
                                            𝑙 1 ,...,𝑙 𝑘 ,𝑙 0                             Δ
                                                                                   𝑠=0 𝑗𝑙 𝑠
Next, we bound the ℓ2 norm of 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ). Notice that we can rewrite 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) in the
following way
                                                                       𝑞 𝑞 𝑇
                                                      𝑞 𝑞
                                                                                   
               𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) = 𝑏𝑇 𝑓ˆ𝑖 1 𝑓ˆ𝑖 2 ... 𝑓ˆ𝑖 𝑘 𝑎ˆ𝑖 0 , 1 ≤ 𝑖0 , 𝑖 2 , ..., 𝑖 𝑘 ≤ 4, 𝑘 ≥ 0.
                                                       1       2        𝑘      0
Here matrices 𝑓ˆ𝑖 𝑠 are modified from of the functions 𝑓𝑖 𝑠 , and matrix 𝑎ˆ𝑖0 is modified from the
function 𝑎𝑖0 . Explicitly,
                   𝑞         21,𝑞                 𝑞          21,𝑞             𝑞           12,𝑞         𝑞        21,𝑞
                𝑎ˆ 1 = 𝐹ˆ𝑈 Σ2 𝛼𝑇12 , 𝑎ˆ 2 = 𝐹ˆ𝑈 𝛼21 , 𝑎ˆ 3 = 𝐹ˆ𝑉 𝛼𝑇12 , 𝑎ˆ 4 = 𝐹ˆ𝑉 Σ𝑇2 𝛼21 ,
                            21,𝑞                             21,𝑞                         21,𝑞                  21,𝑞
                𝑓ˆ1 = 𝐹ˆ𝑈 Σ2 𝛼𝑇22 , 𝑓ˆ2 = 𝐹ˆ𝑈 𝛼22 , 𝑓ˆ3 = 𝐹ˆ𝑉 𝛼𝑇22 , 𝑓ˆ4 = 𝐹ˆ𝑉 Σ𝑇2 𝛼22 ,
                  𝑞                               𝑞                           𝑞                         𝑞
                                                                         97


           21,𝑞           21,𝑞
where 𝐹ˆ𝑈 and 𝐹ˆ𝑉 are diagonal matrices with diagonal entries
                                        21,𝑞                          1
                                   ( 𝐹ˆ𝑈 )𝑖′−𝑟,𝑖′−𝑟 =                            , 𝑟 + 1 ≤ 𝑖′ ≤ 𝑛,
                                                             (𝜎 2𝑗 − 𝜎𝑖2′ ) 1+𝑞
                                       21,𝑞                           1
                                   ( 𝐹ˆ𝑉     )𝑖′−𝑟,𝑖′−𝑟 =                       , 𝑟 + 1 ≤ 𝑖′ ≤ 𝑚.
                                                             (𝜎 2𝑗 − 𝜎𝑖2′ ) 1+𝑞
Similar as in Theorem 2.5.1, if 𝑖′ > min{𝑛, 𝑚}, we define 𝜎𝑖′ to be 0.
                                                                                      𝑞
     As before, In the above expression of 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ), 𝑎ˆ𝑖 0 either contains 𝛼21 or 𝛼𝑇12 . If it
                                                                                       0
contains the former, let 𝛾 be the former, and it contains the latter, let 𝛾 be the latter. It is easy to
check that this 𝛾 coincides with the 𝛾 defined in the paragraph under (A.1).
     Conditional on 𝛼22 , the only random variable in 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) is 𝛼21 or 𝛼𝑇12 , that is 𝛾.
Therefore, if we write 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) = 𝐺 (𝛾)𝑇 , then the linear operator 𝐺 is independent of 𝛾, and
it is straightforward to check that
                                                                     Í𝑘
                                                                             ℎ𝑖
                                                        ∥𝛼22 ∥ 𝜎𝑟+1𝑠=0 𝑠
                                                                 𝑘
                                   ∥𝐺 ∥ ≤      ∥𝑏∥                                   ,  1 ≤ 𝑗 ≤ 𝑟.
                                                                     𝑘+1+ 𝑘 𝑞 𝑠
                                                                           Í
                                                       2      2
                                                    (𝜎 𝑗 − 𝜎𝑟+1 )            𝑠=0
                                              |                {z                   }
                                                               ≡𝐾
Again, conditional on 𝛼22 , since for each 1 ≤ 𝑝 ≤ 𝑟, the 𝑝th entry of 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) only depends
on the 𝑝th column of 𝛼21 or 𝛼𝑇12 , then different entries of 𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) are independent of each
other, each following a Gaussian distribution N (0, 𝜎 2 𝜉 2𝑝 ), and 𝜉 𝑝 ≤ 𝐾, 1 ≤ 𝑝 ≤ 𝑟, where 𝐾 denotes
the above bound. By Theorem 3.1.1 in Vershynin (2018), there exists some constant 𝑐 such that
                                                                      √                    2 2 2
                                  P(∥𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 )∥ − 𝐾𝜎 𝑟 ≥ 𝑡) ≤ 2𝑒 −𝑐𝑡 /𝜎 𝐾 .
              √︃                Í𝑘
Let 𝑡 = 𝐾𝜎                3
                 log(𝑛 · 2 𝑠=0 𝑠
                                       𝑞
                                           · 8 𝑘 𝑟)/𝑐, then with probability at least 1 −                   2      ,
                                                                                                       Í𝑘
                                                                                                              𝑞𝑠
                                                                                                  𝑛3 ·2 𝑠=0 ·8 𝑘 𝑟
                                                           Í𝑘
                                                                  ℎ𝑖
                                                        𝜎𝑟+1𝑠=0 𝑠 ∥𝛼22 ∥ 𝑘             √︁        𝑘
                                                                                                ∑︁  √        √ √
          ∥𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 )∥ ≤   𝑐 1 𝜎∥𝑏∥                           Í𝑘        ( log 𝑛 +         𝑞 𝑠 + 𝑘 + 𝑟),
                                                              2 ) 𝑘+1+ 𝑠=0 𝑞 𝑠
                                                   (𝜎 2𝑗 − 𝜎𝑟+1                                 𝑠=0
                                                                   98


where 𝑐 1 is some constant. Hence
 ∥𝑀 𝑗 ∥
                  ∞
                 ∑︁                                           Í𝑘
                                                                       𝑞𝑠
    ≤                        ∥𝑇 ( 𝑗, 𝑞 0 , ..., 𝑞 𝑘 ) ∥ ·  𝜂 𝑗 𝑠=0
        𝑞 0 ,𝑞 1 ,...,𝑞 𝑘 =0
                         Í𝑘                                                                                                          Í𝑘
                               ℎ𝑖 𝑠                                                                                                        𝑞
                ∥𝑏∥𝜎𝑟+1    𝑠=0      ∥𝛼22 ∥    𝑘            ∞                              𝑘
                                                                                        ∑︁ √          √ √                             𝑠=0 𝑠
                                                         ∑︁              √︁                                              𝜂𝑗
    ≤ 𝑐1 𝜎                                                             ( log 𝑛 +               𝑞 𝑠 + 𝑘 + 𝑟) ·
                     (𝜎 2𝑗 − 𝜎𝑟+1
                               2 ) 𝑘+1
                                                 𝑞 0 ,𝑞 1 ,...,𝑞 𝑘 =0                   𝑠=0                         𝜎 2𝑗 − 𝜎𝑟+12
                         Í𝑘
                               ℎ𝑖
                ∥𝑏∥𝜎𝑟+1𝑠=0 𝑠 ∥𝛼22 ∥ 𝑘 ­ √︁                          √ √
                                                 ©                                                                                        ª
                                                 ­ ( log 𝑛 + 𝑘 + 𝑟)                             1                       2(𝑘   +   1)      ®
    ≤ 𝑐1 𝜎                                                                                                   +                            ®
                     (𝜎 2𝑗 − 𝜎𝑟+1
                               2 ) 𝑘+1                                                         |𝜂 𝑗 |                      |𝜂 𝑗 |
                                                 ­                                                                                        ®
                                                 ­                                  (1 − 2 2 ) 𝑘+1 (1 − 2 2 ) 𝑘+1 ®
                                                                                            𝜎 −𝜎                        𝜎 −𝜎
                                                 «                                            𝑗 𝑟+1                      𝑗 𝑟+1            ¬
                         Í𝑘
                               ℎ𝑖
                ∥𝑏∥𝜎𝑟+1𝑠=0 𝑠 ∥𝛼22 ∥ 𝑘 √︁                         √
    = 𝑐2 𝜎                                       ( log 𝑛 + 𝑟 + 𝑘),
                    2       2
                (𝜎 𝑗 − 𝜎𝑟+1 − |𝜂 𝑗 |) 𝑘+1
where 𝑐 2 is a constant. In the second inequality above, we used the fact that
                                         Í𝑘                                                           Í 𝑘−1
                                                  𝑞                                                         𝑞                           𝑞𝑘
             ∞                              𝑠=0 𝑠                     ∞                                 𝑠=0 𝑠 ∑︁ ∞
           ∑︁                 𝜂𝑗                                    ∑︁                      𝜂𝑗                                𝜂𝑗
                                                      =
                            2      2                                                            2
                                                                                       𝜎 2𝑗 − 𝜎𝑟+1                      𝜎 2𝑗 − 𝜎𝑟+12
   𝑞 0 ,𝑞 1 ,...,𝑞 𝑘 =0 𝜎 𝑗 − 𝜎𝑟+1                        𝑞 0 ,𝑞 1 ,...,𝑞 𝑘−1 =0                               𝑞 𝑘 =0
                                                                                               ∞
                                                                      1                      ∑︁                𝜂𝑗         Í 𝑘−1
                                                                                                                                   𝑞
                                                      =                                                                      𝑠=0 𝑠
                                                                        |𝜂 𝑗 |                               2      2
                                                           (1 − 2 2 ) 𝑞 0 ,𝑞 1 ,...,𝑞 𝑘−1 =0 𝜎 𝑗 − 𝜎𝑟+1
                                                                   𝜎 −𝜎
                                                                      𝑗 𝑟+1
                                                                                 1
                                                      = ... =                                   .
                                                                                |𝜂 𝑗 |
                                                                   (1 − 2 2 ) 𝑘+1
                                                                           𝜎 −𝜎
                                                                               𝑗 𝑟+1
                                                                         99


and that
                       ∞               𝑘
                     ∑︁               ∑︁   √                𝜂𝑗         Í𝑘
                                                                               𝑞
                                    (          𝑞𝑠)                        𝑠=0 𝑠
             𝑞 0 ,𝑞 1 ,...,𝑞 𝑘 =0 𝑠=0                 𝜎 2𝑗 − 𝜎𝑟+1 2
                       𝑘                      ∞                                                 ∞
                     ∑︁                     ∑︁                           𝜂𝑗        Í
                                                                                      𝑙≠𝑠 𝑞 𝑙
                                                                                               ∑︁    √        𝜂𝑗     𝑞𝑠
                  =                                                                                    𝑞𝑠
                                                                     2        2                             2      2
                     𝑠=0 𝑞 0 ,...,𝑞 𝑠−1 ,𝑞 𝑠+1 ,...,𝑞 𝑘 =0 𝜎 𝑗 − 𝜎𝑟+1                         𝑞 𝑠 =0      𝜎 𝑗 − 𝜎𝑟+1
                        𝑘                                              ∞
                     ∑︁                 2                            ∑︁                         𝜂𝑗      Í
                                                                                                          𝑙≠𝑠 𝑞 𝑙
                  ≤
                      𝑠=0 (1 −
                                          |𝜂 𝑗 |
                                                     ) 𝑞 0 ,...,𝑞 𝑠−1 ,𝑞 𝑠+1 ,...,𝑞 𝑘 =0 𝜎 2𝑗 − 𝜎𝑟+1 2
                                      𝜎 −𝜎 2
                                        2
                                        𝑗 𝑟+1
                              2(𝑘 + 1)
                  =                                   .
                                   |𝜂 𝑗 |
                      (1 − 2 2 ) 𝑘+1
                              𝜎 −𝜎
                                 𝑗 𝑟+1
Here the inequality is due to (2.49), which holds under the condition
                          |𝜂 𝑗 |             𝜎 2𝑗 − 𝜎 2𝑗 |
                                            |e                     (𝜎 𝑗 + ∥Δ𝐴∥) 2 − 𝜎 2𝑗
                                        =                      ≤
                    𝜎 2𝑗 − 𝜎𝑟+1  2          𝜎 2𝑗 − 𝜎𝑟+1 2                𝜎 2𝑗 − 𝜎𝑟+1
                                                                                   2
                                                  2𝜎 𝑗 ∥Δ𝐴∥ + ∥Δ𝐴∥ 2                   (2𝜎 𝑗 + 17 𝜎 𝑗 )∥Δ𝐴∥ 1
                                        ≤                                          ≤                           < .
                                            (𝜎 𝑗 − 𝜎𝑟+1 )(𝜎 𝑗 + 𝜎𝑟+1 )                      7∥Δ𝐴∥𝜎 𝑗               2
Therefore with probability at least 1 − 3 4 𝑘 ,
                                                          𝑛 ·4 𝑟
                 ∥𝑤 𝑗 ∥ ≤ ∥ 𝑀 𝑗 ∥ · e       𝜎 𝑚−1𝑗     ∥𝛽𝑗 ∥
                                                                    Í𝑘
                                                                            ℎ𝑖
                                                   ∥𝑏∥e 𝜎 𝑗 𝜎𝑟+1𝑠=0 𝑠 ∥𝛼22 ∥ 𝑘
                                                          𝑚−1
                                                                                                    √︁        √
                            ≤  𝑐2 𝜎                                                                ( log 𝑛 + 𝑟 + 𝑘)
                                        (𝜎 2𝑗 − 𝜎𝑟+1  2 − |𝜂 |)(𝜎 2 − 𝜎 2 − |𝜂 |) 𝑘
                                                                  𝑗       𝑗       𝑟+1       𝑗
                                                 ∥𝑏∥( 78 𝜎 𝑗 ) 𝑘 ∥Δ𝐴∥ 𝑘               √︁           √
                            ≤  𝑐2 𝜎                                                  ( log 𝑛 + 𝑟 + 𝑘)
                                        2 (𝜎 2 − 𝜎 2 )( 34 𝜎 ∥Δ𝐴∥) 𝑘
                                        3       𝑗       𝑟+1 7 𝑗
                               3                  ∥𝑏∥           4      √︁            √
                            ≤ 𝑐2 𝜎                           ( ) 𝑘 ( log 𝑛 + 𝑟 + 𝑘).
                               2           𝜎𝑟2 − 𝜎 2 17
                                                      𝑟+1
                                                    Í𝑘
Here the third inequality is due to                   𝑠=0 ℎ𝑖 𝑠 + 𝑚 − 1 =         𝑘 and
              𝜎 2𝑗 − 𝜎𝑟+1  2 − |e     𝜎 2𝑗 − 𝜎 2𝑗 | ≥ 𝜎 𝑗 (𝜎 𝑗 − 𝜎𝑟+1 ) − ((𝜎 𝑗 + ∥Δ𝐴∥) 2 − 𝜎 2𝑗 )
                                                                                                          34
                                                      ≥ 7𝜎 𝑗 ∥Δ𝐴∥ − 2𝜎 𝑗 ∥Δ𝐴∥ − ∥Δ𝐴∥ 2 ≥                      𝜎 𝑗 ∥Δ𝐴∥,
                                                                                                           7
                                                                       100


as well as
                   𝜎 2𝑗 − 𝜎𝑟+1
                           2 − |e 𝜎 2𝑗 − 𝜎 2𝑗 | ≥ 𝜎 2𝑗 − 𝜎𝑟+1 2 − 2𝜎 ∥Δ𝐴∥ − ∥Δ𝐴∥ 2
                                                                          𝑗
                                                 ≥ 𝜎 2𝑗 − 𝜎𝑟+12 − ∥Δ𝐴∥(2𝜎 + ∥Δ𝐴∥)
                                                                                   𝑗
                                                 ≥ 𝜎 2𝑗 − 𝜎𝑟+12 − 1 (𝜎 − 𝜎 ) · 15 (𝜎 + 𝜎 )
                                                                             𝑗   𝑟+1            𝑗       𝑟+1
                                                                      7                   7
                                                    2
                                                 ≥ (𝜎 2𝑗 − 𝜎𝑟+1   2 ).
                                                    3
With probability at least 1 − 𝑘4 3 ,
                                  4 𝑛
                                                √ 𝑇
                                       3          𝑟 ∥𝑢𝑖 𝛼22 ∥ 4 𝑘 √︁                   √
                             ∥𝑤∥ ≤ 𝑐 2 𝜎                         ( ) ( log 𝑛 + 𝑟 + 𝑘).
                                       2         𝜎𝑟2 − 𝜎𝑟+12       17
By Theorem 2.5.7, we can see that there are 2 𝑘+1 terms in the expansion of 𝑉2𝑇 𝑉                        e1 that has order 𝑘,
hence by (2.49), with probability at least 1 − 163 ,
                                                          𝑛
                            ∞          √ 𝑇                                                         √︁
    𝑇       𝑇      −1
                           ∑︁             𝑟 ∥𝑢 𝑖 𝛼 22  ∥   8    𝑘
                                                                   √︁          √                 (   𝑟 log 𝑛 + 𝑟)∥𝑢𝑇𝑖 𝛼22 ∥
 ∥𝑢𝑖 𝛼22𝑉2 𝑉1 Σ1 ∥ ≤
              e  e             3𝑐 2 𝜎                    ( ) ( log 𝑛 + 𝑟 + 𝑘) ≤ 𝐶𝜎                                          .
                           𝑘=0          𝜎𝑟2 − 𝜎 2 𝑟+1
                                                          17                                              𝜎𝑟2 − 𝜎 2
                                                                                                                  𝑟+1
By the union bound, with probability at            least 1 − 162 ,
                                                                𝑛
                                                            √︁
                                                          ( 𝑟 log 𝑛 + 𝑟)∥𝛼22 ∥
             ∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉   e1eΣ1−1 ∥ 2,∞     ≤ 𝐶𝜎
                                                                 𝜎𝑟2 − 𝜎𝑟+1 2
                                                               √︁                                  √︁
                                                             ( 𝑟 log 𝑛 + 𝑟)∥Δ𝐴∥                       𝑟 log 𝑛 + 𝑟
                                                 = 𝐶𝜎                                      ≤ 𝐶𝜎                   .
                                                          (𝜎𝑟 − 𝜎𝑟+1 )(𝜎𝑟 + 𝜎𝑟+1 )                  𝜎𝑟 + 𝜎𝑟+1
Next, we consider ∥𝑈2 Σ2𝑉2𝑇 𝑉     e1e Σ1−1 ∥ 2,∞ . Following the same reasoning, we have with probability
at least 1 − 162 ,
              𝑛
                                                      √︁                                √︁
                                                    ( 𝑟 log 𝑛 + 𝑟)∥Σ2 ∥                ( 𝑟 log 𝑛 + 𝑟)𝜎𝑟+1
                ∥𝑈2 Σ2𝑉2𝑇 𝑉   Σ1−1 ∥ 2,∞ ≤ 𝐶𝜎
                            e1e                                                = 𝐶𝜎                            .
                                                          𝜎𝑟2 − 𝜎𝑟+1 2                      𝜎𝑟2 − 𝜎𝑟+12
Combining the above two bounds,
                          max{∥𝑈2 Σ2𝑉2𝑇 𝑉    e1e Σ1−1 ∥ 2,∞ , ∥𝑈2𝑈2𝑇 Δ𝐴𝑉2𝑉2𝑇 𝑉          Σ1−1 ∥ 2,∞ }
                                                                                      e1e
                                       √︁                           √︁
                                      ( 𝑟 log 𝑛 + 𝑟)𝜎𝑟+1               𝑟 log 𝑛 + 𝑟
                             ≤ 𝐶𝜎(                               +                  )
                                              2
                                            𝜎𝑟 − 𝜎𝑟+1 2              𝜎𝑟 + 𝜎𝑟+1
                                    √︁
                                       𝑟 log 𝑛 + 𝑟
                             ≤ 𝐶𝜎                    .
                                     𝜎𝑟 − 𝜎𝑟+1
                                                                                                                            □
                                                             101


                                               APPENDIX B
                                   APPENDIX FOR CHAPTER 3
                                                                                     𝜆ˆ𝑖 −𝜆𝑖∗ 𝑘→∞
This appendix contains the proof of Theorem 3.3.1 and the proof of |                         | −−−−−→ 0 in Section
                                                                                        𝜆∗
                                                                                          𝑖
3.4.
B.1    Proof of Theorem 3.3.1
Definition B.1.1. Let M be a compact manifold endowed with a continuous measure 𝜇. For any
𝑧 ∈ M, its (𝜂(𝜏), 𝜏)-neighborhood N is the neighbourhood with radius 𝜂(𝜏) and measure 𝜏, i.e.,
𝜇(N ) = 𝜏, N = M ∩ 𝐵2 (𝑧, 𝜂(𝜏)), and 𝜂(𝜏) = min{𝑟 : 𝜇(M ∩ 𝐵2 (𝑧, 𝑟)) = 𝜏}.
    Since M is compact, its measure is finite, say 𝜇(M) = 𝐴, and the radii of all the 𝜏-
neighbourhoods are bounded by some constant 𝜂:
                                                sup |𝜂𝑖 | 2 ≤ 𝜂2 .
                                                 𝑖
Theorem B.1.1 (Full version of Theorem 3.3.1). Given the dataset 𝑋 = [𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ], let each 𝑋𝑖
be independently drawn from a compact manifold M ⊆ R 𝑝 with intrinsic dimension 𝑑 and endowed
with the uniform distribution 𝜇. Fix some 𝑞 > 0, let 𝑋𝑖 𝑗 , 𝑗 = 1, . . . , 𝑘 𝑖 be the 𝑘 𝑖 points falling in
the (𝜂𝑖 , 𝑞)-neighbourhood of 𝑋𝑖 . Together they form a matrix 𝑋 (𝑖) = [𝑋𝑖1 , . . . , 𝑋𝑖 𝑘 , 𝑋𝑖 ]. Suppose
                                                                                                    𝑖
the i.i.d. projections 𝑦𝑖, 𝑗 ≡ 𝑃𝑇𝑋 (𝑋𝑖 𝑗 − 𝑋𝑖 ) where 𝑇𝑋𝑖 is the tangent space at 𝑋𝑖 obey the same
                                    𝑖
distribution as some 𝑎𝑖 for all 𝑗, i.e., 𝑦𝑖, 𝑗 ∼ 𝑎𝑖 (∼ means the two vectors are identically distributed),
and the matrix E(𝑎𝑖 − E𝑎𝑖 )(𝑎𝑖 − E𝑎𝑖 ) ∗ has a finite condition number for each 𝑖. In addition, suppose
the support of the noise matrix 𝑆 (𝑖) is uniformly distributed among all sets of cardinality 𝑚𝑖 . For any
𝜁 ∈ M, let 𝑇𝜁 be the tangent space of M at 𝜁 and define 𝜇1 := sup𝜁 ∈M 𝜇(𝑇𝜁 ). Then as long as 𝑞𝑛 ≥
𝑐 log 𝑛, 𝑑 < 𝜌𝑟 min{𝑛𝑞/2, 𝑝}𝜇1−1 log−2 max{2𝑛𝑞, 𝑝}, and 𝑝𝑘𝑖 ≤ 0.4𝜌 𝑠 (here 𝑐, 𝜌𝑟 and 𝜌 𝑠 are positive
                                                                   𝑚
                                                                     𝑖
numerical constants), then with probability over 1 − 𝑐 1 (𝑛 max{𝑛𝑞/2, 𝑝}−10 + exp(−𝑐 2 𝑛𝑞)) for some
                                                               min{𝑘 𝑖 +1,𝑝}1/2
constants 𝑐 1 and 𝑐 2 , the minimizer 𝑆ˆ to (2) with 𝜆𝑖 =              𝜖𝑖         , and 𝛽𝑖 = max{𝑘 𝑖 + 1, 𝑝}−1/2
has the error bound
                                             ˆ − 𝑆 (𝑖) ∥ 2,1 ≤ 𝐶 √ 𝑝𝑛 𝑘¯ ∥𝜖 ∥ 2 .
                                 ∑︁
                                      ∥P𝑖 ( 𝑆)
                                  𝑖
                                                      102


Here 𝑘¯ = max𝑖 𝑘 𝑖 satisfying 𝑛𝑞/2 ≤ 𝑘¯ ≤ 2𝑛𝑞, 𝜖𝑖 = ∥ 𝑋˜ (𝑖) − 𝑋𝑖 1𝑇 − 𝑇 (𝑖) − 𝑆 (𝑖) ∥ 𝐹 , 𝜖 = [𝜖1 , ..., 𝜖 𝑛 ],
∥ · ∥ 2,1 stands for taking ℓ2 norm along rows and ℓ1 norm along the columns, and 𝑇 (𝑖) is the
projection of 𝑋 (𝑖) − 𝑋𝑖 to the tangent space 𝑇𝑋𝑖 .
     The proof the Theorem B.1.1 uses similar techniques as Zhan and Vaswani (2015). The main
difference is that in Zhan and Vaswani (2015), both the left and the right singular vectors of the
data matrix are required to satisfy the coherence conditions, while here we show that only the left
singular vectors that corresponding to the tangent spaces are relevant. In other words, the recovery
guarantee is built solely upon assumptions on the intrinsic properties of the manifold, i.e., the
tangent spaces.
     The proof architecture is as follows. In Section B.1.1, we derive the error bound in Theorem
3.3.1 under a small coherence condition for both the left and the right singular vectors of 𝐿 (𝑖) . In
Section B.1.2, we show that the requirement on the right singular vectors can be removed using the
i.i.d. assumption on the samples.
B.1.1     Deriving the error bound in Theorem B.1.1 under coherence conditions on both the
          right and the left singular vectors
     In Section 3.1.4, we explained that 𝐿 (𝑖) = 𝑋𝑖 1𝑇 +𝑇 (𝑖) corresponds to the linear approximation of
the 𝑖th patch. After the centering C(𝐿 (𝑖) ) = C(𝑇 (𝑖) ), one gets rid of the first term and the resulting
matrix has a column span coincide with 𝑇 (𝑖) . This indicates that the columns of C(𝐿 (𝑖) ) lie in
the column space of the tangent space 𝑠𝑝𝑎𝑛(𝑇 (𝑖) ), this also indicates that the rows of 𝐿 (𝑖) are in
𝑠𝑝𝑎𝑛{1𝑇 ,𝑇 (𝑖) }.
     One can view the knowledge that 1𝑇 is in the row space of 𝐿 (𝑖) as a prior knowledge of the left
singular vectors of 𝐿 (𝑖) . Robust PCA with prior knowledge is studied in Zhan and Vaswani (2015),
and we will use some of the result therein. Specifically, we adapt the dual certificate approach in
Zhan and Vaswani (2015) to our problem to derive the error bound for our new problem in the
theorem, and choose proper 𝜆𝑖 , 𝑖 = 1, 2, · · · , 𝑛 and 𝛽𝑖 accordingly.
     We first state the following assumptions as from Zhan and Vaswani (2015):
Assumption B.1.1 (Incoherence conditions). In each local patch, 𝐿 (𝑖) ∈ R 𝑝×(𝑘 𝑖 +1) , denote 𝑛 (1) =
                                                     103


max{𝑝, 𝑘 𝑖 + 1}, 𝑛 (2) = min{𝑝, 𝑘 𝑖 + 1}, let
                                                   C(𝐿 (𝑖) ) = 𝑈𝑖 Σ𝑖 𝑉𝑖∗ ,
be the singular value decomposition for each 𝐿 (𝑖) , where 𝑈𝑖 ∈ R 𝑝×𝑑 , Σ𝑖 ∈ R𝑑×𝑑 ,𝑉𝑖∗ ∈ R𝑑×(𝑘+1) .
Let 𝑉˜𝑖 be the orthonormal basis of span{1,𝑉𝑖 }, assume for each 𝑖 ∈ {1, 2, · · · , 𝑛}, the following hold
with a constant 𝜌𝑟 that is small enough
                                                                      𝜌𝑟 𝑑
                                                 max ∥𝑈𝑖∗ 𝑒 𝑗 ∥ 2 ≤        ,                         (B.1)
                                                    𝑗                  𝑝
                                                                     𝜌𝑟 𝑑
                                                 max ∥𝑉˜𝑖∗ 𝑒 𝑗 ∥ 2 ≤       ,                         (B.2)
                                                    𝑗                  𝑘𝑖
                                                                    √︄
                                                                        𝜌𝑟 𝑑
                                               max ∥𝑈𝑖 𝑉𝑖∗ ∥ ∞ ≤             .                       (B.3)
                                                 𝑗                      𝑝𝑘 𝑖
and 𝜌𝑟 , 𝜌 𝑠 , 𝑝, 𝑘 𝑖 satisfies the following assumptions:
Assumption B.1.2 (Zhan and Vaswani (2015), Assumption III.2.).
  (a) 𝜌𝑟 ≤ min{10−4 , 𝐶1 },
  (b) 𝜌 𝑠 ≤ min{1 − 1.5𝑏 1 (𝜌𝑟 ), 0.0156},
  (c) 𝑛 (1) ≥ max{𝐶2 (𝜌𝑟 ), 1024},
  (d) 𝑛 (2) ≥ 100 log2 𝑛 (1) ,
        ( 𝑝+𝑘 ) 1/6
              𝑖                      10.5
  (e) log( 𝑝+𝑘        >                       √ ,
                  𝑖)     (𝜌 𝑠 ) 1/6 (1−5.6561) 𝜌 𝑠
  (f) 500 log𝑖𝑛 > 12 .
            𝑝𝑘
                  (1)     𝜌𝑠
Here 𝑏 1 (𝜌𝑟 ), 𝐶2 (𝜌𝑟 ) are some constants related to 𝜌𝑟 .
    Denote Π𝑖 as the linear space of matrices for each local patch (note that this is different from
the tangent space 𝑇 𝑖 of the manifold)
                                      Π𝑖 := {𝑈𝑖 𝑋 ∗ +𝑌 𝑉˜𝑖∗ , 𝑋 ∈ R 𝑝,𝑑 ,𝑌 ∈ R 𝑝,𝑑+1 }.
    As shown by Zhan and Vaswani (2015), the following lemma holds, indicating that if inco-
herence condition is satisfied, then with high probability, there exists desirable dual certificate
(𝑊, 𝐹).
                                                             104


Lemma B.1.2 (Zhan and Vaswani (2015), Lemma V.8, Lemma V.9). For fixed 𝑖 = 1, 2, · · · , 𝑛, if
assumptions (B.1), (B.2), (B.3), Assumption B.1.2 and other assumptions in Theorem B.1.1 hold,
then with probability at least 1 − 𝑐𝑛−10    (1)
                                                , ∥PΩ𝑖 PΠ𝑖 ∥ ≤ 1/4, where Ω𝑖 is the support set of 𝑆 (𝑖) , and
𝛽 < 103 . In addition, there exists a pair (𝑊 , 𝐹 ) obeying
                                                   𝑖 𝑖
                                    𝑈𝑖 𝑉𝑖∗ + 𝑊𝑖 = 𝛽(𝑠𝑔𝑛(𝑆 (𝑖) ) + 𝐹𝑖 + PΩ𝑖 𝐷 𝑖 ),                               (B.4)
with
                                              9                                 9                  1
                    PΠ𝑖 𝑊𝑖 = 0, ∥𝑊𝑖 ∥ ≤         , PΩ𝑖 𝐹𝑖 = 0, ∥𝐹𝑖 ∥ ∞ ≤ , ∥PΩ𝑖 𝐷 𝑖 ∥ 𝐹 ≤ .                      (B.5)
                                             10                                10                  4
    Therefore, by union bound, with probability over 1 − 𝑐𝑛𝑛−10               (1)
                                                                                  , for each local patch, there exists
a pair (𝑊𝑖 , 𝐹𝑖 ) obeying (B.4) and (B.5).
    In Section B.1.2, we will show that with our assumption that data is independently drawn from
a manifold M ⊆ R 𝑝 with intrinsic dimension 𝑑 endowed with the uniform distribution, (B.2) and
(B.3) are satisfied with high probability, so we only need Assumption B.1.2 and (B.1), which is
only related to the property of tangent space of the manifold itself.
    In Lemma B.1.4, we prove that in our setting that each 𝑋𝑖 is drawn from a manifold M ⊆ R 𝑝
independently and uniformly, with high probability, for all 𝑖 = 1, 2, · · · 𝑛, 𝑘 𝑖 is some integer within
the range [𝑞𝑛/2, 2𝑞𝑛]. Now we use that to prove Theorem B.1.1, the result is stated in the following
lemma.
Lemma B.1.3. If for all local patch 𝑖 = 1, 2, · · · , 𝑛, there exists a pair (𝑊𝑖 , 𝐹𝑖 ) obeying (B.4) and
                                                        min{𝑘 𝑖 +1,𝑝}1/2
(B.5), then the minimizer 𝑆ˆ to (2) with 𝜆𝑖 =                    𝜖𝑖          , and 𝛽𝑖 = max{𝑘 𝑖 + 1, 𝑝}−1/2 has the
error bound
                                                 ˆ − 𝑆 (𝑖) ∥ 2,1 ≤ 𝐶 √ 𝑝𝑛 𝑘¯ ∥𝜖 ∥ 2 .
                                     ∑︁
                                          ∥P𝑖 ( 𝑆)
                                       𝑖
    Here 𝜖𝑖 =    ∥ 𝑋˜ (𝑖) − 𝑋𝑖 1𝑇 − 𝑇 − 𝑆 (𝑖) ∥
                                     (𝑖)
                                                  𝐹, 𝜖 = [𝜖1 , ..., 𝜖 𝑛 ] is defined same as Theorem B.1.1.
Proof: To simplify notation, let’s start with the problem for only one local patch:
                                     min 𝜆∥ 𝑋˜ − 𝐿 − 𝑆∥ 2𝐹 + ∥𝐿𝐺 ∥ ∗ + 𝛽∥𝑆∥ 1 .                                 (B.6)
Here 𝑋˜ ∈ R 𝑝×(𝑘+1) , where 𝑘 denotes the number of neighbors in each local patch, 𝐺 = 𝐼 − 𝑘+1                 1 11𝑇
is the centering matrix, recall that the noisy data 𝑋˜ is 𝑋˜ = 𝑋 + 𝑆 + 𝐸 = 𝐿 + 𝑅 + 𝑆 + 𝐸, ∥𝑅 + 𝐸 ∥ 𝐹 =
                                                          105


∥ 𝑋˜ − 𝐿 − 𝑆∥ 𝐹 ≤ 𝜖 (to be more accurate, 𝜖𝑖 for patch 𝑖), 𝑋 is the clean data on the manifold, 𝐿 is first
order Talor approximation of 𝑋, 𝑅 is higher order terms, and 𝐸 denotes random noise. Also denote
the solution to problem (B.6) as 𝐿ˆ = 𝐿 + 𝐻1 , 𝑆ˆ = 𝑆 + 𝐻2 . We choose 𝛽 = √𝑛1 = √                 1       .
                                                                                    (1)        max{𝑘+1,𝑝}
     Since 𝐿,ˆ 𝑆ˆ are the solution to (B.6), the following holds:
       𝜆∥ 𝑋˜ − 𝐿 − 𝑆∥ 2𝐹 + ∥𝐿𝐺 ∥ ∗ + 𝛽∥𝑆∥ 1
        ≥ 𝜆∥ 𝑋˜ − (𝐿 + 𝐻1 ) − (𝑆 + 𝐻2 )∥ 2𝐹 + ∥(𝐿 + 𝐻1 )𝐺 ∥ ∗ + 𝛽∥𝑆 + 𝐻2 ∥ 1
        ≥ 𝜆∥𝐻1 + 𝐻2 − (𝑅 + 𝐸)∥ 2𝐹 + ∥𝐿𝐺 ∥ ∗ + ⟨𝐻1 𝐺,𝑈𝑉 ∗ + 𝑊0 ⟩ + 𝛽∥𝑆∥ 1 + 𝛽⟨𝐻2 , 𝑠𝑔𝑛(𝑆) + 𝐹0 ⟩
        = 𝜆∥𝐻1 + 𝐻2 ∥ 2𝐹 + 𝜆∥𝑅 + 𝐸 ∥ 2𝐹 − 2𝜆⟨𝑅 + 𝐸, 𝐻1 + 𝐻2 ⟩ + ∥𝐿𝐺 ∥ ∗ + ⟨𝐻1 𝐺,𝑈𝑉 ∗ ⟩ + 𝛽∥𝑆∥ 1
           + 𝛽⟨𝐻2 , 𝑠𝑔𝑛(𝑆)⟩ + ∥PΠ ⊥ (𝐻1 𝐺)∥ ∗ + 𝛽∥PΩ⊥ 𝐻2 ∥ 1 .
Here we choose 𝑊0 and 𝐹0 such that ⟨𝐻1 𝐺,𝑊0 ⟩ = ∥PΠ ⊥ (𝐻1 𝐺)∥ ∗ , ⟨𝐻2 , 𝐹0 ⟩ = ∥PΩ⊥ 𝐻2 ∥ 1 same
as Candes et al. (2011). Note that 𝐿𝐺 = 𝑈Σ𝑉 ∗ , 𝐺 = 𝐼 − 𝑘+1      1 11𝑇 is orthogonal projector, 𝐿𝐺1 = 0
implies 𝑉 ∗ 1 = 0, we have
                                                                    1
                 ⟨𝐻1 𝐺,𝑈𝑉 ∗ ⟩ = ⟨𝐻1 ,𝑈𝑉 ∗ 𝐺⟩ = ⟨𝐻1 ,𝑈𝑉 ∗ (𝐼 −          11𝑇 )⟩ = ⟨𝐻1 ,𝑈𝑉 ∗ ⟩,
                                                                  𝑘 +1
                                                                         1
     PΠ ⊥ (𝐻1 𝐺) = (𝐼 − 𝑈𝑈 ∗ )𝐻1 𝐺 (𝐼 − 𝑉˜ 𝑉˜ ∗ ) = (𝐼 − 𝑈𝑈 ∗ )𝐻1 (𝐼 −       11𝑇 )(𝐼 − 𝑉˜ 𝑉˜ ∗ ) = PΠ ⊥ 𝐻1 .
                                                                       𝑘 +1
For the second equality we use the fact that 1 lies on the subspace spanned by 𝑉,      ˜ so (𝐼 − 𝑉˜ 𝑉˜ ∗ )1 = 0.
And for any matrix 𝑀, PΠ ⊥ 𝑀 = (𝐼 − 𝑈𝑈 ∗ )𝑀 (𝐼 − 𝑉˜ 𝑉˜ ∗ ).
     Denote 𝐻 = 𝐻1 + 𝐻2 , plug in the equations above, we obtain
        2𝜆⟨𝑅 + 𝐸, 𝐻⟩
            ≥ 𝜆∥𝐻 ∥ 2𝐹 + ⟨𝐻1 + 𝐻2 ,𝑈𝑉 ∗ ⟩ + ⟨𝐻2 , 𝛽𝑠𝑔𝑛(𝑆) − 𝑈𝑉 ∗ ⟩ + ∥PΠ ⊥ 𝐻1 ∥ ∗ + 𝛽∥PΩ⊥ 𝐻2 ∥ 1
            ≥ 𝜆∥𝐻 ∥ 2𝐹 − ∥𝐻 ∥ 𝐹 ∥𝑈𝑉 ∗ ∥ 𝐹 + ⟨𝐻2 ,𝑊 − 𝛽𝐹 − 𝛽PΩ 𝐷⟩ + ∥PΠ ⊥ 𝐻1 ∥ ∗ + 𝛽∥PΩ⊥ 𝐻2 ∥ 1
                                          9                  9                  𝛽
            ≥ 𝜆∥𝐻 ∥ 2𝐹 − 𝑛 (2) ∥𝐻 ∥ 𝐹 − ∥PΠ ⊥ 𝐻2 ∥ ∗ − 𝛽∥PΩ⊥ 𝐻2 ∥ 1 − ∥PΩ 𝐻2 ∥ 𝐹 +
                          √︁
                                         10                 10                  4
            ∥PΠ ⊥ 𝐻1 ∥ ∗ + 𝛽∥PΩ⊥ 𝐻2 ∥ 1 .
In the 3rd inequality we used
                                                                                    9
          |⟨𝐻2 ,𝑊⟩| = |⟨𝐻2 , PΠ ⊥ 𝑊⟩| = |⟨PΠ ⊥ 𝐻2 ,𝑊⟩| ≤ ∥PΠ ⊥ 𝐻2 ∥ ∗ ∥𝑊 ∥ ≤           ∥P ⊥ 𝐻 ∥ ∗ ,
                                                                                   10 Π 2
                                                     106


                                                                                      9
         |⟨𝐻2 , 𝐹⟩| = |⟨𝐻2 , PΩ⊥ 𝐹⟩| = |⟨PΩ⊥ 𝐻2 , 𝐹⟩| ≤ ∥PΩ⊥ 𝐻2 ∥ 1 ∥𝐹 ∥ ∞ ≤            ∥P ⊥ 𝐻 ∥ ,
                                                                                     10 Ω 2 1
and
                                                                       1
                            |⟨𝐻2 , PΩ 𝐷⟩| ≤ |⟨PΩ 𝐻2 , PΩ 𝐷⟩| ≤           ∥P 𝐻 ∥ .
                                                                       4 Ω 2 𝐹
Assume ∥𝑅 + 𝐸 ∥ 𝐹 ≤ 𝜖, for all 𝑖 = 1, 2, · · · , 𝑛. Also note that
                       ∥PΩ 𝐻2 ∥ 𝐹 ≤ ∥PΩ PΠ 𝐻2 ∥ 𝐹 + ∥PΩ PΠ ⊥ 𝐻2 ∥ 𝐹
                                         1
                                     ≤     ∥𝐻 ∥ + ∥PΠ ⊥ 𝐻2 ∥ 𝐹
                                         4 2 𝐹
                                         1               1
                                     ≤ ∥PΩ 𝐻2 ∥ 𝐹 + ∥PΩ⊥ 𝐻2 ∥ 𝐹 + ∥PΠ ⊥ 𝐻2 ∥ 𝐹 ,
                                         4               4
then we have
                           1                    4                  1               4
            ∥PΩ 𝐻2 ∥ 𝐹 ≤     ∥PΩ⊥ 𝐻2 ∥ 𝐹 + ∥PΠ ⊥ 𝐻2 ∥ 𝐹 ≤ ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻2 ∥ ∗ .
                           3                    3                  3               3
Plug into the previous inequality, also note that for 𝑛 (1) ≥ 16, 𝛽 = √𝑛1 ≤ 14 , it gives
                                                                               (1)
 2𝜆𝜖 ∥𝐻 ∥ 𝐹
                                     9 𝛽                     𝛽
    ≥ 𝜆∥𝐻 ∥ 2𝐹 − 𝑛 (2) ∥𝐻 ∥ 𝐹 − ( + )∥PΠ ⊥ 𝐻2 ∥ ∗ + ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗
                 √︁
                                    10 3                    60
                                   𝛽                   1                    59
    ≥ 𝜆∥𝐻 ∥ 2𝐹 − 𝑛 (2) ∥𝐻 ∥ 𝐹 + ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ + (∥PΠ ⊥ 𝐻1 ∥ ∗ − ∥PΠ ⊥ 𝐻2 ∥ ∗ )
                 √︁
                                 60                   60                    60
                                  𝛽                   1                     59
   = 𝜆∥𝐻 ∥ 2𝐹 − 𝑛 (2) ∥𝐻 ∥ 𝐹 + ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ + (∥PΠ ⊥ 𝐻1 ∥ ∗ − ∥PΠ ⊥ (−𝐻2 )∥ ∗ )
                 √︁
                                 60                  60                     60
                                   𝛽                   1                    59
    ≥ 𝜆∥𝐻 ∥ 2𝐹 − 𝑛 (2) ∥𝐻 ∥ 𝐹 + ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ − ∥PΠ ⊥ (𝐻1 + 𝐻2 )∥ ∗
                 √︁
                                 60                   60                    60
                                   𝛽                   1
    ≥ 𝜆∥𝐻 ∥ 2𝐹 − 𝑛 (2) ∥𝐻 ∥ 𝐹 + ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ − ∥𝐻 ∥ ∗ .
                 √︁
                                 60                   60
The last inequality is due to
                    ∥PΠ ⊥ 𝐻 ∥ ∗ = sup ⟨PΠ ⊥ 𝐻, 𝑋⟩ = sup ⟨𝐻, PΠ ⊥ 𝑋⟩
                                   ∥ 𝑋 ∥ 2 ≤1              ∥ 𝑋 ∥ 2 ≤1
                                ≤         sup     ⟨𝐻, PΠ ⊥ 𝑋⟩ ≤       sup ⟨𝐻, 𝑋⟩ = ∥𝐻 ∥ ∗ .
                                   ∥PΠ ⊥ 𝑋 ∥ 2 ≤1                  ∥ 𝑋 ∥ 2 ≤1
                     √
Note that ∥𝐻 ∥ ∗ ≤ 𝑛 (2) ∥𝐻 ∥ 𝐹 , then we obtain
                                                            𝛽                   1
                2𝜆𝜖 ∥𝐻 ∥ 𝐹 ≥ 𝜆∥𝐻 ∥ 2𝐹 − 2 𝑛 (2) ∥𝐻 ∥ 𝐹 + ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ .
                                             √︁
                                                           60                  60
                                                      107


Rewrite this inequality gives
          𝛽                    1
            ∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ ≤ −𝜆∥𝐻 ∥ 2𝐹 + 2( 𝑛 (2) + 𝜆𝜖)∥𝐻 ∥ 𝐹
                                                                     √︁
         60                   60                                     √                   √
                                                                       𝑛 (2) + 𝜆𝜖 2        𝑛 (2) √
                                                 = −𝜆(∥𝐻 ∥ 𝐹 − (                  )) + ( √ + 𝜆𝜖) 2
                                                                            𝜆                𝜆
                                                     √
                                                        𝑛 (2) √ 2
                                                 ≤ ( √ + 𝜆𝜖) .
                                                          𝜆
Recall that in our original optimization problem, we should consider above inequalities for the
summation of all the local patches, denote ℎ𝑖 ≡ ∥𝐻 (𝑖) ∥ 𝐹 , then
               𝑛                               𝑛
              ∑︁   𝛽𝑖            (𝑖)       1 ∑︁              (𝑖)
                      ∥PΩ⊥ 𝐻2 ∥ 1 +               ∥PΠ ⊥ 𝐻1 ∥ ∗
                  60       𝑖              60            𝑖
              𝑖=1                            𝑖=1
                    ∑︁𝑛
                         −𝜆𝑖 ∥𝐻 (𝑖) ∥ 2𝐹 + 2 min{𝑘 𝑖 + 1, 𝑝}∥𝐻 (𝑖) ∥ 𝐹 + 2𝜆𝑖 𝜖𝑖 ∥𝐻 (𝑖) ∥ 𝐹
                                            √︁
                 ≤
                    𝑖=1
                     𝑛
                    ∑︁
                         −𝜆𝑖 ℎ𝑖2 + 2 min{𝑘 𝑖 + 1, 𝑝}ℎ𝑖 + 2𝜆𝑖 𝜖𝑖 ℎ𝑖
                                      √︁
                 =
                    𝑖=1
                     𝑛                √︁                                √︁
                    ∑︁                  min{𝑘 𝑖 + 1, 𝑝} + 𝜆𝑖 𝜖𝑖 2          min{𝑘 𝑖 + 1, 𝑝} √︁
                 =       −𝜆𝑖 (ℎ𝑖 −                               ) +(          √          + 𝜆𝑖 𝜖𝑖 ) 2
                                                 𝜆𝑖                              𝜆𝑖
                    𝑖=1
                      ∑︁𝑛 √︁
                 ≤4          min{𝑘 𝑖 + 1, 𝑝}𝜖𝑖 ,
                       𝑖=1
                          √
                           min{𝑘 𝑖 +1,𝑝}                        1
where we choose 𝜆𝑖 =             𝜖𝑖       , and 𝛽𝑖 = √                    .
                                                          max{𝑘 𝑖 +1,𝑝}
                                       Í𝑛             (𝑖)          Í𝑛               (𝑖)
   Then we have the      bound for 𝑖=1      ∥PΠ ⊥ 𝐻1 ∥ ∗ and 𝑖=1           ∥PΩ⊥ 𝐻2 ∥ 1 ,
                                                  𝑖                            𝑖
                   𝑛                                           𝑛
                                                                                        √
                                             √︃                           √︃
                                  (𝑖)
                  ∑︁                                         ∑︁
                        ∥PΠ ⊥ 𝐻1 ∥ ∗                  ¯
                                         ≤ 𝐶 min{ 𝑘, 𝑝}           𝜖𝑖 ≤ 𝐶 min{ 𝑘,   ¯ 𝑝} 𝑛∥𝜖 ∥ 2 ,
                            𝑖
                  𝑖=1                                        𝑖=1
                                                         108


                        𝑛                                                            𝑛 √︁
                                        (𝑖)
                      ∑︁                              √︃                          ∑︁
                           ∥PΩ⊥ 𝐻2 ∥ 1           ≤ 𝐶 max max{𝑘 𝑖 , 𝑝}                      min{𝑘 𝑖 , 𝑝}𝜖𝑖
                                  𝑖                        𝑖
                      𝑖=1                                                         𝑖=1
                                                      √︃                   𝑛 √︁
                                                                          ∑︁
                                                = 𝐶 max{ 𝑘, 𝑝}   ¯                min{𝑘 𝑖 , 𝑝}𝜖𝑖
                                                                          𝑖=1
                                                      √︃                                  ∑︁𝑛
                                                                 ¯
                                                 ≤ 𝐶 max{ 𝑘, 𝑝} min{ 𝑘, 𝑝}        ¯            𝜖𝑖
                                                                                           𝑖=1
                                                             √
                                                      √︃
                                                 ≤ 𝐶 𝑝 𝑘¯ 𝑛∥𝜖 ∥ 2 .
          (𝑖)
Denote 𝐻2 ≡ P𝑖 ( 𝑆) ˆ − 𝑆 (𝑖) , to estimate the error bound of Í𝑛 ∥𝐻 (𝑖) ∥ 2,1 , we decompose 𝐻 (𝑖) into
                                                                                  𝑖=1       2                 2
three parts, for each 𝑖 = 1, 2, · · · 𝑛
                (𝑖)                          (𝑖)                                       (𝑖)                (𝑖)
            ∥𝐻2 ∥ 𝐹 ≤ ∥(𝐼 − PΩ𝑖 )𝐻2 ∥ 𝐹 + ∥(PΩ𝑖 − PΩ𝑖 PΠ𝑖 )𝐻2 ∥ 𝐹 + ∥PΩ𝑖 PΠ𝑖 𝐻2 ∥ 𝐹
                                      (𝑖)                    (𝑖)         1 (𝑖)
                      ≤ ∥PΩ⊥ 𝐻2 ∥ 𝐹 + ∥PΠ ⊥ 𝐻2 ∥ 𝐹 + ∥𝐻2 ∥ 𝐹 ,
                                𝑖                       𝑖                4
which leads to
                          (𝑖)         4                (𝑖)                     (𝑖)
                     ∥𝐻2 ∥ 𝐹 ≤ (∥PΩ⊥ 𝐻2 ∥ 𝐹 + ∥PΠ ⊥ 𝐻2 ∥ 𝐹 )
                                      3          𝑖                       𝑖
                                      4                (𝑖)                   (𝑖)
                                   = (∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ + ∥PΠ ⊥ 𝐻 (𝑖) ∥ 𝐹 )
                                      3          𝑖                     𝑖                       𝑖
                                      4                (𝑖)                    (𝑖)
                                   ≤ (∥PΩ⊥ 𝐻2 ∥ 1 + ∥PΠ ⊥ 𝐻1 ∥ ∗ + ∥𝐻 (𝑖) ∥ 𝐹 ).
                                      3          𝑖                     𝑖
                                                                        √
                            Í𝑛           (𝑖)                               min{𝑘 𝑖 +1,𝑝}
Next, we need to bound 𝑖=1 ∥𝐻 ∥ 𝐹 , note that 𝜆𝑖 =                               𝜖            and
                                                                                   𝑖
                               ∑︁𝑛
                                    −𝜆𝑖 ℎ𝑖2 + 2 min{𝑘 𝑖 + 1, 𝑝}ℎ𝑖 + 2𝜆𝑖 𝜖𝑖 ℎ𝑖 ≥ 0,
                                                   √︁
                               𝑖=1
which gives
                              √︃                      ∑︁𝑛             𝑛 √︁
                                                                     ∑︁
                                          ¯
                            4 min{ 𝑘 + 1, 𝑝}               ℎ𝑖 ≥ 4            min{𝑘 𝑖 + 1, 𝑝}ℎ𝑖
                                                      𝑖=1            𝑖=1
                                                                  ∑︁𝑛 √︁                          ℎ2
                                                               ≥           min{𝑘 𝑖 + 1, 𝑝} 𝑖
                                                                                                  𝜖𝑖
                                                                   𝑖=1
                                                                  √︃                         𝑛 ℎ2
                                                                                            ∑︁
                                                                                                   𝑖
                                                               ≥ min{𝑘 + 1, 𝑝}                       ,
                                                                                                  𝜖𝑖
                                                                                            𝑖=1
                                                              109


by Cauchy inequality
                                              𝑛 ℎ2                   ℎ𝑖 ) 2 ( 𝑖=1           ℎ𝑖 ) 2
                                                              Í𝑛                  Í𝑛
                                             ∑︁
                                                     𝑖       ( 𝑖=1
                                                        ≥ Í𝑛                 ≥ √                     ,
                                             𝑖=1 𝑖
                                                   𝜖             𝑖=1 𝜖𝑖               𝑛∥𝜖 ∥ 2
then we obtain                                    √︄
                                      𝑛
                                    ∑︁                 min{ 𝑘¯ + 1, 𝑝} √                         √
                                          ℎ𝑖 ≤ 4                             𝑛∥𝜖 ∥ 2 ≤ 𝐶 𝑛∥𝜖 ∥ 2 ,
                                                       min{𝑘 + 1, 𝑝}
                                    𝑖=1
                                           ≤ 𝑘 ≤ 𝑘¯ ≤ 2𝑛𝑞, which is guaranteed with high probability by Lemma
                                       𝑛𝑞
the last inequality is due to 2
B.1.4, thus
                        𝑛                       𝑛                           𝑛                                𝑛
                      ∑︁
                                (𝑖)        4 ∑︁                 (𝑖)
                                                                          ∑︁
                                                                                                (𝑖)
                                                                                                           ∑︁
                           ∥𝐻2 ∥ 𝐹      ≤ (        ∥PΩ⊥ 𝐻2 ∥ 1 +                ∥PΠ ⊥ 𝐻1 ∥ ∗ +                 ∥𝐻 (𝑖) ∥ 𝐹 )
                                           3               𝑖                           𝑖
                       𝑖=1                     𝑖=1                         𝑖=1                             𝑖=1
                                                𝑛                           𝑛                                𝑛
                                           4 ∑︁                 (𝑖)
                                                                          ∑︁
                                                                                               (𝑖)
                                                                                                           ∑︁
                                       = (         ∥PΩ⊥ 𝐻2 ∥ 1 +                ∥PΠ ⊥ 𝐻1 ∥ ∗ +                 ℎ𝑖 )
                                           3               𝑖                           𝑖
                                              𝑖=1                         𝑖=1                              𝑖=1
                                             √︃
                                        ≤ 𝐶 𝑝 𝑘𝑛∥𝜖¯        ∥2.
                            (𝑖)                                                                                            (𝑖)
Now let’s divide 𝐻2 into columns to get the ℓ2,1 norm error bound, denote (𝐻2 ) 𝑗 as the 𝑗th
                   (𝑖)
column in 𝐻2 , then we can derive the ℓ2,1 norm error bound in Lemma B.1.3
                                                                               v
                                                                               u
                                                                               u +1
                                 √︃                   𝑛                     𝑛 t 𝑘∑︁  𝑖
                                                              (𝑖)                                   (𝑖)
                                                   ∑︁                     ∑︁
                             𝐶 𝑝 𝑘𝑛∥𝜖¯       ∥2 ≥         ∥𝐻2 ∥ 𝐹 =                        ∥(𝐻2 ) 𝑗 ∥ 22
                                                    𝑖=1                   𝑖=1       𝑗=1
                                                                               v
                                                                               u
                                                                               u
                                                                            𝑛
                                                                          ∑︁ t              𝑘 𝑖
                                                                                     1 ∑︁                (𝑖)
                                                                       ≳                 (        ∥(𝐻2 ) 𝑗 ∥ 2 ) 2
                                                                                    𝑘𝑖
                                                                          𝑖=1              𝑗=1
                                                                                  𝑛 ∑︁  𝑘𝑖
                                                                            1   ∑︁
                                                                                                      (𝑖)
                                                                       ≳√                    ∥(𝐻2 ) 𝑗 ∥ 2 .
                                                                             𝑘¯ 𝑖=1 𝑗=1
Then we obtain
                                                               𝑛 𝑘∑︁ 𝑖 +1
                            ∑︁
                                         ˆ − 𝑆 (𝑖) ∥
                                                              ∑︁
                                                                                 (𝑖)                  √
                                  ∥P𝑖  ( 𝑆)            2,1  =             ∥(𝐻2 ) 𝑗 ∥ 2 ≤ 𝐶 𝑝𝑛 𝑘¯ ∥𝜖 ∥ 2 .
                              𝑖                               𝑖=1 𝑗=1
                                                                                                                               □
                                                                                                                  𝑞𝑛
Lemma B.1.4. If 𝑞𝑛 ≥ 9 log 𝑛, with probability at least 1 − 2 exp(−𝑐 3 𝑞𝑛), 2 ≤ 𝑘 𝑖 ≤ 2𝑞𝑛, for all
𝑖 = 1, 2, · · · , 𝑛, here 𝑐 3 is some constants not related to 𝑞 and 𝑛.
                                                                    110


Proof: Since each 𝑋𝑖 is drawn from a manifold M ⊆ R 𝑝 independently and uniformly, for some
fixed (𝜂𝑖0 , 𝑞)-neighborhood of 𝑋𝑖0 , for each 𝑗 = {1, 2, · · · , 𝑛}\{𝑖 0 }, the probability that 𝑋 𝑗 falls into
(𝜂𝑖0 , 𝑞)-neighborhood is 𝑞. Since {𝑋𝑖 }𝑖=1,2,···𝑛 , 𝑘 𝑖 follows i.i.d binomial distribution 𝐵(𝑛, 𝑞), we
can apply large deviations inequalities to derive an upper and lower bound for 𝑘 𝑖 . By Theorem 1 in
Janson (2016), we have that for each 𝑖 = 1, 2, · · · , 𝑛
                                                         (𝑞𝑛) 2                    3
                      P(𝑘 𝑖 > 2𝑞𝑛) ≤ exp(−                             ) ≤ exp(− 𝑞𝑛),
                                              2(𝑞𝑛(1 − 𝑞) + 𝑞𝑛/3)                  8
                                      𝑞𝑛                (𝑞𝑛/2) 2            1
                              P(𝑘 𝑖 <    ) ≤ exp(−               ) = exp(− 𝑞𝑛).
                                      2                   2𝑞𝑛               8
Therefore by Union Bound Theorem
                    𝑞𝑛                                                  3              1
                 P(     ≤ 𝑘 𝑖 ≤ 2𝑞𝑛, ∀𝑖 = 1, 2, · · · 𝑛) ≥ 1 − 𝑛(exp(− 𝑞𝑛) + exp(− 𝑞𝑛))
                    2                                                   8              8
                                                                         1
                                                         ≥ 1 − 2𝑛 exp(− 𝑞𝑛)
                                                                         8
                                                                       1
                                                         = 1 − 2 exp(− 𝑞𝑛 + log 𝑛)
                                                                       8
                                                                        1
                                                         ≥ 1 − 2 exp(− 𝑞𝑛).
                                                                       72
                                                                                                               □
B.1.2     Removing (B.2) and (B.3) in Assumption B.1.1
    We will show that under our assumption that points are uniformly drawn from the manifold,
(B.2) and (B.3) in Assumption B.1.1 automatically hold provided (B.1) holds, thus they can be
removed from the requirements.
    Let us again restrict our attention to an individual patch and for the simplicity of notation, ignore
the superscript 𝑖 (the treatment for all patches are the same). Recall that C(𝐿) = C(𝑇) = 𝑈Σ𝑉 ∗ ,
and 𝑉˜ is the orthonormal basis of 𝑠𝑝𝑎𝑛([1,𝑉]), since 0 = C(𝑇)1 = 𝑈Σ𝑉 ∗ 1, we have 𝑉 ∗ 1 = 0, then
1 ⊥ 𝑠𝑝𝑎𝑛(𝑉), thus we can write one basis for 𝑠𝑝𝑎𝑛([1,𝑉]) as [ √ 1 1,𝑉], which indicates that
                                                                               𝑘+1
in order to remove (B.2), we only need to show that with high probability, 𝑉 has small coherence.
Also, recall that 𝑇 (𝑖) = 𝑃𝑇𝑋 (𝑋 (𝑖) − 𝑋𝑖 1𝑇 ), since each 𝑋𝑖 is independent, each column in 𝑇 (𝑖) is
                               𝑖
also independent. In addition, each column is in the span of the tangent space with 𝑈 being
                                                       111


an orthonormal basis. Therefore 𝑇 = 𝑈Λ ≡ 𝑈 [𝛼1 , 𝛼2 , ..., 𝛼 𝑘 , 0], where 𝛼𝑖 , 𝑖 = 1, 2, · · · , 𝑘 is the 𝑖th
column of Λ, which corresponds to the coefficients of the 𝑖th column of 𝑇 under 𝑈, the last column
is zero vector since it corresponds to 𝑋𝑖 itself. Since columns of 𝑇 are i.i.d, then 𝛼𝑖 s are also i.i.d.,
so they all obey the same distribution as a random vector 𝛼. We establish the following lemma for
the right singular vectors of 𝑇.
Lemma B.1.5. Let C(𝑇) = 𝑈Σ𝑉 ∗ be the reduced singular vector decomposition of C(𝑇), assume
𝐶 ≡ E((𝛼 − E𝛼)(𝛼 − E𝛼) ∗ ) has a finite condition number. Then, with probability at least 1 −
2𝑑 exp(−𝑐𝑘), the right singular vector 𝑉 obeys
                                                                 𝑐
                                             max ∥𝑉 ∗ e 𝑗 ∥ 2 ≤    ,
                                            1≤ 𝑗 ≤𝑘              𝑘
and with (1) in Assumption B.1.1                          √︄
                                                              𝑐𝑑
                                             ∥𝑈𝑉 ∗ ∥ ∞ ≤         .
                                                              𝑝𝑘
Proof: As discussed above, C(𝑇) has the following representation
                                 C(𝑇) = 𝑇𝐺 = 𝑈 [𝛼1 , 𝛼2 , · · · , 𝛼 𝑘 , 0]𝐺 ,
where 𝑈 ∈ R 𝑝×𝑑 is an orthonormal basis of the tangent space, and Λ = [𝛼1 , 𝛼2 , ..., 𝛼 𝑘 , 0] ∈ R𝑑×(𝑘+1)
is the coefficients of randomly drawn points in a neighbourhood projected to the tangent space.
     Since points are randomly drawn from an neighbourhood contained in a ball of radius at most
𝜂, one can easily verify that ∥𝛼 𝑗 ∥ 2 ≤ 𝜂 for each 𝑗 = 1, ..., 𝑘. Assume 𝑇𝐺 and Λ have the reduced
SVD of the form
                                     𝑇𝐺 = 𝑈Σ𝑉 ∗ , Λ𝐺 = 𝑈Λ ΣΛ𝑉Λ∗ ,
Then 𝑇 can be written as
                                        𝑇𝐺 = 𝑈Σ𝑉 ∗ = 𝑈𝑈Λ ΣΛ𝑉Λ∗ .
It can be verified that null(𝑇𝐺) is the span of columns in (𝑉Λ ) 𝐶 , then we have 𝑠𝑝𝑎𝑛(𝑉Λ ) = 𝑠𝑝𝑎𝑛(𝑉),
since both 𝑉Λ and 𝑉 are orthonormal, they are equal up to a rotation, i.e. ∃𝑅 ∈ R𝑑,𝑑 , 𝑅 ∗ 𝑅 = 𝑅𝑅 ∗ = 𝐼,
such that 𝑉 = 𝑉Λ 𝑅. Then
                          max ∥𝑉 ∗ e 𝑗 ∥ 2 = max ∥𝑅 ∗𝑉Λ∗ e 𝑗 ∥ 2 = max ∥𝑉Λ∗ e 𝑗 ∥ 2 .
                         1≤ 𝑗 ≤𝑘             1≤ 𝑗 ≤𝑘                 1≤ 𝑗 ≤𝑘
                                                     112


Next we bound the coherence of 𝑉Λ . Since 𝑉Λ∗ = ΣΛ              −1𝑈 ∗ Λ𝐺, we have
                                                                     Λ
                          max ∥ΣΛ        −1𝑈 ∗ Λ𝐺e ∥ ≤ ∥Σ−1 ∥ max ∥𝑈 ∗ Λ𝐺e ∥
                         1≤ 𝑗 ≤𝑘               Λ        𝑗        Λ    1≤𝑖≤𝑘     Λ      𝑗
                                                            = ∥ΣΛ−1 ∥ max ∥Λ𝐺e ∥
                                                                                   𝑗
                                                                      1≤ 𝑗 ≤𝑘
                                                            ≤ ∥ΣΛ−1 ∥ max ∥𝛼 − 𝛼∥
                                                                                 𝑗 ¯
                                                                      1≤ 𝑗 ≤𝑘
                                                            ≤ 2𝜂∥ΣΛ −1 ∥.
Recall that
                                                      1
            Λ𝐺 = [𝛼1 , 𝛼2 , · · · , 𝛼 𝑘 , 0] (𝐼 −         11𝑇 )
                                                    𝑘 +1
                 = [𝛼1 − 𝛼,
                         ¯ 𝛼2 − 𝛼,    ¯ · · · , 𝛼 𝑘 − 𝛼,
                                                       ¯ −𝛼]¯
                 = [𝛼1 − E𝛼, 𝛼2 − E𝛼, · · · , 𝛼 𝑘 − E𝛼, 0] − [ 𝛼¯ − E𝛼, 𝛼¯ − E𝛼, · · · , 𝛼¯ − E𝛼, 𝛼],
                                                                                                  ¯
            1 Í 𝑘 𝛼 , thus
where 𝛼¯ = 𝑘+1    𝑖=1 𝑖
                         |𝜎𝑑 (Λ𝐺) − 𝜎𝑑 ([𝛼1 − E𝛼, 𝛼2 − E𝛼, · · · , 𝛼 𝑘 − E𝛼, 0])|
                       ≤∥ [ 𝛼¯ − E𝛼, 𝛼¯ − E𝛼, · · · , 𝛼¯ − E𝛼, 𝛼]   ¯ ∥2
                       ≤∥ [ 𝛼¯ − E𝛼, 𝛼¯ − E𝛼, · · · , 𝛼¯ − E𝛼, 𝛼¯ − E𝛼] ∥ 2 + ∥E𝛼∥ 2
                                                                                                      (B.7)
                         √                1 ∑︁
                                                𝑘
                                                                    1
                       ≤ 𝑘 + 1∥                     (𝛼𝑖 − E𝛼) −         E𝛼∥ 2 + 𝜂
                                       𝑘 +1                       𝑘 +1
                                               𝑖=1
                                          𝑘
                                1      ∑︁
                       ≤∥ √                  (𝛼𝑖 − E𝛼)∥ 2 + 2𝜂
                               𝑘 + 1 𝑖=1
First, we want to use Bernstein Matrix Inequality to bound the ℓ2 -norm in the last inequality. Denote
𝛽𝑖 = √ 1 (𝛼𝑖 − E𝛼), 𝑍 = 𝑖=1
                           Í𝑘
                                     𝛽𝑖 , then 𝛽𝑖 is independent, we also have
        𝑘+1
                                                          1                          2𝜂
                          E𝛽𝑖 = 0, ∥ 𝛽𝑖 ∥ 2 ≤ √               (∥𝛼𝑖 ∥ 2 + ∥E𝛼∥ 2 ) ≤ √ ,
                                                        𝑘 +1                          𝑘
                                                            113


which means 𝛽𝑖 has mean zero and is uniformly bounded, also
                 𝜈(𝑍) = max{∥E(𝑍 𝑍 ∗ )∥ 2 , ∥E(𝑍 ∗ 𝑍)∥ 2 }
                                  ∑︁𝑛                     𝑛
                                                         ∑︁
                        = max{∥        E(𝛽𝑖 𝛽𝑖∗ )∥ 2 , ∥     E(𝛽𝑖∗ 𝛽𝑖 )∥ 2 }
                                  𝑖=1                    𝑖=1
                             𝑘
                        =        max{∥E(𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 ∥ 2 , E tr((𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 ))}
                           𝑘 +1
                        < max{∥𝐶 ∥ 2 , tr(𝐶)}
                        < 𝑑𝜎1 (𝐶).
By assumption, 𝐶 has finite condition number, and 𝑑 ≪ 𝑘, by Matrix Bernstein inequality, we are
able to bound the spectral norm of 𝑍
                                                                              −𝑡 2
                                    𝑃(∥𝑍 ∥ 2 ≥ 𝑡) ≤ (𝑑 + 1) exp(                        )
                                                                                    2𝜂𝑡
                                                                       𝑑𝜎1 (𝐶) + √
                                                                                    3 𝑘
           √
             𝜎𝑑 (𝐶)𝑘
Let 𝑡 =         4     ,we   have
                                                      √︁
                                                         𝜎𝑑 (𝐶)𝑘
                                        𝑃(∥𝑍 ∥ 2 ≥                ) ≤ 𝑑 exp(−𝑐𝑘).                         (B.8)
                                                            4
Next, equipped with Matrix Bernstein inequality again, we can prove that 𝜎𝑑 ([𝛼1 − E𝛼, 𝛼2 −
E𝛼, · · · , 𝛼 𝑘 − E𝛼, 0]) concentrates around 𝜎𝑑 (𝐶). Note that 𝜎𝑑2 ([𝛼1 − E𝛼, 𝛼2 − E𝛼, · · · , 𝛼 𝑘 −
                  Í𝑘
E𝛼, 0]) = 𝜎𝑑 ( 𝑖=1        (𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 ), we consider
                      𝑘
                     ∑︁                                                ∑︁𝑛
               |𝜎𝑑 (     (𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 ) − 𝑘𝜎𝑑 (𝐶)| ≤ ∥                (𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 − 𝑘𝐶 ∥ 2
                     𝑖=1                                               𝑖=1
Similar as what we discussed above, let 𝑍 𝑗 = (𝛼 𝑗 − E𝛼)(𝛼 𝑗 − E𝛼)𝑇 − 𝐶, 𝑗 = 1, 2, · · · , 𝑘. It can be
verified that 𝑍 𝑗 is bounded
                               ∥𝑍 𝑗 ∥ 2 ≤ ∥𝛼 𝑗 − E𝛼∥ 22 + 𝜎1 (𝐶) ≤ 2𝜂2 + 𝜎1 (𝐶) ≡ 𝑐 4 .
Since 𝑍 𝑗 follows i.i.d distribution, we also have 𝜈(𝑍) ≤ 𝑘𝑐 5 for some constant 𝑐 5 which represents
the variance of 𝑍 𝑗 . Applying matrix Bernstein inequality, we obtain
                                                                                              𝑡2
                          ∑︁𝑘                                            
                       P ∥      (𝛼 𝑗 − E𝛼)(𝛼 𝑗 − E𝛼)𝑇 − 𝑘𝐶 ∥ 2 ≥ 𝑡 ≤ 2𝑑 exp(−                    𝑐 𝑡 )
                            𝑗=1                                                           𝑘𝑐 5 + 34
                                                             114


                         3𝑘𝜎𝑑 (𝐶)
Furthermore, take 𝑡 =        4    , then with probability over 1 − 2𝑑 exp(−𝑐 6 𝑘) for some constant 𝑐 6 ,
the following holds
            𝑘                                                    𝑛
           ∑︁                                                  ∑︁                                      3𝑘𝜎𝑑 (𝐶)
     |𝜎𝑑 (     (𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 ) − 𝑘𝜎𝑑 (𝐶)|           ≤∥        (𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 − 𝑘𝐶 ∥ 2 <              ,
                                                                                                           4
           𝑖=1                                                  𝑖=1
which leads to
                                                                      𝑘
                                                                    ∑︁                               𝑘𝜎𝑑 (𝐶)
           𝜎𝑑2 ([𝛼1 − E𝛼, 𝛼2 − E𝛼, · · · , 𝛼 𝑘 − E𝛼])       = 𝜎𝑑 (        (𝛼𝑖 − E𝛼)(𝛼𝑖 − E𝛼)𝑇 ) >            ,
                                                                                                       4
                                                                    𝑖=1
thus                                                                                √︁
                                                                                       𝑘𝜎𝑑 (𝐶)
                           𝜎𝑑 ([𝛼1 − E𝛼, 𝛼2 − E𝛼, · · · , 𝛼 𝑘 − E𝛼]) >                           .             (B.9)
                                                                                         2
Combine (B.7), (B.8) and (B.9), we have proved that with probability at least 1 − 𝑑 exp(−𝑐𝑘),
              √                −1 ∥ ≾ √1 , which further gives max ∥𝑉 ∗ e ∥ 2 ≾ 1 .
𝜎𝑑 (Λ𝑃) ≿ 𝑘, therefore ∥ΣΛ                                                                     𝑗   𝑘
                                           𝑘                                 1≤ 𝑗 ≤𝑘+1
     Finally, with (B.1) in Assumption B.1.1, (B.3) is also satisfied with the same probability, since
                                                                                        √︄
                                                                                           𝑐𝑑
                            ∥𝑈𝑉 ∗ ∥ ∞ ≤ max ∥𝑈 ∗ e 𝑗 ∥ 2 max ∥𝑉 ∗ e𝑙 ∥ 2 ≤                    .
                                               𝑗                    𝑙                      𝑝𝑘
Hence (B.3) in Assumption B.1.1 can also be removed.                                                              □
     The above discussion is valid for each patch individually, i.e., with probability at least 1 −
𝑑 exp(−𝑐𝑘 𝑖 ) ≥ 1 − 𝑑 exp(−𝑐𝑘), (B.2) and (B.3) hold for any fixed 𝑖 = 1, 2, · · · 𝑛. By union bound
inequality, with probability at least 1 − 𝑛𝑑 exp(−𝑐𝑘), (B.2) and (B.3) hold for all the local patches.
     Note that 1 − 𝑛𝑑 exp(−𝑐𝑘) = 1 − exp(−𝑐𝑘 + log 𝑛), here we omit 𝑑 since it is very small. By
                                                                           𝑛𝑞
Lemma B.1.4, with probability at least 1 − 2 exp(−𝑐 1 𝑞𝑛), 2 ≤ 𝑘 𝑖 ≤ 2𝑛𝑞, for all 𝑖 = 1, 2, · · · 𝑛. Using
the assumption in Theorem 3.3.1, 𝑞𝑛 ≥ 𝑐 2 log 𝑛 for some constant 𝑐 2 larger enough, we can see
that with probability over 1 − exp(−𝑐 3 𝑘), the requirement (B.2) and (B.3) automatically hold due
to i.i.d assumption on the samples, which enable us to remove these assumptions in Theorem 3.3.1.
                                             𝜆ˆ𝑖 −𝜆𝑖∗
B.1.3     Proof of the convergence of                   as 𝑘 → ∞
                                                𝜆∗
                                                  𝑖                 √             √
                                                                       𝑝             𝑝
     When 𝑘 is large enough, min{𝑘 + 1, 𝑝} = 𝑝, 𝜆ˆ𝑖 = 𝜖ˆ , 𝜆𝑖∗ = 𝜖 , then
                                                                       𝑖            𝑖
                                                   √        √
                                                      𝑝       𝑝
                                  𝜆ˆ𝑖 − 𝜆𝑖∗         𝜖ˆ𝑖 − 𝜖
                                                             𝑖      𝜖𝑖 − 𝜖ˆ𝑖 𝜖𝑖
                                              =         √        =            = − 1.
                                     𝜆𝑖∗                  𝑝            𝜖ˆ𝑖      𝜖ˆ𝑖
                                                        𝜖𝑖
                                                            115


                         𝜆ˆ𝑖 −𝜆𝑖∗ 𝑘→∞                                                     2
                                                                                        𝜖 −𝜖ˆ  2    𝜖 2       𝑘→∞
In order to show |                | −−−−−→ 0, it is sufficient      to prove that 𝑖 2 𝑖 = 𝑖2 − 1 −−−−−→ 0, thus
                            𝜆∗                                                             𝜖ˆ       𝜖ˆ
                              𝑖                                                              𝑖        𝑖
𝜖𝑖 𝑘→∞                  𝜆ˆ𝑖 −𝜆𝑖∗ 𝑘→∞
𝜖ˆ𝑖 −
    −−−−→ 1,     hence
                           𝜆∗
                                  −−−−−→ 0. Notice that
                             𝑖
                                                           𝑘 ∥ 𝑋𝑖 −𝑋𝑖 𝑗 ∥ 42
                   ∥𝑅 (𝑖) + 𝑁 (𝑖) ∥ 2𝐹 −  (𝑘 + 1) 𝑝𝜎 2 +                      Γ̄2 (𝑋𝑖 )
                                                          Í                              
     𝜖𝑖2 − 𝜖ˆ𝑖2                                                        4
                                                          𝑗=1
                 =
         𝜖ˆ𝑖2                                    𝑘 ∥ 𝑋𝑖 −𝑋𝑖 𝑗 ∥ 42
                                 (𝑘 + 1) 𝑝𝜎 2 +                    Γ̄2 (𝑋𝑖 )
                                                Í
                                                         4
                                                𝑗=1
                                                                      𝑘 ∥ 𝑋𝑖 −𝑋𝑖 𝑗 ∥ 42
                     ∥𝑁 (𝑖) ∥ 2𝐹 − (𝑘 + 1) 𝑝𝜎 2 +   ∥𝑅 (𝑖) ∥ 2𝐹 −                       Γ̄2 (𝑋𝑖 ) + ⟨𝑁 (𝑖) , 𝑅 (𝑖) ⟩
                                                                    Í                            
                                                                               4
                                                                     𝑗=1
                 ≤
                                                              𝑘 𝑝𝜎 2
                                                                    𝑘 ∥ 𝑋𝑖 −𝑋𝑖 𝑗 ∥ 42
                                                   ∥𝑅 (𝑖) ∥ 2𝐹 −                      Γ̄2 (𝑋𝑖 )
                                                                   Í
                                                                             4                                   (𝑖) (𝑖)
                    ∥𝑁 (𝑖) ∥ 2𝐹 − (𝑘 + 1) 𝑝𝜎 2
                                                                                                    Í𝑘
                                                                   𝑗=1                                  𝑗=1 ⟨𝑁 𝑗 , 𝑅 𝑗 ⟩
                 ≤                              +                                                +                       .
                                𝑘 𝑝𝜎 2                                𝑘 𝑝𝜎 2                             (𝑘 + 1) 𝑝𝜎 2
                                                                                        (𝑖)    (𝑖)
Since each entry in 𝑁 (𝑖) follows i.i.d.             obeying N (0, 𝜎 2 ), ⟨𝑁 𝑗 , 𝑅 𝑗 ⟩ are also i.i.d.                   with
        (𝑖)    (𝑖)
E(⟨𝑁 𝑗 , 𝑅 𝑗 ⟩) = 0, by law of large numbers, the first and third term approximates 0 when 𝑘 → ∞.
                                                                                                            𝜖 2 −𝜖ˆ2 𝑘→∞
Also, by (3.14) and (3.15) in Section 3.4, the second term also approximates 0, thus 𝑖 2 𝑖 −−−−−→ 0.
                                                                                                               𝜖ˆ
                                                                                                                 𝑖
                                                          116


                                              APPENDIX C
                                    APPENDIX FOR CHAPTER 4
In this appendix, we discuss how to bound the separation between two diagonal matrices
𝑠𝑒 𝑝(Λ1 .Λ2 ) in (4.4) and provide an alternative proof of Lemma 4.4.7.
C.0.1    Discussions on bounding 𝑠𝑒 𝑝(Λ1 , Λ2 )
    Define an eigen-gap
                 (      (                                                                       ))
      𝛿0 = max max          min |𝜆 − 𝑡0 | − max |𝜇 − 𝑡0 |, min |𝜆 − 𝑡0 | − max |𝜇 − 𝑡0 |            .
           𝑡0 ∈C          𝜆∈𝑆(Λ1 )           𝜇∈𝑆(Λ2 )          𝜆∈𝑆(Λ2 )         𝜇∈𝑆(Λ1 )
Here 𝑆(Λ𝑖 ), 𝑖 = 1, 2 are the sets of eigenvalues contained in Λ𝑖 , 𝑖 = 1, 2, respectively, and are called
the spectral sets. We can show that it is a lower bound of 𝑠𝑒 𝑝(Λ1 , Λ2 ). By the definition of 𝑠𝑒 𝑝,
                   𝑠𝑒 𝑝(Λ1 , Λ2 ) = inf ∥𝑇Λ1 − Λ2𝑇 ∥
                                   ∥𝑇 ∥=1
                                  = inf ∥𝑇 (Λ1 − 𝑡 0 𝐼) − (Λ2 − 𝑡0 𝐼)𝑇 ∥
                                   ∥𝑇 ∥=1
                                           (                                           )
                                  ≥ inf       min |𝜆 − 𝑡0 |∥𝑇 ∥ − max |𝜇 − 𝑡0 |∥𝑇 ∥
                                   ∥𝑇 ∥=1 𝜆∈𝑆(Λ1 )                  𝜇∈𝑆(Λ2 )
                                  = min |𝜆 − 𝑡0 | − max |𝜇 − 𝑡0 |.
                                   𝜆∈𝑆(Λ1 )            𝜇∈𝑆(Λ2 )
Similarly,
                            𝑠𝑒 𝑝(Λ1 , Λ2 ) ≥   min |𝜆 − 𝑡0 | − max |𝜇 − 𝑡 0 |.
                                             𝜆∈𝑆(Λ2 )           𝜇∈𝑆(Λ1 )
Hence we have 𝑠𝑒 𝑝(Λ1 , Λ2 ) ≥ 𝛿0 , plugging into (4.4), it gives,
                                                      2𝜅2 (𝑋1 )𝜅2 (𝑉2 )∥Δ𝐴∥
                           ∥ tan Θ(X1 , X
                                        e1 )∥ <                                 ,                    (C.1)
                                                [𝛿0 − 2𝜅2 (𝑋1 )𝜅2 (𝑉2 )∥Δ𝐴∥] +
Let us provide some intuition of 𝛿0 to assist understanding. 𝛿0 > 0 essentially means that there
exists a disk in the complex plane that separates 𝑆(Λ1 ) from 𝑆(Λ2 ), i.e., there exist some 𝑡 ∈ C and
radius 𝜌 > 0, such that the disk 𝐵(𝑡, 𝜌) satisfies either
   (i) 𝑆(Λ2 ) ⊆ 𝐵(𝑡, 𝜌) and 𝑆(Λ1 ) ⊆ C\𝐵(𝑡, 𝜌); or
  (ii) 𝑆(Λ1 ) ⊆ 𝐵(𝑡, 𝜌) and 𝑆(Λ2 ) ⊆ C\𝐵(𝑡, 𝜌).
                                                    117


This is to say, 𝛿0 > 0 is equivalent to the existence of a disk with one of two spectral sets completely
inside it, and the other completely outside it. Comparing 𝛿0 with another common definition of
the gap 𝛿1 := min𝜆𝑖 ∈𝑆(Λ ),𝜆 𝑗 ∈𝑆(Λ ) |𝜆𝑖 − 𝜆 𝑗 |, we see that 𝛿0 ≤ 𝛿1 . As a result, a sin Θ bound that
                            1           2
requires 𝛿0 > 0 is likely to be weaker than one that requires 𝛿1 > 0. In the literature, despite of this
usage of the weaker eigen-gap 𝛿0 , (4.4) provides the best known relation between the sin Θ distance
and the condition numbers.
C.0.2    Alternative proof of Lemma 4.4.7 using complex analysis
    Assume (𝑆(Λ1 ) ∪ 𝑆( e   Λ1 )) ∩ (𝑆(Λ2 ) ∪ 𝑆( e  Λ2 )) = ∅, then there always exists a positively oriented
simple closed curve Γ in the complex plane enclosing the eigenvalues in Λ1 and e                  Λ1 while leaving
                                                                                               1               −1
                                                                                                  ∫
those in Λ2 and e   Λ2 outside. It has been shown in Kato (2013) that 𝑃𝑄 𝑋 = 2𝜋𝑖
                                                                                         1          Γ (𝜆𝐼 − 𝐴) 𝑑𝜆,
where   𝑃𝑄 𝑋 = 𝑄 𝑋1 𝑄 ∗𝑋      is the projector matrix onto the subspace spanned by columns in 𝑄 𝑋1 .
             1             1
Similarly, 𝑃𝑄 = 𝑄 𝑋e 𝑄 ∗e = 2𝜋𝑖      1             e −1
                                        ∫
                𝑋        1    𝑋           Γ (𝜆𝐼 − 𝐴) 𝑑𝜆, then we have
                e
                 1             1
               𝑄𝑉∗ 𝑄 𝑋e = 𝑄𝑉∗ (𝑃𝑄 𝑋 − 𝑃𝑄 )𝑄 𝑋e
                  2    1        2       1       𝑋
                                                e
                                                  1      1
                                       ∫                                  
                             1 ∗                       −1               −1
                         =       𝑄           (𝜆𝐼 − 𝐴) − (𝜆𝐼 − 𝐴)     e      𝑑𝜆 𝑄 𝑋e
                            2𝜋𝑖 𝑉2 Γ                                                 1
                                   ∫
                                1
                         =−            𝑄 ∗ (𝜆𝐼 − 𝐴) −1 Δ𝐴(𝜆𝐼 − 𝐴)     e −1 𝑄 e 𝑑𝜆
                              2𝜋𝑖 Γ 𝑉2                                       𝑋1
                                   ∫
                                1
                         =−            𝑄 ∗ 𝑋 (𝜆𝐼 − Λ) −1 𝑋 −1 Δ𝐴 𝑋    e(𝜆𝐼 − eΛ) −1 𝑋e−1 𝑄 e 𝑑𝜆
                              2𝜋𝑖 Γ 𝑉2                                                      𝑋1
                                            ∫                                              
                                1 ∗
                         =−        𝑄 𝑋          (𝜆𝐼 − Λ2 ) −1𝑉2∗ Δ𝐴 𝑋   e1 (𝜆𝐼 − eΛ1 ) −1 𝑑𝜆 𝑉 e∗ 𝑄 e
                              2𝜋𝑖 𝑉2 2 Γ                                                        1 𝑋1
                                             ∫                                              
                                         1
                         = −(𝑅𝑉−1 ) ∗            (𝜆𝐼 − Λ2 ) −1𝑉2∗ Δ𝐴 𝑋  e1 (𝜆𝐼 − eΛ1 ) −1 𝑑𝜆 𝑅 −1e ,
                                  2 2𝜋𝑖        Γ                                                 𝑋 1
                                       |                        {z                          }
                                                                 𝐺
where the second to last equality used the fact that 𝑉 ∗ 𝑋 = 𝐼, 𝑄𝑉∗ 𝑋1 = 0. The contour integral
                                                                                2
      1                −1   ∗                    −1
         ∫
𝐺 = 2𝜋𝑖 Γ (𝜆𝐼 − Λ2 ) 𝑉2 Δ𝐴 𝑋1 (𝜆𝐼 − Λ1 ) 𝑑𝜆 has poles at 𝜆 𝑗 , 1 ≤ 𝑗 ≤ 𝑟. Hence the (𝑖, 𝑗)th entry
                                  e          e                          e
of 𝐺 can be computed by the Cauchy’s Residue Theorem as
                      ∫
                   1           e𝑗 ) −1 (𝜆 − 𝜆𝑖+𝑟 ) −1 (𝑉 ∗ Δ𝐴 𝑋             e𝑗 − 𝜆𝑖+𝑟 ) −1 (𝑉 ∗ Δ𝐴 𝑋
          𝐺𝑖 𝑗 =         (𝜆 − 𝜆                         2
                                                              e1 )𝑖 𝑗 𝑑𝜆 = (𝜆
                                                                                               2
                                                                                                     e1 )𝑖 𝑗 .
                  2𝜋𝑖 Γ
Plugging this back into the expression of 𝑄𝑉∗ 𝑄 𝑋e gives (4.9).
                                                     2     1
                                                        118