EXPLORING LOW-RANK PRIOR IN HIGH-DIMENSIONAL DATA By He Lyu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineeringโ€”Doctor of Philosophy 2023 ABSTRACT High-dimensional data plays a ubiquitous role in real applications, ranging from biology, computer vision, to social media. The large dimensionality poses new challenges on statistical methods due to the "curse of dimensionality". To overcome these challenges, many statistical and machine learning approaches have been developed based on imposing additional assumptions on the data. One popular assumption is the low-rank prior, which assumes the high-dimensional data lies in a low-dimensional subspace, and approximately exhibits low-rank structure. In this dissertation, we explore various applications of low-rank prior. Chapter 2 studies the stability of leading singular subspaces. Various widely used algorithms have been proposed in numerical analysis, matrix completion, and matrix denoising based on the low-rank assumption, such as Principal Component Analysis and Singular Value Hard Thresholding. Many of these methods involve the computation of Singular Value Decomposition (SVD). To study the stability of these algorithms, in Chapter 2 we establish a useful set of formulae for the sin ฮ˜ distance between the original and the perturbed singular subspaces. Following this, we further derive a collection of new results on SVD perturbation related problems. In Chapter 3, we employ the low-rank prior for manifold denoising problems. Specifically, we generalize the Robust PCA (RPCA) method to manifold setting and propose an optimization framework that separates the sparse component from the noisy data. It is worth noting that in this chapter, we generalize the low-rank prior to a more general form to accommodate data with a more complex structure, instead of assuming the data itself lies in a low-dimensional subspace as in RPCA, we assume the clean data is distributed around a low-dimensional manifold. Therefore, if we consider a local neighborhood, the sub-matrix will be approximately low rank. Subsequently, in Chapter 4 we study the stability of invariant subspaces for eigensystems. Specifically, we focus on the case where the eigensystem is ill-conditioned and explore how the condition numbers affect the stability of invariant subspaces. The material presented in this dissertation encompasses several publications and preprints in the fields of Statistical, Numerical Linear Algebra, and Machine Learning, including Lyu and Wang (2020a); Lyu et al. (2019); Lyu and Wang (2022). Copyright by HE LYU 2023 To my family for their love and support. v ACKNOWLEDGEMENTS There are many people to whom I am indebted for their generous support throughout my doctorate period. First of all, I would like to express my deepest gratitude to my advisor, Professor Rongrong Wang. Words alone cannot fully express the profound influence she has had on my professional development. Her brilliance, passion, and diligence greatly inspired me during my doctorate studies, and I sincerely appreciate her constant guidance and support during these five years. It is extremely lucky for me to have an advisor who always responds to my questions so promptly and patiently, and who care so much about my research as well as my personal development. I also would like to thank all my other committee members, Professor Yuying Xie, Professor Mark Iwen, and Professor Jiliang Tang for their guidance, advice, and for providing helpful feedback and suggestions. I am fortunate to have a wonderful group of colleagues and friends, who have greatly encouraged and supported me over the years. I thank Jieqian He, Shuyang Qin, Ningyu Sha, Hao Wang, Runze Su, Zhuangdi Zhu, Avrajit Ghosh, Siddhant Gautam, Shฤณun Liang, Angqi Li, Xitong Zhang, Guangliang Liu, Zhiyu Xue, and Haitao Mao for making my graduate time colorful, and my gratitude goes to Wei Ao, Kang Yu, Tianxudong Tang, Ziyi Xi, and Jiaxin Yang for their help during my thumb injury. It is unfortunately impossible to name everyone here, so please accept my apologies if I have left anyone out. Most of all, I owe my deepest thanks to my family who have always been my strongest supporters. Their encouragement and love have been the backbone of my success. I feel incredibly fortunate to have them in my life. vi TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 MATRIX PERTURBATION ANALYSIS AND ITS APPLICATIONS . . . . 15 CHAPTER 3 MANIFOLD DENOISING BY NONLINEAR ROBUST PCA . . . . . . . . 51 CHAPTER 4 PERTURBATION OF INVARIANT SUBSPACES FOR ILL-CONDITIONED EIGENSYSTEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 APPENDIX A APPENDIX FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . . . . . 95 APPENDIX B APPENDIX FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . . . . . 102 APPENDIX C APPENDIX FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . . . . . 117 vii CHAPTER 1 INTRODUCTION In the fields of statistical and machine learning, one frequently encounters the task of analyzing high-dimensional data. Since high dimensionality poses a great challenge to traditional methods, new methods that are specifically designed for high-dimensional data have been developed. A promising approach to tackle the curse of dimensionality is to make prior assumptions on the data Candes et al. (2011); Lee et al. (2006); Lyu and Wang (2020b); Krahmer et al. (2023). In this dissertation, we focus on the low-intrinsic-dimensionality prior of the data, which assumes that the high-dimensional data lies around a low-dimensional manifold. In the special case when the manifold is a linear subspace, this prior reduces to the standard low-rank prior. The low-rank assumption underlies many popular statistical and machine learning algorithms, such as Principal Component Analysis and Singular Value Hard Thresholding. A large portion of this dissertation explores the robustness of the reconstruction under the low-rank prior for various applications. In particular, we analyze the fundamental perturbation problem of Singular Value Decomposition (SVD). Due to the great importance of SVD to data science and its sensitivity to noise, studying its stability is crucial for the reliability using many machine learning algorithms involving the SVD. In Chapter 2, we study the stability of each of the SVD factors as well as those of their combinations and derive a collection of new results on SVD perturbation problems. Beyond linear models which are suitable for original PCA Pearson (1901), real-world data may be generated from more complicated nonlinear models, and different types of noise may be present in the data, such as sparse noise and Gaussian noise. In Chapter 3, we generalize the low-rank assumption to the manifold setting, where we assume that the data is generated from clean data distributed around a low-dimensional manifold contaminated by sparse noise and Gaussian noise. An optimization framework is proposed to separate sparse noise from the data. The performance of the proposed method is tested on both synthetic and real datasets. In addition, we also provide a theoretical error bound for the proposed algorithm. 1 In the study of network analysis, the eigenvectors can provide useful information on the network, such as the centrality of nodes and their connectivity. It is often desirable to identify how sensitive the corresponding invariant subspaces are to noise. In Chapter 4, we investigate the stability of invariant subspaces for diagonalizable matrices with a specific focus on the case when the eigensystem is ill-conditioned. We explore the impact of condition numbers on the stability of invariant subspaces, and derive an improved perturbation bound for the ill-conditioned situation. Before presenting more detailed discussions on our methods in each chapter of this dissertation, here we first introduce some necessary notations. 1.1 Notations Throughout this dissertation, for any vector ๐‘ฅ โˆˆ R๐‘› , we denote by โˆฅ๐‘ฅโˆฅ its โ„“2 -norm, and by โˆฅ๐‘ฅโˆฅ โˆž = max๐‘– |๐‘ฅ๐‘– | its infinity norm. For any matrix ๐ด, let ๐œŽ๐‘– ( ๐ด) be the ๐‘–th largest singular value, โˆฅ ๐ดโˆฅ = ๐œŽ1 ( ๐ด) โˆš๏ธƒร โˆš๏ธƒ denotes the spectral norm, โˆฅ ๐ดโˆฅ ๐น = | ๐ด | 2 = ร ๐œŽ 2 ( ๐ด) represents the Frobenius norm, ๐‘–, ๐‘— ๐‘–, ๐‘— ๐‘– ๐‘– ร โˆฅ ๐ดโˆฅ โˆž = max๐‘–, ๐‘— | ๐ด๐‘–, ๐‘— | is the maximum of entrwise absolute values, and โˆฅ ๐ดโˆฅ 1 = ๐‘–, ๐‘— | ๐ด๐‘–, ๐‘— | denotes the summation of absolute values. We use ๐ด = ๐‘„ ๐ด ๐‘… ๐ด to denote the ๐‘„๐‘… decompositions of ๐ด, ๐‘†( ๐ด) to denote the spectrum of ๐ด, ๐œŒ( ๐ด) the spectral radius of ๐ด. For a complex matrix ๐ด, ๐ดโˆ— is its conjugate transpose, if the matrix ๐ด is real, we denote its transpose by ๐ด๐‘‡ . The condition number ( ๐ด) is ๐œ… 2 ( ๐ด) = ๐œŽmax( ๐ด) , where ๐œŽmax ( ๐ด) and ๐œŽmin ( ๐ด) are the largest and smallest singular values of ๐œŽ min ๐ด, respectively. In addition, #(ยท) is the cardinality of a set, and O๐‘Ÿ is the orthogonal matrix group in dimension ๐‘Ÿ. For two functions ๐‘“ (๐‘›) and ๐‘”(๐‘›), the big-O notation ๐‘“ = ๐‘‚ (๐‘”) means that ๐‘“ is asymptotically upper bounded by ๐‘” as ๐‘› โ†’ +โˆž, and ๐‘“ = ฮฉ(๐‘”) means that ๐‘“ is asymptotically lower bounded by ๐‘” as ๐‘› โ†’ +โˆž. Furthermore, we use ๐‘’๐‘– to denote the ๐‘–th canonical basis vector, and ๐ต(๐‘, ๐‘Ÿ) to denote a disk in a complex plane centered at ๐‘ with radius ๐‘Ÿ. 1.2 Problem setup for perturbation analysis This dissertation prominently features matrix perturbation analysis, here we present the problem settings of singular subspace perturbation and invariant subspace perturbation, which are the focus of Chapter 2 and Chapter 4, respectively. 2 1.2.1 The singular subspace perturbation problem For a rectangular matrix ๐ด โˆˆ R๐‘›ร—๐‘š , let ฮ”๐ด denote an unobserved perturbation matrix, the observed perturbed matrix ๐ด e = ๐ด + ฮ”๐ด. We write the conformal SVD of the original matrix A and the perturbed matrix ๐ด e in block matrix form   ยฉฮฃ1 ยช ยฉ๐‘‰1๐‘‡ ยช   ยฉeฮฃ1 ยช ยฉ๐‘‰e๐‘‡ ยช 1ยฎ ๐ด = ๐‘ˆฮฃ๐‘‰ ๐‘‡ = ๐‘ˆ1 ๐‘ˆ2 ยญยญ ยฎยญ ยฎ, ๐ด ยฎยญ ๐‘‡ยฎ e= ๐‘ˆ eeฮฃ ๐‘‰e๐‘‡ = ๐‘ˆ e 1 ๐‘ˆ2 ยญ e ยญ ยฎยญ ๐‘‡ยฎ. ยฎ ยญ (1.1) ฮฃ2 ๐‘‰2 ฮฃ2 ๐‘‰ e e ยซ ยฌยซ ยฌ ยซ ยฌยซ 2 ยฌ Here ๐‘ˆ1 โˆˆ R๐‘›ร—๐‘Ÿ , ๐‘ˆ2 โˆˆ R๐‘›ร—(๐‘›โˆ’๐‘Ÿ) , ๐‘‰1 โˆˆ R๐‘šร—๐‘Ÿ , ๐‘‰2 โˆˆ R๐‘šร—(๐‘šโˆ’๐‘Ÿ) , [๐‘ˆ1 ,๐‘ˆ2 ] โˆˆ R๐‘›ร—๐‘› , [๐‘‰1 ,๐‘‰2 ] โˆˆ R๐‘šร—๐‘š are orthogonal matrices, ฮฃ1 = diag{๐œŽ1 , ๐œŽ2 , ..., ๐œŽ๐‘Ÿ } โˆˆ R๐‘Ÿร—๐‘Ÿ , ฮฃ2 = diag{๐œŽ๐‘Ÿ+1 , ๐œŽ๐‘Ÿ+2 , ..., ๐œŽmin{๐‘š,๐‘›} } โˆˆ R (๐‘›โˆ’๐‘Ÿ)ร—(๐‘šโˆ’๐‘Ÿ) , and the singular values are indexed in non-increasing order, i.e., ๐œŽ1 โ‰ฅ ๐œŽ2 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œŽ๐‘Ÿ โ‰ฅ ๐œŽ๐‘Ÿ+1 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œŽmin{๐‘š,๐‘›} . When ๐‘› โ‰  ๐‘š, ฮฃ2 is rectangular, and the extra columns/rows are padded with 0s. The decomposition of ๐ด e has a similar structure with non-increasing singular values e ฮฃ1 = diag{e ๐œŽ1 ,e ๐œŽ๐‘Ÿ } โˆˆ R๐‘Ÿร—๐‘Ÿ , e ๐œŽ2 , ...,e ฮฃ2 = diag{e ๐œŽ๐‘Ÿ+1 ,e ๐œŽmin{๐‘š,๐‘›} } โˆˆ R (๐‘›โˆ’๐‘Ÿ)ร—(๐‘šโˆ’๐‘Ÿ) . ๐œŽ๐‘Ÿ+2 , ...,e This dissertation is mostly concerned with the case where there is a significant gap between ๐œŽ๐‘Ÿ and ๐œŽ๐‘Ÿ+1 . In Chapter 2, we investigate the stability of each of the SVD factors and their combinations, one example is to estimate the distance between the leading singular subspaces under perturbation, i.e., the distance between ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ1 ) and ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ e1 ). 1.2.2 The invariant subspace perturbation problem Let ๐ด โˆˆ C๐‘›ร—๐‘› be a diagonalizable matrix. An invariant subspace X of ๐ด is one that satisfies ๐ดX โІ X. When a perturbation ฮ”๐ด is added to ๐ด, its invariant subspace X will be perturbed accordingly. More precisely, any invariant subspace of a diagonalizable matrix is spanned by a subset of the right eigenvectors. Suppose ๐ด = ๐‘‹ฮ›๐‘‹ โˆ’1 is the eigen-decomposition of ๐ด, and ๐‘‹ contains the normalized eigenvectors as columns. Partition ๐‘‹ into two blocks ๐‘‹ = [๐‘‹1 , ๐‘‹2 ], ยฉฮ›1 ยช ๐ด[๐‘‹1 , ๐‘‹2 ] = [๐‘‹1 , ๐‘‹2 ] ยญยญ ยฎ, ยฎ (1.2) ฮ›2 ยซ ยฌ 3 where ฮ›1 and ฮ›2 are diagonal matrices containing eigenvalues of ๐ด, then X1 = ๐‘ ๐‘๐‘Ž๐‘›(๐‘‹1 ) is an invariant subspace of ๐ด. In addition, suppose that the corresponding eigenvalues in ฮ›1 and ฮ›2 have a gap, which ensures that with sufficiently small perturbation, the eigen-decomposition of the perturbed matrix ๐ด e = ๐ด + ฮ”๐ด has a similar block structure, ยฉeฮ›1 ยช ๐ด[ ๐‘‹1 , ๐‘‹2 ] = [ ๐‘‹1 , ๐‘‹2 ] ยญยญ e e e e e ยฎ, ยฎ (1.3) ฮ›2 e ยซ ยฌ and therefore we can estimate X1 by X e1 = ๐‘ ๐‘๐‘Ž๐‘›( ๐‘‹ e1 ). We study the behavior the invariant subspace X1 under perturbation, which is measured by the distance between X1 and X e1 . 1.2.3 Distance between subspaces Noticing that in both singular subspace perturbation and invariant subspace perturbation, the target is to estimate the distance between two subspaces. It is necessary to have a well-defined metric to measure the difference between any two subspaces with the same dimension. For two matrices ๐‘ˆ1 , ๐‘ˆe1 โˆˆ C๐‘›ร—๐‘Ÿ with orthonormal columns, a natural measure of the distance between ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ1 ) and ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ e1 ) is given by their canonical angles. Let the singular values of ๐‘ˆ โˆ—๐‘ˆ e be 1 1 ๐›พ1 โ‰ฅ ๐›พ2 โ‰ฅ ... โ‰ฅ ๐›พ๐‘Ÿ โ‰ฅ 0, then cosโˆ’1 ๐›พ๐‘– , ๐‘– = 1, ..., ๐‘Ÿ are the principal angles, and the sin ฮ˜ matrix is the following diagonal matrix n o โˆ’1 โˆ’1 โˆ’1 sin ฮ˜(๐‘ˆ1 , ๐‘ˆ1 ) = diag sin(cos (๐›พ1 )), sin(cos (๐›พ2 )), ..., sin(cos (๐›พ๐‘Ÿ )) . e The matrix that holds all the tan ฮ˜ angles is n o tanฮ˜(๐‘ˆ1 , ๐‘ˆe1 ) = diag tan(cosโˆ’1 (๐›พ1 )), tan(cosโˆ’1 (๐›พ2 )), ยท ยท ยท , tan(cosโˆ’1 (๐›พ๐‘Ÿ )) . The angles are usually measured under either the spectral norm โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ or the Frobenius norm โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ ๐น . It is well known that (see e.g., Cai and Zhang (2018); Knyazev and Argentati (2002)) โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ = โˆฅ๐‘ˆ โˆ—๐‘ˆ eโˆ— 2 1 โˆฅ = โˆฅ๐‘ˆ2 ๐‘ˆ1 โˆฅ, (1.4) e โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ ๐น = โˆฅ๐‘ˆ โˆ—๐‘ˆ eโˆ— 2 1 โˆฅ ๐น = โˆฅ๐‘ˆ2 ๐‘ˆ1 โˆฅ ๐น , (1.5) e where ๐‘ˆ2 is the orthogonal complement of ๐‘ˆ1 as defined in (1.1). 4 1.3 Mathematical background Let us warm up with some mathematical background needed for this dissertation. We start with introducing related norms and norm relations in Section 1.3.1, followed by a brief review of classical matrix perturbation results in Section 1.3.2. Then we move to the introduction of fundamental large deviation inequalities, which are extensively used in our proofs. 1.3.1 Norms and norm relations The content in Chapter 2 features the following โ„“2,โˆž -norm . Definition 1.3.1. For a matrix ๐ด, its โ„“2,โˆž -norm is defined as โˆฅ ๐ดโˆฅ 2,โˆž := sup โˆฅ ๐ด๐‘ฅโˆฅ โˆž . โˆฅ๐‘ฅโˆฅ 2 =1 Remark 1.3.1. One can directly see that โˆฅ ๐ดโˆฅ 2,โˆž = max๐‘– โˆฅ๐‘Ž๐‘– โˆฅ 2 , where ๐‘Ž๐‘‡๐‘– is the ๐‘–th row of ๐ด. This relation indicates that the โ„“2,โˆž -norm of a matrix ๐ด is the maximum of the โ„“2 -norm of its rows. Here we list several helpful propositions of matrix norms that are extensively used in the proofs of this dissertation. Proposition 1.3.2. For a matrix ๐ด โˆˆ R๐‘›ร—๐‘š , the following relation between the spectral norm and โ„“2,โˆž -norm holds โˆš โˆฅ ๐ดโˆฅ 2,โˆž โ‰ค โˆฅ ๐ดโˆฅ โ‰ค ๐‘›โˆฅ ๐ดโˆฅ 2,โˆž . ๐‘› โˆฅ๐‘Ž๐‘‡ ๐‘ฅโˆฅ 2 โ‰ค ๐‘› max โˆฅ๐‘Ž๐‘‡ ๐‘ฅโˆฅ 2 = โˆš๐‘›โˆฅ ๐ดโˆฅ โˆš๏ธƒร โˆš๏ธƒ Proof: For any vector ๐‘ฅ โˆˆ R๐‘š , โˆฅ ๐ด๐‘ฅโˆฅ = ๐‘–=1 ๐‘– ๐‘– ๐‘– 2,โˆž . On the โˆš๏ธƒร other hand, โˆฅ ๐ด๐‘ฅโˆฅ = ๐‘› ๐‘‡ 2 ๐‘‡ ๐‘–=1 โˆฅ๐‘Ž๐‘– ๐‘ฅโˆฅ โ‰ฅ max๐‘– โˆฅ๐‘Ž๐‘– ๐‘ฅโˆฅ = โˆฅ ๐ดโˆฅ 2,โˆž . โ–ก Proposition 1.3.2 indicates that when the number of rows ๐‘› is large, โˆฅ ๐ดโˆฅ 2,โˆž can be much smaller than the spectral norm of ๐ด. For example, when ๐ด has all 1s in the first column and 0s โˆš elsewhere, โˆฅ ๐ดโˆฅ 2,โˆž = 1, while the spectral norm โˆฅ ๐ดโˆฅ = ๐‘›. Proposition 1.3.3. For a matrix ๐ด โˆˆ R๐‘›ร—๐‘š , the following relation between spectral norm and Frobenius norm holds โˆš๏ธ โˆฅ ๐ดโˆฅ โ‰ค โˆฅ ๐ดโˆฅ ๐น โ‰ค rank( ๐ด)โˆฅ ๐ดโˆฅ. Proposition 1.3.3 establishes the fact that the Frobenius norm is lower bounded by the spectral norm and upper bounded by the spectral norm multiplied by the square root of matrix rank. The 5 following proposition derives an upper bound for the โ„“2,โˆž -norm of matrix multiplication. Proposition 1.3.4. For matrix ๐ด โˆˆ R๐‘›ร—๐‘š , ๐ต โˆˆ R๐‘šร—๐‘˜ , then โˆฅ ๐ด๐ตโˆฅ 2,โˆž โ‰ค โˆฅ ๐ดโˆฅ 2,โˆž โˆฅ๐ตโˆฅ. Proof: โˆฅ ๐ด๐ตโˆฅ 2,โˆž = max โˆฅ๐‘Ž๐‘‡๐‘– ๐ตโˆฅ โ‰ค max โˆฅ๐‘Ž๐‘‡๐‘– โˆฅโˆฅ๐ตโˆฅ = โˆฅ ๐ดโˆฅ 2,โˆž โˆฅ๐ตโˆฅ, ๐‘– ๐‘– where ๐‘Ž๐‘‡๐‘– is the ๐‘–th row of ๐ด. โ–ก Remark 1.3.5. Compared with โ„“2 -norm, โ„“2,โˆž -norm can provide finer entrywise control, which is beneficial in applications such as recommender systems since it is usually desirable to have a uniform bound on all individuals. Besides, bounding โ„“2,โˆž -norm is of great importance in the analysis of exact recovery for many statistical problems, such as Stochastic Block Model (SBM), Matrix completion, and Censored Block Model Abbe et al. (2020). In singular subspace perturbation analysis, we are interested in using โ„“2,โˆž -norm to characterize the distance between singular subspaces ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ1 ) and ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ e1 ). A desirable quantity we aim to bound is inf โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž , ๐‘„โˆˆO๐‘Ÿ where ๐‘„ is a rotation matrix and O๐‘Ÿ is the orthogonal matrix group in dimension ๐‘Ÿ. In other words, we consider the difference between ๐‘ˆ e1 and ๐‘ˆ1 after they are maximally aligned by a proper rotation ๐‘„. Since O๐‘Ÿ is compact, the infimum can be achieved. Proposition 1.3.6 (Lemma 1 in Cai and Zhang (2018)). Write the SVD of ๐‘ˆ1๐‘‡ ๐‘ˆ e1 as ๐‘ˆ๐‘‡ ๐‘ˆ e = ๐‘„ 1 ๐‘†๐‘„๐‘‡ , 1 1 2 denote ๐‘„๐‘ˆ = ๐‘„ 1 ๐‘„๐‘‡2 , then โˆš โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ โ‰ค inf โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ โ‰ค โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ โˆฅ โ‰ค 2โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ, ๐‘„โˆˆO๐‘Ÿ โˆš โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ ๐น โ‰ค inf โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ ๐น = โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ โˆฅ ๐น โ‰ค 2โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ ๐น . ๐‘„โˆˆO๐‘Ÿ This proposition relates the quantity inf ๐‘„โˆˆO๐‘Ÿ โˆฅ๐‘ˆ e1 โˆ’๐‘ˆ1 ๐‘„โˆฅ to the sin ฮ˜ distance, which combined with Proposition 1.3.2 further leads to a trivial upper bound of the โ„“2,โˆž -norm perturbation error (see (2.1)). 6 Proposition 1.3.7 (Weylโ€™s inequality). Assume matrix ๐ด โˆˆ R๐‘›ร—๐‘š has singular values indexed in non-increasing order ๐œŽ1 โ‰ซ ๐œŽ2 โ‰ซ ยท ยท ยท โ‰ซ ๐œŽmin{๐‘›,๐‘š} , the perturbed matrix ๐ด e = ๐ด + ฮ”๐ด has singular values e ๐œŽ1 โ‰ซ e ๐œŽ2 โ‰ซ ยท ยท ยท โ‰ซ e ๐œŽmin{๐‘›,๐‘š} . For any 1 โ‰ค ๐‘˜ โ‰ค min{๐‘›, ๐‘š}, the following inequality holds |๐œŽ๐‘˜ โˆ’ e ๐œŽ๐‘˜ | โ‰ค โˆฅฮ”๐ดโˆฅ. Weylโ€™s inequality above provides an error bound on the singular values under perturbation, it indicates singular values are fairly stable under relatively small perturbation ฮ”๐ด. There is a wealth of applications of Weylโ€™s bound in stability analysis. However, for eigenvalues of diagonalizable matrices which are not necessarily symmetric, the following theorem indicates their sensitivity to noise needs to be estimated by the condition number. Theorem 1.3.8 (Bauer-Fike theorem). Let ๐ด โˆˆ C๐‘›ร—๐‘› be a diagonalizable matrix with eigen de- composition ๐ด = ๐‘‹ฮ›๐‘‹ โˆ’1 , where ฮ› is a diagonal matrix containing eigenvalues and ๐‘‹ โˆˆ C๐‘›ร—๐‘› is a non-singular eigenvector matrix. Suppose ๐œ‡ is an eigenvalue of the perturbed matrix ๐ด e = ๐ด + ฮ”๐ด, then there exists ๐œ† โˆˆ ฮ›, such that |๐œ† โˆ’ ๐œ‡| โ‰ค ๐œ…2 (๐‘‹)โˆฅฮ”๐ดโˆฅ, where ๐œ…2 (๐‘‹) = โˆฅ ๐‘‹ โˆฅโˆฅ ๐‘‹ โˆ’1 โˆฅ is the condition number of ๐‘‹. We will also need the definition of incoherence for the perturbation analysis. Definition 1.3.2 (Incoherent). A matrix ๐‘ˆ โˆˆ R๐‘›ร—๐‘Ÿ with orthonormal columns (๐‘› โ‰ฅ ๐‘Ÿ) is said to be โˆš๏ธƒ ๐œ‡-incoherent (๐œ‡ โ‰ฅ 1) if โˆฅ๐‘ˆ โˆฅ 2,โˆž โ‰ค ๐œ‡ ๐‘›๐‘Ÿ . Remark 1.3.9. It is worth noting that since ๐‘ˆ โˆˆ R๐‘›ร—๐‘Ÿ has orthonormal columns, if we denote ๐‘ˆ = โˆš๏ธƒ ๐‘‡ ร 2 [๐‘ข 1 , ๐‘ข 2 , ยท ยท ยท , ๐‘ข ๐‘› ] , then ๐‘– โˆฅ๐‘ข๐‘– โˆฅ = ๐‘Ÿ, by Remark 1.3.1 we directly have โˆฅ๐‘ˆ โˆฅ 2,โˆž = max๐‘– โˆฅ๐‘ข๐‘– โˆฅ โ‰ฅ ๐‘›๐‘Ÿ . On the other hand, by Proposition 1.3.2, โˆฅ๐‘ˆ โˆฅ 2,โˆž โ‰ค โˆฅ๐‘ˆ โˆฅ = 1. Therefore, we have the upper bound โˆš๏ธƒ and lower bound for ๐œ‡, 1 โ‰ค ๐œ‡ โ‰ค ๐‘›๐‘Ÿ . The upper bound is achieved for a standard basis, where each ๐‘ข๐‘– has 1 on one coordinate and 0s elsewhere. When every entry in the matrix ๐‘ˆ has the same magnitude, the lower bound can be achieved. A small ๐œ‡ indicates the mass of ๐‘ˆ is well spread across its entries. It has been pointed out in Candes and Recht (2012) that with high probability, the coherence is naturally bounded by some absolute constant for random orthogonal models. Indeed, the bounded 7 coherence assumption is extensively used in the analysis of matrix completion, matrix denoising, and matrix perturbation analysis Candes et al. (2011); Candes and Recht (2012); Cape et al. (2019b); Abbe et al. (2020, 2022). 1.3.2 Classical singular perturbation results Singular subspace perturbation has been extensively studied in the literature, here we include two classical results, Davis-Kahanโ€™s bound Davis and Kahan (1970) and Wedinโ€™s sin ฮ˜ theorem Wedin (1972). Interested readers are referred to Stewart (2006); Dopico (2000); Dopico and Moro (2002)) for more relevant perturbation bounds. Theorem 1.3.10 (Davis-Kahanโ€™s sin ฮ˜ theorem). Assume ๐ด โˆˆ R๐‘›ร—๐‘› , ๐ด e = ๐ด + ฮ”๐ด โˆˆ R๐‘›ร—๐‘› are two symmetric matrices satisfying the following eigen decompositions   ยฉฮ› 1 ยช ยฉ๐‘ˆ1๐‘‡ ยช ๐‘‡   ยฉeฮ›1 ยช ยฉ๐‘ˆe๐‘‡ ยช 1ยฎ ๐ด = ๐‘ˆฮ›๐‘ˆ๐‘‡ = ๐‘ˆ1 ๐‘ˆ2 ยญ ยญ ยฎ ยญ ๐‘‡ ยฎ , ๐ด = ๐‘ˆ ฮ›๐‘ˆ = ๐‘ˆ ยฎ ยญ ยฎ e e e e e1 ๐‘ˆ e2 ยญ ยญ ยฎยญ ๐‘‡ยฎ. ยฎ ยญ (1.6) ฮ›2 ๐‘ˆ2 ฮ›2 ๐‘ˆ e e ยซ ยฌยซ ยฌ ยซ ยฌยซ 2 ยฌ Here ๐‘ˆ1 โˆˆ R๐‘›ร—๐‘Ÿ , ๐‘ˆ2 โˆˆ R๐‘›ร—(๐‘›โˆ’๐‘Ÿ) , and [๐‘ˆ1 ,๐‘ˆ2 ] โˆˆ R๐‘›ร—๐‘› is an orthogonal matrix. Eigenvalue matrices ฮ›1 = diag{๐œ† 1 , ๐œ† 2 , ยท ยท ยท , ๐œ†๐‘Ÿ }, ฮ›2 = diag{๐œ†๐‘Ÿ+1 , ๐œ†๐‘Ÿ+2 , ยท ยท ยท , ๐œ† ๐‘› }, and the eigenvalues are indexed in non- increasing order, i.e., ๐œ† 1 โ‰ฅ ๐œ† 2 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œ†๐‘Ÿ โ‰ฅ ๐œ†๐‘Ÿ+1 ยท ยท ยท โ‰ฅ ๐œ† ๐‘› . The factors of ๐ด e have similar structures. Denote ๐›ฟ = min1โ‰ค๐‘–โ‰ค๐‘Ÿ,๐‘Ÿ+1โ‰ค ๐‘— โ‰ค๐‘› |๐œ†๐‘– โˆ’ ๐œ† e๐‘— |, if ๐›ฟ > 0, then โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ ๐น โ‰ค โˆฅฮ”๐ด๐‘ˆ1 โˆฅ ๐น . ๐›ฟ In fact, both occurrences of Frobenius norm above can be replaced with spectral norm. Davis- Kahanโ€™s theorem shows for symmetric matrices, the sin ฮ˜ distance depends on ||ฮ”๐ด|| and the eigengap, it provides a tight bound for general symmetric matrices with no additional assumption on the noise. Wedin has extended the theorem to general rectangular matrices. Theorem 1.3.11 (Wedinโ€™s theorem). Let ๐ด, ๐ด e โˆˆ R๐‘›ร—๐‘š be two rectangular matrices with singular decomposition (1.1) as defined in Section 1.2.1, provided ๐›ฟ = min{ min |e๐œŽ๐‘– โˆ’ ๐œŽ ๐‘— |, min e ๐œŽ๐‘– } > 0, 1โ‰ค๐‘–โ‰ค๐‘Ÿ,๐‘Ÿ+1โ‰ค ๐‘— โ‰คmin{๐‘›,๐‘š} 1โ‰ค๐‘–โ‰ค๐‘Ÿ then โˆฅ๐‘ˆe๐‘‡ ฮ”๐ดโˆฅ 2 + โˆฅฮ”๐ด๐‘‰ e1 โˆฅ 2 e1 )โˆฅ 2 + โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ 2 โ‰ค 1 ๐น ๐น โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ . (1.7) ๐น ๐น ๐›ฟ2 8 Here we make a few observations of the Wedinโ€™s bound. First, the definition of ๐›ฟ requires that the singular values in e ฮฃ1 and ฮฃ2 are separated. If โˆฅฮ”๐ดโˆฅ is small compared to the gap in singular values, then essentially it requires the singular values in ฮฃ1 and ฮฃ2 to be separated. Second, (1.7) is a uniform perturbation bound on both the right singular subspace ๐‘ˆ1 and the left singular subspace ๐‘‰1 . When the matrix size ๐‘› and ๐‘š are significantly different, ๐‘ˆ1 and ๐‘‰1 may have quite different stability, and it will be desirable to derive a one-sided perturbation bound for each of them. More discussions on this issue can be found in Section 2.5.2. Third, these classical results are quite tight for general matrices with no statistical assumption on the noise, however, when the perturbation matrix ฮ”๐ด is random, the bounds on the right-hand side become random quantities, since they involve the singular values of the perturbed matrix ๐ด. e To solve this problem, deterministic variants of the Davis-Kahanโ€™s sin ฮ˜ theorem were introduced in Yu et al. (2015); Cai and Zhang (2018) that are particularly useful for statistical applications. Additionally, various asymptotic bounds on eigenvector perturbations have also been derived in Cape et al. (2019a); Tang and Preibe (2018); Fan et al. (2022); Agterberg et al. (2022). 1.3.3 Large deviation inequalities The proof techniques in this dissertation also feature high-dimensional probability, for com- pleteness, here we present a brief review of relevant large deviation inequalities. Interested readers are referred to Vershynin (2018) for a comprehensive understanding of large deviation bounds. For certain types of random variables we are concerned with, these large deviation inequalities char- acterize the exponential decline in the probability of tail events, hence provide us with statistical guarantees on bounding these random variables. For a finite sequence of independent bounded random variables, the following Hoeffdingโ€™s inequality provides a bound on the deviation of their summation to the expectation of the summation. Theorem 1.3.12 (Hoeffdingโ€™s inequality Vershynin (2018)). Let ๐‘‹1 , ๐‘‹2 , ยท ยท ยท , ๐‘‹๐‘› be independent random variables, for each 1 โ‰ค ๐‘– โ‰ค ๐‘›, ๐‘Ž๐‘– โ‰ค ๐‘‹๐‘– โ‰ค ๐‘๐‘– . Then for any ๐‘ก > 0, the following inequaity holds ! ! 2๐‘ก 2 ๐‘› โˆ‘๏ธ P (๐‘‹๐‘– โˆ’ E๐‘‹๐‘– ) โ‰ฅ ๐‘ก โ‰ค 2 exp โˆ’ ร๐‘› . (1.8) (๐‘ โˆ’ ๐‘Ž ) 2 ๐‘–=1 ๐‘–=1 ๐‘– ๐‘– Remark 1.3.13. If ๐‘‹1 , ๐‘‹2 , ยท ยท ยท , ๐‘‹๐‘› are i.i.d. (independently and identically distributed) random 9 variables within [๐‘Ž, ๐‘], E๐‘‹๐‘– = ๐œ‡, then (1.8) reduces to ! ! 2๐‘ก 2 ๐‘› โˆ‘๏ธ P (๐‘‹๐‘– โˆ’ ๐œ‡) โ‰ฅ ๐‘ก โ‰ค 2 exp โˆ’ , ๐‘–=1 ๐‘›(๐‘ โˆ’ ๐‘Ž) 2 which further leads to ! ๐‘›   1 โˆ‘๏ธ 2๐‘› P ๐‘‹๐‘– โˆ’ ๐œ‡ โ‰ฅ ๐‘ก โ‰ค 2 exp โˆ’ . ๐‘› ๐‘–=1 (๐‘ โˆ’ ๐‘Ž) 2 We see that for bounded i.i.d. random variables, the probability that the deviation of the sample mean to true expectation is larger than some constant decreases exponentially in the number of samples. Suppose ๐‘‹ โˆผ ๐ต(๐‘›, ๐‘) follows Binomial distribution with probability ๐‘, then ๐‘‹ = ๐‘‹1 + ๐‘‹2 + ยท ยท ยท + ๐‘‹๐‘› , where ๐‘‹๐‘– are i.i.d. binary random variables which takes value 1 with probability ๐‘ and takes value 0 with probability 1 โˆ’ ๐‘. Since 0 โ‰ค ๐‘‹๐‘– โ‰ค 1, Hoeffdingโ€™s inequality can be used to bound the deviation of ๐‘‹ and its expectation E๐‘‹ = ๐‘›๐‘. However, when ๐‘ is very small, the bound by Hoeffdingโ€™s inequality can be pessimistic. In this case, the following theorem provides an improved deviation bound. Similar results can also be derived using Chernoffโ€™s inequality (see e.g., Vershynin (2018) Theorem 2.3.1). Theorem 1.3.14 (Theorem 1 in Janson (2016)). Suppose ๐‘‹ follows Binomial distribution ๐ต(๐‘›, ๐‘), let ๐‘ž = 1 โˆ’ ๐‘, then for every ๐‘Ž โ‰ฅ 0, ! ๐‘Ž2 P(๐‘‹ > E๐‘‹ + ๐‘Ž) โ‰ค exp โˆ’ , (1.9) 2(๐‘›๐‘๐‘ž + ๐‘Ž/3) where E๐‘‹ is the expectation of ๐‘‹, i.e., E๐‘‹ = ๐‘›๐‘. Remark 1.3.15. For some 0 < ๐›ฟ โ‰ค 1, plugging in ๐‘Ž = ๐›ฟE๐‘‹ = ๐›ฟ๐‘›๐‘ into (1.9), we obtain P(๐‘‹ > (1 + ๐›ฟ)E๐‘‹) โ‰ค exp(โˆ’๐‘๐‘›๐‘๐›ฟ2 ), for some absolute constant ๐‘. For independent and bounded random variables, if provided with information about the variance, tighter bounds can be obtained when the variance is small using the following Bernsteinโ€™s inequality. 10 Theorem 1.3.16 (Bernsteinโ€™s inequality). Let ๐‘‹1 , ๐‘‹2 , ยท ยท ยท , ๐‘‹๐‘› be independent, centered, and real random variables, assume each one is uniformly bounded, E๐‘‹๐‘– = 0, |๐‘‹๐‘– | โ‰ค ๐ฟ, โˆ€1 โ‰ค ๐‘– โ‰ค ๐‘›. ร๐‘› Let ๐‘‹ be the summation ๐‘‹ = ๐‘–=1 ๐‘‹๐‘– , and ๐œˆ(๐‘‹) denotes the variance of ๐‘‹, ๐‘› โˆ‘๏ธ ๐œˆ(๐‘‹) = E๐‘‹ 2 = E๐‘‹๐‘–2 . ๐‘–=1 Then for all ๐‘ก โ‰ฅ 0, ๐‘ก 2 /2 P(|๐‘‹ | โ‰ฅ ๐‘ก) โ‰ค 2 exp(โˆ’ ). (1.10) ๐œˆ(๐‘‹) + ๐ฟ๐‘ก/3 Comparing the Hoeffdingโ€™s bound (1.8) and Bernsteinโ€™s bound (1.10), we see that when the E๐‘‹๐‘–2 โ‰ช ๐‘›๐ฟ 2 , ร๐‘› variance is significantly smaller than the bound on random variables, i.e., ๐‘–=1 Bernsteinโ€™s inequality gives a tighter bound. This is intuitive because in its assumption we are provided with more information about the variance. Bernsteinโ€™s inequality can also be generalized to the summation of random matrices Tropp et al. (2015), which bounds the spectral norm of summation of independent centered random matrices. Theorem 1.3.17 (Matrix Bernstein inequality). Let matrices ๐‘†1 , ๐‘†2 , ยท ยท ยท , ๐‘† ๐‘› โˆˆ R๐‘‘1 ร—๐‘‘2 be indepen- dent, centered random matrices, and assume that each matrix is uniformly bounded, i.e., E๐‘† ๐‘˜ = 0, and โˆฅ๐‘† ๐‘˜ โˆฅ โ‰ค ๐ฟ, โˆ€๐‘˜ = 1, 2, ยท ยท ยท , ๐‘› Denote the sum ๐‘› โˆ‘๏ธ ๐‘= ๐‘†๐‘˜ , ๐‘˜=1 and let ๐‘ฃ(๐‘) be the matrix variance statistic of the sum ( ๐‘› ๐‘› ) โˆ‘๏ธ โˆ‘๏ธ ๐‘ฃ(๐‘) = max{โˆฅE(๐‘ ๐‘ โˆ— )โˆฅ, โˆฅE(๐‘ โˆ— ๐‘)โˆฅ} = max E(๐‘† ๐‘˜ ๐‘† โˆ—๐‘˜ ) , E(๐‘† โˆ—๐‘˜ ๐‘† ๐‘˜ ) . ๐‘˜=1 ๐‘˜=1 Then โˆ’๐‘ก 2 /2 P(โˆฅ๐‘ โˆฅ โ‰ฅ ๐‘ก) โ‰ค (๐‘‘1 + ๐‘‘2 ) exp( ), โˆ€๐‘ก โ‰ฅ 0. ๐‘ฃ(๐‘) + ๐ฟ๐‘ก/3 11 The next one is a concentration inequality on the norm of independent sub-Gaussian random variables. In high-dimensional probability theory, sub-Gaussian distribution denotes a very im- portant family of distributions, as it contains many fundamental distributions including Gaussian distribution. Moreover, concentration inequalities such as Hoeffdingโ€™s bound can be derived for all sub-Gaussian distributions. Formally, a random variable ๐‘‹ follows sub-Gaussian distribution if there exists a constant ๐พ such that for all ๐‘ก โ‰ฅ 0, the tails of ๐‘‹ satisfy P(|๐‘‹ | โ‰ฅ ๐‘ก) โ‰ค 2 exp(โˆ’๐‘ก 2 /๐พ 2 ). Intuitively, the sub-Gaussian property above indicates the tail probability decays at least as fast as normal distributions. Examples of sub-Gaussian distributions include Gaussian distribution, Bernoulli distribution, and all bounded random variables. The sub-Gaussian norm is defined as โˆฅ ๐‘‹ โˆฅ ๐œ“2 = inf{๐‘ก > 0 : E exp(๐‘‹ 2 /๐‘ก 2 ) โ‰ค 2}. The following theorem bounds the norm of sub-Gaussian random vectors Theorem 1.3.18 (Theorem 3.1.1 in Vershynin (2018)). Let ๐‘‹ = [๐‘‹1 , ๐‘‹2 , ยท ยท ยท , ๐‘‹๐‘› ] โˆˆ R๐‘› be a random vector with independent, sub-Gaussian coordinates ๐‘‹๐‘– which satisfies E๐‘‹๐‘–2 = 1, then โˆš ๐‘๐‘ก 2 P(|โˆฅ ๐‘‹ โˆฅ โˆ’ ๐‘›| โ‰ฅ ๐‘ก) โ‰ค 2 exp(โˆ’ ), ๐พ4 where ๐พ = max๐‘– โˆฅ ๐‘‹๐‘– โˆฅ ๐œ“2 . Remark 1.3.19. Noticing that Gaussian random variables are also sub-Gaussian, in Theorem 1.3.18, when ๐‘‹ = [๐‘‹1 , ๐‘‹2 , ยท ยท ยท , ๐‘‹๐‘› ] โˆˆ R๐‘› is a random vector with independent Gaussian coordinates ๐‘‹๐‘– โˆผ ๐‘ (0, ๐œŽ 2 ), then โˆš ๐‘๐‘ก 2 P(|โˆฅ ๐‘‹ โˆฅ โˆ’ ๐œŽ ๐‘›| โ‰ฅ ๐‘ก) โ‰ค 2 exp(โˆ’ ) ๐œŽ2 Here we used the fact that for each Gaussian random variable ๐‘‹๐‘– , โˆฅ ๐‘‹๐‘– โˆฅ ๐œ“2 โ‰ค ๐ถ๐œŽ for some absolute constant ๐ถ (see e.g., Vershynin (2018)). 1.4 Overview of this dissertation Equipped with the presented mathematical backgrounds, we are ready to provide a more detailed overview of our approaches for the applications in the following chapters. 12 Chapter 2 studies the problem of singular perturbation analysis and its applications. As a funda- mental tool in computational mathematics, SVD plays a ubiquitous role in numerical and statistical algorithms, examples can be found in Principal Component Analysis, matrix completion, matrix denoising, community detection, etc. To better understand the performance of these algorithms, it is important to study the stability of the SVD steps. In this dissertation, we take a different approach from the classical singular perturbation analyses. We first derive a set of exact formulae for the sin ฮ˜ distance between the original and the perturbed singular subspaces, from these formulae, one can see how the perturbation of the original matrix propagates into singular vectors and singu- lar subspaces. More importantly, these formulae provide a direct way of analyzing the singular perturbation error. Based on or motivated by these exact formulae, we derive a collection of new results on SVD perturbation related problems. The newly derived results include three components: 1) this dissertation derives a tighter bound on โ„“2,โˆž -norm singular subspace perturbations errors under Gaussian noise. The bound holds for both cases where the matrix is a low-rank or general matrix. Compared with existing works, the proposed result requires minimum assumptions and achieves a tighter bound for general matrices. 2) We also provide a novel stability analysis of the Principal Component Analysis for general full-rank matrices. As one of the arguably most popular statistical dimensional reduction methods, the stability of PCA has been extensively studied in the literature. However, most existing works focus on analyzing the stability of singular vectors or singular values. This dissertation complements the literature in investigating the stability of PC scores. 3) In addition, this dissertation presents a new error bound on singular value truncation, which computes the best low-rank approximation of a given matrix with a target rank. A tight error bound for low-rank matrices already exists in the literature, but this result does not generalize to general matrices directly. In this dissertation, we take a different approach and derive a new singular value truncation bound for general matrices. When the matrix is indeed low rank, The proposed result reduces to the existing tight bound. Results of this chapter have given rise to the manuscript Lyu and Wang (2020a). In Chapter 3, we employ the low-rank prior in the manifold setting. Here we follow the 13 manifold hypothesis popular in dimensional reduction techniques, which states real-world high- dimensional data usually lies around low-dimensional manifolds embedded in high-dimensional space. Therefore, if we consider a local neighborhood, the sub-matrix is approximately low-rank. Specifically, we consider the manifold denoising problem, where the observed data is generated from clean data distributed around a low-dimensional manifold contaminated by sparse noise and possibly existing Gaussian noise, and our goal is to denoise the dataset. Toward this goal, we utilize the low-rank assumption in each local neighborhood and propose an optimization framework to separate the sparse noise from the data. Our approach is a generalization of Robust PCA, its performance is tested on both synthetic and real-world datasets. In addition, we also provide a theoretical error bound under some incoherence conditions and a near-optimal choice of the tuning parameters. Results of this chapter has given rise to the paper Lyu et al. (2019). In Chapter 4, we investigate the stability of invariant subspaces for diagonalizable matrices, with a specific focus on the situation when the eigensystem is ill-conditioned. Let X1 be some invariant subspace of a diagonalizable matrix ๐ด and ๐‘‹1 be the matrix storing the right eigenvectors that span X1 , this chapter studies the stability of X1 under noise and explore the impact of condition numbers on its stability. Previous works all suggest that as the condition number ๐œ…2 (๐‘‹1 ) gets larger, the invariant subspace X1 will be unstable to perturbation, in this dissertation, we make the point that the growth of ๐œ…2 (๐‘‹1 ) alone is not enough to destroy the stability. We illustrate this point by deriving a new perturbation error bound that does not contain ๐œ…2 (๐‘‹1 ), which implies that when the matrix ๐ด gets closer to a Jordan form, one may still estimate its invariant subspaces from the noisy data stably. Another implication of the derived result is that for matrices with ill-conditioned eigensystems, their invariant subspaces may be more stable than their eigenvalues to matrix perturbation. Results of this chapter have given rise to the manuscript Lyu and Wang (2022). 14 CHAPTER 2 MATRIX PERTURBATION ANALYSIS AND ITS APPLICATIONS 2.1 Introduction 2.1.1 Overview of Chapter 2 This chapter studies the problem of singular perturbation analysis and its applications. As a fundamental tool in computational mathematics, Singular Value Decomposition (SVD) is com- monly used in statistical and machine learning algorithms. Studying the stability of SVD is vital to understanding the performances of these algorithms. In this chapter, we establish a useful set of formulae for the sin ฮ˜ distance between the original and the perturbed singular subspaces. These formulae explicitly show that how the perturbation of the original matrix propagates into singular vectors and singular subspaces, thus providing a direct way of analyzing them. Following this, we derive a collection of new results on SVD perturbation related problems, which has given rise to the paper Lyu and Wang (2020a). Our first new result is a tighter bound on the โ„“2,โˆž -norm of the singular subspace perturbation errors under Gaussian noise. Our analysis complements the literature of โ„“2,โˆž -norm singular subspace perturbation bound. The derived bound matches minimax lower bound. Compared with previous works, our result requires minimum assumptions, and holds for general full rank matrices with Gaussian noise. The second new result is a novel stability analysis of the Principal Component Analysis for general full rank matrices. We establish new error bounds for PCA scores under perturbation, which was less explored in the literature, compared with the stability of singular vectors or singular values. The third new result is an error bound on singular value truncation. A tight error bound for low-rank matrices has been derived in Luo et al. (2021), here we consider the perturbation bound for general full rank matrices. When the matrix is indeed low rank, our result reduces to the tight bounds obtained in Luo et al. (2021). 15 2.1.2 Problem setting Singular value decomposition (SVD) is a fundamental tool in computational mathematics. Many widely used algorithms in numerical analysis and statistics (e.g., Principal Component Analysis Pearson (1901); Cai et al. (2021); Abbe et al. (2022), matrix completion Candes and Recht (2012); Candes and Plan (2010); Keshavan et al. (2009), matrix denoising Donoho and Gavish (2014); Gavish and Donoho (2014), community detection Yun and Proutiere (2014); Chin et al. (2015); Abbe (2017), graph inference Tang and Preibe (2018); Athreya et al. (2021), etc.) involve the SVD computation. Since the singular vectors and singular subspaces can be sensitive to noise, the SVD step is often the stability bottleneck of the entire algorithm. Therefore, deriving optimal perturbation bounds is vital to understanding the performances of these algorithms. In general, perturbation theory asks the question that how one function changes when its argument is subject to a perturbation Stewart (1990). In the setting of SVD, the goal of perturbation analysis is to study the sensitivity of SVD factors (i.e., ๐‘ˆ, ฮฃ,๐‘‰) or their combinations to the perturbation ฮ”๐ด. One major focus of this chapter is on the perturbation of singular subspaces, which are subspaces spanned by singular vectors corresponding to a set of singular values. However, if there is one singular value that corresponds to multiple singular vectors, it is not possible to bound the perturbation of these individual singular vectors. Let us illustrate this point using the following example. Consider the diagonal matrix ยฉ1 0 0ยช ยญ ยฎ ๐ด = ยญ0 1 0ยฎ , ยญ ยฎ ยญ ยฎ ยญ ยฎ 0 0 5 ยซ ยฌ We can directly see ๐ด has singular values 1, 1, 5, and we can choose right singular vectors cor- responding to singular value 1 as ๐‘ข 1 = [1, 0, 0] ๐‘‡ and ๐‘ข 2 = [0, 1, 0] ๐‘‡ . For some small constant 16 0 < ๐œ– โ‰ช 1, let the perturbed matrix be ยฉ1 ๐œ– 0 ยช ยญ ยฎ ๐ด = ยญ ๐œ– 1 0ยฎ . e ยญ ยฎ ยญ ยฎ ยญ ยฎ 0 0 5 ยซ ยฌ e the singular values are 1 + ๐œ–, 1 โˆ’ ๐œ–, 5, and the singular vectors corresponding to singular values In ๐ด, 1 + ๐œ– and 1 โˆ’ ๐œ– are e ๐‘ข 1 = [ โˆš1 , โˆš1 , 0] and e ๐‘ข 2 = [ โˆš1 , โˆ’ โˆš1 , 0], which are significantly different from 2 2 2 2 ๐‘ข 1 and ๐‘ข 2 . However, after a more careful look, we see that the subspaces spanned by [๐‘ข 1 , ๐‘ข 2 ] are [e ๐‘ข 2 ] are the same. The example indicates that individual singular vector can change dramatically ๐‘ข1, e even as the perturbation โˆฅฮ”๐ดโˆฅ โ†’ 0, but the singular subspaces are more stable. Chapter 1 provides the math preparations for singular subspace perturbation. Let ๐ด = ๐‘ˆฮฃ๐‘‰ ๐‘‡ be the SVD of a matrix ๐ด, and ๐ด e= ๐‘ˆ eeฮฃ๐‘‰e๐‘‡ be the SVD of its noisy version ๐ด e = ๐ด + ฮ”๐ด. The singular subspace perturbation problem then studies the stability of a left or a right singular subspace of ๐ด under the perturbation ฮ”๐ด. In the setting of Section 1.2.1, we aim to study the distance between the leading singular subspaces ๐‘ˆ1 and ๐‘ˆ e1 under perturbation. Besides, we also investigate the stability of combinations of individual factors in SVD (e.g., in PCA, we need to analyze the PC scores ๐‘ˆ1 ฮฃ1 ). Specifically, in this dissertation, we are concerned with the following three perturbation problems โ€ข Singular subspace perturbation. This dissertation studies the stability of the leading subspace ๐‘ˆ1 (or ๐‘‰1 ) of matrix ๐ด under the perturbation matrix ฮ”๐ด. Specifically, we are interested in deriving a โ„“2,โˆž -norm bound for the distance between the leading subspaces ๐‘ˆ1 and ๐‘ˆ e1 under perturbation, i.e., we aim to bound min๐‘„โˆˆO๐‘Ÿ โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž . โ€ข Stability of PCA scores, which are the projection of data matrix ๐ด onto its PC directions, using our notations in Section 1.2.1, PC scores are given by ๐‘ˆ1 ฮฃ1 . Our goal is to bound min๐‘„โˆˆO๐‘Ÿ |||๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1eฮฃ1 ๐‘„|||, where ||| ยท ||| can be either the spectral norm or the Frobenius norm. โ€ข Stability of singular value truncation. The best rank-๐‘Ÿ approximation of the original matrix ๐ด and perturbed matrix ๐ด e is given by ๐ด๐‘Ÿ = ๐‘ˆ1 ฮฃ1๐‘‰ ๐‘‡ and ๐ด e๐‘Ÿ = ๐‘ˆe1eฮฃ1๐‘‰e๐‘‡ , respectively. We 1 1 17 study ||| ๐ด๐‘Ÿ โˆ’ ๐ด e๐‘Ÿ |||, with ||| ยท ||| being either the spectral norm or the Frobenius norm. Perturbation analysis of these three problems is crucial in understanding the performances of many spectral methods in statistics. There are a plurality of papers addressing SVD perturbation and their statistical applications Cape et al. (2019b); Abbe et al. (2020); Cai et al. (2021); Abbe et al. (2022); Lei (2019). To list a few statistical applications, in the study of network models such as SBM, spectral methods are commonly used in community detection and clustering, where eigenvectors of adjacency matrix can provide connectivity and centrality information about nodes. The โ„“2,โˆž - norm perturbation bound can facilitate the analysis of exact recovery for such methods Cape et al. (2019b); Abbe et al. (2020); Cai et al. (2021). PCA is one of the most popular dimensional reduction methods, the perturbation analysis of PC scores helps provide a more precise characterization for the stability of the low-dimensional embedding Abbe et al. (2022). In matrix completion with noisy entries, Singular Value Truncation is a fundamental method, and it is desirable to study how close the reconstructed matrix is to the true matrix. 2.2 โ„“2,โˆž -norm perturbation bound of the singular subspace perturbation Using the notations as in Section 1.2.1, the purpose of โ„“2,โˆž -norm perturbation analysis is to investigate the quantity min โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž , ๐‘„โˆˆO๐‘Ÿ which characterizes the difference of the leading singular subspaces ๐‘ˆ1 and ๐‘ˆ e1 under perturbation. In many applications, the โ„“2,โˆž -metric is better suited than the sinฮ˜ one since it provides finer entrywise control. For instance, in clustering, classification, and dimension reduction, one cares about the classification accuracy or the embedding quality of each data point, which corresponds to the row-wise โ„“2 error of the leading singular vector matrix, and the maximal row-wise โ„“2 -error is exactly the โ„“2,โˆž -norm. Careful readers may have noticed that, from Proposition 1.3.2 and Proposition 1.3.6, min๐‘„โˆˆO๐‘Ÿ โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž is smaller than the sin ฮ˜ angle up to a constant Cai and Zhang (2018), โˆš min โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค min โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ โ‰ค 2โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ. (2.1) ๐‘„โˆˆO๐‘Ÿ ๐‘„โˆˆO๐‘Ÿ This provides a trivial bound on the โ„“2,โˆž -norm error, but can be very pessimistic. 18 To see why, let us consider the case where matrix ๐ด has rank ๐‘Ÿ with ๐‘Ÿ โ‰ช ๐‘›, the perturbation matrix ฮ”๐ด has i.i.d. Gaussian entries ๐‘ (0, ๐œŽ 2 ), the matrix ๐‘ˆ1 of leading singular vector is ๐œ‡- incoherent (see Definition 1.3.1 for the definition of incoherence), and the following gap condition holds โˆš๏ธ ๐‘ 1 ๐œŽ max{๐‘›, ๐‘š} โ‰ค ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด). Under these assumptions, (2.1) combined with Wedinโ€™s sinฮ˜ bound yields that, with high proba- bility, ( โˆš๏ธ ) โˆš ๐‘ 2 max{๐‘›, ๐‘š}๐œŽ min โˆฅ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค 2 min 1, โˆผ ๐‘‚ (1). (2.2) ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Here ๐‘ 1 , ๐‘ 2 are absolute constants, and the big-O notation is with respect to the size variables ๐‘› and โˆš๏ธ ๐‘š. The gap condition implies that the noise level ๐œŽ can be as large as ๐‘‚ (1/ max{๐‘›, ๐‘š}), so the โˆš๏ธ order of ๐œŽ max{๐‘›, ๐‘š} is ๐‘‚ (1). Hence the bound in (2.2) is ๐‘‚ (1). โˆš๏ธ In contrast, it has been shown that โ„“2,โˆž -norm bound of order ๐‘‚ ( ๐‘Ÿ/๐‘›) can be achieved under the same conditions Abbe et al. (2020); Lyu and Wang (2020a), which implies that the trivial bound derived in (2.1) can be very pessimistic. 2.2.1 Different bounds on โ„“2,โˆž -norm perturbation Various โ„“2,โˆž -norm bounds for the singular vector perturbation have been established by Abbe et al. (2020); Chen et al. (2021a); Cheng et al. (2020); Cape et al. (2019b). In particular, Cape et al. (2019b) derived a Procrustean matrix decomposition, based on which the authors further obtained โ„“2,โˆž -norm perturbation bounds. For the leading subspaces ๐‘ˆ1 and ๐‘ˆ e1 under perturbation, let the SVD of ๐‘ˆ๐‘‡ ๐‘ˆ e be ๐‘ˆ๐‘‡ ๐‘ˆ e = 1 1 1 1 ๐‘„ 1 ๐‘†๐‘„๐‘‡ , denote ๐‘„๐‘ˆ = ๐‘„ 1 ๐‘„๐‘‡ . Proposition 1.3.6 suggests โˆฅ๐‘ˆ 2 2 e1 โˆ’๐‘ˆ1 ๐‘„๐‘ˆ โˆฅ 2,โˆž could be a good estimate of the target quantity min๐‘„โˆˆO๐‘Ÿ โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž . The Procrustean decomposition of ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ in Cape et al. (2019b) reads as the following theorem. Theorem 2.2.1 (Theorem 3.1 in Cape et al. (2019b)). In the setting of 1.2.1, if ๐ด e has rank at least 19 ๐‘Ÿ, then ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ โˆˆ R๐‘›ร—๐‘Ÿ admits the following decomposition ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ = (๐ผ โˆ’ ๐‘ˆ1๐‘ˆ๐‘‡ )ฮ”๐ด๐‘‰1 ๐‘„๐‘‰ e 1 ฮฃ1โˆ’1 + (๐ผ โˆ’ ๐‘ˆ1๐‘ˆ1๐‘‡ )ฮ”๐ด(๐‘‰ e1 โˆ’๐‘‰1 ๐‘„๐‘‰ )e ฮฃ1โˆ’1 (2.3) + (๐ผ โˆ’ ๐‘ˆ1๐‘ˆ1๐‘‡ ) ๐ด(๐‘‰e1 โˆ’๐‘‰1๐‘‰ ๐‘‡ ๐‘‰e eโˆ’1 1 1 ) ฮฃ1 + ๐‘ˆ1 (๐‘ˆ1๐‘‡ ๐‘ˆ e1 โˆ’ ๐‘„๐‘ˆ ). The decomposition also holds when replacing the orthogonal matrices ๐‘„๐‘ˆ and ๐‘„๐‘‰ with any real matrices ๐‘‡1 and ๐‘‡2 . Intuition for this decomposition is that each of the four terms in (2.3) can be bounded fairly easily. Perturbation bounds on singular subspaces were developed in Cape et al. (2019b) based on Theorem 2.2.1. When applied to the case when the perturbation matrix ฮ”๐ด has i.i.d. Gaussian entries, which is the focus of this chapter, Cape et al. (2019b) has derived the following bound. Theorem 2.2.2 (Theorem 4.3 in Cape et al. (2019b)). Suppose ๐ด โˆˆ R๐‘›ร—๐‘š has rank ๐‘Ÿ with ๐‘Ÿ โ‰ค ๐‘š โ‰ค ๐‘› โˆš and ๐œŽ๐‘Ÿ ( ๐ด) โ‰ฅ ๐ถ ๐‘ 1 / ๐‘ 2 . When ฮ”๐ด has i.i.d. standard normal distributions ๐‘ (0, ๐œŽ 2 ), then there is a constant ๐ถ๐‘Ÿ dependent on ๐‘Ÿ such that with high probability   โˆš  min โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค ๐ถ๐‘Ÿ ๐œŽ log ๐‘› 1+ ๐œŽ๐‘š + ๐‘š โˆฅ๐‘ˆ โˆฅ , (2.4) ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿ ( ๐ด) ๐œŽ๐‘Ÿ ( ๐ด) log ๐‘› 1 2,โˆž โˆš where ๐ถ๐‘Ÿ โˆผ ๐‘‚ ( ๐‘Ÿ). This bound is quite tight when ๐‘š โ‰ช ๐‘›, however, when ๐‘› โ‰ˆ ๐‘š, the second term in (2.4) becomes quite large and the bound becomes pessimistic. In the setting where ๐ด โˆˆ R๐‘›ร—๐‘š has noisy and missing entries, and the dimension is highly unbalance, i.e., ๐‘š โ‰ซ ๐‘›, Cai et al. (2021) derived a โ„“2,โˆž perturbation bound. SVD was conducted on as rescaled Gram matrix with diagonal entries deleted. When restricted to the case where there is no missing entries, ๐ด has rank ๐‘Ÿ, and the perturbation matrix ฮ”๐ด has i.i.d. ๐‘ (0, ๐œŽ 2 ) entries, the result in Cai et al. (2021) reads as the following Theorem 2.2.3. Theorem 2.2.3 (Theorem 3.1 in Cai et al. (2021)). Assume the following assumptions hold โˆš ๐‘ ๐œŽ๐‘Ÿ ( ๐ด) โ‰ฅ ๐‘ 1 ๐œŽ log ๐‘š max{๐œ…(๐‘š๐‘›) 1/4 , ๐œ… 3 ๐‘›}, and โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โ‰ค 2 . โˆš๏ธ ๐œ…2 20 ๐œŽ ( ๐ด) where ๐œ… = ๐œŽ1 ( ๐ด) , ๐‘ 1 is some sufficient large constant, and ๐‘ 2 is sufficiently small. Then with high ๐‘Ÿ probability, 2 โˆš๐‘š ! โˆš ๐œŽ ๐œŽ๐œ… โˆฅ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ โˆฅ 2,โˆž โ‰ค ๐œ… 2 ๐‘Ÿ ๐œ‡ + + ๐œ… 2 โˆฅ๐‘ˆ1 โˆฅ 22,โˆž (2.5) 2 ๐œŽ๐‘Ÿ ( ๐ด) ๐œŽ ๐‘Ÿ ( ๐ด) โˆš๏ธƒ โˆš๏ธƒ โˆš max๐‘–, ๐‘— | ๐ด๐‘–, ๐‘— | Here ๐œ‡ = max{ ๐‘›๐‘Ÿ โˆฅ๐‘ˆ1 โˆฅ 2,โˆž , ๐‘š๐‘Ÿ โˆฅ๐‘‰1 โˆฅ 2,โˆž , ๐‘›๐‘š โˆฅ ๐ดโˆฅ } โ‰ฅ 1 characterizes the incoherence of ๐น matrices ๐‘ˆ1 ,๐‘‰1 , and ๐ด. For simplicity of presentation, we have omitted the logarithmic terms. When ๐œ‡ and the condition number ๐œ… are of constant order, the bound on the RHS of (2.5) is near-optimal. However, if matrix ๐ด has large conditional number ๐œ…, it is not desirable to have ๐œ… in the bound. It is worth mentioning that Cai et al. (2010) also derived the following minimax lower bound. Theorem 2.2.4 (Theorem 3.3 in Cai et al. (2021)). Suppose 1 โ‰ค ๐‘Ÿ โ‰ค ๐‘›/2, and (ฮ”๐ด)๐‘– ๐‘— โˆผ ๐‘ (0, ๐œŽ 2 ). ๐‘–.๐‘–.๐‘‘. Define M := {๐ต โˆˆ R๐‘›ร—๐‘š |rank(๐ต) = ๐‘Ÿ, ๐œŽ๐‘Ÿ (๐ต) โˆˆ [0.9๐œŽ๐‘Ÿโˆ— , 1.1๐œŽ๐‘Ÿโˆ— ]}. Denote by ๐‘ˆ (๐ต) โˆˆ R๐‘›ร—๐‘Ÿ the matrix containing the ๐‘Ÿ left singular vectors of ๐ต. Then there exists some universal constant ๐‘ ๐‘™๐‘ > 0 such that ( )  ห† โˆ’ ๐‘ˆ ( ๐ด)โˆฅ 2,โˆž โ‰ฅ ๐‘ ๐‘™๐‘ min  ๐œŽ2 โˆš ๐œŽ โˆš 1 inf sup E min โˆฅ๐‘ˆ๐‘„ ๐‘›๐‘š + โˆ— ๐‘›, 1 โˆš , (2.6) ๐‘ˆห† ๐ดโˆˆM ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿโˆ— 2 ๐œŽ๐‘Ÿ ๐‘› where the infimum is taken over all estimators for ๐‘ˆ ( ๐ด) based on the noisy observation ๐ด + ฮ”๐ด. Theorem 2.2.4 provides a minimax lower bound on the โ„“2,โˆž -norm of singular subspace pertur- โˆš โˆš bation. When ๐ด is rank-๐‘Ÿ, ๐‘› โ‰ ๐‘š, and ๐œŽ๐‘Ÿ โ‰ณ ๐œŽ ๐‘›, the lower bound on the RHS becomes ๐‘‚ (1/ ๐‘›). Some โ„“2,โˆž -norm singular perturbation bounds were derived from perturbation analysis for eigen subspaces of symmetric matrices. Using the Hermitian dilation trick Paulsen (2002), these results on eigen subspaces can be extended to singular subspaces, which leads to a uniform bound on โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ and โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ. Relevant works that fall into this category include Lei (2019); Abbe et al. (2020), In particular, Abbe et al. (2020) established an โ„“2,โˆž -norm bound for eigenspaces of symmetric random matrices whose expectations are of low-rank. Here we consider a simplified version of the results in Abbe et al. (2020) when the perturbation matrix follows has i.i.d. Gaussian 21 entries. Suppose ๐ด โˆˆ R๐‘›ร—๐‘› , denote the eigenvalues of ๐ด as ๐œ† 1 โ‰ฅ ๐œ† 2 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œ† ๐‘› , and the associated eigenvectors are {๐‘ข ๐‘— }๐‘›๐‘—=1 . Analogously, for ๐ด e = ๐ด + ฮ”๐ด, the eigenvalues and eigenvectors are e1 โ‰ฅ ๐œ† ๐œ† e2 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œ† e๐‘› and {e๐‘ข ๐‘— }๐‘› , respectively. Let ๐‘ˆ = [๐‘ข 1 , ๐‘ข 2 , ยท ยท ยท , ๐‘ข๐‘Ÿ ] and ๐‘ˆ = [e ๐‘—=1 ๐‘ข2, ยท ยท ยท , e ๐‘ข1, e ๐‘ข๐‘Ÿ ]. Denote ฮ” = min{๐œ†๐‘Ÿ โˆ’ ๐œ†๐‘Ÿ+1 , min1โ‰ค๐‘–โ‰ค๐‘Ÿ |๐œ†๐‘– |}, ๐œ… = max1โ‰ค๐‘–โ‰ค๐‘Ÿ |๐œ†๐‘– |/ฮ”. In addition, assume there exists ๐›พ โ‰ฅ 0 such that โˆฅ ๐ดโˆฅ 2,โˆž โ‰ค ๐›พฮ” (2.7) and ๐‘๐œ…๐›พ โ‰ค 1 for some constant ๐›พ and for some ๐›ฟ0 โˆˆ (0, 1), P(โˆฅฮ”๐ดโˆฅ โ‰ค ๐›พฮ”) โ‰ฅ 1 โˆ’ ๐›ฟ0 . (2.8) Theorem 2.2.5 (Theorem 2.1 in Abbe et al. (2020)). Under assumptions (2.7) and (2.8), with high probability, e1 โˆ’ ๐‘ˆ1 ๐‘„๐‘ˆ โˆฅ 2,โˆž โ‰ค (๐‘ + ๐œ… 2 ๐›พ)โˆฅ๐‘ˆ1 โˆฅ 2,โˆž + ๐›พโˆฅ ๐ดโˆฅ 2,โˆž /ฮ”. โˆฅ๐‘ˆ (2.9) The bound in Theorem 2.2.5 is quite tight when condition number ๐œ… is of constant order, โˆš๏ธ โˆฅ๐‘ˆ1 โˆฅ 2,โˆž is of ๐‘‚ ( ๐‘Ÿ/๐‘›), and ๐ด has exact rank ๐‘Ÿ, since in this case โˆฅ ๐ดโˆฅ 2,โˆž /ฮ” = โˆฅ๐‘ˆ1 ฮฃ1๐‘ˆ1๐‘‡ โˆฅ 2,โˆž /ฮ” โ‰ค โˆš๏ธ ๐œ…โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โˆผ ๐‘‚ ( ๐‘Ÿ/๐‘›). However, for general full rank matrices or matrices with a large condition number, the bound 2.9 can be improved. 2.2.2 A near-optimal โ„“2,โˆž -bound under i.i.d. Gaussian noise In summary, various singular subspace perturbation bound have been derived in the literature. However, most existing results only work for low-rank matrices, some results require additional assumption on the โ„“2,โˆž -norm of matrix ๐ด, and may not give an informative bound when the condition number becomes large. In Section 2.2.2, we present an improved perturbation bound under the assumption that the perturbation matrix ฮ”๐ด has i.i.d. Gaussian entries. The proposed result complements existing literature in that it only require minimum assumptions, and holds for both cases when matrix ๐ด is of low rank or is a general full-rank matrix. We establish an improved โ„“2,โˆž -norm singular subspace perturbation bound in the case where each entry of the perturbation matrix follows i.i.d. Gaussian distribution. Statistical applications that fall into this regime include Gaussian Mixture Model with isotropic noise covariance matrix 22 Lรถffler et al. (2021). Admittedly, the proposed result requires the perturbation matrix ฮ”๐ด to have i.i.d. Gaussian entries due to the proof technique we use, whether similar results can be obtained when ฮ”๐ด is sub-Gaussian is still open. Theorem 2.2.6. Suppose ๐ด e = ๐ด + ฮ”๐ด โˆˆ R๐‘›ร—๐‘š , ๐‘›ยฏ := max{๐‘›, ๐‘š}, ฮ”๐ด has i.i.d. ๐‘ (0, ๐œŽ 2 ) entries, and โˆš assume ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด) > 21๐œŽ ๐‘›. ยฏ Then with probability at least 1 โˆ’ ๐‘2 , ๐‘› ๐œŽ 2 ๐‘›ยฏ ๐‘…(๐‘Ÿ, ๐‘›) min โˆฅ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค ๐‘ 1 โˆฅ๐‘ˆ1 โˆฅ 2,โˆž + ๐‘2 ๐œŽ , (2.10) ๐‘„โˆˆO๐‘Ÿ (๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด)) 2 ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด) where ๏ฃฑ โˆš โˆš๏ธ ๏ฃฒ ๐‘Ÿ + log ๐‘›, if ๐ด is rank ๐‘Ÿ; ๏ฃด ๏ฃด ๏ฃด ๐‘…(๐‘Ÿ, ๐‘›) = โˆš๏ธ ๏ฃด๐‘Ÿ + ๐‘Ÿ log ๐‘›, ๏ฃด ๏ฃด else, ๏ฃณ and ๐‘, ๐‘ 1 , ๐‘ 2 , ๐œŽ are absolute constants independent of ๐‘› and ๐‘š. The following corollary of Theorem 2.2.6 may be easier to digest. Corollary 2.2.7. Under the same assumption as in Theorem 2.2.6, if we additionally assume that the matrix ๐‘ˆ1 holding the leading ๐‘Ÿ left singular vectors of ๐ด is ๐œ‡1 -incoherent with some constant ๐œ‡1 , then with probability at least 1 โˆ’ ๐‘2 , ๐‘› โˆš โˆš๏ธ   ๐œ‡1 ๐‘Ÿ + ๐‘Ÿ log ๐‘› + ๐‘Ÿ ๐‘Ÿ min โˆฅ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค ๐ถ ยท โˆš โˆผ๐‘‚ โˆš , ๐‘„โˆˆO๐‘Ÿ ๐‘› ๐‘› where ๐‘ and ๐ถ are absolute constants, and the logarithmic factors are omitted in the big O expression. A similar result holds for the right singular subspace. For low-rank matrices, Corollary 2.2.7 can be further improved. Corollary 2.2.8. When ๐ด is of rank-๐‘Ÿ, (2.10) reduces to โˆš โˆš๏ธ ๐œŽ 2 ๐‘›ยฏ ๐‘Ÿ + log ๐‘› min โˆฅ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค ๐‘ 1 โˆฅ๐‘ˆ1 โˆฅ 2,โˆž + ๐‘2 ๐œŽ . (2.11) ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿ2 ( ๐ด) ๐œŽ๐‘Ÿ ( ๐ด) Under the same assumption as in Corollary 2.2.7, with probability at least 1 โˆ’ ๐‘2 , ๐‘› โˆš โˆš๏ธ โˆš โˆš๏ธ‚  ๐œ‡1 ๐‘Ÿ + ๐‘Ÿ log ๐‘› + ๐‘Ÿ ๐‘Ÿ min โˆฅ๐‘ˆe1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2,โˆž โ‰ค ๐ถ ยท โˆš โˆผ๐‘‚ , ๐‘„โˆˆO๐‘Ÿ ๐‘› ๐‘› where ๐‘ and ๐ถ are absolute constants. A similar result holds for the right singular subspace. 23 Remark 2.2.9. To get a sense of the tightness of our results, we note that when ๐ด is of rank-๐‘Ÿ, โˆš ๐‘› โ‰ ๐‘š, and ๐œŽ๐‘Ÿ โ‰ ๐œŽ ๐‘›, the minimax lower bound (2.6) derived in Cai et al. (2021) reduces to (2.12) below, which indicates that our result Corollary 2.2.8 matches the minimax lower bound ๐‘‚ ( โˆš1 ) up ๐‘› โˆš to a factor of ๐‘Ÿ. ห† โˆ’ ๐‘ˆ ( ๐ด)โˆฅ 2,โˆž โ‰ฅ ๐‘ โˆš1 ,   inf sup E min โˆฅ๐‘ˆ๐‘„ (2.12) ๐‘ˆห† ๐ดโˆˆM ๐‘„โˆˆO๐‘Ÿ ๐‘› โˆš๏ธ โˆš Achieve ๐‘‚ ( ๐‘Ÿ/๐‘›) for Achieve ๐‘‚ (๐‘Ÿ/ ๐‘›) for Do not require ๐œ… = ๐‘‚ (1) Do not require addition assumptions rank-๐‘Ÿ matrices general matrices Cape et al. (2019b) โœ— โœ“ โœ“ โœ— Lei (2019) โœ“ โœ“ โœ— โœ— Abbe et al. (2020) โœ“ โœ— โœ“ โœ— Cai et al. (2021) โœ“ โœ— โœ— โœ— This dissertation โœ“ โœ“ โœ“ โœ“ Table 2.1 Comparison of result in this dissertation with existing works about the โ„“2,โˆž bound. Here we compare these results under the assumptions that ฮ”๐ด has i.i.d. ๐‘ (0, ๐œŽ 2 ) entries, โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โ‰ค โˆš๏ธ ๐œŽ ( ๐ด) ๐‘ ๐‘Ÿ/๐‘›, where ๐œŽ, ๐‘ are constants, and ๐‘› โ‰ ๐‘š. Here ๐œ… = ๐œŽ1 ( ๐ด) is the condition number of ๐ด. For ๐‘Ÿ simplicity, we ignore log ๐‘› in the big-O notation. We take a moment here to make a comparison between several existing works with ours (Theorem 2.1 and its corollaries). Table 2.1 summarizes the โ„“2,โˆž -norm bound and requirements in each work. The purpose of the comparison is to show the effectiveness of our results under the setting where the perturbation matrix ฮ”๐ด has i.i.d. Gaussian entries and the matrix ๐ด is nearly square (๐‘› โ‰ ๐‘š). To be fair, we would like to mention that some of the existing results might be better suited for other settings (such as when ๐‘š โ‰ซ ๐‘›). โˆš๏ธ From the table, we can see that our corollaries achieve the ๐‘‚ ( ๐‘Ÿ/๐‘›) order upper bound for โˆš low-rank matrices and ๐‘‚ (๐‘Ÿ/ ๐‘›) for full-rank matrices. In comparison, previous results in Cape โˆš๏ธ et al. (2019b) do not achieve the ๐‘‚ ( ๐‘Ÿ/๐‘›) order for the low-rank case under the assumptions as in โˆš๏ธ Corollary 2.2.8. The result in Lei (2019) achieved the same order of accuracy ๐‘‚ ( ๐‘Ÿ/๐‘›) but only for rank-๐‘Ÿ matrices and under a more restrictive gap condition (below is a simplified version) โˆš   ๐œŽ1 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โ‰ฅ ๐‘ min{ , 2๐‘Ÿ}๐œŽ ๐‘›ยฏ + โˆฅ ๐ดโˆฅ 2,โˆž , ๐œŽ๐‘Ÿ where ๐‘ is a constant. Abbe et al. (2020); Chen et al. (2021a); Cheng et al. (2020); Eldridge et al. (2018) consider the perturbation of one eigen-vector instead of a set of eigen-vectors. Abbe et al. 24 โˆš๏ธ (2020) obtained an ๐‘‚ ( ๐‘Ÿ/๐‘›) perturbation bound for low-rank matrices, but for full rank matrices, ๐œŽ ( ๐ด) 1 their bound is ๐‘‚ (1). Besides, the bound contains in it the condition number e ๐œ… ( ๐ด) = ๐œŽ ( ๐ด)โˆ’๐œŽ , ๐‘Ÿ ๐‘Ÿ+1 ( ๐ด) which potentially makes it very large. Likewise, the result in Cai et al. (2021) also contains a ๐œŽ ( ๐ด) condition number ๐œ…( ๐ด) = ๐œŽ1 ( ๐ด) . In contrast, the condition number does not show up in our results, ๐‘Ÿ hence Theorem 2.2.6 is more suitable for matrices with large condition numbers. In addition, all previous analyses except for Cape et al. (2019b) and Cai et al. (2021) are based on techniques developed for eigen-decomposition. When they are applied to rectangular matrices ๐ด โˆˆ R๐‘›ร—๐‘š , the matrix needs to be symmetrized, causing the resulting upper bounds potentially depend on both the left and the right singular vectors. In contrast, our bound (2.10) is one-sided, in that the perturbation of ๐‘ˆ1 only depends on โˆฅ๐‘ˆ1 โˆฅ 2,โˆž but not โˆฅ๐‘‰1 โˆฅ 2,โˆž . We make a more detailed comparison between our Theorem 2.2.6 and Theorem 4.3 in Cape et al. (2019b). For a fair comparison, we restrict ourselves to the setting where both Cape et al. (2019b) and our results hold, that is, when the data matrix ๐ด โˆˆ R๐‘›ร—๐‘š is of rank ๐‘Ÿ and the perturbation matrix ฮ”๐ด has i.i.d. Gaussian ๐‘ (0, ๐œŽ 2 ) entries with some constant ๐œŽ. As discussed in Theorem โˆš 2.2.2, Theorem 4.3 in Cape et al. (2019b) reads if ๐‘› โ‰ฅ ๐‘š, ๐œŽ๐‘Ÿ ( ๐ด) โ‰ณ ๐œŽ(๐‘›/ ๐‘š), then (2.4) holds. Comparing (2.4) and (2.11), we notice that under the allowable gap condition ๐œŽ๐‘Ÿ ( ๐ด) โ‰ณ โˆš๏ธ โˆš ๐œŽ max{๐‘›, ๐‘š} and the incoherence condition โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โ‰ค ๐‘๐‘Ÿ / ๐‘›, the second term in (2.4), which 2 โˆš reads ๐ถ๐‘Ÿ ๐œŽ2 ๐‘š , can be as large as ๐‘‚ ( ๐‘Ÿ) when ๐‘› โ‰ˆ ๐‘š, which is much larger than our bound ๐œŽ๐‘Ÿ ( ๐ด) โˆš๏ธ ๐‘‚ ( ๐‘Ÿ/๐‘›). Admittedly, our current result only holds for Gaussian perturbation due to the proof techniques we use. We leave the study of other perturbation types as future work. 2.3 Stability of Principal Component Analysis (PCA) As one of the arguably most popular tools for data visualization and exploration, PCA is used to extract the main features from a dataset or to reduce the dimensionality of the data. There is a vast literature on the analysis of PCA. Most previous works focused on the consistency of Principal Component directions or eigenvalues Chen et al. (2021b); Cai et al. (2021); Vaswani and Narayanamurthy (2017); Narayanamurthy and Vaswani (2020), while the stability of PC scores 25 (i.e., the projection of data matrix ๐ด onto its PC directions, using our notations in Section 2.2, PC scores are given by ๐‘ˆ1 ฮฃ1 ) is less explored, despite its importance in the analysis of various spectral methods. There are several relevant works investigating the stability of PC scores, but their analyses were under different settings. For completeness, we include a brief review here. Abbe et al. (2022) developed an โ„“ ๐‘ analysis for a hollowed version of PCA, where SVD is conducted on the hollowed gram matrix ๐บ = H ( ๐ด๐ด๐‘‡ ). Here, ๐ด is the data matrix and the operator H (ยท) zeros out all diagonal 1 entries of a square matrix. The PC scores are given by ๐‘ˆฮ› 2 , where ๐‘ˆ and ฮ› are the eigenvector matrix and corresponding eigenvalues of ๐บ. Perturbation bounds on PC scores in โ„“2,๐‘ -norm were derived in Abbe et al. (2022) to characterize entrywise behaviour of PCA. Another line of research studied adjacency spectral embedding (ASE) for random dot product graphs (RDPG), which is closely related to PCA in that they both return a weighted singular vector matrix. Central limit theorems for rows of ASE have been provided in Tang and Preibe (2018); Athreya et al. (2021). Different from these previous studies, in this section, we focus on the stability of PC scores of the original PCA algorithm, which does not use the hallowed gram matrix. Given a centered data matrix ๐ด and its conformal SVD, PCA returns ๐‘ˆ1 ฮฃ1 (or ๐‘‰1 ฮฃ1 ) as the low-dimensional projection into R๐‘Ÿ . Due to the possible similarity among singular values within ฮฃ1 , the PCA embedding may be subject to rotations. Hence when computing the error, we mode out this rotation and aim to bound min๐‘„โˆˆO๐‘Ÿ โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1eฮฃ1 ๐‘„โˆฅ or min๐‘„โˆˆO๐‘Ÿ โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ ฮฃ1 ๐‘„โˆฅ ๐น , where โˆฅ ยท โˆฅ is the spectral norm. e1e The main difference between these quantities and the sinฮ˜ angle between singular subspaces is that ๐‘ˆ1 is now multiplied by the corresponding singular values, and it is the perturbation of this product that we want to analyze. Naively, one may expect that the perturbation of ๐‘ˆ1 ฮฃ1 is approximately equal to the perturbation of ๐‘ˆ1 times โˆฅฮฃ1 โˆฅ plus the perturbation of ฮฃ times โˆฅ๐‘ˆ1 โˆฅ, and the perturbation of ๐‘ˆ1 can in turn be controlled by the sinฮ˜ theorem. This argument leads to ๐œŽ ( ๐ด)โˆฅฮ”๐ดโˆฅ min โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ ฮฃ1 ๐‘„โˆฅ โ‰ค ๐‘ ยท 1 e1e , (2.13) ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 where ๐‘ is some absolute constant. However, this bound is quite large due to the existence of ๐œŽ1 ( ๐ด) in the numerator. Noticing that ๐œŽ1 ( ๐ด) appears in (2.13) because we consider ๐‘ˆ1 ฮฃ1 as a whole, 26 in the following theorem, we show that the perturbed singular vectors corresponding to different singular values actually have different levels of stability, which in turn enables a tighter bound on the PC scores. More specifically, the next theorem shows that the singular vectors associated with larger singular values are more stable. Theorem 2.3.1. For ๐‘— = 1, ..., ๐‘Ÿ, let sin ฮ˜(e๐‘ข ๐‘— ,๐‘ˆ1 ) be the sinฮ˜ angle between the ๐‘—th left perturbed singular vector e๐‘ข ๐‘— and the leading ๐‘Ÿ-dimensional singular subspace ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ1 ) of ๐ด. Then provided that 3โˆฅฮ”๐ดโˆฅ โ‰ค ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 , we have ๐ถ โˆฅฮ”๐ดโˆฅ โˆฅ sin ฮ˜(e ๐‘ข ๐‘— ,๐‘ˆ1 )โˆฅ โ‰ค . ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 where ๐ถ is some universal constant and by definition โˆฅ sin ฮ˜(e ๐‘ข ๐‘— ,๐‘ˆ1 )โˆฅ โ‰ก โˆฅe๐‘ข๐‘‡๐‘— ๐‘ˆ2 โˆฅ, ๐‘ˆ2 is the orthog- onal complement of ๐‘ˆ1 . The different levels of stability of singular vectors observed in Theorem 2.3.1 will help us get rid of the ๐œŽ1 ( ๐ด) and establish a tighter bound on the PC scores. Theorem 2.3.2. ๐ด e = ๐ด + ฮ”๐ด, ๐‘ˆ1 ฮฃ1 is the PCA embedding of ๐ด and ๐‘ˆ ฮฃ1 is that of ๐ด, e1e e we have   2โˆฅฮ”๐ดโˆฅ min โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1eฮฃ1 ๐‘„โˆฅ โ‰ค 3โˆฅฮ”๐ดโˆฅ + 3๐œŽ๐‘Ÿ+1 min ,1 , ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1    2 ! 1/2 2โˆฅฮ”๐ดโˆฅ min โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1eฮฃ1 ๐‘„โˆฅ ๐น โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ 2๐น + 3 โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น min ,1 ๐‘„โˆˆO๐‘Ÿ ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1   2โˆฅฮ”๐ดโˆฅ + โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น min ,1 . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Here (ฮ”๐ด)๐‘Ÿ is the best rank-๐‘Ÿ approximation of ฮ”๐ด. The upper bound is tighter than (2.13) and can be used to facilitate the error analysis of PCA-related methods (e.g., Little et al. (2018)). Remark 2.3.3. When ๐ด is rank-๐‘Ÿ, the above result reduces to min โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1e ฮฃ1 ๐‘„โˆฅ โ‰ค 3โˆฅฮ”๐ดโˆฅ, ๐‘„โˆˆO๐‘Ÿ โˆš min โˆฅ๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ ฮฃ1 ๐‘„โˆฅ ๐น โ‰ค ( 5 + 1)โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น . e1e ๐‘„โˆˆO๐‘Ÿ 27 Remark 2.3.4. A tight โ„“2,๐‘ -norm perturbation bound of PC scores for hollowed PCA was developed in Abbe et al. (2022). However, unlike the previous theorem for vanilla PCA, it seems not possible to eliminate ๐œŽ1 ( ๐ด) from the bound in hollowed PCA, due to the fact that the hollowed PCA conducts the decomposition on the gram matrix instead of the original data matrix ๐ด. In the noisy setting, the noise on the Gram matrix contains the term ๐ด๐‘‡ ฮ”๐ด, whose norm may reach ๐‘‚ (๐œŽ1 ( ๐ด)โˆฅฮ”๐ดโˆฅ) with ๐œŽ1 included in the expression. 2.4 Stability of singular value truncation In addition to studying the perturbations of ๐‘ˆ1 and ๐‘ˆ1 ฮฃ1 , we also investigate the stability of the hard singular value thresholding operator, which provides the best rank-๐‘Ÿ approximation of ๐ด, i.e., ๐ด๐‘Ÿ = ๐‘ˆ1 ฮฃ1๐‘‰1๐‘‡ . This operator, also known as singular value truncation, is widely used in matrix completion and matrix denoising for promoting low-rankness or reducing the noise Tanner and Wei (2013); Donoho and Gavish (2014); Cai et al. (2010); Gavish and Donoho (2014). Let ๐ด e = ๐ด + ฮ”๐ด be the noisy matrix, and let ๐ด e๐‘Ÿ denote its best rank-๐‘Ÿ approximation. We characterize the stability of the hard singular value thresholding operator through a bound on โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด e๐‘Ÿ โˆฅ. Previous works have investigated the stability of truncated SVD Luo et al. (2021); Vu et al. (2021), and tight error bounds for low-rank matrices have been derived. However, a tight bound for general matrices is still missing in the literature. For rank-๐‘Ÿ matrix ๐ด with ๐‘Ÿ < min{๐‘š, ๐‘›}, the following perturbation result was obtained in Luo et al. (2021), โˆฅ๐ดโˆ’ ๐ด e๐‘Ÿ โˆฅ โ‰ค 2โˆฅฮ”๐ดโˆฅ. (2.14) Since in practice, ๐ด may not be exactly rank-๐‘Ÿ, we hope to establish upper bounds for general full-rank matrices. We comment that although we can easily derive an upper bound of full-rank matrices from that of the low-rank ones, the resulting bound is not tight. Explicitly, for a full-rank matrix ๐ด, ๐ด๐‘Ÿ is of low rank, so we can apply (2.14) on ๐ด๐‘Ÿ to get โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ดe๐‘Ÿ โˆฅ = โˆฅ ๐ด๐‘Ÿ โˆ’ ( ๐ด + ฮ”๐ด)๐‘Ÿ โˆฅ = โˆฅ ๐ด๐‘Ÿ โˆ’ ( ๐ด๐‘Ÿ + ๐ธ) e ๐‘Ÿ โˆฅ โ‰ค 2โˆฅ ๐ธ eโˆฅ โ‰ค 2โˆฅฮ”๐ดโˆฅ + 2โˆฅ ๐ด โˆ’ ๐ด๐‘Ÿ โˆฅ = 2โˆฅฮ”๐ดโˆฅ + 2๐œŽ๐‘Ÿ+1 , where ๐ธ e = ฮ”๐ด + ๐ด โˆ’ ๐ด๐‘Ÿ and the first inequality used (2.14). 28 Apparently, this bound is not optimal as it does not shrink to 0 when ฮ”๐ด โ†’ 0. This then motivates us to establish the following tighter bound. Theorem 2.4.1 (Perturbation result on singular value truncation). Let ๐ด โˆˆ R๐‘›ร—๐‘š be any ๐‘› ร— ๐‘š matrix and ๐ด e = ๐ด + ฮ”๐ด be its noisy version. Denote by ๐ด๐‘Ÿ and ๐ด e๐‘Ÿ their rank-๐‘Ÿ thresholding with all but the first ๐‘Ÿ singular values set to 0. Let ๐œŽ๐‘– be the ๐‘–th largest singular value of ๐ด and ฮฃ2 be the diagonal matrix containing ๐œŽ๐‘Ÿ+1 , ..., ๐œŽmin (the (๐‘Ÿ + 1)th to the last singular values of ๐ด) on the diagonal. Then   2โˆฅฮ”๐ดโˆฅ โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ดe๐‘Ÿ โˆฅ โ‰ค 2โˆฅฮ”๐ดโˆฅ + 2๐œŽ๐‘Ÿ+1 min ,1 , (2.15) ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1    2 ! 1/2 e๐‘Ÿ โˆฅ ๐น โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ 2 + 3 โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น min 2โˆฅฮ”๐ดโˆฅ โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด ,1 . (2.16) ๐น ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 This error bound has exactly the same form as the PCA perturbation bound established in the previous section, except that here ๐ด๐‘Ÿ and ๐ด e๐‘Ÿ do not differ by a rotation. Intuitively, this indicates that the noise-induced rotation on ๐‘ˆ e1 and that on ๐‘‰ e๐‘‡ can essentially cancel with each other. 1 Remark 2.4.2. When ๐ด is a rank-๐‘Ÿ matrix, the bound in Theorem 2.4.1 reduces to the result in Luo et al. (2021): ๏ฃฑ ๏ฃด ๏ฃฒ โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด e๐‘Ÿ โˆฅ โ‰ค 2โˆฅฮ”๐ดโˆฅ, ๏ฃด ๏ฃด ๏ฃด (2.17) ๏ฃด โˆš ๏ฃด ๏ฃด e๐‘Ÿ โˆฅ ๐น โ‰ค 5โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น . ๏ฃด โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด ๏ฃณ 2.5 Closed-form expression of sin ฮ˜ distance between two singular spaces The several new results presented in the previous sections are derived either directly or indirectly from a set of sinฮ˜ formulae we shall establish in this section. In other words, these sinฮ˜ formulae serve as useful tools to analyze SVD based perturbation problems. 2.5.1 First order equivalent expressions of the sin ฮ˜ distance Following the same notation as in Section 1.2.1, our goal is to compute the exact expressions of perturbation angles of the leading left singular subspace ๐‘ˆ1 under noise ฮ”๐ด. (1.4) and (1.5) indicate that the matrices ๐‘ˆ2๐‘‡ ๐‘ˆe1 and ๐‘ˆe๐‘‡ ๐‘ˆ1 are key intermediate quantities to bound the sinฮ˜ angles. In the 2 following theorem, we provide useful expressions of these key quantities. 29 Theorem 2.5.1 (Angular perturbation formula). Let ๐ด, ๐ด e = ๐ด + ฮ”๐ด be two ๐‘› ร— ๐‘š matrices and their conformal SVDs are defined as (1.1). The rank of ๐ด is at least ๐‘Ÿ. Assume there is a gap between the ๐‘Ÿth and the (๐‘Ÿ + 1)th singular values, i.e., ๐œŽ๐‘Ÿ โˆ’ e ๐œŽ๐‘Ÿ+1 > 0 and e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 0. Then the following expressions hold: ๐‘ˆ1๐‘‡ ๐‘ˆe2 = ๐น 12 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ฮฃ๐‘‡2 + ฮฃ1๐‘‰1๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e2e e2 ), ๐‘ˆ 1 ๐‘ˆ2๐‘‡ ๐‘ˆe1 = ๐น 21 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ฮฃ๐‘‡1 + ฮฃ2๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1e e1 ), ๐‘ˆ 2 (2.18) ๐‘‰1๐‘‡ ๐‘‰ e2 = ๐น๐‘‰12 โ—ฆ (ฮฃ๐‘‡1 ๐‘ˆ1๐‘‡ (ฮ”๐ด)๐‘‰ e2 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ 1 ฮฃ2 ), e2e e1 = ๐น 21 โ—ฆ (ฮฃ๐‘‡ ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ๐‘‰2๐‘‡ ๐‘‰ e1 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ ฮฃ1 ). e1e ๐‘‰ 2 2 2 More specifically, the assumption ๐œŽ๐‘Ÿ โˆ’e ๐œŽ๐‘Ÿ+1 > 0 is required for the first and the third expressions of (2.18) to hold, and e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 0 is required for the second and the last expressions to hold. Here โ—ฆ means the Hadamard product, or element-wise product between two matrices. ๐น๐‘ˆ12 โˆˆ R๐‘Ÿร—(๐‘›โˆ’๐‘Ÿ) has entries (๐น๐‘ˆ12 )๐‘–, ๐‘— = 2 1 2 , 1 โ‰ค ๐‘– โ‰ค ๐‘Ÿ, 1 โ‰ค ๐‘— โ‰ค ๐‘› โˆ’ ๐‘Ÿ; ๐น๐‘ˆ21 โˆˆ R (๐‘›โˆ’๐‘Ÿ)ร—๐‘Ÿ has entries (๐น๐‘ˆ21 )๐‘–, ๐‘— = ๐œŽ e โˆ’๐œŽ ๐‘—+๐‘Ÿ ๐‘– 1 , 1 โ‰ค ๐‘– โ‰ค ๐‘› โˆ’ ๐‘Ÿ, 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. Similarly, ๐น๐‘‰12 โˆˆ R๐‘Ÿร—(๐‘šโˆ’๐‘Ÿ) has entries (๐น๐‘‰12 )๐‘–, ๐‘— = 2 1 2 , 1 โ‰ค ๐œŽ 2 โˆ’๐œŽ 2 e ๐œŽ e โˆ’๐œŽ ๐‘— ๐‘–+๐‘Ÿ ๐‘—+๐‘Ÿ ๐‘– ๐‘– โ‰ค ๐‘Ÿ, 1 โ‰ค ๐‘— โ‰ค ๐‘š โˆ’๐‘Ÿ; and ๐น๐‘‰21 โˆˆ R (๐‘šโˆ’๐‘Ÿ)ร—๐‘Ÿ with entries (๐น๐‘‰21 )๐‘–, ๐‘— = 2 1 2 , 1 โ‰ค ๐‘– โ‰ค ๐‘š โˆ’๐‘Ÿ, 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. ๐œŽ โˆ’๐œŽ e ๐‘— ๐‘–+๐‘Ÿ Here if ๐‘– > min{๐‘›, ๐‘š}, we enforce ๐œŽ๐‘– and e ๐œŽ๐‘– to be 0. Taking the spectral norm on both hand sides of (2.18) gives us the following new expression of the sinฮ˜ distance. Corollary 2.5.2. If the condition in Theorem 2.5.1 is satisfied, then the sin ฮ˜ distances between the ๐‘Ÿ leading singular spaces of the original and the perturbed matrices satisfy โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ = โˆฅ๐น 12 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ e2eฮฃ๐‘‡2 + ฮฃ1๐‘‰1๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e2 )โˆฅ ๐‘ˆ 1 = โˆฅ๐น๐‘ˆ21 โ—ฆ (๐‘ˆ2๐‘‡ (ฮ”๐ด)๐‘‰ e1eฮฃ๐‘‡1 + ฮฃ2๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1 )โˆฅ, โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ = โˆฅ๐น 12 โ—ฆ (ฮฃ๐‘‡ ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ e2 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e2e ฮฃ2 )โˆฅ ๐‘‰ 1 1 1 = โˆฅ๐น๐‘‰21 โ—ฆ (ฮฃ๐‘‡2 ๐‘ˆ2๐‘‡ (ฮ”๐ด)๐‘‰ e1 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ 2 e1e ฮฃ1 )โˆฅ. Remark 2.5.3. In the expressions of corollary 2.5.2, the singular value gaps are contained in the terms ๐น๐‘ˆ21 , ๐น๐‘ˆ12 , ๐น๐‘‰21 , and ๐น๐‘‰12 as denominators. In this sense, (2.18) conveys the same insight as Wedinโ€™s sinฮ˜ theorem. 30 Everything else in the right-hand sides of Corollary 2.5.2 is straightforward to bound except perhaps for the Hadamard products. The following lemma shows that the Hadamard product is also relatively easy to treat. Lemma 2.5.4. Assume ๐œŽ๐‘Ÿ โˆ’e ๐œŽ๐‘Ÿ+1 > 0, e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 0, let ๐น๐‘ˆ12 , ๐น๐‘ˆ21 , ฮฃ1 , e ฮฃ1 , ฮฃ2 , eฮฃ2 , be the same as in Theorem 2.5.1 and let ๐ป1 โˆˆ R (๐‘›โˆ’๐‘Ÿ)ร—๐‘Ÿ , ๐ป2 โˆˆ R (๐‘šโˆ’๐‘Ÿ)ร—๐‘Ÿ , ๐ป3 โˆˆ R๐‘Ÿร—(๐‘šโˆ’๐‘Ÿ) , ๐ป4 โˆˆ R๐‘Ÿร—(๐‘›โˆ’๐‘Ÿ) be some arbitrary matrices. Then ๐œŽ๐‘Ÿ ๐œŽ๐‘Ÿ+1 |||๐น๐‘ˆ21 โ—ฆ (๐ป1e |||๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐ป2 )||| โ‰ค e ฮฃ1 )||| โ‰ค |||๐ป1 |||, |||๐ป2 |||, (2.19) 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e 2 2 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ |||๐น๐‘ˆ12 โ—ฆ (๐ป3e ฮฃ๐‘‡2 )||| โ‰ค |||๐ป3 |||, |||๐น๐‘ˆ12 โ—ฆ (ฮฃ1 ๐ป4 )||| โ‰ค e |||๐ป4 |||, (2.20) ๐œŽ๐‘Ÿ2 โˆ’ e 2 ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ2 โˆ’ e 2 ๐œŽ๐‘Ÿ+1 where ||| ยท ||| can be either the spectral or the Frobenius norm. Similar results also hold for ๐น๐‘‰12 and ๐น๐‘‰21 . 2.5.2 Examples in using Theorem 2.5.1 We demonstrate how to use Theorem 2.5.1 to simplify proofs of some existing perturbation bounds in the literature. The theorem we use to derive all the new results in this dissertation is in the next section (Theorem 2.5.7). Curious readers may safely jump to the next section from here. Example 1: The angular perturbation formulae in Theorem 2.5.1 naturally yield the one-sided sinฮ˜ bounds first discovered in Cai and Zhang (2018). Theorem 2.5.1 now introduces a very straightforward derivation of these bounds. Theorem 2.5.5 (One-sided sinฮ˜ theorem). Using the same notation and quantities as in Theorem 2.5.1, if ๐œŽ๐‘Ÿ โˆ’ e ๐œŽ๐‘Ÿ+1 > 0, e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 0, then ( ) ๐œŽ e ๐‘Ÿ โˆฅ(ฮ”๐ด) e1 โˆฅ ๐œŽ๐‘Ÿ+1 โˆฅ(ฮ”๐ด)๐‘‡ ๐‘ˆ ๐‘‰ e1 โˆฅ ๐œŽ๐‘Ÿ โˆฅ(ฮ”๐ด)๐‘‰1 โˆฅ e ๐œŽ๐‘Ÿ+1 โˆฅ๐‘ˆ1๐‘‡ (ฮ”๐ด)โˆฅ โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ โ‰ค min + , + , e๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 ๐œŽ๐‘Ÿ2 โˆ’ e 2 ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ2 โˆ’ e 2 ๐œŽ๐‘Ÿ+1 (2.21) ( ) ๐œŽ๐‘Ÿ โˆฅ๐‘ˆ e๐‘‡ (ฮ”๐ด)โˆฅ ๐œŽ โˆฅ(ฮ”๐ด)๐‘‰ โˆฅ ๐œŽ๐‘Ÿ โˆฅ๐‘ˆ ๐‘‡ (ฮ”๐ด)โˆฅ ๐œŽ โˆฅ(ฮ”๐ด)๐‘‰ โˆฅ e 1 ๐‘Ÿ+1 e1 1 e ๐‘Ÿ+1 1 โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰e1 )โˆฅ โ‰ค min + , + . e๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 ๐œŽ๐‘Ÿ2 โˆ’ e 2 ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ2 โˆ’ e 2 ๐œŽ๐‘Ÿ+1 (2.22) Moreover,   e1 )โˆฅ, โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰e1 )โˆฅ} โ‰ค min 1 1 max{โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ , โˆฅฮ”๐ดโˆฅ. (2.23) ๐œŽ๐‘Ÿ โˆ’ e ๐œŽ๐‘Ÿ+1 e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 31 (2.21) and (2.22) are individual bounds on โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ and โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ, while the classical Wedinโ€™s sinฮ˜ theorem is a uniform bound on both โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ and โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ. The benefit of obtaining the individual bounds was clearly pointed out in Cai and Zhang (2018) by an example. When ๐ด โˆˆ R๐‘›ร—๐‘š is a fixed rank-๐‘Ÿ matrix with ๐‘Ÿ < ๐‘› โ‰ช ๐‘š, and ฮ”๐ด โˆˆ R๐‘›ร—๐‘š is a small random matrix with i.i.d. standard normal entries. The Wedinโ€™s theorem implies โˆš โˆš ๐ถ max{ ๐‘›, ๐‘š} max{โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ1 )โˆฅ, โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰1 )โˆฅ} โ‰ค e e , (2.24) ๐œŽ๐‘Ÿ while the one-sided bounds approximately give, โˆš โˆš ๐ถ ๐‘› ๐ถ ๐‘š โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ โ‰ค , โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰e1 )โˆฅ} โ‰ค . (2.25) ๐œŽ๐‘Ÿ ๐œŽ๐‘Ÿ Since we assumed ๐‘› โ‰ช ๐‘š, only the one-sided bound successfully indicated that ๐‘ˆ1 is more stable than ๐‘‰1 . The proof of Theorem 2.5.5 is a simple application of Theorem 2.5.1. Proof: From Theorem 2.5.1 we have e1 = ๐น 21 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ๐‘ˆ2๐‘‡ ๐‘ˆ e1eฮฃ๐‘‡1 + ฮฃ2๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1 ), ๐‘ˆ 2 e2 = ๐น 12 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ๐‘ˆ1๐‘‡ ๐‘ˆ e2eฮฃ๐‘‡2 + ฮฃ1๐‘‰1๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e2 ). ๐‘ˆ 1 By (2.19) in Lemma 2.5.4, ๐œŽ๐‘Ÿ ๐œŽ๐‘Ÿ+1 โˆฅ๐‘ˆ2๐‘‡ ๐‘ˆ โˆฅ๐‘ˆ2๐‘‡ (ฮ”๐ด)๐‘‰ โˆฅ๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1 โˆฅ โ‰ค e e1 โˆฅ + e1 โˆฅ ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 ๐œŽ๐‘Ÿ ๐œŽ๐‘Ÿ+1 โˆฅ(ฮ”๐ด)๐‘‡ ๐‘ˆ e โ‰ค โˆฅ(ฮ”๐ด)๐‘‰ e1 โˆฅ + e1 โˆฅ (2.26) ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 โˆฅฮ”๐ดโˆฅ โ‰ค . (2.27) ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e Similarly, ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ โˆฅ๐‘ˆ1๐‘‡ ๐‘ˆ โˆฅ๐‘ˆ1๐‘‡ ฮ”๐ดโˆฅ + e2 โˆฅ โ‰ค e โˆฅ(ฮ”๐ด)๐‘‰1 โˆฅ (2.28) 2 ๐œŽ๐‘Ÿ โˆ’ e 2 ๐œŽ๐‘Ÿ+1 2 ๐œŽ๐‘Ÿ โˆ’ e 2 ๐œŽ๐‘Ÿ+1 โˆฅฮ”๐ดโˆฅ โ‰ค . (2.29) ๐œŽ๐‘Ÿ โˆ’ e๐œŽ๐‘Ÿ+1 Inserting (2.26) and (2.28) into โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ = min{โˆฅ๐‘ˆ๐‘‡ ๐‘ˆ e โˆฅ, โˆฅ๐‘ˆ๐‘‡ ๐‘ˆe โˆฅ}, we obtain (2.21). Sim- 1 2 2 1 ilarly, (2.22) also holds. (2.23) is obtained by using (2.27) and (2.29). โ–ก 32 Example 2: In this example, we show that one may obtain some interesting results when applying Theorem 2.5.1 to some less usual choices of ฮ”๐ด. Explicitly, we use Theorem 2.5.1 to re-derive a useful result in Cai and Zhang (2018) but with a more straightforward proof. The result, copied in Proposition 2.5.6, is about the sinฮ˜ distance between the leading singular subspace of a matrix ๐ด and an arbitrary subspace. Proposition 2.5.6 (Proposition 1 in Cai and Zhang (2018)). Suppose ๐ด โˆˆ R๐‘›ร—๐‘š . The orthonormal matrix ๐‘‰ = [๐‘‰1 ,๐‘‰2 ] โˆˆ R๐‘šร—๐‘š is the matrix of right singular vectors of ๐ด, i.e., ๐‘‰1 โˆˆ R๐‘šร—๐‘Ÿ , ๐‘‰2 โˆˆ R๐‘šร—(๐‘šโˆ’๐‘Ÿ) correspond to the first ๐‘Ÿ and last ๐‘š โˆ’ ๐‘Ÿ singular vectors respectively. [๐‘Š1 ,๐‘Š2 ] โˆˆ R๐‘šร—๐‘š is any orthonormal matrix with ๐‘Š1 โˆˆ R๐‘šร—๐‘Ÿ , ๐‘Š2 โˆˆ R๐‘šร—(๐‘šโˆ’๐‘Ÿ) . Given that ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 ) > ๐œŽ๐‘Ÿ+1 ( ๐ด), we have ( ) ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 )โˆฅP ( ๐ด๐‘Š ) ๐ด๐‘Š2 โˆฅ 1 โˆฅ sin ฮ˜(๐‘‰1 ,๐‘Š1 )โˆฅ โ‰ค min ,1 . (2.30) 2 2 ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 ) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด) ( ) ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 )โˆฅP ( ๐ด๐‘Š ) ๐ด๐‘Š2 โˆฅ ๐น โˆš 1 โˆฅ sin ฮ˜(๐‘‰1 ,๐‘Š1 )โˆฅ ๐น โ‰ค min , ๐‘Ÿ . (2.31) 2 2 ( ๐ด) ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 ) โˆ’ ๐œŽ๐‘Ÿ+1 In order to use Theorem 2.5.1 to prove Proposition 2.5.6, we recognize that Proposition 2.5.6 is actually a sinฮ˜ bound under a special perturbation. Specifically, if we set ฮ”๐ด = ๐ด๐‘Š1๐‘Š1๐‘‡ โˆ’ ๐ด, then the quantity sin ฮ˜(๐‘‰1 ,๐‘Š1 ) bounded in Proposition 2.5.6 is exactly the sinฮ˜ angle between ๐ด and ๐ดe = ๐ด + ฮ”๐ด. In addition, this particular choice of ฮ”๐ด has small magnitude of norm therefore leading to a small perturbation bound. Proof: Apply Theorem 2.5.1 to ๐ด and ๐ด e = ๐ด๐‘Š1๐‘Š ๐‘‡ , which means ฮ”๐ด = ๐ด e โˆ’ ๐ด = ๐ด๐‘Š1๐‘Š ๐‘‡ โˆ’ ๐ด = 1 1 โˆ’๐ด๐‘Š2๐‘Š2๐‘‡ . Assume ๐‘ˆ๐‘– ,๐‘‰๐‘– , ฮฃ๐‘– , ๐‘ˆ e๐‘– , ๐‘‰ ฮฃ๐‘– , ๐‘– = 1, 2 are from the conformal SVDs (1.1) of this ๐ด e๐‘– , e e Then using the notation in Theorem 2.5.1, we have (๐น 21 )๐‘–, ๐‘— = and ๐ด. 1 ,๐‘‰e1 = ๐‘‰ ๐œŽ ( ๐ด๐‘Š1 )โˆ’๐œŽ 2 ( ๐ด) 2 ๐‘— ๐‘–+๐‘Ÿ ๐‘‡ ๐‘‡ ๐‘Š1 , ฮฃ ๐‘ˆ (ฮ”๐ด)๐‘‰1 = 0. Theorem 2.5.1 in this case gives e 2 2 ๐‘‰2๐‘‡ ๐‘Š1 = ๐น๐‘‰21 โ—ฆ (๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1e ฮฃ1 ). By Lemma 2.5.4, this implies ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 )|||๐‘ˆ e๐‘‡ ๐ด๐‘Š2๐‘Š ๐‘‡ ๐‘‰2 ||| 1 2 ๐œŽ๐‘Ÿ ( ๐ด๐‘Š1 )|||๐‘ƒ ๐ด๐‘Š1 ๐ด๐‘Š2 ||| |||๐‘‰2๐‘‡ ๐‘Š1 ||| โ‰ค โ‰ค , ๐œŽ๐‘Ÿ2 ( ๐ด๐‘Š1 ) โˆ’ ๐œŽ๐‘Ÿ+1 2 ( ๐ด) ๐œŽ๐‘Ÿ2 ( ๐ด๐‘Š1 ) โˆ’ ๐œŽ๐‘Ÿ+1 2 ( ๐ด) 33 where ||| ยท ||| can be either the spectral of Frobenius norm. Also, we directly have โˆฅ๐‘‰2๐‘‡ ๐‘Š1 โˆฅ โ‰ค 1 and โˆš โˆฅ๐‘‰2๐‘‡ ๐‘Š1 โˆฅ ๐น โ‰ค ๐‘Ÿ, thus (2.30) and (2.31) hold. โ–ก 2.5.3 High order sinฮ˜ distance formulae using series expansions Although the formulae in Theorem 2.5.1 are already quite useful, they are still only first-order formulae in the following sense. Looking at the first formula in (2.18) of Theorem 2.5.1, a closer examination shows that the unknown left hand side ๐‘ˆ1๐‘‡ ๐‘ˆ e2 also appears implicitly in the right-hand side, albeit as high order terms. Since we consider upper bounds in the non-asymptotic regime, high order errors may sometimes affect the tightness of the bound, so we hope to get rid of them. To be more specific about the implicit appearances of the high order terms, we denote the left hand sides of the four formulae in Theorem 2.5.1 as ๐‘‹,๐‘Œ ,๐‘Š, ๐‘ ๐‘‹ B ๐‘ˆ1๐‘‡ ๐‘ˆ e2 , ๐‘Œ B ๐‘ˆ2๐‘‡ ๐‘ˆe1 , ๐‘Š B ๐‘‰1๐‘‡ ๐‘‰e2 , ๐‘ B ๐‘‰2๐‘‡ ๐‘‰e1 . First focus on the expression of ๐‘Œ in Theorem 2.5.1 ๐‘Œ โ‰ก ๐‘ˆ2๐‘‡ ๐‘ˆ e1 = ๐น 21 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ฮฃ๐‘‡1 + ฮฃ2๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1e e1 ) ๐‘ˆ 2 = ๐น๐‘ˆ21 โ—ฆ (ฮฃ2๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ1๐‘ˆ1๐‘‡ ๐‘ˆ e1 + ฮฃ2๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ2๐‘ˆ๐‘‡ ๐‘ˆ 2 e ๐‘‡ ๐‘‡ e e๐‘‡ 2 1 + ๐‘ˆ2 (ฮ”๐ด)๐‘‰1๐‘‰1 ๐‘‰1 ฮฃ1 + ๐‘ˆ2๐‘‡ (ฮ”๐ด)๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1e ฮฃ๐‘‡1 ) = ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1 + ๐›ผ21๐‘‰ ๐‘‡ ๐‘‰e e๐‘‡ 21 ๐‘‡ 21 e๐‘‡ 1 1 ฮฃ1 ) +๐น๐‘ˆ โ—ฆ (ฮฃ2 ๐›ผ22๐‘Œ ) + ๐น๐‘ˆ โ—ฆ (๐›ผ22 ๐‘ ฮฃ1 ), (2.32) | {z } B๐ถ1 where the second line used ๐‘ˆ1๐‘ˆ1๐‘‡ + ๐‘ˆ2๐‘ˆ2๐‘‡ = ๐ผ and ๐‘‰1๐‘‰1๐‘‡ +๐‘‰2๐‘‰2๐‘‡ = ๐ผ, the third line is a re-grouping of terms, and ๐›ผ๐‘– ๐‘— := ๐‘ˆ๐‘–๐‘‡ ฮ”๐ด๐‘‰ ๐‘— , 1 โ‰ค ๐‘–, ๐‘— โ‰ค 2. We can get the same expression for ๐‘ ๐‘ โ‰ก ๐‘‰2๐‘‡ ๐‘‰ e1 = ๐น 21 โ—ฆ (ฮฃ๐‘‡ ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ e1 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ ฮฃ1 ) e1e ๐‘‰ 2 2 2 = ๐น๐‘‰21 โ—ฆ (๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ1๐‘ˆ1๐‘‡ ๐‘ˆ ฮฃ1 +๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ2๐‘ˆ2๐‘‡ ๐‘ˆ e1e e1eฮฃ1 + ฮฃ๐‘‡2 ๐‘ˆ2๐‘‡ (ฮ”๐ด)๐‘‰1๐‘‰1๐‘‡ ๐‘‰ e1 + ฮฃ๐‘‡2 ๐‘ˆ2๐‘‡ (ฮ”๐ด)๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1 ) = ๐น๐‘‰21 โ—ฆ (๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1eฮฃ1 + ฮฃ๐‘‡2 ๐›ผ21๐‘‰1๐‘‡ ๐‘‰e1 ) +๐น 21 โ—ฆ (๐›ผ๐‘‡ ๐‘Œ e ๐‘‰ 21 ๐‘‡ 22 ฮฃ1 ) + ๐น๐‘‰ โ—ฆ (ฮฃ2 ๐›ผ22 ๐‘). (2.33) | {z } B๐ถ2 34 Looking at the last right-hand sides of (2.32) and (2.33), we see that ๐‘Œ and ๐‘ are contained in the second and third terms, respectively, so they appear on both hand sides. To highlight this structure, we shorten the notation by letting F be the linear operator defined as ๏ฃฎ ๏ฃน ยฉ ๏ฃฏ๏ฃฏ๐‘Œ ๏ฃบ๏ฃบ ยช ยฉ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡22๐‘Œ ) + ๐น๐‘ˆ21 โ—ฆ (๐›ผ22 ๐‘ e ฮฃ๐‘‡1 ) ยช F ยญ๏ฃฏ ๏ฃบยฎ = ยญ ยญ ยฎ ยญ ยฎ. 21 21 ๐‘‡ ๐‘‡ ยฎ ๏ฃฏ๐‘ ๏ฃบ ๏ฃฏ ๏ฃบ ๐น๐‘‰ โ—ฆ (๐›ผ22๐‘Œ e ฮฃ1 ) + ๐น๐‘‰ โ—ฆ (ฮฃ2 ๐›ผ22 ๐‘) ยซ๏ฃฐ ๏ฃปยฌ ยซ ยฌ Then (2.32) and (2.33) become, ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน ๏ฃฏ๐‘Œ ๏ฃบ ยฉ๐ถ ยช ๏ฃฏ ๏ฃบ ยญ 1ยฎ ยฉ ๏ฃฏ๏ฃฏ๐‘Œ ๏ฃบ๏ฃบ ยช ๏ฃฏ ๏ฃบ = ยญ ยฎ + F ยญยญ ๏ฃฏ ๏ฃบ ยฎยฎ . ๏ฃฏ๐‘ ๏ฃบ ๐ถ ๏ฃฏ๐‘ ๏ฃบ ๏ฃฐ ๏ฃป ยซ 2ยฌ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ยซ๏ฃฐ ๏ฃปยฌ Clearly, this is an implicit equation system of ๐‘Œ and ๐‘. Provided โˆฅF โˆฅ < 1, we can move F to the left and take the inverse ๏ฃฎ ๏ฃน ยฉ๐‘ˆ2๐‘‡ ๐‘ˆ e1 ยช ๏ฃฏ๐‘Œ ๏ฃบ ยฉ๐ถ1 ยช โˆ‘๏ธ โˆž ยฉ๐ถ1 ยช ยญ ๐‘‡ ยฎโ‰ก ๏ฃฏ ๏ฃบ = (1 โˆ’ F ) โˆ’1 ยญยญ ยฎยฎ = F ๐‘˜ ยญยญ ยฎยฎ . ยญ ยฎ ๏ฃฏ ๏ฃบ ๐‘‰ ๐‘‰ ๏ฃฏ๐‘ ๏ฃบ ๐ถ ๐ถ ยซ 2 1ยฌ ยซ 2ยฌ ยซ 2ยฌ e ๏ฃฏ ๏ฃบ ๐‘˜=0 ๏ฃฐ ๏ฃป This gives us a series expression of the quantities ๐‘ˆ2๐‘‡ ๐‘ˆ e1 and ๐‘‰ ๐‘‡ ๐‘‰ 2 1 e , which allows us to derive Theorem 2.2.6 and Theorem 2.3.1 presented before. We summarize this result in the following theorem. Theorem 2.5.7 (Angular perturbation formula using series expansion). Using the same notation and quantities as in Theorem 2.5.1, we have ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน ยฉ๐‘ˆ2๐‘‡ ๐‘ˆ e1 ยช ยฉ๐ถ1 ยช ยฉ ๏ฃฏ๏ฃฏ๐‘ˆ2๐‘‡ ๐‘ˆe1 ๏ฃบ ยช ยฉ๐‘ˆ1๐‘‡ ๐‘ˆ e2 ยช ยฉ๐ถ3 ยช ยฉ ๏ฃฏ๏ฃฏ๐‘ˆ1๐‘‡ ๐‘ˆ e2 ๏ฃบ ยช ยญ ๐‘‡ ยฎ = ยญ ยฎ + F ยญ ๏ฃฏ๏ฃฏ ๐‘‡ ๏ฃบ๏ฃบ ยฎ , ยญ ๐‘‡ ยฎ = ยญ ยฎ + G ยญ ๏ฃฏ๏ฃฏ ๐‘‡ ๏ฃบ๏ฃบ ยฎ . (2.34) ยญ ยฎ ยญ ยฎ ยญ ๏ฃบยฎ ยญ ยฎ ยญ ยฎ ยญ ๏ฃบยฎ ๐‘‰ ๐‘‰ ๐ถ ๏ฃฏ๐‘‰ ๐‘‰ ๐‘‰ ๐‘‰ ๐ถ ๏ฃฏ๐‘‰ ๐‘‰ ยซ 2 1 ยฌ ยซ 2ยฌ ยซ๏ฃฐ 2 1 ๏ฃปยฌ ยซ 1 2 ยฌ ยซ 4ยฌ ยซ๏ฃฐ 1 2 ๏ฃปยฌ e e ๏ฃบ e e ๏ฃบ In addition, provided that โˆฅF โˆฅ < 1 and โˆฅGโˆฅ < 1, we have ยฉ๐‘ˆ2๐‘‡ ๐‘ˆe1 ยช โˆ‘๏ธ โˆž ยฉ๐ถ1 ยช ยฉ๐‘ˆ1๐‘‡ ๐‘ˆ e2 ยช โˆ‘๏ธ โˆž ยฉ๐ถ3 ยช ยญ ยญ ๐‘‡ ยฎ ยฎ= F ๐‘˜ ยญยญ ยฎยฎ , ยญ ยญ ๐‘‡ ยฎ ยฎ= G ๐‘˜ ยญยญ ยฎยฎ . (2.35) ๐‘‰ ๐‘‰ ๐ถ ๐‘‰ ๐‘‰ ๐ถ ยซ 2 1ยฌ ยซ 2ยฌ ยซ 1 2ยฌ ยซ 4ยฌ e ๐‘˜=0 e ๐‘˜=0 Here ยฉ๐ถ1 ยช ยฉ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡22๐ถ1 ) + ๐น๐‘ˆ21 โ—ฆ (๐›ผ22๐ถ2e ฮฃ๐‘‡1 ) ยช Fยญ ยฎ=ยญ ยญ ยฎ ยญ ยฎ, 21 21 ๐‘‡ ๐‘‡ ยฎ ๐ถ2 ๐น๐‘‰ โ—ฆ (๐›ผ22๐ถ1e ฮฃ1 ) + ๐น๐‘‰ โ—ฆ (ฮฃ2 ๐›ผ22๐ถ2 ) ยซ ยฌ ยซ ยฌ 35 ยฉ๐ถ3 ยช ยฉ๐น 12 โ—ฆ (๐›ผ11๐ถ4e ฮฃ๐‘‡2 ) + ๐น๐‘ˆ12 โ—ฆ (ฮฃ1 ๐›ผ๐‘‡11๐ถ3 ) ยช G ยญยญ ยฎยฎ = ยญยญ ๐‘ˆ ยฎ. 12 12 ๐‘‡ ๐‘‡ ยฎ ๐ถ4 ๐น๐‘‰ โ—ฆ (ฮฃ1 ๐›ผ11๐ถ4 ) + ๐น๐‘‰ โ—ฆ (๐›ผ11๐ถ3 ฮฃ2 ) e ยซ ยฌ ยซ ยฌ ๐ถ1 = ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1 + ๐›ผ21๐‘‰ ๐‘‡ ๐‘‰ e e๐‘‡ 1 1 ฮฃ1 ), ๐ถ2 = ๐น๐‘‰21 โ—ฆ (๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1e ฮฃ1 + ฮฃ๐‘‡2 ๐›ผ21๐‘‰1๐‘‡ ๐‘‰ e1 ), ๐ถ3 = ๐น๐‘ˆ12 โ—ฆ (๐›ผ12๐‘‰2๐‘‡ ๐‘‰ e2eฮฃ๐‘‡2 + ฮฃ1 ๐›ผ๐‘‡21๐‘ˆ2๐‘‡ ๐‘ˆ e2 ), ๐ถ4 = ๐น๐‘‰12 โ—ฆ (ฮฃ๐‘‡1 ๐›ผ12๐‘‰2๐‘‡ ๐‘‰ e2 + ๐›ผ๐‘‡ ๐‘ˆ๐‘‡ ๐‘ˆ 21 2 2 ฮฃ2 ), ee and ๐›ผ๐‘– ๐‘— := ๐‘ˆ๐‘–๐‘‡ ฮ”๐ด๐‘‰ ๐‘— . Remark 2.5.8. Careful readers may observe that, although we removed all cross terms ๐‘ˆ1๐‘‡ ๐‘ˆ e2 , ๐‘ˆ๐‘‡ ๐‘ˆ 2 e1 , ๐‘‰ ๐‘‡ ๐‘‰ 1 e2 , ๐‘‰ ๐‘‡ ๐‘‰ 2 e1 from the right-hand sides of the expressions (2.35), there are still terms like ๐‘ˆ1๐‘‡ ๐‘ˆ e1 and ๐‘‰1๐‘‡ ๐‘‰ e1 appearing on the right-hand side. In fact, these terms are of order ๐‘‚ (1) thus will not degrade the tightness of the upper bounds by any order of magnitudes and only possibly affect the constants. When ๐ด has rank ๐‘Ÿ, Theorem 2.5.7 reduces to the following simpler formulae. Corollary 2.5.9. Using the definitions above, when ๐ด has rank ๐‘Ÿ and โˆฅฮ”๐ดโˆฅ < ๐œŽ๐‘Ÿ ( ๐ด), e +โˆž e eโˆ’1 eโˆ’(2๐‘˜+1) , โˆ‘๏ธ ๐‘ˆ2๐‘‡ ๐‘ˆ (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ (๐›ผ21๐‘‰1๐‘‡ ๐‘‰ e1 + ๐›ผ22 ๐›ผ๐‘‡ ๐‘ˆ๐‘‡ ๐‘ˆ = 12 1 1 ฮฃ1 ) ฮฃ1 e1 ๐‘˜=0 +โˆž (2.36) e eโˆ’1 eโˆ’(2๐‘˜+1) . โˆ‘๏ธ ๐‘‰2๐‘‡ ๐‘‰ e1 = (๐›ผ๐‘‡22 ๐›ผ22 ) ๐‘˜ (๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆe1 + ๐›ผ๐‘‡ ๐›ผ21๐‘‰ ๐‘‡ ๐‘‰ 22 1 1 ฮฃ1 ) ฮฃ1 ๐‘˜=0 Remark 2.5.10. When ๐ด has rank-๐‘Ÿ and ๐›ผ22 is full rank, Corollary 2.5.9 can also be derived โˆ’1 ยฉ ๐›ผ22 ยช using series expansion for Sylvester-type equations. Denote matrix ๐‘€ = ยญยญ ยฎ , ๐‘‹= ยฎ ๐‘‡ ๐›ผ22 ยซ ยฌ ๐›ผ22 ยช ยฉ๐‘ˆ2 ๐‘ˆ๐‘‡ ยฉ๐›ผ ๐‘‰ ๐‘‰ ๐‘‡ โˆ’1 ฮฃ ยช โˆ’1 , and ๐‘Œ = ยญ 21 1 1 1 ยฎ. Direct calculation gives ๐‘€ ๐‘‹ โˆ’ ๐‘‹ ๐ต = ๐‘Œ . ยฉ e1 ยช ee ยญ ยฎยญ ยฎ, ๐ต = e ฮฃ 1 ฮฃ1โˆ’1 ยญ ๐‘‡ ยฎยญ ๐‘‡ ยฎ ยญ ๐‘‡ ๐‘‡ ยฎ ๐›ผ22 ๐‘‰2 ๐‘‰ e1 ๐›ผ12๐‘ˆ1 ๐‘ˆ e1e ยซ ยฌยซ ยฌ ยซ ยฌ By the assumption in Corollary 2.5.9, โˆฅ๐›ผ22 โˆฅ โ‰ค โˆฅฮ”๐ดโˆฅ < ๐œŽ๐‘Ÿ ( ๐ด), e we can see for any eigenvalue ๐œ† of matrix ๐‘€, it holds that |๐œ†| > 1 . Classical series expansion for Sylvester-type equations ๐œŽ๐‘Ÿ ( ๐ด) e (Theorem VII.2.2 in Bhatia (2013)) also leads to equation (2.36). 2.5.4 Examples of Using Theorem 2.5.7 Theorem 2.5.7 is used to derive the refined โ„“2,โˆž bound (Theorem 2.2.6) and the sinฮ˜ bound between singular vectors and their resided singular subspace (Theorem 2.3.1), which provided the 36 main intuition behind our PCA and singular value truncation results (Theorem 2.3.2 and Theorem 2.4.1) in Section 2.3 and Section 2.4. Here, we only present the proof of Theorem 2.3.1, the proof of the main results are provided in Section 2.6. Proof: Again we denote ๐‘Œ B ๐‘ˆ2๐‘‡ ๐‘ˆ e1 and ๐‘ B ๐‘‰ ๐‘‡ ๐‘‰ 2 1 e . Restricting (2.34) in Theorem 2.5.7 to the ๐‘—th columns (1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ), we have ๐œŽ ๐‘— (๐น๐‘ˆ21 ) ๐‘— โ—ฆ (๐›ผ22 ๐‘ ๐‘— ) + (๐น๐‘ˆ21 ) ๐‘— โ—ฆ (ฮฃ2 ๐›ผ๐‘‡22๐‘Œ ๐‘— ), ๐‘Œ ๐‘— = (๐ถ1 ) ๐‘— + e ๐‘ ๐‘— = (๐ถ2 ) ๐‘— + e๐œŽ ๐‘— (๐น๐‘‰21 ) ๐‘— โ—ฆ (๐›ผ๐‘‡22๐‘Œ ๐‘— ) + (๐น๐‘‰21 ) ๐‘— โ—ฆ (ฮฃ๐‘‡2 ๐›ผ22 ๐‘ ๐‘— ). ๐œŽ๐‘— e ๐œŽ ๐œŽ It is easy to verify that โˆฅ(๐ถ1 ) ๐‘— โˆฅ โ‰ค 2 2 โˆฅ๐›ผ21 โˆฅ + 2 ๐‘Ÿ+12 โˆฅ๐›ผ12 โˆฅ, โˆฅ(๐ถ2 ) ๐‘— โˆฅ โ‰ค 2 ๐‘Ÿ+12 โˆฅ๐›ผ21 โˆฅ + ๐œŽ โˆ’๐œŽ e ๐œŽ โˆ’๐œŽ e ๐œŽ โˆ’๐œŽ e ๐‘— ๐‘Ÿ+1 ๐‘— ๐‘Ÿ+1 ๐‘— ๐‘Ÿ+1 ๐œŽ๐‘— e โˆฅ๐›ผ12 โˆฅ, then ๐œŽ โˆ’๐œŽ 2 e 2 ๐‘— ๐‘Ÿ+1 1  โˆฅ๐‘Œ ๐‘— โˆฅ โ‰ค ๐œŽ ๐‘— โˆฅ๐›ผ21 โˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅ๐›ผ12 โˆฅ + e ๐œŽ ๐‘— โˆฅ๐›ผ22 โˆฅโˆฅ๐‘ ๐‘— โˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅ๐›ผ22 โˆฅโˆฅ๐‘Œ ๐‘— โˆฅ , ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 e e 1  โˆฅ๐‘ ๐‘— โˆฅ โ‰ค ๐œŽ๐‘Ÿ+1 โˆฅ๐›ผ21 โˆฅ + e ๐œŽ ๐‘— โˆฅ๐›ผ12 โˆฅ + e๐œŽ ๐‘— โˆฅ๐›ผ22 โˆฅโˆฅ๐‘Œ ๐‘— โˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅ๐›ผ22 โˆฅโˆฅ๐‘ ๐‘— โˆฅ . ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 e 2 Summing up the first inequality multiplied by e ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 โˆ’๐œŽ ๐‘Ÿ+1 โˆฅ๐›ผ22 โˆฅ and the second inequality multiplied by e ๐œŽ ๐‘— โˆฅ๐›ผ22 โˆฅ, after some simplification we get ๐œŽ ๐‘— โˆฅ๐›ผ21 โˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅ๐›ผ12 โˆฅ + โˆฅ๐›ผ22 โˆฅโˆฅ๐›ผ12 โˆฅ e โˆฅ๐‘Œ ๐‘— โˆฅ = โˆฅ๐‘ˆ2๐‘‡ e ๐‘ข๐‘—โˆฅ โ‰ค ๐œŽ 2๐‘— โˆ’ (๐œŽ๐‘Ÿ+1 + โˆฅ๐›ผ22 โˆฅ) 2 e (๐œŽ ๐‘— โˆ’ โˆฅฮ”๐ดโˆฅ)โˆฅฮ”๐ดโˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅฮ”๐ดโˆฅ + โˆฅฮ”๐ดโˆฅ 2 โ‰ค (๐œŽ ๐‘— โˆ’ โˆฅฮ”๐ดโˆฅ) 2 โˆ’ (๐œŽ๐‘Ÿ+1 + โˆฅฮ”๐ดโˆฅ) 2 โˆฅฮ”๐ดโˆฅ โ‰ค ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 โˆ’ 2โˆฅฮ”๐ดโˆฅ 3โˆฅฮ”๐ดโˆฅ โ‰ค , ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 provided that 3โˆฅฮ”๐ดโˆฅ โ‰ค ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 . Here the second inequality is because the upper bound on the right-hand side is decreasing with respect to e ๐œŽ ๐‘— and increasing with respect to โˆฅ๐›ผ22 โˆฅ. Similarly, we also have e๐œŽ ๐‘— โˆฅ๐›ผ12 โˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅ๐›ผ21 โˆฅ + โˆฅ๐›ผ22 โˆฅโˆฅ๐›ผ21 โˆฅ 3โˆฅฮ”๐ดโˆฅ โˆฅ๐‘ ๐‘— โˆฅ = โˆฅ๐‘‰2๐‘‡ e ๐‘ฃ๐‘—โˆฅ โ‰ค โ‰ค . e๐œŽ 2๐‘— โˆ’ (๐œŽ๐‘Ÿ+1 + โˆฅ๐›ผ22 โˆฅ) 2 ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 โ–ก 37 2.6 Proof of the main results In Section 2.6.1, we derive the proof of Theorem 2.5.1 and Lemma 2.5.4. After that, we present the proof of Theorem 2.2.6 in Section 2.6.2. Since the proof of one key lemma (Lemma 2.6.3) is long and involved, we divide it into low-rank case and full-rank case. We prove the low-rank case in Section 2.6.3, and the proof of full-rank case is deferred to appendix. In Section 2.6.4 we provide the proof of Theorem 2.4.1, while the proof of Theorem 2.3.2 can be found in Section 2.6.5. 2.6.1 Proof of Theorem 2.5.1 and Lemma 2.5.4 Proof of Theorem 2.5.1: First, decompose the perturbation ฮ”๐ด in the following two ways: ฮ”๐ด = ๐ด eโˆ’ ๐ด = ๐‘ˆ eeฮฃ๐‘‰e๐‘‡ โˆ’ ๐‘ˆฮฃ๐‘‰ ๐‘‡ = (๐‘ˆ + ฮ”๐‘ˆ)e ฮฃ๐‘‰e๐‘‡ โˆ’ ๐‘ˆฮฃ(๐‘‰ e โˆ’ ฮ”๐‘‰)๐‘‡ (2.37) = ๐‘ˆeฮฃ๐‘‰ e๐‘‡ + (ฮ”๐‘ˆ)e e๐‘‡ โˆ’ ๐‘ˆฮฃ๐‘‰ ฮฃ๐‘‰ e๐‘‡ + ๐‘ˆฮฃ(ฮ”๐‘‰)๐‘‡ = ๐‘ˆ (ฮ”ฮฃ)๐‘‰ e๐‘‡ + (ฮ”๐‘ˆ)e ฮฃ๐‘‰e๐‘‡ + ๐‘ˆฮฃ(ฮ”๐‘‰)๐‘‡ , and ฮ”๐ด = ๐ด eโˆ’ ๐ด = ๐‘ˆ eeฮฃ๐‘‰e๐‘‡ โˆ’ ๐‘ˆฮฃ๐‘‰ ๐‘‡ =๐‘ˆ ฮฃ(๐‘‰ + ฮ”๐‘‰)๐‘‡ โˆ’ (๐‘ˆ ee e โˆ’ ฮ”๐‘ˆ)ฮฃ๐‘‰ ๐‘‡ (2.38) =๐‘ˆ ฮฃ๐‘‰ ๐‘‡ + ๐‘ˆ ee eeฮฃ(ฮ”๐‘‰)๐‘‡ โˆ’ ๐‘ˆฮฃ๐‘‰ e ๐‘‡ + (ฮ”๐‘ˆ)ฮฃ๐‘‰ ๐‘‡ =๐‘ˆ e(ฮ”ฮฃ)๐‘‰ ๐‘‡ + ๐‘ˆ eeฮฃ(ฮ”๐‘‰)๐‘‡ + (ฮ”๐‘ˆ)ฮฃ๐‘‰ ๐‘‡ . Multiplying (2.37) with ๐‘ˆ๐‘‡ on the left and ๐‘‰ e on the right leads to ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ e = ฮ”ฮฃ + ๐‘ˆ๐‘‡ (ฮ”๐‘ˆ)e ฮฃ + ฮฃ(ฮ”๐‘‰)๐‘‡ ๐‘‰e. (2.39) Similarly, multiplying (2.38) with ๐‘ˆe๐‘‡ on the left and ๐‘‰ on the right we obtain ๐‘ˆe๐‘‡ (ฮ”๐ด)๐‘‰ = ฮ”ฮฃ + e ฮฃ(ฮ”๐‘‰)๐‘‡ ๐‘‰ + ๐‘ˆ e๐‘‡ (ฮ”๐‘ˆ)ฮฃ. (2.40) Denote ๐‘‘๐‘ƒ = ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰, e ๐‘‘ ๐‘ƒยฏ = ๐‘ˆe๐‘‡ (ฮ”๐ด)๐‘‰, ฮ”ฮฉ๐‘ˆ = ๐‘ˆ๐‘‡ (ฮ”๐‘ˆ), ฮ”ฮฉ๐‘‰ = ๐‘‰ ๐‘‡ (ฮ”๐‘‰). Notice that ๐ผ = e๐‘‡ ๐‘ˆ ๐‘ˆ e = ๐‘ˆ๐‘‡ ๐‘ˆ gives (๐‘ˆ + ฮ”๐‘ˆ)๐‘‡ ๐‘ˆ e = ๐‘ˆ ๐‘‡ (๐‘ˆ e โˆ’ ฮ”๐‘ˆ), hence ๐‘ˆ๐‘‡ ฮ”๐‘ˆ = โˆ’ฮ”๐‘ˆ๐‘‡ ๐‘ˆ. e Similarly, we also have 38 ๐‘‰ ๐‘‡ ฮ”๐‘‰ = โˆ’ฮ”๐‘‰ ๐‘‡ ๐‘‰. e Plugging these into (2.39) and (2.40), we have ๏ฃฑ ๐‘‡ ๏ฃฒ ๐‘‘๐‘ƒ = ๐‘ˆ ฮ”๐ด๐‘‰ ๏ฃด ๏ฃด ๏ฃด e = ฮ”ฮฃ + ฮ”ฮฉ๐‘ˆ e ฮฃ โˆ’ ฮฃฮ”ฮฉ๐‘‰ , (2.41) ๏ฃด ๐‘‘ ๐‘ƒยฏ = ๐‘ˆe๐‘‡ ฮ”๐ด๐‘‰ = ฮ”ฮฃ + e ๐‘‡ โˆ’ ฮ”ฮฉ๐‘‡ ฮฃ. ๏ฃด ๏ฃด ฮฃฮ”ฮฉ๐‘‰ ๐‘ˆ ๏ฃณ Next, from (2.41) we can cancel ฮ”ฮฉ๐‘‰ by ๐บ ๐‘ˆ : = ๐‘‘๐‘ƒe ฮฃ๐‘‡ + ฮฃ๐‘‘ ๐‘ƒยฏ๐‘‡ = ฮ”ฮฃe ฮฃ๐‘‡ + ฮฃ(ฮ”ฮฃ)๐‘‡ + ฮ”ฮฉ๐‘ˆ e ฮฃ๐‘‡ โˆ’ ฮฃฮฃ๐‘‡ ฮ”ฮฉ๐‘ˆ ฮฃe =e ฮฃeฮฃ๐‘‡ โˆ’ ฮฃฮฃ๐‘‡ + ฮ”ฮฉ๐‘ˆ e ฮฃeฮฃ๐‘‡ โˆ’ ฮฃฮฃ๐‘‡ ฮ”ฮฉ๐‘ˆ . Let ฮ”ฮฉ๐‘ˆ = {๐‘ค๐‘– ๐‘— }๐‘–,๐‘› ๐‘—=1 , then for all 1 โ‰ค ๐‘–, ๐‘— โ‰ค ๐‘›, the following equations hold ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘–2 )๐‘ค๐‘– ๐‘— , ๏ฃฑ ๏ฃฒ (e ๐‘– โ‰  ๐‘—, ๏ฃด ๏ฃด ๏ฃด (๐บ ๐‘ˆ )๐‘– ๐‘— = (2.42) ๐œŽ 2 โˆ’ ๐œŽ๐‘–2 )(๐‘ค๐‘– ๐‘— + 1), ๐‘– = ๐‘— . ๏ฃด (e ๏ฃด ๏ฃด ๏ฃณ ๐‘— Here if ๐‘– > min{๐‘›, ๐‘š}, we define ๐œŽ๐‘– or e ๐œŽ๐‘– to be 0. Also, define ๐น๐‘ˆ12 , ๐น๐‘ˆ21 , ๐น๐‘‰12 , ๐น๐‘‰21 as in the statement of Theorem 2.5.1. By assumption, e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 0, ๐œŽ๐‘Ÿ โˆ’ e ๐œŽ๐‘Ÿ+1 > 0, we can directly check that the denominators in these four matrices only have nonzero entries, thus are well defined. Consider the upper right part in ฮ”ฮฉ๐‘ˆ = ๐‘ˆ๐‘‡ (ฮ”๐‘ˆ), that is, 1 โ‰ค ๐‘– โ‰ค ๐‘Ÿ, ๐‘Ÿ + 1 โ‰ค ๐‘— โ‰ค ๐‘›, from (2.42) we have 1 ๐‘ค๐‘– ๐‘— = (๐บ ๐‘ˆ )๐‘– ๐‘— , 1 โ‰ค ๐‘– โ‰ค ๐‘Ÿ, ๐‘Ÿ + 1 โ‰ค ๐‘— โ‰ค ๐‘›. ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘–2 e Therefore, e2 = ๐‘ˆ๐‘‡ (ฮ”๐‘ˆ2 ) = ๐น 12 โ—ฆ (๐บ 12 ) = ๐น 12 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ๐‘ˆ1๐‘‡ ๐‘ˆ e2e ฮฃ๐‘‡2 + ฮฃ1๐‘‰1๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e2 ). 1 ๐‘ˆ ๐‘ˆ ๐‘ˆ 1 Following the same reasoning, we also obtain ๐‘ˆ2๐‘‡ ๐‘ˆe1 = ๐‘ˆ๐‘‡ (ฮ”๐‘ˆ1 ) = ๐น 21 โ—ฆ (๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ e1eฮฃ๐‘‡1 + ฮฃ2๐‘‰2๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ e1 ), 2 ๐‘ˆ 2 e2 = ๐‘‰ ๐‘‡ (ฮ”๐‘‰2 ) = ๐น 12 โ—ฆ (ฮฃ๐‘‡ ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ ๐‘‰1๐‘‡ ๐‘‰ e2 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ ฮฃ2 ), e2e 1 ๐‘‰ 1 1 1 ๐‘‰2๐‘‡ ๐‘‰e1 = ๐‘‰ ๐‘‡ (ฮ”๐‘‰2 ) = ๐น 21 โ—ฆ (ฮฃ๐‘‡ ๐‘ˆ๐‘‡ (ฮ”๐ด)๐‘‰ e1 +๐‘‰ ๐‘‡ (ฮ”๐ด)๐‘‡ ๐‘ˆ ฮฃ1 ). e1e 2 ๐‘‰ 2 2 2 โ–ก 39 Proof of Lemma 2.5.4: Here we only prove the first inequality in (2.19), i.e., |||๐น๐‘ˆ21 โ—ฆ (๐ป1e ฮฃ1 )||| โ‰ค ๐œŽ๐‘Ÿ e |||๐ป1 |||, the other three inequalities can be proved similarly. Recall the definition of ๐น๐‘ˆ21 ๐œŽ๐‘Ÿ โˆ’๐œŽ 2 e 2 ๐‘Ÿ+1 is (๐น๐‘ˆ21 )๐‘–โˆ’๐‘Ÿ, ๐‘— = 2 1 2 , ๐‘Ÿ + 1 โ‰ค ๐‘– โ‰ค ๐‘›, 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. We directly have ๐œŽ โˆ’๐œŽ e ๐‘— ๐‘– ๐น๐‘ˆ21 โ—ฆ (๐ป1e ฮฃ1 ) = ๐นยฏ๐‘ˆ21 โ—ฆ ๐ป1 , where ๐œŽ๐‘— ( ๐นยฏ๐‘ˆ21 )๐‘–โˆ’๐‘Ÿ, ๐‘— = e , ๐‘Ÿ + 1 โ‰ค ๐‘– โ‰ค ๐‘›, 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘–2 e Let ๐ต1 = ๐น๐‘ˆ21 โ—ฆ (๐ป1e ฮฃ1 ), then ๐ป1 = ๐น e21 โ—ฆ ๐ต1 , where ๐‘ˆ ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘–2 e ๐œŽ2 (๐นe21 )๐‘–โˆ’๐‘Ÿ, ๐‘— = =e๐œŽ ๐‘— โˆ’ ๐‘– , ๐‘Ÿ + 1 โ‰ค ๐‘– โ‰ค ๐‘›, 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. ๐‘ˆ ๐œŽ๐‘— e ๐œŽ๐‘— e Inserting the above expression of ๐น e into ๐ป1 = ๐น e21 โ—ฆ ๐ต1 , we have ๐‘ˆ 2 1 ยฉe๐œŽ1 ยช ยฉ๐œŽ๐‘Ÿ+1 ยช ยฉยญ e ๐œŽ1 ยช ยฎ ยญ ยฎ ยญ ยฎ ยญ 1 ยฎ ยญ ยญ ๐œŽ2 e ยฎ ยญ ยฎ ยญ 2 ๐œŽ๐‘Ÿ+2 ยฎ ยญ ยฎ ยญ ยฎ ยฎโˆ’ยญ ๐œŽ2 ๐ป1 = ๐ต1 ยญยญ ยฎ ๐ต1 ยญ e ยฎ ยฎ. .. ยฎ ยญ .. ยฎ ยญ .. ยฎ ยญ ยญ . ยฎ ยญ ยฎ ยญ . ยฎ ยญ ยฎ ยญ . ยฎ ยญ ยฎ ยญ ยฎ ยญ ยฎ ๐œŽ๐‘Ÿ ยฌ ยซ e ๐œŽ๐‘›2 ยฌ 1 ยฎ ยซ ยซ ๐œŽ๐‘Ÿ ยฌ e Take norm on both sides, we obtain ๐œŽ2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 |||๐ป1 ||| โ‰ฅ e ๐œŽ๐‘Ÿ |||๐ต1 ||| โˆ’ ๐‘Ÿ+1 |||๐ต1 ||| = |||๐ต1 |||, ๐œŽ๐‘Ÿ e ๐œŽ๐‘Ÿ e which further gives ๐œŽ๐‘Ÿ e |||๐ต1 ||| โ‰ค |||๐ป1 |||. ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 e 2 โ–ก Remark 2.6.1. When e ๐œŽ๐‘Ÿ > ๐œŽ๐‘Ÿ+1 , ๐œŽ๐‘Ÿ > e ๐œŽ๐‘Ÿ+1 ,the bounds in Lemma 2.5.4 are tight. That is, in this case, there exists ๐ป๐‘– , 1 โ‰ค ๐‘– โ‰ค 4, such that the equalities in (2.19) and (2.20) hold. Specifically, let ยฉ0 ยท ยท ยท 0 ๐œ–ยช ยฉ0 ยท ยท ยท ยท ยท ยท 0ยช ยญ ยฎ ยญ ยฎ ยญ .. .. .. .. ยฎ ยญ0 ยท ยท ยท ยท ยท ยท 0ยฎ ยญ ยฎ (๐‘›โˆ’๐‘Ÿ)ร—๐‘Ÿ ยญ. . . .ยฎ ๐‘Ÿร—(๐‘šโˆ’๐‘Ÿ) , ยฎโˆˆR ๐ป1 = ยญยญ . . , ๐ป 3 = .. ยฎยฎ โˆˆ R ยญ ยฎ ยญ .. .. .. .. ยฎยฎ ยญ0 ... ยญ .. ยญ . .ยฎ ยญ . .ยฎ ยญ ยฎ ยญ ยฎ ยซ0 ยท ยท ยท ยท ยท ยท 0ยฌ ยซ๐œ– 0 ยท ยท ยท 0ยฌ 40 ยฉ0 ยท ยท ยท 0 ๐œ– ยช ยฉ0 ยท ยท ยท ยท ยท ยท 0 ยช ยญ ยฎ ยญ ยฎ ยญ .. .. .. .. ยฎ ยญ0 ยท ยท ยท ยท ยท ยท 0 ยฎ ยญ ยฎ ยญ. . . .ยฎ ๐ป2 = ยญยญ . . ยฎ โˆˆ R (๐‘šโˆ’๐‘Ÿ)ร—๐‘Ÿ , ๐ป4 = ยญ ยฎ โˆˆ R๐‘Ÿร—(๐‘›โˆ’๐‘Ÿ) , ยญ .. .. . .. .. ยฎ. ยฎ ยญ ยญ0 .. . . . .. .. ยฎยฎ ยญ ยฎ ยญ ยฎ ยญ ยฎ ยญ ยฎ ยซ 0 ยท ยท ยท ยท ยท ยท 0 ยฌ ยซ ๐œ– 0 ยท ยท ยท 0 ยฌ then we can directly check that the equalities in (2.19) and (2.20) hold. 2.6.2 Proof of Theorem 2.2.6 To prove Theorem 2.2.6, we need to decompose ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„ into a sum of several components and bound them separately. For convenience, we put the decomposition in the following lemma, which is similar in nature to Theorem 3.1 in Cape et al. (2019b). Proposition 2.6.2. Set the rotation ๐‘„ to be ๐‘„ = ๐‘„ 1 ๐‘„๐‘‡2 , where ๐‘„ 1 and ๐‘„ 2 are the left and right singular vectors from the SVD: ๐‘ˆ1๐‘‡ ๐‘ˆ e1 = ๐‘„ 1 ๐‘†๐‘„๐‘‡ , then 2 e1 โˆ’ ๐‘ˆ1 ๐‘„ = ๐‘ˆ2๐‘ˆ๐‘‡ ฮ”๐ด๐‘‰1๐‘‰ ๐‘‡ ๐‘‰ e eโˆ’1 ๐‘‡ ๐‘‡ e eโˆ’1 ๐‘‡ e eโˆ’1 ๐‘‡ ๐‘ˆ 2 1 1 ฮฃ1 + ๐‘ˆ2๐‘ˆ2 ฮ”๐ด๐‘‰2๐‘‰2 ๐‘‰1 ฮฃ1 + ๐‘ˆ2 ฮฃ2๐‘‰2 ๐‘‰1 ฮฃ1 + ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„ 2 , (2.43) and โˆฅ๐‘† โˆ’ ๐ผ โˆฅ โ‰ค โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ 2 . (2.44) Proof: By direct calculation, we have e1 โˆ’ ๐‘ˆ1 ๐‘„ = ๐‘ˆ ๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„ 1 ๐‘„๐‘‡ 2 =๐‘ˆ e1 โˆ’ ๐‘ˆ1 ๐‘„ 1 ๐‘†๐‘„๐‘‡ + ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„๐‘‡ 2 2 e1 โˆ’ ๐‘ˆ1๐‘ˆ๐‘‡ ๐‘ˆ ๐‘‡ =๐‘ˆ 1 1 + ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„ 2 e = ๐‘ˆ2๐‘ˆ2๐‘‡ ๐‘ˆe1 + ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„๐‘‡ 2 = ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰ e1e ฮฃ1โˆ’1 + ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 + ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„๐‘‡2 (2.45) = ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1๐‘‰1๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 + ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ ฮฃ1โˆ’1 + ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ e1e e1eฮฃ1โˆ’1 + ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„๐‘‡2 . In addition, since โˆฅ๐‘†โˆฅ = โˆฅ๐‘ˆ1๐‘‡ ๐‘ˆ e1 โˆฅ โ‰ค 1, โˆฅ๐‘† โˆ’ ๐ผ โˆฅ = 1 โˆ’ min ๐‘†๐‘– โ‰ค 1 โˆ’ min ๐‘†๐‘–2 = โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ 2 , ๐‘– ๐‘– 41 where ๐‘†๐‘– is the ๐‘–th diagonal entry of ๐‘†. Hence โˆฅ๐‘ˆ1 ๐‘„ 1 (๐‘† โˆ’ ๐ผ)๐‘„๐‘‡2 โˆฅ 2,โˆž โ‰ค โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โˆฅ๐‘† โˆ’ ๐ผ โˆฅ โ‰ค โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ e1 )โˆฅ 2 . โ–ก The first and the last terms in the expansion (2.43) are easy to bound, the following lemma is devoted to bounding the middle terms, which requires invoking the angular perturbation formula Theorem 2.5.7. Lemma 2.6.3. Under the assumption of Theorem 2.2.6, it holds that ๐œŽ๐‘…(๐‘Ÿ, ๐‘›) max{โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 โˆฅ 2,โˆž , โˆฅ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 โˆฅ 2,โˆž } โ‰ค ๐ถ , ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด) where ๐ถ is some constant and ๏ฃฑ โˆš โˆš๏ธ ๏ฃฒ ๐‘Ÿ + log ๐‘›, if ๐ด is of rank ๐‘Ÿ; ๏ฃด ๏ฃด ๏ฃด ๐‘…(๐‘Ÿ, ๐‘›) = โˆš๏ธ ๏ฃด๐‘Ÿ + ๐‘Ÿ log ๐‘›, ๏ฃด ๏ฃด else. ๏ฃณ Before proving this lemma, let us first see how to use it to prove Theorem 2.2.6. Proof of Theorem 2.2.6: Due to (2.43), we have min โˆฅ๐‘ˆ e 2,โˆž โ‰ค โˆฅ๐‘ˆ2๐‘ˆ๐‘‡ ฮ”๐ด๐‘‰1๐‘‰ ๐‘‡ ๐‘‰ e1 โˆ’ ๐‘ˆ1 ๐‘„โˆฅ 2 1 e1eฮฃ1โˆ’1 โˆฅ 2,โˆž +โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ ฮฃ1โˆ’1 โˆฅ 2,โˆž e1e ๐‘„โˆˆO e ๐‘Ÿ | {z } (I) + โˆฅ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 โˆฅ 2,โˆž + โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆe1 )โˆฅ 2 . | {z } (II) The two middle terms are bounded in Lemma 2.6.3 . We are left to bound the first and the last terms. For the last term, we have 2 36๐œŽ 2 ๐‘›ยฏ  2โˆฅฮ”๐ดโˆฅ (II) โ‰ค โˆฅ๐‘ˆ1 โˆฅ 2,โˆž โ‰ค โˆฅ๐‘ˆ1 โˆฅ 2,โˆž . ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด) (๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด)) 2 Here the first inequality used (2.50) in Lemma 2.6.7, the second one used Corollary 7.3.3 of Vershynin (2018) which bounds the spectral norm of i.i.d. Gaussian matrices: with probability at โˆš least 1 โˆ’ ๐‘’ โˆ’๐‘๐‘›ยฏ for some absolute constant ๐‘, โˆฅฮ”๐ดโˆฅ โ‰ค 3๐œŽ ๐‘›. ยฏ 42 Next we bound (I). (I) = โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1๐‘‰1๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 โˆฅ 2,โˆž โ‰ค โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 โˆฅ 2,โˆž โˆฅ๐‘‰1๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 โˆฅ 1 7 (2.46) โ‰ค โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 โˆฅ 2,โˆž โ‰ค โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 โˆฅ 2,โˆž . ๐œŽ๐‘Ÿ ( ๐ด) e 6๐œŽ ๐‘Ÿ ( ๐ด) โˆš Here the last inequality is by Weylโ€™s bound and the assumption ๐œŽ๐‘Ÿ ( ๐ด) > 21๐œŽ ๐‘›. ยฏ (2.46) implies that it suffices to bound the row norms of ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 . Since ฮ”๐ด is i.i.d. ๐‘ (0, ๐œŽ 2 ), ๐‘ˆ2 and ๐‘‰1 are independent of ฮ”๐ด and that โˆฅ๐‘ˆ2 โˆฅ = โˆฅ๐‘‰1 โˆฅ = 1, then each row of ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 is a Gaussian vector having independent Gaussian entries with mean 0 and variance at most ๐œŽ 2 . By exactly the same proof as Theorem 3.1.1 in Vershynin (2018), there exists a constant ๐‘ such that for all ๐‘ก > 0, โˆ’ ๐‘๐‘ก 2 ๐‘‡ ๐‘‡ ๐‘‡ ๐‘‡ โˆš ๐œŽ 2 โˆฅ๐‘ข๐‘‡ ๐‘ˆ๐‘‡ โˆฅ 2 P( โˆฅ๐‘ข๐‘– ๐‘ˆ2 ฮ”๐ด๐‘‰1 โˆฅ โˆ’ ๐œŽโˆฅ๐‘ข๐‘– ๐‘ˆ2 โˆฅ ๐‘Ÿ โ‰ฅ ๐‘ก) < 2๐‘’ ๐‘– 2 , where ๐‘ข๐‘‡๐‘– is the ๐‘–th row of ๐‘ˆ2 . Setting in the above ๐‘ก = ๐œŽโˆฅ๐‘ข๐‘‡๐‘– ๐‘ˆ2๐‘‡ โˆฅ 3 log ๐‘›/๐‘, then with probability at least 1 โˆ’ 23 , โˆš๏ธ ๐‘› โˆš โˆš๏ธ โˆš โˆš๏ธ โˆฅ๐‘ข๐‘‡๐‘– ๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 โˆฅ โ‰ค ๐‘ 1 ๐œŽ( ๐‘Ÿ + log ๐‘›)โˆฅ๐‘ข๐‘‡๐‘– ๐‘ˆ2๐‘‡ โˆฅ โ‰ค ๐‘ 1 ๐œŽ( ๐‘Ÿ + log ๐‘›), with some constant ๐‘ 1 . By the union bound, the probability of failure for all the rows is at most 22 . ๐‘› Hence with probability at least 1 โˆ’ 22 , it holds ๐‘› โˆš โˆš๏ธ โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰1 โˆฅ 2,โˆž โ‰ค ๐‘ 1 ๐œŽ( ๐‘Ÿ + log ๐‘›). Plugging this into (2.46), we obtain โˆš โˆš๏ธ ๐‘ 1 ๐œŽ( ๐‘Ÿ + log ๐‘›) (I) โ‰ค . ๐œŽ๐‘Ÿ ( ๐ด) Combining the bounds on I, II and Lemma 2.6.3 completes the proof. โ–ก 2.6.3 Proof of Lemma 2.6.3 Here we first provide the proof for the low-rank case to give the reader some intuition. The full-rank case follows a similar idea but is quite notationally heavy, we defer the proof of Lemma 2.6.3 for full-rank case to appendix. 43 Proof of Lemma 2.6.3- the low-rank case: When ๐ด is of rank ๐‘Ÿ, the second quantity to be bounded in Lemma 2.6.3 is 0, hence we focus on the first quantity โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 โˆฅ 2,โˆž . Let ๐‘ข๐‘‡๐‘– be the ๐‘–th row of ๐‘ˆ2 , then by Corollary 2.5.9, the ๐‘–th row of ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 can be expressed as ๐‘ข๐‘‡๐‘– ๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 โˆž ! โˆ‘๏ธ = ๐‘ข๐‘‡๐‘– ๐›ผ22 (๐›ผ๐‘‡22 ๐›ผ22 ) ๐‘˜ (๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1e ฮฃ1โˆ’1 + ๐›ผ๐‘‡22 ๐›ผ21๐‘‰1๐‘‡ ๐‘‰e1eฮฃ1โˆ’2 )(eฮฃ1โˆ’2 ) ๐‘˜ e ฮฃ1โˆ’1 (2.47) ๐‘˜=0 โˆž ! โˆ‘๏ธ = ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ (๐›ผ22 ๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1e ฮฃ1โˆ’1 + ๐›ผ22 ๐›ผ๐‘‡22 ๐›ผ21๐‘‰1๐‘‡ ๐‘‰ e1eฮฃ1โˆ’2 )(e ฮฃ1โˆ’2 ) ๐‘˜ e ฮฃ1โˆ’1 , ๐‘˜=0 where ๐›ผ๐‘– ๐‘— = ๐‘ˆ๐‘–๐‘‡ ฮ”๐ด๐‘‰ ๐‘— . Due to the orthogonality of ๐‘ˆ and ๐‘‰, the entries in each ๐›ผ๐‘– ๐‘— follow i.i.d. ๐‘ (0, ๐œŽ 2 ) distribution, and ๐›ผ22 is independent of ๐›ผ12 . This further implies that ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 and ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 ๐›ผ๐‘‡22 are independent of ๐›ผ12 and ๐›ผ21 , respectively. Conditional on ๐›ผ22 , ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 ๐›ผ๐‘‡12 varies with ๐›ผ12 , and it follows normal distribution. Again by Theorem 3.1.1 in Vershynin (2018), for fixed ๐‘˜ = 0, ..., there exists a constant ๐‘ such that ! โˆš ๐‘๐‘ก 2 P( โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 ๐›ผ๐‘‡12 โˆฅ โˆ’ ๐œŽ ๐‘Ÿ โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 โˆฅ > ๐‘ก) โ‰ค 2 exp โˆ’ . ๐œŽ 2 โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 โˆฅ 2 โˆš๏ธ Setting in the above ๐‘ก = ๐œŽโˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 โˆฅ log(๐‘›3 ยท 2 ๐‘˜ )/๐‘, we get with probability at least 1 โˆ’ ๐‘˜2 3 , 2 ๐‘› โˆš โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 ๐›ผ๐‘‡12 โˆฅ โ‰ค ๐œŽ ๐‘Ÿ โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 โˆฅ + ๐‘ก โˆš โˆš๏ธ โˆš โ‰ค ๐‘ 2 ๐œŽโˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 โˆฅ( ๐‘Ÿ + log ๐‘› + ๐‘˜), (2.48) where ๐‘ 2 is some absolute constant. Then ๐œŽ โˆฅ๐›ผ22 โˆฅ 2๐‘˜+1 โˆš โˆš๏ธ โˆš   โˆ’(2๐‘˜+2) โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 ๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1eฮฃ1 โˆฅ โ‰ค ๐‘2 ( ๐‘Ÿ + log ๐‘› + ๐‘˜). ๐œŽ๐‘Ÿ e e๐œŽ๐‘Ÿ โˆฅ๐›ผ22 โˆฅ โˆš Let ๐œ† = e ๐œŽ๐‘Ÿ . We next argue that ๐œ† < 1/2. By Corollary 7.3.3 of Vershynin (2018), โˆฅฮ”๐ดโˆฅ โ‰ค 3๐œŽ ๐‘›ยฏ with probability at least 1 โˆ’ ๐‘’ โˆ’๐‘๐‘›ยฏ . On this event, by Weylโ€™s bound, โˆš โˆš ๐œŽ๐‘Ÿ โ‰ฅ ๐œŽ๐‘Ÿ โˆ’ โˆฅฮ”๐ดโˆฅ โ‰ฅ ๐œŽ๐‘Ÿ โˆ’ 3๐œŽ ๐‘›ยฏ โ‰ฅ 18๐œŽ ๐‘›ยฏ โ‰ฅ 6โˆฅฮ”๐ดโˆฅ โ‰ฅ 6โˆฅ๐›ผ22 โˆฅ, e 44 โˆš which implies ๐œ† < 1/2. The third inequality above is due to the assumption 21๐œŽ ๐‘›ยฏ < ๐œŽ๐‘Ÿ . By union bound on the probability of failure of (2.48) over all ๐‘˜ = 0, ..., we have with probability at least 1 โˆ’ 43 , ๐‘› โˆž โˆž โˆž ! โˆ‘๏ธ โˆ’(2๐‘˜+2) ๐œŽ โˆ‘๏ธ โˆš 2๐‘˜+1 โˆš๏ธ โˆš โˆ‘๏ธ โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) ๐‘˜ ๐›ผ22 ๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆ e1e ฮฃ1 โˆฅ2 โ‰ค ๐‘3 ๐‘˜๐œ† + ( log ๐‘› + ๐‘Ÿ) ๐œ†2๐‘˜+1 ๐œŽ๐‘Ÿ e ๐‘˜=0 ๐‘˜=0 ๐‘˜=0 ๐œŽ โˆš โˆš๏ธ โ‰ค ๐‘ 4 ( ๐‘Ÿ + log ๐‘›) ๐œŽ๐‘Ÿ e โˆš โˆš๏ธ ๐‘Ÿ + log ๐‘› โ‰ค ๐‘5 ๐œŽ , ๐œŽ๐‘Ÿ ( ๐ด) with ๐‘ 3 โˆ’ ๐‘ 5 being absolute constants, where the last inequality used Weylโ€™s bound and the as- โˆš sumption ๐œŽ๐‘Ÿ ( ๐ด) > 21๐œŽ ๐‘›, ยฏ and the second inequality used the fact that for any 0 < ๐œ† < 1/2, we have โˆž โˆš โˆž โˆš โˆž   โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ ๐‘‘ 1 2๐œ† ๐‘˜๐œ†2๐‘˜+1 โ‰ค ๐‘˜๐œ† ๐‘˜ โ‰ค (๐‘˜ + 1)๐œ† ๐‘˜ = โˆ’1 โ‰ค < 4. (2.49) ๐‘˜=0 ๐‘˜=0 ๐‘˜=1 ๐‘‘๐œ† 1 โˆ’ ๐œ† (1 โˆ’ ๐œ†) 2 Following the same reasoning, with probability at least 1 โˆ’ 43 , ๐‘› โˆž โˆš โˆš๏ธ โˆ‘๏ธ โˆ’(2๐‘˜+3) ๐‘Ÿ + log ๐‘› โˆฅ๐‘ข๐‘‡๐‘– (๐›ผ22 ๐›ผ๐‘‡22 ) (๐‘˜+1) ๐›ผ21๐‘‰1๐‘‡ ๐‘‰ e1eฮฃ1 โˆฅ2 โ‰ค ๐‘6 ๐œŽ , ๐œŽ๐‘Ÿ ( ๐ด) ๐‘˜=0 for some constant ๐‘ 6 . Using these in (2.47), by the union bound, we obtain that with probability at least 1 โˆ’ 82 , ๐‘› โˆš โˆš๏ธ ๐‘Ÿ + log ๐‘› โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 โˆฅ 2,โˆž โ‰ค ๐‘7 ๐œŽ , ๐œŽ๐‘Ÿ ( ๐ด) where ๐‘ 7 is some constant. โ–ก 2.6.4 Proof of Theorem 2.4.1 Although Theorem 2.4.1 is motivated and could be proved by Theorem 2.3.1, we provide an alternative proof that is more straightforward. For this purpose, we will need the following lemmas. Lemma 2.6.4. Under the same assumption as Theorem 2.4.1, we have ยฉ โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e1 โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e2 ยช ๐‘‡ ยฉ 0 ๐‘ˆ1๐‘‡ ๐‘ˆe2๐‘ˆe๐‘‡ ๐ด 2 e๐‘‰ e2 ยช ๐ด๐‘Ÿ โˆ’ ๐ด๐‘Ÿ = ๐‘ˆ ยญ e ยญ ยฎ ๐‘‰ +๐‘ˆ ยญ e ยญ ยฎ๐‘‰e๐‘‡ . ยฎ ยฎ โˆ’๐‘ˆ๐‘‡ ๐ด๐‘‰ ๐‘‰ ๐‘‡ ๐‘‰ 0 โˆ’๐‘ˆ๐‘‡ ฮ”๐ด๐‘‰ 0 ยซ 2 2 2 1 e e1 ยฌ ยซ 2 ยฌ 45 Proof: The lemma can be straightforwardly verified by using the relation ๐ด๐‘Ÿ = ๐‘ˆ1 ฮฃ1๐‘‰1๐‘‡ and e๐‘Ÿ = ๐‘ˆ ๐ด e1e e๐‘‡ . ฮฃ1๐‘‰ โ–ก 1 Lemma 2.6.5 (Lemma 2 in Luo et al. (2021)). Suppose ๐‘ฅ 1 โ‰ฅ ๐‘ฅ 2 โ‰ฅ ... โ‰ฅ ๐‘ฅ ๐‘˜ โ‰ฅ 0 and ๐‘ฆ 1 โ‰ฅ ๐‘ฆ 2 โ‰ฅ ... โ‰ฅ ร๐‘— ร๐‘— ๐‘ฆ ๐‘˜ โ‰ฅ 0. For any 1 โ‰ค ๐‘— โ‰ค ๐‘˜, ๐‘–=1 ๐‘ฅ๐‘– โ‰ค ๐‘–=1 ๐‘ฆ๐‘– . Then for any ๐‘ โ‰ฅ 1, โˆ‘๏ธ๐‘˜ โˆ‘๏ธ๐‘˜ ๐‘ ๐‘ ๐‘ฅ๐‘– โ‰ค ๐‘ฆ๐‘– . ๐‘–=1 ๐‘–=1 The equality holds if and only if (๐‘ฅ1 , ๐‘ฅ 2 , ..., ๐‘ฅ ๐‘˜ ) = (๐‘ฆ 1 , ๐‘ฆ 2 , ..., ๐‘ฆ ๐‘˜ ). Lemma 2.6.6 (Theorem 1 in Thompson (1975)). Assume ๐ด, ๐ต, ๐ถ = ๐ด + ๐ต are (not necessarily square) matrices of the same size, with singular values ๐›ผ1 โ‰ฅ ๐›ผ2 โ‰ฅ ..., ๐›ฝ1 โ‰ฅ ๐›ฝ2 โ‰ฅ ..., ๐›พ1 โ‰ฅ ๐›พ2 โ‰ฅ ..., respectively. Let ๐‘–1 < ๐‘– 2 < ... < ๐‘– ๐‘š and ๐‘—1 < ๐‘— 2 < ... < ๐‘— ๐‘š be positive integers, and set ๐‘˜ ๐‘ก = ๐‘–๐‘ก + ๐‘—๐‘ก โˆ’ ๐‘ก, ๐‘ก = 1, 2, ..., ๐‘š. Then the singular values of ๐ด, ๐ต, ๐ถ satisfy โˆ‘๏ธ๐‘š ๐‘š โˆ‘๏ธ โˆ‘๏ธ๐‘š ๐›พ๐‘˜๐‘ก โ‰ค ๐›ผ๐‘–๐‘ก + ๐›ฝ ๐‘—๐‘ก . ๐‘ก=1 ๐‘ก=1 ๐‘ก=1 Lemma 2.6.7. We have the following uniform error bound on sin ฮ˜ distance   e1 )โˆฅ, โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ} โ‰ค min 2โˆฅฮ”๐ดโˆฅ max{โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ ,1 . (2.50) ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Proof of Lemma 2.6.7: If ๐œŽ๐‘Ÿ = ๐œŽ๐‘Ÿ+1 , (2.50) holds trivially, here we consider the case ๐œŽ๐‘Ÿ > ๐œŽ๐‘Ÿ+1 . Consider the two possibilities ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 2โˆฅฮ”๐ดโˆฅ and ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โ‰ค 2โˆฅฮ”๐ดโˆฅ. When ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > 2โˆฅฮ”๐ดโˆฅ, this and the Weylโ€™s bound |e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ | โ‰ค โˆฅฮ”๐ดโˆฅ, together give 1 e๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 > ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โˆ’ โˆฅฮ”๐ดโˆฅ > (๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 ) > 0, 2 46 which ensures the assumption in Theorem 2.5.5 to hold, and then (2.23) in Theorem 2.5.5 implies e1 )โˆฅ = โˆฅ๐‘ˆ๐‘‡ ๐‘ˆ โˆฅฮ”๐ดโˆฅ 2โˆฅฮ”๐ดโˆฅ โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ 2 1โˆฅ โ‰ค e โ‰ค . ๐œŽ๐‘Ÿ โˆ’ โˆฅฮ”๐ดโˆฅ โˆ’ ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 When ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โ‰ค 2โˆฅฮ”๐ดโˆฅ, we directly have 2โˆฅฮ”๐ดโˆฅ โˆฅ๐‘ˆ2๐‘‡ ๐‘ˆ e1 โˆฅ โ‰ค 1 โ‰ค . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Putting the two cases together, we have e1 )โˆฅ โ‰ค min{ 2โˆฅฮ”๐ดโˆฅ , 1}. โˆฅ sin ฮ˜(๐‘ˆ1 , ๐‘ˆ ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Following the same reasoning, we also have โˆฅ sin ฮ˜(๐‘‰1 , ๐‘‰ e1 )โˆฅ โ‰ค min{ 2โˆฅฮ”๐ดโˆฅ , 1}, thus (2.50) holds. ๐œŽ๐‘Ÿ โˆ’๐œŽ ๐‘Ÿ+1 โ–ก Proof of Theorem 2.4.1: By Lemma 2.6.4, ยฉ โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e1 โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e2 ยช ยฉ 0 ๐‘ˆ1๐‘‡ ๐‘ˆ e๐‘‡ ๐ด e2๐‘ˆ 2 e๐‘‰e2 ยช โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด e๐‘Ÿ โˆฅ โ‰ค ยญ ยญ ๐‘‡ ยฎ + ยญ ยฎ ยญ ๐‘‡ ยฎ ยฎ ๐‘‡ โˆ’๐‘ˆ2 ๐ด๐‘‰2๐‘‰2 ๐‘‰1 e 0 โˆ’๐‘ˆ2 ฮ”๐ด๐‘‰1 e 0 ยซ ยฌ ยซ ยฌ โˆš๏ธƒ โ‰ค โˆฅ๐‘ˆ1๐‘‡ ฮ”๐ดโˆฅ 2 + โˆฅ๐‘ˆ2๐‘‡ ๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1 โˆฅ 2 + max{โˆฅ๐‘ˆ๐‘‡ ฮ”๐ด๐‘‰ 2 e1 โˆฅ, โˆฅ๐‘ˆ๐‘‡ ๐‘ˆe e๐‘‡ ee 1 2๐‘ˆ2 ๐ด๐‘‰2 โˆฅ} โˆš๏ธƒ = โˆฅ๐‘ˆ1๐‘‡ ฮ”๐ดโˆฅ 2 + ๐œ 2 + max{โˆฅ๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰ e1 โˆฅ, ๐œˆ}, (2.51) where we have let ๐œˆ = โˆฅ๐‘ˆ1๐‘‡ ๐‘ˆ e2๐‘ˆe๐‘‡ ๐ด 2 e๐‘‰e2 โˆฅ and ๐œ = โˆฅ๐‘ˆ๐‘‡ ๐ด๐‘‰2๐‘‰ ๐‘‡ ๐‘‰ 2 e โˆฅ, we next bound ๐œ and ๐œˆ. 2 1 ๐œˆ = โˆฅ๐‘ˆ1๐‘‡ ๐‘ˆ e2๐‘ˆe๐‘‡ ๐ด ๐‘‡e e 2 ๐‘‰2 โˆฅ = โˆฅ๐‘ˆ1 ๐‘ˆ2 ฮฃ2 โˆฅ โ‰ค e ee ๐œŽ๐‘Ÿ+1 โˆฅ๐‘ˆ1๐‘‡ ๐‘ˆ e2 โˆฅ. (2.52) Due to the Weylโ€™s bound, we also have |e ๐œŽ๐‘Ÿ+1 โˆ’ ๐œŽ๐‘Ÿ+1 | < โˆฅฮ”๐ดโˆฅ. (2.53) By (2.50),     2โˆฅฮ”๐ดโˆฅ 2โˆฅฮ”๐ดโˆฅ ๐œˆ โ‰ค (๐œŽ๐‘Ÿ+1 + โˆฅฮ”๐ดโˆฅ) min , 1 โ‰ค โˆฅฮ”๐ดโˆฅ + ๐œŽ๐‘Ÿ+1 min ,1 . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Similarly, we can also derive   2โˆฅฮ”๐ดโˆฅ ๐œ= โˆฅฮฃ2๐‘‰2๐‘‡ ๐‘‰e1 โˆฅ โ‰ค ๐œŽ๐‘Ÿ+1 min ,1 . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 47 Inserting the upper bounds of ๐œ and ๐œˆ back to (2.51) completes the proof of (2.15). For Frobenius norm: 2 โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e1 โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e2 + ๐‘ˆ๐‘‡ ๐‘ˆ e๐‘ˆ 1 2 2 e๐‘‡ ๐ดe๐‘‰ e2 ยช e๐‘Ÿ โˆฅ 2 = ยฉ โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด ๐น ยญ ยญ ๐‘‡ ยฎ ยฎ โˆ’๐‘ˆ2 ฮ”๐ด๐‘‰ e1 โˆ’ ๐‘ˆ๐‘‡ ๐ด๐‘‰2๐‘‰ ๐‘‡ ๐‘‰ 0 2 1 e ยซ 2 ยฌ ๐น 2 2 โˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ ยฉโˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e2 + ๐‘ˆ๐‘‡ ๐‘ˆe๐‘ˆ e๐‘‡ ๐ดe๐‘‰e2 ยช 1 2 2 ยฉ e1 ยช = ยญ ยญ ยฎ ยฎ + ยญ ยญ ยฎ . ยฎ ๐‘‡ โˆ’๐‘ˆ2 ฮ”๐ด๐‘‰ ๐‘‡ e1 โˆ’ ๐‘ˆ ๐ด๐‘‰2๐‘‰ ๐‘‰ ๐‘‡ e1 0 ยซ 2 2 ยฌ ๐น ยซ ยฌ ๐น | {z } | {z } B๐‘…1 B๐‘…2 Next we bound ๐‘…1 and ๐‘…2 separately. First consider ๐‘…2 , let ๐‘€ฮ›๐‘Š ๐‘‡ be the singular value decom- position of ๐‘ˆ1๐‘‡ ๐‘ˆe2 , where ๐‘€ โˆˆ R๐‘Ÿร—๐‘Ÿ , ฮ› โˆˆ R๐‘Ÿร—๐‘Ÿ , ๐‘Š โˆˆ R (๐‘›โˆ’๐‘Ÿ)ร—๐‘Ÿ . Then ๐‘…2 = โˆฅ โˆ’ ๐‘ˆ1 ฮ”๐ด๐‘‰ e2 + ๐‘ˆ1๐‘ˆ e2eฮฃ2 โˆฅ 2๐น โ‰ค 2โˆฅ๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰e2 โˆฅ 2 + 2โˆฅ๐‘ˆ๐‘‡ ๐‘ˆ ee 2 ๐น 1 2 ฮฃ2 โˆฅ ๐น โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ 2๐น + 2โˆฅ๐‘ˆ1๐‘‡ ๐‘ˆ e2๐‘Š๐‘Š ๐‘‡ e ฮฃ2 โˆฅ 2๐น    2 2 2โˆฅฮ”๐ดโˆฅ โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + 2 min ,1 โˆฅ๐‘Š ๐‘‡ eฮฃ2 โˆฅ 2๐น ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1    2 โˆ‘๏ธ๐‘Ÿ 2 2โˆฅฮ”๐ดโˆฅ 2 ( ๐ด). โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + 2 min ,1 ๐œŽ๐‘Ÿ+๐‘˜ e ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 ๐‘˜=1 In the second to last inequality, we used the fact that โˆฅ ๐ด๐ตโˆฅ ๐น โ‰ค โˆฅ ๐ดโˆฅโˆฅ๐ตโˆฅ ๐น and in the last inequality, we used โˆฅ๐‘ƒฮฉ ๐ดโˆฅ ๐น โ‰ค โˆฅ ๐ด๐‘Ÿ โˆฅ ๐น for any ๐‘Ÿ-dimensional subspace ฮฉ. By Lemma 2.6.6, โˆ‘๏ธ๐‘˜ โˆ‘๏ธ๐‘˜ โˆ‘๏ธ๐‘˜ โˆ‘๏ธ๐‘˜ โˆ‘๏ธ๐‘˜ ๐œŽ๐‘Ÿ+๐‘– ( ๐ด) e = ๐œŽ๐‘Ÿ+๐‘– ( ๐ด + ฮ”๐ด) โ‰ค ๐œŽ๐‘– (ฮ”๐ด) + ๐œŽ๐‘Ÿ+๐‘– ( ๐ด) = (๐œŽ๐‘– (ฮ”๐ด) + ๐œŽ๐‘Ÿ+๐‘– ( ๐ด)) , 1 โ‰ค ๐‘˜ โ‰ค ๐‘Ÿ. ๐‘–=1 ๐‘–=1 ๐‘–=1 ๐‘–=1 ๐‘–=1 From Lemma 2.6.5, we have โˆ‘๏ธ๐‘Ÿ ๐‘Ÿ โˆ‘๏ธ 2 ( ๐ด) ๐œŽ๐‘Ÿ+๐‘˜ e โ‰ค (๐œŽ๐‘˜ (ฮ”๐ด) + ๐œŽ๐‘Ÿ+๐‘˜ ( ๐ด)) 2 โ‰ค (โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น ) 2 . ๐‘˜=1 ๐‘˜=1 Hence    2 2โˆฅฮ”๐ดโˆฅ ๐‘…2 โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ 2๐น + 2 min ,1 (โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น ) 2 . (2.54) ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 48 Next we consider ๐‘…1 . Notice that 2 ยฉโˆ’๐‘ˆ1๐‘‡ ฮ”๐ด๐‘‰ e1 ยช ยฉ 0 e1 โˆฅ ๐น ) 2 . ยช ๐‘…1 = ยญ ยญ ยฎ+ยญ ยฎ โ‰ค (โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅฮฃ2๐‘‰ ๐‘‡ ๐‘‰ ยฎ ยญ ยฎ 2 โˆ’๐‘ˆ๐‘‡ ฮ”๐ด๐‘‰ e1 โˆ’๐‘ˆ๐‘‡ ๐ด๐‘‰ ๐‘‰ ๐‘‡ ๐‘‰ ยฌ ยซ 2 2 2 1ยฌ ๐น e ยซ 2 Following the same reasoning as in bounding ๐‘…2 , we have    2 2โˆฅฮ”๐ดโˆฅ ๐‘…1 โ‰ค โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น min ,1 . (2.55) ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Combining (2.54) and (2.55), we obtain    2 e๐‘Ÿ โˆฅ 2 2โˆฅฮ”๐ดโˆฅ โˆฅ ๐ด๐‘Ÿ โˆ’ ๐ด โ‰ค 2โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ 2๐น + 3 โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น min ,1 . ๐น ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โ–ก 2.6.5 Proof of Theorem 2.3.2 Proof of Theorem 2.3.2: Let ๐‘‰ e๐‘‡ ๐‘‰1 = ๐‘„ 1 ๐‘†๐‘„๐‘‡ be the SVD of ๐‘‰ e๐‘‡ ๐‘‰1 . Define a special rotation 1 2 1 ๐‘„ห† = ๐‘„ 1 ๐‘„๐‘‡2 , and we bound |||๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1e ห† ฮฃ1 ๐‘„|||, where ||| ยท ||| can be either the the spectral or the Frobenius norm. This yields an upper bound on min๐‘„โˆˆO๐‘Ÿ |||๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1eฮฃ1 ๐‘„|||. By a direct calculation, |||๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆe1e ห† โ‰ค |||๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ ฮฃ1 ๐‘„||| e1e ฮฃ1๐‘‰ e๐‘‡ ๐‘‰1 + ๐‘ˆ e1eฮฃ1๐‘‰ e๐‘‡ ๐‘‰1 โˆ’ ๐‘ˆ e1e ห† ฮฃ1 ๐‘„||| 1 1 โ‰ค |||๐‘ˆ1 ฮฃ1 โˆ’ ๐‘ˆ e1e ฮฃ1๐‘‰ e๐‘‡ ๐‘‰1 ||| + |||๐‘ˆ e1eฮฃ1๐‘‰e๐‘‡ ๐‘‰1 โˆ’ ๐‘ˆ e1e ห† ฮฃ1 ๐‘„||| 1 1 = |||(๐‘ˆ1 ฮฃ1๐‘‰1๐‘‡ โˆ’ ๐‘ˆ e1e ฮฃ1๐‘‰e๐‘‡ )๐‘‰1 ||| + |||๐‘ˆe1e ฮฃ1 (๐‘‰ ห† e๐‘‡ ๐‘‰1 โˆ’ ๐‘„)||| 1 1 โ‰ค |||๐‘ˆ1 ฮฃ1๐‘‰1๐‘‡ โˆ’ ๐‘ˆ e1e ฮฃ1๐‘‰e๐‘‡ ||| + |||๐‘ˆe1eฮฃ1 (๐‘‰ ห† e๐‘‡ ๐‘‰1 โˆ’ ๐‘„)||| 1 1 = ||| ๐ด๐‘Ÿ โˆ’ ๐ด e๐‘Ÿ ||| + |||๐‘ˆ e1e ฮฃ1 (๐‘‰ ห† e๐‘‡ ๐‘‰1 โˆ’ ๐‘„)|||. (2.56) 1 The first term of (2.56) can be bounded by Theorem 2.4.1. Let us focus on the second term. Let โˆš๏ธ e1 . Observe ๐‘ ๐‘‡ ๐‘ + ๐‘‰ ๐‘ = ๐‘‰2๐‘‡ ๐‘‰ e๐‘‡ ๐‘‰1 (๐‘‰e๐‘‡ ๐‘‰1 )๐‘‡ = ๐ผ๐‘Ÿ , this implies ๐‘‰ e๐‘‡ ๐‘‰1 = ๐ผ๐‘Ÿ โˆ’ ๐‘ ๐‘‡ ๐‘ ๐‘„ห† (Lemma 2.6.8), 1 1 1 where the square root of a positive semi-definite matrix ๐ต is defined to be the positive semi-definite matrix ๐ตe such that ๐ต e๐ตe = ๐ต. Using this observation on the quantity inside the norm of the second term on the right-hand side of (2.56), we have โˆš๏ธ โˆš๏ธ ๐‘ˆe1eฮฃ1 (๐‘‰ ห† =๐‘ˆ e๐‘‡ ๐‘‰1 โˆ’ ๐‘„) e1e ฮฃ ( ๐ผ โˆ’ ๐‘ ๐‘‡ ๐‘ โˆ’ ๐ผ) ๐‘„ห† = โˆ’๐‘ˆ ฮฃ ๐‘ ๐‘‡ ๐‘ ( ๐ผ โˆ’ ๐‘ ๐‘‡ ๐‘ + ๐ผ) โˆ’1 ๐‘„, ห† 1 1 e 1 1 e 49 โˆš โˆš where the last equality used the fact that ( ๐ผ โˆ’ ๐‘ ๐‘‡ ๐‘ โˆ’ ๐ผ)( ๐ผ โˆ’ ๐‘ ๐‘‡ ๐‘ + ๐ผ) = โˆ’๐‘ ๐‘‡ ๐‘. Hence โˆš๏ธ |||๐‘ˆ1 ฮฃ1 (๐‘‰1 ๐‘‰1 โˆ’ ๐‘„)||| โ‰ค |||๐‘ˆ1 ฮฃ1 ๐‘ ๐‘ ||| ยท โˆฅ( ๐ผ โˆ’ ๐‘ ๐‘‡ ๐‘ + ๐ผ) โˆ’1 ๐‘„โˆฅ e e e๐‘‡ ห† e e ๐‘‡ ห† โ‰ค |||eฮฃ1 ๐‘ ๐‘‡ |||. (2.57) โˆš The last inequality used โˆฅ๐‘ โˆฅ โ‰ค 1, and โˆฅ( ๐ผ โˆ’ ๐‘ ๐‘‡ ๐‘ + ๐ผ) โˆ’1 โˆฅ โ‰ค 1. Notice that ฮฃ1 ๐‘ ๐‘‡ )๐‘‡ = ๐‘ e (e ฮฃ1 = ๐‘‰2๐‘‡ ๐‘‰ ฮฃ1 = ๐‘‰2๐‘‡ ๐ด e1e e๐‘‡ ๐‘ˆe1 = ๐‘‰ ๐‘‡ ฮ”๐ด๐‘‡ ๐‘ˆ 2 e1 +๐‘‰ ๐‘‡ ๐ด๐‘‡ ๐‘ˆ 2 e1 = ๐‘‰ ๐‘‡ ฮ”๐ด๐‘‡ ๐‘ˆ 2 e1 + ฮฃ๐‘‡ ๐‘ˆ๐‘‡ ๐‘ˆ 2 2 1. e Then for the spectral norm of e ฮฃ1 ๐‘ ๐‘‡ , we have   2โˆฅฮ”๐ดโˆฅ โˆฅe ฮฃ1 ๐‘๐‘‡ โˆฅ โ‰ค โˆฅฮ”๐ดโˆฅ + ๐œŽ๐‘Ÿ+1 โˆฅ๐‘ˆ2๐‘‡ ๐‘ˆ e1 โˆฅ โ‰ค โˆฅฮ”๐ดโˆฅ + ๐œŽ๐‘Ÿ+1 min ,1 . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 For the Frobenius norm, we have   2โˆฅฮ”๐ดโˆฅ โˆฅe ฮฃ1 ๐‘๐‘‡ โˆฅ ๐น โ‰ค โˆฅ๐‘‰2๐‘‡ ฮ”๐ด๐‘‡ ๐‘ˆe1 โˆฅ ๐น + โˆฅฮฃ2๐‘ˆ๐‘‡ ๐‘ˆ 2 1โˆฅ๐น e โ‰ค โˆฅ(ฮ”๐ด)๐‘Ÿ โˆฅ ๐น + โˆฅ(ฮฃ2 )๐‘Ÿ โˆฅ ๐น min ,1 . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 Combining this with (2.56) and (2.57) completes the proof. โ–ก Lemma 2.6.8. Let ๐ด โˆˆ R๐‘Ÿร—๐‘Ÿ be a semi-definite matrix with eigenvalues no greater than 1, ๐ต โˆˆ R๐‘Ÿร—๐‘Ÿ has SVD ๐ต = ๐‘ˆ ๐ต ๐‘† ๐ต๐‘‰๐ต๐‘‡ . In addition, ๐ด and ๐ต satisfy ๐ด + ๐ต๐ต๐‘‡ = ๐ผ, then โˆš ๐ต = ๐ผ โˆ’ ๐ด๐‘ˆ ๐ต๐‘‰๐ต๐‘‡ . โˆš โˆš Proof: Since ๐ต๐ต๐‘‡ = ๐‘ˆ ๐ต ๐‘† 2๐ต๐‘ˆ๐‘‡๐ต , then ๐ต๐ต๐‘‡ = ๐‘ˆ ๐ต ๐‘† ๐ต๐‘ˆ๐‘‡๐ต , and therefore ๐ต = ๐ต๐ต๐‘‡ ๐‘ˆ ๐ต๐‘‰๐ต๐‘‡ . By as- sumption, ๐ต๐ต๐‘‡ = ๐ผ โˆ’ ๐ด, then the result of the lemma follows. โ–ก 50 CHAPTER 3 MANIFOLD DENOISING BY NONLINEAR ROBUST PCA 3.1 Introduction 3.1.1 Overview of Chapter 3 This chapter considers the problem of manifold denoising. In the study of statistics and ma- chine learning, there is a common manifold assumption that underlies many popular dimensional reduction algorithms including Isomap (Isometric Feature Mapping), LLE (Local Linear Embed- ding), and PCA (Principal Component Analysis), which states that real-world high-dimensional data lies near a low-dimensional manifold embedded in the high-dimensional space. Therefore, if we consider a local neighborhood of the data points, this local submatrix is approximately low-rank. Based on this observation, in this chapter, we extend Robust Principal Component Analysis (RPCA) to the manifold setting. Suppose that the observed data matrix is composed of a sparse component and a component drawn from some low-dimensional manifold. We propose and analyze an optimization framework that separates the sparse component from the manifold under noisy data. Theoretical error bounds are provided when the tangent spaces of the manifold satisfy certain incoherence conditions. We also provide a near-optimal choice of the tuning parameters for the proposed optimization formulation with the help of a new curvature estimation method. The efficacy of our method is demonstrated on both synthetic and real datasets. Results of this chapter has given rise to the conference proceeding Lyu et al. (2019). 3.1.2 Manifold denoising Manifold learning is nowadays widely used in computer vision, image processing, and biological data analysis on tasks such as classification, anomaly detection, data interpolation, and denoising. Many machine learning and statistical methods are based on the assumption that real-world high- dimensional data actually lies on low-dimensional manifolds embedded in the high-dimensional space, examples can be found in dimensional reduction Roweis and Saul (2000); Balasubramanian and Schwartz (2002); He et al. (2023) and clustering. However, in practice the data we observe almost never lies perfectly on a manifold, usually they contain noise coming from various sources, 51 which can greatly jeopardize the quality of methods that are sensitive to noise, such as graph-based methods. In recent years, several methods have been proposed to tackle this problem. One approach involves detecting and eliminating outliers from the dataset Du et al. (2013); Sathe and Aggarwal (2016). In particular, the authors of Du et al. (2013) proposed to measure the likelihood of each sample being an outlier using a reliability score based on contextual distance. In order to utilize both local and global manifold structures, an iterative algorithm was designed to update the reliability score matrix. In Sathe and Aggarwal (2016), the authors proposed an iterative algorithm called LODES which is a spectral embedding matrix based on local density. When constructing the weight matrix, LODES took into consideration the difference in local density for each pair of data points, and the output of this algorithm is the outlier score for each data point. It is worth noting that these methods focus on detecting and removing outliers, and may suffer from the problem of loss of information. Along another research line, graph-based denoising methods have been proposed in the literature Hein and Maier (2006); Deutsch et al. (2016). In Hein and Maier (2006), a weighted kNN graph was created for the noisy dataset based on the Gaussian kernel, then the noise was modeled as a diffusion process governed by the graph Laplacian of the weighted kNN graph. The denoising algorithm was designed based on reversing the diffusion process and was suitable for denoising datasets with Gaussian, or more general isotropic high-dimensional noise. Deutsch et al. (2016) also started with a weighted graph based on Gaussian kernel, then spectral graph wavelet transform was performed in each feature dimension, and a non-iterative denoising method was realized by removing all the scaling and wavelet coefficients above some threshold. This method was motivated by the observation that the energy of the smooth manifold coordinate signals concentrate on low- frequency spectral wavelets, while the energy of noise is spread equally across all frequencies. In this chapter, we consider the manifold denoising problem under the presence of both sparse and Gaussian noise. Specifically, we are concerned with the mixed noise model ๐‘‹หœ ๐‘– = ๐‘‹๐‘– + ๐‘†๐‘– + ๐ธ๐‘– , ๐‘– = 1, . . . , ๐‘›, (3.1) 52 where ๐‘‹๐‘– โˆˆ R ๐‘ is the noiseless data independently drawn from some manifold M with an intrinsic dimension ๐‘‘ โ‰ช ๐‘, ๐ธ๐‘– is the i.i.d. Gaussian noise with small magnitudes, and ๐‘†๐‘– is the sparse noise with possibly large magnitudes. If ๐‘†๐‘– has a large entry, then the corresponding ๐‘‹หœ ๐‘– is usually considered as an outlier. A desirable denoising algorithm can simultaneously recover ๐‘‹๐‘– and ๐‘†๐‘– from ๐‘‹หœ ๐‘– , ๐‘– = 1, .., ๐‘›, there are several benefits in recovering the noise term ๐‘†๐‘– along with the signal ๐‘‹๐‘– . First, the support of ๐‘†๐‘– indicates the locations of the anomaly, which is informative in many applications. For example, if ๐‘‹๐‘– is the gene expression data from the ๐‘–th patient, the nonzero elements in ๐‘†๐‘– indicate the differently expressed genes that are the candidates for personalized medicine. Similarly, if ๐‘†๐‘– is a result of malfunctioned hardware, its nonzero elements indicate the locations of the malfunctioned parts. Secondly, the recovery of ๐‘†๐‘– allows the โ€œoutliersโ€ to be pulled back to the data manifold instead of simply being discarded. This prevents waste of information and is especially beneficial in cases where data is insufficient. Thirdly, in some applications, the sparse ๐‘†๐‘– is a part of the clean data rather than a noise term, then the algorithm provides a natural decomposition of the data into a sparse and a non-sparse component that may carry different pieces of information. 3.1.3 Robust PCA The method we are about to propose in Section 3.2 is closely related to Robust Principle Component Analysis (RPCA) Candes et al. (2011) and differential geometry. In Section 3.1.3 and Section 3.1.4, we provide some background on RPCA and geometry, respectively. Readers already familiar with these areas can safely jump to Section 3.2 for the proposed method. Assume a data matrix has a low-rank component and a sparse component, the authors in Candes et al. (2011) proved that under assumptions where the singular vectors of the low-rank matrix are reasonably spread and the support set of the noise matrix is uniformly distributed, the with a high probability, the Principle Component Pursuit (PCP) method can exactly recover both the low-rank component and the sparse component. Formally, assume a large data matrix ๐‘€ โˆˆ R ๐‘ร—๐‘› satisfies ๐‘€ = ๐ฟ 0 + ๐‘†0 , (3.2) 53 where ๐ฟ 0 admits singular value decomposition ๐ฟ 0 = ๐‘ˆฮฃ๐‘‰ โˆ— = ๐‘–=1 ๐œŽ๐‘– ๐‘ข๐‘– ๐‘ฃ๐‘–โˆ— with ๐‘Ÿ being its rank and ร๐‘Ÿ ๐œŽ1 , ๐œŽ2 , ยท ยท ยท , ๐œŽ๐‘Ÿ being its positive singular values. Here ๐‘ˆ = [๐‘ข 1 , ๐‘ข 2 , ยท ยท ยท , ๐‘ข๐‘Ÿ ], ๐‘‰ = [๐‘ฃ 1 , ๐‘ฃ 2 , ยท ยท ยท , ๐‘ฃ ๐‘Ÿ ] are singular vector matrices satisfying the following incoherence conditions with some constant ๐œ‡ โˆš๏ธ‚ 2 ๐œ‡๐‘Ÿ 2 ๐œ‡๐‘Ÿ โˆ— ๐œ‡๐‘Ÿ โˆฅ๐‘ˆ โˆฅ 2,โˆž โ‰ค , โˆฅ๐‘‰ โˆฅ 2,โˆž โ‰ค , โˆฅ๐‘ˆ๐‘‰ โˆฅ โˆž โ‰ค . (3.3) ๐‘› ๐‘š ๐‘›๐‘š Here โˆฅ ยท โˆฅ 2,โˆž is defined as in Definition 1.3.1, and โˆฅ ๐‘€ โˆฅ โˆž = max๐‘–, ๐‘— |๐‘€๐‘– ๐‘— | is the maximum of absolute values in ๐‘€. PCP estimates the low-rank component and the sparse component by solving the optimization problem min โˆฅ๐ฟ โˆฅ โˆ— + ๐œ†โˆฅ๐‘†โˆฅ 1 subject to ๐ฟ + ๐‘† = ๐‘€. Denote ๐‘› (1) = max{๐‘›, ๐‘}, ๐‘› (2) = min{๐‘›, ๐‘}, and choose ๐œ† = โˆš๐‘›1 , the following theorem has been (1) proved in Candes et al. (2011) Theorem 3.1.1 (Theorem 1.1 in Candes et al. (2011)). Suppose ๐ฟ 0 satisfies the incoherent condi- tions 3.3, and the support set of ๐‘†0 is uniformly distributed on all sets with cardinality ๐‘š, then there exists a constant ๐‘ such that with probability at least 1 โˆ’ ๐‘๐‘›โˆ’10 (1) , PCP exactly recovers ๐ฟ 0 and ๐‘†0 , provided rank(๐ฟ 0 ) โ‰ค ๐œŒ๐‘Ÿ ๐‘› (2) ๐œ‡โˆ’1 (log ๐‘› (1) ) โˆ’2 , and ๐‘š โ‰ค ๐œŒ ๐‘  ๐‘›๐‘, where ๐œŒ๐‘Ÿ and ๐œŒ ๐‘  are positive constants. Robust PCA provides an efficient method of handling noisy data with outliers, it has received considerable attention and has demonstrated its success in separating data from sparse noise in many applications. However, its assumption that the data lies in a low-dimensional subspace is somewhat strict. In Section 3.2, we generalize the Robust PCA idea to the non-linear manifold setting. The intuition behind our method is that the small intrinsic dimension assumption ensures the local data matrix is approximately low rank. 3.1.4 Geometric background A fundamental assumption in this chapter is that the clean data lies in a low-dimensional manifold embedded in a high-dimensional space. Before presenting the proposed method, let us 54 warm up with some geometric backgrounds. We first present a definition of a ๐‘‘-dimensional manifold Lee (2012). Definition 3.1.1 (Manifolds). We say that a topological space M is a ๐‘‘-dimensional manifold if it is: 1. a Hausdorff space; 2. there is a countable basis for the topology of ๐‘€; 3. M is locally Euclidean of dimension ๐‘‘, which means, each point of M has a neighborhood that is homeomorphic to an open subset of R๐‘‘ . More specifically, for any ๐‘ โˆˆ M, we can find an open set ๐‘ˆ โŠ‚ M, an open set ๐‘ˆยฏ โŠ‚ R๐‘‘ , and a homeomorphism ๐œ‘ : ๐‘ˆ โ†’ ๐‘ˆ, ยฏ such pair (๐‘ˆ, ๐œ‘) is called a chart. Intuitively, a ๐‘‘-dimensional manifold locally resembles the Euclidean space R๐‘‘ . In order to define derivatives and distances on manifolds, we further need to define a smooth structure on the manifold M. And the definition of smoothness on manifolds is based on the calculus of maps on Euclidean spaces. Definition 3.1.2 (Smooth compatibility). Let M be a ๐‘‘-dimensional manifold, if (๐‘ˆ, ๐œ‘) and (๐‘‰, ๐œ“) are two charts, ๐‘ˆ โˆฉ ๐‘‰ โ‰  ๐œ™, the transition map from ๐œ‘ to ๐œ“ is defined as ๐œ“ โ—ฆ ๐œ‘โˆ’1 : ๐œ‘(๐‘ˆ โˆฉ ๐‘‰) โ†’ ๐œ“(๐‘ˆ โˆฉ๐‘‰). We say two charts (๐‘ˆ, ๐œ‘) and (๐‘‰, ๐œ“) are smoothly compatible if either ๐‘ˆ โˆฉ๐‘‰ = ๐œ™ or the transition map ๐œ“ โ—ฆ ๐œ‘โˆ’1 is a diffeomorphism. Definition 3.1.3 (Smooth manifold). A smooth manifold is a pair (M, A), where M is a manifold and A is a family of charts whose domain covers M, any two charts in A are smoothly compatible with each other. Next, we introduce the concept of tangent space and curvature on smooth manifolds. For illustration purposes, here we focus on their intuition instead of formal definitions in Riemannian Geometry. For a point ๐‘ on a smooth manifold M, we denote the tangent space at ๐‘ as ๐‘‡๐‘ . Intuitively, it is the subspace that is "tangent" to M at point ๐‘, which represents all possible directions curves can pass through this point. In the example of a 2-dimensional manifold in Figure 3.1, ๐‘‡๐‘ is the tangent space at point ๐‘. Intuitively, curvature measures how much a surface is bent. The curvature of a circle on a 2D plane is the reciprocal of its radius. More formally, the principal curvatures at a point on a high-dimensional manifold are defined as the singular values of the second fundamental forms 55 Figure 3.1 Local manifold geometry. Kobayashi and Nomizu (1996). As estimating all the singular values from the noisy data may not be stable, we are only interested in estimating the root mean square curvature, that is the root mean squares of the principal curvatures. For the simplicity of illustration, we review the related concepts using the 2D surface M embedded in R3 (Figure 3.1). For any curve ๐›พ(๐‘ ) in M parametrized by arclength with unit tangent vector ๐‘ก ๐›พ (๐‘ ), its curvature is the norm of the covariant derivative of ๐‘ก ๐›พ : โˆฅ๐‘‘๐‘ก ๐›พ (๐‘ )/๐‘‘๐‘ โˆฅ = โˆฅ๐›พโ€ฒโ€ฒ (๐‘ )โˆฅ. In particular, we have the following decomposition ๐›พโ€ฒโ€ฒ (๐‘ ) = ๐‘˜ ๐‘” (๐‘ ) ๐‘ฃห† (๐‘ ) + ๐‘˜ ๐‘› (๐‘ ) ๐‘›(๐‘ ), ห† where ๐‘›(๐‘ ) ห† is the unit normal direction of the manifold at ๐›พ(๐‘ ) and ๐‘ฃห† is the direction perpendicular to ๐‘›(๐‘ ) ห† and ๐‘ก ๐›พ (๐‘ ), i.e., ๐‘ฃห† = ๐‘›ห† ร— ๐‘ก ๐›พ (๐‘ ). The coefficient ๐‘˜ ๐‘› (๐‘ ) along the normal direction is called the normal curvature, and the coefficient ๐‘˜ ๐‘” (๐‘ ) along the perpendicular direction ๐‘ฃห† is called the geodesic curvature. The principal curvatures purely depend on ๐‘˜ ๐‘› . In particular, in 2D, the principal curvatures are precisely the maximum and minimum of ๐‘˜ ๐‘› among all possible directions. A natural way to compute the normal curvature is through geodesic curves. The geodesic curve between two points is the shortest curve connecting them. Therefore geodesic curves are usually viewed as โ€œstraight linesโ€ on the manifold. The geodesic curves have the favorable property that their curvatures have 0 contribution from ๐‘˜ ๐‘” . That is to say, the second order derivative of the geodesic curve parameterized by the arclength is exactly ๐‘˜ ๐‘› . 56 3.2 Methodology Let ๐‘‹หœ = [ ๐‘‹หœ 1 , . . . , ๐‘‹หœ ๐‘› ] โˆˆ R ๐‘ร—๐‘› be the noisy data matrix containing ๐‘› samples. Each sample is a vector in R ๐‘ independently drawn from (3.1). The overall data matrix ๐‘‹หœ has the representation ๐‘‹หœ = ๐‘‹ + ๐‘† + ๐ธ, where ๐‘‹ is the clean data matrix, ๐‘† is the matrix of the sparse noise, and ๐ธ is the matrix of the Gaussian noise. We further assume that the clean data ๐‘‹ lies on some ๐‘‘-dimensional manifold M with a small dimension ๐‘‘ โ‰ช ๐‘ embedded in R ๐‘ and the samples are sufficient (๐‘› โ‰ฅ ๐‘). The small intrinsic dimension assumption ensures that data is locally low-dimensional so that the corresponding local data matrix is of low rank. This property allows the data to be separated from the sparse noise. The key idea behind our method is to handle the data locally. We use the ๐‘˜ Nearest Neighbors (๐‘˜NN) to construct local data matrices, where ๐‘˜ is larger than the intrinsic dimension ๐‘‘. For a data point ๐‘‹๐‘– โˆˆ R ๐‘ , we define the local patch centered at it to be the set consisted of its ๐‘˜NN and itself, and a local data matrix ๐‘‹ (๐‘–) associated with this patch is ๐‘‹ (๐‘–) = [๐‘‹๐‘–1 , ๐‘‹๐‘–2 , . . . , ๐‘‹๐‘– ๐‘˜ , ๐‘‹๐‘– ], where ๐‘‹๐‘– ๐‘— is the ๐‘—th-nearest neighbor of ๐‘‹๐‘– . Let P๐‘– be the restriction operator to the ๐‘–th patch, i.e., P๐‘– (๐‘‹) = ๐‘‹ ๐‘ƒ๐‘– where ๐‘ƒ๐‘– is the ๐‘› ร— (๐‘˜ + 1) matrix that selects the columns of ๐‘‹ in the ๐‘–th patch. Then ๐‘‹ (๐‘–) = P๐‘– (๐‘‹). Similarly, we define ๐‘† (๐‘–) = P๐‘– (๐‘†), ๐ธ (๐‘–) = P๐‘– (๐ธ) and ๐‘‹หœ (๐‘–) = P๐‘– ( ๐‘‹). หœ Since each local data matrix ๐‘‹ (๐‘–) is nearly of low rank and ๐‘† is sparse, we can decompose the noisy data matrix into low-rank parts and sparse parts through solving the following optimization problem ห† { ๐‘†ห† (๐‘–) }๐‘› , { ๐ฟห† (๐‘–) }๐‘› } = arg min ๐น (๐‘†, {๐‘† (๐‘–) }๐‘› , {๐ฟ (๐‘–) }๐‘› ) { ๐‘†, ๐‘–=1 ๐‘–=1 ๐‘–=1 ๐‘–=1 ๐‘†,๐‘† (๐‘–) ,๐ฟ (๐‘–) ๐‘› โˆ‘๏ธ ๐œ†๐‘– โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐ฟ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ 2๐น + โˆฅC(๐ฟ (๐‘–) )โˆฅ โˆ— + ๐›ฝโˆฅ๐‘† (๐‘–) โˆฅ 1  โ‰ก arg min ๐‘†,๐‘† (๐‘–) ,๐ฟ (๐‘–) ๐‘–=1 subject to ๐‘† (๐‘–) = P๐‘– (๐‘†), (3.4) here we take ๐›ฝ = max{๐‘˜ + 1, ๐‘}โˆ’1/2 as in RPCA, ๐‘‹หœ (๐‘–) = P๐‘– ( ๐‘‹) หœ is the local data matrix on the ๐‘–th patch and C is the centering operator that subtracts the column mean: C(๐‘) = ๐‘ (๐ผ โˆ’ ๐‘˜+1 1 11๐‘‡ ), where 57 1 is the (๐‘˜ + 1)-dimensional column vector of all ones. Here we are decomposing the data on each patch into a low-rank part ๐ฟ (๐‘–) and a sparse part ๐‘† (๐‘–) by imposing the nuclear norm and entry-wise โ„“1 -norm on ๐ฟ (๐‘–) and ๐‘† (๐‘–) , respectively. There are two key components in this formulation: 1). the local patches are overlapping (for example, the first data point ๐‘‹1 may belong to several patches). Thus, the constraint ๐‘† (๐‘–) = P๐‘– (๐‘†) is particularly important because it ensures copies of the same point on different patches (and those of the sparse noise on different patches) remain the same. 2). we do not require ๐ฟ (๐‘–) to be restrictions of a universal ๐ฟ to the ๐‘–th patch, because the ๐ฟ (๐‘–) s correspond to the local affine tangent spaces, and there is no reason for a point on the manifold to have the same projection on different tangent spaces. This seemingly subtle difference has a large impact on the final result. Next, we provide a geometric intuition for the formulation (3.4). Write the clean data matrix ๐‘‹ (๐‘–) on the ๐‘–th patch in its Taylor expansion along the manifold, ๐‘‹ (๐‘–) = ๐‘‹๐‘– 1๐‘‡ + ๐‘‡ (๐‘–) + ๐‘… (๐‘–) , (3.5) where the Taylor series is expanded at ๐‘‹๐‘– (the center point of the ๐‘–th patch), ๐‘‡ (๐‘–) stores the first order term and its columns lie in the tangent space of the manifold at ๐‘‹๐‘– , and ๐‘… (๐‘–) contains all the higher order terms. The sum of the first two terms ๐‘‹๐‘– 1๐‘‡ + ๐‘‡ (๐‘–) is the linear approximation to ๐‘‹ (๐‘–) that is unknown if the tangent space is not given. This linear approximation precisely corresponds to the ๐ฟ (๐‘–) s in (3.4), i.e., ๐ฟ (๐‘–) = ๐‘‹๐‘– 1๐‘‡ + ๐‘‡ (๐‘–) . Since the tangent space has the same dimensionality ๐‘‘ as the manifold, with randomly chosen points, we have with probability one, that rank(๐‘‡ (๐‘–) ) = ๐‘‘. As a result, rank(๐ฟ (๐‘–) ) = rank(๐‘‹๐‘– 1๐‘‡ + ๐‘‡ (๐‘–) ) โ‰ค ๐‘‘ + 1. By the assumption that ๐‘‘ < min{๐‘, ๐‘˜ }, we know that ๐ฟ (๐‘–) is indeed low rank. Combing (3.5) with ๐‘‹หœ (๐‘–) = ๐‘‹ (๐‘–) + ๐‘† (๐‘–) + ๐ธ (๐‘–) , we find the misfit term ๐‘‹หœ (๐‘–) โˆ’ ๐ฟ (๐‘–) โˆ’ ๐‘† (๐‘–) in (3.4) equals ๐ธ (๐‘–) + ๐‘… (๐‘–) . This implies that the misfit contains the high order residues (i.e., the linear approximation error) and the Gaussian noise. We solve (3.4) for the sparse component ๐‘†. ห† If the data only contains sparse noise, i.e., ๐ธ = 0, then ๐‘‹ห† โ‰ก ๐‘‹หœ โˆ’ ๐‘†ห† is the final estimation for ๐‘‹. If ๐ธ โ‰  0, we apply Singular Value Hard Thresholding Gavish and Donoho (2014) to truncate C( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– (๐‘†)) and remove the Gaussian noise (See Section 58 (๐‘–) 3.5), and use the resulting ๐ฟห† ๐œ โˆ— to construct a final estimate ๐‘‹ห† of ๐‘‹ via least squares fitting ๐‘› (๐‘–) โˆ‘๏ธ ๐‘‹ห† = arg min ๐œ†๐‘– โˆฅP๐‘– (๐‘) โˆ’ ๐ฟห† ๐œ โˆ— โˆฅ 2๐น . (3.6) ๐‘ โˆˆR ๐‘ร—๐‘› ๐‘–=1 The following discussion revolves around (3.4) and (3.6), and the structure of the chapter is as follows. In Section 3.3, we establish theoretical recovery guarantees for (3.4) which justifies our choice of ๐›ฝ and allows us to theoretically choose ๐œ†. The calculation of ๐œ† uses the curvature of the manifold, so in Section3.4, we provide a simple method to estimate the average manifold curvature and the method is robust to sparse noise. The optimization algorithms that solve (3.4) and (3.6) are presented in Section 3.5 and the numerical experiments are in Section 3.6. 3.3 Theoretical choice of tuning parameters To establish the error bound, we need a coherence condition on the tangent spaces of the manifold. Definition 3.3.1. Let ๐‘ˆ โˆˆ R๐‘šร—๐‘Ÿ (๐‘š โ‰ฅ ๐‘Ÿ) be a matrix with ๐‘ˆ โˆ—๐‘ˆ = ๐ผ, the coherence of ๐‘ˆ is defined as ๐‘š ๐œ‡(๐‘ˆ) = max โˆฅ๐‘ˆ โˆ— e ๐‘˜ โˆฅ 22 , ๐‘Ÿ ๐‘˜ โˆˆ{1,...,๐‘š} where e ๐‘˜ is the ๐‘˜th element of the canonical basis. For a subspace ๐‘‡, its coherence is defined as ๐‘š ๐œ‡(๐‘‰) = max โˆฅ๐‘‰ โˆ— e ๐‘˜ โˆฅ 22 , ๐‘Ÿ ๐‘˜ โˆˆ{1,...,๐‘š} where ๐‘‰ is an orthonormal basis of ๐‘‡. The coherence is independent of the choice of basis. The following theorem is proved for local patches constructed using the ๐œ–-neighborhoods. We use ๐‘˜NN in the experiments because ๐‘˜NN is more robust to insufficient samples. The full version of Theorem 3.3.1 can be found in the appendix. Theorem 3.3.1. [succinct version] Let each ๐‘‹๐‘– โˆˆ R ๐‘ , ๐‘– = 1, ..., ๐‘›, be independently drawn from a compact manifold M โІ R ๐‘ with an intrinsic dimension ๐‘‘ and endowed with the uniform distribution. Let ๐‘‹๐‘– ๐‘— , ๐‘— = 1, . . . , ๐‘˜ ๐‘– be the ๐‘˜ ๐‘– points falling in an ๐œ‚-neighborhood of ๐‘‹๐‘– with radius ๐œ‚, where ๐œ‚ > 0 is some fixed small constant. These points form the matrix ๐‘‹ (๐‘–) = [๐‘‹๐‘–1 , . . . , ๐‘‹๐‘– ๐‘˜ , ๐‘‹๐‘– ]. For any ๐‘– ๐‘ž โˆˆ M, let ๐‘‡๐‘ž be the tangent space of M at ๐‘ž and define ๐œ‡ยฏ = sup๐‘žโˆˆM ๐œ‡(๐‘‡๐‘ž ). Suppose the support of the noise matrix ๐‘† (๐‘–) is uniformly distributed among all sets of cardinality ๐‘š๐‘– . Then as long as 59 ๐‘‘ < ๐œŒ๐‘Ÿ min{๐‘˜, ๐‘} ๐œ‡ยฏ โˆ’1 logโˆ’2 max{ ๐‘˜, ยฏ ๐‘}, and ๐‘š๐‘– โ‰ค 0.4๐œŒ ๐‘  ๐‘๐‘˜ (here ๐œŒ๐‘Ÿ and ๐œŒ ๐‘  are positive constants, ๐‘˜ยฏ = max๐‘– ๐‘˜ ๐‘– , and ๐‘˜ = min๐‘– ๐‘˜ ๐‘– ) , then with probability over 1 โˆ’ ๐‘ 1 ๐‘› max{๐‘˜, ๐‘}โˆ’10 โˆ’ ๐‘’ โˆ’๐‘2 ๐‘˜ for some constants ๐‘ 1 and ๐‘ 2 , the minimizer ๐‘†ห† to (3.4) with weights min{๐‘˜ ๐‘– + 1, ๐‘}1/2 ๐œ†๐‘– = , ๐›ฝ๐‘– = max{๐‘˜ ๐‘– + 1, ๐‘}โˆ’1/2 (3.7) ๐œ–๐‘– has the error bound ห† โˆ’ ๐‘† (๐‘–) โˆฅ 2,1 โ‰ค ๐ถ โˆš ๐‘๐‘› ๐‘˜ยฏ โˆฅ๐œ– โˆฅ 2 . โˆ‘๏ธ โˆฅP๐‘– ( ๐‘†) ๐‘– Here ๐œ–๐‘– = โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ โˆ’ ๐‘‡ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ ๐น will be estimated in the next section, ๐œ– = [๐œ–1 , ..., ๐œ– ๐‘› ], โˆฅ ยท โˆฅ 2,1 stands for taking โ„“2 norm along rows and โ„“1 norm along columns, and ๐‘‡ (๐‘–) is the projection of ๐‘‹ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ to the tangent space ๐‘‡๐‘‹๐‘– . Remark. We can interpret ๐œ– as the total noise in the data. As explained in Section 3.1.4, โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ โˆ’ ๐‘‡ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ ๐น = โˆฅ๐‘… (๐‘–) + ๐ธ (๐‘–) โˆฅ ๐น , thus ๐œ– = 0 if the manifold is linear and the Gaussian โˆš noise is absent. The factor ๐‘› in front of โˆฅ๐œ– โˆฅ 2 takes into account the use of different norms on the two hand sides (the right-hand side is the Frobenius norm of the noise matrix {๐‘… (๐‘–) + ๐ธ (๐‘–) }๐‘–=1 ๐‘› โˆš obtained by stacking the ๐‘… (๐‘–) + ๐ธ (๐‘–) associated with each patch into one big matrix). The factor ๐‘ is due to the small weight ๐›ฝ๐‘– of โˆฅ๐‘† (๐‘–) โˆฅ 1 compared to the weight 1 on โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐ฟ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ 2๐น . The factor ๐‘˜ยฏ appears because on average, each column of ๐‘†ห† โˆ’ ๐‘† is added about ๐‘˜ := ๐‘›1 ๐‘– ๐‘˜ ๐‘– times on the ร left hand side. 3.4 Estimating the curvature The definition ๐œ†๐‘– in (3.7) involves an unknown quantity ๐œ–๐‘–2 = โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ โˆ’ ๐‘‡ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ 2๐น โ‰ก โˆฅ๐‘… (๐‘–) + ๐ธ (๐‘–) โˆฅ 2๐น . We assume the standard deviation ๐œŽ of the i.i.d. Gaussian entries of ๐ธ (๐‘–) is known, so โˆฅ๐ธ (๐‘–) โˆฅ 2๐น can be approximated. Since ๐‘… (๐‘–) is independent of ๐ธ (๐‘–) , the cross term โŸจ๐‘… (๐‘–) , ๐ธ (๐‘–) โŸฉ is small. Our main task is estimating โˆฅ๐‘… (๐‘–) โˆฅ 2๐น , the linear approximation error defined in Section 3.1.4. At local regions, second order terms dominates the linear approximation residue, hence estimating โˆฅ๐‘… (๐‘–) โˆฅ 2๐น requires the curvature information. 60 3.4.1 The proposed method All existing curvature estimation methods we are aware of are in the field of computer vision where the objects are 2D surfaces in 3D Flynn and Jain (1989); Eppel (2006); Tong and Tang (2005); Meek and Walton (2000). Most of these methods are difficult to generalize to high (> 3) dimensions with the exception of the integral invariant based approaches Pottmann et al. (2007). However, the integral invariant based approaches is not robust to sparse noise and is unsuited to our problem. We propose a new method to estimate the root mean square curvature from the noisy data. Although the graphic illustration is made in 3D, the method is dimension independent. To compute the average normal curvature at a point ๐‘ โˆˆ M, we randomly pick ๐‘š points ๐‘ž๐‘– โˆˆ M on the manifold lying within a proper distance to ๐‘ as specified in Algorithm 3.1. Let ๐›พ๐‘– be the geodesic curve between ๐‘ and ๐‘ž๐‘– . For each ๐‘–, we compute the pairwise Euclidean distance โˆฅ ๐‘ โˆ’ ๐‘ž๐‘– โˆฅ 2 and the pairwise geodesic distance ๐‘‘๐‘” ( ๐‘, ๐‘ž๐‘– ) using the Dฤณkstraโ€™s algorithm. Through a circular approximation of the geodesic curve as drawn in Figure 3.1, we can compute the curvature of the geodesic curve as the inverse of the radius โˆฅ๐›พ๐‘–โ€ฒโ€ฒ ( ๐‘)โˆฅ = 1/๐‘…๐›พโ€ฒ , (3.8) ๐‘– where ๐›พ๐‘–โ€ฒ is the tangent direction along which the curvature is calculated and ๐‘…๐›พโ€ฒ is the radius of the ๐‘– circular approximation to the curve ๐›พ at ๐‘, which can be solved along with the angle ๐œƒ ๐›พโ€ฒ through ๐‘– the geometric relations 2๐‘…๐›พโ€ฒ sin(๐œƒ ๐›พโ€ฒ /2) = โˆฅ ๐‘ โˆ’ ๐‘ž๐‘– โˆฅ 2 , ๐‘…๐›พโ€ฒ ๐œƒ ๐›พโ€ฒ = ๐‘‘๐‘” ( ๐‘, ๐‘ž๐‘– ), (3.9) ๐‘– ๐‘– ๐‘– ๐‘– as indicated in Figure 3.1. Finally, we define the average curvature ฮ“ฬ„( ๐‘) at ๐‘ to be ฮ“ฬ„( ๐‘) := (E๐‘ž๐‘– โˆฅ๐›พ๐‘–โ€ฒโ€ฒ ( ๐‘)โˆฅ 2 ) 1/2 โ‰ก (E๐‘ž๐‘– ๐‘…๐›พโˆ’2๐‘– ) 1/2 . (3.10) To estimate the root mean square curvature from the data, we construct two matrices ๐ท and ๐ด. ๐ท โˆˆ R๐‘›ร—๐‘› is the pairwise distance matrix, where ๐ท ๐‘– ๐‘— denotes the Euclidean distance between two points ๐‘‹๐‘– and ๐‘‹ ๐‘— . ๐ด is a type of adjacency matrix defined as follows and is to be used to compute 61 the pairwise geodesic distances from the data, ๏ฃฑ if ๐‘‹ ๐‘— is in the ๐‘˜ nearest neighbors of ๐‘‹๐‘– , ๏ฃด ๏ฃฒ ๐ท๐‘– ๐‘— ๏ฃด ๏ฃด ๐ด๐‘– ๐‘— = (3.11) ๏ฃด ๏ฃด0 ๏ฃด elsewhere. ๏ฃณ Algorithm 3.1 estimates the root mean square curvature at some point ๐‘ and Algorithm 3.2 estimates the overall curvature within some region ฮฉ on the manifold. Algorithm 3.1 Estimate the root mean square curvature ฮ“ฬ„( ๐‘) at some point ๐‘ Input: Distance matrix ๐ท, adjacency matrix ๐ด, some proper constants ๐‘Ÿ 1 < ๐‘Ÿ 2 , number of pairs ๐‘š Output: The estimated the root mean square curvature ฮ“ฬ„( ๐‘) 1: for ๐‘– = 1 to ๐‘š do 2: Randomly pick some point ๐‘ž๐‘– โˆˆ ๐ต( ๐‘, ๐‘Ÿ 2 )\๐ต( ๐‘, ๐‘Ÿ 1 ) 3: Calculate the geodesic distance ๐‘‘๐‘” ( ๐‘, ๐‘ž๐‘– ) using ๐ด 4: Solve for the radius ๐‘…๐‘– based on (3.9) 5: end 6: Compute estimated curvature ฮ“ฬ„( ๐‘) = ( ๐‘š 1 ร๐‘š ๐‘… โˆ’2 ) 1/2 ๐‘–=1 ๐‘– Algorithm 3.2 Estimate the overall curvature ฮ“ฬ„(ฮฉ) for some region ฮฉ Input: Distance matrix ๐ท, adjacency matrix ๐ด, some proper constants ๐‘Ÿ 1 < ๐‘Ÿ 2 , number of pairs ๐‘š Output: The estimated overall curvature ฮ“ฬ„(ฮฉ) 1: for ๐‘– = 1 to ๐‘š do 2: Randomly pick a pair of points ๐‘๐‘– , ๐‘ž๐‘– โˆˆ ฮฉ such that ๐‘Ÿ 1 โ‰ค ๐‘‘ ( ๐‘๐‘– , ๐‘ž๐‘– ) โ‰ค ๐‘Ÿ 2 3: Calculate the geodesic distance ๐‘‘๐‘” ( ๐‘๐‘– , ๐‘ž๐‘– ) using ๐ด 4: Solve for the radius ๐‘…๐‘– based on (3.9) 5: end 6: Compute the estimated curvature ฮ“ฬ„(ฮฉ) = ( ๐‘š 1 ร๐‘š ๐‘… โˆ’2 ) 1/2 ๐‘–=1 ๐‘– The geodesic distance is computed using the Dฤณkstraโ€™s algorithm, which is not accurate when ๐‘ and ๐‘ž are too close to each other. The constant ๐‘Ÿ 1 in Algorithm 3.1 and 3.2 is thus used to make sure that ๐‘ and ๐‘ž are sufficiently apart. The constant ๐‘Ÿ 2 is to make sure that ๐‘ž is not too far away from ๐‘, as after all we are computing the root mean square curvature around ๐‘. 3.4.2 Estimating ๐œ†๐‘– from the root mean square curvature We provide a way to approximate ๐œ†๐‘– when the number of points ๐‘› is finite. In the asymptotic limit (๐‘˜ โ†’ โˆž, ๐‘˜/๐‘› โ†’ 0), all the approximate sign โ€œโ‰ˆโ€ below become โ€œ=โ€. 62 Fix a point ๐‘ โˆˆ M and another point ๐‘ž๐‘– in the ๐œ‚-neighborhood of ๐‘. Let ๐›พ๐‘– be the geodesic curve between them. With the computed curvature ฮ“ฬ„( ๐‘), we can estimate the linear approximation error of expanding ๐‘ž๐‘– at ๐‘: ๐‘ž๐‘– โ‰ˆ ๐‘ + ๐‘ƒ๐‘‡๐‘ (๐‘ž๐‘– โˆ’ ๐‘), where ๐‘ƒ๐‘‡๐‘ is the projection onto the tangent space at ๐‘. Let E be the error of this linear approximation E (๐‘ž๐‘– , ๐‘) = ๐‘ž๐‘– โˆ’ ๐‘ โˆ’ ๐‘ƒ๐‘‡๐‘ (๐‘ž๐‘– โˆ’ ๐‘) = ๐‘ƒ๐‘‡ โŠฅ (๐‘ž๐‘– โˆ’ ๐‘) ๐‘ where ๐‘‡๐‘โŠฅ is the orthogonal complement of the tangent space. From Figure 3.1, the relation between โˆฅE ( ๐‘, ๐‘ž๐‘– ) โˆฅ 2 , โˆฅ ๐‘ โˆ’ ๐‘ž๐‘– โˆฅ 2 , and ๐œƒ ๐›พโ€ฒ is ๐‘– ๐œƒ ๐›พโ€ฒ โˆฅ ๐‘โˆ’๐‘ž๐‘– โˆฅ 2 โˆฅE ( ๐‘, ๐‘ž๐‘– )โˆฅ 2 โ‰ˆ โˆฅ ๐‘ โˆ’ ๐‘ž๐‘– โˆฅ 2 sin 2๐‘– = 2๐‘… โ€ฒ 2 . (3.12) ๐›พ ๐‘– To obtain a closed-form formula for E, we assume that for the fixed ๐‘ and a randomly chosen ๐‘ž๐‘– in an ๐œ‰ neighborhood of ๐‘, the projection ๐‘ƒ๐‘‡๐‘ (๐‘ž๐‘– โˆ’ ๐‘) follows a uniform distribution in a ball with radius ๐œ‚โ€ฒ (in fact ๐œ‚โ€ฒ โ‰ˆ ๐œ‚ since when ๐œ‚ is small, the projection of ๐‘ž โˆ’ ๐‘ is almost ๐‘ž โˆ’ ๐‘ itself, therefore the radius of the projected ball almost equal to the radius of the original neighborhood). Under this assumption, let ๐‘Ÿ๐‘– = โˆฅ๐‘ƒ๐‘‡๐‘ (๐‘ž๐‘– โˆ’ ๐‘)โˆฅ 2 be the magnitude of the projection and ๐œ™๐‘– = ๐‘ƒ๐‘‡๐‘ (๐‘ž๐‘– โˆ’ ๐‘)/โˆฅ๐‘ƒ๐‘‡๐‘ (๐‘ž๐‘– โˆ’ ๐‘)โˆฅ 2 be the direction, by Vershynin (2018), ๐‘Ÿ๐‘– and ๐œ™๐‘– are independent of each other. As the curvature ๐‘…๐›พ๐‘– only depends on the direction, the numerator and the denominator of the right-hand side of (3.12) are independent of each other. Therefore, โˆฅ ๐‘โˆ’๐‘ž๐‘– โˆฅ 4 Eโˆฅ ๐‘โˆ’๐‘ž๐‘– โˆฅ 4 Eโˆฅ ๐‘โˆ’๐‘ž๐‘– โˆฅ 4 EโˆฅE ( ๐‘, ๐‘ž๐‘– )โˆฅ 22 โ‰ˆ E 2 = 2 E๐‘… โˆ’2 = 2 ยท ฮ“ฬ„2 ( ๐‘), (3.13) 4๐‘… 2 โ€ฒ 4 ๐›พโ€ฒ 4 ๐›พ ๐‘– ๐‘– where the first equality used the independence and the last equality used the definition of the root mean square curvature in the previous subsection. Now we apply this estimation to the neighborhood of ๐‘‹๐‘– . Let ๐‘ = ๐‘‹๐‘– , and ๐‘ž ๐‘— = ๐‘‹๐‘– ๐‘— be the neighbors of ๐‘‹๐‘– . Using (3.13), the average linear approximation error on this patch is 4 1 (๐‘–) 2 1 ร๐‘˜ 2 ๐‘˜โ†’โˆž Eโˆฅ ๐‘‹๐‘– โˆ’๐‘‹๐‘– ๐‘— โˆฅ 2 2 โˆฅ๐‘… โˆฅ ๐น := ๐‘˜ โˆฅE (๐‘‹๐‘– ๐‘— , ๐‘‹๐‘– )โˆฅ 2 โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 4 ฮ“ฬ„ (๐‘‹๐‘– ), (3.14) ๐‘˜ ๐‘—=1 where the right-hand side can also be estimated with 4 4 1 โˆ‘๏ธ โˆฅ ๐‘‹๐‘– โˆ’ ๐‘‹๐‘– ๐‘— โˆฅ 2 2 ๐‘˜โ†’โˆž Eโˆฅ ๐‘‹๐‘– โˆ’ ๐‘‹๐‘– ๐‘— โˆฅ 2 2 ๐‘˜ ฮ“ฬ„ (๐‘‹๐‘– ) โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ ฮ“ฬ„ (๐‘‹๐‘– ). (3.15) ๐‘˜ 4 4 ๐‘—=1 63 ๐‘˜ โˆฅ ๐‘‹๐‘– โˆ’๐‘‹๐‘– ๐‘— โˆฅ 42 Therefore, when ๐‘˜ is sufficient large, 1 โˆฅ๐‘… (๐‘–) โˆฅ 2 is also close to 1 ร ฮ“ฬ„2 (๐‘‹๐‘– ), which can ๐‘˜ ๐น ๐‘˜ 4 ๐‘—=1 be completely computed from the data. Combining this with the argument at the beginning of ยง5 we get, โˆš๏ธƒ ๐‘˜ โˆฅ ๐‘‹๐‘– โˆ’ ๐‘‹๐‘– โˆฅ 4  1/2 ๐‘— 2  โˆ‘๏ธ ๐œ–๐‘– = โˆฅ๐‘… (๐‘–) + ๐ธ (๐‘–) โˆฅ ๐นโ‰ˆ โˆฅ๐‘… (๐‘–) โˆฅ 2๐น + โˆฅ๐ธ (๐‘–) โˆฅ 2๐น ) โ‰ˆ (๐‘˜ + 1) ๐‘๐œŽ 2 + ฮ“ฬ„2 (๐‘‹๐‘– ) =: ๐œ–ห†. 4 ๐‘—=1 min{๐‘˜+1,๐‘}1/2 ๐œ†ห†๐‘– โˆ’๐œ†๐‘–โˆ— ๐‘˜โ†’โˆž Thus we can set ๐œ†๐‘– = ห† due to (3.7). We show in the appendix that โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 0, ๐œ–ห†๐‘– ๐œ†โˆ— ๐‘– min{๐‘˜+1,๐‘}1/2 where ๐œ†๐‘–โˆ— = ๐œ–๐‘– as in (3.7). 3.5 Optimization algorithm To solve the convex optimization problem (3.4) in a memory-economic way, we first write ๐ฟ (๐‘–) as a function of ๐‘† and eliminate them from the problem. We can do so by fixing ๐‘† and minimizing the objective function with respect to ๐ฟ (๐‘–) ๐ฟห† (๐‘–) = arg min ๐œ†๐‘– โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐ฟ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ 2๐น + โˆฅC(๐ฟ (๐‘–) )โˆฅ โˆ— ๐ฟ (๐‘–) = arg min ๐œ†๐‘– โˆฅC(๐ฟ (๐‘–) ) โˆ’ C( ๐‘‹หœ (๐‘–) โˆ’ ๐‘† (๐‘–) )โˆฅ 2๐น + โˆฅC(๐ฟ (๐‘–) )โˆฅ โˆ— + ๐œ†๐‘– โˆฅ(๐ผ โˆ’ C)(๐ฟ (๐‘–) โˆ’ ( ๐‘‹หœ (๐‘–) โˆ’ ๐‘† (๐‘–) ))โˆฅ 2๐น . ๐ฟ (๐‘–) (3.16) Notice that ๐ฟ (๐‘–) can be decomposed as ๐ฟ (๐‘–) = C(๐ฟ (๐‘–) ) + (๐ผ โˆ’ C)(๐ฟ (๐‘–) ), set ๐ด = C(๐ฟ (๐‘–) ), ๐ต = (๐ผ โˆ’ C) (๐ฟ (๐‘–) ), then (3.16) is equivalent to ห† ๐ต) ( ๐ด, ห† = arg min ๐œ†๐‘– โˆฅ ๐ด โˆ’ C( ๐‘‹หœ (๐‘–) โˆ’ ๐‘† (๐‘–) )โˆฅ 2 + โˆฅ ๐ดโˆฅ โˆ— + ๐œ†๐‘– โˆฅ๐ต โˆ’ (๐ผ โˆ’ C)( ๐‘‹หœ โˆ—(๐‘–) โˆ’ ๐‘† (๐‘–) ))โˆฅ 2 , ๐น ๐น ๐ด,๐ต which decouples into ๐ดห† = arg min ๐œ†๐‘– โˆฅ ๐ด โˆ’ C( ๐‘‹หœ (๐‘–) โˆ’ ๐‘† (๐‘–) )โˆฅ 2๐น + โˆฅ ๐ดโˆฅ โˆ— , ๐ตห† = arg min ๐œ†๐‘– โˆฅ๐ต โˆ’ (๐ผ โˆ’ C)( ๐‘‹หœ (๐‘–) โˆ’ ๐‘† (๐‘–) )โˆฅ 2๐น . ๐ด ๐ต The problems above have closed form solutions ๐ดห† = T1/2๐œ†๐‘– (C( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– (๐‘†))), ๐ตห† = (๐ผ โˆ’ C)( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– (๐‘†)), (3.17) where T๐œ‡ is the soft-thresholding operator on the singular values T๐œ‡ (๐‘) = ๐‘ˆ max{ฮฃ โˆ’ ๐œ‡๐ผ, 0}๐‘‰ โˆ— , where ๐‘ˆฮฃ๐‘‰ โˆ— is the SVD of ๐‘. 64 Combing ๐ดห† and ๐ต, ห† we have derived the closed form solution for ๐ฟห† (๐‘–) ๐ฟห† (๐‘–) (๐‘†) = T1/2๐œ†๐‘– (C( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– (๐‘†))) + (๐ผ โˆ’ C)( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– (๐‘†)). (3.18) Plugging (3.18) into ๐น in (3.4), the resulting optimization problem solely depends on ๐‘†. Then we apply FISTA Beck and Teboulle (2009); Sha et al. (2019) to find the optimal solution ๐‘†ห† with ๐‘†ห† = arg min ๐น ( ๐ฟห† (๐‘–) (๐‘†), ๐‘†). (3.19) ๐‘† Once ๐‘†ห† is found, if the data has no Gaussian noise, then the final estimation for ๐‘‹ is ๐‘‹ห† โ‰ก ๐‘‹หœ โˆ’ ๐‘†; ห† if (๐‘–) there is Gaussian noise, we use the following denoised local patches ๐ฟห† ๐œโˆ— (๐‘–) ๐ฟห† ๐œ โˆ— = ๐ป๐œ โˆ— (C( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– ( ๐‘†))) ห† + (๐ผ โˆ’ C)( ๐‘‹หœ (๐‘–) โˆ’ P๐‘– ( ๐‘†)), ห† (3.20) where ๐ป๐œ โˆ— is the Singular Value Hard Thresholding Operator with the optimal threshold as defined (๐‘–) in Gavish and Donoho (2014). This optimal thresholding removes the Gaussian noise from ๐ฟห† ๐œ โˆ— . (๐‘–) With the denoised ๐ฟห† ๐œโˆ— , we solve (3.6) to obtain the denoised data ๐‘› ๐‘› (๐‘–) โˆ‘๏ธ โˆ‘๏ธ ๐‘‹ห† = ( ๐œ†๐‘– ๐ฟห† ๐œ โˆ— ๐‘ƒ๐‘–๐‘‡ )( ๐œ†๐‘– ๐‘ƒ๐‘– ๐‘ƒ๐‘–๐‘‡ ) โˆ’1 . (3.21) ๐‘–=1 ๐‘–=1 The proposed Nonlinear Robust Principle Component Analysis (NRPCA) algorithm is summarized in Algorithm 3.3. There is one caveat in solving (3.4): the strong sparse noise may result in a wrong Algorithm 3.3 Nonlinear Robust PCA Input: Noisy data matrix ๐‘‹, หœ ๐‘˜ (number of neighbors in each local patch), ๐‘‡ (number of neighbor- hood updates iterations) Output: The denoised data ๐‘‹, ห† the estimated sparse noise ๐‘†ห† 1: Estimate the curvature using (3.10) 2: Estimate ๐œ†๐‘– , ๐‘– = 1, . . . , ๐‘› as in Section 3.4, set ๐›ฝ as in (3.4) 3: ๐‘†ห† โ† 0 4: for iter = 1 to T do 5: Find the ๐‘˜NN for each point using ๐‘‹หœ โˆ’ ๐‘†ห† and construct the restriction operators {P๐‘– }๐‘–=1 ๐‘› 6: Construct the local data matrices ๐‘‹หœ (๐‘–) = P๐‘– ( ๐‘‹) หœ using P๐‘– and the noisy data ๐‘‹หœ 7: ๐‘†ห† โ† minimizer of (3.19) iteratively using FISTA 8: end 9: Compute each ๐ฟ ห† (๐‘–)โˆ— from (3.20) and assign ๐‘‹ห† from (3.21) ๐œ 65 neighborhood assignment when constructing the local patches. Therefore, once ๐‘†ห† is obtained and removed from the data, we update the neighborhood assignment and re-compute ๐‘†. ห† This procedure is repeated ๐‘‡ times. 3.6 Numerical experiment Figure 3.2 NRPCA applied to the noisy 3D Swiss roll dataset. ๐‘‹หœ โˆ’ ๐‘†ห† is the result after subtracting the sparse noise estimated by setting ๐‘‡ = 1 in NRPCA, i.e., no neighbour update; โ€œ ๐‘‹หœ โˆ’ ๐‘†ห† with one neighbor updateโ€ used the ๐‘†ห† obtained by setting ๐‘‡ = 2 in NRPCA; clearly, the neighbour update helped to remove more sparse noise; ๐‘‹ห† is the data obtained via fitting the denoised tangent spaces as in (3.6). Compared toโ€œ ๐‘‹หœ โˆ’ ๐‘†ห† with one neighbor updateโ€, it further removed the Gaussian noise from the data; โ€Patch-wise Robust PCAโ€ refers to the ad-hoc application of the vanilla Robust PCA to each local patch independently, whose performance is worse than the proposed joint-recovery formulation. Simulated Swiss roll: We demonstrate the superior performance of NRPCA on a synthetic dataset following the mixed noise model (3.1). We sampled 2000 noiseless data ๐‘‹๐‘– uniformly from a 3D Swiss roll and generated the Gaussian noise matrix with i.i.d. entries obeying N (0, 0.25). The sparse noise matrix ๐‘† is generated by randomly replacing 100 entries of a zero ๐‘ ร— ๐‘› matrix with i.i.d. samples generated from (โˆ’1) ๐‘ฆ ยท ๐‘ง where ๐‘ฆ โˆผ Bernoulli(0.5) and ๐‘ง โˆผ N (5, 0.09). We applied NRPCA to the simulated data with patch size ๐‘˜ = 15. Figure 3.2 reports the denoising 66 results in the original space (3D) looking down from above. We compare two ways of using the outputs of NRPCA: 1). only remove the sparse noise from the data ๐‘‹หœ โˆ’ ๐‘†; ห† 2). remove both the sparse and Gaussian noise from the data: ๐‘‹. ห† In addition, we plotted ๐‘‹หœ โˆ’ ๐‘†ห† with and without the neighbourhood update. These results are all superior to an ad-hoc application of the Robust PCA on the individual local patches. High-dimensional Swiss roll: We carried out the same simulation on a high-dimensional Swiss roll, and obtained better distinguishability among 1)-3). We also observed an overall improvement of the performance of NRPCA, which matches our intuition that the assumptions of Theorem 3.3.1 are more likely to be satisfied in high dimensions. The denoised results are displayed in Figure 3.3, where we clearly see that the use of ๐‘‹ห† instead of ๐‘‹หœ โˆ’ ๐‘†ห† allows a significant amount of Gaussian noise to be removed from the data. In the high-dimensional simulation, we generated a Swiss roll in R20 as following: 1. Choose the number of samples ๐‘› = 2000; 2. let ๐‘ก be the vector of length ๐‘› containing the ๐‘› uniform grid points in the interval [0, 4๐œ‹] with grid space 4๐œ‹/(๐‘› โˆ’ 1); 3. Set the first three dimensions of the data the same way as the 3D Swiss roll, for ๐‘– = 1, ..., ๐‘›, ๐‘‹๐‘– (1) = (๐‘ก (๐‘–) + 1) cos(๐‘ก (๐‘–)); ๐‘‹๐‘– (2) = (๐‘ก (๐‘–) + 1) sin(๐‘ก (๐‘–)); ๐‘‹๐‘– (3) โˆผ unif([0, 8๐œ‹]), where unif([0, 8๐œ‹]) means the uniform distribution on the interval [0, 8๐œ‹]. 4. Set the 4-20 dimensions of the data to contain pure sinusoids with various frequencies ๐‘‹๐‘– (๐‘˜) = ๐‘ก (๐‘–) sin( ๐‘“ ๐‘˜ ๐‘ก (๐‘–)), ๐‘˜ = 4, ..., 20, . where ๐‘“ ๐‘˜ = ๐‘˜/21 is the frequency for the ๐‘˜th dimension. The noisy data is obtained by adding i.i.d. Gaussian noise N (0, 0.25) to each entry of ๐‘‹ and adding sparse noise to 600 randomly chosen entries where the noise added to each chosen entry obeys N (5, 0.09). 67 Figure 3.3 NRPCA applied to the noisy 20D Swiss roll data set. ๐‘‹หœ โˆ’ ๐‘†ห† is the result after subtracting the estimated sparse noise via NRPCA with ๐‘‡ = 1; โ€œ ๐‘‹หœ โˆ’ ๐‘†ห† with one neighbor updateโ€ is that with ๐‘‡ = 2, i.e., patches are reassigned once; ๐‘‹ห† is the denoised data obtained via fitting the tangent spaces in NRPCA with ๐‘‡ = 2; โ€œPatch-wise Robust PCAโ€ refers to the ad-hoc application of the vanilla RPCA to each local patch independently, whose performance is clearly worse than the proposed joint-recovery formulation. MNIST: We observe some interesting dimension reduction results of the MNIST dataset with the help of NRPCA. It is well-known that the handwritten digits 4 and 9 have so high a similarity that some popular dimension reduction methods, such as Isomap and Laplacian Eigenmaps (LE) are not able to separate them into two clusters (first column of Figure 3.4). Despite the similarity, a few other methods (such as t-SNE) are able to distinguish them to a much higher degree, which suggests the possibility of improving the results of Isomap and LE with proper data pre-processing. We conjecture that the overlapping parts in Figure 3.4 (the left column) are caused by personalized writing styles with different beginning or finishing strokes. This type of differences can be better modelled by sparse noise than Gaussian or Poisson noises. 68 0.015 Original, Laplacian 0.015 Denoised Laplacian 0.01 0.01 Laplacian2 Laplacian2 0.005 0.005 0 0 -0.005 -0.005 -0.01 -0.01 -0.015 -0.01 -0.008 -0.006 -0.004 -0.002 0 0.002 0.004 0.006 0.008 0.01 -0.01 -0.005 0 0.005 0.01 0.015 Laplacian1 Laplacian1 20 Original Isomap 10 Denoised Isomap 15 5 10 Isomap2 Isomap2 5 0 0 -5 -5 -10 -15 -10 -20 -15 -10 -5 0 5 10 15 20 25 -15 -10 -5 0 5 10 15 Isomap1 Isomap1 Figure 3.4 Laplacian eigenmaps and Isomap results for the original and the NRPCA denoised digits 4 and 9 from the MNIST dataset. The right columns of Figure 3.4 confirm this conjecture: after the NRPCA denoising (with ๐‘˜ = 10), we see a much better separability of the two digits using the first two coordinates of Isomap and Laplacian Eigenmaps. Here we used 2000 randomly drawn images of 4 and 9 from the MNIST training dataset. Figure 3.5 used another random set of the same cardinally and ๐‘˜ = 5, but they both demonstrated that the denoising step greatly facilitates the dimensionality reduction. In addition, we observe some emerging trajectory (or skeleton) patterns in the plot of the denoised embedding (right column of Figure 3.4 and Figure 3.5). Mathematically speaking, this is due to the nuclear norm penalty on the tangent spaces in the optimization formulation that forces the denoised data to have a small intrinsic dimension. However, since the small intrinsic dimensionality is not manually inputted but implicitly imposed via an automatic calculation of the data curvature and the weight parameter ๐œ†๐‘– , we do not think the trajectory pattern is a human artifact. To further examine the meaning the trajectories, we replaced the dots in the bottom two scattered plots in Figure 3.5 by their original images of the digits, and obtained Figure 3.6 and Figure 3.7. We can see that 1). the digits are better grouped in the denoised embedding than the original one and 2). the trajectories in the denoised embedding correspond to graduate transitions between the two images 69 on the two ends. If two images are connected by two trajectories, then it indicates two ways for one image to gradually deform into the other. Furthermore, Figure 3.8 listed a few images of 4 and 9 before and after denoising, which shows which part of the image is detected as sparse noise and changed by NRPCA. 0.015 Original, Laplacian 0.01 Denoised Laplacian 0.01 0.005 Laplacian2 0.005 Laplacian2 0 0 -0.005 -0.005 -0.01 -0.01 -0.015 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 Laplacian1 Laplacian1 15 Original Isomap 15 Denoised Isomap 10 10 5 Isomap2 Isomap2 5 0 -5 0 -10 -5 -15 -20 -10 -25 -20 -15 -10 -5 0 5 10 15 20 25 -15 -10 -5 0 5 10 15 20 Isomap1 Isomap1 Figure 3.5 Laplacian eigenmaps and Isomap results for the original and the NRPCA denoised digits 4 and 9 from the MNIST dataset. Figure 3.6 Isomap embedding using the original data from the MNIST dataset. 70 Figure 3.7 Isomap embedding using the Denoised data via NRPCA. Original images for digit 4 Denoised images for digit 4 Original images for digit 9 Denoised images for digit 9 Figure 3.8 A comparison of the original and the NRPCA denoised images of digit 4 and 9. Figure 3.9 shows the results for NRPCA denoising with more iterations of patch-reassignment, we can see that the results almost have no visible difference after ๐‘‡ > 2. Since the patch-reassignment is in the outer iteration, increasing its frequency greatly increases the computation time. Fortunately, we find that often times two iterations are enough to deliver a good denoising result. 71 Noisy images Denoised images with T=1 Denoised images with T=2 Denoised images with T=3 Denoised images with T=4 Denoised images with T=5 Figure 3.9 NRPCA Denoising results with more iterations of patch-reassignment. Biological data: We illustrate the potential usefulness of NRPCA algorithm on an embryoid body (EB) differentiation dataset over a 27-day time course, which consists of gene expressions for 31,000 cells measured with single-cell RNA-sequencing technology (scRNAseq) Martin and Evans (1975); Moon et al. (2019). This EB data comprising expression measurement for cells originated from embryoid at different stages is hence developmental in nature, which should exhibit a progressive type of characters such as tree structure because all cells arise from a single oocyte and then develop into different highly-differentiated tissues. This progression character is often missing when we directly apply dimension reduction methods to the data as shown in Figure 3.10 because biological data including scRNAseq is highly noisy and often is contaminated with outliers from different sources including environmental effects and measurement error. In this case, we aim to reveal the progressive nature of the single-cell data from transcript abundance as measured by scRNAseq. We first normalized the scRNAseq data following the procedure described in Moon et al. (2019) and randomly selected 1000 cells using the stratified sampling framework to maintain the ratios among different developmental stages. We applied our NRPCA method to the normalized subset of EB data and then applied Locally Linear Embedding (LLE) to the denoised results. The two- 72 dimensional LLE results are shown in Figure 3.10. Our analysis demonstrated that although LLE is unable to show the progression structure using noisy data, after the NRPCA denoising, LLE successfully extracted the trajectory structure in the data, which reflects the underlying smooth differentiating processes of embryonic cells. Interestingly, using the denoised data from ๐‘‹หœ โˆ’ ๐‘†ห† with neighbor update, the LLE embedding showed a branching at around day 9 and increased variance in later time points, which was confirmed by manual analysis using 80 biomarkers in Moon et al. (2019). Figure 3.10 LLE results for denoised scRNAseq data set. 73 CHAPTER 4 PERTURBATION OF INVARIANT SUBSPACES FOR ILL-CONDITIONED EIGENSYSTEM 4.1 Introduction 4.1.1 Overview of Chapter 4 Given a diagonalizable matrix ๐ด, this chapter studies the stability of its invariant subspaces when its matrix of eigenvectors is ill-conditioned. Let X1 be some invariant subspace of ๐ด and ๐‘‹1 be the matrix of the right eigenvectors that span X1 . It is generally believed that when the condition number ๐œ…2 (๐‘‹1 ) gets large, the corresponding invariant subspace X1 will become unstable to perturbation. This chapter proves that this is not always the case. Specifically, we show that the growth of ๐œ…2 (๐‘‹1 ) alone is not enough to destroy the stability. As a direct application, the result in this chapter ensures that when ๐ด gets close to a Jordan form, one may still estimate its invariant subspaces from the noisy data stably. The result in this chapter also suggests that for matrices with ill-conditioned eigensystems, their invariant subspaces may be more stable than their eigenvalues to matrix perturbation. Results of this chapter has given rise to the manuscript Lyu and Wang (2022). 4.2 Invariant subspace perturbation analysis In Chapter 2 we have discussed the stability of singular subspaces, for symmetric matrices, the techniques in Chapter 2 can also be extended to studying the stability of invariant subspaces. This is because symmetric matrices enjoy the following nice properties: 1. all the eigenvalues are real; 2. for any symmetric matrix ๐ด, there exists an eigen decomposition ๐ด = ๐‘„ฮ›๐‘„๐‘‡ , where ๐‘„ is a unitary matrix, and ฮ› is a diagonal matrix containing the eigenvalues. However, when ๐ด โˆˆ C๐‘›ร—๐‘› is a general square matrix with eigen decomposition ๐ด = ๐‘‹ฮ›๐‘‹ โˆ’1 , studying the invariant subspaces of ๐ด becomes more challenging. First, the eigenvalues in ฮ› may not be real. Moreover, the eigenvector matrix ๐‘‹ here may not be unitary, and eigenvectors in ๐‘‹ may not be orthogonal to each other either. As a result, the condition numbers ๐œ… 2 (๐‘‹) and ๐œ…2 (๐‘‹1 ) can be arbitrarily large, examples can be found when the matrix ๐ด is close to a Jordan form (e.g., 74 Example 4.3.1). To get a better understanding of the behaviour of invariant subspace ๐‘ ๐‘๐‘Ž๐‘›(๐‘‹1 ) under perturbation, it is important to investigate the impact of the condition numbers on its stability. Chapter 1 provides us with the math techniques for invariant subspace perturbation analysis. In the settings of Section 4.2, we consider a diagonalizable matrix ๐ด โˆˆ C๐‘›ร—๐‘› which has an eigenvector matrix partitioned into two blocks ๐‘‹ = [๐‘‹1 , ๐‘‹2 ]. Assume the perturbed matrix ๐ด e = ๐ด + ฮ”๐ด has a similar block structure, ๐‘‹ e1 is the eigenvector matrix of ๐ด e with partition ๐‘‹e1 = [ ๐‘‹ e2 ]. Define e1 , ๐‘‹ ๐‘‰ := (๐‘‹ โˆ’1 ) โˆ— and ๐‘‰ eโˆ’1 ) โˆ— , then ๐‘‰ โˆ— ๐‘‹ = ๐‘‰ e := ( ๐‘‹ eโˆ— ๐‘‹ e = ๐ผ. Similar as the partition of ๐‘‹ and ๐‘‹, e we also partition matrix ๐‘‰ and ๐‘‰ e into two blocks, i.e., ๐‘‰ = [๐‘‰1 ,๐‘‰2 ], ๐‘‰e = [๐‘‰ e2 ]. e1 , ๐‘‰ Using the definitions above, we study the distance between subspaces X1 = ๐‘ ๐‘๐‘Ž๐‘›(๐‘‹1 ) and X f1 = ๐‘ ๐‘๐‘Ž๐‘›( ๐‘‹ e1 ). Specifically, we are interested in bounding the quantity โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ = โˆฅ sin ฮ˜(๐‘„ ๐‘‹ , ๐‘„ e )โˆฅ, 1 ๐‘‹1 where the sin ฮ˜ distance between two subspaces are defined in Section 1.2.3. 4.2.1 Related works Since the exact sin ฮ˜ distance is hard to calculate, a central task in eigen-perturbation anal- ysis is establishing useful upper bounds with simpler expressions Varah (1970); Stewart (1990); Greenbaum et al. (2020); Karow and Kressner (2014); Ipsen (2003); Golub and Wilkinson (1976); Demmel (1986); Kato (2013); Chatelin (2011); Gohberg et al. (2006); Davis and Kahan (1970). Explicitly, we prefer upper bounds that are expressed by simple quantities related to ๐ด and ฮ”๐ด, such as โˆฅฮ”๐ดโˆฅ, โˆฅ ๐ดโˆฅ, the condition number of ๐‘‹, and the gap between ฮ›1 and ฮ›2 , since these quantities are more likely to be given as prior information and/or are easier to estimate when the exact ๐ด is unknown. The most well-known bound is perhaps the one given by the Davis-Kahan theorem Davis and Kahan (1970) (see Theorem 1.3.10 for the theorem), which states that for Hermitian matrices, the sin ฮ˜ distance depends only on ||ฮ”๐ด|| and the eigengap. For non-Hermitian matrices, however, it is believed that the stability of X1 also depends on the condition number of the eigenvector matrix ๐‘‹: an ill-conditioned ๐‘‹ would cause instabilities of the invariant subspaces. However, a tight relationship between the condition number and the sin ฮ˜ distance is yet to be established. 75 This problem has been partially studied in several papers Haviv and Ritov (1994); Ipsen (2003); Stewart (1973, 1990); Varah (1970). The best known relation between the sin ฮ˜ distance and the condition numbers was provided in Stewart (1971), here we state a slightly simplified version of this classical result. Theorem 4.2.1 (simplified version of Theorem 4.1 Stewart (1971)). Provided that 1 2 โˆฅฮ”๐ดโˆฅ(โˆฅ ๐ดโˆฅ + โˆฅฮ”๐ดโˆฅ) < sep(๐‘„ โˆ—๐‘‹ ๐ด๐‘„ ๐‘‹1 , ๐‘„๐‘‰โˆ— ๐ด๐‘„๐‘‰2 ) โˆ’ 2โˆฅฮ”๐ดโˆฅ , (4.1) 4 1 2 the following error bound holds 2โˆฅฮ”๐ดโˆฅ โˆฅ tan ฮ˜(๐‘„ ๐‘‹1 , ๐‘„ ๐‘‹e )โˆฅ < , (4.2) 1 ๐‘ ๐‘’ ๐‘(๐‘„ โˆ—๐‘‹ ๐ด๐‘„ ๐‘‹1 , ๐‘„๐‘‰โˆ— ๐ด๐‘„๐‘‰2 ) โˆ’ 2โˆฅฮ”๐ดโˆฅ 1 2 where for any pair of matrices ๐ฟ 1 , ๐ฟ 2 , sep(๐ฟ 1 , ๐ฟ 2 ) := inf โˆฅ๐‘‡ โˆฅ=1 โˆฅ๐‘‡ ๐ฟ 1 โˆ’ ๐ฟ 2๐‘‡ โˆฅ. Since โˆฅ tan ฮ˜โˆฅ is larger than โˆฅ sin ฮ˜โˆฅ, this also gives a sinฮ˜ bound. Theorem 4.2.1 indicates the key quantity that determines the stability of invariant subspaces is not the eigengap nor the condition numbers, but a new quantity called separation defined using the norm of a Sylvester operator related to ๐ด. Despite the efficacy in characterizing the subspace stability, separation is difficult to estimate in practice when the original matrix ๐ด is not fully known. In contrast, it is much easier to come up with a rough estimate of (or be given some prior knowledge of) the condition number of the eigensystem as well as the eigengap. Therefore, characterizing the stability directly in terms of condition number (of the eigensystem) and eigengaps is still of great practical interest. One can replace the separation in (4.2) with the condition numbers using the following inequal- ities, ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) ๐‘ ๐‘’ ๐‘(๐‘„ โˆ—๐‘‹ ๐ด๐‘„ ๐‘‹1 , ๐‘„๐‘‰โˆ— ๐ด๐‘„๐‘‰2 ) = ๐‘ ๐‘’ ๐‘(๐‘… ๐‘‹1 ฮ›1 ๐‘… โˆ’1 , ๐‘…๐‘‰2 ฮ›2 ๐‘…๐‘‰โˆ’1 ) โ‰ฅ , (4.3) 1 2 ๐‘‹ 1 2 ๐œ… 2 (๐‘… ๐‘‹1 )๐œ…2 (๐‘…๐‘‰2 ) where the equality is given by direct calculation and the inequality is from Chapter V in Stewart (1990). Noticing ๐œ…2 (๐‘… ๐‘‹1 ) = ๐œ…2 (๐‘‹1 ) and ๐œ…2 (๐‘…๐‘‰2 ) = ๐œ…2 (๐‘‰2 ), substituting (4.3) into (4.2) leads to the following bound in terms of condition numbers 2๐œ…2 (๐‘‹1 )๐œ…2 (๐‘‰2 )โˆฅฮ”๐ดโˆฅ โˆฅ tan ฮ˜(X1 , X e1 )โˆฅ < , (4.4) [๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) โˆ’ 2๐œ…2 (๐‘‹1 )๐œ…2 (๐‘‰2 )โˆฅฮ”๐ดโˆฅ] + 76 where [ยท] + is the positive part of the input to guarantee the bound is positive. The separation ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) only depends on the eigenvalues of ๐ด, and can be bounded by a constant which is smaller than the common eigen gap ๐›ฟ1 := min๐œ†๐‘– โˆˆ๐‘†(ฮ› ),๐œ† ๐‘— โˆˆ๐‘†(ฮ› ) |๐œ†๐‘– โˆ’ ๐œ† ๐‘— |, more discussions can be 1 2 found in Appendix. We note that when the bound (4.4) is quite tight ๐œ…2 (๐‘‹1 ) and ๐œ… 2 (๐‘‰2 ) are small. The minimal condition number is achieved when ๐ด is Hermitian, for which ๐‘‹ is orthogonal and ๐œ…2 (๐‘‹) = ๐œ…2 (๐‘‹1 ) = ๐œ…2 (๐‘‰2 ) = 1. Then (4.4) reduces to 2โˆฅฮ”๐ดโˆฅ โˆฅ tan ฮ˜(X1 , X e1 )โˆฅ < , (4.5) ๐›ฟ1 โˆ’ 2โˆฅฮ”๐ดโˆฅ which meets the tight Davis-Kahanโ€™s bound Davis and Kahan (1970) ensuring the stability of the subspace with sufficient eigengap from others. However, the bound (4.4) suggests that X1 becomes unstable (0 tolerance of noise) as ๐œ…2 (๐‘‹1 ) โ†’ โˆž. That is, with fixed eigenvalues, ๐œ…2 (๐‘‰2 ), and the magnitude of noise โˆฅฮ”๐ดโˆฅ, as ๐œ…2 (๐‘‹1 ) โ†’ โˆž, the bound on RHS also goes to infinity and becomes not informative. A similar result has been derived in Varah (1970). The original result in Varah (1970) was stated for both diagonalizable matrices and Jordan forms. Here to avoid distractions, we only state it for diagonalizable matrices, that is ๐ถ๐‘Ÿ โˆฅฮ”๐ดโˆฅ โˆฅ sin ฮ˜(X1 , X f1 )โˆฅ โ‰ค . (4.6) ๐œŽmin (๐‘‹)๐œŽmin (๐‘‹1 ) Here ๐ถ๐‘Ÿ is a constant depending on ๐‘Ÿ := ๐‘‘๐‘–๐‘š(X1 ) and the eigengap ๐›ฟ1 defined above. Recall that we โˆš required the columns of ๐‘‹ to be normalized, which leads to 1 โ‰ค ๐œŽmax (๐‘‹1 ) = ๐œ…2 (๐‘‹1 )๐œŽmin (๐‘‹1 ) โ‰ค ๐‘Ÿ. From this, we see that the bound (4.6) is no smaller than ๐ถ๐‘Ÿ ๐œ…2 (๐‘‹1 )โˆฅฮ”๐ดโˆฅ โˆš . ๐‘Ÿ๐œŽmin (๐‘‹) Thus (4.6) also suggests an instability of X1 as ๐œ…2 (๐‘‹1 ) โ†’ โˆž, which is quite pessimistic. In Ipsen (2003), the author proved that one can replace the absolute error โˆฅฮ”๐ดโˆฅ in (4.4) by a relative error โˆฅ ๐ดโˆ’๐‘˜ ฮ”๐ด ๐ด eโˆ’๐‘™ โˆฅ, where ๐‘˜ and ๐‘™ are positive numbers. However, the dependence on the condition numbers was not improved. 77 4.3 Motivating example As discussed in Section 4.2.1, (4.4) is the state-of-the-art relation between the sin ฮ˜ distance and the condition numbers of the eigensystem. However, when ๐œ…2 (๐‘‹1 ) gets large, (4.4) is no longer tight, which can be seen from the following example. It motivates us to derive a new perturbation bound in Section 4.4. Example 4.3.1. Consider the following matrix ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน ๏ฃฏ๐ต 0 ๏ฃบ ๏ฃฏ1 1 ๏ฃบ ๐ด= ๏ฃฏ ๐ต=๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃบ, ๏ฃบ. ๏ฃฏ 0 1/2๏ฃบ ๏ฃฏ ๐œ– 1๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ ๏ฃป ๏ฃฐ ๏ฃป Assume the perturbation matrix ฮ”๐ด โˆˆ C3ร—3 is arbitrary with ๐œ–1 := โˆฅฮ”๐ดโˆฅ = ๐‘œ(1). Also assume ๐œ– = ๐‘œ(1). Let ๐‘‹1 be the 3 ร— 2 matrix containing the two eigenvectors of ๐ด corresponding to the block ๐ต. We want to find the stability of X1 = ๐‘ ๐‘๐‘Ž๐‘›(๐‘‹1 ). Since ๐ต is close to a Jordan block, ๐œ…2 (๐‘‹1 ) must be large. To verify it, we first obtain the closed-form expressions for ๐‘‹1 and ๐‘‹2 ยฉ1 1 ยช ยฉ0 ยช 1 ยญ ยญ 1 1ยฎ ยฎ ยญ ยฎ ๐‘‹1 = โˆš ๐‘‹ 2 = ยญ0 ยฎ . ยญ ยฎ ยญ๐œ– 2 โˆ’๐œ– 2 ยฎ , 1 + ๐œ– ยญยญ ยฎ ยฎ ยญ ยฎ ยญ ยฎ 0 0 1 ยซ ยฌ ยซ ยฌ Then this immediately implies X1 = ๐‘ ๐‘๐‘Ž๐‘›{๐‘’ 1 , ๐‘’ 2 } (๐‘’๐‘– , ๐‘– = 1, 2 are the canonical basis vectors) and ๐œ…2 (๐‘‹1 ) = ๐œ– โˆ’1/2 โ‰ซ 1. In addition, the eigen-gap of this ๐ด is sufficiently large, since the eigenvalues in ฮ›1 are 1 ยฑ ๐œ– 1/2 , and ฮ›2 = 1/2. With a general perturbation ฮ”๐ด, there is no closed-form expression for X e1 . We therefore first use some special perturbations to make our point. Let ๐ธ๐‘–, ๐‘— be the 3 ร— 3 matrix whose (๐‘–, ๐‘—)th entry equals 1 and other entries equal 0. Consider special perturbations of the form ฮ”๐ด = ๐œ–1 ๐ธ๐‘–, ๐‘— with ๐‘–, ๐‘— โˆˆ {1, 2, 3}, and assume ๐œ– 1 = ๐‘œ(1) is a different small constant than ๐œ–. If (๐‘–, ๐‘—) โˆˆ {(1, 1), (1, 2), (2, 1), (2, 2), (1, 3), (2, 3), (3, 3)}, one can verify that X e1 is exactly the same as X1 , so โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ = 0. Else, (๐‘–, ๐‘—) = (3, 1) or (๐‘–, ๐‘—) = (3, 2). For these two, one can verify that under the perturbation ฮ”๐ด = ๐œ–1 ๐ธ๐‘–, ๐‘— , the eigenvalues of ๐ด do not change. As for eigenvectors, 78 if (๐‘–, ๐‘—) = (3, 1), then the perturbed eigenvectors are ๏ฃฎ ๏ฃน ยฉ ๏ฃฏ๏ฃฏ 1 1 ๏ฃบยช ๏ฃบยฎ ยญ๏ฃฏ ๏ฃบยฎ e1 = ๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’ ยญยญ ๏ฃฏ ๐œ– 1/2 ๐‘‹ 1/2 โˆ’๐œ– ๏ฃบ๏ฃบ ยฎ , ยญ ๏ฃฏ๏ฃฏ ๏ฃบยฎ ยญ ๏ฃฏ 2๐œ– 2๐œ–1 ๏ฃบ ยฎ ๏ฃฏ 1โˆš โˆš ๏ฃบ ยซ ๏ฃฐ 1+2 ๐œ– 1โˆ’2 ๐œ– ๏ฃป ยฌ where ๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’ stands for column-wise normalization. We can directly compute the distance between X1 and X e1 to get โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ โ‰ค ๐‘‚ (๐œ–1 ) = ๐‘‚ (โˆฅฮ”๐ดโˆฅ). Similarly, for (๐‘–, ๐‘—) = (3, 2), the same calculation again yields โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ โ‰ค ๐‘‚ (โˆฅฮ”๐ดโˆฅ). Therefore, for all the special perturbations of the form ฮ”๐ด = ๐œ– 1 ๐ธ๐‘–, ๐‘— , we have โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ โ‰ค ๐‘‚ (โˆฅฮ”๐ดโˆฅ). Notice that the sin ฮ˜ distance does not get worse as ๐œ…2 (๐‘‹1 ) = ๐œ– โˆ’1/2 โ†’ โˆž, suggesting that the bound (4.4) may be suboptimal. To get additional supporting evidence, we tested random perturbations and summarized the values of โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ in Table 4.1. ๐œ– 1e-2 1e-4 1e-6 1e-8 1e-10 Estimated by (4.4) 5.00e-5 4.08e-4 4.00e-3 0.0042 0.67 True sin ฮ˜ distance 2.07e-6 1.99e-6 1.99e-6 1.99e-7 1.99e-6 Table 4.1 Comparison of the true sin ฮ˜ distance in Example 4.3.1 with its upper bound computed from (4.4) for various values of ๐œ–. The perturbation matrix ฮ”๐ด is a realization of the random Gaussian matrix rescaled to a fixed norm โˆฅฮ”๐ดโˆฅ = ๐œ–1 = 10โˆ’6 . With this fixed ฮ”๐ด, we let the condition number ๐œ…2 (๐‘‹1 ) โ†’ โˆž by letting ๐œ– โ†’ 0 in Example 4.3.1. We see that the true sin ฮ˜ distance does not vary with ๐œ– while the upper bound (4.4) does, suggesting the suboptimality of (4.4). In the simulation, we added a random perturbation with energy ๐œ–1 to the matrix ๐ด defined in Example 4.3.1 and let the ๐œ– in ๐ด go to 0. For each value of ๐œ–, we compute the true sin ฮ˜ distance as well as the value of the upper bound (4.4). We observed that the true sin ฮ˜ distance does not change much as ๐œ– โ†’ 0 while the upper bound blows up, which suggests a suboptimality of the bound. A few more details about the simulation: the random perturbation ฮ”๐ด in this experiment was obtained by re-scaling an i.i.d. Gaussian matrix to have a spectral norm of 10โˆ’6 . Because in Example 4.3.1, ๐‘‹1 are the eigenvectors of ๐ด corresponding to the two largest magnitude eigenvalues, in the 79 simulation, we also took ๐‘‹ e1 to be the eigenvectors of ๐ด e associated with the two largest magnitude โˆš๏ธ‚   eigenvalues. The true sin ฮ˜ distance in the table was computed by 1 โˆ’ ๐œŽ๐‘š๐‘–๐‘› 2 ๐‘„ โˆ—๐‘‹ ๐‘„ ๐‘‹e , which 1 1 equals sin ฮ˜(๐‘„ ๐‘‹1 , ๐‘„ ๐‘‹e ) . 1 4.4 A new invariant subspace perturbation bound Both the theoretical argument for special perturbations and the numerical results for random perturbations suggest that for the matrix ๐ด in Example 4.3.1, the perturbation of X1 is ๐‘‚ (๐œ–1 ) = ๐‘‚ (โˆฅฮ”๐ดโˆฅ), which is unaffected by the large condition number ๐œ…2 (๐‘‹1 ) = ๐œ– โˆ’1/2 as ๐œ– โ†’ 0. In this section, we show that this phenomenon is no coincidence, and ๐œ…2 (๐‘‹1 ) can indeed be removed from the previous bound (4.4). In order to present the main results in this chapter, we need to assume that the spectra of ๐ด and ๐ด e have gaps. Since the study of the eigenvalue perturbation is out of the scope of this dissertation, we simply make the existence of eigengaps as an assumption. Assumption 1 [Eigengap]: Suppose ๐‘†(ฮ›1 ) and ๐‘†(ฮ›2 ) are well-separated in the sense of min๐œ†โˆˆ๐‘†(ฮ› ),๐œŽโˆˆ๐‘†(ฮ› ) |๐œ† โˆ’ ๐œŽ| > 0. 1 2 The following assumption assumes that the gap still exists after perturbation. Assumption 2 [Eigengap under perturbation]: Suppose ๐‘†( e ฮ›1 ) and ๐‘†(ฮ›2 ) are well-separated with some eigengap ๐›ฟ๐œ† > 0. More explicitly, 0 < ๐›ฟ๐œ† := min๐œ†โˆˆ๐‘†( e ฮ›1 ),๐œŽโˆˆ๐‘†(ฮ›2 ) |๐œ† โˆ’ ๐œŽ|. The following theorem improves upon (4.4) when ๐œ… 2 (๐‘‹1 ) is large. Theorem 4.4.1. Assume ๐ด and ๐ด e are diagonalizable matrices with decomposition (1.2) and (1.3), and satisfying Assumption 2 with an eigengap ๐›ฟ๐œ† . Let ๐‘Ÿ = # (S(ฮ›1 )) be the number of eigenvalues in ฮ›1 and denote by ๐œ† e๐‘— , ๐‘— = 1, ..., ๐‘Ÿ the ๐‘—th diagonal element in e ฮ›1 , then we have ๐‘Ÿ ๐œ… (๐‘‰ )โˆฅฮ”๐ดโˆฅ ๐น ร– ยฉยญ ๐‘Ž ยช sin ฮ˜(๐‘„ ๐‘‹1 , ๐‘„ ๐‘‹e ) โ‰ค 2 2 ยญ1 + ยฎ ยฎ 1 ๐‘Ž ยญ min | ๐œ† e ๐‘— โˆ’ ๐œ† ๐‘˜ | ยฎ ๐‘—=1 ๐œ† ๐‘˜ โˆˆ๐‘†(ฮ›2 ) ยซ ยฌ (4.7) ๐‘Ÿ   ๐œ… (๐‘‰ )โˆฅฮ”๐ดโˆฅ ๐น ร– ๐‘Ž โ‰ค 2 2 1+ , ๐‘Ž ๐›ฟ๐œ† ๐‘—=1 where ๐‘Ž = โˆฅ ๐ดโˆฅ + โˆฅฮ”๐ดโˆฅ + ๐œŒ(ฮ›2 ), and ๐œŒ(ฮ›2 ) is the spectral radius of ฮ›2 . Remark 4.4.2. Notice that (4.7) does not contain ๐œ…2 (๐‘‹1 ), and when being applied to Example 4.3.1, agrees with the numerical observation. 80 We note that Theorem 4.4.1 states that e1 )โˆฅ โ‰ค ๐œ…2 (๐‘‰2 ) ๐‘“ (โˆฅ ๐ดโˆฅ, โˆฅฮ”๐ดโˆฅ, ๐›ฟ1 , ๐‘Ÿ) , โˆฅ sin ฮ˜(X1 , X (4.8) where ๐‘“ is some function of โˆฅ ๐ดโˆฅ, โˆฅฮ”๐ดโˆฅ, ๐›ฟ1 , and ๐‘Ÿ โ‰ก ๐‘‘๐‘–๐‘š(X1 ). The new bound ensures that the stability will not keep getting worse as ๐œ…2 (๐‘‹1 ) โ†’ โˆž. In particular, when ๐ด approaches a Jordan form, we may still stably estimate its invariant subspaces from noisy data. Note that there is a previous result Varah (1970) that guaranteed the stability of invariant subspaces for deficient matrices (which correspond to ๐œ…2 (๐‘‹1 ) = โˆž), while our result holds for the much larger class of diagonalizable matrices with large condition numbers. The proof technique is also different. Another important implication of Theorem 4.4.1 is that, if a matrix has an ill-conditioned eigensystem, then its invariant subspaces are likely to be more stable than its eigenvalues to perturbation. To see why, recall that the perturbation of eigenvalues is controlled by the Bauer-Fike theorem, |๐œ† โˆ’ ๐œ†| e โ‰ค ๐œ…2 (๐‘‹)โˆฅฮ”๐ดโˆฅ, which is known to be tight. As the matrix gets more and more ill-conditioned, this bound is likely to be larger than (4.8) as ๐œ…2 (๐‘‹) โ‰ฅ ๐œ…2 (๐‘‰2 ). Hence the eigenvalues are perturbed more than the subspaces. To be a bit more concrete, for the ๐ด in Example 4.3.1, the Bauer-Fike bound on the eigenvalues ๐œ–  implies |๐œ† โˆ’ ๐œ†| โˆผ ๐‘‚ โˆš1 , while our sin ฮ˜ bound on the invariant subspace X1 implies sin ฮ˜(X1 , X e e1 ) = ๐œ– ๐‘‚ (๐œ– 1 ). Of course, for the invariant subspaces to be detectable after perturbation, we need the eigen- gap to exist in the perturbed matrix, which can be guaranteed by the Bauker-Fike theorem if we โˆš require 2๐œ–1 โ‰ฒ ๐œ–. Under this requirement, the Bauker-Fike theorem indicates that the perturbed eigenvalues are at most ๐‘‚ (1) away from the original ones, and our result indicates that the perturbed โˆš invariant subspace is only ๐‘‚ ( ๐œ–) away from the original one. Hence the invariant subspace is much more stable than the eigenvalues in this case. 4.4.1 Tightness of the bound (4.7) At first glance, the bound in (4.7) contains the ๐‘Ÿth power of the eigengap in the denominator, which seems unusual. In this section, we demonstrate that it is actually tight for general matrices. 81 The dependence on ๐›ฟ๐œ†๐‘Ÿ is tight We give an example showing that the dependence on ๐›ฟ๐œ† is tight, i.e., the ๐‘Ÿth power on ๐›ฟ๐œ† in the upper bound can be attained by the following examples. Example 4.4.3. We first present an example for ๐‘Ÿ = 2. ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน ๏ฃฏ1 0 0 ๏ฃบ ๏ฃฏ0 0 0๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๐ด = ๏ฃฏ1 1 โˆ’ ๐›ฟ ๏ฃฏ 0 ๏ฃบ๏ฃบ , ฮ”๐ด = ๏ฃฏ0 0 0๏ฃบ๏ฃบ . ๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ0 0 1 โˆ’ 2๐›ฟ๏ฃบ ๏ฃฏ0 ๐œ– 0๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ ๏ฃป ๏ฃฐ ๏ฃป Here we set ๐œ– = min{๐‘œ(1), ๐‘‚ (๐›ฟ2 )}, 0 < ๐›ฟ < 1. We consider the perturbation of the subspace X1 spanned by the largest two eigenvectors of ๐ด, so ๐‘Ÿ := ๐‘‘๐‘–๐‘š(X1 ) = 2. It is immediate to verify that X1 = ๐‘ ๐‘๐‘Ž๐‘›(๐‘’ 1 , ๐‘’ 2 ) (๐‘’๐‘– is the ๐‘–th canonical basis), ๐‘‰2 = ๐‘’ 3 , and the eigengap between the largest two and the smallest eigenvalues is ๐›ฟ๐œ† = ๐›ฟ. With the given perturbation, one can verify that ๐œ†๐‘– = ๐œ† e๐‘– , for ๐‘– = 1, 2, 3. The perturbed subspace can also be calculated ๏ฃฑ ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน๏ฃผ ๏ฃด ๏ฃฏ 1 ๏ฃบ ๏ฃฏ0๏ฃบ ๏ฃด ๏ฃด ๏ฃด ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ๏ฃด ๏ฃด ๏ฃด ๏ฃด ๏ฃฒ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ๏ฃด ๏ฃด ๏ฃด ๏ฃฝ ๏ฃด e ๏ฃฏ 1 X1 = ๐‘ ๐‘๐‘Ž๐‘› ๏ฃฏ ๐›ฟ ๏ฃบ , ๏ฃฏ 1 ๏ฃบ . ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃด ๏ฃด ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ๏ฃด ๏ฃด ๏ฃด ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ๏ฃด ๏ฃด ๏ฃฏ ๐œ– 2 ๏ฃบ ๏ฃฏ ๐›ฟ๐œ– ๏ฃบ ๏ฃด ๏ฃด ๏ฃด ๏ฃด ๏ฃด ๏ฃณ ๏ฃฐ 2๐›ฟ ๏ฃป ๏ฃฐ ๏ฃป ๏ฃพ In X e1 , we pick a special vector that has a large angle with the original subspace X1 : pick ๐‘ฅ = h i๐‘‡   1, 0, โˆ’ ๐œ– 2 โˆˆ X e1 . One can verify that the sine of the angle between ๐‘ฅ and X1 reaches ฮฉ ๐œ– =  2๐›ฟ  ๐›ฟ2 โˆฅฮ”๐ดโˆฅ ๐น ฮฉ ๐œ…2 (๐‘‰2 ) ๐›ฟ๐‘Ÿ , since ๐œ…2 (๐‘‰2 ) = 1, ๐‘Ÿ = 2 and โˆฅฮ”๐ดโˆฅ ๐น = ๐œ– (The Big-ฮฉ notation was defined in ๐œ† Section 1.1). Since โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ is at least as large as the sin ฮ˜ angle between ๐‘ฅ and X1 , then   โˆฅฮ”๐ดโˆฅ ๐น โˆฅ sin ฮ˜(X1 , X1 )โˆฅ โ‰ฅ ฮฉ ๐œ…2 (๐‘‰2 ) ๐›ฟ๐‘Ÿ e , and therefore the ๐‘Ÿth power of the eigengap is reached for ๐œ† the case ๐‘Ÿ = 2. The above construction can be easily generalized to any ๐‘Ÿ > 2. 82 Example 4.4.4. ๏ฃฎ ๏ฃน ๏ฃฏ0 ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏโˆ’1 ๐›ฟ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฎ ๏ฃน ๏ฃฏ ๏ฃบ ๏ฃฏ0 ยท ยท ยท ยท ยท ยท ยท ยท ยท 0๏ฃบ โˆ’1 2๐›ฟ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ. . ๏ฃบ ๐ด = ๐ผ๐‘Ÿ+1 โˆ’ ๏ฃฏ๏ฃฏ ๏ฃบ , ฮ”๐ด = ๏ฃฏ .. ยท ยท ยท ยท ยท ยท ยท ยท ยท .. ๏ฃบ๏ฃบ . ๏ฃฏ ๏ฃฏ ... ... ๏ฃบ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ0 ยท ยท ยท 0 ๐œ– 0๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃฏ โˆ’1 (๐‘Ÿ โˆ’ 1)๐›ฟ ๏ฃบ ๏ฃบ ๏ฃฐ ๏ฃป ๏ฃฏ ๏ฃบ 0 ๏ฃฏ ๏ฃบ ๏ฃฏ ๐‘Ÿ๐›ฟ๏ฃบ ๏ฃฐ ๏ฃป Here except for the ๐œ–, all entries of the perturbation matrix ฮ”๐ด are zero. Again, we set ๐œ– = min{๐‘œ(1), ๐‘‚ (๐›ฟ๐‘Ÿ )}, 0 < ๐›ฟ < 1. Consider the perturbation of the subspace spanned by the eigenvectors associated with the largest ๐‘Ÿ eigenvalues, which is the subspace X1 = ๐‘ ๐‘๐‘Ž๐‘› (๐‘’ 1 , ๐‘’ 2 , ..., ๐‘’๐‘Ÿ ). One can easily verify that ๐‘‰2 = ๐‘’๐‘Ÿ+1 , and the eigengap is still ๐›ฟ๐œ† = ๐›ฟ. We can again pick a special vector in the h i๐‘‡ ๐‘Ÿโˆ’1 ๐œ– perturbed invariant subspace X1 : ๐‘ฅ = 1, 0, 0, ยท ยท ยท , (โˆ’1) ๐‘Ÿ!๐›ฟ๐‘Ÿ . The angle between ๐‘ฅ and X1 then e   โˆฅฮ”๐ดโˆฅ ๐น   ๐œ– reaches ฮฉ ๐›ฟ๐‘Ÿ = ฮฉ ๐œ…2 (๐‘‰2 ) ๐›ฟ๐‘Ÿ . Therefore the power ๐‘Ÿ on the eigengap in the denominator of ๐œ† (4.7) is attained. The ๐œ…2 (๐‘‰2 ) in (4.7) cannot be removed The bound in Theorem 2.5.1 successfully got rid of ๐œ…2 (๐‘‹1 ). The next example shows that we cannot further remove ๐œ…2 (๐‘‰2 ) from it. Example 4.4.5. We first consider a matrix of three dimensions. ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน ๏ฃฏ1 + ๐›ฟ 0 0 ๏ฃบ๏ฃบ ๏ฃฏ0 0 0๏ฃบ ๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๐ด = ๏ฃฏ๏ฃฏ 0 1 0 ๏ฃบ๏ฃบ , ฮ”๐ด = ๏ฃฏ๐œ– 0 0๏ฃบ๏ฃบ , ๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ 0 ๏ฃฏ 1 1 โˆ’ ๐›ฟ ๏ฃบ๏ฃบ ๏ฃฏ0 0 0๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ 2 1๏ฃป ๏ฃฐ ๏ฃป where ๐œ– = min{๐‘œ(1), ๐‘‚ (๐›ฟ2 )} and 0 < ๐›ฟ, ๐›ฟ1 โ‰ช 1. Consider X1 to be the 1-dimensional subspace spanned by the eigenvector [1, 0, 0] ๐‘‡ of ๐ด associated with the largest eigenvalue, so ๐‘Ÿ = 1. Under the given perturbation, one can check that the perturbed eigenvector associated with the largest h i๐‘‡ h i๐‘‡  ๐œ– ๐œ– eigenvalue is 1, ๐›ฟ , 2๐›ฟ(๐›ฟ+๐›ฟ ) , so X e1 = ๐‘ ๐‘๐‘Ž๐‘› 1, , ๐œ– ๐œ– . As a result, 1 ๐›ฟ 2๐›ฟ(๐›ฟ+๐›ฟ )1   ๐œ– โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ = ฮฉ . ๐›ฟ(๐›ฟ + ๐›ฟ1 ) 83   It is also immediate that ๐œ…2 (๐‘‰2 ) = ฮฉ ๐›ฟ1 , ๐›ฟ๐œ† = ๐›ฟ, โˆฅฮ”๐ดโˆฅ ๐น = ๐œ– and โˆฅ ๐ดโˆฅ = ๐‘‚ (1). Plugging these into 1   (4.7), we get that our upper bound is ๐‘‚ ๐›ฟ๐›ฟ๐œ– . We can see that this bound can indeed bound the 1   actual sin ฮ˜ angle. But if ๐œ…2 (๐‘‰2 ) was absent from the bound (4.7), then (4.7) would only be ๐‘‚ ๐›ฟ๐œ– , which is smaller than the actual sin ฮ˜ angle. Therefore the appearance of ๐œ…2 (๐‘‰2 ) is necessary. The same idea allows us to construct such examples for any dimension. Specifically, for any dimension ๐‘›, we can define ๏ฃฏ0 ยท ยท ยท 0๏ฃบ ๏ฃฎ ๏ฃน ๏ฃฎ ๏ฃน ๏ฃฏ1 + ๐›ฟ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ๐œ– ... .. ๏ฃบ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃฏ ๏ฃบ .๏ฃบ 1 ๏ฃฏ ๏ฃบ ๏ฃฏ ๏ฃฏ . .. ๏ฃบ๏ฃบ ๐ด= ๏ฃฏ ฮ”๐ด = ๏ฃฏ๏ฃฏ0 .. ๏ฃฏ ๏ฃบ ๏ฃบ, .๏ฃบ . ๏ฃฏ 1 1โˆ’๐›ฟ ๏ฃบ ๏ฃฏ 2 1 ๏ฃบ ๏ฃฏ ๏ฃฏ .. .. ๏ฃบ .. ๏ฃบ ๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃบ ๏ฃฏ. . .๏ฃบ ๏ฃฏ (1 โˆ’ 2๐›ฟ1 )๐ผ๐‘›โˆ’3 ๏ฃบ ๏ฃฏ ๏ฃฏ ๏ฃบ ๏ฃบ ๏ฃฐ ๏ฃป ๏ฃฏ0 ยท ยท ยท 0๏ฃบ๏ฃป ๏ฃฐ Here the perturbation matrix ฮ”๐ด only contains one nonzero element at its (2, 1)th entry. Let X1 to be the subspace spanned by the first eigenvector, so again, ๐‘Ÿ = 1. Direct calculation gives that X1 = h i๐‘‡ ๐œ– ๐œ– ๐‘ ๐‘๐‘Ž๐‘›(๐‘’ 1 ), the eigenvector of ๐ด associated with the largest eigenvalue is 1, ๐›ฟ , 2๐›ฟ(๐›ฟ+๐›ฟ ) , 0, ..., 0 , e h 1 i๐‘‡    hence X f1 = ๐‘ ๐‘๐‘Ž๐‘› 1, , ๐œ– ๐œ– 1 . We also have ๐œ… 2 (๐‘‰2 ) = ฮฉ ๐›ฟ , โˆฅฮ”๐ดโˆฅ ๐น = ๐œ–, and ๐›ฟ 2๐›ฟ(๐›ฟ+๐›ฟ ) , 0, ..., 0 1 1 โˆฅ ๐ดโˆฅ = ๐‘‚ (1). Then same discussion implies that the appearance of ๐œ…2 (๐‘‰2 ) in the upper bound is necessary. A remark on the ๐‘Ž in the bound Observe that the bound in (4.7) also contains ๐‘Ž, which is essentially the spectral norm of ๐ด. We argue that the presence of ๐‘Ž is necessary as it ensures that the bound is scaling invariant. More specifically, replacing ๐ด and ๐ด e by ๐‘ก ๐ด and ๐‘ก ๐ดe with any scalar ๐‘ก โ‰  0, we see that our bound in (4.7) does not change, which matches the fact that the angle between the original and the perturbed subspaces is invariant to a universal scaling. 4.4.2 Proof of the theorem In order to prove Theorem 2.5.1, we first state an equivalent expression of โˆฅ sin ฮ˜(๐‘„ ๐‘‹1 , ๐‘„ ๐‘‹e )โˆฅ, 1 โˆฅ sin ฮ˜(๐‘„ ๐‘‹1 , ๐‘„ ๐‘‹e )โˆฅ = โˆฅ๐‘„๐‘‰โˆ— ๐‘„ ๐‘‹e โˆฅ. 1 2 1 84 Here ๐‘‹ e1 = ๐‘„ e ๐‘… e ,๐‘‰2 = ๐‘„๐‘‰ ๐‘…๐‘‰ are the QR decompositions of ๐‘‹ ๐‘‹1 ๐‘‹1 2 2 e1 and ๐‘‰2 , respectively. The proof is based on the following lemma in Li (1994). Lemma 4.4.6 (Lemma 2.1 in Li (1994)). Let ๐‘ˆ1 , ๐‘ˆ e1 โˆˆ C๐‘›,๐‘Ÿ (1 โ‰ค ๐‘Ÿ โ‰ค ๐‘› โˆ’ 1) with ๐‘ˆ โˆ—๐‘ˆ1 = ๐‘ˆ eโˆ—๐‘ˆ e = ๐ผ, 1 1 1 and let X1 = ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ1 ) and X e1 = ๐‘ ๐‘๐‘Ž๐‘›(๐‘ˆ e1 ). If ๐‘ˆe = [๐‘ˆ e2 ] is a unitary matrix, then e1 , ๐‘ˆ โˆฅ sin ฮ˜(X1 , X e1 )โˆฅ = โˆฅ๐‘ˆ eโˆ—๐‘ˆ1 โˆฅ. 2 In Lemma 4.4.6, let ๐‘ˆ1 = ๐‘„ ๐‘‹e , ๐‘ˆ e1 = ๐‘„ ๐‘‹ , ๐‘ˆe = [๐‘„ ๐‘‹ , ๐‘„๐‘‰ ]. Noticing that ๐‘‰ โˆ— ๐‘‹ = ๐ผ, we can 1 1 1 2 verify that ๐‘ˆ eโˆ—๐‘ˆe = ๐ผ. Therefore, it holds that โˆฅ sin ฮ˜(๐‘„ ๐‘‹ , ๐‘„ e )โˆฅ = โˆฅ๐‘„ โˆ— ๐‘„ e โˆฅ. 1 ๐‘‹1 ๐‘‰2 ๐‘‹1 The next Lemma gives an equivalent expression of ๐‘„๐‘‰โˆ— ๐‘„ ๐‘‹e . 2 1 Lemma 4.4.7. Using the notations in Section 4.2, it holds that    ๐‘„๐‘‰โˆ— ๐‘„ ๐‘‹e = (๐‘…๐‘‰โˆ’1 ) โˆ— ๐น โ—ฆ ๐‘‰2โˆ— ฮ”๐ด ๐‘‹ e1 ๐‘… โˆ’1 , (4.9) 2 1 2 ๐‘‹ e 1   โˆ’1 where โ—ฆ denotes the Hadamard product, and ๐น โˆˆ C๐‘›โˆ’๐‘Ÿ,๐‘Ÿ is defined as ๐น๐‘–, ๐‘— = ๐œ† e๐‘— โˆ’ ๐œ†๐‘Ÿ+๐‘– , for ๐‘– = 1, ..., ๐‘› โˆ’ ๐‘Ÿ, ๐‘— = 1, ..., ๐‘Ÿ. Lemma 4.4.7 has been implicitly derived in Li (1998), here for completeness, we provide its proof here. An alternative proof using complex analysis can be found in the appendix. Proof: Since ๐‘‹ โˆ’1 ๐ด๐‘‹ = ฮ› and ๐‘‹ eโˆ’1 ๐ด e๐‘‹e=e ฮ›, then ๐‘‹ โˆ’1 ฮ”๐ด ๐‘‹ e = โˆ’๐‘‹ โˆ’1 ( ๐ด โˆ’ ๐ด) e๐‘‹ e = โˆ’ฮ›๐‘‹ โˆ’1 ๐‘‹ e + ๐‘‹ โˆ’1 ๐‘‹ eeฮ›. Consider the (๐‘› โˆ’ ๐‘Ÿ) ร— ๐‘Ÿ block in the lower-left corner of this equation, we have   ๐‘‰2โˆ— ฮ”๐ด ๐‘‹ e1 = ๐นยฏ โ—ฆ ๐‘‰ โˆ— ๐‘‹ 2 e1 , (4.10) where ๐นยฏ๐‘–, ๐‘— = ๐œ† e๐‘— โˆ’ ๐œ†๐‘Ÿ+๐‘– , 1 โ‰ค ๐‘– โ‰ค ๐‘› โˆ’ ๐‘Ÿ, 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. Let ๐น = 1/๐น, ยฏ where the division is carried out elementwise, (4.10) becomes   ๐‘‰2โˆ— ๐‘‹ e1 = ๐น โ—ฆ ๐‘‰ โˆ— ฮ”๐ด ๐‘‹ 2 e1 . Last, we replace the ๐‘‰2โˆ— and ๐‘‹ e1 on the left-hand side with their QR decomposition and move the ๐‘… factors to the right-hand sides to obtain the equation in the statement of this lemma. โ–ก 85    Denoting ๐‘€ = ๐น โ—ฆ ๐‘‰2โˆ— ฮ”๐ด ๐‘‹ e1 ๐‘… โˆ’1 , by Lemma 4.4.7, we have ๐‘‹1 e 1 sin ฮ˜(๐‘„ ๐‘‹1 , ๐‘„ ๐‘‹e ) = ๐‘„๐‘‰โˆ— ๐‘„ ๐‘‹e = (๐‘…๐‘‰โˆ’1 ) โˆ— ๐‘€ โ‰ค โˆฅ๐‘€โˆฅ . (4.11) 1 2 1 2 ๐œŽmin (๐‘…๐‘‰2 ) In order to bound โˆฅ ๐‘€ โˆฅ, we establish the following lemma for an equivalent expression of ๐‘€. Lemma 4.4.8. Denote ๐‘€ = [๐‘š 1 , ๐‘š 2 , ..., ๐‘š ๐‘›โˆ’๐‘Ÿ ] โˆ— . Then for 1 โ‰ค ๐‘– โ‰ค ๐‘› โˆ’ ๐‘Ÿ, the ๐‘–th row of ๐‘€ can be expressed as 1  โˆ— ฮ”๐ด ๐ดห† ๐‘Ÿโˆ’1 โˆ’ ๐œŽ ๐ดห† ๐‘Ÿโˆ’2 + ๐œŽ ๐ดห† ๐‘Ÿโˆ’3 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ  ๐‘š๐‘–โˆ— = ๐‘‰2,๐‘– 1 2 ๐‘Ÿโˆ’1 ๐ผ๐‘› ๐‘„ ๐‘‹ e , (4.12) (โˆ’1) ๐‘Ÿ+1 ๐œŽ๐‘Ÿ 1 where ๐‘‰2,๐‘– is the ๐‘–th column of ๐‘‰2 , ๐ดห† = ๐ด e โˆ’ ๐œ†๐‘–+๐‘Ÿ ๐ผ๐‘› , ๐ผ๐‘› is the identity matrix of size ๐‘›, ๐œ†๐‘–+๐‘Ÿ is the ๐‘–th diagonal element in ฮ›2 , ๐œŽ๐‘˜ is the homogeneous symmetric polynomial of order ๐‘˜ in ๐‘Ÿ variables, that is โˆ‘๏ธ ๐œŽ๐‘˜ = ๐œ†ห†๐‘–1 ๐œ†ห†๐‘–2 ยท ยท ยท ๐œ†ห†๐‘– ๐‘˜ , ๐‘˜ = 1, 2, ..., ๐‘Ÿ. 1โ‰ค๐‘– 1 <๐‘–2 ,...,๐‘– ๐‘˜โˆ’1 <๐‘– ๐‘˜ โ‰ค๐‘Ÿ Here ๐œ†ห† ๐‘— := ๐œ†e๐‘— โˆ’ ๐œ†๐‘–+๐‘Ÿ , ๐‘— = 1, ..., ๐‘Ÿ, and ๐œ† e๐‘— the ๐‘—th diagonal element in e ฮ›1 . By Assumption 2, ๐œ†ห† ๐‘— โ‰  0. Proof of Lemma 4.4.8: Let ๐‘๐‘–โˆ— be the ๐‘–th row of the (๐‘› โˆ’ ๐‘Ÿ) ร— ๐‘Ÿ matrix ๐‘‰2โˆ— ฮ”๐ด. Then by Lemma 4.4.7 , the ๐‘–th row of ๐‘€ is ๏ฃฎ 1 ๏ฃน ๏ฃฏ ๏ฃบ ๏ฃฏ ๐œ†e โˆ’๐œ†๐‘–+๐‘Ÿ ๏ฃฏ 1 ๏ฃบ  ๏ฃบ โˆ— โˆ— . ๏ฃบ โˆ’1 ๐‘š ๐‘– = ๐‘ ๐‘– ๐‘‹1 ๏ฃฏ . (4.13) ๏ฃฏ e . ๏ฃบ ๐‘…e . ๏ฃฏ ๏ฃบ ๐‘‹1 ๏ฃฏ ๏ฃบ ๏ฃฏ 1 ๏ฃบ ๏ฃฏ e๐‘Ÿ โˆ’๐œ†๐‘–+๐‘Ÿ ๏ฃบ ๐œ† ๏ฃฐ ๏ฃป Let ๐‘ฆ be an arbitrary unit vector, and let ๐œ†ห† ๐‘— := ๐œ† e๐‘— โˆ’๐œ†๐‘–+๐‘Ÿ , for ๐‘— = 1, ..., ๐‘Ÿ. In addition, define ๐‘ = ๐‘… โˆ’1 ๐‘ฆ, ๐‘‹ e 1 which means โˆฅ๐‘‹ e1 ๐‘โˆฅ = ๐‘„ e ๐‘ฆ = 1. ๐‘‹1 (4.14) Then (4.13) yields ๐‘Ÿ โˆ‘๏ธ 1 ๐‘š๐‘–โˆ— ๐‘ฆ = ๐‘๐‘–โˆ— ยญ (4.15) ยฉ ยช ๐‘ ๐‘—e ๐‘ฅ๐‘—ยฎ, ๐‘—=1 ๐œ†ห† ๐‘— ยซ ยฌ where for 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ, e ๐‘ฅ ๐‘— is the ๐‘—th column of ๐‘‹ e1 . Next, we derive an equivalent representation of the summation in (4.15) with the help of the characteristic polynomial. Define ๐ดห† = ๐ด e โˆ’ ๐œ†๐‘–+๐‘Ÿ ๐ผ, and 86 define its characteristic polynomial ๐‘ž(๐‘ง) = ๐‘ง โˆ’ ๐œ†ห† 1 ๐‘ง โˆ’ ๐œ†ห† 2 ยท ยท ยท ๐‘ง โˆ’ ๐œ†ห† ๐‘Ÿ .    Expanding ๐‘ž(๐‘ง) leads to ๐‘ž(๐‘ง) = ๐‘ง๐‘Ÿ โˆ’ ๐œŽ1 ๐‘ง๐‘Ÿโˆ’1 + ๐œŽ2 ๐‘ง๐‘Ÿโˆ’2 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ๐‘Ÿโˆ’1 ๐‘ง + (โˆ’1) ๐‘Ÿ ๐œŽ๐‘Ÿ , (4.16) where ๐œŽ๐‘˜ , ๐‘˜ = 1, ..., ๐‘Ÿ are the homogeneous symmetric polynomials of order ๐‘˜ in ๐‘Ÿ variables, that is โˆ‘๏ธ ๐œŽ๐‘˜ = ๐œ†ห†๐‘–1 ๐œ†ห†๐‘–2 ยท ยท ยท ๐œ†ห†๐‘– ๐‘˜ . 1โ‰ค๐‘– 1 <๐‘– 2 ,...,๐‘– ๐‘˜โˆ’1 <๐‘– ๐‘˜ โ‰ค๐‘Ÿ Notice that ๐‘‹e1 is also invariant to ๐ด. ห† Since       ๐ดห† โˆ’ ๐œ†ห† ๐‘— ๐ผ ๐‘‹ e1 = ๐ด eโˆ’ ๐œ†e๐‘— ๐‘‹ e1 = ๐‘‹ e1 e ฮ›1 โˆ’ ๐œ† e๐‘— ๐ผ๐‘› , 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ,      ห† ๐‘‹ then ๐‘ž( ๐ด) e1 = ๐‘‹e1 e ฮ›1 โˆ’ ๐œ† e1 ๐ผ e ฮ›1 โˆ’ ๐œ† e2 ๐ผ ยท ยท ยท e ฮ›1 โˆ’ ๐œ† e๐‘Ÿ ๐ผ = 0. This means for any ๐‘ โˆˆ C๐‘Ÿ , we have   ห† ห† ห† ห†  ๐‘Ÿ ๐‘Ÿโˆ’1 ๐‘Ÿโˆ’1 ๐‘Ÿ 0 = ๐‘ž ๐ด ๐‘‹1 ๐‘ = ๐ด โˆ’ ๐œŽ1 ๐ด e โˆ’ ยท ยท ยท + (โˆ’1) ๐œŽ๐‘Ÿโˆ’1 ๐ด + (โˆ’1) ๐œŽ๐‘Ÿ ๐ผ๐‘› ๐‘‹ e1 ๐‘. Let us move the last term in the right-hand side to the left and for the terms left on the right, pull one ๐ดห† out of the bracket,   (โˆ’1) ๐‘Ÿ+1 ๐œŽ๐‘Ÿ ๐‘‹หœ 1 ๐‘ = ๐ดห† ๐‘Ÿโˆ’1 โˆ’ ๐œŽ1 ๐ดห† ๐‘Ÿโˆ’2 + ๐œŽ2 ๐ดห† ๐‘Ÿโˆ’3 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ๐‘Ÿโˆ’1 ๐ผ๐‘› ๐ดห† ๐‘‹ e1 ๐‘. (4.17) Now let us take ๐‘ to be the vector consisting of ๐‘ ๐‘— = ๐‘ ๐‘— /๐œ†ห† ๐‘— , for ๐‘— = 1, ..., ๐‘Ÿ, then ๐ดห† ๐‘‹ e1 ๐‘ = ๐‘‹ e1 ๐‘ and e1 ๐‘ = ร๐‘Ÿ ๐‘‹ 1 ๐‘ฅ ๐‘ . Plugging these two relations into (4.17), we get ๐‘—=1 ห† e ๐œ†๐‘— ๐‘— ๐‘— ๐‘Ÿ ยฉโˆ‘๏ธ 1   (โˆ’1) ๐‘Ÿ+1 ๐œŽ๐‘Ÿ ยญ ๐‘ฅ ๐‘— ยฎ = ๐ดห† ๐‘Ÿโˆ’1 โˆ’ ๐œŽ1 ๐ดห† ๐‘Ÿโˆ’2 + ๐œŽ2 ๐ดห† ๐‘Ÿโˆ’3 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ๐‘Ÿโˆ’1 ๐ผ๐‘› ๐‘‹หœ 1 ๐‘, ยช ๐‘ ๐‘—e ๐œ†ห† ยซ ๐‘—=1 ๐‘— ยฌ or equivalently,   ๐‘Ÿ ๐ดห† ๐‘Ÿโˆ’1 โˆ’ ๐œŽ1 ๐ดห† ๐‘Ÿโˆ’2 + ๐œŽ2 ๐ดห† ๐‘Ÿโˆ’3 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ๐‘Ÿโˆ’1 ๐ผ๐‘› ๐‘‹หœ 1 ๐‘ โˆ‘๏ธ 1 ๐‘ ๐‘ฅหœ = . (4.18) ห†๐‘— ๐‘— ๐‘— ๐œ† (โˆ’1) ๐‘Ÿ+1 ๐œŽ ๐‘Ÿ ๐‘—=1 Plugging this back to the formula for ๐‘š๐‘–โˆ— ๐‘ฆ (4.15), we get โˆ— 1 โˆ—  ห† ๐‘Ÿโˆ’1 ห† ๐‘Ÿโˆ’2 ห† ๐‘Ÿโˆ’3 ๐‘Ÿโˆ’1  ๐‘š๐‘– ๐‘ฆ = ๐‘‰ ฮ”๐ด ๐ด โˆ’ ๐œŽ1 ๐ด + ๐œŽ2 ๐ด โˆ’ ยท ยท ยท + (โˆ’1) ๐œŽ๐‘Ÿโˆ’1 ๐ผ๐‘› ๐‘„ ๐‘‹e ๐‘ฆ. (โˆ’1) ๐‘Ÿ+1 ๐œŽ๐‘Ÿ 2,๐‘– 1 The equation above holds for arbitrary ๐‘ฆ โˆˆ C๐‘Ÿ , hence (4.12) holds. โ–ก 87 Next, we prove Theorem 2.5.1. Proof of Theorem 2.5.1: Denote ๐‘๐‘–โˆ— = ๐‘‰2,๐‘– โˆ— ฮ”๐ด, by Lemma 4.4.8 we have ๐ดห† ๐‘Ÿโˆ’1 โˆ’ ๐œŽ1 ๐ดห† ๐‘Ÿโˆ’2 + ๐œŽ2 ๐ดห† ๐‘Ÿโˆ’3 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ๐‘Ÿโˆ’1 ๐ผ๐‘› โˆฅ๐‘š๐‘–โˆ— โˆฅ = ๐‘๐‘–โˆ— ๐‘„ ๐‘‹e ๐œŽ๐‘Ÿ 1 ๐ดห† ๐‘Ÿโˆ’1 โˆ’ ๐œŽ1 ๐ดห† ๐‘Ÿโˆ’2 + ๐œŽ2 ๐ดห† ๐‘Ÿโˆ’3 โˆ’ ยท ยท ยท + (โˆ’1) ๐‘Ÿโˆ’1 ๐œŽ๐‘Ÿโˆ’1 ๐ผ๐‘› โ‰ค โˆฅ๐‘๐‘–โˆ— โˆฅ |๐œŽ๐‘Ÿ | ห† โˆฅ ๐ดโˆฅ ๐‘Ÿโˆ’1 + |๐œŽ1 |โˆฅ ๐ดโˆฅ ห† ๐‘Ÿโˆ’2 + |๐œŽ2 |โˆฅ ๐ดโˆฅ ห† ๐‘Ÿโˆ’3 + ยท ยท ยท + |๐œŽ๐‘Ÿโˆ’1 | โ‰ค โˆฅ๐‘๐‘–โˆ— โˆฅ . |๐œŽ๐‘Ÿ | Notice that โˆ‘๏ธ โˆ‘๏ธ |๐œŽ๐‘˜ | = ๐œ†ห†๐‘–1 ๐œ†ห†๐‘–2 ยท ยท ยท ๐œ†ห†๐‘– ๐‘˜ โ‰ค |๐œ†ห†๐‘–1 | ยท |๐œ†ห†๐‘–2 | ยท ยท ยท |๐œ†ห†๐‘– ๐‘˜ |. 1โ‰ค๐‘– 1 <๐‘– 2 ,...,๐‘– ๐‘˜โˆ’1 <๐‘– ๐‘˜ โ‰ค๐‘Ÿ 1โ‰ค๐‘– 1 <๐‘–2 ,...,๐‘– ๐‘˜โˆ’1 <๐‘– ๐‘˜ โ‰ค๐‘Ÿ Define an auxiliary function = ๐‘ง + |๐œ†ห† 1 | ๐‘ง + |๐œ†ห† 2 | ยท ยท ยท ๐‘ง + |๐œ†ห† ๐‘Ÿ | = ๐‘ง๐‘Ÿ + ๐œŽ    ยฏ ๐‘ž(๐‘ง) ยฏ 1 ๐‘ง๐‘Ÿโˆ’1 + ๐œŽยฏ 2 ๐‘ง๐‘Ÿโˆ’2 + ยท ยท ยท + ๐œŽ ยฏ๐‘Ÿ. |๐œ†ห†๐‘–1 | ยท |๐œ†ห†๐‘–2 | ยท ยท ยท |๐œ†ห†๐‘– ๐‘˜ | โ‰ฅ |๐œŽ๐‘˜ |, for 1 โ‰ค ๐‘˜ โ‰ค ๐‘Ÿ, and ๐œŽ ยฏ ๐‘Ÿ = |๐œ†ห† 1 | ยท ร Here ๐œŽ ยฏ๐‘˜ = 1โ‰ค๐‘–1 <๐‘–2 <ยทยทยท<๐‘– ๐‘˜โˆ’1 <๐‘– ๐‘˜ โ‰ค๐‘Ÿ |๐œ†ห† 2 | ยท ยท ยท |๐œ†ห† ๐‘Ÿ | = |๐œŽ๐‘Ÿ |. Then โˆฅ ๐ดโˆฅห† ๐‘Ÿโˆ’1 + ๐œŽ ยฏ 1 โˆฅ ๐ดโˆฅห† ๐‘Ÿโˆ’2 + ๐œŽ ห† ๐‘Ÿโˆ’3 + ยท ยท ยท + ๐œŽ ยฏ 2 โˆฅ ๐ดโˆฅ ยฏ ๐‘Ÿโˆ’1 โˆฅ๐‘š๐‘– โˆฅ โ‰ค โˆฅ๐‘๐‘–โˆ— โˆฅ ยฏ๐‘Ÿ ๐œŽ ยฏ ๐‘ž(๐‘Ž) โˆ’๐œŽ ยฏ๐‘Ÿ โ‰ค โˆฅ๐‘๐‘–โˆ— โˆฅ ๐‘Ž๐œŽ ยฏ๐‘Ÿ ๐‘Ÿ ! โˆฅ๐‘๐‘– โˆฅ ยฉร– ๐‘Ž = 1+ โˆ’ 1ยฎ ยช |๐œ†ห† ๐‘— | ยญ ๐‘Ž ยซ ๐‘—=1 ยฌ ๐‘Ÿ โˆฅ๐‘๐‘– โˆฅ ร– ยฉยญ ๐‘Ž ยช โ‰ค ยญ1 + ยฎ ยฎ, ๐‘Ž ๐‘—=1 ยญ min | ๐œ† e ๐‘— โˆ’ ๐œ† ๐‘˜ | ยฎ ๐œ† ๐‘˜ โˆˆ๐‘†(ฮ›2 ) ยซ ยฌ where ๐‘Ž = โˆฅ ๐ดโˆฅ + โˆฅฮ”๐ดโˆฅ + ๐œŒ(ฮ›2 ) โ‰ฅ โˆฅ ๐ดโˆฅ. ห† Combining the bounds for all 1 โ‰ค ๐‘– โ‰ค ๐‘› โˆ’ ๐‘Ÿ leads to ๐‘‰2โˆ— ฮ”๐ด ร– ๐‘Ÿ ยฉ ๐‘Ž ยช โˆฅ๐‘‰ โˆฅโˆฅฮ”๐ดโˆฅ ร– ๐‘Ÿ ยฉ ๐‘Ž ยช ๐น 2 ๐น โˆฅ๐‘€ โˆฅ โ‰ค ยญ1 + ยฎโ‰ค ยญ1 + ยญ ยฎ ยญ ยฎ ยฎ. ๐‘Ž ๐‘—=1 ยญ min | ๐œ† e ๐‘— โˆ’ ๐œ† ๐‘˜ | ยฎ ๐‘Ž ๐‘—=1 ยญ min | ๐œ† e ๐‘— โˆ’ ๐œ† ๐‘˜ | ยฎ ๐œ† ๐‘˜ โˆˆ๐‘†(ฮ›2 ) ๐œ† ๐‘˜ โˆˆ๐‘†(ฮ›2 ) ยซ ยฌ ยซ ยฌ By (4.11), we further obtain (4.7). โ–ก 88 BIBLIOGRAPHY Abbe, E. (2017). Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446โ€“6531. Abbe, E., Fan, J., and Wang, K. (2022). An โ„“p theory of pca and spectral clustering. The Annals of Statistics, 50(4):2359โ€“2385. Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. The Annals of Statistics, 48(3):1452โ€“1474. Agterberg, J., Lubberts, Z., and Priebe, C. E. (2022). Entrywise estimation of singular vectors of low-rank matrices with heteroskedasticity and dependence. IEEE Transactions on Information Theory, 68(7):4618โ€“4650. Athreya, A., Tang, M., Park, Y., and Priebe, C. E. (2021). On estimation and inference in latent structure random graphs. Statistical Science, 36(1):68โ€“88. Balasubramanian, M. and Schwartz, E. L. (2002). The isomap algorithm and topological stability. Science, 295(5552):7โ€“7. Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183โ€“202. Bhatia, R. (2013). Matrix analysis, volume 169. Springer Science & Business Media. Cai, C., Li, G., Chi, Y., Poor, H. V., and Chen, Y. (2021). Subspace estimation from unbalanced and incomplete data matrices: โ„“2,โˆž statistical guarantees. The Annals of Statistics, 49(2):944โ€“967. Cai, J. F., Candรจs, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956โ€“1982. Cai, T. T. and Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaces with applications to high-dimensional statistics. The Annals of Statistics, 46(1):60โ€“89. Candes, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust Principal Component Analysis? J. ACM, 58(3):11:1โ€“11:37. Candes, E. J. and Plan, Y. (2010). Matrix completion with noise. Proceedings of the IEEE, 98(6):925โ€“936. Candes, E. J. and Recht, B. (2012). Exact matrix completion via convex optimization. Communi- cations of the ACM, 55(6):111โ€“119. Cape, J., Tang, M., and Priebe, C. E. (2019a). Signal-plus-noise matrix models: eigenvector 89 deviations and fluctuations. Biometrika, 106(1):243โ€“250. Cape, J., Tang, M., and Priebe, C. E. (2019b). The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. The Annals of Statistics, 47(5):2405โ€“ 2439. Chatelin, F. (2011). Spectral approximation of linear operators. SIAM. Chen, Y., Cheng, C., and Fan, J. (2021a). Asymmetry helps: Eigenvalue and eigenvector analyses of asymmetrically perturbed low-rank matrices. The Annals of statistics, 49(1):435. Chen, Y., Chi, Y., Fan, J., Ma, C., et al. (2021b). Spectral methods for data science: A statistical perspective. Foundations and Trendsยฎ in Machine Learning, 14(5):566โ€“806. Cheng, C., Wei, Y., and Chen, Y. (2020). Inference for linear forms of eigenvectors under minimal eigenvalue separation: Asymmetry and heteroscedasticity. arXiv preprint arXiv:2001.04620. Chin, P., Rao, A., and Vu, V. (2015). Stochastic block model and community detection in sparse graphs: A spectral algorithm with optimal rate of recovery. In Conference on Learning Theory, pages 391โ€“423. PMLR. Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1โ€“46. Demmel, J. W. (1986). Computing stable eigendecompositions of matrices. Linear Algebra and its Applications, 79:163โ€“193. Deutsch, S., Ortega, A., and Medioni, G. (2016). Manifold denoising based on spectral graph wavelets. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4673โ€“4677. IEEE. Donoho, D. and Gavish, M. (2014). Minimax risk of matrix denoising by singular value threshold- ing. The Annals of Statistics, 42(6):2413โ€“2440. Dopico, F. M. (2000). A note on sin ๐œƒ theorems for singular subspace variations. BIT Numerical Mathematics, 40(2):395โ€“403. Dopico, F. M. and Moro, J. (2002). Perturbation theory for simultaneous bases of singular subspaces. BIT Numerical Mathematics, 42(1):84โ€“109. Du, C., Sun, J., Zhou, S., and Zhao, J. (2013). An Outlier Detection Method for Robust Manifold Learning. In Yin, Z., Pan, L., and Fang, X., editors, Proceedings of The Eighth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2013, Advances in Intelligent Systems and Computing, pages 353โ€“360. Springer Berlin Heidelberg. 90 Eldridge, J., Belkin, M., and Wang, Y. (2018). Unperturbed: spectral analysis beyond davis-kahan. In Algorithmic Learning Theory, pages 321โ€“358. PMLR. Eppel, S. (2006). Using curvature to distinguish between surface reflections and vessel con- tents in computer vision based recognition of materials in transparent vessels. arXiv preprint arXiv:1602.00177. Fan, J., Fan, Y., Han, X., and Lv, J. (2022). Asymptotic theory of eigenvectors for random matrices with diverging spikes. Journal of the American Statistical Association, 117(538):996โ€“1009. Flynn, P. J. and Jain, A. K. (1989). On reliable curvature estimation. Computer Vision and Pattern Recognition, 89:110โ€“116. โˆš Gavish, M. and Donoho, D. L. (2014). The optimal hard threshold for singular values is 4/ 3. IEEE Transactions on Information Theory, 60(8):5040โ€“5053. Gohberg, I., Lancaster, P., and Rodman, L. (2006). Invariant subspaces of matrices with applica- tions. SIAM. Golub, G. H. and Wilkinson, J. H. (1976). Ill-conditioned eigensystems and the computation of the jordan canonical form. SIAM review, 18(4):578โ€“619. Greenbaum, A., Li, R. C., and Overton, M. L. (2020). First-order perturbation theory for eigenvalues and eigenvectors. SIAM review, 62(2):463โ€“482. Haviv, M. and Ritov, Y. (1994). Bounds on the error of an approximate invariant subspace for non-self-adjoint matrices. Numerische Mathematik, 67(4):491โ€“500. He, Y., Tian, Y., Wang, M., Chen, F., Yu, L., Tang, M., Chen, C., Zhang, N., Kuang, B., and Prakash, A. (2023). Que2engage: Embedding-based retrieval for relevant and engaging products at facebook marketplace. arXiv preprint arXiv:2302.11052. Hein, M. and Maier, M. (2006). Manifold denoising. Advances in neural information processing systems, 19. Ipsen, I. C. (2003). A note on unifying absolute and relative perturbation bounds. Linear algebra and its applications, 358(1-3):239โ€“253. Janson, S. (2016). Large deviation inequalities for sums of indicator variables. arXiv preprint arXiv:1609.00533. Karow, M. and Kressner, D. (2014). On a perturbation bound for invariant subspaces of matrices. SIAM Journal on Matrix Analysis and Applications, 35(2):599โ€“618. Kato, T. (2013). Perturbation theory for linear operators, volume 132. Springer Science & Business 91 Media. Keshavan, R., Montanari, A., and Oh, S. (2009). Matrix completion from noisy entries. Advances in neural information processing systems, 22. Knyazev, A. V. and Argentati, M. E. (2002). Principal angles between subspaces in an a-based scalar product: algorithms and perturbation estimates. SIAM Journal on Scientific Computing, 23(6):2008โ€“2040. Kobayashi, S. and Nomizu, K. (1996). Foundations of differential geometry. 2. Krahmer, F., Lyu, H., Saab, R., Veselovska, A., and Wang, R. (2023). Quantization of bandlimited graph signals. In Fourteenth International Conference on Sampling Theory and Applications. Lee, H., Battle, A., Raina, R., and Ng, A. (2006). Efficient sparse coding algorithms. Advances in neural information processing systems, 19. Lee, J. M. (2012). Smooth manifolds. Springer. Lei, L. (2019). Unified ๐‘™ 2โ†’โˆž eigenspace perturbation theory for symmetric random matrices. arXiv preprint arXiv:1909.04798. Li, R. C. (1994). On perturbations of matrix pencils with real spectra. Mathematics of Computation, 62(205):231โ€“265. Li, R. C. (1998). Spectral variations and hadamard products: Some problems. Linear algebra and its applications, 278(1-3):317โ€“326. Little, A., Xie, Y., and Sun, Q. (2018). An analysis of classical multidimensional scaling. arXiv preprint arXiv:1812.11954. Lรถffler, M., Zhang, A. Y., and Zhou, H. H. (2021). Optimality of spectral clustering in the gaussian mixture model. The Annals of Statistics, 49(5):2506โ€“2530. Luo, Y., Han, R., and Zhang, A. (2021). A schatten-q low-rank matrix perturbation analysis via perturbation projection error bound. Linear Algebra and its Applications, 630:225โ€“240. Lyu, H., Sha, N., Qin, S., Yan, M., Xie, Y., and Wang, R. (2019). Manifold denoising by nonlinear robust principal component analysis. Advances in neural information processing systems, 32. Lyu, H. and Wang, R. (2020a). An exact ๐‘ ๐‘–๐‘›๐œƒ formula for matrix perturbation analysis and its applications. arXiv preprint arXiv:2011.07669. Lyu, H. and Wang, R. (2020b). Sigma delta quantization for images. arXiv preprint arXiv:2005.08487. 92 Lyu, H. and Wang, R. (2022). Perturbation of invariant subspaces for ill-conditioned eigensystem. arXiv preprint arXiv:2203.00068. Martin, G. R. and Evans, M. J. (1975). Differentiation of clonal lines of teratocarcinoma cells: formation of embryoid bodies in vitro. Proceedings of the National Academy of Sciences, 72(4):1441โ€“1445. Meek, D. S. and Walton, D. J. (2000). On surface normal and gaussian curvature approximations given data sampled from a smooth surface. Computer Aided Geometric Design, 17(6):521โ€“543. Moon, K., Dฤณk, D. V., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., Yim, K., Elzen, A. v. d., Hirn, M. J., Coifman, R. R., Ivanova, N. B., Wolf, G., and Krishnaswamy, S. (2019). Visualizing Structure and Transitions for Biological Data Exploration. bioRxiv, page 120378. Narayanamurthy, P. and Vaswani, N. (2020). Fast robust subspace tracking via pca in sparse data-dependent noise. IEEE Journal on Selected Areas in Information Theory, 1(3):723โ€“744. Paulsen, V. (2002). Completely bounded maps and operator algebras. Number 78. Cambridge University Press. Pearson, K. (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559โ€“572. Pottmann, H., Wallner, J., Yang, Y. L., Lai, Y. K., and Hu, S. M. (2007). Principal curvatures from the integral invariant viewpoint. Computer Aided Geometric Design, 24(8):428โ€“442. Roweis, S. T. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embed- ding. science, 290(5500):2323โ€“2326. Sathe, S. and Aggarwal, C. (2016). Lodes: Local density meets spectral outlier detection. In Proceedings of the 2016 SIAM international conference on data mining, pages 171โ€“179. SIAM. Sha, N., Yan, M., and Lin, Y. (2019). Efficient seismic denoising techniques using robust principal component analysis. In SEG Technical Program Expanded Abstracts 2019, pages 2543โ€“2547. Society of Exploration Geophysicists. Stewart, G. (1990). Matrix perturbation theory. Stewart, G. W. (1971). Error bounds for approximate invariant subspaces of closed linear operators. SIAM Journal on Numerical Analysis, 8(4):796โ€“808. Stewart, G. W. (1973). Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM review, 15(4):727โ€“764. Stewart, M. (2006). Perturbation of the svd in the presence of small singular values. Linear algebra 93 and its applications, 419(1):53โ€“77. Tang, M. and Preibe, C. E. (2018). Limit theorems for eigenvectors of the normalized laplacian for random graphs. The Annals of Statistics, 46(5):2360โ€“2415. Tanner, J. and Wei, K. (2013). Normalized iterative hard thresholding for matrix completion. SIAM Journal on Scientific Computing, 35(5):S104โ€“S125. Thompson, R. C. (1975). Singular value inequalities for matrix sums and minors. Linear Algebra and its Applications, 11(3):251โ€“269. Tong, W. S. and Tang, C. K. (2005). Robust estimation of adaptive tensors of curvature by tensor voting. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(3):434โ€“449. Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trendsยฎ in Machine Learning, 8(1-2):1โ€“230. Varah, J. M. (1970). Computing invariant subspaces of a general matrix when the eigensystem is poorly conditioned. Mathematics of Computation, 24(109):137โ€“149. Vaswani, N. and Narayanamurthy, P. (2017). Finite sample guarantees for pca in non-isotropic and data-dependent noise. In 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 783โ€“789. IEEE. Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press. Vu, T., Chunikhina, E., and Raich, R. (2021). Perturbation expansions and error bounds for the truncated singular value decomposition. Linear Algebra and its Applications, 627:94โ€“139. Wedin, P. (1972). Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99โ€“111. Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the davisโ€“kahan theorem for statisticians. Biometrika, 102(2):315โ€“323. Yun, S. Y. and Proutiere, A. (2014). Accurate community detection in the stochastic block model via spectral algorithms. arXiv preprint arXiv:1412.7335. Zhan, J. and Vaswani, N. (2015). Robust pca with partial subspace knowledge. IEEE Transactions on Signal Processing, 63(13):3332โ€“3347. 94 APPENDIX A APPENDIX FOR CHAPTER 2 Proof of Lemma 2.6.3- the full-rank case: We first bound โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ ฮฃ1โˆ’1 โˆฅ 2,โˆž . To do so, e1e we need Theorem 2.5.7 to obtain the expansion of ๐‘‰2๐‘‡ ๐‘‰ e1 . Let us first check that in the setting of โˆš this lemma (i.e., 21๐œŽ ๐‘›ยฏ < ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด)), the condition in Theorem 2.5.7 is satisfied with high probability, that is, โˆฅF โˆฅ < 1, where ยฉ๐ถ1 ยช ยฉ๐น 21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡22๐ถ1 ) + ๐น๐‘ˆ21 โ—ฆ (๐›ผ22๐ถ2e ฮฃ๐‘‡1 ) ยช F ยญยญ ยฎยฎ = ยญยญ ๐‘ˆ ยฎ. 21 21 ๐‘‡ ๐‘‡ ยฎ ๐ถ2 ๐น๐‘‰ โ—ฆ (๐›ผ22๐ถ1 ฮฃ1 ) + ๐น๐‘‰ โ—ฆ (ฮฃ2 ๐›ผ22๐ถ2 ) e ยซ ยฌ ยซ ยฌ โˆš As discussed in the proof of Theorem 2.2.6, โˆฅฮ”๐ดโˆฅ โ‰ค 3๐œŽ ๐‘›ยฏ with probability at least 1 โˆ’ ๐‘’ โˆ’๐‘๐‘›ยฏ with โˆš some constant ๐‘. By the assumption 21๐œŽ ๐‘›ยฏ < ๐œŽ๐‘Ÿ ( ๐ด) โˆ’ ๐œŽ๐‘Ÿ+1 ( ๐ด), we have with probability at least 1 โˆ’ ๐‘’ โˆ’๐‘๐‘›ยฏ , ยฉ๐ถ1 ยช F ยญยญ ยฎยฎ ๐ถ ยซ 2ยฌ โ‰ค โˆฅ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡22๐ถ1 )โˆฅ + โˆฅ๐น๐‘ˆ21 โ—ฆ (๐›ผ22๐ถ2e ฮฃ๐‘‡1 )โˆฅ + โˆฅ๐น๐‘‰21 โ—ฆ (๐›ผ๐‘‡22๐ถ1e ฮฃ1 )โˆฅ + โˆฅ๐น๐‘‰21 โ—ฆ (ฮฃ๐‘‡2 ๐›ผ22๐ถ2 )โˆฅ ๐œŽ๐‘Ÿ+1 ๐œŽ๐‘Ÿ e ๐œŽ๐‘Ÿ e ๐œŽ๐‘Ÿ+1 โ‰ค โˆฅ๐›ผ22 โˆฅโˆฅ๐ถ1 โˆฅ + โˆฅ๐›ผ22 โˆฅโˆฅ๐ถ2 โˆฅ + โˆฅ๐›ผ22 โˆฅโˆฅ๐ถ1 โˆฅ + โˆฅ๐›ผ22 โˆฅโˆฅ๐ถ2 โˆฅ 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e 2 2 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e 2 2 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 e โˆฅ๐›ผ22 โˆฅ = (โˆฅ๐ถ1 โˆฅ + โˆฅ๐ถ2 โˆฅ) e๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โˆฅฮ”๐ดโˆฅ โˆš๏ธƒ โ‰ค ยท 2(โˆฅ๐ถ1 โˆฅ 2 + โˆฅ๐ถ2 โˆฅ 2 ) ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โˆ’ โˆฅฮ”๐ดโˆฅ โˆš โˆš๏ธƒ 2 โ‰ค โˆฅ๐ถ1 โˆฅ 2 + โˆฅ๐ถ2 โˆฅ 2 . 6 Here the second inequality is due to Lemma 2.5.4. Now we have โˆฅF โˆฅ < 1 with high probability, which enables us to use Theorem 2.5.7 to decompose ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 . Denote ๐‘Ž 1 = ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆe1 ), ๐‘Ž 2 = ๐น 21 โ—ฆ (๐›ผ21๐‘‰ ๐‘‡ ๐‘‰e e๐‘‡ ๐‘ˆ 1 1 ฮฃ1 ) ๐‘Ž 3 = ๐น๐‘‰21 โ—ฆ (๐›ผ๐‘‡12๐‘ˆ1๐‘‡ ๐‘ˆe1eฮฃ1 ), ๐‘Ž 4 = ๐น๐‘‰21 โ—ฆ (ฮฃ๐‘‡2 ๐›ผ21๐‘‰1๐‘‡ ๐‘‰e1 ), and ๐‘“1 (๐‘‹) = ๐น๐‘ˆ21 โ—ฆ (ฮฃ2 ๐›ผ๐‘‡22 ๐‘‹), ๐‘“2 (๐‘‹) = ๐น๐‘ˆ21 โ—ฆ (๐›ผ22 ๐‘‹ e ฮฃ๐‘‡1 ), 95 ๐‘“3 (๐‘‹) = ๐น๐‘‰21 โ—ฆ (๐›ผ๐‘‡22 ๐‘‹ e ฮฃ1 ), ๐‘“4 (๐‘‹) = ๐น๐‘‰21 โ—ฆ (ฮฃ๐‘‡2 ๐›ผ22 ๐‘‹). By Theorem 2.5.7, each term in the expansion of ๐‘‰2๐‘‡ ๐‘‰ e1 is of the form ๐‘“๐‘–1 ( ๐‘“๐‘–2 (...( ๐‘“๐‘– ๐‘˜ (๐‘Ž๐‘–0 )))), 1 โ‰ค ๐‘–0 , ๐‘– 1 , ..., ๐‘– ๐‘˜ โ‰ค 4, ๐‘˜ = 0, 1, 2, .... Now assume ๐‘–1 , ..., ๐‘– ๐‘˜ and ๐‘˜ are fixed. Let ๐‘ค be the ๐‘–th row in ๐‘ˆ2 ๐›ผ22 ๐‘“๐‘–1 ( ๐‘“๐‘–2 (...( ๐‘“๐‘– ๐‘˜ (๐‘Ž๐‘–0 ))))e ฮฃ1โˆ’1 , and let ๐‘๐‘‡ = ๐‘ข๐‘‡๐‘– ๐›ผ22 , where ๐‘ข๐‘‡๐‘– is the ๐‘–th row of ๐‘ˆ2 . Then ๐‘ค = ๐‘๐‘‡ ๐‘“๐‘–1 ( ๐‘“๐‘–2 (...( ๐‘“๐‘– ๐‘˜ (๐‘Ž๐‘–0 ))))e ฮฃ1โˆ’1 . Notice that ๐‘Ž๐‘– and each ๐‘“๐‘– , 1 โ‰ค ๐‘  โ‰ค ๐‘˜, either contains ฮฃ2 or e 0 ๐‘  ฮฃ1 , let โ„Ž๐‘– = 1 if ๐‘“๐‘– contains ฮฃ2 , and โ„Ž๐‘– = 0 ๐‘  ๐‘  ๐‘  if ๐‘“๐‘– ๐‘  contains e ฮฃ1 . Let โ„Ž๐‘–0 = 1 if ๐‘Ž๐‘–0 contains ฮฃ2 and โ„Ž๐‘– ๐‘  = 0 if ๐‘Ž๐‘–0 contains e ฮฃ1 . Also, let ๐‘š be the total number of times that e ฮฃ1 appears in ๐‘“๐‘– and ๐‘Ž๐‘– . Then ๐‘  0 โ„Ž๐‘–0 + โ„Ž๐‘–1 + ... + โ„Ž๐‘– ๐‘˜ + ๐‘š = ๐‘˜ + 1. (A.1) Likewise, each ๐‘“๐‘– ๐‘  , 1 โ‰ค ๐‘  โ‰ค ๐‘˜ either contains ๐›ผ22 or ๐›ผ๐‘‡22 . Let ๐‘‘๐‘– ๐‘  = ๐›ผ22 if ๐‘“๐‘– ๐‘  contains ๐›ผ22 and ๐‘‘๐‘– ๐‘  = ๐›ผ๐‘‡22 if it contains ๐›ผ๐‘‡22 . Also, let ๐›พ = ๐›ผ๐‘‡12 if ๐‘Ž๐‘–0 contains ๐›ผ๐‘‡12 and ๐›พ = ๐›ผ21 if ๐‘Ž๐‘–0 contains ๐›ผ21 . Last, denote ๐›ฝ = ๐‘‰1๐‘‡ ๐‘‰ e1 if ๐‘Ž๐‘– contains ๐‘‰ ๐‘‡ ๐‘‰ 0 1 1 e and ๐›ฝ = ๐‘ˆ๐‘‡ ๐‘ˆ e , if ๐‘Ž๐‘– contains ๐‘ˆ๐‘‡ ๐‘ˆ 1 1 0 1 1 e . For the ๐›พ and ๐›ฝ defined above, let ๐›พ๐‘™๐‘‡ be the ๐‘™th row of ๐›พ, i.e., ๐›พ = [๐›พ1๐‘‡ ; ๐›พ2๐‘‡ ; ...; ๐›พ๐‘›โˆ’๐‘Ÿ ๐‘‡ ] and ๐›ฝ be the ๐‘–th column of ๐›ฝ, ๐‘– i.e., ๐›ฝ = [๐›ฝ1 , ๐›ฝ2 , ..., ๐›ฝ๐‘Ÿ ]. Then for 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ, the ๐‘—th entry in ๐‘ค is ๐‘ค๐‘— โ„Ž๐‘– โ„Ž๐‘– โ„Ž๐‘– โ„Ž๐‘– 0 โˆ‘๏ธ ๐‘ ๐‘™1 ๐œŽ๐‘Ÿ+๐‘™1 โˆ‘๏ธ (๐‘‘๐‘–1 )๐‘™1 ๐‘™2 ๐œŽ๐‘Ÿ+๐‘™2 ๐‘˜ โˆ‘๏ธ (๐‘‘๐‘– ๐‘˜โˆ’1 )๐‘™ ๐‘˜โˆ’1 ๐‘™ ๐‘˜ ๐œŽ๐‘Ÿ+๐‘™ โˆ‘๏ธ (๐‘‘๐‘– ๐‘˜ )๐‘™ ๐‘˜ ๐‘™0 ๐œŽ๐‘Ÿ+๐‘™ 1 2 ๐‘˜ 0 ๐‘‡ = ... ๐›พ๐‘™ (e ๐œŽ ๐‘šโˆ’1 ๐›ฝ ๐‘— ) ๐‘™1 e ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 0 ๐‘— 1 ๐‘™2 2 ๐‘™๐‘˜ ๐‘˜ ๐‘™0 0 โ„Ž๐‘– โ„Ž๐‘– โ„Ž๐‘– 0 โˆ‘๏ธ ๐‘ ๐‘™1 ๐œŽ๐‘Ÿ+๐‘™1 โˆ‘๏ธ (๐‘‘๐‘–1 )๐‘™1 ๐‘™2 ๐œŽ๐‘Ÿ+๐‘™2 โˆ‘๏ธ (๐‘‘๐‘– ๐‘˜ )๐‘™ ๐‘˜ ๐‘™0 ๐œŽ๐‘Ÿ+๐‘™ * + 1 2 0 = ... ๐›พ๐‘™0 ,e ๐œŽ ๐‘šโˆ’1 ๐›ฝ๐‘— . ๐œŽ e 2 โˆ’ ๐œŽ2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐‘— ๐‘™1 ๐‘— ๐‘Ÿ+๐‘™1 ๐‘™2 2 ๐‘™0 0 | {z } โ‰ก๐‘€ ๐‘— In the above, let โ„Ž๐‘– โ„Ž๐‘– โ„Ž๐‘– โ„Ž๐‘– 0 โˆ‘๏ธ ๐‘ ๐‘™1 ๐œŽ๐‘Ÿ+๐‘™1 โˆ‘๏ธ (๐‘‘๐‘–1 )๐‘™1 ๐‘™2 ๐œŽ๐‘Ÿ+๐‘™2 ๐‘˜ โˆ‘๏ธ (๐‘‘๐‘– ๐‘˜โˆ’1 )๐‘™ ๐‘˜โˆ’1 ๐‘™ ๐‘˜ ๐œŽ๐‘Ÿ+๐‘™ โˆ‘๏ธ (๐‘‘๐‘– ๐‘˜ )๐‘™ ๐‘˜ ๐‘™0 ๐œŽ๐‘Ÿ+๐‘™ 1 2 ๐‘˜ 0 ๐‘€๐‘— = ... ๐›พ๐‘™ 0 . ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ ๐‘™1 e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 1 ๐‘™2 2 ๐‘™๐‘˜ ๐‘˜ ๐‘™0 0 96 We first bound โˆฅ๐‘€ ๐‘— โˆฅ. Denote ๐œ‚ ๐‘— = ๐œŽ 2๐‘— โˆ’ e ๐œŽ 2๐‘— , and ฮ” ๐‘—๐‘™ = ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ 2 . Notice that 1 1 1 = = ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ e 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ 2 + (e ๐œŽ 2๐‘— โˆ’ ๐œŽ 2๐‘— ) e๐œŽ 2 โˆ’๐œŽ 2 2 2 (๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+๐‘™ )(1 + 2 2 ) ๐‘— ๐‘— ๐œŽ โˆ’๐œŽ ๐‘— ๐‘Ÿ+๐‘™ 1 1 ๐œ‚๐‘— ๐œ‚๐‘— 2 = ๐œ‚๐‘— = (1 + +( ) + ...). ฮ” ๐‘—๐‘™ (1 โˆ’ ฮ” ) ฮ” ๐‘—๐‘™ ฮ” ๐‘—๐‘™ ฮ” ๐‘—๐‘™ ๐‘—๐‘™ Hence รŽ๐‘˜ รŽ๐‘˜ โ„Ž๐‘– ๐‘  ๐‘ =1 (๐‘‘๐‘– ๐‘  )๐‘™ ๐‘  ๐‘™ ๐‘ +1 ๐‘ =0 ๐œŽ๐‘Ÿ+๐‘™ ๐‘  ๐‘ ๐‘™1 ๐‘˜   โˆ‘๏ธ ร– ๐œ‚๐‘— ๐œ‚๐‘— 2 โˆฅ๐‘€๐‘— โˆฅ = 1+ +( ) + ... ๐›พ๐‘™0 ฮ” ๐‘—๐‘™ ๐‘  ฮ” ๐‘—๐‘™ ๐‘  รŽ๐‘˜ ๐‘™1 ,...,๐‘™ ๐‘˜ ,๐‘™0 ๐‘ =0 ฮ” ๐‘—๐‘™ ๐‘  ๐‘ =0 โ„Ž๐‘– ๐‘  โˆž รŽ๐‘˜ รŽ๐‘˜ โˆ‘๏ธ โˆ‘๏ธ ๐‘ ๐‘™1 ๐‘ =1 (๐‘‘ ๐‘– ๐‘  ) ๐‘™ ๐‘  ๐‘™ ๐œŽ ๐‘ +1 ๐‘ =0 ๐‘Ÿ+๐‘™ ๐‘  ร๐‘˜ ๐‘ž๐‘  = รŽ๐‘˜  รŽ๐‘˜ ๐‘ž๐‘  ๐›พ๐‘™0 ยท ๐œ‚ ๐‘— ๐‘ =0 ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜ =0 ๐‘™1 ,...,๐‘™ ๐‘˜ ,๐‘™0 ๐‘ =0 ฮ” ๐‘—๐‘™ ๐‘ =0 ฮ” ๐‘—๐‘™ ๐‘  โ„Ž๐‘– ๐‘  โˆž รŽ๐‘˜ รŽ๐‘˜ ๐‘ =1 (๐‘‘๐‘– ๐‘  )๐‘™ ๐‘  ๐‘™ ๐‘ +1 ๐‘ =0 ๐œŽ๐‘Ÿ+๐‘™ ๐‘  ๐‘ ๐‘™1 ร๐‘˜ โˆ‘๏ธ โˆ‘๏ธ ๐‘ž๐‘  โ‰ค ๐›พ๐‘™ 0 ยท ๐œ‚ ๐‘— ๐‘ =0 . รŽ๐‘˜ 1+๐‘ž ๐‘  ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜ =0 ๐‘™ 1 ,...,๐‘™ ๐‘˜ ,๐‘™ 0 ๐‘ =0 ๐‘—๐‘™ ๐‘ ฮ” | {z } ๐‘‡ ( ๐‘—,๐‘ž 0 ,...,๐‘ž ๐‘˜ ) Here we let ๐‘™ ๐‘˜+1 = ๐‘™ 0 . In the above, denote รŽ๐‘˜ รŽ๐‘˜ โ„Ž๐‘– ๐‘  ๐‘ ๐‘™1 ๐‘ =1 (๐‘‘๐‘– ๐‘  )๐‘™ ๐‘  ๐‘™ ๐‘ +1 ๐‘ =0 ๐œŽ๐‘Ÿ+๐‘™ ๐‘  โˆ‘๏ธ ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) โ‰ก ๐›พ๐‘™ 0 โˆˆ R๐‘Ÿร—1 . รŽ๐‘˜ 1+๐‘ž ๐‘  ๐‘™ 1 ,...,๐‘™ ๐‘˜ ,๐‘™ 0 ฮ” ๐‘ =0 ๐‘—๐‘™ ๐‘  Next, we bound the โ„“2 norm of ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ). Notice that we can rewrite ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) in the following way ๐‘ž ๐‘ž ๐‘‡  ๐‘ž ๐‘ž  ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) = ๐‘๐‘‡ ๐‘“ห†๐‘– 1 ๐‘“ห†๐‘– 2 ... ๐‘“ห†๐‘– ๐‘˜ ๐‘Žห†๐‘– 0 , 1 โ‰ค ๐‘–0 , ๐‘– 2 , ..., ๐‘– ๐‘˜ โ‰ค 4, ๐‘˜ โ‰ฅ 0. 1 2 ๐‘˜ 0 Here matrices ๐‘“ห†๐‘– ๐‘  are modified from of the functions ๐‘“๐‘– ๐‘  , and matrix ๐‘Žห†๐‘–0 is modified from the function ๐‘Ž๐‘–0 . Explicitly, ๐‘ž 21,๐‘ž ๐‘ž 21,๐‘ž ๐‘ž 12,๐‘ž ๐‘ž 21,๐‘ž ๐‘Žห† 1 = ๐นห†๐‘ˆ ฮฃ2 ๐›ผ๐‘‡12 , ๐‘Žห† 2 = ๐นห†๐‘ˆ ๐›ผ21 , ๐‘Žห† 3 = ๐นห†๐‘‰ ๐›ผ๐‘‡12 , ๐‘Žห† 4 = ๐นห†๐‘‰ ฮฃ๐‘‡2 ๐›ผ21 , 21,๐‘ž 21,๐‘ž 21,๐‘ž 21,๐‘ž ๐‘“ห†1 = ๐นห†๐‘ˆ ฮฃ2 ๐›ผ๐‘‡22 , ๐‘“ห†2 = ๐นห†๐‘ˆ ๐›ผ22 , ๐‘“ห†3 = ๐นห†๐‘‰ ๐›ผ๐‘‡22 , ๐‘“ห†4 = ๐นห†๐‘‰ ฮฃ๐‘‡2 ๐›ผ22 , ๐‘ž ๐‘ž ๐‘ž ๐‘ž 97 21,๐‘ž 21,๐‘ž where ๐นห†๐‘ˆ and ๐นห†๐‘‰ are diagonal matrices with diagonal entries 21,๐‘ž 1 ( ๐นห†๐‘ˆ )๐‘–โ€ฒโˆ’๐‘Ÿ,๐‘–โ€ฒโˆ’๐‘Ÿ = , ๐‘Ÿ + 1 โ‰ค ๐‘–โ€ฒ โ‰ค ๐‘›, (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘–2โ€ฒ ) 1+๐‘ž 21,๐‘ž 1 ( ๐นห†๐‘‰ )๐‘–โ€ฒโˆ’๐‘Ÿ,๐‘–โ€ฒโˆ’๐‘Ÿ = , ๐‘Ÿ + 1 โ‰ค ๐‘–โ€ฒ โ‰ค ๐‘š. (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘–2โ€ฒ ) 1+๐‘ž Similar as in Theorem 2.5.1, if ๐‘–โ€ฒ > min{๐‘›, ๐‘š}, we define ๐œŽ๐‘–โ€ฒ to be 0. ๐‘ž As before, In the above expression of ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ), ๐‘Žห†๐‘– 0 either contains ๐›ผ21 or ๐›ผ๐‘‡12 . If it 0 contains the former, let ๐›พ be the former, and it contains the latter, let ๐›พ be the latter. It is easy to check that this ๐›พ coincides with the ๐›พ defined in the paragraph under (A.1). Conditional on ๐›ผ22 , the only random variable in ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) is ๐›ผ21 or ๐›ผ๐‘‡12 , that is ๐›พ. Therefore, if we write ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) = ๐บ (๐›พ)๐‘‡ , then the linear operator ๐บ is independent of ๐›พ, and it is straightforward to check that ร๐‘˜ โ„Ž๐‘– โˆฅ๐›ผ22 โˆฅ ๐œŽ๐‘Ÿ+1๐‘ =0 ๐‘  ๐‘˜ โˆฅ๐บ โˆฅ โ‰ค โˆฅ๐‘โˆฅ , 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. ๐‘˜+1+ ๐‘˜ ๐‘ž ๐‘  ร 2 2 (๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ) ๐‘ =0 | {z } โ‰ก๐พ Again, conditional on ๐›ผ22 , since for each 1 โ‰ค ๐‘ โ‰ค ๐‘Ÿ, the ๐‘th entry of ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) only depends on the ๐‘th column of ๐›ผ21 or ๐›ผ๐‘‡12 , then different entries of ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) are independent of each other, each following a Gaussian distribution N (0, ๐œŽ 2 ๐œ‰ 2๐‘ ), and ๐œ‰ ๐‘ โ‰ค ๐พ, 1 โ‰ค ๐‘ โ‰ค ๐‘Ÿ, where ๐พ denotes the above bound. By Theorem 3.1.1 in Vershynin (2018), there exists some constant ๐‘ such that โˆš 2 2 2 P(โˆฅ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ )โˆฅ โˆ’ ๐พ๐œŽ ๐‘Ÿ โ‰ฅ ๐‘ก) โ‰ค 2๐‘’ โˆ’๐‘๐‘ก /๐œŽ ๐พ . โˆš๏ธƒ ร๐‘˜ Let ๐‘ก = ๐พ๐œŽ 3 log(๐‘› ยท 2 ๐‘ =0 ๐‘  ๐‘ž ยท 8 ๐‘˜ ๐‘Ÿ)/๐‘, then with probability at least 1 โˆ’ 2 , ร๐‘˜ ๐‘ž๐‘  ๐‘›3 ยท2 ๐‘ =0 ยท8 ๐‘˜ ๐‘Ÿ ร๐‘˜ โ„Ž๐‘– ๐œŽ๐‘Ÿ+1๐‘ =0 ๐‘  โˆฅ๐›ผ22 โˆฅ ๐‘˜ โˆš๏ธ ๐‘˜ โˆ‘๏ธ โˆš โˆš โˆš โˆฅ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ )โˆฅ โ‰ค ๐‘ 1 ๐œŽโˆฅ๐‘โˆฅ ร๐‘˜ ( log ๐‘› + ๐‘ž ๐‘  + ๐‘˜ + ๐‘Ÿ), 2 ) ๐‘˜+1+ ๐‘ =0 ๐‘ž ๐‘  (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ๐‘ =0 98 where ๐‘ 1 is some constant. Hence โˆฅ๐‘€ ๐‘— โˆฅ โˆž โˆ‘๏ธ ร๐‘˜ ๐‘ž๐‘  โ‰ค โˆฅ๐‘‡ ( ๐‘—, ๐‘ž 0 , ..., ๐‘ž ๐‘˜ ) โˆฅ ยท ๐œ‚ ๐‘— ๐‘ =0 ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜ =0 ร๐‘˜ ร๐‘˜ โ„Ž๐‘– ๐‘  ๐‘ž โˆฅ๐‘โˆฅ๐œŽ๐‘Ÿ+1 ๐‘ =0 โˆฅ๐›ผ22 โˆฅ ๐‘˜ โˆž ๐‘˜ โˆ‘๏ธ โˆš โˆš โˆš ๐‘ =0 ๐‘  โˆ‘๏ธ โˆš๏ธ ๐œ‚๐‘— โ‰ค ๐‘1 ๐œŽ ( log ๐‘› + ๐‘ž ๐‘  + ๐‘˜ + ๐‘Ÿ) ยท (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ) ๐‘˜+1 ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜ =0 ๐‘ =0 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+12 ร๐‘˜ โ„Ž๐‘– โˆฅ๐‘โˆฅ๐œŽ๐‘Ÿ+1๐‘ =0 ๐‘  โˆฅ๐›ผ22 โˆฅ ๐‘˜ ยญ โˆš๏ธ โˆš โˆš ยฉ ยช ยญ ( log ๐‘› + ๐‘˜ + ๐‘Ÿ) 1 2(๐‘˜ + 1) ยฎ โ‰ค ๐‘1 ๐œŽ + ยฎ (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ) ๐‘˜+1 |๐œ‚ ๐‘— | |๐œ‚ ๐‘— | ยญ ยฎ ยญ (1 โˆ’ 2 2 ) ๐‘˜+1 (1 โˆ’ 2 2 ) ๐‘˜+1 ยฎ ๐œŽ โˆ’๐œŽ ๐œŽ โˆ’๐œŽ ยซ ๐‘— ๐‘Ÿ+1 ๐‘— ๐‘Ÿ+1 ยฌ ร๐‘˜ โ„Ž๐‘– โˆฅ๐‘โˆฅ๐œŽ๐‘Ÿ+1๐‘ =0 ๐‘  โˆฅ๐›ผ22 โˆฅ ๐‘˜ โˆš๏ธ โˆš = ๐‘2 ๐œŽ ( log ๐‘› + ๐‘Ÿ + ๐‘˜), 2 2 (๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 โˆ’ |๐œ‚ ๐‘— |) ๐‘˜+1 where ๐‘ 2 is a constant. In the second inequality above, we used the fact that ร๐‘˜ ร ๐‘˜โˆ’1 ๐‘ž ๐‘ž ๐‘ž๐‘˜ โˆž ๐‘ =0 ๐‘  โˆž ๐‘ =0 ๐‘  โˆ‘๏ธ โˆž โˆ‘๏ธ ๐œ‚๐‘— โˆ‘๏ธ ๐œ‚๐‘— ๐œ‚๐‘— = 2 2 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+12 ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜ =0 ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜โˆ’1 =0 ๐‘ž ๐‘˜ =0 โˆž 1 โˆ‘๏ธ ๐œ‚๐‘— ร ๐‘˜โˆ’1 ๐‘ž = ๐‘ =0 ๐‘  |๐œ‚ ๐‘— | 2 2 (1 โˆ’ 2 2 ) ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜โˆ’1 =0 ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ๐œŽ โˆ’๐œŽ ๐‘— ๐‘Ÿ+1 1 = ... = . |๐œ‚ ๐‘— | (1 โˆ’ 2 2 ) ๐‘˜+1 ๐œŽ โˆ’๐œŽ ๐‘— ๐‘Ÿ+1 99 and that โˆž ๐‘˜ โˆ‘๏ธ โˆ‘๏ธ โˆš ๐œ‚๐‘— ร๐‘˜ ๐‘ž ( ๐‘ž๐‘ ) ๐‘ =0 ๐‘  ๐‘ž 0 ,๐‘ž 1 ,...,๐‘ž ๐‘˜ =0 ๐‘ =0 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐‘˜ โˆž โˆž โˆ‘๏ธ โˆ‘๏ธ ๐œ‚๐‘— ร ๐‘™โ‰ ๐‘  ๐‘ž ๐‘™ โˆ‘๏ธ โˆš ๐œ‚๐‘— ๐‘ž๐‘  = ๐‘ž๐‘  2 2 2 2 ๐‘ =0 ๐‘ž 0 ,...,๐‘ž ๐‘ โˆ’1 ,๐‘ž ๐‘ +1 ,...,๐‘ž ๐‘˜ =0 ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ๐‘ž ๐‘  =0 ๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ๐‘˜ โˆž โˆ‘๏ธ 2 โˆ‘๏ธ ๐œ‚๐‘— ร ๐‘™โ‰ ๐‘  ๐‘ž ๐‘™ โ‰ค ๐‘ =0 (1 โˆ’ |๐œ‚ ๐‘— | ) ๐‘ž 0 ,...,๐‘ž ๐‘ โˆ’1 ,๐‘ž ๐‘ +1 ,...,๐‘ž ๐‘˜ =0 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ โˆ’๐œŽ 2 2 ๐‘— ๐‘Ÿ+1 2(๐‘˜ + 1) = . |๐œ‚ ๐‘— | (1 โˆ’ 2 2 ) ๐‘˜+1 ๐œŽ โˆ’๐œŽ ๐‘— ๐‘Ÿ+1 Here the inequality is due to (2.49), which holds under the condition |๐œ‚ ๐‘— | ๐œŽ 2๐‘— โˆ’ ๐œŽ 2๐‘— | |e (๐œŽ ๐‘— + โˆฅฮ”๐ดโˆฅ) 2 โˆ’ ๐œŽ 2๐‘— = โ‰ค ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 2๐œŽ ๐‘— โˆฅฮ”๐ดโˆฅ + โˆฅฮ”๐ดโˆฅ 2 (2๐œŽ ๐‘— + 17 ๐œŽ ๐‘— )โˆฅฮ”๐ดโˆฅ 1 โ‰ค โ‰ค < . (๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 )(๐œŽ ๐‘— + ๐œŽ๐‘Ÿ+1 ) 7โˆฅฮ”๐ดโˆฅ๐œŽ ๐‘— 2 Therefore with probability at least 1 โˆ’ 3 4 ๐‘˜ , ๐‘› ยท4 ๐‘Ÿ โˆฅ๐‘ค ๐‘— โˆฅ โ‰ค โˆฅ ๐‘€ ๐‘— โˆฅ ยท e ๐œŽ ๐‘šโˆ’1๐‘— โˆฅ๐›ฝ๐‘— โˆฅ ร๐‘˜ โ„Ž๐‘– โˆฅ๐‘โˆฅe ๐œŽ ๐‘— ๐œŽ๐‘Ÿ+1๐‘ =0 ๐‘  โˆฅ๐›ผ22 โˆฅ ๐‘˜ ๐‘šโˆ’1 โˆš๏ธ โˆš โ‰ค ๐‘2 ๐œŽ ( log ๐‘› + ๐‘Ÿ + ๐‘˜) (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 โˆ’ |๐œ‚ |)(๐œŽ 2 โˆ’ ๐œŽ 2 โˆ’ |๐œ‚ |) ๐‘˜ ๐‘— ๐‘— ๐‘Ÿ+1 ๐‘— โˆฅ๐‘โˆฅ( 78 ๐œŽ ๐‘— ) ๐‘˜ โˆฅฮ”๐ดโˆฅ ๐‘˜ โˆš๏ธ โˆš โ‰ค ๐‘2 ๐œŽ ( log ๐‘› + ๐‘Ÿ + ๐‘˜) 2 (๐œŽ 2 โˆ’ ๐œŽ 2 )( 34 ๐œŽ โˆฅฮ”๐ดโˆฅ) ๐‘˜ 3 ๐‘— ๐‘Ÿ+1 7 ๐‘— 3 โˆฅ๐‘โˆฅ 4 โˆš๏ธ โˆš โ‰ค ๐‘2 ๐œŽ ( ) ๐‘˜ ( log ๐‘› + ๐‘Ÿ + ๐‘˜). 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ 2 17 ๐‘Ÿ+1 ร๐‘˜ Here the third inequality is due to ๐‘ =0 โ„Ž๐‘– ๐‘  + ๐‘š โˆ’ 1 = ๐‘˜ and ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 โˆ’ |e ๐œŽ 2๐‘— โˆ’ ๐œŽ 2๐‘— | โ‰ฅ ๐œŽ ๐‘— (๐œŽ ๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 ) โˆ’ ((๐œŽ ๐‘— + โˆฅฮ”๐ดโˆฅ) 2 โˆ’ ๐œŽ 2๐‘— ) 34 โ‰ฅ 7๐œŽ ๐‘— โˆฅฮ”๐ดโˆฅ โˆ’ 2๐œŽ ๐‘— โˆฅฮ”๐ดโˆฅ โˆ’ โˆฅฮ”๐ดโˆฅ 2 โ‰ฅ ๐œŽ ๐‘— โˆฅฮ”๐ดโˆฅ, 7 100 as well as ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 โˆ’ |e ๐œŽ 2๐‘— โˆ’ ๐œŽ 2๐‘— | โ‰ฅ ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 โˆ’ 2๐œŽ โˆฅฮ”๐ดโˆฅ โˆ’ โˆฅฮ”๐ดโˆฅ 2 ๐‘— โ‰ฅ ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+12 โˆ’ โˆฅฮ”๐ดโˆฅ(2๐œŽ + โˆฅฮ”๐ดโˆฅ) ๐‘— โ‰ฅ ๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+12 โˆ’ 1 (๐œŽ โˆ’ ๐œŽ ) ยท 15 (๐œŽ + ๐œŽ ) ๐‘— ๐‘Ÿ+1 ๐‘— ๐‘Ÿ+1 7 7 2 โ‰ฅ (๐œŽ 2๐‘— โˆ’ ๐œŽ๐‘Ÿ+1 2 ). 3 With probability at least 1 โˆ’ ๐‘˜4 3 , 4 ๐‘› โˆš ๐‘‡ 3 ๐‘Ÿ โˆฅ๐‘ข๐‘– ๐›ผ22 โˆฅ 4 ๐‘˜ โˆš๏ธ โˆš โˆฅ๐‘คโˆฅ โ‰ค ๐‘ 2 ๐œŽ ( ) ( log ๐‘› + ๐‘Ÿ + ๐‘˜). 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+12 17 By Theorem 2.5.7, we can see that there are 2 ๐‘˜+1 terms in the expansion of ๐‘‰2๐‘‡ ๐‘‰ e1 that has order ๐‘˜, hence by (2.49), with probability at least 1 โˆ’ 163 , ๐‘› โˆž โˆš ๐‘‡ โˆš๏ธ ๐‘‡ ๐‘‡ โˆ’1 โˆ‘๏ธ ๐‘Ÿ โˆฅ๐‘ข ๐‘– ๐›ผ 22 โˆฅ 8 ๐‘˜ โˆš๏ธ โˆš ( ๐‘Ÿ log ๐‘› + ๐‘Ÿ)โˆฅ๐‘ข๐‘‡๐‘– ๐›ผ22 โˆฅ โˆฅ๐‘ข๐‘– ๐›ผ22๐‘‰2 ๐‘‰1 ฮฃ1 โˆฅ โ‰ค e e 3๐‘ 2 ๐œŽ ( ) ( log ๐‘› + ๐‘Ÿ + ๐‘˜) โ‰ค ๐ถ๐œŽ . ๐‘˜=0 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ 2 ๐‘Ÿ+1 17 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ 2 ๐‘Ÿ+1 By the union bound, with probability at least 1 โˆ’ 162 , ๐‘› โˆš๏ธ ( ๐‘Ÿ log ๐‘› + ๐‘Ÿ)โˆฅ๐›ผ22 โˆฅ โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ e1eฮฃ1โˆ’1 โˆฅ 2,โˆž โ‰ค ๐ถ๐œŽ ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 2 โˆš๏ธ โˆš๏ธ ( ๐‘Ÿ log ๐‘› + ๐‘Ÿ)โˆฅฮ”๐ดโˆฅ ๐‘Ÿ log ๐‘› + ๐‘Ÿ = ๐ถ๐œŽ โ‰ค ๐ถ๐œŽ . (๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 )(๐œŽ๐‘Ÿ + ๐œŽ๐‘Ÿ+1 ) ๐œŽ๐‘Ÿ + ๐œŽ๐‘Ÿ+1 Next, we consider โˆฅ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 โˆฅ 2,โˆž . Following the same reasoning, we have with probability at least 1 โˆ’ 162 , ๐‘› โˆš๏ธ โˆš๏ธ ( ๐‘Ÿ log ๐‘› + ๐‘Ÿ)โˆฅฮฃ2 โˆฅ ( ๐‘Ÿ log ๐‘› + ๐‘Ÿ)๐œŽ๐‘Ÿ+1 โˆฅ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ ฮฃ1โˆ’1 โˆฅ 2,โˆž โ‰ค ๐ถ๐œŽ e1e = ๐ถ๐œŽ . ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ๐‘Ÿ2 โˆ’ ๐œŽ๐‘Ÿ+12 Combining the above two bounds, max{โˆฅ๐‘ˆ2 ฮฃ2๐‘‰2๐‘‡ ๐‘‰ e1e ฮฃ1โˆ’1 โˆฅ 2,โˆž , โˆฅ๐‘ˆ2๐‘ˆ2๐‘‡ ฮ”๐ด๐‘‰2๐‘‰2๐‘‡ ๐‘‰ ฮฃ1โˆ’1 โˆฅ 2,โˆž } e1e โˆš๏ธ โˆš๏ธ ( ๐‘Ÿ log ๐‘› + ๐‘Ÿ)๐œŽ๐‘Ÿ+1 ๐‘Ÿ log ๐‘› + ๐‘Ÿ โ‰ค ๐ถ๐œŽ( + ) 2 ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 2 ๐œŽ๐‘Ÿ + ๐œŽ๐‘Ÿ+1 โˆš๏ธ ๐‘Ÿ log ๐‘› + ๐‘Ÿ โ‰ค ๐ถ๐œŽ . ๐œŽ๐‘Ÿ โˆ’ ๐œŽ๐‘Ÿ+1 โ–ก 101 APPENDIX B APPENDIX FOR CHAPTER 3 ๐œ†ห†๐‘– โˆ’๐œ†๐‘–โˆ— ๐‘˜โ†’โˆž This appendix contains the proof of Theorem 3.3.1 and the proof of | | โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 0 in Section ๐œ†โˆ— ๐‘– 3.4. B.1 Proof of Theorem 3.3.1 Definition B.1.1. Let M be a compact manifold endowed with a continuous measure ๐œ‡. For any ๐‘ง โˆˆ M, its (๐œ‚(๐œ), ๐œ)-neighborhood N is the neighbourhood with radius ๐œ‚(๐œ) and measure ๐œ, i.e., ๐œ‡(N ) = ๐œ, N = M โˆฉ ๐ต2 (๐‘ง, ๐œ‚(๐œ)), and ๐œ‚(๐œ) = min{๐‘Ÿ : ๐œ‡(M โˆฉ ๐ต2 (๐‘ง, ๐‘Ÿ)) = ๐œ}. Since M is compact, its measure is finite, say ๐œ‡(M) = ๐ด, and the radii of all the ๐œ- neighbourhoods are bounded by some constant ๐œ‚: sup |๐œ‚๐‘– | 2 โ‰ค ๐œ‚2 . ๐‘– Theorem B.1.1 (Full version of Theorem 3.3.1). Given the dataset ๐‘‹ = [๐‘‹1 , ๐‘‹2 , ยท ยท ยท , ๐‘‹๐‘› ], let each ๐‘‹๐‘– be independently drawn from a compact manifold M โІ R ๐‘ with intrinsic dimension ๐‘‘ and endowed with the uniform distribution ๐œ‡. Fix some ๐‘ž > 0, let ๐‘‹๐‘– ๐‘— , ๐‘— = 1, . . . , ๐‘˜ ๐‘– be the ๐‘˜ ๐‘– points falling in the (๐œ‚๐‘– , ๐‘ž)-neighbourhood of ๐‘‹๐‘– . Together they form a matrix ๐‘‹ (๐‘–) = [๐‘‹๐‘–1 , . . . , ๐‘‹๐‘– ๐‘˜ , ๐‘‹๐‘– ]. Suppose ๐‘– the i.i.d. projections ๐‘ฆ๐‘–, ๐‘— โ‰ก ๐‘ƒ๐‘‡๐‘‹ (๐‘‹๐‘– ๐‘— โˆ’ ๐‘‹๐‘– ) where ๐‘‡๐‘‹๐‘– is the tangent space at ๐‘‹๐‘– obey the same ๐‘– distribution as some ๐‘Ž๐‘– for all ๐‘—, i.e., ๐‘ฆ๐‘–, ๐‘— โˆผ ๐‘Ž๐‘– (โˆผ means the two vectors are identically distributed), and the matrix E(๐‘Ž๐‘– โˆ’ E๐‘Ž๐‘– )(๐‘Ž๐‘– โˆ’ E๐‘Ž๐‘– ) โˆ— has a finite condition number for each ๐‘–. In addition, suppose the support of the noise matrix ๐‘† (๐‘–) is uniformly distributed among all sets of cardinality ๐‘š๐‘– . For any ๐œ โˆˆ M, let ๐‘‡๐œ be the tangent space of M at ๐œ and define ๐œ‡1 := sup๐œ โˆˆM ๐œ‡(๐‘‡๐œ ). Then as long as ๐‘ž๐‘› โ‰ฅ ๐‘ log ๐‘›, ๐‘‘ < ๐œŒ๐‘Ÿ min{๐‘›๐‘ž/2, ๐‘}๐œ‡1โˆ’1 logโˆ’2 max{2๐‘›๐‘ž, ๐‘}, and ๐‘๐‘˜๐‘– โ‰ค 0.4๐œŒ ๐‘  (here ๐‘, ๐œŒ๐‘Ÿ and ๐œŒ ๐‘  are positive ๐‘š ๐‘– numerical constants), then with probability over 1 โˆ’ ๐‘ 1 (๐‘› max{๐‘›๐‘ž/2, ๐‘}โˆ’10 + exp(โˆ’๐‘ 2 ๐‘›๐‘ž)) for some min{๐‘˜ ๐‘– +1,๐‘}1/2 constants ๐‘ 1 and ๐‘ 2 , the minimizer ๐‘†ห† to (2) with ๐œ†๐‘– = ๐œ–๐‘– , and ๐›ฝ๐‘– = max{๐‘˜ ๐‘– + 1, ๐‘}โˆ’1/2 has the error bound ห† โˆ’ ๐‘† (๐‘–) โˆฅ 2,1 โ‰ค ๐ถ โˆš ๐‘๐‘› ๐‘˜ยฏ โˆฅ๐œ– โˆฅ 2 . โˆ‘๏ธ โˆฅP๐‘– ( ๐‘†) ๐‘– 102 Here ๐‘˜ยฏ = max๐‘– ๐‘˜ ๐‘– satisfying ๐‘›๐‘ž/2 โ‰ค ๐‘˜ยฏ โ‰ค 2๐‘›๐‘ž, ๐œ–๐‘– = โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ โˆ’ ๐‘‡ (๐‘–) โˆ’ ๐‘† (๐‘–) โˆฅ ๐น , ๐œ– = [๐œ–1 , ..., ๐œ– ๐‘› ], โˆฅ ยท โˆฅ 2,1 stands for taking โ„“2 norm along rows and โ„“1 norm along the columns, and ๐‘‡ (๐‘–) is the projection of ๐‘‹ (๐‘–) โˆ’ ๐‘‹๐‘– to the tangent space ๐‘‡๐‘‹๐‘– . The proof the Theorem B.1.1 uses similar techniques as Zhan and Vaswani (2015). The main difference is that in Zhan and Vaswani (2015), both the left and the right singular vectors of the data matrix are required to satisfy the coherence conditions, while here we show that only the left singular vectors that corresponding to the tangent spaces are relevant. In other words, the recovery guarantee is built solely upon assumptions on the intrinsic properties of the manifold, i.e., the tangent spaces. The proof architecture is as follows. In Section B.1.1, we derive the error bound in Theorem 3.3.1 under a small coherence condition for both the left and the right singular vectors of ๐ฟ (๐‘–) . In Section B.1.2, we show that the requirement on the right singular vectors can be removed using the i.i.d. assumption on the samples. B.1.1 Deriving the error bound in Theorem B.1.1 under coherence conditions on both the right and the left singular vectors In Section 3.1.4, we explained that ๐ฟ (๐‘–) = ๐‘‹๐‘– 1๐‘‡ +๐‘‡ (๐‘–) corresponds to the linear approximation of the ๐‘–th patch. After the centering C(๐ฟ (๐‘–) ) = C(๐‘‡ (๐‘–) ), one gets rid of the first term and the resulting matrix has a column span coincide with ๐‘‡ (๐‘–) . This indicates that the columns of C(๐ฟ (๐‘–) ) lie in the column space of the tangent space ๐‘ ๐‘๐‘Ž๐‘›(๐‘‡ (๐‘–) ), this also indicates that the rows of ๐ฟ (๐‘–) are in ๐‘ ๐‘๐‘Ž๐‘›{1๐‘‡ ,๐‘‡ (๐‘–) }. One can view the knowledge that 1๐‘‡ is in the row space of ๐ฟ (๐‘–) as a prior knowledge of the left singular vectors of ๐ฟ (๐‘–) . Robust PCA with prior knowledge is studied in Zhan and Vaswani (2015), and we will use some of the result therein. Specifically, we adapt the dual certificate approach in Zhan and Vaswani (2015) to our problem to derive the error bound for our new problem in the theorem, and choose proper ๐œ†๐‘– , ๐‘– = 1, 2, ยท ยท ยท , ๐‘› and ๐›ฝ๐‘– accordingly. We first state the following assumptions as from Zhan and Vaswani (2015): Assumption B.1.1 (Incoherence conditions). In each local patch, ๐ฟ (๐‘–) โˆˆ R ๐‘ร—(๐‘˜ ๐‘– +1) , denote ๐‘› (1) = 103 max{๐‘, ๐‘˜ ๐‘– + 1}, ๐‘› (2) = min{๐‘, ๐‘˜ ๐‘– + 1}, let C(๐ฟ (๐‘–) ) = ๐‘ˆ๐‘– ฮฃ๐‘– ๐‘‰๐‘–โˆ— , be the singular value decomposition for each ๐ฟ (๐‘–) , where ๐‘ˆ๐‘– โˆˆ R ๐‘ร—๐‘‘ , ฮฃ๐‘– โˆˆ R๐‘‘ร—๐‘‘ ,๐‘‰๐‘–โˆ— โˆˆ R๐‘‘ร—(๐‘˜+1) . Let ๐‘‰หœ๐‘– be the orthonormal basis of span{1,๐‘‰๐‘– }, assume for each ๐‘– โˆˆ {1, 2, ยท ยท ยท , ๐‘›}, the following hold with a constant ๐œŒ๐‘Ÿ that is small enough ๐œŒ๐‘Ÿ ๐‘‘ max โˆฅ๐‘ˆ๐‘–โˆ— ๐‘’ ๐‘— โˆฅ 2 โ‰ค , (B.1) ๐‘— ๐‘ ๐œŒ๐‘Ÿ ๐‘‘ max โˆฅ๐‘‰หœ๐‘–โˆ— ๐‘’ ๐‘— โˆฅ 2 โ‰ค , (B.2) ๐‘— ๐‘˜๐‘– โˆš๏ธ„ ๐œŒ๐‘Ÿ ๐‘‘ max โˆฅ๐‘ˆ๐‘– ๐‘‰๐‘–โˆ— โˆฅ โˆž โ‰ค . (B.3) ๐‘— ๐‘๐‘˜ ๐‘– and ๐œŒ๐‘Ÿ , ๐œŒ ๐‘  , ๐‘, ๐‘˜ ๐‘– satisfies the following assumptions: Assumption B.1.2 (Zhan and Vaswani (2015), Assumption III.2.). (a) ๐œŒ๐‘Ÿ โ‰ค min{10โˆ’4 , ๐ถ1 }, (b) ๐œŒ ๐‘  โ‰ค min{1 โˆ’ 1.5๐‘ 1 (๐œŒ๐‘Ÿ ), 0.0156}, (c) ๐‘› (1) โ‰ฅ max{๐ถ2 (๐œŒ๐‘Ÿ ), 1024}, (d) ๐‘› (2) โ‰ฅ 100 log2 ๐‘› (1) , ( ๐‘+๐‘˜ ) 1/6 ๐‘– 10.5 (e) log( ๐‘+๐‘˜ > โˆš , ๐‘–) (๐œŒ ๐‘  ) 1/6 (1โˆ’5.6561) ๐œŒ ๐‘  (f) 500 log๐‘–๐‘› > 12 . ๐‘๐‘˜ (1) ๐œŒ๐‘  Here ๐‘ 1 (๐œŒ๐‘Ÿ ), ๐ถ2 (๐œŒ๐‘Ÿ ) are some constants related to ๐œŒ๐‘Ÿ . Denote ฮ ๐‘– as the linear space of matrices for each local patch (note that this is different from the tangent space ๐‘‡ ๐‘– of the manifold) ฮ ๐‘– := {๐‘ˆ๐‘– ๐‘‹ โˆ— +๐‘Œ ๐‘‰หœ๐‘–โˆ— , ๐‘‹ โˆˆ R ๐‘,๐‘‘ ,๐‘Œ โˆˆ R ๐‘,๐‘‘+1 }. As shown by Zhan and Vaswani (2015), the following lemma holds, indicating that if inco- herence condition is satisfied, then with high probability, there exists desirable dual certificate (๐‘Š, ๐น). 104 Lemma B.1.2 (Zhan and Vaswani (2015), Lemma V.8, Lemma V.9). For fixed ๐‘– = 1, 2, ยท ยท ยท , ๐‘›, if assumptions (B.1), (B.2), (B.3), Assumption B.1.2 and other assumptions in Theorem B.1.1 hold, then with probability at least 1 โˆ’ ๐‘๐‘›โˆ’10 (1) , โˆฅPฮฉ๐‘– Pฮ ๐‘– โˆฅ โ‰ค 1/4, where ฮฉ๐‘– is the support set of ๐‘† (๐‘–) , and ๐›ฝ < 103 . In addition, there exists a pair (๐‘Š , ๐น ) obeying ๐‘– ๐‘– ๐‘ˆ๐‘– ๐‘‰๐‘–โˆ— + ๐‘Š๐‘– = ๐›ฝ(๐‘ ๐‘”๐‘›(๐‘† (๐‘–) ) + ๐น๐‘– + Pฮฉ๐‘– ๐ท ๐‘– ), (B.4) with 9 9 1 Pฮ ๐‘– ๐‘Š๐‘– = 0, โˆฅ๐‘Š๐‘– โˆฅ โ‰ค , Pฮฉ๐‘– ๐น๐‘– = 0, โˆฅ๐น๐‘– โˆฅ โˆž โ‰ค , โˆฅPฮฉ๐‘– ๐ท ๐‘– โˆฅ ๐น โ‰ค . (B.5) 10 10 4 Therefore, by union bound, with probability over 1 โˆ’ ๐‘๐‘›๐‘›โˆ’10 (1) , for each local patch, there exists a pair (๐‘Š๐‘– , ๐น๐‘– ) obeying (B.4) and (B.5). In Section B.1.2, we will show that with our assumption that data is independently drawn from a manifold M โІ R ๐‘ with intrinsic dimension ๐‘‘ endowed with the uniform distribution, (B.2) and (B.3) are satisfied with high probability, so we only need Assumption B.1.2 and (B.1), which is only related to the property of tangent space of the manifold itself. In Lemma B.1.4, we prove that in our setting that each ๐‘‹๐‘– is drawn from a manifold M โІ R ๐‘ independently and uniformly, with high probability, for all ๐‘– = 1, 2, ยท ยท ยท ๐‘›, ๐‘˜ ๐‘– is some integer within the range [๐‘ž๐‘›/2, 2๐‘ž๐‘›]. Now we use that to prove Theorem B.1.1, the result is stated in the following lemma. Lemma B.1.3. If for all local patch ๐‘– = 1, 2, ยท ยท ยท , ๐‘›, there exists a pair (๐‘Š๐‘– , ๐น๐‘– ) obeying (B.4) and min{๐‘˜ ๐‘– +1,๐‘}1/2 (B.5), then the minimizer ๐‘†ห† to (2) with ๐œ†๐‘– = ๐œ–๐‘– , and ๐›ฝ๐‘– = max{๐‘˜ ๐‘– + 1, ๐‘}โˆ’1/2 has the error bound ห† โˆ’ ๐‘† (๐‘–) โˆฅ 2,1 โ‰ค ๐ถ โˆš ๐‘๐‘› ๐‘˜ยฏ โˆฅ๐œ– โˆฅ 2 . โˆ‘๏ธ โˆฅP๐‘– ( ๐‘†) ๐‘– Here ๐œ–๐‘– = โˆฅ ๐‘‹หœ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ โˆ’ ๐‘‡ โˆ’ ๐‘† (๐‘–) โˆฅ (๐‘–) ๐น, ๐œ– = [๐œ–1 , ..., ๐œ– ๐‘› ] is defined same as Theorem B.1.1. Proof: To simplify notation, letโ€™s start with the problem for only one local patch: min ๐œ†โˆฅ ๐‘‹หœ โˆ’ ๐ฟ โˆ’ ๐‘†โˆฅ 2๐น + โˆฅ๐ฟ๐บ โˆฅ โˆ— + ๐›ฝโˆฅ๐‘†โˆฅ 1 . (B.6) Here ๐‘‹หœ โˆˆ R ๐‘ร—(๐‘˜+1) , where ๐‘˜ denotes the number of neighbors in each local patch, ๐บ = ๐ผ โˆ’ ๐‘˜+1 1 11๐‘‡ is the centering matrix, recall that the noisy data ๐‘‹หœ is ๐‘‹หœ = ๐‘‹ + ๐‘† + ๐ธ = ๐ฟ + ๐‘… + ๐‘† + ๐ธ, โˆฅ๐‘… + ๐ธ โˆฅ ๐น = 105 โˆฅ ๐‘‹หœ โˆ’ ๐ฟ โˆ’ ๐‘†โˆฅ ๐น โ‰ค ๐œ– (to be more accurate, ๐œ–๐‘– for patch ๐‘–), ๐‘‹ is the clean data on the manifold, ๐ฟ is first order Talor approximation of ๐‘‹, ๐‘… is higher order terms, and ๐ธ denotes random noise. Also denote the solution to problem (B.6) as ๐ฟห† = ๐ฟ + ๐ป1 , ๐‘†ห† = ๐‘† + ๐ป2 . We choose ๐›ฝ = โˆš๐‘›1 = โˆš 1 . (1) max{๐‘˜+1,๐‘} Since ๐ฟ,ห† ๐‘†ห† are the solution to (B.6), the following holds: ๐œ†โˆฅ ๐‘‹หœ โˆ’ ๐ฟ โˆ’ ๐‘†โˆฅ 2๐น + โˆฅ๐ฟ๐บ โˆฅ โˆ— + ๐›ฝโˆฅ๐‘†โˆฅ 1 โ‰ฅ ๐œ†โˆฅ ๐‘‹หœ โˆ’ (๐ฟ + ๐ป1 ) โˆ’ (๐‘† + ๐ป2 )โˆฅ 2๐น + โˆฅ(๐ฟ + ๐ป1 )๐บ โˆฅ โˆ— + ๐›ฝโˆฅ๐‘† + ๐ป2 โˆฅ 1 โ‰ฅ ๐œ†โˆฅ๐ป1 + ๐ป2 โˆ’ (๐‘… + ๐ธ)โˆฅ 2๐น + โˆฅ๐ฟ๐บ โˆฅ โˆ— + โŸจ๐ป1 ๐บ,๐‘ˆ๐‘‰ โˆ— + ๐‘Š0 โŸฉ + ๐›ฝโˆฅ๐‘†โˆฅ 1 + ๐›ฝโŸจ๐ป2 , ๐‘ ๐‘”๐‘›(๐‘†) + ๐น0 โŸฉ = ๐œ†โˆฅ๐ป1 + ๐ป2 โˆฅ 2๐น + ๐œ†โˆฅ๐‘… + ๐ธ โˆฅ 2๐น โˆ’ 2๐œ†โŸจ๐‘… + ๐ธ, ๐ป1 + ๐ป2 โŸฉ + โˆฅ๐ฟ๐บ โˆฅ โˆ— + โŸจ๐ป1 ๐บ,๐‘ˆ๐‘‰ โˆ— โŸฉ + ๐›ฝโˆฅ๐‘†โˆฅ 1 + ๐›ฝโŸจ๐ป2 , ๐‘ ๐‘”๐‘›(๐‘†)โŸฉ + โˆฅPฮ  โŠฅ (๐ป1 ๐บ)โˆฅ โˆ— + ๐›ฝโˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 . Here we choose ๐‘Š0 and ๐น0 such that โŸจ๐ป1 ๐บ,๐‘Š0 โŸฉ = โˆฅPฮ  โŠฅ (๐ป1 ๐บ)โˆฅ โˆ— , โŸจ๐ป2 , ๐น0 โŸฉ = โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 same as Candes et al. (2011). Note that ๐ฟ๐บ = ๐‘ˆฮฃ๐‘‰ โˆ— , ๐บ = ๐ผ โˆ’ ๐‘˜+1 1 11๐‘‡ is orthogonal projector, ๐ฟ๐บ1 = 0 implies ๐‘‰ โˆ— 1 = 0, we have 1 โŸจ๐ป1 ๐บ,๐‘ˆ๐‘‰ โˆ— โŸฉ = โŸจ๐ป1 ,๐‘ˆ๐‘‰ โˆ— ๐บโŸฉ = โŸจ๐ป1 ,๐‘ˆ๐‘‰ โˆ— (๐ผ โˆ’ 11๐‘‡ )โŸฉ = โŸจ๐ป1 ,๐‘ˆ๐‘‰ โˆ— โŸฉ, ๐‘˜ +1 1 Pฮ  โŠฅ (๐ป1 ๐บ) = (๐ผ โˆ’ ๐‘ˆ๐‘ˆ โˆ— )๐ป1 ๐บ (๐ผ โˆ’ ๐‘‰หœ ๐‘‰หœ โˆ— ) = (๐ผ โˆ’ ๐‘ˆ๐‘ˆ โˆ— )๐ป1 (๐ผ โˆ’ 11๐‘‡ )(๐ผ โˆ’ ๐‘‰หœ ๐‘‰หœ โˆ— ) = Pฮ  โŠฅ ๐ป1 . ๐‘˜ +1 For the second equality we use the fact that 1 lies on the subspace spanned by ๐‘‰, หœ so (๐ผ โˆ’ ๐‘‰หœ ๐‘‰หœ โˆ— )1 = 0. And for any matrix ๐‘€, Pฮ  โŠฅ ๐‘€ = (๐ผ โˆ’ ๐‘ˆ๐‘ˆ โˆ— )๐‘€ (๐ผ โˆ’ ๐‘‰หœ ๐‘‰หœ โˆ— ). Denote ๐ป = ๐ป1 + ๐ป2 , plug in the equations above, we obtain 2๐œ†โŸจ๐‘… + ๐ธ, ๐ปโŸฉ โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น + โŸจ๐ป1 + ๐ป2 ,๐‘ˆ๐‘‰ โˆ— โŸฉ + โŸจ๐ป2 , ๐›ฝ๐‘ ๐‘”๐‘›(๐‘†) โˆ’ ๐‘ˆ๐‘‰ โˆ— โŸฉ + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + ๐›ฝโˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ โˆฅ๐ป โˆฅ ๐น โˆฅ๐‘ˆ๐‘‰ โˆ— โˆฅ ๐น + โŸจ๐ป2 ,๐‘Š โˆ’ ๐›ฝ๐น โˆ’ ๐›ฝPฮฉ ๐ทโŸฉ + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + ๐›ฝโˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 9 9 ๐›ฝ โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ ๐‘› (2) โˆฅ๐ป โˆฅ ๐น โˆ’ โˆฅPฮ  โŠฅ ๐ป2 โˆฅ โˆ— โˆ’ ๐›ฝโˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 โˆ’ โˆฅPฮฉ ๐ป2 โˆฅ ๐น + โˆš๏ธ 10 10 4 โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + ๐›ฝโˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 . In the 3rd inequality we used 9 |โŸจ๐ป2 ,๐‘ŠโŸฉ| = |โŸจ๐ป2 , Pฮ  โŠฅ ๐‘ŠโŸฉ| = |โŸจPฮ  โŠฅ ๐ป2 ,๐‘ŠโŸฉ| โ‰ค โˆฅPฮ  โŠฅ ๐ป2 โˆฅ โˆ— โˆฅ๐‘Š โˆฅ โ‰ค โˆฅP โŠฅ ๐ป โˆฅ โˆ— , 10 ฮ  2 106 9 |โŸจ๐ป2 , ๐นโŸฉ| = |โŸจ๐ป2 , PฮฉโŠฅ ๐นโŸฉ| = |โŸจPฮฉโŠฅ ๐ป2 , ๐นโŸฉ| โ‰ค โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 โˆฅ๐น โˆฅ โˆž โ‰ค โˆฅP โŠฅ ๐ป โˆฅ , 10 ฮฉ 2 1 and 1 |โŸจ๐ป2 , Pฮฉ ๐ทโŸฉ| โ‰ค |โŸจPฮฉ ๐ป2 , Pฮฉ ๐ทโŸฉ| โ‰ค โˆฅP ๐ป โˆฅ . 4 ฮฉ 2 ๐น Assume โˆฅ๐‘… + ๐ธ โˆฅ ๐น โ‰ค ๐œ–, for all ๐‘– = 1, 2, ยท ยท ยท , ๐‘›. Also note that โˆฅPฮฉ ๐ป2 โˆฅ ๐น โ‰ค โˆฅPฮฉ Pฮ  ๐ป2 โˆฅ ๐น + โˆฅPฮฉ Pฮ  โŠฅ ๐ป2 โˆฅ ๐น 1 โ‰ค โˆฅ๐ป โˆฅ + โˆฅPฮ  โŠฅ ๐ป2 โˆฅ ๐น 4 2 ๐น 1 1 โ‰ค โˆฅPฮฉ ๐ป2 โˆฅ ๐น + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ ๐น + โˆฅPฮ  โŠฅ ๐ป2 โˆฅ ๐น , 4 4 then we have 1 4 1 4 โˆฅPฮฉ ๐ป2 โˆฅ ๐น โ‰ค โˆฅPฮฉโŠฅ ๐ป2 โˆฅ ๐น + โˆฅPฮ  โŠฅ ๐ป2 โˆฅ ๐น โ‰ค โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป2 โˆฅ โˆ— . 3 3 3 3 Plug into the previous inequality, also note that for ๐‘› (1) โ‰ฅ 16, ๐›ฝ = โˆš๐‘›1 โ‰ค 14 , it gives (1) 2๐œ†๐œ– โˆฅ๐ป โˆฅ ๐น 9 ๐›ฝ ๐›ฝ โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ ๐‘› (2) โˆฅ๐ป โˆฅ ๐น โˆ’ ( + )โˆฅPฮ  โŠฅ ๐ป2 โˆฅ โˆ— + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— โˆš๏ธ 10 3 60 ๐›ฝ 1 59 โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ ๐‘› (2) โˆฅ๐ป โˆฅ ๐น + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + (โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— โˆ’ โˆฅPฮ  โŠฅ ๐ป2 โˆฅ โˆ— ) โˆš๏ธ 60 60 60 ๐›ฝ 1 59 = ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ ๐‘› (2) โˆฅ๐ป โˆฅ ๐น + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + (โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— โˆ’ โˆฅPฮ  โŠฅ (โˆ’๐ป2 )โˆฅ โˆ— ) โˆš๏ธ 60 60 60 ๐›ฝ 1 59 โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ ๐‘› (2) โˆฅ๐ป โˆฅ ๐น + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— โˆ’ โˆฅPฮ  โŠฅ (๐ป1 + ๐ป2 )โˆฅ โˆ— โˆš๏ธ 60 60 60 ๐›ฝ 1 โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ ๐‘› (2) โˆฅ๐ป โˆฅ ๐น + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— โˆ’ โˆฅ๐ป โˆฅ โˆ— . โˆš๏ธ 60 60 The last inequality is due to โˆฅPฮ  โŠฅ ๐ป โˆฅ โˆ— = sup โŸจPฮ  โŠฅ ๐ป, ๐‘‹โŸฉ = sup โŸจ๐ป, Pฮ  โŠฅ ๐‘‹โŸฉ โˆฅ ๐‘‹ โˆฅ 2 โ‰ค1 โˆฅ ๐‘‹ โˆฅ 2 โ‰ค1 โ‰ค sup โŸจ๐ป, Pฮ  โŠฅ ๐‘‹โŸฉ โ‰ค sup โŸจ๐ป, ๐‘‹โŸฉ = โˆฅ๐ป โˆฅ โˆ— . โˆฅPฮ  โŠฅ ๐‘‹ โˆฅ 2 โ‰ค1 โˆฅ ๐‘‹ โˆฅ 2 โ‰ค1 โˆš Note that โˆฅ๐ป โˆฅ โˆ— โ‰ค ๐‘› (2) โˆฅ๐ป โˆฅ ๐น , then we obtain ๐›ฝ 1 2๐œ†๐œ– โˆฅ๐ป โˆฅ ๐น โ‰ฅ ๐œ†โˆฅ๐ป โˆฅ 2๐น โˆ’ 2 ๐‘› (2) โˆฅ๐ป โˆฅ ๐น + โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— . โˆš๏ธ 60 60 107 Rewrite this inequality gives ๐›ฝ 1 โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— โ‰ค โˆ’๐œ†โˆฅ๐ป โˆฅ 2๐น + 2( ๐‘› (2) + ๐œ†๐œ–)โˆฅ๐ป โˆฅ ๐น โˆš๏ธ 60 60 โˆš โˆš ๐‘› (2) + ๐œ†๐œ– 2 ๐‘› (2) โˆš = โˆ’๐œ†(โˆฅ๐ป โˆฅ ๐น โˆ’ ( )) + ( โˆš + ๐œ†๐œ–) 2 ๐œ† ๐œ† โˆš ๐‘› (2) โˆš 2 โ‰ค ( โˆš + ๐œ†๐œ–) . ๐œ† Recall that in our original optimization problem, we should consider above inequalities for the summation of all the local patches, denote โ„Ž๐‘– โ‰ก โˆฅ๐ป (๐‘–) โˆฅ ๐น , then ๐‘› ๐‘› โˆ‘๏ธ ๐›ฝ๐‘– (๐‘–) 1 โˆ‘๏ธ (๐‘–) โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— 60 ๐‘– 60 ๐‘– ๐‘–=1 ๐‘–=1 โˆ‘๏ธ๐‘› โˆ’๐œ†๐‘– โˆฅ๐ป (๐‘–) โˆฅ 2๐น + 2 min{๐‘˜ ๐‘– + 1, ๐‘}โˆฅ๐ป (๐‘–) โˆฅ ๐น + 2๐œ†๐‘– ๐œ–๐‘– โˆฅ๐ป (๐‘–) โˆฅ ๐น โˆš๏ธ โ‰ค ๐‘–=1 ๐‘› โˆ‘๏ธ โˆ’๐œ†๐‘– โ„Ž๐‘–2 + 2 min{๐‘˜ ๐‘– + 1, ๐‘}โ„Ž๐‘– + 2๐œ†๐‘– ๐œ–๐‘– โ„Ž๐‘– โˆš๏ธ = ๐‘–=1 ๐‘› โˆš๏ธ โˆš๏ธ โˆ‘๏ธ min{๐‘˜ ๐‘– + 1, ๐‘} + ๐œ†๐‘– ๐œ–๐‘– 2 min{๐‘˜ ๐‘– + 1, ๐‘} โˆš๏ธ = โˆ’๐œ†๐‘– (โ„Ž๐‘– โˆ’ ) +( โˆš + ๐œ†๐‘– ๐œ–๐‘– ) 2 ๐œ†๐‘– ๐œ†๐‘– ๐‘–=1 โˆ‘๏ธ๐‘› โˆš๏ธ โ‰ค4 min{๐‘˜ ๐‘– + 1, ๐‘}๐œ–๐‘– , ๐‘–=1 โˆš min{๐‘˜ ๐‘– +1,๐‘} 1 where we choose ๐œ†๐‘– = ๐œ–๐‘– , and ๐›ฝ๐‘– = โˆš . max{๐‘˜ ๐‘– +1,๐‘} ร๐‘› (๐‘–) ร๐‘› (๐‘–) Then we have the bound for ๐‘–=1 โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— and ๐‘–=1 โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 , ๐‘– ๐‘– ๐‘› ๐‘› โˆš โˆš๏ธƒ โˆš๏ธƒ (๐‘–) โˆ‘๏ธ โˆ‘๏ธ โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— ยฏ โ‰ค ๐ถ min{ ๐‘˜, ๐‘} ๐œ–๐‘– โ‰ค ๐ถ min{ ๐‘˜, ยฏ ๐‘} ๐‘›โˆฅ๐œ– โˆฅ 2 , ๐‘– ๐‘–=1 ๐‘–=1 108 ๐‘› ๐‘› โˆš๏ธ (๐‘–) โˆ‘๏ธ โˆš๏ธƒ โˆ‘๏ธ โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 โ‰ค ๐ถ max max{๐‘˜ ๐‘– , ๐‘} min{๐‘˜ ๐‘– , ๐‘}๐œ–๐‘– ๐‘– ๐‘– ๐‘–=1 ๐‘–=1 โˆš๏ธƒ ๐‘› โˆš๏ธ โˆ‘๏ธ = ๐ถ max{ ๐‘˜, ๐‘} ยฏ min{๐‘˜ ๐‘– , ๐‘}๐œ–๐‘– ๐‘–=1 โˆš๏ธƒ โˆ‘๏ธ๐‘› ยฏ โ‰ค ๐ถ max{ ๐‘˜, ๐‘} min{ ๐‘˜, ๐‘} ยฏ ๐œ–๐‘– ๐‘–=1 โˆš โˆš๏ธƒ โ‰ค ๐ถ ๐‘ ๐‘˜ยฏ ๐‘›โˆฅ๐œ– โˆฅ 2 . (๐‘–) Denote ๐ป2 โ‰ก P๐‘– ( ๐‘†) ห† โˆ’ ๐‘† (๐‘–) , to estimate the error bound of ร๐‘› โˆฅ๐ป (๐‘–) โˆฅ 2,1 , we decompose ๐ป (๐‘–) into ๐‘–=1 2 2 three parts, for each ๐‘– = 1, 2, ยท ยท ยท ๐‘› (๐‘–) (๐‘–) (๐‘–) (๐‘–) โˆฅ๐ป2 โˆฅ ๐น โ‰ค โˆฅ(๐ผ โˆ’ Pฮฉ๐‘– )๐ป2 โˆฅ ๐น + โˆฅ(Pฮฉ๐‘– โˆ’ Pฮฉ๐‘– Pฮ ๐‘– )๐ป2 โˆฅ ๐น + โˆฅPฮฉ๐‘– Pฮ ๐‘– ๐ป2 โˆฅ ๐น (๐‘–) (๐‘–) 1 (๐‘–) โ‰ค โˆฅPฮฉโŠฅ ๐ป2 โˆฅ ๐น + โˆฅPฮ  โŠฅ ๐ป2 โˆฅ ๐น + โˆฅ๐ป2 โˆฅ ๐น , ๐‘– ๐‘– 4 which leads to (๐‘–) 4 (๐‘–) (๐‘–) โˆฅ๐ป2 โˆฅ ๐น โ‰ค (โˆฅPฮฉโŠฅ ๐ป2 โˆฅ ๐น + โˆฅPฮ  โŠฅ ๐ป2 โˆฅ ๐น ) 3 ๐‘– ๐‘– 4 (๐‘–) (๐‘–) = (โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + โˆฅPฮ  โŠฅ ๐ป (๐‘–) โˆฅ ๐น ) 3 ๐‘– ๐‘– ๐‘– 4 (๐‘–) (๐‘–) โ‰ค (โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + โˆฅ๐ป (๐‘–) โˆฅ ๐น ). 3 ๐‘– ๐‘– โˆš ร๐‘› (๐‘–) min{๐‘˜ ๐‘– +1,๐‘} Next, we need to bound ๐‘–=1 โˆฅ๐ป โˆฅ ๐น , note that ๐œ†๐‘– = ๐œ– and ๐‘– โˆ‘๏ธ๐‘› โˆ’๐œ†๐‘– โ„Ž๐‘–2 + 2 min{๐‘˜ ๐‘– + 1, ๐‘}โ„Ž๐‘– + 2๐œ†๐‘– ๐œ–๐‘– โ„Ž๐‘– โ‰ฅ 0, โˆš๏ธ ๐‘–=1 which gives โˆš๏ธƒ โˆ‘๏ธ๐‘› ๐‘› โˆš๏ธ โˆ‘๏ธ ยฏ 4 min{ ๐‘˜ + 1, ๐‘} โ„Ž๐‘– โ‰ฅ 4 min{๐‘˜ ๐‘– + 1, ๐‘}โ„Ž๐‘– ๐‘–=1 ๐‘–=1 โˆ‘๏ธ๐‘› โˆš๏ธ โ„Ž2 โ‰ฅ min{๐‘˜ ๐‘– + 1, ๐‘} ๐‘– ๐œ–๐‘– ๐‘–=1 โˆš๏ธƒ ๐‘› โ„Ž2 โˆ‘๏ธ ๐‘– โ‰ฅ min{๐‘˜ + 1, ๐‘} , ๐œ–๐‘– ๐‘–=1 109 by Cauchy inequality ๐‘› โ„Ž2 โ„Ž๐‘– ) 2 ( ๐‘–=1 โ„Ž๐‘– ) 2 ร๐‘› ร๐‘› โˆ‘๏ธ ๐‘– ( ๐‘–=1 โ‰ฅ ร๐‘› โ‰ฅ โˆš , ๐‘–=1 ๐‘– ๐œ– ๐‘–=1 ๐œ–๐‘– ๐‘›โˆฅ๐œ– โˆฅ 2 then we obtain โˆš๏ธ„ ๐‘› โˆ‘๏ธ min{ ๐‘˜ยฏ + 1, ๐‘} โˆš โˆš โ„Ž๐‘– โ‰ค 4 ๐‘›โˆฅ๐œ– โˆฅ 2 โ‰ค ๐ถ ๐‘›โˆฅ๐œ– โˆฅ 2 , min{๐‘˜ + 1, ๐‘} ๐‘–=1 โ‰ค ๐‘˜ โ‰ค ๐‘˜ยฏ โ‰ค 2๐‘›๐‘ž, which is guaranteed with high probability by Lemma ๐‘›๐‘ž the last inequality is due to 2 B.1.4, thus ๐‘› ๐‘› ๐‘› ๐‘› โˆ‘๏ธ (๐‘–) 4 โˆ‘๏ธ (๐‘–) โˆ‘๏ธ (๐‘–) โˆ‘๏ธ โˆฅ๐ป2 โˆฅ ๐น โ‰ค ( โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + โˆฅ๐ป (๐‘–) โˆฅ ๐น ) 3 ๐‘– ๐‘– ๐‘–=1 ๐‘–=1 ๐‘–=1 ๐‘–=1 ๐‘› ๐‘› ๐‘› 4 โˆ‘๏ธ (๐‘–) โˆ‘๏ธ (๐‘–) โˆ‘๏ธ = ( โˆฅPฮฉโŠฅ ๐ป2 โˆฅ 1 + โˆฅPฮ  โŠฅ ๐ป1 โˆฅ โˆ— + โ„Ž๐‘– ) 3 ๐‘– ๐‘– ๐‘–=1 ๐‘–=1 ๐‘–=1 โˆš๏ธƒ โ‰ค ๐ถ ๐‘ ๐‘˜๐‘›โˆฅ๐œ–ยฏ โˆฅ2. (๐‘–) (๐‘–) Now letโ€™s divide ๐ป2 into columns to get the โ„“2,1 norm error bound, denote (๐ป2 ) ๐‘— as the ๐‘—th (๐‘–) column in ๐ป2 , then we can derive the โ„“2,1 norm error bound in Lemma B.1.3 v u u +1 โˆš๏ธƒ ๐‘› ๐‘› t ๐‘˜โˆ‘๏ธ ๐‘– (๐‘–) (๐‘–) โˆ‘๏ธ โˆ‘๏ธ ๐ถ ๐‘ ๐‘˜๐‘›โˆฅ๐œ–ยฏ โˆฅ2 โ‰ฅ โˆฅ๐ป2 โˆฅ ๐น = โˆฅ(๐ป2 ) ๐‘— โˆฅ 22 ๐‘–=1 ๐‘–=1 ๐‘—=1 v u u ๐‘› โˆ‘๏ธ t ๐‘˜ ๐‘– 1 โˆ‘๏ธ (๐‘–) โ‰ณ ( โˆฅ(๐ป2 ) ๐‘— โˆฅ 2 ) 2 ๐‘˜๐‘– ๐‘–=1 ๐‘—=1 ๐‘› โˆ‘๏ธ ๐‘˜๐‘– 1 โˆ‘๏ธ (๐‘–) โ‰ณโˆš โˆฅ(๐ป2 ) ๐‘— โˆฅ 2 . ๐‘˜ยฏ ๐‘–=1 ๐‘—=1 Then we obtain ๐‘› ๐‘˜โˆ‘๏ธ ๐‘– +1 โˆ‘๏ธ ห† โˆ’ ๐‘† (๐‘–) โˆฅ โˆ‘๏ธ (๐‘–) โˆš โˆฅP๐‘– ( ๐‘†) 2,1 = โˆฅ(๐ป2 ) ๐‘— โˆฅ 2 โ‰ค ๐ถ ๐‘๐‘› ๐‘˜ยฏ โˆฅ๐œ– โˆฅ 2 . ๐‘– ๐‘–=1 ๐‘—=1 โ–ก ๐‘ž๐‘› Lemma B.1.4. If ๐‘ž๐‘› โ‰ฅ 9 log ๐‘›, with probability at least 1 โˆ’ 2 exp(โˆ’๐‘ 3 ๐‘ž๐‘›), 2 โ‰ค ๐‘˜ ๐‘– โ‰ค 2๐‘ž๐‘›, for all ๐‘– = 1, 2, ยท ยท ยท , ๐‘›, here ๐‘ 3 is some constants not related to ๐‘ž and ๐‘›. 110 Proof: Since each ๐‘‹๐‘– is drawn from a manifold M โІ R ๐‘ independently and uniformly, for some fixed (๐œ‚๐‘–0 , ๐‘ž)-neighborhood of ๐‘‹๐‘–0 , for each ๐‘— = {1, 2, ยท ยท ยท , ๐‘›}\{๐‘– 0 }, the probability that ๐‘‹ ๐‘— falls into (๐œ‚๐‘–0 , ๐‘ž)-neighborhood is ๐‘ž. Since {๐‘‹๐‘– }๐‘–=1,2,ยทยทยท๐‘› , ๐‘˜ ๐‘– follows i.i.d binomial distribution ๐ต(๐‘›, ๐‘ž), we can apply large deviations inequalities to derive an upper and lower bound for ๐‘˜ ๐‘– . By Theorem 1 in Janson (2016), we have that for each ๐‘– = 1, 2, ยท ยท ยท , ๐‘› (๐‘ž๐‘›) 2 3 P(๐‘˜ ๐‘– > 2๐‘ž๐‘›) โ‰ค exp(โˆ’ ) โ‰ค exp(โˆ’ ๐‘ž๐‘›), 2(๐‘ž๐‘›(1 โˆ’ ๐‘ž) + ๐‘ž๐‘›/3) 8 ๐‘ž๐‘› (๐‘ž๐‘›/2) 2 1 P(๐‘˜ ๐‘– < ) โ‰ค exp(โˆ’ ) = exp(โˆ’ ๐‘ž๐‘›). 2 2๐‘ž๐‘› 8 Therefore by Union Bound Theorem ๐‘ž๐‘› 3 1 P( โ‰ค ๐‘˜ ๐‘– โ‰ค 2๐‘ž๐‘›, โˆ€๐‘– = 1, 2, ยท ยท ยท ๐‘›) โ‰ฅ 1 โˆ’ ๐‘›(exp(โˆ’ ๐‘ž๐‘›) + exp(โˆ’ ๐‘ž๐‘›)) 2 8 8 1 โ‰ฅ 1 โˆ’ 2๐‘› exp(โˆ’ ๐‘ž๐‘›) 8 1 = 1 โˆ’ 2 exp(โˆ’ ๐‘ž๐‘› + log ๐‘›) 8 1 โ‰ฅ 1 โˆ’ 2 exp(โˆ’ ๐‘ž๐‘›). 72 โ–ก B.1.2 Removing (B.2) and (B.3) in Assumption B.1.1 We will show that under our assumption that points are uniformly drawn from the manifold, (B.2) and (B.3) in Assumption B.1.1 automatically hold provided (B.1) holds, thus they can be removed from the requirements. Let us again restrict our attention to an individual patch and for the simplicity of notation, ignore the superscript ๐‘– (the treatment for all patches are the same). Recall that C(๐ฟ) = C(๐‘‡) = ๐‘ˆฮฃ๐‘‰ โˆ— , and ๐‘‰หœ is the orthonormal basis of ๐‘ ๐‘๐‘Ž๐‘›([1,๐‘‰]), since 0 = C(๐‘‡)1 = ๐‘ˆฮฃ๐‘‰ โˆ— 1, we have ๐‘‰ โˆ— 1 = 0, then 1 โŠฅ ๐‘ ๐‘๐‘Ž๐‘›(๐‘‰), thus we can write one basis for ๐‘ ๐‘๐‘Ž๐‘›([1,๐‘‰]) as [ โˆš 1 1,๐‘‰], which indicates that ๐‘˜+1 in order to remove (B.2), we only need to show that with high probability, ๐‘‰ has small coherence. Also, recall that ๐‘‡ (๐‘–) = ๐‘ƒ๐‘‡๐‘‹ (๐‘‹ (๐‘–) โˆ’ ๐‘‹๐‘– 1๐‘‡ ), since each ๐‘‹๐‘– is independent, each column in ๐‘‡ (๐‘–) is ๐‘– also independent. In addition, each column is in the span of the tangent space with ๐‘ˆ being 111 an orthonormal basis. Therefore ๐‘‡ = ๐‘ˆฮ› โ‰ก ๐‘ˆ [๐›ผ1 , ๐›ผ2 , ..., ๐›ผ ๐‘˜ , 0], where ๐›ผ๐‘– , ๐‘– = 1, 2, ยท ยท ยท , ๐‘˜ is the ๐‘–th column of ฮ›, which corresponds to the coefficients of the ๐‘–th column of ๐‘‡ under ๐‘ˆ, the last column is zero vector since it corresponds to ๐‘‹๐‘– itself. Since columns of ๐‘‡ are i.i.d, then ๐›ผ๐‘– s are also i.i.d., so they all obey the same distribution as a random vector ๐›ผ. We establish the following lemma for the right singular vectors of ๐‘‡. Lemma B.1.5. Let C(๐‘‡) = ๐‘ˆฮฃ๐‘‰ โˆ— be the reduced singular vector decomposition of C(๐‘‡), assume ๐ถ โ‰ก E((๐›ผ โˆ’ E๐›ผ)(๐›ผ โˆ’ E๐›ผ) โˆ— ) has a finite condition number. Then, with probability at least 1 โˆ’ 2๐‘‘ exp(โˆ’๐‘๐‘˜), the right singular vector ๐‘‰ obeys ๐‘ max โˆฅ๐‘‰ โˆ— e ๐‘— โˆฅ 2 โ‰ค , 1โ‰ค ๐‘— โ‰ค๐‘˜ ๐‘˜ and with (1) in Assumption B.1.1 โˆš๏ธ„ ๐‘๐‘‘ โˆฅ๐‘ˆ๐‘‰ โˆ— โˆฅ โˆž โ‰ค . ๐‘๐‘˜ Proof: As discussed above, C(๐‘‡) has the following representation C(๐‘‡) = ๐‘‡๐บ = ๐‘ˆ [๐›ผ1 , ๐›ผ2 , ยท ยท ยท , ๐›ผ ๐‘˜ , 0]๐บ , where ๐‘ˆ โˆˆ R ๐‘ร—๐‘‘ is an orthonormal basis of the tangent space, and ฮ› = [๐›ผ1 , ๐›ผ2 , ..., ๐›ผ ๐‘˜ , 0] โˆˆ R๐‘‘ร—(๐‘˜+1) is the coefficients of randomly drawn points in a neighbourhood projected to the tangent space. Since points are randomly drawn from an neighbourhood contained in a ball of radius at most ๐œ‚, one can easily verify that โˆฅ๐›ผ ๐‘— โˆฅ 2 โ‰ค ๐œ‚ for each ๐‘— = 1, ..., ๐‘˜. Assume ๐‘‡๐บ and ฮ› have the reduced SVD of the form ๐‘‡๐บ = ๐‘ˆฮฃ๐‘‰ โˆ— , ฮ›๐บ = ๐‘ˆฮ› ฮฃฮ›๐‘‰ฮ›โˆ— , Then ๐‘‡ can be written as ๐‘‡๐บ = ๐‘ˆฮฃ๐‘‰ โˆ— = ๐‘ˆ๐‘ˆฮ› ฮฃฮ›๐‘‰ฮ›โˆ— . It can be verified that null(๐‘‡๐บ) is the span of columns in (๐‘‰ฮ› ) ๐ถ , then we have ๐‘ ๐‘๐‘Ž๐‘›(๐‘‰ฮ› ) = ๐‘ ๐‘๐‘Ž๐‘›(๐‘‰), since both ๐‘‰ฮ› and ๐‘‰ are orthonormal, they are equal up to a rotation, i.e. โˆƒ๐‘… โˆˆ R๐‘‘,๐‘‘ , ๐‘… โˆ— ๐‘… = ๐‘…๐‘… โˆ— = ๐ผ, such that ๐‘‰ = ๐‘‰ฮ› ๐‘…. Then max โˆฅ๐‘‰ โˆ— e ๐‘— โˆฅ 2 = max โˆฅ๐‘… โˆ—๐‘‰ฮ›โˆ— e ๐‘— โˆฅ 2 = max โˆฅ๐‘‰ฮ›โˆ— e ๐‘— โˆฅ 2 . 1โ‰ค ๐‘— โ‰ค๐‘˜ 1โ‰ค ๐‘— โ‰ค๐‘˜ 1โ‰ค ๐‘— โ‰ค๐‘˜ 112 Next we bound the coherence of ๐‘‰ฮ› . Since ๐‘‰ฮ›โˆ— = ฮฃฮ› โˆ’1๐‘ˆ โˆ— ฮ›๐บ, we have ฮ› max โˆฅฮฃฮ› โˆ’1๐‘ˆ โˆ— ฮ›๐บe โˆฅ โ‰ค โˆฅฮฃโˆ’1 โˆฅ max โˆฅ๐‘ˆ โˆ— ฮ›๐บe โˆฅ 1โ‰ค ๐‘— โ‰ค๐‘˜ ฮ› ๐‘— ฮ› 1โ‰ค๐‘–โ‰ค๐‘˜ ฮ› ๐‘— = โˆฅฮฃฮ›โˆ’1 โˆฅ max โˆฅฮ›๐บe โˆฅ ๐‘— 1โ‰ค ๐‘— โ‰ค๐‘˜ โ‰ค โˆฅฮฃฮ›โˆ’1 โˆฅ max โˆฅ๐›ผ โˆ’ ๐›ผโˆฅ ๐‘— ยฏ 1โ‰ค ๐‘— โ‰ค๐‘˜ โ‰ค 2๐œ‚โˆฅฮฃฮ› โˆ’1 โˆฅ. Recall that 1 ฮ›๐บ = [๐›ผ1 , ๐›ผ2 , ยท ยท ยท , ๐›ผ ๐‘˜ , 0] (๐ผ โˆ’ 11๐‘‡ ) ๐‘˜ +1 = [๐›ผ1 โˆ’ ๐›ผ, ยฏ ๐›ผ2 โˆ’ ๐›ผ, ยฏ ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ ๐›ผ, ยฏ โˆ’๐›ผ]ยฏ = [๐›ผ1 โˆ’ E๐›ผ, ๐›ผ2 โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ E๐›ผ, 0] โˆ’ [ ๐›ผยฏ โˆ’ E๐›ผ, ๐›ผยฏ โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผยฏ โˆ’ E๐›ผ, ๐›ผ], ยฏ 1 ร ๐‘˜ ๐›ผ , thus where ๐›ผยฏ = ๐‘˜+1 ๐‘–=1 ๐‘– |๐œŽ๐‘‘ (ฮ›๐บ) โˆ’ ๐œŽ๐‘‘ ([๐›ผ1 โˆ’ E๐›ผ, ๐›ผ2 โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ E๐›ผ, 0])| โ‰คโˆฅ [ ๐›ผยฏ โˆ’ E๐›ผ, ๐›ผยฏ โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผยฏ โˆ’ E๐›ผ, ๐›ผ] ยฏ โˆฅ2 โ‰คโˆฅ [ ๐›ผยฏ โˆ’ E๐›ผ, ๐›ผยฏ โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผยฏ โˆ’ E๐›ผ, ๐›ผยฏ โˆ’ E๐›ผ] โˆฅ 2 + โˆฅE๐›ผโˆฅ 2 (B.7) โˆš 1 โˆ‘๏ธ ๐‘˜ 1 โ‰ค ๐‘˜ + 1โˆฅ (๐›ผ๐‘– โˆ’ E๐›ผ) โˆ’ E๐›ผโˆฅ 2 + ๐œ‚ ๐‘˜ +1 ๐‘˜ +1 ๐‘–=1 ๐‘˜ 1 โˆ‘๏ธ โ‰คโˆฅ โˆš (๐›ผ๐‘– โˆ’ E๐›ผ)โˆฅ 2 + 2๐œ‚ ๐‘˜ + 1 ๐‘–=1 First, we want to use Bernstein Matrix Inequality to bound the โ„“2 -norm in the last inequality. Denote ๐›ฝ๐‘– = โˆš 1 (๐›ผ๐‘– โˆ’ E๐›ผ), ๐‘ = ๐‘–=1 ร๐‘˜ ๐›ฝ๐‘– , then ๐›ฝ๐‘– is independent, we also have ๐‘˜+1 1 2๐œ‚ E๐›ฝ๐‘– = 0, โˆฅ ๐›ฝ๐‘– โˆฅ 2 โ‰ค โˆš (โˆฅ๐›ผ๐‘– โˆฅ 2 + โˆฅE๐›ผโˆฅ 2 ) โ‰ค โˆš , ๐‘˜ +1 ๐‘˜ 113 which means ๐›ฝ๐‘– has mean zero and is uniformly bounded, also ๐œˆ(๐‘) = max{โˆฅE(๐‘ ๐‘ โˆ— )โˆฅ 2 , โˆฅE(๐‘ โˆ— ๐‘)โˆฅ 2 } โˆ‘๏ธ๐‘› ๐‘› โˆ‘๏ธ = max{โˆฅ E(๐›ฝ๐‘– ๐›ฝ๐‘–โˆ— )โˆฅ 2 , โˆฅ E(๐›ฝ๐‘–โˆ— ๐›ฝ๐‘– )โˆฅ 2 } ๐‘–=1 ๐‘–=1 ๐‘˜ = max{โˆฅE(๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ โˆฅ 2 , E tr((๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ ))} ๐‘˜ +1 < max{โˆฅ๐ถ โˆฅ 2 , tr(๐ถ)} < ๐‘‘๐œŽ1 (๐ถ). By assumption, ๐ถ has finite condition number, and ๐‘‘ โ‰ช ๐‘˜, by Matrix Bernstein inequality, we are able to bound the spectral norm of ๐‘ โˆ’๐‘ก 2 ๐‘ƒ(โˆฅ๐‘ โˆฅ 2 โ‰ฅ ๐‘ก) โ‰ค (๐‘‘ + 1) exp( ) 2๐œ‚๐‘ก ๐‘‘๐œŽ1 (๐ถ) + โˆš 3 ๐‘˜ โˆš ๐œŽ๐‘‘ (๐ถ)๐‘˜ Let ๐‘ก = 4 ,we have โˆš๏ธ ๐œŽ๐‘‘ (๐ถ)๐‘˜ ๐‘ƒ(โˆฅ๐‘ โˆฅ 2 โ‰ฅ ) โ‰ค ๐‘‘ exp(โˆ’๐‘๐‘˜). (B.8) 4 Next, equipped with Matrix Bernstein inequality again, we can prove that ๐œŽ๐‘‘ ([๐›ผ1 โˆ’ E๐›ผ, ๐›ผ2 โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ E๐›ผ, 0]) concentrates around ๐œŽ๐‘‘ (๐ถ). Note that ๐œŽ๐‘‘2 ([๐›ผ1 โˆ’ E๐›ผ, ๐›ผ2 โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ ร๐‘˜ E๐›ผ, 0]) = ๐œŽ๐‘‘ ( ๐‘–=1 (๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ ), we consider ๐‘˜ โˆ‘๏ธ โˆ‘๏ธ๐‘› |๐œŽ๐‘‘ ( (๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ ) โˆ’ ๐‘˜๐œŽ๐‘‘ (๐ถ)| โ‰ค โˆฅ (๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ โˆ’ ๐‘˜๐ถ โˆฅ 2 ๐‘–=1 ๐‘–=1 Similar as what we discussed above, let ๐‘ ๐‘— = (๐›ผ ๐‘— โˆ’ E๐›ผ)(๐›ผ ๐‘— โˆ’ E๐›ผ)๐‘‡ โˆ’ ๐ถ, ๐‘— = 1, 2, ยท ยท ยท , ๐‘˜. It can be verified that ๐‘ ๐‘— is bounded โˆฅ๐‘ ๐‘— โˆฅ 2 โ‰ค โˆฅ๐›ผ ๐‘— โˆ’ E๐›ผโˆฅ 22 + ๐œŽ1 (๐ถ) โ‰ค 2๐œ‚2 + ๐œŽ1 (๐ถ) โ‰ก ๐‘ 4 . Since ๐‘ ๐‘— follows i.i.d distribution, we also have ๐œˆ(๐‘) โ‰ค ๐‘˜๐‘ 5 for some constant ๐‘ 5 which represents the variance of ๐‘ ๐‘— . Applying matrix Bernstein inequality, we obtain ๐‘ก2  โˆ‘๏ธ๐‘˜  P โˆฅ (๐›ผ ๐‘— โˆ’ E๐›ผ)(๐›ผ ๐‘— โˆ’ E๐›ผ)๐‘‡ โˆ’ ๐‘˜๐ถ โˆฅ 2 โ‰ฅ ๐‘ก โ‰ค 2๐‘‘ exp(โˆ’ ๐‘ ๐‘ก ) ๐‘—=1 ๐‘˜๐‘ 5 + 34 114 3๐‘˜๐œŽ๐‘‘ (๐ถ) Furthermore, take ๐‘ก = 4 , then with probability over 1 โˆ’ 2๐‘‘ exp(โˆ’๐‘ 6 ๐‘˜) for some constant ๐‘ 6 , the following holds ๐‘˜ ๐‘› โˆ‘๏ธ โˆ‘๏ธ 3๐‘˜๐œŽ๐‘‘ (๐ถ) |๐œŽ๐‘‘ ( (๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ ) โˆ’ ๐‘˜๐œŽ๐‘‘ (๐ถ)| โ‰คโˆฅ (๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ โˆ’ ๐‘˜๐ถ โˆฅ 2 < , 4 ๐‘–=1 ๐‘–=1 which leads to ๐‘˜ โˆ‘๏ธ ๐‘˜๐œŽ๐‘‘ (๐ถ) ๐œŽ๐‘‘2 ([๐›ผ1 โˆ’ E๐›ผ, ๐›ผ2 โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ E๐›ผ]) = ๐œŽ๐‘‘ ( (๐›ผ๐‘– โˆ’ E๐›ผ)(๐›ผ๐‘– โˆ’ E๐›ผ)๐‘‡ ) > , 4 ๐‘–=1 thus โˆš๏ธ ๐‘˜๐œŽ๐‘‘ (๐ถ) ๐œŽ๐‘‘ ([๐›ผ1 โˆ’ E๐›ผ, ๐›ผ2 โˆ’ E๐›ผ, ยท ยท ยท , ๐›ผ ๐‘˜ โˆ’ E๐›ผ]) > . (B.9) 2 Combine (B.7), (B.8) and (B.9), we have proved that with probability at least 1 โˆ’ ๐‘‘ exp(โˆ’๐‘๐‘˜), โˆš โˆ’1 โˆฅ โ‰พ โˆš1 , which further gives max โˆฅ๐‘‰ โˆ— e โˆฅ 2 โ‰พ 1 . ๐œŽ๐‘‘ (ฮ›๐‘ƒ) โ‰ฟ ๐‘˜, therefore โˆฅฮฃฮ› ๐‘— ๐‘˜ ๐‘˜ 1โ‰ค ๐‘— โ‰ค๐‘˜+1 Finally, with (B.1) in Assumption B.1.1, (B.3) is also satisfied with the same probability, since โˆš๏ธ„ ๐‘๐‘‘ โˆฅ๐‘ˆ๐‘‰ โˆ— โˆฅ โˆž โ‰ค max โˆฅ๐‘ˆ โˆ— e ๐‘— โˆฅ 2 max โˆฅ๐‘‰ โˆ— e๐‘™ โˆฅ 2 โ‰ค . ๐‘— ๐‘™ ๐‘๐‘˜ Hence (B.3) in Assumption B.1.1 can also be removed. โ–ก The above discussion is valid for each patch individually, i.e., with probability at least 1 โˆ’ ๐‘‘ exp(โˆ’๐‘๐‘˜ ๐‘– ) โ‰ฅ 1 โˆ’ ๐‘‘ exp(โˆ’๐‘๐‘˜), (B.2) and (B.3) hold for any fixed ๐‘– = 1, 2, ยท ยท ยท ๐‘›. By union bound inequality, with probability at least 1 โˆ’ ๐‘›๐‘‘ exp(โˆ’๐‘๐‘˜), (B.2) and (B.3) hold for all the local patches. Note that 1 โˆ’ ๐‘›๐‘‘ exp(โˆ’๐‘๐‘˜) = 1 โˆ’ exp(โˆ’๐‘๐‘˜ + log ๐‘›), here we omit ๐‘‘ since it is very small. By ๐‘›๐‘ž Lemma B.1.4, with probability at least 1 โˆ’ 2 exp(โˆ’๐‘ 1 ๐‘ž๐‘›), 2 โ‰ค ๐‘˜ ๐‘– โ‰ค 2๐‘›๐‘ž, for all ๐‘– = 1, 2, ยท ยท ยท ๐‘›. Using the assumption in Theorem 3.3.1, ๐‘ž๐‘› โ‰ฅ ๐‘ 2 log ๐‘› for some constant ๐‘ 2 larger enough, we can see that with probability over 1 โˆ’ exp(โˆ’๐‘ 3 ๐‘˜), the requirement (B.2) and (B.3) automatically hold due to i.i.d assumption on the samples, which enable us to remove these assumptions in Theorem 3.3.1. ๐œ†ห†๐‘– โˆ’๐œ†๐‘–โˆ— B.1.3 Proof of the convergence of as ๐‘˜ โ†’ โˆž ๐œ†โˆ— ๐‘– โˆš โˆš ๐‘ ๐‘ When ๐‘˜ is large enough, min{๐‘˜ + 1, ๐‘} = ๐‘, ๐œ†ห†๐‘– = ๐œ–ห† , ๐œ†๐‘–โˆ— = ๐œ– , then ๐‘– ๐‘– โˆš โˆš ๐‘ ๐‘ ๐œ†ห†๐‘– โˆ’ ๐œ†๐‘–โˆ— ๐œ–ห†๐‘– โˆ’ ๐œ– ๐‘– ๐œ–๐‘– โˆ’ ๐œ–ห†๐‘– ๐œ–๐‘– = โˆš = = โˆ’ 1. ๐œ†๐‘–โˆ— ๐‘ ๐œ–ห†๐‘– ๐œ–ห†๐‘– ๐œ–๐‘– 115 ๐œ†ห†๐‘– โˆ’๐œ†๐‘–โˆ— ๐‘˜โ†’โˆž 2 ๐œ– โˆ’๐œ–ห† 2 ๐œ– 2 ๐‘˜โ†’โˆž In order to show | | โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 0, it is sufficient to prove that ๐‘– 2 ๐‘– = ๐‘–2 โˆ’ 1 โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 0, thus ๐œ†โˆ— ๐œ–ห† ๐œ–ห† ๐‘– ๐‘– ๐‘– ๐œ–๐‘– ๐‘˜โ†’โˆž ๐œ†ห†๐‘– โˆ’๐œ†๐‘–โˆ— ๐‘˜โ†’โˆž ๐œ–ห†๐‘– โˆ’ โˆ’โˆ’โˆ’โˆ’โ†’ 1, hence ๐œ†โˆ— โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 0. Notice that ๐‘– ๐‘˜ โˆฅ ๐‘‹๐‘– โˆ’๐‘‹๐‘– ๐‘— โˆฅ 42 โˆฅ๐‘… (๐‘–) + ๐‘ (๐‘–) โˆฅ 2๐น โˆ’ (๐‘˜ + 1) ๐‘๐œŽ 2 + ฮ“ฬ„2 (๐‘‹๐‘– ) ร  ๐œ–๐‘–2 โˆ’ ๐œ–ห†๐‘–2 4 ๐‘—=1 = ๐œ–ห†๐‘–2 ๐‘˜ โˆฅ ๐‘‹๐‘– โˆ’๐‘‹๐‘– ๐‘— โˆฅ 42 (๐‘˜ + 1) ๐‘๐œŽ 2 + ฮ“ฬ„2 (๐‘‹๐‘– ) ร 4 ๐‘—=1 ๐‘˜ โˆฅ ๐‘‹๐‘– โˆ’๐‘‹๐‘– ๐‘— โˆฅ 42 โˆฅ๐‘ (๐‘–) โˆฅ 2๐น โˆ’ (๐‘˜ + 1) ๐‘๐œŽ 2 + โˆฅ๐‘… (๐‘–) โˆฅ 2๐น โˆ’ ฮ“ฬ„2 (๐‘‹๐‘– ) + โŸจ๐‘ (๐‘–) , ๐‘… (๐‘–) โŸฉ  ร  4 ๐‘—=1 โ‰ค ๐‘˜ ๐‘๐œŽ 2 ๐‘˜ โˆฅ ๐‘‹๐‘– โˆ’๐‘‹๐‘– ๐‘— โˆฅ 42 โˆฅ๐‘… (๐‘–) โˆฅ 2๐น โˆ’ ฮ“ฬ„2 (๐‘‹๐‘– ) ร 4 (๐‘–) (๐‘–) โˆฅ๐‘ (๐‘–) โˆฅ 2๐น โˆ’ (๐‘˜ + 1) ๐‘๐œŽ 2 ร๐‘˜ ๐‘—=1 ๐‘—=1 โŸจ๐‘ ๐‘— , ๐‘… ๐‘— โŸฉ โ‰ค + + . ๐‘˜ ๐‘๐œŽ 2 ๐‘˜ ๐‘๐œŽ 2 (๐‘˜ + 1) ๐‘๐œŽ 2 (๐‘–) (๐‘–) Since each entry in ๐‘ (๐‘–) follows i.i.d. obeying N (0, ๐œŽ 2 ), โŸจ๐‘ ๐‘— , ๐‘… ๐‘— โŸฉ are also i.i.d. with (๐‘–) (๐‘–) E(โŸจ๐‘ ๐‘— , ๐‘… ๐‘— โŸฉ) = 0, by law of large numbers, the first and third term approximates 0 when ๐‘˜ โ†’ โˆž. ๐œ– 2 โˆ’๐œ–ห†2 ๐‘˜โ†’โˆž Also, by (3.14) and (3.15) in Section 3.4, the second term also approximates 0, thus ๐‘– 2 ๐‘– โˆ’โˆ’โˆ’โˆ’โˆ’โ†’ 0. ๐œ–ห† ๐‘– 116 APPENDIX C APPENDIX FOR CHAPTER 4 In this appendix, we discuss how to bound the separation between two diagonal matrices ๐‘ ๐‘’ ๐‘(ฮ›1 .ฮ›2 ) in (4.4) and provide an alternative proof of Lemma 4.4.7. C.0.1 Discussions on bounding ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) Define an eigen-gap ( ( )) ๐›ฟ0 = max max min |๐œ† โˆ’ ๐‘ก0 | โˆ’ max |๐œ‡ โˆ’ ๐‘ก0 |, min |๐œ† โˆ’ ๐‘ก0 | โˆ’ max |๐œ‡ โˆ’ ๐‘ก0 | . ๐‘ก0 โˆˆC ๐œ†โˆˆ๐‘†(ฮ›1 ) ๐œ‡โˆˆ๐‘†(ฮ›2 ) ๐œ†โˆˆ๐‘†(ฮ›2 ) ๐œ‡โˆˆ๐‘†(ฮ›1 ) Here ๐‘†(ฮ›๐‘– ), ๐‘– = 1, 2 are the sets of eigenvalues contained in ฮ›๐‘– , ๐‘– = 1, 2, respectively, and are called the spectral sets. We can show that it is a lower bound of ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ). By the definition of ๐‘ ๐‘’ ๐‘, ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) = inf โˆฅ๐‘‡ฮ›1 โˆ’ ฮ›2๐‘‡ โˆฅ โˆฅ๐‘‡ โˆฅ=1 = inf โˆฅ๐‘‡ (ฮ›1 โˆ’ ๐‘ก 0 ๐ผ) โˆ’ (ฮ›2 โˆ’ ๐‘ก0 ๐ผ)๐‘‡ โˆฅ โˆฅ๐‘‡ โˆฅ=1 ( ) โ‰ฅ inf min |๐œ† โˆ’ ๐‘ก0 |โˆฅ๐‘‡ โˆฅ โˆ’ max |๐œ‡ โˆ’ ๐‘ก0 |โˆฅ๐‘‡ โˆฅ โˆฅ๐‘‡ โˆฅ=1 ๐œ†โˆˆ๐‘†(ฮ›1 ) ๐œ‡โˆˆ๐‘†(ฮ›2 ) = min |๐œ† โˆ’ ๐‘ก0 | โˆ’ max |๐œ‡ โˆ’ ๐‘ก0 |. ๐œ†โˆˆ๐‘†(ฮ›1 ) ๐œ‡โˆˆ๐‘†(ฮ›2 ) Similarly, ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) โ‰ฅ min |๐œ† โˆ’ ๐‘ก0 | โˆ’ max |๐œ‡ โˆ’ ๐‘ก 0 |. ๐œ†โˆˆ๐‘†(ฮ›2 ) ๐œ‡โˆˆ๐‘†(ฮ›1 ) Hence we have ๐‘ ๐‘’ ๐‘(ฮ›1 , ฮ›2 ) โ‰ฅ ๐›ฟ0 , plugging into (4.4), it gives, 2๐œ…2 (๐‘‹1 )๐œ…2 (๐‘‰2 )โˆฅฮ”๐ดโˆฅ โˆฅ tan ฮ˜(X1 , X e1 )โˆฅ < , (C.1) [๐›ฟ0 โˆ’ 2๐œ…2 (๐‘‹1 )๐œ…2 (๐‘‰2 )โˆฅฮ”๐ดโˆฅ] + Let us provide some intuition of ๐›ฟ0 to assist understanding. ๐›ฟ0 > 0 essentially means that there exists a disk in the complex plane that separates ๐‘†(ฮ›1 ) from ๐‘†(ฮ›2 ), i.e., there exist some ๐‘ก โˆˆ C and radius ๐œŒ > 0, such that the disk ๐ต(๐‘ก, ๐œŒ) satisfies either (i) ๐‘†(ฮ›2 ) โІ ๐ต(๐‘ก, ๐œŒ) and ๐‘†(ฮ›1 ) โІ C\๐ต(๐‘ก, ๐œŒ); or (ii) ๐‘†(ฮ›1 ) โІ ๐ต(๐‘ก, ๐œŒ) and ๐‘†(ฮ›2 ) โІ C\๐ต(๐‘ก, ๐œŒ). 117 This is to say, ๐›ฟ0 > 0 is equivalent to the existence of a disk with one of two spectral sets completely inside it, and the other completely outside it. Comparing ๐›ฟ0 with another common definition of the gap ๐›ฟ1 := min๐œ†๐‘– โˆˆ๐‘†(ฮ› ),๐œ† ๐‘— โˆˆ๐‘†(ฮ› ) |๐œ†๐‘– โˆ’ ๐œ† ๐‘— |, we see that ๐›ฟ0 โ‰ค ๐›ฟ1 . As a result, a sin ฮ˜ bound that 1 2 requires ๐›ฟ0 > 0 is likely to be weaker than one that requires ๐›ฟ1 > 0. In the literature, despite of this usage of the weaker eigen-gap ๐›ฟ0 , (4.4) provides the best known relation between the sin ฮ˜ distance and the condition numbers. C.0.2 Alternative proof of Lemma 4.4.7 using complex analysis Assume (๐‘†(ฮ›1 ) โˆช ๐‘†( e ฮ›1 )) โˆฉ (๐‘†(ฮ›2 ) โˆช ๐‘†( e ฮ›2 )) = โˆ…, then there always exists a positively oriented simple closed curve ฮ“ in the complex plane enclosing the eigenvalues in ฮ›1 and e ฮ›1 while leaving 1 โˆ’1 โˆซ those in ฮ›2 and e ฮ›2 outside. It has been shown in Kato (2013) that ๐‘ƒ๐‘„ ๐‘‹ = 2๐œ‹๐‘– 1 ฮ“ (๐œ†๐ผ โˆ’ ๐ด) ๐‘‘๐œ†, where ๐‘ƒ๐‘„ ๐‘‹ = ๐‘„ ๐‘‹1 ๐‘„ โˆ—๐‘‹ is the projector matrix onto the subspace spanned by columns in ๐‘„ ๐‘‹1 . 1 1 Similarly, ๐‘ƒ๐‘„ = ๐‘„ ๐‘‹e ๐‘„ โˆ—e = 2๐œ‹๐‘– 1 e โˆ’1 โˆซ ๐‘‹ 1 ๐‘‹ ฮ“ (๐œ†๐ผ โˆ’ ๐ด) ๐‘‘๐œ†, then we have e 1 1 ๐‘„๐‘‰โˆ— ๐‘„ ๐‘‹e = ๐‘„๐‘‰โˆ— (๐‘ƒ๐‘„ ๐‘‹ โˆ’ ๐‘ƒ๐‘„ )๐‘„ ๐‘‹e 2 1 2 1 ๐‘‹ e 1 1 โˆซ    1 โˆ— โˆ’1 โˆ’1 = ๐‘„ (๐œ†๐ผ โˆ’ ๐ด) โˆ’ (๐œ†๐ผ โˆ’ ๐ด) e ๐‘‘๐œ† ๐‘„ ๐‘‹e 2๐œ‹๐‘– ๐‘‰2 ฮ“ 1 โˆซ 1 =โˆ’ ๐‘„ โˆ— (๐œ†๐ผ โˆ’ ๐ด) โˆ’1 ฮ”๐ด(๐œ†๐ผ โˆ’ ๐ด) e โˆ’1 ๐‘„ e ๐‘‘๐œ† 2๐œ‹๐‘– ฮ“ ๐‘‰2 ๐‘‹1 โˆซ 1 =โˆ’ ๐‘„ โˆ— ๐‘‹ (๐œ†๐ผ โˆ’ ฮ›) โˆ’1 ๐‘‹ โˆ’1 ฮ”๐ด ๐‘‹ e(๐œ†๐ผ โˆ’ eฮ›) โˆ’1 ๐‘‹eโˆ’1 ๐‘„ e ๐‘‘๐œ† 2๐œ‹๐‘– ฮ“ ๐‘‰2 ๐‘‹1 โˆซ  1 โˆ— =โˆ’ ๐‘„ ๐‘‹ (๐œ†๐ผ โˆ’ ฮ›2 ) โˆ’1๐‘‰2โˆ— ฮ”๐ด ๐‘‹ e1 (๐œ†๐ผ โˆ’ eฮ›1 ) โˆ’1 ๐‘‘๐œ† ๐‘‰ eโˆ— ๐‘„ e 2๐œ‹๐‘– ๐‘‰2 2 ฮ“ 1 ๐‘‹1 โˆซ  1 = โˆ’(๐‘…๐‘‰โˆ’1 ) โˆ— (๐œ†๐ผ โˆ’ ฮ›2 ) โˆ’1๐‘‰2โˆ— ฮ”๐ด ๐‘‹ e1 (๐œ†๐ผ โˆ’ eฮ›1 ) โˆ’1 ๐‘‘๐œ† ๐‘… โˆ’1e , 2 2๐œ‹๐‘– ฮ“ ๐‘‹ 1 | {z } ๐บ where the second to last equality used the fact that ๐‘‰ โˆ— ๐‘‹ = ๐ผ, ๐‘„๐‘‰โˆ— ๐‘‹1 = 0. The contour integral 2 1 โˆ’1 โˆ— โˆ’1 โˆซ ๐บ = 2๐œ‹๐‘– ฮ“ (๐œ†๐ผ โˆ’ ฮ›2 ) ๐‘‰2 ฮ”๐ด ๐‘‹1 (๐œ†๐ผ โˆ’ ฮ›1 ) ๐‘‘๐œ† has poles at ๐œ† ๐‘— , 1 โ‰ค ๐‘— โ‰ค ๐‘Ÿ. Hence the (๐‘–, ๐‘—)th entry e e e of ๐บ can be computed by the Cauchyโ€™s Residue Theorem as โˆซ 1 e๐‘— ) โˆ’1 (๐œ† โˆ’ ๐œ†๐‘–+๐‘Ÿ ) โˆ’1 (๐‘‰ โˆ— ฮ”๐ด ๐‘‹ e๐‘— โˆ’ ๐œ†๐‘–+๐‘Ÿ ) โˆ’1 (๐‘‰ โˆ— ฮ”๐ด ๐‘‹ ๐บ๐‘– ๐‘— = (๐œ† โˆ’ ๐œ† 2 e1 )๐‘– ๐‘— ๐‘‘๐œ† = (๐œ† 2 e1 )๐‘– ๐‘— . 2๐œ‹๐‘– ฮ“ Plugging this back into the expression of ๐‘„๐‘‰โˆ— ๐‘„ ๐‘‹e gives (4.9). 2 1 118