ROBUST ALGORITHMS ON LOW-RANK APPROXIMATION AND THEIR APPLICATIONS By Ningyu Sha A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering – Doctor of Philosophy Statistics – Dual Major 2021 ABSTRACT ROBUST ALGORITHMS ON LOW-RANK APPROXIMATION AND THEIR APPLICATIONS By Ningyu Sha Low-rank approximation models have been widely developed in computer vision, image anal- ysis, signal processing, web data analysis, bioinformatics, etc. Generally, we assume that the intrinsic data lies in a low-dimensional subspace, and we need to extract the low-rank repre- sentation given observations. There are many well-known works such as Principal Compo- nent Analysis (PCA), factor analysis, least squares, etc. However, their performance may be affected when dealing with outliers. Robust PCA (RPCA) plays an important role in such cases, but RPCA based methods suffer from expensive computation costs. In this thesis, we discussed how to improve the performance of RPCA in terms of both speed and accu- racy. The comparison between convex and non-convex models is also discussed. Notably, we propose a theory about matrix decomposition with unknown rank. A nonlinear RPCA approach is also proposed, given the assumption that data lie on a manifold. Then, we take examples from seismic event detection and 2D image denoising. The numerical experiments show the robustness of our techniques and present speedup and higher recovery accuracy compared with existing approaches. It is usually common in practice that observed data has missing values. So, we need to make a low-rank approximation based on incomplete data. Also, it may take a long time for offline matrix completion since we need to collect all data first. The online version can offer up-to-date results based on a continuous data stream. Online matrix completion has applications in computer vision and web data analysis, especially in video image transmission and recommendation systems. To be better applied on color images with three channels, we introduced online quaternion matrix completion. We can get an updated result for every new observed entry using stochastic gradient descent on the quaternion matrix. Copyright by NINGYU SHA 2021 ACKNOWLEDGEMENTS Firstly, I would like to thank my advisor and committee chair, Dr. Ming Yan. He is very patient and knowledgeable. He describes a big picture about optimization for me. Whenever I have any difficulty or questions, he is more than willing to help. I also want to thank my co- advisor, Dr. Yuying Xie. He is humorous and warm-hearted. He gives me precious opinions from statistics direction. I would also like to thank my other committee members, Dr. Matthew Hirn, Dr. Yuehua Cui, and Dr. Haolei Weng, for their knowledgeable feedback and kind suggestions for my research life. I want to thank our research partners, Dr. Youzuo Lin, Dr. Lei Shi, Dr. Rongrong Wang, He Lye, and Shuyang Qin. Thank you for the brilliant work on our projects. Also, I would like to thank my friends Yuning Hao, Hongnan Wang, Yun Song, Yuejiao Sun, Lijiang Xu, Jieqian He, Binbin Huang, Hao Wang, Jialin Qu, Peide Li, Mi Hu, Ken Lee, Runze Su, Yao Xuan and Yao Li for giving me huge support and companion during my Ph.D. I also want to thank the CMSE community, Lisa Roy, Heather Williams, Dr. Andrew Christlieb, etc. They are always there to offer help. I also want to thank Erica Schmittdiel, who stands with me through the darkest time. Most importantly, I would like to thank my parents, Chunhua Sha and Hui Peng. Their selfless love is continued support to make me a better person, always. I also need to thank a lot of people, even if I have forgotten their names. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTER 1 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Applications of robust low-rank optimization . . . . . . . . . . . . . . . . . . 1 1.2 Existing work on RPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Overview of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 CHAPTER 2 ROBUST PRINCIPAL COMPONENT ANALYSIS FOR LOW RANK MATRIX APPROXIMATION . . . . . . . . . . . . . . . . . . . . . 11 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Proposed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Forward-backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . 21 2.2.2 An accelerated algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1.1 Low-rank matrix recovery . . . . . . . . . . . . . . . . . . . 29 2.3.1.2 Robustness of the model . . . . . . . . . . . . . . . . . . . . 30 2.3.1.3 Low-rank matrix recovery with missing entries . . . . . . . . 31 2.3.2 Real image experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4.1 Nonconvex penalties on the singular values . . . . . . . . . . . . . . . 35 2.4.2 Other regularization on the sparse component . . . . . . . . . . . . . 36 2.4.3 Constrained problems . . . . . . . . . . . . . . . . . . . . . . . . . . 36 CHAPTER 3 ROBUST PRINCIPAL COMPONENT ANALYSIS FOR SEISMIC EVENT DETECTION . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 New algorithms with infimal convolution . . . . . . . . . . . . . . . . 41 3.2.1.1 Comparison between PGM and IC-PGM . . . . . . . . . . . 42 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.1 Synthetic seismic data . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Field seismic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 v CHAPTER 4 MANIFOLD DENOISING BY NONLINEAR ROBUST PRINCI- PAL COMPONENT ANALYSIS . . . . . . . . . . . . . . . . . . . . 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Geometric explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 CHAPTER 5 ONLINE MATRIX COMPLETION WITH QUATERNION MATRIX 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Introduction on Quaternion Matrices . . . . . . . . . . . . . . . . . . . . . . 60 5.2.1 Quaternion Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . 62 5.2.4 Incoherence Condition . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2.5 Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3 Online Matrix Completion Algorithms and its Theoretical Analysis . . . . . 63 5.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.1 The Hermitian Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.4.2 The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 vi LIST OF TABLES Table 2.1: Comparison of three RPCA algorithms. We compare the relative error of their solutions to the true low-rank matrix and the number of itera- tions. Both Alg. 2.1 and Alg. 2.2 have better performance than (Shen et al., 2019) in terms of the relative error and the number of iterations. Alg. 2.2 has the fewest iterations but the relative error could be large. It is because the true low-rank matrix is not the optimal solution to the optimization problem, and the trajectory of the iterations moves close to L? before it approaches the optimal solution. . . . . . . . . . . . . . . . . 30 Table 2.2: Performance of Alg. 2.2 on low-rank matrix recovery with missing entries. We change the level of sparsity in the sparse noise, standard deviation of the Gaussian noise, and the ratio of missing entries. . . . . . . . . . . 32 Table 3.1: Comparison of six algorithms. IC-ADMM is the fastest, which is the same as synthetic data. The function value for MCP is smaller because of a different model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 vii LIST OF FIGURES Figure 1.1: Image from https://link.medium.com/bAUJGpEl5hb. Height and weight are correlated. If visualized, these two vectors have an acute angle (left figure). After using PCA, height and weight are combined as a new feature ‘size’ (right figure). There is also ‘other’ information left that is orthogonal to ‘size’. The short length means that it is less important. . . 2 Figure 1.2: Image from https://link.medium.com/BbeEPVIl5hb. PCA works well for clean linear data. However, it works poorly for data with outliers. . . 3 Figure 1.3: An application of RPCA. Image from (Zhou et al., 2014). Top row: four frames from a video. Middle row: video background, which is viewed as a low-rank approximation. Bottom row: main objects which correspond to sparse components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Figure 1.4: Image from https://link.medium.com/ERplrHLl5hb. Rating system for Netflix films. Each row represents each user while each column repre- sents each film. Values from 1 to 5 are scores. . . . . . . . . . . . . . . . 5 Figure 2.1: The contour map of the relative error to L? for different parameters. In this experiment, we set r = 25 and s = 20. The upper bound of the rank is set to be p = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 2.2: The relative error to the true low-rank matrix vs the rank p for Shen et al.’s and Alg. 2.2. Alg. 2.2 is robust to p, as long as p is not smaller than the true rank 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Figure 2.3: The numerical experiment on the ‘cameraman’ image. (A-C) show that the proposed model performs better than Shen et al.’s both visually and in terms of RE and PSNR. (D) compares the objective values vs time for general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is the value obtained by Alg. 2.2 with more iterations. It shows the fast speed with the Gauss- Newton approach and acceleration. With the Gauss-Newton approach, the computation time for Alg. 2.1 is reduced to about 1/7 of the one with standard SVD (from 65.11s to 8.43s). The accelerated Alg. 2.2 requires 5.2s, though the number of iterations is reduced from 3194 to 360. 33 viii Figure 2.4: The numerical experiment on the ‘Barbara’ image. (A-C) show that the proposed model performs better than Shen et al.’s both visually and in terms of RE and PSNR. (D) compares the objective values vs time for general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is the value obtained by Alg. 2.2 with more iterations. It shows the fast speed with the Gauss- Newton approach and acceleration. With the Gauss-Newton approach, the computation time for Alg. 2.1 is reduced to less than 1/3 of the one with standard SVD (from 148.6s to 43.7s). The accelerated Alg. 2.2 requires 23.3s, though the number of iterations is reduced from 3210 to 300. 34 Figure 3.1: Comparison of recovered results on synthetic seismic data with 500 re- ceivers and 1000 measurements at each receiver. (a) simulated clean data. (b) noisy data (-26.2 dB). (c) recovered data by L1 (13.4 dB). (d) recovered data by MCP (13.9 dB). (e) recovered sparse noise by L1 . (f) recovered sparse noise by MCP. (g) the difference between the clean data and the recovered one by L1 . (h) the difference between the clean data and the recovered one by L1 . (e-h) zoom-in over receivers 150-350 and measurements 1-400. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Figure 3.2: Comparison of five algorithms (PGM, FISTA, IC-PGM, IC-FISTA, IC- ADMM) for the convex RPCA on synthetic data. IC-ADMM has the fastest convergence rate and smallest computational time. IC technique improves the performance of PGM and FISTA significantly. . . . . . . . 45 Figure 3.3: Noisy data generated in Oklahoma. . . . . . . . . . . . . . . . . . . . . . 46 Figure 3.4: Recovered results of the real data with two models. . . . . . . . . . . . . 47 Figure 4.1: NRPCA applied to the noisy Swiss roll data set. X̃ − Ŝ is the result after subtracting the estimated sparse noise via NRPCA with T = 1; “ X̃ − Ŝ with one neighbor update” is that with T = 2, i.e., patches are reassigned once; X̂ is the denoised data obtained via NRPCA with T = 2; “Patch-wise Robust PCA” refers to the ad-hoc application of the vanilla RPCA to each local patch independently, whose performance is clearly worse than the proposed joint-recovery formulation. . . . . . . . 56 Figure 4.2: Laplacian eigenmaps and Isomap results for the original and the NR- PCA denoised digits 4 and 9 from the MNIST dataset. . . . . . . . . . . 57 Figure 5.1: A movie rating system. For a given d1 ×d2 low-rank matrix with missing entries, it can be factorized by a d1 × k user matrix and a k × d2 item matrix where k is the rank of the original matrix. . . . . . . . . . . . . . 59 ix Figure 5.2: Loss function value versus number of iterations for the small Hermitian case. The stepsize is 3e−5 . Within 40000 iterations, the value decreases from nearly 100 to 10−6 . The loss function value tends to keep decreasing after these 40000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 5.3: Loss function value versus number of iterations for the small general quaternion matrix case. The stepsize is tuned to be 1e−4 . Within 5000 iterations, the value decreases from around 100 to 10−4 . The loss func- tion value tends to keep decreasing after these 5000 iterations. . . . . . . 85 Figure 5.4: Online image recovery result after 10000 iterations. We randomly sam- pled 1000o observations from (a). We can see that the result for recov- ered image (b) is not good. We expect that the difference between (a) and (b) can be as small as possible. . . . . . . . . . . . . . . . . . . . . . 86 Figure 5.5: Loss function value versus number of iterations for the real color image. The initial stepsize is tuned to be 1e−5 . For every 300 iterations, we multiply the stepsize by 0.95. The loss function value decreases from around 170 to almost 45. At the beginning, the loss function value decreases the most, and it tends to keep decreasing after these 10000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 x LIST OF ALGORITHMS 2.1 RPCA for low rank matrix approximation . . . . . . . . . . . . . . . . . . . . 21 2.2 Accelerated RPCA with nonmonotone APG . . . . . . . . . . . . . . . . . . . 28 4.1 Nonlinear RPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.1 Online learning algorithm for the Hermitian matrix M . . . . . . . . . . . . . 82 5.2 Online learning algorithm for general M (theoretical version) . . . . . . . . . 83 5.3 Online learning algorithm for general M (practical version) . . . . . . . . . . 83 xi CHAPTER 1 BACKGROUND 1.1 Applications of robust low-rank optimization Many high-dimensional data points can be represented as points in a low-dimensional sub- space of a high-dimensional space, and principal component analysis (PCA) is a popular tool to find the low-dimensional subspace. Here, we use a simple example to explain the intuition behind PCA. Height and weight are two measurements (features) for football players, and we say that the original dimension of these measurements is two. That is, the data for each player lies in a two-dimensional space. Obviously, there is a positive correlation between these two measurements. If one person is taller than another one, he/she is usually heavier too. More specifically, these two measurements are linearly correlated, as shown in Fig. 1.1. In this example, we can use one new variable named size to approximately describe the combination of height and weight at a high accuracy. In fact, it is also applied to the size of clothes. Generally, correlations within features happen when there are more than two features. Mathematically, we say that these features are not linearly independent. PCA finds a lin- ear transform such that the new features after the transformation are linearly independent. In addition, PCA finds the most important features that represent the most information (variance) in the dataset. For example, the first principal component explains the most variance in the dataset, and the second principal component is the vector that is orthogo- nal to the first principal component and explains the second most variance in the dataset. PCA can be calculated from the singular value decomposition (SVD) of the data matrix, and the singular values represent the variance in the data corresponding to the principal components, which are corresponding right eigenvectors. Then, we keep the principal com- ponents corresponding to the largest singular values to represent most information using a 1 Figure 1.1: Image from https://link.medium.com/bAUJGpEl5hb. Height and weight are correlated. If visualized, these two vectors have an acute angle (left figure). After using PCA, height and weight are combined as a new feature ‘size’ (right figure). There is also ‘other’ information left that is orthogonal to ‘size’. The short length means that it is less important. small number of features. PCA has been applied to many different areas, such as computer vision, bioinformatics, finance, psychology, etc. PCA removes the principal components corresponding to smallest singular values because the data points with small random distortions are not exactly on the low-dimensional space. The small random distortions are equivalent to adding small random values to all singular values. Therefore, by setting small singular values as zero, PCA can reduce the effect of these distortions. In the left figure of Figure 1.2, with the small distortions, PCA is able to get the first principal component, shown in red, and it is very close to the true direction shown in blue. However, in many real cases, there are outliers accompanied with the data set. In statistics, an outlier is a data point that differs significantly from other observations. It can be caused by equipment error or the population that has a heavy-tailed distribution. Outliers usually have large distance from normal data points, but the total number of outliers is much smaller than the total number of data points. In the case with outliers, PCA may not perform well. For example, in the right figure of Figure 1.2, there are four outliers, and PCA will give the red direction, which is very different from the true blue one. 2 (a) Yellow dots are sampled from one (b) Yellow dots are sampled from one dimensional subspace (blue line) cor- dimensional subspace (blue line) cor- rupted by Gaussian noise. The red line rupted by large sparse outliers. PCA is is the output of PCA. It can reserve affected by outliers and can not capture most variance. the data well. Figure 1.2: Image from https://link.medium.com/BbeEPVIl5hb. PCA works well for clean linear data. However, it works poorly for data with outliers. Standard PCA can not deal with outliers because it is based on the assumption that the additional noise follows a Gaussian distribution. However, outliers follow heavy-tail distributions and a different model is required. Robust PCA (RPCA) is one approach to remove the outliers while preserving the low-dimensional structure. This approach has been successfully applied in a wide range of areas, including computer vision (De la Torre and Black, 2001), image processing (Liu et al., 2012; Elhamifar and Vidal, 2013), dimensionality reduction (Cunningham and Ghahramani, 2015), bioinformatics data analysis (Da Costa et al., 2009), and web data and services (Koren, 2009). More specifically, RPCA has achieved great success in video surveillance and face recognition (Candès et al., 2011; Bouwmans and Zahzah, 2014). Also, it has been proved by Candès et al. (2011) that the low-rank part can be a good approximation of the original complete matrix if the data satisfies some assumptions. RPCA assumes that we observe all entries of the full noisy matrix. Compared with standard PCA, RPCA utilizes the difference between the low-dimensional data structure and outliers. Also, different from two-stage methods, which first detect the outliers and remove them, RPCA describes the outliers as a sparse matrix and decompose the noisy 3 matrix into the sum of two or three matrices, with or without the Gaussian noise. In some applications, the outliers contain useful information. For example, in video surveillance, the low-rank part preserves the stationary background, whereas the sparse part can capture a moving object or person in the foreground. See Figure 1.3 for an example with four frames in a video. Figure 1.3: An application of RPCA. Image from (Zhou et al., 2014). Top row: four frames from a video. Middle row: video background, which is viewed as a low-rank approximation. Bottom row: main objects which correspond to sparse components. RPCA separates a data matrix into the sum of a low-rank matrix, a sparse matrix (outliers), and a small noise matrix (Gaussian noise). If there is no Gaussian distortions, the small noise matrix is just a zero matrix. However, in practice, we are often confronted with such a situation where the collected data is incomplete. The good thing is that a low-rank matrix can be recovered from only a few entries of the matrix. For example, an m × n rank-one matrix can be recovered by one row and one column. When the given noisy matrix has missing entries, it is a low-rank matrix completion problem, that is, we will remove the outliers and recover the whole low-rank matrix. For example, in a film rating system, as shown in Figure 1.4, each user rates the films they watched. However, this data matrix tends 4 to be incomplete because people cannot watch all film showed in the rating website. Matrix completion techniques can predict the rates of each user for unwatched films and promote to the user the films they may like but did not watch. Figure 1.4: Image from https://link.medium.com/ERplrHLl5hb. Rating system for Netflix films. Each row represents each user while each column represents each film. Values from 1 to 5 are scores. 1.2 Existing work on RPCA Assuming that the true data points lie on a low-dimensional subspace, we use a matrix D ∈ Rn×p to represent the observed noisy data, where n is the number of samples and p is the dimension of each sample. We also assume that the intrinsic dimension of the data is k < p. Therefore, in the linear case, k is also the rank of the matrix L0 consisting of the true samples. In this case, we have the following model, D = L 0 + N0 , (1.1) 5 where L0 is a low-rank matrix with rank k and N0 is a small perturbation matrix (Gaussian noise). Then the classical PCA solves the following problem minimize kD − Lk2F L subject to rank(L) = k. The classical PCA assumes that the noise follows a Gaussian distribution. When only some components of the matrix are affected by large noise and the distribution of the noise is unknown, we consider RPCA, which assumes that D = L0 + S0 , with S0 being a sparse matrix. Because we do not have any knowledge about the distribution of the noise, the components can be considered as damaged and should be removed. If we know the locations, then we can remove them and fill-in new values for those locations. However, the locations are not given, and we have to find the sparse matrix S0 and the low-rank matrix L0 together using RPCA algorithms. RPCA is an inverse problem to recover L and S from the matrix D, which can be realized via solving the idealized nonconvex problem minimize rank(L) + λkSk0 , subject to L + S = D, (1.2) L,S where λ is a parameter to balance the two objectives and kSk0 counts the number of non- zero entries in S. However, this problem is NP-hard in general (Amaldi and Kann, 1998). Therefore, much attention is focused on the following convex relaxation: minimize kLk∗ + λkSk1 , subject to L + S = D. (1.3) L,S Here k · k∗ and k · k1 denote the nuclear norm and `1 −norm of a matrix, respectively. This is called principal component pursuit (PCP) (Zhou et al., 2010a). When Gaussian noise is also involved, we consider D = L + S + N, (1.4) 6 where N is the Gaussian noise. We can set the noise level to be  and use the Frobenius norm to measure the Gaussian noise. A relaxed version of PCP is defined as minimize kLk∗ + λkSk1 , subject to kL + S − Dk2F ≤ . (1.5) L,S This constrained optimization problem is also equivalent to the following unconstrained one 1 minimize kL + S − Dk2F + kLk∗ + λkSk1 . (1.6) L,S 2µ Classical optimization algorithms such as proximal gradient method (PGM), accelerated PGM, and alternating direction method of multipliers (ADMM) have been used to solve the unconstraned problem (1.6). Let’s briefly go through these algorithms. For PGM, we can view the loss function as two parts corresponding with L and S respectively. Then the PGM takes two steps. The first step is the gradient descent, and the second step is to calculate the proximal functions. One iteration is described as  L̂k = Lk − t (Lk + Sk − D),  µ     t k    k k k Ŝ = S − (L + S − D),   µ (1.7) 1 Lk+1 = arg min tkLk∗ + kL − L̂k k2F ,        L 2 1   Sk+1 = arg min tλkSk1 + kS − Ŝk k2F .   S 2 To solve the second proximal function, we can directly use soft-thresholding on each entry of the matrix. To solve the first proximal function, we need to use SVD to calculate the singular values and do the soft-thresholding on the singular values. We can set the initial condition L = 0 and S = 0. As for accelerated proximal gradient method, we use FISTA (Beck and Teboulle, 2009b) as one example. The main change compared with the original PGM is that the shrinkage operator is not used on the previous point L̂k but on the linear combination of the previous two points L̂k , L̂k−1 as follows: θk−1 − 1 k L̄k = Lk + (L − Lk−1 ) (1.8) θk 7 √ 1+ 1+4θk2 where θk+1 = 2 . Generally, it updates variables starting with θ−1 = θ0 = 1. It has been proved that FISTA has improved the convergence rate from O(1/k) to O(1/k 2 ) for general convex problems. When it goes to ADMM, we need to convert the original problem to a constrained form 1 minimize kZk2F + λkSk1 + kLk∗ , subject to Z + L + S = D. (1.9) Z,L,S 2µ Unfortunately, its convergence for general three blocks has not been proved to converge. But in practice, it usually works very well. In order to solve (1.9), we need to establish the augmented Lagrangian function 1 β Lβ (Z, S, L, Q) = kZk2F +λkSk1 +kLk∗ −βhQ, Z+S+L−Di+ kZ+S+L−Dk2F . (1.10) 2µ 2 Alternatively, we consider 0 1 β Lβ (Z, S, L, Q) = kZk2F + λkSk1 + kLk∗ + kZ + S + L − D − Qk2F , (1.11) 2µ 2 0 with Lβ (Z, S, L, Q) = Lβ (Z, S, L, Q) + β2 kQk2F if we want to get the optimal variables Z, S 0 and L, respectively. Then we minimize Lβ alternatingly.  0    Zk+1 := arg min Lβ (Z, Sk , Lk , Qk );   Z    0 Sk+1 := arg min Lβ (Zk+1 , S, Lk , Qk );   S (1.12) 0 Lk+1 := arg min Lβ (Zk+1 , Sk+1 , Lk , Qk );       L    Qk+1 := Qk − (Zk+1 + Sk+1 + Lk+1 − D).  More specifically, 1 β    Zk+1 := arg min kZk2F + kZ + Sk + Lk − D − Qk k2F ;     Z 2µ 2   β Sk+1 := arg min λkSk1 + kZk+1 + S + Lk − D − Qk k2F ;   S 2 (1.13)  k+1 β := arg min kLk∗ + kZk+1 + Sk+1 + L − D − Qk k2F ;   L 2     S   Qk+1 := Qk − (Zk+1 + Sk+1 + Lk+1 − D).  8 All these approaches need to find the proximal of the nuclear norm, which requires SVD. When the matrix size is large, the SVD computation is very expensive and dominates other computation (Trefethen and Bau III, 1997). 1.3 Overview of this thesis We briefly introduce the following chapters in this thesis. In Chapter 2, we study the theoretical parts of low-rank approximation. There are mainly two types of algorithms for RPCA. The first type of algorithm applies regularization terms on the singular values of a matrix to obtain the low-rank matrix. However, calculating singular values can be very expensive for large matrices. The second type of algorithm replaces the low-rank matrix as the multiplication of two smaller matrices. They are faster than the first type because no SVD is required. However, the rank of the low-rank matrix is required, and an accurate rank estimation is required to obtain a reasonable solution. In this chapter, we propose algorithms that combine both types. Our proposed algorithms require an upper bound of the rank and SVD on small matrices. First, they are faster than the first type because the cost of SVD on small matrices is negligible. Second, they are more robust than the second type because an upper bound of the rank instead of the rank is required. Numerical experiments show the good performance of our proposed algorithms. In Chapter 3, we take examples from seismic data. Seismic events are usually buried in noise. Our goal is to discover the underlying signal from noise. RPCA based seismic denoising approaches yield promising results in separating useful seismic events from noise. However, current RPCA-based methods suffer from expensive computational costs, which hinders their wide applications in seismic data denoising and preprocessing. In this work, we develop a cost-effective denoising technique based on RPCA. Instead of solving for the clean data and noise simultaneously, we alternatively update them. This approach admits a large stepsize and increases the speed. In addition, we improve the model by incorporating a nonconvex term. To verify the effectiveness of our technique, we applied our denoising 9 technique to both synthetic and field reflection seismic data. From the numerical results, we observe that our denoising methods not only produce comparable or better denoising results but also yield efficient computational cost. Through comparison to other RPCA- based denoising methods, our method is at least 4-10x faster. In Chapter 4, we extend RPCA to nonlinear manifolds. Suppose that the data matrix contains a sparse component and a component drawn from some low-dimensional manifold. Is it possible to separate both components by using the low dimensionality assumption of the manifold? Is there a benefit to treat the manifold as a whole as opposed to treating each local region individually? We answer these two questions affirmatively by proposing an optimization framework that separates these two components from noisy data. The efficacy of the proposed method is demonstrated on both synthetic and real dataset. The three chapters consider offline algorithms, in which we have access to all data before the algorithm is applied. In the last chapter, we applied low-rank approximation on online matrix completion. Online optimization is more and more useful nowadays in data science and machine learning areas. As we know, for offline setting, we need to get all observations first and use all information to train the model. On the other side, online algorithms aim to make decision sequentially based on sequentially sampled data. For example, in many applications, like film recommendation systems, the user will give one rating for a film each time after watching. In this case, we can only get one observation each time. There are many work that aims to develop algorithms which can give updated result after every new observed entry. In Chapter 5, we consider the completion of a quaternion matrix, whose entries are quaternion numbers. Each quaternion number has one real number and three imaginary number. Therefore, it can be applied to color image with or without depth. We develop and analyze an online matrix completion algorithm for quaternion matrices. We want to find a decomposition of the matrix such that the matrix is a product of two small matrices, In this algorithm, after we receive an entry from the matrix, we update one row and one one column of the two small matrices, respectively. 10 CHAPTER 2 ROBUST PRINCIPAL COMPONENT ANALYSIS FOR LOW RANK MATRIX APPROXIMATION 2.1 Introduction Robust principal component analysis (RPCA) decomposes a data matrix into a low-rank part and a sparse part. It has applications in a wide range of areas, including computer vision (De la Torre and Black, 2001), image processing (Liu et al., 2012; Elhamifar and Vidal, 2013), dimensionality reduction (Cunningham and Ghahramani, 2015), and bioinformatics data analysis (Da Costa et al., 2009). More specifically, the RPCA model has achieved great success in video surveillance and face recognition (Candès et al., 2011; Bouwmans and Zahzah, 2014). For example, in video surveillance, the low-rank part preserves the stationary background, whereas the sparse part can capture a moving object or person in the foreground. We first assume that the data matrix D is obtained by the sum of a low-rank matrix and a spare matrix. That is D = L + S, where L is a low-rank matrix and S is a sparse matrix, which has only a few nonzero components. RPCA is an inverse problem to recover L and S from the matrix D, which can be realized via solving the idealized nonconvex problem minimize rank(L) + λkSk0 , subject to L + S = D, (2.1) L,S where λ is a parameter to balance the two objectives and kSk0 counts the number of non- zero entries in S. However, this problem is NP-hard in general (Amaldi and Kann, 1998). Therefore, much attention is focused on the following convex relaxation minimize kLk∗ + λkSk1 , subject to L + S = D. (2.2) L,S 11 Here k · k∗ and k · k1 denote the nuclear norm and `1 −norm of a matrix, respectively. It is shown that under mild conditions, the convex model (2.2) can exactly recover the low-rank and sparse parts with high probabilities (Candès et al., 2011). When additional Gaussian noise is considered, we can set the noise level to be  and use the Frobenius norm k · kF to measure the reconstruction error. Then, the problem becomes minimize kLk∗ + λkSk1 , subject to kL + S − Dk2F ≤ . (2.3) L,S Then, this constrained optimization problem is equivalent to the unconstrained problem 1 minimize kL + S − Dk2F + kLk∗ + λkSk1 (2.4) L,S 2µ with a trade-off parameter µ. There is a correspondence between the two parameters  and µ in (2.3) and (2.4), but the explicit expression does not exist. In this chapter, we will focus on the unconstrained problem (2.4), and the technique introduced in this chapter can be applied to the convex models (2.2) and (2.3). Please see Section 2.4 for more details. There are many existing approaches for solving (2.4) including the augmented Lagrange method (Lin et al., 2010; Bouwmans and Zahzah, 2014; Wright et al., 2009). Some examples are proximal gradient method for (L, S), alternating minimization for L and S (Shen et al., 2019), proximal gradient method for L after S is eliminated (Sha et al., 2019), alternating direction method of multipliers (ADMM) (Yuan and Yang, 2009; Tao and Yuan, 2011). All these approaches need to find the proximal of the nuclear norm, which require the singular value decomposition (SVD). When the matrix size is large, the SVD computation is very expensive and dominates all the computation (Trefethen and Bau III, 1997). Alternative approaches for RPCA use matrix decomposition (Wen et al., 2012) and do not require SVD. Assuming that the rank of L is known as p, we can decompose it as L = XY> , with X ∈ Rm×p and Y ∈ Rn×p . Then the following nonconvex optimization problem 1 minimize kXY> + S − Dk2F + λkSk1 , (2.5) X,Y,S 2 12 is considered. There are infinite many optimal solutions for this problem, since for any invertable matrix A ∈ Rp×p , (X, Y, S) and (XA−1 , YA> , S) have the same function value. In fact, for any matrix L with rank no greater than p, we can find L = XY> and Y> Y = Ip×p . Therefore, we can have an additional constraint Y> Y = Ip×p . The resulting problem still has infinite many optimal solutions, since for any orthogonal matrix A ∈ Rp×p , (X, Y, S) and (XA, YA, S) have the same function value. Though (X, Y) are not unique, the low-rank matrix L that we need is unique at probability one. This resulting problem was discussed in (Shen et al., 2019), and an efficient algorithm by alternatively minimizing XY> and S is provided. In this algorithm, a Gauss-Newton algorithm is applied to update XY> and increase the speed. Though the matrix decomposition problem is fast to solve, it is nonconvex and requires an accurate estimation of the rank of L. Fig. 2.2 in Section 2.3.1.2 demonstrates that a good estimation of the rank is critical. However, in most scenarios, we do not have the exact rank of L, but we can have an upper bound of the true rank. Therefore, we can combine the matrix decomposition and the nuclear norm minimization to have the benefits of both problems. The problem we consider in this chapter is 1 minimize kL + S − Dk2F + µkLk∗ + λkSk1 , subject to rank(L) ≤ p. (2.6) L,S 2 When µ = 0, the problem (2.6) is equivalent to (2.5). In addition, we consider the following more general problem 1 minimize kA(L) + S − Dk2F + µkLk∗ + λkSk1 , subject to rank(L) ≤ p. (2.7) L,S 2 where D is the measurement of A(L) contaminated with both Gaussian noise and a sparse component. Here A is a bounded linear operator that describes how the measurements are calculated. For example, in robust matrix completion, we let A be the restriction operator on the given components of the matrix L. Note that the alternating minimization algorithm in (Shen et al., 2019) can not be applied to this general problem because the subproblem for L can no longer be solved efficiently by 13 the Gauss-Newton method. We will show the equivalency of the alternating minimization algorithm in (Shen et al., 2019) as a proximal gradient method applied to a problem with L only. Then the subproblem of L in our general problem (2.7) can still be solved efficiently with the Gauss-Newton method. Please see more details in Section 2.2. For simplicity, we use the nuclear norm and `1 −norm for the low-rank and sparse matrices, respectively. The purpose of this chapter is to introduce a fast algorithm to solve a type of RPCA algorithm, while the comparison of different penalties is out of the scopes of this chapter. The contributions of this chapter are: • We propose a new model for RPCA, which combines the nuclear norm minimization and the matrix decomposition. The matrix decomposition brings efficient algorithms, and the nuclear norm miminization on a smaller matrix removes the requirement of the rank of the low-rank matrix. Note that the nuclear norm minimization can be replaced by other nonconvex penalties, and the results in this chapter are still valid. • We develop efficient algorithms to solve this problem and show its convergence. 2.1.1 Notation Throughout this chapter, matrices are denoted by bold capital letters (e.g., A), and operators are denoted by calligraphic letters (e.g., A). In particular, I denotes the identity matrix, 0 denotes the zero matrix (all entries equal to zero), and I denotes the identity operator. If there is potential for confusion, we indicate the dimension of the matrix with subscripts. For a matrix A, A> represents its transpose and A(:, j : k) denotes the matrix composed by the columns of A indexing from j to k. Let Ai,j be the (i, j) entry of A. The `1 −norm of A is given by kAk1 = i,j |Ai,j |. Denote the ith singular value of A by σi (A). The nuclear P norm of A is given by kAk∗ = i σi (A). We will use ∂k · k1 and ∂k · k∗ to denote the P subgradients of the `1 −norm and nuclear norm, respectively. The linear space of all m × n real matrices is denoted by Rm×n . For A, B ∈ Rm×n , the inner product of A, B is defined by 14 hA, Bi = Tr(A> B), which induces the Frobenius norm kAkF = Tr(A> A) = p pP 2 i σi (A). Let A be a linear bounded operator on Rm×n . The operator norm of A is given by kAk = sup{kA(A)kF : A ∈ Rm×n , kAkF = 1}. The adjoint operator of A denoted by A∗ is also linear and bounded on Rm×n such that hA(A), Bi = hA, A∗ (B)i. The notation is used to denote the component-wise multiplication. Additionally, for a function f : R → R, without further reference, f acting on a matrix A ∈ Rm×n specifies that f is evaluated on each entry of A, i.e., f (A) ∈ Rm×n with f (A)i,j = f (Ai,j ). For example, if f (x) = |x| − λ, we can denote f (A) ∈ Rm×n by |A| − λ with (|A| − λ)i,j = |Ai,j | − λ. 2.1.2 Organization The rest of the chapter is organized as follows. We introduce our proposed algorithms and show their convergence in Section 2.2. Then we conduct numerical experiments to compare the performance of our proposed algorithms with existing approaches in Section 2.3. In Section 2.4, we introduce some potential extension from this chapter. We end this chapter with a short conclusion. 2.2 Proposed algorithms The problem (2.6) is nonconvex because of the constraint rank(L) ≤ p. It has several equivalent formulations. E.g., it is equivalent to the following nonconvex weighted nuclear norm minimization problem: p min(m,n) 1 X X minimize kL + S − Dk2F + µ σi (L) + C σi (L) + λkSk1 , L,S 2 i=1 i=p+1 where C is a sufficiently large number such that the optimal L has at most p nonzero singular values. However, this formulation also requires the singular value decomposition of an m × n matrix in each iteration, which is expensive when m and n are large. We consider another equivalent problem with matrix decomposition in the following theorem. 15 Theorem 2.2.1. Problem (2.6) is equivalent to 1 minimize kXY> + S − Dk2F + µkXk∗ + λkSk1 , subject to Y> Y = Ip×p . (2.8) X,Y,S 2 More specifically, if (X, Y, S) is an optimal solution to (2.8), then (XY> , S) is an optimal solution to (2.6). If (L, S) is an optimal solution to (2.6) and we have the decomposition L = XY> with Y> Y = Ip×p , then (X, Y, S) is an optimal solution to (2.8). Proof. For any matrix L ∈ Rm×n with rank no greater than p, we can have the decomposition L = XY> , with Y> Y = Ip×p . This decomposition is not unique, and one decomposition can be easily obtained from the compact SVD of L. Let L = Up Σp Vp> be the SVD of L with a square p × p matrix Σp , we have Vp> Vp = Ip×p . Thus, problem (2.6) is equivalent to 1 minimize kXY> + S − Dk2F + µkXY> k∗ + λkSk1 , subject to Y> Y = Ip×p . X,Y,S 2 For any X ∈ Rm×p , let X = UΣV> be its SVD with U ∈ Rm×p and V ∈ Rp×p . We have XY> = UΣV> Y> = UΣ(YV)> . Since (YV)> (YV) = V> Y> YV = V> V = Ip×p . The SVD of XY> is UΣ(YV)> , and kXY> k∗ = pi=1 Σii = kXk∗ . Thus, problem (2.6) is equivalent to (2.8). P Next, we consider problem (2.8) with S fixed. When S is fixed, it becomes a problem of L = XY> , and solving this problem is to find the proximal operator of the corresponding nonconvex weighted nuclear norm, which is denoted as 1 minimize kL − Mk2F + µkLk∗ , subject to rank(L) ≤ p, (2.9) L 2 or equivalently 1 minimize kXY> − Mk2F + µkXk∗ , subject to Y> Y = Ip×p , (2.10) X,Y 2 where M = D − S. 16 Theorem 2.2.2. Let q = min(m, n). Problem (2.9) can be solved in two steps: 1. Find the compact SVD of M = UΣV> , with Σ = diag(σ1 (M), · · · , σq (M)) satisfying σ1 (M) ≥ σ2 (M) ≥ · · · ≥ σq (M); 2. Construct a diagonal matrix Σ̂µ ∈ Rp×p with (Σ̂µ )ii = max(Σii −µ, 0), then one solution of (2.9) is U(:, 1 : p)Σ̂µ V(:, 1 : p)> . In addition, for any orthogonal matrix A ∈ Rp×p , (U(:, 1 : p)Σ̂µ A, V(:, 1 : p)A) is an optimal solution of (2.10). Proof. Given any L ∈ Rm×n with rank(L) ≤ p, let σ1 , σ2 , · · · , σq be its singular values in the decreasing order such that σp+1 = · · · = σq = 0. Note that the main diagonal entries of Σ are the singular values of M. According to the von-Neumann trace inequality (Horn and Johnson, 2012, Theorem 7.4.1.1), one can bound the matrix inner product by the singular values, i.e., hL, Mi ≤ qi=1 σi Σii . Then we have P 1 1 1 kL − Mk2F + µkLk∗ = kLk2F + kMk2F − hL, Mi + µkLk∗ 2 2 2 q q q q 1 X 2 1X 2 X X ≥ σ + Σ − σi Σii + µ σi (2.11) 2 i=1 i 2 i=1 ii i=1 i=1 p q p p 1X 2 1X 2 X X = σi + Σii − σi Σii + µ σi , 2 i=1 2 i=1 i=1 i=1 where the equality is satisfied when L has a simultaneous SVD with M through U and V. Therefore, the optimal L minimizing 21 kL − Mk2F + µkLk∗ can be selected from the matrices that have a simultaneous SVD with M through U and V. Then we can assume that the optimal L satisfies L = Udiag(σ1 , · · · , σp , σp+1 , · · · , σq )V> = U(:, 1 : p)diag(σ1 , · · · , σp )V(:, 1 : p)> , where the last equality holds because of the fact that σp+1 = · · · = σq = 0. Next, one can construct an optimal L of the above form by letting σi = max(Σii − µ, 0) for i = 1, 2, · · · , p, which minimizes the last equation in (2.11). Thus U(:, 1 : p)Σ̂µ V(:, 1 : p)> minimizes the objective function of (2.9) over all L ∈ Rm×n with rank no greater than p. 17 By the same argument in the proof of Theorem 2.2.1, we see that problem (2.10) is equivalent to problem (2.9). Since for any orthogonal matrix A ∈ Rp×p , there hold L = (U(:, 1 : p)Σ̂µ A)(V(:, 1 : p)A)> and (V(:, 1 : p)A)> (V(:, 1 : p)A) = A> A = Ip×p . Therefore, (U(:, 1 : p)Σ̂µ A, V(:, 1 : p)A) is an optimal solution of problem (2.10). The first step to solve problem (2.10) in the previous theorem requires the truncated SVD of an m × n matrix M. Since we only need the first p (p < q = min(m, n)) singular values, we use the Gauss-Newton algorithm to find (X, Y) alternatively. In this approach, we require the SVD of a m × p matrix, which is much faster than the truncated SVD of a m × n matrix when p is small. In addition, we use the previous X as the initial guess in the next iteration to reduce the number of inner iterations for the Gauss-Newton algorithm. Lemma 2.2.3. If the rank of M ∈ Rm×n is larger than p, problem (2.10) can be solved in the following three steps: 1. Find X̂ ∈ Rm×p (p < m) by solving the following optimization problem 1 minimize kXX> − MM> k2F ; X 2 2. Y = M> X̂(X̂> X̂)−1 ; 3. Let X̂ = Up Σ̂A be its thin SVD with Σ̂ ∈ Rp×p and choose X as X = Up Σ̂µ A with (Σ̂µ )ii = max(0, Σ̂ii − µ) for i = 1, . . . , p. Then (X, Y) is a solution of problem (2.10). Proof. Given any X ∈ Rm×p , let λ1 , λ2 , · · · , λm be the non-negative eigenvalues of the matrix XX> . Since rank(X) ≤ p < m, we have λp+1 = · · · = λm = 0. Recall that the compact SVD of M given in Theorem 2.2.2 is UΣV> with Σ ∈ Rq×q (here q = min(m, n)). Then 18 Σ211 ≥ Σ211 ≥ · · · ≥ Σ2qq are the largest q eigenvalues of the matrix MM> , and if q < m, the remaining eigenvalues of MM> are all zeros. Then we have p q p X X X > kXX − MM> k2F ≥ λ2i + Σ4ii −2 λi Σ2ii i=1 i=1 i=1 p q q X X X = (λi − Σ2ii )2 + Σ4ii ≥ Σ4ii , i=1 i=p+1 i=p+1 where the equality is satisfied when we choose X = U(:, 1 : p)diag(Σ11 , · · · , Σpp ). Let Σ̂ = diag(Σ11 , · · · , Σpp ). The matrix Σ̂ is invertible as the rank of M is larger than p. Then for any orthogonal matrix A ∈ Rp×p , X̂ = U(:, 1 : p)Σ̂A minimizes the objective function 1 2 kXX> − MM> k2F . After we find X̂ = U(:, 1 : p)Σ̂A for a certain orthogonal matrix A, we have Y = M> X̂(X̂> X̂)−1 =VΣU> U(:, 1 : p)Σ̂A((U(:, 1 : p)Σ̂A)> U(:, 1 : p)Σ̂A)−1 =VΣU> U(:, 1 : p)Σ̂−1 A =V(:, 1 : p)Σ̂Σ̂−1 A = V(:, 1 : p)A, where the third equality is due to the fact that    Σ̂p×p ΣU> U(:, 1 : p) =  .  0(q−p)×p According to Theorem 2.2.2, (X̂, Y) is an optimal solution of problem (2.10) if µ = 0. Note that X̂ = U(:, 1 : p)Σ̂A is the thin SVD with Σ̂ ∈ Rp×p . Then, the third step gives X = U(:, 1 : p)Σ̂µ A. Theorem 2.2.2 shows that (X, Y) is an optimal solution of problem (2.10). Remark: To find X̂ in the first step, we apply the Gauss-Newton algorithm from (Liu et al., 2015), which is previously used for RPCA in (Shen et al., 2019). The iteration is X ← MM> X(X> X)−1 − X((X> X)−1 X> MM> X(X> X)−1 − I)/2. When p is small, computing the inverse of X> X is fast. Though an iterative algorithm is required to solve this subproblem at each outer iteration, we can use the output from the previous outer iteration 19 as the initial and the number of inner iterations is reduced significantly. Therefore, the computational time can be reduced significantly, as shown in Section 2.3. In the numerical experiments, the first Gauss-Newton algorithm requires several hundred iterations, while the number for following Gauss-Newton algorithms reduces to less than ten. From Theorem 2.2.2, we say that we solve the proximal operator of the nonconvex func- tion kLk∗ + ιrank(L)≤p (L) exactly. Here the indicator function is defined as   0, if rank(L) ≤ p;  ιrank(L)≤p (L) =  +∞, otherwise.  With these theorems, we are ready to develop optimization algorithms for the general prob- lem (2.7). 2.2.1 Forward-backward First, we eliminate S, and it becomes the following problem with L only: 1 minimize min kA(L) + S − Dk2F + λkSk1 + µkLk∗ L:rank(L)≤p S 2   1 = minimize min 2 kA(L) + S − DkF + λkSk1 + µkLk∗ (2.12) L:rank(L)≤p S 2 = minimize fλ (D − A(L)) + µkLk∗ . L:rank(L)≤p Here fλ is the Moreau envelope of λ| · | defined by fλ (x) = miny∈R {λ|y| + 21 (y − x)2 }. So it is differential and has a 1-Lipschitz continuous gradient. Then we can apply the proximal- gradient method (or forward-backward operator splitting). We take the gradient of fλ , which is given by fλ0 (x) = x − sign(x) max(0, |x| − λ) = sign(x) min(λ, |x|). (2.13) The forward-backward iteration for L with stepsize t is Lk+1 = proxtµ Lk − tA∗ fλ0 (A(Lk ) − D) , (2.14)  where the proximal operator is defined by 1 proxµ (A) = arg min kL − Ak2F + µkLk∗ . (2.15) L:rank(L)≤p 2 20 The algorithm is summarized in Alg. 2.1. Algorithm 2.1: RPCA for low rank matrix approximation Input: D, µ, λ, p, A, stepsize t, stopping criteria , maximum number of iterations M ax_Iter, initialization L0 = 0 Output: L, S 1 for k = 0, 1, 2, 3, . . . , Max_Iter do 2 S = sign(D − A(Lk )) max(0, |D − A(Lk )| − λ) ; 3 Lk+1 = proxtµ (Lk − tA∗ (A(Lk ) − D + S) using Gauss-Newton; 4 if kLk+1 − Lk kF /kLk kF <  then 5 break 6 end 7 end Connection to (Shen et al., 2019). Consider the special case with A = I and µ = 0. We let t = 1 in (2.14) and obtain the following iteration 1 Lk+1 = prox0 (Lk − fλ0 (Lk − D)) = arg min kL + Sk+1 − Dk2 , L:rank(L)≤p 2 where Sk+1 = sign(D − Lk ) max(0, |D − Lk | − λ). This is exactly the algorithm in (Shen et al., 2019) for solving (2.5). It alternates between finding the best S with L fixed and the best L (or (X, Y)) with S fixed. Recently, the work (Cai et al., 2019) proposed a novel RPCA algorithm with linear con- vergence. It projects matrices to special manifolds of low-rank matrices, and their truncated SVD can be computed efficiently. Our matrix does not have this property in our algorithm, and a good initial guess from the previous iteration is necessary to reduce the computation in the Gauss-Newton method. 2.2.1.1 Convergence analysis From the discussion above, problem (2.7) can be solved by an iteration process of forward- backward splitting. In each iteration, we reduce the value of the objective function 1 E(L, S) = kA(L) + S − Dk2F + λkSk1 + µkLk∗ (2.16) 2 21 by applying proximal operators to L and S alternatively. The resulting iteration sequence {(Lk , Sk )}k≥1 with some initial (L0 , S0 ) is explicitly given by Sk = sign(D − A(Lk−1 )) max(0, |D − A(Lk−1 )| − λ), (2.17) Lk = proxtµ Lk−1 − tA∗ (A(Lk−1 ) + Sk − D) ,  where the proximal operator proxtµ (·) for updating L is defined by (2.15). Here we use (2.13) to derive fλ0 (A(Lk−1 ) − D) = A(Lk−1 ) − D + sign(D − A(Lk−1 )) max(0, |D − A(Lk−1 )| − λ) = A(Lk−1 ) + Sk − D. In this subsection, we establish the convergence results for {(Lk , Sk )}k≥1 . We will show that every limit point of {(Lk , Sk )}k≥1 , denoted by (L? , S? ), is a fixed point of the proximal operator, i.e., S? = sign(D − A(L? )) max(0, |D − A(L? )| − λ), (2.18) ? ? ∗ ? ? L = proxtµ (L − tA (A(L ) + S − D)) . In practical execution, one can efficiently solve the proximal operator for L by solving (Xk , Yk ) through 1 minimize kXY> − Lk−1 + tA∗ (A(Lk−1 ) + Sk − D)k2F + µkXk∗ , X,Y 2 (2.19) > subject to Y Y = Ip×p , and letting Lk = Xk (Yk )> . We will also prove that if (X? , Y? , S? ) is a limit point of {(Xk , Yk , Sk )}k≥1 , then (X? (Y? )> , S? ) is a limit point of {(Lk , Sk )}k≥1 , and the limit point (X? , Y? , S? ) is a stationary point of 1 E(XY> , S) = kA(XY> ) + S − Dk2F + λkSk1 + µkXY> k∗ , 2 22 i.e., (X? , Y? , S? ) satisfies the first-order optimality condition 0 ∈ [A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ]Y? , 0 ∈ (X? )> [A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ], (2.20) 0 ∈ A(X? (Y? )> ) + S? − D + λ∂kS? k1 . We summarize these results in the following theorem. Theorem 2.2.4. Define the objective function E(L, S) as (2.16). Let {(Lk , Sk )}k≥1 be a sequence generated by (2.17) with initial (L0 , S0 ) and stepsize t < 1 kAk2 , where Lk = Xk (Yk )> with (Xk , Yk ) being solved from (2.19). We have the following statements: 1. The objective values {E(Lk , Sk )}k≥1 are non-increasing along {(Lk , Sk )}k≥1 . 2. The sequence {(Lk , Sk )}k≥1 is bounded and thus has limit points. 3. Every limit point (L? , S? ) of {(Lk , Sk )}k≥1 satisfies (2.18). 4. The sequence {(Xk , Yk , Sk )}k≥1 is also bounded. In addition, for any limit point (X? , Y? , S? ) of {(Xk , Yk , Sk )}k≥1 , (X? (Y? )> , S? ) is a limit point of {(Lk , Sk )}k≥1 . 5. Every limit point (X? , Y? , S? ) of {(Xk , Yk , Sk )}k≥1 is a stationary point of E(XY> , S), which satisfies the first-order optimality condition in (2.20). In addition, if A = I, we can take the stepsize t = 1, and all the statements above still hold. 23 Proof. We start by verifying the first two statements. For k ≥ 0 and t < 1 kAk2 , we have E(Lk+1 , Sk+1 ) 1 = kA(Lk+1 ) − A(Lk )k2F + hA(Lk+1 ) − A(Lk ), A(Lk ) + Sk+1 − Di 2 1 + kA(Lk ) + Sk+1 − Dk2F + λkSk+1 k1 + µkLk+1 k∗ 2 1 k+1 ≤ kL − Lk k2F + hLk+1 − Lk , A∗ fλ0 (A(Lk ) − D)i + µkLk+1 k∗ 2t kAk2 (2.21)   1 k k+1 2 k+1 1 + kA(L ) + S − DkF + λkS k1 + − kLk+1 − Lk k2F 2 2 2t   1 1 k+1 k ∗ 0 k 2 k+1 = kL − L + tA fλ (A(L ) − D)kF + tµkL k∗ t 2 kAk2   t ∗ 0 1 k − kA fλ (A(L ) − D)kF + 2 − kLk+1 − Lk k2F 2 2 2t 1 + kA(Lk ) + Sk+1 − Dk2F + λkSk+1 k1 , 2 where the inequality is due to the facts that kA(Lk+1 ) − A(Lk )k2F ≤ kAk2 kLk+1 − Lk k2F and A(Lk ) + Sk+1 − D = fλ0 (A(Lk ) − D). Note that Lk+1 = proxtµ Lk − tA∗ fλ0 (A(Lk ) − D) , which solves  1 minimize kL − Lk + tA∗ fλ0 (A(Lk ) − D)k2F + tµkLk∗ . L:rank(L)≤p 2 Since rank(Lk ) ≤ p, we have 1 k+1 kL − Lk + tA∗ fλ0 (A(Lk ) − D)k2F + tµkLk+1 k∗ 2 1 ≤ kLk − Lk + tA∗ fλ0 (A(Lk ) − D)k2F + tµkLk k∗ 2 t2 ∗ 0 = kA fλ (A(Lk ) − D)k2F + tµkLk k∗ . 2 Substituting the above estimate to (2.21) yields kAk2   k+1 k+1 1 E(L , S ) ≤ − kLk+1 − Lk k2F 2 2t (2.22) 1 + kA(Lk ) + Sk+1 − Dk2F + µkLk k∗ + λkSk+1 k1 . 2 24 Moreover, we see that 1 Sk+1 = arg min kS − (D − A(Lk ))k2F + λkSk1 . S 2 Then from (Lou and Yan, 2018, Lemma 2), there holds 1 k+1 kS − (D − A(Lk ))k2F + λkSk+1 k1 2 (2.23) 1 1 ≤ kSk − (D − A(Lk ))k2F + λkSk k1 − kSk+1 − Sk k2F . 2 2 Combining estimates (2.22) and (2.23), we find that kAk2   1 1 E(L k+1 ,S k+1 k k ) ≤ E(L , S ) + − kLk+1 − Lk k2F − kSk+1 − Sk k2F . (2.24) 2 2t 2 kAk2 Since 2 − 1 2t < 0, the estimate above implies E(Lk+1 , Sk+1 ) ≤ E(Lk , Sk ) for any k ≥ 0, which verifies the first statement. Note that the target function E(L, S) is coercive, i.e., E(L, S) → +∞ when kLkF + kSkF → +∞. Since E(Lk , Sk ) ≤ E(L0 , S0 ) < +∞, ∀k ≥ 1, this property guarantees that both {Lk }k≥1 and {Sk }k≥1 are bounded sequences, and thus the second statement holds. For any limit point (L? , S? ) of {(Lk , Sk )}k≥1 , there exists a convergent subsequence {(Lki , Ski )}i≥1 such that Lki → L? and Ski → S? . On the other hand, we see that Ski +1 = sign(D − A(Lki )) max(0, |D − A(Lki )| − λ), (2.25) ki +1 Lki − tA∗ (A(Lki ) + Ski +1 − D) .  L = proxtµ Summing both sides of (2.24) from k = 0 to ∞, we obtain  X ∞ ∞ 1 2 k+1 k 2 X − kAk kL − L kF + kSk+1 − Sk k2F ≤ 2E(L0 , S0 ) < ∞. t k=0 k=0 This inequality guarantees that {Ski +1 }i≥1 has the same limit point S? as that of {Ski }i≥1 , and {Lki +1 }i≥1 has the same limit point L? as that of {Lki }i≥1 . Then by taking limits in both sides of the two equations in (2.25), we obtain the third statement. Next we will prove the last two statements. As kXk k2F = kLk k2F and kYk k2F = p, we know that the sequence {(Xk , Yk , Sk )}k≥1 is also bounded. Let (X? , Y? , S? ) be a limit point 25 of {(Xk , Yk , Sk )}k≥1 , which is the limitation of a subsequence {(Xki , Yki , Ski )}i≥1 . Then we have Lki = Xki (Yki )> → X? (Y? )> and Ski → S? , i.e., (X? (Y? )> , S? ) is the limit point of the subsequence {(Lki , Ski )}i≥1 . Thus the fourth statement is verified. Now we are in the position to prove the fifth statement. Due to the third and fourth statements, if (X? , Y? , S? ) is a limit point of {(Xk , Yk , Sk )}k≥1 , i.e., (X? (Y? )> , S? ) should satisfy (2.18) S? = sign(D − A(X? (Y? )> )) max(0, |D − A(X? (Y? )> )| − λ), (2.26) X? (Y? )> = proxtµ X? (Y? )> − tA∗ (A(X? (Y? )> ) + S? − D) .  The first condition in (2.26) implies that the limit point S? minimizes 1 kA(X? (Y? )> ) + S − Dk2F + λkSk1 + µkX? (Y? )> k∗ 2 over all S ∈ Rm×n . Thus, S? should satisfy the third condition in (2.20). Moreover, since rank(X? (Y? )> ) ≤ p, the second condition in (2.26) actually implies that (X? , Y? ) is an optimal solution of the problem 1 minimize kXY> − X? (Y? )> + tA∗ (A(X? (Y? )> ) + S? − D)k2F + tµkXY> k∗ . X,Y 2 Therefore, (X? , Y? ) should satisfy the first-order optimality condition for X, which gives [X? (Y? )> − X? (Y? )> + tA∗ (A(X? (Y? )> ) + S? − D)]Y? + tµ∂kX? (X? (Y? )> )> k∗ Y? =t[A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ]Y? 3 0. Similarly, from the first-order opitmality condition for Y, one can verify that 0 ∈ (X? )> [A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ]. We thus derive the first two conditions in (2.20). 26 We will complete our proof by verifying the convergence results for the special case of A = I and t = 1. In this case, by the same method, one can derive a similar inequality as (2.24), which is 1 E(Lk+1 , Sk+1 ) ≤ E(Lk , Sk ) − kSk+1 − Sk k2F . 2 Then {E(Lk , Sk )}k≥1 are non-increasing along {(Lk , Sk )}k≥1 , and {(Lk , Sk )}k≥1 is bounded due to the coerciveness of E(L, S). Let (L? , S? ) be the limit point of {(Lk , Sk )}k≥1 achieved by the subsequence {(Lki , Ski )}i≥1 . Recall the iterations for updating Ski +1 and Lki given by Ski +1 = sign(D − Lki ) max(0, |D − Lki | − λ), (2.27) ki D − Ski .  L = proxµ P∞ Since k=0 kSk+1 − Sk k2F ≤ 2E(L0 , S0 ) < +∞, {Ski +1 }i≥1 has the same limit point S? as that of {Ski }i≥1 . Taking limits in both sides of equations (2.27) yields the condition (2.18) for A = I and t = 1. The last two statements can be verified by exactly the same arguments for the general case. We thus complete the proof. 2.2.2 An accelerated algorithm We show in the previous subsection that Alg. 2.1 is a forward-backward splitting or proximal gradient algorithm for a nonconvex problem. Recently, accelerated proximal gradient (APG) algorithms are proposed for nonconvex problems to reduce the computational time without sacrificing convergence (Li and Pong, 2015; Li and Lin, 2015). In this chapter, we adopt the nonmonotone APG (Li and Lin, 2015, Alg. 2) because of its better performance shown in (Li and Lin, 2015). The algorithm is described in Alg. 2.2. We let δ = 1 and η = 0.6 in the numerical experiments. 2.3 Numerical experiments In this section, we use synthetic data and real images to demonstrate the performance of our proposed model and algorithms. The code to reproduce the results in this section can be found at https://github.com/mingyan08/RPCA_Rank_Bound. 27 Algorithm 2.2: Accelerated RPCA with nonmonotone APG Input: D, µ, λ, p, A, stepsize t, η ∈ [0, 1), δ > 0, stopping criteria , maximum number of iterations M ax_Iter, initialization: L0 = L1 = Z1 = 0, t0 = 0, t1 = q 1 = 1, c1 = F (L1 ) Output: L, S 1 for k = 1, 2, 3, .., Max_Iter do k−1 k−1 2 L = Lk + t tk (Zk − Lk ) + t tk−1 (Lk − Lk−1 ); 3 S = sign(D − A(L)) max(0, |D − A(L)| − λ); 4 Zk+1 = proxtµ (L − tA∗ (A(L) − D + S)); 5 if F (Zk+1 ) ≤ ck − δkZk+1 − Lk2 then 6 Lk+1 = Zk+1 ; 7 else 8 Sk = sign(D − A(Lk )) max(0, |D − A(Lk )| − λ); 9 Vk+1 = ( proxtµ (Lk − tA∗ (A(Lk ) − D + Sk )); k+1 Zk+1 if F (Zk+1 ) ≤ F (Vk+1 ); 10 L = Vk+1 otherwise; 11 end 12 if kLk − Lk−1 kF /kLk−1 kF <  then 13 break 14 end √ k2 4(t ) +1+1 15 tk+1 = 2 ; k+1 k 16 q = ηq + 1; ηq k ck +F (Lk+1 ) 17 ck+1 = q k+1 ; 18 end 2.3.1 Synthetic data We would like to recover the low-rank matrix from a noisy matrix that is contaminated by a sparse matrix and Gaussian noise. We create a true low-rank 500 × 500 matrix L? by multiplying a random 500 × r matrix and a random r × 500 matrix, where their components are generated from standard normal distribution independently. We calculate the mean of the absolute values of all the components in L? and denote it as c. Then we randomly select s% of the components and replace their values with uniformly distributed random values from [−3c, 3c]. After that, we add small Gaussian noise N (0, σ 2 ) to all components of the matrix. We let t = 1.7 in the experiments because of fast convergence, though the convergence results in Theorem 2.2.4 require t < 1. 28 2.3.1.1 Low-rank matrix recovery We fix σ = 0.05 for the Gaussian noise and set the upper bound of the rank to be p = r + 5. We stop all algorithms when the relative error at the k-th iteration, which is defined as kLk+1 − Lk kF RE(Lk+1 , Lk ) := , kLk kF is less than 10−4 . We use the relative error to L? , which is defined as kL − L? kF RE(L, L∗ ) := , kL? kF to evaluate the performance of our proposed model and that in (Shen et al., 2019). First, we consider the case with r = 25 and s = 20. We plot a contour map of the relative error to L? for different parameters µ and λ in Fig. 2.1. From this contour map, we can see that the best parameter does not happen when µ = 0, which corresponds to the model in (Shen et al., 2019). It verifies the better performance of our proposed model with appropriate parameters. In this subsection, we set λ = 0.02 for Shen et al.’s and (µ = 0.6, λ = 0.04) for our proposed algorithms. In addition, we consider another two settings for (r, s), and the comparison with different algorithms is shown in Table 2.1. In this table, we also compare the number of iterations for three algorithms: Shen et al.’s, Alg. 2.1, and Alg. 2.2. From this table, we can see that both Alg. 2.1 and Alg. 2.2 have better performance and fewer iterations than (Shen et al., 2019). The accelerated Alg. 2.2 has the fewest iterations, but its performance in terms of RE(L, L? ) is not as good as Alg. 2.1 for the last case. It is because we stop both algorithms when the stopping criteria is satisfied, and the algorithms are not converged yet. We checked the objective function values for both algorithms, and the value for Alg. 2.2 is smaller than that for Alg. 2.1 in this case. Therefore, if we want a solution close to the true low-rank matrix L? , we may need to stop early before the convergence, which is the same as many models for inverse problems. 29 0.2 4 0.002 0.15 0.0 0. 8 0.1 4 0.0.02 01 0.0 01 0. 08 0.05 08 00..00248 0.0 0. 0.0 04 00..02 0.01 00248 00...0 0 0.5 1 1.5 2 Figure 2.1: The contour map of the relative error to L? for different parameters. In this experiment, we set r = 25 and s = 20. The upper bound of the rank is set to be p = 30. Shen et al.’s (Shen et al., 2019) Alg. 1 Alg.2 r s RE(L, L? ) # iter RE(L, L? ) # iter RE(L, L? ) # iter 25 20 0.0745 1318 0.0075 296 0.0075 68 50 20 0.0496 1434 0.0101 473 0.0088 77 25 40 0.0990 2443 0.0635 796 0.0915 187 Table 2.1: Comparison of three RPCA algorithms. We compare the relative error of their solutions to the true low-rank matrix and the number of iterations. Both Alg. 2.1 and Alg. 2.2 have better performance than (Shen et al., 2019) in terms of the relative error and the number of iterations. Alg. 2.2 has the fewest iterations but the relative error could be large. It is because the true low-rank matrix is not the optimal solution to the optimization problem, and the trajectory of the iterations moves close to L? before it approaches the optimal solution. 2.3.1.2 Robustness of the model In this experiment, we compare the robustness of our proposed model with that of (Shen et al., 2019). We let r = 25 and s = 20. Then we run both models for p from 15 to 35. The comparison of the relative error to L? is shown in Fig. 2.2. We let λ = 0.02 for Shen et al.’s and (µ = 0.6, λ = 0.04) for Alg. 2.2. It shows that our proposed model is robust to the 30 parameter p, as long as it is not smaller than the true rank r. 0.6 Shen et al. Alg. 2 0.5 relative error of L 0.4 0.3 0.2 0.1 0 15 20 25 30 35 rank (p) Figure 2.2: The relative error to the true low-rank matrix vs the rank p for Shen et al.’s and Alg. 2.2. Alg. 2.2 is robust to p, as long as p is not smaller than the true rank 25. 2.3.1.3 Low-rank matrix recovery with missing entries In this experiment, we try to recover the low-rank matrix when there are missing entries in the matrix. Therefore, the operator A is not the identity I. We randomly select the missing entries from all the entries. We let r = 25 and add both the sparse noise with parameter s and the Gaussian noise with parameter σ to the true matrix L? . Then we apply Alg. 2.2 to recover the low-rank matrix, and the relative error to L? is used to evaluate the performance. The results for different settings are in Table 2.2. For the first three cases with s = 20, we choose (µ = 0.5, λ = 0.04), while we let (µ = 0.1, λ = 0.01) for the last case with s = 5. Note that, even with missing entries, Alg. 2.2 can reconstruct the low-rank matrix accurately. 31 s σ ratio of missing entries RE(L, L? ) by Alg. 2.2 20 0.05 10% 0.0079 20 0.05 20% 0.0088 20 0.05 50% 0.0201 5 0.01 50% 0.0015 Table 2.2: Performance of Alg. 2.2 on low-rank matrix recovery with missing entries. We change the level of sparsity in the sparse noise, standard deviation of the Gaussian noise, and the ratio of missing entries. 2.3.2 Real image experiment In this section, we consider the three algorithms applied to image processing problems. Since natural images are not low-rank essentially, we consider two cases on two different images (‘cameraman’ and ‘Barbara’). For the 256 × 256 cameraman image (the pixel values are from 0 to 255), we create an image with rank 37 from a low-rank approximation of the original image. Then we add 20% salt and pepper impulse noise and Gaussian noise with standard variance 4. We set 42 as the upper bound of the rank of the low-rank image for all algorithms. We let λ = 0.03 for Shen et al. and (µ = 0.5, λ = 0.06) for our model. To compare the performance of both models, we use the relative error defined in the last subsection and peak signal to noise ratio (PSNR) defined as Peak_Val2 PSNR := 10 log10 . MSE Here Peak_Val is the largest value allowed at a pixel (255 in our case), and MSE is the mean squared error between the recovered image and the true image. The numerical results are shown in Fig. 2.3. From Fig. 2.3(A-C), we can see that our proposed model performs better than Shen et al. (Shen et al., 2019). For the proposed model, we also compare the speed of three algorithms: Alg. 2.1, Alg. 2.1 with standard SVD, and Alg. 2.2 in Fig. 2.3(D). For both plots, we can see that the Gauss-Newton approach increases the speed comparing to the standard SVD approach. From the decrease of the objective function value, we can see that the accelerated algorithm Alg. 2.2 is faster than the nonaccelerated Alg. 2.1. 32 (a) Corrupted image (b) Recovered by Shen et al. RE: 0.4760, PSNR: 12.76 RE: 0.1736, PSNR: 21.52 general SVD Alg. 1 105 Alg. 2 104 103 0 10 20 30 40 50 60 time (s) (c) Recovered by Alg. 2.2 (d) Comparison of the objective function RE: 0.0457, PSNR:33.11 value vs time for three algorithms Figure 2.3: The numerical experiment on the ‘cameraman’ image. (A-C) show that the proposed model performs better than Shen et al.’s both visually and in terms of RE and PSNR. (D) compares the objective values vs time for general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is the value obtained by Alg. 2.2 with more iterations. It shows the fast speed with the Gauss-Newton approach and acceleration. With the Gauss-Newton approach, the computation time for Alg. 2.1 is reduced to about 1/7 of the one with standard SVD (from 65.11s to 8.43s). The accelerated Alg. 2.2 requires 5.2s, though the number of iterations is reduced from 3194 to 360. Next, we use the original 512 × 512 barbara image (the pixel values are from 0 to 255) without modification and add the same two types of noise as in the cameraman image. Because the original image is not low-rank, we choose the upper bound of rank p = 50. We let λ = 0.03 for Shen et al. and (µ = 0.5, λ = 0.06) for our model. The comparison 33 result is shown in Fig. 2.4, and it is similar to the cameraman image. We also applied the acceleration to Shen et al.’s algorithm and obtained a better image with RE = 0.1447 and PSNR = 22.37. (a) Corrupted image (b) Recovered by Shen et al RE: 0.4821, PSNR: 11.91 RE: 0.3368, PSNR: 15.03 106 general SVD Alg. 1 Alg. 2 105 104 0 50 100 150 time (s) (c) Recovered by Alg. 2.2 (d) Comparison of the objective function RE: 0.1317, PSNR: 23.18 value vs time for three algorithms Figure 2.4: The numerical experiment on the ‘Barbara’ image. (A-C) show that the proposed model performs better than Shen et al.’s both visually and in terms of RE and PSNR. (D) compares the objective values vs time for general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is the value obtained by Alg. 2.2 with more iterations. It shows the fast speed with the Gauss- Newton approach and acceleration. With the Gauss-Newton approach, the computation time for Alg. 2.1 is reduced to less than 1/3 of the one with standard SVD (from 148.6s to 43.7s). The accelerated Alg. 2.2 requires 23.3s, though the number of iterations is reduced from 3210 to 300. 34 2.4 Concluding remarks In this chapter, we introduced a new model for RPCA when an upper bound of the rank is provided. For the unconstrained RPCA problem, we formulate it as the sum of one smooth function and one nonsmooth nonconvex function. Then we derive an algorithm based on proximal-gradient. This proposed algorithm has the alternating minimization algo- rithm (Shen et al., 2019) as a special case. Because of the connection between this algorithm and proximal gradient, we adopted an acceleration approach and proposed an accelerated algorithm. Both proposed algorithms have two advantages comparing to existing algorithms. First, different from algorithms that require accurate rank estimations, the proposed algo- rithms are robust to the upper bound of the rank. Second, we apply the Gauss-Newton algorithm to avoid the computation of singular values for large matrices, so our algorithm is faster than those algorithms that require SVD. Except for problem (2.7), this algorithm can be generalized to solve many other variants. 2.4.1 Nonconvex penalties on the singular values In the problem (2.7), we choose the convex nuclear norm for the low-rank component in the objective function, which is the `1 norm on the singular values. The `1 norm pushes all singular values toward zero for the same amount, bringing bias in the solution. To promote the low-rankness of the low-rank component (or sparsity of its singular values), we can choose nonconvex regularization terms for the singular values. The idea for nonconvex regularization is to reduce the bias by pushing less on larger singular values. Some examples of nonconvex regularization are `p (0 ≤ p < 1) (Chartrand, 2007), smoothly clipped absolute deviation (SCAD) (Fan and Li, 2001), minimax concave penalty (MCP) (Zhang et al., 2010), nonconvex weighted `1 (Huang et al., 2015), etc. When these regularization terms are applied, the only difference is in the third step for finding X in Lemma 2.2.3. Currently, we have to apply the soft thresholding on the singular values. When nonconvex regularization is 35 used, we apply the corresponding thresholding on the singular values. In this case, all the convergence results stay valid. 2.4.2 Other regularization on the sparse component We can also replace the `1 norm of the sparse component with other regularization terms. Similarly to the penalty on the singular values, the `1 norm on the sparse component brings bias, and we can use nonconvex regularization terms. Wen et al. (2019) uses both nonconvex regularization terms for the low-rank and sparse components. When different regularization terms are used on the sparse component, the new function fλ (see (2.12) for the definition) may not be differentiable any more. In this case, the convergence results do not hold. 2.4.3 Constrained problems When there is no noise in the measurements, the problem becomes constrained, and the pre- vious algorithm can not be applied directly. Shen et al. (2019) uses the penalty method and gradually increases the weight for the penalization to approximate the constrained problem. Here, we introduce a new method based on ADMM. We consider the following constrained problem minimize µkLk∗ + kSk1 , subject to rank(L) ≤ p, D = L + S. (2.28) L,S When we apply ADMM, the steps are α Zk 2 Lk+1 = arg min µkLk∗ + kD − L − Sk + kF ; (2.29a) L:rank(L)≤p 2 α α Zk 2 Sk+1 = arg min kSk1 + kD − Lk+1 − S + k ; (2.29b) S 2 α F Zk+1 = Zk − α(Lk+1 + Sk+1 − D). (2.29c) The first step is exactly the proximal operator that can be solved from Lemma 2.2.3. The other two steps are easy to compute. This algorithm has only one parameter α, while penalty 36 methods, such as that in (Shen et al., 2019), require additional parameters to increase the weight for the penalization. 37 CHAPTER 3 ROBUST PRINCIPAL COMPONENT ANALYSIS FOR SEISMIC EVENT DETECTION 3.1 Introduction Reflected seismic data is contaminated by both random and coherence noise. Random noise is usually caused by environmental inferences, and coherent noise is mostly generated by the source. Denoising is an important preprocessing step because noisy seismic data may lead to unrealistic artifacts in the inversion or imaging results. But, it is challenging to effectively and efficiently eliminate noise from noisy seismic data. Various seismic denoising methods have been developed to remove random noise (Yu et al., 2015; Fomel and Liu, 2013; Kreimer and Sacchi, 2012) and coherent noise (Weglein, 2016; Liu and Fomel, 2013; Herman and Perkins, 2006). Our technique belongs to the sparse-transform-based methods. In this type of methods, it was shown that the signal could be sparsely represented by a basis/dictionary or it is sparse after some transform such as wavelet transform. In particular, different methods based on sparsity are proposed (Chen et al., 2016; Rubinstein et al., 2010). Since the signal received at different receivers are correlated, the measurements lie in a low dimensional space. Dimension reduction techniques are also applied to model the signal. They include empirical mode decomposition based methods (Kopsinis and McLaughlin, 2009; Chen et al., 2017), singular spectral analysis (Qiao et al., 2017), RPCA (Candès et al., 2011; Cheng et al., 2015), and Cadzow filtering. Though RPCA-based denoising methods achieve promising results in many applica- tions (Cheng et al., 2015; Sun et al., 2014; Duarte et al., 2012), existing algorithms are slow and yield a large amount of computational time, especially for large-scale data set. Therefore, we develop new algorithms for RPCA and its nonconvex variants. To verify the performance of these algorithms, we apply them to both synthetic and field data. Through 38 the numerical experiments, our proposed algorithms significantly improve computational efficiency and yield comparable or better denoising results. 3.2 Theory In this section, we focus on RPCA, though the technique can be applied to other methods. Given the noisy data denoted as an n1 × n2 matrix D, where n1 and n2 are the number of receivers and measurements at each receiver, respectively. The goal of RPCA is to get a low-rank matrix L and a sparse matrix S from this noisy matrix such that L + S = D. Ma and Aybat (2018) reviewed several forms of RPCA and efficient algorithms. One formula is minimize rank(L) + λkSk0 , subject to L + S = D, (3.1) L,S where rank(L) is the rank of the matrix L, kSk0 is the number of nonzero elements in the matrix S, and λ is a parameter to balance these two terms. This model assumes that the data is corrupted by sparse noise, which is modeled as a sparse matrix S. However, in practice, especially in the collected seismic data, D often includes other types of noise such as Gaussian noise, denoted as N. We set the random noise level to be σ, namely, kNk2F ≤ σ, where k · kF is the Frobenious norm. Hence, the corresponding formula becomes minimize rank(L) + λkSk0 , subject to kD − L − Sk2F ≤ σ. (3.2) L,S This problem is NP-hard (Ma and Aybat, 2018), and direct numerical calculation is impos- sible. One direction is convexification. It has been proved in (Zhou et al., 2010b) that a relaxed version –principal component pursuit– can be defined as minimize kLk∗ + λkSk1 subject to kD − L − Sk2F ≤ σ. (3.3) L,S The above minimization problem yields a stable estimate of the low-rank matrix and the sparse matrix under some conditions for D. Here, k · k∗ is the nuclear norm defined as the sum of all singular values, and k · k1 is the sum of the absolute values of all elements. 39 The constrained problem (3.3) is equivalent to the following unconstrained one minimize kLk∗ + λkSk1 + 1 2µ kD − L − Sk2F . (3.4) L,S That is, given σ in (3.3), we can find µ such that the optimal solutions for both (3.3) and (3.4) are equivalent, and vice versa. Therefore, we focus on solving Eq. (3.4) instead of Eq. (3.3). This problem is convex, and many existing convex optimization algorithms are applied. Some examples are proximal gradient method (PGM), accelerated PGM, and alternating direction methods of multipliers (ADMM). We provide some brief introduction over those methods. We consider the two matrices L and S together. The last function in (3.4) is differentiable with respect to L and S, while the first two functions are separable. One iteration of PGM can be expressed as  L̄k = Lk − µt (Lk + Sk − D),         k k t k k S̄ = S − µ (L + S − D),   (3.5)    Lk+1 = arg min tkLk∗ + 21 kL − L̄k k2F ,   L   Sk+1 = arg min tλkSk1 + 21 kS − S̄k k2F ,    L where t ∈ (0, µ) is the stepsize. PGM has the convergence rate of O(1/k) for general convex functions. Acceleration tech- niques improve the convergence rate to O(1/k 2 ). Some examples are provided in (Nesterov, 2013, 2005; Kim and Fessler, 2016; Beck and Teboulle, 2009a,b). Beck and Teboulle (2009a) develop fast iterative shrinkage-thresholding algorithm (FISTA) by applying the proximal operator on an extrapolated point. By denoting one iteration of PGM as (Lk+1 , Sk+1 ) = PGM(Lk , Sk ), the iteration of FISTA is equivalent to θk−1 −1 (L̂k , Ŝk ) = (Lk , Sk ) + θk (Lk − Lk−1 , Sk − Sk−1 ), (Lk+1 , Sk+1 ) = PGM(L̂k , Ŝk ), q where θk = (1 + 1 + 4θk−1 2 )/2 and θ0 = 1. FISTA requires fewer iterations than PGM with similar per-iteration cost, which is demonstrated in our numerical experiments. 40 ADMM (Boyd et al., 2011; Yuan and Yang, 2013) solves a constrained problem. By introducing a new matrix Z, an equivalent formulation can be obtained minimize kLk∗ + λkSk1 + 1 2µ kZk2F s.t. Z + S + L = D. (3.6) Z,L,S Though the convergence of three-block ADMM is not guaranteed for general problems, this ADMM for RPCA converges with an appropriately chosen parameter (Wang et al., 2019). These algorithms are computationally efficient for small matrices, but they suffer slow convergent rates when handling medium to large-scale data sets, such as seismic data. The efficiency of the algorithms are directly related to the number of the unknown variables. PGM and FISTA have two matrices as unknown variables, while ADMM has three. In this abstract, we use infimal convolution to reduce unknown variables to be one matrix and obtain faster algorithms than existing ones. Besides solving the convex formula (3.4), people also solve nonconvex ones (Zhou and Tao, 2011, 2013) because of their better performance. In this abstract, we also consider a nonconvex model by replacing the L1 term with a nonconvex term to show the robustness of our algorithms. Methodology and Algorithms 3.2.1 New algorithms with infimal convolution The problem (3.4) has two unknown matrices L and S, and only the last two terms have S. So we can eliminate S by finding the optimal S with a given L h(D − L) := min λkSk1 + 1 2µ kD − L − Sk2F . (3.7) S By defining f (S) = λkSk1 and g(X) = 1 2µ kXk2F , we will have h as the infimal convolution of f and g, which is defined as f g : x → min f (y) + g(x − y). Hence, the problem (3.4) can y be reduced to minimize kLk∗ + f g(D − L), (3.8) L 41 which contains only one unknown matrix L. The infimal convolution f g(D − L) is differ- entiable with respect to L, and its gradient is 1 µ -Lipschitz continuous. So we apply PGM and FISTA to solve the problem (3.8) with L only. We name the corresponding algorithms as IC-PGM and IC-FISTA. We also apply ADMM with two blocks and name the new algo- rithm as IC-ADMM. Its convergence is well studied, and there is no restriction in choosing its parameters. For simplicity, we only provide the iteration of IC-PGM as below  S̄k = Sk − (Lk + Sk − D),        k+1 = arg min µλkSk1 + 21 kS − S̄k k2F ,  S   L (3.9) L̄k = Lk − µt (Lk + Sk+1 − D),        Lk+1 = arg min tkLk∗ + 12 kL − L̄k k2F .    L Here, the stepsize of t ∈ (0, 2µ), which is larger than PGM. The derivation of IC-FISTA is similar to IC-PGM. 3.2.1.1 Comparison between PGM and IC-PGM Comparing the steps in (3.5) and (3.9), we notice that PGM updates L and S simultane- ously, while IC-PGM updates S first and use the updated S to update L. The improved performance of IC-PGM over conventional PGM comes from two folds. Firstly, alternative update is faster than simultaneous update, which is similar to the improvement of Gauss- Seidel over Jacobian methods for solving linear equations. Secondly, IC-PGM essentially solves the problem with L only, which allows a larger stepsize than the conventional PGM. A new nonconvex model Though IC-FISTA solves RPCA efficiently, RPCA is still a convex relaxed model. In order to obtain better performance, we consider nonconvex models that we can apply the infimal convolution technique to get fast algorithms. There are many nonconvex penalties, please see (Huang and Yan, 2018; Wen et al., 2018). In this abstract, we choose Minimax Concave 42 Penalty (MCP) (Zhang et al., 2010) to replace the L1 term λkSk1 in RPCA. When we apply IC-PGM or IC-FISTA to solve this nonconvex problem, we just need to replace the step for updating S: Sk+1 = arg min µr(S) + 12 kS − S̄k k2F , L where r(S) = gλ,b (Si,j ) is the MCP function with Pn1 Pn2 i=1 j=1  S2i,j λ|Si,j | −  , |Si,j | ≤ bλ, 2b gλ,b (Si,j ) = (3.10)  bλ2 ,  |Si,j | > bλ. 2 Here b is a parameter. When b goes to infinity, r(S) becomes the L1 term. 3.3 Results 3.3.1 Synthetic seismic data We first test our denoising algorithms on synthetic seismic data. The seismic measurements are collections of synthetic seismograms obtained by implementing forward modeling on a velocity model with a few layers. One common-shot gather of synthetic seismic data with 500 receivers is posed at the top surface of the model. The interval between two receivers is 5 m. We use a Ricker wavelet with a center frequency of 25 Hz as the source time function and a staggered-grid finite-difference scheme with a perfectly matched layered absorbing boundary condition to generate 2D synthetic seismic reflection data (Tan and Huang, 2014). The synthetic trace at each receiver is a collection of time-series data of length 1, 000. We add two types of noise onto the seismic data. First, we add 25.2 dB Gaussian noise. Then, we choose 2% of the data and reset their values with random numbers uniformly distributed in [−u, u] with u being three multiply the largest value in the clean data. The overall signal to noise ratio (SNR) is -26.2 dB. 43 100 100 200 200 300 300 400 400 Time Step Time Step 500 500 600 600 700 700 800 800 900 900 1000 1000 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 Receivers Receivers (a) clean data (b) noisy data 100 100 200 200 300 300 400 400 Time Step Time Step 500 500 600 600 700 700 800 800 900 900 1000 1000 50 100 150 200 250 300 350 400 450 500 50 100 150 200 250 300 350 400 450 500 Receivers Receivers (c) recovered signal by L1 (d) recovered signal by MCP 50 50 100 100 150 150 Time Step Time Step 200 200 250 250 300 300 350 350 400 400 150 200 250 300 350 150 200 250 300 350 Receivers Receivers (e) recovered sparse noise by L1 (f) recovered sparse noise by MCP 50 50 100 100 150 150 Time Step Time Step 200 200 250 250 300 300 350 350 400 400 150 200 250 300 350 150 200 250 300 350 Receivers Receivers (g) difference between (c) and (a) (h) difference between (d) and (a) Figure 3.1: Comparison of recovered results on synthetic seismic data with 500 receivers and 1000 measurements at each receiver. (a) simulated clean data. (b) noisy data (-26.2 dB). (c) recovered data by L1 (13.4 dB). (d) recovered data by MCP (13.9 dB). (e) recovered sparse noise by L1 . (f) recovered sparse noise by MCP. (g) the difference between the clean data and the recovered one by L1 . (h) the difference between the clean data and the recovered one by L1 . (e-h) zoom-in over receivers 150-350 and measurements 1-400. We first compare the recovery results for convex and nonconvex models in Fig. 3.1. We 44 manually tune the parameters to obtain the best denoising results for both models. For the L1 penalty, we choose µ = 3 × 10−5 and λ = 0.12, while for MCP, we choose µ = 1 × 10−5 , λ = 0.135, and b = 10. Both models can remove noise from the noisy data. The SNR values of recovered data for both L1 and MCP are 13.4 dB and 13.9 dB, respectively. The figure confirms that the nonconvex model performs slightly better than the convex one. Next, we compare the efficiency of all algorithms on the convex RPCA model. We first run IC-ADMM for 1,000 iterations to obtain an estimation for the minimal objective value. Then we compare their performance in function values with respect to the iteration number and time in Fig 3.2. IC-ADMM has the fastest convergent rate and smallest computational time among all five algorithms. By applying our IC technique, traditional algorithms (PGM and FISTA) yield better convergent rates. Comparing Fig. 3.2(a) and Fig. 3.2(b), we note that the time for each iteration is similar for all algorithms. 105 105 PGM PGM FISTA FISTA IC-PGM IC-PGM IC-FISTA IC-FISTA 100 IC-ADMM 100 IC-ADMM Function Value(f - f*) Function Value (f - f*) 10-5 10-5 10-10 10-10 10-15 10-15 0 100 200 300 400 500 600 700 800 900 1000 0 10 20 30 40 50 60 70 80 90 # Iters Time(s) (a) performance-iteration (b) performance-time Figure 3.2: Comparison of five algorithms (PGM, FISTA, IC-PGM, IC-FISTA, IC-ADMM) for the convex RPCA on synthetic data. IC-ADMM has the fastest convergence rate and smallest computational time. IC technique improves the performance of PGM and FISTA significantly. 3.3.2 Field seismic data In this section, we apply our denoising algorithms to the field data collected from the IRIS Community Wavefield Experiment in Oklahoma (Kent et al., 2016). Fig. 3.3 shows the data collected by 220 major seismic sensors to detect earthquakes. Before applying RPCA to 45 20 40 60 80 Station Number 100 120 140 160 180 200 220 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Time(s) Figure 3.3: Noisy data generated in Oklahoma. recover the data, we applied discrete cosine transform on the time domain. The comparison between L1 and MCP is in Fig. 3.4. We set µ = 3 × 104 , λ = 3 × 10−3 , and b = 3 × 106 . The stopping criteria is max(kSk+1 − Sk k2F /kSk k2F , kLk+1 − Lk k2F /kLk k2F ) < 10−3 . As shown in Fig. 3.4, both models successfully separate the horizontal signal (we do not want) and vertical signal (we want). In addition, we compare the computation time and total numbers of iterations in Ta- ble 2.1. IC technique improves the performance of conventional algorithms. They are 4-10x Algorithm # Iters Time (s) Function vlaue PGM 940 9.23 × 103 4.92 × 107 FISTA 166 1.63 × 103 4.92 × 107 IC-PGM 81 1.07 × 103 4.92 × 107 IC-FISTA 43 6.41 × 102 4.92 × 107 IC-ADMM 25 2.69 × 102 4.92 × 107 MCP 49 5.74 × 102 3.78 × 107 Table 3.1: Comparison of six algorithms. IC-ADMM is the fastest, which is the same as synthetic data. The function value for MCP is smaller because of a different model. faster than non-IC algorithms. It also shows that the non-convex MCP model has compa- rable performance as the convex one, while the algorithm for this nonconvex model is also fast. 46 20 20 40 40 60 60 80 80 Station Number Station Number 100 100 120 120 140 140 160 160 180 180 200 200 220 220 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Time(s) Time(s) (a) recovered horizontal signal by L1 (b) recovered horizontal signal by MCP 20 20 40 40 60 60 80 80 Station Number Station Number 100 100 120 120 140 140 160 160 180 180 200 200 220 220 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Time(s) Time(s) (c) recovered vertical signal by L1 (d) recovered vertical signal by MCP 20 20 40 40 60 60 80 80 Station Number Station Number 100 100 120 120 140 140 160 160 180 180 200 200 220 220 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Time(s) Time(s) (e) recovered random noise by L1 (f) recovered random noise by MCP Figure 3.4: Recovered results of the real data with two models. 3.4 Conclusion In this section, we developed new seismic denoising algorithms based on RPCA. In particular, we applied infimal convolution to solve the convex and nonconvex optimization problems. Our technique not only allows a large stepsize, but also reduces the number of unknown variables to solve. All these characteristics of our new algorithms result in a significantly im- proved computational efficiency by comparing to other conventional algorithms such as PGM 47 and FISTA. We verified the performance of our algorithms using both synthetic reflection seismic data and field data. We observe at least a speed-up ratio of 4-10x over conventional algorithms with a comparable or better denoised results. 48 CHAPTER 4 MANIFOLD DENOISING BY NONLINEAR ROBUST PRINCIPAL COMPONENT ANALYSIS 4.1 Introduction Manifold and graph learning are nowadays widely used in computer vision, image processing, and biological data analysis on tasks such as classification, anomaly detection, data inter- polation, and denoising. In most applications, graphs are learned from high dimensional data, and successfully learned graphs allow traditional data analysis methods (PCA, Fourier analysis, clustering algorithm, neural networks) to be performed in conjunction with prior knowledge of the graph connectivity (Hammond et al., 2011; Jianbo Shi and Malik, 2000; Jiang et al., 2013; Meila and Shi, 2001). However, the quality of the learned manifold or graph may be greatly jeopardized by outliers, in ways that affect the stability of various manifold learning methods. In recent years, several methods have been proposed to handle outliers in nonlinear data (Li et al., 2009; Zhigang Tang et al., 2010; Du et al., 2013). Despite the success of those methods, they only aim at finding the outliers instead of correcting them. In addition, few theoretical results characterize their statistical performances. In this paper, we propose a novel non task-driven algorithm for the mixed noise model in (4.1) and provide theoretical guarantees to control its estimation error. Specifically, we consider the mixed noise model as X̃i = Xi + Si + i , i = 1, . . . , n, (4.1) where Xi ∈ Rp is the noiseless data independently drawn from some manifold M with an intrinsic dimension d < p, i is the i.i.d. Gaussian noise with small magnitudes, and Si is the sparse noise with possible large magnitudes. If Si has a large entry, then the corresponding X̃i is usually considered as an outlier. The goal of this chapter is to simultaneously recover 49 Xi and Si from X̃i , i = 1, .., n. There are several benefits in recovering the noise term Si along with the signal Xi . First, the support of Si indicates the locations of the anomaly, which is informative in many applications. For example, if Xi is the gene expression data from the ith patient, the nonzero elements in Si indicate the differentially expressed genes that are the candidates for personalized medicine. Similarly, if Si is a result of malfunctioned hardware, its nonzero elements indicate the locations of the malfunction parts. Secondly, the recovery of Si allows the “outliers” to be pulled back to the data manifold instead of simply being discarded. This prevents waste of information and is especially beneficial in cases where data is insufficient. Thirdly, in some applications, the sparse Si is part of the actual data rather than a noise term, then the algorithm provides a natural decomposition of the data into a sparse and a non-sparse component that may carry different pieces of information. Along a similar line of research, robust principle component analysis (RPCA) (Candes et al., 2011) has received considerable attention and has demonstrated its success in separat- ing data from sparse noise in many applications. However, its assumption that the data lies in a low dimensional subspace is somewhat strict. In this chapter, we generalize the Robust PCA idea to the non-linear manifold setting. The major new components in our algorithm are: 1) an incorporation of the curvature information of the manifold into the optimization framework, and 2) a unified way to apply RPCA to a collection of tangent spaces of the manifold. 4.2 Methodology Let X̃ = [X̃1 , . . . , X̃n ] ∈ Rp×n be the noisy data matrix containing n samples. Each sample is a vector in Rp independently drawn based on (4.1). The overall data matrix X̃ has the representation X̃ = X + S + N 50 where X is the clean data matrix, S is the matrix of the sparse noise, and N is the Gaussian noise. We further assume that the clean data Xi lies on some manifold M embedded in Rp with a small intrinsic dimension d  p and the sample size is sufficient (n ≥ p). The small intrinsic dimension assumption ensures that data is locally low dimensional so that the corresponding local data matrix is of low rank. This property allows the data to be separated from the sparse noise. The key idea behind our method is to handle the data locally. We use the k Nearest Neighbors (kNN) to construct local data matrices, where k is larger than the intrinsic di- mension d. For a data point Xi ∈ Rp , we define the local patch centered at it to be the set consisted of its kNN and itself, and a local data matrix X (i) associated with this patch is X (i) = [Xi1 , Xi2 , . . . , Xik , Xi ], where Xij is the jth-nearest neighbor of Xi . Let Pi be the restriction operator to the ith patch, i.e., Pi (X) = XPi where Pi is the n × (k + 1) matrix that selects the columns of X in the ith patch. Then X (i) = Pi (X). Similarly, we define S (i) = Pi (S), N (i) = Pi (N ) and X̃ (i) = Pi (X̃). Since each local data matrix X (i) is of low rank and S is sparse, we can decompose the noisy data matrix into low-rank parts and sparse parts through solving the following optimization problem (Ŝ, {L̂(i) }ni=1 ) = arg min F (S, {L(i) }ni=1 ) S,L(i) Xn λi kX̃ (i) − L(i) − S (i) k2F + kC(L(i) )k∗ + βkS (i) k1 (4.2)  ≡ arg min S,S (i) ,L(i) i=1 subject to S (i) = Pi (S), here we take β = max{k, p}−1/2 as in RPCA, X̃ (i) = Pi (X̃) is the local data matrix on the ith patch and C is the centering operator that subtract the column mean: C(Z) = Z(I − k1 11T ), where 1 is the (k + 1)-dimensional column vector of all ones. Here we are decomposing the data on each patch into a low-rank part L(i) and a sparse part S (i) by imposing the nuclear norm and entry-wise `1 norm on L(i) and S (i) , respectively. There are two key components in this formulation: First, the patches have overlapping components (for example, X1 may 51 belong to several patches). Thus, the constraint S (i) = Pi (S) is particularly important because it ensures the same point (and the sparse noise on that point) belonging to different patches eventually has all its copies coincide with each other. Secondly, we do not have such a requirement on L(i) because the L(i) s correspond to local tangent spaces, which will be explained in the next section. Although some of the tangent spaces may be close, there is no reason for a point on the manifold to have the same projection onto two different tangent spaces. This seemingly subtle difference has a large impact on the final result. If the data has no Gaussian noise, i.e., N = 0, then X̂ ≡ X̃ − Ŝ is the final estimation for X. If N 6= 0, we can no longer only remove the sparse noise from X̃ and use X̃ − Ŝ to approximate the clean data. Instead, we use the supposedly cleaner (See §3) tangent spaces L̂(i) to construct a final estimate X̂ of X via fitting it to L̂(i) Xn X̂ = arg min λi kPi (Z) − L̂(i) k2F . (4.3) Z∈Rp×n i=1 The following discussion revolves around (4.2) and (4.3), and the structure of the chapter is as follows. In §4.3, we explain the geometric meaning of each term in (4.2). The choice of λ requires the information of the curvature of the manifold. Optimization algorithm is presented is §4.4 and numerical experiments are in §4.5. 4.3 Geometric explanation We provide a geometric intuition for the formulation (4.2). Let us write the local clean data matrix X (i) into its Taylor expansion along the manifold, X (i) = Xi 1T + T (i) + R(i) , (4.4) where the Taylor series is expanded at Xi (the point around which the ith patch is con- structed), T (i) stores the first order term whose columns lie in the tangent space of the manifold at Xi , and R(i) contains all the higher order terms. The sum of the first two terms Xi 1T + T (i) is the linear approximation to X (i) that is unknown if the tangent space is not given. This linear approximation precisely corresponds to the L(i) s in (4.2), i.e., 52 L(i) = Xi 1T + T (i) . Since the tangent space has the same dimensionality d as the manifold, with randomly chosen points, we have with probability one, that rank(T (i) ) = d. As a result, rank(L(i) ) = rank(Xi 1T + T (i) ) ≤ d + 1. By the assumption that d < min{p, k}, we know that L(i) is indeed low rank. Combing (4.4) with X̃ (i) = X (i) + S (i) + N (i) , we find the misfit term X̃ (i) − L(i) − S (i) in (4.2) equals N (i) + R(i) . This implies that the misfit contains the high order residue (i.e., the linear approximation error) and the Gaussian noise. 4.4 Optimization algorithm To solve the convex optimization problem (4.2) in a memory-economic way, we first write L(i) as a function of S and eliminate them from the problem. We can do so by fixing S and minimizing the objective function with respect to L(i) L̂(i) = arg min λi kX̃ (i) − L(i) − S (i) k2F + kC(L(i) )k∗ L(i) = arg min λi kC(L(i) ) − C(X̃ (i) − S (i) )k2F + kC(L(i) )k∗ (4.5) L(i) + λi k(I − C)(L(i) − (X̃ (i) − S (i) ))k2F . Notice that L(i) can be decomposed as L(i) = C(L(i) ) + (I − C)(L(i) ), set A = C(L(i) ), B = (I − C)(L(i) ), then (4.5) is equivalent to (Â, B̂) = arg min λi kA − C(X̃ (i) − S (i) )k2F + kAk∗ + λi kB − (I − C)(X̃ ∗(i) − S (i) ))k2F , A,B which decouples into  = arg min λi kA − C(X̃ (i) − S (i) )k2F + kAk∗ , A B̂ = arg min λi kB − (I − C)(X̃ (i) − S (i) )k2F . B The problems above have closed form solutions  = T1/2λi (C(X̃ (i) − Pi (S))), B̂ = (I − C)(X̃ (i) − Pi (S)), (4.6) 53 where Tµ is the soft-thresholding operator on the singular values Tµ (Z) = U max{Σ − µI, 0}V ∗ , where U ΣV ∗ is the SVD of Z. Combing  and B̂, we have derived the closed form solution for L̂(i) L̂(i) (S) = T1/2λi (C(X̃ (i) − Pi (S))) + (I − C)(X̃ (i) − Pi (S)). (4.7) Plugging (4.7) into F in (4.2), the resulting optimization problem solely depends on S. Then we apply FISTA (Beck and Teboulle, 2009c; Sha et al., 2019) to find the optimal solution Ŝ with Ŝ = arg min F (L̂(i) (S), S). (4.8) S Once Ŝ is found, if the data has no Gaussian noise, then the final estimation for X is (i) X̂ ≡ X̃ − Ŝ; if there is Gaussian noise, we use the following denoised local patches L̂τ ∗ (i) L̂τ ∗ = Hτ ∗ (C(X̃ (i) − Pi (Ŝ))) + (I − C)(X̃ (i) − Pi (Ŝ)), (4.9) where Hτ ∗ is the singular value hard thresholding Operator with the optimal threshold as defined in (Gavish and Donoho, 2014). This optimal thresholding removes the Gaussian (i) (i) noise from L̂τ ∗ . With the denoised L̂τ ∗ , we solve (4.3) to obtain the denoised data n ! n X (i) X X̂ = λi L̂τ ∗ PiT ( λi Pi PiT )−1 . (4.10) i=1 i=1 The proposed nonlinear robust principle component analysis (NRPCA) algorithm is sum- marized in Algorithm 4.1. There is one caveat in solving (4.2): the strong sparse noise may result in a wrong neighborhood assignment when constructing the local patches. Therefore, once Ŝ is obtained and removed from the data, we update the neighborhood assignment and re-compute Ŝ. This procedure is repeated T times. 4.5 Numerical experiments We evaluate the performance of the proposed algorithm on simulated and real-world data sets. 54 Algorithm 4.1: Nonlinear RPCA Input: Noisy data matrix X̃, k (number of neighbors in each local patch), T (number of neighborhood updates iterations) Output: the denoised data X̂, the estimated sparse noise Ŝ 1 Estimate the curvature; 2 Estimate λi , i = 1, . . . , n as in §5, set β as in (4.2); 3 Ŝ ← 0; 4 for iter = 1: T do 5 Find the kNN for each point using X̃ − Ŝ and construct the restriction operators {Pi }ni=1 ; 6 Construct the local data matrices X̃ (i) = Pi (X̃) using Pi and the noisy data X̃; 7 Ŝ ← minimizer of (4.8) iteratively using FISTA; 8 end (i) 9 Compute each L̂τ ∗ from (4.9) and assign X̂ from (4.10). Simulated Swiss roll: We demonstrate the superior performance of NRPCA on a synthetically generated dataset following the mixed noise model (4.1). We sampled 2000 noiseless data Xi uniformly from a 3D Swiss roll and generated the Gaussian noise matrix with i.i.d. entries obey N (0, 0.25). The sparse noise matrix S is generated by randomly replacing 100 entries of a zero p × n matrix with i.i.d. samples generated from (−1)y · z where y ∼ Bernoulli(0.5) and z ∼ N (2, 0.09). We applied NRPCA to the simulated data with patch size k = 16. Figure 4.1 reports the denoising effect in the original space (3D) looking down from above. We observed a visible reduction of the noise. A similar experiment on the high dimensional Swiss roll is in the appendix, where the differences between X̃ − Ŝ and X̂ are much more apparent. MNIST: We observe some interesting dimension reduction results of the MNIST dataset with the help of NRPCA. It is well-known that the handwritten digits 4 and 9 have so high a similarity that the popular dimension reduction methods Isomap and Laplacian Eigenmaps are not able to separate them into two clusters (first column of Figure 4.2). We conjecture that the overlapping parts are caused by personalized writing styles with different beginning or finishing strokes. This type of differences can be better modelled by sparse noise than Gaussian or Poisson noises. The right column of Figure 4.2 confirms this conjecture: after 55 Figure 4.1: NRPCA applied to the noisy Swiss roll data set. X̃ − Ŝ is the result after subtracting the estimated sparse noise via NRPCA with T = 1; “ X̃ − Ŝ with one neighbor update” is that with T = 2, i.e., patches are reassigned once; X̂ is the denoised data obtained via NRPCA with T = 2; “Patch-wise Robust PCA” refers to the ad-hoc application of the vanilla RPCA to each local patch independently, whose performance is clearly worse than the proposed joint-recovery formulation. the NRPCA denoising (with k = 16), we see a much better separability of the two digits using the first two coordinates of Isomap and Laplacian Eigenmaps. In addition, these new embedding results seem to suggest that some trajectory patterns may exist in the data. We provide additional plots in the appendix to support this observation. 4.6 Conclusion In this chapter, we proposed the first outlier correction method for nonlinear data analysis that can correct outliers caused by the addition of large sparse noise. The method is a generalization of the Robust PCA method to the nonlinear setting. We provided procedures to treat the non-linearity by working with overlapping local patches of the data manifold and incorporating the curvature information into the denoising algorithm. We demonstrated that the method works equally well when Gaussian noises are present in the data in addition 56 to the sparse noise. We established a theoretical error bound on the denoised data that holds under conditions only depending on the intrinsic properties of the manifold. We tested our method on both synthetic and real dataset that were known to have nonlinear structures and reported promising results. Figure 4.2: Laplacian eigenmaps and Isomap results for the original and the NRPCA de- noised digits 4 and 9 from the MNIST dataset. 57 CHAPTER 5 ONLINE MATRIX COMPLETION WITH QUATERNION MATRIX 5.1 Introduction Matrix completion problems aim to recover an unknown matrix given that the matrix has a low rank. It has been widely studied and has many applications especially in computer vision (Cabral et al., 2011, 2014), video processing (Ji et al., 2010; Kim et al., 2015), bioin- formatics (Li et al., 2017; Lu et al., 2018), etc. There are both convex and non-convex algorithms for solving it. Convex algorithms tend to simplify the problem and use convex penalties (Recht et al., 2010). While it is easier to find an optimal solution, it may cost a lot of computational time. Nonconvex algorithms tend to use matrix factorization. Even if they need less computational time, they may not easily converge to the global optimal solution because of the saddle points and local optimal solution. In this case, a good initialization is very important. Traditional methods for matrix completion have already achieved good performance when dealing with greyscale images. For each image, we can easily view it as a matrix. As for a video, we can convert each frame as a vector in a matrix. However, the problem arises when applied to color images. Color images have three channels (Red, Green, and Blue) that have a mutual connection. Intuitively, traditional matrix-based methods that treat each channel separately do not work well. Some tensor-based methods have been developed to solve this multidimensional data problem. In this case, it is converted as a low-rank tensor approximation problem where three channels are considered as a third-order tensor. There are many well-known models like ANDECOMP/PARAFAC (CP) and Tucker (Zhou et al., 2019; Rauhut et al., 2017). Similar to matrix SVD, they describe the tensor as the sum of the outer products of vectors. In a recent paper (Kilmer and Martin, 2011), it proposed t-SVD, which expresses the tensor 58 as the sum of outer products of matrices. In this chapter, we consider an alternating way and apply quaternion matrices to this problem. A quaternion matrix has four parts: one real part and three imaginary parts representing the three channels. Quaternion matrices have been applied to other models such as Deep Neural Network (Liu et al., 2019; Zhu et al., 2018; Gaudet and Maida, 2018). They also have been applied on many areas like Natural Language Processing (Parcollet et al., 2018; Tay et al., 2019) and image processing (Wang et al., 2018; Ye et al., 2020). Also, there is an increasing trend today that we need to deal with large-scale data. The traditional offline matrix completion tends to give good performance. Offline matrix completion model needs to collect all observation data and gets recovery result at once. However, it is not practical on some applications, such as web data analysis. For example, in Figure 5.1, a movie ranking system can be seen as a matrix. Each column is a movie that needs to be rated, and each column corresponds to a user. We need to offer up-to-date recommendations to users. Moreover, the estimate should be better if we continuously get observations. Figure 5.1: A movie rating system. For a given d1 × d2 low-rank matrix with missing entries, it can be factorized by a d1 × k user matrix and a k × d2 item matrix where k is the rank of the original matrix. This chapter combines quaternion matrices and online matrix completion. We will de- velop an online low-rank quaternion matrix completion model that can be widely used in many cases. 59 5.2 Introduction on Quaternion Matrices We introduce quaternion numbers/matrices and their properties in this section. 5.2.1 Quaternion Numbers A quaternion number q ∈ Qn is defined as q = qr + qi i + qj j + qk k, (5.1) where qr , qi , qj , qk ∈ R and i, j, k are three imaginary units satisfying i2 = j2 = k2 = −1, ij = −ji = k, jk = −kj = i, ki = −ik = j. (5.2) The conjugate and modulus of q are respectively defined by q q∗ = qr − qi i − qj j − qk k and |q| = qr2 + qi2 + qj2 + qk2 . (5.3) Let x = [xi ] ∈ Qn be a quaternion vector. We define the following three norms for the quaternion vector x: 1-norm kxk1 := ni=1 |xi |, 2-norm kxk2 := i=1 |xi | , and ∞-norm P pPn 2 kxk∞ := max1≤i≤n |xi |. Let A = [aij ] ∈ Qn1 ×n2 be a quaternion matrix. We define the following norms: 1- qP P norm kAk1 = ni=1 Forbinoes norm n1 n2 Tr(A∗ A), P 1 Pn2 2 = p j=1 |aij |, kAk F = i=1 j=1 |a ij | where A∗ is the conjugate transpose (or Hermitian transpose) of A, and ∞-norm kAk∞ = max1≤i≤n1 ,1≤j≤n2 |aij |. The rank of matrix A is the number of independent rows/columns in A. A square quaternion matrix is unitary if A∗ A = AA∗ = I. An Hermitian quaternion matrix satisfies A∗ = A, which is an extension of symmetric matrices. A color image with R (red), G (green), B (blue) channels can be represented by a quater- nion matrix without the real part. That is, Aij = Rij i + Gij j + Bij k, (5.4) where Rij , Gij , and Bij are the corresponding R, G, and B channels. 60 5.2.2 Basic Properties Some properties of quaternion matrices are the same as real matrices. For example, • The definition of the inner product hA, Bi = TrhA∗ Bi; • Spectral norm and the Frobenius norm kA∗ k = kAk, kA∗ kF = kAkF ; • kABkF ≤ kAkkBkF ; • kABk ≤ kAkkBk; • kAk ≤ kAkF . • Cauchy-Schwarz inequality Re(tr(A∗ B)) ≤ kAkF kBkF . However, we need to be careful about one thing. For real matrices, the matrices in a trace of a product can be switched without changing the result. However for quarternion matrices A and B, we can only have Re(tr(AB)) = Re(tr(BA)). (5.5) For example, given two quaternion matrices         1, 1 1, 2 1, 2 1, 1 A =   +   i +   j +   k, 1, 0 2, 1 2, 1 1, 0 and         1, 2 1, 1 1, 1 1, 2 B =   +   i +   j +   k, 2, 1 1, 0 1, 0 2, 1 we have tr(AB) = −10 + 20i + 6j + 10k, and tr(BA) = −10 + 6i + 20j + 10k. 61 In this case, tr(AB) 6= tr(BA). The reason is that the product of two quaternion numbers may not be the same if the order is changed. For example, let a = 1 + 1i + 2j + 3k and b = 3 + 2i + 2j + 1k, then we have a ∗ b = −6 + 1i + 13j + 8k, b ∗ a = −6 + 9i + 3j + 12k. This difference may cause some difficulties for the theoretical proof, which will be explained in the following sections. 5.2.3 Singular Value Decomposition According to the work by Zhang (1997), we can define singular value decomposition for quaternion matrices. For any quaternion matrix L ∈ Qd1 ×d2 with rank k, there exists an unitary quaternion matrix U = [u1 , . . . , ud1 ] ∈ Qd1 ×d1 and another unitary quaternion matrix V = [v1 , . . . , vd2 ] ∈ Qd2 ×d2 such that L = UΣk V∗ (5.6) where Σk ∈ Rd1 ×d2 consists of all singular values of L, σ1 ≥ σ2 ≥ · · · ≥ σk > 0, on its diagonal entries. Then the spectral norm kLk := max{σ1 , . . . , σk }. The condition number κ max{σ1 ,...,σk } is defined as κ = min{σ1 ,...,σk } . What’s more, for any Hermitian quaternion matrix L ∈ Qd×d with rank k, there exists an unitary quaternion matrix U = [u1 , . . . , ud ], with ui being its i-th column, such that L = UΣk U∗ (5.7) where Σk = diag(σ1 , . . . , σk , 0, . . . , 0) ∈ Rd×d consists of all singular values of L, and σ1 ≥ σ2 ≥ · · · ≥ σk > 0. 62 5.2.4 Incoherence Condition Let W ∈ Qd×k be an orthonormal basis of a subspace of Rd of dimension k, then the projection to the subspace is PW = WW∗ . We define the coherence of W as d d µ(W) = max kPW ei k2 = max ke∗i Wk2 , (5.8) k 1≤i≤d k 1≤i≤d where ei is the vector with the i-th component being 1 and others being 0. Definition 5.2.1. We assume M is µ-incoherent, i.e., µk µk max kX∗ ei k2 ≤ , max kY∗ ei k2 ≤ (5.9) i d1 i d2 and r µk ∗ kXY k∞ ≤ , (5.10) d1 d2 where X ∈ Qd1 ×k , Y ∈ Qd2 ×k are the left and right singular vectors of M. 5.2.5 Sampling Scheme We consider the Bernoulli model for uniform sampling. Let Ω ⊂ [d1 ] × [d2 ]. Given a matrix M, we define the matrix PΩ as  if (i, j) ∈ Ω,  Mij  [PΩ (M)]ij = if (i, j) ∈  0  / Ω. Every time (i, j) is uniformly sampled from Ω. Our goal is to recover M given PΩ (M). 5.3 Online Matrix Completion Algorithms and its Theoretical Anal- ysis We first consider a Hermitian quaternion matrix M ∈ Qd×d , i.e., there exists a quaternion matrix U such that M = UU∗ . Also, we write U with its columns as U = [u1 , u2 , . . . , ud ]. 63 Moreover, we define f (U) as f (U) = kM − UU∗ k2F = hM, −UU∗ i + hM, Mi + h−UU∗ , Mi + hUU∗ , UU∗ i. The stochastic gradient of f (U) given the (i, j) component is SG(U) = 2d2 [(UU∗ − M)ij ei e∗j + (UU∗ − M)ji ej e∗i ]U. (5.11) Note that (UU∗ − M)∗ij = (UU∗ − M)ji because both M and UU∗ are Hermitian. The expectation of SG(U) is d,d X 1 2 ESG(U) = 2 2d ((UU∗ − M)ij ei e∗j + (UU∗ − M)ji ej e∗i )U i=1,j=1 d = 4(UU∗ − M)U. Let ∇f (U) := 4(UU∗ − M)U, which is one descent direction of f (U). We assume kMk = 1. Then κ = 1 σmin (M) , namely, σmin (M) = κ1 . We also denote SVD(M) = XSX∗ , SVD(UU∗ ) = WDW∗ . Here X ∈ Qd×k with k orthogonal columns and S ∈ Rk×k is a diagonal square matrix. First, we prepare with a few lemmas about properties of f (U) in a local Frobenious ball around optimal. Lemma 5.3.1. Within the region D = {U|kUk ≤ Γ}, we have the function f (U) satisfying for any U1 , U2 ∈ D: k∇f (U1 ) − ∇f (U2 )kF ≤ βkU1 − U2 kF , (5.12) with β = 16 max{kU1 k2 , kU2 k2 , 1}. 64 Proof. From the definition of ∇f (U), we have k∇f (U1 ) − ∇f (U2 )kF =k4(U1 U∗1 − M)U1 − 4(U2 U∗2 − M)U2 kF ≤4kU1 U∗1 U1 − U2 U∗2 U2 kF + 4kM(U1 − U2 )kF =4kU1 U∗1 (U1 − U2 ) + U1 (U1 − U2 )∗ U2 + (U1 − U2 )U∗2 U2 kF + 4kM(U1 − U2 )kF ≤4kU1 k2 kU1 − U2 kF + 4kU1 kkU2 kkU1 − U2 kF + 4kU2 k2 kU1 − U2 kF + 4kMkkU1 − U2 kF ≤12 max{kU1 k2 , kU2 k2 }kU1 − U2 kF + 4kMkkU1 − U2 kF ≤16 max{kU1 k2 , kU2 k2 , 1}kU1 − U2 kF . We have completed the proof. Lemma 5.3.2. Within the region D = {U|σmin (X∗ U) ≥ γ}, the function f (U) = kM − UU∗ k2F satisfies k∇f (U)k2F ≥ 4γ 2 f (U). (5.13) Proof. Inside the region D, we let SVD(UU∗ ) = WDW∗ , thus we have k∇f (U)k2F =16k(UU∗ − M)Uk2F =16(kPW (UU∗ − M)Uk2F + kPW⊥ (UU∗ − M)Uk2F ) =16 tr(PW (UU∗ − M)UU∗ (UU∗ − M)PW ) + kPW⊥ MUk2F  =16 tr(PW (UU∗ − M)WDW∗ (UU∗ − M)PW ) + kPW⊥ MUk2F  ≥16(σmin(D) kPW (UU∗ − M)PW k2F + kPW⊥ MUk2F ) =16(σmin(D) kUU∗ − PW MPW k2F + kPW⊥ MUk2F ). 65 The inequality holds because tr(PW (UU∗ − M)WDW∗ (UU∗ − M)PW ) ≥σmin(D) tr(PW (UU∗ − M)WW∗ (UU∗ − M)PW ) =σmin(D) tr(PW (UU∗ − M)WW∗ WW∗ (UU∗ − M)PW ) The last equality is true because PW UU∗ PW = WW∗ WDW∗ WW∗ = WDW∗ = UU∗ . On the other hand, we have kPW⊥ MUk2F = kPW⊥ XSX∗ Uk2F ≥ σmin 2 (X∗ U)kPW⊥ XSk2F 2 =σmin (X∗ U)tr(PW⊥ M2 PW⊥ ) = σmin 2 (X∗ U)kPW⊥ Mk2F , and σmin (D) = λmin (UU∗ ) = λmin (U∗ U) ≥λmin (U∗ PX U) = σmin2 (X∗ U). Finally, we can get k∇f (U)k2F ≥16(σmin (D)kUU∗ − PW MPW k2F + σmin (X∗ U)kPW⊥ Mk2F ) 2 ≥16σmin (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ Mk2F )) 2 =16σmin (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ M(PW + PW⊥ )k2F )) 2 =16σmin (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ MPW k2F + kPW⊥ MPW⊥ k2F ) 2 ≥4σmin (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ MPW k2F + kPW MPW⊥ k2F + kPW⊥ MPW⊥ k2F ) 2 =4σmin (X∗ U)kUU∗ − Mk2F ≥4γ 2 kUU∗ − Mk2F . The third inequality holds because we have kPW⊥ MPW kF = kPW MPW⊥ kF . (5.14) 66 The third equality holds because the inner product between each pair of UU∗ − PW MPW , PW MPW⊥ , PW⊥ MPW and PW⊥ MPW⊥ is 0. Lemma 5.3.3. Within the region D = {U|kM − UU∗ kF ≤ 1 σ (M)}, 10 k we have (5.15) p p kUk ≤ 2kMk, σmin (X∗ U) ≥ σk (M)/2. Proof. From the definition of the spectral norm, we have kUk2 =kUU∗ k ≤ kMk + kM − UU∗ k ≤kMk + kM − UU∗ kF ≤ 2kMk. We have the following lower bound for the smallest nonzero singular value of U∗ U, σmin (U∗ U) =σk (UU∗ ) = σk (M − (M − UU∗ )) 9 ≥σk (M) − kM − UU∗ k ≥ σk (M). 10 The first inequality holds because ∀i, j ∈ N, we have σi (A) ≥ σi+j−1 (A + B) − σj (B). On the other hand, we denote the orthogonal complementary space of X as X⊥ . Then we have 9 9 σk (M)kX∗⊥ Wk2 ≤ σmin (U∗ U)kX∗⊥ Wk2 ≤ σmin (D)kX∗⊥ Wk2 10 10 ≤kX∗⊥ WDW∗ X⊥ k ≤ kX∗⊥ UU∗ X⊥ kF =kPX⊥ (M − UU∗ )PX⊥ kF 1 ≤kM − UU∗ kF ≤ σk (M). 10 The last equality is true because tr(X∗⊥ UU∗ X⊥ X∗⊥ UU∗ X⊥ ) = tr(X⊥ X∗⊥ UU∗ X⊥ X∗⊥ X⊥ X∗⊥ UU∗ X⊥ X∗⊥ ) 67 and tr(X⊥ X∗⊥ MX⊥ X∗⊥ X⊥ X∗⊥ MX⊥ X∗⊥ ) = tr(X⊥ X∗⊥ XSX∗ X⊥ X∗⊥ XSX∗ X⊥ X∗⊥ ) = 0 Let the principal angle between X and W be θ. According to the inequality above, sin2 θ = kX∗⊥ Wk2 ≤ 91 . So cos2 θ = σmin 2 (X∗ W) ≥ 89 . We have 2 σmin (X∗ U) =σmin (X∗ UU∗ X) = σmin (X∗ WDW∗ X) 9 8 ≥σmin (D)σmin2 (X∗ W) ≥ σk (M) × ≥ σk (M)/2. 10 9 The lemma is proved. Now we are well prepared for the main theorem. Theorem 5.3.4. Let f (U) = kUU∗ − Mk2F and gi (U) = ke∗i Uk2 . Suppose after initializa- tion, we have 2 10µkκ2  1 f (U0 ) ≤ , max gi (U0 ) ≤ . 20κ i d Then, there exists some absolute constant c such that for any learning rate η < c µdkκ3 log d , with at least 1 − T d10 probability, we will have for all t ≤ T that 2 20µkκ2   η t 1 f (Ut ) ≤ 1 − , max gi (Ut ) ≤ . 2κ 10κ i d Proof. Let the filtration be Ft = σ{SG(U0 ), SG(U1 ), · · · , SG(Ut−1 )}, i.e., an increasing sequence of σ−field. n o 20µkκ2 We define event t = ∀τ ≤ t, f (Uτ ) ≤ (1 − η t 1 2 2κ ) ( 10κ ) , maxi gi (Uτ ) ≤ d . We aim to prove that this event happens with high probability. Conditioned on t , we have kUt k ≤ √ 2, σmin (X∗ Ut ) ≥ √12κ and σmin (U∗t Ut ) ≥ 2κ 1 based on Lemma 5.3.3. Construction of supermartingale G: Let gi (U) = e∗i UU∗ ei , for any change ∆U, we have gi (U + ∆U) =e∗i (U + ∆U)(U + ∆U)∗ ei =gi (U) + e∗i (∆UU∗ + U∆U∗ )ei + ke∗i ∆Uk2 . 68 For any l ∈ [d]: Eke∗l SG(U)k2 1t =4d4 Ek e∗l (u∗i uj − Mij )ei e∗j + e∗l (u∗j ui − Mji )ej e∗i Uk2 1t  ≤16d4 Eke∗l (u∗i uj − Mij )ei e∗j Uk2 1t ≤16d4 Eδil |u∗i uj − Mij |2 max ke∗i Uk2 1t i 1 X =16d4 2 δil |u∗i uj − Mij |2 max ke∗i Uk2 1t d i,j i =16d2 ke∗l (UU∗ − M)k2 max ke∗i Uk2 1t i ≤O(µ2 k 2 κ4 ). The last inequality holds because we have ke∗l (UU∗ − M)k ≤ke∗l UU∗ k + ke∗l Mk ≤kUkkU∗ el k + ke∗l Xk r r √ 20µkκ2 µk ≤ 2 + r d d µkκ 2 ≤O( ). d On the other hand, we know E[gi (Ut+1 )1t |Ft ] =E[gi (Ut − ηSG(Ut ))1t |Ft ] =[gi (Ut ) − ηe∗i ESG(Ut )U∗t ei − ηe∗i Ut ESG(Ut )∗ ei + η 2 Ee∗i SG(Ut )SG(Ut )∗ ei ]1t =[gi (Ut ) − 2ηRe(e∗i ESG(Ut )U∗t ei ) + η 2 Eke∗i SG(Ut )k2 ]1t =[e∗i Ut U∗t ei − 8ηRe(e∗i (Ut U∗t − M)Ut U∗t ei ) + η 2 Eke∗i SG(Ut )k2 ]1t =[tr(U∗t ei e∗i Ut ) − 8ηRe(tr(U∗t ei e∗i (Ut U∗t − M)Ut )) + η 2 Eke∗i SG(Ut )k2 ]1t =[Re(tr(U∗t ei e∗i (I − 8η(Ut U∗t − M))Ut )) + η 2 Eke∗i SG(Ut )k2 ]1t =[Re(tr(U∗t ei e∗i Ut (I − 8ηU∗t Ut ))) + 8ηRe(tr(U∗t ei e∗i MUt )) + η 2 Eke∗i SG(Ut )k2 ]1t ≤[(1 − 8ησmin (U∗t Ut ))gi (Ut ) + 8ηRe(tr(U∗t ei e∗i MUt )) + η 2 Eke∗i SG(Ut )k2 ]1t 69 For the middle term, we have 8ηRe(tr(U∗t ei e∗i MUt )) =4η(tr(U∗t ei e∗i MUt ) + tr(U∗t Mei e∗i Ut )) ≤8ηkU∗t ei k2 ke∗i MUt k2 ≤8ηkU∗t ei k2 kUt kke∗i Mk2 r 20µkκ2 √ ≤8η 2ke∗i XSX∗ k2 r d 10µkκ2 ∗ ≤16η kei XSk2 r d 10µkκ2 ∗ ≤16η kei Xk2 d r r √ 10µkκ2 µk 16 10ηµkκ ≤16η = . d d d The first inequality holds because tr(U∗t ei e∗i MUt ) + tr(U∗t Mei e∗i Ut ) = 2Re(tr(e∗i MUt U∗t ei )) ≤2kU∗i ei k2 ke∗i MUt k2 . The fourth inequality holds because kXk = 1. The fifth inequality holds because kSk = kMk = 1. The last inequality comes from the incoherence definition of M: r µk ke∗i Xk2 ≤ . d In this case, E[gi (Ut+1 )1t |Ft ] √ 4η 16 10ηµkκ ≤[(1 − )gi (Ut ) + + η 2 O(µ2 k 2 κ4 )]1t  κ d  4η ηµkκ ≤ (1 − )gi (Ut ) + 60 1t . κ d We can get the last inequality if the stepsize η is small enough. 70   4η −t 2 We let Git = 1 − gi (Ut )1t−1 − 15 µkκ and have  κ d −t−1  µkκ2   4η E[Gi(t+1) |Ft ] = 1 − E[gi (Ut+1 )1t |Ft ] − 15 κ d −t  2 15µkκ3   4η 60ηµkκ ≤ 1− gi (Ut )1t + 1 − κ (κ − 4η)d t (κ − 4η)d ≤Git . The last inequality is true because we have 1t ≤ 1t−1 . That’s to say, Git is a supermartingale. Probability 1 bound for G: We know that  −t−1 4η Gi(t+1) − E[Gi(t+1) |Ft ] = 1 − (gi (Ut+1 )1t − E[gi (Ut+1 )1t |Ft ]) κ  −t−1 4η = 1− [−ηe∗i [Ut SG(Ut )∗ + SG(Ut )U∗t κ − (Ut ESG(Ut )∗ + ESG(Ut )U∗t )]ei + η 2 [ke∗i SG(Ut )k2 − Eke∗i SG(Ut )k2 ] 1t .  Let l ∈ [d], we need to approximate the upper bounds for e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el and ke∗l SG(Ut )k2 1t . For the first term, we have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el =e∗l [(Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i ]Ut U∗t el + e∗l Ut U∗t [(Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i ]el . We consider the upper bounds for different conditions. • If l 6= i and l 6= j, we have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el = 0. 71 • if l = j 6= i, we have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el =2d2 (Ut U∗t − M)ji e∗i Ut U∗t ej + e∗j Ut U∗t (Ut U∗t − M)ij ei =2d2 Re[(Ut U∗t − M)ij e∗i Ut U∗t ej ] ≤2d2 kUt U∗t − Mk∞ max ke∗i Ut k2 ≤ O(µ2 k 2 κ4 ). i The last second inequality is true because we have kUU∗ − Mk∞ ≤ kUU∗ k∞ + kMk∞ ≤ max ke∗i Uk2 + max |ei XSX∗ ei | i i 2 20µkκ µk 21µkκ2 ≤ + kMk ≤ . d d d • if l = j = i, we have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el =2d2 (Ut U∗t − M)ii e∗i Ut U∗t ei + 2d2 e∗i Ut U∗t (Ut U∗t − M)ii ei =4d2 (Ut U∗t − M)ii ke∗i Ut k2 ≤4d2 kUt U∗t − Mk∞ max ke∗i Ut k2 ≤ O(µ2 k 2 κ4 ). i Therefore, we can always have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el 1t ≤ O(µ2 k 2 κ4 )1t . For the second term, we have ke∗l SG(Ut )k2 1t ≤4d4 |Ut U∗t − M|2ij (ke∗l ei e∗j U∗t k + ke∗l ej e∗i U∗t k)2 1t ≤16d4 kUt U∗t − Mk2∞ max ke∗i Ut k2 1t ≤ O(µ3 dk 3 κ6 )1t i For any l ∈ [d], we have e∗l [[SG(Ut )]U∗t +Ut SG(Ut )∗ ]el ≤ O(µ2 k 2 κ4 ) and ke∗l SG(Ut )k2 ≤ O(µ3 dk 3 κ6 ). Since the (i, j) component is randomly sampled from M. In this case, we also have E(e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el ) ≤ O(µ2 k 2 κ4 ), 72 and E(ke∗l SG(Ut )k2 1t ) ≤ O(µ3 dk 3 κ6 )1t . In fact, we have that E[Gi(t+1) |Ft ] ≥ 0. By letting η small enough, we have with probability 1,  −t−1 4η Gi(t+1) − E[Gi(t+1) |Ft ] ≤ 1− ηO(µ2 k 2 κ4 )1t . (5.16) κ Variance bound for G: We need to approximate an upper bound for two variances: Var(e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el 1t ) and Var(ke∗l SG(Ut )k2 el 1t ). For the first term, we have: Var(Re(e∗l [SG(Ut )]U∗t el ) · 1t |Ft ) ≤E[(Re(e∗l [SG(Ut )]U∗t el ))2 · 1t |Ft ] =E[(Re(tr(U∗t el e∗l [SG(Ut )])))2 · 1t |Ft ] ≤4d4 E[(Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )) + Re(tr(U∗t el e∗l (Ut U∗t − M)ji ej e∗i Ut )))2 · 1t |Ft ] ≤8d4 E[(Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 · 1t |Ft ] + 8d4 E[(Re(tr(U∗t el e∗l (Ut U∗t − M)ji ej e∗i Ut )))2 · 1t |Ft ] X =16d2 [(Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 · 1t |Ft ] i,j X ≤16d2 |(Ut U∗t − M)lj |2 max kU∗t ei k4 1t i j µ3 k 3 κ6   =16d 2 kel (Ut U∗t − M)k 2 max ke∗j Ut k4 1t ≤O 1t . j d The second equality is based on the definition of expectation. For the fourth inequality, we need to consider different conditions. • if l 6= i, (Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 = 0. 73 • if l = i, we have (Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 = (Re(tr(U∗t el (Ut U∗t − M)lj e∗j Ut )))2 ≤kU∗t el k22 k(Ut U∗t − M)lj e∗j Ut k22 ≤ |Ut U∗t − M)lj |2 max kU∗t ei k4 . i For the second term, we have: Var(ke∗l SG(Ut )k2 1t |Ft ) ≤E(ke∗l SG(Ut )k4 1t |Ft )2 1 X 8 ∗ =16 2 d kel ((Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i )Ut k4 1t d i,j X X ≤128d6 ke∗l (Ut U∗t − M)ij ei e∗j Ut k4 1t + 128d6 ke∗l (Ut U∗t − M)ji ej e∗i Ut k4 1t i,j i,j X =256d6 |(UU∗ − M)lj |4 ke∗j Ut k4 1t j ≤O(1)d6 kUU∗ − Mk2∞ ke∗l (UU∗ − M)k2 max ke∗i Uk4 1t i ≤O(µ5 dk 5 κ10 )1t . The second inequality follows from (a + b)4 ≤ 8a4 + 8b4 , and the second equality holds by considering the two cases l 6= i and l = i. Therefore, we can choose a small η and obtain  −2t−2  3 3 6 4η µk κ Var(Gi(t+1) |Ft ) ≤ 1 − 2 η O 1t . (5.17) κ d Berstein’s inequality for G: Let σ 2 = tτ =1 Var(Giτ |Fτ −1 ), and there exists R such P that, |Giτ − E[Giτ |Fτ −1 ]| ≤ R, τ = 1, . . . t. with probability 1. Then by standard Berstein concentration inequality, s2 /2   P (Git ≥ Gi0 + s) ≤ exp − 2 . (5.18) σ + Rs/3 2 t p Since Gi0 = gi (U0 ) − 15 µkκ d , let s̃ = O(1) 1 − 4η κ [ σ 2 log d + R log d], we know t ! µkκ2 µkκ2  4η 1 P gi (Ut )1t−1 ≥ 15 + 1− (gi (U0 ) − 15 ) + s̃ ≤ 11 . (5.19) d κ d 2d 74 Based on (5.16), we know R = (1 − 4η −t κ ) ηO(µ2 k 2 κ4 ) satisfies Giτ − E[Giτ |Fτ −1 ] ≤ R where τ = 1, . . . , t. Also, by the variance bound for G in (5.17), we can have  t p r !v u t  2t−2τ 4η µ 3 k 3 κ6 log d uX 4η 1− 2 σ log d ≤ηO t 1− κ d τ =1 κ r ! r ! 3 3 6 κ √ 3 3 7 r µ k κ log d µ k κ log d ≤ηO ≤ ηO . d η d By choosing η < c µdkκ3 log d and choosing c to be small enough, we have r ! µkκ2 µkκ2     √ µ3 k 3 κ7 log d 2 2 4 µkκ s̃ ≤ ηO + ηO(µ k κ log d) ≤ O +O ≤ . d d d d 10µkκ2 Since we have initialization max gi (U0 ) ≤ d , by the Bernstein’s inequality, we have i µkκ2   1 P gi (Ut )1t−1 ≥ 20 ≤ 11 . d 2d Namely, µkκ2    1 P t−1 ∩ gi (Ut )1t−1 ≥ 20 ≤ 11 . d 2d We also need to construct another supermartingale F . Construction of supermatingale F : From the definition of SG(Ut ), we have EkSG(Ut )k2F 1t ≤16d4 E(UU∗ − M)2ij max ke∗i Ut k2 1t (5.20) i ≤16d2 kUt U∗t − Mk2F max ke∗i Ut k2 1t ≤ O(µdkκ2 )f (Ut )1t . i By the update equation Ut+1 = Ut − ηSG(Ut ), (5.21) 75 we can have E[f (Ut+1 )1t |Ft ] ≤[f (Ut ) − Eh∇f (Ut ), ηSG(Ut )i + η 2 EkSG(Ut )k2F ]1t =[f (Ut ) − ηk∇f (Ut )k2F + η 2 EkSG(Ut )k2F ]1t    2η 2 2 ≤ 1− f (Ut ) + η O(µkdκ )f (Ut ) 1t κ  η ≤ 1− f (Ut )1t . κ The first inequality comes from second order Taylor expansion, and we choose a small η and increase the coefficient on the second order term from η 2 /2 to η 2 . The second inequality uses Lemma 5.3.2. The last inequality holds when we choose a small η. Let Ft = (1 − κη )−t f (Ut )1t−1 , then  η −t−1  η −t E[Ft+1 |Ft ] = 1 − E[f (Ut+1 )1t |Ft ] ≤ 1 − f (Ut )1t ≤ Ft . κ κ Therefore, Ft is a supermartingale. Probability 1 bound for F : From the definition of F , we have  η −t−1 Ft+1 = 1− f (Ut+1 )1t κ  η −t−1 = 1− kUt+1 U∗t+1 − Mk2F 1t κ  η −t−1 = 1− kUt U∗t − M − η(Ut SG(Ut )∗ + SG(Ut )U∗t ) κ + η 2 SG(Ut )SG(Ut )∗ k2F 1t . Define fˆ(η) := f (Ut − ηSG(Ut )). By the second order Taylor expansion with respect to η, 76 we can have η2 2 ˆ f (Ut+1 ) =fˆ(0) + η∇fˆ(0) + ∇ f (ξ) 2 =kUt U∗t − Mk2F + η(−2RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i) η2 + (2hUt SG(Ut )∗ + SG(Ut )U∗t , Ut SG(Ut )∗ + SG(Ut )U∗t i 2 + 4RehUt U∗t − M, SG(Ut )SG(Ut )∗ i − 12ξRehUt SG(Ut )∗ + SG(Ut )U∗t , SG(Ut )SG(Ut )∗ i + 12ξ 2 hSG(Ut )SG(Ut )∗ , SG(Ut )SG(Ut )∗ i). where ∇fˆ(0) = −2RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i, and ∇2 fˆ(ξ) =2hUt SG(Ut )∗ + SG(Ut )U∗t , Ut SG(Ut )∗ + SG(Ut )U∗t i + 4RehUt U∗t − M, SG(Ut )SG(Ut )∗ i − 12ξRehUt SG(Ut )∗ + SG(Ut )U∗t , SG(Ut )SG(Ut )∗ i + 12ξ 2 hSG(Ut )SG(Ut )∗ , SG(Ut )SG(Ut )∗ i =4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F . Then we need to bound  η −t−1 Ft+1 − E[Ft+1 |Ft ] = 1 − [f (Ut+1 ) − E(f (Ut+1 )|Ft )]1t . κ 77 Firstly, kUt U∗t − Mk∞ 1t = max |ei (Ut Ut − M)ej |1t ij = max |ei (PX + PX⊥ )(Ut Ut − M)ej |1t ij ≤ max |ei PX (Ut Ut − M)ej |1t + max |ei PX⊥ Ut Ut ej |1t ij ij ≤ max ke∗i XkkX∗ (Ut U∗t − M)ej k1t + ke∗j WkkW∗ WDW∗ PX⊥ ei k ij ≤ max ke∗i Xkk(Ut U∗t − M)ej k + ke∗j WkkWDW∗ PX⊥ ei k ij ≤ max ke∗i XkkUt U∗t − MkF + ke∗j Wkk(Ut U∗t − M)PX⊥ ei k i r ! r ! µk p µkκ3 p ≤O f (Ut ) + O f (Ut ) d d r ! µkκ3 p ≤O f (Ut ). d The fifth inequality comes from r 1 1 kei Ut k √ 20µkκ2 ke∗i Wk ≤ kei WD k 2 1 = 1 ≤ 2κ . λmin (D) 2 λmin (D) 2 d As we know, for the first-order derivative, RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i =RehUt U∗t − M, Ut SG(Ut )∗ i + RehUt U∗t − M, SG(Ut )U∗t i √ ≤2 2kUt U∗t − MkF kSG(Ut )kF √ p =4 2d2 f (Ut )k(Ut U∗t − M)ij ei e∗j Ut + (Ut U∗t − M)ji ej e∗i Ut kF √ p ≤4 2d2 f (Ut )(k(Ut U∗t − M)ij ei e∗j Ut kF + k(Ut U∗t − M)ji ej e∗i Ut kF ) √ p ≤4 2d2 f (Ut )kUt U∗t − Mk∞ (kei e∗j Ut kF + kej e∗i Ut kF ) √ p ≤8 2d2 f (Ut )kUt U∗t − M∗ k∞ max ke∗i Ut k i ≤O(µdkκ2.5 )f (Ut ). 78 √ Here, the first inequality comes from kUk ≤ 2. For the second-order derivative, with a small enough η, we have 4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F ≤O(1)kSG(Ut )k2F ≤O(1)d4 kUt U∗t − Mk2∞ max ke∗i Ut k2 i ≤O(µ2 d2 k 2 κ5 )f (Ut ). The first inequality holds because when η is small enough, we have an uniform upper bound for kU − ξSG(Ut )k and k(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − Mk. In this case, if we choose η to be small enought, we can have with probability 1,  η −t−1 |Ft+1 − E[Ft+1 |Ft ]| ≤ 1 − ηO(µdkκ2.5 )f (Ut+1 )1t κ (5.22)  η −t−1  η t+1 ≤ 1− 1− ηO(µdkκ0.5 ). κ 2κ Variance bound for F : For the first order derivative, Var(RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i) ≤E(RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i)2 ≤2E(RehUt U∗t − M, Ut SG(Ut )∗ )2 + 2E(RehUt U∗t − M, SG(Ut )U∗t i)2 ≤2k(Ut U∗t − M)Ut k2F EkSG(Ut )k2F + 2kU∗t (Ut U∗t − M)k2F EkSG(Ut )k2F ≤8Ek(Ut U∗t − M)k2F EkSG(Ut )k2F ≤O(µdkκ2 )f 2 (Ut ), where the last inequality comes from (5.20). 79 For the second order derivative, when η is small enough, we have Var(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F ) ≤E(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )2 =E(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i)2 + E(2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )2 )2 + E(8Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i · kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F ) ≤E2(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i)2 + E2(2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )2 )2 ≤O(1)EkSG(Ut )k4F =O(d8 )Ek(Ut U∗t − M)ij ei e∗j U + (Ut U∗t − M)ji ej e∗i Ut k4F ≤O(d8 )kUt U∗t − Mk2∞ E|(Ut U∗t − M)ij |2 max kei Ut k4 i ≤O(d6 )kUt U∗t − Mk∞2 kUt U∗t − Mk2F max kei Ut k4 i ≤O(µ3 d3 k 3 κ7 )f 2 (Ut ). Therefore, we can have with probability 1,  η −2t−2 2 Var(Ft+1 |Ft ) ≤ 1 − η O(µdkκ2 )f 2 (Ut )1t κ   (5.23)  η −2t−2  η 2t+2 2 µdk ≤ 1− 1− η O 1t . κ 2κ κ2 Berstein’s inequality for F : Let σ 2 = τ =1 Var(Fτ |Fτ −1 ) and R satisfies |Fτ − Pt E[F τ |Fτ −1 ]| ≤ R according to (5.22), τ = 1, · · · , t. Then by the standard Bernstein concen- tration inequality, we know: s2 /2   P (Ft ≥ F0 + s) ≤ exp − 2 . σ + Rs/3 80 η t Let s̃ = O(1) 1 − [ σ 2 log d + R log d]. So when d ≥ 2, we have  p κ    η t 1 P f (Ut )1t−1 ≥ 1 − f (U0 ) + s̃ ≤ 10 . κ 2d −t η t We can know that R = 1 − κη ηO(µdkκ0.5 ). By the variance bound of F in  1 − 2κ (5.23), we have r !v u t   η tp 2  µdk log d u X η 2t−2τ  η 2τ 1− σ log d ≤ ηO t 1− 1− κ κ2 τ =1 κ 2κ v !u t r  η  t µdk log d u X η 2t−2τ  η 2τ −2t ≤ 1− ηO t 1− 1− 2κ κ2 τ =1 κ 2κ r !  η √ µdk log d ≤ 1− ηO . 2κ κ The last inequality holds because we have t  κ 2 κ X η 2t−2τ  η 2τ −2t 4( η ) − 4 η + 1 κ 1− 1− < κ ≤ . τ =1 κ 2κ 4η − 3 η By η < c µdkκ3 log d and choosing c to be small enough, we have: " r ! #  2  η  t √ µdk log d 0.5  η t 1 s̃ = 1 − ηO + ηO(µdkκ ) ≤ 1 − . 2κ κ 2κ 20κ Since F0 = f (U0 ) ≤ 1 (20κ)2 , we can have  2 !  η t 1 1 P f (Ut )1t−1 ≥ 1− ≤ . 2κ 10κ 2d10 That’s to say, (  2 )!  η t 1 1 P t−1 ∩ f (Ut ) ≥ 1 − ≤ . 2κ 10κ 2d10 Probability for event T : We need to combine the concentration result for martingales G and F . Then we get  ( 2 )!! µkκ2    η t 1 P (t−1 ∩ ¯t ) = P t−1 ∩ ∪i gi (Ut ) ≥ 20 ∪ f (Ut ) ≥ 1 − d 2κ 10κ d µkκ2       X η t 1 2 1 ≤ P t−1 ∩ gi (Ut ) ≥ 20 + P t−1 ∩ f (Ut ) ≥ (1 − ) ( ) ≤ 10 . i=1 d 2κ 10κ d The theorem is proved in the Hermitian case. 81 5.4 Algorithms In this section, we give algorithms for both Hermitian and general cases. 5.4.1 The Hermitian Case For the Hermitian case, we need to find one matrix U so that UU∗ ≈ M. Given an new observation (i, j), one or two rows of U is updated for every iteration. The SGD computation is given by (5.11), and η is the stepsize. For the convergence in this chapter, η has to satisfy η< c µdkκ3 log d in our theoretical proof with a small c. However, in practices, we may choose a larger stepsize. T is the total number of required observations. The algorithm is described in Algorithm 5.1. Algorithm 5.1: Online learning algorithm for the Hermitian matrix M Input: Initial Ω0 ∈ Qd×d , learning rate η, iterations T, U0 U∗0 ← top k SVD of d2 P (M) Ω0 Ω0 Output: U, s.t.UU∗ ≈ M 1 for t = 0, 1, 2, 3, . . . , T-1 do 2 Observe Mij where (i, j) ⊂ {1, · · · , d} × {1, · · · , d} is uniform distributed ; 3 Ut+1 = Ut − 2ηd2 ((Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i ))Ut 4 end 5.4.2 The General Case In the general case with the quaternion matrix M ∈ Qd1 ×d2 . We need to find two matrices U ∈ Qd1 ×k and V ∈ Qd2 ×k so that UV∗ ≈ M. Ω0 is the initial coordinates set of observation, from which we construct a good initialization U0 . For each iteration given one observation Mi,j , we update the i-th row of Ut and the j-th row Vt . Both rows are updated by SGD in a similar way as the Hermitian case. The parameter η is the stepsize, and T is the total number of observations. The algorithm is described in Algorithm 5.2. In practice, we use Algorithm 5.3 to increase the speed just like (Jin et al., 2016). Instead of directly doing SVD on a d1 × d2 quaternion matrix UV∗ , we do SVD on two smaller k × k 82 quaternion matrix U∗t Ut and Vt∗ Vt . Algorithm 5.2: Online learning algorithm for general M (theoretical version) Input: Initial Ω0 ∈ Qd1 ×d2 , learning rate η, iterations T, U0 V0∗ ← top k SVD of d1 ×d2 Ω0 PΩ0 (M) Output: U, V, s.t.UV∗ ≈ M 1 for t = 0, 1, 2, 3, . . . , T-1 do 2 WU DWV∗ ← SVD(Ut V∗ ) ; 1 1 3 Ũt ← WU D 2 , Ṽt ← WV D 2 ; 4 Observe Mij where(i, j) ⊂ {1, · · · , d} × {1, · · · , d} is uniform distributed ; 5 Ut+1 ← Ũt − 2ηd1 d2 (Ũt Ṽt∗ − M)ij ei e∗j Ṽt ; 6 Vt+1 ← Ṽt − 2ηd1 d2 (Ũt Ṽt∗ − M)ji ei e∗j Ṽt 7 end Algorithm 5.3: Online learning algorithm for general M (practical version) Input: Initial Ω0 ∈ Qd1 ×d2 , learning rate η, iterations T, U0 V0∗ ← top k SVD of d1 ×d2 Ω0 PΩ0 (M) Output: U, V, s.t.UV∗ ≈ M 1 for t = 0, 1, 2, 3, . . . , T-1 do 2 Observe Mij where(i, j) ∼ Unif([d1 ] × [d2 ]) ; 3 RU DU R∗U ← SVD(U∗t U) ; 4 RV DV R∗V ← SVD(Vt∗ V) ; 1 1 5 QU DQ∗V ← SVD(DU2 R∗U RV (DV2 )∗ ) ; −1 1 6 Ut+1 = Ut − 2ηd1 d2 ((Ut Vt∗ − M)ij ei e∗j Vt RV DV 2 QV Q∗U DU2 R∗U ; 1 −1 7 Vt+1 = Vt − 2ηd1 d2 ((Ut Vt∗ − M)ji ej e∗i Ut RU DU 2 QU Q∗V DV2 R∗V 8 end 5.5 Numerical Experiments In this section, we conduct some numerical experiments for both Hermitian and general cases. Small Hermitian case: We randomly generate a 10 × 10 Hermitian quaternion matrix with rank 5. The initialization is obtained with 99% of the matrix. The stepsize η is chosen as 3e−5 , and the total iteration number is 40000. The total number of iteration is large comparing to the size of the matrix, and we use this example to demonstrate the linear 83 convergence of the proposed algorithm. For this example, each component is chosen for 400 times on average, and the algorithm converges linearly in Figure 5.2, which confirms the theoretical results. 10 2 1 10 10 0 10 -1 loss function value 10 -2 10 -3 10 -4 10 -5 10 -6 10 -7 0 0.5 1 1.5 2 2.5 3 3.5 4 iteration 10 4 Figure 5.2: Loss function value versus number of iterations for the small Hermitian case. The stepsize is 3e−5 . Within 40000 iterations, the value decreases from nearly 100 to 10−6 . The loss function value tends to keep decreasing after these 40000 iterations. Small general case: We randomly generate a 10 × 10 general quaternion matrix with rank 5. The initialization is obtained with 99% of the matrix. The stepsize η is chosen as 1e−4 , and the total iteration number is 5000. We also observe the convergence of the proposed algorithm in Figure 5.3. From both example, we can see that the algorithm converges slowly, though it converges linearly. Color image: For a color image with depth with dimension 259 × 320 × 4, we resize the image to be 156 × 192 × 4. There are four channels including the depth, so we can use a quaternion matrix to describe it with the depth being the real part. We normalize it and force its rank to be 30. The initial stepsize is 1e−5 , and the total iteration number is 10000. In this case, the total number of iterations is smaller than the number of components in the matrix. For every 300 iterations, we reduce the stepsize by 0.95 to help it converge. 84 10 2 10 1 0 10 loss function value 10 -1 10 -2 10 -3 10 -4 10 -5 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 iteration Figure 5.3: Loss function value versus number of iterations for the small general quaternion matrix case. The stepsize is tuned to be 1e−4 . Within 5000 iterations, the value decreases from around 100 to 10−4 . The loss function value tends to keep decreasing after these 5000 iterations. The result is shown in Figure 5.4 and Figure 5.5. We can see that the image can not be recovered well with only 10000 pixels. We are thinking about one possible reason for it. A quaternion matrix typically have three more components than the real matrix. It tends to need more observations to converge. Because only around 1/3 of the pixels are used once, which is far below the computation requires for offline algorithms. In each iteration of an offline algorithm, it goes through all observed pixels, which is about 10000 pixels, and there are many iteration required to get a good observation. 5.6 Conclusion In this chapter, we introduce quaternion matrix and its properties. Based on previous work for online matrix completion, we set up a provable and efficient framework for online quaternion matrix completion, which can be easily applied on color images. This framework applies nonconvex SGD on quaternion matrix and we can show the performance improvement 85 (a) True Image. (b) Recovered Image. Figure 5.4: Online image recovery result after 10000 iterations. We randomly sampled 1000o observations from (a). We can see that the result for recovered image (b) is not good. We expect that the difference between (a) and (b) can be as small as possible. 180 160 140 120 loss function value 100 80 60 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 iteration Figure 5.5: Loss function value versus number of iterations for the real color image. The initial stepsize is tuned to be 1e−5 . For every 300 iterations, we multiply the stepsize by 0.95. The loss function value decreases from around 170 to almost 45. At the beginning, the loss function value decreases the most, and it tends to keep decreasing after these 10000 iterations. based on each updated input. By using martingale theory, we prove that SGD can stay away from saddle points and converges linearly if we have a good initialization. 86 BIBLIOGRAPHY 87 BIBLIOGRAPHY Amaldi, E. and Kann, V. (1998). On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1-2):237–260. Beck, A. and Teboulle, M. (2009a). Fast gradient-based algorithms for constrained total vari- ation image denoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419–2434. Beck, A. and Teboulle, M. (2009b). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202. Beck, A. and Teboulle, M. (2009c). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202. Bouwmans, T. and Zahzah, E. H. (2014). Robust pca via principal component pursuit: A review for a comparative evaluation in video surveillance. Computer Vision and Image Understanding, 122:22–34. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed op- timization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122. Cabral, R., De la Torre, F., Costeira, J. P., and Bernardino, A. (2014). Matrix completion for weakly-supervised multi-label image classification. IEEE transactions on pattern analysis and machine intelligence, 37(1):121–135. Cabral, R. S., Torre, F., Costeira, J. P., and Bernardino, A. (2011). Matrix completion for multi-label image classification. In Advances in neural information processing systems, pages 190–198. Cai, H., Cai, J.-F., and Wei, K. (2019). Accelerated alternating projections for robust principal component analysis. The Journal of Machine Learning Research, 20(1):685–717. Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust principal component analysis? Journal of the ACM (JACM), 58(3):11. Candes, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust Principal Component Analysis? J. ACM, 58(3):11. Chartrand, R. (2007). Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 14(10):707–710. Chen, Y., Ma, J., and Fomel, S. (2016). Double-sparsity dictionary for seismic noise atten- uation. Geophysics, 81(2):V103–V116. 88 Chen, Y., Zhou, Y., Chen, W., Zu, S., Huang, W., and Zhang, D. (2017). Empirical low- rank approximation for seismic noise attenuation. IEEE Transactions on Geoscience and Remote Sensing, 55(8):4696–4711. Cheng, J., Chen, K., and Sacchi, M. D. (2015). Robust principle component analysis (RPCA) for seismic data denoising. In GeoConvention 2015. Cunningham, J. P. and Ghahramani, Z. (2015). Linear dimensionality reduction: Survey, insights, and generalizations. The Journal of Machine Learning Research, 16(1):2859–2900. Da Costa, J. F. P., Alonso, H., and Roque, L. (2009). A weighted principal component analysis and its application to gene expression data. IEEE/ACM Transactions on Com- putational Biology and Bioinformatics, 8(1):246–252. De la Torre, F. and Black, M. J. (2001). Robust principal component analysis for computer vision. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 362–369. IEEE. Du, C., Sun, J., Zhou, S., and Zhao, J. (2013). An Outlier Detection Method for Robust Manifold Learning. In Yin, Z., Pan, L., and Fang, X., editors, Proceedings of The Eighth International Conference on Bio-Inspired Computing: Theories and Applications (BIC- TA), 2013, Advances in Intelligent Systems and Computing, pages 353–360. Springer Berlin Heidelberg. Duarte, L. T., Nadalin, E. Z., Nose Filho, K., Zanetti, R., Romano, J. M., and Tygel, M. (2012). Seismic wave separation by means of robust principal component analysis. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pages 1494–1498. IEEE. Elhamifar, E. and Vidal, R. (2013). Sparse subspace clustering: Algorithm, theory, and ap- plications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765– 2781. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360. Fomel, S. and Liu, Y. (2013). Seislet transform and seislet frame. Geophysics, 75:V25–V38. Gaudet, C. J. and Maida, A. S. (2018). Deep quaternion networks. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE. Gavish, √ M. and Donoho, D. L. (2014). The optimal hard threshold for singular values is 4/ 3. IEEE Transactions on Information Theory, 60(8):5040–5053. Hammond, D. K., Vandergheynst, P., and Gribonval, R. (2011). Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150. Herman, G. and Perkins, C. (2006). Predictive removal of scattered noise. Geophysics, 71:V41–V49. 89 Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge university press. Huang, X. and Yan, M. (2018). Nonconvex penalties with analytical solutions for one-bit compressive sensing. Signal Processing, 144:341–351. Huang, X.-L., Shi, L., and Yan, M. (2015). Nonconvex sorted `1 minimization for sparse approximation. Journal of the Operations Research Society of China, 3(2):207–229. Ji, H., Liu, C., Shen, Z., and Xu, Y. (2010). Robust video denoising using low rank matrix completion. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1791–1798. IEEE. Jianbo Shi and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 22(8):888–905. Jiang, B., Ding, C., Luo, B., and Tang, J. (2013). Graph-Laplacian PCA: Closed-Form Solution and Robustness. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3492–3498. Jin, C., Kakade, S. M., and Netrapalli, P. (2016). Provable efficient online matrix completion via non-convex stochastic gradient descent. Advances in Neural Information Processing Systems, 29:4520–4528. Kent, A., Sweet, J., and Woodward, B. (2016). Iris community wavefield experiment in Oklahoma. Incorporated Research Institutions for Seismology. Dataset/Seismic Network. Kilmer, M. E. and Martin, C. D. (2011). Factorization strategies for third-order tensors. Linear Algebra and its Applications, 435(3):641–658. Kim, D. and Fessler, J. A. (2016). Optimized first-order methods for smooth convex mini- mization. Mathematical programming, 159(1-2):81–107. Kim, J.-H., Sim, J.-Y., and Kim, C.-S. (2015). Video deraining and desnowing using temporal correlation and low-rank matrix completion. IEEE Transactions on Image Processing, 24(9):2658–2670. Kopsinis, Y. and McLaughlin, S. (2009). Development of EMD-based denoising methods inspired by wavelet thresholding. IEEE Transactions on Signal Processing, 57(4):1351– 1362. Koren, Y. (2009). The bellkor solution to the netflix grand prize. Netflix prize documentation, 81(2009):1–10. Kreimer, N. and Sacchi, M. D. (2012). A tensor higher-order singular value decomposition for prestack seismic data noise reduction and interpolation. Geophysics, 77:V113–V122. Li, G. and Pong, T. K. (2015). Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 25(4):2434–2460. 90 Li, H. and Lin, Z. (2015). Accelerated proximal gradient methods for nonconvex program- ming. Advances in neural information processing systems, 28:379–387. Li, J.-Q., Rong, Z.-H., Chen, X., Yan, G.-Y., and You, Z.-H. (2017). Mcmda: Matrix completion for mirna-disease association prediction. Oncotarget, 8(13):21187. Li, X.-R., Li, X.-M., Li, H.-L., and Cao, M.-Y. (2009). Rejecting Outliers Based on Corre- spondence Manifold. Acta Automatica Sinica, 35(1):17–22. Lin, Z., Chen, M., and Ma, Y. (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055. Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. (2012). Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(1):171–184. Liu, X., Wen, Z., and Zhang, Y. (2015). An efficient gauss–newton algorithm for symmetric low-rank product matrix approximations. SIAM Journal on Optimization, 25(3):1571– 1608. Liu, Y. and Fomel, S. (2013). Seismic data analysis using local time-frequency decomposition. Geophysical Prospecting, 61(3):516–525. Liu, Y., Zheng, Y., Lu, J., Cao, J., and Rutkowski, L. (2019). Constrained quaternion- variable convex optimization: a quaternion-valued recurrent neural network approach. IEEE transactions on neural networks and learning systems, 31(3):1022–1035. Lou, Y. and Yan, M. (2018). Fast l1–l2 minimization via a proximal operator. Journal of Scientific Computing, 74(2):767–785. Lu, C., Yang, M., Luo, F., Wu, F.-X., Li, M., Pan, Y., Li, Y., and Wang, J. (2018). Predic- tion of lncrna–disease associations based on inductive matrix completion. Bioinformatics, 34(19):3357–3364. Ma, S. and Aybat, N. S. (2018). Efficient optimization algorithms for robust principal component analysis and its variants. Proceedings of the IEEE, 106(8):1411–1426. Meila, M. and Shi, J. (2001). Learning Segmentation by Random Walks. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 873–879. MIT Press. Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Program- ming, 103(1):127–152. Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161. Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., Linarès, G., De Mori, R., and Bengio, Y. (2018). Quaternion convolutional neural networks for end-to-end automatic speech recognition. arXiv preprint arXiv:1806.07789. 91 Qiao, T., Ren, J., Wang, Z., Zabalza, J., Sun, M., Zhao, H., Li, S., Benediktsson, J. A., Dai, Q., and Marshall, S. (2017). Effective denoising and classification of hyperspectral images using curvelet transform and singular spectrum analysis. IEEE Transactions on Geoscience and Remote Sensing, 55(1):119–133. Rauhut, H., Schneider, R., and Stojanac, Ž. (2017). Low rank tensor recovery via iterative hard thresholding. Linear Algebra and its Applications, 523:220–262. Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501. Rubinstein, R., Zibulevsky, M., and Elad, M. (2010). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58(3):1553–1564. Sha, N., Yan, M., and Lin, Y. (2019). Efficient seismic denoising techniques using robust principal component analysis. In SEG Technical Program Expanded Abstracts 2019, pages 2543–2547. Society of Exploration Geophysicists. Shen, Y., Xu, H., and Liu, X. (2019). An alternating minimization method for robust principal component analysis. Optimization Methods and Software, 34(6):1251–1276. Sun, C., Zhang, Q., Wang, J., and Xie, J. (2014). Noise reduction based on robust principal component analysis. Journal of Computational Information Systems, 10(10):4403–4410. Tan, S. and Huang, L. (2014). An efficient finite-difference method with high-order accu- racy in both time and space domains for modelling scalar-wave propagation. Geophysical Journal International, 197(2):1250–1267. Tao, M. and Yuan, X. (2011). Recovering low-rank and sparse components of matrices from incomplete and noisy observations. SIAM Journal on Optimization, 21(1):57–81. Tay, Y., Zhang, A., Tuan, L. A., Rao, J., Zhang, S., Wang, S., Fu, J., and Hui, S. C. (2019). Lightweight and efficient neural natural language processing with quaternion networks. arXiv preprint arXiv:1906.04393. Trefethen, L. N. and Bau III, D. (1997). Numerical linear algebra, volume 50. Siam. Wang, C., Wang, X., Li, Y., Xia, Z., and Zhang, C. (2018). Quaternion polar harmonic fourier moments for color images. Information Sciences, 450:141–156. Wang, Y., Yin, W., and Zeng, J. (2019). Global convergence of admm in nonconvex nons- mooth optimization. Journal of Scientific Computing, 78(1):29–63. Weglein, A. B. (2016). Multiples: Signal or noise? Geophysics, 81:V283–V302. Wen, F., Chu, L., Liu, P., and Qiu, R. C. (2018). A survey on nonconvex regularization- based sparse and low-rank recovery in signal processing, statistics, and machine learning. IEEE Access, 6:69883–69906. 92 Wen, F., Ying, R., Liu, P., and Truong, T.-K. (2019). Nonconvex regularized robust pca using the proximal block coordinate descent algorithm. IEEE Transactions on Signal Processing, 67(20):5402–5416. Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Mathematical Program- ming Computation, 4(4):333–361. Wright, J., Ganesh, A., Rao, S., and Ma, Y. (2009). Robust principal component analy- sis: Exact recovery of corrupted low-rank matrices via convex optimization. Coordinated Science Laboratory Report no. UILU-ENG-09-2210, DC-243. Ye, H.-S., Zhou, N.-R., and Gong, L.-H. (2020). Multi-image compression-encryption scheme based on quaternion discrete fractional hartley transform and improved pixel adaptive diffusion. Signal Processing, 175:107652. Yu, S., Ma, J., Zhang, X., and Sacchi, M. D. (2015). Interpolation and denoising of highdi- mensional seismic data by learning a tight frame. Geophysics, 80:V119–V132. Yuan, X. and Yang, J. (2009). Sparse and low-rank matrix decomposition via alternating direction methods. preprint, 12(2). Yuan, X. and Yang, J. (2013). Sparse and low-rank matrix decomposition via alternating direction methods. Pacific Journal of Optimization, 9:167–180. Zhang, C.-H. et al. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894–942. Zhang, F. (1997). Quaternions and matrices of quaternions. Linear algebra and its applica- tions, 251:21–57. Zhigang Tang, Jun Yang, and Bingru Yang (2010). A new Outlier detection algorithm based on Manifold Learning. In 2010 Chinese Control and Decision Conference, pages 452–457. Zhou, M., Liu, Y., Long, Z., Chen, L., and Zhu, C. (2019). Tensor rank learning in cp de- composition via convolutional neural network. Signal Processing: Image Communication, 73:12–21. Zhou, T. and Tao, D. (2011). GoDec: Randomized low-rank & sparse matrix decomposition in noisy case. In International Conference on Machine Learning, pages 30–40. Zhou, T. and Tao, D. (2013). Greedy bilateral sketch, completion & smoothing. In In- ternational Conference on Artificial Intelligence and Statistics, pages 650– 658. JMLR. org. Zhou, X., Yang, C., Zhao, H., and Yu, W. (2014). Low-rank modeling and its applications in image analysis. ACM Computing Surveys (CSUR), 47(2):1–33. 93 Zhou, Z., Li, X., Wright, J., Candes, E., and Ma, Y. (2010a). Stable principal component pursuit. In 2010 IEEE international symposium on information theory, pages 1518–1522. IEEE. Zhou, Z., Li, X., Wright, J., Candes, E., and Ma, Y. (2010b). Stable principal component pursuit. In IEEE International Symposium on Information Theory, pages 1518–1522. IEEE. Zhu, X., Xu, Y., Xu, H., and Chen, C. (2018). Quaternion convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–647. 94