ROBUST ALGORITHMS ON LOW-RANK APPROXIMATION AND THEIR
                             APPLICATIONS
                                     By
                                 Ningyu Sha
                           A DISSERTATION
                               Submitted to
                       Michigan State University
                in partial fulfillment of the requirements
                             for the degree of
Computational Mathematics, Science and Engineering – Doctor of Philosophy
                         Statistics – Dual Major
                                    2021


                                        ABSTRACT
       ROBUST ALGORITHMS ON LOW-RANK APPROXIMATION AND THEIR
                                       APPLICATIONS
                                               By
                                          Ningyu Sha
Low-rank approximation models have been widely developed in computer vision, image anal-
ysis, signal processing, web data analysis, bioinformatics, etc. Generally, we assume that the
intrinsic data lies in a low-dimensional subspace, and we need to extract the low-rank repre-
sentation given observations. There are many well-known works such as Principal Compo-
nent Analysis (PCA), factor analysis, least squares, etc. However, their performance may be
affected when dealing with outliers. Robust PCA (RPCA) plays an important role in such
cases, but RPCA based methods suffer from expensive computation costs. In this thesis,
we discussed how to improve the performance of RPCA in terms of both speed and accu-
racy. The comparison between convex and non-convex models is also discussed. Notably,
we propose a theory about matrix decomposition with unknown rank. A nonlinear RPCA
approach is also proposed, given the assumption that data lie on a manifold. Then, we take
examples from seismic event detection and 2D image denoising. The numerical experiments
show the robustness of our techniques and present speedup and higher recovery accuracy
compared with existing approaches.
    It is usually common in practice that observed data has missing values. So, we need to
make a low-rank approximation based on incomplete data. Also, it may take a long time
for offline matrix completion since we need to collect all data first. The online version can
offer up-to-date results based on a continuous data stream. Online matrix completion has
applications in computer vision and web data analysis, especially in video image transmission
and recommendation systems. To be better applied on color images with three channels, we
introduced online quaternion matrix completion. We can get an updated result for every
new observed entry using stochastic gradient descent on the quaternion matrix.


 Copyright by
NINGYU SHA
         2021


                                ACKNOWLEDGEMENTS
Firstly, I would like to thank my advisor and committee chair, Dr. Ming Yan. He is very
patient and knowledgeable. He describes a big picture about optimization for me. Whenever
I have any difficulty or questions, he is more than willing to help. I also want to thank my co-
advisor, Dr. Yuying Xie. He is humorous and warm-hearted. He gives me precious opinions
from statistics direction.
    I would also like to thank my other committee members, Dr. Matthew Hirn, Dr. Yuehua
Cui, and Dr. Haolei Weng, for their knowledgeable feedback and kind suggestions for my
research life.
    I want to thank our research partners, Dr. Youzuo Lin, Dr. Lei Shi, Dr. Rongrong Wang,
He Lye, and Shuyang Qin. Thank you for the brilliant work on our projects.
    Also, I would like to thank my friends Yuning Hao, Hongnan Wang, Yun Song, Yuejiao
Sun, Lijiang Xu, Jieqian He, Binbin Huang, Hao Wang, Jialin Qu, Peide Li, Mi Hu, Ken
Lee, Runze Su, Yao Xuan and Yao Li for giving me huge support and companion during my
Ph.D.
    I also want to thank the CMSE community, Lisa Roy, Heather Williams, Dr. Andrew
Christlieb, etc. They are always there to offer help. I also want to thank Erica Schmittdiel,
who stands with me through the darkest time.
    Most importantly, I would like to thank my parents, Chunhua Sha and Hui Peng. Their
selfless love is continued support to make me a better person, always.
    I also need to thank a lot of people, even if I have forgotten their names.
                                                iv


                               TABLE OF CONTENTS
LIST OF TABLES      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .               x
CHAPTER 1 BACKGROUND . . . . . . . . . . . . . . . .                 .  . . . . . . .  .  . . .  1
   1.1 Applications of robust low-rank optimization . . . . . .       . . . . . . . . .  . . .   1
   1.2 Existing work on RPCA . . . . . . . . . . . . . . . . .        . . . . . . . . .  . . .   5
   1.3 Overview of this thesis . . . . . . . . . . . . . . . . . .    . . . . . . . . .  . . .   9
CHAPTER 2 ROBUST PRINCIPAL COMPONENT ANALYSIS FOR LOW RANK
              MATRIX APPROXIMATION . . . . . . . . . . . . . . . . . . . . .                    11
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     11
       2.1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       14
       2.1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         15
   2.2 Proposed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        15
       2.2.1 Forward-backward . . . . . . . . . . . . . . . . . . . . . . . . . . . .           20
              2.2.1.1 Convergence analysis . . . . . . . . . . . . . . . . . . . . . .          21
       2.2.2 An accelerated algorithm . . . . . . . . . . . . . . . . . . . . . . . . .         27
   2.3 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        27
       2.3.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       28
              2.3.1.1 Low-rank matrix recovery . . . . . . . . . . . . . . . . . . .            29
              2.3.1.2 Robustness of the model . . . . . . . . . . . . . . . . . . . .           30
              2.3.1.3 Low-rank matrix recovery with missing entries . . . . . . . .             31
       2.3.2 Real image experiment . . . . . . . . . . . . . . . . . . . . . . . . . .          32
   2.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       35
       2.4.1 Nonconvex penalties on the singular values . . . . . . . . . . . . . . .           35
       2.4.2 Other regularization on the sparse component . . . . . . . . . . . . .             36
       2.4.3 Constrained problems . . . . . . . . . . . . . . . . . . . . . . . . . .           36
CHAPTER 3 ROBUST PRINCIPAL COMPONENT ANALYSIS FOR SEISMIC
              EVENT DETECTION . . . . . . . . . . . . . . . . . . . . . . . . .                 38
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     38
   3.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     39
       3.2.1 New algorithms with infimal convolution . . . . . . . . . . . . . . . .            41
              3.2.1.1 Comparison between PGM and IC-PGM . . . . . . . . . . .                   42
   3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    43
       3.3.1 Synthetic seismic data . . . . . . . . . . . . . . . . . . . . . . . . . .         43
       3.3.2 Field seismic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       45
   3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     47
                                              v


CHAPTER 4 MANIFOLD DENOISING BY NONLINEAR ROBUST                              PRINCI-
              PAL COMPONENT ANALYSIS . . . . . . . . . . . . .                . . . . . . . 49
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
   4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 50
   4.3 Geometric explanation . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 52
   4.4 Optimization algorithm . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 53
   4.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 54
   4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
CHAPTER 5 ONLINE MATRIX COMPLETION WITH QUATERNION MATRIX                                   58
   5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
   5.2 Introduction on Quaternion Matrices . . . . . . . . . . . . . . . . . . . . . .      60
       5.2.1 Quaternion Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . .       60
       5.2.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   61
       5.2.3 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . .       62
       5.2.4 Incoherence Condition . . . . . . . . . . . . . . . . . . . . . . . . . .      63
       5.2.5 Sampling Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      63
   5.3 Online Matrix Completion Algorithms and its Theoretical Analysis . . . . .           63
   5.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
       5.4.1 The Hermitian Case . . . . . . . . . . . . . . . . . . . . . . . . . . .       82
       5.4.2 The General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     82
   5.5 Numerical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    83
   5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        87
                                             vi


                                    LIST OF TABLES
Table 2.1: Comparison of three RPCA algorithms. We compare the relative error
           of their solutions to the true low-rank matrix and the number of itera-
           tions. Both Alg. 2.1 and Alg. 2.2 have better performance than (Shen
           et al., 2019) in terms of the relative error and the number of iterations.
           Alg. 2.2 has the fewest iterations but the relative error could be large.
           It is because the true low-rank matrix is not the optimal solution to the
           optimization problem, and the trajectory of the iterations moves close to
           L? before it approaches the optimal solution. . . . . . . . . . . . . . . . .     30
Table 2.2: Performance of Alg. 2.2 on low-rank matrix recovery with missing entries.
           We change the level of sparsity in the sparse noise, standard deviation
           of the Gaussian noise, and the ratio of missing entries. . . . . . . . . . .      32
Table 3.1: Comparison of six algorithms. IC-ADMM is the fastest, which is the
           same as synthetic data. The function value for MCP is smaller because
           of a different model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
                                              vii


                                     LIST OF FIGURES
Figure 1.1: Image from https://link.medium.com/bAUJGpEl5hb. Height and weight
            are correlated. If visualized, these two vectors have an acute angle (left
            figure). After using PCA, height and weight are combined as a new
            feature ‘size’ (right figure). There is also ‘other’ information left that is
            orthogonal to ‘size’. The short length means that it is less important. . .      2
Figure 1.2: Image from https://link.medium.com/BbeEPVIl5hb. PCA works well
            for clean linear data. However, it works poorly for data with outliers. . .      3
Figure 1.3: An application of RPCA. Image from (Zhou et al., 2014). Top row: four
            frames from a video. Middle row: video background, which is viewed as
            a low-rank approximation. Bottom row: main objects which correspond
            to sparse components. . . . . . . . . . . . . . . . . . . . . . . . . . . . .    4
Figure 1.4: Image from https://link.medium.com/ERplrHLl5hb. Rating system for
            Netflix films. Each row represents each user while each column repre-
            sents each film. Values from 1 to 5 are scores. . . . . . . . . . . . . . . .    5
Figure 2.1: The contour map of the relative error to L? for different parameters. In
            this experiment, we set r = 25 and s = 20. The upper bound of the
            rank is set to be p = 30. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 2.2: The relative error to the true low-rank matrix vs the rank p for Shen
            et al.’s and Alg. 2.2. Alg. 2.2 is robust to p, as long as p is not smaller
            than the true rank 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  31
Figure 2.3: The numerical experiment on the ‘cameraman’ image. (A-C) show that
            the proposed model performs better than Shen et al.’s both visually and
            in terms of RE and PSNR. (D) compares the objective values vs time
            for general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is the value obtained by
            Alg. 2.2 with more iterations. It shows the fast speed with the Gauss-
            Newton approach and acceleration. With the Gauss-Newton approach,
            the computation time for Alg. 2.1 is reduced to about 1/7 of the one
            with standard SVD (from 65.11s to 8.43s). The accelerated Alg. 2.2
            requires 5.2s, though the number of iterations is reduced from 3194 to 360. 33
                                              viii


Figure 2.4: The numerical experiment on the ‘Barbara’ image. (A-C) show that the
            proposed model performs better than Shen et al.’s both visually and in
            terms of RE and PSNR. (D) compares the objective values vs time for
            general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is the value obtained by
            Alg. 2.2 with more iterations. It shows the fast speed with the Gauss-
            Newton approach and acceleration. With the Gauss-Newton approach,
            the computation time for Alg. 2.1 is reduced to less than 1/3 of the one
            with standard SVD (from 148.6s to 43.7s). The accelerated Alg. 2.2
            requires 23.3s, though the number of iterations is reduced from 3210 to 300. 34
Figure 3.1: Comparison of recovered results on synthetic seismic data with 500 re-
            ceivers and 1000 measurements at each receiver. (a) simulated clean
            data. (b) noisy data (-26.2 dB). (c) recovered data by L1 (13.4 dB).
            (d) recovered data by MCP (13.9 dB). (e) recovered sparse noise by L1 .
            (f) recovered sparse noise by MCP. (g) the difference between the clean
            data and the recovered one by L1 . (h) the difference between the clean
            data and the recovered one by L1 . (e-h) zoom-in over receivers 150-350
            and measurements 1-400. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 3.2: Comparison of five algorithms (PGM, FISTA, IC-PGM, IC-FISTA, IC-
            ADMM) for the convex RPCA on synthetic data. IC-ADMM has the
            fastest convergence rate and smallest computational time. IC technique
            improves the performance of PGM and FISTA significantly. . . . . . . .        45
Figure 3.3: Noisy data generated in Oklahoma. . . . . . . . . . . . . . . . . . . . . .   46
Figure 3.4: Recovered results of the real data with two models. . . . . . . . . . . . .   47
Figure 4.1: NRPCA applied to the noisy Swiss roll data set. X̃ − Ŝ is the result
            after subtracting the estimated sparse noise via NRPCA with T = 1;
            “ X̃ − Ŝ with one neighbor update” is that with T = 2, i.e., patches
            are reassigned once; X̂ is the denoised data obtained via NRPCA with
            T = 2; “Patch-wise Robust PCA” refers to the ad-hoc application of the
            vanilla RPCA to each local patch independently, whose performance is
            clearly worse than the proposed joint-recovery formulation. . . . . . . .     56
Figure 4.2: Laplacian eigenmaps and Isomap results for the original and the NR-
            PCA denoised digits 4 and 9 from the MNIST dataset. . . . . . . . . . .       57
Figure 5.1: A movie rating system. For a given d1 ×d2 low-rank matrix with missing
            entries, it can be factorized by a d1 × k user matrix and a k × d2 item
            matrix where k is the rank of the original matrix. . . . . . . . . . . . . .  59
                                              ix


Figure 5.2: Loss function value versus number of iterations for the small Hermitian
            case. The stepsize is 3e−5 . Within 40000 iterations, the value decreases
            from nearly 100 to 10−6 . The loss function value tends to keep decreasing
            after these 40000 iterations. . . . . . . . . . . . . . . . . . . . . . . . . .   84
Figure 5.3: Loss function value versus number of iterations for the small general
            quaternion matrix case. The stepsize is tuned to be 1e−4 . Within 5000
            iterations, the value decreases from around 100 to 10−4 . The loss func-
            tion value tends to keep decreasing after these 5000 iterations. . . . . . .      85
Figure 5.4: Online image recovery result after 10000 iterations. We randomly sam-
            pled 1000o observations from (a). We can see that the result for recov-
            ered image (b) is not good. We expect that the difference between (a)
            and (b) can be as small as possible. . . . . . . . . . . . . . . . . . . . . .    86
Figure 5.5: Loss function value versus number of iterations for the real color image.
            The initial stepsize is tuned to be 1e−5 . For every 300 iterations, we
            multiply the stepsize by 0.95. The loss function value decreases from
            around 170 to almost 45. At the beginning, the loss function value
            decreases the most, and it tends to keep decreasing after these 10000
            iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
                                               x


                            LIST OF ALGORITHMS
2.1 RPCA for low rank matrix approximation . . . . . . . . . . . . . . . . . . . .     21
2.2 Accelerated RPCA with nonmonotone APG . . . . . . . . . . . . . . . . . . .        28
4.1 Nonlinear RPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Online learning algorithm for the Hermitian matrix M . . . . . . . . . . . . .     82
5.2 Online learning algorithm for general M (theoretical version) . . . . . . . . .    83
5.3 Online learning algorithm for general M (practical version) . . . . . . . . . .    83
                                         xi


                                       CHAPTER 1
                                     BACKGROUND
1.1     Applications of robust low-rank optimization
Many high-dimensional data points can be represented as points in a low-dimensional sub-
space of a high-dimensional space, and principal component analysis (PCA) is a popular tool
to find the low-dimensional subspace. Here, we use a simple example to explain the intuition
behind PCA. Height and weight are two measurements (features) for football players, and
we say that the original dimension of these measurements is two. That is, the data for each
player lies in a two-dimensional space. Obviously, there is a positive correlation between
these two measurements. If one person is taller than another one, he/she is usually heavier
too. More specifically, these two measurements are linearly correlated, as shown in Fig. 1.1.
In this example, we can use one new variable named size to approximately describe the
combination of height and weight at a high accuracy. In fact, it is also applied to the size of
clothes.
    Generally, correlations within features happen when there are more than two features.
Mathematically, we say that these features are not linearly independent. PCA finds a lin-
ear transform such that the new features after the transformation are linearly independent.
In addition, PCA finds the most important features that represent the most information
(variance) in the dataset. For example, the first principal component explains the most
variance in the dataset, and the second principal component is the vector that is orthogo-
nal to the first principal component and explains the second most variance in the dataset.
PCA can be calculated from the singular value decomposition (SVD) of the data matrix,
and the singular values represent the variance in the data corresponding to the principal
components, which are corresponding right eigenvectors. Then, we keep the principal com-
ponents corresponding to the largest singular values to represent most information using a
                                              1


Figure 1.1: Image from https://link.medium.com/bAUJGpEl5hb. Height and weight are
correlated. If visualized, these two vectors have an acute angle (left figure). After using
PCA, height and weight are combined as a new feature ‘size’ (right figure). There is also
‘other’ information left that is orthogonal to ‘size’. The short length means that it is less
important.
small number of features. PCA has been applied to many different areas, such as computer
vision, bioinformatics, finance, psychology, etc.
    PCA removes the principal components corresponding to smallest singular values because
the data points with small random distortions are not exactly on the low-dimensional space.
The small random distortions are equivalent to adding small random values to all singular
values. Therefore, by setting small singular values as zero, PCA can reduce the effect of
these distortions. In the left figure of Figure 1.2, with the small distortions, PCA is able
to get the first principal component, shown in red, and it is very close to the true direction
shown in blue. However, in many real cases, there are outliers accompanied with the data
set. In statistics, an outlier is a data point that differs significantly from other observations.
It can be caused by equipment error or the population that has a heavy-tailed distribution.
Outliers usually have large distance from normal data points, but the total number of outliers
is much smaller than the total number of data points. In the case with outliers, PCA may
not perform well. For example, in the right figure of Figure 1.2, there are four outliers, and
PCA will give the red direction, which is very different from the true blue one.
                                                 2


         (a) Yellow dots are sampled from one     (b) Yellow dots are sampled from one
         dimensional subspace (blue line) cor-    dimensional subspace (blue line) cor-
         rupted by Gaussian noise. The red line   rupted by large sparse outliers. PCA is
         is the output of PCA. It can reserve     affected by outliers and can not capture
         most variance.                           the data well.
Figure 1.2: Image from https://link.medium.com/BbeEPVIl5hb. PCA works well for clean
linear data. However, it works poorly for data with outliers.
    Standard PCA can not deal with outliers because it is based on the assumption that
the additional noise follows a Gaussian distribution. However, outliers follow heavy-tail
distributions and a different model is required. Robust PCA (RPCA) is one approach to
remove the outliers while preserving the low-dimensional structure. This approach has been
successfully applied in a wide range of areas, including computer vision (De la Torre and
Black, 2001), image processing (Liu et al., 2012; Elhamifar and Vidal, 2013), dimensionality
reduction (Cunningham and Ghahramani, 2015), bioinformatics data analysis (Da Costa
et al., 2009), and web data and services (Koren, 2009). More specifically, RPCA has achieved
great success in video surveillance and face recognition (Candès et al., 2011; Bouwmans and
Zahzah, 2014). Also, it has been proved by Candès et al. (2011) that the low-rank part can be
a good approximation of the original complete matrix if the data satisfies some assumptions.
    RPCA assumes that we observe all entries of the full noisy matrix. Compared with
standard PCA, RPCA utilizes the difference between the low-dimensional data structure
and outliers. Also, different from two-stage methods, which first detect the outliers and
remove them, RPCA describes the outliers as a sparse matrix and decompose the noisy
                                                3


matrix into the sum of two or three matrices, with or without the Gaussian noise. In some
applications, the outliers contain useful information. For example, in video surveillance, the
low-rank part preserves the stationary background, whereas the sparse part can capture a
moving object or person in the foreground. See Figure 1.3 for an example with four frames
in a video.
Figure 1.3: An application of RPCA. Image from (Zhou et al., 2014). Top row: four frames
from a video. Middle row: video background, which is viewed as a low-rank approximation.
Bottom row: main objects which correspond to sparse components.
    RPCA separates a data matrix into the sum of a low-rank matrix, a sparse matrix
(outliers), and a small noise matrix (Gaussian noise). If there is no Gaussian distortions, the
small noise matrix is just a zero matrix. However, in practice, we are often confronted with
such a situation where the collected data is incomplete. The good thing is that a low-rank
matrix can be recovered from only a few entries of the matrix. For example, an m × n
rank-one matrix can be recovered by one row and one column. When the given noisy matrix
has missing entries, it is a low-rank matrix completion problem, that is, we will remove the
outliers and recover the whole low-rank matrix. For example, in a film rating system, as
shown in Figure 1.4, each user rates the films they watched. However, this data matrix tends
                                               4


to be incomplete because people cannot watch all film showed in the rating website. Matrix
completion techniques can predict the rates of each user for unwatched films and promote
to the user the films they may like but did not watch.
Figure 1.4: Image from https://link.medium.com/ERplrHLl5hb. Rating system for Netflix
films. Each row represents each user while each column represents each film. Values from 1
to 5 are scores.
1.2    Existing work on RPCA
Assuming that the true data points lie on a low-dimensional subspace, we use a matrix
D ∈ Rn×p to represent the observed noisy data, where n is the number of samples and p is
the dimension of each sample. We also assume that the intrinsic dimension of the data is
k < p. Therefore, in the linear case, k is also the rank of the matrix L0 consisting of the
true samples. In this case, we have the following model,
                                       D = L 0 + N0 ,                                 (1.1)
                                              5


where L0 is a low-rank matrix with rank k and N0 is a small perturbation matrix (Gaussian
noise). Then the classical PCA solves the following problem
                                   minimize kD − Lk2F
                                       L
                                   subject to rank(L) = k.
    The classical PCA assumes that the noise follows a Gaussian distribution. When only
some components of the matrix are affected by large noise and the distribution of the noise
is unknown, we consider RPCA, which assumes that
                                        D = L0 + S0 ,
with S0 being a sparse matrix. Because we do not have any knowledge about the distribution
of the noise, the components can be considered as damaged and should be removed. If we
know the locations, then we can remove them and fill-in new values for those locations.
However, the locations are not given, and we have to find the sparse matrix S0 and the
low-rank matrix L0 together using RPCA algorithms.
    RPCA is an inverse problem to recover L and S from the matrix D, which can be realized
via solving the idealized nonconvex problem
                      minimize rank(L) + λkSk0 , subject to L + S = D,                 (1.2)
                         L,S
where λ is a parameter to balance the two objectives and kSk0 counts the number of non-
zero entries in S. However, this problem is NP-hard in general (Amaldi and Kann, 1998).
Therefore, much attention is focused on the following convex relaxation:
                       minimize kLk∗ + λkSk1 , subject to L + S = D.                   (1.3)
                           L,S
Here k · k∗ and k · k1 denote the nuclear norm and `1 −norm of a matrix, respectively. This
is called principal component pursuit (PCP) (Zhou et al., 2010a).
    When Gaussian noise is also involved, we consider
                                       D = L + S + N,                                  (1.4)
                                              6


where N is the Gaussian noise. We can set the noise level to be  and use the Frobenius
norm to measure the Gaussian noise. A relaxed version of PCP is defined as
                    minimize kLk∗ + λkSk1 , subject to kL + S − Dk2F ≤ .                (1.5)
                       L,S
This constrained optimization problem is also equivalent to the following unconstrained one
                                        1
                          minimize        kL + S − Dk2F + kLk∗ + λkSk1 .                 (1.6)
                              L,S      2µ
    Classical optimization algorithms such as proximal gradient method (PGM), accelerated
PGM, and alternating direction method of multipliers (ADMM) have been used to solve the
unconstraned problem (1.6). Let’s briefly go through these algorithms. For PGM, we can
view the loss function as two parts corresponding with L and S respectively. Then the PGM
takes two steps. The first step is the gradient descent, and the second step is to calculate
the proximal functions. One iteration is described as
                            
                            L̂k = Lk − t (Lk + Sk − D),
                            
                                            µ
                            
                            
                            
                            
                                            t k
                            
                            
                                k      k           k
                            Ŝ = S − (L + S − D),
                            
                            
                                            µ
                                                                                         (1.7)
                                                         1
                               Lk+1 = arg min tkLk∗ + kL − L̂k k2F ,
                            
                            
                            
                            
                            
                            
                                           L            2
                                                          1
                            
                            
                            Sk+1 = arg min tλkSk1 + kS − Ŝk k2F .
                            
                            
                                           S              2
    To solve the second proximal function, we can directly use soft-thresholding on each entry
of the matrix. To solve the first proximal function, we need to use SVD to calculate the
singular values and do the soft-thresholding on the singular values. We can set the initial
condition L = 0 and S = 0.
    As for accelerated proximal gradient method, we use FISTA (Beck and Teboulle, 2009b)
as one example. The main change compared with the original PGM is that the shrinkage
operator is not used on the previous point L̂k but on the linear combination of the previous
two points L̂k , L̂k−1 as follows:
                                              θk−1 − 1 k
                                   L̄k = Lk +         (L − Lk−1 )                        (1.8)
                                                 θk
                                                  7


                   √
                 1+   1+4θk2
where θk+1 =          2
                             .  Generally, it updates variables starting with θ−1 = θ0 = 1. It has
been proved that FISTA has improved the convergence rate from O(1/k) to O(1/k 2 ) for
general convex problems.
    When it goes to ADMM, we need to convert the original problem to a constrained form
                                1
                minimize          kZk2F + λkSk1 + kLk∗ , subject to Z + L + S = D.           (1.9)
                   Z,L,S       2µ
Unfortunately, its convergence for general three blocks has not been proved to converge. But
in practice, it usually works very well.
    In order to solve (1.9), we need to establish the augmented Lagrangian function
                        1                                                     β
 Lβ (Z, S, L, Q) =        kZk2F +λkSk1 +kLk∗ −βhQ, Z+S+L−Di+ kZ+S+L−Dk2F . (1.10)
                      2µ                                                      2
Alternatively, we consider
          0                     1                            β
         Lβ (Z, S, L, Q) =        kZk2F + λkSk1 + kLk∗ + kZ + S + L − D − Qk2F ,            (1.11)
                               2µ                            2
       0
with Lβ (Z, S, L, Q) = Lβ (Z, S, L, Q) + β2 kQk2F if we want to get the optimal variables Z, S
                                                0
and L, respectively. Then we minimize Lβ alternatingly.
                               
                                                    0
                               
                               
                                Zk+1 := arg min Lβ (Z, Sk , Lk , Qk );
                               
                                            Z
                               
                               
                                                   0
                               Sk+1 := arg min Lβ (Zk+1 , S, Lk , Qk );
                               
                               
                                             S
                                                                                            (1.12)
                                                    0
                                 Lk+1 := arg min Lβ (Zk+1 , Sk+1 , Lk , Qk );
                               
                               
                               
                               
                               
                                            L
                               
                               
                               
                               Qk+1 := Qk − (Zk+1 + Sk+1 + Lk+1 − D).
                               
More specifically,
                                          1            β
                   
                   
                    Zk+1 := arg min kZk2F + kZ + Sk + Lk − D − Qk k2F ;
                   
                   
                   
                                   Z    2µ            2
                   
                                                  β
                   Sk+1 := arg min λkSk1 + kZk+1 + S + Lk − D − Qk k2F ;
                   
                   
                                    S              2                                        (1.13)
                    k+1                          β
                             := arg min kLk∗ + kZk+1 + Sk+1 + L − D − Qk k2F ;
                   
                    L
                                                  2
                   
                   
                   
                                   S
                   
                   
                   Qk+1 := Qk − (Zk+1 + Sk+1 + Lk+1 − D).
                   
                                                      8


    All these approaches need to find the proximal of the nuclear norm, which requires SVD.
When the matrix size is large, the SVD computation is very expensive and dominates other
computation (Trefethen and Bau III, 1997).
1.3     Overview of this thesis
We briefly introduce the following chapters in this thesis. In Chapter 2, we study the
theoretical parts of low-rank approximation. There are mainly two types of algorithms for
RPCA. The first type of algorithm applies regularization terms on the singular values of
a matrix to obtain the low-rank matrix. However, calculating singular values can be very
expensive for large matrices. The second type of algorithm replaces the low-rank matrix as
the multiplication of two smaller matrices. They are faster than the first type because no
SVD is required. However, the rank of the low-rank matrix is required, and an accurate rank
estimation is required to obtain a reasonable solution. In this chapter, we propose algorithms
that combine both types. Our proposed algorithms require an upper bound of the rank and
SVD on small matrices. First, they are faster than the first type because the cost of SVD on
small matrices is negligible. Second, they are more robust than the second type because an
upper bound of the rank instead of the rank is required. Numerical experiments show the
good performance of our proposed algorithms.
    In Chapter 3, we take examples from seismic data. Seismic events are usually buried
in noise. Our goal is to discover the underlying signal from noise. RPCA based seismic
denoising approaches yield promising results in separating useful seismic events from noise.
However, current RPCA-based methods suffer from expensive computational costs, which
hinders their wide applications in seismic data denoising and preprocessing. In this work,
we develop a cost-effective denoising technique based on RPCA. Instead of solving for the
clean data and noise simultaneously, we alternatively update them. This approach admits a
large stepsize and increases the speed. In addition, we improve the model by incorporating
a nonconvex term. To verify the effectiveness of our technique, we applied our denoising
                                               9


technique to both synthetic and field reflection seismic data. From the numerical results,
we observe that our denoising methods not only produce comparable or better denoising
results but also yield efficient computational cost. Through comparison to other RPCA-
based denoising methods, our method is at least 4-10x faster.
    In Chapter 4, we extend RPCA to nonlinear manifolds. Suppose that the data matrix
contains a sparse component and a component drawn from some low-dimensional manifold.
Is it possible to separate both components by using the low dimensionality assumption of
the manifold? Is there a benefit to treat the manifold as a whole as opposed to treating
each local region individually? We answer these two questions affirmatively by proposing an
optimization framework that separates these two components from noisy data. The efficacy
of the proposed method is demonstrated on both synthetic and real dataset.
    The three chapters consider offline algorithms, in which we have access to all data before
the algorithm is applied. In the last chapter, we applied low-rank approximation on online
matrix completion. Online optimization is more and more useful nowadays in data science
and machine learning areas. As we know, for offline setting, we need to get all observations
first and use all information to train the model. On the other side, online algorithms aim
to make decision sequentially based on sequentially sampled data. For example, in many
applications, like film recommendation systems, the user will give one rating for a film each
time after watching. In this case, we can only get one observation each time. There are
many work that aims to develop algorithms which can give updated result after every new
observed entry. In Chapter 5, we consider the completion of a quaternion matrix, whose
entries are quaternion numbers. Each quaternion number has one real number and three
imaginary number. Therefore, it can be applied to color image with or without depth. We
develop and analyze an online matrix completion algorithm for quaternion matrices. We
want to find a decomposition of the matrix such that the matrix is a product of two small
matrices, In this algorithm, after we receive an entry from the matrix, we update one row
and one one column of the two small matrices, respectively.
                                               10


                                       CHAPTER 2
     ROBUST PRINCIPAL COMPONENT ANALYSIS FOR LOW RANK
                               MATRIX APPROXIMATION
2.1     Introduction
Robust principal component analysis (RPCA) decomposes a data matrix into a low-rank
part and a sparse part. It has applications in a wide range of areas, including computer
vision (De la Torre and Black, 2001), image processing (Liu et al., 2012; Elhamifar and Vidal,
2013), dimensionality reduction (Cunningham and Ghahramani, 2015), and bioinformatics
data analysis (Da Costa et al., 2009). More specifically, the RPCA model has achieved
great success in video surveillance and face recognition (Candès et al., 2011; Bouwmans
and Zahzah, 2014). For example, in video surveillance, the low-rank part preserves the
stationary background, whereas the sparse part can capture a moving object or person in
the foreground.
    We first assume that the data matrix D is obtained by the sum of a low-rank matrix and
a spare matrix. That is
                                         D = L + S,
where L is a low-rank matrix and S is a sparse matrix, which has only a few nonzero
components. RPCA is an inverse problem to recover L and S from the matrix D, which can
be realized via solving the idealized nonconvex problem
                     minimize rank(L) + λkSk0 , subject to L + S = D,                    (2.1)
                         L,S
where λ is a parameter to balance the two objectives and kSk0 counts the number of non-
zero entries in S. However, this problem is NP-hard in general (Amaldi and Kann, 1998).
Therefore, much attention is focused on the following convex relaxation
                       minimize kLk∗ + λkSk1 , subject to L + S = D.                     (2.2)
                           L,S
                                             11


Here k · k∗ and k · k1 denote the nuclear norm and `1 −norm of a matrix, respectively. It is
shown that under mild conditions, the convex model (2.2) can exactly recover the low-rank
and sparse parts with high probabilities (Candès et al., 2011). When additional Gaussian
noise is considered, we can set the noise level to be  and use the Frobenius norm k · kF to
measure the reconstruction error. Then, the problem becomes
                   minimize kLk∗ + λkSk1 , subject to kL + S − Dk2F ≤ .                    (2.3)
                       L,S
Then, this constrained optimization problem is equivalent to the unconstrained problem
                                     1
                          minimize      kL + S − Dk2F + kLk∗ + λkSk1                        (2.4)
                             L,S    2µ
with a trade-off parameter µ. There is a correspondence between the two parameters  and
µ in (2.3) and (2.4), but the explicit expression does not exist. In this chapter, we will focus
on the unconstrained problem (2.4), and the technique introduced in this chapter can be
applied to the convex models (2.2) and (2.3). Please see Section 2.4 for more details.
    There are many existing approaches for solving (2.4) including the augmented Lagrange
method (Lin et al., 2010; Bouwmans and Zahzah, 2014; Wright et al., 2009). Some examples
are proximal gradient method for (L, S), alternating minimization for L and S (Shen et al.,
2019), proximal gradient method for L after S is eliminated (Sha et al., 2019), alternating
direction method of multipliers (ADMM) (Yuan and Yang, 2009; Tao and Yuan, 2011). All
these approaches need to find the proximal of the nuclear norm, which require the singular
value decomposition (SVD). When the matrix size is large, the SVD computation is very
expensive and dominates all the computation (Trefethen and Bau III, 1997).
    Alternative approaches for RPCA use matrix decomposition (Wen et al., 2012) and do
not require SVD. Assuming that the rank of L is known as p, we can decompose it as
                                           L = XY> ,
with X ∈ Rm×p and Y ∈ Rn×p . Then the following nonconvex optimization problem
                                       1
                            minimize     kXY> + S − Dk2F + λkSk1 ,                          (2.5)
                              X,Y,S    2
                                               12


is considered. There are infinite many optimal solutions for this problem, since for any
invertable matrix A ∈ Rp×p , (X, Y, S) and (XA−1 , YA> , S) have the same function value.
In fact, for any matrix L with rank no greater than p, we can find L = XY> and Y> Y = Ip×p .
Therefore, we can have an additional constraint Y> Y = Ip×p . The resulting problem still
has infinite many optimal solutions, since for any orthogonal matrix A ∈ Rp×p , (X, Y, S)
and (XA, YA, S) have the same function value. Though (X, Y) are not unique, the low-rank
matrix L that we need is unique at probability one. This resulting problem was discussed
in (Shen et al., 2019), and an efficient algorithm by alternatively minimizing XY> and S
is provided. In this algorithm, a Gauss-Newton algorithm is applied to update XY> and
increase the speed.
    Though the matrix decomposition problem is fast to solve, it is nonconvex and requires
an accurate estimation of the rank of L. Fig. 2.2 in Section 2.3.1.2 demonstrates that a
good estimation of the rank is critical. However, in most scenarios, we do not have the exact
rank of L, but we can have an upper bound of the true rank. Therefore, we can combine
the matrix decomposition and the nuclear norm minimization to have the benefits of both
problems. The problem we consider in this chapter is
                        1
             minimize     kL + S − Dk2F + µkLk∗ + λkSk1 , subject to rank(L) ≤ p.        (2.6)
                L,S     2
When µ = 0, the problem (2.6) is equivalent to (2.5). In addition, we consider the following
more general problem
                    1
        minimize      kA(L) + S − Dk2F + µkLk∗ + λkSk1 , subject to rank(L) ≤ p.         (2.7)
            L,S     2
where D is the measurement of A(L) contaminated with both Gaussian noise and a sparse
component. Here A is a bounded linear operator that describes how the measurements are
calculated. For example, in robust matrix completion, we let A be the restriction operator
on the given components of the matrix L.
    Note that the alternating minimization algorithm in (Shen et al., 2019) can not be applied
to this general problem because the subproblem for L can no longer be solved efficiently by
                                               13


the Gauss-Newton method. We will show the equivalency of the alternating minimization
algorithm in (Shen et al., 2019) as a proximal gradient method applied to a problem with L
only. Then the subproblem of L in our general problem (2.7) can still be solved efficiently
with the Gauss-Newton method. Please see more details in Section 2.2.
    For simplicity, we use the nuclear norm and `1 −norm for the low-rank and sparse matrices,
respectively. The purpose of this chapter is to introduce a fast algorithm to solve a type
of RPCA algorithm, while the comparison of different penalties is out of the scopes of this
chapter. The contributions of this chapter are:
    • We propose a new model for RPCA, which combines the nuclear norm minimization
      and the matrix decomposition. The matrix decomposition brings efficient algorithms,
      and the nuclear norm miminization on a smaller matrix removes the requirement of
      the rank of the low-rank matrix. Note that the nuclear norm minimization can be
      replaced by other nonconvex penalties, and the results in this chapter are still valid.
    • We develop efficient algorithms to solve this problem and show its convergence.
2.1.1    Notation
Throughout this chapter, matrices are denoted by bold capital letters (e.g., A), and operators
are denoted by calligraphic letters (e.g., A). In particular, I denotes the identity matrix, 0
denotes the zero matrix (all entries equal to zero), and I denotes the identity operator. If
there is potential for confusion, we indicate the dimension of the matrix with subscripts. For
a matrix A, A> represents its transpose and A(:, j : k) denotes the matrix composed by the
columns of A indexing from j to k. Let Ai,j be the (i, j) entry of A. The `1 −norm of A
is given by kAk1 = i,j |Ai,j |. Denote the ith singular value of A by σi (A). The nuclear
                      P
norm of A is given by kAk∗ =          i σi (A). We will use ∂k · k1 and ∂k · k∗ to denote the
                                    P
subgradients of the `1 −norm and nuclear norm, respectively. The linear space of all m × n
real matrices is denoted by Rm×n . For A, B ∈ Rm×n , the inner product of A, B is defined by
                                                14


hA, Bi = Tr(A> B), which induces the Frobenius norm kAkF =             Tr(A> A) =
                                                                    p               pP
                                                                                           2
                                                                                        i σi (A).
Let A be a linear bounded operator on Rm×n . The operator norm of A is given by kAk =
sup{kA(A)kF : A ∈ Rm×n , kAkF = 1}. The adjoint operator of A denoted by A∗ is also
linear and bounded on Rm×n such that hA(A), Bi = hA, A∗ (B)i. The notation            is used to
denote the component-wise multiplication. Additionally, for a function f : R → R, without
further reference, f acting on a matrix A ∈ Rm×n specifies that f is evaluated on each entry
of A, i.e., f (A) ∈ Rm×n with f (A)i,j = f (Ai,j ). For example, if f (x) = |x| − λ, we can
denote f (A) ∈ Rm×n by |A| − λ with (|A| − λ)i,j = |Ai,j | − λ.
2.1.2   Organization
The rest of the chapter is organized as follows. We introduce our proposed algorithms and
show their convergence in Section 2.2. Then we conduct numerical experiments to compare
the performance of our proposed algorithms with existing approaches in Section 2.3. In
Section 2.4, we introduce some potential extension from this chapter. We end this chapter
with a short conclusion.
2.2     Proposed algorithms
The problem (2.6) is nonconvex because of the constraint rank(L) ≤ p. It has several
equivalent formulations. E.g., it is equivalent to the following nonconvex weighted nuclear
norm minimization problem:
                                              p             min(m,n)
                        1                   X                 X
             minimize kL + S − Dk2F + µ          σi (L) + C          σi (L) + λkSk1 ,
                L,S     2                    i=1             i=p+1
where C is a sufficiently large number such that the optimal L has at most p nonzero singular
values. However, this formulation also requires the singular value decomposition of an m × n
matrix in each iteration, which is expensive when m and n are large. We consider another
equivalent problem with matrix decomposition in the following theorem.
                                              15


Theorem 2.2.1. Problem (2.6) is equivalent to
                  1
       minimize     kXY> + S − Dk2F + µkXk∗ + λkSk1 , subject to Y> Y = Ip×p .          (2.8)
         X,Y,S    2
More specifically, if (X, Y, S) is an optimal solution to (2.8), then (XY> , S) is an optimal
solution to (2.6). If (L, S) is an optimal solution to (2.6) and we have the decomposition
L = XY> with Y> Y = Ip×p , then (X, Y, S) is an optimal solution to (2.8).
Proof. For any matrix L ∈ Rm×n with rank no greater than p, we can have the decomposition
                                          L = XY> ,
with Y> Y = Ip×p . This decomposition is not unique, and one decomposition can be easily
obtained from the compact SVD of L. Let L = Up Σp Vp> be the SVD of L with a square
p × p matrix Σp , we have Vp> Vp = Ip×p . Thus, problem (2.6) is equivalent to
                  1
       minimize     kXY> + S − Dk2F + µkXY> k∗ + λkSk1 , subject to Y> Y = Ip×p .
         X,Y,S    2
For any X ∈ Rm×p , let X = UΣV> be its SVD with U ∈ Rm×p and V ∈ Rp×p . We have
                               XY> = UΣV> Y> = UΣ(YV)> .
Since (YV)> (YV) = V> Y> YV = V> V = Ip×p . The SVD of XY> is UΣ(YV)> , and
kXY> k∗ = pi=1 Σii = kXk∗ . Thus, problem (2.6) is equivalent to (2.8).
            P
    Next, we consider problem (2.8) with S fixed. When S is fixed, it becomes a problem of
L = XY> , and solving this problem is to find the proximal operator of the corresponding
nonconvex weighted nuclear norm, which is denoted as
                              1
                   minimize     kL − Mk2F + µkLk∗ , subject to rank(L) ≤ p,             (2.9)
                        L     2
or equivalently
                           1
                minimize     kXY> − Mk2F + µkXk∗ , subject to Y> Y = Ip×p ,            (2.10)
                    X,Y    2
where M = D − S.
                                              16


Theorem 2.2.2. Let q = min(m, n). Problem (2.9) can be solved in two steps:
   1. Find the compact SVD of M = UΣV> , with Σ = diag(σ1 (M), · · · , σq (M)) satisfying
      σ1 (M) ≥ σ2 (M) ≥ · · · ≥ σq (M);
   2. Construct a diagonal matrix Σ̂µ ∈ Rp×p with (Σ̂µ )ii = max(Σii −µ, 0), then one solution
      of (2.9) is U(:, 1 : p)Σ̂µ V(:, 1 : p)> .
In addition, for any orthogonal matrix A ∈ Rp×p , (U(:, 1 : p)Σ̂µ A, V(:, 1 : p)A) is an optimal
solution of (2.10).
Proof. Given any L ∈ Rm×n with rank(L) ≤ p, let σ1 , σ2 , · · · , σq be its singular values in
the decreasing order such that σp+1 = · · · = σq = 0. Note that the main diagonal entries of
Σ are the singular values of M. According to the von-Neumann trace inequality (Horn and
Johnson, 2012, Theorem 7.4.1.1), one can bound the matrix inner product by the singular
values, i.e., hL, Mi ≤ qi=1 σi Σii . Then we have
                       P
             1                            1            1
               kL − Mk2F + µkLk∗ = kLk2F + kMk2F − hL, Mi + µkLk∗
             2                            2            2
                                                q           q         q                 q
                                          1   X
                                                     2   1X 2 X                       X
                                      ≥            σ +         Σ −       σi Σii + µ        σi             (2.11)
                                          2 i=1 i 2 i=1 ii i=1                         i=1
                                                p           q         p                 p
                                          1X 2 1X 2 X                                 X
                                      =            σi +        Σii −     σi Σii + µ        σi ,
                                          2 i=1          2 i=1       i=1               i=1
where the equality is satisfied when L has a simultaneous SVD with M through U and V.
Therefore, the optimal L minimizing 21 kL − Mk2F + µkLk∗ can be selected from the matrices
that have a simultaneous SVD with M through U and V. Then we can assume that the
optimal L satisfies
      L = Udiag(σ1 , · · · , σp , σp+1 , · · · , σq )V> = U(:, 1 : p)diag(σ1 , · · · , σp )V(:, 1 : p)> ,
where the last equality holds because of the fact that σp+1 = · · · = σq = 0. Next, one can
construct an optimal L of the above form by letting σi = max(Σii − µ, 0) for i = 1, 2, · · · , p,
which minimizes the last equation in (2.11). Thus U(:, 1 : p)Σ̂µ V(:, 1 : p)> minimizes the
objective function of (2.9) over all L ∈ Rm×n with rank no greater than p.
                                                       17


    By the same argument in the proof of Theorem 2.2.1, we see that problem (2.10) is
equivalent to problem (2.9). Since for any orthogonal matrix A ∈ Rp×p , there hold
                               L = (U(:, 1 : p)Σ̂µ A)(V(:, 1 : p)A)>
and
                          (V(:, 1 : p)A)> (V(:, 1 : p)A) = A> A = Ip×p .
Therefore, (U(:, 1 : p)Σ̂µ A, V(:, 1 : p)A) is an optimal solution of problem (2.10).
    The first step to solve problem (2.10) in the previous theorem requires the truncated
SVD of an m × n matrix M. Since we only need the first p (p < q = min(m, n)) singular
values, we use the Gauss-Newton algorithm to find (X, Y) alternatively. In this approach,
we require the SVD of a m × p matrix, which is much faster than the truncated SVD of a
m × n matrix when p is small. In addition, we use the previous X as the initial guess in the
next iteration to reduce the number of inner iterations for the Gauss-Newton algorithm.
Lemma 2.2.3. If the rank of M ∈ Rm×n is larger than p, problem (2.10) can be solved in
the following three steps:
   1. Find X̂ ∈ Rm×p (p < m) by solving the following optimization problem
                                                    1
                                     minimize         kXX> − MM> k2F ;
                                         X          2
   2. Y = M> X̂(X̂> X̂)−1 ;
   3. Let X̂ = Up Σ̂A be its thin SVD with Σ̂ ∈ Rp×p and choose X as X = Up Σ̂µ A with
      (Σ̂µ )ii = max(0, Σ̂ii − µ) for i = 1, . . . , p. Then (X, Y) is a solution of problem (2.10).
Proof. Given any X ∈ Rm×p , let λ1 , λ2 , · · · , λm be the non-negative eigenvalues of the matrix
XX> . Since rank(X) ≤ p < m, we have λp+1 = · · · = λm = 0. Recall that the compact
SVD of M given in Theorem 2.2.2 is UΣV> with Σ ∈ Rq×q (here q = min(m, n)). Then
                                                     18


Σ211 ≥ Σ211 ≥ · · · ≥ Σ2qq are the largest q eigenvalues of the matrix MM> , and if q < m, the
remaining eigenvalues of MM> are all zeros. Then we have
                                            p         q              p
                                          X         X               X
                         >
                 kXX −         MM> k2F ≥      λ2i +        Σ4ii −2      λi Σ2ii
                                          i=1        i=1            i=1
                                            p                     q             q
                                          X                      X              X
                                       =      (λi −  Σ2ii )2 +        Σ4ii ≥        Σ4ii ,
                                          i=1                   i=p+1         i=p+1
where the equality is satisfied when we choose X = U(:, 1 : p)diag(Σ11 , · · · , Σpp ). Let
Σ̂ = diag(Σ11 , · · · , Σpp ). The matrix Σ̂ is invertible as the rank of M is larger than p. Then
for any orthogonal matrix A ∈ Rp×p , X̂ = U(:, 1 : p)Σ̂A minimizes the objective function
1
2
  kXX>   − MM> k2F .
    After we find X̂ = U(:, 1 : p)Σ̂A for a certain orthogonal matrix A, we have
        Y = M> X̂(X̂> X̂)−1 =VΣU> U(:, 1 : p)Σ̂A((U(:, 1 : p)Σ̂A)> U(:, 1 : p)Σ̂A)−1
                                  =VΣU> U(:, 1 : p)Σ̂−1 A
                                  =V(:, 1 : p)Σ̂Σ̂−1 A = V(:, 1 : p)A,
where the third equality is due to the fact that
                                                                      
                                                         Σ̂p×p
                                   ΣU> U(:, 1 : p) =                  .
                                                                       
                                                             0(q−p)×p
    According to Theorem 2.2.2, (X̂, Y) is an optimal solution of problem (2.10) if µ = 0.
Note that X̂ = U(:, 1 : p)Σ̂A is the thin SVD with Σ̂ ∈ Rp×p . Then, the third step
gives X = U(:, 1 : p)Σ̂µ A. Theorem 2.2.2 shows that (X, Y) is an optimal solution of
problem (2.10).
Remark: To find X̂ in the first step, we apply the Gauss-Newton algorithm from (Liu
et al., 2015), which is previously used for RPCA in (Shen et al., 2019). The iteration
is X ← MM> X(X> X)−1 − X((X> X)−1 X> MM> X(X> X)−1 − I)/2. When p is small,
computing the inverse of X> X is fast. Though an iterative algorithm is required to solve this
subproblem at each outer iteration, we can use the output from the previous outer iteration
                                                   19


as the initial and the number of inner iterations is reduced significantly. Therefore, the
computational time can be reduced significantly, as shown in Section 2.3. In the numerical
experiments, the first Gauss-Newton algorithm requires several hundred iterations, while the
number for following Gauss-Newton algorithms reduces to less than ten.
    From Theorem 2.2.2, we say that we solve the proximal operator of the nonconvex func-
tion kLk∗ + ιrank(L)≤p (L) exactly. Here the indicator function is defined as
                                               
                                                0,        if rank(L) ≤ p;
                                               
                            ιrank(L)≤p (L) =
                                                +∞, otherwise.
                                               
With these theorems, we are ready to develop optimization algorithms for the general prob-
lem (2.7).
2.2.1     Forward-backward
First, we eliminate S, and it becomes the following problem with L only:
                                     1
                       minimize min kA(L) + S − Dk2F + λkSk1 + µkLk∗
                      L:rank(L)≤p S 2
                                                                      
                                        1
                   = minimize min                             2
                                          kA(L) + S − DkF + λkSk1 + µkLk∗                 (2.12)
                      L:rank(L)≤p S     2
                   = minimize fλ (D − A(L)) + µkLk∗ .
                      L:rank(L)≤p
Here fλ is the Moreau envelope of λ| · | defined by fλ (x) = miny∈R {λ|y| + 21 (y − x)2 }. So it
is differential and has a 1-Lipschitz continuous gradient. Then we can apply the proximal-
gradient method (or forward-backward operator splitting). We take the gradient of fλ , which
is given by
                   fλ0 (x) = x − sign(x) max(0, |x| − λ) = sign(x) min(λ, |x|).           (2.13)
The forward-backward iteration for L with stepsize t is
                             Lk+1 = proxtµ Lk − tA∗ fλ0 (A(Lk ) − D) ,                    (2.14)
                                                                        
where the proximal operator is defined by
                                                      1
                           proxµ (A) = arg min          kL − Ak2F + µkLk∗ .               (2.15)
                                          L:rank(L)≤p 2
                                                   20


The algorithm is summarized in Alg. 2.1.
  Algorithm 2.1: RPCA for low rank matrix approximation
    Input: D, µ, λ, p, A, stepsize t, stopping criteria , maximum number of iterations
             M ax_Iter, initialization L0 = 0
    Output: L, S
  1 for k = 0, 1, 2, 3, . . . , Max_Iter do
  2     S = sign(D − A(Lk )) max(0, |D − A(Lk )| − λ) ;
  3     Lk+1 = proxtµ (Lk − tA∗ (A(Lk ) − D + S) using Gauss-Newton;
  4     if kLk+1 − Lk kF /kLk kF <  then
  5         break
  6     end
  7 end
    Connection to (Shen et al., 2019). Consider the special case with A = I and µ = 0.
We let t = 1 in (2.14) and obtain the following iteration
                                                                1
               Lk+1 = prox0 (Lk − fλ0 (Lk − D)) = arg min         kL + Sk+1 − Dk2 ,
                                                    L:rank(L)≤p 2
where Sk+1 = sign(D − Lk )        max(0, |D − Lk | − λ). This is exactly the algorithm in (Shen
et al., 2019) for solving (2.5). It alternates between finding the best S with L fixed and the
best L (or (X, Y)) with S fixed.
    Recently, the work (Cai et al., 2019) proposed a novel RPCA algorithm with linear con-
vergence. It projects matrices to special manifolds of low-rank matrices, and their truncated
SVD can be computed efficiently. Our matrix does not have this property in our algorithm,
and a good initial guess from the previous iteration is necessary to reduce the computation
in the Gauss-Newton method.
2.2.1.1     Convergence analysis
From the discussion above, problem (2.7) can be solved by an iteration process of forward-
backward splitting. In each iteration, we reduce the value of the objective function
                                    1
                        E(L, S) = kA(L) + S − Dk2F + λkSk1 + µkLk∗                        (2.16)
                                    2
                                                21


by applying proximal operators to L and S alternatively. The resulting iteration sequence
{(Lk , Sk )}k≥1 with some initial (L0 , S0 ) is explicitly given by
                       Sk = sign(D − A(Lk−1 ))       max(0, |D − A(Lk−1 )| − λ),
                                                                                                 (2.17)
                      Lk = proxtµ   Lk−1 − tA∗ (A(Lk−1 ) + Sk − D) ,
                                                                         
where the proximal operator proxtµ (·) for updating L is defined by (2.15). Here we use
(2.13) to derive
               fλ0 (A(Lk−1 ) − D)
                = A(Lk−1 ) − D + sign(D − A(Lk−1 ))         max(0, |D − A(Lk−1 )| − λ)
                = A(Lk−1 ) + Sk − D.
In this subsection, we establish the convergence results for {(Lk , Sk )}k≥1 . We will show
that every limit point of {(Lk , Sk )}k≥1 , denoted by (L? , S? ), is a fixed point of the proximal
operator, i.e.,
                         S? = sign(D − A(L? ))       max(0, |D − A(L? )| − λ),
                                                                                                 (2.18)
                           ?             ?        ∗     ?     ?
                         L = proxtµ (L − tA (A(L ) + S − D)) .
In practical execution, one can efficiently solve the proximal operator for L by solving
(Xk , Yk ) through
                             1
                  minimize     kXY> − Lk−1 + tA∗ (A(Lk−1 ) + Sk − D)k2F + µkXk∗ ,
                     X,Y     2                                                                   (2.19)
                               >
                 subject to Y Y = Ip×p ,
and letting Lk = Xk (Yk )> . We will also prove that if (X? , Y? , S? ) is a limit point of
{(Xk , Yk , Sk )}k≥1 , then (X? (Y? )> , S? ) is a limit point of {(Lk , Sk )}k≥1 , and the limit point
(X? , Y? , S? ) is a stationary point of
                                  1
                    E(XY> , S) = kA(XY> ) + S − Dk2F + λkSk1 + µkXY> k∗ ,
                                  2
                                                    22


i.e., (X? , Y? , S? ) satisfies the first-order optimality condition
                      0 ∈ [A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ]Y? ,
                      0 ∈ (X? )> [A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ],                  (2.20)
                      0 ∈ A(X? (Y? )> ) + S? − D + λ∂kS? k1 .
We summarize these results in the following theorem.
Theorem 2.2.4. Define the objective function E(L, S) as (2.16). Let {(Lk , Sk )}k≥1 be a
sequence generated by (2.17) with initial (L0 , S0 ) and stepsize t <          1
                                                                             kAk2
                                                                                  , where Lk = Xk (Yk )>
with (Xk , Yk ) being solved from (2.19). We have the following statements:
    1. The objective values {E(Lk , Sk )}k≥1 are non-increasing along {(Lk , Sk )}k≥1 .
    2. The sequence {(Lk , Sk )}k≥1 is bounded and thus has limit points.
    3. Every limit point (L? , S? ) of {(Lk , Sk )}k≥1 satisfies (2.18).
    4. The sequence {(Xk , Yk , Sk )}k≥1 is also bounded. In addition, for any limit point
       (X? , Y? , S? ) of {(Xk , Yk , Sk )}k≥1 , (X? (Y? )> , S? ) is a limit point of {(Lk , Sk )}k≥1 .
    5. Every limit point (X? , Y? , S? ) of {(Xk , Yk , Sk )}k≥1 is a stationary point of E(XY> , S),
       which satisfies the first-order optimality condition in (2.20).
In addition, if A = I, we can take the stepsize t = 1, and all the statements above still hold.
                                                     23


Proof. We start by verifying the first two statements. For k ≥ 0 and t <         1
                                                                               kAk2
                                                                                    ,  we have
       E(Lk+1 , Sk+1 )
          1
       = kA(Lk+1 ) − A(Lk )k2F + hA(Lk+1 ) − A(Lk ), A(Lk ) + Sk+1 − Di
          2
             1
         + kA(Lk ) + Sk+1 − Dk2F + λkSk+1 k1 + µkLk+1 k∗
             2
          1 k+1
       ≤ kL        − Lk k2F + hLk+1 − Lk , A∗ fλ0 (A(Lk ) − D)i + µkLk+1 k∗
          2t
                                                           kAk2                              (2.21)
                                                                    
             1     k        k+1        2       k+1                 1
         + kA(L ) + S           − DkF + λkS k1 +                −       kLk+1 − Lk k2F
             2                                              2     2t
                                                                    
          1 1 k+1           k       ∗ 0     k          2        k+1
       =        kL     − L + tA fλ (A(L ) − D)kF + tµkL k∗
          t 2
                                            kAk2
                                                        
             t ∗ 0                                    1
                           k
         − kA fλ (A(L ) − D)kF +      2
                                                   −       kLk+1 − Lk k2F
             2                                2      2t
             1
         + kA(Lk ) + Sk+1 − Dk2F + λkSk+1 k1 ,
             2
where the inequality is due to the facts that
                          kA(Lk+1 ) − A(Lk )k2F ≤ kAk2 kLk+1 − Lk k2F
and
                              A(Lk ) + Sk+1 − D = fλ0 (A(Lk ) − D).
Note that Lk+1 = proxtµ Lk − tA∗ fλ0 (A(Lk ) − D) , which solves
                                                         
                                 1
                   minimize        kL − Lk + tA∗ fλ0 (A(Lk ) − D)k2F + tµkLk∗ .
                  L:rank(L)≤p    2
Since rank(Lk ) ≤ p, we have
                       1 k+1
                         kL     − Lk + tA∗ fλ0 (A(Lk ) − D)k2F + tµkLk+1 k∗
                       2
                          1
                       ≤ kLk − Lk + tA∗ fλ0 (A(Lk ) − D)k2F + tµkLk k∗
                          2
                          t2 ∗ 0
                       = kA fλ (A(Lk ) − D)k2F + tµkLk k∗ .
                           2
Substituting the above estimate to (2.21) yields
                                           kAk2
                                                       
                          k+1    k+1                 1
                     E(L , S ) ≤                 −        kLk+1 − Lk k2F
                                             2      2t
                                                                                             (2.22)
                            1
                         + kA(Lk ) + Sk+1 − Dk2F + µkLk k∗ + λkSk+1 k1 .
                            2
                                                 24


Moreover, we see that
                                              1
                            Sk+1 = arg min      kS − (D − A(Lk ))k2F + λkSk1 .
                                        S     2
Then from (Lou and Yan, 2018, Lemma 2), there holds
                       1 k+1
                         kS      − (D − A(Lk ))k2F + λkSk+1 k1
                       2                                                                   (2.23)
                            1                                     1
                        ≤ kSk − (D − A(Lk ))k2F + λkSk k1 − kSk+1 − Sk k2F .
                            2                                     2
Combining estimates (2.22) and (2.23), we find that
                                            kAk2
                                                      
                                                     1                    1
    E(L    k+1
               ,S k+1            k  k
                      ) ≤ E(L , S ) +             −      kLk+1 − Lk k2F − kSk+1 − Sk k2F . (2.24)
                                              2     2t                    2
        kAk2
Since    2
              −  1
                 2t
                    < 0, the estimate above implies E(Lk+1 , Sk+1 ) ≤ E(Lk , Sk ) for any k ≥ 0,
which verifies the first statement.
   Note that the target function E(L, S) is coercive, i.e., E(L, S) → +∞ when kLkF +
kSkF → +∞. Since E(Lk , Sk ) ≤ E(L0 , S0 ) < +∞, ∀k ≥ 1, this property guarantees that
both {Lk }k≥1 and {Sk }k≥1 are bounded sequences, and thus the second statement holds.
   For any limit point (L? , S? ) of {(Lk , Sk )}k≥1 , there exists a convergent subsequence
{(Lki , Ski )}i≥1 such that Lki → L? and Ski → S? . On the other hand, we see that
                       Ski +1 = sign(D − A(Lki ))      max(0, |D − A(Lki )| − λ),
                                                                                           (2.25)
                         ki +1
                                           Lki − tA∗ (A(Lki ) + Ski +1 − D) .
                                                                           
                       L        = proxtµ
Summing both sides of (2.24) from k = 0 to ∞, we obtain
                           X  ∞                    ∞
               1          2           k+1    k 2
                                                    X
                  − kAk            kL     − L kF +      kSk+1 − Sk k2F ≤ 2E(L0 , S0 ) < ∞.
               t               k=0                  k=0
This inequality guarantees that {Ski +1 }i≥1 has the same limit point S? as that of {Ski }i≥1 ,
and {Lki +1 }i≥1 has the same limit point L? as that of {Lki }i≥1 . Then by taking limits in
both sides of the two equations in (2.25), we obtain the third statement.
   Next we will prove the last two statements. As kXk k2F = kLk k2F and kYk k2F = p, we
know that the sequence {(Xk , Yk , Sk )}k≥1 is also bounded. Let (X? , Y? , S? ) be a limit point
                                                    25


of {(Xk , Yk , Sk )}k≥1 , which is the limitation of a subsequence {(Xki , Yki , Ski )}i≥1 . Then we
have
                            Lki = Xki (Yki )> → X? (Y? )> and Ski → S? ,
i.e., (X? (Y? )> , S? ) is the limit point of the subsequence {(Lki , Ski )}i≥1 . Thus the fourth
statement is verified.
     Now we are in the position to prove the fifth statement. Due to the third and fourth
statements, if (X? , Y? , S? ) is a limit point of {(Xk , Yk , Sk )}k≥1 , i.e., (X? (Y? )> , S? ) should
satisfy (2.18)
                S? = sign(D − A(X? (Y? )> ))        max(0, |D − A(X? (Y? )> )| − λ),
                                                                                                   (2.26)
                X? (Y? )> = proxtµ     X? (Y? )> − tA∗ (A(X? (Y? )> ) + S? − D) .
                                                                                     
The first condition in (2.26) implies that the limit point S? minimizes
                        1
                          kA(X? (Y? )> ) + S − Dk2F + λkSk1 + µkX? (Y? )> k∗
                        2
over all S ∈ Rm×n . Thus, S? should satisfy the third condition in (2.20).
     Moreover, since rank(X? (Y? )> ) ≤ p, the second condition in (2.26) actually implies that
(X? , Y? ) is an optimal solution of the problem
                    1
        minimize       kXY> − X? (Y? )> + tA∗ (A(X? (Y? )> ) + S? − D)k2F + tµkXY> k∗ .
           X,Y      2
Therefore, (X? , Y? ) should satisfy the first-order optimality condition for X, which gives
                       [X? (Y? )> − X? (Y? )> + tA∗ (A(X? (Y? )> ) + S? − D)]Y?
                        + tµ∂kX? (X? (Y? )> )> k∗ Y?
                    =t[A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ]Y? 3 0.
Similarly, from the first-order opitmality condition for Y, one can verify that
                    0 ∈ (X? )> [A∗ (A(X? (Y? )> ) + S? − D) + µ∂kX? (Y? )> k∗ ].
We thus derive the first two conditions in (2.20).
                                                   26


    We will complete our proof by verifying the convergence results for the special case of
A = I and t = 1. In this case, by the same method, one can derive a similar inequality
as (2.24), which is
                                                          1
                         E(Lk+1 , Sk+1 ) ≤ E(Lk , Sk ) − kSk+1 − Sk k2F .
                                                          2
Then {E(Lk , Sk )}k≥1 are non-increasing along {(Lk , Sk )}k≥1 , and {(Lk , Sk )}k≥1 is bounded
due to the coerciveness of E(L, S). Let (L? , S? ) be the limit point of {(Lk , Sk )}k≥1 achieved
by the subsequence {(Lki , Ski )}i≥1 . Recall the iterations for updating Ski +1 and Lki given by
                        Ski +1 = sign(D − Lki )    max(0, |D − Lki | − λ),
                                                                                            (2.27)
                            ki
                                         D − Ski .
                                                  
                          L = proxµ
       P∞
Since     k=0 kSk+1 − Sk k2F ≤ 2E(L0 , S0 ) < +∞, {Ski +1 }i≥1 has the same limit point S? as
that of {Ski }i≥1 . Taking limits in both sides of equations (2.27) yields the condition (2.18)
for A = I and t = 1. The last two statements can be verified by exactly the same arguments
for the general case. We thus complete the proof.
2.2.2    An accelerated algorithm
We show in the previous subsection that Alg. 2.1 is a forward-backward splitting or proximal
gradient algorithm for a nonconvex problem. Recently, accelerated proximal gradient (APG)
algorithms are proposed for nonconvex problems to reduce the computational time without
sacrificing convergence (Li and Pong, 2015; Li and Lin, 2015). In this chapter, we adopt
the nonmonotone APG (Li and Lin, 2015, Alg. 2) because of its better performance shown
in (Li and Lin, 2015). The algorithm is described in Alg. 2.2. We let δ = 1 and η = 0.6 in
the numerical experiments.
2.3     Numerical experiments
In this section, we use synthetic data and real images to demonstrate the performance of
our proposed model and algorithms. The code to reproduce the results in this section can
be found at https://github.com/mingyan08/RPCA_Rank_Bound.
                                                27


  Algorithm 2.2: Accelerated RPCA with nonmonotone APG
    Input: D, µ, λ, p, A, stepsize t, η ∈ [0, 1), δ > 0, stopping criteria , maximum
             number of iterations M ax_Iter, initialization: L0 = L1 = Z1 = 0, t0 = 0,
             t1 = q 1 = 1, c1 = F (L1 )
    Output: L, S
  1 for k = 1, 2, 3, .., Max_Iter do
                      k−1                   k−1
  2    L = Lk + t tk (Zk − Lk ) + t tk−1 (Lk − Lk−1 );
  3    S = sign(D − A(L)) max(0, |D − A(L)| − λ);
  4    Zk+1 = proxtµ (L − tA∗ (A(L) − D + S));
  5    if F (Zk+1 ) ≤ ck − δkZk+1 − Lk2 then
  6        Lk+1 = Zk+1 ;
  7    else
  8        Sk = sign(D − A(Lk )) max(0, |D − A(Lk )| − λ);
  9        Vk+1 = (     proxtµ (Lk − tA∗ (A(Lk ) − D + Sk ));
              k+1          Zk+1      if F (Zk+1 ) ≤ F (Vk+1 );
 10        L      =
                           Vk+1      otherwise;
 11    end
 12    if kLk − Lk−1 kF /kLk−1 kF <  then
 13        break
 14    end
                √ k2
                   4(t ) +1+1
 15    tk+1 =            2
                                ;
         k+1        k
 16    q     = ηq + 1;
                ηq k ck +F (Lk+1 )
 17    ck+1 =           q k+1
                                   ;
18  end
2.3.1    Synthetic data
We would like to recover the low-rank matrix from a noisy matrix that is contaminated by
a sparse matrix and Gaussian noise. We create a true low-rank 500 × 500 matrix L? by
multiplying a random 500 × r matrix and a random r × 500 matrix, where their components
are generated from standard normal distribution independently. We calculate the mean of
the absolute values of all the components in L? and denote it as c. Then we randomly
select s% of the components and replace their values with uniformly distributed random
values from [−3c, 3c]. After that, we add small Gaussian noise N (0, σ 2 ) to all components
of the matrix. We let t = 1.7 in the experiments because of fast convergence, though the
convergence results in Theorem 2.2.4 require t < 1.
                                                       28


2.3.1.1    Low-rank matrix recovery
We fix σ = 0.05 for the Gaussian noise and set the upper bound of the rank to be p = r + 5.
We stop all algorithms when the relative error at the k-th iteration, which is defined as
                                                  kLk+1 − Lk kF
                               RE(Lk+1 , Lk ) :=                 ,
                                                      kLk kF
is less than 10−4 . We use the relative error to L? , which is defined as
                                                  kL − L? kF
                                   RE(L, L∗ ) :=              ,
                                                     kL? kF
to evaluate the performance of our proposed model and that in (Shen et al., 2019). First,
we consider the case with r = 25 and s = 20. We plot a contour map of the relative error
to L? for different parameters µ and λ in Fig. 2.1. From this contour map, we can see that
the best parameter does not happen when µ = 0, which corresponds to the model in (Shen
et al., 2019). It verifies the better performance of our proposed model with appropriate
parameters. In this subsection, we set λ = 0.02 for Shen et al.’s and (µ = 0.6, λ = 0.04) for
our proposed algorithms.
    In addition, we consider another two settings for (r, s), and the comparison with different
algorithms is shown in Table 2.1. In this table, we also compare the number of iterations
for three algorithms: Shen et al.’s, Alg. 2.1, and Alg. 2.2. From this table, we can see that
both Alg. 2.1 and Alg. 2.2 have better performance and fewer iterations than (Shen et al.,
2019). The accelerated Alg. 2.2 has the fewest iterations, but its performance in terms of
RE(L, L? ) is not as good as Alg. 2.1 for the last case. It is because we stop both algorithms
when the stopping criteria is satisfied, and the algorithms are not converged yet. We checked
the objective function values for both algorithms, and the value for Alg. 2.2 is smaller than
that for Alg. 2.1 in this case. Therefore, if we want a solution close to the true low-rank
matrix L? , we may need to stop early before the convergence, which is the same as many
models for inverse problems.
                                               29


                    0.2
                                                                                  4
                                                                              0.002
                   0.15
                                                  0.0                          0.
                                                     8
                    0.1                                      4
                                                         0.0.02
                                                          01
                                                         0.0          01
                                                                    0.
                                08
                   0.05                      08                            00..00248
                                         0.0
                               0.
                                                                            0.0
                                 04
                              00..02
                               0.01
                                                 00248
                                             00...0
                          0            0.5               1          1.5                  2
Figure 2.1: The contour map of the relative error to L? for different parameters. In this
experiment, we set r = 25 and s = 20. The upper bound of the rank is set to be p = 30.
               Shen et al.’s (Shen et al., 2019)                Alg. 1                       Alg.2
   r      s
               RE(L, L? )          # iter                  RE(L, L? ) # iter           RE(L, L? ) # iter
   25     20     0.0745            1318                     0.0075     296              0.0075     68
   50     20     0.0496            1434                     0.0101     473              0.0088     77
   25     40     0.0990            2443                     0.0635     796              0.0915     187
Table 2.1: Comparison of three RPCA algorithms. We compare the relative error of their
solutions to the true low-rank matrix and the number of iterations. Both Alg. 2.1 and
Alg. 2.2 have better performance than (Shen et al., 2019) in terms of the relative error and
the number of iterations. Alg. 2.2 has the fewest iterations but the relative error could be
large. It is because the true low-rank matrix is not the optimal solution to the optimization
problem, and the trajectory of the iterations moves close to L? before it approaches the
optimal solution.
2.3.1.2    Robustness of the model
In this experiment, we compare the robustness of our proposed model with that of (Shen
et al., 2019). We let r = 25 and s = 20. Then we run both models for p from 15 to 35.
The comparison of the relative error to L? is shown in Fig. 2.2. We let λ = 0.02 for Shen et
al.’s and (µ = 0.6, λ = 0.04) for Alg. 2.2. It shows that our proposed model is robust to the
                                                     30


parameter p, as long as it is not smaller than the true rank r.
                                      0.6
                                                                  Shen et al.
                                                                  Alg. 2
                                      0.5
                relative error of L
                                      0.4
                                      0.3
                                      0.2
                                      0.1
                                       0
                                        15   20     25       30                 35
                                                  rank (p)
Figure 2.2: The relative error to the true low-rank matrix vs the rank p for Shen et al.’s and
Alg. 2.2. Alg. 2.2 is robust to p, as long as p is not smaller than the true rank 25.
2.3.1.3   Low-rank matrix recovery with missing entries
In this experiment, we try to recover the low-rank matrix when there are missing entries in
the matrix. Therefore, the operator A is not the identity I. We randomly select the missing
entries from all the entries. We let r = 25 and add both the sparse noise with parameter s
and the Gaussian noise with parameter σ to the true matrix L? . Then we apply Alg. 2.2 to
recover the low-rank matrix, and the relative error to L? is used to evaluate the performance.
The results for different settings are in Table 2.2. For the first three cases with s = 20, we
choose (µ = 0.5, λ = 0.04), while we let (µ = 0.1, λ = 0.01) for the last case with s = 5.
Note that, even with missing entries, Alg. 2.2 can reconstruct the low-rank matrix accurately.
                                                   31


                   s     σ    ratio of missing entries RE(L, L? ) by Alg. 2.2
                  20  0.05              10%                      0.0079
                  20  0.05              20%                      0.0088
                  20  0.05              50%                      0.0201
                  5   0.01              50%                      0.0015
Table 2.2: Performance of Alg. 2.2 on low-rank matrix recovery with missing entries. We
change the level of sparsity in the sparse noise, standard deviation of the Gaussian noise,
and the ratio of missing entries.
2.3.2    Real image experiment
In this section, we consider the three algorithms applied to image processing problems. Since
natural images are not low-rank essentially, we consider two cases on two different images
(‘cameraman’ and ‘Barbara’). For the 256 × 256 cameraman image (the pixel values are
from 0 to 255), we create an image with rank 37 from a low-rank approximation of the
original image. Then we add 20% salt and pepper impulse noise and Gaussian noise with
standard variance 4. We set 42 as the upper bound of the rank of the low-rank image for
all algorithms. We let λ = 0.03 for Shen et al. and (µ = 0.5, λ = 0.06) for our model.
To compare the performance of both models, we use the relative error defined in the last
subsection and peak signal to noise ratio (PSNR) defined as
                                                    Peak_Val2
                                  PSNR := 10 log10             .
                                                       MSE
Here Peak_Val is the largest value allowed at a pixel (255 in our case), and MSE is the mean
squared error between the recovered image and the true image. The numerical results are
shown in Fig. 2.3. From Fig. 2.3(A-C), we can see that our proposed model performs better
than Shen et al. (Shen et al., 2019). For the proposed model, we also compare the speed
of three algorithms: Alg. 2.1, Alg. 2.1 with standard SVD, and Alg. 2.2 in Fig. 2.3(D). For
both plots, we can see that the Gauss-Newton approach increases the speed comparing to
the standard SVD approach. From the decrease of the objective function value, we can see
that the accelerated algorithm Alg. 2.2 is faster than the nonaccelerated Alg. 2.1.
                                                32


       (a) Corrupted image                       (b) Recovered by Shen et al.
       RE: 0.4760, PSNR: 12.76                   RE: 0.1736, PSNR: 21.52
                                                                             general SVD
                                                                             Alg. 1
                                               105                           Alg. 2
                                               104
                                               103
                                                   0   10   20    30      40 50      60
                                                                 time (s)
       (c) Recovered by Alg. 2.2            (d) Comparison of the objective function
       RE: 0.0457, PSNR:33.11               value vs time for three algorithms
Figure 2.3: The numerical experiment on the ‘cameraman’ image. (A-C) show that the
proposed model performs better than Shen et al.’s both visually and in terms of RE and
PSNR. (D) compares the objective values vs time for general SVD, Alg. 2.1, and Alg. 2.2.
Here f ? is the value obtained by Alg. 2.2 with more iterations. It shows the fast speed
with the Gauss-Newton approach and acceleration. With the Gauss-Newton approach, the
computation time for Alg. 2.1 is reduced to about 1/7 of the one with standard SVD (from
65.11s to 8.43s). The accelerated Alg. 2.2 requires 5.2s, though the number of iterations is
reduced from 3194 to 360.
   Next, we use the original 512 × 512 barbara image (the pixel values are from 0 to 255)
without modification and add the same two types of noise as in the cameraman image.
Because the original image is not low-rank, we choose the upper bound of rank p = 50.
We let λ = 0.03 for Shen et al. and (µ = 0.5, λ = 0.06) for our model. The comparison
                                             33


result is shown in Fig. 2.4, and it is similar to the cameraman image. We also applied the
acceleration to Shen et al.’s algorithm and obtained a better image with RE = 0.1447 and
PSNR = 22.37.
        (a) Corrupted image                        (b) Recovered by Shen et al
        RE: 0.4821, PSNR: 11.91                    RE: 0.3368, PSNR: 15.03
                                                 106                            general SVD
                                                                                Alg. 1
                                                                                Alg. 2
                                                 105
                                                 104
                                                     0         50           100            150
                                                                   time (s)
       (c) Recovered by Alg. 2.2              (d) Comparison of the objective function
       RE: 0.1317, PSNR: 23.18                value vs time for three algorithms
Figure 2.4: The numerical experiment on the ‘Barbara’ image. (A-C) show that the proposed
model performs better than Shen et al.’s both visually and in terms of RE and PSNR. (D)
compares the objective values vs time for general SVD, Alg. 2.1, and Alg. 2.2. Here f ? is
the value obtained by Alg. 2.2 with more iterations. It shows the fast speed with the Gauss-
Newton approach and acceleration. With the Gauss-Newton approach, the computation
time for Alg. 2.1 is reduced to less than 1/3 of the one with standard SVD (from 148.6s to
43.7s). The accelerated Alg. 2.2 requires 23.3s, though the number of iterations is reduced
from 3210 to 300.
                                               34


2.4     Concluding remarks
In this chapter, we introduced a new model for RPCA when an upper bound of the rank
is provided. For the unconstrained RPCA problem, we formulate it as the sum of one
smooth function and one nonsmooth nonconvex function. Then we derive an algorithm
based on proximal-gradient. This proposed algorithm has the alternating minimization algo-
rithm (Shen et al., 2019) as a special case. Because of the connection between this algorithm
and proximal gradient, we adopted an acceleration approach and proposed an accelerated
algorithm. Both proposed algorithms have two advantages comparing to existing algorithms.
First, different from algorithms that require accurate rank estimations, the proposed algo-
rithms are robust to the upper bound of the rank. Second, we apply the Gauss-Newton
algorithm to avoid the computation of singular values for large matrices, so our algorithm is
faster than those algorithms that require SVD. Except for problem (2.7), this algorithm can
be generalized to solve many other variants.
2.4.1    Nonconvex penalties on the singular values
In the problem (2.7), we choose the convex nuclear norm for the low-rank component in
the objective function, which is the `1 norm on the singular values. The `1 norm pushes
all singular values toward zero for the same amount, bringing bias in the solution. To
promote the low-rankness of the low-rank component (or sparsity of its singular values), we
can choose nonconvex regularization terms for the singular values. The idea for nonconvex
regularization is to reduce the bias by pushing less on larger singular values. Some examples
of nonconvex regularization are `p (0 ≤ p < 1) (Chartrand, 2007), smoothly clipped absolute
deviation (SCAD) (Fan and Li, 2001), minimax concave penalty (MCP) (Zhang et al., 2010),
nonconvex weighted `1 (Huang et al., 2015), etc. When these regularization terms are applied,
the only difference is in the third step for finding X in Lemma 2.2.3. Currently, we have
to apply the soft thresholding on the singular values. When nonconvex regularization is
                                              35


used, we apply the corresponding thresholding on the singular values. In this case, all the
convergence results stay valid.
2.4.2   Other regularization on the sparse component
We can also replace the `1 norm of the sparse component with other regularization terms.
Similarly to the penalty on the singular values, the `1 norm on the sparse component brings
bias, and we can use nonconvex regularization terms. Wen et al. (2019) uses both nonconvex
regularization terms for the low-rank and sparse components. When different regularization
terms are used on the sparse component, the new function fλ (see (2.12) for the definition)
may not be differentiable any more. In this case, the convergence results do not hold.
2.4.3   Constrained problems
When there is no noise in the measurements, the problem becomes constrained, and the pre-
vious algorithm can not be applied directly. Shen et al. (2019) uses the penalty method and
gradually increases the weight for the penalization to approximate the constrained problem.
Here, we introduce a new method based on ADMM. We consider the following constrained
problem
                minimize µkLk∗ + kSk1 , subject to rank(L) ≤ p, D = L + S.            (2.28)
                   L,S
When we apply ADMM, the steps are
                                                  α                Zk 2
                     Lk+1 =    arg min µkLk∗ + kD − L − Sk +          kF ;           (2.29a)
                             L:rank(L)≤p
                                                  2                 α
                                            α                   Zk 2
                     Sk+1 = arg min kSk1 + kD − Lk+1 − S +         k ;               (2.29b)
                                 S           2                   α F
                     Zk+1 = Zk − α(Lk+1 + Sk+1 − D).                                 (2.29c)
The first step is exactly the proximal operator that can be solved from Lemma 2.2.3. The
other two steps are easy to compute. This algorithm has only one parameter α, while penalty
                                              36


methods, such as that in (Shen et al., 2019), require additional parameters to increase the
weight for the penalization.
                                            37


                                        CHAPTER 3
  ROBUST PRINCIPAL COMPONENT ANALYSIS FOR SEISMIC EVENT
                                        DETECTION
3.1     Introduction
Reflected seismic data is contaminated by both random and coherence noise. Random noise
is usually caused by environmental inferences, and coherent noise is mostly generated by the
source. Denoising is an important preprocessing step because noisy seismic data may lead to
unrealistic artifacts in the inversion or imaging results. But, it is challenging to effectively
and efficiently eliminate noise from noisy seismic data.
    Various seismic denoising methods have been developed to remove random noise (Yu
et al., 2015; Fomel and Liu, 2013; Kreimer and Sacchi, 2012) and coherent noise (Weglein,
2016; Liu and Fomel, 2013; Herman and Perkins, 2006). Our technique belongs to the
sparse-transform-based methods. In this type of methods, it was shown that the signal
could be sparsely represented by a basis/dictionary or it is sparse after some transform such
as wavelet transform. In particular, different methods based on sparsity are proposed (Chen
et al., 2016; Rubinstein et al., 2010). Since the signal received at different receivers are
correlated, the measurements lie in a low dimensional space. Dimension reduction techniques
are also applied to model the signal. They include empirical mode decomposition based
methods (Kopsinis and McLaughlin, 2009; Chen et al., 2017), singular spectral analysis (Qiao
et al., 2017), RPCA (Candès et al., 2011; Cheng et al., 2015), and Cadzow filtering.
    Though RPCA-based denoising methods achieve promising results in many applica-
tions (Cheng et al., 2015; Sun et al., 2014; Duarte et al., 2012), existing algorithms are
slow and yield a large amount of computational time, especially for large-scale data set.
Therefore, we develop new algorithms for RPCA and its nonconvex variants. To verify the
performance of these algorithms, we apply them to both synthetic and field data. Through
                                              38


the numerical experiments, our proposed algorithms significantly improve computational
efficiency and yield comparable or better denoising results.
3.2     Theory
In this section, we focus on RPCA, though the technique can be applied to other methods.
Given the noisy data denoted as an n1 × n2 matrix D, where n1 and n2 are the number of
receivers and measurements at each receiver, respectively. The goal of RPCA is to get a
low-rank matrix L and a sparse matrix S from this noisy matrix such that L + S = D. Ma
and Aybat (2018) reviewed several forms of RPCA and efficient algorithms. One formula is
                      minimize rank(L) + λkSk0 , subject to L + S = D,                    (3.1)
                         L,S
where rank(L) is the rank of the matrix L, kSk0 is the number of nonzero elements in the
matrix S, and λ is a parameter to balance these two terms. This model assumes that the
data is corrupted by sparse noise, which is modeled as a sparse matrix S. However, in
practice, especially in the collected seismic data, D often includes other types of noise such
as Gaussian noise, denoted as N. We set the random noise level to be σ, namely, kNk2F ≤ σ,
where k · kF is the Frobenious norm. Hence, the corresponding formula becomes
                  minimize rank(L) + λkSk0 , subject to kD − L − Sk2F ≤ σ.                (3.2)
                     L,S
This problem is NP-hard (Ma and Aybat, 2018), and direct numerical calculation is impos-
sible. One direction is convexification. It has been proved in (Zhou et al., 2010b) that a
relaxed version –principal component pursuit– can be defined as
                    minimize kLk∗ + λkSk1 subject to kD − L − Sk2F ≤ σ.                   (3.3)
                       L,S
The above minimization problem yields a stable estimate of the low-rank matrix and the
sparse matrix under some conditions for D. Here, k · k∗ is the nuclear norm defined as the
sum of all singular values, and k · k1 is the sum of the absolute values of all elements.
                                               39


    The constrained problem (3.3) is equivalent to the following unconstrained one
                          minimize kLk∗ + λkSk1 +              1
                                                              2µ
                                                                 kD − L − Sk2F .          (3.4)
                               L,S
That is, given σ in (3.3), we can find µ such that the optimal solutions for both (3.3)
and (3.4) are equivalent, and vice versa. Therefore, we focus on solving Eq. (3.4) instead
of Eq. (3.3). This problem is convex, and many existing convex optimization algorithms
are applied. Some examples are proximal gradient method (PGM), accelerated PGM, and
alternating direction methods of multipliers (ADMM). We provide some brief introduction
over those methods.
    We consider the two matrices L and S together. The last function in (3.4) is differentiable
with respect to L and S, while the first two functions are separable. One iteration of PGM
can be expressed as         
                                L̄k = Lk − µt (Lk + Sk − D),
                            
                            
                            
                            
                            
                            
                            
                                 k     k     t    k      k
                            S̄ = S − µ (L + S − D),
                            
                            
                                                                                          (3.5)
                            
                            
                               Lk+1 = arg min tkLk∗ + 21 kL − L̄k k2F ,
                            
                                             L
                            
                            
                            Sk+1 = arg min tλkSk1 + 21 kS − S̄k k2F ,
                            
                            
                            
                                              L
where t ∈ (0, µ) is the stepsize.
    PGM has the convergence rate of O(1/k) for general convex functions. Acceleration tech-
niques improve the convergence rate to O(1/k 2 ). Some examples are provided in (Nesterov,
2013, 2005; Kim and Fessler, 2016; Beck and Teboulle, 2009a,b). Beck and Teboulle (2009a)
develop fast iterative shrinkage-thresholding algorithm (FISTA) by applying the proximal
operator on an extrapolated point. By denoting one iteration of PGM as (Lk+1 , Sk+1 ) =
PGM(Lk , Sk ), the iteration of FISTA is equivalent to
                                                     θk−1 −1
                        (L̂k , Ŝk ) = (Lk , Sk ) +    θk
                                                             (Lk  − Lk−1 , Sk − Sk−1 ),
                   (Lk+1 , Sk+1 ) = PGM(L̂k , Ŝk ),
                 q
where θk = (1 + 1 + 4θk−1  2
                                 )/2 and θ0 = 1. FISTA requires fewer iterations than PGM with
similar per-iteration cost, which is demonstrated in our numerical experiments.
                                                      40


     ADMM (Boyd et al., 2011; Yuan and Yang, 2013) solves a constrained problem. By
introducing a new matrix Z, an equivalent formulation can be obtained
                    minimize kLk∗ + λkSk1 +      1
                                                2µ
                                                   kZk2F  s.t. Z + S + L = D.              (3.6)
                      Z,L,S
Though the convergence of three-block ADMM is not guaranteed for general problems, this
ADMM for RPCA converges with an appropriately chosen parameter (Wang et al., 2019).
     These algorithms are computationally efficient for small matrices, but they suffer slow
convergent rates when handling medium to large-scale data sets, such as seismic data. The
efficiency of the algorithms are directly related to the number of the unknown variables.
PGM and FISTA have two matrices as unknown variables, while ADMM has three. In
this abstract, we use infimal convolution to reduce unknown variables to be one matrix and
obtain faster algorithms than existing ones.
     Besides solving the convex formula (3.4), people also solve nonconvex ones (Zhou and
Tao, 2011, 2013) because of their better performance. In this abstract, we also consider a
nonconvex model by replacing the L1 term with a nonconvex term to show the robustness
of our algorithms.
Methodology and Algorithms
3.2.1    New algorithms with infimal convolution
The problem (3.4) has two unknown matrices L and S, and only the last two terms have S.
So we can eliminate S by finding the optimal S with a given L
                         h(D − L) := min λkSk1 +       1
                                                      2µ
                                                         kD   − L − Sk2F .                 (3.7)
                                       S
By defining f (S) = λkSk1 and g(X) =      1
                                         2µ
                                            kXk2F , we will have h as the infimal convolution of
f and g, which is defined as f g : x → min f (y) + g(x − y). Hence, the problem (3.4) can
                                             y
be reduced to
                               minimize kLk∗ + f g(D − L),                                (3.8)
                                   L
                                               41


which contains only one unknown matrix L. The infimal convolution f g(D − L) is differ-
entiable with respect to L, and its gradient is   1
                                                  µ
                                                    -Lipschitz continuous. So we apply PGM
and FISTA to solve the problem (3.8) with L only. We name the corresponding algorithms
as IC-PGM and IC-FISTA. We also apply ADMM with two blocks and name the new algo-
rithm as IC-ADMM. Its convergence is well studied, and there is no restriction in choosing
its parameters. For simplicity, we only provide the iteration of IC-PGM as below
                          
                            S̄k = Sk − (Lk + Sk − D),
                          
                          
                          
                          
                          
                          
                          
                               k+1
                                   = arg min µλkSk1 + 21 kS − S̄k k2F ,
                          
                          S
                          
                          
                                        L
                                                                                       (3.9)
                            L̄k = Lk − µt (Lk + Sk+1 − D),
                          
                          
                          
                          
                          
                          
                          
                          Lk+1 = arg min tkLk∗ + 12 kL − L̄k k2F .
                          
                          
                          
                                        L
Here, the stepsize of t ∈ (0, 2µ), which is larger than PGM. The derivation of IC-FISTA is
similar to IC-PGM.
3.2.1.1    Comparison between PGM and IC-PGM
Comparing the steps in (3.5) and (3.9), we notice that PGM updates L and S simultane-
ously, while IC-PGM updates S first and use the updated S to update L. The improved
performance of IC-PGM over conventional PGM comes from two folds. Firstly, alternative
update is faster than simultaneous update, which is similar to the improvement of Gauss-
Seidel over Jacobian methods for solving linear equations. Secondly, IC-PGM essentially
solves the problem with L only, which allows a larger stepsize than the conventional PGM.
A new nonconvex model
Though IC-FISTA solves RPCA efficiently, RPCA is still a convex relaxed model. In order
to obtain better performance, we consider nonconvex models that we can apply the infimal
convolution technique to get fast algorithms. There are many nonconvex penalties, please
see (Huang and Yan, 2018; Wen et al., 2018). In this abstract, we choose Minimax Concave
                                               42


Penalty (MCP) (Zhang et al., 2010) to replace the L1 term λkSk1 in RPCA. When we apply
IC-PGM or IC-FISTA to solve this nonconvex problem, we just need to replace the step for
updating S:
                             Sk+1 = arg min µr(S) + 12 kS − S̄k k2F ,
                                            L
where r(S) =               gλ,b (Si,j ) is the MCP function with
               Pn1 Pn2
                  i=1  j=1
                                            
                                                        S2i,j
                                            λ|Si,j | −
                                                             , |Si,j | ≤ bλ,
                                                         2b
                            gλ,b (Si,j ) =                                              (3.10)
                                             bλ2 ,
                                            
                                                                |Si,j | > bλ.
                                               2
Here b is a parameter. When b goes to infinity, r(S) becomes the L1 term.
3.3    Results
3.3.1    Synthetic seismic data
We first test our denoising algorithms on synthetic seismic data. The seismic measurements
are collections of synthetic seismograms obtained by implementing forward modeling on a
velocity model with a few layers. One common-shot gather of synthetic seismic data with
500 receivers is posed at the top surface of the model. The interval between two receivers is
5 m. We use a Ricker wavelet with a center frequency of 25 Hz as the source time function
and a staggered-grid finite-difference scheme with a perfectly matched layered absorbing
boundary condition to generate 2D synthetic seismic reflection data (Tan and Huang, 2014).
The synthetic trace at each receiver is a collection of time-series data of length 1, 000. We
add two types of noise onto the seismic data. First, we add 25.2 dB Gaussian noise. Then,
we choose 2% of the data and reset their values with random numbers uniformly distributed
in [−u, u] with u being three multiply the largest value in the clean data. The overall signal
to noise ratio (SNR) is -26.2 dB.
                                                    43


                               100                                                                                             100
                               200                                                                                             200
                               300                                                                                             300
                               400                                                                                             400
                 Time Step                                                                                        Time Step
                               500                                                                                             500
                               600                                                                                             600
                               700                                                                                             700
                               800                                                                                             800
                               900                                                                                             900
                             1000                                                                                             1000
                                        50   100     150   200     250    300      350     400   450   500                              50   100         150   200      250      300   350         400   450   500
                                                                 Receivers                                                                                           Receivers
                                                      (a) clean data                                                                                      (b) noisy data
                                100                                                                                           100
                                200                                                                                           200
                                300                                                                                           300
                                400                                                                                           400
                   Time Step                                                                                  Time Step
                                500                                                                                           500
                                600                                                                                           600
                                700                                                                                           700
                                800                                                                                           800
                                900                                                                                           900
                               1000                                                                                       1000
                                        50    100    150   200      250      300   350     400   450   500                              50   100     150       200     250    300      350     400       450   500
                                                                 Receivers                                                                                           Receivers
                                             (c) recovered signal by L1                                                                 (d) recovered signal by MCP
                               50                                                                                              50
                             100                                                                                              100
                             150                                                                                              150
                 Time Step                                                                                        Time Step
                             200                                                                                              200
                             250                                                                                              250
                             300                                                                                              300
                             350                                                                                              350
                             400                                                                                              400
                               150             200                 250               300               350                      150                200                  250                  300               350
                                                             Receivers                                                                                               Receivers
                                      (e) recovered sparse noise by L1                                                                (f) recovered sparse noise by MCP
                               50                                                                                              50
                             100                                                                                              100
                             150                                                                                              150
                 Time Step                                                                                        Time Step
                             200                                                                                              200
                             250                                                                                              250
                             300                                                                                              300
                             350                                                                                              350
                             400                                                                                              400
                               150             200                 250               300               350                      150                200                  250                  300               350
                                                             Receivers                                                                                               Receivers
                                      (g) diﬀerence between (c) and (a)                                                               (h) diﬀerence between (d) and (a)
Figure 3.1: Comparison of recovered results on synthetic seismic data with 500 receivers and
1000 measurements at each receiver. (a) simulated clean data. (b) noisy data (-26.2 dB). (c)
recovered data by L1 (13.4 dB). (d) recovered data by MCP (13.9 dB). (e) recovered sparse
noise by L1 . (f) recovered sparse noise by MCP. (g) the difference between the clean data
and the recovered one by L1 . (h) the difference between the clean data and the recovered
one by L1 . (e-h) zoom-in over receivers 150-350 and measurements 1-400.
   We first compare the recovery results for convex and nonconvex models in Fig. 3.1. We
                                                                                                             44


manually tune the parameters to obtain the best denoising results for both models. For the
L1 penalty, we choose µ = 3 × 10−5 and λ = 0.12, while for MCP, we choose µ = 1 × 10−5 ,
λ = 0.135, and b = 10. Both models can remove noise from the noisy data. The SNR values
of recovered data for both L1 and MCP are 13.4 dB and 13.9 dB, respectively. The figure
confirms that the nonconvex model performs slightly better than the convex one.
   Next, we compare the efficiency of all algorithms on the convex RPCA model. We first
run IC-ADMM for 1,000 iterations to obtain an estimation for the minimal objective value.
Then we compare their performance in function values with respect to the iteration number
and time in Fig 3.2. IC-ADMM has the fastest convergent rate and smallest computational
time among all five algorithms. By applying our IC technique, traditional algorithms (PGM
and FISTA) yield better convergent rates. Comparing Fig. 3.2(a) and Fig. 3.2(b), we note
that the time for each iteration is similar for all algorithms.
                                                                                                                                                   105
                                      105
                                                                                                                                                                                                 PGM
                                                                                                     PGM
                                                                                                                                                                                                 FISTA
                                                                                                     FISTA
                                                                                                                                                                                                 IC-PGM
                                                                                                     IC-PGM
                                                                                                                                                                                                 IC-FISTA
                                                                                                     IC-FISTA
                                      100                                                            IC-ADMM                                       100                                           IC-ADMM
            Function Value(f - f*)                                                                                      Function Value (f - f*)
                                      10-5                                                                                                         10-5
                                     10-10                                                                                                        10-10
                                     10-15                                                                                                        10-15
                                             0   100   200   300   400    500      600   700   800    900   1000                                          0   10   20   30   40   50   60   70    80        90
                                                                         # Iters                                                                                             Time(s)
                                                       (a) performance-iteration                                                                                    (b) performance-time
Figure 3.2: Comparison of five algorithms (PGM, FISTA, IC-PGM, IC-FISTA, IC-ADMM)
for the convex RPCA on synthetic data. IC-ADMM has the fastest convergence rate and
smallest computational time. IC technique improves the performance of PGM and FISTA
significantly.
3.3.2   Field seismic data
In this section, we apply our denoising algorithms to the field data collected from the IRIS
Community Wavefield Experiment in Oklahoma (Kent et al., 2016). Fig. 3.3 shows the data
collected by 220 major seismic sensors to detect earthquakes. Before applying RPCA to
                                                                                                                   45


                                                20
                                                40
                                                60
                                                80
                               Station Number
                                                100
                                                120
                                                140
                                                160
                                                180
                                                200
                                                220
                                                      0   2000   4000   6000   8000   10000 12000 14000 16000 18000
                                                                                Time(s)
                       Figure 3.3: Noisy data generated in Oklahoma.
recover the data, we applied discrete cosine transform on the time domain. The comparison
between L1 and MCP is in Fig. 3.4. We set µ = 3 × 104 , λ = 3 × 10−3 , and b = 3 × 106 .
The stopping criteria is max(kSk+1 − Sk k2F /kSk k2F , kLk+1 − Lk k2F /kLk k2F ) < 10−3 . As shown
in Fig. 3.4, both models successfully separate the horizontal signal (we do not want) and
vertical signal (we want).
    In addition, we compare the computation time and total numbers of iterations in Ta-
ble 2.1. IC technique improves the performance of conventional algorithms. They are 4-10x
                      Algorithm                           # Iters               Time (s)               Function vlaue
                      PGM                                    940               9.23 × 103                4.92 × 107
                      FISTA                                  166               1.63 × 103                4.92 × 107
                      IC-PGM                                   81              1.07 × 103                4.92 × 107
                      IC-FISTA                                 43              6.41 × 102                4.92 × 107
                      IC-ADMM                                  25              2.69 × 102                4.92 × 107
                      MCP                                      49              5.74 × 102                3.78 × 107
Table 3.1: Comparison of six algorithms. IC-ADMM is the fastest, which is the same as
synthetic data. The function value for MCP is smaller because of a different model.
faster than non-IC algorithms. It also shows that the non-convex MCP model has compa-
rable performance as the convex one, while the algorithm for this nonconvex model is also
fast.
                                                                                46


                                  20                                                                                           20
                                  40                                                                                           40
                                  60                                                                                           60
                                  80                                                                                           80
             Station Number                                                                                   Station Number
                              100                                                                                              100
                              120                                                                                              120
                              140                                                                                              140
                              160                                                                                              160
                              180                                                                                              180
                              200                                                                                              200
                              220                                                                                              220
                                       0     2000   4000   6000   8000   10000 12000 14000 16000 18000                               0    2000   4000   6000   8000   10000 12000 14000 16000 18000
                                                                   Time(s)                                                                                       Time(s)
                                           (a) recovered horizontal signal by L1                                                    (b) recovered horizontal signal by MCP
                                  20                                                                                           20
                                  40                                                                                           40
                                  60                                                                                           60
                                  80                                                                                           80
                 Station Number                                                                           Station Number
                                  100                                                                                      100
                                  120                                                                                      120
                                  140                                                                                      140
                                  160                                                                                      160
                                  180                                                                                      180
                                  200                                                                                      200
                                  220                                                                                      220
                                        0    2000   4000   6000   8000   10000 12000 14000 16000 18000                              0     2000   4000   6000   8000   10000 12000 14000 16000 18000
                                                                    Time(s)                                                                                     Time(s)
                                             (c) recovered vertical signal by L1                                                        (d) recovered vertical signal by MCP
                                  20                                                                                           20
                                  40                                                                                           40
                                  60                                                                                           60
                                  80                                                                                           80
             Station Number                                                                                   Station Number
                              100                                                                                              100
                              120                                                                                              120
                              140                                                                                              140
                              160                                                                                              160
                              180                                                                                              180
                              200                                                                                              200
                              220                                                                                              220
                                       0     2000   4000   6000   8000   10000 12000 14000 16000 18000                               0    2000   4000   6000   8000   10000 12000 14000 16000 18000
                                                                   Time(s)                                                                                       Time(s)
                                            (e) recovered random noise by L1                                                             (f) recovered random noise by MCP
                              Figure 3.4: Recovered results of the real data with two models.
3.4    Conclusion
In this section, we developed new seismic denoising algorithms based on RPCA. In particular,
we applied infimal convolution to solve the convex and nonconvex optimization problems.
Our technique not only allows a large stepsize, but also reduces the number of unknown
variables to solve. All these characteristics of our new algorithms result in a significantly im-
proved computational efficiency by comparing to other conventional algorithms such as PGM
                                                                                                         47


and FISTA. We verified the performance of our algorithms using both synthetic reflection
seismic data and field data. We observe at least a speed-up ratio of 4-10x over conventional
algorithms with a comparable or better denoised results.
                                             48


                                          CHAPTER 4
       MANIFOLD DENOISING BY NONLINEAR ROBUST PRINCIPAL
                                COMPONENT ANALYSIS
4.1     Introduction
Manifold and graph learning are nowadays widely used in computer vision, image processing,
and biological data analysis on tasks such as classification, anomaly detection, data inter-
polation, and denoising. In most applications, graphs are learned from high dimensional
data, and successfully learned graphs allow traditional data analysis methods (PCA, Fourier
analysis, clustering algorithm, neural networks) to be performed in conjunction with prior
knowledge of the graph connectivity (Hammond et al., 2011; Jianbo Shi and Malik, 2000;
Jiang et al., 2013; Meila and Shi, 2001). However, the quality of the learned manifold or
graph may be greatly jeopardized by outliers, in ways that affect the stability of various
manifold learning methods.
    In recent years, several methods have been proposed to handle outliers in nonlinear
data (Li et al., 2009; Zhigang Tang et al., 2010; Du et al., 2013). Despite the success of those
methods, they only aim at finding the outliers instead of correcting them. In addition, few
theoretical results characterize their statistical performances. In this paper, we propose a
novel non task-driven algorithm for the mixed noise model in (4.1) and provide theoretical
guarantees to control its estimation error. Specifically, we consider the mixed noise model
as
                               X̃i = Xi + Si + i ,  i = 1, . . . , n,                      (4.1)
where Xi ∈ Rp is the noiseless data independently drawn from some manifold M with an
intrinsic dimension d < p, i is the i.i.d. Gaussian noise with small magnitudes, and Si is the
sparse noise with possible large magnitudes. If Si has a large entry, then the corresponding
X̃i is usually considered as an outlier. The goal of this chapter is to simultaneously recover
                                               49


Xi and Si from X̃i , i = 1, .., n.
    There are several benefits in recovering the noise term Si along with the signal Xi .
First, the support of Si indicates the locations of the anomaly, which is informative in
many applications. For example, if Xi is the gene expression data from the ith patient, the
nonzero elements in Si indicate the differentially expressed genes that are the candidates
for personalized medicine. Similarly, if Si is a result of malfunctioned hardware, its nonzero
elements indicate the locations of the malfunction parts. Secondly, the recovery of Si allows
the “outliers” to be pulled back to the data manifold instead of simply being discarded. This
prevents waste of information and is especially beneficial in cases where data is insufficient.
Thirdly, in some applications, the sparse Si is part of the actual data rather than a noise
term, then the algorithm provides a natural decomposition of the data into a sparse and a
non-sparse component that may carry different pieces of information.
    Along a similar line of research, robust principle component analysis (RPCA) (Candes
et al., 2011) has received considerable attention and has demonstrated its success in separat-
ing data from sparse noise in many applications. However, its assumption that the data lies
in a low dimensional subspace is somewhat strict. In this chapter, we generalize the Robust
PCA idea to the non-linear manifold setting. The major new components in our algorithm
are: 1) an incorporation of the curvature information of the manifold into the optimization
framework, and 2) a unified way to apply RPCA to a collection of tangent spaces of the
manifold.
4.2      Methodology
Let X̃ = [X̃1 , . . . , X̃n ] ∈ Rp×n be the noisy data matrix containing n samples. Each sample
is a vector in Rp independently drawn based on (4.1). The overall data matrix X̃ has the
representation
                                            X̃ = X + S + N
                                                   50


where X is the clean data matrix, S is the matrix of the sparse noise, and N is the Gaussian
noise. We further assume that the clean data Xi lies on some manifold M embedded in
Rp with a small intrinsic dimension d  p and the sample size is sufficient (n ≥ p). The
small intrinsic dimension assumption ensures that data is locally low dimensional so that
the corresponding local data matrix is of low rank. This property allows the data to be
separated from the sparse noise.
     The key idea behind our method is to handle the data locally. We use the k Nearest
Neighbors (kNN) to construct local data matrices, where k is larger than the intrinsic di-
mension d. For a data point Xi ∈ Rp , we define the local patch centered at it to be the
set consisted of its kNN and itself, and a local data matrix X (i) associated with this patch
is X (i) = [Xi1 , Xi2 , . . . , Xik , Xi ], where Xij is the jth-nearest neighbor of Xi . Let Pi be the
restriction operator to the ith patch, i.e., Pi (X) = XPi where Pi is the n × (k + 1) matrix
that selects the columns of X in the ith patch. Then X (i) = Pi (X). Similarly, we define
S (i) = Pi (S), N (i) = Pi (N ) and X̃ (i) = Pi (X̃).
     Since each local data matrix X (i) is of low rank and S is sparse, we can decompose
the noisy data matrix into low-rank parts and sparse parts through solving the following
optimization problem
       (Ŝ, {L̂(i) }ni=1 ) = arg min F (S, {L(i) }ni=1 )
                               S,L(i)
                                           Xn
                                               λi kX̃ (i) − L(i) − S (i) k2F + kC(L(i) )k∗ + βkS (i) k1   (4.2)
                                                                                                        
                           ≡ arg min
                             S,S (i) ,L(i) i=1
                             subject to S (i) = Pi (S),
here we take β = max{k, p}−1/2 as in RPCA, X̃ (i) = Pi (X̃) is the local data matrix on the ith
patch and C is the centering operator that subtract the column mean: C(Z) = Z(I − k1 11T ),
where 1 is the (k + 1)-dimensional column vector of all ones. Here we are decomposing the
data on each patch into a low-rank part L(i) and a sparse part S (i) by imposing the nuclear
norm and entry-wise `1 norm on L(i) and S (i) , respectively. There are two key components
in this formulation: First, the patches have overlapping components (for example, X1 may
                                                            51


belong to several patches). Thus, the constraint S (i) = Pi (S) is particularly important
because it ensures the same point (and the sparse noise on that point) belonging to different
patches eventually has all its copies coincide with each other. Secondly, we do not have such
a requirement on L(i) because the L(i) s correspond to local tangent spaces, which will be
explained in the next section. Although some of the tangent spaces may be close, there is
no reason for a point on the manifold to have the same projection onto two different tangent
spaces. This seemingly subtle difference has a large impact on the final result.
     If the data has no Gaussian noise, i.e., N = 0, then X̂ ≡ X̃ − Ŝ is the final estimation
for X. If N 6= 0, we can no longer only remove the sparse noise from X̃ and use X̃ − Ŝ to
approximate the clean data. Instead, we use the supposedly cleaner (See §3) tangent spaces
L̂(i) to construct a final estimate X̂ of X via fitting it to L̂(i)
                                             Xn
                               X̂ = arg min      λi kPi (Z) − L̂(i) k2F .                 (4.3)
                                     Z∈Rp×n i=1
     The following discussion revolves around (4.2) and (4.3), and the structure of the chapter
is as follows. In §4.3, we explain the geometric meaning of each term in (4.2). The choice
of λ requires the information of the curvature of the manifold. Optimization algorithm is
presented is §4.4 and numerical experiments are in §4.5.
4.3      Geometric explanation
We provide a geometric intuition for the formulation (4.2). Let us write the local clean data
matrix X (i) into its Taylor expansion along the manifold,
                                    X (i) = Xi 1T + T (i) + R(i) ,                        (4.4)
where the Taylor series is expanded at Xi (the point around which the ith patch is con-
structed), T (i) stores the first order term whose columns lie in the tangent space of the
manifold at Xi , and R(i) contains all the higher order terms. The sum of the first two
terms Xi 1T + T (i) is the linear approximation to X (i) that is unknown if the tangent space
is not given. This linear approximation precisely corresponds to the L(i) s in (4.2), i.e.,
                                                 52


L(i) = Xi 1T + T (i) . Since the tangent space has the same dimensionality d as the manifold,
with randomly chosen points, we have with probability one, that rank(T (i) ) = d. As a result,
rank(L(i) ) = rank(Xi 1T + T (i) ) ≤ d + 1. By the assumption that d < min{p, k}, we know
that L(i) is indeed low rank.
    Combing (4.4) with X̃ (i) = X (i) + S (i) + N (i) , we find the misfit term X̃ (i) − L(i) − S (i)
in (4.2) equals N (i) + R(i) . This implies that the misfit contains the high order residue (i.e.,
the linear approximation error) and the Gaussian noise.
4.4      Optimization algorithm
To solve the convex optimization problem (4.2) in a memory-economic way, we first write
L(i) as a function of S and eliminate them from the problem. We can do so by fixing S and
minimizing the objective function with respect to L(i)
                    L̂(i) = arg min λi kX̃ (i) − L(i) − S (i) k2F + kC(L(i) )k∗
                               L(i)
                          = arg min λi kC(L(i) ) − C(X̃ (i) − S (i) )k2F + kC(L(i) )k∗             (4.5)
                               L(i)
                            + λi k(I − C)(L(i) − (X̃ (i) − S (i) ))k2F .
Notice that L(i) can be decomposed as L(i) = C(L(i) ) + (I − C)(L(i) ), set A = C(L(i) ), B =
(I − C)(L(i) ), then (4.5) is equivalent to
    (Â, B̂) = arg min λi kA − C(X̃ (i) − S (i) )k2F + kAk∗ + λi kB − (I − C)(X̃ ∗(i) − S (i) ))k2F ,
                 A,B
which decouples into
                           Â = arg min λi kA − C(X̃ (i) − S (i) )k2F + kAk∗ ,
                                    A
                           B̂ = arg min λi kB − (I − C)(X̃ (i) − S (i) )k2F .
                                    B
    The problems above have closed form solutions
                    Â = T1/2λi (C(X̃ (i) − Pi (S))), B̂ = (I − C)(X̃ (i) − Pi (S)),               (4.6)
                                                    53


where Tµ is the soft-thresholding operator on the singular values
                 Tµ (Z) = U max{Σ − µI, 0}V ∗ , where U ΣV ∗ is the SVD of Z.
Combing Â and B̂, we have derived the closed form solution for L̂(i)
                   L̂(i) (S) = T1/2λi (C(X̃ (i) − Pi (S))) + (I − C)(X̃ (i) − Pi (S)).      (4.7)
    Plugging (4.7) into F in (4.2), the resulting optimization problem solely depends on S.
Then we apply FISTA (Beck and Teboulle, 2009c; Sha et al., 2019) to find the optimal
solution Ŝ with
                                       Ŝ = arg min F (L̂(i) (S), S).                       (4.8)
                                                  S
Once Ŝ is found, if the data has no Gaussian noise, then the final estimation for X is
                                                                                        (i)
X̂ ≡ X̃ − Ŝ; if there is Gaussian noise, we use the following denoised local patches L̂τ ∗
                          (i)
                       L̂τ ∗ = Hτ ∗ (C(X̃ (i) − Pi (Ŝ))) + (I − C)(X̃ (i) − Pi (Ŝ)),      (4.9)
where Hτ ∗ is the singular value hard thresholding Operator with the optimal threshold as
defined in (Gavish and Donoho, 2014). This optimal thresholding removes the Gaussian
             (i)                             (i)
noise from L̂τ ∗ . With the denoised L̂τ ∗ , we solve (4.3) to obtain the denoised data
                                           n
                                                          ! n
                                         X         (i)
                                                            X
                                X̂ =          λi L̂τ ∗ PiT (     λi Pi PiT )−1 .         (4.10)
                                         i=1                 i=1
The proposed nonlinear robust principle component analysis (NRPCA) algorithm is sum-
marized in Algorithm 4.1.
    There is one caveat in solving (4.2): the strong sparse noise may result in a wrong
neighborhood assignment when constructing the local patches. Therefore, once Ŝ is obtained
and removed from the data, we update the neighborhood assignment and re-compute Ŝ. This
procedure is repeated T times.
4.5     Numerical experiments
We evaluate the performance of the proposed algorithm on simulated and real-world data
sets.
                                                        54


  Algorithm 4.1: Nonlinear RPCA
    Input: Noisy data matrix X̃, k (number of neighbors in each local patch), T
              (number of neighborhood updates iterations)
    Output: the denoised data X̂, the estimated sparse noise Ŝ
  1 Estimate the curvature;
  2 Estimate λi , i = 1, . . . , n as in §5, set β as in (4.2);
  3 Ŝ ← 0;
  4 for iter = 1: T do
  5     Find the kNN for each point using X̃ − Ŝ and construct the restriction operators
          {Pi }ni=1 ;
  6     Construct the local data matrices X̃ (i) = Pi (X̃) using Pi and the noisy data X̃;
  7     Ŝ ← minimizer of (4.8) iteratively using FISTA;
  8 end
                      (i)
  9 Compute each L̂τ ∗ from (4.9) and assign X̂ from (4.10).
    Simulated Swiss roll: We demonstrate the superior performance of NRPCA on a
synthetically generated dataset following the mixed noise model (4.1). We sampled 2000
noiseless data Xi uniformly from a 3D Swiss roll and generated the Gaussian noise matrix
with i.i.d. entries obey N (0, 0.25). The sparse noise matrix S is generated by randomly
replacing 100 entries of a zero p × n matrix with i.i.d. samples generated from (−1)y · z
where y ∼ Bernoulli(0.5) and z ∼ N (2, 0.09). We applied NRPCA to the simulated data
with patch size k = 16. Figure 4.1 reports the denoising effect in the original space (3D)
looking down from above. We observed a visible reduction of the noise. A similar experiment
on the high dimensional Swiss roll is in the appendix, where the differences between X̃ − Ŝ
and X̂ are much more apparent.
    MNIST: We observe some interesting dimension reduction results of the MNIST dataset
with the help of NRPCA. It is well-known that the handwritten digits 4 and 9 have so high a
similarity that the popular dimension reduction methods Isomap and Laplacian Eigenmaps
are not able to separate them into two clusters (first column of Figure 4.2). We conjecture
that the overlapping parts are caused by personalized writing styles with different beginning
or finishing strokes. This type of differences can be better modelled by sparse noise than
Gaussian or Poisson noises. The right column of Figure 4.2 confirms this conjecture: after
                                                    55


Figure 4.1: NRPCA applied to the noisy Swiss roll data set. X̃ − Ŝ is the result after
subtracting the estimated sparse noise via NRPCA with T = 1; “ X̃ − Ŝ with one neighbor
update” is that with T = 2, i.e., patches are reassigned once; X̂ is the denoised data obtained
via NRPCA with T = 2; “Patch-wise Robust PCA” refers to the ad-hoc application of the
vanilla RPCA to each local patch independently, whose performance is clearly worse than
the proposed joint-recovery formulation.
the NRPCA denoising (with k = 16), we see a much better separability of the two digits
using the first two coordinates of Isomap and Laplacian Eigenmaps. In addition, these new
embedding results seem to suggest that some trajectory patterns may exist in the data. We
provide additional plots in the appendix to support this observation.
4.6     Conclusion
In this chapter, we proposed the first outlier correction method for nonlinear data analysis
that can correct outliers caused by the addition of large sparse noise. The method is a
generalization of the Robust PCA method to the nonlinear setting. We provided procedures
to treat the non-linearity by working with overlapping local patches of the data manifold
and incorporating the curvature information into the denoising algorithm. We demonstrated
that the method works equally well when Gaussian noises are present in the data in addition
                                               56


to the sparse noise. We established a theoretical error bound on the denoised data that holds
under conditions only depending on the intrinsic properties of the manifold. We tested our
method on both synthetic and real dataset that were known to have nonlinear structures
and reported promising results.
Figure 4.2: Laplacian eigenmaps and Isomap results for the original and the NRPCA de-
noised digits 4 and 9 from the MNIST dataset.
                                              57


                                        CHAPTER 5
      ONLINE MATRIX COMPLETION WITH QUATERNION MATRIX
5.1     Introduction
Matrix completion problems aim to recover an unknown matrix given that the matrix has
a low rank. It has been widely studied and has many applications especially in computer
vision (Cabral et al., 2011, 2014), video processing (Ji et al., 2010; Kim et al., 2015), bioin-
formatics (Li et al., 2017; Lu et al., 2018), etc. There are both convex and non-convex
algorithms for solving it. Convex algorithms tend to simplify the problem and use convex
penalties (Recht et al., 2010). While it is easier to find an optimal solution, it may cost a lot
of computational time. Nonconvex algorithms tend to use matrix factorization. Even if they
need less computational time, they may not easily converge to the global optimal solution
because of the saddle points and local optimal solution. In this case, a good initialization is
very important.
    Traditional methods for matrix completion have already achieved good performance when
dealing with greyscale images. For each image, we can easily view it as a matrix. As for a
video, we can convert each frame as a vector in a matrix. However, the problem arises when
applied to color images. Color images have three channels (Red, Green, and Blue) that have
a mutual connection. Intuitively, traditional matrix-based methods that treat each channel
separately do not work well.
    Some tensor-based methods have been developed to solve this multidimensional data
problem. In this case, it is converted as a low-rank tensor approximation problem where
three channels are considered as a third-order tensor. There are many well-known models
like ANDECOMP/PARAFAC (CP) and Tucker (Zhou et al., 2019; Rauhut et al., 2017).
Similar to matrix SVD, they describe the tensor as the sum of the outer products of vectors.
In a recent paper (Kilmer and Martin, 2011), it proposed t-SVD, which expresses the tensor
                                               58


as the sum of outer products of matrices.
    In this chapter, we consider an alternating way and apply quaternion matrices to this
problem. A quaternion matrix has four parts: one real part and three imaginary parts
representing the three channels. Quaternion matrices have been applied to other models
such as Deep Neural Network (Liu et al., 2019; Zhu et al., 2018; Gaudet and Maida, 2018).
They also have been applied on many areas like Natural Language Processing (Parcollet
et al., 2018; Tay et al., 2019) and image processing (Wang et al., 2018; Ye et al., 2020).
    Also, there is an increasing trend today that we need to deal with large-scale data.
The traditional offline matrix completion tends to give good performance. Offline matrix
completion model needs to collect all observation data and gets recovery result at once.
However, it is not practical on some applications, such as web data analysis. For example,
in Figure 5.1, a movie ranking system can be seen as a matrix. Each column is a movie
that needs to be rated, and each column corresponds to a user. We need to offer up-to-date
recommendations to users. Moreover, the estimate should be better if we continuously get
observations.
Figure 5.1: A movie rating system. For a given d1 × d2 low-rank matrix with missing entries,
it can be factorized by a d1 × k user matrix and a k × d2 item matrix where k is the rank of
the original matrix.
    This chapter combines quaternion matrices and online matrix completion. We will de-
velop an online low-rank quaternion matrix completion model that can be widely used in
many cases.
                                             59


5.2   Introduction on Quaternion Matrices
We introduce quaternion numbers/matrices and their properties in this section.
5.2.1   Quaternion Numbers
A quaternion number q ∈ Qn is defined as
                                            q = qr + qi i + qj j + qk k,                                   (5.1)
where qr , qi , qj , qk ∈ R and i, j, k are three imaginary units satisfying
                      i2 = j2 = k2 = −1, ij = −ji = k, jk = −kj = i, ki = −ik = j.                         (5.2)
The conjugate and modulus of q are respectively defined by
                                                                    q
                         q∗ = qr − qi i − qj j − qk k and |q| =       qr2 + qi2 + qj2 + qk2 .              (5.3)
   Let x = [xi ] ∈ Qn be a quaternion vector. We define the following three norms for the
quaternion vector x: 1-norm kxk1 := ni=1 |xi |, 2-norm kxk2 :=                          i=1 |xi | , and ∞-norm
                                                P                                 pPn
                                                                                                  2
kxk∞ := max1≤i≤n |xi |.
   Let A = [aij ] ∈ Qn1 ×n2 be a quaternion matrix. We define the following norms: 1-
                                                                        qP P
norm kAk1 = ni=1                          Forbinoes  norm                     n1     n2
                                                                                                       Tr(A∗ A),
                    P 1 Pn2                                                                      2 =
                                                                                                     p
                              j=1 |aij |,                   kAk   F =         i=1    j=1  |a ij |
where A∗ is the conjugate transpose (or Hermitian transpose) of A, and ∞-norm kAk∞ =
max1≤i≤n1 ,1≤j≤n2 |aij |. The rank of matrix A is the number of independent rows/columns in
A. A square quaternion matrix is unitary if A∗ A = AA∗ = I. An Hermitian quaternion
matrix satisfies A∗ = A, which is an extension of symmetric matrices.
   A color image with R (red), G (green), B (blue) channels can be represented by a quater-
nion matrix without the real part. That is,
                                           Aij = Rij i + Gij j + Bij k,                                    (5.4)
where Rij , Gij , and Bij are the corresponding R, G, and B channels.
                                                       60


5.2.2   Basic Properties
Some properties of quaternion matrices are the same as real matrices. For example,
    • The definition of the inner product hA, Bi = TrhA∗ Bi;
    • Spectral norm and the Frobenius norm kA∗ k = kAk, kA∗ kF = kAkF ;
    • kABkF ≤ kAkkBkF ;
    • kABk ≤ kAkkBk;
    • kAk ≤ kAkF .
    • Cauchy-Schwarz inequality Re(tr(A∗ B)) ≤ kAkF kBkF .
    However, we need to be careful about one thing. For real matrices, the matrices in a
trace of a product can be switched without changing the result. However for quarternion
matrices A and B, we can only have
                                  Re(tr(AB)) = Re(tr(BA)).                         (5.5)
For example, given two quaternion matrices
                                                     
                              1, 1 1, 2      1, 2   1, 1
                         A =   +   i +   j +   k,
                                1, 0     2, 1     2, 1     1, 0
and                                                  
                              1, 2 1, 1      1, 1   1, 2
                         B =   +   i +   j +   k,
                                2, 1     1, 0     1, 0     2, 1
we have
                               tr(AB) = −10 + 20i + 6j + 10k,
and
                               tr(BA) = −10 + 6i + 20j + 10k.
                                              61


In this case, tr(AB) 6= tr(BA). The reason is that the product of two quaternion numbers
may not be the same if the order is changed. For example, let a = 1 + 1i + 2j + 3k and
b = 3 + 2i + 2j + 1k, then we have
                        a ∗ b = −6 + 1i + 13j + 8k, b ∗ a = −6 + 9i + 3j + 12k.
This difference may cause some difficulties for the theoretical proof, which will be explained
in the following sections.
5.2.3     Singular Value Decomposition
According to the work by Zhang (1997), we can define singular value decomposition for
quaternion matrices. For any quaternion matrix L ∈ Qd1 ×d2 with rank k, there exists an
unitary quaternion matrix U = [u1 , . . . , ud1 ] ∈ Qd1 ×d1 and another unitary quaternion matrix
V = [v1 , . . . , vd2 ] ∈ Qd2 ×d2 such that
                                                 L = UΣk V∗                                      (5.6)
where Σk ∈ Rd1 ×d2 consists of all singular values of L, σ1 ≥ σ2 ≥ · · · ≥ σk > 0, on its
diagonal entries. Then the spectral norm kLk := max{σ1 , . . . , σk }. The condition number κ
                        max{σ1 ,...,σk }
is defined as κ =       min{σ1 ,...,σk }
                                         .
    What’s more, for any Hermitian quaternion matrix L ∈ Qd×d with rank k, there exists
an unitary quaternion matrix U = [u1 , . . . , ud ], with ui being its i-th column, such that
                                                 L = UΣk U∗                                      (5.7)
where Σk = diag(σ1 , . . . , σk , 0, . . . , 0) ∈ Rd×d consists of all singular values of L, and σ1 ≥
σ2 ≥ · · · ≥ σk > 0.
                                                      62


5.2.4   Incoherence Condition
Let W ∈ Qd×k be an orthonormal basis of a subspace of Rd of dimension k, then the
projection to the subspace is PW = WW∗ . We define the coherence of W as
                                   d                         d
                         µ(W) =      max kPW ei k2 = max ke∗i Wk2 ,                   (5.8)
                                   k 1≤i≤d                   k 1≤i≤d
where ei is the vector with the i-th component being 1 and others being 0.
Definition 5.2.1. We assume M is µ-incoherent, i.e.,
                                             µk                         µk
                           max kX∗ ei k2 ≤      , max kY∗ ei k2 ≤                     (5.9)
                             i               d1       i                 d2
and                                                  r
                                                           µk
                                           ∗
                                      kXY k∞ ≤                   ,                   (5.10)
                                                          d1 d2
where X ∈ Qd1 ×k , Y ∈ Qd2 ×k are the left and right singular vectors of M.
5.2.5   Sampling Scheme
We consider the Bernoulli model for uniform sampling. Let Ω ⊂ [d1 ] × [d2 ]. Given a matrix
M, we define the matrix PΩ as
                                             
                                                         if (i, j) ∈ Ω,
                                             
                                             Mij
                                             
                               [PΩ (M)]ij =
                                                         if (i, j) ∈
                                             
                                             0
                                                                  / Ω.
Every time (i, j) is uniformly sampled from Ω. Our goal is to recover M given PΩ (M).
5.3    Online Matrix Completion Algorithms and its Theoretical Anal-
       ysis
We first consider a Hermitian quaternion matrix M ∈ Qd×d , i.e., there exists a quaternion
matrix U such that M = UU∗ . Also, we write U with its columns as
                                      U = [u1 , u2 , . . . , ud ].
                                                63


Moreover, we define f (U) as
               f (U) = kM − UU∗ k2F
                     = hM, −UU∗ i + hM, Mi + h−UU∗ , Mi + hUU∗ , UU∗ i.
    The stochastic gradient of f (U) given the (i, j) component is
                   SG(U) = 2d2 [(UU∗ − M)ij ei e∗j + (UU∗ − M)ji ej e∗i ]U.               (5.11)
Note that (UU∗ − M)∗ij = (UU∗ − M)ji because both M and UU∗ are Hermitian. The
expectation of SG(U) is
                           d,d
                           X     1 2
              ESG(U) =             2
                                     2d ((UU∗ − M)ij ei e∗j + (UU∗ − M)ji ej e∗i )U
                         i=1,j=1
                                 d
                       = 4(UU∗ − M)U.
Let ∇f (U) := 4(UU∗ − M)U, which is one descent direction of f (U).
    We assume kMk = 1. Then κ =            1
                                        σmin (M)
                                                 , namely, σmin (M) = κ1 . We also denote
                        SVD(M) = XSX∗ ,            SVD(UU∗ ) = WDW∗ .
Here X ∈ Qd×k with k orthogonal columns and S ∈ Rk×k is a diagonal square matrix.
    First, we prepare with a few lemmas about properties of f (U) in a local Frobenious ball
around optimal.
Lemma 5.3.1. Within the region D = {U|kUk ≤ Γ}, we have the function f (U) satisfying
for any U1 , U2 ∈ D:
                          k∇f (U1 ) − ∇f (U2 )kF ≤ βkU1 − U2 kF ,                         (5.12)
with β = 16 max{kU1 k2 , kU2 k2 , 1}.
                                                   64


Proof. From the definition of ∇f (U), we have
      k∇f (U1 ) − ∇f (U2 )kF
    =k4(U1 U∗1 − M)U1 − 4(U2 U∗2 − M)U2 kF
    ≤4kU1 U∗1 U1 − U2 U∗2 U2 kF + 4kM(U1 − U2 )kF
    =4kU1 U∗1 (U1 − U2 ) + U1 (U1 − U2 )∗ U2 + (U1 − U2 )U∗2 U2 kF + 4kM(U1 − U2 )kF
    ≤4kU1 k2 kU1 − U2 kF + 4kU1 kkU2 kkU1 − U2 kF
      + 4kU2 k2 kU1 − U2 kF + 4kMkkU1 − U2 kF
    ≤12 max{kU1 k2 , kU2 k2 }kU1 − U2 kF + 4kMkkU1 − U2 kF
    ≤16 max{kU1 k2 , kU2 k2 , 1}kU1 − U2 kF .
We have completed the proof.
Lemma 5.3.2. Within the region D = {U|σmin (X∗ U) ≥ γ}, the function f (U) = kM −
UU∗ k2F satisfies
                                  k∇f (U)k2F ≥ 4γ 2 f (U).                         (5.13)
Proof. Inside the region D, we let SVD(UU∗ ) = WDW∗ , thus we have
               k∇f (U)k2F
             =16k(UU∗ − M)Uk2F
             =16(kPW (UU∗ − M)Uk2F + kPW⊥ (UU∗ − M)Uk2F )
             =16 tr(PW (UU∗ − M)UU∗ (UU∗ − M)PW ) + kPW⊥ MUk2F
                                                                        
             =16 tr(PW (UU∗ − M)WDW∗ (UU∗ − M)PW ) + kPW⊥ MUk2F
                                                                           
             ≥16(σmin(D) kPW (UU∗ − M)PW k2F + kPW⊥ MUk2F )
             =16(σmin(D) kUU∗ − PW MPW k2F + kPW⊥ MUk2F ).
                                              65


The inequality holds because
                      tr(PW (UU∗ − M)WDW∗ (UU∗ − M)PW )
                   ≥σmin(D) tr(PW (UU∗ − M)WW∗ (UU∗ − M)PW )
                   =σmin(D) tr(PW (UU∗ − M)WW∗ WW∗ (UU∗ − M)PW )
The last equality is true because PW UU∗ PW = WW∗ WDW∗ WW∗ = WDW∗ = UU∗ .
   On the other hand, we have
                  kPW⊥ MUk2F = kPW⊥ XSX∗ Uk2F ≥ σmin       2
                                                               (X∗ U)kPW⊥ XSk2F
                    2
                 =σmin  (X∗ U)tr(PW⊥ M2 PW⊥ ) = σmin 2
                                                         (X∗ U)kPW⊥ Mk2F ,
and
                               σmin (D) = λmin (UU∗ ) = λmin (U∗ U)
                             ≥λmin (U∗ PX U) = σmin2
                                                       (X∗ U).
Finally, we can get
          k∇f (U)k2F
         ≥16(σmin (D)kUU∗ − PW MPW k2F + σmin (X∗ U)kPW⊥ Mk2F )
               2
         ≥16σmin  (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ Mk2F ))
               2
         =16σmin  (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ M(PW + PW⊥ )k2F ))
               2
         =16σmin  (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ MPW k2F + kPW⊥ MPW⊥ k2F )
             2
         ≥4σmin  (X∗ U)(kUU∗ − PW MPW k2F + kPW⊥ MPW k2F + kPW MPW⊥ k2F
           + kPW⊥ MPW⊥ k2F )
             2
         =4σmin  (X∗ U)kUU∗ − Mk2F
         ≥4γ 2 kUU∗ − Mk2F .
The third inequality holds because we have
                               kPW⊥ MPW kF = kPW MPW⊥ kF .                      (5.14)
                                                66


The third equality holds because the inner product between each pair of UU∗ − PW MPW ,
PW MPW⊥ , PW⊥ MPW and PW⊥ MPW⊥ is 0.
Lemma 5.3.3. Within the region D = {U|kM − UU∗ kF ≤             1
                                                                  σ (M)},
                                                                10 k
                                                                          we have
                                                                                   (5.15)
                               p                            p
                        kUk ≤     2kMk,     σmin (X∗ U) ≥     σk (M)/2.
Proof. From the definition of the spectral norm, we have
                            kUk2 =kUU∗ k ≤ kMk + kM − UU∗ k
                                  ≤kMk + kM − UU∗ kF ≤ 2kMk.
We have the following lower bound for the smallest nonzero singular value of U∗ U,
                      σmin (U∗ U) =σk (UU∗ ) = σk (M − (M − UU∗ ))
                                                                9
                                   ≥σk (M) − kM − UU∗ k ≥         σk (M).
                                                               10
The first inequality holds because ∀i, j ∈ N, we have
                              σi (A) ≥ σi+j−1 (A + B) − σj (B).
On the other hand, we denote the orthogonal complementary space of X as X⊥ . Then we
have
               9                     9
                 σk (M)kX∗⊥ Wk2 ≤ σmin (U∗ U)kX∗⊥ Wk2 ≤ σmin (D)kX∗⊥ Wk2
              10                    10
                                  ≤kX∗⊥ WDW∗ X⊥ k ≤ kX∗⊥ UU∗ X⊥ kF
                                  =kPX⊥ (M − UU∗ )PX⊥ kF
                                                       1
                                  ≤kM − UU∗ kF ≤         σk (M).
                                                      10
   The last equality is true because
           tr(X∗⊥ UU∗ X⊥ X∗⊥ UU∗ X⊥ ) = tr(X⊥ X∗⊥ UU∗ X⊥ X∗⊥ X⊥ X∗⊥ UU∗ X⊥ X∗⊥ )
                                              67


and
      tr(X⊥ X∗⊥ MX⊥ X∗⊥ X⊥ X∗⊥ MX⊥ X∗⊥ ) = tr(X⊥ X∗⊥ XSX∗ X⊥ X∗⊥ XSX∗ X⊥ X∗⊥ ) = 0
Let the principal angle between X and W be θ. According to the inequality above, sin2 θ =
kX∗⊥ Wk2 ≤ 91 . So cos2 θ = σmin 2
                                    (X∗ W) ≥ 89 . We have
                  2
                σmin   (X∗ U) =σmin (X∗ UU∗ X) = σmin (X∗ WDW∗ X)
                                                             9             8
                              ≥σmin (D)σmin2
                                              (X∗ W) ≥          σk (M) × ≥ σk (M)/2.
                                                            10             9
The lemma is proved.
    Now we are well prepared for the main theorem.
Theorem 5.3.4. Let f (U) = kUU∗ − Mk2F and gi (U) = ke∗i Uk2 . Suppose after initializa-
tion, we have
                                             2
                                                                         10µkκ2
                                     
                                         1
                           f (U0 ) ≤            ,    max gi (U0 ) ≤               .
                                        20κ             i                   d
Then, there exists some absolute constant c such that for any learning rate η <                        c
                                                                                                   µdkκ3 log d
                                                                                                               ,
with at least 1 −    T
                    d10
                        probability, we will have for all t ≤ T that
                                                  2
                                                                               20µkκ2
                                            
                                      η t     1
                     f (Ut ) ≤ 1 −                    ,      max gi (Ut ) ≤             .
                                     2κ       10κ               i                   d
Proof. Let the filtration be Ft = σ{SG(U0 ), SG(U1 ), · · · , SG(Ut−1 )}, i.e., an increasing
sequence of σ−field.
                              n                                                                  o
                                                                                          20µkκ2
    We define event t =        ∀τ ≤ t, f (Uτ ) ≤ (1 −     η t 1 2
                                                          2κ
                                                             ) ( 10κ ) , maxi gi (Uτ ) ≤     d
                                                                                                  . We aim
to prove that this event happens with high probability. Conditioned on t , we have kUt k ≤
√
  2, σmin (X∗ Ut ) ≥ √12κ and σmin (U∗t Ut ) ≥ 2κ 1
                                                     based on Lemma 5.3.3.
    Construction of supermartingale G: Let gi (U) = e∗i UU∗ ei , for any change ∆U, we
have
                   gi (U + ∆U) =e∗i (U + ∆U)(U + ∆U)∗ ei
                                   =gi (U) + e∗i (∆UU∗ + U∆U∗ )ei + ke∗i ∆Uk2 .
                                                    68


   For any l ∈ [d]:
                     Eke∗l SG(U)k2 1t
                   =4d4 Ek e∗l (u∗i uj − Mij )ei e∗j + e∗l (u∗j ui − Mji )ej e∗i Uk2 1t
                                                                                
                   ≤16d4 Eke∗l (u∗i uj − Mij )ei e∗j Uk2 1t
                   ≤16d4 Eδil |u∗i uj − Mij |2 max ke∗i Uk2 1t
                                                  i
                           1 X
                   =16d4 2         δil |u∗i uj − Mij |2 max ke∗i Uk2 1t
                           d i,j                          i
                   =16d2 ke∗l (UU∗ − M)k2 max ke∗i Uk2 1t
                                                  i
                   ≤O(µ2 k 2 κ4 ).
The last inequality holds because we have
                              ke∗l (UU∗ − M)k ≤ke∗l UU∗ k + ke∗l Mk
                                                    ≤kUkkU∗ el k + ke∗l Xk
                                                           r             r
                                                       √     20µkκ2        µk
                                                    ≤ 2               +
                                                          r      d          d
                                                            µkκ  2
                                                    ≤O(            ).
                                                              d
On the other hand, we know
      E[gi (Ut+1 )1t |Ft ]
    =E[gi (Ut − ηSG(Ut ))1t |Ft ]
    =[gi (Ut ) − ηe∗i ESG(Ut )U∗t ei − ηe∗i Ut ESG(Ut )∗ ei + η 2 Ee∗i SG(Ut )SG(Ut )∗ ei ]1t
    =[gi (Ut ) − 2ηRe(e∗i ESG(Ut )U∗t ei ) + η 2 Eke∗i SG(Ut )k2 ]1t
    =[e∗i Ut U∗t ei − 8ηRe(e∗i (Ut U∗t − M)Ut U∗t ei ) + η 2 Eke∗i SG(Ut )k2 ]1t
    =[tr(U∗t ei e∗i Ut ) − 8ηRe(tr(U∗t ei e∗i (Ut U∗t − M)Ut )) + η 2 Eke∗i SG(Ut )k2 ]1t
    =[Re(tr(U∗t ei e∗i (I − 8η(Ut U∗t − M))Ut )) + η 2 Eke∗i SG(Ut )k2 ]1t
    =[Re(tr(U∗t ei e∗i Ut (I − 8ηU∗t Ut ))) + 8ηRe(tr(U∗t ei e∗i MUt )) + η 2 Eke∗i SG(Ut )k2 ]1t
    ≤[(1 − 8ησmin (U∗t Ut ))gi (Ut ) + 8ηRe(tr(U∗t ei e∗i MUt )) + η 2 Eke∗i SG(Ut )k2 ]1t
                                                       69


For the middle term, we have
              8ηRe(tr(U∗t ei e∗i MUt )) =4η(tr(U∗t ei e∗i MUt ) + tr(U∗t Mei e∗i Ut ))
                                         ≤8ηkU∗t ei k2 ke∗i MUt k2
                                         ≤8ηkU∗t ei k2 kUt kke∗i Mk2
                                                r
                                                  20µkκ2 √
                                         ≤8η                  2ke∗i XSX∗ k2
                                                 r    d
                                                   10µkκ2 ∗
                                         ≤16η                kei XSk2
                                                 r     d
                                                   10µkκ2 ∗
                                         ≤16η                kei Xk2
                                                       d
                                                 r           r            √
                                                   10µkκ2 µk           16 10ηµkκ
                                         ≤16η                        =                .
                                                       d         d             d
The first inequality holds because
                  tr(U∗t ei e∗i MUt ) + tr(U∗t Mei e∗i Ut ) = 2Re(tr(e∗i MUt U∗t ei ))
                ≤2kU∗i ei k2 ke∗i MUt k2 .
The fourth inequality holds because kXk = 1. The fifth inequality holds because kSk =
kMk = 1. The last inequality comes from the incoherence definition of M:
                                                       r
                                                         µk
                                          ke∗i Xk2 ≤          .
                                                           d
In this case,
                       E[gi (Ut+1 )1t |Ft ]
                                                 √
                                4η             16 10ηµkκ
                     ≤[(1 − )gi (Ut ) +                       + η 2 O(µ2 k 2 κ4 )]1t
                                κ                  d 
                                 4η               ηµkκ
                     ≤ (1 − )gi (Ut ) + 60                1t .
                                  κ                 d
We can get the last inequality if the stepsize η is small enough.
                                                   70


                                                             
                          4η −t                             2
   We let Git = 1 −                gi (Ut )1t−1 − 15 µkκ       and have
                             
                          κ                              d
                                           −t−1 
                                                                                 µkκ2
                                                                                     
                                        4η
             E[Gi(t+1) |Ft ] = 1 −                    E[gi (Ut+1 )1t |Ft ] − 15
                                        κ                                         d
                                           −t                             2
                                                                                     15µkκ3
                                                                                               
                                        4η                          60ηµkκ
                              ≤ 1−                gi (Ut )1t +               1 −
                                        κ                          (κ − 4η)d t (κ − 4η)d
                              ≤Git .
The last inequality is true because we have
                                                 1t ≤ 1t−1 .
That’s to say, Git is a supermartingale.
   Probability 1 bound for G: We know that
                                                    −t−1
                                                 4η
           Gi(t+1) − E[Gi(t+1) |Ft ] = 1 −                    (gi (Ut+1 )1t − E[gi (Ut+1 )1t |Ft ])
                                                  κ
                                                    −t−1
                                                 4η
                                       = 1−                   [−ηe∗i [Ut SG(Ut )∗ + SG(Ut )U∗t
                                                  κ
                                          − (Ut ESG(Ut )∗ + ESG(Ut )U∗t )]ei
                                          + η 2 [ke∗i SG(Ut )k2 − Eke∗i SG(Ut )k2 ] 1t .
                                                                                       
   Let l ∈ [d], we need to approximate the upper bounds for e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el
and ke∗l SG(Ut )k2 1t . For the first term, we have
                       e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el
                     =e∗l [(Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i ]Ut U∗t el +
                       e∗l Ut U∗t [(Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i ]el .
   We consider the upper bounds for different conditions.
   • If l 6= i and l 6= j, we have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el = 0.
                                                       71


   • if l = j 6= i, we have
                          e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el
                        =2d2 (Ut U∗t − M)ji e∗i Ut U∗t ej + e∗j Ut U∗t (Ut U∗t − M)ij ei
                        =2d2 Re[(Ut U∗t − M)ij e∗i Ut U∗t ej ]
                       ≤2d2 kUt U∗t − Mk∞ max ke∗i Ut k2 ≤ O(µ2 k 2 κ4 ).
                                                  i
      The last second inequality is true because we have
                                    kUU∗ − Mk∞ ≤ kUU∗ k∞ + kMk∞
                                  ≤ max ke∗i Uk2 + max |ei XSX∗ ei |
                                       i                   i
                                            2
                                    20µkκ               µk      21µkκ2
                                  ≤           + kMk          ≤             .
                                         d                d          d
   • if l = j = i, we have
                        e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el
                      =2d2 (Ut U∗t − M)ii e∗i Ut U∗t ei + 2d2 e∗i Ut U∗t (Ut U∗t − M)ii ei
                      =4d2 (Ut U∗t − M)ii ke∗i Ut k2
                      ≤4d2 kUt U∗t − Mk∞ max ke∗i Ut k2 ≤ O(µ2 k 2 κ4 ).
                                                i
   Therefore, we can always have e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el 1t ≤ O(µ2 k 2 κ4 )1t . For
the second term, we have
                       ke∗l SG(Ut )k2 1t
                     ≤4d4 |Ut U∗t − M|2ij (ke∗l ei e∗j U∗t k + ke∗l ej e∗i U∗t k)2 1t
                     ≤16d4 kUt U∗t − Mk2∞ max ke∗i Ut k2 1t ≤ O(µ3 dk 3 κ6 )1t
                                                i
   For any l ∈ [d], we have e∗l [[SG(Ut )]U∗t +Ut SG(Ut )∗ ]el ≤ O(µ2 k 2 κ4 ) and ke∗l SG(Ut )k2 ≤
O(µ3 dk 3 κ6 ). Since the (i, j) component is randomly sampled from M. In this case, we also
have
                        E(e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el ) ≤ O(µ2 k 2 κ4 ),
                                                     72


and
                                 E(ke∗l SG(Ut )k2 1t ) ≤ O(µ3 dk 3 κ6 )1t .
In fact, we have that E[Gi(t+1) |Ft ] ≥ 0. By letting η small enough, we have with probability
1,
                                                               −t−1
                                                             4η
                     Gi(t+1) − E[Gi(t+1) |Ft ] ≤      1−              ηO(µ2 k 2 κ4 )1t .       (5.16)
                                                             κ
    Variance bound for G: We need to approximate an upper bound for two variances:
Var(e∗l [[SG(Ut )]U∗t + Ut SG(Ut )∗ ]el 1t ) and Var(ke∗l SG(Ut )k2 el 1t ). For the first term, we
have:
                    Var(Re(e∗l [SG(Ut )]U∗t el ) · 1t |Ft )
                   ≤E[(Re(e∗l [SG(Ut )]U∗t el ))2 · 1t |Ft ]
                   =E[(Re(tr(U∗t el e∗l [SG(Ut )])))2 · 1t |Ft ]
                   ≤4d4 E[(Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut ))
                     + Re(tr(U∗t el e∗l (Ut U∗t − M)ji ej e∗i Ut )))2 · 1t |Ft ]
                   ≤8d4 E[(Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 · 1t |Ft ]
                     + 8d4 E[(Re(tr(U∗t el e∗l (Ut U∗t − M)ji ej e∗i Ut )))2 · 1t |Ft ]
                          X
                   =16d2 [(Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 · 1t |Ft ]
                           i,j
                          X
                   ≤16d2       |(Ut U∗t − M)lj |2 max kU∗t ei k4 1t
                                                      i
                             j
                                                                             µ3 k 3 κ6
                                                                                      
                   =16d 2
                          kel (Ut U∗t  − M)k  2
                                                max ke∗j Ut k4 1t  ≤O                   1t .
                                                  j                             d
The second equality is based on the definition of expectation. For the fourth inequality, we
need to consider different conditions.
    • if l 6= i, (Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 = 0.
                                                      73


   • if l = i, we have
            (Re(tr(U∗t el e∗l (Ut U∗t − M)ij ei e∗j Ut )))2 = (Re(tr(U∗t el (Ut U∗t − M)lj e∗j Ut )))2
           ≤kU∗t el k22 k(Ut U∗t − M)lj e∗j Ut k22 ≤ |Ut U∗t − M)lj |2 max kU∗t ei k4 .
                                                                             i
   For the second term, we have:
       Var(ke∗l SG(Ut )k2 1t |Ft )
     ≤E(ke∗l SG(Ut )k4 1t |Ft )2
          1 X 8 ∗
     =16 2        d kel ((Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i )Ut k4 1t
          d i,j
             X                                                    X
     ≤128d6        ke∗l (Ut U∗t − M)ij ei e∗j Ut k4 1t + 128d6       ke∗l (Ut U∗t − M)ji ej e∗i Ut k4 1t
              i,j                                                 i,j
             X
     =256d6        |(UU∗ − M)lj |4 ke∗j Ut k4 1t
               j
     ≤O(1)d6 kUU∗ − Mk2∞ ke∗l (UU∗ − M)k2 max ke∗i Uk4 1t
                                                         i
     ≤O(µ5 dk 5 κ10 )1t .
   The second inequality follows from (a + b)4 ≤ 8a4 + 8b4 , and the second equality holds
by considering the two cases l 6= i and l = i.
   Therefore, we can choose a small η and obtain
                                                       −2t−2         3 3 6
                                                     4η                 µk κ
                        Var(Gi(t+1) |Ft ) ≤ 1 −                   2
                                                                η O                 1t .               (5.17)
                                                     κ                       d
   Berstein’s inequality for G: Let σ 2 = tτ =1 Var(Giτ |Fτ −1 ), and there exists R such
                                                        P
that, |Giτ − E[Giτ |Fτ −1 ]| ≤ R, τ = 1, . . . t. with probability 1. Then by standard Berstein
concentration inequality,
                                                                    s2 /2
                                                                              
                              P (Git ≥ Gi0 + s) ≤ exp − 2                        .                      (5.18)
                                                                σ + Rs/3
                                 2                           t p
Since Gi0 = gi (U0 ) − 15 µkκ  d
                                   , let s̃ = O(1) 1 − 4η κ
                                                               [ σ 2 log d + R log d], we know
                                                     t                              !
                                 µkκ2                                   µkκ2
                                            
                                                 4η                                        1
       P gi (Ut )1t−1 ≥ 15              + 1−           (gi (U0 ) − 15         ) + s̃ ≤ 11 .            (5.19)
                                    d             κ                        d              2d
                                                      74


   Based on (5.16), we know R = (1 −                4η −t
                                                     κ
                                                       ) ηO(µ2 k 2 κ4 )     satisfies Giτ − E[Giτ |Fτ −1 ] ≤ R
where τ = 1, . . . , t. Also, by the variance bound for G in (5.17), we can have
                 t p                     r                    !v  u t             2t−2τ
             4η                              µ 3  k 3 κ6 log  d     uX            4η
         1−                 2
                         σ log d ≤ηO                                t        1−
              κ                                        d              τ =1
                                                                                   κ
                                           r                    !                    r               !
                                               3    3  6              κ √                3  3  7
                                                                    r
                                             µ k κ log d                               µ k κ log d
                                    ≤ηO                                  ≤ ηO                          .
                                                       d              η                       d
By choosing η <           c
                     µdkκ3 log d
                                 and choosing c to be small enough, we have
                  r                   !
                                                                           µkκ2                      µkκ2
                                                                                              
         √            µ3 k 3 κ7 log d           2 2 4                                      µkκ
    s̃ ≤   ηO                           + ηO(µ k κ log d) ≤ O                       +O             ≤       .
                              d                                              d               d         d
                                                          10µkκ2
   Since we have initialization max gi (U0 ) ≤               d
                                                                  , by the Bernstein’s inequality, we have
                                        i
                                                             µkκ2
                                                                    
                                                                             1
                                   P gi (Ut )1t−1 ≥ 20                ≤ 11 .
                                                                d          2d
Namely,
                                                                  µkκ2
                                                                      
                                                                                   1
                             P t−1 ∩ gi (Ut )1t−1 ≥ 20                      ≤ 11 .
                                                                    d           2d
   We also need to construct another supermartingale F .
   Construction of supermatingale F : From the definition of SG(Ut ), we have
                      EkSG(Ut )k2F 1t
                   ≤16d4 E(UU∗ − M)2ij max ke∗i Ut k2 1t                                                (5.20)
                                                i
                   ≤16d2 kUt U∗t − Mk2F max ke∗i Ut k2 1t ≤ O(µdkκ2 )f (Ut )1t .
                                               i
By the update equation
                                          Ut+1 = Ut − ηSG(Ut ),                                          (5.21)
                                                        75


we can have
                        E[f (Ut+1 )1t |Ft ]
                    ≤[f (Ut ) − Eh∇f (Ut ), ηSG(Ut )i + η 2 EkSG(Ut )k2F ]1t
                    =[f (Ut ) − ηk∇f (Ut )k2F + η 2 EkSG(Ut )k2F ]1t
                                                                  
                                2η                2          2
                    ≤ 1−               f (Ut ) + η O(µkdκ )f (Ut ) 1t
                                 κ
                             η
                    ≤ 1−          f (Ut )1t .
                              κ
The first inequality comes from second order Taylor expansion, and we choose a small η and
increase the coefficient on the second order term from η 2 /2 to η 2 . The second inequality uses
Lemma 5.3.2. The last inequality holds when we choose a small η.
    Let Ft = (1 − κη )−t f (Ut )1t−1 , then
                                η −t−1                              η −t
           E[Ft+1 |Ft ] = 1 −               E[f (Ut+1 )1t |Ft ] ≤ 1 −       f (Ut )1t ≤ Ft .
                                 κ                                     κ
Therefore, Ft is a supermartingale.
    Probability 1 bound for F : From the definition of F , we have
                             η −t−1
               Ft+1  = 1−                f (Ut+1 )1t
                              κ
                             η −t−1
                     = 1−                kUt+1 U∗t+1 − Mk2F 1t
                              κ
                             η −t−1
                     = 1−                kUt U∗t − M − η(Ut SG(Ut )∗ + SG(Ut )U∗t )
                              κ
                             + η 2 SG(Ut )SG(Ut )∗ k2F 1t .
Define fˆ(η) := f (Ut − ηSG(Ut )). By the second order Taylor expansion with respect to η,
                                                     76


we can have
                                      η2 2 ˆ
       f (Ut+1 ) =fˆ(0) + η∇fˆ(0) +     ∇ f (ξ)
                                      2
                  =kUt U∗t − Mk2F + η(−2RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i)
                      η2
                    +    (2hUt SG(Ut )∗ + SG(Ut )U∗t , Ut SG(Ut )∗ + SG(Ut )U∗t i
                       2
                    + 4RehUt U∗t − M, SG(Ut )SG(Ut )∗ i
                    − 12ξRehUt SG(Ut )∗ + SG(Ut )U∗t , SG(Ut )SG(Ut )∗ i
                    + 12ξ 2 hSG(Ut )SG(Ut )∗ , SG(Ut )SG(Ut )∗ i).
where
                   ∇fˆ(0) = −2RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i,
and
          ∇2 fˆ(ξ) =2hUt SG(Ut )∗ + SG(Ut )U∗t , Ut SG(Ut )∗ + SG(Ut )U∗t i
                     + 4RehUt U∗t − M, SG(Ut )SG(Ut )∗ i
                     − 12ξRehUt SG(Ut )∗ + SG(Ut )U∗t , SG(Ut )SG(Ut )∗ i
                     + 12ξ 2 hSG(Ut )SG(Ut )∗ , SG(Ut )SG(Ut )∗ i
                   =4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i
                     + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F .
Then we need to bound
                                        η −t−1
               Ft+1 − E[Ft+1 |Ft ] = 1 −          [f (Ut+1 ) − E(f (Ut+1 )|Ft )]1t .
                                         κ
                                               77


Firstly,
              kUt U∗t − Mk∞ 1t
            = max |ei (Ut Ut − M)ej |1t
                ij
            = max |ei (PX + PX⊥ )(Ut Ut − M)ej |1t
                ij
            ≤ max |ei PX (Ut Ut − M)ej |1t + max |ei PX⊥ Ut Ut ej |1t
                ij                                   ij
            ≤ max ke∗i XkkX∗ (Ut U∗t − M)ej k1t + ke∗j WkkW∗ WDW∗ PX⊥ ei k
                ij
            ≤ max ke∗i Xkk(Ut U∗t − M)ej k + ke∗j WkkWDW∗ PX⊥ ei k
                ij
            ≤ max ke∗i XkkUt U∗t − MkF + ke∗j Wkk(Ut U∗t − M)PX⊥ ei k
                 i
                   r !                       r          !
                      µk p                      µkκ3 p
            ≤O               f (Ut ) + O                     f (Ut )
                       d                          d
                   r      !
                      µkκ3 p
            ≤O                  f (Ut ).
                        d
   The fifth inequality comes from
                                                                         r
                                     1      1           kei Ut k     √     20µkκ2
                 ke∗i Wk ≤ kei WD k  2
                                          1         =    1       ≤     2κ         .
                                         λmin (D)
                                          2
                                                        λmin (D)
                                                         2                    d
As we know, for the first-order derivative,
            RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i
           =RehUt U∗t − M, Ut SG(Ut )∗ i + RehUt U∗t − M, SG(Ut )U∗t i
              √
          ≤2 2kUt U∗t − MkF kSG(Ut )kF
              √ p
           =4 2d2 f (Ut )k(Ut U∗t − M)ij ei e∗j Ut + (Ut U∗t − M)ji ej e∗i Ut kF
              √ p
          ≤4 2d2 f (Ut )(k(Ut U∗t − M)ij ei e∗j Ut kF + k(Ut U∗t − M)ji ej e∗i Ut kF )
              √ p
          ≤4 2d2 f (Ut )kUt U∗t − Mk∞ (kei e∗j Ut kF + kej e∗i Ut kF )
              √ p
          ≤8 2d2 f (Ut )kUt U∗t − M∗ k∞ max ke∗i Ut k
                                                  i
          ≤O(µdkκ2.5 )f (Ut ).
                                                 78


                                                √
Here, the first inequality comes from kUk ≤        2.
    For the second-order derivative, with a small enough η, we have
                 4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i
                  + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F
               ≤O(1)kSG(Ut )k2F
               ≤O(1)d4 kUt U∗t − Mk2∞ max ke∗i Ut k2
                                            i
               ≤O(µ2 d2 k 2 κ5 )f (Ut ).
The first inequality holds because when η is small enough, we have an uniform upper bound
for kU − ξSG(Ut )k and k(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − Mk. In this case, if we choose
η to be small enought, we can have with probability 1,
                                             η −t−1
                 |Ft+1 − E[Ft+1 |Ft ]| ≤ 1 −          ηO(µdkκ2.5 )f (Ut+1 )1t
                                              κ                                     (5.22)
                                             η −t−1      η t+1
                                         ≤ 1−           1−         ηO(µdkκ0.5 ).
                                              κ            2κ
Variance bound for F : For the first order derivative,
            Var(RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i)
           ≤E(RehUt U∗t − M, Ut SG(Ut )∗ + SG(Ut )U∗t i)2
           ≤2E(RehUt U∗t − M, Ut SG(Ut )∗ )2 + 2E(RehUt U∗t − M, SG(Ut )U∗t i)2
           ≤2k(Ut U∗t − M)Ut k2F EkSG(Ut )k2F + 2kU∗t (Ut U∗t − M)k2F EkSG(Ut )k2F
           ≤8Ek(Ut U∗t − M)k2F EkSG(Ut )k2F
           ≤O(µdkκ2 )f 2 (Ut ),
where the last inequality comes from (5.20).
                                                79


    For the second order derivative, when η is small enough, we have
                Var(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i
                + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )
             ≤E(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i
                + 2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )2
             =E(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i)2
                + E(2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )2 )2
                + E(8Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i
                · kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )
             ≤E2(4Reh(Ut − ξSG(Ut ))(Ut − ξSG(Ut ))∗ − M, SG(Ut )SG(Ut )∗ i)2
                + E2(2kSG(Ut )(Ut − ξSG(Ut ))∗ + (Ut − ξSG(Ut ))SG(Ut )∗ k2F )2 )2
             ≤O(1)EkSG(Ut )k4F
             =O(d8 )Ek(Ut U∗t − M)ij ei e∗j U + (Ut U∗t − M)ji ej e∗i Ut k4F
             ≤O(d8 )kUt U∗t − Mk2∞ E|(Ut U∗t − M)ij |2 max kei Ut k4
                                                              i
             ≤O(d6 )kUt U∗t − Mk∞2 kUt U∗t − Mk2F max kei Ut k4
                                                            i
             ≤O(µ3 d3 k 3 κ7 )f 2 (Ut ).
Therefore, we can have with probability 1,
                                                η −2t−2 2
                        Var(Ft+1 |Ft ) ≤ 1 −               η O(µdkκ2 )f 2 (Ut )1t
                                                 κ                                            (5.23)
                                η −2t−2      η 2t+2 2       µdk
                      ≤ 1−                  1−          η O             1t .
                                 κ             2κ                κ2
    Berstein’s inequality for F : Let σ 2 =                τ =1 Var(Fτ |Fτ −1 ) and R satisfies |Fτ −
                                                        Pt
E[F τ |Fτ −1 ]| ≤ R according to (5.22), τ = 1, · · · , t. Then by the standard Bernstein concen-
tration inequality, we know:
                                                                s2 /2
                                                                        
                               P (Ft ≥ F0 + s) ≤ exp − 2                    .
                                                             σ + Rs/3
                                                   80


                         η t
Let s̃ = O(1) 1 −               [ σ 2 log d + R log d]. So when d ≥ 2, we have
                                p
                         κ
                                                                                
                                                             η t                      1
                                P f (Ut )1t−1 ≥ 1 −               f (U0 ) + s̃ ≤ 10 .
                                                              κ                       2d
                                                −t       η t
    We can know that R = 1 − κη                                 ηO(µdkκ0.5 ). By the variance bound of F in
                                                             
                                                     1 − 2κ
(5.23), we have
                                                     r             !v u t 
                    η tp 2
                                                      µdk log d u       X            η 2t−2τ         η 2τ
               1−                σ log d ≤ ηO                         t           1−              1−
                     κ                                      κ2           τ =1
                                                                                      κ                2κ
                                                       v
                                                     !u t
                                       r
                     η   t              µdk log d u     X          η 2t−2τ          η 2τ −2t
           ≤ 1−               ηO                       t         1−                1−
                     2κ                       κ2          τ =1
                                                                      κ                 2κ
                                        r             !
                     η √                 µdk log d
           ≤ 1−                  ηO                     .
                     2κ                         κ
The last inequality holds because we have
                          t                                              κ 2       κ
                        X               η 2t−2τ       η 2τ −2t 4( η ) − 4 η + 1             κ
                                  1−               1−               <           κ          ≤ .
                        τ =1
                                       κ               2κ                    4η − 3            η
By η <        c
          µdkκ3 log d
                      and choosing c to be small enough, we have:
                                 "          r             !                      #                        2
                      η     t    √           µdk log d                    0.5
                                                                                             η t      1
         s̃ = 1 −                   ηO                       + ηO(µdkκ ) ≤ 1 −                                .
                      2κ                            κ                                        2κ       20κ
Since F0 = f (U0 ) ≤             1
                             (20κ)2
                                     , we can have
                                                                           2 !
                                                              η t     1               1
                                P     f (Ut )1t−1  ≥ 1−                           ≤       .
                                                              2κ       10κ            2d10
That’s to say,
                                           (                                 2 )!
                                                               η t     1                 1
                          P       t−1 ∩     f (Ut ) ≥ 1 −                            ≤        .
                                                               2κ       10κ               2d10
    Probability for event T : We need to combine the concentration result for martingales
G and F . Then we get
                                                                      (                                     2 )!!
                                                              µkκ2
                                                                                                     
                                                                                                η t     1
   P (t−1 ∩ ¯t ) = P t−1 ∩ ∪i gi (Ut ) ≥ 20                         ∪ f (Ut ) ≥ 1 −
                                                                d                              2κ        10κ
     d
                                           µkκ2
                                                                                                    
   X                                                                                        η t 1 2              1
 ≤      P t−1 ∩ gi (Ut ) ≥ 20                        + P t−1 ∩ f (Ut ) ≥ (1 − ) (                    )      ≤ 10 .
    i=1
                                              d                                            2κ 10κ               d
The theorem is proved in the Hermitian case.
                                                            81


5.4      Algorithms
In this section, we give algorithms for both Hermitian and general cases.
5.4.1     The Hermitian Case
For the Hermitian case, we need to find one matrix U so that UU∗ ≈ M. Given an new
observation (i, j), one or two rows of U is updated for every iteration. The SGD computation
is given by (5.11), and η is the stepsize. For the convergence in this chapter, η has to satisfy
η<        c
      µdkκ3 log d
                   in our theoretical proof with a small c. However, in practices, we may choose
a larger stepsize. T is the total number of required observations. The algorithm is described
in Algorithm 5.1.
  Algorithm 5.1: Online learning algorithm for the Hermitian matrix M
    Input: Initial Ω0 ∈ Qd×d , learning rate η, iterations T, U0 U∗0 ← top k SVD of
                d2
                    P (M)
                Ω0 Ω0
    Output: U, s.t.UU∗ ≈ M
  1 for t = 0, 1, 2, 3, . . . , T-1 do
  2      Observe Mij where (i, j) ⊂ {1, · · · , d} × {1, · · · , d} is uniform distributed ;
  3      Ut+1 = Ut − 2ηd2 ((Ut U∗t − M)ij ei e∗j + (Ut U∗t − M)ji ej e∗i ))Ut
  4 end
5.4.2     The General Case
In the general case with the quaternion matrix M ∈ Qd1 ×d2 . We need to find two matrices
U ∈ Qd1 ×k and V ∈ Qd2 ×k so that UV∗ ≈ M. Ω0 is the initial coordinates set of observation,
from which we construct a good initialization U0 . For each iteration given one observation
Mi,j , we update the i-th row of Ut and the j-th row Vt . Both rows are updated by SGD
in a similar way as the Hermitian case. The parameter η is the stepsize, and T is the total
number of observations. The algorithm is described in Algorithm 5.2.
    In practice, we use Algorithm 5.3 to increase the speed just like (Jin et al., 2016). Instead
of directly doing SVD on a d1 × d2 quaternion matrix UV∗ , we do SVD on two smaller k × k
                                                  82


quaternion matrix U∗t Ut and Vt∗ Vt .
  Algorithm 5.2: Online learning algorithm for general M (theoretical version)
   Input: Initial Ω0 ∈ Qd1 ×d2 , learning rate η, iterations T, U0 V0∗ ← top k SVD of
            d1 ×d2
              Ω0
                   PΩ0 (M)
   Output: U, V, s.t.UV∗ ≈ M
 1 for t = 0, 1, 2, 3, . . . , T-1 do
 2     WU DWV∗ ← SVD(Ut V∗ ) ;
                      1               1
 3     Ũt ← WU D 2 , Ṽt ← WV D 2 ;
 4     Observe Mij where(i, j) ⊂ {1, · · · , d} × {1, · · · , d} is uniform distributed ;
 5     Ut+1 ← Ũt − 2ηd1 d2 (Ũt Ṽt∗ − M)ij ei e∗j Ṽt ;
 6     Vt+1 ← Ṽt − 2ηd1 d2 (Ũt Ṽt∗ − M)ji ei e∗j Ṽt
 7 end
  Algorithm 5.3: Online learning algorithm for general M (practical version)
   Input: Initial Ω0 ∈ Qd1 ×d2 , learning rate η, iterations T, U0 V0∗ ← top k SVD of
            d1 ×d2
              Ω0
                   PΩ0 (M)
   Output: U, V, s.t.UV∗ ≈ M
 1 for t = 0, 1, 2, 3, . . . , T-1 do
 2     Observe Mij where(i, j) ∼ Unif([d1 ] × [d2 ]) ;
 3     RU DU R∗U ← SVD(U∗t U) ;
 4     RV DV R∗V ← SVD(Vt∗ V) ;
                               1         1
 5     QU DQ∗V ← SVD(DU2 R∗U RV (DV2 )∗ ) ;
                                                              −1          1
 6     Ut+1 = Ut − 2ηd1 d2 ((Ut Vt∗ − M)ij ei e∗j Vt RV DV 2 QV Q∗U DU2 R∗U ;
                                                                          1
                                                              −1
 7     Vt+1 = Vt − 2ηd1 d2 ((Ut Vt∗ − M)ji ej e∗i Ut RU DU 2 QU Q∗V DV2 R∗V
 8 end
5.5    Numerical Experiments
In this section, we conduct some numerical experiments for both Hermitian and general
cases.
   Small Hermitian case: We randomly generate a 10 × 10 Hermitian quaternion matrix
with rank 5. The initialization is obtained with 99% of the matrix. The stepsize η is chosen
as 3e−5 , and the total iteration number is 40000. The total number of iteration is large
comparing to the size of the matrix, and we use this example to demonstrate the linear
                                                 83


convergence of the proposed algorithm. For this example, each component is chosen for 400
times on average, and the algorithm converges linearly in Figure 5.2, which confirms the
theoretical results.
                                         10 2
                                              1
                                         10
                                         10 0
                                         10 -1
                   loss function value
                                         10 -2
                                         10 -3
                                         10 -4
                                         10 -5
                                         10 -6
                                         10 -7
                                                  0   0.5   1   1.5      2        2.5   3   3.5      4
                                                                      iteration                   10 4
Figure 5.2: Loss function value versus number of iterations for the small Hermitian case.
The stepsize is 3e−5 . Within 40000 iterations, the value decreases from nearly 100 to 10−6 .
The loss function value tends to keep decreasing after these 40000 iterations.
   Small general case: We randomly generate a 10 × 10 general quaternion matrix with
rank 5. The initialization is obtained with 99% of the matrix. The stepsize η is chosen as
1e−4 , and the total iteration number is 5000. We also observe the convergence of the proposed
algorithm in Figure 5.3. From both example, we can see that the algorithm converges slowly,
though it converges linearly.
   Color image: For a color image with depth with dimension 259 × 320 × 4, we resize
the image to be 156 × 192 × 4. There are four channels including the depth, so we can use
a quaternion matrix to describe it with the depth being the real part. We normalize it and
force its rank to be 30. The initial stepsize is 1e−5 , and the total iteration number is 10000.
In this case, the total number of iterations is smaller than the number of components in
the matrix. For every 300 iterations, we reduce the stepsize by 0.95 to help it converge.
                                                                      84


                                        10 2
                                        10 1
                                             0
                                        10
                  loss function value
                                        10 -1
                                        10 -2
                                        10 -3
                                        10 -4
                                        10 -5
                                                 0   500   1000   1500   2000    2500       3000   3500   4000   4500   5000
                                                                                iteration
Figure 5.3: Loss function value versus number of iterations for the small general quaternion
matrix case. The stepsize is tuned to be 1e−4 . Within 5000 iterations, the value decreases
from around 100 to 10−4 . The loss function value tends to keep decreasing after these 5000
iterations.
The result is shown in Figure 5.4 and Figure 5.5. We can see that the image can not be
recovered well with only 10000 pixels. We are thinking about one possible reason for it. A
quaternion matrix typically have three more components than the real matrix. It tends to
need more observations to converge. Because only around 1/3 of the pixels are used once,
which is far below the computation requires for offline algorithms. In each iteration of an
offline algorithm, it goes through all observed pixels, which is about 10000 pixels, and there
are many iteration required to get a good observation.
5.6    Conclusion
In this chapter, we introduce quaternion matrix and its properties. Based on previous
work for online matrix completion, we set up a provable and efficient framework for online
quaternion matrix completion, which can be easily applied on color images. This framework
applies nonconvex SGD on quaternion matrix and we can show the performance improvement
                                                                                85


               (a) True Image.                                                                          (b) Recovered Image.
Figure 5.4: Online image recovery result after 10000 iterations. We randomly sampled 1000o
observations from (a). We can see that the result for recovered image (b) is not good. We
expect that the difference between (a) and (b) can be as small as possible.
                                        180
                                        160
                                        140
                                        120
                  loss function value
                                        100
                                        80
                                        60
                                              0   1000   2000   3000   4000    5000       6000   7000    8000   9000   10000
                                                                              iteration
Figure 5.5: Loss function value versus number of iterations for the real color image. The
initial stepsize is tuned to be 1e−5 . For every 300 iterations, we multiply the stepsize by
0.95. The loss function value decreases from around 170 to almost 45. At the beginning,
the loss function value decreases the most, and it tends to keep decreasing after these 10000
iterations.
based on each updated input. By using martingale theory, we prove that SGD can stay away
from saddle points and converges linearly if we have a good initialization.
                                                                              86


BIBLIOGRAPHY
      87


                                     BIBLIOGRAPHY
Amaldi, E. and Kann, V. (1998). On the approximability of minimizing nonzero variables or
  unsatisfied relations in linear systems. Theoretical Computer Science, 209(1-2):237–260.
Beck, A. and Teboulle, M. (2009a). Fast gradient-based algorithms for constrained total vari-
  ation image denoising and deblurring problems. IEEE Transactions on Image Processing,
  18(11):2419–2434.
Beck, A. and Teboulle, M. (2009b). A fast iterative shrinkage-thresholding algorithm for
  linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202.
Beck, A. and Teboulle, M. (2009c). A fast iterative shrinkage-thresholding algorithm for
  linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202.
Bouwmans, T. and Zahzah, E. H. (2014). Robust pca via principal component pursuit: A
  review for a comparative evaluation in video surveillance. Computer Vision and Image
  Understanding, 122:22–34.
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed op-
  timization and statistical learning via the alternating direction method of multipliers.
  Foundations and Trends® in Machine learning, 3(1):1–122.
Cabral, R., De la Torre, F., Costeira, J. P., and Bernardino, A. (2014). Matrix completion for
  weakly-supervised multi-label image classification. IEEE transactions on pattern analysis
  and machine intelligence, 37(1):121–135.
Cabral, R. S., Torre, F., Costeira, J. P., and Bernardino, A. (2011). Matrix completion for
  multi-label image classification. In Advances in neural information processing systems,
  pages 190–198.
Cai, H., Cai, J.-F., and Wei, K. (2019). Accelerated alternating projections for robust
  principal component analysis. The Journal of Machine Learning Research, 20(1):685–717.
Candès, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust principal component analysis?
  Journal of the ACM (JACM), 58(3):11.
Candes, E. J., Li, X., Ma, Y., and Wright, J. (2011). Robust Principal Component Analysis?
  J. ACM, 58(3):11.
Chartrand, R. (2007). Exact reconstruction of sparse signals via nonconvex minimization.
  IEEE Signal Processing Letters, 14(10):707–710.
Chen, Y., Ma, J., and Fomel, S. (2016). Double-sparsity dictionary for seismic noise atten-
  uation. Geophysics, 81(2):V103–V116.
                                                88


Chen, Y., Zhou, Y., Chen, W., Zu, S., Huang, W., and Zhang, D. (2017). Empirical low-
  rank approximation for seismic noise attenuation. IEEE Transactions on Geoscience and
  Remote Sensing, 55(8):4696–4711.
Cheng, J., Chen, K., and Sacchi, M. D. (2015). Robust principle component analysis (RPCA)
  for seismic data denoising. In GeoConvention 2015.
Cunningham, J. P. and Ghahramani, Z. (2015). Linear dimensionality reduction: Survey,
  insights, and generalizations. The Journal of Machine Learning Research, 16(1):2859–2900.
Da Costa, J. F. P., Alonso, H., and Roque, L. (2009). A weighted principal component
  analysis and its application to gene expression data. IEEE/ACM Transactions on Com-
  putational Biology and Bioinformatics, 8(1):246–252.
De la Torre, F. and Black, M. J. (2001). Robust principal component analysis for computer
  vision. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV
  2001, volume 1, pages 362–369. IEEE.
Du, C., Sun, J., Zhou, S., and Zhao, J. (2013). An Outlier Detection Method for Robust
  Manifold Learning. In Yin, Z., Pan, L., and Fang, X., editors, Proceedings of The Eighth
  International Conference on Bio-Inspired Computing: Theories and Applications (BIC-
  TA), 2013, Advances in Intelligent Systems and Computing, pages 353–360. Springer
  Berlin Heidelberg.
Duarte, L. T., Nadalin, E. Z., Nose Filho, K., Zanetti, R., Romano, J. M., and Tygel, M.
  (2012). Seismic wave separation by means of robust principal component analysis. In
  2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pages
  1494–1498. IEEE.
Elhamifar, E. and Vidal, R. (2013). Sparse subspace clustering: Algorithm, theory, and ap-
  plications. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765–
  2781.
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
  oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
Fomel, S. and Liu, Y. (2013). Seislet transform and seislet frame. Geophysics, 75:V25–V38.
Gaudet, C. J. and Maida, A. S. (2018). Deep quaternion networks. In 2018 International
  Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
Gavish,
     √ M. and Donoho, D. L. (2014). The optimal hard threshold for singular values is
  4/ 3. IEEE Transactions on Information Theory, 60(8):5040–5053.
Hammond, D. K., Vandergheynst, P., and Gribonval, R. (2011). Wavelets on graphs via
  spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150.
Herman, G. and Perkins, C. (2006). Predictive removal of scattered noise. Geophysics,
  71:V41–V49.
                                             89


Horn, R. A. and Johnson, C. R. (2012). Matrix analysis. Cambridge university press.
Huang, X. and Yan, M. (2018). Nonconvex penalties with analytical solutions for one-bit
   compressive sensing. Signal Processing, 144:341–351.
Huang, X.-L., Shi, L., and Yan, M. (2015). Nonconvex sorted `1 minimization for sparse
   approximation. Journal of the Operations Research Society of China, 3(2):207–229.
Ji, H., Liu, C., Shen, Z., and Xu, Y. (2010). Robust video denoising using low rank matrix
   completion. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern
   Recognition, pages 1791–1798. IEEE.
Jianbo Shi and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transac-
   tions on Pattern Analysis and Machine Intelligence, 22(8):888–905.
Jiang, B., Ding, C., Luo, B., and Tang, J. (2013). Graph-Laplacian PCA: Closed-Form
   Solution and Robustness. In 2013 IEEE Conference on Computer Vision and Pattern
   Recognition, pages 3492–3498.
Jin, C., Kakade, S. M., and Netrapalli, P. (2016). Provable efficient online matrix completion
   via non-convex stochastic gradient descent. Advances in Neural Information Processing
   Systems, 29:4520–4528.
Kent, A., Sweet, J., and Woodward, B. (2016). Iris community wavefield experiment in
   Oklahoma. Incorporated Research Institutions for Seismology. Dataset/Seismic Network.
Kilmer, M. E. and Martin, C. D. (2011). Factorization strategies for third-order tensors.
   Linear Algebra and its Applications, 435(3):641–658.
Kim, D. and Fessler, J. A. (2016). Optimized first-order methods for smooth convex mini-
   mization. Mathematical programming, 159(1-2):81–107.
Kim, J.-H., Sim, J.-Y., and Kim, C.-S. (2015). Video deraining and desnowing using temporal
   correlation and low-rank matrix completion. IEEE Transactions on Image Processing,
   24(9):2658–2670.
Kopsinis, Y. and McLaughlin, S. (2009). Development of EMD-based denoising methods
   inspired by wavelet thresholding. IEEE Transactions on Signal Processing, 57(4):1351–
   1362.
Koren, Y. (2009). The bellkor solution to the netflix grand prize. Netflix prize documentation,
   81(2009):1–10.
Kreimer, N. and Sacchi, M. D. (2012). A tensor higher-order singular value decomposition
   for prestack seismic data noise reduction and interpolation. Geophysics, 77:V113–V122.
Li, G. and Pong, T. K. (2015). Global convergence of splitting methods for nonconvex
   composite optimization. SIAM Journal on Optimization, 25(4):2434–2460.
                                              90


Li, H. and Lin, Z. (2015). Accelerated proximal gradient methods for nonconvex program-
  ming. Advances in neural information processing systems, 28:379–387.
Li, J.-Q., Rong, Z.-H., Chen, X., Yan, G.-Y., and You, Z.-H. (2017). Mcmda: Matrix
  completion for mirna-disease association prediction. Oncotarget, 8(13):21187.
Li, X.-R., Li, X.-M., Li, H.-L., and Cao, M.-Y. (2009). Rejecting Outliers Based on Corre-
  spondence Manifold. Acta Automatica Sinica, 35(1):17–22.
Lin, Z., Chen, M., and Ma, Y. (2010). The augmented lagrange multiplier method for exact
  recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. (2012). Robust recovery of subspace
  structures by low-rank representation. IEEE transactions on pattern analysis and machine
  intelligence, 35(1):171–184.
Liu, X., Wen, Z., and Zhang, Y. (2015). An efficient gauss–newton algorithm for symmetric
  low-rank product matrix approximations. SIAM Journal on Optimization, 25(3):1571–
  1608.
Liu, Y. and Fomel, S. (2013). Seismic data analysis using local time-frequency decomposition.
  Geophysical Prospecting, 61(3):516–525.
Liu, Y., Zheng, Y., Lu, J., Cao, J., and Rutkowski, L. (2019). Constrained quaternion-
  variable convex optimization: a quaternion-valued recurrent neural network approach.
  IEEE transactions on neural networks and learning systems, 31(3):1022–1035.
Lou, Y. and Yan, M. (2018). Fast l1–l2 minimization via a proximal operator. Journal of
  Scientific Computing, 74(2):767–785.
Lu, C., Yang, M., Luo, F., Wu, F.-X., Li, M., Pan, Y., Li, Y., and Wang, J. (2018). Predic-
  tion of lncrna–disease associations based on inductive matrix completion. Bioinformatics,
  34(19):3357–3364.
Ma, S. and Aybat, N. S. (2018). Efficient optimization algorithms for robust principal
  component analysis and its variants. Proceedings of the IEEE, 106(8):1411–1426.
Meila, M. and Shi, J. (2001). Learning Segmentation by Random Walks. In Leen, T. K.,
  Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing
  Systems 13, pages 873–879. MIT Press.
Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Program-
  ming, 103(1):127–152.
Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical
  Programming, 140(1):125–161.
Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., Linarès, G., De Mori, R., and Bengio,
  Y. (2018). Quaternion convolutional neural networks for end-to-end automatic speech
  recognition. arXiv preprint arXiv:1806.07789.
                                              91


Qiao, T., Ren, J., Wang, Z., Zabalza, J., Sun, M., Zhao, H., Li, S., Benediktsson, J. A.,
  Dai, Q., and Marshall, S. (2017). Effective denoising and classification of hyperspectral
  images using curvelet transform and singular spectrum analysis. IEEE Transactions on
  Geoscience and Remote Sensing, 55(1):119–133.
Rauhut, H., Schneider, R., and Stojanac, Ž. (2017). Low rank tensor recovery via iterative
  hard thresholding. Linear Algebra and its Applications, 523:220–262.
Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear
  matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501.
Rubinstein, R., Zibulevsky, M., and Elad, M. (2010). Double sparsity: Learning sparse
  dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing,
  58(3):1553–1564.
Sha, N., Yan, M., and Lin, Y. (2019). Efficient seismic denoising techniques using robust
  principal component analysis. In SEG Technical Program Expanded Abstracts 2019, pages
  2543–2547. Society of Exploration Geophysicists.
Shen, Y., Xu, H., and Liu, X. (2019). An alternating minimization method for robust
  principal component analysis. Optimization Methods and Software, 34(6):1251–1276.
Sun, C., Zhang, Q., Wang, J., and Xie, J. (2014). Noise reduction based on robust principal
  component analysis. Journal of Computational Information Systems, 10(10):4403–4410.
Tan, S. and Huang, L. (2014). An efficient finite-difference method with high-order accu-
  racy in both time and space domains for modelling scalar-wave propagation. Geophysical
  Journal International, 197(2):1250–1267.
Tao, M. and Yuan, X. (2011). Recovering low-rank and sparse components of matrices from
  incomplete and noisy observations. SIAM Journal on Optimization, 21(1):57–81.
Tay, Y., Zhang, A., Tuan, L. A., Rao, J., Zhang, S., Wang, S., Fu, J., and Hui, S. C. (2019).
  Lightweight and efficient neural natural language processing with quaternion networks.
  arXiv preprint arXiv:1906.04393.
Trefethen, L. N. and Bau III, D. (1997). Numerical linear algebra, volume 50. Siam.
Wang, C., Wang, X., Li, Y., Xia, Z., and Zhang, C. (2018). Quaternion polar harmonic
  fourier moments for color images. Information Sciences, 450:141–156.
Wang, Y., Yin, W., and Zeng, J. (2019). Global convergence of admm in nonconvex nons-
  mooth optimization. Journal of Scientific Computing, 78(1):29–63.
Weglein, A. B. (2016). Multiples: Signal or noise? Geophysics, 81:V283–V302.
Wen, F., Chu, L., Liu, P., and Qiu, R. C. (2018). A survey on nonconvex regularization-
  based sparse and low-rank recovery in signal processing, statistics, and machine learning.
  IEEE Access, 6:69883–69906.
                                              92


Wen, F., Ying, R., Liu, P., and Truong, T.-K. (2019). Nonconvex regularized robust pca
  using the proximal block coordinate descent algorithm. IEEE Transactions on Signal
  Processing, 67(20):5402–5416.
Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix
  completion by a nonlinear successive over-relaxation algorithm. Mathematical Program-
  ming Computation, 4(4):333–361.
Wright, J., Ganesh, A., Rao, S., and Ma, Y. (2009). Robust principal component analy-
  sis: Exact recovery of corrupted low-rank matrices via convex optimization. Coordinated
  Science Laboratory Report no. UILU-ENG-09-2210, DC-243.
Ye, H.-S., Zhou, N.-R., and Gong, L.-H. (2020). Multi-image compression-encryption scheme
  based on quaternion discrete fractional hartley transform and improved pixel adaptive
  diffusion. Signal Processing, 175:107652.
Yu, S., Ma, J., Zhang, X., and Sacchi, M. D. (2015). Interpolation and denoising of highdi-
  mensional seismic data by learning a tight frame. Geophysics, 80:V119–V132.
Yuan, X. and Yang, J. (2009). Sparse and low-rank matrix decomposition via alternating
  direction methods. preprint, 12(2).
Yuan, X. and Yang, J. (2013). Sparse and low-rank matrix decomposition via alternating
  direction methods. Pacific Journal of Optimization, 9:167–180.
Zhang, C.-H. et al. (2010). Nearly unbiased variable selection under minimax concave
  penalty. The Annals of statistics, 38(2):894–942.
Zhang, F. (1997). Quaternions and matrices of quaternions. Linear algebra and its applica-
  tions, 251:21–57.
Zhigang Tang, Jun Yang, and Bingru Yang (2010). A new Outlier detection algorithm based
  on Manifold Learning. In 2010 Chinese Control and Decision Conference, pages 452–457.
Zhou, M., Liu, Y., Long, Z., Chen, L., and Zhu, C. (2019). Tensor rank learning in cp de-
  composition via convolutional neural network. Signal Processing: Image Communication,
  73:12–21.
Zhou, T. and Tao, D. (2011). GoDec: Randomized low-rank & sparse matrix decomposition
  in noisy case. In International Conference on Machine Learning, pages 30–40.
Zhou, T. and Tao, D. (2013). Greedy bilateral sketch, completion & smoothing. In In-
  ternational Conference on Artificial Intelligence and Statistics, pages 650– 658. JMLR.
  org.
Zhou, X., Yang, C., Zhao, H., and Yu, W. (2014). Low-rank modeling and its applications
  in image analysis. ACM Computing Surveys (CSUR), 47(2):1–33.
                                             93


Zhou, Z., Li, X., Wright, J., Candes, E., and Ma, Y. (2010a). Stable principal component
  pursuit. In 2010 IEEE international symposium on information theory, pages 1518–1522.
  IEEE.
Zhou, Z., Li, X., Wright, J., Candes, E., and Ma, Y. (2010b). Stable principal component
  pursuit. In IEEE International Symposium on Information Theory, pages 1518–1522.
  IEEE.
Zhu, X., Xu, Y., Xu, H., and Chen, C. (2018). Quaternion convolutional neural networks.
  In Proceedings of the European Conference on Computer Vision (ECCV), pages 631–647.
                                            94