TENSOR LEARNING WITH STRUCTURE, GEOMETRY AND MULTI-MODALITY
                                     By
                           Seyyid Emre Sofuoglu
                             A DISSERTATION
                                 Submitted to
                         Michigan State University
                 in partial fulfillment of the requirements
                              for the degree of
              Electrical Engineering – Doctor of Philosophy
                                    2022


                                                 ABSTRACT
          TENSOR LEARNING WITH STRUCTURE, GEOMETRY AND MULTI-MODALITY
                                                        By
                                            Seyyid Emre Sofuoglu
With the advances in sensing and data acquisition technology, it is now possible to collect data from different
modalities and sources simultaneously. Most of these data are multi-dimensional in nature and can be
represented by multiway arrays known as tensors. For instance, a color image is a third-order tensor defined
by two indices for spatial variables and one index for color mode. Some other examples include color video,
medical imaging such as EEG and fMRI, spatiotemporal data encountered in urban traffic monitoring, etc.
    In the past two decades, tensors have become ubiquitous in signal processing, statistics and computer
science. Traditional unsupervised and supervised learning methods developed for one-dimensional signals
do not translate well to higher order data structures as they get computationally prohibitive with increasing
dimensionalities. Vectorizing high dimensional inputs creates problems in nearly all machine learning tasks
due to exponentially increasing dimensionality, distortion of data structure and the difficulty of obtaining
sufficiently large training sample size.
    In this thesis, we develop tensor-based approaches to various machine learning tasks. Existing tensor
based unsupervised and supervised learning algorithms extend many well-known algorithms, e.g. 2-D
component analysis, support vector machines and linear discriminant analysis, with better performance and
lower computational and memory costs. Most of these methods rely on Tucker decomposition which has
exponential storage complexity requirements; CANDECOMP-PARAFAC (CP) based methods which might
not have a solution; or Tensor Train (TT) based solutions which suffer from exponentially increasing ranks.
Many tensor based methods have quadratic (w.r.t the size of data), or higher computational complexity, and
similarly, high memory complexity. Moreover, they are not always designed with the particular structure of
the data in mind. Many of these methods use purely algebraic measures as the objective which might not
capture the local relations within data. Thus, there is a need to develop new models with better computational
and memory efficiency, with the particular structure of the data and problem in mind. Finally, as tensors
represent the data with more faithfulness to the original structure compared to the vectorization, they also
allow coupling of heterogeneous data sources where the underlying physical relationship is known. Still,
most of the work on coupled tensor decompositions does not explore supervised problems.


    In order to address the issues around computational and storage complexity of tensor based machine
learning, in Chapter 2, we propose a new tensor train decomposition structure, which is a hybrid between
Tucker and Tensor Train decompositions. The proposed structure is used to implement Tensor Train based
supervised and unsupervised learning frameworks: linear discriminant analysis (LDA) and graph regularized
subspace learning. The algorithm is designed to solve extremal eigenvalue-eigenvector pair computation
problems, which can be generalized to many other methods. The supervised framework, Tensor Train
Discriminant Analysis (TTDA), is evaluated in a classification task with varying storage complexities with
respect to classification accuracy and training time on four different datasets. The unsupervised approach,
Graph Regularized TT, is evaluated on a clustering task with respect to clustering quality and training time
on various storage complexities. Both frameworks are compared to discriminant analysis algorithms with
similar objectives based on Tucker and TT decompositions.
    In Chapter 3, we present an unsupervised anomaly detection algorithm for spatiotemporal tensor data.
The algorithm models the anomaly detection problem as a low-rank plus sparse tensor decomposition
problem, where the normal activity is assumed to be low-rank and the anomalies are assumed to be sparse
and temporally continuous. We present an extension of this algorithm, where we utilize a graph regularization
term in our objective function to preserve the underlying geometry of the original data. Finally, we propose a
computationally efficient implementation of this framework by approximating the nuclear norm using graph
total variation minimization. The proposed approach is evaluated for both simulated data with varying levels
of anomaly strength, length and number of missing entries in the tensor as well as urban traffic data.
    In Chapter 4, we propose a geometric tensor learning framework using product graph structures for
tensor completion problem. Instead of purely algebraic measures such as rank, we use graph smoothness
constraints that utilize geometric or topological relations within data. We prove the equivalence of a Cartesian
graph structure to TT-based graph structure under some conditions. We show empirically, that introducing
such relaxations due to the conditions do not deteriorate the recovery performance. We also outline a fully
geometric learning method on product graphs for data completion.
    In Chapter 5, we introduce a supervised learning method for heterogeneous data sources such as simul-
taneous EEG and fMRI. The proposed two-stage method first extracts features taking the coupling across
modalities into account and then introduces kernelized support tensor machines for classification. We illus-
trate the advantages of the proposed method on simulated and real classification tasks with small number of
training data with high dimensionality.


                                        ACKNOWLEDGEMENTS
                                                       
                                            Õæk QË @ áÔ gQË @ é<Ë @ Õæ.
All praise be to Allah, the Lord of universe(s). Peace be upon the prophet of Allah, Muhammad (s.a.w.),
who is the guide and the best example to all humanity, especially for those who are in search of the truth and
wisdom. My thesis is but a miniscule result of the guidance of the prophet of Allah in the way of knowledge.
    First and foremost, I would like to thank my advisor, Dr. Aviyente, for her guidance, patience, encour-
agement, and warm support throughout my research. Without her, the quality of my work would not have
been a percent of what it is. She has been a role model, and motivation for me due to her deep dedication,
firm discipline, and caring attentiveness to the work of her students.
    I am deeply grateful to my wife, who sacrificed so much along this process and has been there for me
through thick and thin. No amount of thanks could be enough. I would like to also thank my mother, who
deserves more credit than any one person for any achievement or success of mine. Also, to my father, in
light of whom, I have considered a PhD is possible for me, who always supported me with his wisdom, and
who never lost his belief in me. To my mother-in-law and father-in-law, who have always been there for us,
kept us in their prayers, and sincerely interested in my work. Finally, I thank my little son, who was my
motivation, happiness, and emotional support.
    There are too many people that were a support to me in writing this thesis for me to be able to mention
each one of them. I ask for the forgiveness of any of my benefactors, to whom I could not thank in name.
May Allah shower all who helped me with success and honour in both their lives.
                                                          iv


                                       TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       xi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . .             . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Background and Notation . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . . . .  2
   1.2 Robust Principal Component Analysis . . . . . .       . . . . . . . . . . . . . . . . . . . . . . .  6
   1.3 Geometric Learning . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . .  7
        1.3.1 Robust PCA on Graphs . . . . . . . . . .       . . . . . . . . . . . . . . . . . . . . . . .  8
        1.3.2 Spectral Geometric Matrix Completion .         . . . . . . . . . . . . . . . . . . . . . . .  8
   1.4 Tensor Decompositions . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . . . .  8
        1.4.1 Tucker Decomposition (TD) . . . . . . .        . . . . . . . . . . . . . . . . . . . . . . .  9
        1.4.2 Canonical/Polyadic (CP) Decomposition          . . . . . . . . . . . . . . . . . . . . . . .  9
        1.4.3 Tensor Train Decomposition (TT) . . . .        . . . . . . . . . . . . . . . . . . . . . . . 10
   1.5 Robust PCA for Tensors . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . 11
   1.6 Linear Discriminant Analysis (LDA) for Tensors .      . . . . . . . . . . . . . . . . . . . . . . . 12
   1.7 Organization and Contributions of the Thesis . .      . . . . . . . . . . . . . . . . . . . . . . . 13
CHAPTER 2     MULTI-BRANCH TENSOR TRAIN STRUCTURE FOR SUPERVISED AND
              UNSUPERVISED LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . .              . . 16
   2.1  Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
   2.2  Tensor Train Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 19
   2.3  Multi-Branch Tensor Train Discriminant Analysis . . . . . . . . . . . . . . . . . . . . .      . . 21
        2.3.1 Two-way Tensor Train Discriminant Analysis (2WTTDA) . . . . . . . . . . . .              . . 22
        2.3.2 Three-way Tensor Train Discriminant Analysis (3WTTDA) . . . . . . . . . . .              . . 24
   2.4  Analysis of Storage, Training Complexity and Convergence . . . . . . . . . . . . . . . .       . . 24
        2.4.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 26
        2.4.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 27
   2.5  Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 28
        2.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 29
        2.5.2 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 30
        2.5.3 Training Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 34
        2.5.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 35
        2.5.5 Effect of Sample Size on Accuracy . . . . . . . . . . . . . . . . . . . . . . . . .      . . 35
        2.5.6 Summary of Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . .        . . 36
   2.6  Graph Regularized Tensor Train Decomposition . . . . . . . . . . . . . . . . . . . . . .       . . 37
        2.6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 37
        2.6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 38
        2.6.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 41
   2.7  Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 41
        2.7.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 42
        2.7.2 COIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 43
                                                     v


  2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
CHAPTER 3   TENSOR METHODS FOR ANOMALY DETECTION ON SPATIOTEMPORAL
            DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 45
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
  3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 47
  3.3 Robust Low-Rank Tensor Decomposition for Anomaly Detection . . . . . . . . . . . . .           . . 49
      3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 49
      3.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 51
      3.3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 55
  3.4 Low-rank On Graphs Plus Temporally Smooth Sparse Decomposition . . . . . . . . . .             . . 56
      3.4.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 56
      3.4.2 Computational Complexity of LOGSS . . . . . . . . . . . . . . . . . . . . . . .          . . 58
  3.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 58
  3.6 Anomaly Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 60
  3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 60
      3.7.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 62
      3.7.2 Parameter Selection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 63
      3.7.3 Experiments on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 65
      3.7.4 Experiments on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 68
  3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 72
CHAPTER 4 GEOMETRIC TENSOR LEARNING . . . . . . . .                  . . . . . . . . . . . . . . . . . . 74
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
  4.2 Tensor Train Robust PCA on Graphs . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . 75
      4.2.1 Kronecker Structured Graphs . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . 75
      4.2.2 Optimization . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . 77
      4.2.3 Computation and Memory Complexity for Graphs             . . . . . . . . . . . . . . . . . . 78
  4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . 79
      4.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . 80
      4.3.2 Real Data . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 81
  4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . 81
CHAPTER 5 COUPLED SUPPORT TENSOR MACHINE                     . . . . . . . . . . . . . . . . . . . . . . 83
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
  5.2 Related Work . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . 86
      5.2.1 Coupled Matrix Tensor Factorization . . .        . . . . . . . . . . . . . . . . . . . . . . 86
      5.2.2 CP-STM for Tensor Classification . . . . .       . . . . . . . . . . . . . . . . . . . . . . 87
      5.2.3 Multiple Kernel Learning . . . . . . . . .       . . . . . . . . . . . . . . . . . . . . . . 88
  5.3 Methods . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . 89
      5.3.1 Multimodal Tensor Factorization . . . . . .      . . . . . . . . . . . . . . . . . . . . . . 90
      5.3.2 Coupled Support Tensor Machine (C-STM)           . . . . . . . . . . . . . . . . . . . . . . 90
  5.4 Model Estimation . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . . . . . 92
  5.5 Experiments . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . 94
      5.5.1 Parameter Selection . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . . . . . 94
              5.5.1.1 Multimodal Tensor Factorization        . . . . . . . . . . . . . . . . . . . . . . 94
              5.5.1.2 C-STM . . . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . . . . . 95
      5.5.2 Simulated Data . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . . . 95
                                                   vi


              5.5.2.1 Kernel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
       5.5.3 EEG-fMRI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
   5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
CHAPTER 6 CONCLUSIONS . . . . . . . . . . . . .          . . . . . . . . . . . . . . . . . . . . . . . . 106
   6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
       6.1.1 Multi-Branch Tensor Learning . . . . .      . . . . . . . . . . . . . . . . . . . . . . . . 108
       6.1.2 Tensor Methods for Anomaly Detection        . . . . . . . . . . . . . . . . . . . . . . . . 109
       6.1.3 Geometric Tensor Learning . . . . . .       . . . . . . . . . . . . . . . . . . . . . . . . 109
       6.1.4 Supervised Coupled Tensor Learning .        . . . . . . . . . . . . . . . . . . . . . . . . 110
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
                                                  vii


                                              LIST OF TABLES
Table 2.1: Storage Complexities of Different Tensor Decomposition Structures . . . . . . . . . . . . 25
Table 2.2: Computational complexities of various algorithms. The number of iterations to find
           the subspaces are denoted as 𝑡 𝑐 for CMDA and 𝑡 𝑡 for TT-based methods. 𝐶𝑠 = 2𝐶𝐾.
           (𝑟 << 𝐼, 𝑡 𝑡 𝑟 (𝑟 + 𝑁/ 𝑓 − 1) << 𝐶𝑠 , and 𝐼 𝑁 / 𝑓 >> 𝑟 6 ) . . . . . . . . . . . . . . . . . . . . 26
Table 2.3: Classification accuracy (top) and training time (bottom) with standard deviation for
           various methods and datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Table 3.1: Properties of anomaly detection methods used in the experiments. The acronyms refer
           to the different attributes of the cost function: (LR) low-rank, (SP) sparse, (WLR)
           weighted low-rank, (SR) smoothness regularization. . . . . . . . . . . . . . . . . . . . . 61
Table 3.2: Mean and standard deviation of AUC values for for various 𝑐 and 𝑃. On experiments of
           each variable, the rest of the variables are fixed at 𝑐 = 2.5, 𝑃 = 0%, 𝑙 = 7 and 𝑚 = 2.3%.
           The proposed methods, outperform the other algorithms in all cases significantly with
           𝑝 < 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Table 3.3: Mean and standard deviation of run times (seconds) for various methods. . . . . . . . . . 67
Table 3.4: Events of Interest for NYC in 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Table 3.5: Results for 2018 NYC Yellow Taxi Data. Columns indicate the percentage of selected
           points with top anomaly scores. The table entries correspond to the number of events
           detected at the corresponding percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Table 3.6: Results on 2018 NYC Bike Trip Data.          . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 4.1: Denoising performance for synthetic data against varying levels of gross noise 𝑐% for
           various methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 4.2: Denoising performance for real data against varying levels of gross noise 𝑐% for various
           methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Table 5.1: Distribution Specifications for Simulation Study; MVN stands for multivariate normal
           distribution. I are identity matrices. Bold numbers are vectors whose elements are all
           equal to the numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Table 5.2: Various kernel combination schemes. Note that 𝐾2(2) = 𝐾1(3) . . . . . . . . . . . . . . . . . 100
Table 5.3: Classification accuracy using different kernel combinations. . . . . . . . . . . . . . . . . 100
Table 5.4: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Perfor-
           mance Metrics with Standard Deviations in Subscripts) . . . . . . . . . . . . . . . . . . . 103
                                                      viii


                                              LIST OF FIGURES
Figure 1.1: Illustration of tensors and tensor merging product using tensor network notations. Each
            node represents a tensor and each edge represents a mode of the tensor. (a) Tensor A,
            (b) Tensor B, (c) Tensor Merging Product between modes (𝑛, 𝑚) and (𝑛 + 1, 𝑚 − 1). . . .           4
Figure 1.2: Tensor network notation for Tucker decomposition. . . . . . . . . . . . . . . . . . . . .         9
Figure 1.3: Tensor Train Decomposition of Y using tensor merging products. . . . . . . . . . . . . . 11
Figure 2.1: Tensor A 𝑛 is formed by first merging U𝑛𝑅 , U𝑛−1     𝐿   and S and then applying trace
                                 𝑡 ℎ      𝑡 ℎ
            operation across 4 and 8 modes of the resulting tensor. The green line at the
            bottom of the diagram refers to the trace operator. . . . . . . . . . . . . . . . . . . . . . 21
Figure 2.2: Illustration of the proposed methods (Compare (a) and (b) with Figures 1.2 and 1.3):
            (a) The proposed tensor network structure for 2WTT; (b) The proposed tensor network
            structure for 3WTT; (c) The flow diagram for 2WTTDA (Algorithm 2.2); (d) The flow
            diagram for 3WTTDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.3: Comparisons with a BTT based Ritz pair computation algorithm. a) Classification
            accuracy and b) Training time with respect to normalized storage cost. . . . . . . . . . . 31
Figure 2.4: Classification accuracy vs. Normalized storage cost of the different methods for: a)
            COIL-100, b) Weizmann Face, c) Cambridge Hand Gesture and d) UCF-101. All TD
            based methods are denoted using ’x’, TT based methods are denoted using ’+’ and
            proposed methods are denoted using ’*’. STTM and LDA are denoted using ’4’ and
            ’o’, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 2.5: Training complexity vs. Normalized storage cost of the different methods for: a)
            COIL-100, b) Weizmann Face, c) Cambridge Hand Gesture, and d) UCF-101. . . . . . . 33
Figure 2.6: Convergence curve for TTDA on COIL-100. Objective value vs. the number of
            iterations is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 2.7: Comparison of classification accuracy vs. training sample size for Weizmann Face
            Dataset for different methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Figure 2.8: (a) Normalized Mutual Information vs. Storage Complexity of different methods for
            MNIST dataset. (b) Computation Time vs. Storage Complexity of different methods
            for MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 2.9: (a) Normalized Mutual Information vs Storage Complexity of different methods for
            COIL dataset. (b) Computation Time vs Storage Complexity of different methods for
            COIL dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
                                                       ix


Figure 3.1: Mean AUC values for various choices of: (a) 𝜆 and 𝛾, (b) 𝜃 and 𝜓1 , (c) 𝜓2 and 𝜓4 .
            Mean AUC values across 10 random experiments are reported for each hyperparameter
            pair. For each set of experiments, the remaining hyperparameters are fixed. . . . . . . . . 64
Figure 3.2: AUC of ROC w.r.t. 𝑙 and 𝑚 with 𝑐 = 2.5, 𝑃 = 0%. . . . . . . . . . . . . . . . . . . . . . 66
Figure 3.3: ROC curves for various amplitudes of anomalies. Higher amplitude means more
            separability. 𝑐 = (a) 1.5, (b) 2, (c) 2.5. (𝑃 = 0%, 𝑙 = 7, 𝑚 = 2.3%) . . . . . . . . . . . . 67
Figure 3.4: ROC curves for varying percentage of missing data, (a) 𝑃 = 20%, (b) 𝑃 = 40%, (c)
            𝑃 = 60%. (𝑐 = 2.5, 𝑙 = 7, 𝑚 = 2.3%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Figure 3.5: Bike Activity data, the extracted sparse part and low-rank part across for July 4th
            Celebrations at Hudson River banks. (a) Real Data where the traffic for 52 Wednesdays
            is shown along with the traffic on Independence Day and average traffic; (b) Sparse
            tensor where the curve corresponding to the anomaly is highlighted; (c) Low-rank
            tensor with the curve corresponding to the Independence Day highlighted. . . . . . . . . 71
Figure 4.1: Phase diagrams for missing data recovery. . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 5.1: Illustration of Coupled Tensor Matrix Model . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 5.2: C-STM Model Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 5.3: Simulation Result: Average accuracy rates shown in bar plots; Standard deviation of
            accuracy rates shown by error bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Figure 5.4: Region of Interest (ROI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
                                                        x


                                     LIST OF ALGORITHMS
Algorithm 2.1: Tensor Train Discriminant Analysis (TTDA) . . . . . . . . . . . . . . . . . . . . . . 22
Algorithm 2.2: Two-Way Tensor Train Discriminant Analysis (2WTTDA) . . . . . . . . . . . . . . . 24
Algorithm 2.3: Graph Regularized Tensor Train-ADMM(GRTT-ADMM) . . . . . . . . . . . . . . . 41
Algorithm 3.1: GLOSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Algorithm 3.2: LOGSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Algorithm 4.1: TTRPCA-G/nG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Algorithm 5.1: ACMTF Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Algorithm 5.2: Coupled Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
                                                 xi


                                                  CHAPTER 1
                                               INTRODUCTION
With the advance of sensing and data acquisition technology, it is now possible to collect data from different
modalities and sources simultaneously. Some examples include medical imaging data such as fMRI, hyper-
spectral images, computer vision data such as multiview camera recordings, liquid-chromatography/mass
spectrometry data in chemometrics, large scale gene expression data in bioinformatics. Most of these data
are multi-dimensional in nature and can be represented by multiway arrays known as tensors [48]. For
instance, a color image is a third-order tensor defined by two indices for spatial variables and one index for
color mode. Similarly, a video comprised of color images is a fourth-order tensor, time being the fourth
dimension besides spatial and spectral.
     In order to efficiently process large datasets, there is an increasing interest in near real-time processing
methods, especially in the case of multimedia, remote-sensing and biological data. Another challenge is to
come up with methods that are scalable with the size of data. Although computational power and memory
sizes of modern systems keep increasing, traditional methods are exponentially expensive. As the data
collected are not always clean and complete, there is also a need for methods that are robust against missing
data and outliers, and can account for the underlying structure or the topology of the data. Finally, there is also
an emerging interest in extracting information from multiple heterogenous modalities simultaneously. As
tensor decompositions are natural, sparse and distributed representations of big data, all the aforementioned
problems spurred an interest in efficient tensor algorithms suitable for massive datasets.
     Traditional unsupervised and supervised learning methods developed for one-dimensional signals, or
vectors, do not translate well to higher order data structures as they get computationally prohibitive with
increasing dimensionalities, namely, ’curse of dimensionality’. Vectorizing high dimensional inputs creates
problems in nearly all machine learning tasks also due to the difficulty of obtaining sufficiently large training
sample size. Moreover, important information about the structure of the high-dimensional space that data lie
in is lost through vectorization which reduces the effectiveness of these methods. Therefore, there is a need
for extracting compact representations from the original tensor data that result in accurate and interpretable
models.
     Tensor decompositions can be defined as structures which represent higher order data as a set of low-
                                                         1


order core tensors, thus allowing for better interpretability and computational advantages [45]. Tensor
decompositions are natural extensions of matrix decompositions or factorizations such as singular value
decomposition (SVD), principal component analysis (PCA) and non-negative matrix factorization (NMF)
to higher order data. However, there are various advantages of tensor decompositions compared to matrix
factorizations, such as the ability to efficiently parameterize high dimensional spaces and account for intrinsic
multidimensional and distributed patterns present in data. Thus, they allow for models to capture multiple
interactions and couplings in data, in a more structured and interpretable manner. Tensors first started to
gain attention in psychometrics [174, 175] and then chemometrics [27, 160]. After being introduced to many
fields such as signal processing [48, 47, 46, 18, 183], statistics and computer science [38, 43, 96], tensors
have received great attention. The work on tensors is too diverse and numerous to be covered in this thesis,
so we refer the interested reader to survey papers [93, 44, 45, 159, 81].
     Recent years have also seen a growth in the development of tensor decomposition methods for machine
learning. Various extensions of unsupervised methods such as PCA, robust PCA, SVD, NMF, neighborhood
preserving embedding (NPE), locally linear embedding (LLE), Canonical Component Analysis (CCA) [48,
70, 154, 106, 78, 91] and supervised methods such as linear discriminant analysis (LDA), Support Vector
Machines (SVM), regression, deep neural networks [195, 196, 168, 112, 104, 74, 38, 97, 137, 96, 95] to
tensors have been explored. Some applications include face recognition, object classification, dimensionality
reduction, clustering, anomaly detection, learning latent variable models and prediction.
     Although most tensor based methods are developed as extensions of existing vector based methods, it is
possible to develop new approaches with structures specific to tensors. By utilizing tensors, one can introduce
structural models in many different ways such as deep tensor network structures, Kronecker or Cartesian
structured product graphs, and multi-modal coupling of data collected from different physical sensors, that
are otherwise not relevant or applicable. Tensor based methods also often have milder requirements for
theoretical performance analysis such as convergence, sample size, and recovery guarantees [70, 44, 45].
1.1    Background and Notation
In this thesis, we denote numbers and scalars with letters such as 𝑥, 𝑦, 𝑁. Vectors are denoted by boldface
lowercase letters, e.g. x, y. Matrices are denoted by italic capital letters like 𝑋, 𝑌 . Multi-dimensional tensors
are denoted by calligraphic capital letters such as X, Y. The order of a tensor is the number of dimensions
of the data hypercube, also known as ways or modes. For example, a scalar can be regarded as a zeroth-order
                                                         2


tensor, a vector as a first-order tensor, and a matrix as a second-order tensor.
      Let A ∈ R𝐼1 ×···×𝐼𝑁 be a tensor. Vectors obtained by fixing all indices of the tensor except the one that
corresponds to 𝑛th mode are called mode-𝑛 fibers and denoted as a𝑖1 ,...𝑖𝑛−1 ,𝑖𝑛+1 ,...𝑖 𝑁 ∈ R𝐼𝑛 .
Definition 1. (Vectorization, Matricization and Reshaping) V(.) is a vectorization operator such that
V(A) ∈ R𝐼1 𝐼2 ...𝐼𝑁 ×1 . T𝑛 (.) is a tensor-to-matrix reshaping operator defined as T𝑛 (A) ∈ R𝐼1 ...𝐼𝑛 ×𝐼𝑛+1 ...𝐼𝑁
and the inverse operator is denoted as T−1     𝑛 (.).
Definition 2. (Mode-𝑛 unfolding and canonical unfolding, [197]) The mode-𝑛 unfolding of a tensor A is
                          Î𝑁
defined as A (𝑛) ∈ R𝐼𝑛 ×               𝐼 0
                            𝑛0 =1,𝑛0 ≠𝑛 𝑛  where the mode-n fibers of the tensor A are the columns of A (𝑛) and the
remaining modes are organized accordingly along the rows.
      Mode-𝑛 canonical unfolding, or mode-1, 2, . . . , 𝑛 unfolding, of a tensor Y is defined as Y[𝑛] ∈
  Î𝑛          Î𝑁
R    𝑖=1 𝐼𝑖 ×  𝑖=𝑛+1 𝐼𝑖 .
Definition 3. (Left and right unfolding) The left unfolding operator creates a matrix from a tensor by taking
all modes except the last mode as row indices and the last mode as column indices, i.e., L(A) ∈ R𝐼1 𝐼2 ...𝐼𝑁 −1 ×𝐼𝑁
which is equivalent to T 𝑁 −1 (A). Right unfolding transforms a tensor to a matrix by taking all the first mode
fibers as column vectors, i.e. R(A) ∈ R𝐼1 ×𝐼2 𝐼3 ...𝐼𝑁 which is equivalent to T1 (A). The inverse of these
operators are denoted as L−1 (.) and R−1 (.), respectively. A tensor is defined to be left (right)-orthogonal if
its left (right) unfolding is orthogonal.
Definition 4. (Tensor trace) Tensor trace is defined on slices of a tensor and contracts them to scalars. Let
A ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 with 𝐼 𝑘 0 = 𝐼 𝑘 , then the trace operator on modes 𝑘 0 and 𝑘 is defined as:
                                                       Õ𝐼𝑘
                               D = 𝑡𝑟 𝑘𝑘0 (A) =                A (:, . . . , 𝑖 𝑘 0 , :, . . . , 𝑖 𝑘 , :, . . . , :),
                                                   𝑖𝑘 0 =𝑖𝑘 =1
where D ∈ R𝐼1 ×···×𝐼𝑘0−1 ×𝐼𝑘0+1 ×···×𝐼𝑘−1 ×𝐼𝑘+1 ×···×𝐼𝑁 is a 𝑁 − 2-mode tensor.
Definition 5. (Tensor Merging Product) Tensor merging product connects two tensors along some given sets
of modes. For two tensors A ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 and B ∈ R 𝐽1 ×𝐽2 ×···×𝐽𝑀 where 𝐼𝑛 = 𝐽𝑚 and 𝐼𝑛+1 = 𝐽𝑚−1 for
some 𝑛 and 𝑚, tensor merging product is given by [45]:
                                                       C = A ×𝑚,𝑚−1𝑛,𝑛+1 B.
                                                                  3


                     ...                                       ...                                         ...                              ...                           𝐽1 𝐼 𝑁
                                                                                                                 ...      𝐼 𝑛 , 𝐽𝑚    ...                               ...
                                                                                                                                                                                       ...
               ...         ...                           ...         ...                             ...                                          ...            𝐽𝑚−2                         𝐼𝑛+2
      𝐼 𝑁 −1         A                                         B                           𝐼2              A                                 B               =                     C
                                                                                                                                                                 𝐽𝑚+1                   ...   𝐼𝑛−1
                                                                                                                  ...   𝐼𝑛+1 , 𝐽𝑚−1    ...                               ...
          𝐼𝑁                     𝐼2                 𝐽𝑀                     𝐽2                   𝐼1                                                      𝐽1
                     𝐼1                                        𝐽1                                          𝐼𝑁                               𝐽𝑀                            𝐽 𝑀 𝐼1
                (a)                                            (b)                                                                                (c)
Figure 1.1: Illustration of tensors and tensor merging product using tensor network notations. Each node
represents a tensor and each edge represents a mode of the tensor. (a) Tensor A, (b) Tensor B, (c) Tensor
Merging Product between modes (𝑛, 𝑚) and (𝑛 + 1, 𝑚 − 1).
C ∈ R𝐼1 ×···×𝐼𝑛−1 ×𝐼𝑛+2 ×···×𝐼𝑁 ×𝐽1 ×···×𝐽𝑚−2 ×𝐽𝑚+1 ×···×𝐽𝑀 is a (𝑁 + 𝑀 − 4)-mode tensor that is calculated as:
                                             C(𝑖 1 , . . . , 𝑖 𝑛−1 , 𝑖 𝑛+2 , . . . , 𝑖 𝑁 , 𝑗1 , . . . , 𝑗 𝑚−2 , 𝑗 𝑚+1 , . . . , 𝑗 𝑀 ) =
                                               𝐼𝑛 𝐽Õ
                                               Õ   𝑚−1
                                                       
                                                               A (𝑖 1 , . . . , 𝑖 𝑛−1 , 𝑖 𝑛 = 𝑡 1 , 𝑖 𝑛+1 = 𝑡2 , 𝑖 𝑛+1 , . . . , 𝑖 𝑁 )
                                               𝑡1 =1 𝑡2 =1
                                                                                                                          
                                                  B ( 𝑗 1 , . . . , 𝑗 𝑚−2 , 𝑗 𝑚−1 = 𝑡 2 , 𝑗 𝑚 = 𝑡1 , 𝑗 𝑚+1 , . . . , 𝑗 𝑀 ) .
A graphical representation of tensors A and B and the tensor merging product defined above is given in
Figure 1.1.
    A special case of the tensor merging product can be considered for the case where 𝐼𝑛 = 𝐽𝑚 for all
𝑛, 𝑚 ∈ {1, . . . , 𝑁 − 1}, 𝑀 ≥ 𝑁. In this case, the tensor merging product across the first 𝑁 − 1 modes is
defined as:
                                                                                                𝑁 −1
                                                                                C 0 = A ×1,...,
                                                                                         1,..., 𝑁 −1 B,                                                                                              (1.1)
where C 0 ∈ R𝐼𝑁 ×𝐽𝑁 ×···×𝐽𝑀 . This can equivalently be written as:
                                                                       R(C 0) = L(A) > T 𝑁 −1 (B),                                                                                                   (1.2)
                                      Î𝑀
where R(C 0) ∈ R𝐼𝑁 ×                   𝑚=𝑁   𝐽𝑚 .
Definition 6. (Mode-𝑛 product) The mode-𝑛 product of tensor A and a matrix U ∈ R 𝐽 ×𝐼𝑛 is denoted as B =
A ×𝑛 U and defined as 𝑏 𝑖1 ,...,𝑖𝑛−1 , 𝑗,𝑖𝑛+1 ...,𝑖 𝑁 = 𝑖𝐼𝑛𝑛=1 𝑎 𝑖1 ...,𝑖𝑛 ,...,𝑖 𝑁 𝑢 𝑗,𝑖𝑛 , where B ∈ R𝐼1 ×...×𝐼𝑛−1 ×𝐽 ×𝐼𝑛+1 ×...×𝐼𝑁 .
                                                       Í
Notice that this is equivalent to a tensor merging product B = A ×2𝑛 U.
Definition 7. (Mode-𝑛 Graph Laplacian) Mode-𝑛 adjacency matrix 𝑊 𝑛 corresponding to mode-𝑛 similarity
graph of a tensor Y is defined as:
                         − ky (𝑛) ,𝑠 −y (𝑛) ,𝑠0 k2𝐹
                        
                                                                                 if y (𝑛),𝑠 ∈ N 𝑘 (y (𝑛),𝑠0 ) or y (𝑛),𝑠0 ∈ N 𝑘 (y (𝑛),𝑠 )
                                     2𝜎 2
                   𝑛
                        𝑒
                        
                                                    ,
                𝑊𝑠𝑠0 =                                                                                                                                                         ,
                        
                        
                         0,
                                                                                otherwise
                        
                                                                                           4


                                                                                                  Î𝑁
where N 𝑘 (y (𝑛),𝑠 ) is the Euclidean 𝑘-nearest neighborhood of the 𝑠th row of Y(𝑛) , y (𝑛),𝑠 ∈ R
                                                                                                          𝐼
                                                                                                   𝑖=1,𝑖≠𝑛 𝑖 . The
mode-𝑛 graph Laplacian Φ𝑛 is then defined as Φ𝑛 = 𝐷 𝑛 − 𝑊 𝑛 , where 𝐷 𝑛 is a diagonal degree matrix with
   𝑛 = Í 𝐼𝑛 𝑊 𝑛 . The eigendecomposition of Φ𝑛 can be written as Φ𝑛 = 𝑃 Λ 𝑃 > , where 𝑃 is the matrix
𝐷 𝑖,𝑖     𝑖0 =1 𝑖,𝑖0                                                                 𝑛 𝑛 𝑛        𝑛
of eigenvectors and Λ𝑛 is a diagonal matrix with the eigenvalues on the diagonal, in a non-descending order.
Definition 8. (Product Graphs) Let {G𝑛 } be a set of 𝑁 graphs, where G𝑛 = {𝑉𝑛 , 𝐸 𝑛 } where 𝑉𝑛 are the
vertices and 𝐸 𝑛 are the edges of G𝑛 . Let Φ𝑛 ∈ R𝐼𝑛 × 𝐼𝑛 be the graph Laplacian for 𝐺 𝑛 . A product graph G
is created by unifying {G𝑛 }, in a special manner. A Kronecker product graph, or Kronecker graph is defined
as [151]:
                                           G = G𝑁 ⊗ G𝑁 −1 ⊗ · · · ⊗ G1 ,                                      (1.3)
and has a Laplacian defined as:
                                          Φ = Φ 𝑁 ⊗ Φ 𝑁 −1 ⊗ · · · ⊗ Φ1 .                                     (1.4)
     A Cartesian product graph, on the other hand is defined as [151]:
                        G = G𝑁 ⊗ IÎ 𝑁 −1 𝐼𝑛 + I𝐼𝑁 ⊗ G𝑁 −1 ⊗ IÎ 𝑁 −2 𝐼𝑛 + · · · + IÎ 𝑁      ⊗ G1 ,             (1.5)
                                     𝑛=1                           𝑛=1              𝑛=2 𝐼𝑛
with Laplacian:
                       Φ = Φ 𝑁 ⊗ IÎ 𝑁 −1 𝐼𝑛 + I𝐼𝑁 ⊗ Φ 𝑁 −1 ⊗ IÎ 𝑁 −2 𝐼𝑛 + · · · + IÎ 𝑁     ⊗ Φ1 .             (1.6)
                                     𝑛=1                           𝑛=1              𝑛=2 𝐼𝑛
     The graph smoothness objective defined over all modes of a tensor X ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 :
                                               Õ 𝑁
                                                         >
                                                    𝑡𝑟 (X(𝑛) Φ𝑛 X(𝑛) ),                                       (1.7)
                                                𝑛=1
is equivalent to:
                                                   V(X) > ΦV(X),                                              (1.8)
where Φ is the graph Laplacian of the Cartesian product graph.
Definition 9. (Mode-𝑛 Concatenation) Mode-𝑛 concatenation unfolds the input tensors along mode-𝑛 and
stacks the unfolded matrices across rows:
                                                                          >
                                         cat𝑛 (A, A) =      A >(𝑛)  A >(𝑛)    .                               (1.9)
                                                          5


If the input is a set of tensors, e.g. {A} = {A1 , A2 , . . . , A 𝑀 }, where A 𝑚 ∈ R𝐼1 ×···×𝐼𝑁 , ∀𝑚 ∈ {1, . . . , 𝑀 },
then cat𝑛 ({A}) stacks all mode 𝑛 unfoldings of tensors {A} across rows into a matrix of size 𝑀 𝐼𝑛 ×
Î𝑁
     𝑛0 =1,𝑛0 ≠𝑛 𝐼 𝑛 .
                      0
Definition 10. (Tensor norms) In this thesis, we employ three different tensor norms. Frobenius norm
                                                         qÍ
of a tensor is defined as kA k 𝐹 =                               𝑖1 ,𝑖2 ,...,𝑖 𝑁 𝑦 𝑖1 ,𝑖2 ,...,𝑖 𝑁 . ℓ1 norm of a tensor is defined as kA k 1 =
                                                                                   2
Í
    𝑖1 ,𝑖2 ,...,𝑖 𝑁 |𝑦 𝑖1 ,𝑖2 ,...,𝑖 𝑁 |. In this thesis, the nuclear norm of a tensor is defined as the weighted sum of the
                                                                                                                    Í𝑁
nuclear norms of all mode-n unfoldings of a tensor, namely kA k ∗ = 𝑛=1                                                      𝜓 𝑛 kA (𝑛) k ∗ , where 𝜓 𝑛 s are the
weights corresponding to each mode [173, 70].
Definition 11. (Support Set) Let Ω be a support set defined for tensor A, i.e. Ω ∈ {1, . . . , 𝐼1 } × {1, . . . , 𝐼2 } ×
· · · × {1, . . . , 𝐼 𝑁 }. The projection operator on this support set, PΩ , is defined as:
                                                                         
                                                                         
                                                                          A𝑖1 ,𝑖2 ,...,𝑖 𝑁 ,       (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 ) ∈ Ω,
                                                                         
                                                                         
                                                                         
                                            PΩ [A] 𝑖1 ,𝑖2 ,...,𝑖 𝑁 =                                                                                       (1.10)
                                                                         
                                                                         
                                                                          0,
                                                                                                   otherwise.
                                                                         
The orthogonal complement of the operator PΩ is defined in a similar manner as:
                                                                          
                                                                          
                                                                           A𝑖1 ,𝑖2 ,...,𝑖 𝑁 ,       (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 ) ∉ Ω,
                                                                          
                                                                          
                                                                          
                                           PΩ⊥ [A] 𝑖1 ,𝑖2 ,...,𝑖 𝑁 =                                                                                       (1.11)
                                                                          
                                                                          
                                                                           0,
                                                                                                    otherwise.
                                                                          
1.2         Robust Principal Component Analysis
Principal component analysis (PCA) is one of the most common algorithms for dimensionality reduction
[30]. The aim of PCA is to find a subspace with smaller dimension, in which the projected data has the highest
variance among all possible subspaces with the same dimension. This model is based on the assumption
that the data has an i.i.d. Gaussian distribution. Thus, with grossly corrupted, or partially observed data,
PCA fails to estimate the best subspace, or the best low-rank approximation to data. To alleviate the issues
with these outliers, robust PCA methods [30, 35] have been proposed where the aim is to separate the data
into a low-rank part, corresponding to the clean part, and a sparse part, corresponding to gross corruptions,
or outliers.
        In [30, 35], this problem is formulated as:
                                                    minimize 𝐿,𝑆 k𝐿 k ∗ + 𝜆k𝑆k 1 ,                      𝐿+𝑆 =𝑌                                             (1.12)
                                                                                      6


where the observed data is 𝑌 and 𝜆 is a regularization parameter. It was proven that under some conditions
on the true 𝐿 and 𝑆, an alternating minimization based approach can recover the true 𝐿. This algorithm can
also be used for matrix completion [30].
    Although very successful in recovering low-rank data from gross corruptions, RPCA has some short-
comings. First, the algorithm requires an SVD at every step of the optimization which gets intractable with
increasing data size. Second, the underlying assumption of incoherence is not always easy to satisfy as in
many real data, the underlying, clean data is structured and can be sparse. Finally, the outliers themselves
could be structured, which violates the randomness of the support of the sparse part.
1.3    Geometric Learning
In this thesis, we explore a variety of approaches to supervised and unsupervised learning where underlying
geometric or topological relations between each modes rows. Such relations are common in many types
of real data, e.g. recommendation systems, drug-target interaction, genomics. Usually, such relations are
represented by a graph, or a set of graphs with relationships across these graphs. We unite these approaches
under the term geometric learning.
    Such geometric structures within data are often not taken into account in many methods as purely
algebraic terms, such as rank, cannot account for them [22]. In many modern signal processing applications,
graph-based priors have been used to extract low-dimensional structure from high dimensional data [57, 83,
170]. Representation of a signal on a graph is also motivated by the emerging field of signal processing
on graphs, based on notions of spectral graph theory [155, 154, 146]. The underlying assumption is that
high-dimensional data samples lie on or close to a smooth low-dimensional manifold, represented by a graph
𝐺. In this section, we explore several extensions of wide-known algorithms for dimensionality reduction
using geometric learning concepts.
    In the specific case of manifold learning, geometric learning is utilized as follows. Tensor samples for a
data set with no labels are denoted as Y𝑠 ∈ R𝐼1 ×···×𝐼𝑁 where 𝑠 ∈ {1, . . . , 𝑆}. For two tensor samples, Y𝑠 and
Y𝑠0 and their respective low-dimensional projections, 𝑋𝑠 and 𝑋𝑠0 , regularization that is employed to preserve
the underlying geometry can be formulated as:
                                           Õ𝑆 Õ  𝑆
                                                     k 𝑋𝑠 − 𝑋𝑠0 k 2𝐹 𝑤 𝑠𝑠0 ,                               (1.13)
                                           𝑠=1 𝑠0 =1
                                               𝑠0 ≠𝑠
                                                          7


where 𝑋𝑠 is the projection of Y𝑠 to a lower dimensional manifold and 𝑤 𝑠,𝑠0 is the mode-𝑁 + 1 adjacency or
similarity graph. One can also equivalently rewrite (1.13) as tr 𝑋𝑠 Φ 𝑁 +1 𝑋𝑠> .
                                                                                  
1.3.1    Robust PCA on Graphs
In [155], it was shown that a low-rank approximation, 𝑈, to a data matrix, 𝑋, can be obtained by solving the
following optimization problem:
                                  min k 𝑋 − 𝑈 k 1 + 𝛾1 𝑡𝑟 (𝑈Φ1𝑈𝑇 ) + 𝛾2 𝑡𝑟 (𝑈𝑇 Φ2𝑈),                   (1.14)
                                   𝑈
where Φ1 and Φ2 are the graph Laplacians corresponding to graphs connecting the samples (rows) of 𝑋 and
the features (columns) of 𝑋, respectively. The above formulation assumes that the data is low-rank on graphs,
i.e. lies on a smooth low-dimensional manifold. This can be quantified by a graph stationarity measure,
            kdiag(Γ𝑛 ) k 2
𝑠𝑟 (Γ𝑛 ) =     kΓ𝑛 k 𝐹 ,   where Γ𝑛 = 𝑃𝑛>𝐶𝑛 𝑃𝑛 with 𝐶𝑛 being the covariance matrix of each mode-𝑛 unfolding,
i.e., 𝑛 = 1 or 2 in the case of data matrices [146, 154].
1.3.2    Spectral Geometric Matrix Completion
In [84, 22], it was shown that using two graphs that encode the relationships between the rows, G1 and the
columns G2 of a matrix, it is possible to extract a signal 𝑋 that is band-limited on the Cartesian product
graph G = G2 ⊗ I + I ⊗ G1 from data 𝑌 with missing entries and corruption. The problem is formulated as:
                              minimize𝑋 kPΩ [𝑋 − 𝑌 ] k 2𝐹 + 𝑡𝑟 (𝑋 > Φ1 𝑋) + 𝑡𝑟 (𝑋Φ2 𝑋 > ),             (1.15)
where the last two terms are called the graph smoothness terms, and Φ1 and Φ2 are graph Laplacians
corresponding to the rows and columns of the matrix, respectively. Φ = Φ2 × I + I × Φ1 is the Laplacian
matrix associated with G.
1.4     Tensor Decompositions
In many machine learning problems, low-rank decompositions or matrix factorizations are the main compu-
tational bottleneck as they are often O (𝑛3 ), i.e. cubic, in computational complexity. Tensor decompositions
provide alternative factorization models preserving the original multiway structure of the data which de-
crease both the computational complexity and the storage cost of the factors. In this section, we review
                                                            8


                                        ...                                           ...
                                  ...
                                               ...
                                                                                ...         ...
                         𝐼 𝑁−1          Y            = 𝐼 𝑁−1           𝑈𝑁−1           X
                                 𝐼𝑁                  𝐼2                    𝑈𝑁                     𝑈2
                                        𝐼1                                            𝑈1
                                                                  𝐼𝑁                                   𝐼2
                                                                                      𝐼1
                       Figure 1.2: Tensor network notation for Tucker decomposition.
three well-known tensor decomposition approaches: Tucker Decomposition (TD), Canonical Decomposition
(CANDECOMP/PARAFAC, CP) and Tensor Train Decomposition (TT).
1.4.1   Tucker Decomposition (TD)
Tucker decomposition represents a tensor Y ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 as:
                                            Y = X ×1 𝑈1 ×2 𝑈2 · · · × 𝑁 𝑈 𝑁 ,
where 𝑈𝑛 ∈ R𝐼𝑛 ×𝑅𝑛 with 𝑅𝑛 < 𝐼𝑛 are called the low-rank factor matrices and X ∈ R𝑅1 ×𝑅2 ×···×𝑅 𝑁 is the core
tensor. This decomposition is illustrated in Figure 1.2. Instead of mode-𝑛 product, Tucker Decomposition
can also be represented by tensor merging product as:
                                            Y = X ×21 𝑈1 ×22 𝑈2 · · · ×2𝑁 𝑈 𝑁 ,
1.4.2   Canonical/Polyadic (CP) Decomposition
CP is a special case of Tucker Decomposition where the core tensor X is superdiagonal:
                                                         
                                                         
                                                         𝑟 𝑖1 ,𝑖2 ,...,𝑖 𝑁 , if 𝑖1 = 𝑖2 = · · · = 𝑖 𝑁 ,
                                                         
                                                         
                                                         
                                     𝑥 𝑖1 ,𝑖2 ,...,𝑖 𝑁 =                                                      (1.16)
                                                         
                                                         
                                                          0,
                                                                             otherwise.
                                                         
where 𝑟 𝑖1 ,𝑖2 ,...,𝑖 𝑁 ∈ 𝑅 is a constant. CP is often defined equivalently as the sum of rank one tensors computed
by the outer products of the columns of the factor matrices 𝑈𝑛 :
                                                     𝑅
                                                     Õ
                                              Y=           u1,𝑖 ◦ u2,𝑖 ◦ · · · ◦ u 𝑁 ,𝑖 ,                    (1.17)
                                                     𝑖=1
                                                                   9


where 𝑅 is the CP rank of Y. The above equation is also denoted as:
                                                    Y = [[𝑈1 , 𝑈2 , . . . , 𝑈 𝑁 ]]                         (1.18)
The above representation is called Kruskal tensor (see [99]), which is a convenient representation for CP
tensors. We denote a Kruskal tensor by 𝔜 = [[U1 , ..., U 𝑁 ]] or 𝔜 = [[𝜁; U1 , ..., U 𝑁 ]] where 𝜁 ∈ R𝑟 is
a vector whose entries are the weights of rank one tensor components. In the special case of matrices, 𝜁
corresponds to singular values of a matrix. In general, it is assumed that the rank 𝑅 is small so that equation
(1.18) is also called low-rank approximation of a tensor Y.
1.4.3   Tensor Train Decomposition (TT)
Most of the existing work in both unsupervised and supervised learning have utilized Tucker or CP decom-
positions. The matrix product state (MPS) or Tensor Train is one of the best understood tensor networks, or
tensor decompositions, for which efficient algorithms have been developed [140, 139]. TT is a special case
of a tensor decomposition where a tensor with 𝑁 indices is factorized into a chain-like product of low-rank,
three-mode tensors. TT model has been employed in various applications such as PCA [18], manifold
learning [183] and deep learning [137].
    Tensor Train decomposition represents each element of Y using a series of matrix products as:
                         Y (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 ) = U1 (1, 𝑖1 , :)U2 (:, 𝑖2 , :) . . . U𝑁 (:, 𝑖 𝑁 , :)x, (1.19)
where U𝑛 ∈ R𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 are the three mode low-rank tensor factors, 𝑅𝑛 < 𝐼𝑛 are the TT-ranks of the
corresponding modes and x ∈ R𝑅 𝑁 ×1 is the projected sample vector. Using tensor merging product form,
(1.19) can be rewritten as
                                            Y = U1 ×13 U2 ×13 · · · ×13 U𝑁 ×13 x.                          (1.20)
A graphical representation of (1.20) can be seen in Figure 1.3. If Y is vectorized, another equivalent
expression for (1.19) in terms of matrix projection is obtained as:
                                        V (Y) = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 )x.
    Another widely used form of Tensor Train, called Matrix Product States (MPS) [18] is defined as:
                          Y = U1 ×13 U2 ×13 · · · ×13 U𝑘 ×13 𝑋 ×13 U𝑘+1 ×13 · · · ×13 U𝑁 .                 (1.21)
                                                                10


                                         ...
                                   ...
                                               ...   𝑅0 = 1
                          𝐼 𝑁−1          Y           =        U1     U2         ...     U𝑁−1     U𝑁   x
                                  𝐼𝑁                 𝐼2
                                          𝐼1
                                                              𝐼1     𝐼2                  𝐼 𝑁−1   𝐼𝑁
                  Figure 1.3: Tensor Train Decomposition of Y using tensor merging products.
Note that the low dimensional projected samples in MPS are matrices while for TT, they are vectors.
      Let 𝑈 ≤𝑘 = L(U1 ×13 U2 ×13 · · · ×13 U𝑘 ) ∈ R𝐼1 𝐼2 ...𝐼𝑘 ×𝑟𝑘 and 𝑈>𝑘 = R(U𝑘+1 ×13 · · · ×13 U𝑁 ) ∈ R𝑟𝑘+1 ×𝐼𝑘+1 ...𝐼𝑁 .
When L(U𝑛 )s for 𝑛 ≤ 𝑘 are left orthogonal, 𝑈 ≤𝑘 is also left orthogonal [80], i.e. L(U𝑛 ) > L(U𝑛 ) =
I𝑅𝑛−1 𝐼𝑛 , ∀𝑛 implies 𝑈 >𝑈 = I𝐼 , 𝐼 = 𝑛=1 𝐼𝑛 where I𝐼 ∈ R𝐼 ×𝐼 is the identity matrix. Similarly, when R(U𝑛 )s
                                     Î𝑁
for 𝑛 > 𝑘 are right orthogonal, 𝑈>𝑘 is right orthogonal. When Y is reshaped into a matrix, (1.21) can be
                                                                 
equivalently expressed as a matrix projection T 𝑘 (Y) = I𝐼𝑛 ⊗ 𝑈 ≤𝑘 𝑋𝑈>𝑘 .
1.5     Robust PCA for Tensors
In order to separate the low-rank part from sparse outliers in tensors, many extensions of RPCA to tensors
have been proposed [70, 120, 208, 197]. Although utilizing tensor structures have proven to be useful in
many tasks, quantifying the low rankness of tensors is an open ended problem as there is no clear definition
of rank for a tensor. Similarly, the nuclear norm of a tensor, which is a convex measure of the rank, is not
clearly defined. Depending on the definition of the rank, different nuclear norms are used such as Tucker
rank (sum of nuclear norms (SNN) of mode-𝑛 unfoldings), tubal rank (tensor nuclear norm (TNN)), TT rank
(Schatten norm (TTNN)), etc.
      In [120, 70], Tucker rank was used to quantify the rank of a tensor. This interpretation resulted in a SNN
based objective as:
                                                                   𝑁
                                                                   Õ
                                                 minimize L, S           kL (𝑛) k ∗ + kSk 1 ,                        (1.22)
                                                                   𝑛=1
where k.k 1 is the ℓ1 norm of a tensor, defined as the sum of the absolute values of all entries of a tensor. There
are many extensions to this formulation such as reformulating the objective function for tensor completion,
using other definitions of nuclear norm, and weighting the nuclear norms of different modes.
                                                                    11


1.6     Linear Discriminant Analysis (LDA) for Tensors
Let Y ∈ R𝐼1 ×···×𝐼𝑁 ×𝐾 ×𝐶 be the collection of samples of training tensors. For a given Y with 𝐶 classes and
𝐾 samples per class, define Y𝑐𝑘 ∈ R𝐼1 ×···×𝐼𝑁 as the sample tensors where 𝑘 ∈ {1, . . . , 𝐾 } is the sample index
and 𝑐 ∈ {1, . . . , 𝐶} is the class index. LDA for tensor data initially applies a vectorization to tensor samples
and then finds an orthogonal projection 𝑈 that maximizes the discriminability of projections by solving1:
                                       𝑈 = argmin 𝑡𝑟 (𝑈ˆ > 𝑆𝑊 𝑈)       ˆ − 𝜆𝑡𝑟 (𝑈ˆ > 𝑆 𝐵 𝑈)
                                                                                           
                                                                                         ˆ =
                                                𝑈ˆ
                                    argmin 𝑡𝑟 (𝑈ˆ > (𝑆𝑊 − 𝜆𝑆 𝐵 )𝑈)      ˆ = argmin 𝑡𝑟 (𝑈ˆ > 𝑆𝑈),ˆ                               (1.23)
                                       𝑈ˆ                                        𝑈ˆ
where 𝑆 = 𝑆𝑊 − 𝜆𝑆 𝐵 , 𝜆 is the regularization parameter that controls the trade-off between 𝑆𝑊 and 𝑆 𝐵 which
are within-class and between-class scatter matrices, respectively, defined as:
                                                Õ𝐶 Õ 𝐾
                                       𝑆𝑊 =              V (Y𝑐𝑘 − M 𝑐 )V (Y𝑐𝑘 − M 𝑐 ) > ,                                       (1.24)
                                                𝑐=1 𝑘=1
                                                Õ𝐶 Õ  𝐾
                                         𝑆𝐵 =             V (M 𝑐 − M)V (M 𝑐 − M) > ,                                            (1.25)
                                                𝑐=1 𝑘=1
                   1  Í𝐾                                                             1 Í𝐶 Í𝐾
where M 𝑐 =             𝑘=1 Y𝑐   is the mean for each class 𝑐 and M =                            𝑘=1 Y𝑐
                               𝑘                                                                         𝑘 is the total mean of all
                  𝐾                                                                 𝐶𝐾    𝑐=1
samples. Since 𝑈 is an orthogonal projection, (1.23) is equivalent to minimizing the within-class scatter
                                                                                                                          Î𝑁
and maximizing the between class scatter of projections. This can be solved by the matrix 𝑈 ∈ R                             𝑛=1 𝐼𝑛 ×𝑅 𝑁
                                                         Î𝑁          Î𝑁
whose columns are the eigenvectors of 𝑆 ∈ R                 𝑛=1 𝐼𝑛 ×   𝑛=1 𝐼𝑛 corresponding to the lowest 𝑅 𝑁 eigenvalues.
Multilinear Discriminant Analysis (MDA):
      MDA extends LDA to tensors using TD by finding a subspace 𝑈𝑛 ∈ R𝐼𝑛 ×𝑅𝑛 for each mode 𝑛 ∈ {1, . . . , 𝑁 }
that maximizes the discriminability along that mode [112, 168, 195]. When the number of modes 𝑁 is equal
to 1, MDA is equivalent to LDA. In the case of MDA, within-class scatter along each mode 𝑛 ∈ {1, . . . , 𝑁 }
is defined as:
                                                                                                             >
                  Õ 𝐶 Õ 𝐾                       Ö                                          Ö                
            (𝑛)
                                                           ×1𝑚𝑈𝑚                                       ×1𝑚𝑈𝑚 
                             𝑘                                             𝑘
                =            (Y𝑐 − M 𝑐 )                                   (Y𝑐 − M 𝑐 )
                                                                                                              
          𝑆𝑊                                                                                                          .         (1.26)
                                                                                                            
                   𝑐=1 𝑘=1                 𝑚∈ {1,..., 𝑁 }                             𝑚∈ {1,..., 𝑁 }        
                                                𝑚≠𝑛                 (𝑛)                   𝑚≠𝑛                (𝑛)
     1 The original formulation optimizes the trace ratio. Prior work showed the equivalence of trace ratio to trace difference used in
this thesis [63].
                                                                    12


Between-class scatter 𝑆 𝐵(𝑛) is found in a similar manner. Using these definitions, each 𝑈𝑛 is found by
optimizing [168]:
                                                            (𝑛)
                                    𝑈𝑛 = argmin 𝑡𝑟 (𝑈ˆ 𝑛> (𝑆𝑊   − 𝜆𝑆 𝐵(𝑛) )𝑈ˆ 𝑛 ).                             (1.27)
                                            𝑈ˆ 𝑛
Different implementations of the multilinear discriminant analysis have been introduced including Dis-
criminant Analysis with Tensor Representation (DATER), Direct Generalized Tensor Discriminant Analysis
                                                                                   (𝑛)
(DGTDA) and Constrained MDA (CMDA). DATER minimizes the ratio 𝑡𝑟 (𝑈𝑛> 𝑆𝑊               𝑈𝑛 )/𝑡𝑟 (𝑈𝑛> 𝑆 𝐵(𝑛) 𝑈𝑛 ) [195]
instead of (1.27). Direct Generalized Tensor Discriminant Analysis (DGTDA), on the other hand, computes
scatter matrices without projecting inputs on 𝑈𝑚 , where 𝑚 ≠ 𝑛 and finds an optimal 𝑈𝑛 [112]. Constrained
MDA (CMDA) finds the solution in an iterative fashion [112], where each subspace is found by fixing all
other subspaces.
1.7    Organization and Contributions of the Thesis
In this thesis, we present novel algorithms for learning from tensor data for different applications including
discriminative and multi-modal supervised learning, data recovery, and anomaly detection from spatiotem-
poral data. In all of the proposed methods, important design challenges such as keeping the storage and
computational complexity low while maintaining high performance accuracy are taken into account. The re-
sulting cost functions are formulated as optimization problems and efficient algorithms are presented to solve
them. We have also provided analysis of the proposed algorithms in terms of convergence, computational
and storage complexity. The algorithms were applied to classification, clustering and anomaly detection
tasks.
     In Chapter 2, we introduce a tensor decomposition structure that provides computational and storage
efficiency while providing high accuracy for both supervised and unsupervised applications. The structure
is akin to a hybrid between Tensor Train and Tucker decompositions using the flexibility of tensor networks,
and named as multi-branch tensor train decomposition. This new structure is useful for extending existing
machine learning methods to tensor type data. This structure is employed in a supervised learning setting for
extending LDA to tensor data. We also present an unsupervised learning method that relies on the proposed
structure, namely graph regularized tensor train decomposition. The proposed supervised and unsupervised
algorithms are compared to state-of-the-art tensor decompositions with similar objectives in, respectively,
classification and clustering settings on various data sets.
                                                       13


    In Chapter 3, we address the problem of unsupervised anomaly detection in spatiotemporal data. In
this chapter, we develop an anomaly detection method by taking the structure of anomalies into account.
In particular, we model the spatiotemporal data as a tensor where the underlying data is low-rank and the
anomalies are sparse and temporally continuous. In other words, anomalies appear as spatially contiguous
groups of locations that show anomalous values consistently for a short duration of time. We present different
methods based on these assumptions: LOw-rank plus temporally Smooth Sparse decomposition (LOSS),
Graph regularized LOSS (GLOSS) and LOw-rank on Graphs plus temporally Smooth Sparse decomposition
(LOGSS). LOSS and GLOSS use the same objective functions except for the graph regularization term
included in GLOSS. By including a graph regularization term in the objective, we preserve the local geometry
of the data. In LOGSS, different from the previous two, graph regularization is employed to approximate the
nuclear norm minimization problem to reduce the computational complexity. As the proposed methods use
a tensor completion framework, they are robust against missing entries in the observed spatiotemporal data.
    In Chapter 4, we study geometric tensor learning, a framework that explores the effectiveness of product
graph structure on tensors. This framework is utilized for a tensor recovery task where the aim is to
reconstruct the underlying data from grossly corrupted and partially observed data. We present a new
method with Tensor Train structure to model the data. Using TT changes the definition of the product graph
as the tensor factorization is not Kronecker structured. We propose two new algorithms to model product
graphs for TT decomposition, where in the first approach, we propose using a graph for each canonical mode
unfolding, and in the second model, we show the equivalence of canonical unfolding to mode-𝑛 unfolding
when each mode-𝑛 graph has a Kronecker structure.
    In Chapter 5, we present a coupled support tensor machine framework for classification. The framework
extends a supervised learning task to multiple, possibly heterogenous, data sources such as simultaneous
EEG and fMRI. Starting from a kernelized support tensor machine (STM) framework formulated for single
modality, we propose a multiple kernel approach. In particular, we estimate latent factors from Advanced
Coupled Matrix Tensor Factorization (ACMTF) [3] jointly across modalities. This algorithm decomposes
multiple tensors with CP decomposition, where some sets of factors from different modes are coupled
together, i.e. has similarity conditions, to account for the same underlying physical phenomenon that affects
these factors. A Coupled Support Tensor Machine (C-STM) combines individual and shared latent factors
with multiple kernels and estimates a maximal-margin classifier for coupled matrix tensor data.
    Finally, Chapter 6 summarizes the main contributions of this thesis and discuss potential future research
                                                       14


directions.
            15


                                                 CHAPTER 2
  MULTI-BRANCH TENSOR TRAIN STRUCTURE FOR SUPERVISED AND UNSUPERVISED
                                                 LEARNING
2.1    Introduction
Tensor decompositions have been proposed for various tasks including dimensionality reduction, classifi-
cation and clustering. In the area of unsupervised learning, many vector or matrix based methods such as
PCA, SVD and NMF have been extended to tensors using various tensor decomposition structures including
PARAFAC/CP, Tucker decomposition and TT [159, 42, 93, 26, 48, 49, 43, 106, 18, 181]. Tensors have also
gained attention in supervised learning literature. Vectorizing high dimensional inputs may result in poor
classification performance due to overfitting when the training sample size is relatively small compared to
the feature vector dimension [159, 112, 169, 168, 196]. Another issue brought forth by vectorization is small
sample size (SSS) problem. This problem is considered to be the main drawback of linear discriminant
analysis (LDA) as the objective function becomes ill-defined due to the singularity of the estimated scatter
matrices [63]. For this and many other reasons, a variety of supervised tensor learning methods for feature
extraction, selection, regression and classification have been proposed [126, 168, 97, 75, 112, 74]. These
include extensions of Linear Discriminant Analysis (LDA) to Multilinear Discriminant Analysis (MDA) for
face and gait recognition [196, 168, 112]; Discriminant Non-negative Tensor Factorization (DNTF) [202];
Supervised Tensor Learning (STL) where one projection vector along each mode of a tensor is learned [169,
76]. More recently, the linear regression model has been extended to tensors to learn multilinear mappings
from a tensorial input space to a continuous output space [73, 201]. Finally, a framework for tensor-based
linear large margin classification was formulated as Support Tensor Machines (STMs), in which the param-
eters defining the separating hyperplane form a tensor [97, 75, 74]. However, most of these methods are
based on CP or Tucker decomposition which have various problems.
    CP decomposition is very efficient in terms of storage complexity and has useful properties such as
uniqueness and interpretability, which is not true for TD or TT. However, finding the correct CP rank is
challenging and often is done by iterating over different values. For noisy tensor data, CP decomposition may
find false rank. Moreover, there may not be a low-rank approximation, which is referred to as degeneracy
[93] and the problem is often ill-posed [44].
                                                       16


     Tucker Decomposition is an extension of SVD and a widely used tensor decomposition. Also, it is more
numerically stable compared to CP and most of the time, a low-rank approximation exists in Tucker form.
However, Tucker representation can be exponential in storage requirements since even when each mode is
low-rank with rank 𝑟, the dimensionality of a sample core tensor X𝑐𝑘 will be 𝑟 𝑁 [182, 32]. It was also shown
in [17] that learning a low-rank approximation using Tucker Decomposition requires a truncated SVD on a
fat matrix, which hinders the quality of the low-rank approximation as fat or tall matrices are usually full
rank.
     TT format exhibits both stability and low storage complexity compared to Tucker format [42]. It was also
suggested as a possible solution to the imbalance of unfolding problem encountered in Tucker decomposition
in [17]. However, the ranks of the core tensors in the Tensor Train decomposition are frequently not low
for real data [45] and increase exponentially with the number of modes. In [204], a solution was proposed
by limiting the maximum ranks of tensor factors and optimizing accordingly. This limitation requires
the original data to be low TT rank, which is not true for most real data [45, 204], and results in lower
approximation accuracy. A widely used solution to these problem with TT or MPS [18] is to start the
Tensor Train decomposition from the left, i.e. first mode, and the right, i.e. last mode, simultaneously which
decreases the computational complexity significantly.
     Even though early applications of TT decomposition focused on compression and dimensionality re-
duction [139, 140], more recently TT has been used in machine learning applications. In [18], MPS is
implemented in an unsupervised manner to first compress the tensor of training samples and then the re-
sulting lower dimensional core tensors are used as features for subsequent classification. In [182], TT
decomposition is associated with a structured subspace model, namely the Tensor Train subspace. Learning
this structured subspace from training data is posed as a non-convex problem referred to as TT-PCA. Once the
subspaces are learned from the training data, the resulting low-dimensional subspaces are used to project and
classify the test data. [183] extends TT-PCA to manifold learning by proposing a Tensor Train neighborhood
preserving embedding (TTNPE). The classification is conducted by first learning a set of tensor subspaces
from the training data and then projecting the training and testing data onto the learned subspaces. Apart
from employing TT for subspace learning, recent work has also considered the use of TT in classifier design.
In [38], a support tensor train machine (STTM) is introduced to replace the rank-1 weight tensor in Support
Tensor Machine (STM) [169] by a Tensor Train that can approximate any tensor with a scalable number of
parameters.
                                                       17


     In order to address the issues of exponential storage requirements and high computational complexity, in
this chapter, we introduce supervised and unsupervised subspace learning approaches based on the Tensor
Train (TT) structure. Building on the idea of starting the decomposition from the first and last modes
simultaneously, we show that using a number of branches in the decomposition, a hybrid between TT and
Tucker Decomposition structure can be achieved which in turn can provide a balanced structure and low
ranks. When the number of branches is equal to one, the decomposition is equivalent to a standard TT
decomposition. When the number of branches is equal to the number of modes, the decomposition is
equivalent to Tucker Decomposition. As a special case, when the number of branches is two, the structure
is still denoted as TT in previous work[98, 204, 18, 45]. However, the formulation is different from one-
way TT as the orthogonality requirements of some factors become right-orthogonality instead of left. By
generalizing the proposed idea to different number of branches, we show that our structure is a hybrid tensor
decomposition that includes TT and Tucker decomposition as special cases. The significance of the number
of branches becomes more apparent for high dimensional data with large number of modes as for these data,
the TT ranks tend to get large. The proposed structure is closely related to recent work in the computation
of extremal eigenvalue-eigenvector pairs of large Hermitian matrices [45]. Since the main objective is to
compute extremal eigenpairs, i.e. Ritz pairs, of a large-scale structured matrix, the proposed multi-branch
approach is similar to the previously introduced block TT (BTT) format [56, 98, 204].
     In particular, we present a discriminant subspace learning approach using the TT model, namely the
Tensor Train Discriminant Analysis (TTDA). The proposed approach is based on linear discriminant analysis
(LDA) and learns a Tensor Train subspace (TT-subspace) [183, 182] that maximizes the linear discriminant
function. Although this approach provides an efficient structure for storing the learned subspaces, it is com-
putationally prohibitive. For this reason, we propose the multi-branch structure for efficient implementations
of TTDA. In the second part of this chapter, we apply the multi-branch structure to an unsupervised learning
problem that can be posed as an extremal eigenvalue-eigenvector problem. This problem is a combination
of manifold learning and linear subspace learning which is posed as a graph regularized dimensionality
reduction algorithm.
     The contributions of the chapter can be summarized as follows:
     • A new tensor network structure is introduced to provide a better trade-off between computational
        complexity and storage cost. The proposed multi-branch structure is akin to a hybrid between tensor-
                                                        18


       train and Tucker decompositions using the flexibility of tensor networks. This structure can also be
       utilized within other subspace learning tasks, as was done in both supervised and unsupervised settings
       in this chapter.
     • This chapter is the first that uses tensor-train decomposition to formulate LDA for supervised subspace
       learning. Unlike recent work on TT-subspace learning [17, 18, 182, 183] which focuses on dimension-
       ality reduction, the proposed work learns discriminative TT-subspaces to extract features that optimize
       the linear discriminant function.
     • A theoretical analysis of storage and computational complexity of this new framework is presented. A
       method to find the optimal implementation of the multi-branch TT model given the dimensions of the
       input tensor is provided.
    The rest of the chapter is organized as follows. In Section 2.2, we introduce an optimization problem
to learn the TT-subspace structure that maximizes the linear discriminant function, named as Tensor Train
Discriminant Analysis (TTDA). In Section 2.3, we introduce multi-branch implementations of TTDA to
address the issue of high computational complexity. In Section 2.4, we provide an analysis of storage
cost, computational complexity and convergence of the proposed algorithms. We also provide a procedure
to determine the optimal TT structure for minimizing storage complexity. In Section 2.5, we compare the
proposed methods with state-of-the-art tensor based discriminant analysis and subspace learning methods for
classification applications. In Section 2.6, we discuss graph regularized unsupervised learning on tensors,
then, propose an implementation using the two-way structure using an ADMM based optimization, and
discuss the convergence properties. In Section 2.7, we compare the proposed method with unsupervised
tensor learning methods for clustering applications on two data sets.
2.2    Tensor Train Discriminant Analysis
When the data samples are tensors, traditional LDA first vectorizes them and then finds an optimal projection
as shown in (1.23). This creates problems as the intrinsic structure of the data is destroyed. Even though
MDA addresses this problem, it is inefficient in terms of storage complexity as it relies on TD [45, 32]. Thus,
we propose to solve (1.23) by constraining 𝑈 = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 ) to be a TT-subspace to reduce
the computational and storage complexity and obtain a solution that will preserve the inherent data structure.
                                                          19


     The goal of TTDA is to learn left orthogonal tensor factors U𝑛 ∈ R𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 , 𝑛 ∈ {1, . . . , 𝑁 } using
TT-model such that the discriminability of projections x𝑐𝑘 , ∀𝑐, 𝑘 is maximized. First, U𝑛 s can be initialized
using TT decomposition proposed in [140]. To optimize U𝑛 s for discriminability, we need to solve (1.23)
for each U𝑛 , which can be rewritten using the definition of 𝑈 as:
                                                                                                                                   
                                      1         1         1      1          >                1        1          1          1
      U𝑛 = argmin 𝑡𝑟 L(U1 ×3 · · · ×3 Û𝑛 ×3 · · · ×3 U𝑁 ) 𝑆L(U1 ×3 · · · ×3 Û𝑛 ×3 · · · ×3 U𝑁 ) .                                           (2.1)
                  Û𝑛
     Using the definitions presented in (1.1) and (1.2), we can express (2.1) in terms of tensor merging
product:
                                                                                                                                     
      U𝑛 = argmin 𝑡𝑟 (U1 ×13 · · · ×13 Û𝑛 ×13 · · · ×13 U𝑁 ) ×1,...,  𝑁
                                                                1,..., 𝑁
                                                                         S  × 1,..., 𝑁
                                                                              𝑁 +1,...,2𝑁
                                                                                             (U   ×
                                                                                                1 3
                                                                                                   1
                                                                                                      · · · ×1
                                                                                                               Û
                                                                                                             3 𝑛 3 × 1
                                                                                                                       · · · ×1
                                                                                                                                U
                                                                                                                              3 𝑁 , )          (2.2)
                Û𝑛
 where S = T−1   𝑁 (𝑆) ∈ R
                            𝐼1 ×···×𝐼 𝑁 ×𝐼1 ×···×𝐼 𝑁 . Let U                      1         1       1
                                                               ≤𝑛−1 = U1 ×3 U2 ×3 · · · ×3 U𝑛−1 and U>𝑛 = U𝑛+1 ×3 · · · ×3
                                                                                                                                            1      1
U𝑁 . By rearranging the terms in (2.2), we can first compute all merging products and trace operations that
do not involve U𝑛 as:
                 "                                                                                                                     #
                                                                                                                                  
                                                                             𝑁 +1,..., 𝑁 +𝑛−1                 𝑁 +𝑛+2,...,2𝑁
     A𝑛 =  𝑡𝑟 48   U ≤𝑛−1 ×1,...,𝑛−1
                            1,...,𝑛−1
                                                  𝑛+1,..., 𝑁
                                         U>𝑛 ×1,...,     𝑁 −𝑛   U ≤𝑛−1 ×1,...,𝑛−1                (U>𝑛 ×1,...,      𝑁 −𝑛         S)        ,   (2.3)
where A 𝑛 ∈ R𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 ×𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 (refer to Figure 2.1 for a graphical representation of (2.3)). Then, we
can rewrite the optimization in terms of U𝑛 :
                                                                                                
                                        U𝑛 = argmin Û𝑛 ×1,2,3      1,2,3     A  𝑛  ×  1,2,3
                                                                                       4,5,6
                                                                                              Û𝑛    .                                        (2.4)
                                                     Û𝑛
 Let 𝐴𝑛 = T3 (A 𝑛 ) ∈ R𝑅𝑛−1 𝐼𝑛 𝑅𝑛 ×𝑅𝑛−1 𝐼𝑛 𝑅𝑛 , then (2.4) can be rewritten as:
                            U𝑛 = argmin V( Û𝑛 ) > 𝐴𝑛 V( Û𝑛 ),                    L( Û𝑛 ) > L( Û𝑛 ) = I𝑅𝑛 .                                (2.5)
                                        Û𝑛
This is a non-convex function due to unitary constraints and can be solved by the algorithm proposed in
[189]. The algorithm employs a curvilinear line search method to solve minimizations with orthogonality
constraints such as eigenvalue decompositions or matrix rank minimization. The procedure described above
to find the subspaces is computationally expensive as the complexity of finding each A 𝑛 [183] is at least
quadratic in the number of elements, i.e. O (𝐼 2𝑁 ).
     When 𝑛 = 𝑁, (2.5) does not apply as U>𝑁 is not defined and the trace operation is defined on the third
mode of U𝑁 . To update U𝑁 , the following can be used:
                                                                                                 
                                       U𝑁 = argmin 𝑡𝑟 Û𝑁 ×1,2           1,2    A  𝑁   × 1,2
                                                                                         3,4  Û 𝑁      ,
                                                    Û𝑁
                                                                    20


                                                         𝐼1                      𝐼1
                                          U𝑛−1
                                            𝐿
                                                         𝐼2                      𝐼2       U𝑛−1
                                                                                            𝐿
                                                        ...                     ...
                                          𝑅𝑛−1                                            𝑅𝑛−1
                                                         𝐼𝑛          S           𝐼𝑛
                                           𝑅𝑛                                              𝑅𝑛
                                                        ...                     ...
                                          U𝑛𝑅          𝐼 𝑁−1                   𝐼 𝑁−1      U𝑛𝑅
                                                         𝐼𝑁                      𝐼𝑁
                                           𝑅𝑁                                              𝑅𝑁
Figure 2.1: Tensor A 𝑛 is formed by first merging U𝑛𝑅 , U𝑛−1               𝐿 and S and then applying trace operation across
  𝑡 ℎ      𝑡 ℎ
4 and 8 modes of the resulting tensor. The green line at the bottom of the diagram refers to the trace
operator.
                                                                       
                                   𝑁 −1                𝑁 +1,...,2𝑁 −1
where A 𝑁 = U ≤ 𝑁 −1 ×1,...,1,..., 𝑁 −1   U   ≤ 𝑁 −1 ×1,..., 𝑁 −1     S   . Once all of the U𝑛 s are obtained, they can be used
to extract low-dimensional, discriminative features by the projection operation 𝑈 > V(Y𝑐𝑘 ). The pseudocode
for TTDA is given in Algorithm 1.
      TTDA algorithm described above is computationally expensive as it requires the computation of tensor
A 𝑛 through tensor merging products. Also, the size of A 𝑛 grows significantly with increasing 𝑅𝑛 , e.g.
𝑂 (𝐼 4𝑁 ), which has exponentially growing memory cost. Moreover, since the ranks of TT are bounded by
        Î𝑛                                                        Î𝑛 2 Î𝑛 2
𝑅𝑛 ≤ 𝑖=1       𝐼𝑖 , the size of 𝐴𝑛 may increase up to 𝑖=1                𝐼𝑖 × 𝑖=1 𝐼𝑖 . Thus, 𝐴𝑛 might end up being larger
than the original scatter matrix 𝑆, making TT approximation worse than LDA in terms of computational
complexity.
2.3    Multi-Branch Tensor Train Discriminant Analysis
In this section, we introduce tensor decomposition structures to reduce computational complexity of TTDA.
In particular, instead of using the whole scatter tensor S for the update of each U𝑛 , we aim to partition
tensor factors into several sets where each set will approximate a lower dimensional TT subspace. For the
update of each factor set, the original data will be projected to the remaining sets, and then the corresponding
scatter matrix will be formed from these low dimensional projections. To further reduce the complexity,
tensor merging products between sets will be removed. These collection of sets form a Kronecker structure,
approximating 𝑈 in (1.23).
                                                                    21


Algorithm 2.1: Tensor Train Discriminant Analysis (TTDA)
Input: Input tensors Y𝑐𝑘 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 where 𝑐 ∈ {1, . . . , 𝐶} and 𝑘 ∈ {1, . . . , 𝐾 }, initial tensor factors
     U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, 𝜆, 𝑅1 , . . . , 𝑅 𝑁 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟
Output: U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, and x𝑐𝑘 , ∀𝑐, 𝑘
  1: S ← T−1 𝑁 (𝑆 𝑊 − 𝜆𝑆 𝐵 ), see eqns.(1.24), (1.25).
  2: while 𝑖𝑡𝑒𝑟 < 𝑀𝑎𝑥𝐼𝑡𝑒𝑟 do
  3:     for 𝑛 = 1 : 𝑁 − 1 do
  4:         Compute A 𝑛 using (2.3).
  5:         V(U𝑛 ) ←                 argmin              V( Û𝑛 ) > T3 (A 𝑛 )V( Û𝑛 ).
  6:     end for            Û𝑛 ,L( Û𝑛 ) > L( Û𝑛 )=I 𝑅𝑛
                                                                           
                               1,..., 𝑁 −1                 𝑁 +1,...,2𝑁 −1
  7:     A 𝑁 ← U𝑁𝐿 −1 ×1,...,         𝑁 −1     U  𝐿      ×
                                                 𝑁 −1 1,..., 𝑁 −1         S
                                                                                       
  8:     L(U𝑁 ) ←               argmin              𝑡𝑟 L( Û𝑁 ) > T2 (A 𝑁 )L( Û𝑁 ) .
                      Û𝑁 ,L( Û𝑛 ) > L( Û𝑁 )=I 𝑅 𝑁
  9:     𝑖𝑡𝑒𝑟 ← 𝑖𝑡𝑒𝑟 + 1.
 10: end while
 11: 𝑈 = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 )
 12: x𝑐𝑘 ← 𝑈 > V(Y𝑐𝑘 ), ∀𝑐, 𝑘.
2.3.1    Two-way Tensor Train Discriminant Analysis (2WTTDA)
As LDA tries to find a subspace 𝑈 which maximizes discriminability for vector-type data, 2D-LDA tries to
find two subspaces 𝑉1 , 𝑉2 such that these subspaces maximize discriminability for matrix-type data [198]. If
                                                                                          Î𝑑        Î𝑁
one considers the matricized version of Y𝑐𝑘 along mode 𝑑, i.e. T𝑑 (Y𝑐𝑘 ) ∈ R               𝑖=1 𝐼𝑖 ×  𝑖=𝑑+1 𝐼𝑖 , where 1 < 𝑑 < 𝑁,
the equivalent orthogonal projection can be written as:
                                                            T𝑑 (Y𝑐𝑘 ) = 𝑉1 𝑋𝑐𝑘 𝑉2> ,                                        (2.6)
              Î𝑑                       Î𝑁
                𝑖=1 𝐼𝑖 ×𝑅 𝑑               𝑖=𝑑+1 𝐼𝑖 × 𝑅 𝑑 , 𝑋𝑐𝑘 ∈ R𝑅𝑑 ×𝑅𝑑 .
                                                     ˆ                  ˆ
where 𝑉1 ∈ R                , 𝑉2 ∈ R
     In TTDA, since the projections x𝑐𝑘 are considered to be vectors, the subspace 𝑈 = L(U1 ×13 U2 ×13 · · · ×13
U𝑁 ) is analogous to the solution of LDA with the constraint that the subspace admits a TT model. If we
consider the projections and the input samples as matrices, now we can impose a TT structure to the left and
right subspaces analogous to solution of 2D-LDA. In other words, one can find two sets of TT representations
corresponding to 𝑉1 and 𝑉2 in (2.6). Using this structural approximation, (2.6) can be rewritten as:
                               T𝑑 (Y𝑐𝑘 ) = L(U1 ×13 · · · ×13 U𝑑 ) 𝑋𝑐𝑘 R(U𝑑+1 ×13 · · · ×13 U𝑁 ),                           (2.7)
which is equivalent to the following representation:
                                  Y𝑐𝑘 = U1 ×13 · · · ×13 U𝑑 ×13 𝑋𝑐𝑘 ×12 U𝑑+1 ×13 · · · ×13 U𝑁 .
                                                                       22


This formulation is graphically represented in Figure 2.2a where the decomposition has two branches, thus
we refer to it as Two-way Tensor Train Decomposition (2WTT).
     In prior work, decompositions with this structure was also called Tensor Train [204] or Matrix Product
States [18] but the formulation is different and some properties of Tensor Train such as ranks, computational
and storage complexities change when this structure is employed. Thus, this is a different decomposition
than the conventional Tensor Train and the differences will be more higlighted when the number of branches
increase. The differences between the two-way structure and TT are:
     • In the two-way structure, factors after mode-𝑑 are not linearly dependent to the ones before mode-𝑑.
        This can be viewed easily in 2D-LDA problem, which is a two-way structure with only two modes, or
        from the representation in Figure 2.2a.
     • In 2WTT, modes before mode-𝑑 are left-orthogonal, similar to TT, but the modes after mode-𝑑 are
        right orthogonal. This fact with the previous one allow a reduced computational cost in the ALS
        optimization scheme by a simple linear independence assumption.
     • In TT, ranks grow from the first mode to the last mode as the upper bound of each rank, 𝑅𝑛 ≤
                     Î
        𝐼𝑛 𝑅𝑛−1 ≤ 𝑛𝑛0=1 𝐼𝑛0 , keeps increasing. On the other hand, for 2WTT ranks with 𝑛 > 𝑑 are bounded
                              Î
        by 𝑅𝑛 ≤ 𝐼𝑛 𝑅𝑛+1 ≤ 𝑛𝑁0=𝑛+1 𝐼𝑛0 . As a quantitative measure, the upper bound of the maximum rank is
                               q
        Î𝑁                      Î𝑁
           𝑛=1 𝐼 𝑛 for TT, and    𝑛=1 𝐼 𝑛 for 2WTT.
     To maximize discriminability using 2WTT, an optimization scheme that alternates between the two sets
of TT-subspaces can be utilized. When forming the scatter matrices for a set, projections of the data to the
other set can be used instead of the full data which is similar to (1.26). This will reduce computational
complexity as the cost of computing scatter matrices and the number of matrix multiplications to find A 𝑛 in
(2.3) will decrease. We propose the procedure given in Algorithm 2.2 to implement this approach and refer
to it as Two-way Tensor Train Discriminant Analysis (2WTTDA) as illustrated in Figure 2.2c. To determine
                                                                                         Î𝑑       Î
the value of 𝑑 in (2.7), we use a center of mass approach and find the 𝑑 that minimizes | 𝑖=1 𝐼𝑖 − 𝑁𝑗=𝑑+1 𝐼 𝑗 |.
In this manner, the problem can be separated into two parts which have similar computational complexities.
                                                      23


Algorithm 2.2: Two-Way Tensor Train Discriminant Analysis (2WTTDA)
Input: Input tensors Y𝑐𝑘 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 where 𝑐 ∈ {1, . . . , 𝐶} and 𝑘 ∈ {1, . . . , 𝐾 }, initial tensor factors
     U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, 𝑑, 𝜆, 𝑅1 , . . . , 𝑅 𝑁 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟, 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟
Output: U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, and 𝑋𝑐𝑘 , ∀𝑐, 𝑘
  1: while 𝑖𝑡𝑒𝑟 < 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟 do
                             𝑁 −𝑑+1
  2:     Y𝐿 ← Y ×2,...,𝑑+1,..., 𝑁    (U𝑑+1 ×13 · · · ×13 U𝑁 ).
  3:     [U𝑖 ] ← 𝑇𝑇 𝐷 𝐴(Y𝐿 , 𝜆, 𝑅𝑖 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟)∀𝑖 ∈ {1, . . . , 𝑑}.
  4:     Y𝑅 ← Y ×2,...,𝑑+1
                       1,...,𝑑    (U1 ×13 · · · ×13 U𝑑 ).
  5:     [U𝑖 ] ← 𝑇𝑇 𝐷 𝐴(Y𝑅 , 𝜆, 𝑅𝑖 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟)∀𝑖 ∈ {𝑑 + 1, . . . , 𝑁 }.
  6:     𝑖𝑡𝑒𝑟 = 𝑖𝑡𝑒𝑟 + 1.
  7: end while
  8: X𝑐𝑘 ← L(U1 ×13 · · · ×13 U𝑑 ) > T𝑑 (Y𝑐𝑘 )R(U𝑑+1 ×13 · · · ×13 U𝑁 ) > .
2.3.2   Three-way Tensor Train Discriminant Analysis (3WTTDA)
Elaborating on the idea of 2WTTDA, one can increase the number of modes of the projected samples
which will increase the number of tensor factor sets, or the number of subspaces to be approximated
using TT structure. For example, one may choose the number of modes of the projections as three,
i.e. X𝑐𝑘 ∈ R𝑅𝑑1 ×𝑅𝑑2 ×𝑅𝑑2 , where 1 < 𝑑1 < 𝑑2 < 𝑁. This model, named as Three-way Tensor Train
                             ˆ
Decomposition (3WTT), is given in (2.8) and represented graphically in Figure 2.2b.
                                                                                        
                                   𝑘            𝑘 𝑁 −𝑑2 +2               1    1          𝑑2 −𝑑1 +2
                                 Y𝑐 = X𝑐 ×3                  U𝑑2 +1 ×3 · · · ×3 U𝑁 ×2
                                                                 !
                                                                   ×1𝑑1 +2 U1 ×13 · · · ×13 U𝑑1 .
                                                                                              
                                   U𝑑1 +1 ×13     ···  ×13 U𝑑2                                                  (2.8)
     To maximize discriminability using 3WTT, one can utilize an iterative approach as in Algorithm 2.2,
where inputs are projected on all tensor factor sets except the set to be optimized, then TTDA is applied to the
projections. The flowchart for the corresponding algorithm is illustrated in Figure 2.2d. This procedure can
be repeated until a convergence criterion is met or a number of iterations is reached. The values of 𝑑1 and
                                                                                                     Î𝑁
𝑑2 are calculated such that the product of dimensions corresponding to each set is as close to ( 𝑖=1      𝐼𝑖 ) 1/3 as
possible. It is important to note that 3WTT will only be meaningful for tensors of order three or higher. For
three-mode tensors, 3WTT is equivalent to Tucker Decomposition. When there are more than four modes,
the number of branches can be increased accordingly which makes 4W, 5W and 𝑁WTT possible.
2.4    Analysis of Storage, Training Complexity and Convergence
In this section, we derive the storage and computational complexities of the aforementioned algorithms as
well as providing a convergence analysis for TTDA.
                                                                   24


                                                                                                                                      𝐼1               𝐼 𝑑1
                                                                                                                ...
                                                                                                                                     U1       ...     U𝑑1
                        ...
                                                                                                          ...
                                                                                                                       ...
                                                                                                 𝐼 𝑁−1          Y𝑐𝑘          =                                 X𝑐𝑘
                  ...
                              ...
         𝐼 𝑁−1          Y𝑐𝑘         =         U1     ...     U𝑑    𝑋𝑐𝑘    U𝑑+1    ...      U𝑁
                                                                                                         𝐼𝑁                  𝐼2
                                                                                                                                    U𝑑1 +1    ...     U𝑑2            U𝑑2 +1     ...    U𝑁
                                                                                                                 𝐼1
                 𝐼𝑁                 𝐼2
                         𝐼1
                                              𝐼1             𝐼𝑑           𝐼 𝑑+1            𝐼𝑁                                       𝐼 𝑑1 +1            𝐼 𝑑2           𝐼 𝑑2 +1           𝐼𝑁
                                                     (a)                                                                                             (b)
                                                                                                                                                       Apply TT
                                                                                                                                                        in [140]
                                               Apply TT
                                                in [140]
                                                                                                                                                    U𝑑1 +1 , . . . , U𝑑2
            U1 , . . . , U𝑑                        Y, 𝜆, 𝑑             U𝑑+1 , . . . , U𝑁                              U1 , . . . , U𝑑1                                            U𝑑2 +1 , . . . , U𝑁
                                                                                                     Y, 𝜆, 𝑑
                                                                                                                                                              Y𝑅                Apply
                 Project Y                                               Project Y                                                Project Y
                                                   Apply                                                                                                      Y𝑀                TTDA
                 according                                               according
                                         Y𝐿        TTDA           Y𝑅                                                                                          Y𝐿
                 to line 4.                                              to line 2.
                                                     (c)                                                                                             (d)
Figure 2.2: Illustration of the proposed methods (Compare (a) and (b) with Figures 1.2 and 1.3): (a) The
proposed tensor network structure for 2WTT; (b) The proposed tensor network structure for 3WTT; (c) The
flow diagram for 2WTTDA (Algorithm 2.2); (d) The flow diagram for 3WTTDA
                        Table 2.1: Storage Complexities of Different Tensor Decomposition Structures
                                                   Methods             Subspaces (U𝑛 s) (O (.))                  Projections X𝑐𝑘 (O (.))
                                                     TT                    (𝑁 − 1)𝑟 2 𝐼 + 𝑟 𝐼                             𝑟𝐶𝐾
                                                   2WTT                   (𝑁 − 2)𝑟 2 𝐼 + 2𝑟 𝐼                            𝑟 2 𝐶𝐾
                                                   3WTT                   (𝑁 − 3)𝑟 2 𝐼 + 3𝑟 𝐼                            𝑟 3 𝐶𝐾
                                                    TD                          𝑁𝑟 𝐼                                     𝑟 𝑁 𝐶𝐾
    Storage Complexity Let 𝐼𝑛 = 𝐼, 𝑛 ∈ {1, 2, . . . , 𝑁 } and 𝑅𝑙 = 𝑟, 𝑙 ∈ {2, . . . , 𝑁 − 1}. Assuming 𝑁 is a
multiple of both 2 and 3, total storage complexities are:
    • O ((𝑁 − 1)𝑟 2 𝐼 + 𝑟 𝐼 + 𝑟𝐶𝐾) for TT Decomposition, where 𝑅1 = 1, 𝑅 𝑁 = 𝑟;
    • O ((𝑁 − 2)𝑟 2 𝐼 + 2𝑟 𝐼 + 𝑟 2𝐶𝐾) for Two-Way TT Decomposition, where 𝑅1 = 𝑅 𝑁 = 1;
    • O ((𝑁 − 3)𝑟 2 𝐼 + 3𝑟 𝐼 + 𝑟 3𝐶𝐾) for Three-Way TT Decomposition, where 𝑅1 = 𝑅 𝑑1 = 𝑅 𝑁 = 1;
    • O (𝑁𝑟 𝐼 + 𝑟 𝑁 𝐶𝐾) for Tucker Decomposition, where 𝑅1 = 𝑅 𝑁 = 𝑟.
    These results show that when the number of modes for the projected samples is increased, the storage cost
increases exponentially for X𝑐𝑘 while the cost of storing U𝑛 s decreases quadratically. Using the above, one
                                                                                                25


Table 2.2: Computational complexities of various algorithms. The number of iterations to find the subspaces
are denoted as 𝑡 𝑐 for CMDA and 𝑡 𝑡 for TT-based methods. 𝐶𝑠 = 2𝐶𝐾. (𝑟 << 𝐼, 𝑡 𝑡 𝑟 (𝑟 + 𝑁/ 𝑓 − 1) << 𝐶𝑠 ,
and 𝐼 𝑁 / 𝑓 >> 𝑟 6 )
                                         Methods      Order of Complexity (O (.))
                                           LDA                 𝐶𝑠 𝐼 2𝑁 + 𝐼 3𝑁
                                         DGTDA                3𝐼 3 + 𝑁𝐶𝑠 𝐼 𝑁 +1
                                          CMDA             2𝑡 𝑐 𝐼 3 + 𝑡 𝑐 𝑁 2 𝐶𝑠 𝐼 𝑁
                                          TTDA                       𝐶𝑠 𝐼 2𝑁
                                        2WTTDA                (𝑟/2 + 2)𝐶𝑠 𝐼 𝑁
                                        3WTTDA          (𝑟 𝐼 /3 /2 + 3)𝐶𝑠 𝐼 2𝑁 /3
                                                              𝑁
can easily find the optimal number of modes for the projected samples that minimizes storage complexity.
Let the number of modes of X𝑐𝑘 be denoted by 𝑓 . The storage complexity of the decomposition is then
O ((𝑁 − 𝑓 )𝑟 2 𝐼 + 𝑓 𝑟 𝐼 + 𝑟 𝑓 𝐶𝐾). The optimal storage complexity is achieved by taking the derivative of the
complexity in terms of 𝑓 and equating it to zero. In this case, the optimal 𝑓 is given by
                                                                                 
                                                                     𝑟2𝐼 − 𝑟 𝐼
                                            𝑓ˆ = 𝑟𝑜𝑢𝑛𝑑 log𝑟                           ,
                                                                     𝐶𝐾 ln(𝑟)
where 𝑟𝑜𝑢𝑛𝑑 (.) is an operator that rounds to the closest positive integer.
2.4.1    Computational Complexity
For all of the decompositions mentioned except for DGTDA and LDA, the U𝑛 s and X𝑐𝑘 depend on each
other which makes these decompositions iterative. The number of iterations will be denoted as 𝑡 𝑐 and 𝑡 𝑡 for
CMDA and TT-based methods, respectively. For the sake of simplicity, we also define 𝐶𝑠 = 2𝐶𝐾. The total
cost of finding U𝑛 s and X𝑐𝑘 ∀𝑛, 𝑐, 𝑘, where 𝑟 << 𝐼 is in the order of:
                                                                 
     • O 𝐼 𝑁 (𝐶𝑠 + 𝑡 𝑡 𝑟 (𝑟 + 𝑁 − 1))𝐼 𝑁 + 𝑡 𝑡 𝑟 4 (𝐼 + 𝑟 2 𝐼 −1 ) for TTDA;
                                                                                       
     • O 𝑟 𝐼 𝑁 𝐶2𝑠 + 2𝐼 𝑁 /2 (𝐶𝑠 + 𝑡 𝑡 𝑟 (𝑟 + 𝑁/2 − 1))𝐼 𝑁 /2 + 𝑡 𝑡 𝑟 4 𝐼 + 𝑡 𝑡 𝑟 6 𝐼 −1 for 2WTTDA;
                             
                                                                                       
     • O 𝑟 𝐼 𝑁 𝐶2𝑠 + 3𝐼 𝑁 /3 (𝐶𝑠 + 𝑡 𝑡 𝑟 (𝑟 + 𝑁/3 − 1))𝐼 𝑁 /3 + 𝑡 𝑡 𝑟 4 𝐼 + 𝑡 𝑡 𝑟 6 𝐼 −1 for 3WTTDA.
                             
If convergence criterion is met with a small number of iterations, i.e. 𝑡 𝑡 𝑟 (𝑟 +𝑁/ 𝑓 −1) << 𝐶𝑠 , and 𝐼 𝑁 / 𝑓 >> 𝑟 6
for all 𝑓 , the reduced complexities are as given in Table 2.2.
     We can see from Table 2.2 that with increasing number of branches, TT-based methods become more
efficient if the algorithm converges in a few number of iterations. This is especially the case if the ranks of
tensor factors are low as this reduces the dimensionalities of the search space and the search algorithm finds
                                                              26


   a solution to (2.5) faster. When this assumption holds true, the complexity is dominated by the formation
   of scatter matrices. Note that the ranks are assumed to be much lower than dimensionalities and number of
   modes is assumed to be sufficiently high. When these assumptions do not hold, the complexity of computing
   A 𝑛 might be dominated by terms with higher powers of 𝑟. This indicates that TT-based methods are more
   effective when the tensors have higher number of modes and when the TT-ranks of the tensor factors are
   low. DGTDA has an advantage over all other methods as it is not iterative and the solution for each mode
   is not dependent on other modes. On the other hand, the solution of DGTDA is not optimal and there are
   no convergence guarantees except when the ranks and initial dimensions are equal to each other, i.e. when
   there is no compression.
   2.4.2     Convergence
   To analyze the convergence of TTDA, we must first establish a lower bound for the objective function of
   LDA, as (2.1) is lower bounded by the objective value of (1.23).
   Lemma 1. Given that 𝜆 ∈ R+ , i.e. a nonnegative real number, the lower bound of 𝑡𝑟 (𝑈 > 𝑆𝑊 𝑈)−𝜆𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈)
                                  Î𝑁
   is achieved when 𝑈 ∈ R            𝑛=1 𝐼𝑛 ×𝑟 satisfies the following two conditions, simultaneously:
1. The columns of 𝑈 are in the null space of 𝑆𝑊 : u 𝑗 ∈ 𝑛𝑢𝑙𝑙 (𝑆𝑊 ), ∀ 𝑗 ∈ {1, . . . , 𝑟 }.
2. {u1 , u2 , . . . , u𝑟 } are the top-𝑟 eigenvectors of 𝑆 𝐵 .
   In this case, the minimum of 𝑡𝑟 (𝑈 > 𝑆𝑊 𝑈) − 𝜆𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) = −𝜆
                                                                                 Í𝑟
                                                                                     𝑖=1 𝜎𝑖 , where 𝜎𝑖 s are the eigenvalues of
   𝑆𝐵.
   Proof. Since 𝑆𝑊 is positive semi-definite,
                                                          0 ≤ min 𝑡𝑟 (𝑈 > 𝑆𝑊 𝑈),
                                                                𝑈
   which implies that when the columns of 𝑈 are in the null space of 𝑆𝑊 , i.e. u 𝑗 ∈ 𝑛𝑢𝑙𝑙 (𝑆𝑊 ), ∀ 𝑗 ∈ {1, . . . , 𝑟 },
   the minimum value will be achieved for the first part of the objective function.
       To minimize the trace difference, we need to maximize 𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) which is bounded from above as:
                                                                            Õ𝑟
                                                                  >
                                                       max 𝑡𝑟 (𝑈 𝑆 𝐵 𝑈) ≤       𝜎𝑖 .
                                                         𝑈
                                                                            𝑖=1
                                                                    27


𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) is maximized when the columns of 𝑈 are the top-𝑟 eigenvectors of 𝑆 𝐵 . Therefore, the trace
difference achieves the lower-bound when u 𝑗 ∈ 𝑛𝑢𝑙𝑙 (𝑆𝑊 ), ∀ 𝑗 ∈ {1, . . . , 𝑟 } and {u1 , u2 , . . . , u𝑟 } are the top-𝑟
                                                                            Í
eigenvectors of 𝑆 𝐵 and this lower-bound is equal to −𝜆 𝑟𝑖=1 𝜎𝑖 .
     As shown above, the objective function of LDA is lower bounded. Thus, the solution to (2.1) is also
lower-bounded.
     Let 𝑓 (U1 , U2 , . . . , U𝑁 ) = 𝑡𝑟 (𝑈 > 𝑆𝑈), where 𝑈 = L(U1 ×13 · · · ×13 U𝑁 ) and 𝑆 is defined as in (1.23). If
the function 𝑓 is non-increasing with each update of U𝑛 s, i.e.
                                                 𝑓 (U1𝑡 , U2𝑡 , . . . , U𝑛𝑡−1 , . . . , U𝑁𝑡−1
                                                                                              )≥
                                 𝑓 (U1𝑡 , U2𝑡 , . . . , U𝑛𝑡 , . . . , U𝑁
                                                                       𝑡−1
                                                                           ),         ∀𝑡, 𝑛 ∈ {1, 2, . . . , 𝑁 },
then we can claim that Algorithm 2.1 converges to a fixed point as 𝑡 →                           − ∞ since 𝑓 (.) is lower-bounded.
In [189], an approach to regulate the step sizes in the search algorithm was introduced to guarantee global
convergence. In this chapter, this approach is used to update U𝑛 s. Thus, (2.5) can be optimized globally,
and the objective value is non-increasing. As Multi-Branch extensions utilize TTDA on the update of each
branch, proving the convergence of TTDA is sufficient to prove the convergence of 2WTTDA or 3WTTDA.
2.5     Experiments
The proposed TT based discriminant analysis methods are evaluated in terms of classification accuracy,
storage complexity, training complexity and sample size. We compared our methods1 with both linear
supervised tensor learning methods including LDA, DGTDA and CMDA[112]2 as well as other Tensor
Train based learning methods such as MPS [18], TTNPE [183]3 and STTM [38]4. The experiments were
conducted on four different data sets: COIL-100, Weizmann Face, Cambridge and UCF-101. For all data
sets and all methods, we evaluate the classification accuracy and training complexity with respect to storage
complexity.
     In this chapter, classification accuracy is evaluated using a 1-NN classifier and quantified as 𝑁𝑡𝑟 𝑢𝑒 /𝑁𝑡𝑒𝑠𝑡 ,
where 𝑁𝑡𝑟 𝑢𝑒 is the number of test samples which were assigned the correct label and 𝑁𝑡𝑒𝑠𝑡 is the total number
of test samples. Normalized storage complexity is quantified as the ratio of the total number of elements in
    1 Our  code is in https://github.com/mrsfgl/mbttda
    2 https://github.com/laurafroelich/tensor_classification
    3 https://github.com/wangwenqi1990/TTNPE.
    4 https://github.com/git2cchen/KSTTM
                                                                        28


the learnt tensor factors (U𝑛 , ∀𝑛) and projections (X𝑐𝑘 , ∀𝑐, 𝑘) of training data, 𝑂 𝑠 , to the size of the original
training data (Y𝑐𝑘 , ∀𝑐, 𝑘):
                                                       𝑂𝑠
                                                      Î𝑁         .
                                                  𝐶𝐾     𝑛=1 𝐼 𝑛
Training complexity is the total runtime in seconds for learning the subspaces. All experiments were repeated
10 times with random selection of the training and test sets and average classification accuracies are reported.
     The regularization parameter, 𝜆, for each experiment was selected using a validation set composed of all
of the samples in the training set and a small subset of each class from the test set (10 samples for COIL-100,
5 samples for Weizmann, 1 sample for Cambridge, and 10 samples for UCF-101). Utilizing a leave-𝑠-out
approach, where 𝑠 is the aforementioned subset size, 5 random experiments were conducted. The optimal 𝜆
was selected as the value that gave the best average classification accuracy among a range of values from 0.1
to 1000 increasing in a logarithmic scale. CMDA, TTNPE and MPS do not utilize the 𝜆 parameter while
DGTDA utilizes eigendecomposition to find 𝜆 [112]. STTM has an outlier fraction parameter which was set
to 0.02 according to the original paper [38].
2.5.1    Data Sets
COIL-100
     The dataset consists of 7,200 RGB images of 100 objects of size 128 × 128. Each object has 72 images,
where each image corresponds to a different pose angle ranging from 0 to 360 degrees with increments of 5
degrees [133]. For our experiments, we downsampled the grayscale images of all objects to 64 × 64. Each
sample image was reshaped to create a tensor of size 8 × 8 × 8 × 8. Reshaping the inputs into higher order
tensors is common practice and was studied in prior work [90, 140, 209, 45, 17]. 20 samples from each class
were selected randomly as training data, i.e. Y ∈ R8×8×8×8×20×100 , and the remaining 52 samples were used
for testing.
Weizmann Face Database
     The dataset includes RGB face images of size 512×352 belonging to 28 subjects taken from 5 viewpoints,
under 3 illumination conditions, with 3 expressions [188]. For our experiments, each image was grayscaled,
and downsampled to 64 × 44. The images were then reshaped into 5-mode tensors of size 4 × 4 × 4 × 4 × 11
as in [183]. For each experiment, 20 samples were randomly selected to be the training data, i.e. Y ∈
R4×4×4×4×11×20×28 , and the remaining 25 samples were used in testing.
                                                       29


Cambridge Hand-Gesture Database
    The dataset consists of 900 image sequences of 9 gesture classes, which are combinations of 3 hand
shapes and 3 motions. For each class, there are 100 image sequences generated by the combinations of 5
illuminations, 10 motions and 2 subjects [91]. Sequences consist of images of size 240 × 320 and sequence
length varies. In our experiments, we used grayscaled versions of the sequences and we downsampled all
sequences to length 30. We also included 2 subjects and 5 illuminations as the fourth mode. Thus, we
have 10 samples for each of the 9 classes from which we randomly select 4 samples as the training set, i.e.
Y ∈ R30×40×30×10×4×9 , and the remaining 6 as test set.
UCF-101 Human Action Dataset
    UCF-101 is an action recognition dataset [162]. There are 13320 videos of 101 actions, where each
action category might have different number of samples. Each sample is an RGB image sequence with frame
size 240 × 320 × 3. The number of frames differs for each sample. In our experiments, we used grayscaled,
downsampled frames of size 30 × 40. From each class, we extracted 100 samples to balance the class sizes
where each sample consists of 50 frames obtained by uniformly sampling each video sequence. 60 randomly
selected samples from each class were used for training, i.e. Y ∈ R30×40×50×60×101 , and the remaining 40
samples were used for testing.
2.5.2   Classification Accuracy
We first evaluate the classification accuracy of the different methods with respect to normalized storage
complexity. The varying levels of storage cost are obtained by varying the ranks, 𝑅𝑖 s, in the implementation
of the tensor decomposition methods. Varying the truncation parameter 𝜏 ∈ (0, 1], the singular values
smaller than 𝜏 times the largest singular value are eliminated. The remaining singular values are used to
determine the ranks 𝑅𝑖 s for both TT-based and TD-based methods. For TT-based methods, the ranks are
selected using TT-decomposition proposed in [140], while for TD-based methods truncated HOSVD was
used. We also limit the upper bounds of the ranks to 𝑅𝑛 ≤ 𝐼𝑛 to reduce the memory cost associated with
storing 𝐴𝑛 . With this limitation, the maximum dimension of projected samples x𝑐𝑘 becomes 𝐼 𝑁 which is a
                                     Î𝑁
fraction of the original sample size 𝑖=1  𝐼𝑖 .
    In Fig. 2.3, we present comparisons of BTT based eigenpair computation algorithm [56] with the
proposed method. The experiments were conducted on COIL-100 data set. BTT performs similarly to
                                                     30


                                                    (a)
                                                    (b)
Figure 2.3: Comparisons with a BTT based Ritz pair computation algorithm. a) Classification accuracy and
b) Training time with respect to normalized storage cost.
                                                    31


                            (a)                                                 (b)
                            (c)                                                 (d)
Figure 2.4: Classification accuracy vs. Normalized storage cost of the different methods for: a) COIL-100,
b) Weizmann Face, c) Cambridge Hand Gesture and d) UCF-101. All TD based methods are denoted using
’x’, TT based methods are denoted using ’+’ and proposed methods are denoted using ’*’. STTM and LDA
are denoted using ’4’ and ’o’, respectively.
TTDA with slightly lower accuracy and higher training time for increasing storage complexity. As BTT
has adaptive ranks, the factor sizes increase more compared to TTDA which results in a much higher
computational cost. Although the scatter tensor is approximated using a TT structure, BTT does not provide
better computational complexity than TTDA as the approximation itself has 𝑂 (𝐼 2𝑁 ) complexity. Multi-
branch extensions of TTDA outperform BTT both in terms of classification accuracy and computational
complexity.
     Figure 2.4a illustrates the classification accuracy of the different methods with respect to normalized
storage complexity for COIL-100 data set. For this particular dataset, we implemented all of the methods
mentioned above. It can be seen that the proposed discriminant analysis framework in its original form,
TTDA, gives the highest accuracy results followed by TTNPE. However, these two methods only operate
at very low storage complexities since the TT-ranks of tensor factors are constrained to be smaller than the
corresponding mode’s input dimensions. We also implemented STTM, which does not provide compression
                                                                                           𝐶 (𝐶−1)
rates similar to other TT-based methods. This is due to the fact that STTM needs to learn     2    classifiers
                                                       32


                            (a)                                                  (b)
                            (c)                                                  (d)
Figure 2.5: Training complexity vs. Normalized storage cost of the different methods for: a) COIL-100, b)
Weizmann Face, c) Cambridge Hand Gesture, and d) UCF-101.
with TT structure. Moreover, these methods have very high computational complexity as will be shown in
Section 2.5.3. For this reason, they will not be included in the comparisons for the other datasets. For a wide
range of storage complexities, MPS and 2WTTDA perform the best and have similar accuracy. It can also
be seen that the storage costs of MPS and 2WTTDA stop increasing after some point due to rank constraints.
This is in line with the theoretical storage complexity analysis presented in Section 2.4. Tucker based
methods, such as CMDA and DGTDA, along with the original vector based LDA have lower classification
accuracy.
    Figure 2.4b similarly illustrates the classification accuracy of the different methods on the Weizmann
Face Database. For all storage complexities, the proposed 2WTTDA and 3WTTDA perform better than the
other methods, including TT based methods such as MPS.
    Figure 2.4c illustrates the classification accuracy for the Cambridge hand gesture database. In this case,
3WTTDA performs the best for most storage costs. As the number of samples for training, validation
and testing is very low for Cambridge dataset, the classification accuracy fluctuates with respect to the
dimensionality of the features at normalized storage cost of 0.02. Similar fluctuations can also be seen in the
                                                       33


results of [183].
     Finally, we tested the proposed methods on a more realistic, large sized dataset, UCF-101. For this
dataset, TT-based methods perform better than the Tucker based methods. In particular, 2WTTDA performs
very close to MPS at low storage costs, whereas 3WTTDA performs well for a range of normalized storage
costs and provides the highest accuracy overall.
     Even though our methods outperform MPS for most datasets, the classification accuracies get close for
UCF-101 and COIL-100. This is due to the high number of classes in these datasets. As the number of
classes increases, the number of scatter matrices that needs to be estimated also increases which results in a
larger bias given limited number of training samples. This improved performance of MPS for datasets with
large number of classes is also observed when MPS is compared to CMDA. Therefore, the reason that MPS
and the proposed methods perform similarly is a limitation of discriminant analysis rather than the proposed
tensor network structure.
2.5.3    Training Complexity
In order to compute the training complexity, for TT-based methods, each set of tensor factors is optimized
until the change in the normalized difference between consecutive tensor factors is less than 0.1 or 200
iterations is completed. After updating the factors in a branch, no further optimizations are done on that
branch in each iteration. CMDA iteratively optimizes the subspaces for a given number of iterations (which
is set to 20 to increase the speed in our experiments) or until the change in the normalized difference between
consecutive subspaces is less than 0.1.
     Figs. 2.5a, 2.5b, 2.5c, 2.5d illustrate the training complexity of the different methods with respect to
normalized storage cost for the four different datasets. In particular, Figure 2.5a illustrates the training
complexity of all the methods including TTNPE, TTDA and STTM for COIL-100. It can be seen that STTM
has the highest computational complexity among all of the tested methods. This is due to the fact that for a
100-class classification problem, STTM implements (100) (99)/2 one vs. one binary classifiers, increasing
the computational complexity. Similarly, TTNPE has high computational complexity as it tries to learn the
manifold projections which involves eigendecomposition of the embedded graph. Among the remaining
methods, LDA has the highest computational complexity as it is based on learning from vectorized samples
which increases the dimensionality of the covariance matrices. For the tensor based methods, the proposed
                                                        34


2WTTDA and 3WTTDA have the lowest computational complexity followed by MPS and DGTDA. In
particular, for large datasets like UCF-101 the difference in computational complexity between our methods
and existing TT-based methods such as MPS is more than a factor of 102 .
Table 2.3: Classification accuracy (top) and training time (bottom) with standard deviation for various
methods and datasets.
            Accuracy (%)      3WTTDA          2WTTDA         MPS[18]      CMDA[112]     DGTDA[112]
            COIL-100          95.6 ± 0.4      94.8 ± 0.5    94.2 ± 0.2      86.3 ± 0.7    76.6 ± 0.9
            Weizmann           93.6 ± 2       97.6 ± 1.2    87.5 ± 2.3     96.4 ± 1.03    69.9 ± 1.8
            Cambridge         98.2 ± 1.7     89.1 ± 16.7    56.2 ± 9.8       95 ± 2.8     35.4 ± 8.7
            UCF-101           68.6 ± 0.8      67.7 ± 0.9    67.9 ± 0.6      67.7 ± 0.8    57.3 ± 2.7
            (s)
            COIL-100         0.09 ± 0.005    0.24 ± 0.06    1.4 ± 0.13      12.2 ± 6.6    0.7 ± 0.06
            Weizmann         0.05 ± 0.02     0.09 ± 0.02    0.13 ± 0.01      2.6 ± 0.3   0.16 ± 0.02
            Cambridge        0.11 ± 0.01      1.7 ± 1.5     2.07 ± 0.25     12.6 ± 0.3    0.7 ± 0.04
            UCF-101          0.67 ± 0.02    0.853 ± 0.13    56.4 ± 1.9    413.5 ± 24.1    35.3 ± 2.9
2.5.4   Convergence
In this section, we present an empirical study of convergence for TTDA in Figure 2.6 where we report the
objective value of TTDA, i.e. the expression inside argmin operator in (2.5), with random initialization of
projection tensors. This figure illustrates the convergence of the TTDA algorithm, which is at the core of
both 2WTTDA and 3WTTDA, on COIL-100 dataset. It can be seen that even for random initializations of
the tensor factors, the algorithm converges in a small number of steps. The convergence rates for 2WTTDA
and 3WTTDA are faster than that of TTDA as they update smaller sized projection tensors as shown in
Section 2.4.
2.5.5   Effect of Sample Size on Accuracy
We also evaluated the effect of training sample size on classification accuracy for Weizmann Dataset. In
Figure 2.7, we illustrate the classification accuracy with respect to training sample size for different methods.
It can be seen that 3WTTDA provides a high classification accuracy even for small training datasets, i.e., for
15 training samples it provides an accuracy of 96%. This is followed by CMDA and 2WTTDA. It should
also be noted that DGTDA is the most sensitive to sample size as it cannot even achieve the local optima and
more data allows it to learn better classifiers.
                                                         35


Figure 2.6: Convergence curve for TTDA on COIL-100. Objective value vs. the number of iterations is
shown.
Figure 2.7: Comparison of classification accuracy vs. training sample size for Weizmann Face Dataset for
different methods.
2.5.6    Summary of Experimental Results
In Table 2.3, we summarize the performance of the different algorithms for the four different datasets
considered in this chapter. In the left half of this table, we report the classification accuracy (mean ± std) of
the different methods for a fixed normalized storage cost of about 2.10−2 for COIL-100, 6.10−3 for Weizmann
Face, 2.10−4 for Cambridge Hand Gesture and 10−3 for UCF-101 datasets. At the given compression rates,
for all datasets the proposed 3WTTDA and 2WTTDA perform better than the other tensor based methods.
In some cases, the improvement in classification accuracy is significant, e.g. for Weizmann and Cambridge
data sets. These results show that the proposed method achieves the best trade-off, i.e. between normalized
storage complexity and classification accuracy.
    Similarly, the right half of Table 2.3 summarizes the average training complexity for the different
                                                         36


methods for the same normalized storage cost. From this Table, it can be seen that 3WTTDA is the
most computationally efficient method for all datasets. This is followed by 2WTTDA. The difference
in computational time becomes more significant as the size of the dataset increases, e.g. for UCF-101.
Therefore, even if the other methods perform well for some of the datasets, the proposed methods provide
higher accuracy at a computational complexity reduced by a factor of 102 .
2.6    Graph Regularized Tensor Train Decomposition
The geometric relationship between data samples has been shown to be important for learning low-
dimensional structures from high-dimensional data [171, 183]. Recently, motivated by manifold learning,
dimensionality reduction of tensor objects has been formulated to incorporate the geometric structure [115].
The goal is to learn a low dimensional representation for tensor objects that incorporates the geometric
structure while maintaining a low reconstruction error in the tensor decomposition. This idea of manifold
learning for tensors has been mostly implemented for the Tucker method, including Graph Laplacian Tucker
Decomposition (GLTD) [82] and nonnegative Tucker factorization (NTF). However, this line of work suffers
from the limitations of Tucker decomposition mentioned early in the chapter [45].
    Earlier in this chapter, we proposed a TT model, called multi-branch Tensor Train, such that the features
can be matrices and higher-order tensors. In this way, the computational efficiency could be improved.
Utilizing this structure, specifically the two-way approach, we propose a graph-regularized TT decomposition
for unsupervised dimensionality reduction.
2.6.1   Problem Statement
Our goal is to find a TT projection such that the geometric structure of the samples Y𝑠 ∈ R𝐼1 ×···×𝐼𝑁 is
preserved, i.e. the distance between the samples, Y𝑠 , should be similar to that between the projections 𝑋𝑠 ,
while the reconstruction error of the low-rank TT decomposition is minimized. This goal can be formulated
                                                       37


through the following cost function as:
                                                       𝑓𝑂 ({U}, X) =
                          Õ𝑆
                               kY𝑠 − U1 ×13 · · · ×13 U𝑘 ×13 𝑋𝑠 ×12 U𝑘+1 ×13 · · · ×13 U𝑁 k 2𝐹
                          𝑠=1
                                  𝑆     𝑆
                              𝜆 ÕÕ
                           +                k 𝑋𝑠 − 𝑋𝑠0 k 2𝐹 𝑤 𝑠𝑠0 , L(U𝑛 ) > L(U𝑛 ) = I𝑟𝑛 , ∀𝑛
                               2 𝑠=1 0
                                      𝑠 =1
                                      𝑠0 ≠𝑠
where {U} denotes the set of tensor factors U𝑛 , ∀𝑛 ∈ {1, . . . , 𝑁 }, X ∈ R𝑟𝑘 ×𝑆×𝑟𝑘+1 is the tensor whose slices
are 𝑋𝑠 and 𝑤 𝑠𝑠0 is the similarity between tensor samples defined by:
                                         
                                         
                                                if Y𝑠 ∈ N𝑘 (Y𝑠0 )       or Y𝑠0 ∈ N𝑘 (Y𝑠 )
                                         
                                          1,
                                         
                                         
                               𝑤 𝑠𝑠0 =                                                       ,               (2.9)
                                         
                                         
                                          0,
                                               otherwise
                                         
where N𝑘 (Y𝑠 ) is the k-nearest neighborhood of Y𝑠 .
    The objective function can equivalently be expressed as:
                                                                                                             
   𝑓𝑂 ({U}, X) = k𝝅 𝑘+1 (Y) − U1 ×13 · · · ×13 U𝑘 ×13 X ×13 U𝑘+1 ×13 · · · ×13 U𝑁 k 2𝐹 + 𝜆tr (X ×12 Φ) ×1,2
                                                                                                        1,3 X   ,
                    L(U𝑛 ) > L(U𝑛 ) = I𝑟𝑛 , for 𝑛 ≤ 𝑘 and R(U𝑛 )R(U𝑛 ) > = I𝑟𝑛 , for 𝑛 > 𝑘,
where 𝝅 𝑘+1 (Y) ∈ R𝐼1 ×···×𝐼𝑘 ×𝑆×𝐼𝑘+1 ×···×𝐼𝑁 is the permuted version of Y such that the last mode is moved to
the (𝑘 + 1)th mode and all modes larger than 𝑘 are shifted by one mode, 𝑊 ∈ R𝑆×𝑆 is the adjacency matrix
                                                                                                       Í
and Φ = 𝐷 − 𝑊 ∈ R𝑆×𝑆 is the graph Laplacian where 𝐷 is a diagonal degree matrix with, 𝑑 𝑠𝑠 = 𝑆𝑠0=1 𝑤 𝑠𝑠0 .
2.6.2   Optimization
The goal of obtaining low-rank tensor train projections that preserve the data geometry can be achieved by
minimizing the objective function as follows:
                        argmin 𝑓𝑂 ({U}, X),                s.t. L(U𝑛 ) > L(U𝑛 ) = I𝑟𝑛 , for 𝑛 ≤ 𝑘,          (2.10)
                         {U }, X
                                          and R(U𝑛 )R(U𝑛 ) > = I𝑟𝑛 , for 𝑛 > 𝑘.
As we want our tensor factors to be orthogonal, the solutions lie in the Stiefel manifold S𝑛 , i.e. L(U𝑛 ) ∈ S𝑛
for 𝑛 ≤ 𝑘 and R(U𝑛 ) > ∈ S𝑛 for 𝑛 > 𝑘. Although the function 𝑓𝑂 (.) is convex, the optimization problem is
nonconvex due to the manifold constraints on U𝑛 s.
                                                                38


    The solution to the optimization problem can be obtained by Alternating Direction Method of Multipliers
(ADMM). In order to solve the optimization problem we define {V}, as the set of auxiliary variables
V𝑛 , ∀𝑛 ∈ {1, . . . , 𝑁 } and rewrite the objective function as:
                                argmin 𝑓𝑂 ({V}, X)            subject to U𝑛 = V𝑛           , ∀𝑛
                              {U }, {V },X
                                  L(U𝑛 ) ∈ S𝑛 , ∀𝑛 ≤ 𝑘 and R(U𝑛 ) > ∈ S𝑛 , ∀𝑛 > 𝑘.
The partial augmented Lagrangian is given by:
                                                        𝑁                                   𝑁
                                                       Õ                                𝛾Õ
      L ({U}, {V}, X, {Z}) = 𝑓𝑂 ({V}, X) −                 Z𝑛 ×1,2,3
                                                                  1,2,3 (V𝑛 − U𝑛 ) +            kV𝑛 − U𝑛 k 2𝐹 ,          (2.11)
                                                       𝑛=1
                                                                                        2 𝑛=1
where Z𝑛 s are the Lagrange multipliers and 𝛾 is the penalty parameter.
    As each tensor factor is independent from the others, we update the variables for each mode 𝑛 using the
corresponding part of the augmented Lagrangian:
                                                                                     𝛾
                      L 𝑛 (U𝑛 , V𝑛 , Z𝑛 ) = 𝑓𝑂 (V𝑛 ) − Z𝑛 ×1,2,3
                                                               1,2,3 (V𝑛 − U𝑛 ) +       kV𝑛 − U𝑛 k 2𝐹 ,                  (2.12)
                                                                                     2
where 𝑓𝑂 (V𝑛 ) denotes the objective function where all variables other than V𝑛 are fixed. The solution for
each variable at iteration 𝑡 + 1 can then be found using a step-by-step approach as:
                                                                              
                                           V𝑛𝑡+1 = argmin L 𝑛 U𝑛𝑡 , V𝑛 , Z𝑛𝑡 ,                                           (2.13)
                                                      V𝑛
                                           
                                                                
                                            L 𝑛 U𝑛 , V𝑛𝑡+1 , Z𝑛𝑡 , L(𝑈𝑛 ) ∈ S𝑛          for 𝑛 ≤ 𝑘,
                                           
                                           
                                           
                         U𝑛𝑡+1 = argmin
                                     U𝑛
                                            L 𝑛 U𝑛 , V𝑛𝑡+1 , Z𝑛𝑡 , R(𝑈𝑛 ) > ∈ S𝑛
                                           
                                                                
                                                                                        for 𝑛 > 𝑘,
                                           
                                             Z𝑛𝑡+1 = Z𝑛𝑡 − 𝛾(V𝑛𝑡+1 − U𝑛𝑡+1 ).                                            (2.14)
Once V𝑛 , U𝑛 , Z𝑛 are updated for all 𝑛, samples X are computed using:
                                                                                       
                                  X 𝑡+1 = argmin L {U 𝑡+1 }, {V 𝑡+1 }, X, {Z 𝑡+1 } .                                     (2.15)
                                              X
Solution for V𝑛 : For 𝑛 ≤ 𝑘, the solution for V𝑛𝑡+1 can be written explicitly as:
       V𝑛𝑡+1 = argmin k𝝅 𝑘+1 (Y) − V1𝑡+1 ×13 · · · ×13 V𝑛 ×13 · · · ×13 V𝑘𝑡 ×13 X 𝑡 ×13 V𝑘+1  𝑡
                                                                                                  ×13 · · · ×13 V𝑁𝑡 k 2𝐹
                      V𝑛
                                              1,2,3                 𝛾
                                       −Z𝑛𝑡 ×1,2,3  (V𝑛 − U𝑛𝑡 ) +      kV𝑛 − U𝑛𝑡 k 2𝐹 .                                  (2.16)
                                                                    2
                                                            39


We can equivalently convert this equation into matrix form as:
                                                                                         𝛾
    V𝑛𝑡+1 = argmin k𝐻L(V𝑛 )𝑃 − T𝑛 (𝝅 𝑘+1 (Y))k 2𝐹 − tr L(Z𝑛𝑡 ) > L(V𝑛 − U𝑛𝑡 ) + kL(V𝑛 ) − L(U𝑛𝑡 ) k 2𝐹 ,
               V𝑛                                                                           2
                                                                                                                   (2.17)
                                          𝑡 ×1 · · · ×1 V 𝑡 ×1 X 𝑡 ×1 · · · ×1 V 𝑡 ). The analytical solution is found
where 𝐻 = I𝐼𝑛 ⊗ 𝑉 ≤𝑛−1𝑡+1     , 𝑃 = R(V𝑛+1        3      3 𝑘 3          3      3 𝑁
by taking the derivative with respect to L(V𝑛 ) and setting it to zero:
                                                     !
                       2𝐻 > 𝐻L(V𝑛𝑡+1 )𝑃 − 𝐺 𝑃> − L(Z𝑛𝑡 ) + 𝛾L(V𝑛𝑡+1 ) − 𝛾L(U𝑛𝑡 ) = 0,
                                                                  −1
               T3 (V𝑛𝑡+1 ) = 2(𝑃𝑃> ⊗ 𝐻 > 𝐻) + 𝛾I𝑟𝑛−1 𝐼𝑛 𝑟𝑛            T3 (𝛾U𝑛𝑡 + Z𝑛𝑡 ) + 2T2 (𝐻 > 𝐺𝑃> ) ,
                                                                                                        
                                                                                                                   (2.18)
where 𝐺 = T𝑛 (𝝅 𝑘+1 (Y)). Note that the inverse in the solution will always exist given 𝛾 > 0 as the inverse
of a sum of a Hermitian matrix and an identity matrix always exists.
    When 𝑛 > 𝑘, following (2.17) the solution for V𝑛 can be written in the same manner but with different
                                             𝑡+1        
𝐻, 𝐺 and 𝑃, where 𝐻 = 𝑉 ≤𝑛−1   𝑡    , 𝑃 = 𝑉>𝑛       ⊗ I𝐼𝑛 and 𝐺 = T𝑛+1 (𝝅 𝑘+1 (Y)).
Solution for U𝑛 : For 𝑛 ≤ 𝑘, we can solve (2.12) for U𝑛 using:
                                                                          𝛾
                 U𝑛𝑡+1 = argmin −tr L(Z𝑛𝑡 ) > L(V𝑛𝑡+1 − U𝑛 ) + kL(V𝑛𝑡+1 ) − L(U𝑛 ) k 2𝐹
                           U𝑛 :L( U𝑛 ) ∈S𝑛                                     2
                                                           1
                       =       argmin      kL(V𝑛𝑡+1 ) −      L(Z𝑛𝑡 ) − L(U𝑛 ) k 2𝐹 ,                               (2.19)
                           U𝑛 :L( U𝑛 ) ∈S𝑛                 𝛾
which is found by applying a singular value decomposition to L(V𝑛𝑡+1 ) − 𝛾1 L(Z𝑛𝑡 ). When 𝑛 > 𝑘, the optimal
solution is similarly found by applying SVD to R(V𝑛𝑡+1 ) − 𝛾1 R(Z𝑛𝑡 ).
Solution for X: Let 𝝅2 (X) ∈ R𝑟𝑘 ×𝑟𝑘+1 ×𝑆 be the permutation of X, (2.15) can equivalently be rewritten in
matrix form as:
                         h                  i                         2
                             𝑡+1 >
                                                                        + 𝜆tr L(𝝅2 (X))ΦL(𝝅2 (X)) > .
                                        𝑡+1                                                             
               argmin     𝑉>𝑘      ⊗ 𝑉 ≤𝑘     L(𝝅2 (X)) − T 𝑁 (Y)                                                  (2.20)
                  X                                                   𝐹
The solution for X 𝑡+1 does not have any constraints, thus it is solved analytically by setting the derivative of
(2.20) to zero:
                                 2𝐻 > (𝐻L(𝝅2 (X 𝑡+1 )) − 𝐺) + 2𝜆L(𝝅2 (X 𝑡+1 ))Φ = 0,
                                    𝐻 > 𝐻L(𝝅2 (X 𝑡+1 )) + 𝜆L(𝝅2 (X 𝑡+1 ))Φ = 𝐻 > 𝐺,                                (2.21)
where 𝐻 = 𝑉>𝑘  𝑡 > ⊗ 𝑉 𝑡+1 and 𝐺 = T (Y). (2.21) is a Sylvester equation which can be solved efficiently
                         ≤𝑘                   𝑁
[14]. Similar to the case for V𝑛 , the solution to this problem always exists.
Solution for Z𝑛 : Finally, we update the Lagrange multipliers Z𝑛𝑡 using (2.14).
                                                              40


Algorithm 2.3: Graph Regularized Tensor Train-ADMM(GRTT-ADMM)
Input: Input tensors Y𝑠 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 where 𝑠 ∈ {1, . . . , 𝑆}, initial tensor factors {U 1 }, 𝑛 ∈ {1, . . . , 𝑁 },
     𝑘, 𝜆, 𝑟 1 , . . . , 𝑟 𝑁 , 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟, 𝐶𝑜𝑛𝑣𝑇 ℎ𝑟𝑒𝑠ℎ
Output: U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, and 𝑋𝑠 , ∀𝑠
  1: {V 1 } ← {U 1 }.
  2: {Z 1 } ← 0.
  3: while 𝑡 < 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟 or 𝑐 > 𝐶𝑜𝑛𝑣𝑇 ℎ𝑟𝑒𝑠 do
  4:     for 𝑛 = 1 : 𝑁 do
  5:           Find V𝑛𝑡+1 using (2.18).
  6:           Find U𝑛𝑡+1 using SVD to solve (2.19).
  7:           Find Z𝑛𝑡+1 using (2.14).
  8:     end for
  9:     Find X 𝑡+1 using (2.21).
                       Í 𝑁 k V𝑛𝑡+1 −V𝑛𝑡 k 2𝐹
 10:     𝑐 ← 𝑁1 𝑛=1              k V𝑛𝑡 k 2𝐹
 11:     𝑡 = 𝑡 + 1.
 12: end while
2.6.3   Convergence
Convergence of ADMM is guaranteed for convex functions but there is no theoretical proof of the convergence
of ADMM for nonconvex functions. Recent research has provided some theoretical guarantees for the
convergence of ADMM for a class of nonconvex problems under some conditions [187].
     Our objective function is nonconvex due to unitary constraints. In [187], it has been shown that this type
of nonconvex optimization problems, i.e. convex optimization on a Stiefel manifold, converge under some
conditions. We show that these conditions hold for each optimization problem corresponding to mode 𝑛.
The gradient of 𝑓𝑂 with respect to V𝑛 is Lipschitz continuous with Lipschitz constant 𝐿 ≥ k𝑃𝑃> ⊗ 𝐻 > 𝐻 k 2 ,
which fulfills the conditions given in [187]. Thus, L 𝑛 convergences to a set of solutions V𝑛𝑡 , U𝑛𝑡 , Z𝑛𝑡 , given
that 𝛾 ≥ 2𝐿 + 1. The solution for X is found analytically. As the iterative solutions for each variable converge
and the optimization function is nonnegative, i.e. bounded from below, the algorithm converges to a local
minimum.
2.7    Experiments
The proposed method is evaluated for clustering and compared to existing tensor clustering methods including
k-means, MPS [18], TTNPE [183] and GLTD [82] for Weizmann Face Database and MNIST Dataset.
Clustering quality is quantified by Normalized Mutual Information (NMI). Average accuracy with respect to
both storage complexity and computation time over 20 experiments are reported for all methods.
                                                      41


     In the following experiments, the storage complexity is quantified as the size of the tensor factors (U𝑛 , ∀𝑛)
and projections (X𝑠 , ∀𝑠). The varying levels of storage cost are obtained by varying 𝑟 𝑛 s in the implementation
of the tensor decomposition methods. Using varying levels of a truncation parameter 𝜏 ∈ (0, 1], the singular
values smaller than 𝜏 times the largest singular value are discarded. The rest are used to determine ranks
𝑟 𝑛 for both TT based and TD based methods. For GRTT and TTNPE, the ranks are selected using TT
decomposition proposed in [140], while for GLTD truncated HOSVD was used. Computational complexity
is quantified as the time it takes to learn the tensor factors. In order to compute the run time, for TT
based methods, each set of tensor factors is optimized until the change in the normalized difference between
consecutive tensor factors is less than 0.01 or 50 iterations are completed.
     The regularization parameter, 𝜆, for each experiment was selected using a validation set composed of a
small batch of samples not included in the experiments. 5 random experiments were conducted and optimal
𝜆 was selected as the value that gave the best average NMI for a range of 𝜆 values from 0.001 to 1000
increasing in a logarithmic scale. The similarity graphs were constructed using k-nearest neighbor method
with 𝑘 = 𝑙𝑜𝑔(𝑆) following [177].
2.7.1    MNIST
MNIST is a database of grayscale handwritten digit images where each image is of size 28 × 28. We
transformed each of the images to a 4 × 7 × 4 × 7 tensor. Reshaping the inputs into higher order tensors is
common practice and was employed in prior work [90, 140, 209, 45, 17]. In our experiments, we used a
subset of 500 images with 50 images from each class. 50 samples with 5 samples from each class are used
as validation set to determine 𝜆.
                             (a)                                                  (b)
Figure 2.8: (a) Normalized Mutual Information vs. Storage Complexity of different methods for MNIST
dataset. (b) Computation Time vs. Storage Complexity of different methods for MNIST dataset.
                                                       42


     In Figure 2.8a, we can see that at all storage complexity levels, our approach gives the best clustering
result in terms of NMI. The dotted purple line represents the accuracy of k-means clustering on original
tensor data. Even though the performance of TTNPE is the closest to our method, it is computationally
inefficient. In Figure 2.8b, we can see that our approach is faster than GLTD and TTNPE at all storage
complexities. MPS is the most efficient in terms of speed but it provides poor clustering quality.
2.7.2    COIL
The dataset consists of 7,200 RGB images of 100 objects of size 128 × 128. Each object has 72 images,
where each image corresponds to a different pose angle ranging from 0 to 360 degrees with increments of 5
degrees [133]. We used a subset of 20 classes and 32 randomly selected, downsampled, grayscale samples
from each class. Each image was converted to an 8 × 8 × 8 × 8 tensor. 8 samples from each class are used
as validation set.
                              (a)                                             (b)
Figure 2.9: (a) Normalized Mutual Information vs Storage Complexity of different methods for COIL dataset.
(b) Computation Time vs Storage Complexity of different methods for COIL dataset.
     From Figure 2.9a, we can see that the proposed method provides the best clustering results compared to
all other methods. The results for GLTD seem to deteriorate with increasing ranks, which is a result of using
orthonormal tensor factors. TTNPE gives results closest to the proposed method but it is computationally
inefficient and gets very slow with increasing 𝑟 𝑛 s. From Figure 2.9b, we can see that MPS provides best
results in terms of run time but the proposed method has a similar computational complexity while providing
better clustering accuracy.
                                                      43


2.8    Conclusions
In this chapter, we propose a new Tensor Train implementation structure and two associated learning tasks.
In the first part of the chapter, i.e. Tensor Train Disciminant Analysis, we introduce a novel approach
for Tensor Train based discriminant analysis for tensor object classification. The proposed approach first
formulated linear discriminant analysis such that the learnt subspaces have a TT structure. The resulting
framework, TTDA, reduces storage complexity at the expense of high computational complexity. This
increase in computational complexity is then addressed by reshaping the projection vector into matrices and
third-order tensors, resulting in 2WTTDA and 3WTTDA, respectively. A theoretical analysis of storage and
computational complexity illustrated the tradeoff between these two quantities and suggest a way to select
the optimal number of modes in the reshaping of TT structure. The proposed methods were compared with
the state-of-the-art TT based subspace learning methods as well as tensor based discriminant analysis for
four datasets. While providing reduced storage and computational costs, the proposed methods also yield
higher or similar classification accuracy compared to state-of-the art tensor based learning methods such as
CMDA, STTM, TTNPE and MPS.
    In the second part of the chapter, i.e. Graph Regularized Tensor Train Decomposition we proposed a
unsupervised graph regularized tensor train decomposition for dimensionality reduction. To the best of our
knowledge, this is the first tensor train based dimensionality reduction method that incorporates manifold
information through graph regularization. The proposed method also utilizes a multi-branch structure to
implement tensor train decomposition which increases the computational efficiency. An ADMM based
algorithm is proposed to solve the resulting optimization problem. The proposed method was compared to
GLTD, TTNPE and MPS for unsupervised learning in two different datasets.
    The proposed multi-branch structure can also be extended to other unsupervised methods such as
dictionary learning, subspace learning, denoising, data recovery, and compression applications. The structure
can also be optimized by permuting the modes in a way that the dimensions are better balanced than the
original order.
                                                      44


                                                  CHAPTER 3
        TENSOR METHODS FOR ANOMALY DETECTION ON SPATIOTEMPORAL DATA
3.1     Introduction
Large volumes of spatiotemporal data are ubiquitous in a diverse range of applications including climate
science, social sciences, neuroscience, epidemiology [20], transportation systems [55], mobile health, and
Earth sciences [207]. One emerging application of interest in spatiotemporal data is anomaly detection. De-
tecting anomalies from these large data volumes is important for identifying interesting but rare phenomena,
e.g. traffic congestion or irregular crowd movement in urban areas, or hot-spots for monitoring outbreaks in
infectious diseases. Traditionally, the problem of anomaly detection has been approached mainly through
statistical and machine learning techniques [34, 59, 79]. However, these techniques are not as effective on
spatiotemporal data as anomalies are highly correlated across time and space, they can no longer be modeled
as i.i.d.
     The definition of anomaly and the suitability of a particular method is determined by the application. In
this chapter, we focus on urban anomaly detection [206, 205, 117, 118, 203], where anomalies correspond to
incidental events that occur rarely, such as irregularity in traffic volume, unexpected crowds, etc. Urban data
are spatiotemporal data collected by mobile devices or distributed sensors in cities and are usually associated
with timestamps and location tags. Detecting and predicting urban anomalies are of great importance to
policymakers and governments for understanding city-scale human mobility and activity patterns, inferring
land usage and region functions and discovering traffic problems [206, 31].
     Urban anomalies are usually modeled as group anomalies within spatiotemporal data. This type of
anomalies exhibit themselves as spatially contiguous groups of locations that show anomalous values con-
sistently for a short duration of time stamps. Early approaches to detecting these anomalies decompose the
anomaly detection problem by first treating the spatial and temporal properties of the outliers independently
using univariate vector modeling frameworks, and then merging together in a post-processing step [58, 191].
Some examples are dynamic linear models (DLMs) [108] including time-varying autoregressive (TVAR) [24,
25] and switching Kalman filtering/smoothing (SKF/SKS) [131, 134]. However, as the number of sensors
increases, scalability becomes a critical issue and these methods become inefficient and unreliable [59]. For
                                                        45


these reasons, one natural approach to address ST anomaly detection has been to use tensor decomposition
to capture the multiway structure of the data.
    While there has been a growing number of papers in low-rank tensor models for anomaly detection [60,
186, 193, 185, 205, 117], most of the existing methods are not particularly suited to the complex structure of
anomalies in urban traffic data and the unique challenges associated with them. These challenges include:
1) scarcity of anomalies; 2) the contextual dependency of what constitutes an anomaly, i.e., the criteria
for anomaly may vary for different regions and time windows due to external influences such as weather
patterns; 3) long-term periodicity across time, e.g., one day or one week and 4) strong short-term temporal
correlations.
    In this chapter, we address these challenges by proposing temporally regularized, locally consistent,
robust low-rank plus sparse tensor models to decompose the urban traffic data into normal and anomalous
components. The key contributions of the proposed methods are:
    • Modeling Temporal Persistence: The traditional low-rank plus sparse tensor decomposition model
       is modified to account for the characteristics of urban traffic data. As anomalies tend to last for periods
       of time, i.e., have strong short-term dependencies, we impose temporal smoothness on the sparse part
       of the tensor through total variation regularization. This regularization ensures that instantaneous
       changes in the data, which may be due to errors in sensing, are not mistaken for actual anomalies.
       This formulation leads to our first algorithm; low-rank plus temporally smooth sparse (LOSS) tensor
       decomposition.
    • Incorporating Local Structure: We introduce additional structure to the solution of LOSS by
       enforcing the low-rank tensor corresponding to normal activity to be smooth on manifolds across each
       mode. These smoothness terms are added to LOSS as graph regularization across each mode yielding
       GLOSS. While the low-rank structure in LOSS captures the global structure, GLOSS further reduces
       redundancy in the representation and incorporates the local geometric structure. This new framework
       exploits the joint structures and correlations across the modes to more accurately model and process
       ST signals.
    • Using Local Structure for Efficiency: We reduce the computational cost incurred by the nuclear
       norm minimization in GLOSS and LOSS by utilizing the graph smoothness term to estimate normal
                                                        46


      activity. This allows an efficient algorithm that is still effective in extracting anomalies with some mild
      constraints on data.
    • Robustness to Missing and Heterogenous Data: Our optimization framework is formulated in
      a flexible manner such that it solves a tensor completion problem in conjunction with anomaly
      detection. This framework takes into account the missing data and fits a structure to the observed
      part when estimating the underlying structure. Thus, it provides robustness against missing data.
      The heterogeneity of traffic data is taken into account through the use of weighted nuclear norm
      minimization to emphasize the difference in low-rank structure across modes.
    The rest of the chapter is organized as follows. In Section, 3.2, we review some of the related work in
spatiotemporal anomaly detection, in particular tensor based anomaly detection methods. In Section 3.3, we
formulate the optimization problems of LOSS and GLOSS, propose ADMM based solutions and analyze
the computational complexity of these solutions. In Section 3.4, we formulate the optimization problem of
LOGSS, propose an ADMM based solution and analyze the computational complexity. In Section 3.5 we
provide an analysis of convergence of all proposed algorithms. In Section 3.6, we explain how the anomaly
scores for each element in the data is generated. In Section 3.7, we describe the experimental settings both
with synthetic and real data and compare the proposed methods with baseline anomaly detection methods
as well as other tensor based methods. Finally, we conclude the chapter in Section 3.8, by discussing the
proposed algorithms, experimental results and future work.
3.2   Related Work
Anomaly detection in spatiotemporal data is usually studied under three categories: point anomalies; tra-
jectory anomalies and group anomalies. Point anomalies are defined as spatiotemporal outliers that break
the natural ST autocorrelation structure of the normal points. Most ST point anomaly detection algorithms
such as ST-DBSCAN [100] assume homogeneity in neighborhood properties across space and time, which
can be violated in the presence of ST heterogeneity. Trajectory anomalies are usually detected by computing
pairwise similarities among trajectories and identifying trajectories that are spatially distant from the others
[105]. Finally, group anomalies appear in ST data as spatially contiguous groups of locations that show
anomalous values consistently for a short duration of time stamps. Most approaches for detecting group
anomalies decompose the anomaly detection problem by first treating the spatial and temporal properties
                                                       47


of the outliers independently, and then merging together in a post-processing step [58, 191]. However, as
the number of sensors increases, scalability becomes a critical issue and these methods become unreliable.
For these reasons, one natural approach to address ST group anomaly detection has been to use tensor
decomposition.
     Low-rank tensor decomposition and completion have been proposed as suitable approaches to anomaly
detection in spatiotemporal data as these methods are a natural extension of spectral anomaly detection
techniques to multi-way data [59, 113, 68, 207, 39, 118, 117, 193]. These models project the original
spatiotemporal data into a low-dimensional latent space, in which the normal activity is represented with
better spatial and temporal resolution. The learned features, i.e., factor matrices or core tensors, are then
used to detect anomalies by monitoring the reconstruction error at each time point [144, 143, 166, 193]
or by applying well-known statistical tests to the extracted multivariate features [60, 207]. For example,
Zhang et al. [207] proposed a tensor-based method to detect targets in hyperspectral imaging data with both
spectral and spatial anomaly characteristics. Shi et al. [158] proposed an incremental tensor decomposition
algorithm for online anomaly detection. In [60], a hybrid model is constructed from a topology tensor and
a flow tensor and Tucker decomposition with an adjustable core size is used to detect anomalies. Xu et al.
[193] proposed a sliding-window tensor factorization to detect anomalies. Wang et al. [185] expanded the
traditional Tucker decomposition in a probabilistic manner to detect abnormal activity behaviors.
     Although these low-rank tensor models are powerful in identifying abnormal traffic activity, they have
multiple shortcomings. First, they rely on well-known factorization models such as Tucker [60, 207, 193]
and CP [117]. These methods obtain a low-dimensional projection of the data without taking the particular
structure of anomalies, i.e., sparsity, into account. The proposed method incorporates the sparsity of
anomalies into the model by assuming that the anomalies lie in the sparse part of the tensor rather than
in the low-rank part. Second, prior work on higher order robust principal component analysis (HoRPCA)
within the framework of anomaly detection [113, 68] does not ensure temporal smoothness of the detected
anomalies. However, in urban data, anomalies typically last for some time and are not instantaneous.
Recently, temporally regularized matrix factorization models [149, 50, 200] and their extensions to tensors
[199, 179, 184, 88] have been implemented to capture the temporal dynamics. While the smoothness term
included in these papers is very similar to ours, in these papers the smoothness is enforced on the factor
matrices corresponding to the temporal mode while in our approach temporal regularization is applied
directly to the sparse tensor.
                                                      48


     Finally, the existing tensor based anomaly detection methods are limited to modeling anomalies that lie
in linear subspaces. The proposed method utilizes a simultaneously structured model [84] for robust tensor
decomposition. This model structures the low-rank tensor by the proximities within each mode and forces
the solution to be smooth on manifolds constructed from each mode. This goal is achieved by adding graph
regularization across each mode of the tensor. Although graph regularized tensor decomposition has been
used before [115, 135, 82, 148, 13, 180, 192, 51], it is usually with respect to one mode of the tensor and it
was not utilized for anomaly detection in urban traffic data. Yet, for many tensors, correlations exist across
all modalities. Several recent papers [67, 84, 156, 135, 155, 128, 127] exploit this coupled relationship to
co-organize matrices and infer underlying row and column embeddings. The methods GLOSS and LOGSS
in this chapter are direct extensions of this work to tensors where we consider graph regularization across all
modes to capture the coupled correlation structure in spatiotemporal data.
3.3    Robust Low-Rank Tensor Decomposition for Anomaly Detection
In the following discussions, we model spatiotemporal data as a four mode tensor Y ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 . The first
mode corresponds to hours in a day. The second mode corresponds to the days of a week as urban traffic
activity shows highly similar patterns on the same days of different weeks. The third mode corresponds to
the different weeks and the last mode corresponds to the spatial locations, such as stations for metro data,
sensors for traffic data or zones for other urban data.
3.3.1    Problem Statement
Assuming that anomalies are rare events, our goal is to decompose Y into a low-rank part, L, that corresponds
to normal activity and a sparse part, S, that corresponds to the anomalies. This model relies on the assumption
that normal activity can be embedded into a lower dimensional subspace while anomalies are outliers. We
also take into account the existence of missing elements in the data, i.e., the observed data is PΩ [Y]. This
goal can be formulated as:
                                min kL k ∗ + 𝜆kSk 1 ,   s.t. PΩ [L + S] = PΩ [Y],                         (3.1)
                                L, S
where 𝜆 is the regularization parameter for sparsity.
     Since urban anomalies tend to be temporally continuous, i.e., smooth in the first mode, this assumption
                                                         49


can be incorporated into the above formulation as:
                         min kL k ∗ + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 , s.t. PΩ [L + S] = PΩ [Y],                   (3.2)
                         L, S
where 𝛾 is the regularization parameter for temporal smoothness and kS ×1 Δk 1 quantifies the sparsity of
the projection of the tensor S onto the discrete-time differentiation operator along the first mode where Δ is
defined as:
                                              1 −1 0 . . . 0 
                                                                             
                                                                             
                                                                             
                                             0
                                                        1 −1 . . . 0 
                                                                             
                                        Δ = . . . . . . . . . . . . . . . .                          (3.3)
                                                                             
                                                         0 . . . 1 −1
                                                                             
                                             0
                                                                             
                                                                             
                                             −1 0 . . . 0                1   
                                                                             
     It is common to incorporate the relationships among data points as auxiliary information in addition
to the low-rank assumption to improve the quality of tensor decomposition and to capture the local data
structure [132, 164]. This approach, also known as manifold learning, is an effective dimensionality
reduction technique leveraging geometric information. The intuitive idea behind manifold learning is that
if two objects are close in the intrinsic geometry of data manifold, they should be close to each other after
dimensionality reduction. For tensors, this usually reduces to forcing two similar objects to behave similarly
in the projected low-dimensional space through a graph Laplacian term. In this section, since we are trying
to learn anomalies from a single tensor, we do not have tensor samples and their projections. Instead, we
preserve the relationships between each mode unfolding of the tensor data as each mode corresponds to a
different attribute of the data.This goal can be achieved through graph regularization across each mode as
follows:
                                          4   𝐼𝑛 Õ  𝐼 𝑛
                                       𝜃 ÕÕ
                          min kL k ∗ +                   kL (𝑛),𝑖 − L (𝑛),𝑖0 k 2𝐹 𝑤 𝑖𝑖𝑛 0 + 𝜆kSk 1 +
                          L,S          2 𝑛=1 𝑖=1 0
                                                  𝑖 =1
                                                   𝑖0 ≠𝑖
                                𝛾kS ×1 Δk 1 ,           𝑠.𝑡.   PΩ [L + S] = PΩ [Y],
where 𝜃 is the weight parameter for graph regularization.
     The above optimization problem can equivalently be rewritten using trace norm to represent the graph
                                                             50


regularization and total variation (TV) norm across the temporal mode to describe temporal smoothness as:
                                                                 Õ4                    
                                        min kL k ∗ + 𝜃               tr L >(𝑛) Φ𝑛 L (𝑛) + 𝜆kSk 1 +
                                        L, S
                                                                 𝑛=1
                                           Õ
                                      𝛾             ks𝑖2 ,𝑖3 ,𝑖4 k TV ,      PΩ [L + S] = PΩ [Y],          (3.4)
                                         𝑖2 ,𝑖3 ,𝑖4
where k.k TV denotes the total variation norm. The solution to the optimization problems (3.2) and (3.4)
are referred to as LOw-rank plus temporally Smooth Sparse (LOSS) decomposition, and Graph regularized
LOw-rank plus temporally Smooth Sparse (GLOSS) decomposition, respectively.
3.3.2   Optimization
The proposed objective function is convex. In prior work, ADMM has been shown to be effective at solving
similar optimization problems in an iterative fashion [70, 7]. Thus, we follow a similar approach for solving
the optimization problem given in (3.4). To separate the minimization of TV, ℓ1 norm and graph regularization
from each other, we introduce auxiliary variables Z, W, {𝔏} := {𝔏1 , 𝔏2 , 𝔏3 , 𝔏4 }, {𝔊} := {𝔊1 , 𝔊2 , 𝔊3 , 𝔊4 }
such that the optimization problem becomes:
                                                         Õ 4                                  
                                       min                        𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ + 𝜃𝑔(𝔊𝑛 , Φ𝑛 ) + 𝜆kSk 1 +
                               L, {𝔏}, {𝔊}, S,Z,W
                                                          𝑛=1
                                          𝛾kZk 1 ,               s.t. W = S,        Z = W ×1 Δ,
                                PΩ [L + S] = PΩ [Y], 𝔏𝑛 = L, 𝔊𝑛 = L, 𝑛 ∈ {1, 2, 3, 4},                     (3.5)
                                                    
where 𝑔(𝔊𝑛 , Φ𝑛 ) = tr 𝔊𝑛(𝑛) > Φ𝑛 𝔊𝑛(𝑛) is the graph regularization term for each auxiliary variable 𝔊𝑛 . To
solve the above optimization problem, we propose using ADMM with partial augmented Lagrangian:
                                    Õ4                                           
                                           𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ + 𝜃𝑔(𝔊𝑛 , Φ𝑛 ) + 𝜆kSk 1 + 𝛾kZk 1 +
                                    𝑛=1
                                                                                    4
                                𝛽1                                              𝛽2 Õ 𝑛
                                    kPΩ [L + S − Y − Γ1 ] k 2𝐹 +                       k𝔏 − L − Γ2𝑛 k 2𝐹 +
                                 2                                               2 𝑛=1
                                        4
                                   𝛽3 Õ                                    𝛽4
                                             kL − 𝔊𝑛 − Γ3𝑛 k 2𝐹 + kW ×1 Δ − Z − Γ4 k 2𝐹 +
                                   2 𝑛=1                                    2
                                                              𝛽5
                                                                   kS − W − Γ5 k 2𝐹 ,                      (3.6)
                                                               2
where Γ1 , Γ2𝑛 , Γ3𝑛 , Γ4 , Γ5 ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 are the Lagrange multipliers.
                                                                          51


     1. L update: The low-rank variable L can be updated using:
                                                    𝛽1
                                L 𝑡+1 = argmin           kPΩ [L + S 𝑡 − Y − Γ1𝑡 ] k 2𝐹 +
                                             L       2
                            4                                                                 
                           Õ     𝛽2 𝑛,𝑡                 𝑛,𝑡 2     𝛽3            𝑛,𝑡      𝑛,𝑡 2
                                    k𝔏 − L − Γ2 k 𝐹 + kL − 𝔊 − Γ3 k 𝐹 ,                                                (3.7)
                           𝑛=1
                                 2                                2
which has the analytical solution:
                                                       PΩ [𝛽1 T1 + 𝛽2 T2 + 𝛽3 T3 ]
                                    PΩ [L 𝑡+1 ] =                                      ,                               (3.8)
                                                             𝛽1 + 4(𝛽2 + 𝛽3 )
                                                           PΩ⊥ [𝛽2 T2 + 𝛽3 T3 ]
                                        PΩ⊥ [L 𝑡+1 ] =                              ,                                  (3.9)
                                                                 4(𝛽2 + 𝛽3
where T1 = Y − S 𝑡 + Γ1𝑡 , T2 = 4𝑛=1 𝔏𝑛,𝑡 − Γ2𝑛,𝑡 and T3 = 4𝑛=1 𝔊𝑛,𝑡 + Γ3𝑛,𝑡 .
                                 Í                                 Í
     2. 𝔏𝑛 update: The variables 𝔏𝑛 can be updated using:
                                                                  𝛽2 𝑛
                            𝔏𝑛,𝑡+1 = argmin 𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ +         k𝔏 − L 𝑡+1 − Γ2𝑛,𝑡 k 2𝐹 ,                        (3.10)
                                         𝔏𝑛                       2
                                                                                                   
which is solved by a soft thresholding operator on singular values of L 𝑡+1 + Γ2𝑛,𝑡                  (𝑛)
                                                                                                         with a threshold of
𝜓 𝑛 /𝛽2 .
     3. 𝔊𝑛 update: The variables 𝔊𝑛 can be updated using:
                                                                                   
                                     𝔊𝑛,𝑡+1 = argmin 𝜃tr 𝔊𝑛(𝑛) > Φ𝑛 𝔊𝑛(𝑛) +
                                                    𝔊𝑛
                                             𝛽3 𝑡+1
                                                  kL − 𝔊𝑛 − Γ3𝑛,𝑡 k 2𝐹 ,                                              (3.11)
                                              2
which is solved by:
                                                                             
                                       𝔊𝑛,𝑡+1
                                         (𝑛)
                                                 =  𝛽 3 𝐺 𝑖𝑛𝑣   L 𝑡+1
                                                                      − Γ 𝑛,𝑡
                                                                          3          ,                                (3.12)
                                                                                (𝑛)
where 𝐺 𝑖𝑛𝑣 = (2𝜃Φ𝑛 + 𝛽3 I) −1 always exists and can be computed outside the loop for faster update.
     4. S update: The variable S can be updated using:
                                                          𝛽1
                          S 𝑡+1 = argmin 𝜆kSk 1 +            kPΩ [S + L 𝑡+1 − Y − Γ1𝑡 ] k 2𝐹 +
                                      S                   2
                                                 𝛽5
                                                    kS − W 𝑡 − Γ5𝑡 k 2𝐹 ,                                             (3.13)
                                                 2
where the Frobenius norm terms can be combined and the expression can be simplified into:
                                                                    𝛽1 + 𝛽5
                         PΩ [S 𝑡+1 ] = argmin kPΩ [S] k 1 +                   kPΩ [S − T𝑠 ] k 2𝐹 ,                    (3.14)
                                         PΩ [S ]                      2𝜆
                                                                       𝛽5
                          PΩ⊥ [S 𝑡+1 ] = argmin kPΩ⊥ [S] k 1 +             kPΩ⊥ [S − T𝑠 ] k 2𝐹 ,                      (3.15)
                                          PΩ⊥ [S ]                     2𝜆
                                                             52


where
                                              "                                         #
                                                𝛽1 (Y − L 𝑡+1 + Γ1𝑡 ) + 𝛽5 (W 𝑡 + Γ5𝑡 )
                             PΩ [T𝑠 ] = PΩ
                                                               𝛽1 + 𝛽5
                                            PΩ⊥ [T𝑠 ] = PΩ⊥ [W 𝑡 + Γ5𝑡 ]
                                                                            𝜆                                                𝜆
The above is solved by setting PΩ [S 𝑡+1 ] = 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ [T𝑠 ],     𝛽1 +𝛽5 ) and PΩ [S
                                                                                        ⊥
                                                                                            𝑡+1 ] = 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ⊥ [T𝑠 ], 𝛽5 ),
where 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(a, 𝜙) = 𝑠𝑖𝑔𝑛(a)       𝑚𝑎𝑥(|a| − 𝜙, 0) and        is elementwise or Hadamard product.
    5. W update: The auxiliary variable W can be updated using:
                                                     𝛽4
                                  W 𝑡+1 = argmin        kW ×1 Δ − Z 𝑡 − Γ4𝑡 k 2𝐹 +
                                               W      2
                                               𝛽5 𝑡+1
                                                  kS − W − Γ5𝑡 k 2𝐹 ,                                           (3.16)
                                                2
which is solved analytically by taking the derivative of the expression given above and setting it to zero which
results in:
                                                                                        
                               𝑡+1
                            W(1)   = 𝑊𝑖𝑛𝑣 𝛽5 (S 𝑡+1 − Γ5𝑡 ) (1) + 𝛽4 Δ> (Γ4𝑡 + Z 𝑡 ) (1) ,                      (3.17)
where 𝑊𝑖𝑛𝑣 = (𝛽5 I + 𝛽4 Δ> Δ) −1 always exists and can be computed outside the loop for faster update.
    6. Z update: The auxiliary variable Z, can be updated using:
                                                         𝛽4
                            Z 𝑡+1 = argmin 𝛾kZk 1 +          kW 𝑡+1 ×1 Δ − Z − Γ4𝑡 k 2𝐹                         (3.18)
                                        Z                 2
which is solved by 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(W 𝑡+1 ×1 Δ − Γ4𝑡 , 𝛾/𝛽4 ).
    7. Dual updates: Finally, dual variables Γ1 , Γ2𝑛 , Γ3𝑛 , Γ4 , Γ5 are updated using:
                                      Γ1𝑡+1 = Γ1𝑡 − PΩ [L 𝑡+1 + S 𝑡+1 − Y],                                     (3.19)
                                        Γ2𝑛,𝑡+1 = Γ2𝑛,𝑡 − (𝔏𝑛,𝑡+1 − L 𝑡+1 ),                                    (3.20)
                                        Γ3𝑛,𝑡+1 = Γ3𝑛,𝑡 − (L 𝑡+1 − 𝔊𝑛,𝑡+1 ),                                    (3.21)
                                       Γ4𝑡+1 = Γ4𝑡 − (W 𝑡+1 ×1 Δ − Z 𝑡+1 ),                                     (3.22)
                                           Γ5𝑡+1 = Γ5𝑡 − (S 𝑡+1 − W 𝑡+1 ).                                      (3.23)
The pseudocode for the proposed algorithm, GLOSS, is given in Algorithm 3.1. The optimization for LOSS
can be similarly computed without the updates on graph regularization and related variables {𝔊}, {Γ3 }.
                                                         53


Algorithm 3.1: GLOSS
Input: Y ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 , Ω, Φ, parameters 𝜆, 𝛾, 𝜃, {𝜓}, 𝛽1 , 𝛽2 , 𝛽3 , 𝛽4 , 𝛽5 , max_iter.
Output: L : Low-rank tensor; S: Sparse tensor.
  Initialize S 0 = 0, Z 0 = 0, 𝔏𝑛,0 = 0, 𝔊𝑛,0 = 0, Γ01 = 0, Γ2𝑛,0 = 0, Γ3𝑛,0 = 0, Γ04 = 0, Γ05 = 0, ∀𝑖 ∈ {1, . . . , 4}.
  𝑊𝑖𝑛𝑣 ← (𝛽5 I + 𝛽4 Δ> Δ) −1
  𝐺 𝑖𝑛𝑣 ← (2𝜃Φ𝑛 + 𝛽3 I) −1
  for 𝑡 = 0 to max_iter do
      T1 ← Y − S 𝑡 + Γ1𝑡
      T2 ← 4𝑛=1 𝔏𝑛,𝑡 − Γ2𝑛,𝑡
              Í
      T3 ← 𝑛=1 𝔊𝑛,𝑡 + Γ3𝑛,𝑡
              Í4
      PΩ [L 𝑡+1 ] ← PΩ [𝛽1 T1 + 𝛽2 T2 + 𝛽3 T3 ] /(𝛽1 + 4(𝛽2 + 𝛽3 ))
      PΩ⊥ [L 𝑡+1 ] ← PΩ⊥ [𝛽2 T2 + 𝛽3 T3 ] /4(𝛽2 + 𝛽3 )
      for 𝑛 = 1 to 4 do                             
           [𝑈, Σ, 𝑉] ← SVD L 𝑡+1 + Γ2𝑛,𝑡 (𝑛)
                                     𝜓𝑛
           ˆ 𝑖,𝑖 ← max (𝜎𝑖,𝑖 −
           𝜎                         𝛽2 , 0)    ∀𝑖 ∈ {1, . . . , 𝐼𝑛 }
           𝔏𝑛,𝑡+1
             (𝑛)
                   ← 𝑈 Σ̂𝑉 >
                                                   
           𝔊𝑛,𝑡+1
              (𝑛)
                   ← 𝛽3 𝐺 𝑖𝑛𝑣        L 𝑡+1 − Γ3𝑛,𝑡
                                                      (𝑛)
      end for                                                
                          𝛽1 ( Y−L 𝑡+1 +Γ1𝑡 )+𝛽5 ( W 𝑡 +Γ5𝑡 )
      PΩ [T𝑠 ] ← PΩ                     𝛽1 +𝛽5
      PΩ⊥ [T𝑠 ] ← PΩ⊥ [W + Γ5 ]  𝑡      𝑡
      PΩ [S 𝑡+1 ] ← 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ [T𝑠 ], 𝛽1 𝜆+𝛽5 )
      PΩ⊥ [S 𝑡+1 ] ← 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ⊥ [T𝑠 ], 𝛽𝜆5 )
                                                                          
      W(1)𝑡+1 ←𝑊
                   𝑖𝑛𝑣 𝛽5 (S
                                𝑡+1 −Γ𝑡 )
                                          5 (1)
                                                + 𝛽4 Δ> (Γ4𝑡 +   Z 𝑡 ) (1)
      Z 𝑡+1 ← 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(W 𝑡+1 ×1 Δ − Γ4𝑡 , 𝛽𝛾4 )
      Γ1𝑡+1 ← Γ1𝑡 − PΩ [L 𝑡+1 + S 𝑡+1 − Y]
      Γ4𝑡+1 ← Γ4𝑡 − (W 𝑡+1 ×1 Δ − Z 𝑡+1 )
      Γ5𝑡+1 ← Γ5𝑡 − (S 𝑡+1 − W 𝑡+1 )
      for 𝑛 = 1 to 4 do
           Γ2𝑛,𝑡+1 ← Γ2𝑛,𝑡 − (𝔏𝑛,𝑡+1 − L 𝑡+1 )
           Γ3𝑛,𝑡+1 ← Γ3𝑛,𝑡 − (L 𝑡+1 − 𝔊𝑛,𝑡+1 )
      end for
  end for
                                                                  54


3.3.3   Computational Complexity
Let Y be a mode 𝑁 tensor with dimensions 𝐼1 = 𝐼2 = · · · = 𝐼 𝑁 = 𝐼. The computational complexity of each
iteration of ADMM is computed as follows:
    1. The update of L only involves element-wise operations, thus the contribution to computational com-
       plexity is not significant.
    2. The computational complexity of updating each 𝔏𝑛 is O 𝐼 2𝑁 −1 . Thus, the total computational
                                                                           
       complexity is O (𝑁 𝐼 2𝑁 −1 ). However, it is possible to reduce the effect of this cost by parallelizing
       the updates across modes, hence the complexity becomes quadratic in the number of elements, i.e.
       O 𝐼 2𝑁 −1 .
                   
    3. The update of each 𝔊𝑛 requires the computation of a matrix inverse with complexity 𝑂 (𝐼 3 ) and a
       matrix multiplication with complexity O (𝐼 𝑁 +1 ). As mentioned before, the inverse always exists and
       can be computed outside the loop. Similar to the updates of 𝔏𝑛 , the total complexity is O (𝑁 𝐼 𝑁 +1 ).
       This can be reduced by parallelizing across modes.
    4. The second update requires a soft thresholding, which has linear complexity, i.e., O (𝐼 𝑁 ).
    5. The computational complexity of updating W is governed by matrix multiplication resulting in
       O (𝐼 𝑁 +1 ) complexity.
    6. The update of Z consists of a matrix product followed by soft thresholding, which results in a total
       complexity of O ((𝐼 + 1)𝐼 𝑁 ).
    7. The updates of the dual variables do not require any additional multiplication operations. Thus, the
       computational complexity is negligible.
It can be concluded that the complexity of each loop is governed by the updates of 𝔏𝑛 . Therefore, the
total computational complexity of the algorithm is O (max_iter𝑁 𝐼 2𝑁 −1 ). Since the complexity is driven by
nuclear norm minimization, in the following section, we propose another method which utilizes graph total
variation regularization for low-rank approximation instead of nuclear norm.
                                                      55


3.4    Low-rank On Graphs Plus Temporally Smooth Sparse Decomposition
As mentioned in Section 1.3.1, we will approximate the low-rank tensor, L, through 𝑁 graph total variation
terms corresponding to each mode similar to FRPCAG in (1.14). To this end, the first 𝐽𝑛 eigenvectors of Φ𝑛 ,
𝑃ˆ 𝑛 corresponding to the 𝐽𝑛 lowest eigenvalues, are used to quantify the total variation of the low-rank tensor
across mode-𝑛 with respect to its corresponding similarity graph. As these first 𝐽𝑛 eigenvectors capture
the low-frequency information of the signal, they can capture the normal activity in the data. Thus, the
optimization problem can be written as:
                                        Õ𝑁                     
                                 min 𝜃       tr L >(𝑛) Φ̂𝑛 L (𝑛) + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 ,
                                 L, S
                                        𝑛=1
                                               𝑠.𝑡.    PΩ [Y] = PΩ [L + S],                               (3.24)
where Φ̂𝑛 = 𝑃ˆ 𝑛 Λ̂𝑛 𝑃ˆ 𝑛> and Λ̂𝑛 ∈ R 𝐽𝑛 ×𝐽𝑛 is the leading principal submatrix of Λ𝑛 . If we define the
projections of each mode-𝑛 unfolding of L to the graph eigenvectors (low frequency graph Fourier basis) as
   𝑛 = 𝑃ˆ > L
G(𝑛)      𝑛 (𝑛) , then (3.24) can be rewritten as:
                                         Õ 𝑁                     
                                                     𝑛 > 𝑛 𝑛
                                min 𝜃          tr G(𝑛)   Λ G(𝑛) + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 ,
                              L, { G }S
                                          𝑛=1
                                  𝑠.𝑡.   PΩ [Y] = PΩ [L + S],                   𝑛
                                                                             G(𝑛)  = 𝑃ˆ 𝑛> L (𝑛) ,        (3.25)
where {G} := {G 1 , . . . , G 𝑁 }. The solution to (3.25) will be called LOw-rank on Graphs plus temporally
Smooth Sparse Decomposition (LOGSS).
3.4.1    Optimization
The optimization problem was solved using ADMM. We introduce auxiliary variables W and Z similar to
LOSS to separate sparsity and temporal smoothness regularization. The problem is then rewritten as:
                                              Õ𝑁                       
                                                        𝑛 > 𝑛 𝑛
                                min        𝜃       tr G(𝑛)   Λ G(𝑛) + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 ,
                          L, { G }, S,W, Z
                                              𝑛=1
                   s.t.  PΩ [Y] = PΩ [L + S],               𝑛
                                                          G(𝑛)  = 𝑃ˆ 𝑛> L (𝑛) , S = W,       Z = W ×1 Δ.  (3.26)
                                                               56


The corresponding augmented Lagrangian is given by:
                      𝑁                      
                     Õ
                               𝑛 >                                        𝛽1
                  𝜃       tr G(𝑛)          𝑛
                                     Λ𝑛 G(𝑛)    + 𝜆kSk 1 + 𝛾kZk 1 + kPΩ [L + S − Y] − Γ1 k 2𝐹 +
                     𝑛=1
                                                                           2
                                                                                   𝑁
             𝛽2                                   𝛽3                         𝛽4 Õ
                 kW ×1 Δ − Z − Γ2 k 2𝐹 + kS − W − Γ3 k 2𝐹 +                          kL − G 𝑛 ×𝑛 𝑃ˆ 𝑛 − Γ4𝑛 k 2𝐹 , (3.27)
               2                                   2                         2 𝑛=1
where Γ1 , Γ2 , Γ3 , Γ4𝑛 ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 are the Lagrange multipliers. Using (3.27) each variable can be updated
alternately.
1. L update: The update of low-rank variable L is given by:
                                       PΩ [L 𝑡+1 ] = PΩ [𝛽1 T1 + 𝛽4 T2 ] /(𝛽1 + 4𝛽4 ),
                                                 PΩ⊥ [L 𝑡+1 ] = PΩ⊥ [T2 ]/4,                                       (3.28)
                                            G 𝑛,𝑡 ×𝑛 𝑃ˆ 𝑛 + Γ4𝑛,𝑡 .
                                     Í𝑁
where T1 = Y − S 𝑡 + Γ1𝑡 , T2 =         𝑛=1
2. G 𝑛 update: The variables G 𝑛 can be updated using:
                                                     𝜃
                                     G 𝑛,𝑡+1 = (2       Λ̂𝑛 + I) −1 (L 𝑡+1 ×𝑛 𝑃ˆ 𝑛> − Γ4𝑛,𝑡 ),                     (3.29)
                                                     𝛽4
where I ∈ R 𝐽𝑛 ×𝐽𝑛 is an identity matrix.
3. S update: The variable S can be updated using:
                                     PΩ [S 𝑡+1 ] = T𝜆 (PΩ [𝛽1 T3 + 𝛽3 T4 ])/(𝛽1 + 𝛽3 )
                                               PΩ⊥ [S 𝑡+1 ] = T 𝜆 (PΩ⊥ [T4 ]),                                     (3.30)
                                                                    𝛽3
where T3 = Y −L 𝑡+1 +Γ1𝑡 , T4 = W 𝑡 +Γ3𝑡 , T 𝜙 (a) = 𝑠𝑖𝑔𝑛(a)            𝑚𝑎𝑥(|a| − 𝜙, 0) and      is Hadamard product.
4. W update: The auxiliary variable W can be updated using:
                                                                                              
                              W(1) 𝑡+1
                                        = 𝑊𝑖𝑛𝑣 𝛽3 (S 𝑡+1 − Γ3𝑡 ) (1) + 𝛽2 Δ> (Γ2𝑡 + Z 𝑡 ) (1) ,                    (3.31)
where 𝑊𝑖𝑛𝑣 = (𝛽3 I + 𝛽2 Δ> Δ) −1 always exists and can be computed outside the loop for faster update.
5. Z update: The auxiliary variable Z, can be updated using:
                                                               𝛽2
                               Z 𝑡+1 = argmin 𝛾kZk 1 +             kW 𝑡+1 ×1 Δ − Z − Γ2𝑡 k 2𝐹 ,                    (3.32)
                                             Z                  2
which is solved by T 𝛾 (W 𝑡+1 ×1 Δ − Γ2𝑡 ).
                         𝛽2
                                                               57


6. Dual updates: Finally, dual variables Γ1 , Γ2 , Γ3 , Γ4𝑛 are updated using:
                                      Γ1𝑡+1 = Γ1𝑡 − PΩ [L 𝑡+1 + S 𝑡+1 − Y],                           (3.33)
                                       Γ2𝑡+1 = Γ2𝑡 − (W 𝑡+1 ×1 Δ − Z 𝑡+1 ),                           (3.34)
                                           Γ3𝑡+1 = Γ3𝑡 − (S 𝑡+1 − W 𝑡+1 ),                            (3.35)
                                    Γ4𝑛,𝑡+1 = Γ4𝑛,𝑡 − (L 𝑡+1 − G 𝑛,𝑡+1 ×𝑛 𝑃ˆ 𝑛 ).                     (3.36)
The pseudocode for the optimization is given in Algorithm 3.2.
Algorithm 3.2: LOGSS
Input: Y, Ω, Φ𝑛 , parameters 𝜃, 𝜆, 𝛾, 𝛽1 , 𝛽2 , 𝛽3 , 𝛽4 , max_iter.
Output: L : Low-rank tensor; S: Sparse tensor.
   Initialize S 0 = 0, W 0 = 0, Z 0 = 0, G 𝑛,0 = 0, Γ01 = 0, Γ02 = 0, Γ03 = 0, Γ4𝑛,0 = 0, ∀𝑖 ∈ {1, . . . , 4},
   𝑊𝑖𝑛𝑣 = (𝛽3 I + 𝛽2 Δ> Δ) −1 .
   for 𝑡 = 1 to max_iter do
        Update L using (3.28).
        Update G 𝑛 s using (3.29).
        Update S using (3.30).
        Update W using (3.31).
        Update Z using (3.32).
        Update Lagrange multipliers using (3.33), (3.34), (3.35) and (3.36).
   end for
3.4.2    Computational Complexity of LOGSS
Assume 𝐼1 = 𝐼2 = · · · = 𝐼 𝑁 = 𝐼. The complexity of the proposed algorithm is dominated by matrix
multiplications which are the updates of L, W, Z and Γ4𝑛 . The updates of G 𝑛 require are multiplications
since Λ̂𝑛 s are diagonal. The computational complexity of the matrix multiplications are: O (𝑁 𝐼 𝑁 ) for the
update of L, O (𝐼 𝑁 ) for the updates of W, Z, Γ4𝑛 . Since the updates of L and Γ4𝑛 can be parallelized, the
complexity of the algorithm is O (max_iter𝐼 𝑁 ), hence, linear in the number of elements. Thus, using graph
total variation regularization instead of nuclear norm we reduce the complexity of LOSS from quadratic to
linear.
3.5    Convergence
In this subsection, we analyze the convergence of the proposed algorithms. First, we show that the proposed
optimization problem (3.4) can be written as a two-block ADMM. In previous work, linear and global
                                                         58


convergence of ADMM is proven for two-block systems [52] with no dependence on the hyperparameters.
We use this proof to derive the convergence of GLOSS. Convergence of LOSS and LOGSS follow as they
have similar objective functions and similar operations can be applied to rewrite their objective as two-block
ADMM.
     In the following discussion, we will assume that there is no missing data, i.e. PΩ [Y] = Y, to
simplify the notation. Let ℎ(S) = 𝜆kSk 1 , 𝑗 (Z) = 𝛾kZk 1 , 𝑓 ({𝔏}) = 4𝑛=1 𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ , 𝑔({𝔊}) =
                                                                                        Í
                          
Í4
       tr   𝔊 𝑛 > Φ𝑛 𝔊𝑛      , (3.5) can be rewritten as:
   𝑛=1        (𝑛)      (𝑛)
                                         min      ℎ(S) + 𝑗 (Z) + 𝑓 ({𝔏}) + 𝜃𝑔({𝔊}),
                                    {𝔏}, {𝔊}, S,Z
                                   𝐴1 L (1) + 𝐴2 S (1) + 𝐴3 cat1 ({𝔏}) + 𝐴4 cat1 ({𝔊})+
                                        𝐴5 Z(1) + 𝐴6 W(1) = cat1 ({Y, 0, . . . , 0}),                       (3.37)
where
                                                                                              
             I         I          0 0 0 0                     0 0 0 0               0        0
                                                                                              
                                                                                              
             I         0          −I 0 0 0                    0 0 0 0               0        0
                                                                                              
                                                                                              
                                                                                              
             I         0           0 −I 0 0                   0 0 0 0               0        0
                                                                                              
                                                                                              
             I         0           0 0 −I 0                   0 0 0 0               0        0
                                                                                              
                                                                                              
             I         0           0 0 0 −I                   0 0 0 0               0        0
                                                                                              
                                                                                              
      𝐴1 =  I  , 𝐴2 = 0 , 𝐴3 =  0 0 0 0  , 𝐴4 = −I 0 0 0  , 𝐴5 =  0  , 𝐴6 =  0  .
                                                                                              
                                                                                              
                                                                                              
             I         0          0 0 0 0                      0 −I 0 0             0        0
                                                                                              
                                                                                              
             I                                                     0 0 −I 0 
                                                                                              
                         0          0 0 0 0                                             0        0
                                                                                              
                                                                                              
             I         0          0 0 0 0                      0 0 0 −I             0        0
                                                                                              
                                                                                              
             0         I          0 0 0 0                     0 0 0 0               −I       0
                                                                                              
                                                                                              
                                                                                            Δ        −I
                                                                                              
             0         0          0 0 0 0                     0 0 0 0
                                                                                              
From this reformulation, it is easy to see that 𝐴1> 𝐴5 = 0 and 𝐴2 , 𝐴3 , 𝐴4 , 𝐴6 are all orthogonal to each other,
i.e. 𝐴𝑖> 𝐴 𝑗 = 0 where 𝑖, 𝑗 ∈ {2, 3, 4, 6} and 𝑖 ≠ 𝑗.
     In this manner, the optimization problem reduces to a special case of two-block ADMM as follows. Define
variables 𝑉1 = [L (1) , W(1) ] and 𝑉2 = [X(1) , cat1 ({𝔏}), cat1 ({𝔊}), Z(1) ], and matrices 𝐵1 = [ 𝐴1 , 𝐴5 ] and
𝐵2 = [ 𝐴2 , 𝐴3 , 𝐴4 , 𝐴6 ]. When we create the variable 𝑉1 , the order of updating S and W needs to change in
the new formulation as 𝐴2 and 𝐴5 are not orthogonal. Updates of {𝔏} and {𝔊} have no effect on the update
                                                             59


of W but this is not true for S, so updating W before S might affect the solution. However, it was proven
in [194] that the change in order gives the equivalent solution if either one of the functions of the variables
S and W is affine. In our formulation, this is true as the function corresponding to W is a constant. Thus,
the problem reduces to the two-block form:
                                  min 𝑓1 (𝑉1 ) + 𝑓2 (𝑉2 ),    s.t. 𝐵1𝑉1 + 𝐵2𝑉2 = 𝐶,                      (3.38)
                                  𝑉1 ,𝑉2
where 𝑓1 (𝑉1 ) = 0 and 𝑓2 (𝑉2 ) = ℎ(S)+ 𝑗 (Z)+ 𝑓 ({𝔏})+𝜃𝑔({𝔊}) are both convex and 𝐶 = cat1 ({Y, 0, . . . , 0}).
    It can easily be shown using Kronecker products and vectorizations that the objective functions (3.2),
(3.25) can also be converted into a two-block form. Thus, LOSS and LOGSS also converge using the above
results.
3.6    Anomaly Scoring
The methods proposed in this chapter focus on extracting spatiotemporal features for anomaly detection.
After extracting the features, i.e. the sparse part, a baseline anomaly detector can be applied to obtain an
anomaly score. In this chapter, we evaluated three anomaly detection methods: Elliptic Envelope (EE)
[150], Local Outlier Factor (LOF) [23] and One Class SVM (OCSVM) [153]. These three methods are
used to assign an anomaly score to each element of the sparse tensor. Each method was applied to all third
mode fibers which correspond to different weeks’ traffic activity. This is equivalent to fitting a univariate
distribution to each of the third mode fibers of the tensor. The anomaly scores were used to create an anomaly
score tensor. Finally, the elements with the highest anomaly scores were selected as anomalous while the
rest were determined to be normal.
3.7    Experiments
In this chapter, we evaluated the proposed method on both real and synthetic datasets. We compared our
method to regular HoRPCA and weighted HoRPCA (WHoRPCA) where the nuclear norm of each unfolding
is weighted. In addition to Tucker based algorithms we compare with two CP based anomaly detection
methods: low rank plus sparse CP (LRSCP) [88] and Bayesian augmented tensor factorization (BATF) [40].
In the case of LRSCP, a low rank plus sparse CP model is used for anomaly detection. While the original
algorithm is implemented in an online manner, in this chapter, it is modified to be applicable to the whole
ST data for comparability against other methods. BATF is a CP based tensor completion/imputation model
with smoothness constraints designed for urban traffic data. BATF utilizes bias vectors for all mode-𝑛 slices
                                                           60


which enforces smoothness across each mode. To evaluate the effect of graph regularization term in (3.4),
we also compared with LOSS corresponding to the objective function in (3.2). As our method is focused
on feature extraction for anomaly detection, we also compared our method to baseline anomaly detection
methods such as EE, LOF and OCSVM applied to the original tensor. After the feature extraction stage,
unless noted otherwise, such as "GLOSS-LOF", EE was used as the default anomaly scoring method for
all tensor feature extraction methods, e.g., HoRPCA, WHoRPCA, BATF, etc. LRSCP scores the anomalies
by ordering the magnitude of the elements of the sparse tensor [88]. The number of neighbors for LOF is
selected as 10 as this is the suggested lower-bound in [207]. The outlier fraction of OCSVM is set to 0.1 as
only the anomaly scores, not labels, generated by OCSVM are used in the experiments. The methods used
for comparison and their properties are summarized in Table 3.1.
Table 3.1: Properties of anomaly detection methods used in the experiments. The acronyms refer to
the different attributes of the cost function: (LR) low-rank, (SP) sparse, (WLR) weighted low-rank, (SR)
smoothness regularization.
                                                         LR      SP     WLR      SR
                                           GLOSS          +        +      +       +
                                           LOGSS          -        +      -       +
                           Tucker Based
                                           LOSS           +        +      +       +
                                           WHoRPCA        +        +      +       -
                                           HoRPCA         +        +      -       -
                                           BATF           +        +      -       +
                           CP Based
                                           LRSCP          +        +      -       -
                                           EE            N/A    N/A      N/A     N/A
                                           LOF           N/A    N/A      N/A     N/A
                                           OCSVM         N/A    N/A      N/A     N/A
     For each data set, a varying 𝐾 percent of the elements with highest anomaly scores are determined to
be anomalous. With varying 𝐾, ROC curves were generated for synthetic data and the mean area under
the curve (AUC) was computed for 10 random experiments. For real data, number of detected events were
reported for varying 𝐾.
     The number of neighbors for LOF is selected as 10 as this is the suggested lower-bound in [207]. Finally,
the outlier fraction of OCSVM is set to 0.1 as only the anomaly scores, not labels, generated by OCSVM is
used in the experiments.
                                                      61


3.7.1     Data Description
To evaluate the proposed framework, we use two publicly available datasets as well as synthetic data.
Real Data:
The first dataset is NYC yellow taxi trip records1 for 2018. This dataset consists of trip information such as
the departure zone and time, arrival zone and time, number of passengers, tips for each yellow taxi trip in
NYC. In the following experiments, we only use the arrival zone and time to collect the number of arrivals
for each zone aggregated over one hour time intervals. We selected 81 central zones to avoid zones with
very low traffic [205]. Thus, we created a tensor Y of size 24 × 7 × 52 × 81 where the first mode corresponds
to hours within a day, the second mode corresponds to days of a week, the third mode corresponds to weeks
of a year and the last mode corresponds to the zones. The data is suitable for low-rank on graphs model as
graph stationarity measure 𝑠𝑟 (Γ), given in Section 1.3.1, for each mode is 0.83, 0.98, 0.99, 0.56, respectively.
This implies that the data is mostly low-rank on the temporal modes as there is strong correlation among the
different days, hours and weeks, while it is less low-rank across space.
     The second dataset is Citi Bike NYC bike trips2 for 2018. This dataset contains the departure and arrival
station numbers, arrival and departure times and user id. In our experiments, we aggregated bike arrival data
for taxi zones imported from the NYC yellow taxi trip records dataset, instead of using the original stations,
to reduce the dimensionality and to avoid data sparsity. The resulting data tensor is of size 24 × 7 × 52 × 81.
    1 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
    2 https://www.citibikenyc.com/system-data
                                                                 62


Synthetic Data Generation:
Evaluating urban events in a real-world setting is an open challenge, since it is difficult to obtain urban
traffic data sets with ground truth information, i.e. anomalies are not known a priori and what constitutes
an anomaly depends on the nature of the data and the specific problem. To be able to evaluate our method
quantitatively, we generate synthetic data and inject anomalies. Following [205], we generated a synthetic
data set by taking the average of the NYC taxi trip tensor, Y, along the third mode, i.e. across weeks of
a year. We then repeat the resulting three-mode tensor such that for each zone, average data for a week is
repeated 52 times. We multiply each element of the tensor by a Gaussian random variable with mean 1 and
variance 0.5 to create variation across weeks.
     We generate anomalies on randomly selected 𝑚% of the first mode fibers. For each fiber, we set a random
time interval of length 𝑙, which corresponds to 𝑙 hours in a day, as anomalous. We multiply the average
value of each randomly selected anomalous interval by a parameter 𝑐 and the modify the entries by adding
or subtracting this value from the interval. When 𝑐 is low, the anomalies will be harder to detect and may be
perceived as noise.
3.7.2    Parameter Selection:
In this section, we will discuss how the different parameters in (3.4) are selected. Following [70], we set
        p                                                1
𝜆 = 1/ max(𝐼1 , . . . , 𝐼 𝑁 ) for HoRPCA and 𝛽1 = 5std(𝑣𝑒𝑐 ( Y)) , where 𝑣𝑒𝑐(Y) ∈ R
                                                                                     𝐼1 𝐼2 ...𝐼 𝑁 is the vectorization
of Y and std(.) is the standard deviation. The other 𝛽 parameters for all methods are set to be the same as
𝛽1 . The selection of 𝛽 parameters does not affect the algorithm performance but changes the convergence
rate as mentioned in Section 3.5. In GLOSS, the neighboorhood size for each of the 𝑘-NN graphs is selected
                 Í𝑁
to be 𝑘 = 𝑙𝑜𝑔( 𝑖=1    𝐼𝑛 ) following [177]. The 𝜎 value in 𝑘-NN graphs is selected to be proportional to the
Frobenius norm of each mode to ensure similar density levels for all graphs. The ranks for LRSCP and
BATF are selected from {1, 2, . . . , 11} as the rank with the best result. Increasing the rank to higher values
does not improve the results while increasing complexity.
     An important observation about the selection of the parameters is that depending on the data, the optimal
values of hyperparameters might be different. This is due to the properties of the data such as size, variance
and sparsity level. Moreover, the different hyperparameters are dependent on each other. Hence, a search
over the whole parameter space may be costly.         Thus, we perform a sensitivity analysis for the different
                                                       63


                                                    (a)
                                                    (b)
                                                    (c)
Figure 3.1: Mean AUC values for various choices of: (a) 𝜆 and 𝛾, (b) 𝜃 and 𝜓1 , (c) 𝜓2 and 𝜓4 . Mean AUC
values across 10 random experiments are reported for each hyperparameter pair. For each set of experiments,
the remaining hyperparameters are fixed.
                                                    64


hyperparameters. In Figure 3.1, we present the average AUC for various ranges of the hyperparameters for
GLOSS applied on the synthetic data generated with 𝑐 = 2.5. It can be seen from Figure 3.1a that while
low values of 𝜆 always provide the best results, 𝛾 values are optimized around 10−5 . Increasing the sparsity
penalty 𝜆 above 10−3 results in a sparse tensor that is mostly zero. At this 𝜆 value, AUC becomes equal
to 0.5 which is equivalent to randomly guessing the anomalous points. On the other hand, when 𝛾 is too
large, it smooths out all mode-1 fibers and generates the same anomaly score for each fiber. This is akin to
identifying anomalous days, rather than time intervals. From these observations, it can be seen that there is
not a strong dependence between 𝛾 and 𝜆. For the weight parameters, 𝜓 𝑛 , when one of them is set to 1 and
the others are varied across a wide range of values as shown in Figures 3.1c and 3.1b, the accuracy does not
change significantly. Similarly, changing the value of 𝜃 does not seem to affect the optimal value of 𝜓1 .
    We repeated this analysis for different 𝑐 values and observed similar results indicating that the proposed
method is not sensitive to the selection of hyperparameters as long as they are selected following the guidelines
given below. In particular, the choice of 𝜆 affects the accuracy more than any of the other hyperparameters.
This realization reduces the computational complexity of finding the best set of parameters as the search can
be implemented in parallel.
    Based on these empirical observations, we select 𝜆 = 𝛾 = 1/kPΩ (Y) k 0 , where kPΩ (Y) k 0 is the number
of nonzero elements of Y for GLOSS and 𝜆 = 𝛾 = 1/𝑚𝑎𝑥(𝐼1 , . . . , 𝐼 𝑁 ) for LOSS and WHoRPCA similar
to [70]. Since the ranks across each mode are closely related to the variance of the data within that mode,
weights for each mode in the definition of nuclear norm, 𝜓 𝑛 s, are selected to be inversely proportional to
                                                                              𝑝
the trace of the square root of the covariance of mode-𝑛, i.e. 𝜓 𝑛 =         q        , where 𝑝 is selected such that
                                                                        𝑇 𝑟 ( Σ𝑌(𝑛) )
                                                                                         Î4      1/4
min𝑛 (𝜓 𝑛 ) = 1. The parameter 𝜃 is set to be the geometric mean of 𝜓 𝑛 s, i.e. 𝜃 =        𝑛=1 𝜓 𝑛 .
3.7.3    Experiments on Synthetic Data
Effect of anomaly length and percentage:
First, we evaluated the effect of the length 𝑙 and the percentage 𝑚, i.e. denseness, of anomalies in synthesized
data. For these experiments, we set 𝑐 = 2.5 and 𝑃 = 0%. From Table 3.2 and Fig. 3.2, it can be seen
that as 𝑙 increases, the performance of LOGSS and LOSS improves while HoRPCA’s performance does not
show a significant change. This is due to the fact that the temporal total variation regularization will become
more suited to the observed data as the anomalies become more temporally persistent, i.e. when 𝑙 increases.
                                                        65


                        Figure 3.2: AUC of ROC w.r.t. 𝑙 and 𝑚 with 𝑐 = 2.5, 𝑃 = 0%.
Although LOSS performs slightly better than LOGSS when 𝑙 is large, for low 𝑙, it drastically underperforms
which is not the case for LOGSS. With increasing 𝑚, all methods perform worse due to the assumption of
sparsity for the anomalies.
Robustness against noise:
We first evaluated the effect of 𝑐, i.e. the strength of the anomaly, on the accuracy of the proposed method.
Low 𝑐 values imply that the amplitudes of anomalies are low and they may be indistinguishable from noise.
From Fig. 3.3 and Table 3.2, it can be seen that for varying 𝑐 values, our method (GLOSS-EE) achieves the
highest AUC values compared to both baseline methods and HoRPCA, WHoRPCA, LRSCP, BATF, LOSS
and LOGSS. Among the remaining methods, LOSS performs better than WHoRPCA, especially when the
anomaly strength is small. CP based methods generally underperform compared to all other methods. The
proposed methods have higher anomaly detection accuracy compared to EE and HoRPCA which illustrates
the benefit of tailoring the optimization problem to anomaly structure. In fact, HoRPCA does not perform
better than EE in most cases which means that extracting anomalies using HoRPCA does not have a significant
improvement compared to using the original data. It is also important to note that the choice of the anomaly
scoring method does not change the performance of GLOSS significantly.
    In terms of time complexity, it can be seen from Table 3.3 that LOGSS is up to 10 times faster than
both GLOSS and LOSS. BATF, on the other hand, has higher run time compared to the proposed methods
                                                        66


                    (a)                                   (b)                                     (c)
Figure 3.3: ROC curves for various amplitudes of anomalies. Higher amplitude means more separability.
𝑐 = (a) 1.5, (b) 2, (c) 2.5. (𝑃 = 0%, 𝑙 = 7, 𝑚 = 2.3%)
(LOSS and GLOSS) with lower accuracy. Although other methods such as EE, LOF, LRSCP and HoRPCA
outperform LOSS and GLOSS, the proposed methods outperform them in anomaly detection accuracy as
mentioned earlier.
Table 3.2: Mean and standard deviation of AUC values for for various 𝑐 and 𝑃. On experiments of each
variable, the rest of the variables are fixed at 𝑐 = 2.5, 𝑃 = 0%, 𝑙 = 7 and 𝑚 = 2.3%. The proposed methods,
outperform the other algorithms in all cases significantly with 𝑝 < 0.001.
                          𝑐 = 1.5         𝑐=2           𝑐 = 2.5         𝑃 = 20%        𝑃 = 40%          𝑃 = 60%
      EE                0.70±0.004     0.81±0.004     0.87±0.003      0.81±0.004      0.61±0.008       0.53±0.01
      LOF               0.68±0.003     0.78±0.004     0.84±0.004      0.79±0.005      0.78±0.005       0.73±0.007
      OCSVM             0.65±0.005     0.73±0.004     0.77±0.005      0.81±0.004      0.81±0.005       0.76±0.006
      HoRPCA            0.7±0.004      0.81±0.004     0.87±0.003       0.8±0.004      0.73±0.006       0.79±0.005
      WHoRPCA           0.7±0.004      0.81±0.004     0.87±0.003       0.8±0.003      0.73±0.006       0.8±0.005
      LRSCP             0.51±0.02      0.53±0.03      0.54±0.05       0.61±0.007       0.7±0.01        0.8±0.006
      BATF              0.59±0.005     0.65±0.005     0.69±0.005      0.67±0.005      0.64±0.006       0.61±0.006
      LOSS              0.81±0.005     0.9±0.004     0.94±0.0035      0.85±0.003      0.73±0.009       0.74±0.009
      LOGSS             0.8 ± 0.005    0.9±0.003      0.94±0.002      0.86±0.004      0.74±0.008       0.76±0.005
      GLOSS-EE          0.83±0.005     0.92±0.003     0.95±0.002      0.88±0.004      0.77±0.008       0.76±0.006
      GLOSS-SVM         0.81±0.006     0.9±0.004      0.95±0.002      0.83±0.006      0.81±0.007       0.82±0.007
      GLOSS-LOF         0.8±0.006      0.91±0.004     0.95±0.002       0.81±0.01      0.80±0.007       0.86±0.007
             Table 3.3: Mean and standard deviation of run times (seconds) for various methods.
     EE        LOF          OCSVM       HoRPCA    WHoRPCA       LRSCP        BATF         LOSS        LOGSS     GLOSS
  12.1 ± 0.3 17.0 ± 0.3  1016.2 ± 56.1  7.5 ± 0.4 13.5 ± 0.5  0.66 ± 0.17  65.8 ± 0.16  44.7 ± 2.7   5.0±0.27  46.7 ± 1.5
Robustness against missing data:
In addition to injecting synthetic anomalies, we also remove varying number of days at random from the
tensor to evaluate the robustness of the proposed method to missing data. After generating the synthetic data,
a percentage 𝑃 of the mode-1 fibers is set to zero to simulate missing data, where the number of mode-1
fibers is equal to the total number of days, i.e. 7 × 52 × 81. The accuracy of anomaly detection for varying
                                                          67


                   (a)                                  (b)                                 (c)
Figure 3.4: ROC curves for varying percentage of missing data, (a) 𝑃 = 20%, (b) 𝑃 = 40%, (c) 𝑃 = 60%.
(𝑐 = 2.5, 𝑙 = 7, 𝑚 = 2.3%)
levels of missing data is illustrated in Fig. 3.4 and the corresponding AUC values (mean ± std) are given in
Table 3.2.
     While the performance of all the methods degrades with increasing levels of missing data, GLOSS
provides the best anomaly detection performance and is robust against missing data compared to the rest of the
methods. It is interesting to see that with increasing percentage of missing data, the performance of OCSVM
does not degrade too much, and some of the methods such as HoRPCA, WHoRPCA, LRSCP, GLOSS-SVM
and GLOSS-LOF have a better performance. This phenomenon occurs partially due to increasing percentage
of anomalous points. As some of the false positives are replaced by missing data, and anomalous points
are higher in percentage, newly identified data might have more true positives with increasing percentage of
missing data. Incorporating temporal smoothness for the anomalies in the objective function lowers the false
detection rate by penalizing instantaneous changes in traffic that do not constitute an actual anomaly. We note
that while LOSS performs comparable to GLOSS for varying anomaly strength, the performance of LOSS
degrades quickly with increasing missing data. Therefore, even though both WHoRPCA and LOSS are
equipped to handle missing data, GLOSS is more robust as it uses side information in the form of similarity
graphs. Similar to the previous experiments, LOGSS provide the best results in terms of computational
efficiency with the expense of a small amount of AUC compared to GLOSS.
3.7.4   Experiments on Real Data
To evaluate the performance of the proposed methods on real data, we compiled a list of 20 urban events,
which are listed in Table 3.4, that took place in the important urban activity centers such as city squares,
parks, museums, stadiums and concert halls, during 2018. We used the same set of urban events for both taxi
and bike data. To detect the activities, top-𝐾 percent, with varying K, of the highest anomaly scores of the
                                                        68


                               Table 3.4: Events of Interest for NYC in 2018.
  Event Name                                   Location                                  Date and Time
  New Year’s                                   Times Square                              1/1 12-2AM
  Blackhawks vs. Rangers                       Madison Square Garden                     1/3 4-8PM
  Armory Show                                  Piers 92/94                               1/14 9AM-5PM
  Woman’s March                                Central Park West                         1/20 8AM-12PM
  Big Ten Basketball Final                     Madison Square Garden                     3/4 3-10PM
  Big East Quarter Finals                      Madison Square Garden                     3/8 3-10PM
  St. Patricks Day Parade                      5th Avenue, btw 44th and 79th             3/17 11AM-5PM
  Nor’easter Storm                             Citywide                                  3/20 11AM-5PM
  NIT Quarterfinal Utah vs. St.Mary’s          Madison Square Garden                     3/21 5 PM- 10PM
  U2 Concert                                   Madison Square Garden                     7/1 5-10PM
  July 4th Celebrations                        Citywide                                  7/4 5-11PM
  UN General Assembly                          United Nation Headquarters                9/25 12-5PM
  Comic Con                                    Javits Center                             11/4 8 AM-3PM
  NYC Marathon                                 Colombus Circle                           11/04 12-5PM
  Elton John Concert                           Madison Square Garden                     11/9 7-11PM
  Macy’s Thanksgiving Parade                   Herald Square                             11/22 9PM-12AM
  Christmas Tree Lighting                      Bryant Park                               12/4 7PM-12AM
  Golden Knights vs. Rangers                   Madison Square Garden                     12/16 12-3PM
  Phish Concert                                Madison Square Garden                     12/28 4-8PM
  New Year’s Eve                               Times Square                              12/31 8PM-12AM
extracted sparse tensors are selected as anomalies and compared against the compiled list of urban events.
In previous work, similar case studies were presented for experiments on real data [206, 39, 203, 205].
    Detection performance for all methods is given in Tables 3.5 and 3.6 for the taxi and bike data, respectively.
From Table 3.5, it can be seen that anomaly scoring methods applied to the spatiotemporal features extracted
by GLOSS perform the best for the NYC Taxi data. The performance of GLOSS is followed by LOSS as
temporal smoothness allows for detection of events at lower 𝐾 by removing anomalies resulting from noise.
LOSS performs the best initially but as more points are considered GLOSS finds more cases. Although
LOGSS performs better than baseline methods most of the time it does not perform as good as GLOSS or
LOSS. In the case of CP based methods, although the performance of LRSCP is not good, BATF shows
results competitive with GLOSS especially for Taxi data. Although BATF extracts the anomalies well, it has
a high computational complexity. Among the baseline methods, EE performs the best while LOF performs
the worst. However, when the features extracted from GLOSS are input to LOF and EE, their performances
                                                      69


become very similar. This shows that GLOSS is effective at separating anomalous entries from noise and
normal traffic activity and thus, improves the performance of both LOF and EE. It is important to note that
most of the anomalies cannot be detected at low 𝐾 values by most methods because events such as New
Year’s Eve or July 4th celebrations change the activity pattern in the whole city and constitute the majority
of the anomalies detected at low 𝐾 values for Taxi Data.
     The performance of all methods is significantly reduced in Bike Data as can be seen from Table 3.6.
This is because Bike Data is very noisy with a large number of days, or points that would be considered
anomalous. Also, some of the selected events do not produce significant changes in Bike Data such as
New Year’s Eve as usage of bikes at midnight is low even though it’s a significant event for taxi traffic.
Changes in the weather also affect the performance by increasing the variance of the data, especially across
the third mode, which corresponds to the weeks of the year. We see some fluctuations in the performance
comparisons as well, such as LOGSS performing better than all methods at higher 𝐾, i.e. percentages. Still,
the proposed methods are better than HoRPCA, WHoRPCA, and baseline methods overall, which shows the
improvements brought by the extracted features.
     It is important to note that event selection is done manually and the selected events may not correspond
to the most significant anomalies. Thus, although it is a widely utilized tool in analyzing the performance on
real data, the case study approach might not reflect the true performance of the anomaly detection method
as effectively as synthetic data.
Table 3.5: Results for 2018 NYC Yellow Taxi Data. Columns indicate the percentage of selected points with
top anomaly scores. The table entries correspond to the number of events detected at the corresponding
percentage.
                         %                0.014    0.07  0.14   0.3  0.7    1    2    3
                         EE                 0        0     1     3    9     9   16   18
                         LOF                0        0     0     1    1     2    4    5
                         OCSVM              0        0     2     5    8    11   15   16
                         HoRPCA             0        5    11    14   20    20   20   20
                         WHoRPCA            0        0     1     3    9     9   16   18
                         LRSCP              0        1     1     1    5     7   10   11
                         BATF               0        1     5     9   14    15   20   20
                         LOSS               3        9    12    15   16    17   19   20
                         LOGSS              0        0     1     6   10    12   18   18
                         GLOSS-EE           1        8    13    15   18    18   19   20
                         GLOSS-LOF          2        8    12    14   18    18   19   19
                         GLOSS-SVM          0        3     3     9   17    18   20   20
     In Fig. 3.5, we illustrate the bike data for July 4th at Hudson River banks, and the low-rank and sparse
                                                        70


                              Table 3.6: Results on 2018 NYC Bike Trip Data.
                            %                0.3   1   2   3  4.2   7    9.7   12.5
                            EE                0    0   0   1   2    2     2      3
                            LOF               1    1   1   1   2    2     4      6
                            OCSVM             0    0   1   2   2    2     3      3
                            HoRPCA            2    3   4   4   6    8    11     14
                            WHoRPCA           1    1   2   7  11   12    12     12
                            LRSCP             0    1   2   3   3    4     6     10
                            BATF              2    3   4   6   9   10    11     13
                            LOSS              0    1   1   1   4   13    15     16
                            LOGSS             1    3   4   6   9   11    14     17
                            GLOSS-EE          0    0   3   7   9   11    15     16
                            GLOSS-LOF         1    2   2   2   2    6     8     11
                            GLOSS-SVM         1    2   5   8  10   15    15     15
                   (a)                                 (b)                                (c)
Figure 3.5: Bike Activity data, the extracted sparse part and low-rank part across for July 4th Celebrations
at Hudson River banks. (a) Real Data where the traffic for 52 Wednesdays is shown along with the traffic
on Independence Day and average traffic; (b) Sparse tensor where the curve corresponding to the anomaly is
highlighted; (c) Low-rank tensor with the curve corresponding to the Independence Day highlighted.
parts extracted by GLOSS. It can be seen that as the data varies across different weeks, the low-rank part can
explain this variance well by fitting a pattern to days with varying amplitudes. Thus, the proposed method
does not get affected by events such as the weather as it can capture both low and high traffic days in the
low-rank part which can be seen in Fig 3.5c. The deviations from the daily pattern, rather than the actual
traffic volume, is captured by the sparse part, which is then input to the anomaly scoring algorithms. Thus,
our method is able to extract the events at a fairly low 𝐾.
                                                       71


3.8    Conclusions
In this chapter, we proposed robust tensor decomposition based anomaly detection methods for urban traffic
data. The proposed methods extract a low-rank component using a weighted nuclear norm and imposes the
sparse component to be temporally smooth to better model the anomaly structure. In one of the methods,
graph regularization is employed to preserve the geometry of the data and to account for nonlinearities in
the anomaly structure. In another, low-rank tensor recovery is implemented through minimizing graph total
variation on similarity graphs constructed across each mode. This approximation circumvents the need
for computing a computationally expensive nuclear norm minimization. ADMM based computationally
efficient and scalable algorithms are proposed to solve the resulting optimization problems. As the proposed
methods focus on spatiotemporal feature extraction, the resulting features can be input to well-known anomaly
detection methods such as EE, LOF and OCSVM for anomaly scoring.
     The proposed methods are evaluated on both synthetic and real urban traffic data. Results on synthetic
data illustrate the robustness of our methods to varying levels of missing data and their sensitivity to even low
amplitude anomalies. In particular, our methods outperforms HoRPCA and WHoRPCA thanks to temporal
smoothness assumption on the sparse part. Moreover, the graph regularization improves the accuracy further
by ensuring that the low-rank projections preserve local geometry of the data. In real data, our methods
begin to detect anomalies earlier, i.e. the top anomaly scores usually correspond to events of interest, than
existing methods. GLOSS provides further improvement over LOSS as more events are detected for a given
number of selected points. Furthermore, the results from real data show how the extracted sparse component
highlights the anomalous activities. For both synthetic and real data, LOGSS outperforms other methods in
terms of computational efficiency with some loss in AUC compared to GLOSS. Experiments on synthetic
data reveal that when anomalies are longer in duration, the proposed methods perform better. LOGSS also
outperforms LOSS when anomalies are shorter in duration. Although LOSS and GLOSS perform better in
real data, LOGSS shows similar performance with shorter run time.
     In future work, a statistical tensor anomaly scoring method will be explored instead of scoring each fiber
individually by a separate algorithm. In recent years, deep learning based methods have also provided a
promising new direction for anomaly detection [33, 28]. Some examples include long short-term memory
(LSTM) neural network used to predict traffic flow and detect anomalies [94] and fully connected neural
networks used to decompose traffic data into normal and anomalous parts [205]. Future work will consider
                                                        72


extensions to tensor type data and application of 3D-CNNs [33]. Another possible extension of the proposed
work is online anomaly detection. The method proposed in this paper cannot achieve real-time anomaly
detection as for the current application the data has been collected and stored offline. However, it is possible
to extend low-rank tensor models to an online setting. Recently, methods for online tensor subspace tracking
have been introduced [87, 141, 8, 110]. Our method can be extended to the online setting using the methods
outlined in these papers. Applications and extensions of the proposed method on network data and other
spatiotemporal data with different characteristics such as fMRI will also be considered.
                                                       73


                                                 CHAPTER 4
                                   GEOMETRIC TENSOR LEARNING
4.1    Introduction
Most high-dimensional data have a low-dimensional structure and principal component analysis (PCA) is
the fundamental approach to extract this low-dimensional structure. A major drawback of PCA is that it
is sensitive to grossly corrupted or outlying observations, which are ubiquitous in real-world data. Robust
PCA (RPCA) addresses this issue by decomposing the observed data matrix into a low-rank and sparse part
[30]. However, RPCA can only handle two-way matrix data while real data is usually multi-way in nature
and stored in arrays known as tensors. In recent years, different extensions of RPCA have been introduced
to deal with tensor type data based on the different tensor models. Some examples include simple low-rank
tensor completion (SiLRTC) [120], higher-order RPCA (HoRPCA) based on the Tucker model [70], tensor
RPCA (TRPCA) based on the t-SVD model [125] and tensor-train based tensor RPCA [53, 197].
    Different tensor models employ different definitions of rank and result in different interpretations of
what a low-rank tensor is. Tucker rank, optimized by minimizing sum of nuclear norms (SNN), cannot
appropriately capture the global correlation in a tensor as each mode represents the matrix row in an
unbalanced matricization scheme [17]. Tensor tubal rank, optimized by minimizing tensor nuclear norm
(TNN), on the other hand, characterizes the correlations along the first and second modes while that along
the third mode is encoded by the embedded circular convolution. Moreover, definition of TNN is usually
limited to third-order tensors [210]. Tensor-train rank, optimized using tensor train nuclear norm (TTNN),
can capture the global correlation of all tensor entries, by providing the mean of the correlation between two
sets of modes as it is based on a canonical unfolding [17].
    While the robust low-rank tensor representations capture the global structure of tensor data, they do not
preserve the local geometric structure. Manifold learning addresses this issue and has been successfully
implemented for tensors [164, 161]. However, current manifold learning approaches typically focus on only
one mode of the data. Yet for many data matrices and tensors, correlations exist across all modalities. Several
recent papers [67, 156, 155, 128, 127, 157, 22] exploit this coupled relationship to co-organize matrices and
infer underlying row and column embeddings.
                                                       74


     Inspired by the success of low-rank embedding and manifold learning, in this chapter, we propose to
integrate them into a unified framework for simultaneously capturing the global low-rank and the local geo-
metric structure. In particular, we propose a graph regularized robust low-rank tensor-train decomposition.
The proposed model is based on robust tensor-train decomposition introduced in [197]. Unlike previous
graph regularized tensor decompositions, the proposed method introduces graph regularization across each
canonical unfolding to leverage the underlying geometry on each tensor mode. The resulting optimization
problem is shown to be computationally expensive due to the size of the graphs across each canonical unfold-
ing. An equivalence between these computationally expensive graph regularization terms and regular tensor
unfolding is derived and a computationally efficient implementation of graph regularized robust tensor-train
decomposition is proposed.
4.2    Tensor Train Robust PCA on Graphs
Given an observed tensor with missing entries and gross corruption PΩ [Y], the objective of tensor RPCA
is to extract a low-rank tensor, X, and a sparse tensor, S, corresponding to gross outliers such that PΩ [Y] =
PΩ [X + S]. Robust tensor-train PCA model was proposed in [197] as follows:
                                      𝑁
                                      Õ −1
                     minimize X, S         𝛼𝑛 kX[𝑛] k ∗ + 𝜆kSk 1 ,     s.t. PΩ [X + S] = PΩ [Y].
                                      𝑛=1
     While nuclear norm minimization can capture the global structure of the tensor, incorporating graph
regularization can capture the local geometry and non-linear structures within the data. Adding graph
regularization on the mode-𝑛 canonical unfoldings of the low-rank part, X[𝑛] ∈ R𝐼1 ...𝐼𝑛 ×𝐼𝑛+1 ...𝐼𝑁 , the modified
objective can be rewritten as:
                  𝑁
                  Õ −1                Õ𝑁
                                                   >
       minimize        𝛼𝑛 kX[𝑛] k ∗ +      𝜃 𝑛 tr(X[𝑛] Φ𝑛 X[𝑛] ) +𝜆kSk 1 ,    s.t.  PΩ [X + S] = PΩ [Y],      (4.1)
          X, S
                  𝑛=1                 𝑛=1
where Φ𝑛 ∈ R𝐼1 ...𝐼𝑛 ×𝐼1 ...𝐼𝑛 is the mode-𝑛 graph Laplacian. The solution to (4.1) will be referred to as tensor
train robust PCA with graph regularization (TTRPCA-G).
4.2.1    Kronecker Structured Graphs
For each mode 𝑛, the graph regularization proposed in (4.1) computes the similarity over all the modes
from the first to the 𝑛th mode. Thus, it potentially utilizes the same local geometric information multiple
times. In this section, we propose modeling each Φ𝑛 with a Kronecker structure where Φ𝑛 = Φ̂𝑛 ⊗ I with
                                                             75


Φ̂𝑛 ∈ R𝐼𝑛 ×𝐼𝑛 . This structure imposes a manifold on only mode 𝑛, rather than the set of all modes 𝑛 0 with
𝑛 0 ≤ 𝑛. In the following, we show that solving such a system is equivalent to replacing the trace norm of the
canonical unfolding with the trace norm of mode-𝑛 unfolding. Moreover, this structure highly reduces the
computational complexity of TTRPCA-G.
Lemma 2. Let 𝐵 ∈ R𝐼 ×𝐽 , 𝐴 ∈ R𝐾 ×𝐼 and 𝐶 ∈ R 𝐿×𝐽 , then
                                              vec( 𝐴𝐵𝐶 > ) = (𝐶 ⊗ 𝐴)vec(𝐵),
where vec(.) stacks all elements of a tensor into a vector.
Lemma 3. Given a third-order tensor B ∈ R𝐼 ×𝐽 ×M , and its third mode slices 𝐵𝑖 ∈ R𝐼 ×𝐽 ,
                                               k𝐶B (2) k 2𝐹 = k (𝐶 ⊗ I)B [2] k 2𝐹 ,
where 𝐶 ∈ R 𝐿×𝐽 is any matrix and I ∈ R𝐼 ×𝐼 is the identity.
Proof. Let b𝑖 = vec(𝐵𝑖 ) ∈ R𝐼 𝐽 ×1 , then,
                                                           vec(𝐵1𝐶 > )   (𝐶 ⊗ I)b1 
                                                                                         
                                                                                         
                                                                                         
                                                           vec(𝐵2𝐶 > )   (𝐶 ⊗ I)b2 
                                           >
                                                𝐶 >) =                   =
                                                                                         
                                  vec(B (2)                          .              .       
                                                          
                                                                    ..    
                                                                                  ..      
                                                                                            
                                                                                         
                                                          vec(𝐵 𝑀 𝐶 > )   (𝐶 ⊗ I)b 𝑀 
                                                                                         
                                                                                         
              = (I ⊗ (𝐶 ⊗ I)) [b>        >            > >
                                    1 , b2 , . . . , b 𝑀 ] = (I ⊗ (𝐶 ⊗ I))vec(B [2] ) = vec((𝐶 ⊗ I)B [2] ),
where the second equality follows from Lemma 1. Thus,
                              >
          k𝐶B (2) k 2𝐹 = kB (2) 𝐶 > k 2𝐹 = kvec(B (2)   >
                                                           𝐶 > ) k 2𝐹 = kvec((𝐶 ⊗ I)B [2] ) k 2𝐹 = k (𝐶 ⊗ I)B [2] k 2𝐹 .
     By reshaping any tensor with 𝑁 modes to third-order tensors, where mode 𝑛 is mapped to the second
mode , it can be shown that this result can be generalized for any mode 𝑛 and order 𝑁.
Theorem 1. Let Φ𝑛 = Φ̂𝑛 ⊗I, where Φ̂𝑛 ∈ R𝐼𝑛 ×𝐼𝑛 and I ∈ R𝐼1 𝐼2 ...𝐼𝑛−1 ×𝐼1 𝐼2 ...𝐼𝑛−1 . The term                         >
                                                                                                        Í𝑁
                                                                                                          𝑛=1 𝜃 𝑛 tr(X[𝑛] Φ𝑛 X[𝑛] )
                  Í𝑁
is equivalent to 𝑛=1            > Φ̂ X ).
                        𝜃 𝑛 tr(X(𝑛)    𝑛 (𝑛)
                                                                   76


Proof. Since Φ̂𝑛 is symmetric, let Φ̂𝑛 = 𝑃ˆ 𝑛> 𝑃ˆ 𝑛 . Then Φ𝑛 = ( 𝑃ˆ 𝑛> 𝑃ˆ 𝑛 ) ⊗ I = 𝑃𝑛> 𝑃𝑛 , where 𝑃𝑛 = ( 𝑃ˆ 𝑛 ⊗ I).
Thus, the following equalities hold:
                >                                                                                    > ˆ> ˆ
           tr(X[𝑛] Φ𝑛 X[𝑛] ) = k𝑃𝑛 X[𝑛] k 2𝐹 = k ( 𝑃ˆ 𝑛 ⊗ I)X[𝑛] k 2𝐹 = k 𝑃ˆ 𝑛 X(𝑛) k 2𝐹 = tr(X(𝑛)     𝑃𝑛 𝑃𝑛 X(𝑛) )
                                      >
                              = tr(X(𝑛)  Φ̂𝑛 X(𝑛) ),
where the third equality follows from Lemma 2.
                                                                                         Í𝑁            >
    By Theorem 1, the graph regularization term in (4.1) is equivalent to                  𝑛=1 𝜃 𝑛 tr(X(𝑛) Φ̂𝑛 X(𝑛) ). Thus, we
can rewrite (4.1) as:
                 𝑁
                 Õ −1                 Õ𝑁
                                                   >
      minimize        𝛼𝑛 kX[𝑛] k ∗ +      𝜃 𝑛 tr(X(𝑛) Φ̂𝑛 X(𝑛) ) + 𝜆kSk 1 ,        s.t.   PΩ [X + S] = PΩ [Y].            (4.2)
         X, S
                 𝑛=1                  𝑛=1
The solution to (4.2) will be referred to as tensor train robust PCA with mode-𝑛 graph regularization
(TTRPCA-nG).
4.2.2  Optimization
The objective functions (4.1) and (4.2) are optimized using an Alternating Direction Method of Multipliers
(ADMM) scheme as ADMM has been previously utilized for solving similar convex problems [70, 154,
156]. In this section, we will give a detailed derivation of the update steps for optimizing (4.1). While the
update steps will be similar for (4.2), we will note when they differ.
    To separate the nuclear norm and graph regularization terms and to isolate functions of each mode, we
introduce auxiliary variables 𝔏𝑛 and 𝔊𝑛 . (4.1) is then rewritten as:
                                        𝑁
                                        Õ −1                 Õ𝑁
                          minimize           𝛼𝑛 k𝔏𝑛[𝑛] k ∗+       𝜃 𝑛 tr(𝔊𝑛>          𝑛
                                                                          [𝑛] Φ𝑛 𝔊 [𝑛] ) + 𝜆kSk 1 ,
                         {𝔏}, {𝔊}, X, S
                                        𝑛=1                  𝑛=1
                                  s.t. PΩ [X + S] = PΩ [Y], X = 𝔏𝑛 , X = 𝔊𝑛 .
The corresponding augmented Lagrangian is given by:
                               𝑁
                               Õ −1                  Õ𝑁
                                    𝛼𝑛 k𝔏𝑛[𝑛] k ∗ +       𝜃 𝑛 tr(𝔊𝑛>         𝑛
                                                                    [𝑛] Φ𝑛 𝔊 [𝑛] ) + 𝜆kSk 1 +
                               𝑛=1                   𝑛=1
                                                      𝑁 −1                                 𝑁
            𝛽1                                    𝛽2  Õ                                𝛽3 Õ
               kPΩ [Y − X − S − Λ1 ] k 2𝐹 +                 kX − 𝔏 −   𝑛
                                                                          Λ2𝑛 k 2𝐹 +          kX − 𝔊𝑛 − Λ3𝑛 k 2𝐹 ,        (4.3)
             2                                    2   𝑛=1
                                                                                        2 𝑛=1
                                                               77


where Λ1 , Λ2𝑛 , Λ3𝑛 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 are the Lagrange multipliers. Using (4.3), each variable can be updated
iteratively.
1. X update: The update of low-rank variable X is given by:
                              PΩ [X 𝑡+1 ] = PΩ [𝛽1 T1 + T2 ] /(𝛽1 + (𝑁 − 1) 𝛽2 + 𝑁 𝛽3 ),
                                      PΩ⊥ [X 𝑡+1 ] = PΩ⊥ [T2 ]/((𝑁 − 1) 𝛽2 + 𝑁 𝛽3 ),                       (4.4)
                                       Í 𝑁 −1
                                               (𝔏𝑛,𝑡 + Λ2𝑛,𝑡 ) + 𝛽3                   + Λ3𝑛,𝑡 ).
                                                                      Í𝑁
where T1 = Y − S 𝑡 − Λ1𝑡 , T2 = 𝛽2                                       𝑛=1 (𝔊
                                                                                 𝑛,𝑡
                                          𝑛=1
2.    𝔏𝑛 update: The update of 𝔏𝑛 is solved by a soft thresholding operator on singular values of
              
 X 𝑡+1 − Λ2𝑛,𝑡 [𝑛] with a threshold of 𝛼𝑛 /𝛽2 .
3. 𝔊𝑛 update: The variables 𝔊𝑛 can be updated by:
                                                                               
                                         𝔊𝑛,𝑡+1
                                             [𝑛]
                                                   = 𝛽 3 𝐺  𝑖𝑛𝑣   X 𝑡+1
                                                                        − Λ 𝑛,𝑡
                                                                            3         ,                    (4.5)
                                                                                  [𝑛]
where 𝐺 𝑖𝑛𝑣 = (2𝜃 𝑛 Φ𝑛 + 𝛽3 I) −1 exists for any Φ𝑛 for which the set of eigenvalues do not contain     𝛽3
                                                                                                       2𝜃𝑛 , and
can be computed outside the loop for faster update.
     The update rule for (4.2) is given by:
                                                                               
                                         𝔊𝑛,𝑡+1
                                             (𝑛)
                                                   = 𝛽   𝐺
                                                       3 𝑖𝑛𝑣      X 𝑡+1
                                                                        − Λ 𝑛,𝑡
                                                                            3         ,                    (4.6)
                                                                                  (𝑛)
                                 −1
where 𝐺 𝑖𝑛𝑣 = 2𝜃 𝑛 Φ̂𝑛 + 𝛽3 I .
4. S update: The variable S can be updated by soft thresholding PΩ [Y − X 𝑡+1 − Λ1𝑡 ] with a threshold of
 𝜆
𝛽1 .
5. Dual updates: Finally, dual variables Λ1 , Λ2𝑛 , Λ3𝑛 are updated using:
                                         Λ1𝑡+1 = Λ1𝑡 + PΩ [X 𝑡+1 + S 𝑡+1 − Y],                             (4.7)
                                            Λ2𝑛,𝑡+1 = Λ2𝑛,𝑡 + (𝔏𝑛,𝑡+1 − X 𝑡+1 ),                           (4.8)
                                           Λ3𝑛,𝑡+1 = Λ3𝑛,𝑡 + (𝔊𝑛,𝑡+1 − X 𝑡+1 ).                            (4.9)
     The algorithms for both TTRPCA-G and TTRPCA-nG are outlined in Algorithm 4.1.
4.2.3   Computation and Memory Complexity for Graphs
                                    Î𝑛           Î𝑛
In (4.1), the size of each Φ𝑛 is 𝑖=1       𝐼𝑖 × 𝑖=1    𝐼𝑖 . Thus, the memory requirement for TTRPCA-G is O (𝐼 2 ),
             Î𝑁
where 𝐼 = 𝑖=1      𝐼𝑖 . On the other hand, TTRPCA-nG requires only O (𝐼𝑛2 ) parameters for each Φ̂𝑛 .
                                                               78


Algorithm 4.1: TTRPCA-G/nG
Input: Y, Ω, Φ𝑛 , parameters 𝜃 𝑛 , 𝛼𝑛 , 𝛾, 𝛽1 , 𝛽2 , 𝛽3 , T.
Output: X : Low-rank tensor; S: Sparse tensor.
   Initialize S 0 = 0, 𝔏𝑛,0 = 0, 𝔊𝑛,0 = 0, Λ01 = 0, Λ2𝑛,0 = 0, Λ3𝑛,0 = 0, ∀𝑛 ∈ {1, . . . , 𝑁 },
   𝐺 𝑖𝑛𝑣 ← (2𝜃 𝑛 Φ𝑛 + 𝛽3 I) −1 (TTRPCA-G) or
                                  −1
   𝐺 𝑖𝑛𝑣 ← 2𝜃 𝑛 Φ̂𝑛 + 𝛽3 I              (TTRPCA-nG).
   for 𝑡 = 1 to T do
       Update X using (4.4).
       Update 𝔏𝑛 s using optimization step 2.
       Update 𝔊𝑛 s using (4.5) (for TT-PCA-G) or (4.6) (for TT-PCA-nG).
       Update S using optimization step 4.
       Update Lagrange multipliers using (4.7), (4.8) and (4.9).
   end for
                                                                           
    Computation of 𝐺 𝑖𝑛𝑣 in (4.5), for 𝑛 = 𝑁 has O 𝐼 4 complexity. On the other hand, the computational
complexity of 𝐺 𝑖𝑛𝑣 in TTRPCA-nG is O (𝐼𝑛2 ). Overall, the computational complexities for TTRPCA-G and
TTRPCA-nG are O (𝐼 4 ) and O (𝐼 3/2 ), respectively.
4.3    Experiments
The proposed method was compared against Tucker based HoRPCA [70] and TT based TTRPCA [197] on
data completion and denoising tasks, on both synthetic and real tensor data 1. The results are reported in
terms of peak signal to noise ratio (PSNR), structural similarity index measure (SSIM), and residual squared
                      k Ŷ−Y0 k 𝐹
error (RSE), i.e.,       k Y0 k 𝐹 ,  where Y0 corresponds to the true underlying low-rank data.
Table 4.1: Denoising performance for synthetic data against varying levels of gross noise 𝑐% for various
methods.
                                       5                            20                     35                          50
                           RSE     PSNR    SSIM            RSE    PSNR   SSIM   RSE    PSNR      SSIM      RSE     PSNR    SSIM
         Observed          1.95     17.78   0.71           3.91   11.73   0.24  5.18      9.3     0.07     6.22      7.71   0.01
         HoRPCA            0.44     31.15   0.86           0.63   28.12   0.59  0.86    25.42     0.29     0.98     24.29   0.26
         TTRPCA            0.07     47.52   0.99           0.23   36.86   0.96  0.44    31.22     0.81     0.93     24.77   0.30
         TTRPCA-G          0.04     50.06   0.99           0.15   40.08   0.97  0.33    32.82     0.87     0.86     25.67   0.35
         TTRPCA-nG         0.04     50.11   0.99           0.15   40.07   0.97  0.33    32.81     0.87     0.87     25.67   0.35
                                                 𝛿                             Î𝑘         Î 𝑁 −1
    Following [17], we set 𝛼 𝑘 =           Í 𝑁 −1𝑘
                                                     𝛿𝑘  0
                                                           , where 𝛿 𝑘 = min(   𝑛=1 𝐼 𝑛 ,     𝑛=𝑘+1 𝐼 𝑛 ). 𝜃 𝑘 s are set proportional to
                                             𝑘 0 =1
                                                      𝐼𝑘
the size of the corresponding mode              Î𝑁
                                                           𝐼0
                                                              for TTRPCA-nG, and are set proportional to 𝛼 𝑘 for TTRPCA-G.
                                                    𝑘 0 =1 𝑘
The remaining parameters are optimized for all methods such that the best results for each method are
reported.
    1 You can find our code in github.com/mrsfgl/ttrpca_g
                                                                        79


Table 4.2: Denoising performance for real data against varying levels of gross noise 𝑐% for various methods.
                              5                    20                    35                   50
                   RSE      PSNR   SSIM     RSE   PSNR    SSIM  RSE   PSNR      SSIM  RSE   PSNR  SSIM
       Observed    0.414    17.57  0.785   0.8276 11.56   0.392 1.09    9.13    0.205 1.306  7.59  0.11
       HoRPCA      0.105    29.53  0.928   0.211  23.42   0.791 0.288  20.74    0.667 0.383 18.25  0.53
       TTRPCA      0.106    29.42  0.929   0.199  23.49    0.83 0.277  21.07    0.747 0.378 18.37  0.63
       TTRPCA-nG   0.103    29.89  0.926   0.164  25.63   0.852 0.237  22.42     0.77 0.331 19.52  0.65
                          (a) HoRPCA                                       (b) TTRPCA
                         (c) TTRPCA-G                                    (d) TTRPCA-nG
                              Figure 4.1: Phase diagrams for missing data recovery.
4.3.1    Synthetic Data
Tensors with TT structure were generated by simulating each tensor factor U𝑛 ∀𝑛 ∈ {1, . . . , 𝑁 }, as i.i.d.
Gaussian with mean 0 and covariance I, and merging them. For the sake of simplicity, all modes have the
same size and rank, i.e., 𝐼𝑛 = 𝐼, ∀𝑛, 𝑟 𝑛 = 𝑟, ∀𝑛 ≠ 𝑁. In our experiments, 𝑁 is selected to be 4 and 𝐼 = 10,
i.e., Y ∈ R10×10×10×10 and 𝑟 is varied in the range of 1 to 9.
     After generating synthetic data, we simulate missing data by setting 𝑚% of the entries to zero. In
                                                        80


Fig. 4.1, we demonstrate the phase diagrams for tensor completion with various methods. By varying the
rank 𝑟 of the simulated tensors and 𝑚, we illustrate the robustness of all of the methods against increasing
ranks and missing data. It can be seen that TTRPCA outperforms HoRPCA as the underlying structure is
better explained by TTNN. The proposed TTRPCA-G and TTRPCA-nG further improve the performance
by incorporating the underlying manifold information.
     Next, we evaluate the robustness of the proposed method against sparse outliers by replacing a constant
𝑐% of the entries by random values sampled uniformly from [0, 1]. For this set of experiments, 𝑚 = 0%
and 𝑟 = 4. In Table 4.1, we summarize the results. It can be seen that graph regularization improves the
performance at all outlier levels. In particular, TTRPCA is better than HoRPCA as TTNN captures the
low-rank structure better than SNN and graph regularization further improves the results by taking the local
geometry into account. Moreover, the results show that the performances of TTRPCA-G and TTRPCA-nG
are close to each other which indicates that the two methods capture the same geometry.
4.3.2    Real Data
The algorithms are also tested on a subset of 40 objects from COIL dataset [133]. The color images are
converted to grayscale and downsampled to a size of 16×16. For each object, each sample image corresponds
to a different pose angle ranging from 0 to 360 degrees with increments of 10 degrees [133]. Thus, we create
tensors of size 16 × 16 × 36 × 40. We then corrupt the tensor by randomly selecting 𝑐% of all entries and
setting them to 0 or 1. Note that when using this data the adjacency matrix of the fourth mode canonical
graph is of size 368640 × 368640. This makes TTRPCA-G computationally expensive, so we only use
TTRPCA-nG.
     From Table 4.2, it can be seen that the proposed method outperforms other methods in denoising for
varying levels of gross noise. TTRPCA-nG can capture the data structure better than the other methods as it
simultaneously minimizes TTNN and considers the underlying manifold across each mode.
4.4    Conclusions
In this chapter, we proposed two graph regularized robust tensor train principal component analysis methods.
In the first method, we utilized canonical unfoldings to construct the mode-𝑛 graphs while in the second
method mode-𝑛 unfoldings are used. We derived an equivalence between mode-𝑛 and canonical graphs with
the assumption that the canonical graph has a specific Kronecker structure.
                                                     81


    The proposed methods outperformed both robust Tucker and tensor-train decomposition methods in
denoising and completion tasks. Experiments on synthetic data show that the performances of graph
regularization with canonical unfolding and mode-𝑛 unfolding were similar to each other while mode-
𝑛 graphs provided much lower memory requirements and computational complexity. Future work will
consider capturing low-rank structure through graph total variation minimization as suggested in [154] to
reduce the computational complexity of low-rank tensor recovery task.
                                                  82


                                                  CHAPTER 5
                                 COUPLED SUPPORT TENSOR MACHINE
5.1    Introduction
Advances in clinical neuroimaging and computational bioinformatics have dramatically increased our under-
standing of various brain functions using multiple modalities such as Magnetic Resonance Imaging (MRI),
functional Magnetic Resonance Imaging (fMRI), electroencephalogram (EEG), and Positron Emission To-
mography (PET). Their strong connections to the patients’ biological status and disease pathology suggest
the great potential of their predictive power in disease diagnostics. Numerous studies using vector- and
tensor-based statistical models illustrate how to utilize these imaging data both at the voxel- and Region-of-
Interest (ROI) level and develop efficient biomarkers that predict disease status. For example, [9] proposes
a classification model using functional connectivity MRI for autism disease and reaches 89% diagnostic
accuracy for subjects under 20. [152] utilizes network models and brain imaging data to develop novel
biomarkers for Parkinson’s disease. Many works in Alzheimer’s disease research such as [129, 122, 89,
66, 124, 54, 116] use EEG, MRI and PET imaging data to predict patient’s cognition and detect early-stage
Alzheimer’s diseases. Although these studies have provided impressive results, utilizing imaging data from
a single modality such as individual MRI sequences are known to have limited predictive capacity, especially
in the early phases of the disease. For instance, [116] uses brain MRI volumes from regions of interest to
identify patients in early-stage Alzheimer’s disease. They use one-year MRI data from Alzheimer’s Disease
Neuroimaging Initiative (ADNI) and obtain 77% prediction accuracy. Although such a performance is
favorable compared to other existing approaches, the diagnostic accuracy is relatively low due to the limited
information from MRI data. In recent years, it has been common to acquire multiple neuroimaging modalities
in clinical studies such as simultaneous EEG-fMRI or MRI and fMRI. Even though each modality measures
different biological signals, they are interdependent and mutually informative. Learning from multimodal
neuroimaging data may help integrate information from multiple sources and facilitate biomarker develop-
ments in clinical studies. It also raises the need for novel supervised learning techniques for multimodal data
in statistical learning literature.
     Existing statistical approaches to multimodal data science are dominated by unsupervised learning
                                                        83


methods. These methods analyze multimodal neuroimaging data by performing joint matrix decomposition
and extracting common information across different modalities. During optimization, the decomposed factors
bridging two or more modalities are estimated to interpret the connections between different modalities.
Examples of these methods include matrix-based joint Independent Component Analysis (jICA) [29, 72,
107, 121, 165, 6] which assume bilinear correlations between factors in different modalities. However, these
matrix-vector based models cannot preserve the multilinear nature of original data and the spatiotemporal
correlations across modes as most neuroimaging modalities are naturally in tensor format. Recently, various
coupled matrix-tensor decomposition methods have been introduced to address this issue [4, 5, 6, 37, 36,
86, 130]. These methods impose different soft or hard multilinear constraints between factors from different
modalities providing more flexibility in data modeling.
    Current supervised learning approaches for multimodal data mostly concatenate data modalities as extra
features without exploring their interdependence. For example, [211, 114] build generalized regression mod-
els by appending tensor and vector predictors linearly for image prediction and classification. [142] develops
a discriminant analysis by including tensor and vector predictors in a linear fashion. [111] proposes an
integrative factor regression for multimodal neuroimaging data assuming that data from different modalities
can be decomposed into latent factors. More recently, [65] proposed multiple tensor-on-tensor regression
for multimodal data, which combines tensor-on-tensor regression from [123] with traditional additive linear
model. Another type of integration utilizes kernel tricks and combines information from multimodal data
with multiple kernels. [71] provides a survey on various multiple kernel learning techniques for multimodal
data fusion and classification with support vector machines. Combining kernels linearly or non-linearly
instead of original data in different modalities provides more flexibility in information integration. [12]
proposed a multiple kernel regression model with group lasso penalty, which integrates information by
multiple kernels and selects the most predictive data modalities.
    Despite these accomplishments, the current approaches have several shortcomings. First, they mainly
focus on exploring the interdependence between multimodal imaging data, ignoring the representative and
discriminative power of the learned components. Thus, the methods cannot further bridge the imaging data
to the patients’ biological status, which is not helpful in biomarker development. Second, the supervised
techniques such as integrate information primarily by data or feature concatenation without explicitly consid-
ering the possible correlations between different modalities. This lack of consideration of interdependence
may cause issues like overfitting and parameter identifiability. Third, even though methods from [111,
                                                      84


65] have considered latent structures for multimodal data, these models are designed primarily for linear
regression and are not directly applicable to classification problems. Fourth, the aforementioned multimodal
analysis methods are mainly vector based methods, which cannot handle large-size multi-dimensional data
encountered in contemporary data science. As discussed in [21], tensors provide a powerful tool for analyz-
ing multi-dimensional data in statistics. As a result, developing a novel multimodal tensor-based statistical
framework for supervised learning can be of great interest. Finally, although many empirical studies demon-
strate the success of using multimodal data, there is a lack of mathematical and statistical clarity to the extent
of generalizability and associated uncertainties. The absence of a solid statistical framework for multimodal
data analysis makes it impossible to interpret the generalization ability of a certain statistical model.
     In this chapter, we propose a two-stage Coupled Support Tensor Machine (C-STM) for multimodal tensor-
based neuroimaging data classification. The proposed model addresses the current issues in multimodal data
science and provides a sound statistical framework to interpret the interdependence between modalities and
quantify the model consistency and generalization ability. The major contributions of this chapter are:
    1. Individual and common latent factors are extracted from multimodal tensor data, for each sample
        or subject, using Advanced Coupled Matrix Tensor Factorization (ACMTF) [4, 3]. The extracted
        components are then utilized in a statistical framework. Most of the work on ACMTF do not focus
        on each subject separately and the extracted factors are utilized for a signal analysis rather than a
        subsequent statistical learning framework. Specifically, the work on supervised approaches with
        CMTF is limited.
    2. Building a novel Coupled Support Tensor Machine with both the coupled and non-coupled tensor CP
        factors for classification. In this regard, multiple kernel learning approaches are adopted to integrate
        components from multi-modal data.
    3. For the validation of our work, we provide both theoretical and empirical evidence. We provide
        theoretical results such as classification consistency for statistical guarantee. A thorough numerical
        study has been conducted, including a simulation study and experiments on real data to illustrate the
        usefulness of the proposed methodology.
A Matlab package is also provided in the supplemental material, including all functions for C-STM classifi-
                                                         85


cation. The source codes are available at our Github repository 1.
5.2     Related Work
In this section, we review some background and prior work on coupled tensor decomposition and multiple
kernel learning.
5.2.1    Coupled Matrix Tensor Factorization
                                                                       𝑋2              𝐼1
                                                 X1
                                                           𝐼3           𝑀
                                               𝐼2
                              Figure 5.1: Illustration of Coupled Tensor Matrix Model
      Motivated by the fact that joint analysis of data from multiple sources can potentially unveil complex data
structures and provide more information, Coupled Matrix Tensor Factorization (CMTF) [2] was proposed
for multimodal data fusion. CMTF estimates the underlying latent factors for both tensor and matrix data
simultaneously by taking the coupling between tensor and matrix data into account. This feature makes
CMTF a promising model in analyzing heterogeneous data, which generally have different structures and
modalities.
      During latent factor estimation, CMTF solves an objective function that approximates a CP decomposition
for the tensor modality and a singular value decomposition for the second modality with the assumption that
the factors from one mode of each modality are the same. Given X1 ∈ R𝐼1 ×𝐼2 ×...×𝐼𝑑 and 𝑋2 ∈ R𝐼1 ×𝐽2 , without
loss of generality assume that the factors from the first mode of the tensor X1 span the column space of the
matrix 𝑋2 . CMTF then tries to estimate all factors by minimizing:
                      1                                              1
        Q(𝔘1 , V) =     kX1 − [[𝑋1(1) , 𝑋1(2) , . . . 𝑋1(𝑑) ]] k 2𝐹 + k 𝑋2 − 𝑋2(1) 𝑋2(2)𝑇 k 2𝐹 ,   s.t. 𝑋1(1) = 𝑋2(1) ,   (5.1)
                      2                                              2
where 𝑋 𝑝(𝑚) are the factor matrices for modality 𝑝 and mode 𝑚. The factor matrices 𝑋1(1) = 𝑋2(1) are
the coupled factors between tensor and matrix data. An illustration of this coupling is given in Figure
      1 https://github.com/PeterLiPeide/Coupled_MatrixTensor_SupportTensor_Machine
                                                                86


5.1. These factor matrices can also be represented in Kruskal form, 𝔘1 = [[𝑋1(1) , 𝑋1(2) , . . . 𝑋1(𝑑) ]] and
𝔘2 = [[𝑋2(1) , 𝑋2(2) ]]. By minimizing the objective function Q(𝔘1 , 𝔘2 ), CMTF estimates latent factors for
the tensor and matrix data jointly which allows it to utilize information from both modalities. [2] uses a
gradient descent algorithm to optimize the objective function (5.1). Although this model is formulated for
the joint decomposition of a 𝑑th order tensor and a matrix, extensions to two or more tensors with couplings
across multiple modes are possible.
    In real data, couplings across different modalities might include shared or modality-specific (individual)
components. Shared components correspond to those columns of the factor matrices that contribute to the
decomposition of both modalities, while individual components carry information unique to the correspond-
ing modality. Although CMTF provides a successful framework for joint data analysis, it often fails to obtain
a unique estimation for shared or individual components. As a result, any further statistical analysis and
learning from CMTF estimation will suffer from the uncertainty in latent factors. To address this issue, [3]
proposed Advanced Coupled Matrix Tensor Factorization (ACMTF) by introducing a sparsity penalty to the
weights of latent factors in the objective function (5.1), and restricting the norm of the columns of the factors
to be unity to provide uniqueness up to a permutation. This modification provides a more precise estimation
for latent factors compared to CMTF ([3, 5]). In our framework, we utilize ACMTF to extract the latent
factors which are in turn used to build a classifier for multimodal data.
5.2.2   CP-STM for Tensor Classification
CP-STM has been previously studied by [169, 76, 77] and uses CP tensor to construct STM types of model.
Assume there is a collection of data 𝑇𝑛 = {(X1 , 𝑦 1 ), (X2 , 𝑦 2 ), . . . , (X𝑛 , 𝑦 𝑛 )}, where X𝑡 ∈ X ⊂ R𝐼1 ×𝐼2 ×···×𝐼𝑑
are 𝑑-way tensors. X is a compact tensor space, which is a subspace of R𝐼1 ×𝐼2 ×···×𝐼𝑑 . 𝑦 𝑡 ∈ {−1, 1} are binary
labels. CP-STM assumes the tensor predictors are in CP format, and can be classified by the function which
minimizes the objective function
                                                             𝑛
                                                          1Õ
                                       min     𝜆||f|| 2 +       L (f (X𝑡 ), 𝑦 𝑡 ).                                 (5.2)
                                         𝑓                𝑛 𝑡=1
By using tensor kernel function
                                                     Õ 𝑟 Ö 𝑑
                                                                        ( 𝑗)   ( 𝑗)
                                     𝐾 (X1 , X2 ) =            𝐾 ( 𝑗) (x1,𝑙 , x2,𝑚 ),                              (5.3)
                                                    𝑙,𝑚=1 𝑗=1
                                                          87


              𝑟                                    𝑟
             Í    (1)            (𝑑)               Í    (1)             (𝑑)
where X1 =       x1𝑙  ◦ · · · ◦ x1𝑙  and X2 =         x2𝑙   ◦ · · · ◦ x2𝑙   . The STM classifier can be written as
             𝑙=1                                  𝑙=1
                                                  Õ𝑛
                                       f (X) =         𝛼𝑡 𝑦 𝑡 𝐾 (X𝑡 , X) = 𝛼𝑇 D 𝑦 K(X)                             (5.4)
                                                  𝑡=1
where X is a new 𝑑-way rank-𝑟 tensor, 𝛼 = [𝛼1 , . . . , 𝛼𝑛 ] 𝑇 is the coefficient vector, D 𝑦 is a diagonal matrix
whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 and K(X) = [𝐾 (X1 , X), . . . , 𝐾 (X𝑛 , X)] 𝑇 is a column vector, whose
values are kernel values computed between training and test data. We denote the collection of functions
in the form of (5.4) with H , which is a functional space also known as Reproducing Kernel Hilbert Space
(RKHS). The optimal classifier CP-STM f ∈ H can be estimated by plugging function (5.4) into objective
function (5.2), and minimize it with Hinge or Squared Hinge loss. The coefficients of the optimal CP-STM
model is denoted by 𝛼∗ . The classification model is statistically consistent if the tensor kernel function
satisfying the universal approximating property, which is shown by [109].
5.2.3   Multiple Kernel Learning
Multiple kernel learning (MKL) creates new kernels using a linear or non-linear combination of single kernels
to measure inner products between data. Statistical learning algorithms such as support vector machine and
kernel regression can then utilize the new combined kernels instead of single kernels to obtain better learning
results and avoid the potential bias from kernel selection ([71]). A more important and related reason for
using MKL is that different kernels can take inputs from various data representations possibly from different
sources or modalities. Thus, combining kernels and using MKL is one possible way of integrating multiple
information sources.
    Given a collection of kernel functions {𝐾1 (·, ·), . . . 𝐾𝑚 (·, ·)}, a new kernel function can be constructed
by
                                          𝐾 (·, ·) = f 𝜂 ({𝐾1 (·, ·), . . . 𝐾𝑚 (·, ·)}|𝜂)                          (5.5)
where f 𝜂 is a linear or non-linear function and 𝜂 is a vector whose elements are the weights for the kernel
combination. Linear combination methods are the most popular in multiple kernel learning, where the kernel
function is parameterized as
                                          𝐾 (·, ·) = f 𝜂 ({𝐾1 (·, ·), . . . 𝐾𝑚 (·, ·)}|𝜂)
                                                      Õ𝑚                                                           (5.6)
                                                   =       𝜂𝑙 𝐾𝑙 (·, ·).
                                                      𝑙=1
                                                                 88


The weight parameters 𝜂𝑙 can be simply assumed to be the same (unweighted) ([145, 16]), or be determined
by looking at some performance measures for each kernel or data representation ([167, 147]). There are few
more advanced approaches such as optimization-based, Bayesian approaches, and boosting approaches that
can also be adopted ([103, 64, 176, 92, 69, 41, 19]). In this chapter, we only consider linear combination
(5.6), and select the weight parameters in a heuristic data-driven way to construct our C-STM model.
5.3    Methods
Let 𝑇𝑛 = {(X1,1 , 𝑋1,2 , 𝑦 1 ), . . . , (X𝑛,1 , 𝑋𝑛,2 , 𝑦 𝑛 )} be training data, where each sample 𝑡 ∈ {1, . . . , 𝑛} has two
data modalities X𝑡 ,1 , 𝑋𝑡 ,2 and a corresponding binary label 𝑦 𝑡 ∈ {1, −1}. In this chapter, following [2],
we assume that the first data modality is a third-order tensor, X𝑡 ,1 ∈ R𝐼1 ×𝐼2 ×𝐼3 , and the other is a matrix,
𝑋𝑡 ,2 ∈ R𝐼4 ×𝐼3 . The third mode of X𝑡 ,1 and the second mode of 𝑋𝑡 ,2 are assumed to be coupled for each 𝑡,
i.e. the factor matrix is assumed to be fully or partially shared across these modes. Utilizing this coupling,
one can extract factors that better represent the underlying structure of the data, and preserve and utilize the
discriminative power of the factors from both modalities. Our approach, C-STM, consists of two stages:
Multimodal tensor factorization, i.e., ACMTF, and coupled support tensor machine as illustrated in Figure
5.2. In this section, we present both stages and the corresponding procedures.
                                    Individual Factors (Tensor Modality)
                                                                    𝐾1(1) (·, ·)𝐾1(2) (·, ·)
                                                  +..+
                       𝑋2      𝐼1
            X1
                        𝑀
                                                  +..+                    𝐾1(3) (·, ·)         𝐾 (·, ·)            𝑦
                 𝐼3
           𝐼2
                                              Shared Factors
                                                     +..+                 𝐾2(1) (·, ·)
                                    Individual Factors (Matrix Modality)
                                         Figure 5.2: C-STM Model Pipeline
                                                            89


5.3.1    Multimodal Tensor Factorization
In this chapter, the first aim is to perform a joint factorization across two modalities for each training sample,
𝑡. Let 𝔘𝑡 ,1 = [[𝜁; 𝑋𝑡(1)       (2)     (3)                                                                        (1)  (2)
                        ,1 , 𝑋𝑡 ,1 , 𝑋𝑡 ,1 ]] denote the Kruskal tensor of X𝑡 ,1 , and 𝔘𝑡 ,2 = [[𝜎; 𝑋𝑡 ,2 , 𝑋𝑡 ,2 ]] denote
the singular value decomposition of 𝑋𝑡 ,2 . The weights of the columns of each factor matrix 𝑋𝑡(𝑚)                     , 𝑝 , where 𝑝
is the index for modality and 𝑚 denotes the mode, are denoted by 𝜁 and 𝜎 and the norms of these columns
are constrained to be 1 to avoid redundancy. The objective function of ACMTF [4, 3] is then given by:
                                                                                                                 >
                 Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) = 𝛾kX𝑡 ,1 − [[𝜁; 𝑋𝑡(1)           (2)     (3)      2                (1)     (2) 2
                                                             ,1 , 𝑋𝑡 ,1 , 𝑋𝑡 ,1 ]] k 𝐹 + 𝛾k 𝑋𝑡 ,2 − 𝑋𝑡 ,2 𝚺𝑋𝑡 ,2 k 𝐹
                                    + 𝛽k𝜁 k 0 + 𝛽k𝜎k 0
                           s.t.     𝑋𝑡(3)
                                       ,1 = 𝑋𝑡 ,2
                                                 (2)                                                                           (5.7)
                                    kx𝑡(1)           (2)               (3)             (1)          (2)
                                        ,1,𝑘 k 2 = kx𝑡 ,1,𝑘 k 2 = kx𝑡 ,1,𝑘 k 2 = kx𝑡 ,2,𝑘 k 2 = kx𝑡 ,2,𝑘 k 2 = 1,
                                       ∀𝑘 ∈ {1, . . . , 𝑟 }
                                                                                                                          ( 𝑗)
where 𝚺 is a diagonal matrix whose elements are the singular values 𝜎 of the matrix 𝑋𝑡 ,2 and x𝑡 ,𝑚,𝑘 ∈ R𝐼 𝑗
denotes the columns of the factor matrices for the object X𝑡 ,𝑚 . The objective function in (5.7) includes penal-
ties for the number of non-zero weights in both tensor and matrix decomposition. Thus, the model identifies
the shared and individual components. These factors are then considered as different data representations
for multimodal data, and used to predict the labels 𝑦 𝑡 in C-STM classifier.
5.3.2    Coupled Support Tensor Machine (C-STM)
C-STM uses the idea of multiple kernel learning and considers the coupled and uncoupled factors from
ACMTF decomposition as various data representations. As a result, we use three different kernel functions
to measure their similarity, i.e., inner products. One can think of these three kernels inducing three different
feature maps transforming multimodal factors into different feature spaces. In each feature space, the
corresponding kernel measures the similarity between factors in this specific data modality. The similarities
of multimodal factors are then integrated by combining the kernel measures through a non-linear combination.
This combination should be able to take individual and shared components into account separately for better
adaptability depending on the size and corruptions on the data as the coupled modes are likely to be better
estimated than the individual modes. Thus, we use tensor kernels for individual modes of each modality and
combine these with the kernels of the coupled modes as illustrated in Figure 5.2. The kernel function for
                                                                   90


C-STM is defined as
                                                                                                       
                                𝐾 (X𝑡 ,1 , 𝑋𝑡 ,2 ), (X𝑖,1 , 𝑋𝑖,2 ) = 𝐾 (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ), (𝔘𝑖,1 , 𝔘𝑖,2 )
         Õ 𝑟
      =        𝑤 1 𝐾1(1) (x𝑡(1)      (1)     (2) (2)          (2)            (3) (3)∗        (3)∗      (1) (1)       (1)
                             ,1,𝑘 , x𝑖,1,𝑙 )𝐾1 (x𝑡 ,1,𝑘 , x𝑖,1,𝑙 ) + 𝑤 2 𝐾1 (x𝑡 ,1,𝑘 , x𝑖,1,𝑙 ) + 𝑤 3 𝐾2 (x𝑡 ,2,𝑘 , x𝑖,2,𝑙 )   (5.8)
        𝑘,𝑙=1
for two pairs of decomposed tensor matrix factors (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) and (𝔘𝑖,1 , 𝔘𝑖,2 ). x𝑡(3)∗             ,1,𝑘 is the average of the
estimated shared factors 21 [x𝑡(3)               (2)
                                       ,1,𝑘 + x𝑡 ,2,𝑘 ]. This kernel is inspired by the idea of Multiple Kernel Learning
with linear combination of multiple kernels for multimodal data. Few more details regarding chosing such
kernel combination are provided in the section 5.5.2.1. 𝑤 1 , 𝑤 2 , and 𝑤 3 are the three weight parameters
combining the three kernel functions. As discussed in [71], there is no unique choice for determining these
weights, in this chapter, we adopt a cross-validation approach as explained in the Appendix C.2.
     With the kernel function in (5.8), C-STM model tries to estimate a bivariate decision function f from a
collection of functions H such that
                                                                                𝑛
                                                                            1Õ
                                          f = arg min        𝜆 · ||f|| 2 +         L (f (X𝑡 ), 𝑦 𝑡 ),                          (5.9)
                                                                            𝑛 𝑡=1
                                                      
where L (X𝑡 , 𝑦 𝑡 ) = max 0, 1 − f (X𝑡 ) · 𝑦 𝑡 is Hinge loss. H is defined as the collection of all functions in the
form of
                                                          Õ𝑛
                                         f (X1 , 𝑋2 ) =        𝛼𝑡 𝑦 𝑡 𝐾 ((X𝑡 ,1 , 𝑋𝑡 ,2 ), (X1 , 𝑋2 ))
                                                          𝑡=1                                                                 (5.10)
                                                        = 𝛼𝑇 D 𝑦 K(X1 , 𝑋2 )
due to the well-known representer theorem ([10]) for any pair of testing data (X1 , 𝑋2 ) and for 𝛼 ∈ R𝑛 . For
all possible values of 𝛼, equation (5.10) defines the data collection H . D 𝑦 is a diagonal matrix whose
diagonal elements are labels from the training data 𝑇𝑛 . K(X1 , 𝑋2 ) is a 𝑛 × 1 vector whose 𝑡-th element is
𝐾 (X𝑡 ,1 , 𝑋𝑡 ,2 ), (X1 , 𝑋2 ) . The optimal C-STM decision function, denoted by fn = 𝛼∗𝑇 D 𝑦 K(X1 , 𝑋2 ), can be
                              
estimated by solving the quadratic programming problem
                                                              1 𝑇
                                                     min𝑛       𝛼 D 𝑦 KD 𝑦 𝛼 − 1𝑇 𝛼,
                                                    𝛼∈R       2
                                                    S.T.    𝛼𝑇 y = 0,                                                         (5.11)
                                                                            1
                                                              0𝛼              ,
                                                                           2𝑛𝜆
where K is the kernel matrix constructed by function (5.8). Problem (5.11) is the dual problem of (5.9), and
its optimal solution 𝛼∗ also minimizes the objective function (5.9) when plugging functions in the form of
(5.10). For a new pair of test points (X1 , 𝑋2 ), the class label is predicted as sign (fn (X1 , 𝑋2 )).
                                                                      91


5.4    Model Estimation
In this section, we first present the estimation procedure for coupled tensor matrix decomposition (5.7), and
then combine it with the classification procedure to summarize the algorithm for C-STM.
    To satisfy the constraints in the objective function (5.7), the function Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) is converted to a
differentiable and unconstrained form given by:
                                                                                                                  >
         Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) =𝛾kX𝑡 ,1 − [[𝜁; 𝑋𝑡(1)           (2)      (3)      2                      (1)
                                                   ,1 , 𝑋𝑡 ,1 , 𝑋𝑡 ,1 ]] k 𝐹 + 𝛾k 𝑋𝑡 ,2 − 𝑋𝑡 ,2 𝚺𝑋𝑡 ,2 k 𝐹
                                                                                                              (2) 2
                             + 𝜉 k 𝑋𝑡(3)       (2) 2
                                      ,1 − 𝑋𝑡 ,2 k 𝐹
                                Õ𝑟  q                    q                                                                          (5.12)
                                      𝛽 𝜁𝑟 + 𝜖 + 𝛽 𝜎𝑟2 + 𝜖 + 𝜃 (kx𝑡(1)                                           (2)
                                                                                                      2                           2
                             +                                                        ,1,𝑘 k 2 − 1) + (kx𝑡 ,1,𝑘 k 2 − 1)
                                             2
                                𝑘=1
                                                                                                                 
                                (kx𝑡(3)                   (kx𝑡(1)                      (kx𝑡(2)
                                                     2                           2                           2
                                                                                                               
                             +       ,1,𝑘 k 2 − 1) +             ,2,𝑘 k 2  − 1) +            ,2,𝑘 k 2   − 1)       ,
where ℓ1 norm penalties in (5.7) are replaced with differentiable approximations; 𝜉 and 𝜃 are Lagrange
multipliers and 𝜖 > 0 is a very small number. This unconstrained optimization problem can be solved by
nonlinear conjugate gradient descent ([2, 4, 130]).
    Let T𝑡 be the full (created by converting Kruskal tensor, or the factor matrices into multidimensional
                                                               >
array form) tensor of 𝔘𝑡 ,1 , and M𝑡 = 𝑋𝑡(1)               (2)
                                                  ,2 𝚺𝑋𝑡 ,2 , the partial derivative of each latent factor can be derived
as follows:
                       𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                                          = 𝛾(T𝑡 − X𝑡 ,1 ) (1) (𝜁 >            𝑋𝑡(3)     𝑋𝑡(2)              (1)      ¯ (1)
                                                                                             ,1 ) + 𝜃 (𝑋𝑡 ,1 − 𝑋𝑡 ,1 ),              (5.13)
                            𝛿𝑋𝑡(1)
                                                                                  ,1
                                ,1
                        𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                                            = (T𝑡 − X𝑡 ,1 ) (2) (𝜁 >           𝑋𝑡(3)     𝑋𝑡(1)              (2)      ¯ (2)
                      𝛾                                                                      ,1 ) + 𝜃 (𝑋𝑡 ,1 − 𝑋𝑡 ,1 ),              (5.14)
                             𝛿𝑋𝑡(2)
                                                                                  ,1
                                  ,1
         𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
       𝛾                     = (T𝑡 − X𝑡 ,1 ) (3) (𝜁 >       𝑋𝑡(2)       𝑋𝑡(1)             (3)         (2)            (3)
                                                                            ,1 ) + 𝜉 (𝑋𝑡 ,1 − 𝑋𝑡 ,2 ) + 𝜃 (𝑋𝑡 ,1 − 𝑋𝑡 ,1 ),
                                                                                                                               ¯ (3) (5.15)
              𝛿𝑋𝑡(3)
                                                                ,1
                   ,1
                                 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                              𝛾                        = (M𝑡 − 𝑋𝑡 ,2 ) 𝑋𝑡(2)                     (1)      ¯ (1)
                                                                                 ,2 𝚺 + 𝜃 (𝑋𝑡 ,2 − 𝑋𝑡 ,2 ),                          (5.16)
                                       𝛿𝑋𝑡(1)
                                            ,2
                      𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                   𝛾                      = (M𝑡 − 𝑋𝑡 ,2 ) > 𝑋𝑡(1)                    (2)       (3)              (2)
                                                                     ,2 𝚺 + 𝜏(𝑋𝑡 ,2 − 𝑋𝑡 ,1 ) + 𝜃 (𝑋𝑡 ,2 − 𝑋𝑡 ,2 ),
                                                                                                                         ¯ (2)       (5.17)
                           𝛿𝑋𝑡(2)
                               ,2
                      𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )                                            𝛽 𝜎𝑘
                                          = x𝑡(1)>                        (2)
                                               ,2,𝑘 (M𝑡 − 𝑋𝑡 ,2 )x𝑡 ,2,𝑘 + 2 q                       , 𝑘 ∈ {1, . . . , 𝑟 },          (5.18)
                            𝛿𝜎𝑘
                                                                                          𝜎𝑘2 + 𝜖
        𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )                                                             𝛽 𝜁
                           = vec(T𝑡 − X𝑡 ,1 ) > x𝑡(3)             x𝑡(2)        x𝑡(1)
                                                                                                       𝑘
              𝛿𝜁 𝑘                                      ,1,𝑘          ,1,𝑘        ,1,𝑘 + 2 q                 , 𝑘 ∈ {1, . . . , 𝑟 },  (5.19)
                                                                                                   𝜁𝑘 + 𝜖
                                                                                                      2
                                                                       92


Algorithm 5.1: ACMTF Decomposition
 1:  Input: Multimodal data (X1 , 𝑋2 ), r, 𝜂, S (Upper limit for the number of iterations)
 2:  Output: 𝔘∗𝑡 ,1 , 𝔘∗𝑡 ,2
 3:  𝔘𝑡 ,1 , 𝔘𝑡 ,2 = 𝔘0𝑡 ,1 , 𝔘0𝑡 ,2                                                                                   ⊲ Initial value
 4:  𝚫0 = −OQ(𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 )
                                                    
 5:  𝜑0 = arg min 𝜑 Q (𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 ) + 𝜑𝚫0
 6:  𝔘1𝑡 ,1 , 𝔘1𝑡 ,2 = (𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 ) + 𝜑0 𝚫0
 7:  g0 = 𝚫0
     while s < S and kQ(𝔘𝑡𝑠,1 , 𝔘𝑡𝑠,2 ) − Q(𝔘𝑡𝑠−1           ,1 , 𝔘𝑡 ,2 ) k > 𝜂 do
 8:                                                                𝑠−1
 9:        𝚫𝑠+1 = −OQ(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                                 𝑠        𝑠
                                𝚫> (𝚫𝑠+1 −𝚫𝑠 )
10:        g𝑠+1 = 𝚫𝑠+1 + −g𝑠+1> (𝚫 −𝚫𝑠 ) g𝑠
                                   𝑠 𝑠+1                        
11:        𝜑 𝑠+1 = arg min 𝜑 Q (𝔘𝑡𝑠,1 , 𝔘𝑡𝑠,2 ) + 𝜑g𝑠+1
               ,1 , 𝔘𝑡 ,2 = (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) + 𝜑 𝑠+1 g𝑠+1
12:        𝔘𝑡𝑠+1      𝑠+1          𝑠        𝑠
13: end while
where (𝑣𝑒𝑐(.)) is a vectorization operator that stacks all elements of the operand in a column vector, T( 𝑗)
denotes the mode- 𝑗 unfolding of a tensor T , and                      denotes Khatri-Rao product. M̄ is a normalized matrix
whose columns have unit ℓ2 norms.
     By combining all of the partial derivatives, the partial derivative of the objective function is given by:
                                               
                                                 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                     OQ(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) =                          ,                   ,                  ,
                                                    𝛿𝑋𝑡(1)
                                                         ,1                𝛿𝑋  (2)
                                                                              𝑡 ,1           𝛿𝑋  (3)
                                                                                                𝑡 ,1
                                                                                                                    >          (5.20)
                                                 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )      𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )
                                                                   ,                   ,...                    ,...
                                                    𝛿𝑋𝑡(2)
                                                         ,2
                                                                            𝛿𝜁1                    𝛿𝜎1
which is a 2𝑟 + 5 dimensional vector. As mentioned in [4], a nonlinear conjugate gradient method with
Hestenes-Stiefel updates is used to optimize (5.12). The procedure is described in Algorithm 5.1.
     Once the factors for all data pairs in the training set 𝑇𝑛 are extracted, we can create the kernel matrix
using the kernel function in (5.8). By solving the quadratic programming problem (5.11), we can obtain the
optimal decision function fn . This two-stage procedure for C-STM estimation is summarized in Algorithm
5.2.
                                                                       93


Algorithm 5.2: Coupled Support Tensor Machine
  1: procedure C-STM
  2:     Input: Training set 𝑇𝑛 = {(X1,1 , 𝑋1,2 , 𝑦 1 ), . . . , (X𝑛,1 , 𝑋𝑛,2 , 𝑦 𝑛 )}, y, kernel function 𝐾, 𝑟, 𝜆, 𝜂, S
  3:     for t = 1, 2,. . . n do
             𝔘∗𝑡 ,1 , 𝔘∗𝑡 ,2 = ACMTF (X𝑡 ,1 , 𝑋𝑡 ,2 ), 𝑟, 𝜂, S
                                                               
  4:
  5:     end for
  6:     Create initial matrix K ∈ R𝑛×𝑛
  7:     for t = 1,. . . ,n do
  8:         for i = 1,. . . ,n do
                                                               
  9:               K[𝑖, 𝑡] = 𝐾 (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ), (𝔘𝑖,1 , 𝔘𝑖,2 )                                            ⊲ Kernel values
 10:               K[𝑖, 𝑡] = K[𝑡, 𝑖]
 11:         end for
 12:     end for
 13:     Solve the quadratic programming problem (5.11) and find the optimal 𝛼∗ .
 14:     Output: 𝛼∗
 15: end procedure
5.5    Experiments
5.5.1   Parameter Selection
5.5.1.1   Multimodal Tensor Factorization
The proposed model requires the selection of three different parameters, namely, 𝛾, 𝛽, and rank 𝑟. To select
these parameters, we closely follow best practices outlined in previous work on CMTF [2], ACMTF [4] and
CCMTF [130]. First of all, one of these parameter can be set to 1 as a pivot, and following previous work,
we set 𝛾 = 1. The selection of rank 𝑟 is directly related to the selection of 𝛽. As 𝛽 enforces sparsity over
the singular values, it directly minimizes the rank. With sufficiently large 𝑟, we can estimate the low-rank
part through optimization. For the selection of 𝑟 in real data, we set 𝑟 = 5 following the work of [130] where
it was shown through CORCONDIA tests [85] that 𝑟 = 3 is sufficiently large for oddball data. In the case
of the simulation study, 𝑟 = 5 is again sufficiently large as the data were generated from rank 𝑟 = 3 factors.
Finally, based on our empirical results and the results presented in [130] we set 𝛽 = 0.001 using k-fold cross
validation.
                                                                94


5.5.1.2    C-STM
The parameters in C-STM include kernel weights 𝑤 1 , 𝑤 2 , 𝑤 3 and regularization parameter 𝜆 in the optimiza-
tion. The weight parameters, normalized such that the ℓ2 -norm is equal to 1, and 𝜆 are selected using 5 fold
cross-validation. The overall classification accuracy in our validation set serves as the performance metric
and helps us determine the best combination of weights and 𝜆.
    The selection of weight parameters 𝑤 1 , 𝑤 2 , 𝑤 3 is indeed a problem of how to combine kernels from
different modalities. It is straightforward to calculate kernels from every data modality, however, combining
them appropriately and effectively would be challenging unless we can find out the weight for each kernel.
This problem has been widely studied in the literature of Multiple Kernel Learning. In [71], the authors
summarize that the existing methods of kernel weight selection can be divided into five categories, including
fixed rules, heuristic approaches, optimization approaches, Bayesian approaches, and boosting approaches.
As there is no consensus on the best way to choose the weights, we adopt a cross-validation approach as
explained in the Appendix of the revised manuscript to identify the kernel weights. The overall classification
accuracy in our validation set serves as the performance metric and helps us determine the best combination
of weights.
    The generalization of our method to more than two modalities would be straightforward for tuning the
weights. This is because the tuning problem has been widely studied in multiple kernel learning (MKL)
research. There is no restriction on the number of kernels one can include in MKL framework. The weight
selection techniques in MKL can be adapted to our framework.
    The optimization problem defined in (5.9) is an ordinary SVM problem once the kernel values are
calculated through equation (5.8). Thus, for more information about the estimation procedure, 𝜆 selection,
as well as the consistency results readers are referred to existing Support Vector Machine literature [163].
5.5.2    Simulated Data
We present a simulation study to demonstrate the benefit of utilizing C-STM with multimodal data in
classification problems. To show the advantage of using multi-modalities in C-STM, we include CP-STM
from [76], Constrained Multilinear Discriminant Analysis (CMDA), and Direct General Tensor Discriminant
Analysis (DGTDA) from [112] as competitors. These existing approaches can only take a single tensor /
matrix as the feature for classification. As a result, they are not able to enjoy the multi-modalities in the
                                                        95


simulated data. We apply these approaches on every single data modality in our simulated data, and compare
their classification performance with C-STM which uses multimodal data.
     We generate synthetic data using the idea from [61]. Suppose the two data modalities in our classification
problems are
                                                  Õ3
                                                        (1)        (2)          (3)
                                          X𝑡 ,1 =     x 𝑘,𝑡 ,1 ◦ x 𝑘,𝑡 ,1 ◦ x 𝑘,𝑡 ,1
                                                  𝑘=1
                                                                                                             (5.21)
                                                  Õ3
                                                        (1)        (2)
                                          𝑋𝑡 ,2 =     x 𝑘,𝑡 ,2 ◦ x 𝑘,𝑡 ,2
                                                  𝑘=1
where X𝑡 ,1 are three-way tensors in the size of 30 by 20 by 10. 𝑋𝑡 ,2 are matrices in the size of 50 by 10.
Both of them have CP ranks equal to 3. To generate data for the simulation study, we first generate the
latent factors (vectors) from various multivariate normal distributions, and then convert these factors into
full tensors X𝑡 ,1 and matrices 𝑋𝑡 ,2 using equation (5.21). The multivariate normal distributions we used to
generate columns of the latent factors in equation (5.21) are specified in Table 5.1 below. In Table 5.1, we
use 𝑐 = 1, 2 to denote data from two different classes.
Table 5.1: Distribution Specifications for Simulation Study; MVN stands for multivariate normal distribution.
I are identity matrices. Bold numbers are vectors whose elements are all equal to the numbers.
                                         Tensor Factors                 Shared Factors        Matrix Factors
                                        (1)               (2)               (3)        (2)          (1)
               Simulation    𝑐        x 𝑘,𝑡 ,1          x 𝑘,𝑡 ,1          x 𝑘,𝑡 ,1 = x 𝑘,𝑡 ,2     x 𝑘,𝑡 ,2
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 1
                             2  MVN(1.5, I)          MVN(1, I)             MVN(1, I)          MVN(1.25, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 2
                             2  MVN(1.5, I)          MVN(1, I)             MVN(1, I)          MVN(1.5, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 3
                             2  MVN(1.5, I)          MVN(1, I)             MVN(1, I)          MVN(1.75, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 4
                             2  MVN(1.5, I)          MVN(1, I)             MVN(1, I)           MVN(2, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 5
                             2  MVN(1.5, I)          MVN(1, I)             MVN(1, I)          MVN(2.25, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 6
                             2   MVN(2, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 7
                             2   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(2, I)
                             1   MVN(1, I)           MVN(1, I)             MVN(1, I)           MVN(1, I)
               Case 8
                             2   MVN(1, I)           MVN(1, I)             MVN(2, I)           MVN(1, I)
                                                            96


     There are eight different cases in our simulation study. In case 1 - 5, one of the tensor factors and the
matrix factors are generated from different multivariate normal distributions for data in different classes.
This means the tensor and matrix data both contain certain class information (discriminant power) which are
different in different data modalities. Notice that the discriminant power in one of the tensor factor remains
the same among case 1 - 5, while the power in the matrix factor increases. Case 6 and 7 assume the class
information exists only in a single data modality. In case 6, only one of the tensor factors are generated from
different distributions for data in different classes. This factor then becomes the matrix factor in class 7. In
case 8, the shared factors are sampled from different distributions, meaning that both tensor and matrix data
modalities have class information. However, such class information are from the shared factors are the same
between different modalities.
     For each simulation case, we generate 50 pairs of tensor and matrix from both class, collecting 100 pairs
of observations in total. We then perform a random training and testing set separation by randomly choosing
20 samples as the testing set, and use the remaining data as the training set. The random selection of testing
set is conducted in a stratified sampling manner such that the proportion of samples from each class remains
the same in both training and testing sets. For all models, we report the model prediction accuracy, the
proportion of correct predictions over total predictions, on the testing set as the performance metric. The
random training and testing set separation is repeated for 50 times and the average prediction accuracy of
these 50 repetitions for all the cases are reported in Figure 5.3. Additionally, the standard deviations are
illustrated by the error bars in the figure. The results of CP-STM, CMDA, and DGTDA with tensor data are
denoted by CPSTM1, CMDA1, and DGTDA1 respectively in the figure. The results using matrix data are
denoted by CPSTM2, CMDA2, and DGTDA2.
     From the Figure 5.3, we can conclude that our C-STM has a more favorable performance in this
multimodal classification problem comparing with other competitors. Its accuracy rates are significantly
larger than other methods in most cases. Particularly, we can see that the accuracy rates of C-STM (orange)
are increasing from case 1 to case 5, while the accuracy rates of CP-STM using tensor data remain the same.
This is because the difference between class mean vectors for the first tensor factor does not change from
case 1 to case 5. However, the gap between class mean vectors in matrix factor increases. Due to this fact,
both C-STM and CP-STM (yellow) which utilize matrix data are getting better performance from case 1
to case 5. More importantly, C-STM always outperforms CP-STM with matrix data as it enjoys the extra
class information from multimodalities. In case 6 and case 7 where class information are in single data
                                                        97


                            1.2   Case 1                                                           1.2   Case 2
                            0.9                                                                    0.9
                 Accuracy                                                               Accuracy
                            0.6                                                                    0.6
                            0.3                                                                    0.3
                            0.0                                                                    0.0
                                  C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                     C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2
                                                         Method                                                                 Method
                            1.2   Case 3                                                           1.2   Case 4
                            0.9                                                                    0.9
                 Accuracy                                                               Accuracy
                            0.6                                                                    0.6
                            0.3                                                                    0.3
                            0.0                                                                    0.0
                                  C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                     C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2
                                                         Method                                                                 Method
                            1.2   Case 5                                                           1.2   Case 6
                            0.9                                                                    0.9
                 Accuracy                                                               Accuracy
                            0.6                                                                    0.6
                            0.3                                                                    0.3
                            0.0                                                                    0.0
                                  C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                     C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2
                                                         Method                                                                 Method
                            1.2   Case 7                                                           1.2   Case 8
                            0.9                                                                    0.9
                 Accuracy                                                               Accuracy
                            0.6                                                                    0.6
                            0.3                                                                    0.3
                            0.0                                                                    0.0
                                  C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                     C−STM    CMDA1   CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2
                                                         Method                                                                 Method
Figure 5.3: Simulation Result: Average accuracy rates shown in bar plots; Standard deviation of accuracy
rates shown by error bars
modalities, the advantage of C-STM is not as significant as the previous cases, though its performances are
slightly better than CP-STM. This indicates C-STM can provide robust classification results even when extra
data modalities do not provide any other class information, as it can extract more accurate estimates of the
factors in the decomposition step. In case 8 where the class information is from the shared factors, C-STM
recovers the shared factors accurately and provides significantly better classification accuracy. Through this
simulation, we showed that C-STM has a clear advantage of using multimodal data in classification problems,
                                                                                       98


and is robust to redundant data modalities.
5.5.2.1     Kernel Selection
In this section, we evaluate and justify the choice of the kernel function presented in (5.8). In this formulation,
the individual and coupled modes for each modality are first separated and then the individual modes of each
modality are combined as a tensor kernel function. The kernels from the individual modes are then added
to those from the coupled modes to obtain the final form where the weights for the individual and coupled
parts can be optimized as discussed in the Appendix. This kernel formulation separates the coupled and
individual information and integrates them as a linear combination. Although this is not the only way to
integrate kernels, the relatively simple structure of such combination provides us with several benefits such
as interpretability, convenient parameter tuning, and generalizability for multimodal data. With equation
(5.8), it is possible to explain the contribution of different data modalities to discrimination power by looking
at the weights parameters. Further, with linear combination of kernels, the weight parameters can be tuned
with the different approaches introduced in [71] such as Group Lasso. Even though we do not adopt these
tuning techniques in this chapter, it still shows the advantage of choosing such a combination and can be the
foundation for future work. Lastly, this kernel combination can be extended for data with more modalities
easily since kernels are appended linearly.
     Besides the aforementioned reasons, we also provide numerical experiments to illustrate the performance
of our choice against other kernel combination choices. In these experiments the factor sizes are the same
(X1 ∈ R40×40×40 , and 𝑋2 ∈ R40×40 , 𝑟 = 3) so that the kernels are balanced across the modes. We consider
two cases, i.e., Case 8 in Table 5.1, and Case 9 where the columns of the latent factors corresponding to all
individual and coupled modes of the second class are from the distribution MVN(2, I). Although there can
be many different kernel combinations, we select four particular formulations for comparison as they can be
a basis for other choices. The particular formulations are the weighted combination of individual kernels
from all modes (K2), the weighted combination of the tensor kernels corresponding to the two modalities
(K3) and the tensor kernel corresponding to all modes across modalities (K4). The formulations for the
different kernels are given in Table 5.2. We report average classification accuracy across 50 simulations,
where the simulated tensors are randomly initialized.
     In Table 5.3, we can see that the kernel selection schemes K1 and K2 perform the best. K3 performs
                                                          99


                   Table 5.2: Various kernel combination schemes. Note that 𝐾2(2) = 𝐾1(3) .
                                                  Combination Scheme
                                  K1       𝑤 1 𝐾1(1) 𝐾1(2) + 𝑤 2 𝐾1(3) + 𝑤 3 𝐾2(1)
                                  K2    𝑤 1 𝐾1(1) + 𝑤 2 𝐾1(2) + 𝑤 3 𝐾1(3) + 𝑤 4 𝐾2(1)
                                  K3        𝑤 1 𝐾1(1) 𝐾1(2) 𝐾1(3) + 𝑤 2 𝐾2(1) 𝐾2(2)
                                  K4                𝐾1(1) 𝐾1(2) 𝐾1(3) 𝐾2(1)
slightly worse as it is not as flexible as the previous two. Finally, K4 performs the worst as it is affected
by all modes simultaneously, and cannot generalize well. While the difference in performance between the
kernels is not significant, K3 and K4 cannot determine whether the observed class differences are due to
an individual mode or a coupled mode. Thus, K1 and K2 are better in terms of explaining the results. For
Case 8, in most cases, cross-validation across a range of weight parameters for K1 and K2 yields 𝑤 2 = 1
and 𝑤 3 = 1, respectively, and the remaining weights are equal to zero. This directly identifies the source
of discriminability and allows for better interpretability, which is not possible for K3 and K4. Finally,
K1 has less number of parameters than K2 and this can be advantageous in cases with high number of
modalities. The smaller number of parameters make cross-validation simpler, while still allowing for some
interpretability.
                   Table 5.3: Classification accuracy using different kernel combinations.
                                    K1                K2                 K3               K4
                    Case 9     0.91 ± 0.036     0.9 ± 0.043        0.87 ± 0.038      0.82 ± 0.045
                    Case 8     0.90 ± 0.039     0.9 ± 0.038        0.88 ± 0.036       0.83 ± 0.06
5.5.3   EEG-fMRI Data
In this section, we present the application of the proposed method on simultaneous EEG-fMRI data. The
simultaneous electroencephalography (EEG) with functional magnetic resonance imaging (fMRI) is one of
the most popular non-invasive multimodal brain imaging techniques to study human brain function. EEG
records electrical activity from the scalp resulting from ionic current within the neurons of the brain. Its
millisecond temporal resolution makes it possible to record event-related potentials that occur in response
to visual, auditory and sensory stimuli ([172, 1]). While EEG provides high temporal resolution, its spatial
resolution is limited by the number of electrodes placed on the scalp and thus provides less spatial resolu-
                                                          100


tion compared to other neuroimaging modalities such as magnetic resonance imaging (MRI) and Positron
Emission Tomography (PET). As a result, it has been commonplace to record EEG data in conjunction
with a high spatial resolution modality. As another powerful tool in studying human brain function, blood
oxygenation level dependent (BOLD) functional magnetic resonance imaging (fMRI) provides signals with
much higher spatial resolution to reflect hemodynamic changes in blood oxygenation level at all voxels
related to neuronal activities ([138, 15, 101, 62]). Recording simultaneous EEG and fMRI can provide high
resolution information at both the spatial and temporal dimensions at the same time. Thus, developing novel
machine learning techniques to utilize such multimodal data is of great significance. In this application, we
apply our C-STM model to a binary trial classification problem on a simultaneous EEG-fMRI data.
    The data is obtained from the study [178]. In this study, there are seventeen individuals (six females,
average age 27.7) participated in three runs each of analogous visual and auditory oddball paradigms. The
375 (125 per run) total stimuli per task were presented for 200 ms each with a 2-3 s uniformly distributed
variable inter-trial interval. A trial is defined as a time window in which subjects receive stimuli and
make responses. In the visual task, a large red circle on isoluminant gray backgrounds was considered as
the target stimuli, and a small green circle were the standard stimuli. For the auditory task, the standard
and oddball stimuli were, respectively, 390 Hz pure tones and broadband sounds which sound like "laser
guns". During the experiment, the stimuli were presented to all subjects, and their EEG and fMRI data are
collected simultaneously and continuously. We obtain the EEG and fMRI data from OpenNeuro website
(https://openneuro.org/datasets/ds000116/versions/00003). We utilize both EEG and fMRI in
this data set with our C-STM model to class stimulus types in all the trials. Through our numerical study,
we want to demonstrate the fact our C-STM model enjoys the advantage of data multimodality and provides
more accurate class predictions. The data from Subject 4 are dropped since its fMRI data are corrupted.
    We pre-process both the EEG and fMRI data with Statistical Parametric Mapping (SPM 12) ([11]) and
Matlab. The EEG data is collected by a custom built MR-compatible EEG system with 49 channels. [178]
provides a version of re-referenced EEG data with 34 channels which are used in our experiment. This version
of EEG data are sampled at 1,000 Hz, and are downsampled to 200 Hz at the beginning of pre-processing.
We then remove both low-frequency and high-frequency noise in the data using SPM filter functions. As
the last step of EEG pre-processing, we define trials from Brain Imaging Data Structure (BIDS) files [136]
and extract EEG data epochs recorded within the trial-related time windows. The time window for each trial
is considered to go from 100 ms before the stimulus onset until 500 ms after the stimulus. For each trial,
                                                      101


we construct a three-mode tensor corresponding to the EEG data for all subjects where the modes represent
channel × time × subject. We denote it as X𝑡 ,1 ∈ R34×121×16 . The fMRI data is collected by 3T Philips
Achieva MR Scanner with 170 volumes (TR = 2s) per session. Each 3D volume contains 32 slices. The
voxel size in the image is 3 x 3 x 4 mm. For each subject, we realign all the fMRI volumes from multiple
sessions to the mean volume, and co-register the participant’s T1 weighted anatomical scan to the mean fMRI
volume. Next, we normalize all the fMRI volumes to match the MNI brain template ([102]) by creating
segments from co-registered T1 weighted scan, and keep the voxel size as 3 x 3 x 4 mm. All normalized fMRI
volumes are then smoothed by 3D Gaussian kernels with full width at half maximum (FWHM) parameter
being 8 × 8 × 8. After the pre-processing, we further perform a regular statistical analysis ([119, 190]) to
extract fMRI volumes from visual and auditory stimulus related voxels. Such data are also known as Region
of Interest (ROI) data. We extract fMRI volumes from 178 voxels (in Figure 5.4a) for auditory oddball tasks,
and 112 voxels for auditory tasks. As a result, fMRI data are modeled by matrices whose rows and columns
stand for voxels and subjects: X𝑡 ,2 ∈ R16×178 for auditory task data, and X𝑡 ,2 ∈ R16×112 for visual task data.
There is no time mode in fMRI data because the trial duration is less than the repetition time of fMRI (time
for obtaining a single 3D volume fMRI). For each trial, there is only one 3D scan of fMRI collected from a
single subject. The ROI data then becomes a vector for this subject in the trial as we extract volumes from
the regions of interest.
    To classify trials with oddball and standard stimulus, we collect 140 multimodal data samples (X𝑡 ,1 , X𝑡 ,2 )
from auditory tasks, and 100 samples from visual tasks. For both types of tasks, the numbers of oddball and
standard trials are equal. We consider the trials with oddball stimulus as the positive class, and the trials with
standard stimulus as the negative class. Like the procedures in our simulation study, we select 20% of data
as testing set, and use the remaining 80% for model estimation and validation. The classification accuracy,
precision (positive predictive rate), sensitivity (true positive rate), and specificity (true negative rate) of
classifiers are calculated using the test set at each experiment. The experiment is repeated multiple times,
and the average accuracy, precision, sensitivity, and specificity, and their standard deviations (in subscripts)
are reported in Table 5.4. The single mode classifiers CPSTM, CMDA, and DGTDA are also applied on
either EEG or fMRI data as a comparison. The single mode classifiers applied on EEG data are denoted by
appending the number "1" after their names, and those applied on fMRI data are denoted by appending the
number "2". The area under the curve (AUC) for all the classifiers are also reported in Table 5.4.
    The results in Table 5.4 show that the trial classification accuracy for C-STM using multimodal data
                                                       102


Table 5.4: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Performance
Metrics with Standard Deviations in Subscripts)
               Task       Method       Accuracy      Precision     Sensitivity    Specificity     AUC
                          C-STM         0.890.05      0.830.07       1.000.00      0.770.11     0.890.06
                          CP-STM1       0.800.08      0.710.11       1.000.00      0.600.12     0.780.06
                          CP-STM2       0.830.06      0.760.07       0.990.05      0.650.11     0.820.05
             Auditory     CDMA1         0.550.10      0.510.09       0.960.09      0.200.21     0.550.06
                          CDMA2         0.670.09      0.610.11       0.920.07      0.460.14     0.700.08
                          DGTDA1        0.550.09      0.510.09       0.940.07      0.230.12     0.590.06
                          DGTDA2        0.670.09      0.600.10       0.900.09      0.460.13     0.680.08
                          C-STM         0.860.06      0.820.09       0.930.07      0.770.12     0.860.06
                          CP-STM1       0.760.08      0.660.11       1.000.00      0.540.12     0.780.05
                          CP-STM2       0.770.08      0.700.11       0.980.08      0.580.17     0.770.07
              Visual      CDMA1         0.530.12      0.520.11       0.940.11      0.110.18     0.540.08
                          CDMA2         0.650.13      0.610.14       0.910.09      0.430.19     0.660.09
                          DGTDA1        0.560.11      0.540.11       0.940.06      0.170.12     0.560.07
                          DGTDA2        0.640.10      0.600.13       0.860.10      0.440.18     0.640.07
is better than any classifier based on single modality with a significant improvement in terms of average
accuracy rates and average AUC values. This improvement is observed for classification of both auditory and
visual tasks. This observation agrees to the conclusion from our simulation study. Similar to our simulation
study, the tensor discriminant analysis does not work as well as CP-STM and C-STM. In addition, it is
obvious that the performance of tensor discriminant analysis using fMRI data are better than using EEG
data. This is within our expectation, since the regions we extracted from fMRI data are identified by group
level fMRI statistical analysis. The data in these regions have already shown significant differences between
different trials in the traditional study, and thus are easy to classify. On the other hand, there is no prior
analysis and feature extraction procedure applied on EEG data, leaving a low signal to noise ratio in EEG
data. However, C-STM still can take advantage of using EEG data and further increase the classification
accuracy, highlighting its robustness and potential in processing noisy mulitmodal tensor data.
5.6    Conclusion
In this chapter, we have proposed a novel coupled support tensor machine classifier for multimodal data by
combining the advanced coupled matrix tensor factorization (ACMTF) and support tensor machine (STM).
The most distinctive feature of this classifier is its ability to integrate features across different modalities and
structures. The proposed approach can simultaneously take matrix- and tensor-shaped data for classification
                                                         103


and can be easily extended to inputs with more than two modality. The coupled tensor matrix decomposition
unveils the intrinsic correlation structure between data across different modalities, making it possible to
integrate information from multiple sources efficiently. Such a decomposition also makes the whole method
robust and applicable to large-scale noisy data with missing values. The newly designed kernel functions
in C-STM provide feature-level information fusion, combining discriminant information from different
modalities. Moreover, the kernel formulation makes it possible to utilize the most discriminative features
from each modality by tuning the weight parameters in the function.
    The most important theoretical extension of our current approach would be the development of excess
risk for C-STM. In particular, we are looking for an explicit expression for the excess risk in terms of data
factors from multiple modalities to quantify the contribution of every single modality in minimizing the
excess risk. By doing so, we are able to interpret the importance of each data modality in classification
tasks. In addition, quantifying the uncertainty of tensor and matrix factors estimation and their impact on
the excess risk will build the foundation to the next level.
    Future work will focus on learning the weight parameters in the kernel function via optimization. As [71]
introduced, the weights in the kernel function can be further estimated by including a group lasso penalty
in the objective function. This procedure allows the identification of the most significant components and
reduce the cost of parameter selection. Finally, the proposed method can be extended to multimodal data
with more than two modalities, and for regression problems.
    In conclusion, we believe C-STM offers many encouraging possibilities for multimodal data integration
and analysis. Its capability of handling multimodal tensor inputs will make it appropriate in many advanced
data applications in neuroscience and medical research. We anticipate that this method will play an important
role in a variety of applications.
                                                      104


          (a) Auditory Task
           (b) Visual Task
Figure 5.4: Region of Interest (ROI)
                105


                                                 CHAPTER 6
                                               CONCLUSIONS
In this thesis, we introduced new tensor based machine learning models using various data structures. In par-
ticular, we utilized tensor network structures, geometric models and multi-modal coupling for efficient tensor
based unsupervised and supervised learning. The proposed methods in the thesis contribute significantly to
the tensor learning literature by improving existing structures and using tensors to model new problems.
     In Chapter 2, we proposed a tensor decomposition structure, i.e., Multi Branch Tensor Network, that
provides improved storage and computational efficiency compared to existing architectures, without sac-
rificing representation power of the low-dimensional approximations. We explored two applications of
multi-branch structure in supervised and unsupervised settings and provided theoretical analysis of conver-
gence, computational and storage complexities. In the supervised setting, a multi-branch implementation of
LDA was introduced. The proposed approach reduced the computational complexity by orders of magni-
tude, while improving the classification accuracy compared to vector-, Tucker- or TT-based methods. We
also demonstrated that the proposed approach is more general compared to Tucker based MDA methods,
TT-based BTT and MPS, and provided a way of selecting the optimal structure depending on data size.
The proposed methods show robustness against small-sample-size (SSS) problem, which is considered to
be the main drawback of LDA [63], as it learns smaller number of parameters due to the efficient multi-
branch structure. The proposed Multi-branch methods can be further utilized in any Ritz pair (extremal
eigenvalue-eigenvector pair) computation. In the unsupervised setting, the proposed structure was used for
a graph regularized optimization problem to reduce the computational complexity and learn more effective
subspaces for dimensionality reduction. Low-dimensional projections, or features of the data are then used
for clustering. Using multi branch structure reduced the computational complexity for the optimization by
orders of magnitude, especially when the projected space was larger in dimension. This is a crucial as,
often, using a small number of features does not provide acceptable clustering quality or other learning
objective functions. The proposed method outperformed Tucker-based graph regularized method in terms
of clustering quality, hence, suggesting that multi-branch decompositions could learn subspaces that better
fit the data compared to Tucker based methods even in a manifold learning setting.
     In Chapter 3, we present a framework for robust tensor decomposition for anomaly detection in spa-
                                                     106


tiotemporal data. In particular, we focus on urban traffic monitoring applications where the anomalies exhibit
themselves as temporally contiguous events. Motivated by this application, we model spatiotemporal data
as a low-rank plus sparse tensor where the low-rank part corresponds to underlying data and sparse part
corresponds to anomalies. The continuous nature of the anomalies are taken into account with the additional
constraint that the sparse part is temporally continuous (LOSS). We explore two extensions of this model.
First, to capture the underlying local geometric relations in the data, we consider graph regularization across
each mode (GLOSS). This type of regularization is motivated by the assumption that the data lies on a
product graph, which holds true for many real world scenarios, since real world data is often generated over
graphs with such structures. Specifically, for traffic data, this structure is fitting by design. Second, we
modify our objective function by minimizing the graph total variation instead of nuclear norm for low-rank
approximation to reduce the computational complexity (LOGSS). Finally, we incorporate a tensor comple-
tion framework in all of these methods to address missing data. The proposed methods are shown to improve
anomaly detection performance compared to baseline methods and have higher accuracy than existing robust
tensor decomposition methods. In particular, using graph regularization in the objective improves the results
significantly even when the nuclear norm penalties are dropped. This shows that using geometric structure
instead of global structure such as rank can be used to obtain highly accurate and low complexity tensor
models.
    In Chapter 4, we propose incorporating geometric structure in addition to global structure into tensor
denoising and recovery formulations. Geometric tensor learning allows for better modeling of underlying
relations of data compared to purely algebraic measures. We propose a TT based robust PCA model where
a graph smoothness penalty is applied to each mode-𝑛 canonical unfolding. Since this formulation results
in computationally intractable matrix inversion problems, we propose an extension where we impose a
Kronecker structure on the mode-𝑛 canonical graph. This structure is designed such that the the redundancy
across the different mode unfoldings is minimized. We prove the equivalence between the proposed graph
smoothness penalty with mode-𝑛 unfolding based graph smoothness penalty. We show the advantage of
using geometric approaches by comparing these two methods with purely algebraic objective functions.
The proposed methods outperform the existing methods in missing data imputation, and denoising tasks.
Moreover, using the Kronecker structured graph rather than the canonical graph provided similar results with
improved computational efficiency. The results in this chapter illustrate that with additional regularizations
on topological structure, TT-based models can be further improved.
                                                      107


     Finally, in Chapter 5, we utilize coupled tensor decomposition in a supervised setting. Heterogenous data
collected by sensing the same physical phenomena are generally coupled. Coupled tensor decomposition
methods have been utilized extensively for unsupervised learning tasks such as denoising, recovery and
clustering such heterogeneous data. In this chapter, we showed that these models can also be utilized in
supervised settings, for robust dimensionality reduction and feature extraction. To combine representations
from different sources, we proposed using multiple kernel learning methods. MKL uses a combination of
single kernels, that can take inputs from various data sources, to obtain better results and avoid potential
bias from kernel selection [71]. We propose a two-step model where in the first step, factor matrices for
all samples are extracted by coupling heterogeneous data sources. In the next step, we feed the extracted
features, i.e., factor matrices, to a kernelized support tensor machine. The proposed method shows better
classification accuracy compared to supervised learning based on the individual data sources. Coupled
decomposition extracts the features from each sample using information from both data sources, which
provides better features for the subsequent STM and improves the accuracy compared to STM trained on
the features of individual modalities. The proposed method also outperforms methods that specifically learn
discriminative factors from individual modalities, which illustrates the advantage of coupling.
6.1    Future Work
The work in this thesis suggests new research directions. In this section, we suggest areas of future work for
each chapter.
6.1.1    Multi-Branch Tensor Learning
The proposed tensor decomposition structure is a hybrid between Tucker and TT decompositions, hence,
is generally applicable in any tensor decomposition problem. Thus, the structure can be utilized in many
other supervised and unsupervised tasks, such as regression, STMs, dictionary learning, manifold learning,
data recovery and compression. It can also be utilized in improving tensor-based deep learning methods, by
either compressing the data, or the network parameters as it provides an optimal way of decomposing large
tensors.
     Theoretical aspects of the representation power of this structure is also of interest as it would lead to
a deeper understanding of tensor decompositions. Since Tucker and TT decompositions are special cases
of the multi-branch structure, this analysis would also reveal a concrete reasoning for the choice of the
                                                       108


decomposition structure depending on the problem, and data at hand.
6.1.2    Tensor Methods for Anomaly Detection
Chapter 3 reveals the utility of tensor based robust learning architectures on the problem of anomaly detection,
specifically for spatiotemporal data. However, there are still some open questions regarding the theoretical
recovery guarantees for the proposed algorithms. Future work can explore recovery bounds for anomalies
depending on the data structure. Furthermore, the proposed methods rely on the proper selection of the
regularization parameters. Although we have shown empirically that the performance is robust within a
wide range of parameter choices, these methods still require some tuning for desirable performance. A fully
Bayesian extension of the proposed approach could be considered as future work to automatically estimate
these parameters depending on the data.
    Anomaly detection part in this chapter was implemented by scoring each fiber individually by a separate
algorithm, which may result in loss of information. Future work will consider a statistical tensor anomaly
scoring method to avoid this simplification. Finally, since the specific application in this chapter is spatiotem-
poral data, a natural extension would be online anomaly detection. Online subspace tracking or functional
data analysis literature may provide the necessary tools for such an extension.
6.1.3    Geometric Tensor Learning
In Chapter 4, we illustrate how using geometric relations within data improve robust tensor learning. Future
work would quantify these improvements in terms of theoretical bounds for recovery when such a structure
is utilized. Although we utilize a combination of global and local structure in our methods, it is also of
interest to use only the local structure, i.e., geometric relations for data recovery. Recovery guarantees for
such an approach is also of great interest as it might allow milder conditions compared to algorithms that
use global structure. In this chapter, we estimated the underlying graphs using a k-NN approach. It is known
in graph learning literature that with noisy and missing data, this generally produces noisy estimates of the
underlying graph. As such, future work will focus on simultaneous graph learning with data recovery.
    Finally, as the use of graphs corresponding to the canonical mode-𝑛 unfolding requires excessive memory
and computational resources, a multi-branch structure could be utilized to reduce these costs without
approximating the graphs with a Kronecker structure. This would also pave the way for a fully geometric
                                                        109


robust multi-branch decomposition.
6.1.4   Supervised Coupled Tensor Learning
In Chapter 5, coupled tensor factorization was utilized as a feature extraction step. However, without using
the class label information, the extracted features might not be suitable for the subsequent classification task.
In this approach, we propose to employ a supervised coupled factorization to address this. Specifically, we
extend Multilinear Discriminant Analysis to coupled factorization by solving:
                                        Õ2 𝑑Õ𝑖 −1
                                                  𝑡𝑟 (𝑈𝑛(𝑖),> 𝑆 𝑖,𝑛𝑈𝑛(𝑖) )
                                                                𝑊
                                                                             Õ2 Õ
                     minimize𝑈 (𝑖) ∀𝑛,𝑖                 (𝑖),> 𝑖,𝑛 (𝑖)
                                                                           +         k 𝐴𝑖𝑛𝑈ˆ 𝑖 − 𝑉𝑛 k 2𝐹 ,  (6.1)
                                        𝑖=1 𝑛=1 𝑡𝑟 (𝑈𝑛        𝑆 𝐵 𝑈𝑛 )
                               𝑛
                                                                             𝑖=1 𝑛∈𝔑
         𝑖,𝑛
where 𝑆𝑊     , 𝑆 𝑖,𝑛
                 𝐵 are within and between class scatters of modality 𝑖 and mode-𝑛; 𝔑 is the set of coupled
modes, and 𝐴𝑛 ’s are transformations through which the couplings are defined. The transformations are used
to explain the difference in properties and resolutions across different modalities even when they correspond
to similar phenomenon.
    To classify a test sample, one can use Mahalanobis distance with respect to class means and covariance.
Another approach would be to train a classifier using the sample mode factors 𝑌 that are learned through a
least squares optimization and similar procedures with Approach 1 can be utilized for test cases.
                                                            110


BIBLIOGRAPHY
     111


                                           BIBLIOGRAPHY
 [1] Rodolfo Abreu, Alberto Leal, and Patrícia Figueiredo. “EEG-informed fMRI: a review of data
     analysis methods”. In: Frontiers in human neuroscience 12 (2018), p. 29.
 [2] Evrim Acar, Tamara G Kolda, and Daniel M Dunlavy. “All-at-once optimization for coupled matrix
     and tensor factorizations”. In: arXiv preprint arXiv:1105.3422 (2011).
 [3] Evrim Acar et al. “ACMTF for fusion of multi-modal neuroimaging data and identification of
     biomarkers”. In: 2017 25th European Signal Processing Conference (EUSIPCO). IEEE. 2017,
     pp. 643–647.
 [4] Evrim Acar et al. “Structure-revealing data fusion”. In: BMC bioinformatics 15.1 (2014), pp. 1–17.
 [5] Evrim Acar et al. “Tensor-based fusion of EEG and FMRI to understand neurological changes in
     schizophrenia”. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE.
     2017, pp. 1–4.
 [6] Evrim Acar et al. “Unraveling diagnostic biomarkers of schizophrenia through structure-revealing
     fusion of multi-modal neuroimaging data”. In: Frontiers in neuroscience 13 (2019), p. 416.
 [7] Hemant Kumar Aggarwal and Angshul Majumdar. “Hyperspectral image denoising using spatio-
     spectral total variation”. In: IEEE Geoscience and Remote Sensing Letters 13.3 (2016), pp. 442–
     446.
 [8] Anima Anandkumar et al. “Tensor vs. matrix methods: Robust tensor decomposition under block
     sparse perturbations”. In: Artificial Intelligence and Statistics. PMLR. 2016, pp. 268–276.
 [9] Jeffrey S Anderson et al. “Functional connectivity magnetic resonance imaging classification of
     autism”. In: Brain 134.12 (2011), pp. 3742–3754.
[10] Andreas Argyriou, Charles A Micchelli, and Massimiliano Pontil. “When is there a representer
     theorem? Vector versus matrix regularizers”. In: The Journal of Machine Learning Research 10
     (2009), pp. 2507–2529.
[11] John Ashburner et al. “SPM12 manual”. In: Wellcome Trust Centre for Neuroimaging, London, UK
     2464 (2014).
[12] Francis R Bach. “Consistency of the group lasso and multiple kernel learning.” In: Journal of
     Machine Learning Research 9.6 (2008).
[13] Mohammad Taha Bahadori, Qi Rose Yu, and Yan Liu. “Fast Multivariate Spatio-temporal Analysis
     via Low Rank Tensor Learning.” In: NIPS. Citeseer. 2014, pp. 3491–3499.
[14] Richard H. Bartels and George W Stewart. “Solution of the matrix equation AX+ XB= C [F4]”. In:
     Communications of the ACM 15.9 (1972), pp. 820–826.
                                                    112


[15] JW Belliveau et al. “Functional mapping of the human visual cortex by magnetic resonance imaging”.
     In: Science 254.5032 (1991), pp. 716–719.
[16] Asa Ben-Hur and William Stafford Noble. “Kernel methods for predicting protein–protein interac-
     tions”. In: Bioinformatics 21.suppl_1 (2005), pp. i38–i46.
[17] Johann A Bengua et al. “Efficient tensor completion for color image and video recovery: Low-rank
     tensor train”. In: IEEE Transactions on Image Processing 26.5 (2017), pp. 2466–2479.
[18] Johann A Bengua et al. “Matrix product state for higher-order tensor compression and classification”.
     In: IEEE Transactions on Signal Processing 65.15 (2017), pp. 4019–4030.
[19] Kristin P Bennett, Michinari Momma, and Mark J Embrechts. “MARK: A boosting algorithm
     for heterogeneous kernel models”. In: Proceedings of the eighth ACM SIGKDD international
     conference on Knowledge discovery and data mining. 2002, pp. 24–31.
[20] Gouri Sankar Bhunia et al. “Spatial and temporal variation and hotspot detection of kala-azar disease
     in Vaishali district (Bihar), India”. In: BMC infectious diseases 13.1 (2013), p. 64.
[21] Xuan Bi et al. “Tensors in statistics”. In: Annual Review of Statistics and Its Application 8 (2020).
[22] Amit Boyarski, Sanketh Vedula, and Alex Bronstein. “Deep matrix factorization with spectral
     geometric regularization”. In: arXiv preprint arXiv: 1911.07255 (2019).
[23] Markus M Breunig et al. “LOF: identifying density-based local outliers”. In: Proceedings of the
     2000 ACM SIGMOD international conference on Management of data. 2000, pp. 93–104.
[24] Laura F Bringmann et al. “Changing dynamics: Time-varying autoregressive models using general-
     ized additive modeling.” In: Psychological methods 22.3 (2017), p. 409.
[25] Laura F Bringmann et al. “Modeling nonstationary emotion dynamics in dyads using a time-varying
     vector-autoregressive model”. In: Multivariate behavioral research 53.3 (2018), pp. 293–314.
[26] Rasmus Bro. “PARAFAC. Tutorial and applications”. In: Chemometrics and Intelligent Laboratory
     Systems 38.2 (1997), pp. 149–171.
[27] Rasmus Bro, Claus A Andersson, and Henk AL Kiers. “PARAFAC2—Part II. Modeling chro-
     matographic data with retention time shifts”. In: Journal of Chemometrics: A Journal of the
     Chemometrics Society 13.3-4 (1999), pp. 295–309.
[28] Saikiran Bulusu et al. “Anomalous example detection in deep learning: A survey”. In: IEEE Access
     8 (2020), pp. 132330–132347.
[29] Vince D Calhoun et al. “Method for multimodal analysis of independent source differences in
     schizophrenia: combining gray matter structural and auditory oddball functional data”. In: Human
     brain mapping 27.1 (2006), pp. 47–62.
[30] E. J. Candès et al. “Robust principal component analysis?” In: Journal of the ACM (JACM) 58.3
                                                   113


     (2011), p. 11.
[31] Clayson Celes, Azzedine Boukerche, and Antonio AF Loureiro. “Crowd Management: A New
     Challenge for Urban Big Data Analytics”. In: IEEE Communications Magazine 57.4 (2019), pp. 20–
     25.
[32] Mohammadhossein Chaghazardi and Shuchin Aeron. “Sample, computation vs storage tradeoffs for
     classification using tensor subspace models”. In: arXiv preprint arXiv:1706.05599 (2017).
[33] Raghavendra Chalapathy and Sanjay Chawla. “Deep learning for anomaly detection: A survey”. In:
     arXiv preprint arXiv:1901.03407 (2019).
[34] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly detection: A survey”. In: ACM
     computing surveys (CSUR) 41.3 (2009), pp. 1–58.
[35] Venkat Chandrasekaran et al. “Rank-sparsity incoherence for matrix decomposition”. In: SIAM
     Journal on Optimization 21.2 (2011), pp. 572–596.
[36] Christos Chatzichristos et al. “Early soft and flexible fusion of EEG and fMRI via tensor decompo-
     sitions”. In: arXiv preprint arXiv:2005.07134 (2020).
[37] Christos Chatzichristos et al. “Fusion of EEG and fMRI via soft coupled tensor decompositions”.
     In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE. 2018, pp. 56–60.
[38] Cong Chen et al. “A support tensor train machine”. In: 2019 International Joint Conference on
     Neural Networks (ĲCNN). IEEE. 2019, pp. 1–8.
[39] Longbiao Chen et al. “Fine-grained urban event detection and characterization based on tensor
     cofactorization”. In: IEEE Transactions on Human-Machine Systems 47.3 (2016), pp. 380–391.
[40] Xinyu Chen et al. “Missing traffic data imputation and pattern discovery with a Bayesian augmented
     tensor factorization model”. In: Transportation Research Part C: Emerging Technologies 104
     (2019), pp. 66–77.
[41] Mario Christoudias, Raquel Urtasun, Trevor Darrell, et al. “Bayesian localized multiple kernel
     learning”. In: Univ. California Berkeley, Berkeley, CA (2009).
[42] A. Cichocki. “Era of big data processing: A new approach via tensor networks and tensor decom-
     positions”. In: arXiv preprint arXiv:1403.2048 (2014).
[43] A. Cichocki et al. “Tensor decompositions for signal processing applications: From two-way to
     multiway component analysis”. In: IEEE Signal Processing Magazine 32.2 (2015), pp. 145–163.
[44] Andrzej Cichocki et al. “Tensor networks for dimensionality reduction and large-scale optimization:
     Part 1 low-rank tensor decompositions”. In: Foundations and Trends® in Machine Learning 9.4-5
     (2016), pp. 249–429.
[45] Andrzej Cichocki et al. “Tensor networks for dimensionality reduction and large-scale optimization:
                                                  114


     Part 2 applications and future perspectives”. In: Foundations and Trends® in Machine Learning 9.6
     (2017), pp. 431–673.
[46] Lieven De Lathauwer and Joséphine Castaing. “Blind identification of underdetermined mixtures
     by simultaneous matrix diagonalization”. In: IEEE Transactions on Signal Processing 56.3 (2008),
     pp. 1096–1105.
[47] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. “Fourth-order cumulant-based
     blind identification of underdetermined mixtures”. In: IEEE Transactions on Signal Processing 55.6
     (2007), pp. 2965–2973.
[48] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. “A multilinear singular value decom-
     position”. In: SIAM journal on Matrix Analysis and Applications 21.4 (2000), pp. 1253–1278.
[49] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. “On the best rank-1 and rank-(r 1, r 2,...,
     rn) approximation of higher-order tensors”. In: SIAM journal on Matrix Analysis and Applications
     21.4 (2000), pp. 1324–1342.
[50] Dingxiong Deng et al. “Latent space model for road networks to predict time-varying traffic”. In:
     Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
     Mining. 2016, pp. 1525–1534.
[51] Lei Deng et al. “Graph Spectral Regularized Tensor Completion for Traffic Data Imputation”. In:
     IEEE Transactions on Intelligent Transportation Systems (2021).
[52] Wei Deng and Wotao Yin. “On the global and linear convergence of the generalized alternating
     direction method of multipliers”. In: Journal of Scientific Computing 66.3 (2016), pp. 889–916.
[53] Renwei Dian, Shutao Li, and Leyuan Fang. “Learning a low tensor-train rank representation for
     hyperspectral image super-resolution”. In: IEEE transactions on neural networks and learning
     systems 30.9 (2019), pp. 2672–2683.
[54] Yiming Ding et al. “A deep learning model to predict a diagnosis of Alzheimer disease by using
     18F-FDG PET of the brain”. In: Radiology 290.2 (2019), pp. 456–464.
[55] Youcef Djenouri et al. “A survey on urban traffic anomalies detection algorithms”. In: IEEE Access
     7 (2019), pp. 12192–12205.
[56] Sergey V Dolgov et al. “Computation of extreme eigenvalues in higher dimensions using block tensor
     train format”. In: Computer Physics Communications 185.4 (2014), pp. 1207–1216.
[57] Haishun Du et al. “Sparse representation-based robust face recognition by graph regularized low-rank
     sparse representation recovery”. In: Neurocomputing 164 (2015), pp. 220–229.
[58] James H Faghmous et al. “A parameter-free spatio-temporal pattern mining model to catalog global
     ocean dynamics”. In: 2013 IEEE 13th International Conference on Data Mining. IEEE. 2013,
     pp. 151–160.
                                                  115


[59] Hadi Fanaee-T and João Gama. “Tensor-based anomaly detection: An interdisciplinary survey”. In:
     Knowledge-Based Systems 98 (2016), pp. 130–147.
[60] Hadi Fanaee-T and Joao Gama. “Event detection from traffic tensors: A hybrid model”. In:
     Neurocomputing 203 (2016), pp. 22–33.
[61] Hadi Fanaee-T and Joao Gama. “SimTensor: A synthetic tensor data generator”. In: arXiv preprint
     arXiv:1612.03772 (2016).
[62] Massimo Filippi, Roland Bammer, et al. MR imaging in white matter diseases of the brain and spinal
     cord. Springer, 2005.
[63] Keinosuke Fukunaga. Introduction to statistical pattern recognition. Elsevier, 2013.
[64] Glenn Fung et al. “A fast iterative algorithm for fisher discriminant using heterogeneous kernels”.
     In: Proceedings of the twenty-first international conference on Machine learning. 2004, p. 40.
[65] Mostafa Reisi Gahrooei et al. “Multiple tensor-on-tensor regression: an approach for modeling
     processes with heterogeneous sources of data”. In: Technometrics 63.2 (2021), pp. 147–159.
[66] Giovana Gavidia-Bovadilla et al. “Early prediction of Alzheimer’s disease using null longitudinal
     model-based classifiers”. In: PloS one 12.1 (2017), e0168011.
[67] Matan Gavish and Ronald R Coifman. “Sampling, denoising and compression of matrices by
     coherent matrix organization”. In: Applied and Computational Harmonic Analysis 33.3 (2012),
     pp. 354–369.
[68] Xiurui Geng et al. “A high-order statistical tensor based algorithm for anomaly detection in hyper-
     spectral imagery”. In: Scientific reports 4 (2014), p. 6869.
[69] Mark Girolami and Mingjun Zhong. “Data Integration for Classification Problems Employing
     Gaussian Process Priors”. In: Advances in Neural Information Processing Systems 19: Proceedings
     of the 2006 Conference. Vol. 19. MIT Press. 2007, p. 465.
[70] Donald Goldfarb and Zhiwei Qin. “Robust low-rank tensor recovery: Models and algorithms”. In:
     SIAM Journal on Matrix Analysis and Applications 35.1 (2014), pp. 225–253.
[71] Mehmet Gönen and Ethem Alpaydın. “Multiple kernel learning algorithms”. In: The Journal of
     Machine Learning Research 12 (2011), pp. 2211–2268.
[72] Adrian R Groves et al. “Linked independent component analysis for multimodal data fusion”. In:
     Neuroimage 54.3 (2011), pp. 2198–2217.
[73] Weiwei Guo, Irene Kotsia, and Ioannis Patras. “Tensor learning for regression”. In: IEEE Transac-
     tions on Image Processing 21.2 (2011), pp. 816–827.
[74] Xian Guo et al. “Support tensor machines for classification of hyperspectral remote sensing imagery”.
     In: IEEE Transactions on Geoscience and Remote Sensing 54.6 (2016), pp. 3248–3264.
                                                  116


[75] Zhifeng Hao et al. “A linear support higher-order tensor machine for classification”. In: IEEE
     Transactions on Image Processing 22.7 (2013), pp. 2911–2920.
[76] Lifang He et al. “Dusk: A dual structure-preserving kernel for supervised tensor learning with
     applications to neuroimages”. In: Proceedings of the 2014 SIAM International Conference on Data
     Mining. SIAM. 2014, pp. 127–135.
[77] Lifang He et al. “Kernelized support tensor machines”. In: Proceedings of the 34th International
     Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 1442–1451.
[78] Xiaofei He, Deng Cai, and Partha Niyogi. “Tensor subspace analysis”. In: Advances in neural
     information processing systems. 2006, pp. 499–506.
[79] Victoria Hodge and Jim Austin. “A survey of outlier detection methodologies”. In: Artificial
     intelligence review 22.2 (2004), pp. 85–126.
[80] Sebastian Holtz, Thorsten Rohwedder, and Reinhold Schneider. “On manifolds of tensors of fixed
     TT-rank”. In: Numerische Mathematik 120.4 (2012), pp. 701–731.
[81] Yuwang Ji et al. “A Survey on Tensor Techniques and Applications in Machine Learning”. In: IEEE
     Access 7 (2019), pp. 162950–162990.
[82] Bo Jiang et al. “Image representation and learning with graph-laplacian tucker tensor decomposition”.
     In: IEEE transactions on cybernetics 49.4 (2018), pp. 1417–1426.
[83] Taisong Jin et al. “Low-rank matrix factorization with multiple hypergraph regularizer”. In: Pattern
     Recognition 48.3 (2015), pp. 1011–1022.
[84] Vassilis Kalofolias et al. “Matrix completion on graphs”. In: arXiv preprint arXiv:1408.1717 (2014).
[85] Maja H Kamstrup-Nielsen, Lea G Johnsen, and Rasmus Bro. “Core consistency diagnostic in
     PARAFAC2”. In: Journal of Chemometrics 27.5 (2013), pp. 99–105.
[86] Esin Karahan et al. “Tensor analysis and fusion of multimodal brain images”. In: Proceedings of the
     IEEE 103.9 (2015), pp. 1531–1559.
[87] Hiroyuki Kasai. “Fast online low-rank tensor subspace tracking by CP decomposition using recursive
     least squares from incomplete observations”. In: Neurocomputing 347 (2019), pp. 177–190.
[88] Hiroyuki Kasai, Wolfgang Kellerer, and Martin Kleinsteuber. “Network volume anomaly detection
     and identification in large-scale networks based on online time-structured traffic tensor tracking”.
     In: IEEE Transactions on Network and Service Management 13.3 (2016), pp. 636–650.
[89] Ali Khazaee, Ata Ebrahimzadeh, and Abbas Babajani-Feremi. “Application of advanced machine
     learning methods on resting-state fMRI network for identification of mild cognitive impairment and
     Alzheimer’s disease”. In: Brain imaging and behavior 10.3 (2016), pp. 799–817.
[90] Boris N Khoromskĳ. “O (dlog N)-quantics approximation of N-d tensors in high-dimensional
                                                   117


      numerical modeling”. In: Constructive Approximation 34.2 (2011), pp. 257–280.
 [91] Tae-Kyun Kim, Shu-Fai Wong, and Roberto Cipolla. “Tensor canonical correlation analysis for
      action classification”. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition.
      IEEE. 2007, pp. 1–8.
 [92] Marius Kloft et al. “Efficient and accurate lp-norm multiple kernel learning.” In: NIPS. vol. 22. 22.
      2009, pp. 997–1005.
 [93] Tamara G Kolda and Brett W Bader. “Tensor decompositions and applications”. In: SIAM review
      51.3 (2009), pp. 455–500.
 [94] Xiangjie Kong et al. “HUAD: Hierarchical urban anomaly detection based on spatio-temporal data”.
      In: IEEE Access 8 (2020), pp. 26573–26582.
 [95] Jean Kossaifi et al. “T-net: Parametrizing fully convolutional nets with a single high-order tensor”.
      In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019,
      pp. 7822–7831.
 [96] Jean Kossaifi et al. “Tensor regression networks”. In: arXiv preprint arXiv:1707.08308 (2017).
 [97] Irene Kotsia, Weiwei Guo, and Ioannis Patras. “Higher rank support tensor machines for visual
      recognition”. In: Pattern Recognition 45.12 (2012), pp. 4192–4203.
 [98] Daniel Kressner, Michael Steinlechner, and André Uschmajew. “Low-rank tensor methods with
      subspace correction for symmetric eigenvalue problems”. In: SIAM Journal on Scientific Computing
      36.5 (2014), A2346–A2368.
 [99] J. B. Kruskal. “Three-way arrays: rank and uniqueness of trilinear decompositions, with application
      to arithmetic complexity and statistics”. In: Linear Algebra and its Applications 18.2 (1977),
      pp. 95–138. issn: 0024-3795. doi: https://doi.org/10.1016/0024-3795(77)90069-6.
      url: http://www.sciencedirect.com/science/article/pii/0024379577900696.
[100] Alp Kut and Derya Birant. “Spatio-temporal outlier detection in large databases”. In: Journal of
      computing and information technology 14.4 (2006), pp. 291–297.
[101] Kenneth K Kwong et al. “Dynamic magnetic resonance imaging of human brain activity during
      primary sensory stimulation.” In: Proceedings of the National Academy of Sciences 89.12 (1992),
      pp. 5675–5679.
[102] Jack L Lancaster et al. “Bias between MNI and Talairach coordinates analyzed using the ICBM-152
      brain template”. In: Human brain mapping 28.11 (2007), pp. 1194–1205.
[103] Gert RG Lanckriet et al. “Learning the kernel matrix with semidefinite programming”. In: Journal
      of Machine learning research 5.Jan (2004), pp. 27–72.
[104] Gisela Lechuga et al. “Discriminant analysis for multiway data”. In: International Conference on
      Partial Least Squares and Related Methods. Springer. 2014, pp. 115–126.
                                                   118


[105] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. “Trajectory clustering: a partition-and-group
      framework”. In: Proceedings of the 2007 ACM SIGMOD international conference on Management
      of data. 2007, pp. 593–604.
[106] Namgil Lee et al. “Nonnegative Tensor Train Decompositions for Multi-domain Feature Extraction
      and Clustering”. In: International Conference on Neural Information Processing. Springer. 2016,
      pp. 87–95.
[107] Xu Lei, Pedro A Valdes-Sosa, and Dezhong Yao. “EEG/fMRI fusion based on independent com-
      ponent analysis: integration of data-driven and model-driven methods”. In: Journal of integrative
      neuroscience 11.03 (2012), pp. 313–337.
[108] Li Li et al. “Trend modeling for traffic time series analysis: An integrated study”. In: IEEE
      Transactions on Intelligent Transportation Systems 16.6 (2015), pp. 3430–3439.
[109] Peide Li and Taps Maiti. “Universal Consistency of Support Tensor Machine”. In: 2019 IEEE
      International Conference on Data Science and Advanced Analytics (DSAA). IEEE. 2019, pp. 608–
      609.
[110] Ping Li et al. “Online robust low-rank tensor modeling for streaming data analysis”. In: IEEE
      transactions on neural networks and learning systems 30.4 (2018), pp. 1061–1075.
[111] Quefeng Li and Lexin Li. “Integrative factor regression and its inference for multimodal data
      analysis”. In: arXiv preprint arXiv:1911.04056 (2019).
[112] Qun Li and Dan Schonfeld. “Multilinear discriminant analysis for higher-order tensor data clas-
      sification”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 36.12 (2014),
      pp. 2524–2537.
[113] Shuangjiang Li et al. “Low-rank tensor decomposition based anomaly detection for hyperspectral
      imagery”. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE. 2015,
      pp. 4525–4529.
[114] Xiaoshan Li et al. “Tucker tensor regression and neuroimaging analysis”. In: Statistics in Biosciences
      10.3 (2018), pp. 520–545.
[115] Xutao Li et al. “MR-NTD: Manifold regularization nonnegative tucker decomposition for tensor data
      dimension reduction and representation”. In: IEEE transactions on neural networks and learning
      systems 28.8 (2016), pp. 1787–1800.
[116] Yingjie Li et al. “Early prediction of Alzheimer’s disease using longitudinal volumetric MRI data
      from ADNI”. in: Health Services and Outcomes Research Methodology 20.1 (2020), pp. 13–39.
[117] Ziyue Li et al. “Tensor completion for weakly-dependent data on graph for metro passenger flow
      prediction”. In: arXiv preprint arXiv:1912.05693 (2019).
[118] Chaoguang Lin et al. “Anomaly detection in spatiotemporal data via regularized non-negative tensor
      analysis”. In: Data Mining and Knowledge Discovery 32.4 (2018), pp. 1056–1073.
                                                   119


[119] Martin A Lindquist et al. “The statistical analysis of fMRI data”. In: Statistical science 23.4 (2008),
      pp. 439–464.
[120] Ji Liu et al. “Tensor completion for estimating missing values in visual data”. In: IEEE transactions
      on pattern analysis and machine intelligence 35.1 (2012), pp. 208–220.
[121] Jingyu Liu et al. “Combining fMRI and SNP data to investigate connections between brain function
      and genetics using parallel ICA”. in: Human brain mapping 30.1 (2009), pp. 241–255.
[122] Siqi Liu et al. “Early diagnosis of Alzheimer’s disease with deep learning”. In: 2014 IEEE 11th
      international symposium on biomedical imaging (ISBI). IEEE. 2014, pp. 1015–1018.
[123] Eric F Lock. “Tensor-on-tensor regression”. In: Journal of Computational and Graphical Statistics
      27.3 (2018), pp. 638–647.
[124] Xiaojing Long et al. “Prediction and classification of Alzheimer disease based on quantification of
      MRI deformation”. In: PloS one 12.3 (2017), e0173372.
[125] Canyi Lu et al. “Tensor robust principal component analysis with a new tensor nuclear norm”. In:
      IEEE transactions on pattern analysis and machine intelligence 42.4 (2019), pp. 925–938.
[126] Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos. “MPCA: Multilinear
      principal component analysis of tensor objects”. In: IEEE Transactions on Neural Networks 19.1
      (2008), pp. 18–39.
[127] Gal Mishne, Eric Chi, and Ronald Coifman. “Co-manifold learning with missing data”. In:
      International Conference on Machine Learning. PMLR. 2019, pp. 4605–4614.
[128] Gal Mishne et al. “Data-driven tree transforms and metrics”. In: IEEE transactions on signal and
      information processing over networks 4.3 (2017), pp. 451–466.
[129] John C Morris et al. “Pittsburgh compound B imaging and prediction of progression from cognitive
      normality to symptomatic Alzheimer disease”. In: Archives of neurology 66.12 (2009), pp. 1469–
      1475.
[130] Raziyeh Mosayebi and Gholam-Ali Hossein-Zadeh. “Correlated coupled matrix tensor factorization
      method for simultaneous EEG-fMRI data fusion”. In: Biomedical Signal Processing and Control
      62 (2020), p. 102071.
[131] Kevin P Murphy. “Switching kalman filters”. In: (1998).
[132] Atsuhiro Narita et al. “Tensor factorization using auxiliary information”. In: Data Mining and
      Knowledge Discovery 25.2 (2012), pp. 298–324.
[133] Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia Object Image Library (COIL-100).
[134] Luong Ha Nguyen and James-A Goulet. “Anomaly detection with the switching kalman filter for
      structural health monitoring”. In: Structural Control and Health Monitoring 25.4 (2018), e2136.
                                                   120


[135] Yongming Nie et al. “Graph-regularized tensor robust principal component analysis for hyperspectral
      image denoising”. In: Applied optics 56.22 (2017), pp. 6094–6102.
[136] Guiomar Niso et al. “MEG-BIDS, the brain imaging data structure extended to magnetoencephalog-
      raphy”. In: Scientific data 5.1 (2018), pp. 1–5.
[137] Alexander Novikov et al. “Tensorizing neural networks”. In: Advances in Neural Information
      Processing Systems. 2015, pp. 442–450.
[138] Seĳi Ogawa et al. “Brain magnetic resonance imaging with contrast dependent on blood oxygenation”.
      In: proceedings of the National Academy of Sciences 87.24 (1990), pp. 9868–9872.
[139] Ivan V Oseledets. “Approximation of 2ˆd\times2ˆd matrices using tensor decomposition”. In: SIAM
      Journal on Matrix Analysis and Applications 31.4 (2010), pp. 2130–2145.
[140] Ivan V Oseledets. “Tensor-train decomposition”. In: SIAM Journal on Scientific Computing 33.5
      (2011), pp. 2295–2317.
[141] Alp Ozdemir, Edward M Bernat, and Selin Aviyente. “Recursive tensor subspace tracking for
      dynamic brain network analysis”. In: IEEE Transactions on Signal and Information Processing over
      Networks 3.4 (2017), pp. 669–682.
[142] Yuqing Pan, Qing Mai, and Xin Zhang. “Covariate-Adjusted Tensor Classification in High Dimen-
      sions”. In: Journal of the American Statistical Association (2018), pp. 1–15.
[143] Evangelos Papalexakis, Konstantinos Pelechrinis, and Christos Faloutsos. “Spotting misbehaviors in
      location-based social networks using tensors”. In: Proceedings of the 23rd International Conference
      on World Wide Web. 2014, pp. 551–552.
[144] Evangelos E Papalexakis, Alex Beutel, and Peter Steenkiste. “Network anomaly detection using
      co-clustering”. In: 2012 IEEE/ACM International Conference on Advances in Social Networks
      Analysis and Mining. IEEE. 2012, pp. 403–410.
[145] Paul Pavlidis et al. “Gene functional classification from heterogeneous data”. In: Proceedings of the
      fifth annual international conference on Computational biology. 2001, pp. 249–255.
[146] Nathanaël Perraudin and Pierre Vandergheynst. “Stationary signal processing on graphs”. In: IEEE
      Transactions on Signal Processing 65.13 (2017), pp. 3462–3477.
[147] Shibin Qiu and Terran Lane. “A framework for multiple kernel support vector regression and its
      applications to siRNA efficacy prediction”. In: IEEE/ACM Transactions on Computational Biology
      and Bioinformatics 6.2 (2008), pp. 190–199.
[148] Yuning Qiu et al. “A generalized graph regularized non-negative tucker decomposition framework
      for tensor data representation”. In: IEEE transactions on cybernetics (2020).
[149] Matthew Roughan et al. “Spatio-temporal compressive sensing and internet traffic matrices (extended
      version)”. In: IEEE/ACM Transactions on Networking 20.3 (2011), pp. 662–676.
                                                    121


[150] Peter J Rousseeuw and Katrien Van Driessen. “A fast algorithm for the minimum covariance
      determinant estimator”. In: Technometrics 41.3 (1999), pp. 212–223.
[151] Aliaksei Sandryhaila and Jose MF Moura. “Big data analysis with signal processing on graphs:
      Representation and processing of massive data sets with irregular structure”. In: IEEE Signal
      Processing Magazine 31.5 (2014), pp. 80–90.
[152] Katharina A Schindlbeck and David Eidelberg. “Network imaging biomarkers: insights and clinical
      applications in Parkinson’s disease”. In: The Lancet Neurology 17.7 (2018), pp. 629–640.
[153] Bernhard Schölkopf et al. “Support vector method for novelty detection”. In: Advances in neural
      information processing systems. 2000, pp. 582–588.
[154] Nauman Shahid, Francesco Grassi, and Pierre Vandergheynst. “Tensor Robust PCA on Graphs”.
      In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
      (ICASSP). IEEE. 2019, pp. 5406–5410.
[155] Nauman Shahid et al. “Fast robust PCA on graphs”. In: IEEE Journal of Selected Topics in Signal
      Processing 10.4 (2016), pp. 740–756.
[156] Nauman Shahid et al. “Robust principal component analysis on graphs”. In: Proceedings of the
      IEEE International Conference on Computer Vision. 2015, pp. 2812–2820.
[157] Abhishek Sharma and Maks Ovsjanikov. “Geometric Matrix Completion: A Functional View”. In:
      arXiv preprint arXiv:2009.14343 (2020).
[158] Lei Shi, Aryya Gangopadhyay, and Vandana P Janeja. “STenSr: Spatio-temporal tensor streams for
      anomaly detection and pattern discovery”. In: Knowledge and Information Systems 43.2 (2015),
      pp. 333–353.
[159] Nicholas D Sidiropoulos et al. “Tensor decomposition for signal processing and machine learning”.
      In: IEEE Transactions on Signal Processing 65.13 (2017), pp. 3551–3582.
[160] Age Smilde, Rasmus Bro, and Paul Geladi. Multi-way analysis: applications in the chemical
      sciences. John Wiley & Sons, 2005.
[161] Seyyid Emre Sofuoglu and Selin Aviyente. “Graph Regularized Tensor Train Decomposition”. In:
      ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing
      (ICASSP). IEEE. 2020, pp. 3912–3916.
[162] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. “UCF101: A dataset of 101 human
      actions classes from videos in the wild”. In: arXiv preprint arXiv:1212.0402 (2012).
[163] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business
      Media, 2008.
[164] Yuting Su et al. “Graph regularized low-rank tensor representation for feature selection”. In: Journal
      of Visual Communication and Image Representation 56 (2018), pp. 234–244.
                                                   122


[165] Jing Sui et al. “Discriminating schizophrenia and bipolar disorder by fusing fMRI and DTI in a
      multimodal CCA+ joint ICA model”. In: Neuroimage 57.3 (2011), pp. 839–855.
[166] Jimeng Sun, Dacheng Tao, and Christos Faloutsos. “Beyond streams and graphs: dynamic tensor
      analysis”. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge
      discovery and data mining. 2006, pp. 374–383.
[167] Hiroaki Tanabe et al. “Simple but effective methods for combining kernels in computational biology”.
      In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in
      Computing and Communication Technologies. IEEE. 2008, pp. 71–78.
[168] Dacheng Tao et al. “General tensor discriminant analysis and gabor features for gait recognition”.
      In: IEEE Transactions on Pattern Analysis and Machine Intelligence 29.10 (2007).
[169] Dacheng Tao et al. “Supervised tensor learning”. In: Fifth IEEE International Conference on Data
      Mining (ICDM’05). IEEE. 2005, 8–pp.
[170] Liang Tao et al. “Low rank approximation with sparse integration of multiple manifolds for data
      representation”. In: Applied Intelligence 42.3 (2015), pp. 430–446.
[171] Joshua B Tenenbaum, Vin De Silva, and John C Langford. “A global geometric framework for
      nonlinear dimensionality reduction”. In: science 290.5500 (2000), pp. 2319–2323.
[172] Michal Teplan et al. “Fundamentals of EEG measurement”. In: Measurement science review 2.2
      (2002), pp. 1–11.
[173] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. “On the extension of trace norm to tensors”.
      In: NIPS Workshop on Tensors, Kernels, and Machine Learning. Vol. 7. 2010.
[174] Ledyard R Tucker. “Implications of factor analysis of three-way matrices for measurement of
      change”. In: Problems in measuring change 15 (1963), pp. 122–137.
[175] Ledyard R Tucker et al. “The extension of factor analysis to three-dimensional matrices”. In:
      Contributions to mathematical psychology 110119 (1964).
[176] Manik Varma and Debajyoti Ray. “Learning the discriminative power-invariance trade-off”. In:
      2007 IEEE 11th International Conference on Computer Vision. IEEE. 2007, pp. 1–8.
[177] Ulrike Von Luxburg. “A tutorial on spectral clustering”. In: Statistics and computing 17.4 (2007),
      pp. 395–416.
[178] Jennifer M Walz et al. “Simultaneous EEG-fMRI reveals temporal evolution of coupling between
      supramodal cortical attention networks and the brainstem”. In: Journal of Neuroscience 33.49
      (2013), pp. 19212–19222.
[179] Kaidong Wang et al. “Hyperspectral and Multispectral Image Fusion via Nonlocal Low-Rank Tensor
      Decomposition and Spectral Unmixing”. In: IEEE Transactions on Geoscience and Remote Sensing
      58.11 (2020), pp. 7654–7671.
                                                   123


[180] Qi Wang et al. “Robust bi-stochastic graph regularized matrix factorization for data clustering”. In:
      IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).
[181] Wenqi Wang, Vaneet Aggarwal, and Shuchin Aeron. “Principal Component Analysis with Tensor
      Train Subspace”. In: arXiv preprint arXiv:1803.05026 (2018).
[182] Wenqi Wang, Vaneet Aggarwal, and Shuchin Aeron. “Principal component analysis with tensor train
      subspace”. In: Pattern Recognition Letters 122 (2019), pp. 86–91.
[183] Wenqi Wang, Vaneet Aggarwal, and Shuchin Aeron. “Tensor train neighborhood preserving embed-
      ding”. In: IEEE Transactions on Signal Processing 66.10 (2018), pp. 2724–2732.
[184] Xudong Wang and Lĳun Sun. “Diagnosing spatiotemporal traffic anomalies with low-rank tensor
      autoregression”. In: IEEE Transactions on Intelligent Transportation Systems (2021).
[185] Xudong Wang et al. “A probabilistic tensor factorization approach to detect anomalies in spatiotem-
      poral traffic activities”. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE.
      2019, pp. 1658–1663.
[186] Yao Wang et al. “Hyperspectral image restoration via total variation regularized low-rank tensor
      decomposition”. In: IEEE Journal of Selected Topics in Applied Earth Observations and Remote
      Sensing 11.4 (2017), pp. 1227–1243.
[187] Yu Wang, Wotao Yin, and Jinshan Zeng. “Global convergence of ADMM in nonconvex nonsmooth
      optimization”. In: Journal of Scientific Computing 78.1 (2019), pp. 29–63.
[188] Weizmann Facebase. http://www.wisdom.weizmann.ac.il/~/vision/FaceBase/.
[189] Zaiwen Wen and Wotao Yin. “A feasible method for optimization with orthogonality constraints”.
      In: Mathematical Programming 142.1-2 (2013), pp. 397–434.
[190] Keith J Worsley et al. “A general statistical analysis for fMRI data”. In: Neuroimage 15.1 (2002),
      pp. 1–15.
[191] Elizabeth Wu, Wei Liu, and Sanjay Chawla. “Spatio-temporal outlier detection in precipitation
      data”. In: International Workshop on Knowledge Discovery from Sensor Data. Springer. 2008,
      pp. 115–133.
[192] Kun Xie et al. “Graph based tensor recovery for accurate internet anomaly detection”. In: IEEE
      INFOCOM 2018-IEEE Conference on Computer Communications. IEEE. 2018, pp. 1502–1510.
[193] Ming Xu et al. “Anomaly detection in road networks using sliding-window tensor factorization”. In:
      IEEE Transactions on Intelligent Transportation Systems 20.12 (2019), pp. 4704–4713.
[194] Ming Yan and Wotao Yin. “Self equivalence of the alternating direction method of multipliers”. In:
      Splitting Methods in Communication, Imaging, Science, and Engineering. Springer, 2016, pp. 165–
      194.
                                                     124


[195] Shuicheng Yan et al. “Discriminant analysis with tensor representation”. In: Computer Vision and
      Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE.
      2005, pp. 526–532.
[196] Shuicheng Yan et al. “Multilinear discriminant analysis for face recognition”. In: IEEE Transactions
      on Image Processing 16.1 (2006), pp. 212–220.
[197] Jing-Hua Yang et al. “Low-rank tensor train for tensor robust principal component analysis”. In:
      Applied Mathematics and Computation 367 (2020), p. 124783.
[198] Jieping Ye, Ravi Janardan, and Qi Li. “Two-dimensional linear discriminant analysis”. In: Advances
      in Neural Information Processing Systems. 2005, pp. 1569–1576.
[199] Tatsuya Yokota, Qibin Zhao, and Andrzej Cichocki. “Smooth PARAFAC decomposition for tensor
      completion”. In: IEEE Transactions on Signal Processing 64.20 (2016), pp. 5423–5436.
[200] Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. “Temporal Regularized Matrix Factorization for
      High-dimensional Time Series Prediction.” In: NIPS. 2016, pp. 847–855.
[201] Rose Yu and Yan Liu. “Learning from multiway data: Simple and efficient tensor regression”. In:
      International Conference on Machine Learning. PMLR. 2016, pp. 373–381.
[202] Stefanos Zafeiriou. “Discriminant nonnegative tensor factorization algorithms”. In: IEEE Transac-
      tions on Neural Networks 20.2 (2009), pp. 217–235.
[203] Huichu Zhang, Yu Zheng, and Yong Yu. “Detecting urban anomalies using multiple spatio-temporal
      data sources”. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous
      Technologies 2.1 (2018), pp. 1–18.
[204] Junyu Zhang, Zaiwen Wen, and Yin Zhang. “Subspace methods with local refinements for eigenvalue
      computation using low-rank tensor-train format”. In: Journal of Scientific Computing 70.2 (2017),
      pp. 478–499.
[205] Mingyang Zhang et al. “A decomposition approach for urban anomaly detection across spatiotem-
      poral data”. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence.
      AAAI Press. 2019, pp. 6043–6049.
[206] Mingyang Zhang et al. “Urban Anomaly Analytics: Description, Detection and Prediction”. In:
      IEEE Transactions on Big Data (2020).
[207] Xing Zhang, Gongjian Wen, and Wei Dai. “A tensor decomposition-based anomaly detection
      algorithm for hyperspectral image”. In: IEEE Transactions on Geoscience and Remote Sensing
      54.10 (2016), pp. 5801–5820.
[208] Zemin Zhang et al. “Novel Methods for Multilinear Data Completion and De-noising Based on
      Tensor-SVD”. in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
      (CVPR). July 2014.
                                                  125


[209] Qibin Zhao et al. “Tensor ring decomposition”. In: arXiv preprint arXiv:1606.05535 (2016).
[210] Yu-Bang Zheng et al. “Tensor N-tubal rank and its convex relaxation for low-rank tensor recovery”.
      In: Information Sciences 532 (2020), pp. 170–189.
[211] Hua Zhou, Lexin Li, and Hongtu Zhu. “Tensor regression with applications in neuroimaging data
      analysis”. In: Journal of the American Statistical Association 108.502 (2013), pp. 540–552.
                                                  126