TENSOR LEARNING WITH STRUCTURE, GEOMETRY AND MULTI-MODALITY By Seyyid Emre Sofuoglu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Electrical Engineering – Doctor of Philosophy 2022 ABSTRACT TENSOR LEARNING WITH STRUCTURE, GEOMETRY AND MULTI-MODALITY By Seyyid Emre Sofuoglu With the advances in sensing and data acquisition technology, it is now possible to collect data from different modalities and sources simultaneously. Most of these data are multi-dimensional in nature and can be represented by multiway arrays known as tensors. For instance, a color image is a third-order tensor defined by two indices for spatial variables and one index for color mode. Some other examples include color video, medical imaging such as EEG and fMRI, spatiotemporal data encountered in urban traffic monitoring, etc. In the past two decades, tensors have become ubiquitous in signal processing, statistics and computer science. Traditional unsupervised and supervised learning methods developed for one-dimensional signals do not translate well to higher order data structures as they get computationally prohibitive with increasing dimensionalities. Vectorizing high dimensional inputs creates problems in nearly all machine learning tasks due to exponentially increasing dimensionality, distortion of data structure and the difficulty of obtaining sufficiently large training sample size. In this thesis, we develop tensor-based approaches to various machine learning tasks. Existing tensor based unsupervised and supervised learning algorithms extend many well-known algorithms, e.g. 2-D component analysis, support vector machines and linear discriminant analysis, with better performance and lower computational and memory costs. Most of these methods rely on Tucker decomposition which has exponential storage complexity requirements; CANDECOMP-PARAFAC (CP) based methods which might not have a solution; or Tensor Train (TT) based solutions which suffer from exponentially increasing ranks. Many tensor based methods have quadratic (w.r.t the size of data), or higher computational complexity, and similarly, high memory complexity. Moreover, they are not always designed with the particular structure of the data in mind. Many of these methods use purely algebraic measures as the objective which might not capture the local relations within data. Thus, there is a need to develop new models with better computational and memory efficiency, with the particular structure of the data and problem in mind. Finally, as tensors represent the data with more faithfulness to the original structure compared to the vectorization, they also allow coupling of heterogeneous data sources where the underlying physical relationship is known. Still, most of the work on coupled tensor decompositions does not explore supervised problems. In order to address the issues around computational and storage complexity of tensor based machine learning, in Chapter 2, we propose a new tensor train decomposition structure, which is a hybrid between Tucker and Tensor Train decompositions. The proposed structure is used to implement Tensor Train based supervised and unsupervised learning frameworks: linear discriminant analysis (LDA) and graph regularized subspace learning. The algorithm is designed to solve extremal eigenvalue-eigenvector pair computation problems, which can be generalized to many other methods. The supervised framework, Tensor Train Discriminant Analysis (TTDA), is evaluated in a classification task with varying storage complexities with respect to classification accuracy and training time on four different datasets. The unsupervised approach, Graph Regularized TT, is evaluated on a clustering task with respect to clustering quality and training time on various storage complexities. Both frameworks are compared to discriminant analysis algorithms with similar objectives based on Tucker and TT decompositions. In Chapter 3, we present an unsupervised anomaly detection algorithm for spatiotemporal tensor data. The algorithm models the anomaly detection problem as a low-rank plus sparse tensor decomposition problem, where the normal activity is assumed to be low-rank and the anomalies are assumed to be sparse and temporally continuous. We present an extension of this algorithm, where we utilize a graph regularization term in our objective function to preserve the underlying geometry of the original data. Finally, we propose a computationally efficient implementation of this framework by approximating the nuclear norm using graph total variation minimization. The proposed approach is evaluated for both simulated data with varying levels of anomaly strength, length and number of missing entries in the tensor as well as urban traffic data. In Chapter 4, we propose a geometric tensor learning framework using product graph structures for tensor completion problem. Instead of purely algebraic measures such as rank, we use graph smoothness constraints that utilize geometric or topological relations within data. We prove the equivalence of a Cartesian graph structure to TT-based graph structure under some conditions. We show empirically, that introducing such relaxations due to the conditions do not deteriorate the recovery performance. We also outline a fully geometric learning method on product graphs for data completion. In Chapter 5, we introduce a supervised learning method for heterogeneous data sources such as simul- taneous EEG and fMRI. The proposed two-stage method first extracts features taking the coupling across modalities into account and then introduces kernelized support tensor machines for classification. We illus- trate the advantages of the proposed method on simulated and real classification tasks with small number of training data with high dimensionality. ACKNOWLEDGEMENTS     Õæk QË @ áÔ gQË @ é<Ë @ Õæ„. All praise be to Allah, the Lord of universe(s). Peace be upon the prophet of Allah, Muhammad (s.a.w.), who is the guide and the best example to all humanity, especially for those who are in search of the truth and wisdom. My thesis is but a miniscule result of the guidance of the prophet of Allah in the way of knowledge. First and foremost, I would like to thank my advisor, Dr. Aviyente, for her guidance, patience, encour- agement, and warm support throughout my research. Without her, the quality of my work would not have been a percent of what it is. She has been a role model, and motivation for me due to her deep dedication, firm discipline, and caring attentiveness to the work of her students. I am deeply grateful to my wife, who sacrificed so much along this process and has been there for me through thick and thin. No amount of thanks could be enough. I would like to also thank my mother, who deserves more credit than any one person for any achievement or success of mine. Also, to my father, in light of whom, I have considered a PhD is possible for me, who always supported me with his wisdom, and who never lost his belief in me. To my mother-in-law and father-in-law, who have always been there for us, kept us in their prayers, and sincerely interested in my work. Finally, I thank my little son, who was my motivation, happiness, and emotional support. There are too many people that were a support to me in writing this thesis for me to be able to mention each one of them. I ask for the forgiveness of any of my benefactors, to whom I could not thank in name. May Allah shower all who helped me with success and honour in both their lives. iv TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Robust Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Geometric Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.1 Robust PCA on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Spectral Geometric Matrix Completion . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Tensor Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Tucker Decomposition (TD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Canonical/Polyadic (CP) Decomposition . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.3 Tensor Train Decomposition (TT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Robust PCA for Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Linear Discriminant Analysis (LDA) for Tensors . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Organization and Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 13 CHAPTER 2 MULTI-BRANCH TENSOR TRAIN STRUCTURE FOR SUPERVISED AND UNSUPERVISED LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Tensor Train Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Multi-Branch Tensor Train Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Two-way Tensor Train Discriminant Analysis (2WTTDA) . . . . . . . . . . . . . . 22 2.3.2 Three-way Tensor Train Discriminant Analysis (3WTTDA) . . . . . . . . . . . . . 24 2.4 Analysis of Storage, Training Complexity and Convergence . . . . . . . . . . . . . . . . . . 24 2.4.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.2 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.3 Training Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.5 Effect of Sample Size on Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.6 Summary of Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Graph Regularized Tensor Train Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.6.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.7.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.7.2 COIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 v 2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 CHAPTER 3 TENSOR METHODS FOR ANOMALY DETECTION ON SPATIOTEMPORAL DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Robust Low-Rank Tensor Decomposition for Anomaly Detection . . . . . . . . . . . . . . . 49 3.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Low-rank On Graphs Plus Temporally Smooth Sparse Decomposition . . . . . . . . . . . . 56 3.4.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.2 Computational Complexity of LOGSS . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Anomaly Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7.2 Parameter Selection: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.3 Experiments on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.4 Experiments on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 CHAPTER 4 GEOMETRIC TENSOR LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2 Tensor Train Robust PCA on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.1 Kronecker Structured Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.3 Computation and Memory Complexity for Graphs . . . . . . . . . . . . . . . . . . 78 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.3.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 CHAPTER 5 COUPLED SUPPORT TENSOR MACHINE . . . . . . . . . . . . . . . . . . . . . . 83 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.1 Coupled Matrix Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.2 CP-STM for Tensor Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2.3 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Multimodal Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 Coupled Support Tensor Machine (C-STM) . . . . . . . . . . . . . . . . . . . . . . 90 5.4 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5.1 Parameter Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5.1.1 Multimodal Tensor Factorization . . . . . . . . . . . . . . . . . . . . . . 94 5.5.1.2 C-STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.5.2 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 vi 5.5.2.1 Kernel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5.3 EEG-fMRI Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 CHAPTER 6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.1 Multi-Branch Tensor Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1.2 Tensor Methods for Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1.3 Geometric Tensor Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.1.4 Supervised Coupled Tensor Learning . . . . . . . . . . . . . . . . . . . . . . . . . 110 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 vii LIST OF TABLES Table 2.1: Storage Complexities of Different Tensor Decomposition Structures . . . . . . . . . . . . 25 Table 2.2: Computational complexities of various algorithms. The number of iterations to find the subspaces are denoted as 𝑡 𝑐 for CMDA and 𝑡 𝑡 for TT-based methods. 𝐶𝑠 = 2𝐶𝐾. (𝑟 << 𝐼, 𝑡 𝑡 𝑟 (𝑟 + 𝑁/ 𝑓 − 1) << 𝐶𝑠 , and 𝐼 𝑁 / 𝑓 >> 𝑟 6 ) . . . . . . . . . . . . . . . . . . . . 26 Table 2.3: Classification accuracy (top) and training time (bottom) with standard deviation for various methods and datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Table 3.1: Properties of anomaly detection methods used in the experiments. The acronyms refer to the different attributes of the cost function: (LR) low-rank, (SP) sparse, (WLR) weighted low-rank, (SR) smoothness regularization. . . . . . . . . . . . . . . . . . . . . 61 Table 3.2: Mean and standard deviation of AUC values for for various 𝑐 and 𝑃. On experiments of each variable, the rest of the variables are fixed at 𝑐 = 2.5, 𝑃 = 0%, 𝑙 = 7 and 𝑚 = 2.3%. The proposed methods, outperform the other algorithms in all cases significantly with 𝑝 < 0.001. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Table 3.3: Mean and standard deviation of run times (seconds) for various methods. . . . . . . . . . 67 Table 3.4: Events of Interest for NYC in 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Table 3.5: Results for 2018 NYC Yellow Taxi Data. Columns indicate the percentage of selected points with top anomaly scores. The table entries correspond to the number of events detected at the corresponding percentage. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Table 3.6: Results on 2018 NYC Bike Trip Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Table 4.1: Denoising performance for synthetic data against varying levels of gross noise 𝑐% for various methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Table 4.2: Denoising performance for real data against varying levels of gross noise 𝑐% for various methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Table 5.1: Distribution Specifications for Simulation Study; MVN stands for multivariate normal distribution. I are identity matrices. Bold numbers are vectors whose elements are all equal to the numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Table 5.2: Various kernel combination schemes. Note that 𝐾2(2) = 𝐾1(3) . . . . . . . . . . . . . . . . . 100 Table 5.3: Classification accuracy using different kernel combinations. . . . . . . . . . . . . . . . . 100 Table 5.4: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Perfor- mance Metrics with Standard Deviations in Subscripts) . . . . . . . . . . . . . . . . . . . 103 viii LIST OF FIGURES Figure 1.1: Illustration of tensors and tensor merging product using tensor network notations. Each node represents a tensor and each edge represents a mode of the tensor. (a) Tensor A, (b) Tensor B, (c) Tensor Merging Product between modes (𝑛, 𝑚) and (𝑛 + 1, 𝑚 − 1). . . . 4 Figure 1.2: Tensor network notation for Tucker decomposition. . . . . . . . . . . . . . . . . . . . . 9 Figure 1.3: Tensor Train Decomposition of Y using tensor merging products. . . . . . . . . . . . . . 11 Figure 2.1: Tensor A 𝑛 is formed by first merging U𝑛𝑅 , U𝑛−1 𝐿 and S and then applying trace 𝑡 ℎ 𝑡 ℎ operation across 4 and 8 modes of the resulting tensor. The green line at the bottom of the diagram refers to the trace operator. . . . . . . . . . . . . . . . . . . . . . 21 Figure 2.2: Illustration of the proposed methods (Compare (a) and (b) with Figures 1.2 and 1.3): (a) The proposed tensor network structure for 2WTT; (b) The proposed tensor network structure for 3WTT; (c) The flow diagram for 2WTTDA (Algorithm 2.2); (d) The flow diagram for 3WTTDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 2.3: Comparisons with a BTT based Ritz pair computation algorithm. a) Classification accuracy and b) Training time with respect to normalized storage cost. . . . . . . . . . . 31 Figure 2.4: Classification accuracy vs. Normalized storage cost of the different methods for: a) COIL-100, b) Weizmann Face, c) Cambridge Hand Gesture and d) UCF-101. All TD based methods are denoted using ’x’, TT based methods are denoted using ’+’ and proposed methods are denoted using ’*’. STTM and LDA are denoted using ’4’ and ’o’, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Figure 2.5: Training complexity vs. Normalized storage cost of the different methods for: a) COIL-100, b) Weizmann Face, c) Cambridge Hand Gesture, and d) UCF-101. . . . . . . 33 Figure 2.6: Convergence curve for TTDA on COIL-100. Objective value vs. the number of iterations is shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 2.7: Comparison of classification accuracy vs. training sample size for Weizmann Face Dataset for different methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 2.8: (a) Normalized Mutual Information vs. Storage Complexity of different methods for MNIST dataset. (b) Computation Time vs. Storage Complexity of different methods for MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Figure 2.9: (a) Normalized Mutual Information vs Storage Complexity of different methods for COIL dataset. (b) Computation Time vs Storage Complexity of different methods for COIL dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 ix Figure 3.1: Mean AUC values for various choices of: (a) 𝜆 and 𝛾, (b) 𝜃 and 𝜓1 , (c) 𝜓2 and 𝜓4 . Mean AUC values across 10 random experiments are reported for each hyperparameter pair. For each set of experiments, the remaining hyperparameters are fixed. . . . . . . . . 64 Figure 3.2: AUC of ROC w.r.t. 𝑙 and 𝑚 with 𝑐 = 2.5, 𝑃 = 0%. . . . . . . . . . . . . . . . . . . . . . 66 Figure 3.3: ROC curves for various amplitudes of anomalies. Higher amplitude means more separability. 𝑐 = (a) 1.5, (b) 2, (c) 2.5. (𝑃 = 0%, 𝑙 = 7, 𝑚 = 2.3%) . . . . . . . . . . . . 67 Figure 3.4: ROC curves for varying percentage of missing data, (a) 𝑃 = 20%, (b) 𝑃 = 40%, (c) 𝑃 = 60%. (𝑐 = 2.5, 𝑙 = 7, 𝑚 = 2.3%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Figure 3.5: Bike Activity data, the extracted sparse part and low-rank part across for July 4th Celebrations at Hudson River banks. (a) Real Data where the traffic for 52 Wednesdays is shown along with the traffic on Independence Day and average traffic; (b) Sparse tensor where the curve corresponding to the anomaly is highlighted; (c) Low-rank tensor with the curve corresponding to the Independence Day highlighted. . . . . . . . . 71 Figure 4.1: Phase diagrams for missing data recovery. . . . . . . . . . . . . . . . . . . . . . . . . . 80 Figure 5.1: Illustration of Coupled Tensor Matrix Model . . . . . . . . . . . . . . . . . . . . . . . . 86 Figure 5.2: C-STM Model Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure 5.3: Simulation Result: Average accuracy rates shown in bar plots; Standard deviation of accuracy rates shown by error bars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Figure 5.4: Region of Interest (ROI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 x LIST OF ALGORITHMS Algorithm 2.1: Tensor Train Discriminant Analysis (TTDA) . . . . . . . . . . . . . . . . . . . . . . 22 Algorithm 2.2: Two-Way Tensor Train Discriminant Analysis (2WTTDA) . . . . . . . . . . . . . . . 24 Algorithm 2.3: Graph Regularized Tensor Train-ADMM(GRTT-ADMM) . . . . . . . . . . . . . . . 41 Algorithm 3.1: GLOSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Algorithm 3.2: LOGSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Algorithm 4.1: TTRPCA-G/nG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Algorithm 5.1: ACMTF Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Algorithm 5.2: Coupled Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 xi CHAPTER 1 INTRODUCTION With the advance of sensing and data acquisition technology, it is now possible to collect data from different modalities and sources simultaneously. Some examples include medical imaging data such as fMRI, hyper- spectral images, computer vision data such as multiview camera recordings, liquid-chromatography/mass spectrometry data in chemometrics, large scale gene expression data in bioinformatics. Most of these data are multi-dimensional in nature and can be represented by multiway arrays known as tensors [48]. For instance, a color image is a third-order tensor defined by two indices for spatial variables and one index for color mode. Similarly, a video comprised of color images is a fourth-order tensor, time being the fourth dimension besides spatial and spectral. In order to efficiently process large datasets, there is an increasing interest in near real-time processing methods, especially in the case of multimedia, remote-sensing and biological data. Another challenge is to come up with methods that are scalable with the size of data. Although computational power and memory sizes of modern systems keep increasing, traditional methods are exponentially expensive. As the data collected are not always clean and complete, there is also a need for methods that are robust against missing data and outliers, and can account for the underlying structure or the topology of the data. Finally, there is also an emerging interest in extracting information from multiple heterogenous modalities simultaneously. As tensor decompositions are natural, sparse and distributed representations of big data, all the aforementioned problems spurred an interest in efficient tensor algorithms suitable for massive datasets. Traditional unsupervised and supervised learning methods developed for one-dimensional signals, or vectors, do not translate well to higher order data structures as they get computationally prohibitive with increasing dimensionalities, namely, ’curse of dimensionality’. Vectorizing high dimensional inputs creates problems in nearly all machine learning tasks also due to the difficulty of obtaining sufficiently large training sample size. Moreover, important information about the structure of the high-dimensional space that data lie in is lost through vectorization which reduces the effectiveness of these methods. Therefore, there is a need for extracting compact representations from the original tensor data that result in accurate and interpretable models. Tensor decompositions can be defined as structures which represent higher order data as a set of low- 1 order core tensors, thus allowing for better interpretability and computational advantages [45]. Tensor decompositions are natural extensions of matrix decompositions or factorizations such as singular value decomposition (SVD), principal component analysis (PCA) and non-negative matrix factorization (NMF) to higher order data. However, there are various advantages of tensor decompositions compared to matrix factorizations, such as the ability to efficiently parameterize high dimensional spaces and account for intrinsic multidimensional and distributed patterns present in data. Thus, they allow for models to capture multiple interactions and couplings in data, in a more structured and interpretable manner. Tensors first started to gain attention in psychometrics [174, 175] and then chemometrics [27, 160]. After being introduced to many fields such as signal processing [48, 47, 46, 18, 183], statistics and computer science [38, 43, 96], tensors have received great attention. The work on tensors is too diverse and numerous to be covered in this thesis, so we refer the interested reader to survey papers [93, 44, 45, 159, 81]. Recent years have also seen a growth in the development of tensor decomposition methods for machine learning. Various extensions of unsupervised methods such as PCA, robust PCA, SVD, NMF, neighborhood preserving embedding (NPE), locally linear embedding (LLE), Canonical Component Analysis (CCA) [48, 70, 154, 106, 78, 91] and supervised methods such as linear discriminant analysis (LDA), Support Vector Machines (SVM), regression, deep neural networks [195, 196, 168, 112, 104, 74, 38, 97, 137, 96, 95] to tensors have been explored. Some applications include face recognition, object classification, dimensionality reduction, clustering, anomaly detection, learning latent variable models and prediction. Although most tensor based methods are developed as extensions of existing vector based methods, it is possible to develop new approaches with structures specific to tensors. By utilizing tensors, one can introduce structural models in many different ways such as deep tensor network structures, Kronecker or Cartesian structured product graphs, and multi-modal coupling of data collected from different physical sensors, that are otherwise not relevant or applicable. Tensor based methods also often have milder requirements for theoretical performance analysis such as convergence, sample size, and recovery guarantees [70, 44, 45]. 1.1 Background and Notation In this thesis, we denote numbers and scalars with letters such as 𝑥, 𝑦, 𝑁. Vectors are denoted by boldface lowercase letters, e.g. x, y. Matrices are denoted by italic capital letters like 𝑋, 𝑌 . Multi-dimensional tensors are denoted by calligraphic capital letters such as X, Y. The order of a tensor is the number of dimensions of the data hypercube, also known as ways or modes. For example, a scalar can be regarded as a zeroth-order 2 tensor, a vector as a first-order tensor, and a matrix as a second-order tensor. Let A ∈ R𝐼1 ×···×𝐼𝑁 be a tensor. Vectors obtained by fixing all indices of the tensor except the one that corresponds to 𝑛th mode are called mode-𝑛 fibers and denoted as a𝑖1 ,...𝑖𝑛−1 ,𝑖𝑛+1 ,...𝑖 𝑁 ∈ R𝐼𝑛 . Definition 1. (Vectorization, Matricization and Reshaping) V(.) is a vectorization operator such that V(A) ∈ R𝐼1 𝐼2 ...𝐼𝑁 ×1 . T𝑛 (.) is a tensor-to-matrix reshaping operator defined as T𝑛 (A) ∈ R𝐼1 ...𝐼𝑛 ×𝐼𝑛+1 ...𝐼𝑁 and the inverse operator is denoted as T−1 𝑛 (.). Definition 2. (Mode-𝑛 unfolding and canonical unfolding, [197]) The mode-𝑛 unfolding of a tensor A is Î𝑁 defined as A (𝑛) ∈ R𝐼𝑛 × 𝐼 0 𝑛0 =1,𝑛0 ≠𝑛 𝑛 where the mode-n fibers of the tensor A are the columns of A (𝑛) and the remaining modes are organized accordingly along the rows. Mode-𝑛 canonical unfolding, or mode-1, 2, . . . , 𝑛 unfolding, of a tensor Y is defined as Y[𝑛] ∈ Î𝑛 Î𝑁 R 𝑖=1 𝐼𝑖 × 𝑖=𝑛+1 𝐼𝑖 . Definition 3. (Left and right unfolding) The left unfolding operator creates a matrix from a tensor by taking all modes except the last mode as row indices and the last mode as column indices, i.e., L(A) ∈ R𝐼1 𝐼2 ...𝐼𝑁 −1 ×𝐼𝑁 which is equivalent to T 𝑁 −1 (A). Right unfolding transforms a tensor to a matrix by taking all the first mode fibers as column vectors, i.e. R(A) ∈ R𝐼1 ×𝐼2 𝐼3 ...𝐼𝑁 which is equivalent to T1 (A). The inverse of these operators are denoted as L−1 (.) and R−1 (.), respectively. A tensor is defined to be left (right)-orthogonal if its left (right) unfolding is orthogonal. Definition 4. (Tensor trace) Tensor trace is defined on slices of a tensor and contracts them to scalars. Let A ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 with 𝐼 𝑘 0 = 𝐼 𝑘 , then the trace operator on modes 𝑘 0 and 𝑘 is defined as: Õ𝐼𝑘 D = 𝑡𝑟 𝑘𝑘0 (A) = A (:, . . . , 𝑖 𝑘 0 , :, . . . , 𝑖 𝑘 , :, . . . , :), 𝑖𝑘 0 =𝑖𝑘 =1 where D ∈ R𝐼1 ×···×𝐼𝑘0−1 ×𝐼𝑘0+1 ×···×𝐼𝑘−1 ×𝐼𝑘+1 ×···×𝐼𝑁 is a 𝑁 − 2-mode tensor. Definition 5. (Tensor Merging Product) Tensor merging product connects two tensors along some given sets of modes. For two tensors A ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 and B ∈ R 𝐽1 ×𝐽2 ×···×𝐽𝑀 where 𝐼𝑛 = 𝐽𝑚 and 𝐼𝑛+1 = 𝐽𝑚−1 for some 𝑛 and 𝑚, tensor merging product is given by [45]: C = A ×𝑚,𝑚−1𝑛,𝑛+1 B. 3 ... ... ... ... 𝐽1 𝐼 𝑁 ... 𝐼 𝑛 , 𝐽𝑚 ... ... ... ... ... ... ... ... ... 𝐽𝑚−2 𝐼𝑛+2 𝐼 𝑁 −1 A B 𝐼2 A B = C 𝐽𝑚+1 ... 𝐼𝑛−1 ... 𝐼𝑛+1 , 𝐽𝑚−1 ... ... 𝐼𝑁 𝐼2 𝐽𝑀 𝐽2 𝐼1 𝐽1 𝐼1 𝐽1 𝐼𝑁 𝐽𝑀 𝐽 𝑀 𝐼1 (a) (b) (c) Figure 1.1: Illustration of tensors and tensor merging product using tensor network notations. Each node represents a tensor and each edge represents a mode of the tensor. (a) Tensor A, (b) Tensor B, (c) Tensor Merging Product between modes (𝑛, 𝑚) and (𝑛 + 1, 𝑚 − 1). C ∈ R𝐼1 ×···×𝐼𝑛−1 ×𝐼𝑛+2 ×···×𝐼𝑁 ×𝐽1 ×···×𝐽𝑚−2 ×𝐽𝑚+1 ×···×𝐽𝑀 is a (𝑁 + 𝑀 − 4)-mode tensor that is calculated as: C(𝑖 1 , . . . , 𝑖 𝑛−1 , 𝑖 𝑛+2 , . . . , 𝑖 𝑁 , 𝑗1 , . . . , 𝑗 𝑚−2 , 𝑗 𝑚+1 , . . . , 𝑗 𝑀 ) = 𝐼𝑛 𝐽Õ Õ 𝑚−1  A (𝑖 1 , . . . , 𝑖 𝑛−1 , 𝑖 𝑛 = 𝑡 1 , 𝑖 𝑛+1 = 𝑡2 , 𝑖 𝑛+1 , . . . , 𝑖 𝑁 ) 𝑡1 =1 𝑡2 =1  B ( 𝑗 1 , . . . , 𝑗 𝑚−2 , 𝑗 𝑚−1 = 𝑡 2 , 𝑗 𝑚 = 𝑡1 , 𝑗 𝑚+1 , . . . , 𝑗 𝑀 ) . A graphical representation of tensors A and B and the tensor merging product defined above is given in Figure 1.1. A special case of the tensor merging product can be considered for the case where 𝐼𝑛 = 𝐽𝑚 for all 𝑛, 𝑚 ∈ {1, . . . , 𝑁 − 1}, 𝑀 ≥ 𝑁. In this case, the tensor merging product across the first 𝑁 − 1 modes is defined as: 𝑁 −1 C 0 = A ×1,..., 1,..., 𝑁 −1 B, (1.1) where C 0 ∈ R𝐼𝑁 ×𝐽𝑁 ×···×𝐽𝑀 . This can equivalently be written as: R(C 0) = L(A) > T 𝑁 −1 (B), (1.2) Î𝑀 where R(C 0) ∈ R𝐼𝑁 × 𝑚=𝑁 𝐽𝑚 . Definition 6. (Mode-𝑛 product) The mode-𝑛 product of tensor A and a matrix U ∈ R 𝐽 ×𝐼𝑛 is denoted as B = A ×𝑛 U and defined as 𝑏 𝑖1 ,...,𝑖𝑛−1 , 𝑗,𝑖𝑛+1 ...,𝑖 𝑁 = 𝑖𝐼𝑛𝑛=1 𝑎 𝑖1 ...,𝑖𝑛 ,...,𝑖 𝑁 𝑢 𝑗,𝑖𝑛 , where B ∈ R𝐼1 ×...×𝐼𝑛−1 ×𝐽 ×𝐼𝑛+1 ×...×𝐼𝑁 . Í Notice that this is equivalent to a tensor merging product B = A ×2𝑛 U. Definition 7. (Mode-𝑛 Graph Laplacian) Mode-𝑛 adjacency matrix 𝑊 𝑛 corresponding to mode-𝑛 similarity graph of a tensor Y is defined as:  − ky (𝑛) ,𝑠 −y (𝑛) ,𝑠0 k2𝐹  if y (𝑛),𝑠 ∈ N 𝑘 (y (𝑛),𝑠0 ) or y (𝑛),𝑠0 ∈ N 𝑘 (y (𝑛),𝑠 )  2𝜎 2 𝑛 𝑒   , 𝑊𝑠𝑠0 = ,    0,  otherwise  4 Î𝑁 where N 𝑘 (y (𝑛),𝑠 ) is the Euclidean 𝑘-nearest neighborhood of the 𝑠th row of Y(𝑛) , y (𝑛),𝑠 ∈ R 𝐼 𝑖=1,𝑖≠𝑛 𝑖 . The mode-𝑛 graph Laplacian Φ𝑛 is then defined as Φ𝑛 = 𝐷 𝑛 − 𝑊 𝑛 , where 𝐷 𝑛 is a diagonal degree matrix with 𝑛 = Í 𝐼𝑛 𝑊 𝑛 . The eigendecomposition of Φ𝑛 can be written as Φ𝑛 = 𝑃 Λ 𝑃 > , where 𝑃 is the matrix 𝐷 𝑖,𝑖 𝑖0 =1 𝑖,𝑖0 𝑛 𝑛 𝑛 𝑛 of eigenvectors and Λ𝑛 is a diagonal matrix with the eigenvalues on the diagonal, in a non-descending order. Definition 8. (Product Graphs) Let {G𝑛 } be a set of 𝑁 graphs, where G𝑛 = {𝑉𝑛 , 𝐸 𝑛 } where 𝑉𝑛 are the vertices and 𝐸 𝑛 are the edges of G𝑛 . Let Φ𝑛 ∈ R𝐼𝑛 × 𝐼𝑛 be the graph Laplacian for 𝐺 𝑛 . A product graph G is created by unifying {G𝑛 }, in a special manner. A Kronecker product graph, or Kronecker graph is defined as [151]: G = G𝑁 ⊗ G𝑁 −1 ⊗ · · · ⊗ G1 , (1.3) and has a Laplacian defined as: Φ = Φ 𝑁 ⊗ Φ 𝑁 −1 ⊗ · · · ⊗ Φ1 . (1.4) A Cartesian product graph, on the other hand is defined as [151]: G = G𝑁 ⊗ IÎ 𝑁 −1 𝐼𝑛 + I𝐼𝑁 ⊗ G𝑁 −1 ⊗ IÎ 𝑁 −2 𝐼𝑛 + · · · + IÎ 𝑁 ⊗ G1 , (1.5) 𝑛=1 𝑛=1 𝑛=2 𝐼𝑛 with Laplacian: Φ = Φ 𝑁 ⊗ IÎ 𝑁 −1 𝐼𝑛 + I𝐼𝑁 ⊗ Φ 𝑁 −1 ⊗ IÎ 𝑁 −2 𝐼𝑛 + · · · + IÎ 𝑁 ⊗ Φ1 . (1.6) 𝑛=1 𝑛=1 𝑛=2 𝐼𝑛 The graph smoothness objective defined over all modes of a tensor X ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 : Õ 𝑁 > 𝑡𝑟 (X(𝑛) Φ𝑛 X(𝑛) ), (1.7) 𝑛=1 is equivalent to: V(X) > ΦV(X), (1.8) where Φ is the graph Laplacian of the Cartesian product graph. Definition 9. (Mode-𝑛 Concatenation) Mode-𝑛 concatenation unfolds the input tensors along mode-𝑛 and stacks the unfolded matrices across rows:  > cat𝑛 (A, A) = A >(𝑛) A >(𝑛) . (1.9) 5 If the input is a set of tensors, e.g. {A} = {A1 , A2 , . . . , A 𝑀 }, where A 𝑚 ∈ R𝐼1 ×···×𝐼𝑁 , ∀𝑚 ∈ {1, . . . , 𝑀 }, then cat𝑛 ({A}) stacks all mode 𝑛 unfoldings of tensors {A} across rows into a matrix of size 𝑀 𝐼𝑛 × Î𝑁 𝑛0 =1,𝑛0 ≠𝑛 𝐼 𝑛 . 0 Definition 10. (Tensor norms) In this thesis, we employ three different tensor norms. Frobenius norm qÍ of a tensor is defined as kA k 𝐹 = 𝑖1 ,𝑖2 ,...,𝑖 𝑁 𝑦 𝑖1 ,𝑖2 ,...,𝑖 𝑁 . ℓ1 norm of a tensor is defined as kA k 1 = 2 Í 𝑖1 ,𝑖2 ,...,𝑖 𝑁 |𝑦 𝑖1 ,𝑖2 ,...,𝑖 𝑁 |. In this thesis, the nuclear norm of a tensor is defined as the weighted sum of the Í𝑁 nuclear norms of all mode-n unfoldings of a tensor, namely kA k ∗ = 𝑛=1 𝜓 𝑛 kA (𝑛) k ∗ , where 𝜓 𝑛 s are the weights corresponding to each mode [173, 70]. Definition 11. (Support Set) Let Ω be a support set defined for tensor A, i.e. Ω ∈ {1, . . . , 𝐼1 } × {1, . . . , 𝐼2 } × · · · × {1, . . . , 𝐼 𝑁 }. The projection operator on this support set, PΩ , is defined as:    A𝑖1 ,𝑖2 ,...,𝑖 𝑁 , (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 ) ∈ Ω,    PΩ [A] 𝑖1 ,𝑖2 ,...,𝑖 𝑁 = (1.10)    0,  otherwise.  The orthogonal complement of the operator PΩ is defined in a similar manner as:    A𝑖1 ,𝑖2 ,...,𝑖 𝑁 , (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 ) ∉ Ω,    PΩ⊥ [A] 𝑖1 ,𝑖2 ,...,𝑖 𝑁 = (1.11)    0,  otherwise.  1.2 Robust Principal Component Analysis Principal component analysis (PCA) is one of the most common algorithms for dimensionality reduction [30]. The aim of PCA is to find a subspace with smaller dimension, in which the projected data has the highest variance among all possible subspaces with the same dimension. This model is based on the assumption that the data has an i.i.d. Gaussian distribution. Thus, with grossly corrupted, or partially observed data, PCA fails to estimate the best subspace, or the best low-rank approximation to data. To alleviate the issues with these outliers, robust PCA methods [30, 35] have been proposed where the aim is to separate the data into a low-rank part, corresponding to the clean part, and a sparse part, corresponding to gross corruptions, or outliers. In [30, 35], this problem is formulated as: minimize 𝐿,𝑆 k𝐿 k ∗ + 𝜆k𝑆k 1 , 𝐿+𝑆 =𝑌 (1.12) 6 where the observed data is 𝑌 and 𝜆 is a regularization parameter. It was proven that under some conditions on the true 𝐿 and 𝑆, an alternating minimization based approach can recover the true 𝐿. This algorithm can also be used for matrix completion [30]. Although very successful in recovering low-rank data from gross corruptions, RPCA has some short- comings. First, the algorithm requires an SVD at every step of the optimization which gets intractable with increasing data size. Second, the underlying assumption of incoherence is not always easy to satisfy as in many real data, the underlying, clean data is structured and can be sparse. Finally, the outliers themselves could be structured, which violates the randomness of the support of the sparse part. 1.3 Geometric Learning In this thesis, we explore a variety of approaches to supervised and unsupervised learning where underlying geometric or topological relations between each modes rows. Such relations are common in many types of real data, e.g. recommendation systems, drug-target interaction, genomics. Usually, such relations are represented by a graph, or a set of graphs with relationships across these graphs. We unite these approaches under the term geometric learning. Such geometric structures within data are often not taken into account in many methods as purely algebraic terms, such as rank, cannot account for them [22]. In many modern signal processing applications, graph-based priors have been used to extract low-dimensional structure from high dimensional data [57, 83, 170]. Representation of a signal on a graph is also motivated by the emerging field of signal processing on graphs, based on notions of spectral graph theory [155, 154, 146]. The underlying assumption is that high-dimensional data samples lie on or close to a smooth low-dimensional manifold, represented by a graph 𝐺. In this section, we explore several extensions of wide-known algorithms for dimensionality reduction using geometric learning concepts. In the specific case of manifold learning, geometric learning is utilized as follows. Tensor samples for a data set with no labels are denoted as Y𝑠 ∈ R𝐼1 ×···×𝐼𝑁 where 𝑠 ∈ {1, . . . , 𝑆}. For two tensor samples, Y𝑠 and Y𝑠0 and their respective low-dimensional projections, 𝑋𝑠 and 𝑋𝑠0 , regularization that is employed to preserve the underlying geometry can be formulated as: Õ𝑆 Õ 𝑆 k 𝑋𝑠 − 𝑋𝑠0 k 2𝐹 𝑤 𝑠𝑠0 , (1.13) 𝑠=1 𝑠0 =1 𝑠0 ≠𝑠 7 where 𝑋𝑠 is the projection of Y𝑠 to a lower dimensional manifold and 𝑤 𝑠,𝑠0 is the mode-𝑁 + 1 adjacency or similarity graph. One can also equivalently rewrite (1.13) as tr 𝑋𝑠 Φ 𝑁 +1 𝑋𝑠> .  1.3.1 Robust PCA on Graphs In [155], it was shown that a low-rank approximation, 𝑈, to a data matrix, 𝑋, can be obtained by solving the following optimization problem: min k 𝑋 − 𝑈 k 1 + 𝛾1 𝑡𝑟 (𝑈Φ1𝑈𝑇 ) + 𝛾2 𝑡𝑟 (𝑈𝑇 Φ2𝑈), (1.14) 𝑈 where Φ1 and Φ2 are the graph Laplacians corresponding to graphs connecting the samples (rows) of 𝑋 and the features (columns) of 𝑋, respectively. The above formulation assumes that the data is low-rank on graphs, i.e. lies on a smooth low-dimensional manifold. This can be quantified by a graph stationarity measure, kdiag(Γ𝑛 ) k 2 𝑠𝑟 (Γ𝑛 ) = kΓ𝑛 k 𝐹 , where Γ𝑛 = 𝑃𝑛>𝐶𝑛 𝑃𝑛 with 𝐶𝑛 being the covariance matrix of each mode-𝑛 unfolding, i.e., 𝑛 = 1 or 2 in the case of data matrices [146, 154]. 1.3.2 Spectral Geometric Matrix Completion In [84, 22], it was shown that using two graphs that encode the relationships between the rows, G1 and the columns G2 of a matrix, it is possible to extract a signal 𝑋 that is band-limited on the Cartesian product graph G = G2 ⊗ I + I ⊗ G1 from data 𝑌 with missing entries and corruption. The problem is formulated as: minimize𝑋 kPΩ [𝑋 − 𝑌 ] k 2𝐹 + 𝑡𝑟 (𝑋 > Φ1 𝑋) + 𝑡𝑟 (𝑋Φ2 𝑋 > ), (1.15) where the last two terms are called the graph smoothness terms, and Φ1 and Φ2 are graph Laplacians corresponding to the rows and columns of the matrix, respectively. Φ = Φ2 × I + I × Φ1 is the Laplacian matrix associated with G. 1.4 Tensor Decompositions In many machine learning problems, low-rank decompositions or matrix factorizations are the main compu- tational bottleneck as they are often O (𝑛3 ), i.e. cubic, in computational complexity. Tensor decompositions provide alternative factorization models preserving the original multiway structure of the data which de- crease both the computational complexity and the storage cost of the factors. In this section, we review 8 ... ... ... ... ... ... 𝐼 𝑁−1 Y = 𝐼 𝑁−1 𝑈𝑁−1 X 𝐼𝑁 𝐼2 𝑈𝑁 𝑈2 𝐼1 𝑈1 𝐼𝑁 𝐼2 𝐼1 Figure 1.2: Tensor network notation for Tucker decomposition. three well-known tensor decomposition approaches: Tucker Decomposition (TD), Canonical Decomposition (CANDECOMP/PARAFAC, CP) and Tensor Train Decomposition (TT). 1.4.1 Tucker Decomposition (TD) Tucker decomposition represents a tensor Y ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 as: Y = X ×1 𝑈1 ×2 𝑈2 · · · × 𝑁 𝑈 𝑁 , where 𝑈𝑛 ∈ R𝐼𝑛 ×𝑅𝑛 with 𝑅𝑛 < 𝐼𝑛 are called the low-rank factor matrices and X ∈ R𝑅1 ×𝑅2 ×···×𝑅 𝑁 is the core tensor. This decomposition is illustrated in Figure 1.2. Instead of mode-𝑛 product, Tucker Decomposition can also be represented by tensor merging product as: Y = X ×21 𝑈1 ×22 𝑈2 · · · ×2𝑁 𝑈 𝑁 , 1.4.2 Canonical/Polyadic (CP) Decomposition CP is a special case of Tucker Decomposition where the core tensor X is superdiagonal:   𝑟 𝑖1 ,𝑖2 ,...,𝑖 𝑁 , if 𝑖1 = 𝑖2 = · · · = 𝑖 𝑁 ,    𝑥 𝑖1 ,𝑖2 ,...,𝑖 𝑁 = (1.16)    0,  otherwise.  where 𝑟 𝑖1 ,𝑖2 ,...,𝑖 𝑁 ∈ 𝑅 is a constant. CP is often defined equivalently as the sum of rank one tensors computed by the outer products of the columns of the factor matrices 𝑈𝑛 : 𝑅 Õ Y= u1,𝑖 ◦ u2,𝑖 ◦ · · · ◦ u 𝑁 ,𝑖 , (1.17) 𝑖=1 9 where 𝑅 is the CP rank of Y. The above equation is also denoted as: Y = [[𝑈1 , 𝑈2 , . . . , 𝑈 𝑁 ]] (1.18) The above representation is called Kruskal tensor (see [99]), which is a convenient representation for CP tensors. We denote a Kruskal tensor by 𝔜 = [[U1 , ..., U 𝑁 ]] or 𝔜 = [[𝜁; U1 , ..., U 𝑁 ]] where 𝜁 ∈ R𝑟 is a vector whose entries are the weights of rank one tensor components. In the special case of matrices, 𝜁 corresponds to singular values of a matrix. In general, it is assumed that the rank 𝑅 is small so that equation (1.18) is also called low-rank approximation of a tensor Y. 1.4.3 Tensor Train Decomposition (TT) Most of the existing work in both unsupervised and supervised learning have utilized Tucker or CP decom- positions. The matrix product state (MPS) or Tensor Train is one of the best understood tensor networks, or tensor decompositions, for which efficient algorithms have been developed [140, 139]. TT is a special case of a tensor decomposition where a tensor with 𝑁 indices is factorized into a chain-like product of low-rank, three-mode tensors. TT model has been employed in various applications such as PCA [18], manifold learning [183] and deep learning [137]. Tensor Train decomposition represents each element of Y using a series of matrix products as: Y (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 ) = U1 (1, 𝑖1 , :)U2 (:, 𝑖2 , :) . . . U𝑁 (:, 𝑖 𝑁 , :)x, (1.19) where U𝑛 ∈ R𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 are the three mode low-rank tensor factors, 𝑅𝑛 < 𝐼𝑛 are the TT-ranks of the corresponding modes and x ∈ R𝑅 𝑁 ×1 is the projected sample vector. Using tensor merging product form, (1.19) can be rewritten as Y = U1 ×13 U2 ×13 · · · ×13 U𝑁 ×13 x. (1.20) A graphical representation of (1.20) can be seen in Figure 1.3. If Y is vectorized, another equivalent expression for (1.19) in terms of matrix projection is obtained as: V (Y) = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 )x. Another widely used form of Tensor Train, called Matrix Product States (MPS) [18] is defined as: Y = U1 ×13 U2 ×13 · · · ×13 U𝑘 ×13 𝑋 ×13 U𝑘+1 ×13 · · · ×13 U𝑁 . (1.21) 10 ... ... ... 𝑅0 = 1 𝐼 𝑁−1 Y = U1 U2 ... U𝑁−1 U𝑁 x 𝐼𝑁 𝐼2 𝐼1 𝐼1 𝐼2 𝐼 𝑁−1 𝐼𝑁 Figure 1.3: Tensor Train Decomposition of Y using tensor merging products. Note that the low dimensional projected samples in MPS are matrices while for TT, they are vectors. Let 𝑈 ≤𝑘 = L(U1 ×13 U2 ×13 · · · ×13 U𝑘 ) ∈ R𝐼1 𝐼2 ...𝐼𝑘 ×𝑟𝑘 and 𝑈>𝑘 = R(U𝑘+1 ×13 · · · ×13 U𝑁 ) ∈ R𝑟𝑘+1 ×𝐼𝑘+1 ...𝐼𝑁 . When L(U𝑛 )s for 𝑛 ≤ 𝑘 are left orthogonal, 𝑈 ≤𝑘 is also left orthogonal [80], i.e. L(U𝑛 ) > L(U𝑛 ) = I𝑅𝑛−1 𝐼𝑛 , ∀𝑛 implies 𝑈 >𝑈 = I𝐼 , 𝐼 = 𝑛=1 𝐼𝑛 where I𝐼 ∈ R𝐼 ×𝐼 is the identity matrix. Similarly, when R(U𝑛 )s Î𝑁 for 𝑛 > 𝑘 are right orthogonal, 𝑈>𝑘 is right orthogonal. When Y is reshaped into a matrix, (1.21) can be   equivalently expressed as a matrix projection T 𝑘 (Y) = I𝐼𝑛 ⊗ 𝑈 ≤𝑘 𝑋𝑈>𝑘 . 1.5 Robust PCA for Tensors In order to separate the low-rank part from sparse outliers in tensors, many extensions of RPCA to tensors have been proposed [70, 120, 208, 197]. Although utilizing tensor structures have proven to be useful in many tasks, quantifying the low rankness of tensors is an open ended problem as there is no clear definition of rank for a tensor. Similarly, the nuclear norm of a tensor, which is a convex measure of the rank, is not clearly defined. Depending on the definition of the rank, different nuclear norms are used such as Tucker rank (sum of nuclear norms (SNN) of mode-𝑛 unfoldings), tubal rank (tensor nuclear norm (TNN)), TT rank (Schatten norm (TTNN)), etc. In [120, 70], Tucker rank was used to quantify the rank of a tensor. This interpretation resulted in a SNN based objective as: 𝑁 Õ minimize L, S kL (𝑛) k ∗ + kSk 1 , (1.22) 𝑛=1 where k.k 1 is the ℓ1 norm of a tensor, defined as the sum of the absolute values of all entries of a tensor. There are many extensions to this formulation such as reformulating the objective function for tensor completion, using other definitions of nuclear norm, and weighting the nuclear norms of different modes. 11 1.6 Linear Discriminant Analysis (LDA) for Tensors Let Y ∈ R𝐼1 ×···×𝐼𝑁 ×𝐾 ×𝐶 be the collection of samples of training tensors. For a given Y with 𝐶 classes and 𝐾 samples per class, define Y𝑐𝑘 ∈ R𝐼1 ×···×𝐼𝑁 as the sample tensors where 𝑘 ∈ {1, . . . , 𝐾 } is the sample index and 𝑐 ∈ {1, . . . , 𝐶} is the class index. LDA for tensor data initially applies a vectorization to tensor samples and then finds an orthogonal projection 𝑈 that maximizes the discriminability of projections by solving1: 𝑈 = argmin 𝑡𝑟 (𝑈ˆ > 𝑆𝑊 𝑈) ˆ − 𝜆𝑡𝑟 (𝑈ˆ > 𝑆 𝐵 𝑈)   ˆ = 𝑈ˆ argmin 𝑡𝑟 (𝑈ˆ > (𝑆𝑊 − 𝜆𝑆 𝐵 )𝑈) ˆ = argmin 𝑡𝑟 (𝑈ˆ > 𝑆𝑈),ˆ (1.23) 𝑈ˆ 𝑈ˆ where 𝑆 = 𝑆𝑊 − 𝜆𝑆 𝐵 , 𝜆 is the regularization parameter that controls the trade-off between 𝑆𝑊 and 𝑆 𝐵 which are within-class and between-class scatter matrices, respectively, defined as: Õ𝐶 Õ 𝐾 𝑆𝑊 = V (Y𝑐𝑘 − M 𝑐 )V (Y𝑐𝑘 − M 𝑐 ) > , (1.24) 𝑐=1 𝑘=1 Õ𝐶 Õ 𝐾 𝑆𝐵 = V (M 𝑐 − M)V (M 𝑐 − M) > , (1.25) 𝑐=1 𝑘=1 1 Í𝐾 1 Í𝐶 Í𝐾 where M 𝑐 = 𝑘=1 Y𝑐 is the mean for each class 𝑐 and M = 𝑘=1 Y𝑐 𝑘 𝑘 is the total mean of all 𝐾 𝐶𝐾 𝑐=1 samples. Since 𝑈 is an orthogonal projection, (1.23) is equivalent to minimizing the within-class scatter Î𝑁 and maximizing the between class scatter of projections. This can be solved by the matrix 𝑈 ∈ R 𝑛=1 𝐼𝑛 ×𝑅 𝑁 Î𝑁 Î𝑁 whose columns are the eigenvectors of 𝑆 ∈ R 𝑛=1 𝐼𝑛 × 𝑛=1 𝐼𝑛 corresponding to the lowest 𝑅 𝑁 eigenvalues. Multilinear Discriminant Analysis (MDA): MDA extends LDA to tensors using TD by finding a subspace 𝑈𝑛 ∈ R𝐼𝑛 ×𝑅𝑛 for each mode 𝑛 ∈ {1, . . . , 𝑁 } that maximizes the discriminability along that mode [112, 168, 195]. When the number of modes 𝑁 is equal to 1, MDA is equivalent to LDA. In the case of MDA, within-class scatter along each mode 𝑛 ∈ {1, . . . , 𝑁 } is defined as:     > Õ 𝐶 Õ 𝐾  Ö   Ö  (𝑛) ×1𝑚𝑈𝑚  ×1𝑚𝑈𝑚   𝑘  𝑘 =  (Y𝑐 − M 𝑐 )  (Y𝑐 − M 𝑐 )   𝑆𝑊 . (1.26)     𝑐=1 𝑘=1  𝑚∈ {1,..., 𝑁 }   𝑚∈ {1,..., 𝑁 }   𝑚≠𝑛  (𝑛)  𝑚≠𝑛  (𝑛) 1 The original formulation optimizes the trace ratio. Prior work showed the equivalence of trace ratio to trace difference used in this thesis [63]. 12 Between-class scatter 𝑆 𝐵(𝑛) is found in a similar manner. Using these definitions, each 𝑈𝑛 is found by optimizing [168]: (𝑛) 𝑈𝑛 = argmin 𝑡𝑟 (𝑈ˆ 𝑛> (𝑆𝑊 − 𝜆𝑆 𝐵(𝑛) )𝑈ˆ 𝑛 ). (1.27) 𝑈ˆ 𝑛 Different implementations of the multilinear discriminant analysis have been introduced including Dis- criminant Analysis with Tensor Representation (DATER), Direct Generalized Tensor Discriminant Analysis (𝑛) (DGTDA) and Constrained MDA (CMDA). DATER minimizes the ratio 𝑡𝑟 (𝑈𝑛> 𝑆𝑊 𝑈𝑛 )/𝑡𝑟 (𝑈𝑛> 𝑆 𝐵(𝑛) 𝑈𝑛 ) [195] instead of (1.27). Direct Generalized Tensor Discriminant Analysis (DGTDA), on the other hand, computes scatter matrices without projecting inputs on 𝑈𝑚 , where 𝑚 ≠ 𝑛 and finds an optimal 𝑈𝑛 [112]. Constrained MDA (CMDA) finds the solution in an iterative fashion [112], where each subspace is found by fixing all other subspaces. 1.7 Organization and Contributions of the Thesis In this thesis, we present novel algorithms for learning from tensor data for different applications including discriminative and multi-modal supervised learning, data recovery, and anomaly detection from spatiotem- poral data. In all of the proposed methods, important design challenges such as keeping the storage and computational complexity low while maintaining high performance accuracy are taken into account. The re- sulting cost functions are formulated as optimization problems and efficient algorithms are presented to solve them. We have also provided analysis of the proposed algorithms in terms of convergence, computational and storage complexity. The algorithms were applied to classification, clustering and anomaly detection tasks. In Chapter 2, we introduce a tensor decomposition structure that provides computational and storage efficiency while providing high accuracy for both supervised and unsupervised applications. The structure is akin to a hybrid between Tensor Train and Tucker decompositions using the flexibility of tensor networks, and named as multi-branch tensor train decomposition. This new structure is useful for extending existing machine learning methods to tensor type data. This structure is employed in a supervised learning setting for extending LDA to tensor data. We also present an unsupervised learning method that relies on the proposed structure, namely graph regularized tensor train decomposition. The proposed supervised and unsupervised algorithms are compared to state-of-the-art tensor decompositions with similar objectives in, respectively, classification and clustering settings on various data sets. 13 In Chapter 3, we address the problem of unsupervised anomaly detection in spatiotemporal data. In this chapter, we develop an anomaly detection method by taking the structure of anomalies into account. In particular, we model the spatiotemporal data as a tensor where the underlying data is low-rank and the anomalies are sparse and temporally continuous. In other words, anomalies appear as spatially contiguous groups of locations that show anomalous values consistently for a short duration of time. We present different methods based on these assumptions: LOw-rank plus temporally Smooth Sparse decomposition (LOSS), Graph regularized LOSS (GLOSS) and LOw-rank on Graphs plus temporally Smooth Sparse decomposition (LOGSS). LOSS and GLOSS use the same objective functions except for the graph regularization term included in GLOSS. By including a graph regularization term in the objective, we preserve the local geometry of the data. In LOGSS, different from the previous two, graph regularization is employed to approximate the nuclear norm minimization problem to reduce the computational complexity. As the proposed methods use a tensor completion framework, they are robust against missing entries in the observed spatiotemporal data. In Chapter 4, we study geometric tensor learning, a framework that explores the effectiveness of product graph structure on tensors. This framework is utilized for a tensor recovery task where the aim is to reconstruct the underlying data from grossly corrupted and partially observed data. We present a new method with Tensor Train structure to model the data. Using TT changes the definition of the product graph as the tensor factorization is not Kronecker structured. We propose two new algorithms to model product graphs for TT decomposition, where in the first approach, we propose using a graph for each canonical mode unfolding, and in the second model, we show the equivalence of canonical unfolding to mode-𝑛 unfolding when each mode-𝑛 graph has a Kronecker structure. In Chapter 5, we present a coupled support tensor machine framework for classification. The framework extends a supervised learning task to multiple, possibly heterogenous, data sources such as simultaneous EEG and fMRI. Starting from a kernelized support tensor machine (STM) framework formulated for single modality, we propose a multiple kernel approach. In particular, we estimate latent factors from Advanced Coupled Matrix Tensor Factorization (ACMTF) [3] jointly across modalities. This algorithm decomposes multiple tensors with CP decomposition, where some sets of factors from different modes are coupled together, i.e. has similarity conditions, to account for the same underlying physical phenomenon that affects these factors. A Coupled Support Tensor Machine (C-STM) combines individual and shared latent factors with multiple kernels and estimates a maximal-margin classifier for coupled matrix tensor data. Finally, Chapter 6 summarizes the main contributions of this thesis and discuss potential future research 14 directions. 15 CHAPTER 2 MULTI-BRANCH TENSOR TRAIN STRUCTURE FOR SUPERVISED AND UNSUPERVISED LEARNING 2.1 Introduction Tensor decompositions have been proposed for various tasks including dimensionality reduction, classifi- cation and clustering. In the area of unsupervised learning, many vector or matrix based methods such as PCA, SVD and NMF have been extended to tensors using various tensor decomposition structures including PARAFAC/CP, Tucker decomposition and TT [159, 42, 93, 26, 48, 49, 43, 106, 18, 181]. Tensors have also gained attention in supervised learning literature. Vectorizing high dimensional inputs may result in poor classification performance due to overfitting when the training sample size is relatively small compared to the feature vector dimension [159, 112, 169, 168, 196]. Another issue brought forth by vectorization is small sample size (SSS) problem. This problem is considered to be the main drawback of linear discriminant analysis (LDA) as the objective function becomes ill-defined due to the singularity of the estimated scatter matrices [63]. For this and many other reasons, a variety of supervised tensor learning methods for feature extraction, selection, regression and classification have been proposed [126, 168, 97, 75, 112, 74]. These include extensions of Linear Discriminant Analysis (LDA) to Multilinear Discriminant Analysis (MDA) for face and gait recognition [196, 168, 112]; Discriminant Non-negative Tensor Factorization (DNTF) [202]; Supervised Tensor Learning (STL) where one projection vector along each mode of a tensor is learned [169, 76]. More recently, the linear regression model has been extended to tensors to learn multilinear mappings from a tensorial input space to a continuous output space [73, 201]. Finally, a framework for tensor-based linear large margin classification was formulated as Support Tensor Machines (STMs), in which the param- eters defining the separating hyperplane form a tensor [97, 75, 74]. However, most of these methods are based on CP or Tucker decomposition which have various problems. CP decomposition is very efficient in terms of storage complexity and has useful properties such as uniqueness and interpretability, which is not true for TD or TT. However, finding the correct CP rank is challenging and often is done by iterating over different values. For noisy tensor data, CP decomposition may find false rank. Moreover, there may not be a low-rank approximation, which is referred to as degeneracy [93] and the problem is often ill-posed [44]. 16 Tucker Decomposition is an extension of SVD and a widely used tensor decomposition. Also, it is more numerically stable compared to CP and most of the time, a low-rank approximation exists in Tucker form. However, Tucker representation can be exponential in storage requirements since even when each mode is low-rank with rank 𝑟, the dimensionality of a sample core tensor X𝑐𝑘 will be 𝑟 𝑁 [182, 32]. It was also shown in [17] that learning a low-rank approximation using Tucker Decomposition requires a truncated SVD on a fat matrix, which hinders the quality of the low-rank approximation as fat or tall matrices are usually full rank. TT format exhibits both stability and low storage complexity compared to Tucker format [42]. It was also suggested as a possible solution to the imbalance of unfolding problem encountered in Tucker decomposition in [17]. However, the ranks of the core tensors in the Tensor Train decomposition are frequently not low for real data [45] and increase exponentially with the number of modes. In [204], a solution was proposed by limiting the maximum ranks of tensor factors and optimizing accordingly. This limitation requires the original data to be low TT rank, which is not true for most real data [45, 204], and results in lower approximation accuracy. A widely used solution to these problem with TT or MPS [18] is to start the Tensor Train decomposition from the left, i.e. first mode, and the right, i.e. last mode, simultaneously which decreases the computational complexity significantly. Even though early applications of TT decomposition focused on compression and dimensionality re- duction [139, 140], more recently TT has been used in machine learning applications. In [18], MPS is implemented in an unsupervised manner to first compress the tensor of training samples and then the re- sulting lower dimensional core tensors are used as features for subsequent classification. In [182], TT decomposition is associated with a structured subspace model, namely the Tensor Train subspace. Learning this structured subspace from training data is posed as a non-convex problem referred to as TT-PCA. Once the subspaces are learned from the training data, the resulting low-dimensional subspaces are used to project and classify the test data. [183] extends TT-PCA to manifold learning by proposing a Tensor Train neighborhood preserving embedding (TTNPE). The classification is conducted by first learning a set of tensor subspaces from the training data and then projecting the training and testing data onto the learned subspaces. Apart from employing TT for subspace learning, recent work has also considered the use of TT in classifier design. In [38], a support tensor train machine (STTM) is introduced to replace the rank-1 weight tensor in Support Tensor Machine (STM) [169] by a Tensor Train that can approximate any tensor with a scalable number of parameters. 17 In order to address the issues of exponential storage requirements and high computational complexity, in this chapter, we introduce supervised and unsupervised subspace learning approaches based on the Tensor Train (TT) structure. Building on the idea of starting the decomposition from the first and last modes simultaneously, we show that using a number of branches in the decomposition, a hybrid between TT and Tucker Decomposition structure can be achieved which in turn can provide a balanced structure and low ranks. When the number of branches is equal to one, the decomposition is equivalent to a standard TT decomposition. When the number of branches is equal to the number of modes, the decomposition is equivalent to Tucker Decomposition. As a special case, when the number of branches is two, the structure is still denoted as TT in previous work[98, 204, 18, 45]. However, the formulation is different from one- way TT as the orthogonality requirements of some factors become right-orthogonality instead of left. By generalizing the proposed idea to different number of branches, we show that our structure is a hybrid tensor decomposition that includes TT and Tucker decomposition as special cases. The significance of the number of branches becomes more apparent for high dimensional data with large number of modes as for these data, the TT ranks tend to get large. The proposed structure is closely related to recent work in the computation of extremal eigenvalue-eigenvector pairs of large Hermitian matrices [45]. Since the main objective is to compute extremal eigenpairs, i.e. Ritz pairs, of a large-scale structured matrix, the proposed multi-branch approach is similar to the previously introduced block TT (BTT) format [56, 98, 204]. In particular, we present a discriminant subspace learning approach using the TT model, namely the Tensor Train Discriminant Analysis (TTDA). The proposed approach is based on linear discriminant analysis (LDA) and learns a Tensor Train subspace (TT-subspace) [183, 182] that maximizes the linear discriminant function. Although this approach provides an efficient structure for storing the learned subspaces, it is com- putationally prohibitive. For this reason, we propose the multi-branch structure for efficient implementations of TTDA. In the second part of this chapter, we apply the multi-branch structure to an unsupervised learning problem that can be posed as an extremal eigenvalue-eigenvector problem. This problem is a combination of manifold learning and linear subspace learning which is posed as a graph regularized dimensionality reduction algorithm. The contributions of the chapter can be summarized as follows: • A new tensor network structure is introduced to provide a better trade-off between computational complexity and storage cost. The proposed multi-branch structure is akin to a hybrid between tensor- 18 train and Tucker decompositions using the flexibility of tensor networks. This structure can also be utilized within other subspace learning tasks, as was done in both supervised and unsupervised settings in this chapter. • This chapter is the first that uses tensor-train decomposition to formulate LDA for supervised subspace learning. Unlike recent work on TT-subspace learning [17, 18, 182, 183] which focuses on dimension- ality reduction, the proposed work learns discriminative TT-subspaces to extract features that optimize the linear discriminant function. • A theoretical analysis of storage and computational complexity of this new framework is presented. A method to find the optimal implementation of the multi-branch TT model given the dimensions of the input tensor is provided. The rest of the chapter is organized as follows. In Section 2.2, we introduce an optimization problem to learn the TT-subspace structure that maximizes the linear discriminant function, named as Tensor Train Discriminant Analysis (TTDA). In Section 2.3, we introduce multi-branch implementations of TTDA to address the issue of high computational complexity. In Section 2.4, we provide an analysis of storage cost, computational complexity and convergence of the proposed algorithms. We also provide a procedure to determine the optimal TT structure for minimizing storage complexity. In Section 2.5, we compare the proposed methods with state-of-the-art tensor based discriminant analysis and subspace learning methods for classification applications. In Section 2.6, we discuss graph regularized unsupervised learning on tensors, then, propose an implementation using the two-way structure using an ADMM based optimization, and discuss the convergence properties. In Section 2.7, we compare the proposed method with unsupervised tensor learning methods for clustering applications on two data sets. 2.2 Tensor Train Discriminant Analysis When the data samples are tensors, traditional LDA first vectorizes them and then finds an optimal projection as shown in (1.23). This creates problems as the intrinsic structure of the data is destroyed. Even though MDA addresses this problem, it is inefficient in terms of storage complexity as it relies on TD [45, 32]. Thus, we propose to solve (1.23) by constraining 𝑈 = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 ) to be a TT-subspace to reduce the computational and storage complexity and obtain a solution that will preserve the inherent data structure. 19 The goal of TTDA is to learn left orthogonal tensor factors U𝑛 ∈ R𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 , 𝑛 ∈ {1, . . . , 𝑁 } using TT-model such that the discriminability of projections x𝑐𝑘 , ∀𝑐, 𝑘 is maximized. First, U𝑛 s can be initialized using TT decomposition proposed in [140]. To optimize U𝑛 s for discriminability, we need to solve (1.23) for each U𝑛 , which can be rewritten using the definition of 𝑈 as:   1 1 1 1 > 1 1 1 1 U𝑛 = argmin 𝑡𝑟 L(U1 ×3 · · · ×3 Û𝑛 ×3 · · · ×3 U𝑁 ) 𝑆L(U1 ×3 · · · ×3 Û𝑛 ×3 · · · ×3 U𝑁 ) . (2.1) Û𝑛 Using the definitions presented in (1.1) and (1.2), we can express (2.1) in terms of tensor merging product:   U𝑛 = argmin 𝑡𝑟 (U1 ×13 · · · ×13 Û𝑛 ×13 · · · ×13 U𝑁 ) ×1,..., 𝑁 1,..., 𝑁 S × 1,..., 𝑁 𝑁 +1,...,2𝑁 (U × 1 3 1 · · · ×1 Û 3 𝑛 3 × 1 · · · ×1 U 3 𝑁 , ) (2.2) Û𝑛 where S = T−1 𝑁 (𝑆) ∈ R 𝐼1 ×···×𝐼 𝑁 ×𝐼1 ×···×𝐼 𝑁 . Let U 1 1 1 ≤𝑛−1 = U1 ×3 U2 ×3 · · · ×3 U𝑛−1 and U>𝑛 = U𝑛+1 ×3 · · · ×3 1 1 U𝑁 . By rearranging the terms in (2.2), we can first compute all merging products and trace operations that do not involve U𝑛 as: "  #   𝑁 +1,..., 𝑁 +𝑛−1 𝑁 +𝑛+2,...,2𝑁 A𝑛 = 𝑡𝑟 48 U ≤𝑛−1 ×1,...,𝑛−1 1,...,𝑛−1 𝑛+1,..., 𝑁 U>𝑛 ×1,..., 𝑁 −𝑛 U ≤𝑛−1 ×1,...,𝑛−1 (U>𝑛 ×1,..., 𝑁 −𝑛 S) , (2.3) where A 𝑛 ∈ R𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 ×𝑅𝑛−1 ×𝐼𝑛 ×𝑅𝑛 (refer to Figure 2.1 for a graphical representation of (2.3)). Then, we can rewrite the optimization in terms of U𝑛 :    U𝑛 = argmin Û𝑛 ×1,2,3 1,2,3 A 𝑛 × 1,2,3 4,5,6 Û𝑛 . (2.4) Û𝑛 Let 𝐴𝑛 = T3 (A 𝑛 ) ∈ R𝑅𝑛−1 𝐼𝑛 𝑅𝑛 ×𝑅𝑛−1 𝐼𝑛 𝑅𝑛 , then (2.4) can be rewritten as: U𝑛 = argmin V( Û𝑛 ) > 𝐴𝑛 V( Û𝑛 ), L( Û𝑛 ) > L( Û𝑛 ) = I𝑅𝑛 . (2.5) Û𝑛 This is a non-convex function due to unitary constraints and can be solved by the algorithm proposed in [189]. The algorithm employs a curvilinear line search method to solve minimizations with orthogonality constraints such as eigenvalue decompositions or matrix rank minimization. The procedure described above to find the subspaces is computationally expensive as the complexity of finding each A 𝑛 [183] is at least quadratic in the number of elements, i.e. O (𝐼 2𝑁 ). When 𝑛 = 𝑁, (2.5) does not apply as U>𝑁 is not defined and the trace operation is defined on the third mode of U𝑁 . To update U𝑁 , the following can be used:    U𝑁 = argmin 𝑡𝑟 Û𝑁 ×1,2 1,2 A 𝑁 × 1,2 3,4 Û 𝑁 , Û𝑁 20 𝐼1 𝐼1 U𝑛−1 𝐿 𝐼2 𝐼2 U𝑛−1 𝐿 ... ... 𝑅𝑛−1 𝑅𝑛−1 𝐼𝑛 S 𝐼𝑛 𝑅𝑛 𝑅𝑛 ... ... U𝑛𝑅 𝐼 𝑁−1 𝐼 𝑁−1 U𝑛𝑅 𝐼𝑁 𝐼𝑁 𝑅𝑁 𝑅𝑁 Figure 2.1: Tensor A 𝑛 is formed by first merging U𝑛𝑅 , U𝑛−1 𝐿 and S and then applying trace operation across 𝑡 ℎ 𝑡 ℎ 4 and 8 modes of the resulting tensor. The green line at the bottom of the diagram refers to the trace operator.   𝑁 −1 𝑁 +1,...,2𝑁 −1 where A 𝑁 = U ≤ 𝑁 −1 ×1,...,1,..., 𝑁 −1 U ≤ 𝑁 −1 ×1,..., 𝑁 −1 S . Once all of the U𝑛 s are obtained, they can be used to extract low-dimensional, discriminative features by the projection operation 𝑈 > V(Y𝑐𝑘 ). The pseudocode for TTDA is given in Algorithm 1. TTDA algorithm described above is computationally expensive as it requires the computation of tensor A 𝑛 through tensor merging products. Also, the size of A 𝑛 grows significantly with increasing 𝑅𝑛 , e.g. 𝑂 (𝐼 4𝑁 ), which has exponentially growing memory cost. Moreover, since the ranks of TT are bounded by Î𝑛 Î𝑛 2 Î𝑛 2 𝑅𝑛 ≤ 𝑖=1 𝐼𝑖 , the size of 𝐴𝑛 may increase up to 𝑖=1 𝐼𝑖 × 𝑖=1 𝐼𝑖 . Thus, 𝐴𝑛 might end up being larger than the original scatter matrix 𝑆, making TT approximation worse than LDA in terms of computational complexity. 2.3 Multi-Branch Tensor Train Discriminant Analysis In this section, we introduce tensor decomposition structures to reduce computational complexity of TTDA. In particular, instead of using the whole scatter tensor S for the update of each U𝑛 , we aim to partition tensor factors into several sets where each set will approximate a lower dimensional TT subspace. For the update of each factor set, the original data will be projected to the remaining sets, and then the corresponding scatter matrix will be formed from these low dimensional projections. To further reduce the complexity, tensor merging products between sets will be removed. These collection of sets form a Kronecker structure, approximating 𝑈 in (1.23). 21 Algorithm 2.1: Tensor Train Discriminant Analysis (TTDA) Input: Input tensors Y𝑐𝑘 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 where 𝑐 ∈ {1, . . . , 𝐶} and 𝑘 ∈ {1, . . . , 𝐾 }, initial tensor factors U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, 𝜆, 𝑅1 , . . . , 𝑅 𝑁 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟 Output: U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, and x𝑐𝑘 , ∀𝑐, 𝑘 1: S ← T−1 𝑁 (𝑆 𝑊 − 𝜆𝑆 𝐵 ), see eqns.(1.24), (1.25). 2: while 𝑖𝑡𝑒𝑟 < 𝑀𝑎𝑥𝐼𝑡𝑒𝑟 do 3: for 𝑛 = 1 : 𝑁 − 1 do 4: Compute A 𝑛 using (2.3). 5: V(U𝑛 ) ← argmin V( Û𝑛 ) > T3 (A 𝑛 )V( Û𝑛 ). 6: end for Û𝑛 ,L( Û𝑛 ) > L( Û𝑛 )=I 𝑅𝑛   1,..., 𝑁 −1 𝑁 +1,...,2𝑁 −1 7: A 𝑁 ← U𝑁𝐿 −1 ×1,..., 𝑁 −1 U 𝐿 × 𝑁 −1 1,..., 𝑁 −1 S   8: L(U𝑁 ) ← argmin 𝑡𝑟 L( Û𝑁 ) > T2 (A 𝑁 )L( Û𝑁 ) . Û𝑁 ,L( Û𝑛 ) > L( Û𝑁 )=I 𝑅 𝑁 9: 𝑖𝑡𝑒𝑟 ← 𝑖𝑡𝑒𝑟 + 1. 10: end while 11: 𝑈 = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 ) 12: x𝑐𝑘 ← 𝑈 > V(Y𝑐𝑘 ), ∀𝑐, 𝑘. 2.3.1 Two-way Tensor Train Discriminant Analysis (2WTTDA) As LDA tries to find a subspace 𝑈 which maximizes discriminability for vector-type data, 2D-LDA tries to find two subspaces 𝑉1 , 𝑉2 such that these subspaces maximize discriminability for matrix-type data [198]. If Î𝑑 Î𝑁 one considers the matricized version of Y𝑐𝑘 along mode 𝑑, i.e. T𝑑 (Y𝑐𝑘 ) ∈ R 𝑖=1 𝐼𝑖 × 𝑖=𝑑+1 𝐼𝑖 , where 1 < 𝑑 < 𝑁, the equivalent orthogonal projection can be written as: T𝑑 (Y𝑐𝑘 ) = 𝑉1 𝑋𝑐𝑘 𝑉2> , (2.6) Î𝑑 Î𝑁 𝑖=1 𝐼𝑖 ×𝑅 𝑑 𝑖=𝑑+1 𝐼𝑖 × 𝑅 𝑑 , 𝑋𝑐𝑘 ∈ R𝑅𝑑 ×𝑅𝑑 . ˆ ˆ where 𝑉1 ∈ R , 𝑉2 ∈ R In TTDA, since the projections x𝑐𝑘 are considered to be vectors, the subspace 𝑈 = L(U1 ×13 U2 ×13 · · · ×13 U𝑁 ) is analogous to the solution of LDA with the constraint that the subspace admits a TT model. If we consider the projections and the input samples as matrices, now we can impose a TT structure to the left and right subspaces analogous to solution of 2D-LDA. In other words, one can find two sets of TT representations corresponding to 𝑉1 and 𝑉2 in (2.6). Using this structural approximation, (2.6) can be rewritten as: T𝑑 (Y𝑐𝑘 ) = L(U1 ×13 · · · ×13 U𝑑 ) 𝑋𝑐𝑘 R(U𝑑+1 ×13 · · · ×13 U𝑁 ), (2.7) which is equivalent to the following representation: Y𝑐𝑘 = U1 ×13 · · · ×13 U𝑑 ×13 𝑋𝑐𝑘 ×12 U𝑑+1 ×13 · · · ×13 U𝑁 . 22 This formulation is graphically represented in Figure 2.2a where the decomposition has two branches, thus we refer to it as Two-way Tensor Train Decomposition (2WTT). In prior work, decompositions with this structure was also called Tensor Train [204] or Matrix Product States [18] but the formulation is different and some properties of Tensor Train such as ranks, computational and storage complexities change when this structure is employed. Thus, this is a different decomposition than the conventional Tensor Train and the differences will be more higlighted when the number of branches increase. The differences between the two-way structure and TT are: • In the two-way structure, factors after mode-𝑑 are not linearly dependent to the ones before mode-𝑑. This can be viewed easily in 2D-LDA problem, which is a two-way structure with only two modes, or from the representation in Figure 2.2a. • In 2WTT, modes before mode-𝑑 are left-orthogonal, similar to TT, but the modes after mode-𝑑 are right orthogonal. This fact with the previous one allow a reduced computational cost in the ALS optimization scheme by a simple linear independence assumption. • In TT, ranks grow from the first mode to the last mode as the upper bound of each rank, 𝑅𝑛 ≤ Î 𝐼𝑛 𝑅𝑛−1 ≤ 𝑛𝑛0=1 𝐼𝑛0 , keeps increasing. On the other hand, for 2WTT ranks with 𝑛 > 𝑑 are bounded Î by 𝑅𝑛 ≤ 𝐼𝑛 𝑅𝑛+1 ≤ 𝑛𝑁0=𝑛+1 𝐼𝑛0 . As a quantitative measure, the upper bound of the maximum rank is q Î𝑁 Î𝑁 𝑛=1 𝐼 𝑛 for TT, and 𝑛=1 𝐼 𝑛 for 2WTT. To maximize discriminability using 2WTT, an optimization scheme that alternates between the two sets of TT-subspaces can be utilized. When forming the scatter matrices for a set, projections of the data to the other set can be used instead of the full data which is similar to (1.26). This will reduce computational complexity as the cost of computing scatter matrices and the number of matrix multiplications to find A 𝑛 in (2.3) will decrease. We propose the procedure given in Algorithm 2.2 to implement this approach and refer to it as Two-way Tensor Train Discriminant Analysis (2WTTDA) as illustrated in Figure 2.2c. To determine Î𝑑 Î the value of 𝑑 in (2.7), we use a center of mass approach and find the 𝑑 that minimizes | 𝑖=1 𝐼𝑖 − 𝑁𝑗=𝑑+1 𝐼 𝑗 |. In this manner, the problem can be separated into two parts which have similar computational complexities. 23 Algorithm 2.2: Two-Way Tensor Train Discriminant Analysis (2WTTDA) Input: Input tensors Y𝑐𝑘 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 where 𝑐 ∈ {1, . . . , 𝐶} and 𝑘 ∈ {1, . . . , 𝐾 }, initial tensor factors U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, 𝑑, 𝜆, 𝑅1 , . . . , 𝑅 𝑁 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟, 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟 Output: U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, and 𝑋𝑐𝑘 , ∀𝑐, 𝑘 1: while 𝑖𝑡𝑒𝑟 < 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟 do 𝑁 −𝑑+1 2: Y𝐿 ← Y ×2,...,𝑑+1,..., 𝑁 (U𝑑+1 ×13 · · · ×13 U𝑁 ). 3: [U𝑖 ] ← 𝑇𝑇 𝐷 𝐴(Y𝐿 , 𝜆, 𝑅𝑖 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟)∀𝑖 ∈ {1, . . . , 𝑑}. 4: Y𝑅 ← Y ×2,...,𝑑+1 1,...,𝑑 (U1 ×13 · · · ×13 U𝑑 ). 5: [U𝑖 ] ← 𝑇𝑇 𝐷 𝐴(Y𝑅 , 𝜆, 𝑅𝑖 , 𝑀𝑎𝑥𝐼𝑡𝑒𝑟)∀𝑖 ∈ {𝑑 + 1, . . . , 𝑁 }. 6: 𝑖𝑡𝑒𝑟 = 𝑖𝑡𝑒𝑟 + 1. 7: end while 8: X𝑐𝑘 ← L(U1 ×13 · · · ×13 U𝑑 ) > T𝑑 (Y𝑐𝑘 )R(U𝑑+1 ×13 · · · ×13 U𝑁 ) > . 2.3.2 Three-way Tensor Train Discriminant Analysis (3WTTDA) Elaborating on the idea of 2WTTDA, one can increase the number of modes of the projected samples which will increase the number of tensor factor sets, or the number of subspaces to be approximated using TT structure. For example, one may choose the number of modes of the projections as three, i.e. X𝑐𝑘 ∈ R𝑅𝑑1 ×𝑅𝑑2 ×𝑅𝑑2 , where 1 < 𝑑1 < 𝑑2 < 𝑁. This model, named as Three-way Tensor Train ˆ Decomposition (3WTT), is given in (2.8) and represented graphically in Figure 2.2b.   𝑘 𝑘 𝑁 −𝑑2 +2 1 1  𝑑2 −𝑑1 +2 Y𝑐 = X𝑐 ×3 U𝑑2 +1 ×3 · · · ×3 U𝑁 ×2 ! ×1𝑑1 +2 U1 ×13 · · · ×13 U𝑑1 .   U𝑑1 +1 ×13 ··· ×13 U𝑑2 (2.8) To maximize discriminability using 3WTT, one can utilize an iterative approach as in Algorithm 2.2, where inputs are projected on all tensor factor sets except the set to be optimized, then TTDA is applied to the projections. The flowchart for the corresponding algorithm is illustrated in Figure 2.2d. This procedure can be repeated until a convergence criterion is met or a number of iterations is reached. The values of 𝑑1 and Î𝑁 𝑑2 are calculated such that the product of dimensions corresponding to each set is as close to ( 𝑖=1 𝐼𝑖 ) 1/3 as possible. It is important to note that 3WTT will only be meaningful for tensors of order three or higher. For three-mode tensors, 3WTT is equivalent to Tucker Decomposition. When there are more than four modes, the number of branches can be increased accordingly which makes 4W, 5W and 𝑁WTT possible. 2.4 Analysis of Storage, Training Complexity and Convergence In this section, we derive the storage and computational complexities of the aforementioned algorithms as well as providing a convergence analysis for TTDA. 24 𝐼1 𝐼 𝑑1 ... U1 ... U𝑑1 ... ... ... 𝐼 𝑁−1 Y𝑐𝑘 = X𝑐𝑘 ... ... 𝐼 𝑁−1 Y𝑐𝑘 = U1 ... U𝑑 𝑋𝑐𝑘 U𝑑+1 ... U𝑁 𝐼𝑁 𝐼2 U𝑑1 +1 ... U𝑑2 U𝑑2 +1 ... U𝑁 𝐼1 𝐼𝑁 𝐼2 𝐼1 𝐼1 𝐼𝑑 𝐼 𝑑+1 𝐼𝑁 𝐼 𝑑1 +1 𝐼 𝑑2 𝐼 𝑑2 +1 𝐼𝑁 (a) (b) Apply TT in [140] Apply TT in [140] U𝑑1 +1 , . . . , U𝑑2 U1 , . . . , U𝑑 Y, 𝜆, 𝑑 U𝑑+1 , . . . , U𝑁 U1 , . . . , U𝑑1 U𝑑2 +1 , . . . , U𝑁 Y, 𝜆, 𝑑 Y𝑅 Apply Project Y Project Y Project Y Apply Y𝑀 TTDA according according Y𝐿 TTDA Y𝑅 Y𝐿 to line 4. to line 2. (c) (d) Figure 2.2: Illustration of the proposed methods (Compare (a) and (b) with Figures 1.2 and 1.3): (a) The proposed tensor network structure for 2WTT; (b) The proposed tensor network structure for 3WTT; (c) The flow diagram for 2WTTDA (Algorithm 2.2); (d) The flow diagram for 3WTTDA Table 2.1: Storage Complexities of Different Tensor Decomposition Structures Methods Subspaces (U𝑛 s) (O (.)) Projections X𝑐𝑘 (O (.)) TT (𝑁 − 1)𝑟 2 𝐼 + 𝑟 𝐼 𝑟𝐶𝐾 2WTT (𝑁 − 2)𝑟 2 𝐼 + 2𝑟 𝐼 𝑟 2 𝐶𝐾 3WTT (𝑁 − 3)𝑟 2 𝐼 + 3𝑟 𝐼 𝑟 3 𝐶𝐾 TD 𝑁𝑟 𝐼 𝑟 𝑁 𝐶𝐾 Storage Complexity Let 𝐼𝑛 = 𝐼, 𝑛 ∈ {1, 2, . . . , 𝑁 } and 𝑅𝑙 = 𝑟, 𝑙 ∈ {2, . . . , 𝑁 − 1}. Assuming 𝑁 is a multiple of both 2 and 3, total storage complexities are: • O ((𝑁 − 1)𝑟 2 𝐼 + 𝑟 𝐼 + 𝑟𝐶𝐾) for TT Decomposition, where 𝑅1 = 1, 𝑅 𝑁 = 𝑟; • O ((𝑁 − 2)𝑟 2 𝐼 + 2𝑟 𝐼 + 𝑟 2𝐶𝐾) for Two-Way TT Decomposition, where 𝑅1 = 𝑅 𝑁 = 1; • O ((𝑁 − 3)𝑟 2 𝐼 + 3𝑟 𝐼 + 𝑟 3𝐶𝐾) for Three-Way TT Decomposition, where 𝑅1 = 𝑅 𝑑1 = 𝑅 𝑁 = 1; • O (𝑁𝑟 𝐼 + 𝑟 𝑁 𝐶𝐾) for Tucker Decomposition, where 𝑅1 = 𝑅 𝑁 = 𝑟. These results show that when the number of modes for the projected samples is increased, the storage cost increases exponentially for X𝑐𝑘 while the cost of storing U𝑛 s decreases quadratically. Using the above, one 25 Table 2.2: Computational complexities of various algorithms. The number of iterations to find the subspaces are denoted as 𝑡 𝑐 for CMDA and 𝑡 𝑡 for TT-based methods. 𝐶𝑠 = 2𝐶𝐾. (𝑟 << 𝐼, 𝑡 𝑡 𝑟 (𝑟 + 𝑁/ 𝑓 − 1) << 𝐶𝑠 , and 𝐼 𝑁 / 𝑓 >> 𝑟 6 ) Methods Order of Complexity (O (.)) LDA 𝐶𝑠 𝐼 2𝑁 + 𝐼 3𝑁 DGTDA 3𝐼 3 + 𝑁𝐶𝑠 𝐼 𝑁 +1 CMDA 2𝑡 𝑐 𝐼 3 + 𝑡 𝑐 𝑁 2 𝐶𝑠 𝐼 𝑁 TTDA 𝐶𝑠 𝐼 2𝑁 2WTTDA (𝑟/2 + 2)𝐶𝑠 𝐼 𝑁 3WTTDA (𝑟 𝐼 /3 /2 + 3)𝐶𝑠 𝐼 2𝑁 /3 𝑁 can easily find the optimal number of modes for the projected samples that minimizes storage complexity. Let the number of modes of X𝑐𝑘 be denoted by 𝑓 . The storage complexity of the decomposition is then O ((𝑁 − 𝑓 )𝑟 2 𝐼 + 𝑓 𝑟 𝐼 + 𝑟 𝑓 𝐶𝐾). The optimal storage complexity is achieved by taking the derivative of the complexity in terms of 𝑓 and equating it to zero. In this case, the optimal 𝑓 is given by    𝑟2𝐼 − 𝑟 𝐼 𝑓ˆ = 𝑟𝑜𝑢𝑛𝑑 log𝑟 , 𝐶𝐾 ln(𝑟) where 𝑟𝑜𝑢𝑛𝑑 (.) is an operator that rounds to the closest positive integer. 2.4.1 Computational Complexity For all of the decompositions mentioned except for DGTDA and LDA, the U𝑛 s and X𝑐𝑘 depend on each other which makes these decompositions iterative. The number of iterations will be denoted as 𝑡 𝑐 and 𝑡 𝑡 for CMDA and TT-based methods, respectively. For the sake of simplicity, we also define 𝐶𝑠 = 2𝐶𝐾. The total cost of finding U𝑛 s and X𝑐𝑘 ∀𝑛, 𝑐, 𝑘, where 𝑟 << 𝐼 is in the order of:    • O 𝐼 𝑁 (𝐶𝑠 + 𝑡 𝑡 𝑟 (𝑟 + 𝑁 − 1))𝐼 𝑁 + 𝑡 𝑡 𝑟 4 (𝐼 + 𝑟 2 𝐼 −1 ) for TTDA;   • O 𝑟 𝐼 𝑁 𝐶2𝑠 + 2𝐼 𝑁 /2 (𝐶𝑠 + 𝑡 𝑡 𝑟 (𝑟 + 𝑁/2 − 1))𝐼 𝑁 /2 + 𝑡 𝑡 𝑟 4 𝐼 + 𝑡 𝑡 𝑟 6 𝐼 −1 for 2WTTDA;    • O 𝑟 𝐼 𝑁 𝐶2𝑠 + 3𝐼 𝑁 /3 (𝐶𝑠 + 𝑡 𝑡 𝑟 (𝑟 + 𝑁/3 − 1))𝐼 𝑁 /3 + 𝑡 𝑡 𝑟 4 𝐼 + 𝑡 𝑡 𝑟 6 𝐼 −1 for 3WTTDA.  If convergence criterion is met with a small number of iterations, i.e. 𝑡 𝑡 𝑟 (𝑟 +𝑁/ 𝑓 −1) << 𝐶𝑠 , and 𝐼 𝑁 / 𝑓 >> 𝑟 6 for all 𝑓 , the reduced complexities are as given in Table 2.2. We can see from Table 2.2 that with increasing number of branches, TT-based methods become more efficient if the algorithm converges in a few number of iterations. This is especially the case if the ranks of tensor factors are low as this reduces the dimensionalities of the search space and the search algorithm finds 26 a solution to (2.5) faster. When this assumption holds true, the complexity is dominated by the formation of scatter matrices. Note that the ranks are assumed to be much lower than dimensionalities and number of modes is assumed to be sufficiently high. When these assumptions do not hold, the complexity of computing A 𝑛 might be dominated by terms with higher powers of 𝑟. This indicates that TT-based methods are more effective when the tensors have higher number of modes and when the TT-ranks of the tensor factors are low. DGTDA has an advantage over all other methods as it is not iterative and the solution for each mode is not dependent on other modes. On the other hand, the solution of DGTDA is not optimal and there are no convergence guarantees except when the ranks and initial dimensions are equal to each other, i.e. when there is no compression. 2.4.2 Convergence To analyze the convergence of TTDA, we must first establish a lower bound for the objective function of LDA, as (2.1) is lower bounded by the objective value of (1.23). Lemma 1. Given that 𝜆 ∈ R+ , i.e. a nonnegative real number, the lower bound of 𝑡𝑟 (𝑈 > 𝑆𝑊 𝑈)−𝜆𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) Î𝑁 is achieved when 𝑈 ∈ R 𝑛=1 𝐼𝑛 ×𝑟 satisfies the following two conditions, simultaneously: 1. The columns of 𝑈 are in the null space of 𝑆𝑊 : u 𝑗 ∈ 𝑛𝑢𝑙𝑙 (𝑆𝑊 ), ∀ 𝑗 ∈ {1, . . . , 𝑟 }. 2. {u1 , u2 , . . . , u𝑟 } are the top-𝑟 eigenvectors of 𝑆 𝐵 . In this case, the minimum of 𝑡𝑟 (𝑈 > 𝑆𝑊 𝑈) − 𝜆𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) = −𝜆 Í𝑟 𝑖=1 𝜎𝑖 , where 𝜎𝑖 s are the eigenvalues of 𝑆𝐵. Proof. Since 𝑆𝑊 is positive semi-definite, 0 ≤ min 𝑡𝑟 (𝑈 > 𝑆𝑊 𝑈), 𝑈 which implies that when the columns of 𝑈 are in the null space of 𝑆𝑊 , i.e. u 𝑗 ∈ 𝑛𝑢𝑙𝑙 (𝑆𝑊 ), ∀ 𝑗 ∈ {1, . . . , 𝑟 }, the minimum value will be achieved for the first part of the objective function. To minimize the trace difference, we need to maximize 𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) which is bounded from above as: Õ𝑟 > max 𝑡𝑟 (𝑈 𝑆 𝐵 𝑈) ≤ 𝜎𝑖 . 𝑈 𝑖=1 27 𝑡𝑟 (𝑈 > 𝑆 𝐵 𝑈) is maximized when the columns of 𝑈 are the top-𝑟 eigenvectors of 𝑆 𝐵 . Therefore, the trace difference achieves the lower-bound when u 𝑗 ∈ 𝑛𝑢𝑙𝑙 (𝑆𝑊 ), ∀ 𝑗 ∈ {1, . . . , 𝑟 } and {u1 , u2 , . . . , u𝑟 } are the top-𝑟 Í eigenvectors of 𝑆 𝐵 and this lower-bound is equal to −𝜆 𝑟𝑖=1 𝜎𝑖 . As shown above, the objective function of LDA is lower bounded. Thus, the solution to (2.1) is also lower-bounded. Let 𝑓 (U1 , U2 , . . . , U𝑁 ) = 𝑡𝑟 (𝑈 > 𝑆𝑈), where 𝑈 = L(U1 ×13 · · · ×13 U𝑁 ) and 𝑆 is defined as in (1.23). If the function 𝑓 is non-increasing with each update of U𝑛 s, i.e. 𝑓 (U1𝑡 , U2𝑡 , . . . , U𝑛𝑡−1 , . . . , U𝑁𝑡−1 )≥ 𝑓 (U1𝑡 , U2𝑡 , . . . , U𝑛𝑡 , . . . , U𝑁 𝑡−1 ), ∀𝑡, 𝑛 ∈ {1, 2, . . . , 𝑁 }, then we can claim that Algorithm 2.1 converges to a fixed point as 𝑡 → − ∞ since 𝑓 (.) is lower-bounded. In [189], an approach to regulate the step sizes in the search algorithm was introduced to guarantee global convergence. In this chapter, this approach is used to update U𝑛 s. Thus, (2.5) can be optimized globally, and the objective value is non-increasing. As Multi-Branch extensions utilize TTDA on the update of each branch, proving the convergence of TTDA is sufficient to prove the convergence of 2WTTDA or 3WTTDA. 2.5 Experiments The proposed TT based discriminant analysis methods are evaluated in terms of classification accuracy, storage complexity, training complexity and sample size. We compared our methods1 with both linear supervised tensor learning methods including LDA, DGTDA and CMDA[112]2 as well as other Tensor Train based learning methods such as MPS [18], TTNPE [183]3 and STTM [38]4. The experiments were conducted on four different data sets: COIL-100, Weizmann Face, Cambridge and UCF-101. For all data sets and all methods, we evaluate the classification accuracy and training complexity with respect to storage complexity. In this chapter, classification accuracy is evaluated using a 1-NN classifier and quantified as 𝑁𝑡𝑟 𝑢𝑒 /𝑁𝑡𝑒𝑠𝑡 , where 𝑁𝑡𝑟 𝑢𝑒 is the number of test samples which were assigned the correct label and 𝑁𝑡𝑒𝑠𝑡 is the total number of test samples. Normalized storage complexity is quantified as the ratio of the total number of elements in 1 Our code is in https://github.com/mrsfgl/mbttda 2 https://github.com/laurafroelich/tensor_classification 3 https://github.com/wangwenqi1990/TTNPE. 4 https://github.com/git2cchen/KSTTM 28 the learnt tensor factors (U𝑛 , ∀𝑛) and projections (X𝑐𝑘 , ∀𝑐, 𝑘) of training data, 𝑂 𝑠 , to the size of the original training data (Y𝑐𝑘 , ∀𝑐, 𝑘): 𝑂𝑠 Î𝑁 . 𝐶𝐾 𝑛=1 𝐼 𝑛 Training complexity is the total runtime in seconds for learning the subspaces. All experiments were repeated 10 times with random selection of the training and test sets and average classification accuracies are reported. The regularization parameter, 𝜆, for each experiment was selected using a validation set composed of all of the samples in the training set and a small subset of each class from the test set (10 samples for COIL-100, 5 samples for Weizmann, 1 sample for Cambridge, and 10 samples for UCF-101). Utilizing a leave-𝑠-out approach, where 𝑠 is the aforementioned subset size, 5 random experiments were conducted. The optimal 𝜆 was selected as the value that gave the best average classification accuracy among a range of values from 0.1 to 1000 increasing in a logarithmic scale. CMDA, TTNPE and MPS do not utilize the 𝜆 parameter while DGTDA utilizes eigendecomposition to find 𝜆 [112]. STTM has an outlier fraction parameter which was set to 0.02 according to the original paper [38]. 2.5.1 Data Sets COIL-100 The dataset consists of 7,200 RGB images of 100 objects of size 128 × 128. Each object has 72 images, where each image corresponds to a different pose angle ranging from 0 to 360 degrees with increments of 5 degrees [133]. For our experiments, we downsampled the grayscale images of all objects to 64 × 64. Each sample image was reshaped to create a tensor of size 8 × 8 × 8 × 8. Reshaping the inputs into higher order tensors is common practice and was studied in prior work [90, 140, 209, 45, 17]. 20 samples from each class were selected randomly as training data, i.e. Y ∈ R8×8×8×8×20×100 , and the remaining 52 samples were used for testing. Weizmann Face Database The dataset includes RGB face images of size 512×352 belonging to 28 subjects taken from 5 viewpoints, under 3 illumination conditions, with 3 expressions [188]. For our experiments, each image was grayscaled, and downsampled to 64 × 44. The images were then reshaped into 5-mode tensors of size 4 × 4 × 4 × 4 × 11 as in [183]. For each experiment, 20 samples were randomly selected to be the training data, i.e. Y ∈ R4×4×4×4×11×20×28 , and the remaining 25 samples were used in testing. 29 Cambridge Hand-Gesture Database The dataset consists of 900 image sequences of 9 gesture classes, which are combinations of 3 hand shapes and 3 motions. For each class, there are 100 image sequences generated by the combinations of 5 illuminations, 10 motions and 2 subjects [91]. Sequences consist of images of size 240 × 320 and sequence length varies. In our experiments, we used grayscaled versions of the sequences and we downsampled all sequences to length 30. We also included 2 subjects and 5 illuminations as the fourth mode. Thus, we have 10 samples for each of the 9 classes from which we randomly select 4 samples as the training set, i.e. Y ∈ R30×40×30×10×4×9 , and the remaining 6 as test set. UCF-101 Human Action Dataset UCF-101 is an action recognition dataset [162]. There are 13320 videos of 101 actions, where each action category might have different number of samples. Each sample is an RGB image sequence with frame size 240 × 320 × 3. The number of frames differs for each sample. In our experiments, we used grayscaled, downsampled frames of size 30 × 40. From each class, we extracted 100 samples to balance the class sizes where each sample consists of 50 frames obtained by uniformly sampling each video sequence. 60 randomly selected samples from each class were used for training, i.e. Y ∈ R30×40×50×60×101 , and the remaining 40 samples were used for testing. 2.5.2 Classification Accuracy We first evaluate the classification accuracy of the different methods with respect to normalized storage complexity. The varying levels of storage cost are obtained by varying the ranks, 𝑅𝑖 s, in the implementation of the tensor decomposition methods. Varying the truncation parameter 𝜏 ∈ (0, 1], the singular values smaller than 𝜏 times the largest singular value are eliminated. The remaining singular values are used to determine the ranks 𝑅𝑖 s for both TT-based and TD-based methods. For TT-based methods, the ranks are selected using TT-decomposition proposed in [140], while for TD-based methods truncated HOSVD was used. We also limit the upper bounds of the ranks to 𝑅𝑛 ≤ 𝐼𝑛 to reduce the memory cost associated with storing 𝐴𝑛 . With this limitation, the maximum dimension of projected samples x𝑐𝑘 becomes 𝐼 𝑁 which is a Î𝑁 fraction of the original sample size 𝑖=1 𝐼𝑖 . In Fig. 2.3, we present comparisons of BTT based eigenpair computation algorithm [56] with the proposed method. The experiments were conducted on COIL-100 data set. BTT performs similarly to 30 (a) (b) Figure 2.3: Comparisons with a BTT based Ritz pair computation algorithm. a) Classification accuracy and b) Training time with respect to normalized storage cost. 31 (a) (b) (c) (d) Figure 2.4: Classification accuracy vs. Normalized storage cost of the different methods for: a) COIL-100, b) Weizmann Face, c) Cambridge Hand Gesture and d) UCF-101. All TD based methods are denoted using ’x’, TT based methods are denoted using ’+’ and proposed methods are denoted using ’*’. STTM and LDA are denoted using ’4’ and ’o’, respectively. TTDA with slightly lower accuracy and higher training time for increasing storage complexity. As BTT has adaptive ranks, the factor sizes increase more compared to TTDA which results in a much higher computational cost. Although the scatter tensor is approximated using a TT structure, BTT does not provide better computational complexity than TTDA as the approximation itself has 𝑂 (𝐼 2𝑁 ) complexity. Multi- branch extensions of TTDA outperform BTT both in terms of classification accuracy and computational complexity. Figure 2.4a illustrates the classification accuracy of the different methods with respect to normalized storage complexity for COIL-100 data set. For this particular dataset, we implemented all of the methods mentioned above. It can be seen that the proposed discriminant analysis framework in its original form, TTDA, gives the highest accuracy results followed by TTNPE. However, these two methods only operate at very low storage complexities since the TT-ranks of tensor factors are constrained to be smaller than the corresponding mode’s input dimensions. We also implemented STTM, which does not provide compression 𝐶 (𝐶−1) rates similar to other TT-based methods. This is due to the fact that STTM needs to learn 2 classifiers 32 (a) (b) (c) (d) Figure 2.5: Training complexity vs. Normalized storage cost of the different methods for: a) COIL-100, b) Weizmann Face, c) Cambridge Hand Gesture, and d) UCF-101. with TT structure. Moreover, these methods have very high computational complexity as will be shown in Section 2.5.3. For this reason, they will not be included in the comparisons for the other datasets. For a wide range of storage complexities, MPS and 2WTTDA perform the best and have similar accuracy. It can also be seen that the storage costs of MPS and 2WTTDA stop increasing after some point due to rank constraints. This is in line with the theoretical storage complexity analysis presented in Section 2.4. Tucker based methods, such as CMDA and DGTDA, along with the original vector based LDA have lower classification accuracy. Figure 2.4b similarly illustrates the classification accuracy of the different methods on the Weizmann Face Database. For all storage complexities, the proposed 2WTTDA and 3WTTDA perform better than the other methods, including TT based methods such as MPS. Figure 2.4c illustrates the classification accuracy for the Cambridge hand gesture database. In this case, 3WTTDA performs the best for most storage costs. As the number of samples for training, validation and testing is very low for Cambridge dataset, the classification accuracy fluctuates with respect to the dimensionality of the features at normalized storage cost of 0.02. Similar fluctuations can also be seen in the 33 results of [183]. Finally, we tested the proposed methods on a more realistic, large sized dataset, UCF-101. For this dataset, TT-based methods perform better than the Tucker based methods. In particular, 2WTTDA performs very close to MPS at low storage costs, whereas 3WTTDA performs well for a range of normalized storage costs and provides the highest accuracy overall. Even though our methods outperform MPS for most datasets, the classification accuracies get close for UCF-101 and COIL-100. This is due to the high number of classes in these datasets. As the number of classes increases, the number of scatter matrices that needs to be estimated also increases which results in a larger bias given limited number of training samples. This improved performance of MPS for datasets with large number of classes is also observed when MPS is compared to CMDA. Therefore, the reason that MPS and the proposed methods perform similarly is a limitation of discriminant analysis rather than the proposed tensor network structure. 2.5.3 Training Complexity In order to compute the training complexity, for TT-based methods, each set of tensor factors is optimized until the change in the normalized difference between consecutive tensor factors is less than 0.1 or 200 iterations is completed. After updating the factors in a branch, no further optimizations are done on that branch in each iteration. CMDA iteratively optimizes the subspaces for a given number of iterations (which is set to 20 to increase the speed in our experiments) or until the change in the normalized difference between consecutive subspaces is less than 0.1. Figs. 2.5a, 2.5b, 2.5c, 2.5d illustrate the training complexity of the different methods with respect to normalized storage cost for the four different datasets. In particular, Figure 2.5a illustrates the training complexity of all the methods including TTNPE, TTDA and STTM for COIL-100. It can be seen that STTM has the highest computational complexity among all of the tested methods. This is due to the fact that for a 100-class classification problem, STTM implements (100) (99)/2 one vs. one binary classifiers, increasing the computational complexity. Similarly, TTNPE has high computational complexity as it tries to learn the manifold projections which involves eigendecomposition of the embedded graph. Among the remaining methods, LDA has the highest computational complexity as it is based on learning from vectorized samples which increases the dimensionality of the covariance matrices. For the tensor based methods, the proposed 34 2WTTDA and 3WTTDA have the lowest computational complexity followed by MPS and DGTDA. In particular, for large datasets like UCF-101 the difference in computational complexity between our methods and existing TT-based methods such as MPS is more than a factor of 102 . Table 2.3: Classification accuracy (top) and training time (bottom) with standard deviation for various methods and datasets. Accuracy (%) 3WTTDA 2WTTDA MPS[18] CMDA[112] DGTDA[112] COIL-100 95.6 ± 0.4 94.8 ± 0.5 94.2 ± 0.2 86.3 ± 0.7 76.6 ± 0.9 Weizmann 93.6 ± 2 97.6 ± 1.2 87.5 ± 2.3 96.4 ± 1.03 69.9 ± 1.8 Cambridge 98.2 ± 1.7 89.1 ± 16.7 56.2 ± 9.8 95 ± 2.8 35.4 ± 8.7 UCF-101 68.6 ± 0.8 67.7 ± 0.9 67.9 ± 0.6 67.7 ± 0.8 57.3 ± 2.7 (s) COIL-100 0.09 ± 0.005 0.24 ± 0.06 1.4 ± 0.13 12.2 ± 6.6 0.7 ± 0.06 Weizmann 0.05 ± 0.02 0.09 ± 0.02 0.13 ± 0.01 2.6 ± 0.3 0.16 ± 0.02 Cambridge 0.11 ± 0.01 1.7 ± 1.5 2.07 ± 0.25 12.6 ± 0.3 0.7 ± 0.04 UCF-101 0.67 ± 0.02 0.853 ± 0.13 56.4 ± 1.9 413.5 ± 24.1 35.3 ± 2.9 2.5.4 Convergence In this section, we present an empirical study of convergence for TTDA in Figure 2.6 where we report the objective value of TTDA, i.e. the expression inside argmin operator in (2.5), with random initialization of projection tensors. This figure illustrates the convergence of the TTDA algorithm, which is at the core of both 2WTTDA and 3WTTDA, on COIL-100 dataset. It can be seen that even for random initializations of the tensor factors, the algorithm converges in a small number of steps. The convergence rates for 2WTTDA and 3WTTDA are faster than that of TTDA as they update smaller sized projection tensors as shown in Section 2.4. 2.5.5 Effect of Sample Size on Accuracy We also evaluated the effect of training sample size on classification accuracy for Weizmann Dataset. In Figure 2.7, we illustrate the classification accuracy with respect to training sample size for different methods. It can be seen that 3WTTDA provides a high classification accuracy even for small training datasets, i.e., for 15 training samples it provides an accuracy of 96%. This is followed by CMDA and 2WTTDA. It should also be noted that DGTDA is the most sensitive to sample size as it cannot even achieve the local optima and more data allows it to learn better classifiers. 35 Figure 2.6: Convergence curve for TTDA on COIL-100. Objective value vs. the number of iterations is shown. Figure 2.7: Comparison of classification accuracy vs. training sample size for Weizmann Face Dataset for different methods. 2.5.6 Summary of Experimental Results In Table 2.3, we summarize the performance of the different algorithms for the four different datasets considered in this chapter. In the left half of this table, we report the classification accuracy (mean ± std) of the different methods for a fixed normalized storage cost of about 2.10−2 for COIL-100, 6.10−3 for Weizmann Face, 2.10−4 for Cambridge Hand Gesture and 10−3 for UCF-101 datasets. At the given compression rates, for all datasets the proposed 3WTTDA and 2WTTDA perform better than the other tensor based methods. In some cases, the improvement in classification accuracy is significant, e.g. for Weizmann and Cambridge data sets. These results show that the proposed method achieves the best trade-off, i.e. between normalized storage complexity and classification accuracy. Similarly, the right half of Table 2.3 summarizes the average training complexity for the different 36 methods for the same normalized storage cost. From this Table, it can be seen that 3WTTDA is the most computationally efficient method for all datasets. This is followed by 2WTTDA. The difference in computational time becomes more significant as the size of the dataset increases, e.g. for UCF-101. Therefore, even if the other methods perform well for some of the datasets, the proposed methods provide higher accuracy at a computational complexity reduced by a factor of 102 . 2.6 Graph Regularized Tensor Train Decomposition The geometric relationship between data samples has been shown to be important for learning low- dimensional structures from high-dimensional data [171, 183]. Recently, motivated by manifold learning, dimensionality reduction of tensor objects has been formulated to incorporate the geometric structure [115]. The goal is to learn a low dimensional representation for tensor objects that incorporates the geometric structure while maintaining a low reconstruction error in the tensor decomposition. This idea of manifold learning for tensors has been mostly implemented for the Tucker method, including Graph Laplacian Tucker Decomposition (GLTD) [82] and nonnegative Tucker factorization (NTF). However, this line of work suffers from the limitations of Tucker decomposition mentioned early in the chapter [45]. Earlier in this chapter, we proposed a TT model, called multi-branch Tensor Train, such that the features can be matrices and higher-order tensors. In this way, the computational efficiency could be improved. Utilizing this structure, specifically the two-way approach, we propose a graph-regularized TT decomposition for unsupervised dimensionality reduction. 2.6.1 Problem Statement Our goal is to find a TT projection such that the geometric structure of the samples Y𝑠 ∈ R𝐼1 ×···×𝐼𝑁 is preserved, i.e. the distance between the samples, Y𝑠 , should be similar to that between the projections 𝑋𝑠 , while the reconstruction error of the low-rank TT decomposition is minimized. This goal can be formulated 37 through the following cost function as: 𝑓𝑂 ({U}, X) = Õ𝑆 kY𝑠 − U1 ×13 · · · ×13 U𝑘 ×13 𝑋𝑠 ×12 U𝑘+1 ×13 · · · ×13 U𝑁 k 2𝐹 𝑠=1 𝑆 𝑆 𝜆 ÕÕ + k 𝑋𝑠 − 𝑋𝑠0 k 2𝐹 𝑤 𝑠𝑠0 , L(U𝑛 ) > L(U𝑛 ) = I𝑟𝑛 , ∀𝑛 2 𝑠=1 0 𝑠 =1 𝑠0 ≠𝑠 where {U} denotes the set of tensor factors U𝑛 , ∀𝑛 ∈ {1, . . . , 𝑁 }, X ∈ R𝑟𝑘 ×𝑆×𝑟𝑘+1 is the tensor whose slices are 𝑋𝑠 and 𝑤 𝑠𝑠0 is the similarity between tensor samples defined by:   if Y𝑠 ∈ N𝑘 (Y𝑠0 ) or Y𝑠0 ∈ N𝑘 (Y𝑠 )   1,   𝑤 𝑠𝑠0 = , (2.9)    0,  otherwise  where N𝑘 (Y𝑠 ) is the k-nearest neighborhood of Y𝑠 . The objective function can equivalently be expressed as:   𝑓𝑂 ({U}, X) = k𝝅 𝑘+1 (Y) − U1 ×13 · · · ×13 U𝑘 ×13 X ×13 U𝑘+1 ×13 · · · ×13 U𝑁 k 2𝐹 + 𝜆tr (X ×12 Φ) ×1,2 1,3 X , L(U𝑛 ) > L(U𝑛 ) = I𝑟𝑛 , for 𝑛 ≤ 𝑘 and R(U𝑛 )R(U𝑛 ) > = I𝑟𝑛 , for 𝑛 > 𝑘, where 𝝅 𝑘+1 (Y) ∈ R𝐼1 ×···×𝐼𝑘 ×𝑆×𝐼𝑘+1 ×···×𝐼𝑁 is the permuted version of Y such that the last mode is moved to the (𝑘 + 1)th mode and all modes larger than 𝑘 are shifted by one mode, 𝑊 ∈ R𝑆×𝑆 is the adjacency matrix Í and Φ = 𝐷 − 𝑊 ∈ R𝑆×𝑆 is the graph Laplacian where 𝐷 is a diagonal degree matrix with, 𝑑 𝑠𝑠 = 𝑆𝑠0=1 𝑤 𝑠𝑠0 . 2.6.2 Optimization The goal of obtaining low-rank tensor train projections that preserve the data geometry can be achieved by minimizing the objective function as follows: argmin 𝑓𝑂 ({U}, X), s.t. L(U𝑛 ) > L(U𝑛 ) = I𝑟𝑛 , for 𝑛 ≤ 𝑘, (2.10) {U }, X and R(U𝑛 )R(U𝑛 ) > = I𝑟𝑛 , for 𝑛 > 𝑘. As we want our tensor factors to be orthogonal, the solutions lie in the Stiefel manifold S𝑛 , i.e. L(U𝑛 ) ∈ S𝑛 for 𝑛 ≤ 𝑘 and R(U𝑛 ) > ∈ S𝑛 for 𝑛 > 𝑘. Although the function 𝑓𝑂 (.) is convex, the optimization problem is nonconvex due to the manifold constraints on U𝑛 s. 38 The solution to the optimization problem can be obtained by Alternating Direction Method of Multipliers (ADMM). In order to solve the optimization problem we define {V}, as the set of auxiliary variables V𝑛 , ∀𝑛 ∈ {1, . . . , 𝑁 } and rewrite the objective function as: argmin 𝑓𝑂 ({V}, X) subject to U𝑛 = V𝑛 , ∀𝑛 {U }, {V },X L(U𝑛 ) ∈ S𝑛 , ∀𝑛 ≤ 𝑘 and R(U𝑛 ) > ∈ S𝑛 , ∀𝑛 > 𝑘. The partial augmented Lagrangian is given by: 𝑁 𝑁 Õ 𝛾Õ L ({U}, {V}, X, {Z}) = 𝑓𝑂 ({V}, X) − Z𝑛 ×1,2,3 1,2,3 (V𝑛 − U𝑛 ) + kV𝑛 − U𝑛 k 2𝐹 , (2.11) 𝑛=1 2 𝑛=1 where Z𝑛 s are the Lagrange multipliers and 𝛾 is the penalty parameter. As each tensor factor is independent from the others, we update the variables for each mode 𝑛 using the corresponding part of the augmented Lagrangian: 𝛾 L 𝑛 (U𝑛 , V𝑛 , Z𝑛 ) = 𝑓𝑂 (V𝑛 ) − Z𝑛 ×1,2,3 1,2,3 (V𝑛 − U𝑛 ) + kV𝑛 − U𝑛 k 2𝐹 , (2.12) 2 where 𝑓𝑂 (V𝑛 ) denotes the objective function where all variables other than V𝑛 are fixed. The solution for each variable at iteration 𝑡 + 1 can then be found using a step-by-step approach as:  V𝑛𝑡+1 = argmin L 𝑛 U𝑛𝑡 , V𝑛 , Z𝑛𝑡 , (2.13) V𝑛     L 𝑛 U𝑛 , V𝑛𝑡+1 , Z𝑛𝑡 , L(𝑈𝑛 ) ∈ S𝑛 for 𝑛 ≤ 𝑘,    U𝑛𝑡+1 = argmin U𝑛  L 𝑛 U𝑛 , V𝑛𝑡+1 , Z𝑛𝑡 , R(𝑈𝑛 ) > ∈ S𝑛     for 𝑛 > 𝑘,  Z𝑛𝑡+1 = Z𝑛𝑡 − 𝛾(V𝑛𝑡+1 − U𝑛𝑡+1 ). (2.14) Once V𝑛 , U𝑛 , Z𝑛 are updated for all 𝑛, samples X are computed using:   X 𝑡+1 = argmin L {U 𝑡+1 }, {V 𝑡+1 }, X, {Z 𝑡+1 } . (2.15) X Solution for V𝑛 : For 𝑛 ≤ 𝑘, the solution for V𝑛𝑡+1 can be written explicitly as: V𝑛𝑡+1 = argmin k𝝅 𝑘+1 (Y) − V1𝑡+1 ×13 · · · ×13 V𝑛 ×13 · · · ×13 V𝑘𝑡 ×13 X 𝑡 ×13 V𝑘+1 𝑡 ×13 · · · ×13 V𝑁𝑡 k 2𝐹 V𝑛 1,2,3 𝛾 −Z𝑛𝑡 ×1,2,3 (V𝑛 − U𝑛𝑡 ) + kV𝑛 − U𝑛𝑡 k 2𝐹 . (2.16) 2 39 We can equivalently convert this equation into matrix form as:   𝛾 V𝑛𝑡+1 = argmin k𝐻L(V𝑛 )𝑃 − T𝑛 (𝝅 𝑘+1 (Y))k 2𝐹 − tr L(Z𝑛𝑡 ) > L(V𝑛 − U𝑛𝑡 ) + kL(V𝑛 ) − L(U𝑛𝑡 ) k 2𝐹 , V𝑛 2 (2.17)   𝑡 ×1 · · · ×1 V 𝑡 ×1 X 𝑡 ×1 · · · ×1 V 𝑡 ). The analytical solution is found where 𝐻 = I𝐼𝑛 ⊗ 𝑉 ≤𝑛−1𝑡+1 , 𝑃 = R(V𝑛+1 3 3 𝑘 3 3 3 𝑁 by taking the derivative with respect to L(V𝑛 ) and setting it to zero: ! 2𝐻 > 𝐻L(V𝑛𝑡+1 )𝑃 − 𝐺 𝑃> − L(Z𝑛𝑡 ) + 𝛾L(V𝑛𝑡+1 ) − 𝛾L(U𝑛𝑡 ) = 0,  −1 T3 (V𝑛𝑡+1 ) = 2(𝑃𝑃> ⊗ 𝐻 > 𝐻) + 𝛾I𝑟𝑛−1 𝐼𝑛 𝑟𝑛 T3 (𝛾U𝑛𝑡 + Z𝑛𝑡 ) + 2T2 (𝐻 > 𝐺𝑃> ) ,  (2.18) where 𝐺 = T𝑛 (𝝅 𝑘+1 (Y)). Note that the inverse in the solution will always exist given 𝛾 > 0 as the inverse of a sum of a Hermitian matrix and an identity matrix always exists. When 𝑛 > 𝑘, following (2.17) the solution for V𝑛 can be written in the same manner but with different  𝑡+1  𝐻, 𝐺 and 𝑃, where 𝐻 = 𝑉 ≤𝑛−1 𝑡 , 𝑃 = 𝑉>𝑛 ⊗ I𝐼𝑛 and 𝐺 = T𝑛+1 (𝝅 𝑘+1 (Y)). Solution for U𝑛 : For 𝑛 ≤ 𝑘, we can solve (2.12) for U𝑛 using:   𝛾 U𝑛𝑡+1 = argmin −tr L(Z𝑛𝑡 ) > L(V𝑛𝑡+1 − U𝑛 ) + kL(V𝑛𝑡+1 ) − L(U𝑛 ) k 2𝐹 U𝑛 :L( U𝑛 ) ∈S𝑛 2 1 = argmin kL(V𝑛𝑡+1 ) − L(Z𝑛𝑡 ) − L(U𝑛 ) k 2𝐹 , (2.19) U𝑛 :L( U𝑛 ) ∈S𝑛 𝛾 which is found by applying a singular value decomposition to L(V𝑛𝑡+1 ) − 𝛾1 L(Z𝑛𝑡 ). When 𝑛 > 𝑘, the optimal solution is similarly found by applying SVD to R(V𝑛𝑡+1 ) − 𝛾1 R(Z𝑛𝑡 ). Solution for X: Let 𝝅2 (X) ∈ R𝑟𝑘 ×𝑟𝑘+1 ×𝑆 be the permutation of X, (2.15) can equivalently be rewritten in matrix form as: h i 2 𝑡+1 > + 𝜆tr L(𝝅2 (X))ΦL(𝝅2 (X)) > . 𝑡+1  argmin 𝑉>𝑘 ⊗ 𝑉 ≤𝑘 L(𝝅2 (X)) − T 𝑁 (Y) (2.20) X 𝐹 The solution for X 𝑡+1 does not have any constraints, thus it is solved analytically by setting the derivative of (2.20) to zero: 2𝐻 > (𝐻L(𝝅2 (X 𝑡+1 )) − 𝐺) + 2𝜆L(𝝅2 (X 𝑡+1 ))Φ = 0, 𝐻 > 𝐻L(𝝅2 (X 𝑡+1 )) + 𝜆L(𝝅2 (X 𝑡+1 ))Φ = 𝐻 > 𝐺, (2.21) where 𝐻 = 𝑉>𝑘 𝑡 > ⊗ 𝑉 𝑡+1 and 𝐺 = T (Y). (2.21) is a Sylvester equation which can be solved efficiently ≤𝑘 𝑁 [14]. Similar to the case for V𝑛 , the solution to this problem always exists. Solution for Z𝑛 : Finally, we update the Lagrange multipliers Z𝑛𝑡 using (2.14). 40 Algorithm 2.3: Graph Regularized Tensor Train-ADMM(GRTT-ADMM) Input: Input tensors Y𝑠 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 where 𝑠 ∈ {1, . . . , 𝑆}, initial tensor factors {U 1 }, 𝑛 ∈ {1, . . . , 𝑁 }, 𝑘, 𝜆, 𝑟 1 , . . . , 𝑟 𝑁 , 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟, 𝐶𝑜𝑛𝑣𝑇 ℎ𝑟𝑒𝑠ℎ Output: U𝑛 , 𝑛 ∈ {1, . . . , 𝑁 }, and 𝑋𝑠 , ∀𝑠 1: {V 1 } ← {U 1 }. 2: {Z 1 } ← 0. 3: while 𝑡 < 𝐿𝑜𝑜 𝑝𝐼𝑡𝑒𝑟 or 𝑐 > 𝐶𝑜𝑛𝑣𝑇 ℎ𝑟𝑒𝑠 do 4: for 𝑛 = 1 : 𝑁 do 5: Find V𝑛𝑡+1 using (2.18). 6: Find U𝑛𝑡+1 using SVD to solve (2.19). 7: Find Z𝑛𝑡+1 using (2.14). 8: end for 9: Find X 𝑡+1 using (2.21). Í 𝑁 k V𝑛𝑡+1 −V𝑛𝑡 k 2𝐹 10: 𝑐 ← 𝑁1 𝑛=1 k V𝑛𝑡 k 2𝐹 11: 𝑡 = 𝑡 + 1. 12: end while 2.6.3 Convergence Convergence of ADMM is guaranteed for convex functions but there is no theoretical proof of the convergence of ADMM for nonconvex functions. Recent research has provided some theoretical guarantees for the convergence of ADMM for a class of nonconvex problems under some conditions [187]. Our objective function is nonconvex due to unitary constraints. In [187], it has been shown that this type of nonconvex optimization problems, i.e. convex optimization on a Stiefel manifold, converge under some conditions. We show that these conditions hold for each optimization problem corresponding to mode 𝑛. The gradient of 𝑓𝑂 with respect to V𝑛 is Lipschitz continuous with Lipschitz constant 𝐿 ≥ k𝑃𝑃> ⊗ 𝐻 > 𝐻 k 2 , which fulfills the conditions given in [187]. Thus, L 𝑛 convergences to a set of solutions V𝑛𝑡 , U𝑛𝑡 , Z𝑛𝑡 , given that 𝛾 ≥ 2𝐿 + 1. The solution for X is found analytically. As the iterative solutions for each variable converge and the optimization function is nonnegative, i.e. bounded from below, the algorithm converges to a local minimum. 2.7 Experiments The proposed method is evaluated for clustering and compared to existing tensor clustering methods including k-means, MPS [18], TTNPE [183] and GLTD [82] for Weizmann Face Database and MNIST Dataset. Clustering quality is quantified by Normalized Mutual Information (NMI). Average accuracy with respect to both storage complexity and computation time over 20 experiments are reported for all methods. 41 In the following experiments, the storage complexity is quantified as the size of the tensor factors (U𝑛 , ∀𝑛) and projections (X𝑠 , ∀𝑠). The varying levels of storage cost are obtained by varying 𝑟 𝑛 s in the implementation of the tensor decomposition methods. Using varying levels of a truncation parameter 𝜏 ∈ (0, 1], the singular values smaller than 𝜏 times the largest singular value are discarded. The rest are used to determine ranks 𝑟 𝑛 for both TT based and TD based methods. For GRTT and TTNPE, the ranks are selected using TT decomposition proposed in [140], while for GLTD truncated HOSVD was used. Computational complexity is quantified as the time it takes to learn the tensor factors. In order to compute the run time, for TT based methods, each set of tensor factors is optimized until the change in the normalized difference between consecutive tensor factors is less than 0.01 or 50 iterations are completed. The regularization parameter, 𝜆, for each experiment was selected using a validation set composed of a small batch of samples not included in the experiments. 5 random experiments were conducted and optimal 𝜆 was selected as the value that gave the best average NMI for a range of 𝜆 values from 0.001 to 1000 increasing in a logarithmic scale. The similarity graphs were constructed using k-nearest neighbor method with 𝑘 = 𝑙𝑜𝑔(𝑆) following [177]. 2.7.1 MNIST MNIST is a database of grayscale handwritten digit images where each image is of size 28 × 28. We transformed each of the images to a 4 × 7 × 4 × 7 tensor. Reshaping the inputs into higher order tensors is common practice and was employed in prior work [90, 140, 209, 45, 17]. In our experiments, we used a subset of 500 images with 50 images from each class. 50 samples with 5 samples from each class are used as validation set to determine 𝜆. (a) (b) Figure 2.8: (a) Normalized Mutual Information vs. Storage Complexity of different methods for MNIST dataset. (b) Computation Time vs. Storage Complexity of different methods for MNIST dataset. 42 In Figure 2.8a, we can see that at all storage complexity levels, our approach gives the best clustering result in terms of NMI. The dotted purple line represents the accuracy of k-means clustering on original tensor data. Even though the performance of TTNPE is the closest to our method, it is computationally inefficient. In Figure 2.8b, we can see that our approach is faster than GLTD and TTNPE at all storage complexities. MPS is the most efficient in terms of speed but it provides poor clustering quality. 2.7.2 COIL The dataset consists of 7,200 RGB images of 100 objects of size 128 × 128. Each object has 72 images, where each image corresponds to a different pose angle ranging from 0 to 360 degrees with increments of 5 degrees [133]. We used a subset of 20 classes and 32 randomly selected, downsampled, grayscale samples from each class. Each image was converted to an 8 × 8 × 8 × 8 tensor. 8 samples from each class are used as validation set. (a) (b) Figure 2.9: (a) Normalized Mutual Information vs Storage Complexity of different methods for COIL dataset. (b) Computation Time vs Storage Complexity of different methods for COIL dataset. From Figure 2.9a, we can see that the proposed method provides the best clustering results compared to all other methods. The results for GLTD seem to deteriorate with increasing ranks, which is a result of using orthonormal tensor factors. TTNPE gives results closest to the proposed method but it is computationally inefficient and gets very slow with increasing 𝑟 𝑛 s. From Figure 2.9b, we can see that MPS provides best results in terms of run time but the proposed method has a similar computational complexity while providing better clustering accuracy. 43 2.8 Conclusions In this chapter, we propose a new Tensor Train implementation structure and two associated learning tasks. In the first part of the chapter, i.e. Tensor Train Disciminant Analysis, we introduce a novel approach for Tensor Train based discriminant analysis for tensor object classification. The proposed approach first formulated linear discriminant analysis such that the learnt subspaces have a TT structure. The resulting framework, TTDA, reduces storage complexity at the expense of high computational complexity. This increase in computational complexity is then addressed by reshaping the projection vector into matrices and third-order tensors, resulting in 2WTTDA and 3WTTDA, respectively. A theoretical analysis of storage and computational complexity illustrated the tradeoff between these two quantities and suggest a way to select the optimal number of modes in the reshaping of TT structure. The proposed methods were compared with the state-of-the-art TT based subspace learning methods as well as tensor based discriminant analysis for four datasets. While providing reduced storage and computational costs, the proposed methods also yield higher or similar classification accuracy compared to state-of-the art tensor based learning methods such as CMDA, STTM, TTNPE and MPS. In the second part of the chapter, i.e. Graph Regularized Tensor Train Decomposition we proposed a unsupervised graph regularized tensor train decomposition for dimensionality reduction. To the best of our knowledge, this is the first tensor train based dimensionality reduction method that incorporates manifold information through graph regularization. The proposed method also utilizes a multi-branch structure to implement tensor train decomposition which increases the computational efficiency. An ADMM based algorithm is proposed to solve the resulting optimization problem. The proposed method was compared to GLTD, TTNPE and MPS for unsupervised learning in two different datasets. The proposed multi-branch structure can also be extended to other unsupervised methods such as dictionary learning, subspace learning, denoising, data recovery, and compression applications. The structure can also be optimized by permuting the modes in a way that the dimensions are better balanced than the original order. 44 CHAPTER 3 TENSOR METHODS FOR ANOMALY DETECTION ON SPATIOTEMPORAL DATA 3.1 Introduction Large volumes of spatiotemporal data are ubiquitous in a diverse range of applications including climate science, social sciences, neuroscience, epidemiology [20], transportation systems [55], mobile health, and Earth sciences [207]. One emerging application of interest in spatiotemporal data is anomaly detection. De- tecting anomalies from these large data volumes is important for identifying interesting but rare phenomena, e.g. traffic congestion or irregular crowd movement in urban areas, or hot-spots for monitoring outbreaks in infectious diseases. Traditionally, the problem of anomaly detection has been approached mainly through statistical and machine learning techniques [34, 59, 79]. However, these techniques are not as effective on spatiotemporal data as anomalies are highly correlated across time and space, they can no longer be modeled as i.i.d. The definition of anomaly and the suitability of a particular method is determined by the application. In this chapter, we focus on urban anomaly detection [206, 205, 117, 118, 203], where anomalies correspond to incidental events that occur rarely, such as irregularity in traffic volume, unexpected crowds, etc. Urban data are spatiotemporal data collected by mobile devices or distributed sensors in cities and are usually associated with timestamps and location tags. Detecting and predicting urban anomalies are of great importance to policymakers and governments for understanding city-scale human mobility and activity patterns, inferring land usage and region functions and discovering traffic problems [206, 31]. Urban anomalies are usually modeled as group anomalies within spatiotemporal data. This type of anomalies exhibit themselves as spatially contiguous groups of locations that show anomalous values con- sistently for a short duration of time stamps. Early approaches to detecting these anomalies decompose the anomaly detection problem by first treating the spatial and temporal properties of the outliers independently using univariate vector modeling frameworks, and then merging together in a post-processing step [58, 191]. Some examples are dynamic linear models (DLMs) [108] including time-varying autoregressive (TVAR) [24, 25] and switching Kalman filtering/smoothing (SKF/SKS) [131, 134]. However, as the number of sensors increases, scalability becomes a critical issue and these methods become inefficient and unreliable [59]. For 45 these reasons, one natural approach to address ST anomaly detection has been to use tensor decomposition to capture the multiway structure of the data. While there has been a growing number of papers in low-rank tensor models for anomaly detection [60, 186, 193, 185, 205, 117], most of the existing methods are not particularly suited to the complex structure of anomalies in urban traffic data and the unique challenges associated with them. These challenges include: 1) scarcity of anomalies; 2) the contextual dependency of what constitutes an anomaly, i.e., the criteria for anomaly may vary for different regions and time windows due to external influences such as weather patterns; 3) long-term periodicity across time, e.g., one day or one week and 4) strong short-term temporal correlations. In this chapter, we address these challenges by proposing temporally regularized, locally consistent, robust low-rank plus sparse tensor models to decompose the urban traffic data into normal and anomalous components. The key contributions of the proposed methods are: • Modeling Temporal Persistence: The traditional low-rank plus sparse tensor decomposition model is modified to account for the characteristics of urban traffic data. As anomalies tend to last for periods of time, i.e., have strong short-term dependencies, we impose temporal smoothness on the sparse part of the tensor through total variation regularization. This regularization ensures that instantaneous changes in the data, which may be due to errors in sensing, are not mistaken for actual anomalies. This formulation leads to our first algorithm; low-rank plus temporally smooth sparse (LOSS) tensor decomposition. • Incorporating Local Structure: We introduce additional structure to the solution of LOSS by enforcing the low-rank tensor corresponding to normal activity to be smooth on manifolds across each mode. These smoothness terms are added to LOSS as graph regularization across each mode yielding GLOSS. While the low-rank structure in LOSS captures the global structure, GLOSS further reduces redundancy in the representation and incorporates the local geometric structure. This new framework exploits the joint structures and correlations across the modes to more accurately model and process ST signals. • Using Local Structure for Efficiency: We reduce the computational cost incurred by the nuclear norm minimization in GLOSS and LOSS by utilizing the graph smoothness term to estimate normal 46 activity. This allows an efficient algorithm that is still effective in extracting anomalies with some mild constraints on data. • Robustness to Missing and Heterogenous Data: Our optimization framework is formulated in a flexible manner such that it solves a tensor completion problem in conjunction with anomaly detection. This framework takes into account the missing data and fits a structure to the observed part when estimating the underlying structure. Thus, it provides robustness against missing data. The heterogeneity of traffic data is taken into account through the use of weighted nuclear norm minimization to emphasize the difference in low-rank structure across modes. The rest of the chapter is organized as follows. In Section, 3.2, we review some of the related work in spatiotemporal anomaly detection, in particular tensor based anomaly detection methods. In Section 3.3, we formulate the optimization problems of LOSS and GLOSS, propose ADMM based solutions and analyze the computational complexity of these solutions. In Section 3.4, we formulate the optimization problem of LOGSS, propose an ADMM based solution and analyze the computational complexity. In Section 3.5 we provide an analysis of convergence of all proposed algorithms. In Section 3.6, we explain how the anomaly scores for each element in the data is generated. In Section 3.7, we describe the experimental settings both with synthetic and real data and compare the proposed methods with baseline anomaly detection methods as well as other tensor based methods. Finally, we conclude the chapter in Section 3.8, by discussing the proposed algorithms, experimental results and future work. 3.2 Related Work Anomaly detection in spatiotemporal data is usually studied under three categories: point anomalies; tra- jectory anomalies and group anomalies. Point anomalies are defined as spatiotemporal outliers that break the natural ST autocorrelation structure of the normal points. Most ST point anomaly detection algorithms such as ST-DBSCAN [100] assume homogeneity in neighborhood properties across space and time, which can be violated in the presence of ST heterogeneity. Trajectory anomalies are usually detected by computing pairwise similarities among trajectories and identifying trajectories that are spatially distant from the others [105]. Finally, group anomalies appear in ST data as spatially contiguous groups of locations that show anomalous values consistently for a short duration of time stamps. Most approaches for detecting group anomalies decompose the anomaly detection problem by first treating the spatial and temporal properties 47 of the outliers independently, and then merging together in a post-processing step [58, 191]. However, as the number of sensors increases, scalability becomes a critical issue and these methods become unreliable. For these reasons, one natural approach to address ST group anomaly detection has been to use tensor decomposition. Low-rank tensor decomposition and completion have been proposed as suitable approaches to anomaly detection in spatiotemporal data as these methods are a natural extension of spectral anomaly detection techniques to multi-way data [59, 113, 68, 207, 39, 118, 117, 193]. These models project the original spatiotemporal data into a low-dimensional latent space, in which the normal activity is represented with better spatial and temporal resolution. The learned features, i.e., factor matrices or core tensors, are then used to detect anomalies by monitoring the reconstruction error at each time point [144, 143, 166, 193] or by applying well-known statistical tests to the extracted multivariate features [60, 207]. For example, Zhang et al. [207] proposed a tensor-based method to detect targets in hyperspectral imaging data with both spectral and spatial anomaly characteristics. Shi et al. [158] proposed an incremental tensor decomposition algorithm for online anomaly detection. In [60], a hybrid model is constructed from a topology tensor and a flow tensor and Tucker decomposition with an adjustable core size is used to detect anomalies. Xu et al. [193] proposed a sliding-window tensor factorization to detect anomalies. Wang et al. [185] expanded the traditional Tucker decomposition in a probabilistic manner to detect abnormal activity behaviors. Although these low-rank tensor models are powerful in identifying abnormal traffic activity, they have multiple shortcomings. First, they rely on well-known factorization models such as Tucker [60, 207, 193] and CP [117]. These methods obtain a low-dimensional projection of the data without taking the particular structure of anomalies, i.e., sparsity, into account. The proposed method incorporates the sparsity of anomalies into the model by assuming that the anomalies lie in the sparse part of the tensor rather than in the low-rank part. Second, prior work on higher order robust principal component analysis (HoRPCA) within the framework of anomaly detection [113, 68] does not ensure temporal smoothness of the detected anomalies. However, in urban data, anomalies typically last for some time and are not instantaneous. Recently, temporally regularized matrix factorization models [149, 50, 200] and their extensions to tensors [199, 179, 184, 88] have been implemented to capture the temporal dynamics. While the smoothness term included in these papers is very similar to ours, in these papers the smoothness is enforced on the factor matrices corresponding to the temporal mode while in our approach temporal regularization is applied directly to the sparse tensor. 48 Finally, the existing tensor based anomaly detection methods are limited to modeling anomalies that lie in linear subspaces. The proposed method utilizes a simultaneously structured model [84] for robust tensor decomposition. This model structures the low-rank tensor by the proximities within each mode and forces the solution to be smooth on manifolds constructed from each mode. This goal is achieved by adding graph regularization across each mode of the tensor. Although graph regularized tensor decomposition has been used before [115, 135, 82, 148, 13, 180, 192, 51], it is usually with respect to one mode of the tensor and it was not utilized for anomaly detection in urban traffic data. Yet, for many tensors, correlations exist across all modalities. Several recent papers [67, 84, 156, 135, 155, 128, 127] exploit this coupled relationship to co-organize matrices and infer underlying row and column embeddings. The methods GLOSS and LOGSS in this chapter are direct extensions of this work to tensors where we consider graph regularization across all modes to capture the coupled correlation structure in spatiotemporal data. 3.3 Robust Low-Rank Tensor Decomposition for Anomaly Detection In the following discussions, we model spatiotemporal data as a four mode tensor Y ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 . The first mode corresponds to hours in a day. The second mode corresponds to the days of a week as urban traffic activity shows highly similar patterns on the same days of different weeks. The third mode corresponds to the different weeks and the last mode corresponds to the spatial locations, such as stations for metro data, sensors for traffic data or zones for other urban data. 3.3.1 Problem Statement Assuming that anomalies are rare events, our goal is to decompose Y into a low-rank part, L, that corresponds to normal activity and a sparse part, S, that corresponds to the anomalies. This model relies on the assumption that normal activity can be embedded into a lower dimensional subspace while anomalies are outliers. We also take into account the existence of missing elements in the data, i.e., the observed data is PΩ [Y]. This goal can be formulated as: min kL k ∗ + 𝜆kSk 1 , s.t. PΩ [L + S] = PΩ [Y], (3.1) L, S where 𝜆 is the regularization parameter for sparsity. Since urban anomalies tend to be temporally continuous, i.e., smooth in the first mode, this assumption 49 can be incorporated into the above formulation as: min kL k ∗ + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 , s.t. PΩ [L + S] = PΩ [Y], (3.2) L, S where 𝛾 is the regularization parameter for temporal smoothness and kS ×1 Δk 1 quantifies the sparsity of the projection of the tensor S onto the discrete-time differentiation operator along the first mode where Δ is defined as:  1 −1 0 . . . 0        0  1 −1 . . . 0    Δ = . . . . . . . . . . . . . . . . (3.3)   0 . . . 1 −1   0     −1 0 . . . 0 1    It is common to incorporate the relationships among data points as auxiliary information in addition to the low-rank assumption to improve the quality of tensor decomposition and to capture the local data structure [132, 164]. This approach, also known as manifold learning, is an effective dimensionality reduction technique leveraging geometric information. The intuitive idea behind manifold learning is that if two objects are close in the intrinsic geometry of data manifold, they should be close to each other after dimensionality reduction. For tensors, this usually reduces to forcing two similar objects to behave similarly in the projected low-dimensional space through a graph Laplacian term. In this section, since we are trying to learn anomalies from a single tensor, we do not have tensor samples and their projections. Instead, we preserve the relationships between each mode unfolding of the tensor data as each mode corresponds to a different attribute of the data.This goal can be achieved through graph regularization across each mode as follows: 4 𝐼𝑛 Õ 𝐼 𝑛 𝜃 ÕÕ min kL k ∗ + kL (𝑛),𝑖 − L (𝑛),𝑖0 k 2𝐹 𝑤 𝑖𝑖𝑛 0 + 𝜆kSk 1 + L,S 2 𝑛=1 𝑖=1 0 𝑖 =1 𝑖0 ≠𝑖 𝛾kS ×1 Δk 1 , 𝑠.𝑡. PΩ [L + S] = PΩ [Y], where 𝜃 is the weight parameter for graph regularization. The above optimization problem can equivalently be rewritten using trace norm to represent the graph 50 regularization and total variation (TV) norm across the temporal mode to describe temporal smoothness as: Õ4   min kL k ∗ + 𝜃 tr L >(𝑛) Φ𝑛 L (𝑛) + 𝜆kSk 1 + L, S 𝑛=1 Õ 𝛾 ks𝑖2 ,𝑖3 ,𝑖4 k TV , PΩ [L + S] = PΩ [Y], (3.4) 𝑖2 ,𝑖3 ,𝑖4 where k.k TV denotes the total variation norm. The solution to the optimization problems (3.2) and (3.4) are referred to as LOw-rank plus temporally Smooth Sparse (LOSS) decomposition, and Graph regularized LOw-rank plus temporally Smooth Sparse (GLOSS) decomposition, respectively. 3.3.2 Optimization The proposed objective function is convex. In prior work, ADMM has been shown to be effective at solving similar optimization problems in an iterative fashion [70, 7]. Thus, we follow a similar approach for solving the optimization problem given in (3.4). To separate the minimization of TV, ℓ1 norm and graph regularization from each other, we introduce auxiliary variables Z, W, {𝔏} := {𝔏1 , 𝔏2 , 𝔏3 , 𝔏4 }, {𝔊} := {𝔊1 , 𝔊2 , 𝔊3 , 𝔊4 } such that the optimization problem becomes: Õ 4   min 𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ + 𝜃𝑔(𝔊𝑛 , Φ𝑛 ) + 𝜆kSk 1 + L, {𝔏}, {𝔊}, S,Z,W 𝑛=1 𝛾kZk 1 , s.t. W = S, Z = W ×1 Δ, PΩ [L + S] = PΩ [Y], 𝔏𝑛 = L, 𝔊𝑛 = L, 𝑛 ∈ {1, 2, 3, 4}, (3.5)   where 𝑔(𝔊𝑛 , Φ𝑛 ) = tr 𝔊𝑛(𝑛) > Φ𝑛 𝔊𝑛(𝑛) is the graph regularization term for each auxiliary variable 𝔊𝑛 . To solve the above optimization problem, we propose using ADMM with partial augmented Lagrangian: Õ4   𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ + 𝜃𝑔(𝔊𝑛 , Φ𝑛 ) + 𝜆kSk 1 + 𝛾kZk 1 + 𝑛=1 4 𝛽1 𝛽2 Õ 𝑛 kPΩ [L + S − Y − Γ1 ] k 2𝐹 + k𝔏 − L − Γ2𝑛 k 2𝐹 + 2 2 𝑛=1 4 𝛽3 Õ 𝛽4 kL − 𝔊𝑛 − Γ3𝑛 k 2𝐹 + kW ×1 Δ − Z − Γ4 k 2𝐹 + 2 𝑛=1 2 𝛽5 kS − W − Γ5 k 2𝐹 , (3.6) 2 where Γ1 , Γ2𝑛 , Γ3𝑛 , Γ4 , Γ5 ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 are the Lagrange multipliers. 51 1. L update: The low-rank variable L can be updated using: 𝛽1 L 𝑡+1 = argmin kPΩ [L + S 𝑡 − Y − Γ1𝑡 ] k 2𝐹 + L 2 4   Õ 𝛽2 𝑛,𝑡 𝑛,𝑡 2 𝛽3 𝑛,𝑡 𝑛,𝑡 2 k𝔏 − L − Γ2 k 𝐹 + kL − 𝔊 − Γ3 k 𝐹 , (3.7) 𝑛=1 2 2 which has the analytical solution: PΩ [𝛽1 T1 + 𝛽2 T2 + 𝛽3 T3 ] PΩ [L 𝑡+1 ] = , (3.8) 𝛽1 + 4(𝛽2 + 𝛽3 ) PΩ⊥ [𝛽2 T2 + 𝛽3 T3 ] PΩ⊥ [L 𝑡+1 ] = , (3.9) 4(𝛽2 + 𝛽3 where T1 = Y − S 𝑡 + Γ1𝑡 , T2 = 4𝑛=1 𝔏𝑛,𝑡 − Γ2𝑛,𝑡 and T3 = 4𝑛=1 𝔊𝑛,𝑡 + Γ3𝑛,𝑡 . Í Í 2. 𝔏𝑛 update: The variables 𝔏𝑛 can be updated using: 𝛽2 𝑛 𝔏𝑛,𝑡+1 = argmin 𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ + k𝔏 − L 𝑡+1 − Γ2𝑛,𝑡 k 2𝐹 , (3.10) 𝔏𝑛 2  which is solved by a soft thresholding operator on singular values of L 𝑡+1 + Γ2𝑛,𝑡 (𝑛) with a threshold of 𝜓 𝑛 /𝛽2 . 3. 𝔊𝑛 update: The variables 𝔊𝑛 can be updated using:   𝔊𝑛,𝑡+1 = argmin 𝜃tr 𝔊𝑛(𝑛) > Φ𝑛 𝔊𝑛(𝑛) + 𝔊𝑛 𝛽3 𝑡+1 kL − 𝔊𝑛 − Γ3𝑛,𝑡 k 2𝐹 , (3.11) 2 which is solved by:   𝔊𝑛,𝑡+1 (𝑛) = 𝛽 3 𝐺 𝑖𝑛𝑣 L 𝑡+1 − Γ 𝑛,𝑡 3 , (3.12) (𝑛) where 𝐺 𝑖𝑛𝑣 = (2𝜃Φ𝑛 + 𝛽3 I) −1 always exists and can be computed outside the loop for faster update. 4. S update: The variable S can be updated using: 𝛽1 S 𝑡+1 = argmin 𝜆kSk 1 + kPΩ [S + L 𝑡+1 − Y − Γ1𝑡 ] k 2𝐹 + S 2 𝛽5 kS − W 𝑡 − Γ5𝑡 k 2𝐹 , (3.13) 2 where the Frobenius norm terms can be combined and the expression can be simplified into: 𝛽1 + 𝛽5 PΩ [S 𝑡+1 ] = argmin kPΩ [S] k 1 + kPΩ [S − T𝑠 ] k 2𝐹 , (3.14) PΩ [S ] 2𝜆 𝛽5 PΩ⊥ [S 𝑡+1 ] = argmin kPΩ⊥ [S] k 1 + kPΩ⊥ [S − T𝑠 ] k 2𝐹 , (3.15) PΩ⊥ [S ] 2𝜆 52 where " # 𝛽1 (Y − L 𝑡+1 + Γ1𝑡 ) + 𝛽5 (W 𝑡 + Γ5𝑡 ) PΩ [T𝑠 ] = PΩ 𝛽1 + 𝛽5 PΩ⊥ [T𝑠 ] = PΩ⊥ [W 𝑡 + Γ5𝑡 ] 𝜆 𝜆 The above is solved by setting PΩ [S 𝑡+1 ] = 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ [T𝑠 ], 𝛽1 +𝛽5 ) and PΩ [S ⊥ 𝑡+1 ] = 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ⊥ [T𝑠 ], 𝛽5 ), where 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(a, 𝜙) = 𝑠𝑖𝑔𝑛(a) 𝑚𝑎𝑥(|a| − 𝜙, 0) and is elementwise or Hadamard product. 5. W update: The auxiliary variable W can be updated using: 𝛽4 W 𝑡+1 = argmin kW ×1 Δ − Z 𝑡 − Γ4𝑡 k 2𝐹 + W 2 𝛽5 𝑡+1 kS − W − Γ5𝑡 k 2𝐹 , (3.16) 2 which is solved analytically by taking the derivative of the expression given above and setting it to zero which results in:   𝑡+1 W(1) = 𝑊𝑖𝑛𝑣 𝛽5 (S 𝑡+1 − Γ5𝑡 ) (1) + 𝛽4 Δ> (Γ4𝑡 + Z 𝑡 ) (1) , (3.17) where 𝑊𝑖𝑛𝑣 = (𝛽5 I + 𝛽4 Δ> Δ) −1 always exists and can be computed outside the loop for faster update. 6. Z update: The auxiliary variable Z, can be updated using: 𝛽4 Z 𝑡+1 = argmin 𝛾kZk 1 + kW 𝑡+1 ×1 Δ − Z − Γ4𝑡 k 2𝐹 (3.18) Z 2 which is solved by 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(W 𝑡+1 ×1 Δ − Γ4𝑡 , 𝛾/𝛽4 ). 7. Dual updates: Finally, dual variables Γ1 , Γ2𝑛 , Γ3𝑛 , Γ4 , Γ5 are updated using: Γ1𝑡+1 = Γ1𝑡 − PΩ [L 𝑡+1 + S 𝑡+1 − Y], (3.19) Γ2𝑛,𝑡+1 = Γ2𝑛,𝑡 − (𝔏𝑛,𝑡+1 − L 𝑡+1 ), (3.20) Γ3𝑛,𝑡+1 = Γ3𝑛,𝑡 − (L 𝑡+1 − 𝔊𝑛,𝑡+1 ), (3.21) Γ4𝑡+1 = Γ4𝑡 − (W 𝑡+1 ×1 Δ − Z 𝑡+1 ), (3.22) Γ5𝑡+1 = Γ5𝑡 − (S 𝑡+1 − W 𝑡+1 ). (3.23) The pseudocode for the proposed algorithm, GLOSS, is given in Algorithm 3.1. The optimization for LOSS can be similarly computed without the updates on graph regularization and related variables {𝔊}, {Γ3 }. 53 Algorithm 3.1: GLOSS Input: Y ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 , Ω, Φ, parameters 𝜆, 𝛾, 𝜃, {𝜓}, 𝛽1 , 𝛽2 , 𝛽3 , 𝛽4 , 𝛽5 , max_iter. Output: L : Low-rank tensor; S: Sparse tensor. Initialize S 0 = 0, Z 0 = 0, 𝔏𝑛,0 = 0, 𝔊𝑛,0 = 0, Γ01 = 0, Γ2𝑛,0 = 0, Γ3𝑛,0 = 0, Γ04 = 0, Γ05 = 0, ∀𝑖 ∈ {1, . . . , 4}. 𝑊𝑖𝑛𝑣 ← (𝛽5 I + 𝛽4 Δ> Δ) −1 𝐺 𝑖𝑛𝑣 ← (2𝜃Φ𝑛 + 𝛽3 I) −1 for 𝑡 = 0 to max_iter do T1 ← Y − S 𝑡 + Γ1𝑡 T2 ← 4𝑛=1 𝔏𝑛,𝑡 − Γ2𝑛,𝑡 Í T3 ← 𝑛=1 𝔊𝑛,𝑡 + Γ3𝑛,𝑡 Í4 PΩ [L 𝑡+1 ] ← PΩ [𝛽1 T1 + 𝛽2 T2 + 𝛽3 T3 ] /(𝛽1 + 4(𝛽2 + 𝛽3 )) PΩ⊥ [L 𝑡+1 ] ← PΩ⊥ [𝛽2 T2 + 𝛽3 T3 ] /4(𝛽2 + 𝛽3 ) for 𝑛 = 1 to 4 do  [𝑈, Σ, 𝑉] ← SVD L 𝑡+1 + Γ2𝑛,𝑡 (𝑛) 𝜓𝑛 ˆ 𝑖,𝑖 ← max (𝜎𝑖,𝑖 − 𝜎 𝛽2 , 0) ∀𝑖 ∈ {1, . . . , 𝐼𝑛 } 𝔏𝑛,𝑡+1 (𝑛) ← 𝑈 Σ̂𝑉 >   𝔊𝑛,𝑡+1 (𝑛) ← 𝛽3 𝐺 𝑖𝑛𝑣 L 𝑡+1 − Γ3𝑛,𝑡 (𝑛) end for   𝛽1 ( Y−L 𝑡+1 +Γ1𝑡 )+𝛽5 ( W 𝑡 +Γ5𝑡 ) PΩ [T𝑠 ] ← PΩ 𝛽1 +𝛽5 PΩ⊥ [T𝑠 ] ← PΩ⊥ [W + Γ5 ] 𝑡 𝑡 PΩ [S 𝑡+1 ] ← 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ [T𝑠 ], 𝛽1 𝜆+𝛽5 ) PΩ⊥ [S 𝑡+1 ] ← 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(PΩ⊥ [T𝑠 ], 𝛽𝜆5 )   W(1)𝑡+1 ←𝑊 𝑖𝑛𝑣 𝛽5 (S 𝑡+1 −Γ𝑡 ) 5 (1) + 𝛽4 Δ> (Γ4𝑡 + Z 𝑡 ) (1) Z 𝑡+1 ← 𝑠𝑜 𝑓 𝑡_𝑡ℎ𝑟𝑒𝑠ℎ(W 𝑡+1 ×1 Δ − Γ4𝑡 , 𝛽𝛾4 ) Γ1𝑡+1 ← Γ1𝑡 − PΩ [L 𝑡+1 + S 𝑡+1 − Y] Γ4𝑡+1 ← Γ4𝑡 − (W 𝑡+1 ×1 Δ − Z 𝑡+1 ) Γ5𝑡+1 ← Γ5𝑡 − (S 𝑡+1 − W 𝑡+1 ) for 𝑛 = 1 to 4 do Γ2𝑛,𝑡+1 ← Γ2𝑛,𝑡 − (𝔏𝑛,𝑡+1 − L 𝑡+1 ) Γ3𝑛,𝑡+1 ← Γ3𝑛,𝑡 − (L 𝑡+1 − 𝔊𝑛,𝑡+1 ) end for end for 54 3.3.3 Computational Complexity Let Y be a mode 𝑁 tensor with dimensions 𝐼1 = 𝐼2 = · · · = 𝐼 𝑁 = 𝐼. The computational complexity of each iteration of ADMM is computed as follows: 1. The update of L only involves element-wise operations, thus the contribution to computational com- plexity is not significant. 2. The computational complexity of updating each 𝔏𝑛 is O 𝐼 2𝑁 −1 . Thus, the total computational  complexity is O (𝑁 𝐼 2𝑁 −1 ). However, it is possible to reduce the effect of this cost by parallelizing the updates across modes, hence the complexity becomes quadratic in the number of elements, i.e. O 𝐼 2𝑁 −1 .  3. The update of each 𝔊𝑛 requires the computation of a matrix inverse with complexity 𝑂 (𝐼 3 ) and a matrix multiplication with complexity O (𝐼 𝑁 +1 ). As mentioned before, the inverse always exists and can be computed outside the loop. Similar to the updates of 𝔏𝑛 , the total complexity is O (𝑁 𝐼 𝑁 +1 ). This can be reduced by parallelizing across modes. 4. The second update requires a soft thresholding, which has linear complexity, i.e., O (𝐼 𝑁 ). 5. The computational complexity of updating W is governed by matrix multiplication resulting in O (𝐼 𝑁 +1 ) complexity. 6. The update of Z consists of a matrix product followed by soft thresholding, which results in a total complexity of O ((𝐼 + 1)𝐼 𝑁 ). 7. The updates of the dual variables do not require any additional multiplication operations. Thus, the computational complexity is negligible. It can be concluded that the complexity of each loop is governed by the updates of 𝔏𝑛 . Therefore, the total computational complexity of the algorithm is O (max_iter𝑁 𝐼 2𝑁 −1 ). Since the complexity is driven by nuclear norm minimization, in the following section, we propose another method which utilizes graph total variation regularization for low-rank approximation instead of nuclear norm. 55 3.4 Low-rank On Graphs Plus Temporally Smooth Sparse Decomposition As mentioned in Section 1.3.1, we will approximate the low-rank tensor, L, through 𝑁 graph total variation terms corresponding to each mode similar to FRPCAG in (1.14). To this end, the first 𝐽𝑛 eigenvectors of Φ𝑛 , 𝑃ˆ 𝑛 corresponding to the 𝐽𝑛 lowest eigenvalues, are used to quantify the total variation of the low-rank tensor across mode-𝑛 with respect to its corresponding similarity graph. As these first 𝐽𝑛 eigenvectors capture the low-frequency information of the signal, they can capture the normal activity in the data. Thus, the optimization problem can be written as: Õ𝑁   min 𝜃 tr L >(𝑛) Φ̂𝑛 L (𝑛) + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 , L, S 𝑛=1 𝑠.𝑡. PΩ [Y] = PΩ [L + S], (3.24) where Φ̂𝑛 = 𝑃ˆ 𝑛 Λ̂𝑛 𝑃ˆ 𝑛> and Λ̂𝑛 ∈ R 𝐽𝑛 ×𝐽𝑛 is the leading principal submatrix of Λ𝑛 . If we define the projections of each mode-𝑛 unfolding of L to the graph eigenvectors (low frequency graph Fourier basis) as 𝑛 = 𝑃ˆ > L G(𝑛) 𝑛 (𝑛) , then (3.24) can be rewritten as: Õ 𝑁   𝑛 > 𝑛 𝑛 min 𝜃 tr G(𝑛) Λ G(𝑛) + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 , L, { G }S 𝑛=1 𝑠.𝑡. PΩ [Y] = PΩ [L + S], 𝑛 G(𝑛) = 𝑃ˆ 𝑛> L (𝑛) , (3.25) where {G} := {G 1 , . . . , G 𝑁 }. The solution to (3.25) will be called LOw-rank on Graphs plus temporally Smooth Sparse Decomposition (LOGSS). 3.4.1 Optimization The optimization problem was solved using ADMM. We introduce auxiliary variables W and Z similar to LOSS to separate sparsity and temporal smoothness regularization. The problem is then rewritten as: Õ𝑁   𝑛 > 𝑛 𝑛 min 𝜃 tr G(𝑛) Λ G(𝑛) + 𝜆kSk 1 + 𝛾kS ×1 Δk 1 , L, { G }, S,W, Z 𝑛=1 s.t. PΩ [Y] = PΩ [L + S], 𝑛 G(𝑛) = 𝑃ˆ 𝑛> L (𝑛) , S = W, Z = W ×1 Δ. (3.26) 56 The corresponding augmented Lagrangian is given by: 𝑁   Õ 𝑛 > 𝛽1 𝜃 tr G(𝑛) 𝑛 Λ𝑛 G(𝑛) + 𝜆kSk 1 + 𝛾kZk 1 + kPΩ [L + S − Y] − Γ1 k 2𝐹 + 𝑛=1 2 𝑁 𝛽2 𝛽3 𝛽4 Õ kW ×1 Δ − Z − Γ2 k 2𝐹 + kS − W − Γ3 k 2𝐹 + kL − G 𝑛 ×𝑛 𝑃ˆ 𝑛 − Γ4𝑛 k 2𝐹 , (3.27) 2 2 2 𝑛=1 where Γ1 , Γ2 , Γ3 , Γ4𝑛 ∈ R𝐼1 ×𝐼2 ×𝐼3 ×𝐼4 are the Lagrange multipliers. Using (3.27) each variable can be updated alternately. 1. L update: The update of low-rank variable L is given by: PΩ [L 𝑡+1 ] = PΩ [𝛽1 T1 + 𝛽4 T2 ] /(𝛽1 + 4𝛽4 ), PΩ⊥ [L 𝑡+1 ] = PΩ⊥ [T2 ]/4, (3.28) G 𝑛,𝑡 ×𝑛 𝑃ˆ 𝑛 + Γ4𝑛,𝑡 . Í𝑁 where T1 = Y − S 𝑡 + Γ1𝑡 , T2 = 𝑛=1 2. G 𝑛 update: The variables G 𝑛 can be updated using: 𝜃 G 𝑛,𝑡+1 = (2 Λ̂𝑛 + I) −1 (L 𝑡+1 ×𝑛 𝑃ˆ 𝑛> − Γ4𝑛,𝑡 ), (3.29) 𝛽4 where I ∈ R 𝐽𝑛 ×𝐽𝑛 is an identity matrix. 3. S update: The variable S can be updated using: PΩ [S 𝑡+1 ] = T𝜆 (PΩ [𝛽1 T3 + 𝛽3 T4 ])/(𝛽1 + 𝛽3 ) PΩ⊥ [S 𝑡+1 ] = T 𝜆 (PΩ⊥ [T4 ]), (3.30) 𝛽3 where T3 = Y −L 𝑡+1 +Γ1𝑡 , T4 = W 𝑡 +Γ3𝑡 , T 𝜙 (a) = 𝑠𝑖𝑔𝑛(a) 𝑚𝑎𝑥(|a| − 𝜙, 0) and is Hadamard product. 4. W update: The auxiliary variable W can be updated using:   W(1) 𝑡+1 = 𝑊𝑖𝑛𝑣 𝛽3 (S 𝑡+1 − Γ3𝑡 ) (1) + 𝛽2 Δ> (Γ2𝑡 + Z 𝑡 ) (1) , (3.31) where 𝑊𝑖𝑛𝑣 = (𝛽3 I + 𝛽2 Δ> Δ) −1 always exists and can be computed outside the loop for faster update. 5. Z update: The auxiliary variable Z, can be updated using: 𝛽2 Z 𝑡+1 = argmin 𝛾kZk 1 + kW 𝑡+1 ×1 Δ − Z − Γ2𝑡 k 2𝐹 , (3.32) Z 2 which is solved by T 𝛾 (W 𝑡+1 ×1 Δ − Γ2𝑡 ). 𝛽2 57 6. Dual updates: Finally, dual variables Γ1 , Γ2 , Γ3 , Γ4𝑛 are updated using: Γ1𝑡+1 = Γ1𝑡 − PΩ [L 𝑡+1 + S 𝑡+1 − Y], (3.33) Γ2𝑡+1 = Γ2𝑡 − (W 𝑡+1 ×1 Δ − Z 𝑡+1 ), (3.34) Γ3𝑡+1 = Γ3𝑡 − (S 𝑡+1 − W 𝑡+1 ), (3.35) Γ4𝑛,𝑡+1 = Γ4𝑛,𝑡 − (L 𝑡+1 − G 𝑛,𝑡+1 ×𝑛 𝑃ˆ 𝑛 ). (3.36) The pseudocode for the optimization is given in Algorithm 3.2. Algorithm 3.2: LOGSS Input: Y, Ω, Φ𝑛 , parameters 𝜃, 𝜆, 𝛾, 𝛽1 , 𝛽2 , 𝛽3 , 𝛽4 , max_iter. Output: L : Low-rank tensor; S: Sparse tensor. Initialize S 0 = 0, W 0 = 0, Z 0 = 0, G 𝑛,0 = 0, Γ01 = 0, Γ02 = 0, Γ03 = 0, Γ4𝑛,0 = 0, ∀𝑖 ∈ {1, . . . , 4}, 𝑊𝑖𝑛𝑣 = (𝛽3 I + 𝛽2 Δ> Δ) −1 . for 𝑡 = 1 to max_iter do Update L using (3.28). Update G 𝑛 s using (3.29). Update S using (3.30). Update W using (3.31). Update Z using (3.32). Update Lagrange multipliers using (3.33), (3.34), (3.35) and (3.36). end for 3.4.2 Computational Complexity of LOGSS Assume 𝐼1 = 𝐼2 = · · · = 𝐼 𝑁 = 𝐼. The complexity of the proposed algorithm is dominated by matrix multiplications which are the updates of L, W, Z and Γ4𝑛 . The updates of G 𝑛 require are multiplications since Λ̂𝑛 s are diagonal. The computational complexity of the matrix multiplications are: O (𝑁 𝐼 𝑁 ) for the update of L, O (𝐼 𝑁 ) for the updates of W, Z, Γ4𝑛 . Since the updates of L and Γ4𝑛 can be parallelized, the complexity of the algorithm is O (max_iter𝐼 𝑁 ), hence, linear in the number of elements. Thus, using graph total variation regularization instead of nuclear norm we reduce the complexity of LOSS from quadratic to linear. 3.5 Convergence In this subsection, we analyze the convergence of the proposed algorithms. First, we show that the proposed optimization problem (3.4) can be written as a two-block ADMM. In previous work, linear and global 58 convergence of ADMM is proven for two-block systems [52] with no dependence on the hyperparameters. We use this proof to derive the convergence of GLOSS. Convergence of LOSS and LOGSS follow as they have similar objective functions and similar operations can be applied to rewrite their objective as two-block ADMM. In the following discussion, we will assume that there is no missing data, i.e. PΩ [Y] = Y, to simplify the notation. Let ℎ(S) = 𝜆kSk 1 , 𝑗 (Z) = 𝛾kZk 1 , 𝑓 ({𝔏}) = 4𝑛=1 𝜓 𝑛 k𝔏𝑛(𝑛) k ∗ , 𝑔({𝔊}) = Í   Í4 tr 𝔊 𝑛 > Φ𝑛 𝔊𝑛 , (3.5) can be rewritten as: 𝑛=1 (𝑛) (𝑛) min ℎ(S) + 𝑗 (Z) + 𝑓 ({𝔏}) + 𝜃𝑔({𝔊}), {𝔏}, {𝔊}, S,Z 𝐴1 L (1) + 𝐴2 S (1) + 𝐴3 cat1 ({𝔏}) + 𝐴4 cat1 ({𝔊})+ 𝐴5 Z(1) + 𝐴6 W(1) = cat1 ({Y, 0, . . . , 0}), (3.37) where             I I 0 0 0 0 0 0 0 0 0 0                         I 0 −I 0 0 0  0 0 0 0 0 0                                     I 0  0 −I 0 0  0 0 0 0 0 0                         I 0  0 0 −I 0  0 0 0 0 0 0                         I 0  0 0 0 −I 0 0 0 0 0 0                         𝐴1 =  I  , 𝐴2 = 0 , 𝐴3 =  0 0 0 0  , 𝐴4 = −I 0 0 0  , 𝐴5 =  0  , 𝐴6 =  0  .                                     I 0 0 0 0 0  0 −I 0 0  0 0                         I  0 0 −I 0              0 0 0 0 0 0 0                         I 0 0 0 0 0  0 0 0 −I 0 0                         0 I 0 0 0 0 0 0 0 0 −I 0                         Δ −I             0 0 0 0 0 0 0 0 0 0             From this reformulation, it is easy to see that 𝐴1> 𝐴5 = 0 and 𝐴2 , 𝐴3 , 𝐴4 , 𝐴6 are all orthogonal to each other, i.e. 𝐴𝑖> 𝐴 𝑗 = 0 where 𝑖, 𝑗 ∈ {2, 3, 4, 6} and 𝑖 ≠ 𝑗. In this manner, the optimization problem reduces to a special case of two-block ADMM as follows. Define variables 𝑉1 = [L (1) , W(1) ] and 𝑉2 = [X(1) , cat1 ({𝔏}), cat1 ({𝔊}), Z(1) ], and matrices 𝐵1 = [ 𝐴1 , 𝐴5 ] and 𝐵2 = [ 𝐴2 , 𝐴3 , 𝐴4 , 𝐴6 ]. When we create the variable 𝑉1 , the order of updating S and W needs to change in the new formulation as 𝐴2 and 𝐴5 are not orthogonal. Updates of {𝔏} and {𝔊} have no effect on the update 59 of W but this is not true for S, so updating W before S might affect the solution. However, it was proven in [194] that the change in order gives the equivalent solution if either one of the functions of the variables S and W is affine. In our formulation, this is true as the function corresponding to W is a constant. Thus, the problem reduces to the two-block form: min 𝑓1 (𝑉1 ) + 𝑓2 (𝑉2 ), s.t. 𝐵1𝑉1 + 𝐵2𝑉2 = 𝐶, (3.38) 𝑉1 ,𝑉2 where 𝑓1 (𝑉1 ) = 0 and 𝑓2 (𝑉2 ) = ℎ(S)+ 𝑗 (Z)+ 𝑓 ({𝔏})+𝜃𝑔({𝔊}) are both convex and 𝐶 = cat1 ({Y, 0, . . . , 0}). It can easily be shown using Kronecker products and vectorizations that the objective functions (3.2), (3.25) can also be converted into a two-block form. Thus, LOSS and LOGSS also converge using the above results. 3.6 Anomaly Scoring The methods proposed in this chapter focus on extracting spatiotemporal features for anomaly detection. After extracting the features, i.e. the sparse part, a baseline anomaly detector can be applied to obtain an anomaly score. In this chapter, we evaluated three anomaly detection methods: Elliptic Envelope (EE) [150], Local Outlier Factor (LOF) [23] and One Class SVM (OCSVM) [153]. These three methods are used to assign an anomaly score to each element of the sparse tensor. Each method was applied to all third mode fibers which correspond to different weeks’ traffic activity. This is equivalent to fitting a univariate distribution to each of the third mode fibers of the tensor. The anomaly scores were used to create an anomaly score tensor. Finally, the elements with the highest anomaly scores were selected as anomalous while the rest were determined to be normal. 3.7 Experiments In this chapter, we evaluated the proposed method on both real and synthetic datasets. We compared our method to regular HoRPCA and weighted HoRPCA (WHoRPCA) where the nuclear norm of each unfolding is weighted. In addition to Tucker based algorithms we compare with two CP based anomaly detection methods: low rank plus sparse CP (LRSCP) [88] and Bayesian augmented tensor factorization (BATF) [40]. In the case of LRSCP, a low rank plus sparse CP model is used for anomaly detection. While the original algorithm is implemented in an online manner, in this chapter, it is modified to be applicable to the whole ST data for comparability against other methods. BATF is a CP based tensor completion/imputation model with smoothness constraints designed for urban traffic data. BATF utilizes bias vectors for all mode-𝑛 slices 60 which enforces smoothness across each mode. To evaluate the effect of graph regularization term in (3.4), we also compared with LOSS corresponding to the objective function in (3.2). As our method is focused on feature extraction for anomaly detection, we also compared our method to baseline anomaly detection methods such as EE, LOF and OCSVM applied to the original tensor. After the feature extraction stage, unless noted otherwise, such as "GLOSS-LOF", EE was used as the default anomaly scoring method for all tensor feature extraction methods, e.g., HoRPCA, WHoRPCA, BATF, etc. LRSCP scores the anomalies by ordering the magnitude of the elements of the sparse tensor [88]. The number of neighbors for LOF is selected as 10 as this is the suggested lower-bound in [207]. The outlier fraction of OCSVM is set to 0.1 as only the anomaly scores, not labels, generated by OCSVM are used in the experiments. The methods used for comparison and their properties are summarized in Table 3.1. Table 3.1: Properties of anomaly detection methods used in the experiments. The acronyms refer to the different attributes of the cost function: (LR) low-rank, (SP) sparse, (WLR) weighted low-rank, (SR) smoothness regularization. LR SP WLR SR GLOSS + + + + LOGSS - + - + Tucker Based LOSS + + + + WHoRPCA + + + - HoRPCA + + - - BATF + + - + CP Based LRSCP + + - - EE N/A N/A N/A N/A LOF N/A N/A N/A N/A OCSVM N/A N/A N/A N/A For each data set, a varying 𝐾 percent of the elements with highest anomaly scores are determined to be anomalous. With varying 𝐾, ROC curves were generated for synthetic data and the mean area under the curve (AUC) was computed for 10 random experiments. For real data, number of detected events were reported for varying 𝐾. The number of neighbors for LOF is selected as 10 as this is the suggested lower-bound in [207]. Finally, the outlier fraction of OCSVM is set to 0.1 as only the anomaly scores, not labels, generated by OCSVM is used in the experiments. 61 3.7.1 Data Description To evaluate the proposed framework, we use two publicly available datasets as well as synthetic data. Real Data: The first dataset is NYC yellow taxi trip records1 for 2018. This dataset consists of trip information such as the departure zone and time, arrival zone and time, number of passengers, tips for each yellow taxi trip in NYC. In the following experiments, we only use the arrival zone and time to collect the number of arrivals for each zone aggregated over one hour time intervals. We selected 81 central zones to avoid zones with very low traffic [205]. Thus, we created a tensor Y of size 24 × 7 × 52 × 81 where the first mode corresponds to hours within a day, the second mode corresponds to days of a week, the third mode corresponds to weeks of a year and the last mode corresponds to the zones. The data is suitable for low-rank on graphs model as graph stationarity measure 𝑠𝑟 (Γ), given in Section 1.3.1, for each mode is 0.83, 0.98, 0.99, 0.56, respectively. This implies that the data is mostly low-rank on the temporal modes as there is strong correlation among the different days, hours and weeks, while it is less low-rank across space. The second dataset is Citi Bike NYC bike trips2 for 2018. This dataset contains the departure and arrival station numbers, arrival and departure times and user id. In our experiments, we aggregated bike arrival data for taxi zones imported from the NYC yellow taxi trip records dataset, instead of using the original stations, to reduce the dimensionality and to avoid data sparsity. The resulting data tensor is of size 24 × 7 × 52 × 81. 1 https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page 2 https://www.citibikenyc.com/system-data 62 Synthetic Data Generation: Evaluating urban events in a real-world setting is an open challenge, since it is difficult to obtain urban traffic data sets with ground truth information, i.e. anomalies are not known a priori and what constitutes an anomaly depends on the nature of the data and the specific problem. To be able to evaluate our method quantitatively, we generate synthetic data and inject anomalies. Following [205], we generated a synthetic data set by taking the average of the NYC taxi trip tensor, Y, along the third mode, i.e. across weeks of a year. We then repeat the resulting three-mode tensor such that for each zone, average data for a week is repeated 52 times. We multiply each element of the tensor by a Gaussian random variable with mean 1 and variance 0.5 to create variation across weeks. We generate anomalies on randomly selected 𝑚% of the first mode fibers. For each fiber, we set a random time interval of length 𝑙, which corresponds to 𝑙 hours in a day, as anomalous. We multiply the average value of each randomly selected anomalous interval by a parameter 𝑐 and the modify the entries by adding or subtracting this value from the interval. When 𝑐 is low, the anomalies will be harder to detect and may be perceived as noise. 3.7.2 Parameter Selection: In this section, we will discuss how the different parameters in (3.4) are selected. Following [70], we set p 1 𝜆 = 1/ max(𝐼1 , . . . , 𝐼 𝑁 ) for HoRPCA and 𝛽1 = 5std(𝑣𝑒𝑐 ( Y)) , where 𝑣𝑒𝑐(Y) ∈ R 𝐼1 𝐼2 ...𝐼 𝑁 is the vectorization of Y and std(.) is the standard deviation. The other 𝛽 parameters for all methods are set to be the same as 𝛽1 . The selection of 𝛽 parameters does not affect the algorithm performance but changes the convergence rate as mentioned in Section 3.5. In GLOSS, the neighboorhood size for each of the 𝑘-NN graphs is selected Í𝑁 to be 𝑘 = 𝑙𝑜𝑔( 𝑖=1 𝐼𝑛 ) following [177]. The 𝜎 value in 𝑘-NN graphs is selected to be proportional to the Frobenius norm of each mode to ensure similar density levels for all graphs. The ranks for LRSCP and BATF are selected from {1, 2, . . . , 11} as the rank with the best result. Increasing the rank to higher values does not improve the results while increasing complexity. An important observation about the selection of the parameters is that depending on the data, the optimal values of hyperparameters might be different. This is due to the properties of the data such as size, variance and sparsity level. Moreover, the different hyperparameters are dependent on each other. Hence, a search over the whole parameter space may be costly. Thus, we perform a sensitivity analysis for the different 63 (a) (b) (c) Figure 3.1: Mean AUC values for various choices of: (a) 𝜆 and 𝛾, (b) 𝜃 and 𝜓1 , (c) 𝜓2 and 𝜓4 . Mean AUC values across 10 random experiments are reported for each hyperparameter pair. For each set of experiments, the remaining hyperparameters are fixed. 64 hyperparameters. In Figure 3.1, we present the average AUC for various ranges of the hyperparameters for GLOSS applied on the synthetic data generated with 𝑐 = 2.5. It can be seen from Figure 3.1a that while low values of 𝜆 always provide the best results, 𝛾 values are optimized around 10−5 . Increasing the sparsity penalty 𝜆 above 10−3 results in a sparse tensor that is mostly zero. At this 𝜆 value, AUC becomes equal to 0.5 which is equivalent to randomly guessing the anomalous points. On the other hand, when 𝛾 is too large, it smooths out all mode-1 fibers and generates the same anomaly score for each fiber. This is akin to identifying anomalous days, rather than time intervals. From these observations, it can be seen that there is not a strong dependence between 𝛾 and 𝜆. For the weight parameters, 𝜓 𝑛 , when one of them is set to 1 and the others are varied across a wide range of values as shown in Figures 3.1c and 3.1b, the accuracy does not change significantly. Similarly, changing the value of 𝜃 does not seem to affect the optimal value of 𝜓1 . We repeated this analysis for different 𝑐 values and observed similar results indicating that the proposed method is not sensitive to the selection of hyperparameters as long as they are selected following the guidelines given below. In particular, the choice of 𝜆 affects the accuracy more than any of the other hyperparameters. This realization reduces the computational complexity of finding the best set of parameters as the search can be implemented in parallel. Based on these empirical observations, we select 𝜆 = 𝛾 = 1/kPΩ (Y) k 0 , where kPΩ (Y) k 0 is the number of nonzero elements of Y for GLOSS and 𝜆 = 𝛾 = 1/𝑚𝑎𝑥(𝐼1 , . . . , 𝐼 𝑁 ) for LOSS and WHoRPCA similar to [70]. Since the ranks across each mode are closely related to the variance of the data within that mode, weights for each mode in the definition of nuclear norm, 𝜓 𝑛 s, are selected to be inversely proportional to 𝑝 the trace of the square root of the covariance of mode-𝑛, i.e. 𝜓 𝑛 = q , where 𝑝 is selected such that 𝑇 𝑟 ( Σ𝑌(𝑛) ) Î4 1/4 min𝑛 (𝜓 𝑛 ) = 1. The parameter 𝜃 is set to be the geometric mean of 𝜓 𝑛 s, i.e. 𝜃 = 𝑛=1 𝜓 𝑛 . 3.7.3 Experiments on Synthetic Data Effect of anomaly length and percentage: First, we evaluated the effect of the length 𝑙 and the percentage 𝑚, i.e. denseness, of anomalies in synthesized data. For these experiments, we set 𝑐 = 2.5 and 𝑃 = 0%. From Table 3.2 and Fig. 3.2, it can be seen that as 𝑙 increases, the performance of LOGSS and LOSS improves while HoRPCA’s performance does not show a significant change. This is due to the fact that the temporal total variation regularization will become more suited to the observed data as the anomalies become more temporally persistent, i.e. when 𝑙 increases. 65 Figure 3.2: AUC of ROC w.r.t. 𝑙 and 𝑚 with 𝑐 = 2.5, 𝑃 = 0%. Although LOSS performs slightly better than LOGSS when 𝑙 is large, for low 𝑙, it drastically underperforms which is not the case for LOGSS. With increasing 𝑚, all methods perform worse due to the assumption of sparsity for the anomalies. Robustness against noise: We first evaluated the effect of 𝑐, i.e. the strength of the anomaly, on the accuracy of the proposed method. Low 𝑐 values imply that the amplitudes of anomalies are low and they may be indistinguishable from noise. From Fig. 3.3 and Table 3.2, it can be seen that for varying 𝑐 values, our method (GLOSS-EE) achieves the highest AUC values compared to both baseline methods and HoRPCA, WHoRPCA, LRSCP, BATF, LOSS and LOGSS. Among the remaining methods, LOSS performs better than WHoRPCA, especially when the anomaly strength is small. CP based methods generally underperform compared to all other methods. The proposed methods have higher anomaly detection accuracy compared to EE and HoRPCA which illustrates the benefit of tailoring the optimization problem to anomaly structure. In fact, HoRPCA does not perform better than EE in most cases which means that extracting anomalies using HoRPCA does not have a significant improvement compared to using the original data. It is also important to note that the choice of the anomaly scoring method does not change the performance of GLOSS significantly. In terms of time complexity, it can be seen from Table 3.3 that LOGSS is up to 10 times faster than both GLOSS and LOSS. BATF, on the other hand, has higher run time compared to the proposed methods 66 (a) (b) (c) Figure 3.3: ROC curves for various amplitudes of anomalies. Higher amplitude means more separability. 𝑐 = (a) 1.5, (b) 2, (c) 2.5. (𝑃 = 0%, 𝑙 = 7, 𝑚 = 2.3%) (LOSS and GLOSS) with lower accuracy. Although other methods such as EE, LOF, LRSCP and HoRPCA outperform LOSS and GLOSS, the proposed methods outperform them in anomaly detection accuracy as mentioned earlier. Table 3.2: Mean and standard deviation of AUC values for for various 𝑐 and 𝑃. On experiments of each variable, the rest of the variables are fixed at 𝑐 = 2.5, 𝑃 = 0%, 𝑙 = 7 and 𝑚 = 2.3%. The proposed methods, outperform the other algorithms in all cases significantly with 𝑝 < 0.001. 𝑐 = 1.5 𝑐=2 𝑐 = 2.5 𝑃 = 20% 𝑃 = 40% 𝑃 = 60% EE 0.70±0.004 0.81±0.004 0.87±0.003 0.81±0.004 0.61±0.008 0.53±0.01 LOF 0.68±0.003 0.78±0.004 0.84±0.004 0.79±0.005 0.78±0.005 0.73±0.007 OCSVM 0.65±0.005 0.73±0.004 0.77±0.005 0.81±0.004 0.81±0.005 0.76±0.006 HoRPCA 0.7±0.004 0.81±0.004 0.87±0.003 0.8±0.004 0.73±0.006 0.79±0.005 WHoRPCA 0.7±0.004 0.81±0.004 0.87±0.003 0.8±0.003 0.73±0.006 0.8±0.005 LRSCP 0.51±0.02 0.53±0.03 0.54±0.05 0.61±0.007 0.7±0.01 0.8±0.006 BATF 0.59±0.005 0.65±0.005 0.69±0.005 0.67±0.005 0.64±0.006 0.61±0.006 LOSS 0.81±0.005 0.9±0.004 0.94±0.0035 0.85±0.003 0.73±0.009 0.74±0.009 LOGSS 0.8 ± 0.005 0.9±0.003 0.94±0.002 0.86±0.004 0.74±0.008 0.76±0.005 GLOSS-EE 0.83±0.005 0.92±0.003 0.95±0.002 0.88±0.004 0.77±0.008 0.76±0.006 GLOSS-SVM 0.81±0.006 0.9±0.004 0.95±0.002 0.83±0.006 0.81±0.007 0.82±0.007 GLOSS-LOF 0.8±0.006 0.91±0.004 0.95±0.002 0.81±0.01 0.80±0.007 0.86±0.007 Table 3.3: Mean and standard deviation of run times (seconds) for various methods. EE LOF OCSVM HoRPCA WHoRPCA LRSCP BATF LOSS LOGSS GLOSS 12.1 ± 0.3 17.0 ± 0.3 1016.2 ± 56.1 7.5 ± 0.4 13.5 ± 0.5 0.66 ± 0.17 65.8 ± 0.16 44.7 ± 2.7 5.0±0.27 46.7 ± 1.5 Robustness against missing data: In addition to injecting synthetic anomalies, we also remove varying number of days at random from the tensor to evaluate the robustness of the proposed method to missing data. After generating the synthetic data, a percentage 𝑃 of the mode-1 fibers is set to zero to simulate missing data, where the number of mode-1 fibers is equal to the total number of days, i.e. 7 × 52 × 81. The accuracy of anomaly detection for varying 67 (a) (b) (c) Figure 3.4: ROC curves for varying percentage of missing data, (a) 𝑃 = 20%, (b) 𝑃 = 40%, (c) 𝑃 = 60%. (𝑐 = 2.5, 𝑙 = 7, 𝑚 = 2.3%) levels of missing data is illustrated in Fig. 3.4 and the corresponding AUC values (mean ± std) are given in Table 3.2. While the performance of all the methods degrades with increasing levels of missing data, GLOSS provides the best anomaly detection performance and is robust against missing data compared to the rest of the methods. It is interesting to see that with increasing percentage of missing data, the performance of OCSVM does not degrade too much, and some of the methods such as HoRPCA, WHoRPCA, LRSCP, GLOSS-SVM and GLOSS-LOF have a better performance. This phenomenon occurs partially due to increasing percentage of anomalous points. As some of the false positives are replaced by missing data, and anomalous points are higher in percentage, newly identified data might have more true positives with increasing percentage of missing data. Incorporating temporal smoothness for the anomalies in the objective function lowers the false detection rate by penalizing instantaneous changes in traffic that do not constitute an actual anomaly. We note that while LOSS performs comparable to GLOSS for varying anomaly strength, the performance of LOSS degrades quickly with increasing missing data. Therefore, even though both WHoRPCA and LOSS are equipped to handle missing data, GLOSS is more robust as it uses side information in the form of similarity graphs. Similar to the previous experiments, LOGSS provide the best results in terms of computational efficiency with the expense of a small amount of AUC compared to GLOSS. 3.7.4 Experiments on Real Data To evaluate the performance of the proposed methods on real data, we compiled a list of 20 urban events, which are listed in Table 3.4, that took place in the important urban activity centers such as city squares, parks, museums, stadiums and concert halls, during 2018. We used the same set of urban events for both taxi and bike data. To detect the activities, top-𝐾 percent, with varying K, of the highest anomaly scores of the 68 Table 3.4: Events of Interest for NYC in 2018. Event Name Location Date and Time New Year’s Times Square 1/1 12-2AM Blackhawks vs. Rangers Madison Square Garden 1/3 4-8PM Armory Show Piers 92/94 1/14 9AM-5PM Woman’s March Central Park West 1/20 8AM-12PM Big Ten Basketball Final Madison Square Garden 3/4 3-10PM Big East Quarter Finals Madison Square Garden 3/8 3-10PM St. Patricks Day Parade 5th Avenue, btw 44th and 79th 3/17 11AM-5PM Nor’easter Storm Citywide 3/20 11AM-5PM NIT Quarterfinal Utah vs. St.Mary’s Madison Square Garden 3/21 5 PM- 10PM U2 Concert Madison Square Garden 7/1 5-10PM July 4th Celebrations Citywide 7/4 5-11PM UN General Assembly United Nation Headquarters 9/25 12-5PM Comic Con Javits Center 11/4 8 AM-3PM NYC Marathon Colombus Circle 11/04 12-5PM Elton John Concert Madison Square Garden 11/9 7-11PM Macy’s Thanksgiving Parade Herald Square 11/22 9PM-12AM Christmas Tree Lighting Bryant Park 12/4 7PM-12AM Golden Knights vs. Rangers Madison Square Garden 12/16 12-3PM Phish Concert Madison Square Garden 12/28 4-8PM New Year’s Eve Times Square 12/31 8PM-12AM extracted sparse tensors are selected as anomalies and compared against the compiled list of urban events. In previous work, similar case studies were presented for experiments on real data [206, 39, 203, 205]. Detection performance for all methods is given in Tables 3.5 and 3.6 for the taxi and bike data, respectively. From Table 3.5, it can be seen that anomaly scoring methods applied to the spatiotemporal features extracted by GLOSS perform the best for the NYC Taxi data. The performance of GLOSS is followed by LOSS as temporal smoothness allows for detection of events at lower 𝐾 by removing anomalies resulting from noise. LOSS performs the best initially but as more points are considered GLOSS finds more cases. Although LOGSS performs better than baseline methods most of the time it does not perform as good as GLOSS or LOSS. In the case of CP based methods, although the performance of LRSCP is not good, BATF shows results competitive with GLOSS especially for Taxi data. Although BATF extracts the anomalies well, it has a high computational complexity. Among the baseline methods, EE performs the best while LOF performs the worst. However, when the features extracted from GLOSS are input to LOF and EE, their performances 69 become very similar. This shows that GLOSS is effective at separating anomalous entries from noise and normal traffic activity and thus, improves the performance of both LOF and EE. It is important to note that most of the anomalies cannot be detected at low 𝐾 values by most methods because events such as New Year’s Eve or July 4th celebrations change the activity pattern in the whole city and constitute the majority of the anomalies detected at low 𝐾 values for Taxi Data. The performance of all methods is significantly reduced in Bike Data as can be seen from Table 3.6. This is because Bike Data is very noisy with a large number of days, or points that would be considered anomalous. Also, some of the selected events do not produce significant changes in Bike Data such as New Year’s Eve as usage of bikes at midnight is low even though it’s a significant event for taxi traffic. Changes in the weather also affect the performance by increasing the variance of the data, especially across the third mode, which corresponds to the weeks of the year. We see some fluctuations in the performance comparisons as well, such as LOGSS performing better than all methods at higher 𝐾, i.e. percentages. Still, the proposed methods are better than HoRPCA, WHoRPCA, and baseline methods overall, which shows the improvements brought by the extracted features. It is important to note that event selection is done manually and the selected events may not correspond to the most significant anomalies. Thus, although it is a widely utilized tool in analyzing the performance on real data, the case study approach might not reflect the true performance of the anomaly detection method as effectively as synthetic data. Table 3.5: Results for 2018 NYC Yellow Taxi Data. Columns indicate the percentage of selected points with top anomaly scores. The table entries correspond to the number of events detected at the corresponding percentage. % 0.014 0.07 0.14 0.3 0.7 1 2 3 EE 0 0 1 3 9 9 16 18 LOF 0 0 0 1 1 2 4 5 OCSVM 0 0 2 5 8 11 15 16 HoRPCA 0 5 11 14 20 20 20 20 WHoRPCA 0 0 1 3 9 9 16 18 LRSCP 0 1 1 1 5 7 10 11 BATF 0 1 5 9 14 15 20 20 LOSS 3 9 12 15 16 17 19 20 LOGSS 0 0 1 6 10 12 18 18 GLOSS-EE 1 8 13 15 18 18 19 20 GLOSS-LOF 2 8 12 14 18 18 19 19 GLOSS-SVM 0 3 3 9 17 18 20 20 In Fig. 3.5, we illustrate the bike data for July 4th at Hudson River banks, and the low-rank and sparse 70 Table 3.6: Results on 2018 NYC Bike Trip Data. % 0.3 1 2 3 4.2 7 9.7 12.5 EE 0 0 0 1 2 2 2 3 LOF 1 1 1 1 2 2 4 6 OCSVM 0 0 1 2 2 2 3 3 HoRPCA 2 3 4 4 6 8 11 14 WHoRPCA 1 1 2 7 11 12 12 12 LRSCP 0 1 2 3 3 4 6 10 BATF 2 3 4 6 9 10 11 13 LOSS 0 1 1 1 4 13 15 16 LOGSS 1 3 4 6 9 11 14 17 GLOSS-EE 0 0 3 7 9 11 15 16 GLOSS-LOF 1 2 2 2 2 6 8 11 GLOSS-SVM 1 2 5 8 10 15 15 15 (a) (b) (c) Figure 3.5: Bike Activity data, the extracted sparse part and low-rank part across for July 4th Celebrations at Hudson River banks. (a) Real Data where the traffic for 52 Wednesdays is shown along with the traffic on Independence Day and average traffic; (b) Sparse tensor where the curve corresponding to the anomaly is highlighted; (c) Low-rank tensor with the curve corresponding to the Independence Day highlighted. parts extracted by GLOSS. It can be seen that as the data varies across different weeks, the low-rank part can explain this variance well by fitting a pattern to days with varying amplitudes. Thus, the proposed method does not get affected by events such as the weather as it can capture both low and high traffic days in the low-rank part which can be seen in Fig 3.5c. The deviations from the daily pattern, rather than the actual traffic volume, is captured by the sparse part, which is then input to the anomaly scoring algorithms. Thus, our method is able to extract the events at a fairly low 𝐾. 71 3.8 Conclusions In this chapter, we proposed robust tensor decomposition based anomaly detection methods for urban traffic data. The proposed methods extract a low-rank component using a weighted nuclear norm and imposes the sparse component to be temporally smooth to better model the anomaly structure. In one of the methods, graph regularization is employed to preserve the geometry of the data and to account for nonlinearities in the anomaly structure. In another, low-rank tensor recovery is implemented through minimizing graph total variation on similarity graphs constructed across each mode. This approximation circumvents the need for computing a computationally expensive nuclear norm minimization. ADMM based computationally efficient and scalable algorithms are proposed to solve the resulting optimization problems. As the proposed methods focus on spatiotemporal feature extraction, the resulting features can be input to well-known anomaly detection methods such as EE, LOF and OCSVM for anomaly scoring. The proposed methods are evaluated on both synthetic and real urban traffic data. Results on synthetic data illustrate the robustness of our methods to varying levels of missing data and their sensitivity to even low amplitude anomalies. In particular, our methods outperforms HoRPCA and WHoRPCA thanks to temporal smoothness assumption on the sparse part. Moreover, the graph regularization improves the accuracy further by ensuring that the low-rank projections preserve local geometry of the data. In real data, our methods begin to detect anomalies earlier, i.e. the top anomaly scores usually correspond to events of interest, than existing methods. GLOSS provides further improvement over LOSS as more events are detected for a given number of selected points. Furthermore, the results from real data show how the extracted sparse component highlights the anomalous activities. For both synthetic and real data, LOGSS outperforms other methods in terms of computational efficiency with some loss in AUC compared to GLOSS. Experiments on synthetic data reveal that when anomalies are longer in duration, the proposed methods perform better. LOGSS also outperforms LOSS when anomalies are shorter in duration. Although LOSS and GLOSS perform better in real data, LOGSS shows similar performance with shorter run time. In future work, a statistical tensor anomaly scoring method will be explored instead of scoring each fiber individually by a separate algorithm. In recent years, deep learning based methods have also provided a promising new direction for anomaly detection [33, 28]. Some examples include long short-term memory (LSTM) neural network used to predict traffic flow and detect anomalies [94] and fully connected neural networks used to decompose traffic data into normal and anomalous parts [205]. Future work will consider 72 extensions to tensor type data and application of 3D-CNNs [33]. Another possible extension of the proposed work is online anomaly detection. The method proposed in this paper cannot achieve real-time anomaly detection as for the current application the data has been collected and stored offline. However, it is possible to extend low-rank tensor models to an online setting. Recently, methods for online tensor subspace tracking have been introduced [87, 141, 8, 110]. Our method can be extended to the online setting using the methods outlined in these papers. Applications and extensions of the proposed method on network data and other spatiotemporal data with different characteristics such as fMRI will also be considered. 73 CHAPTER 4 GEOMETRIC TENSOR LEARNING 4.1 Introduction Most high-dimensional data have a low-dimensional structure and principal component analysis (PCA) is the fundamental approach to extract this low-dimensional structure. A major drawback of PCA is that it is sensitive to grossly corrupted or outlying observations, which are ubiquitous in real-world data. Robust PCA (RPCA) addresses this issue by decomposing the observed data matrix into a low-rank and sparse part [30]. However, RPCA can only handle two-way matrix data while real data is usually multi-way in nature and stored in arrays known as tensors. In recent years, different extensions of RPCA have been introduced to deal with tensor type data based on the different tensor models. Some examples include simple low-rank tensor completion (SiLRTC) [120], higher-order RPCA (HoRPCA) based on the Tucker model [70], tensor RPCA (TRPCA) based on the t-SVD model [125] and tensor-train based tensor RPCA [53, 197]. Different tensor models employ different definitions of rank and result in different interpretations of what a low-rank tensor is. Tucker rank, optimized by minimizing sum of nuclear norms (SNN), cannot appropriately capture the global correlation in a tensor as each mode represents the matrix row in an unbalanced matricization scheme [17]. Tensor tubal rank, optimized by minimizing tensor nuclear norm (TNN), on the other hand, characterizes the correlations along the first and second modes while that along the third mode is encoded by the embedded circular convolution. Moreover, definition of TNN is usually limited to third-order tensors [210]. Tensor-train rank, optimized using tensor train nuclear norm (TTNN), can capture the global correlation of all tensor entries, by providing the mean of the correlation between two sets of modes as it is based on a canonical unfolding [17]. While the robust low-rank tensor representations capture the global structure of tensor data, they do not preserve the local geometric structure. Manifold learning addresses this issue and has been successfully implemented for tensors [164, 161]. However, current manifold learning approaches typically focus on only one mode of the data. Yet for many data matrices and tensors, correlations exist across all modalities. Several recent papers [67, 156, 155, 128, 127, 157, 22] exploit this coupled relationship to co-organize matrices and infer underlying row and column embeddings. 74 Inspired by the success of low-rank embedding and manifold learning, in this chapter, we propose to integrate them into a unified framework for simultaneously capturing the global low-rank and the local geo- metric structure. In particular, we propose a graph regularized robust low-rank tensor-train decomposition. The proposed model is based on robust tensor-train decomposition introduced in [197]. Unlike previous graph regularized tensor decompositions, the proposed method introduces graph regularization across each canonical unfolding to leverage the underlying geometry on each tensor mode. The resulting optimization problem is shown to be computationally expensive due to the size of the graphs across each canonical unfold- ing. An equivalence between these computationally expensive graph regularization terms and regular tensor unfolding is derived and a computationally efficient implementation of graph regularized robust tensor-train decomposition is proposed. 4.2 Tensor Train Robust PCA on Graphs Given an observed tensor with missing entries and gross corruption PΩ [Y], the objective of tensor RPCA is to extract a low-rank tensor, X, and a sparse tensor, S, corresponding to gross outliers such that PΩ [Y] = PΩ [X + S]. Robust tensor-train PCA model was proposed in [197] as follows: 𝑁 Õ −1 minimize X, S 𝛼𝑛 kX[𝑛] k ∗ + 𝜆kSk 1 , s.t. PΩ [X + S] = PΩ [Y]. 𝑛=1 While nuclear norm minimization can capture the global structure of the tensor, incorporating graph regularization can capture the local geometry and non-linear structures within the data. Adding graph regularization on the mode-𝑛 canonical unfoldings of the low-rank part, X[𝑛] ∈ R𝐼1 ...𝐼𝑛 ×𝐼𝑛+1 ...𝐼𝑁 , the modified objective can be rewritten as: 𝑁 Õ −1 Õ𝑁 > minimize 𝛼𝑛 kX[𝑛] k ∗ + 𝜃 𝑛 tr(X[𝑛] Φ𝑛 X[𝑛] ) +𝜆kSk 1 , s.t. PΩ [X + S] = PΩ [Y], (4.1) X, S 𝑛=1 𝑛=1 where Φ𝑛 ∈ R𝐼1 ...𝐼𝑛 ×𝐼1 ...𝐼𝑛 is the mode-𝑛 graph Laplacian. The solution to (4.1) will be referred to as tensor train robust PCA with graph regularization (TTRPCA-G). 4.2.1 Kronecker Structured Graphs For each mode 𝑛, the graph regularization proposed in (4.1) computes the similarity over all the modes from the first to the 𝑛th mode. Thus, it potentially utilizes the same local geometric information multiple times. In this section, we propose modeling each Φ𝑛 with a Kronecker structure where Φ𝑛 = Φ̂𝑛 ⊗ I with 75 Φ̂𝑛 ∈ R𝐼𝑛 ×𝐼𝑛 . This structure imposes a manifold on only mode 𝑛, rather than the set of all modes 𝑛 0 with 𝑛 0 ≤ 𝑛. In the following, we show that solving such a system is equivalent to replacing the trace norm of the canonical unfolding with the trace norm of mode-𝑛 unfolding. Moreover, this structure highly reduces the computational complexity of TTRPCA-G. Lemma 2. Let 𝐵 ∈ R𝐼 ×𝐽 , 𝐴 ∈ R𝐾 ×𝐼 and 𝐶 ∈ R 𝐿×𝐽 , then vec( 𝐴𝐵𝐶 > ) = (𝐶 ⊗ 𝐴)vec(𝐵), where vec(.) stacks all elements of a tensor into a vector. Lemma 3. Given a third-order tensor B ∈ R𝐼 ×𝐽 ×M , and its third mode slices 𝐵𝑖 ∈ R𝐼 ×𝐽 , k𝐶B (2) k 2𝐹 = k (𝐶 ⊗ I)B [2] k 2𝐹 , where 𝐶 ∈ R 𝐿×𝐽 is any matrix and I ∈ R𝐼 ×𝐼 is the identity. Proof. Let b𝑖 = vec(𝐵𝑖 ) ∈ R𝐼 𝐽 ×1 , then,  vec(𝐵1𝐶 > )   (𝐶 ⊗ I)b1               vec(𝐵2𝐶 > )   (𝐶 ⊗ I)b2  > 𝐶 >) =  =     vec(B (2) . .    ..     ..       vec(𝐵 𝑀 𝐶 > )   (𝐶 ⊗ I)b 𝑀          = (I ⊗ (𝐶 ⊗ I)) [b> > > > 1 , b2 , . . . , b 𝑀 ] = (I ⊗ (𝐶 ⊗ I))vec(B [2] ) = vec((𝐶 ⊗ I)B [2] ), where the second equality follows from Lemma 1. Thus, > k𝐶B (2) k 2𝐹 = kB (2) 𝐶 > k 2𝐹 = kvec(B (2) > 𝐶 > ) k 2𝐹 = kvec((𝐶 ⊗ I)B [2] ) k 2𝐹 = k (𝐶 ⊗ I)B [2] k 2𝐹 . By reshaping any tensor with 𝑁 modes to third-order tensors, where mode 𝑛 is mapped to the second mode , it can be shown that this result can be generalized for any mode 𝑛 and order 𝑁. Theorem 1. Let Φ𝑛 = Φ̂𝑛 ⊗I, where Φ̂𝑛 ∈ R𝐼𝑛 ×𝐼𝑛 and I ∈ R𝐼1 𝐼2 ...𝐼𝑛−1 ×𝐼1 𝐼2 ...𝐼𝑛−1 . The term > Í𝑁 𝑛=1 𝜃 𝑛 tr(X[𝑛] Φ𝑛 X[𝑛] ) Í𝑁 is equivalent to 𝑛=1 > Φ̂ X ). 𝜃 𝑛 tr(X(𝑛) 𝑛 (𝑛) 76 Proof. Since Φ̂𝑛 is symmetric, let Φ̂𝑛 = 𝑃ˆ 𝑛> 𝑃ˆ 𝑛 . Then Φ𝑛 = ( 𝑃ˆ 𝑛> 𝑃ˆ 𝑛 ) ⊗ I = 𝑃𝑛> 𝑃𝑛 , where 𝑃𝑛 = ( 𝑃ˆ 𝑛 ⊗ I). Thus, the following equalities hold: > > ˆ> ˆ tr(X[𝑛] Φ𝑛 X[𝑛] ) = k𝑃𝑛 X[𝑛] k 2𝐹 = k ( 𝑃ˆ 𝑛 ⊗ I)X[𝑛] k 2𝐹 = k 𝑃ˆ 𝑛 X(𝑛) k 2𝐹 = tr(X(𝑛) 𝑃𝑛 𝑃𝑛 X(𝑛) ) > = tr(X(𝑛) Φ̂𝑛 X(𝑛) ), where the third equality follows from Lemma 2. Í𝑁 > By Theorem 1, the graph regularization term in (4.1) is equivalent to 𝑛=1 𝜃 𝑛 tr(X(𝑛) Φ̂𝑛 X(𝑛) ). Thus, we can rewrite (4.1) as: 𝑁 Õ −1 Õ𝑁 > minimize 𝛼𝑛 kX[𝑛] k ∗ + 𝜃 𝑛 tr(X(𝑛) Φ̂𝑛 X(𝑛) ) + 𝜆kSk 1 , s.t. PΩ [X + S] = PΩ [Y]. (4.2) X, S 𝑛=1 𝑛=1 The solution to (4.2) will be referred to as tensor train robust PCA with mode-𝑛 graph regularization (TTRPCA-nG). 4.2.2 Optimization The objective functions (4.1) and (4.2) are optimized using an Alternating Direction Method of Multipliers (ADMM) scheme as ADMM has been previously utilized for solving similar convex problems [70, 154, 156]. In this section, we will give a detailed derivation of the update steps for optimizing (4.1). While the update steps will be similar for (4.2), we will note when they differ. To separate the nuclear norm and graph regularization terms and to isolate functions of each mode, we introduce auxiliary variables 𝔏𝑛 and 𝔊𝑛 . (4.1) is then rewritten as: 𝑁 Õ −1 Õ𝑁 minimize 𝛼𝑛 k𝔏𝑛[𝑛] k ∗+ 𝜃 𝑛 tr(𝔊𝑛> 𝑛 [𝑛] Φ𝑛 𝔊 [𝑛] ) + 𝜆kSk 1 , {𝔏}, {𝔊}, X, S 𝑛=1 𝑛=1 s.t. PΩ [X + S] = PΩ [Y], X = 𝔏𝑛 , X = 𝔊𝑛 . The corresponding augmented Lagrangian is given by: 𝑁 Õ −1 Õ𝑁 𝛼𝑛 k𝔏𝑛[𝑛] k ∗ + 𝜃 𝑛 tr(𝔊𝑛> 𝑛 [𝑛] Φ𝑛 𝔊 [𝑛] ) + 𝜆kSk 1 + 𝑛=1 𝑛=1 𝑁 −1 𝑁 𝛽1 𝛽2 Õ 𝛽3 Õ kPΩ [Y − X − S − Λ1 ] k 2𝐹 + kX − 𝔏 − 𝑛 Λ2𝑛 k 2𝐹 + kX − 𝔊𝑛 − Λ3𝑛 k 2𝐹 , (4.3) 2 2 𝑛=1 2 𝑛=1 77 where Λ1 , Λ2𝑛 , Λ3𝑛 ∈ R𝐼1 ×𝐼2 ×···×𝐼𝑁 are the Lagrange multipliers. Using (4.3), each variable can be updated iteratively. 1. X update: The update of low-rank variable X is given by: PΩ [X 𝑡+1 ] = PΩ [𝛽1 T1 + T2 ] /(𝛽1 + (𝑁 − 1) 𝛽2 + 𝑁 𝛽3 ), PΩ⊥ [X 𝑡+1 ] = PΩ⊥ [T2 ]/((𝑁 − 1) 𝛽2 + 𝑁 𝛽3 ), (4.4) Í 𝑁 −1 (𝔏𝑛,𝑡 + Λ2𝑛,𝑡 ) + 𝛽3 + Λ3𝑛,𝑡 ). Í𝑁 where T1 = Y − S 𝑡 − Λ1𝑡 , T2 = 𝛽2 𝑛=1 (𝔊 𝑛,𝑡 𝑛=1 2. 𝔏𝑛 update: The update of 𝔏𝑛 is solved by a soft thresholding operator on singular values of  X 𝑡+1 − Λ2𝑛,𝑡 [𝑛] with a threshold of 𝛼𝑛 /𝛽2 . 3. 𝔊𝑛 update: The variables 𝔊𝑛 can be updated by:   𝔊𝑛,𝑡+1 [𝑛] = 𝛽 3 𝐺 𝑖𝑛𝑣 X 𝑡+1 − Λ 𝑛,𝑡 3 , (4.5) [𝑛] where 𝐺 𝑖𝑛𝑣 = (2𝜃 𝑛 Φ𝑛 + 𝛽3 I) −1 exists for any Φ𝑛 for which the set of eigenvalues do not contain 𝛽3 2𝜃𝑛 , and can be computed outside the loop for faster update. The update rule for (4.2) is given by:   𝔊𝑛,𝑡+1 (𝑛) = 𝛽 𝐺 3 𝑖𝑛𝑣 X 𝑡+1 − Λ 𝑛,𝑡 3 , (4.6) (𝑛)   −1 where 𝐺 𝑖𝑛𝑣 = 2𝜃 𝑛 Φ̂𝑛 + 𝛽3 I . 4. S update: The variable S can be updated by soft thresholding PΩ [Y − X 𝑡+1 − Λ1𝑡 ] with a threshold of 𝜆 𝛽1 . 5. Dual updates: Finally, dual variables Λ1 , Λ2𝑛 , Λ3𝑛 are updated using: Λ1𝑡+1 = Λ1𝑡 + PΩ [X 𝑡+1 + S 𝑡+1 − Y], (4.7) Λ2𝑛,𝑡+1 = Λ2𝑛,𝑡 + (𝔏𝑛,𝑡+1 − X 𝑡+1 ), (4.8) Λ3𝑛,𝑡+1 = Λ3𝑛,𝑡 + (𝔊𝑛,𝑡+1 − X 𝑡+1 ). (4.9) The algorithms for both TTRPCA-G and TTRPCA-nG are outlined in Algorithm 4.1. 4.2.3 Computation and Memory Complexity for Graphs Î𝑛 Î𝑛 In (4.1), the size of each Φ𝑛 is 𝑖=1 𝐼𝑖 × 𝑖=1 𝐼𝑖 . Thus, the memory requirement for TTRPCA-G is O (𝐼 2 ), Î𝑁 where 𝐼 = 𝑖=1 𝐼𝑖 . On the other hand, TTRPCA-nG requires only O (𝐼𝑛2 ) parameters for each Φ̂𝑛 . 78 Algorithm 4.1: TTRPCA-G/nG Input: Y, Ω, Φ𝑛 , parameters 𝜃 𝑛 , 𝛼𝑛 , 𝛾, 𝛽1 , 𝛽2 , 𝛽3 , T. Output: X : Low-rank tensor; S: Sparse tensor. Initialize S 0 = 0, 𝔏𝑛,0 = 0, 𝔊𝑛,0 = 0, Λ01 = 0, Λ2𝑛,0 = 0, Λ3𝑛,0 = 0, ∀𝑛 ∈ {1, . . . , 𝑁 }, 𝐺 𝑖𝑛𝑣 ← (2𝜃 𝑛 Φ𝑛 + 𝛽3 I) −1 (TTRPCA-G) or   −1 𝐺 𝑖𝑛𝑣 ← 2𝜃 𝑛 Φ̂𝑛 + 𝛽3 I (TTRPCA-nG). for 𝑡 = 1 to T do Update X using (4.4). Update 𝔏𝑛 s using optimization step 2. Update 𝔊𝑛 s using (4.5) (for TT-PCA-G) or (4.6) (for TT-PCA-nG). Update S using optimization step 4. Update Lagrange multipliers using (4.7), (4.8) and (4.9). end for  Computation of 𝐺 𝑖𝑛𝑣 in (4.5), for 𝑛 = 𝑁 has O 𝐼 4 complexity. On the other hand, the computational complexity of 𝐺 𝑖𝑛𝑣 in TTRPCA-nG is O (𝐼𝑛2 ). Overall, the computational complexities for TTRPCA-G and TTRPCA-nG are O (𝐼 4 ) and O (𝐼 3/2 ), respectively. 4.3 Experiments The proposed method was compared against Tucker based HoRPCA [70] and TT based TTRPCA [197] on data completion and denoising tasks, on both synthetic and real tensor data 1. The results are reported in terms of peak signal to noise ratio (PSNR), structural similarity index measure (SSIM), and residual squared k Ŷ−Y0 k 𝐹 error (RSE), i.e., k Y0 k 𝐹 , where Y0 corresponds to the true underlying low-rank data. Table 4.1: Denoising performance for synthetic data against varying levels of gross noise 𝑐% for various methods. 5 20 35 50 RSE PSNR SSIM RSE PSNR SSIM RSE PSNR SSIM RSE PSNR SSIM Observed 1.95 17.78 0.71 3.91 11.73 0.24 5.18 9.3 0.07 6.22 7.71 0.01 HoRPCA 0.44 31.15 0.86 0.63 28.12 0.59 0.86 25.42 0.29 0.98 24.29 0.26 TTRPCA 0.07 47.52 0.99 0.23 36.86 0.96 0.44 31.22 0.81 0.93 24.77 0.30 TTRPCA-G 0.04 50.06 0.99 0.15 40.08 0.97 0.33 32.82 0.87 0.86 25.67 0.35 TTRPCA-nG 0.04 50.11 0.99 0.15 40.07 0.97 0.33 32.81 0.87 0.87 25.67 0.35 𝛿 Î𝑘 Î 𝑁 −1 Following [17], we set 𝛼 𝑘 = Í 𝑁 −1𝑘 𝛿𝑘 0 , where 𝛿 𝑘 = min( 𝑛=1 𝐼 𝑛 , 𝑛=𝑘+1 𝐼 𝑛 ). 𝜃 𝑘 s are set proportional to 𝑘 0 =1 𝐼𝑘 the size of the corresponding mode Î𝑁 𝐼0 for TTRPCA-nG, and are set proportional to 𝛼 𝑘 for TTRPCA-G. 𝑘 0 =1 𝑘 The remaining parameters are optimized for all methods such that the best results for each method are reported. 1 You can find our code in github.com/mrsfgl/ttrpca_g 79 Table 4.2: Denoising performance for real data against varying levels of gross noise 𝑐% for various methods. 5 20 35 50 RSE PSNR SSIM RSE PSNR SSIM RSE PSNR SSIM RSE PSNR SSIM Observed 0.414 17.57 0.785 0.8276 11.56 0.392 1.09 9.13 0.205 1.306 7.59 0.11 HoRPCA 0.105 29.53 0.928 0.211 23.42 0.791 0.288 20.74 0.667 0.383 18.25 0.53 TTRPCA 0.106 29.42 0.929 0.199 23.49 0.83 0.277 21.07 0.747 0.378 18.37 0.63 TTRPCA-nG 0.103 29.89 0.926 0.164 25.63 0.852 0.237 22.42 0.77 0.331 19.52 0.65 (a) HoRPCA (b) TTRPCA (c) TTRPCA-G (d) TTRPCA-nG Figure 4.1: Phase diagrams for missing data recovery. 4.3.1 Synthetic Data Tensors with TT structure were generated by simulating each tensor factor U𝑛 ∀𝑛 ∈ {1, . . . , 𝑁 }, as i.i.d. Gaussian with mean 0 and covariance I, and merging them. For the sake of simplicity, all modes have the same size and rank, i.e., 𝐼𝑛 = 𝐼, ∀𝑛, 𝑟 𝑛 = 𝑟, ∀𝑛 ≠ 𝑁. In our experiments, 𝑁 is selected to be 4 and 𝐼 = 10, i.e., Y ∈ R10×10×10×10 and 𝑟 is varied in the range of 1 to 9. After generating synthetic data, we simulate missing data by setting 𝑚% of the entries to zero. In 80 Fig. 4.1, we demonstrate the phase diagrams for tensor completion with various methods. By varying the rank 𝑟 of the simulated tensors and 𝑚, we illustrate the robustness of all of the methods against increasing ranks and missing data. It can be seen that TTRPCA outperforms HoRPCA as the underlying structure is better explained by TTNN. The proposed TTRPCA-G and TTRPCA-nG further improve the performance by incorporating the underlying manifold information. Next, we evaluate the robustness of the proposed method against sparse outliers by replacing a constant 𝑐% of the entries by random values sampled uniformly from [0, 1]. For this set of experiments, 𝑚 = 0% and 𝑟 = 4. In Table 4.1, we summarize the results. It can be seen that graph regularization improves the performance at all outlier levels. In particular, TTRPCA is better than HoRPCA as TTNN captures the low-rank structure better than SNN and graph regularization further improves the results by taking the local geometry into account. Moreover, the results show that the performances of TTRPCA-G and TTRPCA-nG are close to each other which indicates that the two methods capture the same geometry. 4.3.2 Real Data The algorithms are also tested on a subset of 40 objects from COIL dataset [133]. The color images are converted to grayscale and downsampled to a size of 16×16. For each object, each sample image corresponds to a different pose angle ranging from 0 to 360 degrees with increments of 10 degrees [133]. Thus, we create tensors of size 16 × 16 × 36 × 40. We then corrupt the tensor by randomly selecting 𝑐% of all entries and setting them to 0 or 1. Note that when using this data the adjacency matrix of the fourth mode canonical graph is of size 368640 × 368640. This makes TTRPCA-G computationally expensive, so we only use TTRPCA-nG. From Table 4.2, it can be seen that the proposed method outperforms other methods in denoising for varying levels of gross noise. TTRPCA-nG can capture the data structure better than the other methods as it simultaneously minimizes TTNN and considers the underlying manifold across each mode. 4.4 Conclusions In this chapter, we proposed two graph regularized robust tensor train principal component analysis methods. In the first method, we utilized canonical unfoldings to construct the mode-𝑛 graphs while in the second method mode-𝑛 unfoldings are used. We derived an equivalence between mode-𝑛 and canonical graphs with the assumption that the canonical graph has a specific Kronecker structure. 81 The proposed methods outperformed both robust Tucker and tensor-train decomposition methods in denoising and completion tasks. Experiments on synthetic data show that the performances of graph regularization with canonical unfolding and mode-𝑛 unfolding were similar to each other while mode- 𝑛 graphs provided much lower memory requirements and computational complexity. Future work will consider capturing low-rank structure through graph total variation minimization as suggested in [154] to reduce the computational complexity of low-rank tensor recovery task. 82 CHAPTER 5 COUPLED SUPPORT TENSOR MACHINE 5.1 Introduction Advances in clinical neuroimaging and computational bioinformatics have dramatically increased our under- standing of various brain functions using multiple modalities such as Magnetic Resonance Imaging (MRI), functional Magnetic Resonance Imaging (fMRI), electroencephalogram (EEG), and Positron Emission To- mography (PET). Their strong connections to the patients’ biological status and disease pathology suggest the great potential of their predictive power in disease diagnostics. Numerous studies using vector- and tensor-based statistical models illustrate how to utilize these imaging data both at the voxel- and Region-of- Interest (ROI) level and develop efficient biomarkers that predict disease status. For example, [9] proposes a classification model using functional connectivity MRI for autism disease and reaches 89% diagnostic accuracy for subjects under 20. [152] utilizes network models and brain imaging data to develop novel biomarkers for Parkinson’s disease. Many works in Alzheimer’s disease research such as [129, 122, 89, 66, 124, 54, 116] use EEG, MRI and PET imaging data to predict patient’s cognition and detect early-stage Alzheimer’s diseases. Although these studies have provided impressive results, utilizing imaging data from a single modality such as individual MRI sequences are known to have limited predictive capacity, especially in the early phases of the disease. For instance, [116] uses brain MRI volumes from regions of interest to identify patients in early-stage Alzheimer’s disease. They use one-year MRI data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) and obtain 77% prediction accuracy. Although such a performance is favorable compared to other existing approaches, the diagnostic accuracy is relatively low due to the limited information from MRI data. In recent years, it has been common to acquire multiple neuroimaging modalities in clinical studies such as simultaneous EEG-fMRI or MRI and fMRI. Even though each modality measures different biological signals, they are interdependent and mutually informative. Learning from multimodal neuroimaging data may help integrate information from multiple sources and facilitate biomarker develop- ments in clinical studies. It also raises the need for novel supervised learning techniques for multimodal data in statistical learning literature. Existing statistical approaches to multimodal data science are dominated by unsupervised learning 83 methods. These methods analyze multimodal neuroimaging data by performing joint matrix decomposition and extracting common information across different modalities. During optimization, the decomposed factors bridging two or more modalities are estimated to interpret the connections between different modalities. Examples of these methods include matrix-based joint Independent Component Analysis (jICA) [29, 72, 107, 121, 165, 6] which assume bilinear correlations between factors in different modalities. However, these matrix-vector based models cannot preserve the multilinear nature of original data and the spatiotemporal correlations across modes as most neuroimaging modalities are naturally in tensor format. Recently, various coupled matrix-tensor decomposition methods have been introduced to address this issue [4, 5, 6, 37, 36, 86, 130]. These methods impose different soft or hard multilinear constraints between factors from different modalities providing more flexibility in data modeling. Current supervised learning approaches for multimodal data mostly concatenate data modalities as extra features without exploring their interdependence. For example, [211, 114] build generalized regression mod- els by appending tensor and vector predictors linearly for image prediction and classification. [142] develops a discriminant analysis by including tensor and vector predictors in a linear fashion. [111] proposes an integrative factor regression for multimodal neuroimaging data assuming that data from different modalities can be decomposed into latent factors. More recently, [65] proposed multiple tensor-on-tensor regression for multimodal data, which combines tensor-on-tensor regression from [123] with traditional additive linear model. Another type of integration utilizes kernel tricks and combines information from multimodal data with multiple kernels. [71] provides a survey on various multiple kernel learning techniques for multimodal data fusion and classification with support vector machines. Combining kernels linearly or non-linearly instead of original data in different modalities provides more flexibility in information integration. [12] proposed a multiple kernel regression model with group lasso penalty, which integrates information by multiple kernels and selects the most predictive data modalities. Despite these accomplishments, the current approaches have several shortcomings. First, they mainly focus on exploring the interdependence between multimodal imaging data, ignoring the representative and discriminative power of the learned components. Thus, the methods cannot further bridge the imaging data to the patients’ biological status, which is not helpful in biomarker development. Second, the supervised techniques such as integrate information primarily by data or feature concatenation without explicitly consid- ering the possible correlations between different modalities. This lack of consideration of interdependence may cause issues like overfitting and parameter identifiability. Third, even though methods from [111, 84 65] have considered latent structures for multimodal data, these models are designed primarily for linear regression and are not directly applicable to classification problems. Fourth, the aforementioned multimodal analysis methods are mainly vector based methods, which cannot handle large-size multi-dimensional data encountered in contemporary data science. As discussed in [21], tensors provide a powerful tool for analyz- ing multi-dimensional data in statistics. As a result, developing a novel multimodal tensor-based statistical framework for supervised learning can be of great interest. Finally, although many empirical studies demon- strate the success of using multimodal data, there is a lack of mathematical and statistical clarity to the extent of generalizability and associated uncertainties. The absence of a solid statistical framework for multimodal data analysis makes it impossible to interpret the generalization ability of a certain statistical model. In this chapter, we propose a two-stage Coupled Support Tensor Machine (C-STM) for multimodal tensor- based neuroimaging data classification. The proposed model addresses the current issues in multimodal data science and provides a sound statistical framework to interpret the interdependence between modalities and quantify the model consistency and generalization ability. The major contributions of this chapter are: 1. Individual and common latent factors are extracted from multimodal tensor data, for each sample or subject, using Advanced Coupled Matrix Tensor Factorization (ACMTF) [4, 3]. The extracted components are then utilized in a statistical framework. Most of the work on ACMTF do not focus on each subject separately and the extracted factors are utilized for a signal analysis rather than a subsequent statistical learning framework. Specifically, the work on supervised approaches with CMTF is limited. 2. Building a novel Coupled Support Tensor Machine with both the coupled and non-coupled tensor CP factors for classification. In this regard, multiple kernel learning approaches are adopted to integrate components from multi-modal data. 3. For the validation of our work, we provide both theoretical and empirical evidence. We provide theoretical results such as classification consistency for statistical guarantee. A thorough numerical study has been conducted, including a simulation study and experiments on real data to illustrate the usefulness of the proposed methodology. A Matlab package is also provided in the supplemental material, including all functions for C-STM classifi- 85 cation. The source codes are available at our Github repository 1. 5.2 Related Work In this section, we review some background and prior work on coupled tensor decomposition and multiple kernel learning. 5.2.1 Coupled Matrix Tensor Factorization 𝑋2 𝐼1 X1 𝐼3 𝑀 𝐼2 Figure 5.1: Illustration of Coupled Tensor Matrix Model Motivated by the fact that joint analysis of data from multiple sources can potentially unveil complex data structures and provide more information, Coupled Matrix Tensor Factorization (CMTF) [2] was proposed for multimodal data fusion. CMTF estimates the underlying latent factors for both tensor and matrix data simultaneously by taking the coupling between tensor and matrix data into account. This feature makes CMTF a promising model in analyzing heterogeneous data, which generally have different structures and modalities. During latent factor estimation, CMTF solves an objective function that approximates a CP decomposition for the tensor modality and a singular value decomposition for the second modality with the assumption that the factors from one mode of each modality are the same. Given X1 ∈ R𝐼1 ×𝐼2 ×...×𝐼𝑑 and 𝑋2 ∈ R𝐼1 ×𝐽2 , without loss of generality assume that the factors from the first mode of the tensor X1 span the column space of the matrix 𝑋2 . CMTF then tries to estimate all factors by minimizing: 1 1 Q(𝔘1 , V) = kX1 − [[𝑋1(1) , 𝑋1(2) , . . . 𝑋1(𝑑) ]] k 2𝐹 + k 𝑋2 − 𝑋2(1) 𝑋2(2)𝑇 k 2𝐹 , s.t. 𝑋1(1) = 𝑋2(1) , (5.1) 2 2 where 𝑋 𝑝(𝑚) are the factor matrices for modality 𝑝 and mode 𝑚. The factor matrices 𝑋1(1) = 𝑋2(1) are the coupled factors between tensor and matrix data. An illustration of this coupling is given in Figure 1 https://github.com/PeterLiPeide/Coupled_MatrixTensor_SupportTensor_Machine 86 5.1. These factor matrices can also be represented in Kruskal form, 𝔘1 = [[𝑋1(1) , 𝑋1(2) , . . . 𝑋1(𝑑) ]] and 𝔘2 = [[𝑋2(1) , 𝑋2(2) ]]. By minimizing the objective function Q(𝔘1 , 𝔘2 ), CMTF estimates latent factors for the tensor and matrix data jointly which allows it to utilize information from both modalities. [2] uses a gradient descent algorithm to optimize the objective function (5.1). Although this model is formulated for the joint decomposition of a 𝑑th order tensor and a matrix, extensions to two or more tensors with couplings across multiple modes are possible. In real data, couplings across different modalities might include shared or modality-specific (individual) components. Shared components correspond to those columns of the factor matrices that contribute to the decomposition of both modalities, while individual components carry information unique to the correspond- ing modality. Although CMTF provides a successful framework for joint data analysis, it often fails to obtain a unique estimation for shared or individual components. As a result, any further statistical analysis and learning from CMTF estimation will suffer from the uncertainty in latent factors. To address this issue, [3] proposed Advanced Coupled Matrix Tensor Factorization (ACMTF) by introducing a sparsity penalty to the weights of latent factors in the objective function (5.1), and restricting the norm of the columns of the factors to be unity to provide uniqueness up to a permutation. This modification provides a more precise estimation for latent factors compared to CMTF ([3, 5]). In our framework, we utilize ACMTF to extract the latent factors which are in turn used to build a classifier for multimodal data. 5.2.2 CP-STM for Tensor Classification CP-STM has been previously studied by [169, 76, 77] and uses CP tensor to construct STM types of model. Assume there is a collection of data 𝑇𝑛 = {(X1 , 𝑦 1 ), (X2 , 𝑦 2 ), . . . , (X𝑛 , 𝑦 𝑛 )}, where X𝑡 ∈ X ⊂ R𝐼1 ×𝐼2 ×···×𝐼𝑑 are 𝑑-way tensors. X is a compact tensor space, which is a subspace of R𝐼1 ×𝐼2 ×···×𝐼𝑑 . 𝑦 𝑡 ∈ {−1, 1} are binary labels. CP-STM assumes the tensor predictors are in CP format, and can be classified by the function which minimizes the objective function 𝑛 1Õ min 𝜆||f|| 2 + L (f (X𝑡 ), 𝑦 𝑡 ). (5.2) 𝑓 𝑛 𝑡=1 By using tensor kernel function Õ 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) 𝐾 (X1 , X2 ) = 𝐾 ( 𝑗) (x1,𝑙 , x2,𝑚 ), (5.3) 𝑙,𝑚=1 𝑗=1 87 𝑟 𝑟 Í (1) (𝑑) Í (1) (𝑑) where X1 = x1𝑙 ◦ · · · ◦ x1𝑙 and X2 = x2𝑙 ◦ · · · ◦ x2𝑙 . The STM classifier can be written as 𝑙=1 𝑙=1 Õ𝑛 f (X) = 𝛼𝑡 𝑦 𝑡 𝐾 (X𝑡 , X) = 𝛼𝑇 D 𝑦 K(X) (5.4) 𝑡=1 where X is a new 𝑑-way rank-𝑟 tensor, 𝛼 = [𝛼1 , . . . , 𝛼𝑛 ] 𝑇 is the coefficient vector, D 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 and K(X) = [𝐾 (X1 , X), . . . , 𝐾 (X𝑛 , X)] 𝑇 is a column vector, whose values are kernel values computed between training and test data. We denote the collection of functions in the form of (5.4) with H , which is a functional space also known as Reproducing Kernel Hilbert Space (RKHS). The optimal classifier CP-STM f ∈ H can be estimated by plugging function (5.4) into objective function (5.2), and minimize it with Hinge or Squared Hinge loss. The coefficients of the optimal CP-STM model is denoted by 𝛼∗ . The classification model is statistically consistent if the tensor kernel function satisfying the universal approximating property, which is shown by [109]. 5.2.3 Multiple Kernel Learning Multiple kernel learning (MKL) creates new kernels using a linear or non-linear combination of single kernels to measure inner products between data. Statistical learning algorithms such as support vector machine and kernel regression can then utilize the new combined kernels instead of single kernels to obtain better learning results and avoid the potential bias from kernel selection ([71]). A more important and related reason for using MKL is that different kernels can take inputs from various data representations possibly from different sources or modalities. Thus, combining kernels and using MKL is one possible way of integrating multiple information sources. Given a collection of kernel functions {𝐾1 (·, ·), . . . 𝐾𝑚 (·, ·)}, a new kernel function can be constructed by 𝐾 (·, ·) = f 𝜂 ({𝐾1 (·, ·), . . . 𝐾𝑚 (·, ·)}|𝜂) (5.5) where f 𝜂 is a linear or non-linear function and 𝜂 is a vector whose elements are the weights for the kernel combination. Linear combination methods are the most popular in multiple kernel learning, where the kernel function is parameterized as 𝐾 (·, ·) = f 𝜂 ({𝐾1 (·, ·), . . . 𝐾𝑚 (·, ·)}|𝜂) Õ𝑚 (5.6) = 𝜂𝑙 𝐾𝑙 (·, ·). 𝑙=1 88 The weight parameters 𝜂𝑙 can be simply assumed to be the same (unweighted) ([145, 16]), or be determined by looking at some performance measures for each kernel or data representation ([167, 147]). There are few more advanced approaches such as optimization-based, Bayesian approaches, and boosting approaches that can also be adopted ([103, 64, 176, 92, 69, 41, 19]). In this chapter, we only consider linear combination (5.6), and select the weight parameters in a heuristic data-driven way to construct our C-STM model. 5.3 Methods Let 𝑇𝑛 = {(X1,1 , 𝑋1,2 , 𝑦 1 ), . . . , (X𝑛,1 , 𝑋𝑛,2 , 𝑦 𝑛 )} be training data, where each sample 𝑡 ∈ {1, . . . , 𝑛} has two data modalities X𝑡 ,1 , 𝑋𝑡 ,2 and a corresponding binary label 𝑦 𝑡 ∈ {1, −1}. In this chapter, following [2], we assume that the first data modality is a third-order tensor, X𝑡 ,1 ∈ R𝐼1 ×𝐼2 ×𝐼3 , and the other is a matrix, 𝑋𝑡 ,2 ∈ R𝐼4 ×𝐼3 . The third mode of X𝑡 ,1 and the second mode of 𝑋𝑡 ,2 are assumed to be coupled for each 𝑡, i.e. the factor matrix is assumed to be fully or partially shared across these modes. Utilizing this coupling, one can extract factors that better represent the underlying structure of the data, and preserve and utilize the discriminative power of the factors from both modalities. Our approach, C-STM, consists of two stages: Multimodal tensor factorization, i.e., ACMTF, and coupled support tensor machine as illustrated in Figure 5.2. In this section, we present both stages and the corresponding procedures. Individual Factors (Tensor Modality) 𝐾1(1) (·, ·)𝐾1(2) (·, ·) +..+ 𝑋2 𝐼1 X1 𝑀 +..+ 𝐾1(3) (·, ·) 𝐾 (·, ·) 𝑦 𝐼3 𝐼2 Shared Factors +..+ 𝐾2(1) (·, ·) Individual Factors (Matrix Modality) Figure 5.2: C-STM Model Pipeline 89 5.3.1 Multimodal Tensor Factorization In this chapter, the first aim is to perform a joint factorization across two modalities for each training sample, 𝑡. Let 𝔘𝑡 ,1 = [[𝜁; 𝑋𝑡(1) (2) (3) (1) (2) ,1 , 𝑋𝑡 ,1 , 𝑋𝑡 ,1 ]] denote the Kruskal tensor of X𝑡 ,1 , and 𝔘𝑡 ,2 = [[𝜎; 𝑋𝑡 ,2 , 𝑋𝑡 ,2 ]] denote the singular value decomposition of 𝑋𝑡 ,2 . The weights of the columns of each factor matrix 𝑋𝑡(𝑚) , 𝑝 , where 𝑝 is the index for modality and 𝑚 denotes the mode, are denoted by 𝜁 and 𝜎 and the norms of these columns are constrained to be 1 to avoid redundancy. The objective function of ACMTF [4, 3] is then given by: > Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) = 𝛾kX𝑡 ,1 − [[𝜁; 𝑋𝑡(1) (2) (3) 2 (1) (2) 2 ,1 , 𝑋𝑡 ,1 , 𝑋𝑡 ,1 ]] k 𝐹 + 𝛾k 𝑋𝑡 ,2 − 𝑋𝑡 ,2 𝚺𝑋𝑡 ,2 k 𝐹 + 𝛽k𝜁 k 0 + 𝛽k𝜎k 0 s.t. 𝑋𝑡(3) ,1 = 𝑋𝑡 ,2 (2) (5.7) kx𝑡(1) (2) (3) (1) (2) ,1,𝑘 k 2 = kx𝑡 ,1,𝑘 k 2 = kx𝑡 ,1,𝑘 k 2 = kx𝑡 ,2,𝑘 k 2 = kx𝑡 ,2,𝑘 k 2 = 1, ∀𝑘 ∈ {1, . . . , 𝑟 } ( 𝑗) where 𝚺 is a diagonal matrix whose elements are the singular values 𝜎 of the matrix 𝑋𝑡 ,2 and x𝑡 ,𝑚,𝑘 ∈ R𝐼 𝑗 denotes the columns of the factor matrices for the object X𝑡 ,𝑚 . The objective function in (5.7) includes penal- ties for the number of non-zero weights in both tensor and matrix decomposition. Thus, the model identifies the shared and individual components. These factors are then considered as different data representations for multimodal data, and used to predict the labels 𝑦 𝑡 in C-STM classifier. 5.3.2 Coupled Support Tensor Machine (C-STM) C-STM uses the idea of multiple kernel learning and considers the coupled and uncoupled factors from ACMTF decomposition as various data representations. As a result, we use three different kernel functions to measure their similarity, i.e., inner products. One can think of these three kernels inducing three different feature maps transforming multimodal factors into different feature spaces. In each feature space, the corresponding kernel measures the similarity between factors in this specific data modality. The similarities of multimodal factors are then integrated by combining the kernel measures through a non-linear combination. This combination should be able to take individual and shared components into account separately for better adaptability depending on the size and corruptions on the data as the coupled modes are likely to be better estimated than the individual modes. Thus, we use tensor kernels for individual modes of each modality and combine these with the kernels of the coupled modes as illustrated in Figure 5.2. The kernel function for 90 C-STM is defined as   𝐾 (X𝑡 ,1 , 𝑋𝑡 ,2 ), (X𝑖,1 , 𝑋𝑖,2 ) = 𝐾 (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ), (𝔘𝑖,1 , 𝔘𝑖,2 ) Õ 𝑟 = 𝑤 1 𝐾1(1) (x𝑡(1) (1) (2) (2) (2) (3) (3)∗ (3)∗ (1) (1) (1) ,1,𝑘 , x𝑖,1,𝑙 )𝐾1 (x𝑡 ,1,𝑘 , x𝑖,1,𝑙 ) + 𝑤 2 𝐾1 (x𝑡 ,1,𝑘 , x𝑖,1,𝑙 ) + 𝑤 3 𝐾2 (x𝑡 ,2,𝑘 , x𝑖,2,𝑙 ) (5.8) 𝑘,𝑙=1 for two pairs of decomposed tensor matrix factors (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) and (𝔘𝑖,1 , 𝔘𝑖,2 ). x𝑡(3)∗ ,1,𝑘 is the average of the estimated shared factors 21 [x𝑡(3) (2) ,1,𝑘 + x𝑡 ,2,𝑘 ]. This kernel is inspired by the idea of Multiple Kernel Learning with linear combination of multiple kernels for multimodal data. Few more details regarding chosing such kernel combination are provided in the section 5.5.2.1. 𝑤 1 , 𝑤 2 , and 𝑤 3 are the three weight parameters combining the three kernel functions. As discussed in [71], there is no unique choice for determining these weights, in this chapter, we adopt a cross-validation approach as explained in the Appendix C.2. With the kernel function in (5.8), C-STM model tries to estimate a bivariate decision function f from a collection of functions H such that 𝑛 1Õ f = arg min 𝜆 · ||f|| 2 + L (f (X𝑡 ), 𝑦 𝑡 ), (5.9) 𝑛 𝑡=1  where L (X𝑡 , 𝑦 𝑡 ) = max 0, 1 − f (X𝑡 ) · 𝑦 𝑡 is Hinge loss. H is defined as the collection of all functions in the form of Õ𝑛 f (X1 , 𝑋2 ) = 𝛼𝑡 𝑦 𝑡 𝐾 ((X𝑡 ,1 , 𝑋𝑡 ,2 ), (X1 , 𝑋2 )) 𝑡=1 (5.10) = 𝛼𝑇 D 𝑦 K(X1 , 𝑋2 ) due to the well-known representer theorem ([10]) for any pair of testing data (X1 , 𝑋2 ) and for 𝛼 ∈ R𝑛 . For all possible values of 𝛼, equation (5.10) defines the data collection H . D 𝑦 is a diagonal matrix whose diagonal elements are labels from the training data 𝑇𝑛 . K(X1 , 𝑋2 ) is a 𝑛 × 1 vector whose 𝑡-th element is 𝐾 (X𝑡 ,1 , 𝑋𝑡 ,2 ), (X1 , 𝑋2 ) . The optimal C-STM decision function, denoted by fn = 𝛼∗𝑇 D 𝑦 K(X1 , 𝑋2 ), can be  estimated by solving the quadratic programming problem 1 𝑇 min𝑛 𝛼 D 𝑦 KD 𝑦 𝛼 − 1𝑇 𝛼, 𝛼∈R 2 S.T. 𝛼𝑇 y = 0, (5.11) 1 0𝛼 , 2𝑛𝜆 where K is the kernel matrix constructed by function (5.8). Problem (5.11) is the dual problem of (5.9), and its optimal solution 𝛼∗ also minimizes the objective function (5.9) when plugging functions in the form of (5.10). For a new pair of test points (X1 , 𝑋2 ), the class label is predicted as sign (fn (X1 , 𝑋2 )). 91 5.4 Model Estimation In this section, we first present the estimation procedure for coupled tensor matrix decomposition (5.7), and then combine it with the classification procedure to summarize the algorithm for C-STM. To satisfy the constraints in the objective function (5.7), the function Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) is converted to a differentiable and unconstrained form given by: > Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) =𝛾kX𝑡 ,1 − [[𝜁; 𝑋𝑡(1) (2) (3) 2 (1) ,1 , 𝑋𝑡 ,1 , 𝑋𝑡 ,1 ]] k 𝐹 + 𝛾k 𝑋𝑡 ,2 − 𝑋𝑡 ,2 𝚺𝑋𝑡 ,2 k 𝐹 (2) 2 + 𝜉 k 𝑋𝑡(3) (2) 2 ,1 − 𝑋𝑡 ,2 k 𝐹 Õ𝑟  q q (5.12) 𝛽 𝜁𝑟 + 𝜖 + 𝛽 𝜎𝑟2 + 𝜖 + 𝜃 (kx𝑡(1) (2)  2 2 + ,1,𝑘 k 2 − 1) + (kx𝑡 ,1,𝑘 k 2 − 1) 2 𝑘=1  (kx𝑡(3) (kx𝑡(1) (kx𝑡(2) 2 2 2  + ,1,𝑘 k 2 − 1) + ,2,𝑘 k 2 − 1) + ,2,𝑘 k 2 − 1) , where ℓ1 norm penalties in (5.7) are replaced with differentiable approximations; 𝜉 and 𝜃 are Lagrange multipliers and 𝜖 > 0 is a very small number. This unconstrained optimization problem can be solved by nonlinear conjugate gradient descent ([2, 4, 130]). Let T𝑡 be the full (created by converting Kruskal tensor, or the factor matrices into multidimensional > array form) tensor of 𝔘𝑡 ,1 , and M𝑡 = 𝑋𝑡(1) (2) ,2 𝚺𝑋𝑡 ,2 , the partial derivative of each latent factor can be derived as follows: 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) = 𝛾(T𝑡 − X𝑡 ,1 ) (1) (𝜁 > 𝑋𝑡(3) 𝑋𝑡(2) (1) ¯ (1) ,1 ) + 𝜃 (𝑋𝑡 ,1 − 𝑋𝑡 ,1 ), (5.13) 𝛿𝑋𝑡(1) ,1 ,1 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) = (T𝑡 − X𝑡 ,1 ) (2) (𝜁 > 𝑋𝑡(3) 𝑋𝑡(1) (2) ¯ (2) 𝛾 ,1 ) + 𝜃 (𝑋𝑡 ,1 − 𝑋𝑡 ,1 ), (5.14) 𝛿𝑋𝑡(2) ,1 ,1 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛾 = (T𝑡 − X𝑡 ,1 ) (3) (𝜁 > 𝑋𝑡(2) 𝑋𝑡(1) (3) (2) (3) ,1 ) + 𝜉 (𝑋𝑡 ,1 − 𝑋𝑡 ,2 ) + 𝜃 (𝑋𝑡 ,1 − 𝑋𝑡 ,1 ), ¯ (3) (5.15) 𝛿𝑋𝑡(3) ,1 ,1 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛾 = (M𝑡 − 𝑋𝑡 ,2 ) 𝑋𝑡(2) (1) ¯ (1) ,2 𝚺 + 𝜃 (𝑋𝑡 ,2 − 𝑋𝑡 ,2 ), (5.16) 𝛿𝑋𝑡(1) ,2 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛾 = (M𝑡 − 𝑋𝑡 ,2 ) > 𝑋𝑡(1) (2) (3) (2) ,2 𝚺 + 𝜏(𝑋𝑡 ,2 − 𝑋𝑡 ,1 ) + 𝜃 (𝑋𝑡 ,2 − 𝑋𝑡 ,2 ), ¯ (2) (5.17) 𝛿𝑋𝑡(2) ,2 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛽 𝜎𝑘 = x𝑡(1)> (2) ,2,𝑘 (M𝑡 − 𝑋𝑡 ,2 )x𝑡 ,2,𝑘 + 2 q , 𝑘 ∈ {1, . . . , 𝑟 }, (5.18) 𝛿𝜎𝑘 𝜎𝑘2 + 𝜖 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 )   𝛽 𝜁 = vec(T𝑡 − X𝑡 ,1 ) > x𝑡(3) x𝑡(2) x𝑡(1) 𝑘 𝛿𝜁 𝑘 ,1,𝑘 ,1,𝑘 ,1,𝑘 + 2 q , 𝑘 ∈ {1, . . . , 𝑟 }, (5.19) 𝜁𝑘 + 𝜖 2 92 Algorithm 5.1: ACMTF Decomposition 1: Input: Multimodal data (X1 , 𝑋2 ), r, 𝜂, S (Upper limit for the number of iterations) 2: Output: 𝔘∗𝑡 ,1 , 𝔘∗𝑡 ,2 3: 𝔘𝑡 ,1 , 𝔘𝑡 ,2 = 𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 ⊲ Initial value 4: 𝚫0 = −OQ(𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 )   5: 𝜑0 = arg min 𝜑 Q (𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 ) + 𝜑𝚫0 6: 𝔘1𝑡 ,1 , 𝔘1𝑡 ,2 = (𝔘0𝑡 ,1 , 𝔘0𝑡 ,2 ) + 𝜑0 𝚫0 7: g0 = 𝚫0 while s < S and kQ(𝔘𝑡𝑠,1 , 𝔘𝑡𝑠,2 ) − Q(𝔘𝑡𝑠−1 ,1 , 𝔘𝑡 ,2 ) k > 𝜂 do 8: 𝑠−1 9: 𝚫𝑠+1 = −OQ(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝑠 𝑠 𝚫> (𝚫𝑠+1 −𝚫𝑠 ) 10: g𝑠+1 = 𝚫𝑠+1 + −g𝑠+1> (𝚫 −𝚫𝑠 ) g𝑠 𝑠 𝑠+1  11: 𝜑 𝑠+1 = arg min 𝜑 Q (𝔘𝑡𝑠,1 , 𝔘𝑡𝑠,2 ) + 𝜑g𝑠+1 ,1 , 𝔘𝑡 ,2 = (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) + 𝜑 𝑠+1 g𝑠+1 12: 𝔘𝑡𝑠+1 𝑠+1 𝑠 𝑠 13: end while where (𝑣𝑒𝑐(.)) is a vectorization operator that stacks all elements of the operand in a column vector, T( 𝑗) denotes the mode- 𝑗 unfolding of a tensor T , and denotes Khatri-Rao product. M̄ is a normalized matrix whose columns have unit ℓ2 norms. By combining all of the partial derivatives, the partial derivative of the objective function is given by:  𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) OQ(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) = , , , 𝛿𝑋𝑡(1) ,1 𝛿𝑋 (2) 𝑡 ,1 𝛿𝑋 (3) 𝑡 ,1 > (5.20) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) 𝛿Q(𝔘𝑡 ,1 , 𝔘𝑡 ,2 ) , ,... ,... 𝛿𝑋𝑡(2) ,2 𝛿𝜁1 𝛿𝜎1 which is a 2𝑟 + 5 dimensional vector. As mentioned in [4], a nonlinear conjugate gradient method with Hestenes-Stiefel updates is used to optimize (5.12). The procedure is described in Algorithm 5.1. Once the factors for all data pairs in the training set 𝑇𝑛 are extracted, we can create the kernel matrix using the kernel function in (5.8). By solving the quadratic programming problem (5.11), we can obtain the optimal decision function fn . This two-stage procedure for C-STM estimation is summarized in Algorithm 5.2. 93 Algorithm 5.2: Coupled Support Tensor Machine 1: procedure C-STM 2: Input: Training set 𝑇𝑛 = {(X1,1 , 𝑋1,2 , 𝑦 1 ), . . . , (X𝑛,1 , 𝑋𝑛,2 , 𝑦 𝑛 )}, y, kernel function 𝐾, 𝑟, 𝜆, 𝜂, S 3: for t = 1, 2,. . . n do 𝔘∗𝑡 ,1 , 𝔘∗𝑡 ,2 = ACMTF (X𝑡 ,1 , 𝑋𝑡 ,2 ), 𝑟, 𝜂, S  4: 5: end for 6: Create initial matrix K ∈ R𝑛×𝑛 7: for t = 1,. . . ,n do 8: for i = 1,. . . ,n do   9: K[𝑖, 𝑡] = 𝐾 (𝔘𝑡 ,1 , 𝔘𝑡 ,2 ), (𝔘𝑖,1 , 𝔘𝑖,2 ) ⊲ Kernel values 10: K[𝑖, 𝑡] = K[𝑡, 𝑖] 11: end for 12: end for 13: Solve the quadratic programming problem (5.11) and find the optimal 𝛼∗ . 14: Output: 𝛼∗ 15: end procedure 5.5 Experiments 5.5.1 Parameter Selection 5.5.1.1 Multimodal Tensor Factorization The proposed model requires the selection of three different parameters, namely, 𝛾, 𝛽, and rank 𝑟. To select these parameters, we closely follow best practices outlined in previous work on CMTF [2], ACMTF [4] and CCMTF [130]. First of all, one of these parameter can be set to 1 as a pivot, and following previous work, we set 𝛾 = 1. The selection of rank 𝑟 is directly related to the selection of 𝛽. As 𝛽 enforces sparsity over the singular values, it directly minimizes the rank. With sufficiently large 𝑟, we can estimate the low-rank part through optimization. For the selection of 𝑟 in real data, we set 𝑟 = 5 following the work of [130] where it was shown through CORCONDIA tests [85] that 𝑟 = 3 is sufficiently large for oddball data. In the case of the simulation study, 𝑟 = 5 is again sufficiently large as the data were generated from rank 𝑟 = 3 factors. Finally, based on our empirical results and the results presented in [130] we set 𝛽 = 0.001 using k-fold cross validation. 94 5.5.1.2 C-STM The parameters in C-STM include kernel weights 𝑤 1 , 𝑤 2 , 𝑤 3 and regularization parameter 𝜆 in the optimiza- tion. The weight parameters, normalized such that the ℓ2 -norm is equal to 1, and 𝜆 are selected using 5 fold cross-validation. The overall classification accuracy in our validation set serves as the performance metric and helps us determine the best combination of weights and 𝜆. The selection of weight parameters 𝑤 1 , 𝑤 2 , 𝑤 3 is indeed a problem of how to combine kernels from different modalities. It is straightforward to calculate kernels from every data modality, however, combining them appropriately and effectively would be challenging unless we can find out the weight for each kernel. This problem has been widely studied in the literature of Multiple Kernel Learning. In [71], the authors summarize that the existing methods of kernel weight selection can be divided into five categories, including fixed rules, heuristic approaches, optimization approaches, Bayesian approaches, and boosting approaches. As there is no consensus on the best way to choose the weights, we adopt a cross-validation approach as explained in the Appendix of the revised manuscript to identify the kernel weights. The overall classification accuracy in our validation set serves as the performance metric and helps us determine the best combination of weights. The generalization of our method to more than two modalities would be straightforward for tuning the weights. This is because the tuning problem has been widely studied in multiple kernel learning (MKL) research. There is no restriction on the number of kernels one can include in MKL framework. The weight selection techniques in MKL can be adapted to our framework. The optimization problem defined in (5.9) is an ordinary SVM problem once the kernel values are calculated through equation (5.8). Thus, for more information about the estimation procedure, 𝜆 selection, as well as the consistency results readers are referred to existing Support Vector Machine literature [163]. 5.5.2 Simulated Data We present a simulation study to demonstrate the benefit of utilizing C-STM with multimodal data in classification problems. To show the advantage of using multi-modalities in C-STM, we include CP-STM from [76], Constrained Multilinear Discriminant Analysis (CMDA), and Direct General Tensor Discriminant Analysis (DGTDA) from [112] as competitors. These existing approaches can only take a single tensor / matrix as the feature for classification. As a result, they are not able to enjoy the multi-modalities in the 95 simulated data. We apply these approaches on every single data modality in our simulated data, and compare their classification performance with C-STM which uses multimodal data. We generate synthetic data using the idea from [61]. Suppose the two data modalities in our classification problems are Õ3 (1) (2) (3) X𝑡 ,1 = x 𝑘,𝑡 ,1 ◦ x 𝑘,𝑡 ,1 ◦ x 𝑘,𝑡 ,1 𝑘=1 (5.21) Õ3 (1) (2) 𝑋𝑡 ,2 = x 𝑘,𝑡 ,2 ◦ x 𝑘,𝑡 ,2 𝑘=1 where X𝑡 ,1 are three-way tensors in the size of 30 by 20 by 10. 𝑋𝑡 ,2 are matrices in the size of 50 by 10. Both of them have CP ranks equal to 3. To generate data for the simulation study, we first generate the latent factors (vectors) from various multivariate normal distributions, and then convert these factors into full tensors X𝑡 ,1 and matrices 𝑋𝑡 ,2 using equation (5.21). The multivariate normal distributions we used to generate columns of the latent factors in equation (5.21) are specified in Table 5.1 below. In Table 5.1, we use 𝑐 = 1, 2 to denote data from two different classes. Table 5.1: Distribution Specifications for Simulation Study; MVN stands for multivariate normal distribution. I are identity matrices. Bold numbers are vectors whose elements are all equal to the numbers. Tensor Factors Shared Factors Matrix Factors (1) (2) (3) (2) (1) Simulation 𝑐 x 𝑘,𝑡 ,1 x 𝑘,𝑡 ,1 x 𝑘,𝑡 ,1 = x 𝑘,𝑡 ,2 x 𝑘,𝑡 ,2 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 1 2 MVN(1.5, I) MVN(1, I) MVN(1, I) MVN(1.25, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 2 2 MVN(1.5, I) MVN(1, I) MVN(1, I) MVN(1.5, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 3 2 MVN(1.5, I) MVN(1, I) MVN(1, I) MVN(1.75, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 4 2 MVN(1.5, I) MVN(1, I) MVN(1, I) MVN(2, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 5 2 MVN(1.5, I) MVN(1, I) MVN(1, I) MVN(2.25, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 6 2 MVN(2, I) MVN(1, I) MVN(1, I) MVN(1, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 7 2 MVN(1, I) MVN(1, I) MVN(1, I) MVN(2, I) 1 MVN(1, I) MVN(1, I) MVN(1, I) MVN(1, I) Case 8 2 MVN(1, I) MVN(1, I) MVN(2, I) MVN(1, I) 96 There are eight different cases in our simulation study. In case 1 - 5, one of the tensor factors and the matrix factors are generated from different multivariate normal distributions for data in different classes. This means the tensor and matrix data both contain certain class information (discriminant power) which are different in different data modalities. Notice that the discriminant power in one of the tensor factor remains the same among case 1 - 5, while the power in the matrix factor increases. Case 6 and 7 assume the class information exists only in a single data modality. In case 6, only one of the tensor factors are generated from different distributions for data in different classes. This factor then becomes the matrix factor in class 7. In case 8, the shared factors are sampled from different distributions, meaning that both tensor and matrix data modalities have class information. However, such class information are from the shared factors are the same between different modalities. For each simulation case, we generate 50 pairs of tensor and matrix from both class, collecting 100 pairs of observations in total. We then perform a random training and testing set separation by randomly choosing 20 samples as the testing set, and use the remaining data as the training set. The random selection of testing set is conducted in a stratified sampling manner such that the proportion of samples from each class remains the same in both training and testing sets. For all models, we report the model prediction accuracy, the proportion of correct predictions over total predictions, on the testing set as the performance metric. The random training and testing set separation is repeated for 50 times and the average prediction accuracy of these 50 repetitions for all the cases are reported in Figure 5.3. Additionally, the standard deviations are illustrated by the error bars in the figure. The results of CP-STM, CMDA, and DGTDA with tensor data are denoted by CPSTM1, CMDA1, and DGTDA1 respectively in the figure. The results using matrix data are denoted by CPSTM2, CMDA2, and DGTDA2. From the Figure 5.3, we can conclude that our C-STM has a more favorable performance in this multimodal classification problem comparing with other competitors. Its accuracy rates are significantly larger than other methods in most cases. Particularly, we can see that the accuracy rates of C-STM (orange) are increasing from case 1 to case 5, while the accuracy rates of CP-STM using tensor data remain the same. This is because the difference between class mean vectors for the first tensor factor does not change from case 1 to case 5. However, the gap between class mean vectors in matrix factor increases. Due to this fact, both C-STM and CP-STM (yellow) which utilize matrix data are getting better performance from case 1 to case 5. More importantly, C-STM always outperforms CP-STM with matrix data as it enjoys the extra class information from multimodalities. In case 6 and case 7 where class information are in single data 97 1.2 Case 1 1.2 Case 2 0.9 0.9 Accuracy Accuracy 0.6 0.6 0.3 0.3 0.0 0.0 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 Method Method 1.2 Case 3 1.2 Case 4 0.9 0.9 Accuracy Accuracy 0.6 0.6 0.3 0.3 0.0 0.0 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 Method Method 1.2 Case 5 1.2 Case 6 0.9 0.9 Accuracy Accuracy 0.6 0.6 0.3 0.3 0.0 0.0 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 Method Method 1.2 Case 7 1.2 Case 8 0.9 0.9 Accuracy Accuracy 0.6 0.6 0.3 0.3 0.0 0.0 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 Method Method Figure 5.3: Simulation Result: Average accuracy rates shown in bar plots; Standard deviation of accuracy rates shown by error bars modalities, the advantage of C-STM is not as significant as the previous cases, though its performances are slightly better than CP-STM. This indicates C-STM can provide robust classification results even when extra data modalities do not provide any other class information, as it can extract more accurate estimates of the factors in the decomposition step. In case 8 where the class information is from the shared factors, C-STM recovers the shared factors accurately and provides significantly better classification accuracy. Through this simulation, we showed that C-STM has a clear advantage of using multimodal data in classification problems, 98 and is robust to redundant data modalities. 5.5.2.1 Kernel Selection In this section, we evaluate and justify the choice of the kernel function presented in (5.8). In this formulation, the individual and coupled modes for each modality are first separated and then the individual modes of each modality are combined as a tensor kernel function. The kernels from the individual modes are then added to those from the coupled modes to obtain the final form where the weights for the individual and coupled parts can be optimized as discussed in the Appendix. This kernel formulation separates the coupled and individual information and integrates them as a linear combination. Although this is not the only way to integrate kernels, the relatively simple structure of such combination provides us with several benefits such as interpretability, convenient parameter tuning, and generalizability for multimodal data. With equation (5.8), it is possible to explain the contribution of different data modalities to discrimination power by looking at the weights parameters. Further, with linear combination of kernels, the weight parameters can be tuned with the different approaches introduced in [71] such as Group Lasso. Even though we do not adopt these tuning techniques in this chapter, it still shows the advantage of choosing such a combination and can be the foundation for future work. Lastly, this kernel combination can be extended for data with more modalities easily since kernels are appended linearly. Besides the aforementioned reasons, we also provide numerical experiments to illustrate the performance of our choice against other kernel combination choices. In these experiments the factor sizes are the same (X1 ∈ R40×40×40 , and 𝑋2 ∈ R40×40 , 𝑟 = 3) so that the kernels are balanced across the modes. We consider two cases, i.e., Case 8 in Table 5.1, and Case 9 where the columns of the latent factors corresponding to all individual and coupled modes of the second class are from the distribution MVN(2, I). Although there can be many different kernel combinations, we select four particular formulations for comparison as they can be a basis for other choices. The particular formulations are the weighted combination of individual kernels from all modes (K2), the weighted combination of the tensor kernels corresponding to the two modalities (K3) and the tensor kernel corresponding to all modes across modalities (K4). The formulations for the different kernels are given in Table 5.2. We report average classification accuracy across 50 simulations, where the simulated tensors are randomly initialized. In Table 5.3, we can see that the kernel selection schemes K1 and K2 perform the best. K3 performs 99 Table 5.2: Various kernel combination schemes. Note that 𝐾2(2) = 𝐾1(3) . Combination Scheme K1 𝑤 1 𝐾1(1) 𝐾1(2) + 𝑤 2 𝐾1(3) + 𝑤 3 𝐾2(1) K2 𝑤 1 𝐾1(1) + 𝑤 2 𝐾1(2) + 𝑤 3 𝐾1(3) + 𝑤 4 𝐾2(1) K3 𝑤 1 𝐾1(1) 𝐾1(2) 𝐾1(3) + 𝑤 2 𝐾2(1) 𝐾2(2) K4 𝐾1(1) 𝐾1(2) 𝐾1(3) 𝐾2(1) slightly worse as it is not as flexible as the previous two. Finally, K4 performs the worst as it is affected by all modes simultaneously, and cannot generalize well. While the difference in performance between the kernels is not significant, K3 and K4 cannot determine whether the observed class differences are due to an individual mode or a coupled mode. Thus, K1 and K2 are better in terms of explaining the results. For Case 8, in most cases, cross-validation across a range of weight parameters for K1 and K2 yields 𝑤 2 = 1 and 𝑤 3 = 1, respectively, and the remaining weights are equal to zero. This directly identifies the source of discriminability and allows for better interpretability, which is not possible for K3 and K4. Finally, K1 has less number of parameters than K2 and this can be advantageous in cases with high number of modalities. The smaller number of parameters make cross-validation simpler, while still allowing for some interpretability. Table 5.3: Classification accuracy using different kernel combinations. K1 K2 K3 K4 Case 9 0.91 ± 0.036 0.9 ± 0.043 0.87 ± 0.038 0.82 ± 0.045 Case 8 0.90 ± 0.039 0.9 ± 0.038 0.88 ± 0.036 0.83 ± 0.06 5.5.3 EEG-fMRI Data In this section, we present the application of the proposed method on simultaneous EEG-fMRI data. The simultaneous electroencephalography (EEG) with functional magnetic resonance imaging (fMRI) is one of the most popular non-invasive multimodal brain imaging techniques to study human brain function. EEG records electrical activity from the scalp resulting from ionic current within the neurons of the brain. Its millisecond temporal resolution makes it possible to record event-related potentials that occur in response to visual, auditory and sensory stimuli ([172, 1]). While EEG provides high temporal resolution, its spatial resolution is limited by the number of electrodes placed on the scalp and thus provides less spatial resolu- 100 tion compared to other neuroimaging modalities such as magnetic resonance imaging (MRI) and Positron Emission Tomography (PET). As a result, it has been commonplace to record EEG data in conjunction with a high spatial resolution modality. As another powerful tool in studying human brain function, blood oxygenation level dependent (BOLD) functional magnetic resonance imaging (fMRI) provides signals with much higher spatial resolution to reflect hemodynamic changes in blood oxygenation level at all voxels related to neuronal activities ([138, 15, 101, 62]). Recording simultaneous EEG and fMRI can provide high resolution information at both the spatial and temporal dimensions at the same time. Thus, developing novel machine learning techniques to utilize such multimodal data is of great significance. In this application, we apply our C-STM model to a binary trial classification problem on a simultaneous EEG-fMRI data. The data is obtained from the study [178]. In this study, there are seventeen individuals (six females, average age 27.7) participated in three runs each of analogous visual and auditory oddball paradigms. The 375 (125 per run) total stimuli per task were presented for 200 ms each with a 2-3 s uniformly distributed variable inter-trial interval. A trial is defined as a time window in which subjects receive stimuli and make responses. In the visual task, a large red circle on isoluminant gray backgrounds was considered as the target stimuli, and a small green circle were the standard stimuli. For the auditory task, the standard and oddball stimuli were, respectively, 390 Hz pure tones and broadband sounds which sound like "laser guns". During the experiment, the stimuli were presented to all subjects, and their EEG and fMRI data are collected simultaneously and continuously. We obtain the EEG and fMRI data from OpenNeuro website (https://openneuro.org/datasets/ds000116/versions/00003). We utilize both EEG and fMRI in this data set with our C-STM model to class stimulus types in all the trials. Through our numerical study, we want to demonstrate the fact our C-STM model enjoys the advantage of data multimodality and provides more accurate class predictions. The data from Subject 4 are dropped since its fMRI data are corrupted. We pre-process both the EEG and fMRI data with Statistical Parametric Mapping (SPM 12) ([11]) and Matlab. The EEG data is collected by a custom built MR-compatible EEG system with 49 channels. [178] provides a version of re-referenced EEG data with 34 channels which are used in our experiment. This version of EEG data are sampled at 1,000 Hz, and are downsampled to 200 Hz at the beginning of pre-processing. We then remove both low-frequency and high-frequency noise in the data using SPM filter functions. As the last step of EEG pre-processing, we define trials from Brain Imaging Data Structure (BIDS) files [136] and extract EEG data epochs recorded within the trial-related time windows. The time window for each trial is considered to go from 100 ms before the stimulus onset until 500 ms after the stimulus. For each trial, 101 we construct a three-mode tensor corresponding to the EEG data for all subjects where the modes represent channel × time × subject. We denote it as X𝑡 ,1 ∈ R34×121×16 . The fMRI data is collected by 3T Philips Achieva MR Scanner with 170 volumes (TR = 2s) per session. Each 3D volume contains 32 slices. The voxel size in the image is 3 x 3 x 4 mm. For each subject, we realign all the fMRI volumes from multiple sessions to the mean volume, and co-register the participant’s T1 weighted anatomical scan to the mean fMRI volume. Next, we normalize all the fMRI volumes to match the MNI brain template ([102]) by creating segments from co-registered T1 weighted scan, and keep the voxel size as 3 x 3 x 4 mm. All normalized fMRI volumes are then smoothed by 3D Gaussian kernels with full width at half maximum (FWHM) parameter being 8 × 8 × 8. After the pre-processing, we further perform a regular statistical analysis ([119, 190]) to extract fMRI volumes from visual and auditory stimulus related voxels. Such data are also known as Region of Interest (ROI) data. We extract fMRI volumes from 178 voxels (in Figure 5.4a) for auditory oddball tasks, and 112 voxels for auditory tasks. As a result, fMRI data are modeled by matrices whose rows and columns stand for voxels and subjects: X𝑡 ,2 ∈ R16×178 for auditory task data, and X𝑡 ,2 ∈ R16×112 for visual task data. There is no time mode in fMRI data because the trial duration is less than the repetition time of fMRI (time for obtaining a single 3D volume fMRI). For each trial, there is only one 3D scan of fMRI collected from a single subject. The ROI data then becomes a vector for this subject in the trial as we extract volumes from the regions of interest. To classify trials with oddball and standard stimulus, we collect 140 multimodal data samples (X𝑡 ,1 , X𝑡 ,2 ) from auditory tasks, and 100 samples from visual tasks. For both types of tasks, the numbers of oddball and standard trials are equal. We consider the trials with oddball stimulus as the positive class, and the trials with standard stimulus as the negative class. Like the procedures in our simulation study, we select 20% of data as testing set, and use the remaining 80% for model estimation and validation. The classification accuracy, precision (positive predictive rate), sensitivity (true positive rate), and specificity (true negative rate) of classifiers are calculated using the test set at each experiment. The experiment is repeated multiple times, and the average accuracy, precision, sensitivity, and specificity, and their standard deviations (in subscripts) are reported in Table 5.4. The single mode classifiers CPSTM, CMDA, and DGTDA are also applied on either EEG or fMRI data as a comparison. The single mode classifiers applied on EEG data are denoted by appending the number "1" after their names, and those applied on fMRI data are denoted by appending the number "2". The area under the curve (AUC) for all the classifiers are also reported in Table 5.4. The results in Table 5.4 show that the trial classification accuracy for C-STM using multimodal data 102 Table 5.4: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Performance Metrics with Standard Deviations in Subscripts) Task Method Accuracy Precision Sensitivity Specificity AUC C-STM 0.890.05 0.830.07 1.000.00 0.770.11 0.890.06 CP-STM1 0.800.08 0.710.11 1.000.00 0.600.12 0.780.06 CP-STM2 0.830.06 0.760.07 0.990.05 0.650.11 0.820.05 Auditory CDMA1 0.550.10 0.510.09 0.960.09 0.200.21 0.550.06 CDMA2 0.670.09 0.610.11 0.920.07 0.460.14 0.700.08 DGTDA1 0.550.09 0.510.09 0.940.07 0.230.12 0.590.06 DGTDA2 0.670.09 0.600.10 0.900.09 0.460.13 0.680.08 C-STM 0.860.06 0.820.09 0.930.07 0.770.12 0.860.06 CP-STM1 0.760.08 0.660.11 1.000.00 0.540.12 0.780.05 CP-STM2 0.770.08 0.700.11 0.980.08 0.580.17 0.770.07 Visual CDMA1 0.530.12 0.520.11 0.940.11 0.110.18 0.540.08 CDMA2 0.650.13 0.610.14 0.910.09 0.430.19 0.660.09 DGTDA1 0.560.11 0.540.11 0.940.06 0.170.12 0.560.07 DGTDA2 0.640.10 0.600.13 0.860.10 0.440.18 0.640.07 is better than any classifier based on single modality with a significant improvement in terms of average accuracy rates and average AUC values. This improvement is observed for classification of both auditory and visual tasks. This observation agrees to the conclusion from our simulation study. Similar to our simulation study, the tensor discriminant analysis does not work as well as CP-STM and C-STM. In addition, it is obvious that the performance of tensor discriminant analysis using fMRI data are better than using EEG data. This is within our expectation, since the regions we extracted from fMRI data are identified by group level fMRI statistical analysis. The data in these regions have already shown significant differences between different trials in the traditional study, and thus are easy to classify. On the other hand, there is no prior analysis and feature extraction procedure applied on EEG data, leaving a low signal to noise ratio in EEG data. However, C-STM still can take advantage of using EEG data and further increase the classification accuracy, highlighting its robustness and potential in processing noisy mulitmodal tensor data. 5.6 Conclusion In this chapter, we have proposed a novel coupled support tensor machine classifier for multimodal data by combining the advanced coupled matrix tensor factorization (ACMTF) and support tensor machine (STM). The most distinctive feature of this classifier is its ability to integrate features across different modalities and structures. The proposed approach can simultaneously take matrix- and tensor-shaped data for classification 103 and can be easily extended to inputs with more than two modality. The coupled tensor matrix decomposition unveils the intrinsic correlation structure between data across different modalities, making it possible to integrate information from multiple sources efficiently. Such a decomposition also makes the whole method robust and applicable to large-scale noisy data with missing values. The newly designed kernel functions in C-STM provide feature-level information fusion, combining discriminant information from different modalities. Moreover, the kernel formulation makes it possible to utilize the most discriminative features from each modality by tuning the weight parameters in the function. The most important theoretical extension of our current approach would be the development of excess risk for C-STM. In particular, we are looking for an explicit expression for the excess risk in terms of data factors from multiple modalities to quantify the contribution of every single modality in minimizing the excess risk. By doing so, we are able to interpret the importance of each data modality in classification tasks. In addition, quantifying the uncertainty of tensor and matrix factors estimation and their impact on the excess risk will build the foundation to the next level. Future work will focus on learning the weight parameters in the kernel function via optimization. As [71] introduced, the weights in the kernel function can be further estimated by including a group lasso penalty in the objective function. This procedure allows the identification of the most significant components and reduce the cost of parameter selection. Finally, the proposed method can be extended to multimodal data with more than two modalities, and for regression problems. In conclusion, we believe C-STM offers many encouraging possibilities for multimodal data integration and analysis. Its capability of handling multimodal tensor inputs will make it appropriate in many advanced data applications in neuroscience and medical research. We anticipate that this method will play an important role in a variety of applications. 104 (a) Auditory Task (b) Visual Task Figure 5.4: Region of Interest (ROI) 105 CHAPTER 6 CONCLUSIONS In this thesis, we introduced new tensor based machine learning models using various data structures. In par- ticular, we utilized tensor network structures, geometric models and multi-modal coupling for efficient tensor based unsupervised and supervised learning. The proposed methods in the thesis contribute significantly to the tensor learning literature by improving existing structures and using tensors to model new problems. In Chapter 2, we proposed a tensor decomposition structure, i.e., Multi Branch Tensor Network, that provides improved storage and computational efficiency compared to existing architectures, without sac- rificing representation power of the low-dimensional approximations. We explored two applications of multi-branch structure in supervised and unsupervised settings and provided theoretical analysis of conver- gence, computational and storage complexities. In the supervised setting, a multi-branch implementation of LDA was introduced. The proposed approach reduced the computational complexity by orders of magni- tude, while improving the classification accuracy compared to vector-, Tucker- or TT-based methods. We also demonstrated that the proposed approach is more general compared to Tucker based MDA methods, TT-based BTT and MPS, and provided a way of selecting the optimal structure depending on data size. The proposed methods show robustness against small-sample-size (SSS) problem, which is considered to be the main drawback of LDA [63], as it learns smaller number of parameters due to the efficient multi- branch structure. The proposed Multi-branch methods can be further utilized in any Ritz pair (extremal eigenvalue-eigenvector pair) computation. In the unsupervised setting, the proposed structure was used for a graph regularized optimization problem to reduce the computational complexity and learn more effective subspaces for dimensionality reduction. Low-dimensional projections, or features of the data are then used for clustering. Using multi branch structure reduced the computational complexity for the optimization by orders of magnitude, especially when the projected space was larger in dimension. This is a crucial as, often, using a small number of features does not provide acceptable clustering quality or other learning objective functions. The proposed method outperformed Tucker-based graph regularized method in terms of clustering quality, hence, suggesting that multi-branch decompositions could learn subspaces that better fit the data compared to Tucker based methods even in a manifold learning setting. In Chapter 3, we present a framework for robust tensor decomposition for anomaly detection in spa- 106 tiotemporal data. In particular, we focus on urban traffic monitoring applications where the anomalies exhibit themselves as temporally contiguous events. Motivated by this application, we model spatiotemporal data as a low-rank plus sparse tensor where the low-rank part corresponds to underlying data and sparse part corresponds to anomalies. The continuous nature of the anomalies are taken into account with the additional constraint that the sparse part is temporally continuous (LOSS). We explore two extensions of this model. First, to capture the underlying local geometric relations in the data, we consider graph regularization across each mode (GLOSS). This type of regularization is motivated by the assumption that the data lies on a product graph, which holds true for many real world scenarios, since real world data is often generated over graphs with such structures. Specifically, for traffic data, this structure is fitting by design. Second, we modify our objective function by minimizing the graph total variation instead of nuclear norm for low-rank approximation to reduce the computational complexity (LOGSS). Finally, we incorporate a tensor comple- tion framework in all of these methods to address missing data. The proposed methods are shown to improve anomaly detection performance compared to baseline methods and have higher accuracy than existing robust tensor decomposition methods. In particular, using graph regularization in the objective improves the results significantly even when the nuclear norm penalties are dropped. This shows that using geometric structure instead of global structure such as rank can be used to obtain highly accurate and low complexity tensor models. In Chapter 4, we propose incorporating geometric structure in addition to global structure into tensor denoising and recovery formulations. Geometric tensor learning allows for better modeling of underlying relations of data compared to purely algebraic measures. We propose a TT based robust PCA model where a graph smoothness penalty is applied to each mode-𝑛 canonical unfolding. Since this formulation results in computationally intractable matrix inversion problems, we propose an extension where we impose a Kronecker structure on the mode-𝑛 canonical graph. This structure is designed such that the the redundancy across the different mode unfoldings is minimized. We prove the equivalence between the proposed graph smoothness penalty with mode-𝑛 unfolding based graph smoothness penalty. We show the advantage of using geometric approaches by comparing these two methods with purely algebraic objective functions. The proposed methods outperform the existing methods in missing data imputation, and denoising tasks. Moreover, using the Kronecker structured graph rather than the canonical graph provided similar results with improved computational efficiency. The results in this chapter illustrate that with additional regularizations on topological structure, TT-based models can be further improved. 107 Finally, in Chapter 5, we utilize coupled tensor decomposition in a supervised setting. Heterogenous data collected by sensing the same physical phenomena are generally coupled. Coupled tensor decomposition methods have been utilized extensively for unsupervised learning tasks such as denoising, recovery and clustering such heterogeneous data. In this chapter, we showed that these models can also be utilized in supervised settings, for robust dimensionality reduction and feature extraction. To combine representations from different sources, we proposed using multiple kernel learning methods. MKL uses a combination of single kernels, that can take inputs from various data sources, to obtain better results and avoid potential bias from kernel selection [71]. We propose a two-step model where in the first step, factor matrices for all samples are extracted by coupling heterogeneous data sources. In the next step, we feed the extracted features, i.e., factor matrices, to a kernelized support tensor machine. The proposed method shows better classification accuracy compared to supervised learning based on the individual data sources. Coupled decomposition extracts the features from each sample using information from both data sources, which provides better features for the subsequent STM and improves the accuracy compared to STM trained on the features of individual modalities. The proposed method also outperforms methods that specifically learn discriminative factors from individual modalities, which illustrates the advantage of coupling. 6.1 Future Work The work in this thesis suggests new research directions. In this section, we suggest areas of future work for each chapter. 6.1.1 Multi-Branch Tensor Learning The proposed tensor decomposition structure is a hybrid between Tucker and TT decompositions, hence, is generally applicable in any tensor decomposition problem. Thus, the structure can be utilized in many other supervised and unsupervised tasks, such as regression, STMs, dictionary learning, manifold learning, data recovery and compression. It can also be utilized in improving tensor-based deep learning methods, by either compressing the data, or the network parameters as it provides an optimal way of decomposing large tensors. Theoretical aspects of the representation power of this structure is also of interest as it would lead to a deeper understanding of tensor decompositions. Since Tucker and TT decompositions are special cases of the multi-branch structure, this analysis would also reveal a concrete reasoning for the choice of the 108 decomposition structure depending on the problem, and data at hand. 6.1.2 Tensor Methods for Anomaly Detection Chapter 3 reveals the utility of tensor based robust learning architectures on the problem of anomaly detection, specifically for spatiotemporal data. However, there are still some open questions regarding the theoretical recovery guarantees for the proposed algorithms. Future work can explore recovery bounds for anomalies depending on the data structure. Furthermore, the proposed methods rely on the proper selection of the regularization parameters. Although we have shown empirically that the performance is robust within a wide range of parameter choices, these methods still require some tuning for desirable performance. A fully Bayesian extension of the proposed approach could be considered as future work to automatically estimate these parameters depending on the data. Anomaly detection part in this chapter was implemented by scoring each fiber individually by a separate algorithm, which may result in loss of information. Future work will consider a statistical tensor anomaly scoring method to avoid this simplification. Finally, since the specific application in this chapter is spatiotem- poral data, a natural extension would be online anomaly detection. Online subspace tracking or functional data analysis literature may provide the necessary tools for such an extension. 6.1.3 Geometric Tensor Learning In Chapter 4, we illustrate how using geometric relations within data improve robust tensor learning. Future work would quantify these improvements in terms of theoretical bounds for recovery when such a structure is utilized. Although we utilize a combination of global and local structure in our methods, it is also of interest to use only the local structure, i.e., geometric relations for data recovery. Recovery guarantees for such an approach is also of great interest as it might allow milder conditions compared to algorithms that use global structure. In this chapter, we estimated the underlying graphs using a k-NN approach. It is known in graph learning literature that with noisy and missing data, this generally produces noisy estimates of the underlying graph. As such, future work will focus on simultaneous graph learning with data recovery. Finally, as the use of graphs corresponding to the canonical mode-𝑛 unfolding requires excessive memory and computational resources, a multi-branch structure could be utilized to reduce these costs without approximating the graphs with a Kronecker structure. This would also pave the way for a fully geometric 109 robust multi-branch decomposition. 6.1.4 Supervised Coupled Tensor Learning In Chapter 5, coupled tensor factorization was utilized as a feature extraction step. However, without using the class label information, the extracted features might not be suitable for the subsequent classification task. In this approach, we propose to employ a supervised coupled factorization to address this. Specifically, we extend Multilinear Discriminant Analysis to coupled factorization by solving: Õ2 𝑑Õ𝑖 −1 𝑡𝑟 (𝑈𝑛(𝑖),> 𝑆 𝑖,𝑛𝑈𝑛(𝑖) ) 𝑊 Õ2 Õ minimize𝑈 (𝑖) ∀𝑛,𝑖 (𝑖),> 𝑖,𝑛 (𝑖) + k 𝐴𝑖𝑛𝑈ˆ 𝑖 − 𝑉𝑛 k 2𝐹 , (6.1) 𝑖=1 𝑛=1 𝑡𝑟 (𝑈𝑛 𝑆 𝐵 𝑈𝑛 ) 𝑛 𝑖=1 𝑛∈𝔑 𝑖,𝑛 where 𝑆𝑊 , 𝑆 𝑖,𝑛 𝐵 are within and between class scatters of modality 𝑖 and mode-𝑛; 𝔑 is the set of coupled modes, and 𝐴𝑛 ’s are transformations through which the couplings are defined. The transformations are used to explain the difference in properties and resolutions across different modalities even when they correspond to similar phenomenon. To classify a test sample, one can use Mahalanobis distance with respect to class means and covariance. Another approach would be to train a classifier using the sample mode factors 𝑌 that are learned through a least squares optimization and similar procedures with Approach 1 can be utilized for test cases. 110 BIBLIOGRAPHY 111 BIBLIOGRAPHY [1] Rodolfo Abreu, Alberto Leal, and Patrícia Figueiredo. “EEG-informed fMRI: a review of data analysis methods”. In: Frontiers in human neuroscience 12 (2018), p. 29. [2] Evrim Acar, Tamara G Kolda, and Daniel M Dunlavy. “All-at-once optimization for coupled matrix and tensor factorizations”. In: arXiv preprint arXiv:1105.3422 (2011). [3] Evrim Acar et al. “ACMTF for fusion of multi-modal neuroimaging data and identification of biomarkers”. In: 2017 25th European Signal Processing Conference (EUSIPCO). IEEE. 2017, pp. 643–647. [4] Evrim Acar et al. “Structure-revealing data fusion”. In: BMC bioinformatics 15.1 (2014), pp. 1–17. [5] Evrim Acar et al. “Tensor-based fusion of EEG and FMRI to understand neurological changes in schizophrenia”. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE. 2017, pp. 1–4. [6] Evrim Acar et al. “Unraveling diagnostic biomarkers of schizophrenia through structure-revealing fusion of multi-modal neuroimaging data”. In: Frontiers in neuroscience 13 (2019), p. 416. [7] Hemant Kumar Aggarwal and Angshul Majumdar. “Hyperspectral image denoising using spatio- spectral total variation”. In: IEEE Geoscience and Remote Sensing Letters 13.3 (2016), pp. 442– 446. [8] Anima Anandkumar et al. “Tensor vs. matrix methods: Robust tensor decomposition under block sparse perturbations”. In: Artificial Intelligence and Statistics. PMLR. 2016, pp. 268–276. [9] Jeffrey S Anderson et al. “Functional connectivity magnetic resonance imaging classification of autism”. In: Brain 134.12 (2011), pp. 3742–3754. [10] Andreas Argyriou, Charles A Micchelli, and Massimiliano Pontil. “When is there a representer theorem? Vector versus matrix regularizers”. In: The Journal of Machine Learning Research 10 (2009), pp. 2507–2529. [11] John Ashburner et al. “SPM12 manual”. In: Wellcome Trust Centre for Neuroimaging, London, UK 2464 (2014). [12] Francis R Bach. “Consistency of the group lasso and multiple kernel learning.” In: Journal of Machine Learning Research 9.6 (2008). [13] Mohammad Taha Bahadori, Qi Rose Yu, and Yan Liu. “Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning.” In: NIPS. Citeseer. 2014, pp. 3491–3499. [14] Richard H. Bartels and George W Stewart. “Solution of the matrix equation AX+ XB= C [F4]”. In: Communications of the ACM 15.9 (1972), pp. 820–826. 112 [15] JW Belliveau et al. “Functional mapping of the human visual cortex by magnetic resonance imaging”. In: Science 254.5032 (1991), pp. 716–719. [16] Asa Ben-Hur and William Stafford Noble. “Kernel methods for predicting protein–protein interac- tions”. In: Bioinformatics 21.suppl_1 (2005), pp. i38–i46. [17] Johann A Bengua et al. “Efficient tensor completion for color image and video recovery: Low-rank tensor train”. In: IEEE Transactions on Image Processing 26.5 (2017), pp. 2466–2479. [18] Johann A Bengua et al. “Matrix product state for higher-order tensor compression and classification”. In: IEEE Transactions on Signal Processing 65.15 (2017), pp. 4019–4030. [19] Kristin P Bennett, Michinari Momma, and Mark J Embrechts. “MARK: A boosting algorithm for heterogeneous kernel models”. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002, pp. 24–31. [20] Gouri Sankar Bhunia et al. “Spatial and temporal variation and hotspot detection of kala-azar disease in Vaishali district (Bihar), India”. In: BMC infectious diseases 13.1 (2013), p. 64. [21] Xuan Bi et al. “Tensors in statistics”. In: Annual Review of Statistics and Its Application 8 (2020). [22] Amit Boyarski, Sanketh Vedula, and Alex Bronstein. “Deep matrix factorization with spectral geometric regularization”. In: arXiv preprint arXiv: 1911.07255 (2019). [23] Markus M Breunig et al. “LOF: identifying density-based local outliers”. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 2000, pp. 93–104. [24] Laura F Bringmann et al. “Changing dynamics: Time-varying autoregressive models using general- ized additive modeling.” In: Psychological methods 22.3 (2017), p. 409. [25] Laura F Bringmann et al. “Modeling nonstationary emotion dynamics in dyads using a time-varying vector-autoregressive model”. In: Multivariate behavioral research 53.3 (2018), pp. 293–314. [26] Rasmus Bro. “PARAFAC. Tutorial and applications”. In: Chemometrics and Intelligent Laboratory Systems 38.2 (1997), pp. 149–171. [27] Rasmus Bro, Claus A Andersson, and Henk AL Kiers. “PARAFAC2—Part II. Modeling chro- matographic data with retention time shifts”. In: Journal of Chemometrics: A Journal of the Chemometrics Society 13.3-4 (1999), pp. 295–309. [28] Saikiran Bulusu et al. “Anomalous example detection in deep learning: A survey”. In: IEEE Access 8 (2020), pp. 132330–132347. [29] Vince D Calhoun et al. “Method for multimodal analysis of independent source differences in schizophrenia: combining gray matter structural and auditory oddball functional data”. In: Human brain mapping 27.1 (2006), pp. 47–62. [30] E. J. Candès et al. “Robust principal component analysis?” In: Journal of the ACM (JACM) 58.3 113 (2011), p. 11. [31] Clayson Celes, Azzedine Boukerche, and Antonio AF Loureiro. “Crowd Management: A New Challenge for Urban Big Data Analytics”. In: IEEE Communications Magazine 57.4 (2019), pp. 20– 25. [32] Mohammadhossein Chaghazardi and Shuchin Aeron. “Sample, computation vs storage tradeoffs for classification using tensor subspace models”. In: arXiv preprint arXiv:1706.05599 (2017). [33] Raghavendra Chalapathy and Sanjay Chawla. “Deep learning for anomaly detection: A survey”. In: arXiv preprint arXiv:1901.03407 (2019). [34] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly detection: A survey”. In: ACM computing surveys (CSUR) 41.3 (2009), pp. 1–58. [35] Venkat Chandrasekaran et al. “Rank-sparsity incoherence for matrix decomposition”. In: SIAM Journal on Optimization 21.2 (2011), pp. 572–596. [36] Christos Chatzichristos et al. “Early soft and flexible fusion of EEG and fMRI via tensor decompo- sitions”. In: arXiv preprint arXiv:2005.07134 (2020). [37] Christos Chatzichristos et al. “Fusion of EEG and fMRI via soft coupled tensor decompositions”. In: 2018 26th European Signal Processing Conference (EUSIPCO). IEEE. 2018, pp. 56–60. [38] Cong Chen et al. “A support tensor train machine”. In: 2019 International Joint Conference on Neural Networks (IJCNN). IEEE. 2019, pp. 1–8. [39] Longbiao Chen et al. “Fine-grained urban event detection and characterization based on tensor cofactorization”. In: IEEE Transactions on Human-Machine Systems 47.3 (2016), pp. 380–391. [40] Xinyu Chen et al. “Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model”. In: Transportation Research Part C: Emerging Technologies 104 (2019), pp. 66–77. [41] Mario Christoudias, Raquel Urtasun, Trevor Darrell, et al. “Bayesian localized multiple kernel learning”. In: Univ. California Berkeley, Berkeley, CA (2009). [42] A. Cichocki. “Era of big data processing: A new approach via tensor networks and tensor decom- positions”. In: arXiv preprint arXiv:1403.2048 (2014). [43] A. Cichocki et al. “Tensor decompositions for signal processing applications: From two-way to multiway component analysis”. In: IEEE Signal Processing Magazine 32.2 (2015), pp. 145–163. [44] Andrzej Cichocki et al. “Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions”. In: Foundations and Trends® in Machine Learning 9.4-5 (2016), pp. 249–429. [45] Andrzej Cichocki et al. “Tensor networks for dimensionality reduction and large-scale optimization: 114 Part 2 applications and future perspectives”. In: Foundations and Trends® in Machine Learning 9.6 (2017), pp. 431–673. [46] Lieven De Lathauwer and Joséphine Castaing. “Blind identification of underdetermined mixtures by simultaneous matrix diagonalization”. In: IEEE Transactions on Signal Processing 56.3 (2008), pp. 1096–1105. [47] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. “Fourth-order cumulant-based blind identification of underdetermined mixtures”. In: IEEE Transactions on Signal Processing 55.6 (2007), pp. 2965–2973. [48] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. “A multilinear singular value decom- position”. In: SIAM journal on Matrix Analysis and Applications 21.4 (2000), pp. 1253–1278. [49] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. “On the best rank-1 and rank-(r 1, r 2,..., rn) approximation of higher-order tensors”. In: SIAM journal on Matrix Analysis and Applications 21.4 (2000), pp. 1324–1342. [50] Dingxiong Deng et al. “Latent space model for road networks to predict time-varying traffic”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, pp. 1525–1534. [51] Lei Deng et al. “Graph Spectral Regularized Tensor Completion for Traffic Data Imputation”. In: IEEE Transactions on Intelligent Transportation Systems (2021). [52] Wei Deng and Wotao Yin. “On the global and linear convergence of the generalized alternating direction method of multipliers”. In: Journal of Scientific Computing 66.3 (2016), pp. 889–916. [53] Renwei Dian, Shutao Li, and Leyuan Fang. “Learning a low tensor-train rank representation for hyperspectral image super-resolution”. In: IEEE transactions on neural networks and learning systems 30.9 (2019), pp. 2672–2683. [54] Yiming Ding et al. “A deep learning model to predict a diagnosis of Alzheimer disease by using 18F-FDG PET of the brain”. In: Radiology 290.2 (2019), pp. 456–464. [55] Youcef Djenouri et al. “A survey on urban traffic anomalies detection algorithms”. In: IEEE Access 7 (2019), pp. 12192–12205. [56] Sergey V Dolgov et al. “Computation of extreme eigenvalues in higher dimensions using block tensor train format”. In: Computer Physics Communications 185.4 (2014), pp. 1207–1216. [57] Haishun Du et al. “Sparse representation-based robust face recognition by graph regularized low-rank sparse representation recovery”. In: Neurocomputing 164 (2015), pp. 220–229. [58] James H Faghmous et al. “A parameter-free spatio-temporal pattern mining model to catalog global ocean dynamics”. In: 2013 IEEE 13th International Conference on Data Mining. IEEE. 2013, pp. 151–160. 115 [59] Hadi Fanaee-T and João Gama. “Tensor-based anomaly detection: An interdisciplinary survey”. In: Knowledge-Based Systems 98 (2016), pp. 130–147. [60] Hadi Fanaee-T and Joao Gama. “Event detection from traffic tensors: A hybrid model”. In: Neurocomputing 203 (2016), pp. 22–33. [61] Hadi Fanaee-T and Joao Gama. “SimTensor: A synthetic tensor data generator”. In: arXiv preprint arXiv:1612.03772 (2016). [62] Massimo Filippi, Roland Bammer, et al. MR imaging in white matter diseases of the brain and spinal cord. Springer, 2005. [63] Keinosuke Fukunaga. Introduction to statistical pattern recognition. Elsevier, 2013. [64] Glenn Fung et al. “A fast iterative algorithm for fisher discriminant using heterogeneous kernels”. In: Proceedings of the twenty-first international conference on Machine learning. 2004, p. 40. [65] Mostafa Reisi Gahrooei et al. “Multiple tensor-on-tensor regression: an approach for modeling processes with heterogeneous sources of data”. In: Technometrics 63.2 (2021), pp. 147–159. [66] Giovana Gavidia-Bovadilla et al. “Early prediction of Alzheimer’s disease using null longitudinal model-based classifiers”. In: PloS one 12.1 (2017), e0168011. [67] Matan Gavish and Ronald R Coifman. “Sampling, denoising and compression of matrices by coherent matrix organization”. In: Applied and Computational Harmonic Analysis 33.3 (2012), pp. 354–369. [68] Xiurui Geng et al. “A high-order statistical tensor based algorithm for anomaly detection in hyper- spectral imagery”. In: Scientific reports 4 (2014), p. 6869. [69] Mark Girolami and Mingjun Zhong. “Data Integration for Classification Problems Employing Gaussian Process Priors”. In: Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference. Vol. 19. MIT Press. 2007, p. 465. [70] Donald Goldfarb and Zhiwei Qin. “Robust low-rank tensor recovery: Models and algorithms”. In: SIAM Journal on Matrix Analysis and Applications 35.1 (2014), pp. 225–253. [71] Mehmet Gönen and Ethem Alpaydın. “Multiple kernel learning algorithms”. In: The Journal of Machine Learning Research 12 (2011), pp. 2211–2268. [72] Adrian R Groves et al. “Linked independent component analysis for multimodal data fusion”. In: Neuroimage 54.3 (2011), pp. 2198–2217. [73] Weiwei Guo, Irene Kotsia, and Ioannis Patras. “Tensor learning for regression”. In: IEEE Transac- tions on Image Processing 21.2 (2011), pp. 816–827. [74] Xian Guo et al. “Support tensor machines for classification of hyperspectral remote sensing imagery”. In: IEEE Transactions on Geoscience and Remote Sensing 54.6 (2016), pp. 3248–3264. 116 [75] Zhifeng Hao et al. “A linear support higher-order tensor machine for classification”. In: IEEE Transactions on Image Processing 22.7 (2013), pp. 2911–2920. [76] Lifang He et al. “Dusk: A dual structure-preserving kernel for supervised tensor learning with applications to neuroimages”. In: Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM. 2014, pp. 127–135. [77] Lifang He et al. “Kernelized support tensor machines”. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org. 2017, pp. 1442–1451. [78] Xiaofei He, Deng Cai, and Partha Niyogi. “Tensor subspace analysis”. In: Advances in neural information processing systems. 2006, pp. 499–506. [79] Victoria Hodge and Jim Austin. “A survey of outlier detection methodologies”. In: Artificial intelligence review 22.2 (2004), pp. 85–126. [80] Sebastian Holtz, Thorsten Rohwedder, and Reinhold Schneider. “On manifolds of tensors of fixed TT-rank”. In: Numerische Mathematik 120.4 (2012), pp. 701–731. [81] Yuwang Ji et al. “A Survey on Tensor Techniques and Applications in Machine Learning”. In: IEEE Access 7 (2019), pp. 162950–162990. [82] Bo Jiang et al. “Image representation and learning with graph-laplacian tucker tensor decomposition”. In: IEEE transactions on cybernetics 49.4 (2018), pp. 1417–1426. [83] Taisong Jin et al. “Low-rank matrix factorization with multiple hypergraph regularizer”. In: Pattern Recognition 48.3 (2015), pp. 1011–1022. [84] Vassilis Kalofolias et al. “Matrix completion on graphs”. In: arXiv preprint arXiv:1408.1717 (2014). [85] Maja H Kamstrup-Nielsen, Lea G Johnsen, and Rasmus Bro. “Core consistency diagnostic in PARAFAC2”. In: Journal of Chemometrics 27.5 (2013), pp. 99–105. [86] Esin Karahan et al. “Tensor analysis and fusion of multimodal brain images”. In: Proceedings of the IEEE 103.9 (2015), pp. 1531–1559. [87] Hiroyuki Kasai. “Fast online low-rank tensor subspace tracking by CP decomposition using recursive least squares from incomplete observations”. In: Neurocomputing 347 (2019), pp. 177–190. [88] Hiroyuki Kasai, Wolfgang Kellerer, and Martin Kleinsteuber. “Network volume anomaly detection and identification in large-scale networks based on online time-structured traffic tensor tracking”. In: IEEE Transactions on Network and Service Management 13.3 (2016), pp. 636–650. [89] Ali Khazaee, Ata Ebrahimzadeh, and Abbas Babajani-Feremi. “Application of advanced machine learning methods on resting-state fMRI network for identification of mild cognitive impairment and Alzheimer’s disease”. In: Brain imaging and behavior 10.3 (2016), pp. 799–817. [90] Boris N Khoromskij. “O (dlog N)-quantics approximation of N-d tensors in high-dimensional 117 numerical modeling”. In: Constructive Approximation 34.2 (2011), pp. 257–280. [91] Tae-Kyun Kim, Shu-Fai Wong, and Roberto Cipolla. “Tensor canonical correlation analysis for action classification”. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. 2007, pp. 1–8. [92] Marius Kloft et al. “Efficient and accurate lp-norm multiple kernel learning.” In: NIPS. vol. 22. 22. 2009, pp. 997–1005. [93] Tamara G Kolda and Brett W Bader. “Tensor decompositions and applications”. In: SIAM review 51.3 (2009), pp. 455–500. [94] Xiangjie Kong et al. “HUAD: Hierarchical urban anomaly detection based on spatio-temporal data”. In: IEEE Access 8 (2020), pp. 26573–26582. [95] Jean Kossaifi et al. “T-net: Parametrizing fully convolutional nets with a single high-order tensor”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, pp. 7822–7831. [96] Jean Kossaifi et al. “Tensor regression networks”. In: arXiv preprint arXiv:1707.08308 (2017). [97] Irene Kotsia, Weiwei Guo, and Ioannis Patras. “Higher rank support tensor machines for visual recognition”. In: Pattern Recognition 45.12 (2012), pp. 4192–4203. [98] Daniel Kressner, Michael Steinlechner, and André Uschmajew. “Low-rank tensor methods with subspace correction for symmetric eigenvalue problems”. In: SIAM Journal on Scientific Computing 36.5 (2014), A2346–A2368. [99] J. B. Kruskal. “Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics”. In: Linear Algebra and its Applications 18.2 (1977), pp. 95–138. issn: 0024-3795. doi: https://doi.org/10.1016/0024-3795(77)90069-6. url: http://www.sciencedirect.com/science/article/pii/0024379577900696. [100] Alp Kut and Derya Birant. “Spatio-temporal outlier detection in large databases”. In: Journal of computing and information technology 14.4 (2006), pp. 291–297. [101] Kenneth K Kwong et al. “Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation.” In: Proceedings of the National Academy of Sciences 89.12 (1992), pp. 5675–5679. [102] Jack L Lancaster et al. “Bias between MNI and Talairach coordinates analyzed using the ICBM-152 brain template”. In: Human brain mapping 28.11 (2007), pp. 1194–1205. [103] Gert RG Lanckriet et al. “Learning the kernel matrix with semidefinite programming”. In: Journal of Machine learning research 5.Jan (2004), pp. 27–72. [104] Gisela Lechuga et al. “Discriminant analysis for multiway data”. In: International Conference on Partial Least Squares and Related Methods. Springer. 2014, pp. 115–126. 118 [105] Jae-Gil Lee, Jiawei Han, and Kyu-Young Whang. “Trajectory clustering: a partition-and-group framework”. In: Proceedings of the 2007 ACM SIGMOD international conference on Management of data. 2007, pp. 593–604. [106] Namgil Lee et al. “Nonnegative Tensor Train Decompositions for Multi-domain Feature Extraction and Clustering”. In: International Conference on Neural Information Processing. Springer. 2016, pp. 87–95. [107] Xu Lei, Pedro A Valdes-Sosa, and Dezhong Yao. “EEG/fMRI fusion based on independent com- ponent analysis: integration of data-driven and model-driven methods”. In: Journal of integrative neuroscience 11.03 (2012), pp. 313–337. [108] Li Li et al. “Trend modeling for traffic time series analysis: An integrated study”. In: IEEE Transactions on Intelligent Transportation Systems 16.6 (2015), pp. 3430–3439. [109] Peide Li and Taps Maiti. “Universal Consistency of Support Tensor Machine”. In: 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE. 2019, pp. 608– 609. [110] Ping Li et al. “Online robust low-rank tensor modeling for streaming data analysis”. In: IEEE transactions on neural networks and learning systems 30.4 (2018), pp. 1061–1075. [111] Quefeng Li and Lexin Li. “Integrative factor regression and its inference for multimodal data analysis”. In: arXiv preprint arXiv:1911.04056 (2019). [112] Qun Li and Dan Schonfeld. “Multilinear discriminant analysis for higher-order tensor data clas- sification”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 36.12 (2014), pp. 2524–2537. [113] Shuangjiang Li et al. “Low-rank tensor decomposition based anomaly detection for hyperspectral imagery”. In: 2015 IEEE International Conference on Image Processing (ICIP). IEEE. 2015, pp. 4525–4529. [114] Xiaoshan Li et al. “Tucker tensor regression and neuroimaging analysis”. In: Statistics in Biosciences 10.3 (2018), pp. 520–545. [115] Xutao Li et al. “MR-NTD: Manifold regularization nonnegative tucker decomposition for tensor data dimension reduction and representation”. In: IEEE transactions on neural networks and learning systems 28.8 (2016), pp. 1787–1800. [116] Yingjie Li et al. “Early prediction of Alzheimer’s disease using longitudinal volumetric MRI data from ADNI”. in: Health Services and Outcomes Research Methodology 20.1 (2020), pp. 13–39. [117] Ziyue Li et al. “Tensor completion for weakly-dependent data on graph for metro passenger flow prediction”. In: arXiv preprint arXiv:1912.05693 (2019). [118] Chaoguang Lin et al. “Anomaly detection in spatiotemporal data via regularized non-negative tensor analysis”. In: Data Mining and Knowledge Discovery 32.4 (2018), pp. 1056–1073. 119 [119] Martin A Lindquist et al. “The statistical analysis of fMRI data”. In: Statistical science 23.4 (2008), pp. 439–464. [120] Ji Liu et al. “Tensor completion for estimating missing values in visual data”. In: IEEE transactions on pattern analysis and machine intelligence 35.1 (2012), pp. 208–220. [121] Jingyu Liu et al. “Combining fMRI and SNP data to investigate connections between brain function and genetics using parallel ICA”. in: Human brain mapping 30.1 (2009), pp. 241–255. [122] Siqi Liu et al. “Early diagnosis of Alzheimer’s disease with deep learning”. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). IEEE. 2014, pp. 1015–1018. [123] Eric F Lock. “Tensor-on-tensor regression”. In: Journal of Computational and Graphical Statistics 27.3 (2018), pp. 638–647. [124] Xiaojing Long et al. “Prediction and classification of Alzheimer disease based on quantification of MRI deformation”. In: PloS one 12.3 (2017), e0173372. [125] Canyi Lu et al. “Tensor robust principal component analysis with a new tensor nuclear norm”. In: IEEE transactions on pattern analysis and machine intelligence 42.4 (2019), pp. 925–938. [126] Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos. “MPCA: Multilinear principal component analysis of tensor objects”. In: IEEE Transactions on Neural Networks 19.1 (2008), pp. 18–39. [127] Gal Mishne, Eric Chi, and Ronald Coifman. “Co-manifold learning with missing data”. In: International Conference on Machine Learning. PMLR. 2019, pp. 4605–4614. [128] Gal Mishne et al. “Data-driven tree transforms and metrics”. In: IEEE transactions on signal and information processing over networks 4.3 (2017), pp. 451–466. [129] John C Morris et al. “Pittsburgh compound B imaging and prediction of progression from cognitive normality to symptomatic Alzheimer disease”. In: Archives of neurology 66.12 (2009), pp. 1469– 1475. [130] Raziyeh Mosayebi and Gholam-Ali Hossein-Zadeh. “Correlated coupled matrix tensor factorization method for simultaneous EEG-fMRI data fusion”. In: Biomedical Signal Processing and Control 62 (2020), p. 102071. [131] Kevin P Murphy. “Switching kalman filters”. In: (1998). [132] Atsuhiro Narita et al. “Tensor factorization using auxiliary information”. In: Data Mining and Knowledge Discovery 25.2 (2012), pp. 298–324. [133] Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia Object Image Library (COIL-100). [134] Luong Ha Nguyen and James-A Goulet. “Anomaly detection with the switching kalman filter for structural health monitoring”. In: Structural Control and Health Monitoring 25.4 (2018), e2136. 120 [135] Yongming Nie et al. “Graph-regularized tensor robust principal component analysis for hyperspectral image denoising”. In: Applied optics 56.22 (2017), pp. 6094–6102. [136] Guiomar Niso et al. “MEG-BIDS, the brain imaging data structure extended to magnetoencephalog- raphy”. In: Scientific data 5.1 (2018), pp. 1–5. [137] Alexander Novikov et al. “Tensorizing neural networks”. In: Advances in Neural Information Processing Systems. 2015, pp. 442–450. [138] Seiji Ogawa et al. “Brain magnetic resonance imaging with contrast dependent on blood oxygenation”. In: proceedings of the National Academy of Sciences 87.24 (1990), pp. 9868–9872. [139] Ivan V Oseledets. “Approximation of 2ˆd\times2ˆd matrices using tensor decomposition”. In: SIAM Journal on Matrix Analysis and Applications 31.4 (2010), pp. 2130–2145. [140] Ivan V Oseledets. “Tensor-train decomposition”. In: SIAM Journal on Scientific Computing 33.5 (2011), pp. 2295–2317. [141] Alp Ozdemir, Edward M Bernat, and Selin Aviyente. “Recursive tensor subspace tracking for dynamic brain network analysis”. In: IEEE Transactions on Signal and Information Processing over Networks 3.4 (2017), pp. 669–682. [142] Yuqing Pan, Qing Mai, and Xin Zhang. “Covariate-Adjusted Tensor Classification in High Dimen- sions”. In: Journal of the American Statistical Association (2018), pp. 1–15. [143] Evangelos Papalexakis, Konstantinos Pelechrinis, and Christos Faloutsos. “Spotting misbehaviors in location-based social networks using tensors”. In: Proceedings of the 23rd International Conference on World Wide Web. 2014, pp. 551–552. [144] Evangelos E Papalexakis, Alex Beutel, and Peter Steenkiste. “Network anomaly detection using co-clustering”. In: 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. IEEE. 2012, pp. 403–410. [145] Paul Pavlidis et al. “Gene functional classification from heterogeneous data”. In: Proceedings of the fifth annual international conference on Computational biology. 2001, pp. 249–255. [146] Nathanaël Perraudin and Pierre Vandergheynst. “Stationary signal processing on graphs”. In: IEEE Transactions on Signal Processing 65.13 (2017), pp. 3462–3477. [147] Shibin Qiu and Terran Lane. “A framework for multiple kernel support vector regression and its applications to siRNA efficacy prediction”. In: IEEE/ACM Transactions on Computational Biology and Bioinformatics 6.2 (2008), pp. 190–199. [148] Yuning Qiu et al. “A generalized graph regularized non-negative tucker decomposition framework for tensor data representation”. In: IEEE transactions on cybernetics (2020). [149] Matthew Roughan et al. “Spatio-temporal compressive sensing and internet traffic matrices (extended version)”. In: IEEE/ACM Transactions on Networking 20.3 (2011), pp. 662–676. 121 [150] Peter J Rousseeuw and Katrien Van Driessen. “A fast algorithm for the minimum covariance determinant estimator”. In: Technometrics 41.3 (1999), pp. 212–223. [151] Aliaksei Sandryhaila and Jose MF Moura. “Big data analysis with signal processing on graphs: Representation and processing of massive data sets with irregular structure”. In: IEEE Signal Processing Magazine 31.5 (2014), pp. 80–90. [152] Katharina A Schindlbeck and David Eidelberg. “Network imaging biomarkers: insights and clinical applications in Parkinson’s disease”. In: The Lancet Neurology 17.7 (2018), pp. 629–640. [153] Bernhard Schölkopf et al. “Support vector method for novelty detection”. In: Advances in neural information processing systems. 2000, pp. 582–588. [154] Nauman Shahid, Francesco Grassi, and Pierre Vandergheynst. “Tensor Robust PCA on Graphs”. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, pp. 5406–5410. [155] Nauman Shahid et al. “Fast robust PCA on graphs”. In: IEEE Journal of Selected Topics in Signal Processing 10.4 (2016), pp. 740–756. [156] Nauman Shahid et al. “Robust principal component analysis on graphs”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 2812–2820. [157] Abhishek Sharma and Maks Ovsjanikov. “Geometric Matrix Completion: A Functional View”. In: arXiv preprint arXiv:2009.14343 (2020). [158] Lei Shi, Aryya Gangopadhyay, and Vandana P Janeja. “STenSr: Spatio-temporal tensor streams for anomaly detection and pattern discovery”. In: Knowledge and Information Systems 43.2 (2015), pp. 333–353. [159] Nicholas D Sidiropoulos et al. “Tensor decomposition for signal processing and machine learning”. In: IEEE Transactions on Signal Processing 65.13 (2017), pp. 3551–3582. [160] Age Smilde, Rasmus Bro, and Paul Geladi. Multi-way analysis: applications in the chemical sciences. John Wiley & Sons, 2005. [161] Seyyid Emre Sofuoglu and Selin Aviyente. “Graph Regularized Tensor Train Decomposition”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 3912–3916. [162] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. “UCF101: A dataset of 101 human actions classes from videos in the wild”. In: arXiv preprint arXiv:1212.0402 (2012). [163] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. [164] Yuting Su et al. “Graph regularized low-rank tensor representation for feature selection”. In: Journal of Visual Communication and Image Representation 56 (2018), pp. 234–244. 122 [165] Jing Sui et al. “Discriminating schizophrenia and bipolar disorder by fusing fMRI and DTI in a multimodal CCA+ joint ICA model”. In: Neuroimage 57.3 (2011), pp. 839–855. [166] Jimeng Sun, Dacheng Tao, and Christos Faloutsos. “Beyond streams and graphs: dynamic tensor analysis”. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006, pp. 374–383. [167] Hiroaki Tanabe et al. “Simple but effective methods for combining kernels in computational biology”. In: 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies. IEEE. 2008, pp. 71–78. [168] Dacheng Tao et al. “General tensor discriminant analysis and gabor features for gait recognition”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 29.10 (2007). [169] Dacheng Tao et al. “Supervised tensor learning”. In: Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE. 2005, 8–pp. [170] Liang Tao et al. “Low rank approximation with sparse integration of multiple manifolds for data representation”. In: Applied Intelligence 42.3 (2015), pp. 430–446. [171] Joshua B Tenenbaum, Vin De Silva, and John C Langford. “A global geometric framework for nonlinear dimensionality reduction”. In: science 290.5500 (2000), pp. 2319–2323. [172] Michal Teplan et al. “Fundamentals of EEG measurement”. In: Measurement science review 2.2 (2002), pp. 1–11. [173] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. “On the extension of trace norm to tensors”. In: NIPS Workshop on Tensors, Kernels, and Machine Learning. Vol. 7. 2010. [174] Ledyard R Tucker. “Implications of factor analysis of three-way matrices for measurement of change”. In: Problems in measuring change 15 (1963), pp. 122–137. [175] Ledyard R Tucker et al. “The extension of factor analysis to three-dimensional matrices”. In: Contributions to mathematical psychology 110119 (1964). [176] Manik Varma and Debajyoti Ray. “Learning the discriminative power-invariance trade-off”. In: 2007 IEEE 11th International Conference on Computer Vision. IEEE. 2007, pp. 1–8. [177] Ulrike Von Luxburg. “A tutorial on spectral clustering”. In: Statistics and computing 17.4 (2007), pp. 395–416. [178] Jennifer M Walz et al. “Simultaneous EEG-fMRI reveals temporal evolution of coupling between supramodal cortical attention networks and the brainstem”. In: Journal of Neuroscience 33.49 (2013), pp. 19212–19222. [179] Kaidong Wang et al. “Hyperspectral and Multispectral Image Fusion via Nonlocal Low-Rank Tensor Decomposition and Spectral Unmixing”. In: IEEE Transactions on Geoscience and Remote Sensing 58.11 (2020), pp. 7654–7671. 123 [180] Qi Wang et al. “Robust bi-stochastic graph regularized matrix factorization for data clustering”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2020). [181] Wenqi Wang, Vaneet Aggarwal, and Shuchin Aeron. “Principal Component Analysis with Tensor Train Subspace”. In: arXiv preprint arXiv:1803.05026 (2018). [182] Wenqi Wang, Vaneet Aggarwal, and Shuchin Aeron. “Principal component analysis with tensor train subspace”. In: Pattern Recognition Letters 122 (2019), pp. 86–91. [183] Wenqi Wang, Vaneet Aggarwal, and Shuchin Aeron. “Tensor train neighborhood preserving embed- ding”. In: IEEE Transactions on Signal Processing 66.10 (2018), pp. 2724–2732. [184] Xudong Wang and Lijun Sun. “Diagnosing spatiotemporal traffic anomalies with low-rank tensor autoregression”. In: IEEE Transactions on Intelligent Transportation Systems (2021). [185] Xudong Wang et al. “A probabilistic tensor factorization approach to detect anomalies in spatiotem- poral traffic activities”. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE. 2019, pp. 1658–1663. [186] Yao Wang et al. “Hyperspectral image restoration via total variation regularized low-rank tensor decomposition”. In: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11.4 (2017), pp. 1227–1243. [187] Yu Wang, Wotao Yin, and Jinshan Zeng. “Global convergence of ADMM in nonconvex nonsmooth optimization”. In: Journal of Scientific Computing 78.1 (2019), pp. 29–63. [188] Weizmann Facebase. http://www.wisdom.weizmann.ac.il/~/vision/FaceBase/. [189] Zaiwen Wen and Wotao Yin. “A feasible method for optimization with orthogonality constraints”. In: Mathematical Programming 142.1-2 (2013), pp. 397–434. [190] Keith J Worsley et al. “A general statistical analysis for fMRI data”. In: Neuroimage 15.1 (2002), pp. 1–15. [191] Elizabeth Wu, Wei Liu, and Sanjay Chawla. “Spatio-temporal outlier detection in precipitation data”. In: International Workshop on Knowledge Discovery from Sensor Data. Springer. 2008, pp. 115–133. [192] Kun Xie et al. “Graph based tensor recovery for accurate internet anomaly detection”. In: IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE. 2018, pp. 1502–1510. [193] Ming Xu et al. “Anomaly detection in road networks using sliding-window tensor factorization”. In: IEEE Transactions on Intelligent Transportation Systems 20.12 (2019), pp. 4704–4713. [194] Ming Yan and Wotao Yin. “Self equivalence of the alternating direction method of multipliers”. In: Splitting Methods in Communication, Imaging, Science, and Engineering. Springer, 2016, pp. 165– 194. 124 [195] Shuicheng Yan et al. “Discriminant analysis with tensor representation”. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE. 2005, pp. 526–532. [196] Shuicheng Yan et al. “Multilinear discriminant analysis for face recognition”. In: IEEE Transactions on Image Processing 16.1 (2006), pp. 212–220. [197] Jing-Hua Yang et al. “Low-rank tensor train for tensor robust principal component analysis”. In: Applied Mathematics and Computation 367 (2020), p. 124783. [198] Jieping Ye, Ravi Janardan, and Qi Li. “Two-dimensional linear discriminant analysis”. In: Advances in Neural Information Processing Systems. 2005, pp. 1569–1576. [199] Tatsuya Yokota, Qibin Zhao, and Andrzej Cichocki. “Smooth PARAFAC decomposition for tensor completion”. In: IEEE Transactions on Signal Processing 64.20 (2016), pp. 5423–5436. [200] Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. “Temporal Regularized Matrix Factorization for High-dimensional Time Series Prediction.” In: NIPS. 2016, pp. 847–855. [201] Rose Yu and Yan Liu. “Learning from multiway data: Simple and efficient tensor regression”. In: International Conference on Machine Learning. PMLR. 2016, pp. 373–381. [202] Stefanos Zafeiriou. “Discriminant nonnegative tensor factorization algorithms”. In: IEEE Transac- tions on Neural Networks 20.2 (2009), pp. 217–235. [203] Huichu Zhang, Yu Zheng, and Yong Yu. “Detecting urban anomalies using multiple spatio-temporal data sources”. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2.1 (2018), pp. 1–18. [204] Junyu Zhang, Zaiwen Wen, and Yin Zhang. “Subspace methods with local refinements for eigenvalue computation using low-rank tensor-train format”. In: Journal of Scientific Computing 70.2 (2017), pp. 478–499. [205] Mingyang Zhang et al. “A decomposition approach for urban anomaly detection across spatiotem- poral data”. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press. 2019, pp. 6043–6049. [206] Mingyang Zhang et al. “Urban Anomaly Analytics: Description, Detection and Prediction”. In: IEEE Transactions on Big Data (2020). [207] Xing Zhang, Gongjian Wen, and Wei Dai. “A tensor decomposition-based anomaly detection algorithm for hyperspectral image”. In: IEEE Transactions on Geoscience and Remote Sensing 54.10 (2016), pp. 5801–5820. [208] Zemin Zhang et al. “Novel Methods for Multilinear Data Completion and De-noising Based on Tensor-SVD”. in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). July 2014. 125 [209] Qibin Zhao et al. “Tensor ring decomposition”. In: arXiv preprint arXiv:1606.05535 (2016). [210] Yu-Bang Zheng et al. “Tensor N-tubal rank and its convex relaxation for low-rank tensor recovery”. In: Information Sciences 532 (2020), pp. 170–189. [211] Hua Zhou, Lexin Li, and Hongtu Zhu. “Tensor regression with applications in neuroimaging data analysis”. In: Journal of the American Statistical Association 108.502 (2013), pp. 540–552. 126