STATISTICALLY CONSISTENT SUPPORT TENSOR MACHINE FOR MULTI-DIMENSIONAL DATA By Peide Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2021 ABSTRACT STATISTICALLY CONSISTENT SUPPORT TENSOR MACHINE FOR MULTI-DIMENSIONAL DATA By Peide Li Tensors are generalizations of vectors and matrices for multi-dimensional data representation. Fueled by novel computing technologies, tensors have expanded to various domains, including statistics, data science, signal processing, and machine learning. Comparing to traditional data representation formats, tensor data representation distinguishes itself with its capability of preserv- ing complex structures and multi-way features for multi-dimensional data. In this dissertation, we explore some tensor-based classification models and their statistical properties. In particular, we propose few novel support tensor machine methods for huge-size tensor and multimodal tensor classification problems, and study their classification consistency properties. These methods are applied to different applications for validation. The first piece of work considers classification problems for gigantic size multi-dimensional data. Although current tensor-based classification approaches have demonstrated extraordinary performance in empirical studies, they may face more challenges such as long processing time and insufficient computer memory when dealing with big tensors. In chapter 3, we combine tensor-based random projection and support tensor machine, and propose a Tensor Ensemble Clas- sifier (TEC) for ultra-high dimensional tensors, which aggregates multiple support tensor machines estimated from randomly projected CANDECOMP/PARAFAC (CP) tensors. This method uti- lizes Gaussian and spares random projections to compress high-dimensional tensor CP factors, and predicts their class labels with support tensor machine classifiers. With the well celebrated Johnson-Lindenstrauss Lemma and ensemble techniques, TEC methods are shown to be statistically consistent while having high computational efficiencies for big tensor data. Simulation studies and real data applications including Alzheimer’s Disease MRI Image classification and Traffic Image classification are provided as empirical evidence to validate the performance of TEC models. The second piece of work considers classification problems for multimodal tensor data, which are particularly common in neuroscience and brain imaging analysis. Utilizing multimodal data is of great interest for machine learning and statistics research in these domains, since it is believed that integration of features from multiple sources can potentially increase model performance while unveiling the interdependence between heterogeneous data. In chapter 4, we propose a Coupled Support Tensor Machine (C-STM) which adopts Advanced Coupled Matrix Tensor Factorization (ACMTF) and Multiple Kernel Learning (MKL) techniques for coupled matrix tensor data clas- sification. The classification risk of C-STM is shown to be converging to the optimal Bayes risk, making itself a statistically consistent rule. The framework can also be easily extended for multi- modal tensors with data modalities greater than two. The C-STM is validated through a simulation study as well as a simultaneous EEG-fMRI trial classification problem. The empirical evidence shows that C-STM can utilize information from multiple sources and provide a better performance comparing to the traditional methods. Copyright by PEIDE LI 2021 To my parents and my grandmother. v ACKNOWLEDGEMENTS I have received support and assistance from many people throughout the writing of this dissertation and my journey toward PhD. I want to take a moment and thank them. First, I would like to express my deepest gratitude to my advisor Dr. Tapabrata Maiti, whose expertise is invaluable in exploring research questions. He always provides me with constructive insights and strong supports that sharpen my thinking and bring my work to a higher level. Without his guidance, I would not have made such a progress in this field. I would also like to thank my dissertation guidance committee members, Dr. Jiayu Zhou, Dr. Ping-shou Zhong, Dr. David Zhu, and Dr. Shrijita Bhattacharya. Their comments and suggestions are extremely beneficial for my research. I want to extend my appreciation to my collaborators, Dr. Selin Aviyente, Dr. Rejaul Karim, and Mr. Emre Sofuoglu. It is a great pleasure to work with them. Their expertise as well as dedication to scientific research help me to extend my work to a much boarder level. I am also grateful to the help I obtained from all the professors and staff members in the Department of Statistics and Probability. I really appreciate the wonderful courses as well as the assistance they provided. During my six years at Michigan State University, I made a lot of friends and met many kind peers. Thanks to them, I did not feel lonely during my PhD journey. I am very grateful to their sincerity and patience. I wish you all have a wonderful future. Finally, I would like to thank my parents and Miss Jialin Qu. Thank you for your accompany and concerns that support me to go through this journey and overcome difficulties especially under the COVID-19 pandemic. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Tensor Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.3 Tensor Product Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 The Bayes Error and Classification Consistency . . . . . . . . . . . . . . . . . . . 12 1.3.1 The Bayes Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.2 Consistent Classification Rules . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.3 Surrogate Loss Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 14 CHAPTER 2 TENSOR CLASSIFICATION MODELS . . . . . . . . . . . . . . . . . . . 16 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Tensor Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Tensor Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Tensor Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.1 Universal Tensor Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Consistency of CP-STM . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.1 MRI Classification for Alzheimer’s Disease . . . . . . . . . . . . . . . . . 32 2.4.2 KITTI Traffic Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 CHAPTER 3 TEC: TENSOR ENSEMBLE CLASSIFIER FOR BIG DATA . . . . . . . . 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 CP-STM for Tensor Classification . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Methology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Tensor-Shaped Random Projection . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Random-Projection-Based Support Tensor Machine (RPSTM) . . . . . . . 48 3.3.3 TEC: Ensemble of RPSTM . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.5 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 vii 3.5.1 Excess Risk of TEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.5.2 Excess Risk of RPSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.3 Price of Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.4 Convergence of Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7.1 MRI Classification for Alzheimer’s Disease . . . . . . . . . . . . . . . . . 71 3.7.2 KITTI Traffic Image Classification . . . . . . . . . . . . . . . . . . . . . . 72 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 CHAPTER 4 COUPLED SUPPORT TENSOR MACHINE FOR MULTIMODAL NEU- ROIMAGING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 CP Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 CP Support Tensor Machine (CP-STM) . . . . . . . . . . . . . . . . . . . 82 4.2.3 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.1 ACMTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.2 Coupled Support Tensor Machine (C-STM) . . . . . . . . . . . . . . . . . 85 4.4 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.5 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.7 Trial Classification for Simultaneous EEG-fMRI Data . . . . . . . . . . . . . . . . 94 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 APPENDIX A APPENDIX FOR CHAPTER 2 . . . . . . . . . . . . . . . . . . . . 99 APPENDIX B APPENDIX FOR CHAPTER 3 . . . . . . . . . . . . . . . . . . . . 103 APPENDIX C APPENDIX FOR CHAPTER 4 . . . . . . . . . . . . . . . . . . . . 124 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 viii LIST OF TABLES Table 2.1: Biological Information for Subjects in ADNI Study; MMSE: baseline Mini-Mental State Examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Table 2.2: Real Data: ADNI Classification Comparison I . . . . . . . . . . . . . . . . . . . . . 34 Table 2.3: Real Data: Traffic Image Classification I . . . . . . . . . . . . . . . . . . . . . . . . 37 Table 3.1: TEC: Comparison of Computational Complexity . . . . . . . . . . . . . . . . . . . 55 Table 3.2: TEC Simulation Results I: Desktop with 32GB RAM . . . . . . . . . . . . . . . . . 69 Table 3.3: Real Data: ADNI Classification Comparison II . . . . . . . . . . . . . . . . . . . . 71 Table 3.4: Real Data: Traffic Image Classification II . . . . . . . . . . . . . . . . . . . . . . . 73 Table 4.1: Distribution Specifications for Simulation; 𝑀𝑉 𝑁 𝑁: multivariate normal distri- bution. 𝐼 : identity matrices. Bold numbers are vectors whose elements are all the same. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Table 4.2: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Performance Metrics with Standard Deviations in Subscripts) . . . . . . . . . 95 Table B.1: TEC Simulation Results II: Cluster with 128GB RAM . . . . . . . . . . . . . . . . . 123 Table C.1: EEG-fMRI Data: Number of Trials per Subject . . . . . . . . . . . . . . . . . . 132 ix LIST OF FIGURES Figure 1.1: Vector, Matrix, Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 1.2: Tensor CP Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 1.3: Tucker Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 2.1: Real Data: ADNI Classification Reults I . . . . . . . . . . . . . . . . . . . . . . . 35 Figure 2.2: Real Data: Examples of Traffic Objects in KITTI Data . . . . . . . . . . . . . . . . 36 Figure 2.3: Real Data: Traffic Classification Result I . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.1: Real Data: ADNI Classification Result II . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 3.2: Real Data: Traffic Image Classification Result II . . . . . . . . . . . . . . . . . . . 74 Figure 4.1: C-STM Model Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Figure 4.2: Simulation: Average accuracy(bar plot) with standard deviation (error bar) . . . 92 Figure C.1: Auditory fMRI Group Level Analysis . . . . . . . . . . . . . . . . . . . . . . . 129 Figure C.2: Visual fMRI Group Level Analysis . . . . . . . . . . . . . . . . . . . . . . . . 130 Figure C.3: Region of Interest (ROI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Figure C.4: EEG Channel Position from [141] . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Figure C.5: Examples of EEG Latent Factors (Different Trial and Stimulus Types): Topoplot for Channel Factors (left); Plots for Temporal Factors (right) . . . . . . . . . . . 134 x LIST OF ALGORITHMS Algorithm 1: Hinge STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Algorithm 2: Squared Hinge STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Algorithm 3: DGTDA Projection Learning . . . . . . . . . . . . . . . . . . . . . . . . . 25 Algorithm 4: CMDA Projection Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Algorithm 5: Tensor Discriminant Analysis Classification . . . . . . . . . . . . . . . . . 27 Algorithm 6: Tensor CP Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . 28 Algorithm 7: Hinge TEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Algorithm 8: Squared Hinge TEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Algorithm 9: TEC Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Algorithm 10: ACMTF Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Algorithm 11: Coupled Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . 89 xi CHAPTER 1 INTRODUCTION With the development of computer technologies, more and more data with complex structures are observed in various research domains. The high-dimensionality as well as the multi-dimensional structure of the data have raised new challenges in data analysis to the communities of engineering, statistics, and data science. Learning multi-dimensional data with traditional statistical learning methods may not be appropriate, since these methods can suffer from the curses of dimensionality. Moreover, traditional methods are not able to preserve the intrinsic structures for multi-dimensional data, and are not able to utilize their multi-way features. Thus, developing novel statistical learning frameworks and data modeling techniques for multi-dimensional data has become popular in contemporary machine learning and statistics analysis. As a generalization of vectors and matrices for higher-order data, tensor is originally proposed by [66], and becomes an efficient data representation format for multi-dimension data. Fueled by novel computing technologies arises in the past decade, tensors have expanded to many research domains such as statistics, data science, signal processing, and machine learning. Surveys from [78, 70, 18] demonstrate great potentials of using tensor data representation in data mining, statistics, and machine learning. It turns out that using tensor for multi-dimensional data learning can be efficient and appropriate since tensor can help to preserve multi-way structures for the data. Further, advanced operations in tensor algebra can also help to reduce computational cost, and, more importantly, unveil the complex correlation structures for the data. All these benefits make tensor data representation a perfect tool for learning multi-dimensional data. Similar to the traditional machine learning research, current tensor-based machine learning and data mining techniques can be categorized as supervised learning and unsupervised learning. In the category of supervised learning, there are tensor regression and classification models, which usually take tensors as inputs. Depending on the types of outputs, tensor regression models are separated into tensor-to-scalar regression [58, 155, 148, 152, 130, 93, 62, 91], and tensor-to-tensor 1 regression models [99, 121, 98, 51]. Moreover, there are tensor Bayes regression [57], tensor quantile regression [105], and tensor regression-based deep neural network model [80]. For tensor- based classification models, there are models which based on discriminant [147, 142, 103, 92]. In addition, many variants of support tensor machine models [136, 61, 63, 127, 64, 28] are also developed under the idea of maximum margin classifier. [114] provides a extension of probabilistic tensor discriminant analysis which extends linear discriminant to tensor data. Comparing to the supervised learning, research on unsupervised learning with tensors are more dominant. The tensor decomposition techniques in [78] and [113] can be applied in many different application fields for multi-way feature extraction, latent factor estimation, and tensor subspace learning. These decomposition methods are later extended to different application fields. For example, [68, 1, 102] use low-rank tensor decomposition and robust tensor principle component analysis for missing data imputation. In spatio-temporal analysis such as traffic or internet data analysis, tensor decomposition are alos adopted in [120, 30, 146] for spatial and temporal feature extraction. Tensor approaches are also widely applied in anomaly detection problems. Survey from [44] reviews multiple anomaly detection algorithms basing on tensor data representation, which include predicting tensor anomalies from multiple tensors [31] and identifying abnormal elements within a single tensor [87, 153, 156]. Since tensor decomposition can be considered as generalizations of spectral decomposition on higher-order data, [67, 150, 131, 16, 132] use various decomposition methods to perform clustering and community detection for mulit-dimensional and heterogeneous data. In graphcial model and network analysis, [40, 138, 12, 111, 149] use tensor to model the correlation structures among different heterogeneous structures. Additionally, tensor decomposition can be applied in research about recommend system such as [17, 154, 104]. Apart from these two major categories, tensor data representation is often time used for the development of efficient algorithms. For example, random projection is a popular dimension reduction technique but is expensive to apply for high dimensional vector data. Saving the projection matrices can also be memory inefficient. [133, 71, 119] show that tensorizing random projection matrices can reduce the memory cost significantly while preserve the asymptotic isometry property 2 of random projection for high dimensional data. Motivated by these existing work using tensor data representation, this dissertation further investigates the performance of tensor-based machine learning models with a focus on classification problems. Particularly, we explore the statistical property of current tensor classification methods as well as their performance in multiple applications. In addition, we propose novel tensor classifiers for big tensor and multimodal tensor data classification. These become two major contributions in this dissertation. 1.1 Overview The dissertation is organized as follow. In the rest part of this introduction chapter, we provide a review about tensor algebra, operations, and decomposition methods. We also briefly introduce some important statistical concepts in classification analysis in this chapter. The next three chapters in the dissertation explore tensor classification problems from different aspects. In chapter 2, We provide a survey about few most popular tensor classification methods in statistics and machine learning literature. All the methods are applied to Alzheimer’s Disease MRI Image classification and Traffic Image classification problems to benchmark their performances. We further investigate the classification consistency for a certain type of non-parameteric tensor classifier, which is CP Support Tensor Machine (CP-STM). We show that with certain tensor kernel functions, CP-STM is statistically consistent. Chapter 3 considers a specific tensor classification problem where the input tensors are in high dimension. In contemporary data science research, multi-dimensional observations such as spatial- temporal data, medical imaging data are usually coming with high dimensionality, i.e, the dimension of each mode is high even the data is in tensor shape. This raises extra challenges to the existing tensor-based classification models. To address the issue of high dimensionality, we propose a Tensor Ensemble Classifier (TEC) for ultra-high dimensional tensors, which aggregates multiple support tensor machines estimated from randomly projected CANDECOMP/PARAFAC (CP) tensors. This method utilizes Gaussian and spares random projections to compress high-dimensional tensor 3 CP factors, and predicts their class labels with support tensor machine classifier. With the well celebrated Johnson-Lindenstrauss Lemma and ensemble techniques, TEC methods are shown to be statistically consistent while having memory efficiency for big tensor data. Simulation studies and real data applications including Alzheimer’s Disease MRI Image classification and Traffic Image classification are provided as empirical evidence to validate the performance of TEC models. In the last chapter, we consider classification problems for multimodal tensor data, which are particularly common in neuroscience and brain imaging analysis. Utilizing multimodal data is of great interest for machine learning and statistics research in these domains, since it is believed that integration of features from multiple sources can potentially increase model performance while unveiling the interdependence between heterogeneous data. In chapter 4, we propose a Coupled Support Tensor Machine (C-STM) which adopted Advanced Coupled Matrix Tensor Factorization (ACMTF) and Multiple Kernel Learning (MKL) techniques for coupled matrix tensor data classification. The excess risk of C-STM is shown to be converging to the optimal Bayes risk, making itself a statistically consistent rule. The framework can also be easily extended for multimodal tensors with data modalities greater than two. The C-STM is validated with in a simulation study as well as in a simultaneous EEG-fMRI trial classification application. The empirical evidence shows that C-STM can utilize information from multiple source and provide a better performance comparing to the traditional methods. 1.2 Tensor Algebra In this section, we introduce notations, and review some elementary concepts about tensors. Detailed introduction for tensor algebra, tensor decomposition, and tensor product space can be referred from [77] and [59]. 1.2.1 Notations The mathematical notations in the rest part of the thesis are defined as follow. Numbers and scalars are denoted by lowercase and capital letters such as 𝑥, 𝑁. Vectors are denoted by boldface lowercase 4 | | Dimension 𝐼1 Dimension 𝐼1 | {z Dimension 𝐼1 {z {z } D { im z } en } } sio | {z } | n 𝐼3 | {z } Dimension 𝐼2 Dimension 𝐼2 𝑎 ∈ R 𝐼1 𝐴 ∈ R 𝐼1 ×𝐼2 X ∈ R 𝐼1 ×𝐼2 ×𝐼3 Figure 1.1: Vector, Matrix, Tensor letters, e.g. 𝑎 . Matrices are denoted by boldface capital letters, e.g. 𝐴 , 𝐵 . Higher-dimensional tensors are generalization of vector and matrix representations for higher order data, which are denoted by boldface Euler script letters such as X, Y. In general, functions and transformations are also denoted by boldface lowercase letters 𝑓 , 𝑔 , but with clear description to distinguish from vectors. The only exception is kernel function, which will be denote by 𝐾 (·, ·). Vector spaces, functional spaces, and tensor spaces are denoted by boldface Mathcal font in Latex such as H , F . Euclidean spaces with one or multiple dimensions are represented by R 𝐼1 and R 𝐼1 ×𝐼2 , where 𝐼1 and 𝐼2 stand for the size of each dimension. In addition to these notations, we use E and P to denote the expectation and probability in short. Other notations may also be used and be introduced as needed in the following content. Tensor generalizes vectors and matrices by including multiple indices in its structure, making it possible to represent multi-dimensional data. Figure 1.1 provides a comparison between vector, matrix, and tensor. Tensor X can denote three-dimensional data since it provides three indices. The order of a tensor is the number of dimensions, also known as ways or modes. For example, the vector 𝑎 in figure 1.1 is a one-way tensor, matrix 𝐴 is a two-way tensor, and X is a three-way tensor. In general, a tensor can have 𝑑 modes as long as 𝑑 is an integer. The way of indicating entries of tensors is same as we do for vectors and matrices. The 𝑖-th entry of a vector 𝑥 is 𝑥𝑖 , the (𝑖, 𝑗)-th element of a matrix 𝑋 is 𝑥𝑖, 𝑗 , and the (𝑖1 , ..., 𝑖 𝑑 )-th element of a d-way tensor X is 𝑥𝑖1 ,...,𝑖 𝑑 . The indices of a tensor 𝑖1 , ..., 𝑖 𝑑 range from 1 to their capital version, e.g. 𝑖 𝑘 = 1, ...., 𝐼 𝑘 for every mode 𝑘 = 1, ...𝑑. 5 Sub-arrays of a tensor are formed when a subset of the indices are fixed. Similar to matrices that have rows and columns, high-dimensional tensors have various types of sub-arrays. For example, by fixing every index but one in a d-way tensor, we can get one of its fibers, which are analogue of matrix rows and columns. Another type of frequently used tensor sub-arrays is slice, which is a two dimensional section of a tensor. A slice of a tensor can be defined by fixing all but two indices. We will use X :𝑖2 ...𝑖 𝑑 to denote one fiber of a d-way tensor, and use X ::𝑖3 ...𝑖 𝑑 to denote one of its slices. Like the L2 norm for vectors in Euclidean spaces, the L2 norm, also called Frobenius norm, of a d-way tensor X ∈ R 𝐼1 ×...×𝐼 𝑑 is the square root of the sum of the squares of all its elements, i.e. v u u u tÕ 𝐼1 Õ𝐼𝑑 X || Fro =< X , X >= ||X ... 𝑥𝑖2 ,...,𝑖 (1.1) 1 𝑑 𝑖1 𝑖𝑑 where Õ𝐼1 𝐼𝑑 Õ < X 1 , X 2 >= ... 𝑥 1,𝑖1 ,...,𝑖 𝑑 · 𝑥 2,𝑖1 ,...,𝑖 𝑑 (1.2) 𝑖1 𝑖𝑑 is the inner product of two tensors X 1 and X 2 . In the following content, we may use different types of inner product induced by kernel functions, and we will specify those inner products as needed. In many situations, one may need to transform a tensor into a vector or a matrix for computation. Such transformations are called tensor vectroization and unfolding. In the thesis, I denote the vectorization of a tensor X ∈ R 𝐼1 ×...×𝐼 𝑑 as Vec(X X), which is in the dimension of 𝑑𝑗=1 𝐼 𝑗 . Tensor Î unfolding reorders a tensor into a matrix, putting mode-k fibers, X𝑖1 ,𝑖2 ..,𝑖 𝑘−1 ,:,𝑖 𝑘+1 ...𝑖 𝑑 as the Î columns of the matrix. As a result, the matrix is in the shape of 𝐼 𝑘 × 𝑑𝑗=1, 𝑗≠𝑘 𝐼 𝑗 , and is denoted by X (𝑘) . Although there are multiple ways of performing tensor vectorization and unfolding, the resulting vectors and matrices are equal up to a permutation. As long as the transformations are consistent, algorithms and theoretical analysis are remain intact. We follow the tensor vectorization and unfolding rules from [79] in the thesis. In addition to the basic concepts, we also need some operations for vectors and matrices in order to construct tensors and present our work. The first one is the outer product of vectors. Let 6 𝑎 ∈ R 𝑝 and 𝑏 ∈ R𝑞 be two column vectors, the outer product of them is defined by 𝑎 ◦ 𝑏 = 𝑎 · 𝑏𝑇 (1.3) which is a 𝑝 × 𝑞 matrix. If 𝐴 ∈ R 𝑝×𝑡 is a 𝑝 by 𝑡 matrix and 𝑏 ∈ R𝑞 is a column vector, then 𝐴 ◦ 𝑏 = [𝐴𝐴 ∗ 𝑏 1 , ..., 𝐴 ∗ 𝑏 𝑞 ] (1.4) which is a 𝑝 × 𝑡 × 𝑞 array. "∗" stands for the element-wise product. The outer product with a vector increase the multiplier by one more dimension. Another operation is Kronecker Product, which is a version of outer product for matrices. Let 𝐴 ∈ R 𝐼×𝐽 , 𝐵 ∈ R𝐾×𝐿 be two arbitrary matrices. The Kronecker Product of 𝐴 and 𝐵 is 𝐴 ⊗ 𝐵 ∈ R (𝐼𝐾)×(𝐽 𝐿)   𝑎 11 𝐵 ... 𝑎 1𝐽 𝐵     𝐴 ⊗ 𝐵 =  ... ... ...  = [𝑎 1 ◦ 𝑏 1 , ...., 𝑎 𝐽 ◦ 𝑏 𝐿 ] (1.5)      𝑎 𝐼1 𝐵 ... 𝑎 𝐼𝐽 𝐵    Compared with the vector outer product, it restricts the resulting product to be matrices. The Khatri-Rao product is the "matching column-wise" Kronecker product between two matrices with same number of columns. Given matrices 𝐴 ∈ R 𝐼×𝐾 and 𝐵 ∈ R𝐽×𝐾 , the product is defined as: 𝐴 𝐵 = [𝑎𝑎 1 ◦ 𝑏 1 , ..., 𝑎 𝐾 ◦ 𝑏 𝐾 ] (1.6) It requires the two multiplier matrices to have the same number of columns, and the resulting products to be matrices as well. The vector outer product, matrix Kroncker product, and matrix Khatri-Rao product can be regarded as tensor product in mathematical analysis. As a result, we may use ⊗ to denote general tensor product in part of our theoretical development. The mode-n product is a product operation defined between a tensor and a matrix. Assume X ∈ R 𝐼1 ×...×𝐼𝑛 ×...×𝐼 𝑑 is a d-way tensor, and 𝑈 ∈ R𝑃𝑛 ×𝐼𝑛 is a matrix. The mode-n product between tensor X and matrix 𝑈 is defined as X ×𝑛 𝑈 = 𝑈 · X (𝑛) (1.7) where X (𝑛) is the n-th mode unfolding matrix of tensor X with shape 𝐼𝑛 by Î 𝑗≠𝑛 𝐼 𝑗 . The resulting product is still a d-way tensor in shape of 𝐼1 × ... × 𝑃𝑛 × ... × 𝐼 𝑑 . 7 1.2.2 Tensor Decomposition The notations and mathematical operations introduced above make it possible to represent tensors with their decomposition forms. Tensor decomposition is a way to represent, or approximate, a tensor with various pre-defined forms. With specially designed structure, new representation and approximation makes it more flexible to develop novel machine learning models for tensor data, and optimize the existing frameworks by simplifying computation steps. In this section, we review three most popular tensor decomposition methods, Parafac, Tucker, and Tensor-Train decomposition. Candecomp / Parafac Decomposition (CP) is an extension of matrix singular value decom- position for higher-order tensors. It represents a tensor as a summation of vector outer products shown in figure 1.2. Each product term in the summation is also known as rank-one tensor. For a d-mode tensor X ∈ R 𝐼1 ×𝐼2 ...×𝐼 𝑑 , its CP decomposition is defined as 𝑟 (1) (2) (𝑑) Õ X= 𝛼 𝑘 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 (1.8) 𝑘=1 ( 𝑗) 𝐼 where 𝑥 𝑘 ∈ R 𝑗 are called tensor CP components for 𝑗 = 1, ..., 𝑑. 𝛼 𝑘 are scalars and are often merged into one of the CP components for simplicity. As a result, CP decomposition can also be written as 𝑟 (1) (2) (𝑑) Õ X= 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 (1.9) 𝑘=1 In our presentation, we will merge the scalar weights 𝛼 𝑘 to CP components and use the equation (1.9) for CP decomposition unless we specifically mention the weights. 𝑟 is known as the CP rank for the tensor, which is the number of different outer products that adds up to the tensor. For a tensor which cannot be well represented by equation (1.9), i.e. the equation (1.9) does not hold, its CP decomposition is defined as 𝑟 𝑥 𝑘(1) ◦ 𝑥 𝑘(2) ... ◦ 𝑥 𝑘(𝑑) Õ X≈ X̂ 𝑘=1 where 𝑟 (1) (2) (𝑑) Õ X= X̂ 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 and X̂ X = arg min ||X X − X̂ X || 𝐹𝑟𝑜 𝑘=1 X X̂ 8 ≈ +... + Figure 1.2: Tensor CP Decomposition For the convenience of notation, we follow [78] and denote tensor CP decomposition (1.9) as X = È𝑋 𝑋 (1) , 𝑋 (2) , ..., 𝑋 (𝑑) É or 𝔘X = È𝑋 𝑋 (1) , 𝑋 (2) , ..., 𝑋 (𝑑) É (1.10) 𝐼 ×𝑟 where 𝑋 ( 𝑗) ∈ R 𝑗 are called CP factor matrices. The 𝑘-th column in 𝑋 ( 𝑗) is the vector shape ( 𝑗) tensor CP factor 𝑥 𝑗 in equation (1.9). This notation is also called Kruskal tensor. In the paper, we will use either tensor CP decomposition or CP tensor to refer any tensor that in expressed in (1.9) or (1.10). Tucker Decomposition is a form of Principle Component Analysis for higher-order tensors, often denoted by Higher-order PCA (HOPCA). It factorizes a tensor into the form of a core tensor multiplied by a factor matrix at each one of its modes. The Tucker decomposition of a a d-mode tensor X ∈ R 𝐼1 ×𝐼2 ...×𝐼 𝑑 is defined as X = G ×1 𝑈 (1) ×2 𝑈 (2) .... ×𝑑 𝑈 (𝑑) (1.11) 𝑃 ×𝐼 where G ∈ R𝑃1 ×𝑃2 ...×𝑃𝑑 is the core tensor in the shape of 𝑃1 ×𝑃2 ...×𝑃 𝑑 . 𝑈 ( 𝑗) ∈ R 𝑗 𝑗 , 𝑗 = 1, .., 𝑑 are mode-wise factor matrices. In practice, one can restrict the factor matrices to be orthogonal, and thus consider the columns of these matrices as principle components from each mode. The core tensor G measures the interaction across different components. An example of 3-way tensor Tucker decomposition is demonstrated in figure 1.3. Similar to the CP decomposition, we can define the Tucker decomposition for an arbitrary tensor X even if the equation (1.11) does not hold. It is defined as X ≈ G ×1 𝑈 (1) ×2 𝑈 (2) .... ×𝑑 𝑈 (𝑑) 9 𝑈 (1) X ≈ G 𝑈 (3) 𝑈 (2) 𝑈 Figure 1.3: Tucker Decomposition where X = G ×1 𝑈 (1) ×2 𝑈 (2) .... ×𝑑 𝑈 (𝑑) X̂ and X̂X = arg min ||X X − X̂ X || 𝐹𝑟𝑜 X X̂ Notice that CP decomposition is actually a special case of Tucker decomposition, when the core tensor G in the decomposition is super-diagonal and all 𝑃1 , ..., 𝑃 𝑑 are equal. The estimation of Tucker decomposition can be done with an iterative alternating least square algorithm introduced in [35]. Although Tucker decomposition is not easy to be interpreted comparing to CP decomposition, its mode-wise factor matrices can be regarded as basis for the row space of each tensor mode. Thus, it has been widely applied in problems like image compression and higher-order data feature extraction. 1.2.3 Tensor Product Space Apart from the algebraic notations and operations for tensors, we also want to include a brief introduction about tensor functional and tensor space. They are essential in the development of universal tensor kernel functions and statistical consistency. We refer [59] for the definition of tensor product spaces and tensor calculus. Since we consider general tensor product in this section, we use ⊗ to denote it in our description. For finite dimensional vector spaces, the space of their tensor product is call algebraic tensor space. 10 Definition 1.2.1. Let V ⊂ R 𝐼1 and W ⊂ R 𝐼2 be two compact subspace of the Euclidean spaces Í Í R 𝐼1 and R 𝐼2 . V = {𝑣𝑣 : 𝑣 = 𝛼𝑖 𝑣 𝑖 }, W = {𝑤 𝑤 : 𝑤 = 𝛽 𝑗 𝑤 𝑗 }, where {𝑣𝑣 𝑖 } and {𝑤 𝑤 𝑗 } are the basis 𝑖 𝑗 of V and W . The algebraic tensor space of V and W , denoted by T , is a space spanned by the tensor products of basis. Õ T = V ⊗ W = {𝑡𝑡 : 𝑡 = 𝛾𝑖, 𝑗 𝑣 𝑖 ⊗ 𝑤 𝑗 } (1.12) 𝑖, 𝑗 The basis function of this algebraic tensor space are {𝑣𝑣 𝑖 ⊗ 𝑤 𝑗 }. The characteristic algebraic properties of the tensor space is the bilinearity, meaning that for all 𝑡 ∈ T and 𝑎 ∈ R Õ Õ 𝑎 ·𝑡 = 𝛾𝑖, 𝑗 (𝑎 · 𝑣 𝑖 ) ⊗ 𝑤 𝑗 = 𝛾𝑖, 𝑗 𝑣 𝑖 ⊗ (𝑎 · 𝑤 𝑗 ) 𝑖, 𝑗 𝑖, 𝑗 We call this algebraic tensor space of a second-order algebraic tensor space since the basis are the tensor products of two vectors. This second-order tensor space is still a vector space, as defined in mathematics. However, if we consider a specific tensor product, outer product "◦", in the definition 1.2.1, it is indeed isometric to a second-order tensor (matrix) space. The bijection connecting two spaces is a specific folding and unfolding rule as we introduced earlier. Notice that the algebraic tensor space measures distance by Euclidean norm, and the norms of multi-dimensional arrays are measured by Frobenuis norm. The equivalence between the Euclidean norm and Frobenuis preserves the distances between points unchanged before and after unfolding, and making two spaces isometric. Similarly, if the general tensor product is replaced by Kroncker or Khatri-Rao product, the definition 1.2.1 can be extended for the product of matrices spaces. The isometry property also connects this abstract mathematical definition to the more concrete definition of tensors, especially those tensors decomposed into CP forms. As a result, data in forms of multi-dimensional arrays can be considered as points in a tensor product space. In general, we can define 𝑑th-order algebraic tensor space as ( 𝑗) ( 𝑗) X = V (1) ⊗ V (2) ... ⊗ V (𝑑) = 𝑠𝑝𝑎𝑛{⊗ 𝑑𝑗=1𝑣 𝑘 , 𝑣 𝑘 ∈ V ( 𝑗) , 𝑗 = 1, ...𝑑} (1.13) for d-way tensors. This would make it feasible for us to develop further statistical analysis on tensors. 11 In the definition of algebraic tensor space, we only consider Euclidean subspaces and connects it to the spaces of multi-dimensional arrays. Indeed, the definition 1.2.1 can be extended to tensor products of any metric spaces such as the products of inner product spaces, the products of functional spaces, and the products of Reproducing Kernel Hilbert spaces. We define Definition 1.2.2. Let < ·, · > 𝑗 be a general inner product defined on V ( 𝑗) such that V ( 𝑗) is a inner product space. X = ⊗ 𝑑𝑗=1𝑉 ( 𝑗) is going to be an inner product space with inner product < ·, · >X . X = 𝑠𝑝𝑎𝑛{⊗ 𝑑𝑗=1 𝑓 (𝑘) 𝑓 ( 𝑗) V ( 𝑗) are basis functions, 𝑘 = 1, ...𝑑} 𝑗 , 𝑘 ∈ (1.14) ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) For 𝑓 = ⊗ 𝑑𝑗=1 𝑓 𝑘 ∈ X and 𝑔 = ⊗ 𝑑𝑗=1𝑔 𝑙 ∈ X , where 𝑓 𝑘 , 𝑔 𝑙 ∈ 𝑉 ( 𝑗) are basis functions, Í Í 𝑘 𝑙 the inner product is 𝑑 ( 𝑗) ( 𝑗) ÕÖ < 𝑓 , 𝑔 >X = < 𝑓 𝑘 , 𝑔𝑙 > 𝑗 (1.15) 𝑘,𝑙 𝑗=1 This definition generalizes the tensor product spaces to any arbitrary inner products spaces. For example, if each V ( 𝑗) is a uni-variate functional space, then X is a multi-variate functional space whose elements are functions mapping vectors into scalars. Moreover, this definition will help us to construct tensor Reproducing Kernel Hilbert space in chapter 2. 1.3 The Bayes Error and Classification Consistency Statistical analysis in classification problems often time tries to validate the performance of a model by looking at whether its classification risk is close to the Bayes risk, and if it is statistically consistent. These two components are essential in the evaluation of generalization ability for a specific model. We briefly review the definition of Bayes error and classification consistency in this section. More details can be referred from [36]. 1.3.1 The Bayes Problem Consider a binary classification problem where (𝑋, 𝑌 ) is a pair of random variables taking their respective values on R𝑑 and {0, 1}. Let 𝜂 (𝑥𝑥 ) = P(𝑌 = 1|𝑥𝑥 ) = E(𝑌 |𝑥𝑥 ) (1.16) 12 Naturally, any measureable function 𝑓 : R𝑑 → {0, 1} can be a potential classifier or decision function. Now if we consider the very naive zero one loss function L (𝑧, 𝑦) = 1 {𝑧 ≠ 𝑦}, then the expected loss of a classifier 𝑓 , called risk of 𝑓 , is R ( 𝑓 ) = E[L ( 𝑓 (𝑋), 𝑌 )] = P( 𝑓 (𝑋) ≠ 𝑌 ). Let 𝜂 (𝑥𝑥 ) > 21  1  𝑓 ∗ (𝑥𝑥 )  =   0 Otherwise  It is easy to show that among all possible decision functions, 𝑓 ∗ has the smallest risk, making itself the best possible classifier. A Bayes problem is to find the optimal classifier 𝑓 ∗ , and 𝑓 ∗ itself is called the Bayes classifier or Bayes rule. The classification risk of the Bayes rule, R ∗ = R ( 𝑓 ∗ ), is defined as the Bayes risk, which is the smallest possible risk one can obtain. Under most circumstances, it is infeasible to estimate the Bayes rule since the distribution of (𝑋, 𝑌 ) is unknown. 1.3.2 Consistent Classification Rules Instead of searching for Bayes rule, most of time we construct a classifier from a limited amount of data. Suppose 𝑇𝑛 = {(𝑥𝑥 1 , 𝑦 1 ), ...(𝑥𝑥 𝑛 , 𝑦 𝑛 )} is a collection of observations for the random variable (𝑋, 𝑌 ), the empirical estimate of the classification risk for a decision function 𝑓 is 𝑛 1Õ R𝑛 = 1 { 𝑓 (𝑥𝑥 𝑖 ) ≠ 𝑦𝑖 } 𝑛 𝑖=1 1 {·} is an indicator function. A "good" classifier can be constructed from the data 𝑇𝑛 by searching for the optimizer that minimizes the empirical risk. Such a procedure is call Empirical Risk Minimization (ERM), and 𝑇𝑛 is called training set. If we denote the empirical optimal decision function as 𝑓𝑛 , it is a function conditional on the training set 𝑇𝑛 . However, if we use the same strategy but with different training set, we can get a sequence of decision functions { 𝑓𝑛 }. Such a sequence of functions estimated with the same strategy / rule but different training data is called a classification rule, a way of finding optimal decision functions from a training data. ERM procedure produces a decision rule, and there are more variants of ERM such as regularized ERM in the statistical learning literature. 13 To show a classification rule is good from mathematical aspect, one possible way is to prove it is statistically consistent. Definition 1.3.1. A classification rule is (weakly) consistent for a certain distribution of (𝑋, 𝑌 ) if R ( 𝑓𝑛 ) → R ∗ in probability and is strongly consistent if R ( 𝑓𝑛 ) → R ∗ a.s. as 𝑛 → ∞. A consistent rule, not the specific classifier learned from a training data, guarantees that taking sufficiently large samples can reconstruct the unknown data distribution, and finally identify the optimal classifier. This property is like telling a classification rule is learning data in a right way, since it eventually will unveil the whole data distribution. The reconstruction here means the Bayes risk of the classification problem will be eventually the same as the risk of estimated classifier with sufficient training data, and thus will be known. For most classification models, developing statistical consistency is of great interest as it validates that the model is learning the data in a "right" way. In this thesis, our theoretical analysis is also mostly focus on the development of statistical consistency for tensor-based classifiers. 1.3.3 Surrogate Loss Consistency In binary classification problems, the most intuitive and basic loss function is the "zero-one" loss L (𝑧, 𝑦) = 1 {𝑧 ≠ 𝑦}. However, its non-convexity brings lots of challenges in both computational aspect and statistical properties. Moreover, the zero-one loss may have a worse performance than other surrogate loss in various classification applications. Recent works [13, 122, 151] demonstrate that there are many surrogate loss functions for binary classification which equip with convexity and nice statistical properties, making the estimation procedure more tractable. Another motivation of using well-behaved surrogate loss in classification applications is that there is a general quantitative 14 relationship between the approximation and estimation risks associated with surrogate losses, and those associated with zero-one loss. Here we denote the risk of a decision function 𝑓 associated with a surrogate loss L as R L ( 𝑓 ), and the corresponding risk associated with the zero-one loss as R ( 𝑓 ). Further, the Bayes risk under loss function L is denoted as R L ∗ , and the Bayes risk under zero-one loss is denoted as R ∗ . The definition of these risks are R L ( 𝑓 ) = EX ×Y   ∗ =E  ∗  Y L ( 𝑓 (𝑋), 𝑌 ) , RL Y L ( 𝑓 (𝑋), 𝑌 ) X ×Y where X × Y is the domain of the random variables (𝑋, 𝑌 ). The expectation is taken over the joint distribution of (𝑋, 𝑌 ). 𝑓 ∗ is the Bayes classifier such that 𝑓 ∗ = arg min R L ( 𝑓 ). 𝑓 ∗ is the optimal among all measureable functions which map data in X to labels Y . Results from [151] shows that for any measureable function 𝑓 𝜓 (R ( 𝑓 ) − R ∗ ) 6 R L ( 𝑓 ) − R L ∗ for a nondecreasing function 𝜓 : [0, 1] → [0, ∞). This suggests that the statistical consistency developed under surrogate loss function indicates the consistency under zero-one loss, as long as the surrogate loss is well-behaved. Thanks to this general relationship, one can develop statistical consistency for decision rules using surrogate loss and enjoy their nice mathematical properties like Lipschitz continuity instead of using zero-one loss. This reduces the problem difficulties significantly. Lastly, using surrogate losses may enable us to develop a uniform upper bound on the risk of a function 𝑓𝑛 that minimizes the empirical risks. This may help us to further obtain an explicit, uniform bound on the excess risk, R L ( 𝑓𝑛 ) − R L ∗ , highlighting the convergence rate of a specific decision rule. The well-behaved losses are sometimes called self-calibrated or classification-calibrated loss. Examples of such loss functions include Hinge loss, Squared Hinges loss, and exponential loss. A more detailed discussion about self-calibrated loss is available in [95] and the Section 2 of [128]. We estimate tensor classification models using Hinge and Squared Hinge loss in this thesis, and develop our theoretical results with surrogate loss functions. 15 CHAPTER 2 TENSOR CLASSIFICATION MODELS In this chapter, we provide an introduction to several tensor-based classification models, and compare their performance empirically. Moreover, the statistical consistency of few classifiers are established. 2.1 Introduction In contemporary machine learning and statistics research, tensor has become a popular tool to model multi-dimensional data such as spatio-temporal data, brain imaging, and multimodal data. Comparing to the traditional vector presentation, tensor preserves the multi-way structures of the data, providing more correlations among different modes for data mining and modeling. In addition, the existing tensor decomposition methods from [78] can help to estimate the low- dimensional structure for tensor data, which reduce the computational complexity significantly for tensor-based models. As an essential part of supervised tensor learning, tensor classification problems try to predict data labels from tensors. Current literature about tensor classification can be categorized in several groups. First, since distances between tensors can be easily estimated by Frobenious norm, K- nearest neighbour classifiers can be easily established. However, the Frobenious norm of a tensor is equivalent to the L2 norm of its vectorization. Such extensions are indeed equivalent to the vector-based K-nearest neighbour classifiers, and thus have no computational gain from tensor representation. An improvement [92] then has been made on the tensor K-nearest neighbour classifiers by combining it with a Fisher discriminant analysis. Utilizing the multi-way features preserved by tensors, [92] learns multi-linear transformations projecting tensors to lower multi- dimensional spaces where they are easier to be classified. [114] also develops a probabilistic discriminant analysis for tensors, using density instead of distance to discriminate data. Another type of tensor-based classifiers borrow the separating hyperplane from support vector machine, and 16 build support tensor machine models. With different tensor decomposition and kernel functions, there are models like rank-1 CP-STM [136], CP-STMs [63, 64], Tucker STM [127], and support tensor train machine [28]. These models benefit from the distribution-free assumption for tensor data, and are more flexible in real-data applications. Finally, logistic regression model can also be generalized for tensor data. [155] and [93] develop generalized linear regression models with CP and Tucker tensor coefficients, which can be adopted for classification problems. Although the current approaches have demonstrated impressive performance, not all of them provide theoretical guarantee on their generalization ability. According to [36], a classifier with solid generalization ability should be statistically consistent, having their excess classification risks converge to the optimal Bayes risk. Bayes risk is the minimal risk one can obtain from a classification problem with data confirming a certain type of probability distribution. The difference between the risk of a learned classifier and the optimal Bayes risk quantifies the performance of the classifier theoretically. Such results are well established for traditional statistical classification approaches, however, are not completed for all tensor-based methods. In this chapter, we introduce few popular tensor-based classifiers in current literature including CP-STM [63], tensor discriminant analysis [92] and CP-GLM [155], and investigate their perfor- mance through numerical studies. Further, we discuss their statistical consistency, and provide a theoretical result which establishes the statistical consistency for CP-STM. For other methods, the results are introduced as they are can be easily extended from the existing literature. The rest parts in this chapter are organized as follow: Section 2.2 reviews three major types of tensor-based classifiers and their consistency results. Section 2.3 develop the consistency result for the support tensor machine model. In section 2.4, we compare the performance of all reviewed tensor-based methods with two different real data applications. Section 2.5 concludes the chapter. 2.2 Tensor Classification Algorithms In this section, we introduce five different tensor-based classifiers which are categorized into three groups depending on their model mechanism. 17 2.2.1 Support Tensor Machine Support tensor machine extends the idea of kernel support vector machine (see e.g. [128]), and construct a separating hyperplane with support tensors for classification. In this part, we review the Candecomp/Parafac - Support Tensor Machine (CP-STM) model from [63], and provide two different model estimation algorithms. Suppose there is a training data 𝑇𝑛 = {(X X1 , 𝑦 1 ), (X X2 , 𝑦 2 ), ..., (X X𝑛 , 𝑦 𝑛 )}, where X𝑖 ∈ X ⊂ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 are d-way tensors. X is a compact tensor space, which is a subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 . 𝑦𝑖 ∈ {1, −1} are binary labels. CP-STM, like the traditional kernel support vector machine, tries to estimate a decision function 𝑓 : X → R such that it minimizes the objective function 𝑛 1Õ min 𝜆|| 𝑓 || 2 + L ( 𝑓 (X X𝑖 ), 𝑦𝑖 ) (2.1) 𝑛 𝑖=1 L is a loss function for classification such as Hinge loss, squared Hinge loss, and zero-one loss. 𝜆 is a X) 2 𝑑X X is the square of functional norm for 𝑓 . The kernel ∫ tuning parameter. || 𝑓 || 2 =< 𝑓 , 𝑓 >= 𝑓 (X functions for tensor data are defined on the CP representation of tensors. Assume two d-way tensors 𝑟 (1) (2) (𝑑) 𝑟 (1) (2) (𝑑) with CP rank 𝑟 are represented as X 1 = 𝑥 1,𝑘 ◦ 𝑥 1,𝑘 ... ◦ 𝑥 1,𝑘 and X 2 = Í Í 𝑥 2,𝑘 ◦ 𝑥 2,𝑘 ... ◦ 𝑥 2,𝑘 , 𝑘=1 𝑘=1 a tensor kernel function is defined as 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) Õ X1 , X 2 ) = 𝐾 (X 𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑘 ) (2.2) 𝑙,𝑘=1 𝑗=1 where 𝐾 ( 𝑗) are vector-based kernel functions measuring inner products for factors in different tensor modes. The kernel function (2.2) measures the inner products between two tensors by aggregating kernel values of their CP factors across ranks. With the kernel trick and representer’s theorem [9], the optimal decision rule for the optimization problem (2.1) has the form of Õ 𝑛 X) = 𝑓 (X 𝛼𝑖 𝑦𝑖 𝐾 (X X𝑖 , X ) = 𝛼 𝑇 𝐷 𝑦 𝐾 (X X) (2.3) 𝑖=1 where X is a new d-way rank-r tensor with shape 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 . 𝛼 = [𝛼1 , ..., 𝛼𝑛 ] 𝑇 are the coefficients learned by plugging function (2.3) into objective function (2.1) and minimize (2.1). 𝐷 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 . 𝐾 (X X) = [𝐾 (X X1 , X ), ..., 𝐾 (XX𝑛 , X )] 𝑇 18 is a column vector. If we denote the collections of functions which are in the form of equation (2.3) by H such that H = { 𝑓 : 𝑓 = 𝛼 𝑇 𝐷 𝑦 𝐾 (X X), 𝛼 ∈ R𝑛 }, then the optimal STM classifier is denoted as 𝑛 1Õ 𝑓𝑛 = arg min 𝜆|| 𝑓 || 2 + X𝑖 ), 𝑦𝑖 ) L ( 𝑓 (X (2.4) H 𝑓 ∈H 𝑛 𝑖=1 and the class labels are predicted by Sign[ 𝑓𝑛 ]. H is the Reproducing Kernel Hilbert Space generated by tensor kernel (2.2). Support tensors are the tensors whose corresponding coefficients, 𝛼𝑖 in 𝑓𝑛 , are non-zero. Estimating 𝑓 𝑛 from the training data 𝑇𝑛 can be accomplished in various ways. Also, the estimated classifiers can be different, and have different excess error with different types of loss function L. We adopt two different loss functions, Hinge loss and squared Hinge loss, and provide different estimation algorithms. Hinge loss L ( 𝑓 (X X), 𝑦) = max(0, 1 − 𝑦 · 𝑓 (X X)) is a convex and non-differentiable loss designed for support vector / tensor types of classifiers. Comparing to the regular zero-one loss in binary classification, Hinge loss has the same level of penalty for miss- classified points that are close to the separating hyper-plan, while putting a more severe penalty for those which are far away from the plan. [122] demonstrates the statistical robustness and the fast convergence rate of Hinge loss in binary classification problems. Minimizing the objective function (2.1) with Hinge loss can be shown to be equivalent to the optimization problem 1 𝑇 min 𝛼 𝐷 𝑦 𝐾 𝐷 𝑦 𝛼 − 1𝑇 𝛼 𝛼 ∈R𝑛 2 S.T. 𝛼 𝑇 𝑦 = 0 (2.5) 1 0𝛼  2𝑛𝜆 Equation (2.5) is the dual problem of the original STM problem with Hinge loss. The derivation is provided in [25]. Notice that this problem has quadratic objective function and inequality constrains, which can be solved by Quadratic Programming (QP) in [20]. The steps are summarized in the algorithm 1. We use python-style pseudo-code to denote columns of matrices. For example, (𝑚) (𝑚) 𝑋 𝑖 [:, 𝑘] stands for the 𝑘-th column of CP factor matrix 𝑋 𝑖 . We will use such notations in the rest part of the thesis. 19 Algorithm 1 Hinge STM 1: procedure STM Train 2: Input: Training set 𝑇𝑛 = {X X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆 3: for i = 1, 2,...n do 4: X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ] ⊲ CP decomposition by ALS algorithm 5: Create initial matrix 𝐾 ∈ R𝑛×𝑛 6: for i = 1,...,n do 7: for j = 1,...,i do 𝑟 Î 𝑋 (𝑚) (𝑚) Í 𝑑 8: 𝐾 𝑖, 𝑗 = 𝑚=1 𝐾 (𝑋 𝑖 [:, 𝑘], 𝑋 𝑗 [:, 𝑙]) ⊲ Kernel values 𝑘,𝑙=1 9: 𝐾 𝑗,𝑖 = 𝐾 𝑖, 𝑗 10: Solve the quadratic programming problem (2.5) and find the optimal 𝛼 ∗ . 11: Output: 𝛼 ∗ Another loss function which is commonly used for binary classification is the Squared Hinge loss, which is a convex and differentiable surrogate of Hinge loss. Squared Hinge loss squares the Hinge loss L ( 𝑓 (X X), 𝑦) = (max(0, 1 − 𝑦 · 𝑓 (X X))) 2 , making it differentiable when 𝑦 · 𝑓 (X X) = 1. Plugging Squared Hinge loss makes the objective function (2.1) differentiable, and can be minimized by letting its derivative to be zero. The objective function (2.1) with Squared Hinge loss, written in the matrix form, is 𝑛 1Õ min 𝜆𝛼𝛼𝑇 𝐷 𝑦𝐾 𝐷 𝑦𝛼 + (max(0, 1 − 𝑦𝑖 · 𝛼 𝑇 𝐷 𝑦 𝑘 𝑖 )) 2 (2.6) 𝛼 ∈R𝑛 𝑛 𝑖=1 𝐾 is the n-by-n kernel matrix whose (𝑖, 𝑗)-th element is 𝐾 (X X𝑖 , X 𝑗 ). 𝑘 𝑖 = [𝐾 (X X𝑖 , X 1 ), ..., 𝐾 (X X𝑖 , X 𝑛 )] 𝑇 is the i-th column of the kernel matrix 𝐾 . The derivative of (2.6) with respect to 𝛼 is 𝑛 2Õ 5 =2𝜆𝐷 𝐷 𝑦𝐾 𝐷 𝑦𝛼 + (−1)𝑦𝑖 𝐷 𝑦 𝑘 𝑖 · max(0, 1 − 𝑦𝑖 · 𝛼 𝑇 𝐷 𝑦 𝑘 𝑖 )𝑇 𝑛 𝑖=1 2Õ  = 2𝜆𝐷 𝐷 𝑦𝐾 𝐷 𝑦𝛼 + 𝐷 𝑦 𝑘 𝑖 𝑘 𝑇𝑖 𝐷 𝑦 𝛼 − 𝐷 𝑦 𝑘 𝑖 𝑦𝑖 𝑛 𝑠 (2.7) 𝑖∈𝑠𝑠 2   = 2𝜆𝐷 𝐷 𝑦𝐾 𝐷 𝑦𝛼 + 𝐷 𝑦𝐾 𝐼𝑠 𝐾 𝑇 𝐷 𝑦𝛼 − 𝑦  𝑛   1 𝑇  1 = 2𝐷 𝐷 𝑦 𝐾 𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝐷 𝑦 𝛼 − 𝐼 𝑠 𝑦 𝑛 𝑛 where 𝑦 = [𝑦 1 , ..., 𝑦 𝑛 ] 𝑇 and 𝑠 is the collection of indices for support tensors. Support tensors are those tenors with labels (X X𝑖 , 𝑦𝑖 ) such that 𝑦𝑖 · 𝛼 𝑇 𝐷 𝑦 𝑘 𝑖 < 1 when a 𝛼 is given. 𝐼 𝑠 is a identity matrix 20 with size 𝑛 whose diagonal elements corresponding to non-support tensors are set to be zero. To estimate the 𝛼 such that the derivative (2.7) equals to zero, we can use the Gaussian-Newton method (see e.g. [47]). If we denote the second derivative of the objective function (2.6) with respect to 𝛼 by 𝐻 1 𝐻 = 2𝐷 𝐷 𝑦 𝐾 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 )𝐷𝐷𝑦 (2.8) 𝑛 If 𝛼 ∗ is the root of 5 = 0, then Gaussian-Newton algorithm uses the first order Taylor expansion and assumes that 5 |𝛼 ∗ = 5 |𝛼 + 𝐻 |𝛼 · (𝛼𝛼 ∗ − 𝛼 ). Since we assume 5 |𝛼 ∗ = 0, the equation reduces to 𝛼 ∗ = 𝛼 − 𝐻 −1 5 |𝛼 1 1 1  = 𝛼 − 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 ) −1 · (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 )𝐷  𝐷 𝑦𝛼 − 𝐼𝑠 𝑦 𝑛 𝑛 𝑛 (2.9) 1 1 = 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 ) −1 𝐼 𝑠 𝑦 𝑛 𝑛 1 1 = 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 ) −1 𝐼 𝑠 𝑦 𝑛 𝑛 We drop the transpose for convenience in the last step since kernel matrix is symmetric. Notice that the derivation uses the fact 𝐷 𝑦 is symmetric and orthogonal, 𝐷 𝑦 𝐷 𝑇𝑦 = 𝐼 , since 𝑦𝑖2 = 1. The algorithm starts with an initial value of 𝛼 , and keeps updating 𝛼 ∗ with equation (2.9) iteratively until convergence. During each iteration, the newly estimation 𝛼 ∗ will replace 𝛼 for the next iteration. Although the update rule does not include 𝛼 in the final explicit form (2.9), the indices of support tensors are updated at each iteration. Thus, 𝐼 𝑠 will be updated, and making estimate for 𝛼 ∗ to be different. The algorithmic steps for Squared Hinge loss STM is summarized in the algorithm 2. After estimating 𝛼 from either (2.5) with Hinge loss or (2.6) with Squared Hinge loss, the class label can be predicted by Sign[ 𝑓 (X X)] = Sign[𝛼 X)]. CP representation of tensors in the 𝛼 𝐷 𝑦 𝐾 (X training data are already available. To calculate 𝐾 (X X), one has to find the CP representation for the testing data X and then calculate the kernel values with equation (2.2). 2.2.2 Tensor Discriminant Analysis The second types of classification method is tensor-based discriminant analysis (TDA). Tensor discriminant analysis combines tensor-based K-nearest neighbour classifier and multilinear feature 21 Algorithm 2 Squared Hinge STM 1: procedure STM Train 2: Input: Training set 𝑇 = {X X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, 𝜂, maxiter 3: for i = 1, 2,...n do 4: X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ] ⊲ CP decomposition by ALS algorithm 5: Create initial matrix 𝐾 ∈ R𝑛×𝑛 6: for i = 1,...,n do 7: for j = 1,...,i do 𝑟 Î 𝑋 (𝑚) (𝑚) Í 𝑑 8: 𝐾 𝑖, 𝑗 = 𝑚=1 𝐾 (𝑋 𝑖 [:, 𝑘], 𝑋 𝑗 [:, 𝑙]) ⊲ Kernel values 𝑘,𝑙=1 9: 𝐾 𝑗,𝑖 = 𝐾 𝑖, 𝑗 10: Create 𝛼 ∗ = 1 𝑛×1 , 𝛼 = 0 𝑛×1 ⊲ Initial Value, can be different 11: Iteration = 0 12: while ||𝛼𝛼 ∗ − 𝛼 || 2 > 𝜂 & Iteration 6 maxiter do 13: 𝛼 = 𝛼∗ 14: Find 𝑠 ∈ R𝑛×1 . 𝑠 𝑖 ∈ {0, 1} such that 𝑠 𝑖 = 1 if 𝑦𝑖 𝑘 𝑇𝑖 𝛼 < 1 ⊲ Indicating support tensors 15: 𝐼 𝑠 = diag(𝑠𝑠 ) ⊲ Create diagonal matrix with 𝑆 as diagonal 16: ∗ 1 1 −1 𝛼 = 𝑛 𝐷 𝑦 (𝜆𝐼𝐼 + 𝑛 𝐼 𝑠 𝐾 ) 𝐼 𝑠 𝑦 ⊲ Update 17: Output: 𝛼 ∗ extraction to improve the classification performance and utilize the multi-way correlations. It seeks a tensor-to-tensor projection transforming tensors into a new tensor subspace which maximizes the data separation. To measure the level of data separation in the new tensor subspace, TDA adopted two criteria from Fisher discriminant analysis [108]: the scatter ratio criterion and the scatter difference criterion. The optimal tensor-to-tensor projection is selected to maximize either one of the criterion. The discriminant analysis with tensor representation (DATER) [147] and multilinear discriminant analysis MDA [103] search the optimal projection utilizing the maximum ratio criteria. However, the algorithms are not stable, and do not converge over iterations. The general tensor discriminant analysis (GTDA) from [135] and MDA from [142] use the maximum scatter difference criterion, and provide two convergent algorithms in tensor subspace learning. However, the model classification performance relies heavily on tuning parameters. [92] proposes Direct General Tensor Discriminant Analysis (DGTDA) which maximizes the scatter difference and estimates the global optimal tensor-to-tensor projection without parameter tuning. In addition, they also propose a Constrained Multilinear Discriminant Analysis (CMDA) by maximizing the 22 scatter ratio and restricting the tensor-to-tensor projection matrices to be orthogonal. In this part, we provide a review on DGTDA and CMDA methods. In tensor discriminant analysis (TDA), a tensor-to-tensor projection is defined using tensor mode-wise products introduced in section 1.2. Suppose X is a d-way tensor with size 𝐼1 × ... × 𝐼 𝑑 , 𝑃 ×...×𝑃 then a tensor-to-tensor projection transforms X to Z ∈ R 1 𝑑 Z = X ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) (2.10) where 𝑈 ( 𝑗) are 𝑃 𝑗 × 𝐼 𝑗 projection matrices. The projection is defined uniquely by the collection of projection matrices {𝑈 𝑈 (1) ,𝑈 𝑈 (2) ...𝑈 𝑈 (𝑑) }. Now let’s assume the training data are tensors from binary classes, and are denoted by X 𝑐,𝑖 . 𝑐 = 1, 2 stands for the class of tensor data, and 𝑖 stands for the 𝑖-th sample from class 𝑐. Like traditional statistics, the mean projected tensor for class 𝑐 is defined as 𝑛𝑐 𝑛𝑐 1 Õ 1 Õ Z𝑐 = Z 𝑐,𝑖 = X 𝑐,𝑖 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) 𝑛𝑐 𝑛𝑐 𝑖=1 𝑖=1 (2.11) = X 𝑐 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) 𝑛Í𝑐 where X 𝑐 = 𝑛1𝑐 X 𝑐,𝑖 is the class mean of original tensors in class 𝑐. 𝑛𝑐 is the number of samples 𝑖=1 in class 𝑐. Similarly, the overall mean of tensors from both classes is 2 2 𝑛 𝑐 1Õ 1 ÕÕ Z= 𝑛𝑐 · Z 𝑐 = X 𝑐,𝑖 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) 𝑛 𝑛 (2.12) 𝑐=1 𝑐=1 𝑖=1 = X ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) 2 where X = 𝑛1 𝑛𝑐 · X 𝑐 . 𝑛 = 𝑛1 + 𝑛2 is the total number of samples. As an extension of Fisher Í 𝑐=1 discriminant analysis, TDA looks at mode-wise between-class scatter matrices and within-class scatter matrices in the projected subspace. For mode 𝑗, the between class scatter matrix is defined 23 as 2 Õ > 𝑛𝑐 Z 𝑐 − Z ( 𝑗) · Z 𝑐 − Z ( 𝑗)    𝐵𝑗 = 𝑐=1 2 Õ Ö Ö > (2.13) X𝑐 − X ) × 𝑘𝑈 (𝑘) ( 𝑗) · (XX𝑐 − X ) × 𝑘𝑈 (𝑘) ( 𝑗)    = 𝑛𝑐 (X 𝑐=1 𝑘 𝑘 𝑗¯ = 𝑈 ( 𝑗) 𝐵 𝑗 𝑈 ( 𝑗)> ¯ 2 > 𝑗 X 𝑐 − X ) 𝑘≠ 𝑗 𝑈 (𝑘) ( 𝑗) · (X X 𝑐 − X ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗) is the between-class scatter Í  Î   Î 𝐵𝑗 = 𝑛𝑐 (X 𝑐=1 matrix in the partially projected subspace (all modes excepts 𝑗-th mode are projected). 𝐵 𝑗 is in ¯ 𝑗 dimension of 𝑃 𝑗 × 𝑃 𝑗 , and 𝐵 𝑗 is in dimension 𝑃 𝑗 × 𝑃 𝑗 since the 𝑗-th mode is not projected. Z 𝑐 − Z ( 𝑗) are the 𝑗-th mode unfolding matrices for tensor Z 𝑐 − Z which is in dimension     Î 𝐼 𝑗 × 𝑘≠ 𝑗 𝐼 𝑘 . The derivation of equation (2.13) is available in [92]. With the same idea, the mode-j within-class scatter matrices is deinfed as 2 𝑛 𝑐  1 ÕÕ Ö Ö > X 𝑐,𝑖 × 𝑘𝑈 (𝑘) ( 𝑗) X 𝑐,𝑖 × 𝑘𝑈 (𝑘) ( 𝑗)   𝑊𝑗 = 𝑛 (2.14) 𝑐=1 𝑖=1 𝑘 𝑘 𝑗¯ = 𝑈 ( 𝑗) 𝑊 𝑗 𝑈 ( 𝑗)> ¯ 2 𝑛Í𝑐  > 𝑗 X𝑖,𝑐 − X 𝑐 ) (𝑘) X𝑖,𝑐 − X 𝑐 ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗) is the within-class   𝑊 𝑗 = 𝑛1 Í Î Î (X 𝑘≠ 𝑗 𝑈 ( 𝑗) · (X 𝑐=1 𝑖=1 scatter matrix in the partially projected subspace (all modes excepts 𝑗-th mode are projected). Notice that the within-class matrices (2.14) and between-class scatter matrices (2.13) are analogous to within-class and between-class covariance matrices in traditional statistics. Thus, for each mode, an optimal projection matrix can be estimated by maximizing the ratio of (2.13) and (2.14), or the difference between (2.13) and (2.14). GDTDA learns the projection matrices by maximizing the scatter difference of (2.13) and (2.14). Additionally, it assumes that all the projection matrices are orthogonal. The objective function for 24 DGTDA is max 𝐵 𝑗 || 2Fro − 𝜁 ||𝑤 ||𝐵 𝑤 𝑗 || 2Fro 𝑈 ( 𝑗) 𝑗¯ 𝑗¯ = 𝑡𝑟 𝑈 ( 𝑗) 𝐵 𝑗 𝑈 ( 𝑗)> − 𝜁𝑡𝑟 𝑈 ( 𝑗) 𝑊 𝑗 𝑈 ( 𝑗)> (2.15)   S.T. 𝑈 ( 𝑗) · 𝑈 ( 𝑗)> = 𝐼 For each mode 𝑗, the projection matrix can be estimated with singular value decomposition. The whole algorithm estimates 𝑈 ( 𝑗) in a single run without multiple iterations and tuning parameter. 𝜁 is selected through singular value decomposition instead of tuning. We summarize this in the algorithm 3. We use 𝑆𝑉 𝐷 in the algorithm to denote ordinary singular value decomposition for matrix. This algorithm is very common and is used directly without introduction. Algorithm 3 DGTDA Projection Learning 1: procedure DGTDA 2: Input: Tensors {X X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, Target dimension 𝑃1 , 𝑃2 ..., 𝑃 𝑑 3: 4: for j = 1 2, ..., d do 2 > X 𝑐 − X ) ( 𝑗) · (X X 𝑐 − X ) ( 𝑗) Í    5: Calculate 𝐵 𝑗 = 𝑛𝑐 (X 𝑐=1 2 𝑛Í𝑐  > X 𝑐,𝑖 𝑘 × 𝑘𝑈 (𝑘) ( 𝑗) X 𝑐,𝑖 𝑘 × 𝑘𝑈 (𝑘) ( 𝑗)   Calculate 𝑊 𝑗 = 𝑛1 Í Î Î 6: 𝑐=1 𝑖=1 −1 7: [], Σ , [] = 𝑆𝑉 𝐷 (𝑊 𝑊 𝑗 · 𝐵 𝑗) ⊲ SVD Decomposition and take singular value only 8: 𝜁 = max(diag(Σ Σ)) ⊲ Take the maximum singular value 9: 𝑀 = 𝐵𝑗 − 𝜁 ·𝑊 𝑗 10: 𝑈 , [], [] = 𝑆𝑉 𝐷 (𝑀 𝑀) ⊲ SVD Decomposition and take left singular vectors ( = 𝑈 [:, 1 : 𝑃 𝑗 ] > 𝑗) 11: 𝑈ˆ ⊲ Takes the first 𝑃 𝑗 colums and transpose  ( 𝑗) 12: Return 𝑈ˆ , 𝑗 = 1, ..., 𝑑 Instead of maximizing the scatter difference, CMDA learns the projection matrices by maxi- mizing the ratio of (2.13) and (2.14). Different from the existing work of TDA [147, 103] which also use scatter ratio, CMDA include the orthogonal assumption for projection matrices hoping to 25 make the algorithm converges over iterations. The objective function for CMDA can be defined as 𝑗¯ 𝑡𝑟 𝑈 ( 𝑗) 𝐵 𝑗 𝑈 ( 𝑗)>  𝐵 𝑗 || 2Fro ||𝐵 max = 𝑈 ( 𝑗) 𝑊 𝑗 || 2Fro 𝑡𝑟 𝑈 ( 𝑗)𝑊 𝑗¯𝑈 ( 𝑗)> ||𝑊 (2.16) 𝑗 S.T. 𝑈 ( 𝑗) · 𝑈 ( 𝑗)> = 𝐼 For each mode 𝑗, the projection matrix can be estimated with singular value decomposition as well. The whole algorithm estimates 𝑈 ( 𝑗) over iterations until all the estimated projection matrices are approximately orthogonal. We summarize this in the algorithm 4. Algorithm 4 CMDA Projection Learning 1: procedure CMDA 2: Input: Tensors {X X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, Target dimension 𝑃1 , 𝑃2 ..., 𝑃 𝑑 , 𝜂, maxitor ( 𝑗) 3: Initialize 𝑈 0 = 1 ; 𝑗 = 1, .., 𝑑 ⊲ Initialize value, matrix with values all equal to 1 4: while 𝑡 6 maxiter do ⊲ Repeat up to max iteration 5: for j = 1 2, 3, ...,d do 𝑗¯ 2 > X 𝑐 − X ) 𝑘≠ 𝑗 𝑈 (𝑘) ( 𝑗) · (X X 𝑐 − X ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗) Í  Î   Î 6: 𝐵 𝑗,𝑡 = 𝑛𝑐 (X 𝑐=1 𝑗¯ 2 𝑛Í𝑐  > X𝑖,𝑐 − X 𝑐 ) 𝑘≠ 𝑗 𝑈 (𝑘) ( 𝑗) · (X X𝑖,𝑐 − X 𝑐 ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗)   = 𝑛1 Í Î Î 7: 𝑊 𝑗,𝑡 (X 𝑐=1 𝑖=1 𝑊 𝑗𝑗,𝑡−1 · 𝐵 𝑗𝑗,𝑡 ) ¯ ¯ 8: 𝑈 , [], [] = 𝑆𝑉 𝐷 (𝑊 ⊲ SVD ( 𝑗) 9: 𝑈ˆ 𝑡 = 𝑈 [:, 1 : 𝑃 𝑗 ] > ⊲ Takes the first 𝑃 𝑗 colums and transpose Í𝑑 ( 𝑗) ( 𝑗)> 10: Check 𝑒𝑟𝑟 (𝑡) = ||𝑈ˆ 𝑡 · 𝑈ˆ 𝑡 − 𝐼 || Fro 6 𝜂 𝑗=1 11: if 𝑒𝑟𝑟 (𝑡) 6 𝜂 then 12: Stop Iteration ⊲ Quit loop 13: 𝑡 = 𝑡 + 1 ( 𝑗) 14: Return 𝑈ˆ , 𝑗 = 1, ..., 𝑑 The last step in the TDA classification is assigning class labels to new test points. For both DGTDA and CMDA, a tensor-based K-nearest neighbour method is adopted for this purpose. Recall the Frobenius norm for tensor in section 1.2, one can define distance between two tensors with Frobenius norm. For any two tensors X 1 and X 2 in the same dimension, the distance is defined as 𝑑𝑖𝑠(XX1 , X 2 ) = ||XX1 − X 2 || Fro . A K-Nearest Neighbour classifier can use distance and predict class labels for test point. We combine the subspace learning from DGTDA and CMDA algorithm with 26 this KNN classifier, and summarize the whole classification procedure for DGTDA and CMDA in algorithm 5 at the end of this part. Algorithm 5 Tensor Discriminant Analysis Classification 1: procedure TDA 2: Input: Training set 𝑇𝑛 = {X X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, Labels {𝑦 1 , .., 𝑦 𝑛 } ∈ {0, 1}, Target dimension 𝑃1 , 𝑃2 ..., 𝑃 𝑑 , Test set {X X∗1 , ...X X∗𝑚 },𝜂, maxitor, k 3: if DGTDA  (𝑑) then(𝑑) 4: 𝑈 , ...,𝑈 𝑈 = DGTDA({X X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, 𝑃1 , 𝑃2 ..., 𝑃 𝑑 ) 5: else 6: 𝑈 (𝑑) , ...,𝑈 𝑈 (𝑑) = CMDA({X X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, 𝑃1 , 𝑃2 ..., 𝑃 𝑑 , 𝜂, maxitor) 7: for i = 1,..n do 8: Z 𝑐,𝑖 = X 𝑐,𝑖 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) 9: for i = 1,.., m do 10: Z𝑖∗ = X𝑖∗ ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑) 11: 𝑑 = [𝑑𝑖𝑠(Z Z1 , Z𝑖∗ ), ..., 𝑑𝑖𝑠(Z Z𝑛 , Z𝑖∗ )] 12: 𝑑 0 = arg 𝑠𝑜𝑟𝑡 (𝑑) ⊲ Sort distance in increasing order, and return the index 𝑘 𝑦𝑖∗ = 1{ 1𝑘 𝑦[𝑑𝑑 0 [𝑙]] > 0} Í 13: 𝑙=1 14: Return ∗ 𝑦 1 , ..., 𝑦 ∗𝑚 2.2.3 Tensor Regression The last type of tensor classification model we want to review in this part is tensor regression model. Tensor regression is a collection of statistical models taking tensor-shape predictors. These models can predict tensor or scalar response from the tensor covariates. [155, 93] propose tensor generalized linear regression model by assuming CP and Tucker decomposition structures on regression coefficients. [62] introduces sparse penalty on tensor CP regression models to provide an efficient and scalable model for unit-rank tensor regression problems. More recently, tensor response regression [88, 152, 99], Bayesian tensor regression [57], and tensor regression with variation norm penalty [45] are also developed. In this part, we review the CP tensor generalized linear regression model (CP-GLM) [155] for tensor classification problems. CP-GLM assumes a regression model with tensor covariates. Let 𝑔 (·) be the link function, X ∈ R 𝐼1 ×..×𝐼 𝑑 be a d-way tensor, and B ∈ R 𝐼1 ×..×𝐼 𝑑 be the tensor coefficient. The CP-GLM with 27 scalar response is defined as 𝑔 (𝜇) = 𝛼 + 𝛾 >𝑧 + < B , X > (2.17) 𝛾 is a scalar, and 𝑧 ∈ R 𝑝 is a 𝑝 dimensional vector predictor. 𝛾 is a 𝑝 dimensional vector coefficient as well. If B has a low-rank CP decomposition, then we can use Kruskal tensor to denote it as 𝔘B = È𝐵 𝐵 (1) , ..., 𝐵 (𝑑) É. Each 𝐵 ( 𝑗) , 𝑗 = 1, .., 𝑑 is a 𝐼 𝑗 by 𝑟 matrix whose columns are CP factors of tensor B . The rank ofr B is assumed to be 𝑟. With Khatri-rao product, the inner product between tensor coefficient and predictor can be written as < 𝔘B , X > =< 𝐵 ( 𝑗) 𝐵 (𝑑) 𝐵 ( 𝑗+1) 𝐵 ( 𝑗−1) ..., 𝐵 (1) > , X >  ..𝐵 (2.18) 𝐵 ( 𝑗) , X 𝐵 (𝑑) 𝐵 ( 𝑗+1) 𝐵 ( 𝑗−1) ..., 𝐵 (1) >  =< ..𝐵 for 𝑗-th mode CP components. If we write the inner product into a vector form, then 𝐵 ( 𝑗) can be estimated with regular maximum likelihood estimate (MLE) method by fixing 𝛼, 𝛾 , and all other 𝐵 (𝑘) , 𝑘 ≠ 𝑗. A iterative MLE can be adopted to estimate all the CP compoent matrices as well as the regular scalar and vector coefficients in model (2.17). If we denote the likelihood function as ℓ (𝛼, 𝛾 , 𝔘B ), which can be derived in the exactly same ways as ordinary GLM, the iterative MLE algorithm can be summarized in the algorithm 6. In tensor binary classification problems, we can Algorithm 6 Tensor CP Generalized Linear Model 1: procedure TDA 2: Input: {X X1 , ...X X𝑛 }, {𝑧𝑧 1 , ...𝑧𝑧 𝑛 }, 𝑦 , 𝜂, maxitor 3: Initialize 𝔘B ,0 = È00, ..00É ⊲ Initialize Kruskal tensor as 0 matrices 4: Initialize (𝛼0 , 𝑧 0 ) = arg max ℓ (𝛼, 𝛾 , 𝔘B ,0 ) 5: while t < maxitor do 6: for j = 1, ..., d do ( 𝑗) (1) ( 𝑗−1) ( 𝑗) ( 𝑗+1) (𝑑) 7: 𝐵 𝑡+1 = arg max ℓ (𝛼𝑡 , 𝛾 𝑡 , È𝐵 𝐵 𝑡+1 𝐵 𝑡+1 ....𝐵 , 𝐵 , 𝐵𝑡 ..., 𝐵 𝑡 É) (1) ( 𝑗−1) ( 𝑗) ( 𝑗+1) (𝑑) 8: (𝛼𝑡+1 , 𝑧 𝑡+1 ) = arg max ℓ (𝛼, 𝛾 , È𝐵 𝐵 𝑡+1 ....𝐵𝐵 𝑡+1 , 𝐵 𝑡+1 , 𝐵 𝑡 ..., 𝐵 𝑡+1 É) 9: if ℓ𝑡+1 − ℓ𝑡 6 𝜂 then 10: Stop 11: 𝑡 =𝑡+1 ( 𝑗) ( 𝑗+1) 12: Output: 𝛼, 𝛾 , È𝐵 𝐵 (1) ....𝐵 𝐵 ( 𝑗−1) , 𝐵 𝑡+1 𝐵𝑡 ..., 𝐵 (𝑑) É take the link function 𝑔 as the logit function, and predict the probability that P(𝑦 = 1|X X). The class labels are predicted by doing a threshold on the predicted probability. 28 2.3 Statistical Analysis As we mentioned in section 1.3, statistical consistency is one of the most important properties for decision rules as it demonstrates generalization ability of rules from the aspect of prediction risk. The consistency of tensor discriminant analysis can be easy established since DGTDA and CMDA are both converging ([92]) algorithms providing unique tensor-to-tensor projections. In addition, tensor K-nearest neighbour classifier is equivalent to vector K-nearest neighbour, which is consistent ([36]). These two facts make the tensor discriminant analysis a consistent classifier. Tensor CP-GLM models the conditional probabilities for data labels through regression model. Its consistency is guaranteed by the (strong) consistency of regression coefficients ([155]). In this section, we provide few theoretical helping to establish the statistical consistency for CP-STM. The section contains two parts for theory development. One is the universal property establishment for CP-STM kernel functions, the other is consistency proof for CP-STM. 2.3.1 Universal Tensor Kernels Our first result is about the universal property of tensor kernel functions. Kernel Universal property plays a very important role in kernel learning methods such as support vector machine and kernel regression [107]. Since kernel-based learning methods always estimate optimal solution from Reproducing Kernel Hlibert Space (RKHS), the error of approximating the complete functional space { 𝑓 : X → Y , 𝑓 measureable} with RKHS is critical in the generalization ability of the learned rules. In other word, if the approximation error of RKHS is larger, then the prediction from kernel-based methods will be more biased. Kernel with universal property guarantees that the approximation error of RKHS can be as small as possible on any compact subspace of the input space. To present our result about universal property for tensor kernels, we first provide a formal definition about the universal property. Definition 2.3.1. Let 𝐾 (·, ·) be a continuous kernel function defined on X × X → R. Given a compact subspace Z ⊂ X from the input space, the kernel section of 𝐾 (·, ·) over Z is K (Z Z ) := 29 span{𝐾𝑥 , 𝑥 ∈ Z }, which is a RKHS generated by kernel 𝐾 (·, ·) and subspace Z . If for any continuous function 𝑓 : Z → R, there is a positive number 𝜖 > 0 and a function 𝑔 ∈ K (Z Z ) such that || 𝑓 − 𝑔 || ∞ = sup | 𝑓 (𝑥) − 𝑔 (𝑥)| 6 𝜖 Z 𝑥∈Z then 𝐾 (·, ·) has universal approximating property and is called a universal kernel. With universal kernels, we can immediately see that the optimal function estimated from RKHS is also the optimal among all measureable functions. This is critical in the establishment of CP-STM consistency. Thus, our first result shows that the kernel function in CP-STM can be universal. Proposition 2.3.1. For a d-way CP tensor kernel function 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) Õ 𝐾 (XX1 , X 2 ) = 𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑘 ) 𝑙,𝑘=1 𝑗=1 𝑟 (1) (2) (𝑑) 𝑟 (1) (2) (𝑑) with X 1 = 𝑥 1,𝑘 ◦ 𝑥 1,𝑘 ... ◦ 𝑥 1,𝑘 and X 2 = Í Í 𝑥 2,𝑘 ◦ 𝑥 2,𝑘 ... ◦ 𝑥 2,𝑘 . If all mode-wise kernel 𝑘=1 𝑘=1 ( 𝑗) functions 𝐾 (·, ·) are satisfying the universal approximation property in definition 2.3.1, then it also satisfy the universal approximating property in the sense that for all continuous function defined on the compact tensor product space X = ⊗ 𝑑𝑗=1V ( 𝑗) , there exits a function 𝑔 ∈ K (X X ) in the tensor kernel section such that X) − 𝑔 (X || 𝑓 − 𝑔 || ∞ = sup | 𝑓 (X X)| 6 𝜖 X ∈X X The proof of this proposition is provided in the appendix A.1. Notice that since distance-based kernel functions such as Gaussian RBF and polynomial are universal ([107]), we can use one of it or both for all 𝐾 ( 𝑗) (·, ·) to create universal tensor kernel functions. 2.3.2 Consistency of CP-STM With universal tensor kernels, the classification consistency of CP-STM is established with the following theorem. The notations of classification risks are borrowed from section 1.3. 30 Theorem 2.3.1. Let { 𝑓𝑛 : 𝑛 ∈ N} be a sequence of CP-STM classifiers in equation (2.4), which are estimated from different training sets 𝑇𝑛 with size 𝑛. R ∗ is the Bayes risk of tensor binary classification problem for data from the joint distribution X × Y . X is a d-way tensor product space with dimension 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 defined in definition 1.2.1, and Y = {−1, 1}. The CP-STM decision rule { 𝑓𝑛 : 𝑛 ∈ N} is statistically consistent and the excess classification risk of 𝑓𝑛 , R ( 𝑓𝑛 ), converges to the optimal Bayes risk R ( 𝑓𝑛 ) → R ∗ (𝑛 → ∞) If the following conditions are satisfied: Con.1 X is a compact subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 such that there is a constant 0 < 𝐵𝑥 < ∞. For all 𝑟 (1) (2) (𝑑) ( 𝑗) X ∈ X and X = 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 , ||𝑥𝑥 𝑘 || 2 6 𝐵𝑥 < ∞. Í 𝑘=1 Con.2 The loss function L is self-calibrated (see [128]), and is 𝐶 (𝑊) local Lipschitz continuous in the sense that for |𝑎| 6 𝑊 < ∞ and |𝑏| 6 𝑊 < ∞ |L (𝑎, 𝑦) − L (𝑏, 𝑦)| 6 𝐶 (𝑊)|𝑎 − 𝑏| In addition, the loss function is bounded on the second variable, i.e. sup L (0, 𝑦) 6 𝐿 0 < ∞ 𝑦∈{1,−1} . Con.3 The kernel functions 𝐾 ( 𝑗) (·, ·) used to composite the coupled tensor kernel (2.2) are reg- ular vector-based kernels satisfying the universal approximating property in definition 2.3.1. Additionally, they are all bounded so that there is a constant 0 < 𝐾𝑚𝑎𝑥 < ∞, p sup 𝐾 ( 𝑗) (·, ·) 6 𝐾𝑚𝑎𝑥 for all 𝑗 = 1, .., 𝑑. Con.4 The hyper-parameter in the objective function (2.1) 𝜆 = 𝜆 𝑛 satisfies: 𝜆𝑛 → 0 𝑛𝜆 𝑛 → ∞ as 𝑛→∞ 31 The proof of this theorem is provided in the appendix A.2. At the end of this section, we can conclude that all the tensor-based classifiers reviewed in this chapter are statistically consistent, meaning that they all have promising prediction accuracy if certain conditions are satisfied. However, their performance in practice can be of various, since their excess risks may converge at different rates depending on the data distribution in specific applications. In next section, we compare the performance of these classification methods with two real data applications. 2.4 Real Data Analysis In this section, we provide two examples in Neuroimaging and Computer Vision studies, and apply all the reviewed tensor-based classifier in imaging classification problems. 2.4.1 MRI Classification for Alzheimer’s Disease Alzheimer’s Disease (AD) is a progressive, irreversible loss of brain function which would impact memory, thinking, language, judgment, and behavior. The disease could destroy patients’ memory and thinking ability, and eventually make it difficult for patients to carry out even the simplest tasks of daily living. In the research of AD, there are lots of novel technologies developed to collect information from patients for diagnostic purposes, which include genetic analysis, biological and neurological tests, Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. Utilizing these information to predict patients’ biological status is of great interests in early detection and biomarker development for Alzheimer’s Disease. In this study, we consider using patients’ voxel MRI data to predict if patients are early AD or Normal Coherent (NC). Brain MRI technology collects 3D image to show anatomical structure of brains. The data is measured in voxels, which are similar to pixels used in displaying regular images. Voxels are however cuboids in 3D having specific dimensions. Each voxel contains one value standing for the average signal measured at a fixed position. A standard MRI image, also called volume, is arranged in a 3D array to reflect brain structure by placing all voxels in their corresponding positions. Thus, voxel-level MRI images are multi-dimensional arrays loaded from Neuroimaging 32 AD NC Num Subjects 183 219 Age (Mean ± sd) 75.28 ± 7.55 75.80 ± 4.98 Gender (Female / Male) 88 / 95 110 / 109 MMSE (Mean ± sd) 23.28 ± 2.04 29.11 ± 1.00 Table 2.1: Biological Information for Subjects in ADNI Study; MMSE: baseline Mini-Mental State Exam- ination Informatics Technology Initiative (Nifti) files (see [85]), and are tensors in nature. We can use our proposed tensor-based classifier to handle voxel-level MRI data. We collect data from Alzheimer’s Disease Neuroimaging Initiative ADNI. It is a huge longi- tudinal study on Alzheimer’s Disease. In ADNI, the structural brain MRI is believed to be highly correlated to the patients’ cognition, thus can be utilized to predict patients’ status, AD vs. NC. The MRI data is collected from the screening session in ADNI-1. During the session, there are 818 patients selected to enter the study and received 1.5T MRI scan. These images are pre-processed by ADNI with normalization and bias correction, and are provided in the ADNI-1.5T Screening standardized data set. The data set includes both MRI scans and patients’ dementia status labeled as Normal Coherent(NC), Mild Cognition Impaired (MCI), and Alzheimer’s Disease (AD). In this study, we are especially interested about predicting if a patient has AD or NC. Thus, we only collect image data from NC and AD patients. The biological information about AD group and NC group is provided in the table 2.1. MMSE in the table stands for Mini-Mental State Examination, which is a test of cognition functions for patients with dementia. The MMSE in the table 2.1 shows the scores of MMSE test for subjects. There are 402 MRI images in the data collection. We further use Matlab Imaging Processing Toolbox to register and align all the images for classification and comparison. We also use image resize function in the toolbox to unify the voxel dimensions in all MRI images to be 6mm by 6mm by 9mm. After resizing, all images are in the shape of 40 by 40 by 21. This step is necessary since ADNI MRI images are acquired from multiple sites. Resizing guarantees all the images can be represented by tensors in the same size. The choice of such image size is referred from other 33 Models Accuracy Precision Sensitivity Specificity AUC CP-GLM 0.580.04 0.580.07 1.000.00 0.000.00 0.500.00 CMDA 0.700.03 0.690.05 0.670.09 0.730.10 0.650.17 DGTDA 0.700.02 0.710.02 0.590.06 0.800.01 0.640.18 CP-STM1 0.730.03 1.000.00 0.410.07 1.000.00 0.640.20 CP-STM2 0.74 0.04 1.000.00 0.430.08 1.000.00 0.650.20 Table 2.2: Real Data: ADNI Classification Comparison I similar statistical analysis works such as [155, 114, 45]. We conduct our numerical experiment in the same protocol as the simulation study. We randomly sample 80% of images from AD group and 80% from NC group to form the training set with size 321. AD is labeled as positive class, and NC is labeled as negative class. The rest images are used as test set to evaluate model performance. For each classification model, we evaluate its performance by calculating its accuracy, precision, sensitivity, and specificity on the test set. Such step is replicated for multiple times, and the average accuracy, precision, sensitivity, and specificity are reported in the table in percentages. The standard deviation of these performance metrics are also provided (in subscripts) in the parenthesis. In the table 2.2, we use CP-STM1 to denote CP-STM with Hinge loss, and CP-STM2 to denote CP-STM with Square Hinge loss. Figure 2.1 summarizes the comparison in a more illustrative way. The area under the curves (AUC) of ROC curves are reported in the table 2.2 as well. The results in table 2.2 shows that all the tensor-based classifiers have close accuracy in prediction, while CP-STM is slightly better than the others. Also, these methods all have pretty low sensitivity, indicating that the chances of correctly detecting AD patients are pretty low. The performance of CP-GLM is much worse than the others, which may due to the fact that its parameteric form and data distribution assumption are not appropriate for the real data. Comparing two CP-STMs, we notice that one with Squared Hinge loss is better, which may due to the fact that Squared Hinge loss has a bigger penalty on points which strongly violates the margin during the training procedure. 34 Classification Accuracy 0.8 0.6 Method CMDA Accuracy CP−GLM 0.4 CPSTM1 CPSTM2 DGTDA 0.2 0.0 CMDA CP−GLM CPSTM1 CPSTM2 DGTDA Method Figure 2.1: Real Data: ADNI Classification Reults I 2.4.2 KITTI Traffic Images The second application we conduct is traffic image data recognition. Traffic image data recognition is an important computer vision problem. We considered the image data from the KITTI Vision Benchmark Suit. [53], [49], and [52] provided a detailed description and some preliminary studies about the data set. In this application, a 2D object detection task asks us to recognize different objects pointed out by bounding boxes in pictures captured by a camera on streets. There are various types of objects in the pictures, most of which are pedestrians and cars. We selected images containing only pedestrians or cars to test the performance of our classifier. The first step we did before training our classifier is image pre-processing, which includes cropping the images and dividing them into different categories. We picked patterns indicated by bounding boxes from images and smoothed them into a uniform dimension 224 × 224 × 4. Then we transform all these colored images into grey-scale to drop color information in order to avoid potential problems caused by the extreme dimension imbalance among the three different modes. 35 50 50 100 100 150 150 200 200 50 100 150 200 50 100 150 200 50 50 100 100 150 150 200 200 50 100 150 200 50 100 150 200 Figure 2.2: Real Data: Examples of Traffic Objects in KITTI Data The processed images are in size 224 by 224 which can be modeled by two-mode tensors. Figure 2.2 shows few examples of processed images for cars and pedestrians. The total number of images are 33229, among which 4487 are car images and 28742 are pedestrian images. Next, we divide these images into three groups basing on their qualities and visibility. Images having more than 40 pixels in height and fully visible will go to the easy group. Partly visible images having 25 pixels or more in height are in the moderate group. Those images which are difficult to see with bare eyes are going to the hard group. These three groups of images are utilized to define three different classification tasks with levels of difficulties as easy, moderate, and hard. To overcome the class imbalance in all the three groups of images, We randomly select 200 car images and 200 pedestrian images to form a balanced data set in each group for our numerical experiments. Pedestrian images are considered as the positive class data, and car images are negative class data. The following procedures are repeated for 50 times in all three tasks with the balanced data sets. We randomly sample 80% of images as training and validation set. Classification models are estimated and tuning parameters (if any) are selected using this part of the data. Then the models 36 with selected tuning parameters are applied on the rest 20% data for testing. The sampling is conducted in a stratified way so that the proportion of pedestrian and car images are approximately same in both training and testing set. For each repetition, we calculate the same performance metrics, accuracy rates, precision (positive predictive rates), sensitivity (true positive rates), and specificity (true negative rates), for each classification method using the testing set. The average value of these rates and their standard deviations (in subscripts) are reported in the table 2.3. The areas under the ROC curves (AUC) are also reported for all the methods. All the classifiers reviewed in section 2.2 are included, and are denoted with the same notations from the previous ADNI application. The comparison of model prediction accuracy rates is also illustrated by the figure 2.3, in which accuracy rates are shown by bar charts and the standard deviations of accuracy rates are shown by error bars. Task Methods Accuracy Precision Sensitivity Specificity AUC CP-STM1 0.85 0.03 0.840.05 0.850.05 0.840.06 0.85 0.03 CP-STM2 0.830.04 0.830.05 0.830.05 0.830.06 0.830.04 CMDA 0.630.07 0.580.06 0.950.08 0.300.16 0.630.07 Easy DGTDA 0.840.04 0.770.04 0.960.03 0.720.07 0.840.04 CP-GLM 0.570.05 0.570.06 0.590.07 0.550.09 0.570.05 CP-STM1 0.78 0.05 0.780.06 0.770.07 0.780.07 0.78 0.05 CP-STM2 0.730.06 0.750.07 0.720.08 0.750.09 0.730.06 CMDA 0.590.05 0.550.04 0.890.11 0.280.13 0.590.05 Moderate DGTDA 0.740.06 0.720.06 0.790.08 0.690.08 0.740.05 CP-GLM 0.530.05 0.530.05 0.540.07 0.520.08 0.530.05 CP-STM1 0.76 0.04 0.840.06 0.640.07 0.870.05 0.76 0.04 CP-STM2 0.740.04 0.800.06 0.630.07 0.840.06 0.740.04 CMDA 0.530.04 0.520.02 0.910.09 0.160.12 0.530.04 Hard DGTDA 0.720.05 0.680.04 0.840.06 0.600.08 0.720.05 CP-GLM 0.510.06 0.510.06 0.540.07 0.490.07 0.510.06 Table 2.3: Real Data: Traffic Image Classification I By comparing the prediction accuracy rates and AUC values, CP-STM models have significant advantages in classification performance than other tensor-based classification models. Different from previous study, CP-STM with Hinge loss outperforms CP-STM with Squared Hinge loss. The 37 Classification Accuracy 0.75 Method CMDA 0.50 Accuracy CP−GLM CPSTM1 CPSTM2 DGTDA 0.25 0.00 Easy Hard Moderate Task Figure 2.3: Real Data: Traffic Classification Result I reason for this might be that there are more tensors sitting close to the margin and weakly violate X𝑖 ) < 1 and 𝑦𝑖 𝑓𝑛 (X the margin, i.e. 𝑦𝑖 𝑓𝑛 (X X𝑖 ) ≈ 1. As a result, Hinge loss penalizes these points more than Squared Hinge loss in the model estimation procedure, providing a better decision function. Comparing to our previous results in ADNI study, we can conclude that the performance of CP- STM with Hinge ans Squared Hinge loss often time depend on the data distribution. Two tensor discriminant analysis have different performance in this application. DGTDA outperforms CMDA with much higher accuracy rates in all three tasks. Particularly, the accuracy rate of DGTDA is only 1% less than CP-STM1 in the easy task. As for CMDA, it turns out that it fails to identify a discriminantive tensor-to-tensor projection in this application. The performance of CP-GLM is not as good as others, which is similar to the results in our ADNI study. 2.5 Conclusion In this chapter, we explore the possibility of using tensors to model multi-dimensional and structured data, and reviewed few tensor-based classification models. These models utilize tensor algebraic structures and extend traditional classification methods for tensor data. CP-STM and CP-GLM are extension of Support Vector Machine and Generalized Linear Regression with uses tensor CP decomposition. DGTDA and CMDA are generalization of Fisher Discriminant analysis 38 for tensor using tensor subspace learning. Such subspace learning is indeed a variant of tensor Tucker decomposition. These models can be attractive when handling multidimensional data with various structures. As a part of our contribution, we develop the statistical consistency result for CP-STM. All these tensor-based methods are then can be considered as consistent decision rules with nice generalization ability. Our data experiments also provide some empirical evidence on the model performance. Through our numerical study, CP-STM show the best prediction accuracy in both applications regardless of data distribution. However, the performance comparison between CP- STM with Hinge and Squared Hinge loss often time depends on the data distribution. In comparison to CP-STM, CP-GLM is a little bit restrictive due to its parametric form, and sometimes fail to approximate the true data distribution. DGTDA and CMDA are both non-parametric and are flexible enough to classify different types of tensors. However, CMDA sometimes will fail to converge and its results then become bad. Inspired by these existing works, one possible direction for future work can be combining more advanced tensor representation and operations with traditional non-parametric classification models for novel tensor classification models. For example, there are various coupled matrix- tensor decomposition methods established in recent research for multimodal data integration and heterogeneous data analysis. Those decomposition methods can be adopted and help to extend CP-STM for multimodal data classification problems. Besides that, tensor compression via random projection or sketch are also popular in multidimensional big data analysis which aims to provide efficient and scalable ways to process tensors in huge sizes. Coin these tensor compression methods with CP-STM be a great potential for big tensor data classification. 39 CHAPTER 3 TEC: TENSOR ENSEMBLE CLASSIFIER FOR BIG DATA In this chapter, we consider classification problems for gigantic size multi-dimensional data. Al- though tensor-based classification methods mentioned in the previous chapter can analyze multi- dimensional data and preserve data structures, they may face more challenges such as long process- ing time and insufficient computer memory when dealing with big tensor data. Previously we have demonstrated the distribution-free and statistically consistent properties for the CP-STM model, and highlighted its great potential in successfully handling wide varieties of data applications. However, training a CP-STM can be computationally expensive with high-dimensional tensors. To make it feasible for CP-STM to handle large size tensors, we introduce a tensor-shaped random projection technique, and combine it with CP-STM to reduce the computational time and cost for large tensors. The CP-STM estimated with randomly projected tensors is named as Random Projection-based Support Tensor Machine (RPSTM). We further develop a Tensor Ensemble Clas- sifier (TEC) by aggregating multiple RPSTMs to control the excess classification risk brought by random projections. We demonstrate that TEC can balance between the computational costs and excess classification risk, and provide descent performance in numerical studies. 3.1 Introduction With the advancement of information and engineering technology, modern-day data science problems often come with data in gigantic size and increased complexities. These complexities are often reflected by huge dimensionality and multi-way features in the observed data such as high-resolution brain imaging and spatio-temporal data. Classification problems with such high- dimensional multi-way data raise more challenges to scientists on how to process the gigantic size data while preserving their structures. Even though there are many established studies in the literature that handle the multi-way data structures with tensor representation and tensor-based models [136, 103, 63, 92, 127, 155, 114], the high dimensionality issue for tensors are rarely 40 explored especially for classification problems. High-dimensional tensors, though are already in multi-dimensional structure, can still have huge dimensionalities in different modes. The existences of high-dimensional tensors could make the current tensor-based classification models fail to provide reliable results due to extremely long processing time and huge computational cost. Current tensor-based classification approaches for high-dimensional data are mostly adopting regularization or feature extraction steps into the models. For example, [114] proposes a objective function with Lasso penalty to learn the mode-wise precision matrices in the tensor probabilistic discriminant analysis instead of estimating them directly by taking the inverse from empirical covariance matrices, which mitigates the inconsistency in estimate due to high dimensionality. Other methods such as [103, 92] utilize various higher-order principle component analysis to extract features and reduce the data complexity before applying tensor-based K-nearest neighbour classifiers for classification. However, these techniques have several deficiencies. First of all, the regularization-based methods still face the challenge of huge computational cost. Even though they can provide more consistent and robust model estimate by doing a trade-off between variance and bias, the estimation procedures are still the same as that without regularization. For instance, The estimates for ℓ1 or nuclear norm regularized models are often calculated by doing soft-thresholding on the original estimates. Thus, the computational cost for the procedure remains the same, and could be too huge to be carried out for high-dimensional tensors. Secondly, the feature extraction- based methods integrate unsupervised learning procedures to extract the feature, making it difficult to evaluate the classification consistency for the models. Finally, there is a lack of theoretical results in the current approaches that quantifies the excess risks caused by adding regularization terms and feature extracting procedures, making it difficult to depict the trade-off between the classification accuracy and the computational cost. Novel techniques and statistical frameworks are thus desired to not only optimize the computational procedures, but also integrate the statistical analysis for high-dimensional tensor classification problems. Random projection, comparing to the aforementioned techniques, turns out to be a perfect candi- date for simplifying computational complexity in high-dimensional tensor classification problems, 41 since it is easy to apply and can provide straight forward steps for statistical analysis. It projects data into lower dimensional space with randomly generated transformations to reduce data dimension, and is motivated by the well celebrated Johnson Lindenstrauss Lemma (see e.g. [34]). The lemma 8 log 𝑛 says that for arbitrary 𝑘 > , 𝜖 ∈ (0, 1), there is a linear transformation 𝑓 : R 𝑝 → R 𝑘 such 𝜖2 that for any two vectors 𝑥 𝑖 , 𝑥 𝑗 ∈ R 𝑝 , 𝑝 > 𝑘: (1 − 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2 6 || 𝑓 (𝑥𝑥 𝑖 ) − 𝑓 (𝑥𝑥 𝑗 )|| 2 6 (1 + 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2 with large probability for all 𝑖, 𝑗 = 1, ..., 𝑛. The linear transformation 𝑓 preserves the pairwise Euclidean distances between these points. The random projection has been proven to be a decent dimension reduction technique in machine learning literature [19, 46]. A lot of theoretical results on classification consistency are also established for random projection. [41] presents a Vapnik- Chervonenkis type bounds on the generalization error of a linear classifier trained on a single random projection. [29] provides a convergence rate for the classification error of the support vector machine trained with a single random projection. [24] proves that random projection ensemble classifier can reduce the generalization error further. Their results hold several types of basic classifiers such as nearest neighboring and linear/quadratic discriminant analysis. In addition to the computational efficiency and statistical consistency, [133, 71, 119] demonstrate that random projections for tensors can cost low memory, suggesting that the techniques are memory efficient. In this work, we propose a computationally efficient and statistically consistent Tensor Ensemble Classifier (TEC) which aggregates multiple CP-Support Tensor Machines (CP-STM). The new STM distinguishes from the existing works [63, 64] by combining the estimation procedure with a newly proposed tensor-shaped random projection to reduce the size of tensors and simplify the computation, making it extremely useful for high-dimensional tensors. The new STM is named as Random Projection-base Support Tensor Machine (RPSTM). Mutiple RPSTMs are then aggregated to form an ensemble classifier, TEC, to mitigate the potential information loss and reduce the extra classification risk brought by random projections. This idea is motivated by [24] and the well known Random Forest model [19]. Similar to these methods, TEC aggregates base classifiers which are estimated from randomly sampled or projected features, and makes predictions for new 42 test points by majority votes. Results from [60, 101] show that such an aggregation of decision can be a very effective tool for improving unstable classifiers like RPSTM. Our contribution: Our work alleviates the limitations of existing tensor approaches in handling big data classification problems. Specifically, the contribution of this work is threefold. 1. We successfully adopt the well known random-projection technique into high dimensional tensor classification applications and provide an ensemble classifier that can handle extremely big-sized tensor data. The adoption of random projection is shown to be a low-memory cost operation, and makes it feasible to directly classify big tensor data on regular machines efficiently. We further aggregate multiple RPSTM to form our TEC classifier, which can be statistically consistent while remaining computationally efficient. Since the aggregated base classifiers are independent of each other, the model learning procedure can be accelerated in a parallel computing platform. 2. Some theoretical results are established in order to validate the prediction consistency of our classification model. Unlike [29] and [24], we adopt the Johnson-Lindenstrauss lemma further for tensor data and show that the CP-STM can be estimated with randomly projected tensors. The average classification risk of the estimated model converges to the optimal Bayes risk under some specific conditions. Thus, the ensemble of multiple RPSTMs can have robust parameter estimation and provide strongly consistent label predictions. The results also highlight the trade-off between classification risk and dimension reduction created by random projections. As a result, one can take a balance between the computational cost and prediction accuracy in practice. 3. We provide an extensive numerical study with synthetic and real tensor data to reveal our ensemble classifier’s decent performance. It performs better than the traditional methods such as linear discriminant analysis and random forest, and other tensor-based methods in applications like brain MRI classification and traffic image recognition. It can also handle large tensors generated from tensor CANDECOMP/PARAFAC (CP) models, which are 43 widely applied in spatial-temporal data analysis. Besides, the computational cost is much lower for the TEC comparing with the existing methods. All these results indicate a great potential for the proposed TEC in big data and multi-modal data applications. The contents in this chapter are organized as follow: Section 3.2 reviews the basic concepts about CP-STM classification problem and tensor random projection. Section 3.3 describes our TEC classifier for high-dimensional tensor data, which includes an introduction to our proposed tensor-shaped random projection, the RPSTM, and the ensemble classifier TEC. We provide two different estimation methods for TEC model in section 3.4. In section 3.5, we establish the statistical consistency for the TEC classifier, and provide an explicit upper bound on the excess classification brought by random projection. Simulation studies and real data experiments are in section 3.7. Section 3.8 concludes the work in this chapter. 3.2 Related Works We briefly review the CP-STM for tensor classification problems and some related works about random projection. 3.2.1 CP-STM for Tensor Classification Assume there is a collection of data 𝑇𝑛 = {(X X1 , 𝑦 1 ), (X X2 , 𝑦 2 ), ..., (XX𝑛 , 𝑦 𝑛 )}, where X𝑖 ∈ X ⊂ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 are d-way tensors. X is a compact tensor space, which is a subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 . 𝑦𝑖 ∈ {1, −1} are binary labels. CP-STM assumes the tensor predictors are in CP representation, and can be classified by the function which minimizes the objective function 𝑛 1Õ min 𝜆|| 𝑓 || 2 + L ( 𝑓 (X X𝑖 ), 𝑦𝑖 ) (3.1) 𝑛 𝑖=1 X) 2 𝑑X X ∫ L is a loss function for classification, and 𝜆 is a tuning parameter. || 𝑓 || 2 =< 𝑓 , 𝑓 >= 𝑓 (X is the square of functional norm for 𝑓 . By using tensor kernel function 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) Õ X1 , X 2 ) = 𝐾 (X 𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑚 ) (3.2) 𝑙,𝑚=1 𝑗=1 44 𝑟 (1) (𝑑) 𝑟 (1) (𝑑) where X1 = and X2 = Í Í 𝑥 1𝑙 ◦ .. ◦ 𝑥 1𝑙 𝑥 2𝑙 ◦ .. ◦ 𝑥 2𝑙 are two different tensors. The STM 𝑙=1 𝑙=1 classifier can be written as Õ𝑛 X) = 𝑓 (X 𝛼𝑖 𝑦𝑖 𝐾 (XX𝑖 , X) = 𝛼𝑇 𝐷 𝑦 𝐾 (X X) (3.3) 𝑖=1 where X is a new d-way rank-r tensor with shape 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 . 𝛼 = [𝛼1 , ..., 𝛼𝑛 ] 𝑇 are the coefficients. 𝐷 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 . 𝐾 (X X) = X1 , X ), ..., 𝐾 (X [𝐾 (X X𝑛 , X )] 𝑇 is a column vector, whose values are kernel values between train- ing data and the new test data. We denote the collection of functions in the form of (3.3) with H , which is a functional space also known as Reproducing Kernel Hilbert Space (RKHS). The optimal classifier CP-STM 𝑓 ∈ H can be estimated by plugging function (3.3) into objective function (3.1) and minimize it with Hinge loss and Squared Hinge loss. These steps are reviewed in section 2.2.1, algorithm 1 and 2. The coefficient vector of the optimal CP-STM model is denoted by 𝛼 ∗ . The classification model is statistically consistent if the tensor kernel function satisfying the universal approximating property, as we introduced in the section 2.3. However, one potential issue of CP-STM for big tensor classification problems is the curse of dimensionality. Even a high-dimensional tensor is decomposed into its CP representation, 𝑟 (1) (𝑑) ( 𝑗) X= Í 𝑥 1𝑙 ◦ .. ◦ 𝑥 1𝑙 , 𝑥 𝑙 can still be in high-dimensional form, making it expensive to calculate 𝑙=1 the value of kernel functions. For example, Gaussian RBF kernel computes the ℓ2 norm of the difference between two input tensor CP factors. Such computation can be intensive if the inputs are in high-dimensional. Thus, we propose the RPSTM to simplify the computation with random projection, and avoid the potential issues. 3.2.2 Random Projection As a dimension reduction technique, the traditional random projection transforms vector data into a lower dimensional space via a linear transformation. The linear transformation is usually defined by a randomly generated projection matrix, 𝐴 , whose element 𝑎𝑖, 𝑗 are either from independently and identically Gaussian distribution N (0, 1) [34] or a multinomial distribution with three possible 45 outcomes [7]. The two types of random projection matrices are called Gaussian random matrix and sparse random matrix shown below.  √ 3 P = 16        𝐴 = [𝑎𝑖, 𝑗 ] ∼ N (0, 1) 𝐴 = [𝑎𝑖, 𝑗 ] = 0 P = 23    √  − 3 P = 16    Gaussian Random Matrix Sparse Random Matrix log 𝑛 P stands for probability. With either option of random projection matrices 𝐴 ∈ R 𝑘×𝑝 , 𝑘 > 𝑜( 2 ), 𝜖 𝜖 ∈ (0, 1), any two vectors 𝑥 𝑖 , 𝑥 𝑗 ∈ R 𝑝 , 𝑝 > 𝑘, the random projection satisfies (1 − 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2 6 ||𝐴 𝐴𝑥 𝑖 − 𝐴𝑥 𝑗 || 2 6 (1 + 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2 with large probability for all 𝑖, 𝑗 = 1, ..., 𝑛. This is called Johnson Lindenstrauss (JL) property for the random projection transformation. For higher-order tensor data, random projection is still defined as a mapping 𝑓 TRP : R 𝐼1 ...×𝐼 𝑑 → R𝑃 that transforming a high-dimensional tensor into a vector. In general, the function 𝑓 TRP is considered to be 𝑓 TRP =< A , X > (3.4) A is a projection tensor in the same size as X . To reduce the memory used for A and computa- tional cost, [133] proposes a memory efficient random projection for with the assumption that the projection tensor A is formed by Khatri-rao product product of random matrices. They shows the transformation has the JL property for 2-way tensors (matrices). [71] proposes another random projection satisfying the JL property for rank-one tensors with a projection tensor A defined by Kroncker product of random matrices. More related to our work, [119] develops a tensor random projection by assuming the projection tensor A is in either CP or tensor-train decomposition. The transformations are called CP and tensor-train random projection. Both projections are equipped with the JL property. The CP random projection is a multi-linear map [ 𝑓 TRP-CP (X X)] 𝑝 : R 𝐼1 ×𝐼2 ...×𝐼 𝑑 → R𝑃 that for the 𝑝-th 46 element in the output vector 1 [ 𝑓 TRP-CP (XX)] 𝑝 = √ < A 𝑝 , X >=< È𝐴 𝐴 (1) (2) (𝑑) 𝑝 , 𝐴 𝑝 , ..., 𝐴 𝑝 É, X > (3.5) 𝑃 ( 𝑗) 𝑝 = 1, 2, ..., 𝑃. È𝐴 𝐴 (1) (2) (𝑑) 𝑝 , 𝐴 𝑝 , ..., 𝐴 𝑝 É is the CP factor of the projection tensor A 𝑝 , and 𝐴 𝑝 ∈ R 𝐼𝑖 ×𝑟 𝑎 , 𝑗 = 1, .., 𝑑 are Gaussian random matrices. 𝑟 𝑎 is the CP rank of the random projection tensor A𝑖 , which is independent to the tensor data X . This CP random projection can be applied efficiently when the input tensor X is also given in CP form. However, since the projection (3.5) transforms tensors into vectors, it may destroy the multi-way features in tensors. It also requires element-wise transformation, which is an extra burden when the dimension of mapping 𝑃 is large. We propose an alternation to the CP random projection and combine it with our CP-STM for efficient computation. 3.3 Methology In this section, we present the methodology of our TEC classifier for high-dimensional tensors. We first introduce an alternative tensor-shaped random projection, and combine it with CP-STM to construct RPSTM classifier. The ensemble classifier TEC is then developed by aggregating multiple RPSTMs. 3.3.1 Tensor-Shaped Random Projection We propose an alternative CP tensor-to-tensor random projection using rank-1 projection tensors A that can preserve the multi-way structure of tensors after the projection. The proposed tensor-to- tensor random projection is shown to be equivalent to the CP random projection (3.5) with rank-1 projection tensors A up to a folding-unfolding manner. Definition 3.3.1. Suppose a d-mode CP tensor X = È𝑋 𝑋 (1) , 𝑋 (2) , ..., 𝑋 (𝑑) É has size 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 𝐼 ×𝑟 and CP rank 𝑟. 𝑋 ( 𝑗) ∈ R 𝑗 are the CP factors of tensor X in matrix form. A rank-1 CP tensor-to-tensor random projection, 𝑓 TPR-CP-TT : R 𝐼1 ×𝐼2 ...×𝐼 𝑑 → R𝑃1 ×𝑃2 ×...𝑃𝑑 is defined as 1 𝑓 TPR-CP-TT (XX) = √ È𝐴 𝐴 (1) 𝑋 (1) , 𝐴 (2) 𝑋 (2) , ..., 𝐴 (𝑑) 𝑋 (𝑑) É (3.6) 𝑃 47 𝑃 ×𝐼 where 𝐴 ( 𝑗) ∈ R 𝑗 𝑗 are Gaussian random projection matrices or Sparse random projection matrices. 𝑃 = 𝑃1 × 𝑃2 × ...𝑃 𝑑 . The projection is uniquely defined by the projection tensor A = {𝐴 𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) }, the collection of random matrices. Comparing to the CP random projection (3.5), 𝑓 TPR-CP-TT assumes the projection tensor A to be rank-1 and perform projections directly on the CP tensor factors instead of element-wise. Please notice that 𝐴 ( 𝑗) are not CP component matrices for A , and A = {𝐴 𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) } is not the notation of Kruskal tensors. We use this notation for convenience since the random projection in definition 3.3.1 with the collection of matrices {𝐴 𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) } is equivalent to the CP random projection using tensors in equation (3.5). We show this in the following proposition. Proposition 3.3.1. Let 𝜋 : 𝑃1 × 𝑃2 × ...𝑃 𝑑 → 𝑃 be a invertable unfolding rule such that X 𝑝 1 ,..,𝑝 𝑑 = Vec(XX)𝜋 ( 𝑝 ,..,𝑝 ) . 1 6 𝑝 𝑗 6 𝑃 𝑗 , and 1 6 𝑝 = 𝜋 ( 𝑝 1 , .., 𝑝 𝑑 ) 6 𝑃 are indices. For random 1 𝑑 projection (3.5) with rank-1 projection tensor A , it is equivalent to the projection 3.6 up to the unfolding rule 𝜋 . The proof is provided in the appendix B.1. The projection can reduce the computational cost significantly for tensor-based models as it transforms tensor CP components into lower dimensional spaces. Now, we introduce our RPSTM classifiers estimated from the output of tensor random projection (3.6). 3.3.2 Random-Projection-Based Support Tensor Machine (RPSTM) With tensor-shaped CP random projection, we reformulate the model for the tensor classification problem. Let 𝑇𝑛A = {(X XA XA , 𝑦 ), (X 1 1 , 𝑦 ), ..., (X 2 2 XA 𝑛 , 𝑦 𝑛 )} be the random projection of the original training data 𝑇𝑛 such that XA 𝑖 = 𝑓 TPR-CP-TT (X X𝑖 ) (3.7) for all X𝑖 from 𝑇𝑛 . The random projection 𝑓 TPR-CP-TT is uniquely defined by the fixed CP random projection tensor A = {𝐴 𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) }, where each 𝐴 ( 𝑗) ∈ R𝑃 𝑗 ×𝐼 𝑗 . The original training tensors are transformed into a lower dimensional space with size 𝑃1 × 𝑃2 ... × 𝑃 𝑑 , i.e. 48 XA𝑖 ∈ R 𝑃1 ×𝑃2 ...×𝑃 𝑑 . Similar to CP-STM, RPSTM tries to find an optimal function 𝑓 such that it optimizes the objective function 𝑛 XA 1Õ min 𝜆|| 𝑓 || 2 + L ( 𝑓 (X 𝑖 ), 𝑦𝑖 ) (3.8) 𝑛 𝑖=1 Instead of using the original data 𝑇𝑛 , the objective function measures the empirical classification loss on the randomly projected training data 𝑇𝑛A . A new kernel function for randomly pro- jected tensors is defined as follow: For any pair of CP tensors X 1 = È𝑋 𝑋 1(1) , 𝑋 1(2) , ..., 𝑋 1(𝑑) É, X 2 = È𝑋𝑋 2(1) , 𝑋 2(2) , ..., 𝑋 2(𝑑) É, the kernel function is   𝐾 (X A A X1 , X 2 ) = 𝐾 𝑓 TPR-CP-TT (X X1 ), 𝑓 TPR-CP-TT (X X2 )  1 = 𝐾 √ [𝐴 𝐴 (1) 𝑋 1(1) , 𝐴 (2) 𝑋 1(2) , ..., 𝐴 (𝑑) 𝑋 1(𝑑) ], 𝑃  1 (1) (1) (2) (2) (𝑑) (𝑑) (3.9) √ [𝐴 𝐴 𝑋 2 , 𝐴 𝑋 2 , ..., 𝐴 𝑋 2 ] 𝑃 𝑅 Ö 𝑑 Õ 1 ( 𝑗) 1 ( 𝑗) = 𝐾 ( 𝑗) ( √ 𝐴 ( 𝑗) 𝑥 1,𝑙 , √ 𝐴 ( 𝑗) 𝑥 2,𝑘 ) 𝑙,𝑘=1 𝑗=1 𝑃 𝑃 ( 𝑗) where 𝐴 (1) , ..., 𝐴 (𝑑) are the projection matrices of A that defines the projection 𝑓 TPR-CP-TT . 𝑥 1,𝑙 ( 𝑗) ( 𝑗) ( 𝑗) are the columns of 𝑋 1 , and 𝑥 2,𝑘 are the columns of 𝑋 2 . 𝐾 ( 𝑗) are still vector-based kernel functions measuring inner products for factors in different tensor modes. A Random Projection-based Support Tensor Machine (RPSTM), with the kernel function (3.9), will be in the form of Õ 𝑛   𝑔 (XX) = 𝛽𝑖 𝑦𝑖 𝐾 𝑓 TPR-CP-TT (X X𝑖 ), 𝑓 TPR-CP-TT (X X) 𝑖=1 (3.10)   = 𝛽 𝑇 𝐷 𝑦 𝐾 𝑓 TPR-CP-TT (X X) for a new given tensor X due the representer theorem [9]. Notice that we use 𝑔 to denote functions spanned by tensor kernels to distinguish it from the random projection function 𝑓 𝑇 𝑃𝑅−𝐶𝑃−𝑇𝑇 and   the original STM classifier 𝑓 . 𝐾 𝑓 𝑇 𝑃𝑅−𝐶𝑃−𝑇𝑇 (X X) is again a column vector whose elements are kernel values between projected training data 𝑓 TPR-CP-TT (X X𝑖 ) and the projected new observation 𝑓 TPR-CP-TT (X X). 𝐷 𝑦 is the diagonal matrix whose diagonal is 𝑦 = [𝑦 1 , ..., 𝑦 𝑛 ] 𝑇 . 𝛽 is the coefficient 49 vector and is differentiated from the notation of CP-STM. We denote the collection of functions in the form of (3.10) with H A , which is also a reproducing kernel Hilbert space (RKHS). The optimal classifier can be estimated by plugging the function (3.10) into the objective function (3.8) and minimize it. Let 𝑔𝑛 denote the optimal function satisfying 𝑛 XA 1Õ 𝑔𝑛 = arg min 𝜆|| 𝑓 || 2 + L ( 𝑓 (X 𝑖 ), 𝑦𝑖 ) (3.11) HA 𝑓 ∈H 𝑛 𝑖=1 then 𝑔𝑛 is the RPSTM classifier associated with the fixed random projection 𝑓 𝑇 𝑃𝑅−𝐶𝑃−𝑇𝑇 that we estimated from the training data. The label for new observation tensor X will be predicted by Sgn[𝑔𝑔𝑛 (XX)]. Thanks to the tensor random projection, the estimation of RPSTM can be compu- tationally efficient and feasible for high-dimensional tensors. The computational benefit will be discussed in details in the model estimation part of this paper. 3.3.3 TEC: Ensemble of RPSTM While random projection provides extra efficiency by transforming tensor CP components into lower dimension, there is no guarantee that the projected data will preserve the same margin for every single random projection. As a result, the expected excess risk of RPSTM may be larger than the original CP-STM. In order to mitigate the impact of random projection and provide robust class assignments, multiple RPSTMs are aggregated to form a Tensor Ensemble Classifier (TEC). Let 𝑏 1 Õ 𝜏𝑛,𝑏 (XX) = X)] Sgn[𝑔𝑔𝑚,𝑏 (X (3.12) 𝑏 𝑚=1 𝑏 is the number of RPSTM classifiers estimated with different random projections. 𝑔𝑚,𝑏 are RPSTM classifiers learned independently from the training data 𝑇𝑛 . The TEC classifier is then defined as  1 if 𝜏𝑛,𝑏 (X X) > 𝛾   X) =  𝑒 𝑛,𝑏 (X (3.13)  −1  Otherwise  for a new test tensor X in CP form. 𝛾 is the threshold parameter. For simple majority vote and class- balanced binary classification, 𝛾 = 0. However, it can be different values if any prior information is provided. 50 3.4 Model Estimation In this section, We present two estimation procedures for TEC model using two different loss functions. More importantly, we emphasize significant computational efficiency of TEC model by comparing its algorithmic steps and memory costs to the CP-STM, and even to the naive vectorized support vector machine models. Since TEC is an aggregation of multiple independent RPSTMs, we only have to show the details for a single RPSTM estimation and aggregate them. Similar to the estimation of CP-STM, we can use Hinge loss and Squared Hing loss in objective function (3.8) to measure the empirical classification risk and estimate RPSTM classifiers. With Hinge loss, the objective function becomes 𝑛 XA 1Õ  min 𝜆|| 𝑓 || 2 + max 0, 1 − 𝑓 (X ) · 𝑦𝑖 (3.14) HA 𝑛 𝑖 𝑓 ∈H 𝑖=1 Like CP-STM, the optimization problem (3.14) is equivalent to a quadratic programming problem, which is shown in [25]. Once we calculate the kernel matrix 𝐾 A as XA , XA XA , XA XA X A )   𝐾 (X ) 𝐾 (X ) ... 𝐾 (X , 𝑛   1 1 1 2 1 A A XA , XA XA A )  X2 , X 1 ) 𝐾 (X   A 𝐾 (X 2 2 ) ... 𝐾 (X 2 , X 𝑛 𝐾 =   (3.15)  ... ... ... ...      XA A 𝑛 , X 1 ) 𝐾 (X XA A 𝑛 , X2 ) A A X𝑛 , X 𝑛 )    𝐾 (X ... 𝐾 (X   with tensor kernel function (3.9). The quadratic programming problem is defined as 𝛽 𝐷 𝑦 𝐾 A 𝐷 𝑦 𝛽 − 1𝑇 𝛽 1 𝑇 min 𝛽 ∈R𝑛 2 S.T. 𝛽𝑇 𝑦 = 0 (3.16) 1 0𝛽 2𝑛𝜆 Same optimization techniques used in CP-STM can be adopted to solve this problem. A TEC classifier can then be estimated by repeating the procedure for multiple times with different random ( 𝑗) projections. The steps are summarized in the algorithm 7 below. 𝑋 ℎ [:, 𝑙] is the 𝑙-th column of ( 𝑗) the tensor CP factor matrix 𝑋 ℎ . The output of the algorithm contains a list of RPSTM coefficients 51 Algorithm 7 Hinge TEC 1: procedure TEC Train 2: Input: Training set 𝑇𝑛 = {X X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, number of ensemble 𝑏 3: for i = 1, 2,...n do 4: X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ] ⊲ CP decomposition 5: for m = 1, 2, ..., b do (1) (2) (𝑑) 6: Generate random projection tensor A 𝑚 = {𝐴 𝐴𝑚 , 𝐴 𝑚 , ..., 𝐴 𝑚 } 7: Create initial matrix 𝐾 A 𝑚 ∈ R𝑛×𝑛 8: for i = 1,...,n do 9: for h = 1,...,i do 𝑟 Î 𝐾 A 𝑚 [𝑖, ℎ] = Í 𝑑 𝐾 (𝐴 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) 10: 𝑗=1 𝐴𝑚 𝑋 𝑖 [:, 𝑘], 𝐴 𝑚 𝑋 ℎ [:, 𝑙]) ⊲ Kernel values 𝑘,𝑙=1 11: 𝐾 A𝑚 [ℎ, 𝑖] = 𝐾 A𝑚 [𝑖, ℎ] 12: Solve the quadratic programming problem (3.16) and find the optimal 𝛽 𝑚∗ . Output: 𝑚∗ , A 13:  1∗𝛽 2∗ 𝑚 A1 , ..., A 𝑏 ]  Output: 𝛽 , 𝛽 , ..., 𝛽 𝑏∗ , [A A1 , ..., A 𝑏 ]. The projection  1∗ 2∗  𝛽 , 𝛽 , ..., 𝛽 𝑏∗ and its corresponding random projection tensors [A is still needed for new test point prediction. To estimate RPSTM with Squared Hinge loss, we can use Gaussian-Newton method to optimize the objective function 𝑛 XA 1Õ ) · 𝑦𝑖 2  min 𝜆|| 𝑓 || 2 + max 0, 1 − 𝑓 (X 𝑖 (3.17) HA 𝑓 ∈H 𝑛 𝑖=1 by letting its derivative to be zero. Since the procedure is identical to the derivation (2.9) in section 2.2, we provide the updating rule for the parameter 𝛽 directly 𝛽 ∗ = 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 A ) −1 𝐼 𝑠 𝑦 1 1 (3.18) 𝑛 𝑛 where 𝐼𝑠 is the diagonal matrix whose diagonal elements are indicating if the corresponding tensors are support tensors. With a initial value of 𝛽 , we can update the parameter iteratively with (3.18) A until its value converges. The steps are summarized in the algorithm 8. In the algorithm, 𝑘 A 𝑖 𝑚 is the 𝑖-th column vector of kernel matrix 𝐾 A 𝑚 . With estimated TEC model, we can make prediction for new test points. The steps for prediction is identical no matter whether the model are estimated with Hinge loss or Squared Hinge loss. The 52 Algorithm 8 Squared Hinge TEC 1: procedure TEC Train 2: Input: Training set 𝑇𝑛 = {X X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, 𝜂, maxiter, number of ensemble 𝑏 3: for i = 1, 2,...n do 4: X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ] ⊲ CP decomposition 5: for m = 1, ..., b do 6: Create initial matrix 𝐾 ∈ R𝑛×𝑛 (1) (2) (𝑑) 7: Generate random projection tensor A𝑚 = {𝐴 𝐴𝑚 , 𝐴 𝑚 , ..., 𝐴 𝑚 } 8: for i = 1,...,n do 9: for h = 1,...,i do 𝑟 Î 𝐾 A 𝑚 [𝑖, ℎ] = Í 𝑑 𝐾 (𝐴 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) 10: 𝑗=1 𝐴𝑚 𝑋 𝑖 [:, 𝑘], 𝐴 𝑚 𝑋 ℎ [:, 𝑙]) ⊲ Kernel values 𝑘,𝑙=1 11: 𝐾 A 𝑚 [ℎ, 𝑖] = 𝐾 A 𝑚 [𝑖, ℎ] 12: Create 𝛽 𝑚∗ = 1 𝑛×1 , 𝛽 𝑚 = 0 𝑛×1 ⊲ Initial Value 13: Iteration = 0 14: while ||𝛽𝛽 𝑚∗ − 𝛽 𝑚 || 2 > 𝜂 & Iteration 6 maxiter do 15: 𝛽 𝑚 = 𝛽 𝑚∗ 16: Find 𝑠 ∈ R𝑛×1 . 𝑠 𝑖 ∈ {0, 1} such that 𝑠 𝑖 = 1 if 𝑦𝑖 𝑘 A 𝑖 𝑚𝑇 𝛽 𝑚 < 1 ⊲ Support tensors 17: 𝐼 𝑠 = diag(𝑠𝑠 ) ⊲ Create diagonal matrix with 𝑆 as diagonal 18: 𝛽 𝑚∗ = 𝑛1 𝐷 𝑦 (𝜆𝐼𝐼 + 𝑛1 𝐼 𝑠 𝐾 A 𝑚 ) −1 𝐼 𝑠 𝑦 ⊲ Update 19: Output: 𝛽 𝑚∗ A1 , ..., A 𝑏 ]   20: Output: 𝛽 1∗ , 𝛽 2∗ , ..., 𝛽 𝑏∗ , [A steps for prediction are stated in the algorithm 9. For convenience, we keep using the notation for decomposed training tensors X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ] from the estimation steps. 𝑘 A 𝑚 is a new column vector in length n, whose 𝑖-th element is kernel value between training tensors X𝑖 and the new test point X . Suppose the projected training tensors are in the shape of 𝑃1 × 𝑃2 × ... × 𝑃 𝑑 , the time complexity 𝑑 for kernel matrix computation is 𝑂 (𝑛2𝑟 2 𝑑 Í 𝑃 𝑗 ). Notice that the choices of 𝑃 𝑗 are free from 𝑗=1 the original tensor dimensions 𝐼 𝑗 , and are only related to the training data size 𝑛 following the JL lemma. We can choose relatively small 𝑃 𝑗 s so that the total time complexity of algorithm 7 and 8 are 𝑑 𝑑 𝑂 (𝑛2𝑟 2 𝑑 𝑃 𝑗 + 𝑙1 ) and 𝑂 (𝑛2𝑟 2 𝑑 Í Í 𝑃 𝑗 + 𝑙2 ), when the training data are given in their projected 𝑗=1 𝑗=1 CP decomposition forms. 𝑙 1 and 𝑙2 are the necessary steps to perform quadratic programming in algorithm 7 and the iterations in algorithm 8. They are bounded by the order of 𝑂 (𝑛2 ) empirically, 53 Algorithm 9 TEC Prediction 1: procedure TEC Predict  A1 , ..., A 𝑏 ], kernel  2: Input: TEC coefficients 𝛽 1∗ , 𝛽 2∗ , ..., 𝛽 𝑏∗ , random projection tensors [A function 𝐾, tensor rank r, new test point X , threshold parameter 𝛾 3: X = [𝑋 𝑋 (1) , ..., 𝑋 (𝑑) ] ⊲ CP decomposition for New observation 4: 𝜏𝑛,𝑏 = 0 ⊲ Initial value in equation (3.12) 5: for m = 1,...,b do 6: for i = 1,...,n do 𝑟 Î 𝑘 A 𝑚 [𝑖] = ( 𝑗) ( 𝑗) ( 𝑗) 𝑋 𝑖 [:, 𝑘], 𝐴 𝑚 𝑋 ( 𝑗) [:, 𝑙]) Í 𝑑 𝐾 (𝐴 7: 𝑗=1 𝐴𝑚 ⊲ Kernel vector 𝑘,𝑙=1 8: 𝜏𝑛,𝑏 = 𝜏𝑛,𝑏 + Sign[𝑘𝑘 A 𝑚 𝑇 𝐷 𝑦 𝛽 𝑚∗ ] ⊲ Update equation (3.12) 9: If 𝜏𝑛,𝑏 > 𝛾, the prediction is class 1. Otherwise it is -1. ⊲ Equation (3.13) 10: Output: Prediction which is shown by [25]. Since 𝐼 𝑗 >> 𝑃 𝑗 and 𝐼 𝑗 >> 𝑛 for 𝑗 = 1, ..., 𝑑 in high-dimensional tensor problems, the time complexities of RPSTM in algorithms 7 and 8 are significantly smaller than 𝑑 𝑑 CP-STM, which are 𝑂 (𝑛2𝑟 2 𝑑 𝐼 𝑗 + 𝑙1 ) and 𝑂 (𝑛2𝑟 2 𝑑 Í Í 𝐼 𝑗 + 𝑙2 ). Due to the fact that each 𝑗=1 𝑗=1 RPSTM is estimated independently, TEC model can be fitted in a parallel computing manner. As a result, the time complexity of TEC can be roughly the same as RPSTM, which is also much smaller Í𝑑 than CP-STM. As for the memory complexity, CP-STM requires 𝑂 (𝑛𝑟 𝐼 𝑗 + 𝑛), which is more 𝑗=1 Í𝑑 prohibitive and infeasible than 𝑂 (𝑏𝑛𝑟 𝑃 𝑗 + 𝑏𝑛) required by TEC with 𝑏 aggregated RPSTMs. 𝑗=1 As the memory complexity are dominated by the dimension of projected CP factors, TEC turns out to be more efficient. If we further consider the naive vectorized SVM model, its time and memory complexities are 𝑂 (𝑛2 𝑑𝑗=1 𝐼 𝑗 + 𝑙1 ) and 𝑂 (𝑛 𝑑𝑗=1 𝐼 𝑗 ). Through the comparison, it is obvious Î Î that TEC model is much more computationally efficient than both CP-STM and the traditional vectorized SVM. This comparison is summarized in the table 3.1. One may notice that our discussion above does not include the complexity from tensor CP decomposition and random projection. Since both the RPSTM and CP-STM requires CP decom- position, subtracting this part of complexity does not affect the comparison. Moreover, novel CP Í𝑑 decomposition methods from [116, 137] reach a time complexity of 𝑂 (𝑛𝑟 𝐼 𝑗 ). Neither CP-STM 𝑗=1 nor RPSTM will have a larger time complexity than the vectorized SVM by adding this part. As 54 Models Time Complexity Memory Complexity 𝑑 𝑑 𝑑 𝑂 (𝑛2𝑟 2 𝑑 𝑃 𝑗 + 𝑙1 ) / 𝑂 (𝑛2𝑟 2 𝑑 Í Í Í TEC (Parallel) 𝑃 𝑗 + 𝑙2 ) 𝑂 (𝑏𝑛𝑟 𝑃 𝑗 + 𝑏𝑛) 𝑗=1 𝑗=1 𝑗=1 𝑑 𝑑 𝑑 𝑂 (𝑛2𝑟 2 𝑑 𝑃 𝑗 + 𝑙1 ) / 𝑂 (𝑛2𝑟 2 𝑑 Í Í Í RPSTM 𝑃 𝑗 + 𝑙2 ) 𝑂 (𝑛𝑟 𝑃 𝑗 + 𝑛) 𝑗=1 𝑗=1 𝑗=1 𝑑 𝑑 𝑑 𝑂 (𝑛2𝑟 2 𝑑 𝐼 𝑗 + 𝑙1 ) / 𝑂 (𝑛2𝑟 2 𝑑 Í Í Í CP-STM 𝐼 𝑗 + 𝑙2 ) 𝑂 (𝑛𝑟 𝐼 𝑗 + 𝑛) 𝑗=1 𝑗=1 𝑗=1 𝑂 (𝑛2 𝑑𝑗=1 𝐼 𝑗 + 𝑙 1 ) Î Î Vectorized SVM 𝑂 (𝑛 𝑑𝑗=1 𝐼 𝑗 ) Table 3.1: TEC: Comparison of Computational Complexity for the random projection, we define that the projection tensor can be composed by sparse projec- tion matrices in definition 3.3.1. As a result, we can utilize techniques such as low-rank matrix decomposition to reduce the costs of computation and memory. In addition, [90] showed that the projection matrices can be very sparse, making the complexity of random projection to be trivial comparing to other estimation steps. We want to briefly discuss the tuning parameter selection at the end of this section. The number of ensemble classifiers, 𝑏, and the threshold parameter, 𝛾, are chosen by cross-validation. We first let 𝛾 = 0, which is the middle of two labels, -1 and 1. Then we search 𝑏 in a reasonable range, between 2 to 20. The optimal 𝑏 is the one that provides the best classification model. In the next step, we fix 𝑏 and search 𝛾 between 1 and -1 with step size to be 0.1, and find optimal value which has the best classification accuracy. For simple majority vote, 𝛾 can be set to be zero directly. The choice of random projection matrices is more complicated. Although we can generate random projection matrices, the dimension of matrices is remain unclear. Our guideline, JL-lemma, only provides a lower bound for dimension, and is only for vector situation. As a result, we can only choose the dimension based on our intuition and cross-validation results in practice. Empirically, we suggest to choose the projection dimension 𝑃 𝑗 ≈ int(0.7 × 𝐼 𝑗 ) for each mode. 55 3.5 Statistical Properties In this section, we develop the statistical consistency for both TEC and RPSTM models. In addition, we establish an explicit upper bound on the excessive risk brought by random projection, highlighting the trade-off between computational efficiency and potential risks. For the convenience, we introduce a few more notations. R L ( 𝑓 ) is the classification risk of a specific decision function 𝑓 , which is defined as ∫ R L ( 𝑓 ) = E (X×Y) L (𝑦, 𝑓 (X X)) = X))𝑑P L (𝑦, 𝑓 (X where L is a loss function. The empirical risk of the decision function 𝑓 over the training data 𝑇𝑛 is 𝑛 1Õ X𝑖 )  R L,𝑇𝑛 ( 𝑓 ) = L 𝑦𝑖 , 𝑓 (X 𝑛 𝑖=1 R L,𝑇𝑛 ( 𝑓 ) is the estimate of R L ( 𝑓 ) on finite training data. The subscript L in the notation of risks emphasizes that the risks are calculated using specific loss functions. We also use risk notations without the subscript L, like R ( 𝑓 ), to denote the risk of 𝑓 calculated with zero-one loss L (𝑦, 𝑧) = 1 {𝑦 ≠ 𝑧}. 1 is indicator function. The definition of classification consistency is defined on the zero-one loss initially. We use R ∗ to denote the Bayes risk of the tensor classification problem over the joint distribution X ×Y Y . It is the optimal risk in the sense that for any measureable function 𝑓 : X → R, R ∗ = minR ( 𝑓 ). A decision rule is said to be consistent if R ( 𝑓 𝑛 ) → R ∗ as 𝑓 𝑛 → ∞, see [36]. We have to show this for TEC and RPSTM in order to establish their consistency properties. However, existing results [13, 128] show that the convergence in surrogate loss-based risks indicates the convergence in zero-one loss-based risks, i.e. R L ( 𝑓𝑛 ) → R L ∗ ⇒ R ( 𝑓𝑛 ) → R ∗ RL ∗ = minR ( 𝑓 ) for any decision rule { 𝑓 }. This conclusion holds as long as the loss function is L 𝑛 𝑓 self-calibrated. Both Hinge loss and Squared Hinge loss used in our models are self-calibrated. As a result, we only have to show R L ( 𝑓𝑛 ) → R L ∗ for TEC and RPSTM. 56 Recall that in section 3.2 and section 3.3, we use H and H A to denote the reproducing kernel Hilbert space spanned by tensor kernels (3.2) and projected tensor kernels (3.9). Let 𝑓𝑛𝜆 , 𝑓 𝜆 ∈ H such that 𝑓𝑛𝜆 = arg min 𝜆|| 𝑓 || 2 + R L,𝑇𝑛 ( 𝑓 ) 𝑓 𝜆 = arg min 𝜆|| 𝑓 || 2 + R L ( 𝑓 ) H 𝑓 ∈H H 𝑓 ∈H 𝑓𝑛𝜆 is the CP-STM classifier estimated from the training data 𝑇𝑛 that minimizes the objective function on 𝑇𝑛 . 𝑓 𝜆 is the optimal CP-STM learned from infinite size training data. The superscript 𝜆 in both functions denote that they are optimal with the given value of 𝜆 in the objective functions. Notice that both 𝑓𝑛𝜆 and 𝑓 𝜆 do not minimize the empirical risk and expected risk, but the objective functions which have regularization terms. We further denote R L,H ∗ = min R L ( 𝑓 ), the optimal H H 𝑓 ∈H H risk functions in can be achieved. Similarly, we define 𝑛 , 𝑔𝜆 𝑔𝜆 ∈ H A as 𝑔𝑛𝜆 = arg min 𝜆||𝑔𝑔 || 2 + R (𝑔𝑔 ) 𝑔𝜆 = arg min 𝜆||𝑔𝑔 || 2 + R L (𝑔𝑔 ) L,𝑇𝑛A 𝑔 ∈HHA HA 𝑔 ∈H 𝑛 XA XA )) for a given tensor  = 𝑛1 Í where R (𝑔𝑔 ) L 𝑦𝑖 , 𝑔 (X ) and R L (𝑔𝑔 ) = E (X×Y) L (𝑦, 𝑔 (X L,𝑇𝑛A 𝑖=1 𝑖 random projection defined by A . 𝑔𝑛𝜆 is the optimal RPSTM model estimated from the projected training data 𝑇𝑛A , and 𝑔 𝜆 is the infinite-sample estimate. We derive the proof of consistency for TEC 𝑒 𝑛,𝑏 𝜆 and RPSTM 𝑔 𝜆 models with the regularization 𝑛 parameter 𝜆. Notice that R L (𝑒𝑒𝑛,𝑏 𝜆 ) and R (𝑔 𝜆 L 𝑔𝑛 ) are calculated with a specific random projection tensor A . Thus, we develop the consistency results on the expected risk EA R L (𝑒𝑒𝑛,𝑏  𝜆 )  and   EA R L (𝑔𝑔𝑛𝜆 ) instead to demonstrate the average performance and integrate the impacts of different possible random projections. The expectation is taken over the distribution of random projection tensors. 3.5.1 Excess Risk of TEC We first boud the expected risk of our TEC classifier 𝑒𝜆𝑛,𝑏 by using the result from [24], theorem 2. Theorem 3.5.1. For each 𝑏 ∈ N, 𝑒𝜆𝑛,𝑏 is the TEC classifier aggregating 𝑏 independent RPSTMs 57 𝑔𝑛𝜆 . 𝜆 is the parameter for the functional norm in the objective function (3.8). Then 1 EA R (𝑒𝑒𝜆𝑛,𝑏 ) − R ∗ 6 EA R (𝑔𝑔𝑛𝜆 ) − R ∗       min(𝛾, 1 − 𝛾) This result says that the ensemble model TEC is statistically consistent as long as the base classifier is consistent. With the surrogate loss property, we only need to develop the consistency for RPSTM ∗ converges to zero.   model by showing excessive risk of surrogate loss EA R L (𝑔𝑔𝑛𝜆 ) − R L 3.5.2 Excess Risk of RPSTM To show the consistency of RPSTM, we first use the following proposition to decompose the excess ∗ , into several parts and bound them separately.   risk of RPSTM, EA R L (𝑔𝑔𝑛𝜆 ) − R L Proposition 3.5.1. The excess risk is bounded above:      𝜆  ∗  𝜆 EA R L (𝑔𝑔𝑛 ) − R L 6 EA R L (𝑔𝑔𝑛 ) − R A (𝑔𝑔𝑛 ) + EA R 𝜆   (𝑓𝜆 ) − R L ( 𝑓A𝜆 ,𝑛 )  L,𝑇𝑛 L,𝑇𝑛A A ,𝑛     + R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )    𝜆 𝜆 2  𝜆 𝜆 2 + EA R L ( 𝑓A ,𝑛 ) + 𝜆|| 𝑓A ,𝑛 || − R L ( 𝑓 𝑛 ) − 𝜆|| 𝑓 𝑛 || + 𝐷 (𝜆) + R L,H ∗ ∗ H − RL (3.19)   Where 𝐷 (𝜆) = R L ( 𝑓 𝜆) + 𝜆|| 𝑓 𝜆 || ∗ − R L,H . 𝑓𝜆 H A ,𝑛 = 𝛼 ∗𝑇 𝐷 𝑦𝐾 X) , which is a function 𝑓 TPR-CP-TT (X in H A with the coefficient vector being the optimal coefficient estimate from CP-STM. Notice that 𝑓 𝜆 is different from 𝑔𝑛𝜆 since its coefficients are estimated from the CP-STM model A ,𝑛 and original tensor data 𝑇𝑛 . Recall that we use 𝛼 ∗ and 𝛼 in section 3.2 to denote optimal CP-STM and CP-STM coefficients. In other words, the coefficients of 𝑓A𝜆 ,𝑛 are the same as the 𝑓𝑛𝜆 . However, their kernel basis functions will have different values, making them as two different functions. The proof of proposition 3.5.1 is provided in the appendix B.2. The proposition unveils the fact that the excess risk can be bounded by four types of risks: 58 1. Gaps between empirical risk and expected risk:    𝜆 𝜆    EA R L (𝑔𝑔𝑛 ) − R A (𝑔𝑔𝑛 ) R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) L,𝑇𝑛    𝜆    𝜆 EA R A ( 𝑓A ,𝑛 ) − R L ( 𝑓A ,𝑛 ) R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 ) L,𝑇𝑛 2. Extra risk brought by random projection:    𝜆 𝜆 2  𝜆 𝜆 2 E𝐴 R L ( 𝑓A ,𝑛 ) + 𝜆|| 𝑓A ,𝑛 || − R L ( 𝑓𝑛 ) − 𝜆|| 𝑓𝑛 || 3. 𝐷 (𝜆), approximation error between the minimal regularized objective function and the risk of class optimal. This term depicts how regularized objective function approaches to the class optimal risk R L,H ∗ as the parameter 𝜆 vanishes. H ∗ 4. R L,H − RL ∗ measures the approximation error of the reproducing kernel Hilbert space H H . Later we show that with "nice" kernel functions, the functions in the RKHS H can approximate any measureable function as close as possible (in terms of infinite norm). Next, we develop explicit bounds all these components. In the following part, we suppose that all the conditions listed below hold. AS.1 The loss function L is 𝐶 (𝑊) local Lipschitz continuous in the sense that for |𝑎| 6 𝑊 < ∞ and |𝑏| 6 𝑊 < ∞ |L (𝑎, 𝑦) − L (𝑏, 𝑦)| 6 𝐶 (𝑊)|𝑎 − 𝑏| In addition, we need sup L (0, 𝑦) 6 𝐿 0 < ∞. 𝑦∈{1,−1} AS.2 The kernel functions 𝐾 ( 𝑗) (·, ·) used to composite the coupled tensor kernel (3.2) are regular vector-based kernels satisfying the universal approximating property, see [107]. A kernel has this property if it satisfies the following condition. Suppose X is a compact subset of the Euclidean space R 𝑝 , and 𝐶 (X X ) = { 𝑓 : X → R} is the collection of all continuous functions defined on X . The kernel function is also defined on X × X , and its reproduction kernel Hilbert space (RKHS) is H . Then ∀𝑔𝑔 ∈ 𝐶 (X X ), ∃ 𝑓 ∈ H such that ∀𝜖 > 0, ||𝑔𝑔 − 𝑓 || ∞ = sup |𝑔𝑔 (𝑥𝑥 ) − 𝑓 (𝑥𝑥 )| 6 𝜖. X 𝑥 ∈X 59 p AS.3 Assume the tensor kernel function (3.2) is bounded, i.e. sup 𝐾 (·, ·) = 𝐾𝑚𝑎𝑥 < ∞. As a result, the projected kernel function (3.9) is also bounded by 𝐾𝑚𝑎𝑥 for any arbitrary random projection. AS.4 For each component 𝐾 ( 𝑗) (·, ·) in the kernel function (3.2), we assume 𝐾 ( 𝑗) (𝑎, 𝑏) = ℎ ( 𝑗) (||𝑎 − ( 𝑗) 𝑏|| 2 ) or ℎ ( 𝑗) (h𝑎, 𝑏i). ℎ : R → R are functions. We assume that all of them are 𝐿 𝐾 -Lipschitz continuous ( 𝑗) |ℎℎ ( 𝑗) (𝑡1 ) − ℎ ( 𝑗) (𝑡 2 )| 6 𝐿 𝐾 |𝑡 1 − 𝑡2 | 𝐼 ( 𝑗) where 𝑡 1 , 𝑡2 ∈ R 𝑗 are different CP components. Further, let 𝐿 𝐾 = max 𝐿 𝐾 . 𝑗=1,..,𝑑 AS.5 For random projection tensors A = {𝐴 𝐴 (1) , ..., 𝐴 (𝑑) }, suppose all the 𝐴 ( 𝑗) have their elements identically independently distributed as N (0, 1). The dimension of 𝐴 ( 𝑗) is 𝑃 𝑗 × 𝐼 𝑗 . For a 𝛿1 ∈ (0, 1) and 𝜖 > 0, we assume 1 [log 𝛿𝑛 ] 𝑑 1 𝑃𝑗 = 𝑂( ), 𝑗 = 1, 2, ..., 𝑑 𝜖2 𝜖 is considered as the error or distortion caused by random projection. AS.6 The hyper-parameter in the regularization term 𝜆 = 𝜆 𝑛 satisfies: 𝜆𝑛 → 0 𝑛𝜆 𝑛 → ∞ 𝑛𝜆2𝑛 → ∞ as 𝑛→∞ 𝑟 AS.7 For all the tensor data X = 𝑥 𝑘(1) ◦ 𝑥 𝑘(2) ... ◦ 𝑥 𝑘(𝑑) , assume ||𝑥𝑥 𝑘( 𝑗) || 2 6 𝐵𝑥 < ∞. Í 𝑘=1 AS.8 Suppose that there is a CP random projection defined by A such that the Bayes risk in the projected data, R L,A ∗ , remains unaltered A ∗ R L,A ∗ A = RL ∗ R L,A = minR L,A A A ( 𝑓 ) where 𝑓 is any measurable function mapping projected data into class 𝑓 assignments. 60 AS.9 For 𝐷 (𝜆), we assume there is a relation between 𝐷 (𝜆) and 𝜆 𝐷 (𝜆) = 𝑐 𝜂 𝜆𝜂 0<𝜂61 𝑐 𝜂 is a constant depending on 𝜂. AS.10 Suppose the projection error ratio for each mode vanishes at rates depending on loss function. 𝜖𝑛 𝑞 →0 𝑛→∞ 𝜆𝑛 Where 𝜖 𝑛 is the 𝜖 in the assumption AS.5. For hinge loss 𝑞 = 1 and square hinge loss 𝑞 = 32 . AS.11 The probability of projection error is diminishing with increase of sample size, 1  𝛿1 = 𝑂 𝑛 exp(−𝑛 𝑑 ) The assumption AS.1, AS.3, AS.6, and AS.7 are commonly used in supervised learning problems with kernel tricks.(see, e.g. [139, 82, 128]) Assumption AS.4 and AS.5 are needed to help to establish the explicit bound on extra errors brought by random projections. Assumption AS.10 further gives out the condition that the extra errors brought by random projections can converge to zero as 𝑛 goes to infinity. Condition AS.8 assumes that it is possible to learn the optimal decision rule from randomly projected tensors. The optimal risk is still achievable after random projection. This helps to align our results to the definition of consistency in [36], and guarantees that RPSTM ∗ . A more detailed discussion about this condition is provided   are consistent if EA R L (𝑔𝑔𝑛𝜆 ) → R L in the appendix B.3. Condition AS.2 is the sufficient condition that the tensor kernel function (3.2) ∗ is universal (see proof in [89]), making R L,H − RL ∗ to be zero. Finally, condition AS.9 guarantees H ∗ that R L,H has a minimizer, and 𝑓 𝜆 converges to the minimizer as 𝜆 goes to zero. (See definition H 5.14 and corollary 5.18 in the section 5.4 of [128]) With the assumption AS.7, the gaps between empirical risk and expected risk are easily bounded by the Hoeffding Inequality (see e.g. [36]). Also, result from [89] says R L,H ∗ = RL∗ due to H condition AS.2. There are only two terms in the proposition 3.5.1 left to be bounded. The extra risk from random projection and the approximation error 𝐷 (𝜆). For these two parts, our strategy is 61 proving the convergence or risk under a single random projection first, and then use the dominant convergence theorem to show the convergence of expected risks. Condition AS.11 entails that probability of projection as well as expected risk difference vanishes with increase in sample size (ℓ1 convergence). 3.5.3 Price of Random Projection Applying random projection in the training procedure is indeed doing a trade-off between prediction accuracy and computational cost. We give out an explicit upper bound on the extra risk brought by random projections. Without taking expectation, the following proposition gives out an upper bound on the extra risk when a random projection A is given. Proposition 3.5.2. Assume a tensor CP random projection is defined by A , whose components are generated independently and identically from a standard Gaussian distribution. With the assumptions AS.1, AS.4, AS.5, AS.6, and AS.7, for the 𝜖 𝑑 described in AS.5. With probability (1 − 2𝛿1 ) and 𝑞 = 1 for hinge loss, and 𝑞 = 23 for square hinge loss function respectively. 𝜖𝑑 |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | = 𝑂 ( 𝑞 ) 𝜆 where 𝑛 is the size of training set, 𝑑 is the number of modes of tensor. The proof of this proposition is provided in the appendix B.4. The value of 𝑞 depends on loss function as well as kernel and geometric configuration of data, which is discussed in the appendix. This proposition highlights the trade-off between dimension reduction and prediction risk. As the reduced dimension 𝑃 𝑗 is related to 𝜖 negatively, small 𝑃 𝑗 can make the term converges at a very slow rate. 3.5.4 Convergence of Risk Now we summarize the previous results and establish the convergence of risk for RPSTM classifier under a single random projection. The following theorem unveils the explicit convergence rate of 62 RPSTM classifier model. Theorem 3.5.2 (RPSTM Convergence Rate). Suppose all the assumptions AS.1 - AS.8 hold. For 4 𝜖 > 0, let the projected dimension 𝑃 𝑗 = d3𝑟 𝑑 𝜖 −2 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1 for each 𝑗 = 1, 2, ..𝑑. The excess risk of a RPSTM with a specific random projection is bounded with probability at least (1 − 2𝛿1 ) (1 − 𝛿2 ), i.e., ∗ 6 𝑉 (1) + 𝑉 (2) + 𝑉 (3) R L (𝑔𝑔𝑛𝜆 ) − R L q √ q q 𝐿0 𝐿0 log(2/𝛿2 ) 2 log(2/𝛿2 ) • 𝑉 (1) = 12𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) · 𝐾𝑚𝑎𝑥 √ + 9𝜁𝜆 ˜ 2𝑛 + 2𝜁𝜆 𝑛 𝑛𝜆 • 𝑉 (2) = 𝐷 (𝜆) q 𝐿0 • 𝑉 (3) = 𝐶𝑑,𝑟 Ψ · [𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) + 𝜆Ψ]𝜖 𝑑 where q q 𝐿0 𝐿0 • 𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) is a constant depending on 𝐾𝑚𝑎𝑥 𝜆 . • 𝛿1 ∈ (0, 21 ) and 𝛿2 ∈ (0, 1) • 𝜁𝜆 = sup{L ( 𝑓 𝜆 (X X), 𝑦) : (X X, 𝑦) ∈ X × Y , } q 𝐿0 • 𝜁˜𝜆 = sup{L ( 𝑓 (X X), 𝑦) : all 𝑓 : X → R, || 𝑓 || ∞ 6 𝐾𝑚𝑎𝑥 𝜆 , and X, 𝑦) ∈ X × Y } all (X • 𝐶𝑑,𝑟 = (2𝐿 𝐾 𝐵𝑥2 ) 𝑑 𝑟 2 𝑛 • Ψ = sup{||𝛼 X) = 𝛼 𝑇 𝐷 𝑦 𝐾 (X X) ∈ H } Í 𝛼 || 1 = |𝛼𝑖 | : 𝑓 (X 𝑖=1 We prove the theorem, and explain the terms listed above in appendix B.5. The symbol d·e means rounding a value to the nearest integer above its current value. The theorem provides an upper bound controlling the convergence of excessive risk for RPSTM, which holds with probability at least (1 − 2𝛿1 )(1 − 𝛿2 ). This probability is defined on the join distribution of three random variables, tensor data X , labels y, and the random projection A . The component (1 − 𝛿2 ) is from the randomness in sampling the training data 𝑇𝑛 , and (1 − 2𝛿1 ) is caused by random projection. It is clear that the projection error 𝜖, random projection probability parameter 𝛿1 , and projection 63 4 dimension 𝑃 𝑗 are connected by equation 𝑃 𝑗 = d3𝑟 𝑑 𝜖 −2 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1. As a result, one can express the projection error as 𝜖 𝑑 = 𝑟 2 𝑙𝑜𝑔( 𝛿𝑛 )/ 𝑑𝑗=1 𝑃 𝑗 , which is a function of 𝛿1 when fixing the Î 1 size of training data 𝑛 and projected dimension 𝑃 𝑗 . To obtain a higher probability of getting upper bounds, one can consider decreasing 𝛿1 and allowing the projection error to increase. Alternative, one can choose a higher projected dimension 𝑃 𝑗 to have the same level of projection error, but a higher chance of bounding the excessive risk. It worth noting that 𝑃 𝑗 grows as sample size 𝑛 goes to infinity. However, huge 𝑃 𝑗 will make our proposed model infeasible and prohibitive. Thus, the projection error 𝜖 should be replaced by 𝜖 𝑛 in assumption AS.10 to guarantee 𝑃 𝑗 << 𝐼 𝑗 . Theorem 3.5.2 provides an upper bound in general to control the excess risk of RPSTM. In the theorem, there are few quantities related to the loss function L. These terms can be further expressed with specific loss functions such as Hinge and Squared Hinge loss. The next two propositions extend the theorem 3.5.2 with Hinge and Squared Hinge loss, and provide explicit upper bound on RPSTM. 𝜇 𝜇 Proposition 3.5.3. For square hinge loss, let 𝜖 = ( 𝑛1 ) 2𝑑 for 0 < 𝜇 < 1, and 𝜆 = 1 ( 𝑛) 2𝜂+3 for some 4 𝜇 0 < 𝜂 6 1. Assume 𝑃 𝑗 = d3𝑟 𝑑 𝑛 𝑑 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1 for each mode 𝑗 = 1, 2, ..., 𝑑. For some 𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1), with probability (1 − 𝛿2 )(1 − 2𝛿1 ) s 𝜇𝜂 ∗ 6 𝐶 𝑙𝑜𝑔( 2 )( 1 ) 2𝜂+3 R L (𝑔𝑔𝑛𝜆 ) − R L (3.20) 𝛿2 𝑛 Where 𝐶 is a constant. The rate of convergence is faster with increase in sample size, when high value of 𝜇 is chosen. For 𝑑 𝜇 → 1 the risk difference rate becomes ( 𝑛1 ) 5 . The proof of this result is in appendix B.6. 𝜇 𝜇 Proposition 3.5.4. For hinge loss,Let 𝜖 = ( 𝑛1 ) 2𝑑 for 0 < 𝜇 < 1 and 𝜆 = 1 ( 𝑛) 2𝜂+2 for some 4 𝜇 0 < 𝜂 6 1, 𝑃 𝑗 = d3𝑟 𝑑 𝑛 𝑑 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1 , For some 𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1) with probability (1 − 2𝛿1 ) (1 − 𝛿2 ) s 𝜇𝜂 ∗ 6 𝐶 𝑙𝑜𝑔( 2 1 2𝜂+2 R L (𝑔𝑔𝑛𝜆 ) − R L )( ) (3.21) 𝛿2 𝑛 Where 𝐶 is a constant. 64 The rate of convergence is faster with increase in sample size, when high value of 𝜇 is chosen. For 𝑑 𝜇 → 1 the risk difference rate becomes ( 𝑛1 ) 4 . The proof of this result is in appendix B.6. Finally, we show the convergence of expected risk Theorem 3.5.3 (Convergence of Expected Risk). Suppose assumptions AS.1 - AS.11 hold. The excess risk goes to zero in expectation as sample size increases, the EA denote expectation with respect to tensor random projection A , E𝑛 denote expectation with respect to uniform measure on samples ∗ |→0 E𝑛 |EA [R L (𝑔𝑛𝜆 )] − R L This is the expected risk convergence building on top of our previous results. The proof is provided in the appendix B.7. This theorem concludes that the expected risk of RPSTM converges to the optimal Bayes risk under surrogate loss L. With the aforementioned property about L and theorem 3.5.1, the RPSTM and our ensemble model TEC are statistically consistent. 3.6 Simulation Study We provide a simulation study in this section to compare the empirical performance of our TEC model and some other classification methods. Both vector-based and tensor-based methods in the current literature are considered in this comparison. For vector-based methods, we include Gaussian-RBF SVM from [128], BudgetSVM from [38], Linear Discriminant Analysis from [48], and Random Forest from [21]. For tensor-based methods, we select few highly cited models including Direct General Tensor Discriminant Analysis (DGTDA) and Constrained Multilinear Discriminant Analysis (CMDA) from [92]. The synthetic tensor data are generated using the idea from [43], which creates CP tensors by generating random tensor factors and computing the sum of mulit-way outer products. To have a more comprehensive comparison, we also generate Tucker tensors (see [78]) to show how TEC performs when the tensors are not well approximated by CP decomposition. We listed out the data generating models below: 65 1. F1 Model: Low dimensional rank 1 tensor factor model with each components confirming the same distribution. Shape of tensors is 50 × 50 × 50. X1 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) 𝑥 ( 𝑗) ∼ N (00, 𝐼50 ), 𝑗 = 1, 2, 3 X 2 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) 𝑥 ( 𝑗) ∼ N (0.5 0.5, 𝐼50 ), 𝑗 = 1, 2, 3 0.5 2. F2 Model: High dimensional rank 1 tensor with normal distribution in each component. Shape of tensors is 50 × 50 × 50 × 50. X 1 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4) 𝑥 ( 𝑗) ∼ N (00, Σ ( 𝑗) ), 𝑗 = 1, 2, 3, 4 X 2 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4) 𝑥 ( 𝑗) ∼ N (11, Σ ( 𝑗) ), 𝑗 = 1, 2, 3, 4 (4) Σ (1) = 𝐼 , Σ (2) = 𝐴𝑅(0.7), Σ (3) = 𝐴𝑅(0.3), Σ𝑖, 𝑗 = min(𝑖, 𝑗) 3. F3 Model: High dimensional rank 3 tensor factor model. Components confirm different Gaussian distribution. Shape of tensors is 50 × 50 × 50 × 50. 3 (1) (2) (3) (4) ( 𝑗) Õ X1 = 𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘 𝑥 𝑘 ∼ N (00, Σ ), 𝑗 = 1, 2, 3, 4 𝑘=1 3 (1) (2) (3) (4) ( 𝑗) Õ X2 = 𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘 𝑥 𝑘 ∼ N (11, Σ ), 𝑗 = 1, 2, 3, 4 𝑘=1 (4) Σ (1) = 𝐼 , Σ (2) = 𝐴𝑅(0.7), Σ (3) = 𝐴𝑅(0.3), Σ𝑖, 𝑗 = min(𝑖, 𝑗) 4. F4 Model: Low dimensional rank 1 tensor factor model with components confirming different distributions. Shape of tensor is 50 × 50 × 50. X 1 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) each element of 𝑥 (1) ∼ Γ(4, 2), 𝑥 (2) ∼ N (11, 𝐼 ), each element of 𝑥 (3) ∼ 𝑈 (1, 2) X 2 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) each element of 𝑥 (1) ∼ Γ(6, 2), 𝑥 (2) ∼ N (11, 𝐼 ), each element of 𝑥 (3) ∼ 𝑈 (1, 2) 66 5. F5 Model: A higher dimensional version of F4 model. Tensors are having four modes with dimension 50 × 50 × 50 × 50 X 1 =𝑥𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4) each element of 𝑥 (1) ∼ Γ(4, 2), 𝑥 (2) ∼ N (11, 𝐼 ), each element of 𝑥 (3) ∼ Γ(2, 1), each element of 𝑥 (4) ∼ 𝑈 (3.5, 4.5) X 2 =𝑥𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4) each element of 𝑥 (1) ∼ Γ(5, 2), 𝑥 (2) ∼ N (11, 𝐼 ), each element of 𝑥 (3) ∼ Γ(2, 1), each element of 𝑥 (4) ∼ 𝑈 (4.5, 5.5) 6. T1 Model: A Tucker model. 𝑍 (1) , 𝑍 (2) ∈ R50×50×50 with elements independently and identically distributed. The size of factor matrices are all 50 by 50. X 1 = 𝑍 (1) ×1 Σ (1) ×2 Σ (2) ×3 Σ (3) each element of 𝑍 (1) ∼ N (0, 1) X 2 = 𝑍 (2) ×1 Σ (1) ×2 Σ (2) ×3 Σ (3) each element of 𝑍 (2) ∼ N (0.5, 1) Σ (1) = 𝐼 , Σ (2) Random Orthogonal Matrix Σ (3) = 𝐴𝑅(0.7) The models F1 - F5 generates CP tensors, whose components confirm various probability distribu- tions. F3 is a rank-3 CP tensor model. T1 is a Tucker tensor models constructed using mode-wise product (see [78]). The mode-1 factor is an identity matrix, mode-2 factor is a randomly generated orthogonal matrix, and mode-3 factor is an auto-regression matrix. We call each classification problem using tensors generated from these models as classification tasks. Thus, there are six tasks in this simulation study. For each synthetic data, we generate 100 samples from class 1 and another 100 samples from class 2. Each time, we first subsample the data to form a training set of size 160, then use the remaining 40 observations to form the test set. We conduct stratified sampling to form training and test sets so that the percentages of each class are the same in both training and test sets. The training set is used to train and validate classifiers. Then the classifiers with the optimal tuning parameters are evaluated on the test set. We record the percentage of true predictions, Total True Predictions × 100%, Predictions 67 as the accuracy of a classifier on the test set. The experiments are repeated for multiple times, and the mean and the standard deviation of accuracy over all repetitions are reported in the table 3.2 below. For fair comparison, all the computations are done on a desktop with a 12-core CPU and 32GB RAM. We record the average time cost for model estimation over all repetitions, and notate "NA" (Not Available) in the table if a classifier cannot be estimated with the limited resources. We believe this could give an overview about the capabilities of different classifiers when handling big tensor data. More technical details about this simulation study is provided in the appendix B.6. Notice that in the table 3.2, we use TEC1 and TEC2 to denote TEC models estimated with Hinge and Square Hinge Loss. AAM, LLSM, and BSGD are three variants of SVM for scalable and high-dimensional data analysis from BudgetedSVM package. The first thing we observe from the simulation study is that all the vector-based methods fail to deliver result in F2, F3, and F5 due to memory insufficiency. We later test these models using the same simulation data but on a high performance cluster which has 128GB memory. Their performance and the comparison are included in the appendix B.9. Among the tensor-based methods, such space insufficiency would not be an issue since data storing in tensors can better utilize computer memories. However, CMDA method fail to provide results in F2, F3, and F5 as its optimization procedure takes extremely long time. This failure is not due to memory limitation but high time complexity. On the other hand, our TEC models utilize tensors to handle big data with limited memory, and provide results in all tasks with high efficiency. The CP decomposition and random projection techniques in our TEC model further reduces the number of elements to be stored in memory. Notice that although our original tensors have the same number of elements as the vectorized data, the tensor decomposition can be done independently on each data. As a result, when the data is in huge dimension, such as the data from F2, F3, and F5, we can process the tensor decomposition and random projection one by one, storing only the randomly projected CP factors in memory and recycling memory space by deleting the original tensor. This processing pipeline distinguishes itself from other dimension reduction techniques such as principle component analysis as it can process all observations in the data set independently. It does not require to load all the data at one time and then perform 68 Model Methods TEC1 TEC2 RBF-SVM AAM LLSVM Accuracy (%) 83.96 85.70 82.00 73.75 62.81 F1 STD (%) 3.85 3.29 3.12 6.36 12.94 Time (s) 1.05 1.26 2.56 1.15 5.06 Accuracy (%) 98.08 86.70 NA NA NA F2 STD (%) 1.06 1.87 NA NA NA Time (s) 1.44 1.56 NA NA NA Accuracy (%) 96.78 98.63 NA NA NA F3 STD ( %) 2.71 1.72 NA NA NA Time (s) 12 12.6 NA NA NA Accuracy ( %) 93.80 94.10 94.13 46.33 53.13 F4 STD (%) 2.20 2.08 3.65 8.12 14.96 Time (s) 1.75 2.10 1.31 1.94 6.21 Accuracy (%) 89.38 89.78 NA NA NA F5 STD ( %) 2.55 2.25 NA NA NA Time (s) 1.49 1.63 NA NA NA Accuracy ( %) 100 100 100 84.13 100 T1 STD ( %) 0.00 0.00 0.00 5.40 0.00 Time (s) 1.05 1.26 2.55 1.59 6.17 Model Methods BSGD LDA RF CMDA DGTDA Accuracy (%) 79.84 83.75 68.45 55.25 64.25 F1 STD (%) 20.16 4.25 6.15 1.25 11.68 Time (s) 7.62 5.12 1.10 21.50 0.57 Accuracy (%) NA NA NA NA 81.50 F2 STD (%) NA NA NA NA 4.89 Time (s) NA NA NA NA 195 Accuracy (%) NA NA NA NA 93.75 F3 STD ( %) NA NA NA NA 3.23 Time (s) NA NA NA NA 198 Accuracy ( %) 57.75 82.88 84.50 80.75 77.50 F4 STD (%) 7.40 5.20 4.72 5.01 4.61 Time (s) 18.68 5.21 0.89 22.35 0.56 Accuracy (%) NA NA NA NA 77.25 F5 STD ( %) NA NA NA NA 6.56 Time (s) NA NA NA NA 1.93 Accuracy ( %) 100 100 100 85.71 85.00 T1 STD ( %) 0.00 0.00 0.00 22.59 22.91 Time (s) 1.42 5.12 0.45 22.85 0.58 Table 3.2: TEC Simulation Results I: Desktop with 32GB RAM 69 feature extraction and dimension reduction, making it appealing for extremely high-dimensional data analysis. Using only the projected tensor factors also makes the proposed TEC model finishing all the computation in a very short time. Empirical evidence in table 3.2 shows that the processing time is tremendously less than DGTDA and CMDA (no results due to high time complexity) in tasks F2, F3 and F5 where the tensor dimensions are huge. Apart from the efficiency, the results in table 3.2 highlight the promising performance of our TEC models. In tasks F1, F4 and T1 where the data dimensions are low, our TEC models have the similar performance as the RBF-SVM and its variants in BudgetedSVM. In particular, our TEC with Square Hinge loss outperforms RBF-SVM and all other competitors with significantly higher accuracy rates in task F1. Their performances in F4 are still decent, providing much higher accuracy rates than other classifiers except RBF-SVM. Their accuracy rates in F4 are only 0.5% less than that of RBF-SVM in this task. In Tucker tensor classification task T1, our TEC models continuing providing as solid performance as all other classifiers. Although the classification task is relatively easy, it still can demonstrate the capability of our TEC models in handling tensors which are not well approximated by tensor CP decomposition. The performance advantage of TEC models are even more impressive in higher-dimensional tensor classification tasks F2, F3, and F5. Due to the fact that all vector-based classifiers fail to deliver results on the testing platform, we only compare TEC1 and TEC2 with tensor discriminant analysis DGTDA. In F2 and F5, TEC models have about 10% more average rates than DGTDA. This advantage reduces to 3% in task F3, however, is still significant. In conclusion, the simulation study demonstrates computational efficiency as well as solid performance for our proposed tensor ensemble classifier TEC. 3.7 Real Data Analysis In this section, we compare the performance of our proposed TEC models with other existing tensor-based classifiers reviewed in chapter 2. We continue using the two real data sets in chapter 2 for experiments. CP-STM with Hinge and Squared Hinge loss from [63], CMDA and DGTDA 70 Models Accuracy Precision Sensitivity Specificity AUC TEC1 0.710.04 0.800.09 0.500.07 0.890.05 0.640.19 TEC2 0.730.03 0.840.04 0.520.09 0.910.03 0.66 0.19 CP-GLM 0.580.04 0.580.07 1.000.00 0.000.00 0.500.00 CMDA 0.700.03 0.690.05 0.670.09 0.730.10 0.650.17 DGTDA 0.700.02 0.710.02 0.590.06 0.800.01 0.640.18 CP-STM1 0.730.03 1.000.00 0.410.07 1.000.00 0.640.20 CP-STM2 0.74 0.04 1.000.00 0.430.08 1.000.00 0.650.20 Table 3.3: Real Data: ADNI Classification Comparison II from [92], and CP-GLM from [155] are included for comparison. 3.7.1 MRI Classification for Alzheimer’s Disease The first data set is ADNI MRI data from ADNI-1 screening session. The introduction of the data is provided earlier in section 2.4, and thus is omitted here. We randomly sample 80% of images from AD group and 80% from NC group to form the training set with size 321. AD is labeled as positive class, and NC is labeled as negative class. The rest images are used as test set to evaluate model performance. For each classification model, we evaluate its performance by calculating its accuracy, precision, sensitivity, and specificity on the test set. Such step is replicated for multiple times, and the average accuracy, precision, sensitivity, and specificity are reported in the table 3.3. The standard deviation of these performance metrics are also provided in the subscripts. Since the image data are already in tensors, we do not destroy their spatial structure and just compare tensor-base classifiers in this study. The results in table 3.3 shows that all the tensor-based classifiers have close accuracy in prediction, while CP-STM2 and TEC2 are slightly better than the others. Although CP-STM with Squared Hinge loss has 0.01 more in accuracy rate than TEC models, the AUC of TEC2 is slightly greater. This comparison unveils that tensor compression through random projections may not affect the model classification accuracy negatively, while it can provide a much better computation efficiency. 71 Classification Accuracy 0.8 0.6 Method CMDA CP−GLM Accuracy 0.4 CPSTM1 CPSTM2 DGTDA TEC1 TEC2 0.2 0.0 CMDA CP−GLM CPSTM1 CPSTM2 DGTDA TEC1 TEC2 Method Figure 3.1: Real Data: ADNI Classification Result II 3.7.2 KITTI Traffic Image Classification Our second real data application use the same traffic image data set from section 2.4. The introduction and data pre-processing is omitted and can be found in section 2.4. After imaging processing, the data set are divided into three groups which defines three classification tasks with levels of difficulties as easy, moderate, and hard. To maintain the balance between car and pedestrian images in the data sets, we randomly select 200 car images and 200 pedestrian images in each group for our numerical experiments. Pedestrian images are considered as the positive class data, and car images are negative class data. The following procedures are repeated for 50 times in all three tasks with the balanced data sets. We randomly sample 80% of images as training set, and use the rest 20% data as the testing set. The sampling is conducted in a stratified way so that the proportion of pedestrian and car images are approximately same in both training and test set. Classification models are estimated and validated in training set. Then the models with selected tuning parameters are applied on the testing set. For each repetition, we calculate the same performance metrics, accuracy rates, 72 precision (positive predictive rates), sensitivity (true positive rates), and specificity (true negative rates), for each classification method using the testing set. The average value of these rates and their standard deviations (in subscripts) are reported in the table 3.4. The areas under the ROC curves (AUC) are also reported for all the methods. Figure 3.2 shows the comparison of prediction accuracy rates, in which average accuracy rates of each method is shown by the bar charts and their standard deviations are shown by the error bars. Task Methods Accuracy Precision Sensitivity Specificity AUC CP-STM1 0.850.03 0.840.05 0.850.05 0.840.06 0.850.03 CP-STM2 0.830.04 0.830.05 0.830.05 0.830.06 0.830.04 TEC1 0.88 0.04 0.880.06 0.890.05 0.870.07 0.88 0.04 Easy TEC2 0.740.05 0.750.06 0.730.07 0.760.08 0.740.05 CMDA 0.630.07 0.580.06 0.950.08 0.300.16 0.630.07 DGTDA 0.840.04 0.770.04 0.960.03 0.720.07 0.840.04 CP-GLM 0.570.05 0.570.06 0.590.07 0.550.09 0.570.05 CP-STM1 0.780.05 0.780.06 0.770.07 0.780.07 0.780.05 CP-STM2 0.730.06 0.750.07 0.720.08 0.750.09 0.730.06 TEC1 0.85 0.04 0.850.05 0.840.06 0.850.06 0.85 0.04 Moderate TEC2 0.740.04 0.730.04 0.770.08 0.710.06 0.740.04 CMDA 0.590.05 0.550.04 0.890.11 0.280.13 0.590.05 DGTDA 0.740.06 0.720.06 0.790.08 0.690.08 0.740.05 CP-GLM 0.530.05 0.530.05 0.540.07 0.520.08 0.530.05 CP-STM1 0.760.04 0.840.06 0.640.07 0.870.05 0.760.04 CP-STM2 0.740.04 0.800.06 0.630.07 0.840.06 0.740.04 TEC1 0.77 0.05 0.780.05 0.750.07 0.780.06 0.77 0.05 Hard TEC2 0.670.05 0.700.05 0.670.07 0.660.07 0.670.04 CMDA 0.530.04 0.520.02 0.910.09 0.160.12 0.530.04 DGTDA 0.720.05 0.680.04 0.840.06 0.600.08 0.720.05 CP-GLM 0.510.06 0.510.06 0.540.07 0.490.07 0.510.06 Table 3.4: Real Data: Traffic Image Classification II Besides the conclusion we already obtained from section 2.4, the results in table 3.4 show that TEC1 model outperforms CP-STM1, the winner in our previous study, with a significant advantage in all three classification tasks. In particular, TEC model with Hinge loss has 7% more prediction accuracy rates than CP-STM1 in the moderate level classification task. This key observation can be a strong empirical evidence for our proposed TEC models supporting that the models can provide 73 Classification Accuracy 0.75 Method CMDA CP−GLM Accuracy 0.50 CPSTM1 CPSTM2 DGTDA TEC1 TEC2 0.25 0.00 Easy Hard Moderate Task Figure 3.2: Real Data: Traffic Image Classification Result II significantly better prediction accuracy rates when the data are noisy and the projected data are sufficient for classification. 3.8 Conclusion We have proposed a tensor ensemble classifier with the CP support tensor machine and random projection in this work. The proposed method can handle high-dimensional tensor classification problems much faster comparing with the existing regularization based methods. Thanks to the Johnson-Lindenstrauss lemma and its variants, we have shown that the proposed ensemble classifier has a converging classification risk and can provide consistent predictions under some specific conditions. Tests with various synthetic tensor models and real data applications show that the proposed TEC can provide optimistic predictions in most classification problems. Our primary focus in this work is on the classification applications on high-dimensional multi- way data such as images. Support tensor ensemble turns out to be an efficient way of analyzing such data. However, model interpretation has not been considered here. The features in the projected space are not able to provide any information about variable importance. Alternative approaches are possible for constructing explainable tensor classification models, but they are out 74 of this article’s scope. Besides that, selection for the dimension (size) of projected tensor 𝑃 𝑗 s cannot be addressed well at this moment. Although our theoretical result points out the connection between the classification risk and min 𝑃 𝑗 , discussion about how to set 𝑃 𝑗 for each mode of tensor may have to be developed in the future. In conclusion, TEC offers a new option in tensor data analysis. The key features highlighted in work are that TEC can efficiently analyze high-dimensional tensor data without compromising the estimation robustness and classification risk. We anticipate that this method will play a role in future application areas such as neural imaging and multi-modal data analysis. 75 CHAPTER 4 COUPLED SUPPORT TENSOR MACHINE FOR MULTIMODAL NEUROIMAGING DATA In this chapter, we consider a classification problem with multimodal tensor predictors. Multimodal neuroimaging data arise in various applications where information about the same phenomenon is acquired from multiple sensors and across different imaging modalities. Learning from multimodal data is of great interest in machine learning and statistics research as it offers the possibility of capturing complementary information among modalities. Multimodal learning increases model performance, explains the interdependence between heterogeneous data sources, discovers new insights that may not be available from a single modality, and improves decision-making. Recently, coupled matrix-tensor factorization has been introduced for multimodal data fusion to jointly estimate latent factors and identify complex interdependence among the latent factors. However, prior work on coupled matrix-tensor factorization mostly focuses on unsupervised learning, and very few of them utilize the jointly estimated latent factors for supervised learning. This paper considers the multimodal tensor data classification problem and proposes a Coupled Support Tensor Machine (C-STM), which is built upon the latent factors jointly estimated from Advanced Coupled Matrix Tensor Factorization (ACMTF). C-STM combines individual and shared latent factors with multiple kernels and estimates a maximal-margin classifier for coupled matrix tensor data. The classification risk of C-STM is shown to converge to the optimal Bayes risk, making it a statistically consistent rule. C-STM is validated through simulation studies as well as simultaneous EEG-fMRI analysis. The empirical evidence shows that C-STM can utilize information from multiple sources and provide a better classification performance than traditional unimodal classifiers. 4.1 Introduction Advances in clinical neuroimaging and computational bioinformatics have dramatically in- creased our understanding of various brain functions using multiple modalities such as Magnetic 76 Resonance Imaging (MRI), functional Magnetic Resonance Imaging (fMRI), electroencephalo- gram (EEG), and Positron Emission Tomography (PET). The strong connection of these modalities to the patients’ biological status and disease pathology suggests the great potential of their predictive power in disease diagnostics. Numerous studies using vector- and tensor-based statistical models illustrate how to utilize these imaging data at both the voxel- and Region-of-Interest (ROI) levels to develop efficient biomarkers that predict disease status. For example, [8] propose a classification model using functional connectivity MRI for autism disease with 89% diagnostic accuracy. [123] utilize network models and brain imaging data to develop novel biomarkers for Parkinson’s disease. Many works in Alzheimer’s disease research such as [109, 74, 100, 37, 94] use EEG, MRI and PET imaging data to predict patient’s cognition and detect early-stage Alzheimer’s diseases. Although these studies have provided impressive results, utilizing imaging data from single modality such as individual MRI sequences are known to have limited predictive capacity, especially in the early phases of the disease. For instance, [94] use brain MRI volumes from regions of interest to identify patients in early-stage Alzheimer’s disease with 77% prediction accuracy. In recent years, it has been common to acquire multiple neuroimaging modalities in clinical studies such as simultaneous EEG-fMRI, MRI and fMRI. Even though each modality measures different physiological phenom- ena, they are interdependent and mutually informative. Learning from multimodal neuroimaging data may help integrate information from multiple sources and facilitate biomarker development in clinical studies. It also raises the need for novel supervised learning techniques for multimodal data in statistical learning literature. The existing statistical approaches to multimodal data science are dominated by unsupervised learning methods. These methods analyze multimodal neuroimaging data jointly by performing decomposition, and try to discover how the common information is overlaid across different modali- ties. During optimization, the decomposed factors bridging two or more modalities are estimated to interpret connections between multimodal data. Examples of these methods include matrix-based joint Independent Component Analysis (ICA) [23, 56, 86, 97, 129, 6] which assume bilinear cor- relations between factors in different modalities. When tensors are utilized for multi-dimensional 77 imaging modeling, various coupled matrix-tensor decomposition methods are established such as [5, 4, 6, 26, 27, 73, 110] which impose different types of soft or hard multilinear constrains between factors from different modalities. These methods further extend possible correlations between multimodal data, providing more flexibility in data modeling. Current supervised learning approaches for multimodal data simply concatenate data modalities as extra features without exploring their interdependence. For example, [155, 93] build generalized regression models by appending tensor and vector predictors linearly for image prediction and classification. [114] develop a discriminant analysis by including tensor and vector predictors in a linear fashion. [91] propose an integrative factor regression for multimodal neuroimaging data assuming that data from different modalities can be decomposed into common factors. Another type of integration utilizes kernel tricks and combines information from multimodal data with multiple kernels. [55] provide a survey on various multiple kernel learning techniques for multimodal data fusion and classification with support vector machines. Combining kernels linearly or non- linearly in different modalities, instead of original data, provides more flexibility in information integration. [11] proposed a multiple kernel regression model with group lasso penalty, which integrates information by multiple kernels and selects the most predictive data modalities. Despite these accomplishments, the current approaches have several shortcomings. First, they mainly focus on exploring the interdependence between multimodal neuroimaging data, ignoring the representative and discriminative power of the learned components. Thus, the methods cannot further bridge the imaging data to the patients’ biological status, which is not helpful in biomarker development. Second, the supervised techniques integrate information primarily by data or feature concatenation without explicitly considering the possible correlations between different modali- ties. This lack of consideration for interdependence may cause issues like overfitting and parameter identifiability. Third, current multimodal approaches are mostly vector-based. Since many neu- roimaging data are multi-dimensional, these approaches may fail to utilize the multi-way features as well as the multi-way interdependence between different modalities. Finally, although many empirical studies demonstrate the success of using multimodal data, there is a lack of mathematical 78 and statistical clarity to the extent of generalizability and associated uncertainties. The absence of a sound statistical framework for multimodal data analysis makes it impossible to interpret the generalization ability of a certain statistical model. In this paper, we propose a two-stage Coupled Support Tensor Machine (C-STM) for multimodal tensor-based neuroimaging classification. The model accommodates current multimodal data science issues and provides a sound statistical framework to interpret the interdependence between modalities and quantify the model consistency and generalization ability. The major contributions of this work are: 1. To extract individual and common components from multimodal tensor data in the first stage using Advanced Coupled Matrix Tensor Factorization (ACMTF), and identify interdepen- dence between multimodal data through latent factors. 2. To build a novel CP Support Tensor Machine with both the individual and common factors for classification. This new model is named Coupled Support Tensor Machine (C-STM). 3. To show the proposed model is a consistent classification rule. A Matlab package is also provided in the supplemental material, including all functions for C-STM classification and detailed data processing pipeline. The rest part of this chapter is organized as follow. Section 4.2 reviews current approaches about coupled matrix tensor factorization and multiple kernel learning, which are the basis of this work. Section 4.3 introduce our classification model. The model estimation is presented in section 4.4 using nonlinear conjugate gradient descent optimization. A simulation study is presented in section 4.6 to compare the performance of multimodal classification with single modal classification, highlighting the benefits of using information from multiple sources. Then we adopt the C-STM model in a simultaneous EEG-fMRI data trial classification problem in section 4.7. The conclusion of this chapter is in section 4.8. 79 4.2 Related Work In this section, we review some backgorund and prior work on tensor decomposition and support tensor machine. In this work, we denote numbers and scalars by letters such as 𝑥, 𝑦, 𝑁. Vectors are denoted by boldface lowercase letters, e.g. 𝑎 , 𝑏 . Matrices are denoted by boldface capital letters like 𝐴 , 𝐵 . Multi-dimensional tensors are denoted by boldface Euler script letters such as X , Y . The order of a tensor is the number of dimensions of the data hypercube, also known as ways or modes. For example, a scalar can be regarded as a zeroth-order tensor, a vector is a first-order tensor, and a matrix is a second-order tensor. Let X ∈ R 𝐼1 ×𝐼2 ×···×𝐼 𝑁 be a tensor of order 𝑁, where 𝑥𝑖1 ,𝑖2 ,...,𝑖 𝑁 denotes the (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 )th element of the tensor. Vectors obtained by fixing all indices of the tensor except the one that corresponds to 𝑛th mode are called mode-𝑛 fibers and denoted as 𝑥 𝑖1 ,...𝑖𝑛−1 ,𝑖𝑛+1 ,...𝑖 𝑁 ∈ R 𝐼𝑛 . The 𝐼𝑛 × 𝑁0 Î 𝐼 0 mode-𝑛 unfolding of X is defined as X (𝑛) ∈ R 𝑛 =1,𝑛0≠𝑛 𝑛 where the mode-𝑛 fibers of the tensor X are the columns of X (𝑛) and the remaining modes are organized accordingly along the rows. 4.2.1 CP Decomposition Let X ∈ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 be a tensor with 𝑑 modes. Rank-𝑟 Canonical/Polyadic (CP) decomposition of X is defined as: 𝑟 (1) (2) (𝑑) Õ X≈ 𝜁 𝑘 · 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 = È𝜁𝜁 ; 𝑋 (1) , ..., 𝑋 (𝑑) É, (4.1) 𝑘=1 𝐼 ×𝑟 ( 𝑗) where 𝑋 ( 𝑗) ∈ R 𝑗 , 𝑗 ∈ {1, .., 𝑑} are defined as factor matrices whose columns are 𝑥 𝑘 and "◦" represents the vector outer product. The right side of (4.1) is called Kruskal tensor, which is a convenient representation for CP tensors, see [81]. We denote a Kruskal tensor by 𝔘 𝑥 = È𝜁𝜁 ; 𝑋 (1) , ..., 𝑋 (𝑑) É where 𝜁 ∈ R𝑟 is a vector holding the weights of rank one components. In the special case of matrices, 𝜁 corresponds to singular values of a matrix. If all the elements in 𝜁 are 1, then 𝜁 can be dropped from the notation. In general, it is assumed that the rank 𝑟 is small so that 80 equation (4.1) is also called low-rank approximation for a tensor X . Such an approximation can be estimated by an Alternating Least Square (ALS) approach, see [78]. Motivated by the fact that joint analysis of data from multiple sources can potentially un- veil complex data structures and provide more information, Coupled Matrix Tensor Factorization (CMTF) ([2]) was proposed for multimodal data fusion. CMTF estimates the underlying latent factors for both tensor and matrix data simultaneously by taking the coupling between tensor and matrix data into account. This feature makes CMTF a promising model in analyzing heterogeneous data, which generally have different structures and modalities. Let X 1 ∈ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 and 𝑋 2 ∈ R 𝐼1 ×𝐽2 . Assuming the factors from the first mode of the tensor X 1 span the column space of the matrix 𝑋 2 , CMTF tries to estimate all factors by minimizing: 1 1 𝑄 (𝔘𝔘1 , 𝔘 2 ) = X1 − È𝑋 kX 𝑋 1(1) , 𝑋 1(2) , ...𝑋 𝑋 1(𝑑) Ék 2Fro + k𝑋 𝑋 − 𝑋 2(1) 𝑋 2(2)> k 2Fro , (1) (1) s.t. 𝑋 1 = 𝑋 2 , 2 2 2 (4.2) (𝑚) (1) (1) where 𝑋 𝑝 are the factor matrices for modality 𝑝 and mode 𝑚. The factor matrices 𝑋 1 = 𝑋 2 are the coupled factors between tensor and matrix data. These factor matrices can also be represented in Kruskal form, 𝔘 1 = È𝑋 𝑋 1(1) , 𝑋 1(2) , ...𝑋 𝑋 1(𝑑) É and 𝔘 2 = È𝑋𝑋 2(1) , 𝑋 2(2) É. By minimizing the objective function 𝑄 (𝔘 𝔘1 , 𝔘 2 ), CMTF estimates latent factors for both the tensor and matrix data jointly which allows it to utilize information from both modalities. [2] uses a gradient descent algorithm to optimize the objective function (4.2). Although CMTF provides a successful framework for joint data analysis, it often fails to obtain a unique estimation when both shared and individual components exist. As a result, any further statistical analysis and learning from CMTF estimation will suffer from the large uncertainty in latent factors. To address this issue, [3] proposed Advanced Coupled Matrix Tensor Factorization (ACMTF) by introducing a sparsity penalty to the weights of latent factors in the objective function (4.2), and restricting the norm of the columns of the factors to be unity to allow unique results up to a permutation. This modification provides a more precise estimation of latent factors compared to CMTF, and makes it possible to develop further stable statistical models upon the estimated factors. 81 4.2.2 CP Support Tensor Machine (CP-STM) CP-STM has been previously studied by [136, 63, 64] and use CP model to construct STMs. Given a collection of data 𝑇𝑛 = {(X X1 , 𝑦 1 ), (X X2 , 𝑦 2 ), ..., (X X𝑛 , 𝑦 𝑛 )}, where X𝑖 ∈ X ⊂ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 are d-way tensors, X is a compact tensor space which is a subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 , and 𝑦𝑖 ∈ {1, −1} are binary labels. CP-STM assumes the tensor predictors have a CP structure, and can be classified 𝑛 by the function which minimizes the objective function 𝜆|| 𝑓 || 2 + 𝑛1 X𝑖 ), 𝑦𝑖 ). By using Í L ( 𝑓 (X 𝑖=1 tensor kernel function 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) Õ X1 , X 2 ) = 𝐾 (X 𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑚 ), (4.3) 𝑙,𝑚=1 𝑗=1 𝑟 (1) (𝑑) 𝑟 (1) (𝑑) where X 1 = 𝑥 1,𝑙 ◦ .. ◦ 𝑥 1,𝑙 and X 2 = Í Í 𝑥 2,𝑙 ◦ .. ◦ 𝑥 2,𝑙 , the STM classifier can be written as 𝑙=1 𝑙=1 Õ𝑛 X) = 𝑓 (X 𝛼𝑖 𝑦𝑖 𝐾 (X X𝑖 , X ) = 𝛼 𝑇 𝐷 𝑦 𝐾 (X X), (4.4) 𝑖=1 where X is a new 𝑑-way rank-𝑟 tensor of size 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 . In (4.4), 𝛼 = [𝛼1 , ..., 𝛼𝑛 ] 𝑇 are the coefficients, 𝐷 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 , and 𝐾 (X X) = X1 , X ), ..., 𝐾 (X [𝐾 (X X𝑛 , X )] 𝑇 is a column vector, whose entries are kernel values computed between training data and the new test data. We denote the collection of functions in the form of (4.4) with H , which is a functional space also known as Reproducing Kernel Hilbert Space (RKHS). The optimal CP-STM classifier, 𝑓 ∈ H , can be estimated by plugging function (4.4) into the objective function, and minimize it with Hinge or Squared Hinge loss. The coefficients of the optimal CP-STM model is denoted by 𝛼 ∗ . The classification model is statistically consistent if the tensor kernel function satisfies the universal approximating property as shown in [89]. 4.2.3 Multiple Kernel Learning Multiple kernel learning (MKL) creates new kernels using a linear or non-linear combination of single kernels to measure inner products between data. Statistical learning algorithms such as support vector machine and kernel regression can then utilize the new combined kernels instead of single kernels to obtain better learning results and avoid the potential bias from kernel selection. 82 ([55]) A more important and more related reason for using MKL is that different kernels can take inputs from various data representations possibly from different sources or modalities. Thus, combining kernels and using MKL is one possible way of integrating multiple information sources. Given a collection of kernel functions {𝐾1 (·, ·), ...𝐾𝑚 (·, ·)}, a new kernel function can be constructed by 𝐾 (·, ·) = 𝑓𝜂 ({𝐾1 (·, ·), ...𝐾𝑚 (·, ·)}|𝜂𝜂 ) (4.5) where 𝑓𝜂 is a linear or non-linear function. 𝜂 is a vector whose elements are weight coefficients for the kernel combination. Linear combination methods are the most popular multiple kernel learning, where the kernel function is parameterized as 𝐾 (·, ·) = 𝑓𝜂 ({𝐾1 (·, ·), ...𝐾𝑚 (·, ·)}|𝜂𝜂 ) Õ𝑚 (4.6) = 𝜂𝑙 𝐾𝑙 (·, ·) 𝑙=1 The weight parameters 𝜂𝑙 can be simply assumed to be the same (unweighted) ([115, 14]), or be determined by looking at some performance measures from each kernel or data representation ([134, 118]). There are few more advanced approaches such as optimization-based, Bayesian approaches, and boosting approaches that can also be adopted ([84, 50, 140, 75, 76, 54, 32, 15]). Motivated by the elegant framework and consistency property of CP-STM, we decide to extend it for multimodal tensor classification problems by combining it with ACMTF decomposition. We further consider linear combination (4.6) of kernels to integrate latent factors from multimodal data, and select the kernel weight parameters in a heuristic data driven way to construct our C-STM model. The preciseness of ACMTF may offer chance to capture the true latent structures from multimodal tensors resulting in better classification performance. 4.3 Methodology Let 𝑇𝑛 = {(X X1,1 , 𝑋 1,2 , 𝑦 1 ), ..., (X X𝑛,1 , 𝑋 𝑛,2 , 𝑦 𝑛 )} be training data, where each sample 𝑡 ∈ {1, . . . , 𝑛} has two data modalities X 𝑡,1 , 𝑋 𝑡,2 , and a corresponding binary label 𝑦 𝑡 ∈ {1, −1}. In this work, following [2], we assume that the first data modality is a third-order tensor, X 𝑡,1 ∈ R 𝐼1 ×𝐼2 ×𝐼3 , 83 Individual Factors (Tensor Modality) 𝐼2 𝐼2 𝐾1 (·, ·) +..+ 𝐼1 𝐼1 𝑋2 𝐼3 X1 𝐼4 𝐼3 +..+ 𝐼3 𝐾2 (·, ·) C-STM 𝑦 𝐼2 𝐼1 Shared Factors 𝐼4 𝐼4 +..+ 𝐾3 (·, ·) Individual Factors (Matrix Modality) Figure 4.1: C-STM Model Pipeline and the other is a matrix, 𝑋 𝑡,2 ∈ R 𝐼4 ×𝐼3 . The third mode of X 𝑡,1 and the second mode of 𝑋 𝑡,2 are assumed to be coupled for each 𝑡, i.e., the factor matrix is assumed to be fully or partially shared across these modes. Utilizing this coupling, one can extract factors that better represent the underlying structure of the data, and preserve and utilize the discriminative power of the factors from both modalities. Our approach, C-STM (see Figure 4.1), consists of two stages: Multimodal tensor factorization, ACMTF, and coupled support tensor machine. We present both stages in this section. 4.3.1 ACMTF In this stage, we aim to perform a joint factorization across two modalities for each training sample, (1) (2) (3) (1) (2) 𝑡. Let 𝔘 𝑡,1 = È𝜁𝜁 ; 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,1 É denote the Kruskal tensor of X 𝑡,1 , and 𝔘 𝑡,2 = È𝜎 𝜎 ; 𝑋 𝑡,2 , 𝑋 𝑡,2 É denote the singular value decomposition of 𝑋 𝑡,2 . The weights of the columns of each factor matrix (𝑚) 𝑋 𝑡,𝑝 , where 𝑝 is the modality index and 𝑚 is the mode index, are denoted by 𝜁 and 𝜎 . The norms of these columns are constrained to be 1 to avoid redundancy. The objective function of ACMTF 84 decomposition is then given by: (1) (2) (3) (1) (2) > 𝑄 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) = 𝛾1 kXX𝑡,1 − È𝜁𝜁 ; 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,1 Ék 2Fro + 𝛾2 k𝑋 𝑋 𝑡,2 − 𝑋 𝑡,2 Σ 𝑋 𝑡,2 k 2Fro + 𝛽1 k𝜁𝜁 k 1 + 𝛽2 k𝜎 𝜎 k1 (4.7) (3) (2) (1) (2) (3) (1) (2) s.t. 𝑋 𝑡,1 = 𝑋 𝑡,2 , k𝑥𝑥 𝑡,1,𝑘 k 2 = k𝑥𝑥 𝑡,1,𝑘 k 2 = k𝑥𝑥 𝑡,1,𝑘 k 2 = k𝑥𝑥 𝑡,2,𝑘 k 2 = k𝑥𝑥 𝑡,2,𝑘 k 2 = 1 ∀𝑘 ∈ {1, . . . , 𝑟 }. Σ is a diagonal matrix whose elements are the singular values of the matrix 𝑋 𝑡,2 ( 𝑗) and 𝑥 𝑡,𝑚,𝑘 ∈ R 𝑗 denotes the columns of the factor matrices of X𝑡,𝑚 . The objective function in 𝐼 (4.7) includes ℓ1 penalties for weights in both tensor and matrix decomposition. Thus, the model identifies the shared and individual components. In our experiments, we set 𝛾1 = 𝛾2 = 1 , and 𝛽1 = 𝛽2 = 0.01. These parameters can also be learned through optimization. These factors are then considered as extracted data representations for multimodal data, and used to predict the labels 𝑦 𝑡 in C-STM classifier. 4.3.2 Coupled Support Tensor Machine (C-STM) C-STM uses the idea of multiple kernel learning and considers the coupled and uncoupled factors from ACMTF decomposition as various data representations. As a result, we use three different kernel functions to measure their inner products. One can think of these three kernels inducing three different feature maps transforming multimodal factors into different feature spaces. In each feature space, the corresponding kernel measures the similarity between factors in this specific data modality. The similarities of multimodal factors are then integrated by combining the kernel functions through a linear combination. This procedure is illustrated in Figure 4.1. In particular, the kernel 𝐾1 is a tensor kernel (equation (4.3)) since the first individual factors are tensor CP factors. For two pairs of decomposed factors (𝔘 𝔘𝑡,1 , 𝔘𝑡,2 ) and (𝔘 𝔘𝑖,1 , 𝔘𝑖,2 ), the kernel function for C-STM is defined as     𝐾 (X X𝑡,1 , 𝑋 𝑡,2 ), (XX𝑖,1 , 𝑋 𝑖,2 ) = 𝐾 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ), (𝔘 𝔘𝑖,1 , 𝔘𝑖,2 ) 𝑟 (1) (1) (1) (2) (2) (2) (3)∗ (3)∗ (1) (1) Õ = 𝑤 1 𝐾1 (𝑥𝑥 𝑡,1,𝑘 , 𝑥 𝑖,1,𝑙 )𝐾1 (𝑥𝑥 𝑡,1,𝑘 , 𝑥 𝑖,1,𝑙 ) + 𝑤 2 𝐾2 (𝑥𝑥 𝑡,1,𝑘 , 𝑥 𝑖,1,𝑙 ) + 𝑤 3 𝐾3 (𝑥𝑥 𝑡,2,𝑘 , 𝑥 𝑖,2,𝑙 ). (4.8) 𝑘,𝑙=1 85 (3)∗ (3) (2) 𝑥 𝑡,1,𝑘 is the average of the estimated shared factors 12 [𝑥𝑥 𝑡,1,𝑘 +𝑥𝑥 𝑡,2,𝑘 ] since ACMTF algorithm cannot (3) (2) guarantee 𝑥 𝑡,1,𝑘 = 𝑥 𝑡,2,𝑘 numerically. 𝑤 1 , 𝑤 2 , and 𝑤 3 are three weight parameters combining the three kernel functions and can be tuned by cross-validation. With kernel function (4.8), C-STM model tries to estimate a bivariate decision function 𝑓 from a collection of functions H such that 𝑛 1Õ 𝑓 = arg min 𝜆· || 𝑓 || 2 + X𝑖 ), 𝑦𝑖 ) L ( 𝑓 (X (4.9) 𝑓 ∈HH 𝑛 𝑖=1 X𝑖 , 𝑦𝑖 ) = max 0, 1 − 𝑓 (X X𝑖 ) · 𝑦𝑖 is Hinge loss. H is defined as the collection of all  where L (X functions in the form of Õ 𝑛 X1 , 𝑋 2 ) = 𝑓 (X 𝛼𝑖 𝑦𝑖 𝐾 ((XX𝑡,1 , 𝑋 𝑡,2 ), (XX1 , 𝑋 2 )) = 𝛼 𝑇 𝐷 𝑦 𝐾 (XX1 , 𝑋 2 ) (4.10) 𝑡=1 due to the well-known representer theorem ([9]) for any pair of test data (X X1 , 𝑋 2 ) and for 𝛼 ∈ R𝑛 . For all possible values of 𝛼 , equation (4.10) defines the data collection H . 𝐷 𝑦 is a diagonal matrix whose diagonal elements are labels from the training data 𝑇𝑛 . 𝐾 (X X1 , 𝑋 2 ) is a 𝑛 by 1 column vector X𝑡,1 , 𝑋 𝑡,2 ), (X X1 , 𝑋 2 ) . The optimal C-STM decision function, denoted  whose 𝑡-th element is 𝐾 (X by 𝑓𝑛 = 𝛼 ∗𝑇 𝐷 𝑦 𝐾 (X X1 , 𝑋 2 ), can be estimated by solving the quadratic programming problem 1 𝑇 min 𝛼 𝐷 𝑦 𝐾 𝐷 𝑦 𝛼 − 1𝑇 𝛼 , 𝛼 ∈R 𝑛 2 (4.11) 1 S.T. 𝛼 𝑇 𝑦 = 0, 0  𝛼  , 2𝑛𝜆 where 𝐾 is the kernel matrix constructed by function (4.8). Problem (4.11) is the dual problem of (4.9), and its optimal solution 𝛼 ∗ also minimizes the objective function (4.9) when plugging functions in the form of (4.10). For a new pair of test points (X X1 , 𝑋 2 ), the class label is predicted X1 , 𝑋 2 ) .   as Sgn 𝑓𝑛 (X 4.4 Model Estimation In this section, we first present the estimation procedure for tensor matrix decomposition (4.7), and then combine it with the classification procedure to summarize the algorithm for C-STM. 86 To satisfy the constraints in the objective function (4.7), we convert the function 𝑄 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) into a differentiable and unconstrained form (1) (2) (3) (1) (2) > 𝔘𝑡,1 , 𝔘 𝑡,2 ) =𝛾1 ||X 𝑄 (𝔘 X𝑡,1 − È𝜁𝜁 ; 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,1 É|| 2Fro + 𝛾2 ||X X𝑡,2 − 𝑋 𝑡,2 Σ 𝑋 𝑡,2 || 2Fro (3) (2) + 𝜏k𝑋 𝑋 𝑡,1 − 𝑋 𝑡,2 k 2𝐹𝑟𝑜 𝑟  q q (4.12) (1) (2) Õ X𝑡,1,𝑘 X𝑡,1,𝑘  + 𝛽 𝜁 𝑘 + 𝜖 + 𝛽 𝜎𝑘2 + 𝜖 + 𝜃 (kX 2 k 2 − 1) 2 + (kX k 2 − 1) 2 𝑘=1  (3) (1) (2) X𝑡,1,𝑘 X𝑡,2,𝑘 X𝑡,2,𝑘  + (kX k2 − 1) 2 + (kX k2 − 1) 2 + (kX k2 − 1) 2 ℓ1 norm penalties in (4.7) are replaces with differentiable approximations. 𝜏 and 𝜃 are Lagrange multipliers. 𝜖 > 0. This unconstrained optimization problem can be solved by nonlinear conjugate gradient descent ([2, 5, 110]). Let T 𝑡 be the full tensor of 𝔘 𝑡,1 (converting Kruskal tensor into (1) (2) > multidimensional array form), and 𝑀 𝑡 = 𝑋 𝑡,2 Σ 𝑋 𝑡,2 , the partial derivative of each latent factors can be derived as follow: 𝑄 (𝔘 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) (3) (2) (1) (1) = 𝛾1 (T T 𝑡 − X𝑡,1 ) (1) (𝜁𝜁 > 𝑋 𝑡,1 𝑋 𝑡,1 ) + 𝜃 (𝑋 𝑋 𝑡,1 − 𝑋¯ 𝑡,1 ) (4.13) (1) 𝛿𝑋𝑋 𝑡,1 𝑄 (𝔘 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) (3) (1) (2) (2) = 𝛾1 (T T 𝑡 − X 𝑡,1 ) (2) (𝜁𝜁 > 𝑋𝑡,1 𝑋 𝑡,1 ) + 𝜃 (𝑋 𝑋 𝑡,1 − 𝑋¯ 𝑡,1 ) (4.14) (2) 𝛿𝑋𝑋 𝑡,1 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝑄 (𝔘 (2) (1) (3) (2) (3) (3) = 𝛾1 (T T 𝑡 − X 𝑡,1 ) (3) (𝜁𝜁 > 𝑋𝑡,1 𝑋 𝑡,1 ) + 𝜏(𝑋 𝑋 𝑡,1 − 𝑋 𝑡,2 ) + 𝜃 (𝑋 𝑋 𝑡,1 − 𝑋¯ 𝑡,1 ) (4.15) (3) 𝛿𝑋𝑋 𝑡,1 𝑄 (𝔘 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) (2) (1) (1) = 𝛾2 (𝑀 𝑀 𝑡 − 𝑋 𝑡,2 )𝑋 𝑋 𝑡,2 Σ + 𝜃 (𝑋 𝑋 𝑡,2 − 𝑋¯ 𝑡,2 ) (4.16) (1) 𝛿𝑋𝑋 𝑡,2 𝑄 (𝔘 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) (1) (2) (3) (2) (2) = 𝛾2 (𝑀 𝑀 𝑡 − 𝑋 𝑡,2 ) > 𝑋 𝑡,2 Σ + 𝜏(𝑋 𝑋 𝑡,2 − 𝑋 𝑡,1 ) + 𝜃 (𝑋 𝑋 𝑡,2 − 𝑋¯ 𝑡,2 ) (4.17) (2) 𝛿𝑋𝑋 𝑡,2 𝑄 (𝔘 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) (1) (2) (3) 𝛽 𝜁 = 𝛾1 (T T 𝑡 − X 𝑡,1 ) ×1 𝑥 𝑡,1,𝑘 ×2 𝑥 𝑡,1,𝑘 ×3 𝑥 𝑡,1,𝑘 + q 𝑘 ; 𝑘 = 1, ..., 𝑟 (4.18) 𝛿𝜁 𝑘 2 𝜁 𝑘2 + 𝜖 87 𝛿𝑄𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 ) (1)> (2) 𝛽 𝜎 = 𝛾2𝑥 𝑡,2,𝑘 (𝑀 𝑀 𝑡 − 𝑋 𝑡,2 )𝑥𝑥 𝑡,2,𝑘 + q 𝑘 ; 𝑘 = 1, ..., 𝑟 (4.19) 𝛿𝜎𝑘 2 𝜎𝑘2 + 𝜖 Here T ( 𝑗) denotes the mode-j unfolding of a tensor T . × 𝑗 denotes mode-wise product, and denotes Khatri-Rao product. (see Section 1.2). The matrix notation with a overline 𝑀 ¯ denotes a normalized matrix 𝑀 whose columns are divided by their respective ℓ2 norms. If we combine all the derived parts above, the partial derivative of the objective function is 𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝛿𝑄 𝑄 (𝔘 𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝛿𝑄 𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 )  𝛿𝑄 O𝑄 O𝑄(𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) = , , , (1) (2) (3) 𝑋 𝑡,1 𝛿𝑋 𝛿𝑋𝑋 𝑡,1 𝛿𝑋𝑋 𝑡,1 > (4.20) 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝛿𝑄 𝑄 (𝔘 𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝑄 (𝔘 𝛿𝑄 𝔘𝑡,1 , 𝔘 𝑡,2 ) , , ... , ... (2) 𝛿𝜁1 𝛿𝜎1 𝑋 𝑡,2 𝛿𝑋 which is a 5 + 2𝑟 dimensional vector. If we use (1) (2) (3) (1) (2) 𝔘𝑡,1 , 𝔘 𝑡,2 ) = [𝑋 (𝔘 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,2 , 𝑋 𝑡,2 , 𝜁1 , ...𝜎1 , ...] > to denote the latent factors and the weights we have to estimate, the algorithm uses the negative gradient of 𝑄 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) as the direction to update all the components in (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) simulta- neously. We first describe this estimation procedure in the algorithm 10. The algorithm keeps updating (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 ) until convergence. Note that this is a non-convex optimization problem and its convergence properties has been discussed in [112, 117, 143, 144]. Once the factors for all data pairs in the training set 𝑇𝑛 are estimated, we can create the kernel matrix using the kernel function in 4.8. By solving the quadratic programming problem (4.11), we can obtain the optimal decision function 𝑓𝑛 . This two-stage procedure for C-STM estimation is summarized in the algorithm 11 below. 4.5 Theory We discuss the statistical property of C-STM in this section. Let’s assume the risk of a decision X) ≠ 𝑦} , where X ⊂ R 𝐼1 ×..×𝐼 𝑑 is a subspace of R 𝐼1 ×..×𝐼 𝑑 .   function, 𝑓 , is R ( 𝑓 ) = EX ×Y Y 1 { 𝑓 (X Y = {1, −1}. The function 1 {·} is an indicator function measuring the loss of classification 88 Algorithm 10 ACMTF Decomposition 1: procedure ACMTF 2: Input: Multimodal data (X X1 , 𝑋 2 ) tensor rank r, 𝜂, maxiter 3: 0 0 𝔘 𝑡,1 , 𝔘 𝑡,2 = 𝔘 𝑡,1 , 𝔘 𝑡,2 ⊲ Initial value 4: Δ0 = −O𝑄 O𝑄 O𝑄(𝔘 0 , 𝔘0 ) 𝔘𝑡,1  𝑡,20 0 ) + 𝜑Δ  5: 𝜑0 = arg min𝜑 𝑄 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 Δ0 6: 1 , 𝔘 1 = (𝔘 𝔘 𝑡,1 0 , 𝔘0 ) + 𝜑 Δ 𝔘𝑡,1 𝑡,2 𝑡,2 0 0 7: 𝑔 0 = Δ0 8: while s < maxiter and |𝑄 𝑄 (𝔘 𝑠 , 𝔘 𝑠 ) − 𝑄 (𝔘 𝔘𝑡,1 𝔘𝑡,1𝑠−1 , 𝔘 𝑠−1 )| > 𝜂 do 𝑡,2 𝑡,2 9: Δ 𝑠+1 = −O𝑄 O𝑄 O𝑄(𝔘 𝑠 , 𝔘𝑠 ) 𝔘𝑡,1 𝑡,2 Δ > Δ 𝑠+1 −Δ (Δ Δ𝑠 ) 10: 𝑔 𝑠+1 = Δ 𝑠+1 + 𝑠+1 > 𝑔 Δ 𝑠+1 −Δ −𝑔𝑔 𝑠 (Δ Δ𝑠 ) 𝑠 𝑠 , 𝔘 𝑠 ) + 𝜑𝑔  11: 𝜑 𝑠+1 = arg min𝜑 𝑄 (𝔘 𝔘𝑡,1 𝑡,2 𝑔 𝑠+1 𝑠+1 , 𝔘 𝑠+1 = (𝔘 𝔘 𝑡,1 𝑠 , 𝔘𝑠 ) + 𝜑 𝔘𝑡,1 12: 𝑡,2 𝑡,2 𝑠+1𝑔 𝑠+1 13: Output: 𝔘 𝑡,1 ∗ , 𝔘∗ 𝑡,2 Algorithm 11 Coupled Support Tensor Machine 1: procedure C-STM 2: Input: Training set 𝑇𝑛 = {(X X1,1 , 𝑋 1,2 , 𝑦 1 ), ..., (X X𝑛,1 , 𝑋 𝑛,2 , 𝑦 𝑛 )}, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, 𝜂, maxiter 3: for t = 1, 2,...n do 4: ∗ , 𝔘 ∗ = ACMDF((X 𝔘 𝑡,1 𝑡,2 X𝑡,1 , 𝑋 𝑡,2 ), tensor rank r, 𝜂, maxiter) 5: Create initial matrix 𝐾 ∈ R𝑛×𝑛 6: for t = 1,...,n do 7: for i = 1,...,i do  8: 𝐾 [𝑖, 𝑡] = 𝐾 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ), (𝔘 𝔘𝑖,1 , 𝔘𝑖,2 ) ⊲ Kernel values 9: 𝐾 [𝑖, 𝑡] = 𝐾 [𝑡, 𝑖] 10: Solve the quadratic programming problem (4.11) and find the optimal 𝛼 ∗ . 11: Output: 𝛼 ∗ function 𝑓 . If there is a 𝑓 ∗ : X → Y from the collection of all measurable functions such that 𝑓 ∗ = arg min R ( 𝑓 ), its risk is called the Bayes risk for the classification problem with data from X × Y . We denote the Bayes risk as R ∗ = R ( 𝑓 ∗ ). With different training sets 𝑇𝑛 , we can estimate a sequence of decision functions 𝑓𝑛 under the same training procedure. This sequence of decision functions { 𝑓𝑛 } is called a decision rule. A decision rule is statistically consistent if R ( 𝑓𝑛 ) converges to the Bayes risk R ∗ as the size of training data 𝑛 increases, see, e.g., [36]. Our next result shows 89 that C-STM is a statistically consistent decision rule. Proposition 4.5.1. Given the tensor and matrix factors for all data in the domain, the classification risk of C-STM, R ( 𝑓𝑛 ), converges to the optimal Bayes risk almost surely, i.e. R ( 𝑓𝑛 ) → R ∗ 𝑎.𝑠. 𝑛→∞ if the following conditions are satisfied: AS.1 The loss function L is self-calibrated, see [128], and is 𝐶 (𝑊) local Lipschitz continuous in the sense that for |𝑎| 6 𝑊 < ∞ and |𝑏| 6 𝑊 < ∞, |L (𝑎, 𝑦) − L (𝑏, 𝑦)| 6 𝐶 (𝑊)|𝑎 − 𝑏|. In addition, we need sup L (0, 𝑦) 6 𝐿 0 < ∞. 𝑦∈{1,−1} (1) (2) AS.2 The kernel functions 𝐾1 (·, ·), 𝐾1 (·, ·), 𝐾2 (·, ·), and 𝐾3 (·, ·) used to compose the coupled tensor kernel (4.8) are regular vector-based kernels satisfying the universal approximating property. A kernel has this property if it satisfies the following condition. Suppose X is a compact subset of the Euclidean space R 𝑝 , and 𝐶 (X X ) = { 𝑓 : X → R} is the collection of all continuous functions defined on X . The kernel function is also defined on X × X , and its reproduction kernel Hilbert space (RKHS) is H . Then ∀𝑔𝑔 ∈ 𝐶 (X X ), ∃ 𝑓 ∈ H such that ∀𝜖 > 0, ||𝑔𝑔 − 𝑓 || ∞ = sup |𝑔𝑔 (𝑥𝑥 ) − 𝑓 (𝑥𝑥 )| 6 𝜖. X 𝑥 ∈X (1) (2) AS.3 The kernel functions 𝐾1 (·, ·), 𝐾1 (·, ·), 𝐾2 (·, ·), and 𝐾3 (·, ·) used to compose the coupled p tensor kernel (4.8) are all bounded, satisfying sup 𝐾 (·, ·) 6 𝐾𝑚𝑎𝑥 < ∞. AS.4 The hyper-parameter in the regularization term 𝜆 = 𝜆 𝑛 satisfies 𝜆 𝑛 → 0 as 𝑛 → ∞ and 𝑛𝜆 𝑛 → ∞ as 𝑛 → ∞. This proposition is an extension of our previous result for the statistical consistency of CP-STM. The proof of this proposition is provided in Appendix C.1. 90 4.6 Simulation Study We present a simulation study to demonstrate the benefit of utilizing C-STM with multimodal data in classification problems. To show the advantage of using multiple modalities, we compare with CP-STM from [63], Constrained Multilinear Discriminant Analysis (CMDA), and Direct General Tensor Discriminant Analysis (DGTDA) from [92]. These existing approaches can only take a single tensor / matrix as the input for classification. Thus, we apply these approaches on each modality separately and compare their classification performance with C-STM. We generate synthetic data with two modalities using the idea from [43] as follows: 3 3 (1) (2) (3) (1) (2) Õ Õ X 𝑡,1 = 𝑥 𝑘,𝑡,1 ◦ 𝑥 𝑘,𝑡,1 ◦ 𝑥 𝑘,𝑡,1 , 𝑋 𝑡,2 = 𝑥 𝑘,𝑡,2 ◦ 𝑥 𝑘,𝑡,2 , (4.21) 𝑘=1 𝑘=1 where X 𝑡,1 ∈ R30×20×10 and 𝑋 𝑡,2 ∈ R50×10 with ranks equal to 3. To generate data for the simulation study, we first generate the latent factors (vectors) from various multivariate normal distributions (with the parameters given in Table 4.1), and then use equation (4.21) to construct the tensors X 𝑡,1 and matrices 𝑋 𝑡,2 . In Table 4.1, we use 𝑐 = 1, 2 to denote data from two different classes. Eight different cases are considered in our simulation study. In cases 1 - 3, the discriminative information about the two classes is capture by one of the tensor factors and one of the matrix factors. This means that tensor and matrix data both contain class information (discriminative power) which may be different in the two modalities. Notice that the discriminative power in the tensor factor remains the same across cases 1 - 3, while the discriminative power in the matrix factor increases. Cases 4 and 5 assume the class information exists only in a single modality. In case 4, the distribution of one of the tensor factors is varied across classes and the discrimination power between the two classes is captured by the tensor factor. The discriminative factor becomes the matrix factor in case 5. In case 6, the difference between the two classes is captured by the shared factors, meaning that both tensor and matrix data modalities contain class information. For each simulation case, we generate 50 pairs of tensor and matrix data from both classes, collecting 100 pairs of observations in total. We then randomly choose 20 samples as the testing 91 Tensor Factors Shared Factors Matrix Factors (1) (2) (3) (2) (1) Simulation 𝑐 𝑥 𝑘,𝑡,1 𝑥 𝑘,𝑡,1 𝑥 𝑘,𝑡,1 = 𝑥 𝑘,𝑡,2 𝑥 𝑘,𝑡,2 1 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Case 1 2 𝑀𝑉 𝑁 (1.5 1.5, 𝐼 ) 1.5 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (1.5 1.5, 𝐼 ) 1.5 1 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Case 2 2 𝑀𝑉 𝑁 (1.5 1.5, 𝐼 ) 1.5 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (1.75 1.75, 𝐼 ) 1.75 1 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Case 3 2 𝑀𝑉 𝑁 (1.5 1.5, 𝐼 ) 1.5 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (22, 𝐼 ) 1 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Case 4 2 𝑀𝑉 𝑁 (22, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 1 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Case 5 2 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (22, 𝐼 ) 1 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Case 6 2 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) 𝑀𝑉 𝑁 (22, 𝐼 ) 𝑀𝑉 𝑁 (11, 𝐼 ) Table 4.1: Distribution Specifications for Simulation; 𝑀𝑉 𝑁 𝑁: multivariate normal distribution. 𝐼 : identity matrices. Bold numbers are vectors whose elements are all the same. 1.2 Case 1 1.2 Case 2 1.2 Case 3 0.9 0.9 0.9 Accuracy Accuracy Accuracy 0.6 0.6 0.6 0.3 0.3 0.3 0.0 0.0 0.0 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 Method Method Method 1.2 Case 4 1.2 Case 5 1.2 Case 6 0.9 0.9 0.9 Accuracy Accuracy Accuracy 0.6 0.6 0.6 0.3 0.3 0.3 0.0 0.0 0.0 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2 Method Method Method Figure 4.2: Simulation: Average accuracy(bar plot) with standard deviation (error bar) 92 set, and use the remaining data as the training set. The random selection of testing set is conducted in a stratified sampling manner such that the proportion of samples from each class remains the same in both training and testing sets. For all models, we report the model prediction accuracy, the proportion of correct predictions over total predictions, on the testing set as the performance metric. The random selection of training and testing data is repeated 50 times. The average prediction accuracy and standard deviation of these 50 repetitions for all cases are reported in Figure 4.2. The results of CP-STM, CMDA, and DGTDA with tensor data are denoted by CPSTM1, CMDA1, and DGTDA1, respectively. The results using matrix data are denoted by CPSTM2, CMDA2, and DGTDA2. From Figure 4.2, we can conclude that our C-STM has a more favorable performance in this multimodal classification problem compared with single modality methods. Its accuracy rates are significantly larger than other methods in most cases. In particular, we can see that the accuracy rates of C-STM (pink) increase from case 1 to case 3, while the accuracy rates of CP-STM using only the tensor data remains the same. This is because the difference between the class mean vectors for the first tensor factor does not change from case 1 to 3. However, the difference between class mean vectors for the matrix factor increases. Due to this fact, both C-STM and CP-STM (yellow) which utilize matrix data have better performance for case 3. More importantly, C-STM always outperforms CP-STM with matrix data as it enjoys the extra class information from multiple modalities. In cases 4 and 5, where class information is in a single modality, the advantage of C-STM is not as significant as the previous cases, though its performance is still better than CP-STM. This indicates that C-STM can provide robust classification results even when the additional modalities do not provide any class information. In case 6, where the class information is from the shared factors, C-STM recovers the shared factors and provides significantly better classification accuracy. Through this simulation, we showed that C-STM has a clear advantage when using multimodal data in classification problems, and is robust to redundant data modalities. The performance of tensor discriminant analysis is not as good as C-STM and CP-STM because they are not designed for CP tensors. 93 4.7 Trial Classification for Simultaneous EEG-fMRI Data In this section, we present the application of the proposed method on simultaneous EEG-fMRI data. The data is obtained from [141]. In this study, there is data from seventeen individuals (six females, average age 27.7) participating in three runs each of visual and auditory oddball paradigms. 375 (125 per run) total stimuli per task were presented for 200 ms each with a 2-3 s uniformly distributed variable inter-trial interval. A trial is defined as a time window in which subjects receive the stimuli and give responses. In the visual task, a large red circle on isoluminant gray backgrounds was considered as the target stimuli while a small green circle was the standard stimulus. For the auditory task, the standard and oddball stimuli were, respectively, 390 Hz pure tones and broadband sounds which sound like "laser guns". During the experiment, the stimuli were presented to all subjects, and their EEG and fMRI data are collected simultaneously and continuously. We obtain the data from OpenNeuro website (https://openneuro.org/ datasets/ds000116/versions/00003). We utilize both EEG and fMRI in this data set with our C-STM model to classify stimulus types across trials. We pre-process both the EEG and fMRI data with Statistical Parametric Mapping (SPM 12) ([10]) and Matlab. Details of data pre-processing are provided in Appendix C.2. For each trial, we construct a three-mode tensor corresponding to the EEG data for all subjects where the modes represent channel × time × subject denoted as X 𝑡,1 ∈ R34×121×16 . For fMRI data, there is only one 3D scan of fMRI collected from a single subject during each trial. Time mode does not exist in fMRI data because the trial duration is less than the repetition time of fMRI (time for obtaining a single 3D volume fMRI). We further extract fMRI volumes from voxels in the regions of interest (ROI) only for our study. ROI selection and data extraction are provided in Appendix C.2. We extract fMRI volumes from 178 voxels for auditory oddball tasks, and 112 voxels for auditory tasks. As a result, fMRI data for each trial are modeled by matrices whose rows and columns stand for voxels and subjects: 𝑋 𝑡,2 ∈ R16×178 for auditory task data, and 𝑋 𝑡,2 ∈ R16×112 for visual task data. To classify trials with oddball and standard stimulus, we collect 140 multimodal data samples X𝑡,1 , 𝑋 𝑡,2 ) from auditory tasks, and 100 samples from visual tasks. For both types of tasks, the (X 94 Task Method Accuracy Precision Sensitivity Specificity AUC C-STM 0.89 0.05 0.830.07 1.000.00 0.770.11 0.89 0.06 CP-STM1 0.800.08 0.710.11 1.000.00 0.600.12 0.780.06 CP-STM2 0.830.06 0.760.07 0.990.05 0.650.11 0.820.05 Auditory CDMA1 0.550.10 0.510.09 0.960.09 0.200.21 0.550.06 CDMA2 0.670.09 0.610.11 0.920.07 0.460.14 0.700.08 DGTDA1 0.550.09 0.510.09 0.940.07 0.230.12 0.590.06 DGTDA2 0.670.09 0.600.10 0.900.09 0.460.13 0.680.08 C-STM 0.86 0.06 0.820.09 0.930.07 0.770.12 0.86 0.06 CP-STM1 0.760.08 0.660.11 1.000.00 0.540.12 0.780.05 CP-STM2 0.770.08 0.700.11 0.980.08 0.580.17 0.770.07 Visual CDMA1 0.530.12 0.520.11 0.940.11 0.110.18 0.540.08 CDMA2 0.650.13 0.610.14 0.910.09 0.430.19 0.660.09 DGTDA1 0.560.11 0.540.11 0.940.06 0.170.12 0.560.07 DGTDA2 0.640.10 0.600.13 0.860.10 0.440.18 0.640.07 Table 4.2: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Perfor- mance Metrics with Standard Deviations in Subscripts) numbers of oddball and standard trials are equal. We consider the trials with oddball stimulus as the positive class, and the trials with standard stimulus as the negative class. Similar to the simulation study, we select 20% of data as testing set, and use the remaining 80% for model estimation and validation. The classification accuracy, precision (positive predictive rate), sensitivity (true positive rate), specificity (true negative rate), and the area under the curve (AUC) of classifiers are calculated using the test set for each experiment. The experiment is repeated for multiple times, and the average accuracy, precision, sensitivity, and specificity, along with their standard deviations (in subscripts) are reported in Table 4.2. The single mode classifiers CPSTM, CMDA, and DGTDA are also applied on either EEG or fMRI data for comparison. The single mode classifiers applied on EEG data are denoted by appending the method name with the number “1" , and those applied on fMRI data are denoted by appending the method name with the number “2". It can be seen that the classification accuracy for C-STM using multimodal data is higher than any classifier based on single modality with a significant improvement in terms of average accuracy 95 rates and average AUC values. This improvement is observed for both auditory and visual tasks. Particularly, the accuracy rate of C-STM in visual task is 9% higher than CP-STM using fMRI data, the model with the second best performance. This significant performance improvement demonstrates the clear advantage of our C-STM with multimodal data, which is consistent with the previous conclusions from the simulation study. Similarly, the tensor discriminant analysis does not work as well as CP-STM and C-STM, which also agrees with our observations in the simulation study. 4.8 Conclusion In this work, we have proposed a novel coupled support tensor machine classifier for multimodal data by combining advanced coupled matrix tensor factorization and support tensor machine. The most distinctive feature of this classifier is its ability to integrate features across different modalities and structures. The approach can simultaneously process matrix and tensor data for classification and can be extended to more than two modalities. Moreover, the coupled matrix tensor decomposition helps unveil the intrinsic correlation structure in different modalities, making it possible to integrate information from multiple sources efficiently. The newly designed kernel functions in C-STM serve as a feature-level information fusion, combining discriminant information from different modalities. In addition, the kernel formulation makes it possible to utilize the most discriminative features from each modality by tuning the weight parameters in the function. Our theoretical results demonstrate that the C-STM decision rule is statistically consistent. An important theoretical extension of our approach would be the development of excess risk for C-STM. In particular, we look for an explicit expression for the excess risk in terms of data factors from multiple modalities to quantify the contribution of each modality to minimizing the excess risk. By doing so, we are able to interpret the importance of each data modality in classification tasks. In addition, quantifying the uncertainty of tensor and matrix factor estimation and their impact on the excess risk will build the foundation to the next level statistical inference. Another possible future work can be learning the weight parameters in kernel function via optimization problems in the 96 algorithmic aspect. As [55] introduced, the weights in the kernel function can be further estimated by including a group lasso penalty in the objective function. Such a weight estimation procedure can identify the significant data components and reduce the burden of parameter selection. In conclusion, we believe C-STM offers many encouraging possibilities for multimodal data integration and analysis. Its ability to handle multimodal tensor inputs will make it appropriate in many advanced data applications in neuroscience research. 97 APPENDICES 98 APPENDIX A APPENDIX FOR CHAPTER 2 A.1 Proof of Proposition 2.3.1 Proof. The proof of this theorem is quite straightforward. We need to use the tensor product space defined in section 1.2. In addition, since we are discussing general tensor product, we will use ⊗ to denote it. ⊗ can be replaced with outer product ◦ or Kharti-Rao product when facing specific vector or matrix data. The proof will still holds. Let V ( 𝑗) , 𝑗 = 1, , , .𝑑 be compact subsets of R 𝑗 , 𝑗 = 1, ..., 𝑑. The tensor product of these 𝐼 subsets X = ⊗ 𝑑𝑗=1V ( 𝑗) will be again a compact subspace of the tensor space R 𝐼1 ×...×𝐼 𝑑 . Let K (X X) be the kernel sections of tensor kernels we defined in equation (2.2), and C(X X ) = { 𝑓 : X → R} be the collection of all continuous real-valued functions mapping CP tensors to scalars. We have to prove for any 𝑓 ∈ C(X X ), there exist an approximation in K (X X ). We will show such kinds of approximation exists. If 𝑓 ∈ C(X X ), it has sup | 𝑓 | < ∞ due to continuity. Further, since 𝑓 is defined on X and X continuous, then 𝑓 ∈ C(X X ) and can be written as 𝑟 (1) (2) (𝑑) Õ 𝑓 = 𝜆 𝑘 𝑓 𝑘 ⊗ 𝑓 𝑘 ... ⊗ 𝑓 𝑘 + 𝜖 (A.1) 𝑘=1 ( 𝑗) where 𝑓 𝑘 are continuous function defined on V ( 𝑗) , 𝑗 = 1, , , .𝑑. This decomposition exists due to the fact that 𝑓 is defined on X . As a result, 𝑓 belongs to the functional tensor product space  𝑓 : X → R} in definition 1.2.2. It can also be exlpained by the fact that 𝑓 is continuous on a compact space, thus it is multilinear and has such a decomposition (Lemma 4.30 from [59]). 𝜖 is a reminder here, and can be as small as possible since 𝑓 is uniformly bounded. For simplicity, we shall ignore the 𝜖 in the later proof for a while and mention it at the end. 𝜆 𝑘 are bounded since 𝑓 is bounded. If in every mode of the kernel functions are universal, the kernel functions are universal. For 99 ( 𝑗) ( 𝑗) each 𝑓 𝑘 , 𝑘 = 1, ..., 𝑟; 𝑗 = 1, ...𝑑, there is a function 𝑔 𝑘 ∈ 𝑠𝑝𝑎𝑛{𝐾𝑥 : 𝑥 ∈ V ( 𝑗) }, which is from the kernel sections of the corresponding mode, such that ( 𝑗) ( 𝑗) sup | 𝑓 𝑘 − 𝑔 𝑘 | < 𝜖 𝑘 = 1, ..., 𝑟; 𝑗 = 1, ...𝑑 (A.2) V ( 𝑗) for any arbitrary 𝜖 > 0. Then for 𝑟 (1) (2) (𝑑) Õ 𝑔= 𝜆 𝑘 𝑔 𝑘 ⊗ 𝑔 𝑘 ... ⊗ 𝑔 𝑘 (A.3) 𝑘=1 We can have 𝑟 𝑟 (1) (2) (𝑑) (1) (2) (𝑑) Õ Õ sup | 𝑓 − 𝑔 | = sup | 𝜆𝑘 𝑓 𝑘 ⊗ 𝑓 𝑘 ... ⊗ 𝑓𝑘 − 𝜆 𝑘 𝑔 𝑘 ⊗ 𝑔 𝑘 ... ⊗ 𝑔 𝑘 | X ∈XX X ∈X X 𝑘=1 𝑘=1 𝑟 𝑑 𝑑 ( 𝑗) ( 𝑗) Õ Ö Ö = sup |𝜆 𝑘 || 𝑓 𝑘 (𝑥𝑥 ( 𝑗) ) − 𝑔 𝑘 (𝑥𝑥 ( 𝑗) )| (A.4) X ∈X X 𝑘=1 𝑗=1 𝑗=1 6 𝑟𝑑𝜖 · max(|𝜆 𝑘 |) The last step is because of a simple inequality |𝑎 1 𝑎 2 − 𝑏 1 𝑏 2 | 6 |𝑎 1 ||𝑎 2 − 𝑏 2 | + |𝑏 2 ||𝑎 1 − 𝑏 1 |, and universal property in definition 2.3.1. Since r, d, and 𝜆 𝑘 are all bounded, let the 𝜖 becomes as small as possible, we have sup | 𝑓 − 𝑔 | 6 𝜖 ∀ 𝑓 ∈ C(X X) (A.5) X ∈XX for any arbitrary 𝜖 > 0. The proof is completed.  A.2 Proof of Theorem 2.3.1 Proof. The convergence in the theorem can be showed in two steps. Given the parameter 𝜆, we denote 𝑓𝑛𝜆 = arg min 𝜆|| 𝑓 || 2 + R L,𝑇𝑛 ( 𝑓 ) 𝑓 𝜆 = arg min 𝜆|| 𝑓 || 2 + R L ( 𝑓 ) H 𝑓 ∈H H 𝑓 ∈H where ∫ R L ( 𝑓 ) = E (X×Y) L (𝑦, 𝑓 (X X)) = L (𝑦, 𝑓 (XX))𝑑P 100 and 𝑛 1Õ X𝑖 )  R L,𝑇𝑛 ( 𝑓 ) = L 𝑦𝑖 , 𝑓 (X 𝑛 𝑖=1 L is a loss function. 𝑓𝑛𝜆 is the optimal classifier learned from training data, and 𝑓 𝜆 is the optimal from the RKHS H generated by tensor kernel function (2.2). Since the Bayes risk under loss function L is defined as R ∗ = min R ( 𝑓 ) over all functions defined on X , we can immediate X →Y 𝑓 :X Y show that |R ( 𝑓 𝜆 ) − R ∗ | 6 E (X×Y) |L (𝑦, 𝑓 𝜆 (X X)) − L (𝑦, 𝑓 ∗ (X X))| 6 𝐶 (𝐾𝑚𝑎𝑥 ) sup | 𝑓 𝜆 − 𝑓 ∗ | (A.6) 6 𝐶 (𝐾𝑚𝑎𝑥 ) · 𝜖 This is the result of using condition Con.1 and Con.2 in the theorem. 𝑓 𝜆 is in the RKHS and thus bounded by some constant depending on 𝐾𝑚𝑎𝑥 . 𝑓 ∗ is also continuous on compact subspace X (because all the tensor components considered are bounded in condition Con.1) and thus is bounded. The universal approximating property in condition Con.3 makes equation (A.6) vanishes as 𝜖 goes to zero. Thus, the consistency result can be established if we show |R ( 𝑓𝑛𝜆 ) − R ( 𝑓 𝜆 )| converges to zero. This can be done with Hoeffding equality ([36]) and Rademacher complexity (see Theorem B.5.1). From the objective function (2.4), we have R L,𝑇𝑛 ( 𝑓𝑛 ) + 𝜆 𝑛 || 𝑓𝑛 || 2 6 𝐿 0 (A.7) q 𝐿 under condition Con.2 when we simply let 𝑓 = 0 as a naive classifier. Thus, || 𝑓𝑛 || 6 𝜆𝑛0 . Let q 𝐿 𝑀𝑛 = 𝜆𝑛0 . 𝑓𝜖 ∈ H such that R L ( 𝑓𝜖 ) 6 R L ( 𝑓 𝜆 ) + 2𝜖 . || 𝑓𝜖 || 6 𝑀𝑛 when 𝑛 is sufficiently large. Due to condition Con.4, 𝜆 𝑛 → 0, making 𝑀𝑛 → ∞. Further notice that we introduce 𝑓𝜖 since it is independent of 𝑛. As a result, its norm, even though is bounded by 𝑀𝑛 , is a constant and is not changing with respect to 𝑛. By Rademacher complexity, the following inequality holds with 101 probability at least 1 − 𝛿, where 0 < 𝛿 < 1 r 2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 log 2/𝛿 R L ( 𝑓𝑛𝜆 ) 6 R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + √ + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 ) 𝑛 2𝑛 2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 𝑓𝜖 is not the optimal in training data 6 R L,𝑇𝑛 ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 − 𝜆 𝑛 || 𝑓𝑛𝜆 || 2 + √ 𝑛 r log 2/𝛿 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 ) 2𝑛 2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 Drop (𝜆𝑛 | | 𝑓𝑛𝜆 | | 2 > 0) 6 R L,𝑇𝑛 ( 𝑓 𝜖 ) + 𝜆 𝑛 || 𝑓 𝜖 || 2 + √ 𝑛 r log 2/𝛿 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 ) 2𝑛 4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 Rademacher Complexity again 6 R L ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 + √ 𝑛 r log 2/𝛿 + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 ) 2𝑛 Let 𝛿 = 12 , and 𝑁 large such that for all 𝑛 > 𝑁, 𝑛 r 4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 log 2/𝛿 𝜖 𝜆 𝑛 || 𝑓𝜖 || 2 + √ + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 ) 6 𝑛 2𝑛 2 The inequality exists because || 𝑓𝜖 || is a constant with respect to 𝑛, and all other terms are converging to zero. Thus 𝜖 R L ( 𝑓𝑛𝜆 ) 6 R L ( 𝑓𝜖 ) + 6 RL ( 𝑓 𝜆 ) + 𝜖 2 with probability 1 − 12 . We conclude that 𝑛 P(|R L ( 𝑓𝑛𝜆 ) − R L ( 𝑓 𝜆 )| > 𝜖) → 0 (A.8) for any arbitrary 𝜖. This establishes the weak consistency of CP-STM. For strong consistency, we consider for each 𝑛 ∞ ∞ Õ Õ 1 P(|R L ( 𝑓𝑛𝜆 ) − RL ( 𝑓 𝜆 )| > 𝜖) 6 𝑁 − 1 + 6∞ 𝑛 2 𝑛=1 𝑛=1 By Borel-Cantelli Lemma ([42]), R L ( 𝑓𝑛𝜆 ) → R L ( 𝑓 𝜆 ) almost surely. The proof is finished.  102 APPENDIX B APPENDIX FOR CHAPTER 3 B.1 Proof for Proposition 3.3.1 Suppose a CP rank-𝑟 tensor X = È𝑋 𝑋 1 , .., 𝑋 𝑑 É is given with size 𝐼1 × 𝐼2 .. × 𝐼 𝑑 . With a rank-1 projection tensor A 𝑝 , the CP tensor random projection (3.5) can be written as (1) (𝑑) [ 𝑓 TPR-CP (X)] 𝑝 =< È𝐴 𝐴 𝑝 , ..., 𝐴 𝑝 É, È𝑋 𝑋 (1) , .., 𝑋 (𝑑) É > 𝑟 (1) (2) (𝑑) (1) (2) (𝑑) Õ =< 𝑎𝑝 ◦ 𝑎𝑝 ◦ ... ◦ 𝑎 𝑝 , 𝑥 𝑘 ◦ 𝑥 𝑘 ◦ ... ◦ 𝑥 𝑘 > (B.1) 𝑘=1 𝑟 (1)𝑇 (1) (2)𝑇 (2) (𝑑)𝑇 (𝑑) Õ = < 𝑎𝑝 ,𝑥𝑘 > ◦ < 𝑎𝑝 ,𝑥𝑘 > ...◦ < 𝑎 𝑝 ,𝑥𝑘 > 𝑘=1 ( 𝑗) ( 𝑗) ( 𝑗) 𝑎 𝑝 , 𝑥 𝑘 ∈ R 𝑗 , 𝑗 = 1, ..., 𝑑 are CP factors for the projection tensor A 𝑝 and CP tensor X. 𝑎 𝑝 ∼ 𝐼 𝑀𝑉 𝑁 (00, 𝜎 2 𝐼 ) is a multivariate random variable whose elements are identically and independently ( 𝑗) distributed. Also, 𝑎 𝑝 are identically and independently distributed with different value of 𝑝 = 1, .., 𝑃 and 𝑗 = 1, ..., 𝑑. Now we consider the tensor-to-tensor random projection (3.6), and let ( 𝑗)𝑇 ( 𝑗)𝑇 𝑃 ×𝐼 𝐴 ( 𝑗) = (𝑎𝑎 1 , ..., 𝑎 𝑃 )𝑇 ∈ R 𝑗 𝑗 be the random projection matrices in (3.6). Notice that 𝑗 ( 𝑗) the rows of matrices 𝐴 ( 𝑗) are identically and independently distributed as the 𝑎 𝑝 , since the elements in the matrices 𝐴 ( 𝑗) are also identically and independently distributed as N (0, 𝜎 2 ). The tensor-to-tensor CP random projection (3.6) is 𝑓 TPR-CP-TT (X) = È𝐴 𝐴 (1) 𝑋 (1) , 𝐴 (2) 𝑋 (2) , ..., 𝐴 (𝑑) 𝑋 (𝑑) É 𝑟 (1) (2) (𝑑) Õ = < 𝐴 (1) , 𝑥 𝑘 > ◦ < 𝐴 (2) , 𝑥 𝑘 > ...◦ < 𝐴 (𝑑) , 𝑥 𝑘 > (B.2) 𝑘=1 𝑟 (1) (2) (𝑑) Õ = 𝑣 𝑘 ◦ 𝑣 𝑘 ◦ ... ◦ 𝑣 𝑘 𝑘=1 ( 𝑗) ( 𝑗) 𝑣 𝑘 =< 𝐴 ( 𝑗) , 𝑥 𝑘 >∈ R 𝑗 , 𝑗 = 1, ..., 𝑑 since it is just matrix vector multiplication. To show 𝑃 the equivalence between (3.5) and (3.6), we have to show that [ 𝑓 TPR-CP-TT (X)] 𝑝 1 ,𝑝 2 ,...,𝑝 𝑑 = Î [ 𝑓 TPR-CP (X)] 𝑝 when a index mapping 𝜋 is given, 𝜋 ( 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 ) = 𝑝, and 𝑑𝑗=1 𝑝 𝑗 = 𝑝. 103 Î𝑑 For an arbitrary 𝑝 = 𝑗=1 𝑝 𝑗 and 𝜋 ( 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 ) = 𝑝, we can find the element in projected tensor with index 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 is 𝑟 (1) (2) (𝑑) Õ [ 𝑓 TPR-CP-TT (X)] 𝑝 1 ,𝑝 2 ,...,𝑝 𝑑 = 𝑣 𝑝 𝑣 𝑝 ...𝑣𝑣 𝑝 1 2 𝑑 𝑘=1 𝑟 (B.3) (1)𝑇 (1) (2)𝑇 (2) (𝑑)𝑇 (𝑑) Õ = < 𝑎𝑝 ,𝑥𝑘 >◦< 𝑎𝑝 ,𝑥𝑘 > ...◦ < 𝑎𝑝 ,𝑥𝑘 > 1 2 𝑑 𝑘=1 ( 𝑗) ( 𝑗) where 𝑎 𝑝 𝑗 ∈ R 𝑗 are rows of matrices 𝐴 ( 𝑗) . Since 𝑎 𝑝 𝑗 are identically and independently distributed 𝐼 ( 𝑗) ( 𝑗) for all 𝑝 𝑗 , 𝑎 𝑝 𝑗 are equivalent to the 𝑎 𝑝 in equation (B.1). Thus, the equation (B.3) is equivalent to the equation (B.1). The proof is finished. Indeed, the equivalence can be identified as follow: (1) (𝑑) ( 𝑗) For each projection tensor A 𝑝 in (3.5), A 𝑝 = 𝑎 𝑝 ◦ ... ◦ 𝑎 𝑝 , where 𝑎 𝑝 𝑗 ∈ R 𝑗 are 𝑝 𝑗 -th rows of 𝐼 1 𝑑 matrices 𝐴 ( 𝑗) . The order is decided by the index mapping 𝜋 ( 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 ) = 𝑝. B.2 Proof of Proposition 3.5.1 We use the adding and subtraction trick to prove the proposition.   EA R L (𝑔𝑔𝑛𝜆 ) − R L ∗ = [E [R (𝑔 𝜆 𝑔𝑛𝜆 )] A L 𝑔𝑛 ) − R L,𝑇 A (𝑔 𝑛   + EA R A (𝑔𝑔𝑛𝜆 ) − R A ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 L,𝑇𝑛 L,𝑇𝑛 + EA [R ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 )] L,𝑇𝑛A + [EA [R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 ] − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 ] + [R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 )] + [R𝑇𝑛 ( 𝑓𝑛𝜆 ) + 𝜆|| 𝑓𝑛𝜆 || 2 − R L,𝑇𝑛 ( 𝑓 𝜆 ) − 𝜆|| 𝑓 𝜆 || 2 ] + [R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )] + [R L ( 𝑓 𝜆 ) − R L,H ∗ ∗ ∗ 𝜆 2 H + R L,H H − R L + 𝜆|| 𝑓 || ]      𝜆 𝜆   𝜆 𝜆  6 EA R L (𝑔𝑔𝑛 ) − R A (𝑔𝑔𝑛 ) + EA R A ( 𝑓A ,𝑛 ) − R L ( 𝑓A ,𝑛 ) L,𝑇𝑛 L,𝑇𝑛     + R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )    𝜆 𝜆 2  𝜆 𝜆 2 + EA R L ( 𝑓A,𝑛 ) + 𝜆|| 𝑓A,𝑛 || − R L ( 𝑓𝑛 ) − 𝜆|| 𝑓𝑛 || + 𝐷 (𝜆) + R L,H ∗ ∗ H − RL Since 𝐷 (𝜆) = R L ( 𝑓 𝜆 ) + 𝜆|| 𝑓 𝜆 || − R L,H ∗ , there are only two terms are dropped in the derivation. H 104 • The first term dropped is EA [R (𝑔𝑔𝑛𝜆 ) − R ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 ]. As we explained in L,𝑇𝑛A L,𝑇𝑛A the paper, 𝑓A𝜆 ,𝑛 is the decision function but with coefficients estimated from CP-STM model. As a result, it is not the optimal of the objective function (3.8). Since 𝑔𝑛𝜆 minimizes the objective function (3.8) and 𝜆||𝑔𝑔𝑛𝜆 || 2 > 0, we get R (𝑔𝑔𝑛𝜆 ) − R ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 6 R (𝑔𝑔𝑛𝜆 ) + 𝜆||𝑔𝑔𝑛𝜆 || 2 − R ( 𝑓A𝜆 ,𝑛 ) L,𝑇𝑛A L,𝑇𝑛A L,𝑇𝑛A L,𝑇𝑛A − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 (Since 𝑔𝑛𝜆 minimizes (3.8)) 60 The inequality holds for all random projection defined by random tensor A , so EA [R (𝑔𝑔𝑛𝜆 ) − R ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 ] 6 0 L,𝑇𝑛A L,𝑇𝑛A • The second term dropped is [R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + 𝜆|| 𝑓𝑛𝜆 || 2 − R L,𝑇𝑛 ( 𝑓 𝜆 ) − 𝜆|| 𝑓 𝜆 || 2 ]. Similar to the previous dropped term, this term is also less or equal to zero. As we defined, 𝑓𝑛𝜆 minimizes the objective function (3.1) that evaluates loss over the training data 𝑇𝑛 . Even though 𝑓 𝜆 is the class optimal with infinite-size training data, its objective function still has a greater value than that of 𝑓𝑛𝜆 . By comparing the values of objective function (3.1) on 𝑓𝑛𝜆 and 𝑓𝑛𝜆 , we can see that [R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + 𝜆|| 𝑓𝑛𝜆 || 2 − R L,𝑇𝑛 ( 𝑓 𝜆 ) − 𝜆|| 𝑓 𝜆 || 2 ] 6 0 By dropping these two non-positive terms, we prove the proposition. B.3 Discussion on Assumptions AS.8 Assumption AS.8 ensures that the Bayes risk remains unchanged after random projection. This assumption has also been made in [24]. This is a necessary condition to align our results with the definition of classification consistency. For any arbitrary random projection A , it is obvious that R L,A ∗ > RL ∗ . This is because the A optimal Bayes risk is achieved by choosing any measureable function. A function compose random projection A and a decision rule should be measureable, and thus should be considered when searching for Bayes rules. If R L,A ∗ > RL ∗ , then smallest achievable risk in projected data will no A 105 longer be R L∗ . By definition, a decision rule learned from the projected data just have to reach ∗ 𝜆 ) → R ∗ we show in the   to the R L,A A to be consistent. This deviates from the result EA R L (𝑔𝑔𝑛 L ∗  𝜆  paper. Thus, we need the condition to guarantees that EA R L (𝑔𝑔𝑛 ) → R L indicating RPSTM’s consistency aligns with the definition. In [33, 24], many examples satisfying this condition are provided. Typically, if there is an random projection A such that E[𝑦| 𝑓 TPR-CP-TT (X X)] and X are independent, then the condition is satisfied. B.4 Proof of Proposition 3.5.2 Johnson-Lindenstrauss lemma gives concentration bound on the error introduced by random projection in a single mode. (e.g. see [72] and [34]) We first show how this property is applied at each mode of the tensor CP components in the following lemma. ( 𝑗) ( 𝑗) 𝐼 ×1 Lemma B.4.1. For each fixed mode 𝑗 = 1, 2, .., 𝑑 and any two tensor CP factors 𝑥 1 , 𝑥 2 ∈ R 𝑗 among 𝑛 training vectors, with probability at least (1 − 𝛿1 ) and the random projection matrices described in AS.5, we have 𝐴 ( 𝑗) 𝑥 1( 𝑗) − 𝐴 ( 𝑗) 𝑥 2( 𝑗) || 22 − ||𝑥𝑥 1( 𝑗) − 𝑥 2( 𝑗) || 22 6 𝜖 ||𝑥𝑥 1( 𝑗) − 𝑥 2( 𝑗) || 22 ||𝐴 log 𝛿𝑛 𝑃 𝑗 ×𝐼 𝑗 Proof. The matrix 𝐴𝑗 ∈R , where 𝑃 𝑗 = 𝑂 ( 2 1 ). Under condition AS.5, the inequality 𝜖 holds due to the JL-property ([72, 34]).  Next, we apply this lemma to multiple modes of tensor CP factors, and derive a bound for the difference between projected tensor kernel function (3.9) and tensor kernel function (3.2). We need the following lemma and corollary to derive the bound. Lemma B.4.2. Consider a 2𝑑 degree polynomial of independent Centered Gaussian or Rademacher random variables as 𝑄 2𝑑 (𝑌 ) = 𝑄 2𝑑 (𝑦𝑖 , .., 𝑦 𝑑 ). Then for some 𝜖 > 0 and 𝜉 > 0 constant.   𝜖 2𝑑  1 P(|𝑄 2𝑑 (𝑌 ) − E(𝑄 2𝑑 (𝑌 ))| > 𝜖 𝑑) 6 𝑒 2 exp − 2𝑑 𝜉Var[𝑄 2𝑑 (𝑌 )] 106 Proof. The proof can be found using hypercontractivity, [69] Thm 6.12 and Thm 6.7. This result is also mentioned in [124].  From this lemma, we can show a corollary about the difference between projected tensor CP components and original CP components. 𝑟 (1) (𝑑) Corollary B.4.2.1. For any two d-mode tensors in rank-r CP form, X 1 = Í 𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and 𝑘=1 𝑟 (1) (𝑑) X2 = Í 𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 , concentration bounds of polynomials can be derived below. Given 𝜖 > 0, 𝑘=1 𝜉 > 0 constant, we have following JL type result. 𝑟 Ö 𝑑 𝑟 Ö 𝑑 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) Õ Õ Õ P( 𝐴 ( 𝑗) 𝑥 1,𝑘 |𝐴 − 𝐴 ( 𝑗) 𝑥 2,𝑙 || 22 − ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 > 𝜖𝑑 ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 ) 𝑘,𝑙=1 𝑗=1 𝑘,𝑙=1 𝑗=1 𝑘,𝑙=1 𝑗=1  𝜖 2𝑑 Î𝑑 𝑃 ( 𝑗)  1 𝑗=1 2𝑑 6 𝑒 2 exp(− ) 3𝑑 𝑟 4 Proof. It is known that variable for any vector 𝑥 ( 𝑗) ∈ R 𝑗 and any matrix 𝐴 ( 𝑗) made out of entries 𝐼 ( 𝑗) ( 𝑗) of following independent normal with mean 0 and variance 𝑃1 , linear combination 𝐴 ( 𝑗)𝑥 ∼ 𝑗 k𝑥𝑥 k 2 k𝐴𝐴 ( 𝑗) 𝑥 ( 𝑗) k 2 𝑀𝑉 𝑁 (00, 𝑃1 𝐼 ). So, 𝑃 𝑗 ( 𝑗) 2 follows Chi-square of degree of freedom 𝑃 𝑗 . Using the fact that 𝑗 k𝑥𝑥 k 2 expression, ( 𝑗) ( 𝑗) Õ 𝑟 Ö 𝑑 ||𝐴𝐴 ( 𝑗) 𝑥 1,𝑘 − 𝐴 ( 𝑗) 𝑥 2,𝑙 || 22 ( 𝑗) ( 𝑗) 𝑘,𝑙=1 𝑗=1 ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 is the sum of 𝑟 2 identically distributed random variables with correlation 1. Therefore, the variance of the sum is (𝑟 2 ) 2 times variance of an individual term. Here each element in the summation is a product of 𝑑 independent scaled Chi-Square variables. Thus, each element in the summation is polynomial of degree equals to 2d of Gaussian random variables. This result follows from lemma B.4.2. In view of similar result can be found in [119], the constant 𝜉 can be assumed to be 1. It is worth noting that for polynomial of degree 2 or 𝑑 = 1, sharper bound can be obtained as illustrated in [72, 34].  Now we present the bound for tensor kernels. 107 𝑟 (1) (𝑑) Proposition B.4.1. For any two d-mode tensors in rank-r CP form, X1 = Í 𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and 𝑘=1 𝑟 (1) (𝑑) X2 = Í 𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 , concentration bounds of polynomials can be derived below. Suppose the 𝑘=1 random projection 𝑓 TPR-CP-TT is defined by projection tensor A , which satisfies the assumption AS.5. Given 𝜖 > 0, 𝛿1 depending on 𝜖 as given in corollary B.4.2.1, 𝐶𝑑,𝑟 constant depending on 𝑑, we have following JL-type result. For a given tensor kernel function 𝐾 (·, ·),   X1 ), 𝑓 TPR-CP-TT (X X2 ) − 𝐾 (X X1 , X 2 ) > 𝐶𝑑,𝑟  P 𝐾 𝑓 TPR-CP-TT (X 𝜖𝑑 6 𝛿1 Proof. X1 ), 𝑓 TPR-CP-TT (X X2 ) − 𝐾 (X X1 , X 2 )|  |𝐾 𝑓 TPR-CP-TT (X 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) Õ =| 𝐾 ( 𝑗) (𝐴𝐴 ( 𝑗) 𝑥 1𝑘 , 𝐴 ( 𝑗) 𝑥 2𝑙 ) − 𝐾 ( 𝑗) (𝑥𝑥 1𝑘 , 𝑥 2𝑙 )| 𝑘,𝑙=1 𝑗=1 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) Õ 6 |𝐾 ( 𝑗) (𝐴 𝐴 ( 𝑗) 𝑥 1𝑘 , 𝐴 ( 𝑗) 𝑥 2𝑙 ) − 𝐾 ( 𝑗) (𝑥𝑥 1𝑘 , 𝑥 2𝑙 )| 𝑘,𝑙=1 𝑗=1 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) Õ (Lipschitz continuity in AS.4) 6 𝐿 𝐾 | ||𝐴 𝐴 ( 𝑗) 𝑥 1𝑘 − 𝐴 ( 𝑗) 𝑥 2𝑙 || 22 − ||𝑥𝑥 1𝑘 − 𝑥 2𝑙 || 22 | 𝑘,𝑙=1 𝑗=1 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) Õ (Max 𝐿𝑘 ( 𝑗) in AS.4) 6 𝐿𝐾 𝑑 𝐴 ( 𝑗) 𝑥 1𝑘 | ||𝐴 − 𝐴 ( 𝑗) 𝑥 2𝑙 || 22 − ||𝑥𝑥 1𝑘 − 𝑥 2𝑙 || 22 | 𝑘,𝑙=1 𝑗=1 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) Õ (Corollary B.4.2.1 ) 6 𝜖 𝑑 𝐿𝐾𝑑 ||𝑥𝑥 1𝑘 − 𝑥 2𝑘 || 22 𝑘,𝑙=1 𝑗=1 Õ 𝑟 (Assumption AS.7) 6 𝑑 𝐵2𝑑 2𝑑 𝜖 𝑑 𝐿 𝐾 𝑥 𝑘,𝑙=1 6 2𝑑 𝑟 2 𝐿 𝐾 𝑑 𝐵2𝑑 𝜖 𝑑 𝑥 (Denote 𝐶𝑑,𝑟 = 2𝑑 𝑟 2 𝐿𝐾 𝑑 2𝑑 𝐵𝑥 ) = 𝐶𝑑,𝑟 𝜖 𝑑 Such part vanishes as 𝜖 𝑑 becomes as small as possible.  The proposition shows that the difference between the projected kernel and original kernel function can be bounded with probability at least 1 − 𝛿1 , when condition AS.4, AS.5, and AS.7 hold. 108 Now we include conditions AS.1, AS.6 together with the previous results to show proposition 3.5.2. With a single random projection defined by A , the extra risk from random projection |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | contains two parts. They are bounded in separate ways. • Difference between risks can be bounded by the following inequality. With probability at least 1 − 𝛿1 (with respect to random projection), 0 < 𝛿1 < 1, 𝜆 X), 𝑦) − L ( 𝑓 𝜆 (X 𝑛 X), 𝑦)]  R L ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓𝑛𝜆 ) = E (X X ×YY ) L ( 𝑓A ,𝑛 (X r 𝐿0 X) − 𝑓𝑛𝜆 (X X)|  𝜆  ( AS.1 and Jensen’s Inequality) 6 𝐶 (𝐾𝑚𝑎𝑥 ) · E (XX ×Y Y ) | 𝑓A ,𝑛 (X 𝜆 r 𝑛 𝐿0  Õ 6 𝐶 (𝐾𝑚𝑎𝑥 )· |𝛼𝑖 |E (X X ×YY ) {|𝑦𝑖 |· 𝜆 𝑖=1 X1 ), 𝑓 TPR-CP-TT (X X2 ) − 𝐾 (X X1 , X 2 )|}   |𝐾 𝑓 TPR-CP-TT (X r 𝑛 𝐿0  Õ 𝑑  ( Proposition B.4.1 and |𝑦𝑖 | = 1) 6 𝐶 (𝐾𝑚𝑎𝑥 )· |𝛼𝑖 | · E (X Y ) [𝐶𝑑,𝑟 𝜖 ] X ×Y 𝜆 𝑖=1 r 𝐿0 ( Expectation over constant) 6 𝐶 (𝐾𝑚𝑎𝑥 ) · Ψ𝐶𝑑,𝑟 𝜖 𝑑 𝜆 𝑛 X) = 𝛼𝑇 𝐷 𝑦 𝐾 (X X) ∈ H }. Í where Ψ = sup{||𝛼 𝛼 || 1 = |𝛼𝑖 | : 𝑓 (X 𝑖=1 • Difference between functional norms can be bounded in a similar way. With probability at least 1 − 𝛿1 (with respect to random projection), 0 < 𝛿1 < 1, Õ𝑛 Õ 𝑛 |𝜆|| 𝑓A𝜆 ,𝑛 || 2 − 𝜆|| 𝑓𝑛𝜆 || 2 | 6𝜆 𝛼𝑖 𝛼𝑙 |𝑦𝑖 ||𝑦 𝑙 |· 𝑖=1 𝑙=1 X1 ), 𝑓 TPR-CP-TT (X X2 ) − 𝐾 (X X1 , X 2 )|  ( Absolute value) |𝐾 𝑓 TPR-CP-TT (X Õ𝑛 Õ𝑛 ( Proposition B.4.1 and |𝑦𝑖 | = 1) 6 𝜆( |𝛼𝑖 |) · ( |𝛼𝑙 |) · 𝐶𝑑,𝑟 𝜖 𝑑 𝑖=1 𝑙=1 6 𝜆Ψ2𝐶 𝑑,𝑟 𝜖𝑑 Each of these two inequalities hold with probability at least 1 − 𝛿1 , then two inequalities hold simultaneously with probability at least 1−2𝛿1 . This can be showed with simple probability theory, 109 since the probability of at least one inequality does not hold is no more than 2𝛿1 . (Probability of union is no more then the sum of probabilities.) As a result, we conclude that with probability at least 1 − 𝛿1 with respect to random projection r 𝜆 2 𝜆 𝜆 2 𝐿0 𝜖𝑑 𝜆 |R L ( 𝑓A ,𝑛 ) + 𝜆|| 𝑓A ,𝑛 || − R L ( 𝑓𝑛 ) − 𝜆|| 𝑓𝑛 || | 6 𝐶𝑑 Ψ [𝐶 (𝐾𝑚𝑎𝑥 ) + 𝜆Ψ] 𝜖 𝑑 = 𝑂 ( 𝑞 ) 𝜆 𝜆 The proposition 3.5.2 is proved. B.5 Proof of Theorem 3.5.2 Theorem 3.5.2 establishes an upper bound on the excess risk of RPSTM model under a single random projection. Although it does not give out the statistical consistency of RPSTM we are pursuing, it summarizes the conclusions from proposition 3.5.2 and bound the excess risk in proposition 3.5.1 under a single random projection. The theorem assumes that if all the conditions AS.1 - AS.9 hold, then the excess risk under a single random projection can be bounded with probability at least (1 − 2𝛿1 )(1 − 𝛿2 ) ∗ 6 𝑉 (1) + 𝑉 (2) + 𝑉 (3) R L (𝑔𝑔𝜆𝑛 ) − R L q √ q q 𝐿0 𝐿 log(2/𝛿2 ) 2 log(2/𝛿2 ) • 𝑉 (1) = 12𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) · 𝐾𝑚𝑎𝑥 √ 0 + 9 𝜁𝜆 ˜ 2𝑛 + 2𝜁𝜆 𝑛 𝑛𝜆 • 𝑉 (2) = 𝐷 (𝜆) q 𝐿0 • 𝑉 (3) = 𝐶𝑑,𝑟 Ψ · [𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) + 𝜆Ψ]𝜖 𝑑 (1 − 2𝛿1 ) is the probability with respect to random projections, and (1 − 𝛿2 ) is with respect to the randomness of choosing training data 𝑇𝑛 . As noted in the theorem, 𝜁˜𝜆 can be regarded as the supreme X, 𝑦) → L ( 𝑓 (X X), 𝑦) : 𝑓 ∈ F ,  of the infinity norm of a function in the collection L ◦ F = ℎ : (X i.e. 𝜁˜𝜆 = sup ||ℎℎ || ∞ = sup sup X, 𝑦)| |ℎℎ (X ℎ ∈L◦FF ℎ ∈L◦F F X,𝑦)∈(X (X X ×Y Y) q  𝐿0 F = 𝑓 : || 𝑓 || ∞ 6 𝐾𝑚𝑎𝑥 𝜆 . All functions in the collection L ◦ F are compositing a loss function together with a decision function, and are bi-variate. 𝜁𝜆 is a special case of 𝜁˜𝜆 by letting 110 the decision function to be the optimal CP-STM 𝑓 𝜆 , i.e. 𝜁𝜆 = sup |L ( 𝑓 𝜆 (XX), 𝑦)| X,𝑦)∈(X (X X ×Y Y) As for Ψ, it is the supreme of the L-1 norm for CP-STM coefficient vector. In order to show this theorem, we can use the result from proposition 3.5.1, but without taking expectation over random projections. With a single random projection, the proof of Proposition 3.5.2 in appendix B.4 already shows how the term 𝑉 (3) is developed. The probability component with respect to random projection depicts the chance of 𝑉 (3) term being true, and is explained in the proof. 𝑉 (2) is directly taken from the risk decomposition in Proposition 3.5.1. Thus, we only have to show 𝑉 (1) to establish the theorem. Indeed, our discussion in Proposition 3.5.1 unveils that, except the terms already bounded by 𝑉 (2), 𝑉 (3), and the term vanishes due to universal tensor kernels, 𝑉 (1) only bounds the gaps between empirical risk and expected risk, which are listed below. R L (𝑔𝑔𝑛𝜆 ) − R (𝑔𝑔𝑛𝜆 ), R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) L,𝑇𝑛A R ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 ), R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 ) L,𝑇𝑛A Notice that consider the problem under a single random projection. Thus, the risks are not expectations over all random projections. As we mentioned earlier, one of them can be bounded by Hoeffding equality immediately, and the other three can be bounded by Rademacher Complexity. We list the bound for each term, and explain how the bound is developed below. First, we consider the term R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 ) and get the following result. Proposition B.5.1. With probability at least (1 − 𝛿2 ) for 𝛿2 ∈ (0, 1) s 2𝑙𝑜𝑔 𝛿1 2 |R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )| < 2𝜁𝜆 𝑛 𝑛 X𝑖 ), 𝑦𝑖 ) as a sum of independent and identically Í Proof. We consider R L,𝑇𝑛 ( 𝑓 𝜆 ) = L ( 𝑓 𝜆 (X 𝑖=1 distributed (i.i.d) random variables since each pair (X X𝑖 , 𝑦𝑖 ) ∈ 𝑇𝑛 are i.i.d distributed. R L ( 𝑓 𝜆 ) is the 111 expectation of R L,𝑇𝑛 ( 𝑓 𝜆 ). Since loss function is bounded by 𝜁𝜆 for every term in R L,𝑇𝑛 ( 𝑓 𝜆 ), using Hoeffding’s inequality ([36]), we obtain P[R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )] > 𝜃) ≤ 𝑒𝑥 𝑝(− 𝑛𝜃2 ). Choosing 8𝜁 𝜆 𝛿2 = 𝑒𝑥 𝑝(− 𝑛𝜃2 ) leads to the above bound.  8𝜁 𝜆 However, the other three terms cannot be bounded in the same way. This is because the decision function in the other three terms 𝑓A𝜆 ,𝑛 , 𝑔𝑛𝜆 , and 𝑓𝑛𝜆 are calculated from the training data and hence conditional on 𝑇𝑛 . As a result, the risk of RPSTM model R A (𝑔𝑔𝑛𝜆 ) is not a sum of L,𝑇𝑛 independent random variables. This violates the assumption of Hoeffding inequality. We need to use Rademacher Complexity, a stronger tool to bound the three terms left. We use this tool to develop a bound between R L (𝑔𝑔𝑛𝜆 ) and R (𝑔𝑔𝑛𝜆 ). L,𝑇𝑛A We use R𝑛 (F F ) to denote the Rademacher complexity of a function class F , and R̂ 𝐷 𝑛 (F F ) to denote the corresponding sample estimate with respect to samples 𝐷 𝑛 = {𝑍1 , .., 𝑍𝑛 }. To make our description more consistent, one may regard each 𝑍𝑖 = (X X𝑖 , 𝑦𝑖 ), so that 𝐷 𝑛 is another representation of the training data 𝑇𝑛 . The reason for doing this is because we need to use composite function in forms of L ◦ F . We shall present few well established results about the Rademacher complexity without proof. One can find details about the proof from [106]. Theorem B.5.1. Consider a collection of classifiers F = { 𝑓 : || 𝑓 || ∞ 6 𝜁˜𝜆 }. ∀𝛿2 > 0, with probability at least 1 − 𝛿2 , we obtain: s 1 Õ𝑛 log 𝛿2 2 sup |E[ 𝑓 (𝑍)] − 𝑓 (𝑍𝑖 )| 6 2R𝑛 (FF ) + 𝜁˜𝜆 F 𝑓 ∈F 𝑛 2𝑛 𝑖=1 The probability is with respect to the draw of 𝐷 𝑛 . This is the general inequality for Rademacher Complexity, and can be applied for all types of data and functions 𝑓 . There is a corollary from the Rademacher Complexity developed especially for controlling classification risks. The necessity of this corollary is due to the fact that 𝑓 is a uni- variate function in the theorem, but loss functions in classification risk measurement are bi-variate. As a result, it is not appropriate to replace 𝑓 with loss function L, and substitute 𝐷 𝑛 with our tensor training data 𝑇𝑛 directly in Theorem B.5.1. 112 q 𝐿0 Corollary B.5.1.1. Let F = { 𝑓 : || 𝑓 || ∞ 6 𝐾𝑚𝑎𝑥 𝜆 }, and its sample Rademacher Complexity is R̂𝑇𝑛 (F F ). Suppose L : R × Y → R is a loss function satisfying condition AS.1. Then for all possible training set 𝑇𝑛 , r 𝐿0 R̂𝑇𝑛 (L ◦ F ) 6 2𝐶 (𝐾𝑚𝑎𝑥 ) · R̂𝑇𝑛 (F F) 𝜆 q 𝐿 where L ◦ F = {(X X, 𝑦) → L ( 𝑓 (X X), 𝑦) : 𝑓 ∈ F }. 𝐶 (𝐾𝑚𝑎𝑥 𝜆0 ) is the Lipschitz continuous constant of loss function L introduced in assumption AS.1. The corollary bridges the Rademacher Complexity for general functions to the classification prob- lems. This corollary is also know as a useful extension of Ledoux-Talagrand Contraction Theorem. Lastly, since Corollary B.5.1.1 uses sample Rademacher Complexity, one more result from [106] about kernel classes and RKHS is needed. Here, we continue using the notations about CP-STM and RPSTM from our main content. Corollary B.5.1.2. Suppose H is the RKHS generated by projected tensor kernel functions (3.2). With assumption AS.3 and training data 𝑇𝑛 , for collection of function any function F (𝑀) ⊂ H  𝑀𝐾𝑚𝑎𝑥 R̂𝑇𝑛 F (𝑀) 6 √ 𝑛  F (𝑀) = 𝑓 ∈ H , || 𝑓 || 6 𝑀, 𝑀 > 0 . To use the inequality in Corollary B.5.1.1, we have to bound the infinite norm of the function 𝑔𝑛𝜆 due to the fact that we have to composite loss function L and decision function 𝑔𝑛𝜆 to measure the risk. Thus, we provide the following proposition. Proposition B.5.2. Let 𝑔𝑛𝜆 be the optimal RPSTM model. Under assumption AS.1, we have r 𝜆 𝐿0 ||𝑔𝑔𝑛 || ∞ 6 𝐾𝑚𝑎𝑥 𝜆 113 Proof. Since 𝑔𝑛𝜆 = arg min {R A (𝑔𝑔 ) + 𝜆||𝑔𝑔 || 2 }, we get L,𝑇𝑛 𝑔∈HHA 1  ||𝑔𝑔𝑛𝜆 || 2 6 R A (0) + ||0|| 2 − R A (𝑔𝑔𝑛𝜆 ) 𝜆 L,𝑇𝑛 L,𝑇𝑛 1 ( R L,𝑇𝑛A (𝑔𝑔𝑛𝜆 ) negative and | |0| |2 = 0 ) 6 R A (0) 𝜆 L,𝑇𝑛 𝐿 ( Assumption AS.1) 6 0 𝜆 Since 𝑔𝑛𝜆 ∈ H A , and H A is a RKHS generated by projected tensor kernels. By RKHS property , for any function 𝑔 ∈ H A r XA ) h𝑔𝑔 , 𝐾 (, XA )i p 𝐿0 𝑔 (X = 6 ||𝑔𝑔 || sup 𝐾 (·, ·) 6 𝐾𝑚𝑎𝑥 𝜆 The step use Cauchy–Schwarz inequality. The inequality holds for all XA when the function 𝑔 is replaced with 𝑔𝑛𝜆 .  Inspired by the proof of Proposition B.5.2, we consider a collection of function G𝜆 = {𝑔𝑔 : 𝑔 ∈ q H A , ||𝑔𝑔 || 6 𝜆0 }, which obviously includes 𝑔𝑛𝜆 . Due to Corollary B.5.1.2 and Proposition B.5.2, 𝐿 we have r 𝐿0 R̂𝑛 (G𝜆 ) 6 𝐾𝑚𝑎𝑥 𝑛𝜆 q 𝐿0 and ||𝑔𝑔 || ∞ 6 𝐾𝑚𝑎𝑥 𝜆 for all 𝑔 ∈ G𝜆 . Thus, we can now utilize Theorem B.5.1 to show the bound between R L (𝑔𝑔𝑛𝜆 ) and R (𝑔𝑔𝑛𝜆 ). L,𝑇𝑛A q 𝐿0 Proposition B.5.3. Let G𝜆 = {𝑔𝑔 : ||𝑔𝑔 || 6 𝜆 }. Assume conditions AS.1, AS.3, AS.4, and AS.7. Let 𝛿2 > 0, with probability at least 1 − 𝛿2 and a given random projection defined by A r r r 𝐿 0 𝐿 0 log(2/𝛿2 ) |R L (𝑔𝑔𝑛𝜆 ) − R A (𝑔𝑔𝑛𝜆 )| 6 4𝐶 (𝐾𝑚𝑎𝑥 ) · 𝐾𝑚𝑎𝑥 + 3𝜁˜𝜆 (B.4) L,𝑇𝑛 𝜆 𝑛𝜆 2𝑛 The probability is with respect to the join distribution of X × Y . q 𝐿0 Proof. Since 𝑔𝑛𝜆 ∈ G𝜆 and ||𝑔𝑔 || ∞ 6 𝐾𝑚𝑎𝑥 for all 𝑔 ∈ G𝜆 . Let H𝜆 = L ◦ G𝜆 = ℎ : ℎ =  𝜆 X), 𝑦), 𝑔 ∈ G𝜆 }. Then ||ℎℎ || ∞ 6 𝜁˜𝜆 as we noted in the description of Theorem 3.5.2. Theorem L (𝑔𝑔 (X 114 B.5.1 then suggests given a training data 𝑇𝑛 and its projected counterpart, 𝑛 1Õ |R L (𝑔𝑔𝑛𝜆 ) − R A (𝑔𝑔𝑛𝜆 )| 6 sup EX ×Y Y ℎ (𝑍) − ℎ (𝑍𝑖 ) L,𝑇𝑛 𝑛 ℎ ∈H𝜆 𝑖=1 r log(2/𝛿2 ) ( Theorem B.5.1 and | |ℎℎ | |∞ 6 𝜁˜𝜆 by definition) 6 2R𝑛 (H𝜆 ) + 𝜁˜𝜆 2𝑛  r  r log(2/𝛿 ) log(2/𝛿2 ) ( McDiarmid’s inequality) 6 2 R̂𝑛 (H𝜆 ) + 𝜁˜𝜆 2 + 𝜁˜𝜆 2𝑛 2𝑛 r r 𝐿0 log(2/𝛿2 )  ( Corollary B.5.1.1) 6 2 2𝐶 (𝐾𝑚𝑎𝑥 ) · R̂𝑛 (G𝜆 ) + 𝜁˜𝜆 + 𝜆 2𝑛 r log(2/𝛿2 ) 𝜁˜𝜆 2𝑛 r r r 𝐿0 𝐿0 log(2/𝛿2 ) ( Corollary B.5.1.2) 6 4𝐶 (𝐾𝑚𝑎𝑥 ) · 𝐾𝑚𝑎𝑥 + 3𝜁˜𝜆 𝜆 𝑛𝜆 2𝑛 The proposition is proved.  Finally, we can use the exact same way to control R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) and R A ( 𝑓A𝜆 ,𝑛 ) − L,𝑇𝑛 𝜆 R L ( 𝑓A ,𝑛 ). Since the probability 1 − 𝛿2 is with respect to sampling of training data 𝑇𝑛 . Given a random projection, if we get 𝑇𝑛 , we get 𝑇𝑛A . Hence, the randomness of all the four terms bounded by 𝑉 (1) holds simultaneously. We can conclude that with probability at least 1 − 𝛿2 , r p r r 𝐿 0 𝐾𝑚𝑎𝑥 𝐿 0 log(2/𝛿 2 ) 2 log(2/𝛿2 ) 𝑉 (1) = 12𝐶 (𝐾𝑚𝑎𝑥 ) √ + 9𝜁˜𝜆 + 2𝜁𝜆 (B.5) 𝜆 𝑛𝜆 2𝑛 𝑛 Theorem 3.5.2 is proved. As one can see, the term 𝑉 (1) and 𝑉 (3) converge to zero as we assumed. The 𝐷 (𝜆) term in 𝑉 (2) actually motivates us to make assumption AS.9 to make the excess risk converging. B.6 Convergence Rate of Squared Hinge and Hinge Loss We establish the explicit convergence rate for Squared Hinge and Hinge loss in this section. To do that, we have to utilize some properties and facts that hold for these two loss functions. These properties are listed below without proof since they can be easily verified. One can refer [128] for the proof. For similarity, we use L1 and L2 to denote Hinge and Squared Hinge loss. 1. For both loss function, 𝜆|| 𝑓 𝜆 || 2𝐾 < 𝐷 (𝜆) with a given 𝜆 > 0. 115 q 𝐿0 2. For both loss function, || 𝑓 𝜆 || ∞ = sup X)| | 𝑓 𝜆 (X 6 𝐾𝑚𝑎𝑥 || 𝑓 𝜆 || 6 𝐾𝑚𝑎𝑥 𝜆 . X ∈H H q q q 1 𝐷 (𝜆) 𝐿0 3. For hinge loss, 𝐿 0 = 1, 𝜁𝜆 ≤ 1 + 𝐾𝑚𝑎𝑥 𝜆, and 𝜁˜𝜆 ≤ 1 + 𝐾𝑚𝑎𝑥 𝜆 , 𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) =1 q q 1 2 ˜ 𝐷 (𝜆) 𝐿0 4. For square hinge loss,𝐿 0 = 1, 𝜁𝜆 ≤ (1+𝐾𝑚𝑎𝑥 𝜆 ) , 𝜁𝜆 ≤ 2(1+𝐾 𝜆 ), and 𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) = q 2𝐾𝑚𝑎𝑥 𝜆1 . 5. For hinge loss, ΨL1 6 𝜆1 since the dual problem of CP-STM restrict 𝛼𝑖 6 2𝑛𝜆 1 . (See section 2.2) 6. For square hinge loss, using lemma ΨL2 = 𝑂 ( 𝜆1 ) For the property 6 about Squared Hinge loss, we provide some discussion here. Lemma B.6.1. Let 𝑛+ and 𝑛− be training samples with label +1 and −1 respectively. Define X𝑖 , .)} Supremum of absolute sum of all coefficients over every Í Í Ψ = 𝑠𝑢 𝑝 { |𝛽𝑖 | : 𝑓 = 𝛽𝑖 𝐾 (X X𝑖 ∈X possible function in RKHS is given by 1 ΨL2 6 𝐶𝐾 + 4𝑛𝑛𝜆 + 𝑛− where 𝐶𝑘 = min 𝛽 𝑇 𝐾 𝛽 depends on kernel matrix 𝐾 𝛽 :𝛽𝛽𝑇 1 =1 The proof of this lemma is available at [39]. Using this lemma, we can obtain a corollary and establish the last property about Squared Hinge loss. Corollary B.6.1.1. For Squared Hinge loss, supremum of sum of absolute coefficients is finite as sample size grows. 1 ΨL2 = 𝑂 ( ) 𝜆 Proof. Sum of eigenvalues of 𝐾 is of order 𝑂 (𝑛), since trace of 𝐾 is of order 𝑂 (𝑛). The trace of 𝐾 is indicated by the facts that 𝐾 (X X𝑖 , X𝑖 ) = 𝑂 (1) guaranteed by assumption AS.7. Considering 4𝑛+ 𝑛− 6 1 from arithmetic mean and geometric mean inequality. Assuming that 𝐾 is positive 𝑛 definite, then ΨL2 = 𝑂 (1). This bound agrees with the bound of Theorem 3.3 in [29], which is 116 of order 𝑂 ( 𝜆1 ) at constant level. In this case 𝐶𝐾 is 0 under the situation where 𝐾 is not positive definite. Combining these two conditions for the kernel matrix 𝐾 , the corollary is proved.  Note that depending on kernel and geometric configuration of data points. This quantity influences error of projected classifier.The above bounds are consistent with [39]. Assuming the data are from bounded domain, i.e ||X X || 2 6 𝐶𝑑,𝑟 The gram matrix 𝐾 can have minimum eigenvalue as positive so 𝐶𝐾 > 0 . The key idea is to divide the bounded domain into minimal increasing sequence of discs 𝐷 𝑛 formed between rings of radius 𝑅𝑛−1 and 𝑅𝑛 such that X ⊆ ∪𝑛=1 𝑁 𝐷 . So the diameter 𝑛 of X ≤ 2𝑅 𝑁 assuming 𝑅𝑛 < ∞ and then count the number of points in each 𝐷 𝑛 . So some Í 2 regularity conditions on distribution of data for each disc 𝐷 𝑛 is necessary to evaluate bounds on eigenvalues of Gram matrix. For unknown case, we can estimate the Gram matrix. [126] discusses the regularization error in such estimation B.6.1 Proof of Proposition 3.5.3 𝑛 2 4 (𝑙𝑜𝑔 𝛿 ) For Squared Hinge loss, consider the projected dimension to be 𝑃 𝑗 = d3𝑟 𝑑 1 e + 1 for each 𝜖2 mode 𝑗 = 1, 2, ..𝑑. Adopt theorem 3.5.2 and the properties 1 2, 4 and 6, we have with probability at least (1 − 2𝛿1 ))(1 − 𝛿2 ) and some 𝜂 ∈ (0, 1] r r 2 24𝐾𝑚𝑎𝑥 𝐷 (𝜆) 𝑙𝑜𝑔(2/𝛿2 ) 𝐾 2 2𝑙𝑜𝑔(2/𝛿2 ) R L (𝑔𝑔𝑛𝜆 ) ∗ − RL 6 √ + 18(1 + 𝐾𝑚𝑎𝑥 ) + 2(1 + √ ) 𝑛𝜆2 𝜆 2𝑛 𝜆 𝑛 r 1 1 𝜆 𝜖𝑑 + 𝐷 (𝜆) + 𝐶 𝐷 [2𝐾𝑚𝑎𝑥 + ] 𝜆 𝜆 𝜆 𝜆 s s 1 2 1 1 2 𝜖𝑑 6 𝑂( √ ) 𝑙𝑜𝑔 + 𝑂 ( √ ) + 𝑂 ( p ) 𝑙𝑜𝑔 + 𝑂 (𝜆𝜂 ) + 𝑂 ( ) 𝛿2 𝛿2 3 𝑛𝜆2 𝑛𝜆 𝑛𝜆2(1−𝜂) 𝜆2 𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1). Plugging in the assumption AS.10 and replace the last term in the equation above with a number in term of 𝑛, we obtain s ∗ 6 𝐶 𝑙𝑜𝑔( 2 1 𝜇𝜂 R L (𝑔𝑔𝑛𝜆 ) − R L ) · ( ) 2𝜂+3 𝛿2 𝑛 𝜇 𝜇 Now 𝜖 = ( 𝑛1 ) 2𝑑 for 0 < 𝜇 < 1, and 𝜆 = ( 𝑛1 ) 2𝜂+3 for some 0 < 𝜂 6 1. The projected dimension 𝑛 2 4 (𝑙𝑜𝑔 𝛿 ) becomes 𝑃 𝑗 = d3𝑟 𝑑 1 e + 1. 𝜖2 117 B.6.2 Proof of Proposition 3.5.4 The convergence rate for Hinge loss are established using the same way as that of Squared Hinge loss. Adopting theorem 3.5.2 together with the properties 1, 2, 3, and 5, we conclude that r r 𝜆 ∗ 12𝐾 (1 + 𝐾𝑚𝑎𝑥 ) 𝑙𝑜𝑔(2/𝛿2 ) 𝐷 (𝜆) 2𝑙𝑜𝑔(2/𝛿2 ) R L (𝑔𝑔𝑛 ) − R L 6 √ + 9 √ + 2(1 + 𝐾𝑚𝑎𝑥 √ ) 𝑛𝜆 𝜆 𝑛 𝜆 𝑛 𝜆 𝜖𝑑 + 𝐷 (𝜆) + 𝐶 𝐷 [2 + ] s 𝜆 𝜆 s 1 2 1 2 𝜖𝑑 6 𝑂 ( √ ) 𝑙𝑜𝑔( ) + 𝑂 ( p ) 𝑙𝑜𝑔( ) + 𝑂 (𝜆𝜂 ) + 𝑂 ( ) 𝑛𝜆 𝛿2 𝑛𝜆2(1−𝜂) 𝛿2 𝜆 [𝑙𝑜𝑔 𝛿𝑛 ] 2 with 𝑃 𝑗 = 𝑂( 1 ) for each mode 𝑗 = 1, 2, ..𝑑. The inequality holds for probability at least 𝜖2 (1 − 2𝛿1 ) (1 − 𝛿2 ), for some 𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1). Use the assumption AS.10 again, we get s 𝜇𝜂 ∗ 6 𝐶 𝑙𝑜𝑔( 2 )( 1 ) 2𝜂+2 R L (𝑔𝑔𝑛𝜆 ) − R L 𝛿2 𝑛 𝜇 𝜇 The 𝜖 = ( 𝑛1 ) 2𝑑 for 0 < 𝜇 < 1, and 𝜆 = 1 ( 𝑛) 2𝜂+2 for some 0 < 𝜂 6 1. The projected dimension 4 𝜇 becomes 𝑃 𝑗 = d3𝑟 𝑑 𝑛 𝑑 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1. B.7 Proof of theorem 3.5.3 Theorem 3.5.3 establishes the rate of convergence for expected risk difference, showing ℓ1 consistency of error vanishing as sample size increases. Thus theorem 3.5.3 establishes stronger optimality of our algorithm. In this subsection, we show theorem 3.5.3 holds, which is a much stronger result than the risk difference vanishing in probability. First, we show a corollary about the expected difference between projected tensor CP compo- nents and original CP components. 𝑟 (1) (𝑑) Corollary B.7.0.1. For any two d-mode tensors in rank-r CP form, X 1 = Í 𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and 𝑘=1 𝑟 (1) (𝑑) X2 = Í 𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 , expectation of difference of tensor norm and its projection have a upper 𝑘=1 118 bound as shown below. 𝑟 Ö 𝑑 𝑟 Ö 𝑑 ( 𝑗) ( 𝑗) ( 𝑗) ( 𝑗) Õ Õ E| 𝐴 ( 𝑗) 𝑥 1,𝑘 |𝐴 − 𝐴 ( 𝑗) 𝑥 2,𝑙 || 22 − ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 | 𝑘,𝑙=1 𝑗=1 𝑘,𝑙=1 𝑗=1 v u t 𝑟 Ö 𝑑 3𝑑 Õ ( 𝑗) ( 𝑗) 6 𝑟2 Î𝑑 ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 𝑗=1 𝑃 𝑗 𝑘,𝑙=1 𝑗=1 p Proof. It is known that for any real random variable 𝑊, E(|𝑊 |) 6 E(𝑊 2 ). Using the aforemen- tioned result along with variance of difference of projection stated in the proof of corollary B.4.2.1, we prove the following result.  Next we derive the expected difference in tensor kernel due to projection 𝑟 (1) (𝑑) Proposition B.7.1. For any two d-mode tensors in rank-r CP form, X 1 = Í 𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and 𝑘=1 𝑟 (1) (𝑑) X2 = Í 𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 . Suppose the random projection 𝑓 TPR-CP-TT is defined by projection 𝑘=1 tensor A , which satisfies the assumption AS.5. For a given tensor kernel function 𝐾 (·, ·). We have following bound on expected difference of tensor kernel due to projection, here constant 𝐶𝑑,𝑟 is taken from proposition B.4.1   v u t 3𝑑 X1 ), 𝑓 TPR-CP-TT (X X2 ) − 𝐾 (X X1 , X 2 )  E 𝐾 𝑓 TPR-CP-TT (X 6 𝐶𝑑,𝑟 𝑟 2 Î𝑑 𝑗=1 𝑃 𝑗 Proof. We proceed similarly as in proof of proposition B.4.1. Using result on from proposition B.7.0.1, we derive our result.  Using the results from above proposition B.7.1 and derivations mentioned in proof of proposition 3.5.2. As a result, we conclude that expectation with respect to random projection r v u t 𝐿0 3𝑑 EA |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | 6 𝐶𝑑,𝑟 Ψ [𝐶 (𝐾𝑚𝑎𝑥 ) + 𝜆Ψ] Î𝑑 𝜆 𝑗=1 𝑃 𝑗 1 = 𝑂( qÎ ) 𝜆 𝑞 𝑑 𝑗=1 𝑃 𝑗 (B.6) 119 However,it should be noted that for fixed sample size 𝑛 and thus fixed 𝜖, the projected dimension 𝑃 𝑗 changes as a function of probability 𝛿1 only. Now we prove the following result on expected risk of projected error, which is shown to be vanishing with increasing sample size. Proposition B.7.2. Based on conditions AS.1 to AS.8 and conditions AS.10 to AS.11, the expected risk of projection error goes to 0 as 𝑛 increases. 𝜖𝑑 E𝑛 EA |R L ( 𝑓A𝜆,𝑛 ) + 𝜆|| 𝑓A𝜆,𝑛 || 2 − R L ( 𝑓 𝑛𝜆 ) − 𝜆|| 𝑓 𝑛𝜆 || 2 | = 𝑂 ( 𝑞 ) 𝜆 Proof. From equation B.6 and condition AS.5, we obtain following statement With probability at least 1 − 𝛿1 , 1 EA |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | 6 qÎ 𝜆 𝑞 𝑑 𝑗=1 𝑃 𝑗 𝜖𝑑 1 (AS.5 ) 6 𝑞 𝑂( ) 𝜆 [𝑙𝑜𝑔( 𝛿𝑛 )] 𝑑 1 𝜖𝑑 1 (𝛿1 as function of 𝑛 AS.11 ) 6 𝑞 𝑂( ) 𝜆 𝑛 Using above equation and substituting value of 𝛿1 as a function of 𝑛 from condition AS.11, we have established that for sufficient large value of 𝑛   𝜖𝑑 1 1 P𝑛 EA |R L ( 𝑓A𝜆,𝑛 ) + 𝜆|| 𝑓A𝜆,𝑛 || 2 − R L ( 𝑓 𝑛𝜆 ) − 𝜆|| 𝑓 𝑛𝜆 || 2 | > 𝑂( 𝑞 ) 6 𝑛 exp(−𝑛 𝑑 ) (B.7) 𝜆 𝑛 . We utilize the following lemma which states about the expectation of a sequence of random variables with distribution function specific to equation B.7 Lemma B.7.1. For some 𝑑 > 1, let 𝑊𝑛 be a sequence of positive random variables with distribution, for sufficiently large 𝑛 1 1 P𝑛 (𝑊𝑛 > ) 6 𝑛 exp(−𝑛 𝑑 ) 𝑛 Then, the sequence 𝑊𝑛 is uniformly bounded almost surely and in expectation 𝐿 1 norm. 120 Í∞ Proof. We can show that 𝑛 P𝑛 (𝑊𝑛 > 1) < ∞. By Borel-Cantelli lemma, the sequence 𝑊𝑛 is uniformly bounded above by 1 almost surely. By Dominated convergence theorem, for any sequence of positive random variables that is uniformly bounded almost surely; then the expectation is bounded by the uniform bound or E𝑛 (𝑊𝑛 ) > 1 for all sufficiently large 𝑛.  𝑑 Let us denote, EA |R L ( 𝑓A𝜆,𝑛 ) + 𝜆|| 𝑓A𝜆,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | = 𝑊𝑛 𝑂 ( 𝜆𝜖 𝑞 ). Assuming 𝑑 to be the order of feature tensors, we claim using lemma B.7.1 that 𝜖𝑑  𝜖𝑑 E𝑛 𝑊𝑛 𝑂 ( 𝑞 ) 6 𝑂 ( 𝑞 ) 𝜆 𝜆 Thus, we conclude the proof for risk difference due to projection as stated in proposition B.7.2.  In the following statements, we bound the expectation of sampling error of training data. Proposition B.7.3. Based on condition AS.1 to AS.4 and AS.7, the expectation of sampling risk vanishes as sampling size increases.  E𝑛 |R L (𝑔𝑔𝑛𝜆 ) − R A (𝑔𝑔𝑛𝜆 ) + R L ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) L,𝑇𝑛  + R A ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 ) + R L,𝑇𝑛 ( 𝑓 𝜆) − RL ( 𝑓 𝜆 )| L,𝑇𝑛 r r 1 1 1 = 𝑂( √ ) + 𝑂 ( 𝜁˜𝜆 ) + 𝑂 (𝜁𝜆 ) 𝑛𝜆2 2𝑛 𝑛 Proof. Lets denote the random variable |R L (𝑔𝑔𝑛𝜆 ) − R (𝑔𝑔𝑛𝜆 ) + R L ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) L,𝑇𝑛A +R ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 ) + R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )| as 𝐻𝑛 . Also, 𝐻𝑛 is independent of random L,𝑇𝑛A projection; thus taking over EA does not change it. Ignoring the constants, we derive the following result from equation B.5,   1 𝜁˜𝜆 p 𝜁𝜆 p P𝑛 𝐻 𝑛 > 𝑂 ( √ ) + 𝑂 ( √ log(2/𝛿2 )) + 𝑂 ( √ log(2/𝛿2 ) 6 𝛿2 𝑛𝜆2 𝑛 𝑛 We need the following proposition on sub Gaussian random variable to prove above proposition B.7.3 121 Proposition B.7.4. For any real random variable 𝑊 with sub Gaussian tail, meaning P 𝑊 > p  log(2/𝛿2 ) 6 𝛿2 ; then E(𝑊) = 𝑂 (1) 2 2 Proof. Using change of variable, 𝛿2 = 2𝑒 −𝑢 ; we obtain P(𝑊 > 𝑢) 6 2𝑒 −𝑢 . Now using identity ∫∞ that for any positive random variable 𝑊, E(𝑊) = 0 P(𝑊 > 𝑤) 𝑑𝑤, we proof our proposition.  𝜁˜ 𝜁 We further split 𝐻𝑛 = 𝑂 ( √ 1 ) +𝑂 ( √𝜆 )𝑊1 +𝑂 ( √𝜆 )𝑊2 where 𝑊1 and 𝑊2 are random variables 𝑛𝜆2 𝑛 𝑛 with Sub Gaussian tails. We can apply the above proposition B.7.4 about sub Gaussian random variables to prove proposition B.7.3.  Gathering results from theorem 3.5.2, proposition B.7.2 and proposition B.7.3, we can now show the following conclusion. r r ∗ | 6 𝑂( 1 1 1 𝜖𝑑 E𝑛 |EA [R L (𝑔𝑛𝜆 )] − R L √ ) + 𝑂 ( 𝜁˜𝜆 ) + 𝑂 (𝜁𝜆 ) + 𝑂 (𝐷 (𝜆)) + 𝑂 ( 𝑞 ) (B.8) 𝑛𝜆2 2𝑛 𝑛 𝜆 Referring to value 𝜁˜𝜆 and 𝜁𝜆 as mentioned in proof of proposition 3.5.4 and proposition 3.5.3. Under conditions AS.6 and conditions AS.9 to AS.11, we show validity of our claim as a corollary of equation B.8. Therefore, we complete our proof of theorem 3.5.3 B.8 Technical Details about Numerical Studies All the code for our numerical studies are available at our Github repository https://github. com/PeterLiPeide/TEC_Tensor_Ensemble_Classifier. In the Simulation folder, one can find all the values of tuning parameter, optimal rank of CP decomposition, and the dimension of random projection. We also provide a code to regenerate our synthetic data, so that the whole simulation study is reproducible with our CP-STM module. 122 B.9 Simulation Study Discussion Model Methods RBF-SVM AAM LLSVM BSGD LDA RF Accuracy (%) 94.50 75.31 95.94 96.67 89.13 98.50 F2 STD (%) 2.15 6.18 1.86 3.95 3.64 1.50 Time (s) 92 120 360 790 205 9.5 Accuracy (%) 100 83.33 98.50 77.5 97.63 100 F3 STD ( %) 0.00 3.54 1.37 14.39 1.90 0.00 Time (s) 80 120 385 935 175 7.5 Accuracy (%) 89.63 52.92 50 50 83.75 76.50 F5 STD ( %) 2.80 8.28 0.00 0.00 4.50 7.65 Time (s) 117 140 350 1820 231 8.65 Table B.1: TEC Simulation Results II: Cluster with 128GB RAM For the completeness of our numerical study, we further apply the vector-based methods in simulation study 3.6 to those high dimensional classification tasks on a high performance cluster. The cluster is equipped with a 16-core CPU and 128GB of memory. The classification accuracy of all the vector-based classifier in F2, F3, and F5 tasks are provided in table B.1. If we further consider these results obtained from a more power machine, the advantages of our TEC models are more impressive. With much more memory, the BSGD model provides the best accuracy rate as 96.67% in F2. However, our TEC with Hinge loss has 98% average accuracy rate. Similar situation also happens in F5 where the best vector-based method RBF-SVM provides 89.63% accuracy rate. Our TEC with Square Hinge loss outperforms slightly than RBF-SVM. Only in F3, RBF-SVM and RF have the best performance, and are better than TEC models with 2% accuracy rates. Since these performance requires more computer memory, we believe TEC models are in general have greater potential than all these traditional methods. 123 APPENDIX C APPENDIX FOR CHAPTER 4 C.1 Proof of Theorem 4.5.1 Proof. To prove the proposition 4.5.1, we introduce few more notations here. Let L be the loss function satisfying the condition AS.2. We denote the classification risk for an arbitrary decision function, 𝑓 , as ∫ R L ( 𝑓 ) = EX×Y L (𝑦, 𝑓 (XX)) = L (𝑦, 𝑓 (XX))𝑑P The expectation is taken over the joint distribution of X × Y . Notice that this risk notation, R L ( 𝑓 ), is different from our notation R ( 𝑓 ) in section 4.5 since we use the Lipschitz continuous loss L instead of the "zero-one" loss to measure the classification error. L is also called surrogate loss for classification problems. Examples of such surrogate loss functions include Hinge loss and Squared Hinge loss. Comparison of these loss functions and their statistical properties can be found in [151]. If we denote the Bayes risk under the surrogate loss L as R L ∗ , i.e. R ∗ = min R ( 𝑓 ) for L L all measurable function 𝑓 , then the result from [151] says R L ( 𝑓𝑛 ) → R L ∗ indicates R ( 𝑓 ) → R ∗ 𝑛 for any decision rule { 𝑓𝑛 }. This conclusion holds as long as the surrogate loss is "self-calibrated", see [128]. Since we use Hinge loss in our problem, and Hinge loss is known to be Lipschitz and self-calibrated, our assumption AS.2 holds in our discussion. Thus, we only need to show ∗ for the proof of our proposition 4.5.1. R L ( 𝑓𝑛 ) → R L Given the tuning parameter 𝜆 satisfying condition AS.4, we denote 𝑛 1Õ 𝜆 2 𝑓𝑛 = arg min 𝜆 · || 𝑓 || + X𝑖 ), 𝑦𝑖 ) L ( 𝑓 (X H 𝑓 ∈H 𝑛 𝑖=1 where H is the reproducing kernel Hilbert space (RKHS) generated by the kernel function (4.8). As we mentioned in the section 4.3, H is also know as the collection of functions which are in the form of equation (4.10). Now we further assume 𝑓 𝜆 = arg min 𝜆 · || 𝑓 || 2 + R L ( 𝑓 ) H 𝑓 ∈H 124 Then 𝑓 𝜆 is the optimal decision function from H such that it minimizes the expected risk. Com- paring 𝑓𝑛𝜆 with 𝑓 𝜆 , we can understand that 𝑓 𝜆 is the version of 𝑓𝑛𝜆 when the size of training data 𝑛 is as large as possible. If we denote R L,𝑇𝑛 ( 𝑓 ) = 𝑛1 X𝑖 ), 𝑦𝑖 ), then R L,𝑇𝑛 ( 𝑓 ) is a sample Í L ( 𝑓 (X 𝑖=1 estimate of R L ( 𝑓 ). With 𝑓 𝜆 , we can show that ∗ | 6 |R ( 𝑓 𝜆 ) − R ( 𝑓 𝜆 )| + |R ( 𝑓 𝜆 ) − R ∗ | |R L ( 𝑓𝑛𝜆 ) − R L L 𝑛 L L L through triangular inequality. Since the Bayes risk under loss function L is defined as R ∗ = min R ( 𝑓 ) over all functions defined on X , we can immediate show that X →Y 𝑓 :X Y |R ( 𝑓 𝜆 ) − R ∗ | 6 E (X×Y) |L (𝑦, 𝑓 𝜆 (X X)) − L (𝑦, 𝑓 ∗ (X X))| 6 𝐶 (𝐾𝑚𝑎𝑥 ) sup | 𝑓 𝜆 − 𝑓 ∗ | (C.1) 6 𝐶 (𝐾𝑚𝑎𝑥 ) · 𝜖 This is the result of using condition AS.1 and AS.2 in the proposition 4.5.1. 𝑓 𝜆 is in the RKHS and thus bounded by some constant depending on 𝐾𝑚𝑎𝑥 . 𝑓 ∗ is also continuous on compact subspace X (because all the tensor components considered are bounded in condition AS.1) and thus is bounded. The universal approximating property in condition AS.3 makes equation (C.1) vanishes as 𝜖 goes to zero. Thus, the consistency result can be established if we show |R ( 𝑓𝑛𝜆 ) − R ( 𝑓 𝜆 )| converges to zero. This can be done with Rademacher complexity, see Chapter 26 in [125]. From the objective function (4.9), we have R L,𝑇𝑛 ( 𝑓𝑛 ) + 𝜆 𝑛 || 𝑓𝑛 || 2 6 𝐿 0 (C.2) q 𝐿 under condition AS.2 when we simply let 𝑓 = 0 as a naive classifier. Thus, || 𝑓𝑛 || 6 𝜆𝑛0 . Let q 𝐿 𝑀𝑛 = 𝜆𝑛0 . 𝑓𝜖 ∈ H such that R L ( 𝑓𝜖 ) 6 R L ( 𝑓 𝜆 ) + 2𝜖 . || 𝑓𝜖 || 6 𝑀𝑛 when 𝑛 is sufficiently large. Due to condition AS.4, 𝜆 𝑛 → 0, making 𝑀𝑛 → ∞. Further notice that we introduce 𝑓𝜖 since it is independent of 𝑛. As a result, its norm, even though is bounded by 𝑀𝑛 , is a constant and is not changing with respect to 𝑛. By Rademacher complexity, the following inequality holds with 125 probability at least 1 − 𝛿, where 0 < 𝛿 < 1 r 2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 log 2/𝛿 R L ( 𝑓𝑛𝜆 ) 6 R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + √ + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 ) 𝑛 2𝑛 2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 𝑓𝜖 is not the optimal in training data 6 R L,𝑇𝑛 ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 − 𝜆 𝑛 || 𝑓𝑛𝜆 || 2 + √ 𝑛 r log 2/𝛿 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 ) 2𝑛 2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 Drop (𝜆𝑛 | | 𝑓𝑛𝜆 | | 2 > 0) 6 R L,𝑇𝑛 ( 𝑓 𝜖 ) + 𝜆 𝑛 || 𝑓 𝜖 || 2 + √ 𝑛 r log 2/𝛿 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 ) 2𝑛 4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 Rademacher Complexity again 6 R L ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 + √ 𝑛 r log 2/𝛿 + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 ) 2𝑛 Let 𝛿 = 12 , and 𝑁 large such that for all 𝑛 > 𝑁, 𝑛 r 4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 log 2/𝛿 𝜖 𝜆 𝑛 || 𝑓𝜖 || 2 + √ + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 ) 6 𝑛 2𝑛 2 The inequality exists because || 𝑓𝜖 || is a constant with respect to 𝑛, and all other terms are converging to zero. Thus 𝜖 R L ( 𝑓𝑛𝜆 ) 6 R L ( 𝑓𝜖 ) + 6 RL ( 𝑓 𝜆 ) + 𝜖 2 with probability 1 − 12 . We conclude that 𝑛 P(|R L ( 𝑓𝑛𝜆 ) − R L ( 𝑓 𝜆 )| > 𝜖) → 0 (C.3) for any arbitrary 𝜖. This establishes the weak consistency of CP-STM. For strong consistency, we consider for each 𝑛 ∞ ∞ Õ Õ 1 P(|R L ( 𝑓𝑛𝜆 ) − RL ( 𝑓 𝜆 )| > 𝜖) 6 𝑁 − 1 + 6∞ 𝑛 2 𝑛=1 𝑛=1 By Borel-Cantelli Lemma, R L ( 𝑓𝑛𝜆 ) → R L ( 𝑓 𝜆 ) almost surely, see [42]. The proof is finished.  126 C.2 Data Pre-processing for Section 4.7 We provide further details about our EEG-fMRI data pre-processing and fMRI data extraction in this section. Most of the processing steps are referred from [65]. C.2.1 fMRI Data The fMRI data processing includes three major steps, which are pre-processing, regions of interests (ROI) identification, and data extraction. We describe all these steps here. All the steps are performed by SPM 12 in Matlab. There are five steps in the image pre-processing part including realignment, co-registration, segment, normalization, and smoothing. • Realignment: It is a procedure to align all the 3D BOLD volumes recorded along the time to remove artifacts caused by head motions, and also to estimate head position. For each task, there are three sessions of fMRI scans, providing 510 scans in total for each subject. These scans are realigned within subject to the average of these 510 scans. (average across time) In SPM, we create three independent sessions to load all the fMRI runs, and choose not to reslice all the images at this step. The reslicing will be done in normalization step. Avoiding extra reslicing can avoid introducing new artifacts. The mean scan is created in this step for co-registration. • Co-registration: Since all the fMRI scans are aligned to the mean scan, we have to transform the T1 weighted anatomical scan to match their orientation. Reason for doing this is that all the data will finally be transformed to a standardized space. Estimating such a transformation with T1 weighted scan can provide a high accuracy, since anatomical scans have higher resolutions. Matching the orientation of T1 weighted scan with all the fMRI scans makes it possible to apply the transformation estimated from T1 scan directly on fMRI data. In this step, we let the mean fMRI scan to be stationary, and move T1 anatomical scan to match it. A resliced T1 weighted scan is created in this step. 127 • Segment: This step estimate a deformation transformation mapping data into MNI 152 template space [83, 22]. A forward deformation field is created in this step. • Normalization: In this step, the forward deformation is applied to all realigend fMRI scans, transforming all the data into MNI template space. The voxel size is set to be 3 × 3 × 4 mm, which is the same as the original images. • Smoothing: All normalized fMRI volumes are then smoothed by 3D Gaussian kernels with full width at half maximum (FWHM) parameter being 8 × 8 × 8. This pre-processing procedure is applied to auditory and visual fMRI scans separately and inde- pendently. For each task, the processed fMRI are used to for statistcal analysis introduced in [96, 145]. These models are basic linear mixed effect model with auto-regression covariance structure. Since these models are standard and are out of the scope of this dissertation, we do not introduce them in this part. For the first level (subject level) analysis, we use the model to estimate two contrast images: standard stimulus over baseline and oddball stimulus over baseline. These two are difference of average BOLD signals during stimulus time and that during no stimulus (baseline) time. They can be understand as the estimate 𝛽ˆ in a regression model 𝑦 = 𝑥 𝛽 + 𝜖. These contrasts are then pooled together in the group-level analysis. For each voxel, the group-level analysis performs a T-test to compare the BOLD signals in standard contrasts and oddball contrasts. For voxels whose test results is significant, SPM highlighted them as the regions of interest (ROI). The ROI of auditory and visual tasks are presented in the figure C.1 and figure C.2 with P-values. Figure C.3 shows the exact ROIs in the standard brain template in SPM 12. The coordinates of these activate voxels are also provided in the statistical analysis results. To extract ROI data, we can use "spm_get_data" function in SPM 12. Since we are classifying trials, we only take one fMRI scan for each trial. This is because the trial duration (0.6 sec) is less than the repetition time (2 sec) of fMRI data. For each trial, we take the 𝑘-th fMRI scan where "k = round(onset / TR) + 1". This option is also inspired by SPM codes. 128 Figure C.1: Auditory fMRI Group Level Analysis 129 Figure C.2: Visual fMRI Group Level Analysis 130 (a) Auditory Task (b) Visual Task Figure C.3: Region of Interest (ROI) 131 Tasks Auditory Oddball Auditory Standard Visual Oddball Visual Standard Subject 1 75 299 75 299 Subject 2 70 287 70 287 Subject 3 74 296 74 296 Subject 5 74 299 74 299 Subject 6 75 290 75 290 Subject 7 73 295 73 295 Subject 8 72 297 72 297 Subject 9 75 297 75 298 Subject 10 72 298 72 298 Subject 11 70 293 70 293 Subject 12 74 299 74 299 Subject 13 71 297 71 297 Subject 14 75 296 75 296 Subject 15 72 295 72 295 Subject 16 74 293 74 293 Subject 17 73 295 73 295 Table C.1: EEG-fMRI Data: Number of Trials per Subject C.2.2 EEG Data The EEG data is collected by a custom built MR-compatible EEG system with 49 channels. [141] provides a version of re-referenced EEG data with 34 channels which are used in our experiment. The original and re-referenced channel positions are provided in the figure C.4. This version of EEG data are sampled at 1,000 Hz, and are downsampled to 200 Hz at the beginning of pre-processing. We use the “resample" function in Matlab Signal Processing toolbox to downsampled EEG data to 200 Hz. Then, we use function "ft_preproc_lowpassfilter" and "ft_preproc_highpassfilter" from SPM 12 toolbox to filter the data. This step intends to remove both low-frequency and high- frequency noise in the data. Finally, we split EEG into epochs for trials which starts 100 ms before the onset and ends 500 ms after the onset. According to [65], such a duration is long enough to capture the event-related potential during each trial for EEG data. We show few examples of latent factors from EEG data estimated by our ACMTF decomposition in figure C.5. For each trial, the topoplot shows the components from channel mode, and the other plot shows the factors from time mode. 132 Figure C.4: EEG Channel Position from [141] 133 Figure C.5: Examples of EEG Latent Factors (Different Trial and Stimulus Types): Topoplot for Channel Factors (left); Plots for Temporal Factors (right) 134 BIBLIOGRAPHY 135 BIBLIOGRAPHY [1] E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics, 25(2):67–86, 2011. [2] Evrim Acar, Tamara G Kolda, and Daniel M Dunlavy. All-at-once optimization for coupled matrix and tensor factorizations. arXiv preprint arXiv:1105.3422, 2011. [3] Evrim Acar, Yuri Levin-Schwartz, Vince D Calhoun, and Tülay Adali. Acmtf for fusion of multi-modal neuroimaging data and identification of biomarkers. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 643–647. IEEE, 2017. [4] Evrim Acar, Yuri Levin-Schwartz, Vince D Calhoun, and Tülay Adali. Tensor-based fu- sion of eeg and fmri to understand neurological changes in schizophrenia. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–4. IEEE, 2017. [5] Evrim Acar, Evangelos E Papalexakis, Gözde Gürdeniz, Morten A Rasmussen, Anders J Lawaetz, Mathias Nilsson, and Rasmus Bro. Structure-revealing data fusion. BMC bioin- formatics, 15(1):1–17, 2014. [6] Evrim Acar, Carla Schenker, Yuri Levin-Schwartz, Vince D Calhoun, and Tülay Adali. Unraveling diagnostic biomarkers of schizophrenia through structure-revealing fusion of multi-modal neuroimaging data. Frontiers in neuroscience, 13:416, 2019. [7] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003. [8] Jeffrey S Anderson, Jared A Nielsen, Alyson L Froehlich, Molly B DuBray, T Jason Druzgal, Annahir N Cariello, Jason R Cooperrider, Brandon A Zielinski, Caitlin Ravichandran, P Thomas Fletcher, et al. Functional connectivity magnetic resonance imaging classification of autism. Brain, 134(12):3742–3754, 2011. [9] Andreas Argyriou, Charles A Micchelli, and Massimiliano Pontil. When is there a representer theorem? vector versus matrix regularizers. The Journal of Machine Learning Research, 10:2507–2529, 2009. [10] John Ashburner, Gareth Barnes, Chun-Chuan Chen, Jean Daunizeau, Guillaume Flandin, Karl Friston, Stefan Kiebel, James Kilner, Vladimir Litvak, Rosalyn Moran, et al. Spm12 manual. Wellcome Trust Centre for Neuroimaging, London, UK, 2464, 2014. [11] Francis R Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(6), 2008. [12] Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019. 136 [13] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006. [14] Asa Ben-Hur and William Stafford Noble. Kernel methods for predicting protein–protein interactions. Bioinformatics, 21(suppl_1):i38–i46, 2005. [15] Kristin P Bennett, Michinari Momma, and Mark J Embrechts. Mark: A boosting algorithm for heterogeneous kernel models. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 24–31, 2002. [16] Austin R Benson, David F Gleich, and Jure Leskovec. Tensor spectral clustering for par- titioning higher-order network structures. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 118–126. SIAM, 2015. [17] Xuan Bi, Annie Qu, Xiaotong Shen, et al. Multilayer tensor factorization with applications to recommender systems. Annals of Statistics, 46(6B):3308–3333, 2018. [18] Xuan Bi, Xiwei Tang, Yubai Yuan, Yanqing Zhang, and Annie Qu. Tensors in statistics. Annual Review of Statistics and Its Application, 8, 2020. [19] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: appli- cations to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 245–250, 2001. [20] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [21] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [22] Matthew Brett, Kalina Christoff, Rhodri Cusack, Jack Lancaster, et al. Using the talairach atlas with the mni template. Neuroimage, 13(6):85–85, 2001. [23] Vince D Calhoun, Tulay Adali, NR Giuliani, JJ Pekar, KA Kiehl, and GD Pearlson. Method for multimodal analysis of independent source differences in schizophrenia: combining gray matter structural and auditory oddball functional data. Human brain mapping, 27(1):47–62, 2006. [24] Timothy I Cannings and Richard J Samworth. Random-projection ensemble classification. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959–1035, 2017. [25] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011. [26] Christos Chatzichristos, Mike Davies, Javier Escudero, Eleftherios Kofidis, and Sergios Theodoridis. Fusion of eeg and fmri via soft coupled tensor decompositions. In 2018 26th European Signal Processing Conference (EUSIPCO), pages 56–60. IEEE, 2018. 137 [27] Christos Chatzichristos, Eleftherios Kofidis, Lieven De Lathauwer, Sergios Theodoridis, and Sabine Van Huffel. Early soft and flexible fusion of eeg and fmri via tensor decompositions. arXiv preprint arXiv:2005.07134, 2020. [28] Cong Chen, Kim Batselier, Ching-Yun Ko, and Ngai Wong. A support tensor train machine. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019. [29] Di-Rong Chen and Han Li. Convergence rates of learning algorithms by random projection. Applied and Computational Harmonic Analysis, 37(1):36–51, 2014. [30] Xinyu Chen, Zhaocheng He, and Jiawei Wang. Spatial-temporal traffic speed patterns discov- ery and incomplete data recovery via svd-combined tensor decomposition. Transportation research part C: emerging technologies, 86:59–77, 2018. [31] Yanyan Chen, Kuaini Wang, and Ping Zhong. One-class support tensor machine. Knowledge- Based Systems, 96:14–28, 2016. [32] Mario Christoudias, Raquel Urtasun, Trevor Darrell, et al. Bayesian localized multiple kernel learning. Univ. California Berkeley, Berkeley, CA, 2009. [33] R Dennis Cook. Regression graphics: Ideas for studying regressions through graphics, volume 482. John Wiley & Sons, 2009. [34] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003. [35] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank-(r 1, r 2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications, 21(4):1324–1342, 2000. [36] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013. [37] Yiming Ding, Jae Ho Sohn, Michael G Kawczynski, Hari Trivedi, Roy Harnish, Nathaniel W Jenkins, Dmytro Lituiev, Timothy P Copeland, Mariam S Aboian, Carina Mari Aparici, et al. A deep learning model to predict a diagnosis of alzheimer disease by using 18f-fdg pet of the brain. Radiology, 290(2):456–464, 2019. [38] Nemanja Djuric, Liang Lan, Slobodan Vucetic, and Zhuang Wang. Budgetedsvm: A toolbox for scalable svm approximations. The Journal of Machine Learning Research, 14(1):3813– 3817, 2013. [39] Leo Doktorski. L2-svm: Dependence on the regularization parameter. Pattern Recognition and Image Analysis, 21(2):254–257, 2011. [40] Olivier Duchenne, Francis Bach, In-So Kweon, and Jean Ponce. A tensor-based algorithm for high-order graph matching. IEEE transactions on pattern analysis and machine intelligence, 33(12):2383–2395, 2011. 138 [41] Robert Durrant and Ata Kabán. Sharp generalization error bounds for randomly-projected classifiers. In International Conference on Machine Learning, pages 693–701, 2013. [42] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019. [43] Hadi Fanaee-T and Joao Gama. Simtensor: A synthetic tensor data generator. arXiv preprint arXiv:1612.03772, 2016. [44] Hadi Fanaee-T and Joao Gama. Tensor-based anomaly detection: An interdisciplinary survey. Knowledge-Based Systems, 98:130–147, 2016. [45] Long Feng, Xuan Bi, and Heping Zhang. Brain regions identified as being associated with verbal reasoning through the use of imaging regression via internal variation. Journal of the American Statistical Association, pages 1–15, 2020. [46] Xiaoli Z Fern and Carla E Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th international conference on machine learning (ICML-03), pages 186–193, 2003. [47] Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2013. [48] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, 2001. [49] Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger. A new performance measure and eval- uation benchmark for road detection algorithms. In International Conference on Intelligent Transportation Systems (ITSC), 2013. [50] Glenn Fung, Murat Dundar, Jinbo Bi, and Bharat Rao. A fast iterative algorithm for fisher discriminant using heterogeneous kernels. In Proceedings of the twenty-first international conference on Machine learning, page 40, 2004. [51] Mostafa Reisi Gahrooei, Hao Yan, Kamran Paynabar, and Jianjun Shi. Multiple tensor-on- tensor regression: An approach for modeling processes with heterogeneous sources of data. Technometrics, pages 1–23, 2020. [52] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013. [53] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [54] Mark Girolami and Mingjun Zhong. Data integration for classification problems employ- ing gaussian process priors. In Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference, volume 19, page 465. MIT Press, 2007. [55] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. The Journal of Machine Learning Research, 12:2211–2268, 2011. 139 [56] Adrian R Groves, Christian F Beckmann, Steve M Smith, and Mark W Woolrich. Linked independent component analysis for multimodal data fusion. Neuroimage, 54(3):2198–2217, 2011. [57] Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian tensor regression. The Journal of Machine Learning Research, 18(1):2733–2763, 2017. [58] Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE Transactions on Image Processing, 21(2):816–827, 2011. [59] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer, 2012. [60] Peter Hall and Richard J Samworth. Properties of bagged nearest neighbour classifiers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):363–379, 2005. [61] Zhifeng Hao, Lifang He, Bingqian Chen, and Xiaowei Yang. A linear support higher-order tensor machine for classification. IEEE Transactions on Image Processing, 22(7):2911–2920, 2013. [62] Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang. Boosted sparse and low-rank tensor regression. arXiv preprint arXiv:1811.01158, 2018. [63] Lifang He, Xiangnan Kong, Philip S Yu, Xiaowei Yang, Ann B Ragin, and Zhifeng Hao. Dusk: A dual structure-preserving kernel for supervised tensor learning with applications to neuroimages. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 127–135. SIAM, 2014. [64] Lifang He, Chun-Ta Lu, Guixiang Ma, Shen Wang, Linlin Shen, Philip S Yu, and Ann B Ragin. Kernelized support tensor machines. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1442–1451. JMLR. org, 2017. [65] Richard N Henson, Hunar Abdulrahman, Guillaume Flandin, and Vladimir Litvak. Mul- timodal integration of m/eeg and f/mri data in spm12. Frontiers in neuroscience, 13:300, 2019. [66] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189, 1927. [67] Heng Huang, Chris Ding, Dijun Luo, and Tao Li. Simultaneous tensor subspace selection and clustering: the equivalence of high order svd and k-means clustering. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge Discovery and Data mining, pages 327–335, 2008. [68] Prateek Jain and Sewoong Oh. Provable tensor factorization with missing data. arXiv preprint arXiv:1406.2784, 2014. [69] Svante Janson et al. Gaussian hilbert spaces, volume 129. Cambridge university press, 1997. 140 [70] Yuwang Ji, Qiang Wang, Xuan Li, and Jie Liu. A survey on tensor techniques and applications in machine learning. IEEE Access, 7:162950–162990, 2019. [71] Ruhui Jin, Tamara G Kolda, and Rachel Ward. Faster johnson-lindenstrauss transforms via kronecker products. arXiv preprint arXiv:1909.04801, 2019. [72] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1, 1984. [73] Esin Karahan, Pedro A Rojas-Lopez, Maria L Bringas-Vega, Pedro A Valdés-Hernández, and Pedro A Valdes-Sosa. Tensor analysis and fusion of multimodal brain images. Proceedings of the IEEE, 103(9):1531–1559, 2015. [74] Ali Khazaee, Ata Ebrahimzadeh, and Abbas Babajani-Feremi. Application of advanced machine learning methods on resting-state fmri network for identification of mild cognitive impairment and alzheimer’s disease. Brain imaging and behavior, 10(3):799–817, 2016. [75] Fei Yan Krystian Mikolajczyk Josef Kittler and Muhammad Tahir. A comparison of l1 norm and l2 norm multiple kernel svms in image and video classification. [76] Marius Kloft, Ulf Brefeld, Soeren Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and Alexander Zien. Efficient and accurate lp-norm multiple kernel learning. In NIPS, volume 22, pages 997–1005, 2009. [77] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009. [78] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009. [79] Tamara Gibson Kolda. Multilinear operators for higher-order decompositions. Technical report, Sandia National Laboratories, 2006. [80] Jean Kossaifi, Zachary C Lipton, Arinbjörn Kolbeinsson, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. Journal of Machine Learning Research, 21:1–21, 2020. [81] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra and its Applications, 18(2):95 – 138, 1977. [82] Stephen LaConte, Stephen Strother, Vladimir Cherkassky, Jon Anderson, and Xiaoping Hu. Support vector machines for temporal classification of block design fmri data. NeuroImage, 26(2):317–329, 2005. [83] Jack L Lancaster, Diana Tordesillas-Gutiérrez, Michael Martinez, Felipe Salinas, Alan Evans, Karl Zilles, John C Mazziotta, and Peter T Fox. Bias between mni and talairach coor- dinates analyzed using the icbm-152 brain template. Human brain mapping, 28(11):1194– 1205, 2007. 141 [84] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine learning research, 5(Jan):27–72, 2004. [85] Michele Larobina and Loredana Murino. Medical image file formats. Journal of digital imaging, 27(2):200–206, 2014. [86] Xu Lei, Pedro A Valdes-Sosa, and Dezhong Yao. Eeg/fmri fusion based on independent component analysis: integration of data-driven and model-driven methods. Journal of integrative neuroscience, 11(03):313–337, 2012. [87] Jie Li, Guan Han, Jing Wen, and Xinbo Gao. Robust tensor subspace learning for anomaly detection. International Journal of Machine Learning and Cybernetics, 2(2):89–98, 2011. [88] Lexin Li and Xin Zhang. Parsimonious tensor response regression. Journal of the American Statistical Association, 112(519):1131–1146, 2017. [89] Peide Li and Taps Maiti. Universal consistency of support tensor machine. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 608–609. IEEE, 2019. [90] Ping Li, Trevor J Hastie, and Kenneth W Church. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 287–296. ACM, 2006. [91] Quefeng Li and Lexin Li. Integrative factor regression and its inference for multimodal data analysis. arXiv preprint arXiv:1911.04056, 2019. [92] Qun Li and Dan Schonfeld. Multilinear discriminant analysis for higher-order tensor data classification. IEEE transactions on pattern analysis and machine intelligence, 36(12):2524– 2537, 2014. [93] Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, 10(3):520–545, 2018. [94] Yingjie Li, Liangliang Zhang, Andrea Bozoki, David C Zhu, Jongeun Choi, and Taps Maiti. Early prediction of alzheimer’s disease using longitudinal volumetric mri data from adni. Health Services and Outcomes Research Methodology, 20(1):13–39, 2020. [95] Yi Lin. A note on margin-based loss functions in classification. Statistics & probability letters, 68(1):73–82, 2004. [96] Martin A Lindquist et al. The statistical analysis of fmri data. Statistical science, 23(4):439– 464, 2008. [97] Jingyu Liu, Godfrey Pearlson, Andreas Windemuth, Gualberto Ruano, Nora I Perrone- Bizzozero, and Vince Calhoun. Combining fmri and snp data to investigate connections between brain function and genetics using parallel ica. Human brain mapping, 30(1):241– 255, 2009. 142 [98] Yipeng Liu, Jiani Liu, and Ce Zhu. Low-rank tensor train coefficient array estimation for tensor-on-tensor regression. IEEE transactions on neural networks and learning systems, 31(12):5402–5411, 2020. [99] Eric F Lock. Tensor-on-tensor regression. Journal of Computational and Graphical Statis- tics, 27(3):638–647, 2018. [100] Xiaojing Long, Lifang Chen, Chunxiang Jiang, Lijuan Zhang, and Alzheimer’s Disease Neu- roimaging Initiative. Prediction and classification of alzheimer disease based on quantifica- tion of mri deformation. PloS one, 12(3):e0173372, 2017. [101] Miles E Lopes. A sharp bound on the computation-accuracy tradeoff for majority voting ensembles. eScholarship, University of California, 2013. [102] Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and Shuicheng Yan. Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5249–5257, 2016. [103] Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos. Mpca: Multi- linear principal component analysis of tensor objects. IEEE transactions on Neural Networks, 19(1):18–39, 2008. [104] Wei Lu, Fu-Lai Chung, Wenhao Jiang, Martin Ester, and Wei Liu. A deep bayesian tensor- based system for video recommendation. ACM Transactions on Information Systems (TOIS), 37(1):1–22, 2018. [105] Wenqi Lu, Zhongyi Zhu, and Heng Lian. High-dimensional quantile tensor regression. Journal of Machine Learning Research, 21(250):1–31, 2020. [106] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4(Oct):839–860, 2003. [107] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(Dec):2651–2667, 2006. [108] Sebastian Mika, Gunnar Ratsch, Jason Weston, Bernhard Scholkopf, and Klaus-Robert Mullers. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pages 41–48. Ieee, 1999. [109] John C Morris, Catherine M Roe, Elizabeth A Grant, Denise Head, Martha Storandt, Alison M Goate, Anne M Fagan, David M Holtzman, and Mark A Mintun. Pittsburgh compound b imaging and prediction of progression from cognitive normality to symptomatic alzheimer disease. Archives of neurology, 66(12):1469–1475, 2009. [110] Raziyeh Mosayebi and Gholam-Ali Hossein-Zadeh. Correlated coupled matrix tensor fac- torization method for simultaneous eeg-fmri data fusion. Biomedical Signal Processing and Control, 62:102071, 2020. 143 [111] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In Icml, 2011. [112] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006. [113] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011. [114] Yuqing Pan, Qing Mai, and Xin Zhang. Covariate-adjusted tensor classification in high dimensions. Journal of the American Statistical Association, pages 1–15, 2018. [115] Paul Pavlidis, Jason Weston, Jinsong Cai, and William Noble Grundy. Gene functional classification from heterogeneous data. In Proceedings of the fifth annual international conference on Computational biology, pages 249–255, 2001. [116] A. H. Phan, P. Tichavsky, and A. Cichocki. Low complexity damped gauss–newton al- gorithms for candecomp/parafac. SIAM Journal on Matrix Analysis and Applications, 34(1):126–147, 2013. [117] Michael JD Powell. Nonconvex minimization calculations and the conjugate gradient method. In Numerical analysis, pages 122–141. Springer, 1984. [118] Shibin Qiu and Terran Lane. A framework for multiple kernel support vector regression and its applications to sirna efficacy prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(2):190–199, 2008. [119] Beheshteh T Rakhshan and Guillaume Rabusseau. Tensorized random projections. arXiv preprint arXiv:2003.05101, 2020. [120] Bin Ran, Huachun Tan, Yuankai Wu, and Peter J Jin. Tensor based missing traffic data completion with spatial–temporal correlation. Physica A: Statistical Mechanics and its Applications, 446:54–63, 2016. [121] Garvesh Raskutti, Ming Yuan, Han Chen, et al. Convex regularization for high-dimensional multiresponse tensor regression. The Annals of Statistics, 47(3):1554–1584, 2019. [122] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. Are loss functions all the same? Neural computation, 16(5):1063–1076, 2004. [123] Katharina A Schindlbeck and David Eidelberg. Network imaging biomarkers: insights and clinical applications in parkinson’s disease. The Lancet Neurology, 17(7):629–640, 2018. [124] Warren Schudy and Maxim Sviridenko. Concentration and moment inequalities for poly- nomials of independent random variables. In Proceedings of the twenty-third annual ACM- SIAM symposium on Discrete Algorithms, pages 437–446. SIAM, 2012. [125] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. 144 [126] John Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE Transactions on Information Theory, 51(7):2510–2522, 2005. [127] Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, 94(3):303–351, 2014. [128] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. [129] Jing Sui, Godfrey Pearlson, Arvind Caprihan, Tülay Adali, Kent A Kiehl, Jingyu Liu, Jeremy Yamamoto, and Vince D Calhoun. Discriminating schizophrenia and bipolar disorder by fusing fmri and dti in a multimodal cca+ joint ica model. Neuroimage, 57(3):839–855, 2011. [130] Will Wei Sun and Lexin Li. Store: sparse tensor response regression and neuroimaging analysis. The Journal of Machine Learning Research, 18(1):4908–4944, 2017. [131] Will Wei Sun and Lexin Li. Dynamic tensor clustering. Journal of the American Statistical Association, 114(528):1894–1907, 2019. [132] Yanfeng Sun, Junbin Gao, Xia Hong, Bamdev Mishra, and Baocai Yin. Heterogeneous tensor decomposition for clustering via manifold optimization. IEEE transactions on pattern analysis and machine intelligence, 38(3):476–489, 2015. [133] Yiming Sun, Yang Guo, Joel A Tropp, and Madeleine Udell. Tensor random projection for low memory dimension reduction. In NeurIPS Workshop on Relational Representation Learning, 2018. [134] Hiroaki Tanabe, Tu Bao Ho, Canh Hao Nguyen, and Saori Kawasaki. Simple but effective methods for combining kernels in computational biology. In 2008 IEEE International Con- ference on Research, Innovation and Vision for the Future in Computing and Communication Technologies, pages 71–78. IEEE, 2008. [135] D. Tao, X. Li, X. Wu, and S. J. Maybank. General tensor discriminant analysis and gabor fea- tures for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(10), 2007. [136] Dacheng Tao, Xuelong Li, Weiming Hu, Stephen Maybank, and Xindong Wu. Supervised tensor learning. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages 8–pp. IEEE, 2005. [137] Petr Tichavsky, Anh Huy Phan, and Zbyněk Koldovsky. Cramér-rao-induced bounds for candecomp/parafac tensor decomposition. IEEE Transactions on Signal Processing, 61(8):1986–1997, 2013. [138] Théo Trouillon, Christopher R Dance, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Knowledge graph completion via complex tensor factorization. arXiv preprint arXiv:1702.06879, 2017. 145 [139] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013. [140] Manik Varma and Debajyoti Ray. Learning the discriminative power-invariance trade-off. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. [141] Jennifer M Walz, Robin I Goldman, Jordan Muraskin, Bryan Conroy, Truman R Brown, and Paul Sajda. "auditory and visual oddball eeg-fmri", 2018. [142] Huan Wang, Shuicheng Yan, Thomas S Huang, and Xiaoou Tang. A convengent solution to tensor subspace learning. In IJCAI, pages 629–634, 2007. [143] Philip Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2):226–235, 1969. [144] Philip Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM review, 13(2):185–188, 1971. [145] Keith J Worsley, Chien Heng Liao, John Aston, V Petre, GH Duncan, F Morales, and AC Evans. A general statistical analysis for fmri data. Neuroimage, 15(1):1–15, 2002. [146] Kun Xie, Lele Wang, Xin Wang, Gaogang Xie, Jigang Wen, and Guangxing Zhang. Accurate recovery of internet traffic data: A tensor completion approach. In IEEE INFOCOM 2016- The 35th Annual IEEE International Conference on Computer Communications, pages 1–9. IEEE, 2016. [147] Shuicheng Yan, Dong Xu, Qiang Yang, Lei Zhang, Xiaoou Tang, and Hong-Jiang Zhang. Discriminant analysis with tensor representation. In Computer Vision and Pattern Recogni- tion, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 526–532. IEEE, 2005. [148] Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In International Conference on Machine Learning, pages 373–381. PMLR, 2016. [149] Anru Zhang et al. Cross: Efficient low-rank tensor completion. Annals of Statistics, 47(2):936–964, 2019. [150] Changqing Zhang, Huazhu Fu, Si Liu, Guangcan Liu, and Xiaochun Cao. Low-rank ten- sor constrained multiview subspace clustering. In Proceedings of the IEEE international conference on computer vision, pages 1582–1590, 2015. [151] Tong Zhang et al. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004. [152] Xin Zhang and Lexin Li. Tensor envelope partial least-squares regression. Technometrics, 59(4):426–436, 2017. [153] Xing Zhang, Gongjian Wen, and Wei Dai. A tensor decomposition-based anomaly detection algorithm for hyperspectral image. IEEE Transactions on Geoscience and Remote Sensing, 54(10):5801–5820, 2016. 146 [154] Yanqing Zhang, Xuan Bi, Niansheng Tang, and Annie Qu. Dynamic tensor recommender systems. Journal of Machine Learning Research, 22(65):1–35, 2021. [155] Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552, 2013. [156] P. Zhou and J. Feng. Outlier-robust tensor pca. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 1–9, 2017. 147