STATISTICALLY CONSISTENT SUPPORT TENSOR MACHINE FOR
                MULTI-DIMENSIONAL DATA
                                By
                              Peide Li
                        A DISSERTATION
                            Submitted to
                    Michigan State University
            in partial fulfillment of the requirements
                         for the degree of
                Statistics – Doctor of Philosophy
                               2021


                                             ABSTRACT
              STATISTICALLY CONSISTENT SUPPORT TENSOR MACHINE FOR
                                  MULTI-DIMENSIONAL DATA
                                                  By
                                               Peide Li
Tensors are generalizations of vectors and matrices for multi-dimensional data representation.
Fueled by novel computing technologies, tensors have expanded to various domains, including
statistics, data science, signal processing, and machine learning. Comparing to traditional data
representation formats, tensor data representation distinguishes itself with its capability of preserv-
ing complex structures and multi-way features for multi-dimensional data. In this dissertation, we
explore some tensor-based classification models and their statistical properties. In particular, we
propose few novel support tensor machine methods for huge-size tensor and multimodal tensor
classification problems, and study their classification consistency properties. These methods are
applied to different applications for validation.
     The first piece of work considers classification problems for gigantic size multi-dimensional
data. Although current tensor-based classification approaches have demonstrated extraordinary
performance in empirical studies, they may face more challenges such as long processing time
and insufficient computer memory when dealing with big tensors. In chapter 3, we combine
tensor-based random projection and support tensor machine, and propose a Tensor Ensemble Clas-
sifier (TEC) for ultra-high dimensional tensors, which aggregates multiple support tensor machines
estimated from randomly projected CANDECOMP/PARAFAC (CP) tensors. This method uti-
lizes Gaussian and spares random projections to compress high-dimensional tensor CP factors,
and predicts their class labels with support tensor machine classifiers. With the well celebrated
Johnson-Lindenstrauss Lemma and ensemble techniques, TEC methods are shown to be statistically
consistent while having high computational efficiencies for big tensor data. Simulation studies and
real data applications including Alzheimer’s Disease MRI Image classification and Traffic Image
classification are provided as empirical evidence to validate the performance of TEC models.


     The second piece of work considers classification problems for multimodal tensor data, which
are particularly common in neuroscience and brain imaging analysis. Utilizing multimodal data is
of great interest for machine learning and statistics research in these domains, since it is believed
that integration of features from multiple sources can potentially increase model performance while
unveiling the interdependence between heterogeneous data. In chapter 4, we propose a Coupled
Support Tensor Machine (C-STM) which adopts Advanced Coupled Matrix Tensor Factorization
(ACMTF) and Multiple Kernel Learning (MKL) techniques for coupled matrix tensor data clas-
sification. The classification risk of C-STM is shown to be converging to the optimal Bayes risk,
making itself a statistically consistent rule. The framework can also be easily extended for multi-
modal tensors with data modalities greater than two. The C-STM is validated through a simulation
study as well as a simultaneous EEG-fMRI trial classification problem. The empirical evidence
shows that C-STM can utilize information from multiple sources and provide a better performance
comparing to the traditional methods.


Copyright by
  PEIDE LI
       2021


To my parents and my grandmother.
                v


                                    ACKNOWLEDGEMENTS
I have received support and assistance from many people throughout the writing of this dissertation
and my journey toward PhD. I want to take a moment and thank them.
    First, I would like to express my deepest gratitude to my advisor Dr. Tapabrata Maiti, whose
expertise is invaluable in exploring research questions. He always provides me with constructive
insights and strong supports that sharpen my thinking and bring my work to a higher level. Without
his guidance, I would not have made such a progress in this field.
    I would also like to thank my dissertation guidance committee members, Dr. Jiayu Zhou, Dr.
Ping-shou Zhong, Dr. David Zhu, and Dr. Shrĳita Bhattacharya. Their comments and suggestions
are extremely beneficial for my research.
    I want to extend my appreciation to my collaborators, Dr. Selin Aviyente, Dr. Rejaul Karim,
and Mr. Emre Sofuoglu. It is a great pleasure to work with them. Their expertise as well as
dedication to scientific research help me to extend my work to a much boarder level.
    I am also grateful to the help I obtained from all the professors and staff members in the
Department of Statistics and Probability. I really appreciate the wonderful courses as well as the
assistance they provided.
    During my six years at Michigan State University, I made a lot of friends and met many kind
peers. Thanks to them, I did not feel lonely during my PhD journey. I am very grateful to their
sincerity and patience. I wish you all have a wonderful future.
    Finally, I would like to thank my parents and Miss Jialin Qu. Thank you for your accompany
and concerns that support me to go through this journey and overcome difficulties especially under
the COVID-19 pandemic.
                                                 vi


                                   TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      x
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        xi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . .              . . . . . . . . . . . . . . . . . . .  1
   1.1 Overview . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . .  3
   1.2 Tensor Algebra . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . . .  4
       1.2.1 Notations . . . . . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . . .  4
       1.2.2 Tensor Decomposition . . . . . . . . .         . . . . . . . . . . . . . . . . . . .  8
       1.2.3 Tensor Product Space . . . . . . . . . .       . . . . . . . . . . . . . . . . . . . 10
   1.3 The Bayes Error and Classification Consistency       . . . . . . . . . . . . . . . . . . . 12
       1.3.1 The Bayes Problem . . . . . . . . . . .        . . . . . . . . . . . . . . . . . . . 12
       1.3.2 Consistent Classification Rules . . . . .      . . . . . . . . . . . . . . . . . . . 13
       1.3.3 Surrogate Loss Consistency . . . . . .         . . . . . . . . . . . . . . . . . . . 14
CHAPTER 2 TENSOR CLASSIFICATION MODELS . .                      . . . . . . . . . . . . . . . . . 16
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 16
   2.2 Tensor Classification Algorithms . . . . . . . . . .     . . . . . . . . . . . . . . . . . 17
       2.2.1 Support Tensor Machine . . . . . . . . . .         . . . . . . . . . . . . . . . . . 18
       2.2.2 Tensor Discriminant Analysis . . . . . . . .       . . . . . . . . . . . . . . . . . 21
       2.2.3 Tensor Regression . . . . . . . . . . . . . .      . . . . . . . . . . . . . . . . . 27
   2.3 Statistical Analysis . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 29
       2.3.1 Universal Tensor Kernels . . . . . . . . . .       . . . . . . . . . . . . . . . . . 29
       2.3.2 Consistency of CP-STM . . . . . . . . . .          . . . . . . . . . . . . . . . . . 30
   2.4 Real Data Analysis . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . 32
       2.4.1 MRI Classification for Alzheimer’s Disease         . . . . . . . . . . . . . . . . . 32
       2.4.2 KITTI Traffic Images . . . . . . . . . . . .       . . . . . . . . . . . . . . . . . 35
   2.5 Conclusion . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . . 38
CHAPTER 3 TEC: TENSOR ENSEMBLE CLASSIFIER FOR BIG DATA .                            . . . . . . . 40
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 40
   3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 44
       3.2.1 CP-STM for Tensor Classification . . . . . . . . . . . . . . .         . . . . . . . 44
       3.2.2 Random Projection . . . . . . . . . . . . . . . . . . . . . . .        . . . . . . . 45
   3.3 Methology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 47
       3.3.1 Tensor-Shaped Random Projection . . . . . . . . . . . . . . .          . . . . . . . 47
       3.3.2 Random-Projection-Based Support Tensor Machine (RPSTM)                 . . . . . . . 48
       3.3.3 TEC: Ensemble of RPSTM . . . . . . . . . . . . . . . . . . .           . . . . . . . 50
   3.4 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . 51
   3.5 Statistical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
                                                vii


       3.5.1 Excess Risk of TEC . . . . . . . . . . . . .     . . . . . . . . . . . . . . . . .  57
       3.5.2 Excess Risk of RPSTM . . . . . . . . . . .       . . . . . . . . . . . . . . . . .  58
       3.5.3 Price of Random Projection . . . . . . . .       . . . . . . . . . . . . . . . . .  62
       3.5.4 Convergence of Risk . . . . . . . . . . . .      . . . . . . . . . . . . . . . . .  62
   3.6 Simulation Study . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  65
   3.7 Real Data Analysis . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  70
       3.7.1 MRI Classification for Alzheimer’s Disease       . . . . . . . . . . . . . . . . .  71
       3.7.2 KITTI Traffic Image Classification . . . . .     . . . . . . . . . . . . . . . . .  72
   3.8 Conclusion . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . .  74
CHAPTER 4     COUPLED SUPPORT TENSOR MACHINE FOR MULTIMODAL NEU-
              ROIMAGING DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . .  76
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  76
   4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  80
       4.2.1 CP Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . .  80
       4.2.2 CP Support Tensor Machine (CP-STM) . . . . . . . . . . . . . . . . .           . .  82
       4.2.3 Multiple Kernel Learning . . . . . . . . . . . . . . . . . . . . . . . . .     . .  82
   4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . .  83
       4.3.1 ACMTF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . .  84
       4.3.2 Coupled Support Tensor Machine (C-STM) . . . . . . . . . . . . . . .           . .  85
   4.4 Model Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  86
   4.5 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  88
   4.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  91
   4.7 Trial Classification for Simultaneous EEG-fMRI Data . . . . . . . . . . . . . .      . .  94
   4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  96
APPENDICES . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . .  98
   APPENDIX A          APPENDIX FOR CHAPTER 2           . . . . . . . . . . . . . . . . . . . .  99
   APPENDIX B          APPENDIX FOR CHAPTER 3           . . . . . . . . . . . . . . . . . . . . 103
   APPENDIX C          APPENDIX FOR CHAPTER 4           . . . . . . . . . . . . . . . . . . . . 124
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
                                              viii


                                        LIST OF TABLES
Table 2.1: Biological Information for Subjects in ADNI Study; MMSE: baseline Mini-Mental
           State Examination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Table 2.2: Real Data: ADNI Classification Comparison I . . . . . . . . . . . . . . . . . . . . . 34
Table 2.3: Real Data: Traffic Image Classification I . . . . . . . . . . . . . . . . . . . . . . . . 37
Table 3.1: TEC: Comparison of Computational Complexity . . . . . . . . . . . . . . . . . . . 55
Table 3.2: TEC Simulation Results I: Desktop with 32GB RAM . . . . . . . . . . . . . . . . . 69
Table 3.3: Real Data: ADNI Classification Comparison II . . . . . . . . . . . . . . . . . . . . 71
Table 3.4: Real Data: Traffic Image Classification II . . . . . . . . . . . . . . . . . . . . . . . 73
Table 4.1: Distribution Specifications for Simulation; 𝑀𝑉 𝑁     𝑁: multivariate normal distri-
           bution. 𝐼 : identity matrices. Bold numbers are vectors whose elements are all
           the same. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Table 4.2:  Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean
           of Performance Metrics with Standard Deviations in Subscripts) . . . . . . . . . 95
Table B.1: TEC Simulation Results II: Cluster with 128GB RAM . . . . . . . . . . . . . . . . . 123
Table C.1: EEG-fMRI Data: Number of Trials per Subject . . . . . . . . . . . . . . . . . . 132
                                                   ix


                                         LIST OF FIGURES
Figure 1.1: Vector, Matrix, Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    5
Figure 1.2: Tensor CP Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       9
Figure 1.3: Tucker Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Figure 2.1: Real Data: ADNI Classification Reults I . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 2.2: Real Data: Examples of Traffic Objects in KITTI Data . . . . . . . . . . . . . . . . 36
Figure 2.3: Real Data: Traffic Classification Result I . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 3.1: Real Data: ADNI Classification Result II . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 3.2: Real Data: Traffic Image Classification Result II . . . . . . . . . . . . . . . . . . . 74
Figure 4.1: C-STM Model Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Figure 4.2: Simulation: Average accuracy(bar plot) with standard deviation (error bar) . . . 92
Figure C.1: Auditory fMRI Group Level Analysis . . . . . . . . . . . . . . . . . . . . . . . 129
Figure C.2: Visual fMRI Group Level Analysis . . . . . . . . . . . . . . . . . . . . . . . . 130
Figure C.3: Region of Interest (ROI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Figure C.4: EEG Channel Position from [141] . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Figure C.5: Examples of EEG Latent Factors (Different Trial and Stimulus Types): Topoplot
            for Channel Factors (left); Plots for Temporal Factors (right) . . . . . . . . . . . 134
                                                    x


                               LIST OF ALGORITHMS
Algorithm 1:  Hinge STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Algorithm 2:  Squared Hinge STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Algorithm 3:  DGTDA Projection Learning . . . . . . . . . . . . . . . . . . . . . . . . . 25
Algorithm 4:  CMDA Projection Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Algorithm 5:  Tensor Discriminant Analysis Classification . . . . . . . . . . . . . . . . . 27
Algorithm 6:  Tensor CP Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . 28
Algorithm 7:  Hinge TEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Algorithm 8:  Squared Hinge TEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Algorithm 9:  TEC Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Algorithm 10: ACMTF Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Algorithm 11: Coupled Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . 89
                                            xi


                                              CHAPTER 1
                                          INTRODUCTION
With the development of computer technologies, more and more data with complex structures are
observed in various research domains. The high-dimensionality as well as the multi-dimensional
structure of the data have raised new challenges in data analysis to the communities of engineering,
statistics, and data science. Learning multi-dimensional data with traditional statistical learning
methods may not be appropriate, since these methods can suffer from the curses of dimensionality.
Moreover, traditional methods are not able to preserve the intrinsic structures for multi-dimensional
data, and are not able to utilize their multi-way features. Thus, developing novel statistical learning
frameworks and data modeling techniques for multi-dimensional data has become popular in
contemporary machine learning and statistics analysis.
    As a generalization of vectors and matrices for higher-order data, tensor is originally proposed by
[66], and becomes an efficient data representation format for multi-dimension data. Fueled by novel
computing technologies arises in the past decade, tensors have expanded to many research domains
such as statistics, data science, signal processing, and machine learning. Surveys from [78, 70, 18]
demonstrate great potentials of using tensor data representation in data mining, statistics, and
machine learning. It turns out that using tensor for multi-dimensional data learning can be efficient
and appropriate since tensor can help to preserve multi-way structures for the data. Further,
advanced operations in tensor algebra can also help to reduce computational cost, and, more
importantly, unveil the complex correlation structures for the data. All these benefits make tensor
data representation a perfect tool for learning multi-dimensional data.
    Similar to the traditional machine learning research, current tensor-based machine learning and
data mining techniques can be categorized as supervised learning and unsupervised learning. In
the category of supervised learning, there are tensor regression and classification models, which
usually take tensors as inputs. Depending on the types of outputs, tensor regression models are
separated into tensor-to-scalar regression [58, 155, 148, 152, 130, 93, 62, 91], and tensor-to-tensor
                                                    1


regression models [99, 121, 98, 51]. Moreover, there are tensor Bayes regression [57], tensor
quantile regression [105], and tensor regression-based deep neural network model [80]. For tensor-
based classification models, there are models which based on discriminant [147, 142, 103, 92].
In addition, many variants of support tensor machine models [136, 61, 63, 127, 64, 28] are also
developed under the idea of maximum margin classifier. [114] provides a extension of probabilistic
tensor discriminant analysis which extends linear discriminant to tensor data.
    Comparing to the supervised learning, research on unsupervised learning with tensors are more
dominant. The tensor decomposition techniques in [78] and [113] can be applied in many different
application fields for multi-way feature extraction, latent factor estimation, and tensor subspace
learning. These decomposition methods are later extended to different application fields. For
example, [68, 1, 102] use low-rank tensor decomposition and robust tensor principle component
analysis for missing data imputation. In spatio-temporal analysis such as traffic or internet data
analysis, tensor decomposition are alos adopted in [120, 30, 146] for spatial and temporal feature
extraction. Tensor approaches are also widely applied in anomaly detection problems. Survey
from [44] reviews multiple anomaly detection algorithms basing on tensor data representation,
which include predicting tensor anomalies from multiple tensors [31] and identifying abnormal
elements within a single tensor [87, 153, 156]. Since tensor decomposition can be considered as
generalizations of spectral decomposition on higher-order data, [67, 150, 131, 16, 132] use various
decomposition methods to perform clustering and community detection for mulit-dimensional and
heterogeneous data. In graphcial model and network analysis, [40, 138, 12, 111, 149] use tensor
to model the correlation structures among different heterogeneous structures. Additionally, tensor
decomposition can be applied in research about recommend system such as [17, 154, 104].
    Apart from these two major categories, tensor data representation is often time used for the
development of efficient algorithms. For example, random projection is a popular dimension
reduction technique but is expensive to apply for high dimensional vector data. Saving the projection
matrices can also be memory inefficient. [133, 71, 119] show that tensorizing random projection
matrices can reduce the memory cost significantly while preserve the asymptotic isometry property
                                                   2


of random projection for high dimensional data.
     Motivated by these existing work using tensor data representation, this dissertation further
investigates the performance of tensor-based machine learning models with a focus on classification
problems. Particularly, we explore the statistical property of current tensor classification methods as
well as their performance in multiple applications. In addition, we propose novel tensor classifiers
for big tensor and multimodal tensor data classification. These become two major contributions in
this dissertation.
1.1     Overview
     The dissertation is organized as follow. In the rest part of this introduction chapter, we provide
a review about tensor algebra, operations, and decomposition methods. We also briefly introduce
some important statistical concepts in classification analysis in this chapter. The next three chapters
in the dissertation explore tensor classification problems from different aspects. In chapter 2, We
provide a survey about few most popular tensor classification methods in statistics and machine
learning literature. All the methods are applied to Alzheimer’s Disease MRI Image classification
and Traffic Image classification problems to benchmark their performances. We further investigate
the classification consistency for a certain type of non-parameteric tensor classifier, which is CP
Support Tensor Machine (CP-STM). We show that with certain tensor kernel functions, CP-STM
is statistically consistent.
     Chapter 3 considers a specific tensor classification problem where the input tensors are in high
dimension. In contemporary data science research, multi-dimensional observations such as spatial-
temporal data, medical imaging data are usually coming with high dimensionality, i.e, the dimension
of each mode is high even the data is in tensor shape. This raises extra challenges to the existing
tensor-based classification models. To address the issue of high dimensionality, we propose a Tensor
Ensemble Classifier (TEC) for ultra-high dimensional tensors, which aggregates multiple support
tensor machines estimated from randomly projected CANDECOMP/PARAFAC (CP) tensors. This
method utilizes Gaussian and spares random projections to compress high-dimensional tensor
                                                   3


CP factors, and predicts their class labels with support tensor machine classifier. With the well
celebrated Johnson-Lindenstrauss Lemma and ensemble techniques, TEC methods are shown to be
statistically consistent while having memory efficiency for big tensor data. Simulation studies and
real data applications including Alzheimer’s Disease MRI Image classification and Traffic Image
classification are provided as empirical evidence to validate the performance of TEC models.
    In the last chapter, we consider classification problems for multimodal tensor data, which are
particularly common in neuroscience and brain imaging analysis. Utilizing multimodal data is of
great interest for machine learning and statistics research in these domains, since it is believed
that integration of features from multiple sources can potentially increase model performance
while unveiling the interdependence between heterogeneous data. In chapter 4, we propose a
Coupled Support Tensor Machine (C-STM) which adopted Advanced Coupled Matrix Tensor
Factorization (ACMTF) and Multiple Kernel Learning (MKL) techniques for coupled matrix tensor
data classification. The excess risk of C-STM is shown to be converging to the optimal Bayes
risk, making itself a statistically consistent rule. The framework can also be easily extended
for multimodal tensors with data modalities greater than two. The C-STM is validated with in
a simulation study as well as in a simultaneous EEG-fMRI trial classification application. The
empirical evidence shows that C-STM can utilize information from multiple source and provide a
better performance comparing to the traditional methods.
1.2     Tensor Algebra
    In this section, we introduce notations, and review some elementary concepts about tensors.
Detailed introduction for tensor algebra, tensor decomposition, and tensor product space can be
referred from [77] and [59].
1.2.1    Notations
The mathematical notations in the rest part of the thesis are defined as follow. Numbers and scalars
are denoted by lowercase and capital letters such as 𝑥, 𝑁. Vectors are denoted by boldface lowercase
                                                  4


        |
                                             |
        Dimension 𝐼1
                                     Dimension 𝐼1
                                                                                       |
        {z
                                                                               Dimension 𝐼1
                                             {z
                                                                                       {z
                                             }                                                               D {
                                                                                                              im z
        }                                                                                                       en }
                                                                                       }                          sio
                                                    |   {z     }                                             |       n 𝐼3
                                                                                              |   {z     }
                                                    Dimension 𝐼2
                                                                                              Dimension 𝐼2
       𝑎 ∈ R 𝐼1                                     𝐴 ∈ R 𝐼1 ×𝐼2                              X ∈ R 𝐼1 ×𝐼2 ×𝐼3
                                          Figure 1.1: Vector, Matrix, Tensor
letters, e.g. 𝑎 . Matrices are denoted by boldface capital letters, e.g. 𝐴 , 𝐵 . Higher-dimensional
tensors are generalization of vector and matrix representations for higher order data, which are
denoted by boldface Euler script letters such as X, Y. In general, functions and transformations
are also denoted by boldface lowercase letters 𝑓 , 𝑔 , but with clear description to distinguish from
vectors. The only exception is kernel function, which will be denote by 𝐾 (·, ·). Vector spaces,
functional spaces, and tensor spaces are denoted by boldface Mathcal font in Latex such as H , F .
Euclidean spaces with one or multiple dimensions are represented by R 𝐼1 and R 𝐼1 ×𝐼2 , where 𝐼1
and 𝐼2 stand for the size of each dimension. In addition to these notations, we use E and P to denote
the expectation and probability in short. Other notations may also be used and be introduced as
needed in the following content.
    Tensor generalizes vectors and matrices by including multiple indices in its structure, making
it possible to represent multi-dimensional data. Figure 1.1 provides a comparison between vector,
matrix, and tensor. Tensor X can denote three-dimensional data since it provides three indices.
The order of a tensor is the number of dimensions, also known as ways or modes. For example,
the vector 𝑎 in figure 1.1 is a one-way tensor, matrix 𝐴 is a two-way tensor, and X is a three-way
tensor. In general, a tensor can have 𝑑 modes as long as 𝑑 is an integer.
    The way of indicating entries of tensors is same as we do for vectors and matrices. The 𝑖-th
entry of a vector 𝑥 is 𝑥𝑖 , the (𝑖, 𝑗)-th element of a matrix 𝑋 is 𝑥𝑖, 𝑗 , and the (𝑖1 , ..., 𝑖 𝑑 )-th element of
a d-way tensor X is 𝑥𝑖1 ,...,𝑖 𝑑 . The indices of a tensor 𝑖1 , ..., 𝑖 𝑑 range from 1 to their capital version,
e.g. 𝑖 𝑘 = 1, ...., 𝐼 𝑘 for every mode 𝑘 = 1, ...𝑑.
                                                                   5


     Sub-arrays of a tensor are formed when a subset of the indices are fixed. Similar to matrices that
have rows and columns, high-dimensional tensors have various types of sub-arrays. For example,
by fixing every index but one in a d-way tensor, we can get one of its fibers, which are analogue
of matrix rows and columns. Another type of frequently used tensor sub-arrays is slice, which is a
two dimensional section of a tensor. A slice of a tensor can be defined by fixing all but two indices.
We will use X :𝑖2 ...𝑖 𝑑 to denote one fiber of a d-way tensor, and use X ::𝑖3 ...𝑖 𝑑 to denote one of its
slices.
     Like the L2 norm for vectors in Euclidean spaces, the L2 norm, also called Frobenius norm, of
a d-way tensor X ∈ R 𝐼1 ×...×𝐼 𝑑 is the square root of the sum of the squares of all its elements, i.e.
                                                        v
                                                        u
                                                        u
                                                        u
                                                        tÕ  𝐼1       Õ𝐼𝑑
                                 X || Fro =< X , X >=
                               ||X                              ...        𝑥𝑖2 ,...,𝑖               (1.1)
                                                                              1       𝑑
                                                           𝑖1         𝑖𝑑
where
                                             Õ𝐼1      𝐼𝑑
                                                      Õ
                              < X 1 , X 2 >=      ...    𝑥 1,𝑖1 ,...,𝑖 𝑑 · 𝑥 2,𝑖1 ,...,𝑖 𝑑          (1.2)
                                              𝑖1      𝑖𝑑
is the inner product of two tensors X 1 and X 2 . In the following content, we may use different types
of inner product induced by kernel functions, and we will specify those inner products as needed.
     In many situations, one may need to transform a tensor into a vector or a matrix for computation.
Such transformations are called tensor vectroization and unfolding. In the thesis, I denote the
vectorization of a tensor X ∈ R 𝐼1 ×...×𝐼 𝑑 as Vec(X  X), which is in the dimension of 𝑑𝑗=1 𝐼 𝑗 . Tensor
                                                                                           Î
unfolding reorders a tensor into a matrix, putting mode-k fibers, X𝑖1 ,𝑖2 ..,𝑖 𝑘−1 ,:,𝑖 𝑘+1 ...𝑖 𝑑 as the
                                                                                        Î
columns of the matrix. As a result, the matrix is in the shape of 𝐼 𝑘 × 𝑑𝑗=1, 𝑗≠𝑘 𝐼 𝑗 , and is denoted
by X (𝑘) . Although there are multiple ways of performing tensor vectorization and unfolding, the
resulting vectors and matrices are equal up to a permutation. As long as the transformations are
consistent, algorithms and theoretical analysis are remain intact. We follow the tensor vectorization
and unfolding rules from [79] in the thesis.
     In addition to the basic concepts, we also need some operations for vectors and matrices in
order to construct tensors and present our work. The first one is the outer product of vectors. Let
                                                      6


𝑎 ∈ R 𝑝 and 𝑏 ∈ R𝑞 be two column vectors, the outer product of them is defined by
                                                𝑎 ◦ 𝑏 = 𝑎 · 𝑏𝑇                                                 (1.3)
which is a 𝑝 × 𝑞 matrix. If 𝐴 ∈ R 𝑝×𝑡 is a 𝑝 by 𝑡 matrix and 𝑏 ∈ R𝑞 is a column vector, then
                                         𝐴 ◦ 𝑏 = [𝐴𝐴 ∗ 𝑏 1 , ..., 𝐴 ∗ 𝑏 𝑞 ]                                    (1.4)
which is a 𝑝 × 𝑡 × 𝑞 array. "∗" stands for the element-wise product. The outer product with a vector
increase the multiplier by one more dimension.
    Another operation is Kronecker Product, which is a version of outer product for matrices.
Let 𝐴 ∈ R 𝐼×𝐽 , 𝐵 ∈ R𝐾×𝐿 be two arbitrary matrices. The Kronecker Product of 𝐴 and 𝐵 is
𝐴 ⊗ 𝐵 ∈ R (𝐼𝐾)×(𝐽 𝐿)
                                                         
                                    𝑎 11 𝐵 ... 𝑎 1𝐽 𝐵
                                                         
                                                         
                           𝐴 ⊗ 𝐵 =  ... ...        ...  = [𝑎 1 ◦ 𝑏 1 , ...., 𝑎 𝐽 ◦ 𝑏 𝐿 ]                   (1.5)
                                                         
                                                         
                                     𝑎 𝐼1 𝐵 ... 𝑎 𝐼𝐽 𝐵 
                                                         
Compared with the vector outer product, it restricts the resulting product to be matrices. The
Khatri-Rao product is the "matching column-wise" Kronecker product between two matrices with
same number of columns. Given matrices 𝐴 ∈ R 𝐼×𝐾 and 𝐵 ∈ R𝐽×𝐾 , the product is defined as:
                                       𝐴    𝐵 = [𝑎𝑎 1 ◦ 𝑏 1 , ..., 𝑎 𝐾 ◦ 𝑏 𝐾 ]                                 (1.6)
It requires the two multiplier matrices to have the same number of columns, and the resulting
products to be matrices as well. The vector outer product, matrix Kroncker product, and matrix
Khatri-Rao product can be regarded as tensor product in mathematical analysis. As a result, we
may use ⊗ to denote general tensor product in part of our theoretical development.
    The mode-n product is a product operation defined between a tensor and a matrix. Assume
X ∈ R 𝐼1 ×...×𝐼𝑛 ×...×𝐼 𝑑 is a d-way tensor, and 𝑈 ∈ R𝑃𝑛 ×𝐼𝑛 is a matrix. The mode-n product between
tensor X and matrix 𝑈 is defined as
                                             X ×𝑛 𝑈 = 𝑈 · X (𝑛)                                                (1.7)
where X (𝑛) is the n-th mode unfolding matrix of tensor X with shape 𝐼𝑛 by
                                                                                           Î
                                                                                             𝑗≠𝑛 𝐼 𝑗 . The resulting
product is still a d-way tensor in shape of 𝐼1 × ... × 𝑃𝑛 × ... × 𝐼 𝑑 .
                                                        7


1.2.2   Tensor Decomposition
The notations and mathematical operations introduced above make it possible to represent tensors
with their decomposition forms. Tensor decomposition is a way to represent, or approximate, a
tensor with various pre-defined forms. With specially designed structure, new representation and
approximation makes it more flexible to develop novel machine learning models for tensor data, and
optimize the existing frameworks by simplifying computation steps. In this section, we review three
most popular tensor decomposition methods, Parafac, Tucker, and Tensor-Train decomposition.
    Candecomp / Parafac Decomposition (CP) is an extension of matrix singular value decom-
position for higher-order tensors. It represents a tensor as a summation of vector outer products
shown in figure 1.2. Each product term in the summation is also known as rank-one tensor. For a
d-mode tensor X ∈ R 𝐼1 ×𝐼2 ...×𝐼 𝑑 , its CP decomposition is defined as
                                            𝑟
                                                       (1)      (2)         (𝑑)
                                           Õ
                                      X=        𝛼 𝑘 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘                         (1.8)
                                           𝑘=1
         ( 𝑗)    𝐼
where 𝑥 𝑘     ∈ R 𝑗 are called tensor CP components for 𝑗 = 1, ..., 𝑑. 𝛼 𝑘 are scalars and are often
merged into one of the CP components for simplicity. As a result, CP decomposition can also be
written as
                                              𝑟
                                                    (1)      (2)          (𝑑)
                                            Õ
                                        X=       𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘                            (1.9)
                                            𝑘=1
In our presentation, we will merge the scalar weights 𝛼 𝑘 to CP components and use the equation
(1.9) for CP decomposition unless we specifically mention the weights. 𝑟 is known as the CP rank
for the tensor, which is the number of different outer products that adds up to the tensor. For a
tensor which cannot be well represented by equation (1.9), i.e. the equation (1.9) does not hold, its
CP decomposition is defined as
                                              𝑟
                                                  𝑥 𝑘(1) ◦ 𝑥 𝑘(2) ... ◦ 𝑥 𝑘(𝑑)
                                            Õ
                                        X≈
                                       X̂
                                            𝑘=1
where
                          𝑟
                               (1)     (2)       (𝑑)
                         Õ
                    X=
                    X̂       𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘          and X̂   X = arg min ||X X − X̂
                                                                                       X || 𝐹𝑟𝑜
                         𝑘=1                                                    X
                                                                                X̂
                                                         8


                                   ≈                              +... +
                                     Figure 1.2: Tensor CP Decomposition
For the convenience of notation, we follow [78] and denote tensor CP decomposition (1.9) as
                   X = È𝑋   𝑋 (1) , 𝑋 (2) , ..., 𝑋 (𝑑) É or 𝔘X = È𝑋 𝑋 (1) , 𝑋 (2) , ..., 𝑋 (𝑑) É      (1.10)
                  𝐼 ×𝑟
where 𝑋 ( 𝑗) ∈ R 𝑗 are called CP factor matrices. The 𝑘-th column in 𝑋 ( 𝑗) is the vector shape
                    ( 𝑗)
tensor CP factor 𝑥 𝑗     in equation (1.9). This notation is also called Kruskal tensor. In the paper, we
will use either tensor CP decomposition or CP tensor to refer any tensor that in expressed in (1.9)
or (1.10).
     Tucker Decomposition is a form of Principle Component Analysis for higher-order tensors,
often denoted by Higher-order PCA (HOPCA). It factorizes a tensor into the form of a core tensor
multiplied by a factor matrix at each one of its modes. The Tucker decomposition of a a d-mode
tensor X ∈ R 𝐼1 ×𝐼2 ...×𝐼 𝑑 is defined as
                                      X = G ×1 𝑈 (1) ×2 𝑈 (2) .... ×𝑑 𝑈 (𝑑)                           (1.11)
                                                                                                 𝑃 ×𝐼
where G ∈ R𝑃1 ×𝑃2 ...×𝑃𝑑 is the core tensor in the shape of 𝑃1 ×𝑃2 ...×𝑃 𝑑 . 𝑈 ( 𝑗) ∈ R 𝑗 𝑗 , 𝑗 = 1, .., 𝑑
are mode-wise factor matrices. In practice, one can restrict the factor matrices to be orthogonal,
and thus consider the columns of these matrices as principle components from each mode. The
core tensor G measures the interaction across different components. An example of 3-way tensor
Tucker decomposition is demonstrated in figure 1.3. Similar to the CP decomposition, we can
define the Tucker decomposition for an arbitrary tensor X even if the equation (1.11) does not hold.
It is defined as
                                      X ≈ G ×1 𝑈 (1) ×2 𝑈 (2) .... ×𝑑 𝑈 (𝑑)
                                                          9


                                                               𝑈 (1)
                       X             ≈
                                                                G            𝑈 (3)
                                                𝑈 (2)
                                                𝑈
                                 Figure 1.3: Tucker Decomposition
where
                X = G ×1 𝑈 (1) ×2 𝑈 (2) .... ×𝑑 𝑈 (𝑑)
                X̂                                    and X̂X = arg min ||X
                                                                          X − X̂
                                                                              X || 𝐹𝑟𝑜
                                                                     X
                                                                     X̂
Notice that CP decomposition is actually a special case of Tucker decomposition, when the core
tensor G in the decomposition is super-diagonal and all 𝑃1 , ..., 𝑃 𝑑 are equal. The estimation of
Tucker decomposition can be done with an iterative alternating least square algorithm introduced in
[35]. Although Tucker decomposition is not easy to be interpreted comparing to CP decomposition,
its mode-wise factor matrices can be regarded as basis for the row space of each tensor mode. Thus,
it has been widely applied in problems like image compression and higher-order data feature
extraction.
1.2.3   Tensor Product Space
Apart from the algebraic notations and operations for tensors, we also want to include a brief
introduction about tensor functional and tensor space. They are essential in the development of
universal tensor kernel functions and statistical consistency. We refer [59] for the definition of
tensor product spaces and tensor calculus. Since we consider general tensor product in this section,
we use ⊗ to denote it in our description.
    For finite dimensional vector spaces, the space of their tensor product is call algebraic tensor
space.
                                                   10


Definition 1.2.1. Let V ⊂ R 𝐼1 and W ⊂ R 𝐼2 be two compact subspace of the Euclidean spaces
                               Í                               Í
R 𝐼1 and R 𝐼2 . V = {𝑣𝑣 : 𝑣 = 𝛼𝑖 𝑣 𝑖 }, W = {𝑤        𝑤 : 𝑤 = 𝛽 𝑗 𝑤 𝑗 }, where {𝑣𝑣 𝑖 } and {𝑤   𝑤 𝑗 } are the basis
                               𝑖                                 𝑗
of V and W . The algebraic tensor space of V and W , denoted by T , is a space spanned by the
tensor products of basis.
                                                               Õ
                               T = V ⊗ W = {𝑡𝑡 : 𝑡 =                    𝛾𝑖, 𝑗 𝑣 𝑖 ⊗ 𝑤 𝑗 }                    (1.12)
                                                                𝑖, 𝑗
The basis function of this algebraic tensor space are {𝑣𝑣 𝑖 ⊗ 𝑤 𝑗 }. The characteristic algebraic
properties of the tensor space is the bilinearity, meaning that for all 𝑡 ∈ T and 𝑎 ∈ R
                                 Õ                               Õ
                          𝑎 ·𝑡 =       𝛾𝑖, 𝑗 (𝑎 · 𝑣 𝑖 ) ⊗ 𝑤 𝑗 =          𝛾𝑖, 𝑗 𝑣 𝑖 ⊗ (𝑎 · 𝑤 𝑗 )
                                  𝑖, 𝑗                             𝑖, 𝑗
We call this algebraic tensor space of a second-order algebraic tensor space since the basis are the
tensor products of two vectors. This second-order tensor space is still a vector space, as defined in
mathematics. However, if we consider a specific tensor product, outer product "◦", in the definition
1.2.1, it is indeed isometric to a second-order tensor (matrix) space. The bĳection connecting two
spaces is a specific folding and unfolding rule as we introduced earlier. Notice that the algebraic
tensor space measures distance by Euclidean norm, and the norms of multi-dimensional arrays
are measured by Frobenuis norm. The equivalence between the Euclidean norm and Frobenuis
preserves the distances between points unchanged before and after unfolding, and making two
spaces isometric.
    Similarly, if the general tensor product is replaced by Kroncker or Khatri-Rao product, the
definition 1.2.1 can be extended for the product of matrices spaces. The isometry property also
connects this abstract mathematical definition to the more concrete definition of tensors, especially
those tensors decomposed into CP forms. As a result, data in forms of multi-dimensional arrays
can be considered as points in a tensor product space. In general, we can define 𝑑th-order algebraic
tensor space as
                                                                        ( 𝑗) ( 𝑗)
               X = V (1) ⊗ V (2) ... ⊗ V (𝑑) = 𝑠𝑝𝑎𝑛{⊗ 𝑑𝑗=1𝑣 𝑘 , 𝑣 𝑘 ∈ V ( 𝑗) , 𝑗 = 1, ...𝑑}                  (1.13)
for d-way tensors. This would make it feasible for us to develop further statistical analysis on
tensors.
                                                         11


    In the definition of algebraic tensor space, we only consider Euclidean subspaces and connects
it to the spaces of multi-dimensional arrays. Indeed, the definition 1.2.1 can be extended to
tensor products of any metric spaces such as the products of inner product spaces, the products of
functional spaces, and the products of Reproducing Kernel Hilbert spaces. We define
Definition 1.2.2. Let < ·, · > 𝑗 be a general inner product defined on V ( 𝑗) such that V ( 𝑗) is a inner
product space. X = ⊗ 𝑑𝑗=1𝑉 ( 𝑗) is going to be an inner product space with inner product < ·, · >X .
                 X = 𝑠𝑝𝑎𝑛{⊗ 𝑑𝑗=1 𝑓 (𝑘)     𝑓 ( 𝑗) V ( 𝑗) are basis functions, 𝑘 = 1, ...𝑑}
                                     𝑗 , 𝑘 ∈                                                       (1.14)
                    ( 𝑗)                           ( 𝑗)               ( 𝑗) ( 𝑗)
For 𝑓 = ⊗ 𝑑𝑗=1 𝑓 𝑘 ∈ X and 𝑔 = ⊗ 𝑑𝑗=1𝑔 𝑙 ∈ X , where 𝑓 𝑘 , 𝑔 𝑙 ∈ 𝑉 ( 𝑗) are basis functions,
          Í                             Í
           𝑘                            𝑙
the inner product is
                                                           𝑑
                                                                  ( 𝑗) ( 𝑗)
                                                      ÕÖ
                               < 𝑓 , 𝑔 >X        =            < 𝑓 𝑘 , 𝑔𝑙 > 𝑗                       (1.15)
                                                      𝑘,𝑙 𝑗=1
This definition generalizes the tensor product spaces to any arbitrary inner products spaces. For
example, if each V ( 𝑗) is a uni-variate functional space, then X is a multi-variate functional space
whose elements are functions mapping vectors into scalars. Moreover, this definition will help us
to construct tensor Reproducing Kernel Hilbert space in chapter 2.
1.3     The Bayes Error and Classification Consistency
    Statistical analysis in classification problems often time tries to validate the performance of a
model by looking at whether its classification risk is close to the Bayes risk, and if it is statistically
consistent. These two components are essential in the evaluation of generalization ability for a
specific model. We briefly review the definition of Bayes error and classification consistency in
this section. More details can be referred from [36].
1.3.1   The Bayes Problem
Consider a binary classification problem where (𝑋, 𝑌 ) is a pair of random variables taking their
respective values on R𝑑 and {0, 1}. Let
                                      𝜂 (𝑥𝑥 ) = P(𝑌 = 1|𝑥𝑥 ) = E(𝑌 |𝑥𝑥 )                           (1.16)
                                                        12


Naturally, any measureable function 𝑓 : R𝑑 → {0, 1} can be a potential classifier or decision
function. Now if we consider the very naive zero one loss function L (𝑧, 𝑦) = 1 {𝑧 ≠ 𝑦}, then the
expected loss of a classifier 𝑓 , called risk of 𝑓 , is R ( 𝑓 ) = E[L ( 𝑓 (𝑋), 𝑌 )] = P( 𝑓 (𝑋) ≠ 𝑌 ). Let
                                                                 𝜂 (𝑥𝑥 ) > 21
                                                          
                                                          1
                                                          
                                             𝑓 ∗ (𝑥𝑥 )
                                                          
                                                       =
                                                          
                                                           0 Otherwise
                                                          
It is easy to show that among all possible decision functions, 𝑓 ∗ has the smallest risk, making itself
the best possible classifier. A Bayes problem is to find the optimal classifier 𝑓 ∗ , and 𝑓 ∗ itself is called
the Bayes classifier or Bayes rule. The classification risk of the Bayes rule, R ∗ = R ( 𝑓 ∗ ), is defined
as the Bayes risk, which is the smallest possible risk one can obtain. Under most circumstances, it
is infeasible to estimate the Bayes rule since the distribution of (𝑋, 𝑌 ) is unknown.
1.3.2    Consistent Classification Rules
Instead of searching for Bayes rule, most of time we construct a classifier from a limited amount of
data. Suppose 𝑇𝑛 = {(𝑥𝑥 1 , 𝑦 1 ), ...(𝑥𝑥 𝑛 , 𝑦 𝑛 )} is a collection of observations for the random variable
(𝑋, 𝑌 ), the empirical estimate of the classification risk for a decision function 𝑓 is
                                                         𝑛
                                                     1Õ
                                           R𝑛 =             1 { 𝑓 (𝑥𝑥 𝑖 ) ≠ 𝑦𝑖 }
                                                     𝑛
                                                        𝑖=1
1 {·} is an indicator function. A "good" classifier can be constructed from the data 𝑇𝑛 by searching
for the optimizer that minimizes the empirical risk. Such a procedure is call Empirical Risk
Minimization (ERM), and 𝑇𝑛 is called training set. If we denote the empirical optimal decision
function as 𝑓𝑛 , it is a function conditional on the training set 𝑇𝑛 . However, if we use the same
strategy but with different training set, we can get a sequence of decision functions { 𝑓𝑛 }. Such a
sequence of functions estimated with the same strategy / rule but different training data is called
a classification rule, a way of finding optimal decision functions from a training data. ERM
procedure produces a decision rule, and there are more variants of ERM such as regularized ERM
in the statistical learning literature.
                                                            13


     To show a classification rule is good from mathematical aspect, one possible way is to prove it
is statistically consistent.
Definition 1.3.1. A classification rule is (weakly) consistent for a certain distribution of (𝑋, 𝑌 ) if
                                     R ( 𝑓𝑛 ) → R ∗       in probability
and is strongly consistent if
                                           R ( 𝑓𝑛 ) → R ∗       a.s.
as 𝑛 → ∞.
A consistent rule, not the specific classifier learned from a training data, guarantees that taking
sufficiently large samples can reconstruct the unknown data distribution, and finally identify the
optimal classifier. This property is like telling a classification rule is learning data in a right way,
since it eventually will unveil the whole data distribution. The reconstruction here means the Bayes
risk of the classification problem will be eventually the same as the risk of estimated classifier
with sufficient training data, and thus will be known. For most classification models, developing
statistical consistency is of great interest as it validates that the model is learning the data in a "right"
way. In this thesis, our theoretical analysis is also mostly focus on the development of statistical
consistency for tensor-based classifiers.
1.3.3    Surrogate Loss Consistency
In binary classification problems, the most intuitive and basic loss function is the "zero-one" loss
L (𝑧, 𝑦) = 1 {𝑧 ≠ 𝑦}. However, its non-convexity brings lots of challenges in both computational
aspect and statistical properties. Moreover, the zero-one loss may have a worse performance than
other surrogate loss in various classification applications. Recent works [13, 122, 151] demonstrate
that there are many surrogate loss functions for binary classification which equip with convexity and
nice statistical properties, making the estimation procedure more tractable. Another motivation of
using well-behaved surrogate loss in classification applications is that there is a general quantitative
                                                      14


relationship between the approximation and estimation risks associated with surrogate losses, and
those associated with zero-one loss. Here we denote the risk of a decision function 𝑓 associated
with a surrogate loss L as R L ( 𝑓 ), and the corresponding risk associated with the zero-one loss as
R ( 𝑓 ). Further, the Bayes risk under loss function L is denoted as R L      ∗ , and the Bayes risk under
zero-one loss is denoted as R ∗ . The definition of these risks are
                    R L ( 𝑓 ) = EX ×Y
                                                           ∗ =E             ∗         
                                    Y L ( 𝑓 (𝑋), 𝑌 ) ,    RL            Y L ( 𝑓 (𝑋), 𝑌 )
                                                                     X ×Y
where X × Y is the domain of the random variables (𝑋, 𝑌 ). The expectation is taken over the joint
distribution of (𝑋, 𝑌 ). 𝑓 ∗ is the Bayes classifier such that 𝑓 ∗ = arg min R L ( 𝑓 ). 𝑓 ∗ is the optimal
among all measureable functions which map data in X to labels Y . Results from [151] shows that
for any measureable function 𝑓
                                     𝜓 (R ( 𝑓 ) − R ∗ ) 6 R L ( 𝑓 ) − R L
                                                                        ∗
for a nondecreasing function 𝜓 : [0, 1] → [0, ∞). This suggests that the statistical consistency
developed under surrogate loss function indicates the consistency under zero-one loss, as long as
the surrogate loss is well-behaved. Thanks to this general relationship, one can develop statistical
consistency for decision rules using surrogate loss and enjoy their nice mathematical properties
like Lipschitz continuity instead of using zero-one loss. This reduces the problem difficulties
significantly. Lastly, using surrogate losses may enable us to develop a uniform upper bound on
the risk of a function 𝑓𝑛 that minimizes the empirical risks. This may help us to further obtain an
explicit, uniform bound on the excess risk, R L ( 𝑓𝑛 ) − R L    ∗ , highlighting the convergence rate of a
specific decision rule.
     The well-behaved losses are sometimes called self-calibrated or classification-calibrated loss.
Examples of such loss functions include Hinge loss, Squared Hinges loss, and exponential loss. A
more detailed discussion about self-calibrated loss is available in [95] and the Section 2 of [128].
We estimate tensor classification models using Hinge and Squared Hinge loss in this thesis, and
develop our theoretical results with surrogate loss functions.
                                                       15


                                            CHAPTER 2
                             TENSOR CLASSIFICATION MODELS
In this chapter, we provide an introduction to several tensor-based classification models, and
compare their performance empirically. Moreover, the statistical consistency of few classifiers are
established.
2.1     Introduction
    In contemporary machine learning and statistics research, tensor has become a popular tool
to model multi-dimensional data such as spatio-temporal data, brain imaging, and multimodal
data. Comparing to the traditional vector presentation, tensor preserves the multi-way structures
of the data, providing more correlations among different modes for data mining and modeling.
In addition, the existing tensor decomposition methods from [78] can help to estimate the low-
dimensional structure for tensor data, which reduce the computational complexity significantly for
tensor-based models.
    As an essential part of supervised tensor learning, tensor classification problems try to predict
data labels from tensors. Current literature about tensor classification can be categorized in several
groups. First, since distances between tensors can be easily estimated by Frobenious norm, K-
nearest neighbour classifiers can be easily established. However, the Frobenious norm of a tensor
is equivalent to the L2 norm of its vectorization. Such extensions are indeed equivalent to the
vector-based K-nearest neighbour classifiers, and thus have no computational gain from tensor
representation. An improvement [92] then has been made on the tensor K-nearest neighbour
classifiers by combining it with a Fisher discriminant analysis. Utilizing the multi-way features
preserved by tensors, [92] learns multi-linear transformations projecting tensors to lower multi-
dimensional spaces where they are easier to be classified. [114] also develops a probabilistic
discriminant analysis for tensors, using density instead of distance to discriminate data. Another
type of tensor-based classifiers borrow the separating hyperplane from support vector machine, and
                                                 16


build support tensor machine models. With different tensor decomposition and kernel functions,
there are models like rank-1 CP-STM [136], CP-STMs [63, 64], Tucker STM [127], and support
tensor train machine [28]. These models benefit from the distribution-free assumption for tensor
data, and are more flexible in real-data applications. Finally, logistic regression model can also be
generalized for tensor data. [155] and [93] develop generalized linear regression models with CP
and Tucker tensor coefficients, which can be adopted for classification problems.
    Although the current approaches have demonstrated impressive performance, not all of them
provide theoretical guarantee on their generalization ability. According to [36], a classifier with
solid generalization ability should be statistically consistent, having their excess classification risks
converge to the optimal Bayes risk. Bayes risk is the minimal risk one can obtain from a classification
problem with data confirming a certain type of probability distribution. The difference between the
risk of a learned classifier and the optimal Bayes risk quantifies the performance of the classifier
theoretically. Such results are well established for traditional statistical classification approaches,
however, are not completed for all tensor-based methods.
    In this chapter, we introduce few popular tensor-based classifiers in current literature including
CP-STM [63], tensor discriminant analysis [92] and CP-GLM [155], and investigate their perfor-
mance through numerical studies. Further, we discuss their statistical consistency, and provide
a theoretical result which establishes the statistical consistency for CP-STM. For other methods,
the results are introduced as they are can be easily extended from the existing literature. The rest
parts in this chapter are organized as follow: Section 2.2 reviews three major types of tensor-based
classifiers and their consistency results. Section 2.3 develop the consistency result for the support
tensor machine model. In section 2.4, we compare the performance of all reviewed tensor-based
methods with two different real data applications. Section 2.5 concludes the chapter.
2.2     Tensor Classification Algorithms
    In this section, we introduce five different tensor-based classifiers which are categorized into
three groups depending on their model mechanism.
                                                   17


2.2.1    Support Tensor Machine
Support tensor machine extends the idea of kernel support vector machine (see e.g. [128]), and
construct a separating hyperplane with support tensors for classification. In this part, we review
the Candecomp/Parafac - Support Tensor Machine (CP-STM) model from [63], and provide two
different model estimation algorithms.
     Suppose there is a training data 𝑇𝑛 = {(X           X1 , 𝑦 1 ), (X
                                                                      X2 , 𝑦 2 ), ..., (X X𝑛 , 𝑦 𝑛 )}, where X𝑖 ∈ X ⊂
R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 are d-way tensors. X is a compact tensor space, which is a subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 .
𝑦𝑖 ∈ {1, −1} are binary labels. CP-STM, like the traditional kernel support vector machine, tries to
estimate a decision function 𝑓 : X → R such that it minimizes the objective function
                                                               𝑛
                                                           1Õ
                                    min     𝜆|| 𝑓 || 2  +          L ( 𝑓 (X X𝑖 ), 𝑦𝑖 )                                   (2.1)
                                                           𝑛
                                                             𝑖=1
L is a loss function for classification such as Hinge loss, squared Hinge loss, and zero-one loss. 𝜆 is a
                                                 X) 2 𝑑X  X is the square of functional norm for 𝑓 . The kernel
                                          ∫
tuning parameter. || 𝑓 || 2 =< 𝑓 , 𝑓 >= 𝑓 (X
functions for tensor data are defined on the CP representation of tensors. Assume two d-way tensors
                                                𝑟      (1)      (2)           (𝑑)                 𝑟    (1)      (2)        (𝑑)
with CP rank 𝑟 are represented as X 1 =             𝑥 1,𝑘 ◦ 𝑥 1,𝑘 ... ◦ 𝑥 1,𝑘 and X 2 =
                                                Í                                                 Í
                                                                                                     𝑥 2,𝑘 ◦ 𝑥 2,𝑘 ... ◦ 𝑥 2,𝑘 ,
                                               𝑘=1                                              𝑘=1
a tensor kernel function is defined as
                                                       𝑟 Ö   𝑑
                                                                            ( 𝑗)    ( 𝑗)
                                                     Õ
                                     X1 , X 2 ) =
                                  𝐾 (X                           𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑘 )                                (2.2)
                                                   𝑙,𝑘=1 𝑗=1
where 𝐾 ( 𝑗) are vector-based kernel functions measuring inner products for factors in different tensor
modes. The kernel function (2.2) measures the inner products between two tensors by aggregating
kernel values of their CP factors across ranks. With the kernel trick and representer’s theorem [9],
the optimal decision rule for the optimization problem (2.1) has the form of
                                          Õ 𝑛
                                    X) =
                                 𝑓 (X           𝛼𝑖 𝑦𝑖 𝐾 (X X𝑖 , X ) = 𝛼 𝑇 𝐷 𝑦 𝐾 (X     X)                                (2.3)
                                          𝑖=1
where X is a new d-way rank-r tensor with shape 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 . 𝛼 = [𝛼1 , ..., 𝛼𝑛 ] 𝑇 are the
coefficients learned by plugging function (2.3) into objective function (2.1) and minimize (2.1).
𝐷 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 . 𝐾 (X            X) = [𝐾 (X  X1 , X ), ..., 𝐾 (XX𝑛 , X )] 𝑇
                                                          18


is a column vector. If we denote the collections of functions which are in the form of equation (2.3)
by H such that H = { 𝑓 : 𝑓 = 𝛼 𝑇 𝐷 𝑦 𝐾 (X  X), 𝛼 ∈ R𝑛 }, then the optimal STM classifier is denoted as
                                                             𝑛
                                                         1Õ
                               𝑓𝑛 = arg min 𝜆|| 𝑓 || 2 +              X𝑖 ), 𝑦𝑖 )
                                                               L ( 𝑓 (X                            (2.4)
                                           H
                                        𝑓 ∈H             𝑛
                                                           𝑖=1
and the class labels are predicted by Sign[ 𝑓𝑛 ]. H is the Reproducing Kernel Hilbert Space generated
by tensor kernel (2.2). Support tensors are the tensors whose corresponding coefficients, 𝛼𝑖 in 𝑓𝑛 ,
are non-zero.
     Estimating 𝑓 𝑛 from the training data 𝑇𝑛 can be accomplished in various ways. Also, the
estimated classifiers can be different, and have different excess error with different types of loss
function L. We adopt two different loss functions, Hinge loss and squared Hinge loss, and provide
different estimation algorithms. Hinge loss L ( 𝑓 (X     X), 𝑦) = max(0, 1 − 𝑦 · 𝑓 (X
                                                                                    X)) is a convex and
non-differentiable loss designed for support vector / tensor types of classifiers. Comparing to the
regular zero-one loss in binary classification, Hinge loss has the same level of penalty for miss-
classified points that are close to the separating hyper-plan, while putting a more severe penalty
for those which are far away from the plan. [122] demonstrates the statistical robustness and the
fast convergence rate of Hinge loss in binary classification problems. Minimizing the objective
function (2.1) with Hinge loss can be shown to be equivalent to the optimization problem
                                             1 𝑇
                                      min       𝛼 𝐷 𝑦 𝐾 𝐷 𝑦 𝛼 − 1𝑇 𝛼
                                     𝛼 ∈R𝑛   2
                                     S.T. 𝛼 𝑇 𝑦 = 0                                                (2.5)
                                                           1
                                             0𝛼 
                                                         2𝑛𝜆
Equation (2.5) is the dual problem of the original STM problem with Hinge loss. The derivation is
provided in [25]. Notice that this problem has quadratic objective function and inequality constrains,
which can be solved by Quadratic Programming (QP) in [20]. The steps are summarized in the
algorithm 1. We use python-style pseudo-code to denote columns of matrices. For example,
   (𝑚)                                                             (𝑚)
𝑋 𝑖 [:, 𝑘] stands for the 𝑘-th column of CP factor matrix 𝑋 𝑖 . We will use such notations in the
rest part of the thesis.
                                                      19


Algorithm 1 Hinge STM
  1: procedure STM Train
  2:      Input: Training set 𝑇𝑛 = {X        X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆
  3:      for i = 1, 2,...n do
  4:          X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ]                                    ⊲ CP decomposition by ALS algorithm
  5:      Create initial matrix 𝐾 ∈ R𝑛×𝑛
  6:      for i = 1,...,n do
  7:          for j = 1,...,i do
                                𝑟 Î
                                                𝑋 (𝑚)               (𝑚)
                               Í        𝑑
  8:              𝐾 𝑖, 𝑗 =              𝑚=1 𝐾 (𝑋 𝑖 [:, 𝑘], 𝑋 𝑗 [:, 𝑙])                                       ⊲ Kernel values
                            𝑘,𝑙=1
  9:              𝐾 𝑗,𝑖 = 𝐾 𝑖, 𝑗
10:       Solve the quadratic programming problem (2.5) and find the optimal 𝛼 ∗ .
11:       Output: 𝛼 ∗
     Another loss function which is commonly used for binary classification is the Squared Hinge
loss, which is a convex and differentiable surrogate of Hinge loss. Squared Hinge loss squares the
Hinge loss L ( 𝑓 (X X), 𝑦) = (max(0, 1 − 𝑦 · 𝑓 (X           X))) 2 , making it differentiable when 𝑦 · 𝑓 (X           X) = 1.
Plugging Squared Hinge loss makes the objective function (2.1) differentiable, and can be minimized
by letting its derivative to be zero. The objective function (2.1) with Squared Hinge loss, written
in the matrix form, is
                                                              𝑛
                                                           1Õ
                          min       𝜆𝛼𝛼𝑇 𝐷   𝑦𝐾 𝐷 𝑦𝛼 +            (max(0, 1 − 𝑦𝑖 · 𝛼 𝑇 𝐷 𝑦 𝑘 𝑖 )) 2                        (2.6)
                         𝛼 ∈R𝑛                             𝑛
                                                             𝑖=1
𝐾 is the n-by-n kernel matrix whose (𝑖, 𝑗)-th element is 𝐾 (X               X𝑖 , X 𝑗 ). 𝑘 𝑖 = [𝐾 (X
                                                                                                  X𝑖 , X 1 ), ..., 𝐾 (X
                                                                                                                      X𝑖 , X 𝑛 )] 𝑇
is the i-th column of the kernel matrix 𝐾 . The derivative of (2.6) with respect to 𝛼 is
                                                  𝑛
                                              2Õ
                    5 =2𝜆𝐷   𝐷 𝑦𝐾 𝐷 𝑦𝛼 +              (−1)𝑦𝑖 𝐷 𝑦 𝑘 𝑖 · max(0, 1 − 𝑦𝑖 · 𝛼 𝑇 𝐷 𝑦 𝑘 𝑖 )𝑇
                                              𝑛
                                                 𝑖=1
                                                  2Õ                                       
                          = 2𝜆𝐷   𝐷 𝑦𝐾 𝐷 𝑦𝛼 +               𝐷 𝑦 𝑘 𝑖 𝑘 𝑇𝑖 𝐷 𝑦 𝛼 − 𝐷 𝑦 𝑘 𝑖 𝑦𝑖
                                                  𝑛 𝑠                                                                      (2.7)
                                                     𝑖∈𝑠𝑠
                                                  2                              
                          = 2𝜆𝐷   𝐷 𝑦𝐾 𝐷 𝑦𝛼 + 𝐷 𝑦𝐾 𝐼𝑠 𝐾 𝑇 𝐷 𝑦𝛼 − 𝑦
                                                 𝑛                           
                                               1      𝑇               1
                          = 2𝐷  𝐷 𝑦 𝐾 𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝐷 𝑦 𝛼 − 𝐼 𝑠 𝑦
                                                𝑛                       𝑛
where 𝑦 = [𝑦 1 , ..., 𝑦 𝑛 ] 𝑇 and 𝑠 is the collection of indices for support tensors. Support tensors are
those tenors with labels (X     X𝑖 , 𝑦𝑖 ) such that 𝑦𝑖 · 𝛼 𝑇 𝐷 𝑦 𝑘 𝑖 < 1 when a 𝛼 is given. 𝐼 𝑠 is a identity matrix
                                                              20


with size 𝑛 whose diagonal elements corresponding to non-support tensors are set to be zero. To
estimate the 𝛼 such that the derivative (2.7) equals to zero, we can use the Gaussian-Newton method
(see e.g. [47]). If we denote the second derivative of the objective function (2.6) with respect to 𝛼
by 𝐻
                                                           1
                                      𝐻 = 2𝐷 𝐷 𝑦 𝐾 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 )𝐷𝐷𝑦                               (2.8)
                                                           𝑛
If 𝛼 ∗ is the root of 5 = 0, then Gaussian-Newton algorithm uses the first order Taylor expansion
and assumes that 5 |𝛼 ∗ = 5 |𝛼 + 𝐻 |𝛼 · (𝛼𝛼 ∗ − 𝛼 ). Since we assume 5 |𝛼 ∗ = 0, the equation reduces to
                     𝛼 ∗ = 𝛼 − 𝐻 −1 5 |𝛼
                                          1                      1               1 
                         = 𝛼 − 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 ) −1 · (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 )𝐷
                                                         
                                                                         𝐷 𝑦𝛼 − 𝐼𝑠 𝑦
                                          𝑛                      𝑛               𝑛
                                                                                                     (2.9)
                           1            1
                         = 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 𝑇 ) −1 𝐼 𝑠 𝑦
                           𝑛            𝑛
                           1            1
                         = 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 ) −1 𝐼 𝑠 𝑦
                           𝑛            𝑛
We drop the transpose for convenience in the last step since kernel matrix is symmetric. Notice
that the derivation uses the fact 𝐷 𝑦 is symmetric and orthogonal, 𝐷 𝑦 𝐷 𝑇𝑦 = 𝐼 , since 𝑦𝑖2 = 1. The
algorithm starts with an initial value of 𝛼 , and keeps updating 𝛼 ∗ with equation (2.9) iteratively until
convergence. During each iteration, the newly estimation 𝛼 ∗ will replace 𝛼 for the next iteration.
Although the update rule does not include 𝛼 in the final explicit form (2.9), the indices of support
tensors are updated at each iteration. Thus, 𝐼 𝑠 will be updated, and making estimate for 𝛼 ∗ to be
different. The algorithmic steps for Squared Hinge loss STM is summarized in the algorithm 2.
    After estimating 𝛼 from either (2.5) with Hinge loss or (2.6) with Squared Hinge loss, the class
label can be predicted by Sign[ 𝑓 (X   X)] = Sign[𝛼             X)]. CP representation of tensors in the
                                                       𝛼 𝐷 𝑦 𝐾 (X
training data are already available. To calculate 𝐾 (X   X), one has to find the CP representation for the
testing data X and then calculate the kernel values with equation (2.2).
2.2.2    Tensor Discriminant Analysis
The second types of classification method is tensor-based discriminant analysis (TDA). Tensor
discriminant analysis combines tensor-based K-nearest neighbour classifier and multilinear feature
                                                      21


Algorithm 2 Squared Hinge STM
  1: procedure STM Train
  2:     Input: Training set 𝑇 = {X         X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, 𝜂, maxiter
  3:     for i = 1, 2,...n do
  4:         X𝑖 = [𝑋 𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ]                              ⊲ CP decomposition by ALS algorithm
  5:     Create initial matrix 𝐾 ∈ R𝑛×𝑛
  6:     for i = 1,...,n do
  7:         for j = 1,...,i do
                               𝑟 Î
                                                𝑋 (𝑚)           (𝑚)
                              Í        𝑑
  8:              𝐾 𝑖, 𝑗 =             𝑚=1 𝐾 (𝑋 𝑖 [:, 𝑘], 𝑋 𝑗 [:, 𝑙])                             ⊲ Kernel values
                           𝑘,𝑙=1
  9:              𝐾 𝑗,𝑖 = 𝐾 𝑖, 𝑗
10:      Create 𝛼 ∗ = 1 𝑛×1 , 𝛼     = 0 𝑛×1                                      ⊲ Initial Value, can be different
11:      Iteration = 0
12:      while ||𝛼𝛼 ∗ − 𝛼 || 2 > 𝜂 & Iteration 6 maxiter do
13:          𝛼 = 𝛼∗
14:          Find 𝑠 ∈ R𝑛×1 . 𝑠 𝑖 ∈ {0, 1} such that 𝑠 𝑖 = 1 if 𝑦𝑖 𝑘 𝑇𝑖 𝛼 < 1 ⊲ Indicating support tensors
15:          𝐼 𝑠 = diag(𝑠𝑠 )                                        ⊲ Create diagonal matrix with 𝑆 as diagonal
16:            ∗    1               1       −1
             𝛼 = 𝑛 𝐷 𝑦 (𝜆𝐼𝐼 + 𝑛 𝐼 𝑠 𝐾 ) 𝐼 𝑠 𝑦                                                           ⊲ Update
17:      Output: 𝛼 ∗
extraction to improve the classification performance and utilize the multi-way correlations. It seeks
a tensor-to-tensor projection transforming tensors into a new tensor subspace which maximizes
the data separation. To measure the level of data separation in the new tensor subspace, TDA
adopted two criteria from Fisher discriminant analysis [108]: the scatter ratio criterion and the
scatter difference criterion. The optimal tensor-to-tensor projection is selected to maximize either
one of the criterion. The discriminant analysis with tensor representation (DATER) [147] and
multilinear discriminant analysis MDA [103] search the optimal projection utilizing the maximum
ratio criteria. However, the algorithms are not stable, and do not converge over iterations. The
general tensor discriminant analysis (GTDA) from [135] and MDA from [142] use the maximum
scatter difference criterion, and provide two convergent algorithms in tensor subspace learning.
However, the model classification performance relies heavily on tuning parameters. [92] proposes
Direct General Tensor Discriminant Analysis (DGTDA) which maximizes the scatter difference
and estimates the global optimal tensor-to-tensor projection without parameter tuning. In addition,
they also propose a Constrained Multilinear Discriminant Analysis (CMDA) by maximizing the
                                                            22


scatter ratio and restricting the tensor-to-tensor projection matrices to be orthogonal. In this part,
we provide a review on DGTDA and CMDA methods.
    In tensor discriminant analysis (TDA), a tensor-to-tensor projection is defined using tensor
mode-wise products introduced in section 1.2. Suppose X is a d-way tensor with size 𝐼1 × ... × 𝐼 𝑑 ,
                                                                    𝑃 ×...×𝑃
then a tensor-to-tensor projection transforms X to Z ∈ R 1                   𝑑
                                      Z = X ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)                          (2.10)
where 𝑈 ( 𝑗) are 𝑃 𝑗 × 𝐼 𝑗 projection matrices. The projection is defined uniquely by the collection
of projection matrices {𝑈    𝑈 (1) ,𝑈
                                    𝑈 (2) ...𝑈
                                             𝑈 (𝑑) }. Now let’s assume the training data are tensors from
binary classes, and are denoted by X 𝑐,𝑖 . 𝑐 = 1, 2 stands for the class of tensor data, and 𝑖 stands
for the 𝑖-th sample from class 𝑐. Like traditional statistics, the mean projected tensor for class 𝑐 is
defined as
                                   𝑛𝑐                𝑛𝑐
                                1 Õ             1 Õ
                       Z𝑐 =           Z 𝑐,𝑖 =            X 𝑐,𝑖 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)
                               𝑛𝑐               𝑛𝑐
                                  𝑖=1               𝑖=1                                             (2.11)
                           = X 𝑐 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)
                  𝑛Í𝑐
where X 𝑐 = 𝑛1𝑐       X 𝑐,𝑖 is the class mean of original tensors in class 𝑐. 𝑛𝑐 is the number of samples
                 𝑖=1
in class 𝑐. Similarly, the overall mean of tensors from both classes is
                               2                  2    𝑛 𝑐
                           1Õ                 1 ÕÕ
                     Z=           𝑛𝑐 · Z 𝑐 =               X 𝑐,𝑖 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)
                           𝑛                  𝑛                                                     (2.12)
                             𝑐=1                 𝑐=1 𝑖=1
                        = X ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)
                2
where X = 𝑛1         𝑛𝑐 · X 𝑐 . 𝑛 = 𝑛1 + 𝑛2 is the total number of samples. As an extension of Fisher
                Í
               𝑐=1
discriminant analysis, TDA looks at mode-wise between-class scatter matrices and within-class
scatter matrices in the projected subspace. For mode 𝑗, the between class scatter matrix is defined
                                                           23


as
                           2
                         Õ                                         >
                              𝑛𝑐 Z 𝑐 − Z ( 𝑗) · Z 𝑐 − Z ( 𝑗)
                                                       
                   𝐵𝑗 =
                         𝑐=1
                           2
                         Õ                          Ö                                 Ö             >         (2.13)
                                     X𝑐 − X )             × 𝑘𝑈 (𝑘) ( 𝑗) · (XX𝑐 − X )      × 𝑘𝑈 (𝑘) ( 𝑗)
                                                                        
                       =      𝑛𝑐 (X
                         𝑐=1                         𝑘                                 𝑘
                                  𝑗¯
                       = 𝑈 ( 𝑗) 𝐵 𝑗 𝑈 ( 𝑗)>
   ¯     2                                                                           >
   𝑗
                 X 𝑐 − X ) 𝑘≠ 𝑗 𝑈 (𝑘) ( 𝑗) · (X           X 𝑐 − X ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗) is the between-class scatter
         Í                 Î                                      Î
𝐵𝑗 =         𝑛𝑐 (X
        𝑐=1
matrix in the partially projected subspace (all modes excepts 𝑗-th mode are projected). 𝐵 𝑗 is in
                                       ¯
                                       𝑗
dimension of 𝑃 𝑗 × 𝑃 𝑗 , and 𝐵 𝑗 is in dimension 𝑃 𝑗 × 𝑃 𝑗 since the 𝑗-th mode is not projected.
 Z 𝑐 − Z ( 𝑗) are the 𝑗-th mode unfolding matrices for tensor Z 𝑐 − Z which is in dimension
                                                                                          
        Î
𝐼 𝑗 × 𝑘≠ 𝑗 𝐼 𝑘 . The derivation of equation (2.13) is available in [92]. With the same idea, the
mode-j within-class scatter matrices is deinfed as
                                     2    𝑛 𝑐 
                                1 ÕÕ                    Ö                         Ö            >
                                                  X 𝑐,𝑖       × 𝑘𝑈 (𝑘) ( 𝑗) X 𝑐,𝑖     × 𝑘𝑈 (𝑘) ( 𝑗)
                                                                       
                       𝑊𝑗 =
                                𝑛                                                                              (2.14)
                                   𝑐=1 𝑖=1                 𝑘                      𝑘
                                           𝑗¯
                             = 𝑈 ( 𝑗) 𝑊 𝑗 𝑈 ( 𝑗)>
     ¯       2 𝑛Í𝑐                                                                            >
    𝑗
                      X𝑖,𝑐 − X 𝑐 )                 (𝑘)            X𝑖,𝑐 − X 𝑐 ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗) is the within-class
                                                               
𝑊 𝑗 = 𝑛1
             Í                       Î                                         Î
                     (X                  𝑘≠ 𝑗 𝑈          ( 𝑗) · (X
           𝑐=1 𝑖=1
scatter matrix in the partially projected subspace (all modes excepts 𝑗-th mode are projected).
Notice that the within-class matrices (2.14) and between-class scatter matrices (2.13) are analogous
to within-class and between-class covariance matrices in traditional statistics. Thus, for each mode,
an optimal projection matrix can be estimated by maximizing the ratio of (2.13) and (2.14), or the
difference between (2.13) and (2.14).
       GDTDA learns the projection matrices by maximizing the scatter difference of (2.13) and (2.14).
Additionally, it assumes that all the projection matrices are orthogonal. The objective function for
                                                                 24


DGTDA is
                                max         𝐵 𝑗 || 2Fro − 𝜁 ||𝑤
                                          ||𝐵                 𝑤 𝑗 || 2Fro
                                𝑈 ( 𝑗)
                                                       𝑗¯                           𝑗¯
                                     = 𝑡𝑟 𝑈 ( 𝑗) 𝐵 𝑗 𝑈 ( 𝑗)> − 𝜁𝑡𝑟 𝑈 ( 𝑗) 𝑊 𝑗 𝑈 ( 𝑗)>                            (2.15)
                                                                         
                                S.T. 𝑈 ( 𝑗) · 𝑈 ( 𝑗)> = 𝐼
For each mode 𝑗, the projection matrix can be estimated with singular value decomposition. The
whole algorithm estimates 𝑈 ( 𝑗) in a single run without multiple iterations and tuning parameter.
𝜁 is selected through singular value decomposition instead of tuning. We summarize this in the
algorithm 3. We use 𝑆𝑉 𝐷 in the algorithm to denote ordinary singular value decomposition for
matrix. This algorithm is very common and is used directly without introduction.
Algorithm 3 DGTDA Projection Learning
 1: procedure DGTDA
 2:      Input: Tensors {X    X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, Target dimension 𝑃1 , 𝑃2 ..., 𝑃 𝑑
 3:
 4:      for j = 1 2, ..., d do
                                    2                                           >
                                              X 𝑐 − X ) ( 𝑗) · (X     X 𝑐 − X ) ( 𝑗)
                                   Í                            
 5:          Calculate 𝐵 𝑗 =            𝑛𝑐 (X
                                  𝑐=1
                                        2 𝑛Í𝑐                                                   >
                                                  X 𝑐,𝑖 𝑘 × 𝑘𝑈 (𝑘) ( 𝑗) X 𝑐,𝑖 𝑘 × 𝑘𝑈 (𝑘) ( 𝑗)
                                                                          
             Calculate 𝑊 𝑗 = 𝑛1
                                        Í                 Î                          Î
 6:
                                       𝑐=1 𝑖=1
                                        −1
 7:          [], Σ , [] =   𝑆𝑉 𝐷 (𝑊 𝑊 𝑗 · 𝐵 𝑗)                ⊲ SVD Decomposition and take singular value only
 8:          𝜁 = max(diag(Σ      Σ))                                              ⊲ Take the maximum singular value
 9:          𝑀 = 𝐵𝑗 − 𝜁 ·𝑊 𝑗
10:          𝑈 , [], [] = 𝑆𝑉 𝐷 (𝑀    𝑀)                      ⊲ SVD Decomposition and take left singular vectors
                (
                      = 𝑈 [:, 1 : 𝑃 𝑗 ] >
                  𝑗)
11:          𝑈ˆ                                                             ⊲ Takes the first 𝑃 𝑗 colums and transpose
                      ( 𝑗)
12:      Return 𝑈ˆ , 𝑗 = 1, ..., 𝑑
    Instead of maximizing the scatter difference, CMDA learns the projection matrices by maxi-
mizing the ratio of (2.13) and (2.14). Different from the existing work of TDA [147, 103] which
also use scatter ratio, CMDA include the orthogonal assumption for projection matrices hoping to
                                                               25


make the algorithm converges over iterations. The objective function for CMDA can be defined as
                                                                                   𝑗¯
                                                                       𝑡𝑟 𝑈 ( 𝑗) 𝐵 𝑗 𝑈 ( 𝑗)>
                                                                           
                                                       𝐵 𝑗 || 2Fro
                                                     ||𝐵
                                          max                      =
                                          𝑈 ( 𝑗)       𝑊 𝑗 || 2Fro 𝑡𝑟 𝑈 ( 𝑗)𝑊 𝑗¯𝑈 ( 𝑗)>
                                                     ||𝑊                                                               (2.16)
                                                                                    𝑗
                                          S.T. 𝑈 ( 𝑗) · 𝑈 ( 𝑗)> = 𝐼
For each mode 𝑗, the projection matrix can be estimated with singular value decomposition as well.
The whole algorithm estimates 𝑈 ( 𝑗) over iterations until all the estimated projection matrices are
approximately orthogonal. We summarize this in the algorithm 4.
Algorithm 4 CMDA Projection Learning
  1: procedure CMDA
  2:     Input: Tensors {X        X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, Target dimension 𝑃1 , 𝑃2 ..., 𝑃 𝑑 , 𝜂, maxitor
                           ( 𝑗)
  3:     Initialize 𝑈 0 = 1 ; 𝑗 = 1, .., 𝑑                          ⊲ Initialize value, matrix with values all equal to 1
  4:     while 𝑡 6 maxiter do                                                                  ⊲ Repeat up to max iteration
  5:           for j = 1 2, 3, ...,d do
                      𝑗¯         2                                                                          >
                                            X 𝑐 − X ) 𝑘≠ 𝑗 𝑈 (𝑘) ( 𝑗) · (X        X 𝑐 − X ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗)
                                 Í                     Î                                  Î
  6:               𝐵 𝑗,𝑡 =          𝑛𝑐 (X
                                𝑐=1
                        𝑗¯          2 𝑛Í𝑐                                                                          >
                                                 X𝑖,𝑐 − X 𝑐 ) 𝑘≠ 𝑗 𝑈 (𝑘) ( 𝑗) · (X        X𝑖,𝑐 − X 𝑐 ) 𝑘≠ 𝑗 × 𝑘𝑈 (𝑘) ( 𝑗)
                                                                                      
                             = 𝑛1
                                    Í                             Î                                   Î
  7:               𝑊 𝑗,𝑡                       (X
                                   𝑐=1 𝑖=1
                                              𝑊 𝑗𝑗,𝑡−1 · 𝐵 𝑗𝑗,𝑡 )
                                                   ¯         ¯
  8:               𝑈 , [], [] = 𝑆𝑉 𝐷 (𝑊                                                                               ⊲ SVD
                      ( 𝑗)
  9:               𝑈ˆ 𝑡 = 𝑈 [:, 1 : 𝑃 𝑗 ] >                                     ⊲ Takes the first 𝑃 𝑗 colums and transpose
                                        Í𝑑        ( 𝑗)     ( 𝑗)>
10:            Check 𝑒𝑟𝑟 (𝑡) =               ||𝑈ˆ 𝑡 · 𝑈ˆ 𝑡         − 𝐼 || Fro 6 𝜂
                                       𝑗=1
11:            if 𝑒𝑟𝑟 (𝑡) 6 𝜂 then
12:                Stop Iteration                                                                                ⊲ Quit loop
13:            𝑡 = 𝑡 + 1
                           ( 𝑗)
14:      Return 𝑈ˆ , 𝑗 = 1, ..., 𝑑
     The last step in the TDA classification is assigning class labels to new test points. For both
DGTDA and CMDA, a tensor-based K-nearest neighbour method is adopted for this purpose. Recall
the Frobenius norm for tensor in section 1.2, one can define distance between two tensors with
Frobenius norm. For any two tensors X 1 and X 2 in the same dimension, the distance is defined as
𝑑𝑖𝑠(XX1 , X 2 ) = ||XX1 − X 2 || Fro . A K-Nearest Neighbour classifier can use distance and predict class
labels for test point. We combine the subspace learning from DGTDA and CMDA algorithm with
                                                                    26


this KNN classifier, and summarize the whole classification procedure for DGTDA and CMDA in
algorithm 5 at the end of this part.
Algorithm 5 Tensor Discriminant Analysis Classification
  1: procedure TDA
  2:     Input: Training set 𝑇𝑛 = {X         X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, Labels {𝑦 1 , .., 𝑦 𝑛 } ∈ {0, 1}, Target
     dimension 𝑃1 , 𝑃2 ..., 𝑃 𝑑 , Test set {X       X∗1 , ...X
                                                             X∗𝑚 },𝜂, maxitor, k
  3:     if DGTDA
              (𝑑) then(𝑑)
  4:           𝑈 , ...,𝑈     𝑈         = DGTDA({X       X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, 𝑃1 , 𝑃2 ..., 𝑃 𝑑 )
  5:     else
  6:           𝑈 (𝑑) , ...,𝑈 𝑈 (𝑑) = CMDA({X           X 𝑐,𝑖 ; 𝑖 = 1, 2, ..., 𝑛𝑐 ; 𝑐 = 1, 2}, 𝑃1 , 𝑃2 ..., 𝑃 𝑑 , 𝜂, maxitor)
  7:     for i = 1,..n do
  8:         Z 𝑐,𝑖 = X 𝑐,𝑖 ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)
  9:     for i = 1,.., m do
10:          Z𝑖∗ = X𝑖∗ ×1 𝑈 (1) ×2 𝑈 (2) ... ×𝑑 𝑈 (𝑑)
11:          𝑑 = [𝑑𝑖𝑠(Z    Z1 , Z𝑖∗ ), ..., 𝑑𝑖𝑠(Z Z𝑛 , Z𝑖∗ )]
12:          𝑑 0 = arg 𝑠𝑜𝑟𝑡 (𝑑)                           ⊲ Sort distance in increasing order, and return the index
                             𝑘
             𝑦𝑖∗ = 1{ 1𝑘          𝑦[𝑑𝑑 0 [𝑙]] > 0}
                             Í
13:
                            𝑙=1
14:      Return      ∗
                   𝑦 1 , ..., 𝑦 ∗𝑚
2.2.3   Tensor Regression
The last type of tensor classification model we want to review in this part is tensor regression
model. Tensor regression is a collection of statistical models taking tensor-shape predictors. These
models can predict tensor or scalar response from the tensor covariates. [155, 93] propose tensor
generalized linear regression model by assuming CP and Tucker decomposition structures on
regression coefficients. [62] introduces sparse penalty on tensor CP regression models to provide
an efficient and scalable model for unit-rank tensor regression problems. More recently, tensor
response regression [88, 152, 99], Bayesian tensor regression [57], and tensor regression with
variation norm penalty [45] are also developed. In this part, we review the CP tensor generalized
linear regression model (CP-GLM) [155] for tensor classification problems.
     CP-GLM assumes a regression model with tensor covariates. Let 𝑔 (·) be the link function,
X ∈ R 𝐼1 ×..×𝐼 𝑑 be a d-way tensor, and B ∈ R 𝐼1 ×..×𝐼 𝑑 be the tensor coefficient. The CP-GLM with
                                                                  27


scalar response is defined as
                                                 𝑔 (𝜇) = 𝛼 + 𝛾 >𝑧 + < B , X >                                            (2.17)
𝛾 is a scalar, and 𝑧 ∈ R 𝑝 is a 𝑝 dimensional vector predictor. 𝛾 is a 𝑝 dimensional vector coefficient
as well. If B has a low-rank CP decomposition, then we can use Kruskal tensor to denote it as
𝔘B = È𝐵   𝐵 (1) , ..., 𝐵 (𝑑) É. Each 𝐵 ( 𝑗) , 𝑗 = 1, .., 𝑑 is a 𝐼 𝑗 by 𝑟 matrix whose columns are CP factors of
tensor B . The rank ofr B is assumed to be 𝑟. With Khatri-rao product, the inner product between
tensor coefficient and predictor can be written as
                       < 𝔘B , X > =< 𝐵 ( 𝑗) 𝐵 (𝑑)                  𝐵 ( 𝑗+1)      𝐵 ( 𝑗−1) ..., 𝐵 (1) > , X >
                                                                                                        
                                                                 ..𝐵
                                                                                                                         (2.18)
                                             𝐵 ( 𝑗) , X    𝐵 (𝑑)       𝐵 ( 𝑗+1) 𝐵 ( 𝑗−1) ..., 𝐵 (1) >
                                                                                                            
                                       =<                            ..𝐵
for 𝑗-th mode CP components. If we write the inner product into a vector form, then 𝐵 ( 𝑗) can be
estimated with regular maximum likelihood estimate (MLE) method by fixing 𝛼, 𝛾 , and all other
𝐵 (𝑘) , 𝑘 ≠ 𝑗. A iterative MLE can be adopted to estimate all the CP compoent matrices as well as
the regular scalar and vector coefficients in model (2.17). If we denote the likelihood function as
ℓ (𝛼, 𝛾 , 𝔘B ), which can be derived in the exactly same ways as ordinary GLM, the iterative MLE
algorithm can be summarized in the algorithm 6. In tensor binary classification problems, we can
Algorithm 6 Tensor CP Generalized Linear Model
  1: procedure TDA
  2:      Input: {X   X1 , ...X
                              X𝑛 }, {𝑧𝑧 1 , ...𝑧𝑧 𝑛 }, 𝑦 , 𝜂, maxitor
  3:      Initialize 𝔘B ,0 = È00, ..00É                                            ⊲ Initialize Kruskal tensor as 0 matrices
  4:      Initialize (𝛼0 , 𝑧 0 ) = arg max ℓ (𝛼, 𝛾 , 𝔘B ,0 )
  5:      while t < maxitor do
  6:           for j = 1, ..., d do
                        ( 𝑗)                                  (1)        ( 𝑗−1) ( 𝑗) ( 𝑗+1)               (𝑑)
  7:                 𝐵 𝑡+1 = arg max ℓ (𝛼𝑡 , 𝛾 𝑡 , È𝐵       𝐵 𝑡+1     𝐵 𝑡+1
                                                                  ....𝐵         , 𝐵 , 𝐵𝑡          ..., 𝐵 𝑡 É)
                                                                (1)         ( 𝑗−1) ( 𝑗)       ( 𝑗+1)        (𝑑)
  8:           (𝛼𝑡+1 , 𝑧 𝑡+1 ) = arg max ℓ (𝛼, 𝛾 , È𝐵         𝐵 𝑡+1 ....𝐵𝐵 𝑡+1    , 𝐵 𝑡+1 , 𝐵 𝑡      ..., 𝐵 𝑡+1 É)
  9:           if ℓ𝑡+1 − ℓ𝑡 6 𝜂 then
10:                  Stop
11:            𝑡 =𝑡+1
                                                         ( 𝑗) ( 𝑗+1)
12:       Output: 𝛼, 𝛾 , È𝐵    𝐵 (1) ....𝐵
                                         𝐵 ( 𝑗−1) , 𝐵 𝑡+1    𝐵𝑡        ..., 𝐵 (𝑑) É
take the link function 𝑔 as the logit function, and predict the probability that P(𝑦 = 1|X                         X). The class
labels are predicted by doing a threshold on the predicted probability.
                                                                    28


2.3     Statistical Analysis
    As we mentioned in section 1.3, statistical consistency is one of the most important properties
for decision rules as it demonstrates generalization ability of rules from the aspect of prediction
risk. The consistency of tensor discriminant analysis can be easy established since DGTDA and
CMDA are both converging ([92]) algorithms providing unique tensor-to-tensor projections. In
addition, tensor K-nearest neighbour classifier is equivalent to vector K-nearest neighbour, which
is consistent ([36]). These two facts make the tensor discriminant analysis a consistent classifier.
Tensor CP-GLM models the conditional probabilities for data labels through regression model. Its
consistency is guaranteed by the (strong) consistency of regression coefficients ([155]). In this
section, we provide few theoretical helping to establish the statistical consistency for CP-STM. The
section contains two parts for theory development. One is the universal property establishment for
CP-STM kernel functions, the other is consistency proof for CP-STM.
2.3.1   Universal Tensor Kernels
Our first result is about the universal property of tensor kernel functions. Kernel Universal property
plays a very important role in kernel learning methods such as support vector machine and kernel
regression [107]. Since kernel-based learning methods always estimate optimal solution from
Reproducing Kernel Hlibert Space (RKHS), the error of approximating the complete functional
space { 𝑓 : X → Y , 𝑓 measureable} with RKHS is critical in the generalization ability of the
learned rules. In other word, if the approximation error of RKHS is larger, then the prediction
from kernel-based methods will be more biased. Kernel with universal property guarantees that
the approximation error of RKHS can be as small as possible on any compact subspace of the input
space. To present our result about universal property for tensor kernels, we first provide a formal
definition about the universal property.
Definition 2.3.1. Let 𝐾 (·, ·) be a continuous kernel function defined on X × X → R. Given a
compact subspace Z ⊂ X from the input space, the kernel section of 𝐾 (·, ·) over Z is K (Z      Z ) :=
                                                    29


span{𝐾𝑥 , 𝑥 ∈ Z }, which is a RKHS generated by kernel 𝐾 (·, ·) and subspace Z . If for any
continuous function 𝑓 : Z → R, there is a positive number 𝜖 > 0 and a function 𝑔 ∈ K (Z                    Z ) such
that
                                       || 𝑓 − 𝑔 || ∞ = sup | 𝑓 (𝑥) − 𝑔 (𝑥)| 6 𝜖
                                                          Z
                                                        𝑥∈Z
then 𝐾 (·, ·) has universal approximating property and is called a universal kernel.
With universal kernels, we can immediately see that the optimal function estimated from RKHS is
also the optimal among all measureable functions. This is critical in the establishment of CP-STM
consistency. Thus, our first result shows that the kernel function in CP-STM can be universal.
Proposition 2.3.1. For a d-way CP tensor kernel function
                                                         𝑟 Ö 𝑑
                                                                            ( 𝑗)    ( 𝑗)
                                                       Õ
                                     𝐾 (XX1 , X 2 ) =            𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑘 )
                                                      𝑙,𝑘=1 𝑗=1
                𝑟    (1)      (2)        (𝑑)                𝑟     (1)        (2)          (𝑑)
with X 1 =         𝑥 1,𝑘 ◦ 𝑥 1,𝑘 ... ◦ 𝑥 1,𝑘 and X 2 =
               Í                                            Í
                                                                𝑥 2,𝑘 ◦ 𝑥 2,𝑘 ... ◦ 𝑥 2,𝑘 . If all mode-wise kernel
              𝑘=1                                          𝑘=1
              ( 𝑗)
functions 𝐾 (·, ·) are satisfying the universal approximation property in definition 2.3.1, then
it also satisfy the universal approximating property in the sense that for all continuous function
defined on the compact tensor product space X = ⊗ 𝑑𝑗=1V ( 𝑗) , there exits a function 𝑔 ∈ K (X                X ) in
the tensor kernel section such that
                                                               X) − 𝑔 (X
                                      || 𝑓 − 𝑔 || ∞ = sup | 𝑓 (X            X)| 6 𝜖
                                                       X ∈X
                                                          X
The proof of this proposition is provided in the appendix A.1. Notice that since distance-based
kernel functions such as Gaussian RBF and polynomial are universal ([107]), we can use one of it
or both for all 𝐾 ( 𝑗) (·, ·) to create universal tensor kernel functions.
2.3.2   Consistency of CP-STM
With universal tensor kernels, the classification consistency of CP-STM is established with the
following theorem. The notations of classification risks are borrowed from section 1.3.
                                                           30


 Theorem 2.3.1. Let { 𝑓𝑛 : 𝑛 ∈ N} be a sequence of CP-STM classifiers in equation (2.4), which
 are estimated from different training sets 𝑇𝑛 with size 𝑛. R ∗ is the Bayes risk of tensor binary
 classification problem for data from the joint distribution X × Y . X is a d-way tensor product
 space with dimension 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 defined in definition 1.2.1, and Y = {−1, 1}. The CP-STM
 decision rule { 𝑓𝑛 : 𝑛 ∈ N} is statistically consistent and the excess classification risk of 𝑓𝑛 , R ( 𝑓𝑛 ),
 converges to the optimal Bayes risk
                                          R ( 𝑓𝑛 ) → R ∗       (𝑛 → ∞)
 If the following conditions are satisfied:
Con.1 X is a compact subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 such that there is a constant 0 < 𝐵𝑥 < ∞. For all
                            𝑟    (1)   (2)        (𝑑)     ( 𝑗)
        X ∈ X and X =          𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 , ||𝑥𝑥 𝑘 || 2 6 𝐵𝑥 < ∞.
                            Í
                           𝑘=1
Con.2 The loss function L is self-calibrated (see [128]), and is 𝐶 (𝑊) local Lipschitz continuous in
         the sense that for |𝑎| 6 𝑊 < ∞ and |𝑏| 6 𝑊 < ∞
                                       |L (𝑎, 𝑦) − L (𝑏, 𝑦)| 6 𝐶 (𝑊)|𝑎 − 𝑏|
         In addition, the loss function is bounded on the second variable, i.e.
                                               sup L (0, 𝑦) 6 𝐿 0 < ∞
                                            𝑦∈{1,−1}
         .
Con.3 The kernel functions 𝐾 ( 𝑗) (·, ·) used to composite the coupled tensor kernel (2.2) are reg-
         ular vector-based kernels satisfying the universal approximating property in definition
         2.3.1. Additionally, they are all bounded so that there is a constant 0 < 𝐾𝑚𝑎𝑥 < ∞,
             p
         sup 𝐾 ( 𝑗) (·, ·) 6 𝐾𝑚𝑎𝑥 for all 𝑗 = 1, .., 𝑑.
Con.4 The hyper-parameter in the objective function (2.1) 𝜆 = 𝜆 𝑛 satisfies:
                                        𝜆𝑛 → 0       𝑛𝜆 𝑛 → ∞ as       𝑛→∞
                                                        31


The proof of this theorem is provided in the appendix A.2. At the end of this section, we can conclude
that all the tensor-based classifiers reviewed in this chapter are statistically consistent, meaning that
they all have promising prediction accuracy if certain conditions are satisfied. However, their
performance in practice can be of various, since their excess risks may converge at different
rates depending on the data distribution in specific applications. In next section, we compare the
performance of these classification methods with two real data applications.
2.4     Real Data Analysis
    In this section, we provide two examples in Neuroimaging and Computer Vision studies, and
apply all the reviewed tensor-based classifier in imaging classification problems.
2.4.1    MRI Classification for Alzheimer’s Disease
Alzheimer’s Disease (AD) is a progressive, irreversible loss of brain function which would impact
memory, thinking, language, judgment, and behavior. The disease could destroy patients’ memory
and thinking ability, and eventually make it difficult for patients to carry out even the simplest tasks
of daily living. In the research of AD, there are lots of novel technologies developed to collect
information from patients for diagnostic purposes, which include genetic analysis, biological and
neurological tests, Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET)
imaging. Utilizing these information to predict patients’ biological status is of great interests in
early detection and biomarker development for Alzheimer’s Disease. In this study, we consider
using patients’ voxel MRI data to predict if patients are early AD or Normal Coherent (NC).
    Brain MRI technology collects 3D image to show anatomical structure of brains. The data
is measured in voxels, which are similar to pixels used in displaying regular images. Voxels are
however cuboids in 3D having specific dimensions. Each voxel contains one value standing for
the average signal measured at a fixed position. A standard MRI image, also called volume,
is arranged in a 3D array to reflect brain structure by placing all voxels in their corresponding
positions. Thus, voxel-level MRI images are multi-dimensional arrays loaded from Neuroimaging
                                                   32


                                                        AD              NC
                       Num Subjects                     183            219
                       Age (Mean ± sd)             75.28 ± 7.55 75.80 ± 4.98
                       Gender (Female / Male)         88 / 95        110 / 109
                       MMSE (Mean ± sd)            23.28 ± 2.04 29.11 ± 1.00
Table 2.1: Biological Information for Subjects in ADNI Study; MMSE: baseline Mini-Mental State Exam-
ination
Informatics Technology Initiative (Nifti) files (see [85]), and are tensors in nature. We can use our
proposed tensor-based classifier to handle voxel-level MRI data.
    We collect data from Alzheimer’s Disease Neuroimaging Initiative ADNI. It is a huge longi-
tudinal study on Alzheimer’s Disease. In ADNI, the structural brain MRI is believed to be highly
correlated to the patients’ cognition, thus can be utilized to predict patients’ status, AD vs. NC.
The MRI data is collected from the screening session in ADNI-1. During the session, there are 818
patients selected to enter the study and received 1.5T MRI scan. These images are pre-processed
by ADNI with normalization and bias correction, and are provided in the ADNI-1.5T Screening
standardized data set. The data set includes both MRI scans and patients’ dementia status labeled
as Normal Coherent(NC), Mild Cognition Impaired (MCI), and Alzheimer’s Disease (AD). In this
study, we are especially interested about predicting if a patient has AD or NC. Thus, we only collect
image data from NC and AD patients. The biological information about AD group and NC group
is provided in the table 2.1. MMSE in the table stands for Mini-Mental State Examination, which
is a test of cognition functions for patients with dementia. The MMSE in the table 2.1 shows the
scores of MMSE test for subjects.
    There are 402 MRI images in the data collection. We further use Matlab Imaging Processing
Toolbox to register and align all the images for classification and comparison. We also use image
resize function in the toolbox to unify the voxel dimensions in all MRI images to be 6mm by 6mm
by 9mm. After resizing, all images are in the shape of 40 by 40 by 21. This step is necessary
since ADNI MRI images are acquired from multiple sites. Resizing guarantees all the images can
be represented by tensors in the same size. The choice of such image size is referred from other
                                                   33


               Models        Accuracy    Precision   Sensitivity   Specificity    AUC
               CP-GLM        0.580.04    0.580.07     1.000.00      0.000.00    0.500.00
               CMDA          0.700.03    0.690.05     0.670.09      0.730.10    0.650.17
               DGTDA         0.700.02    0.710.02     0.590.06      0.800.01    0.640.18
               CP-STM1       0.730.03    1.000.00     0.410.07      1.000.00    0.640.20
               CP-STM2       0.74 0.04   1.000.00     0.430.08      1.000.00    0.650.20
                         Table 2.2: Real Data: ADNI Classification Comparison I
similar statistical analysis works such as [155, 114, 45].
    We conduct our numerical experiment in the same protocol as the simulation study. We
randomly sample 80% of images from AD group and 80% from NC group to form the training set
with size 321. AD is labeled as positive class, and NC is labeled as negative class. The rest images
are used as test set to evaluate model performance. For each classification model, we evaluate its
performance by calculating its accuracy, precision, sensitivity, and specificity on the test set. Such
step is replicated for multiple times, and the average accuracy, precision, sensitivity, and specificity
are reported in the table in percentages. The standard deviation of these performance metrics
are also provided (in subscripts) in the parenthesis. In the table 2.2, we use CP-STM1 to denote
CP-STM with Hinge loss, and CP-STM2 to denote CP-STM with Square Hinge loss. Figure 2.1
summarizes the comparison in a more illustrative way. The area under the curves (AUC) of ROC
curves are reported in the table 2.2 as well.
    The results in table 2.2 shows that all the tensor-based classifiers have close accuracy in
prediction, while CP-STM is slightly better than the others. Also, these methods all have pretty
low sensitivity, indicating that the chances of correctly detecting AD patients are pretty low.
The performance of CP-GLM is much worse than the others, which may due to the fact that its
parameteric form and data distribution assumption are not appropriate for the real data. Comparing
two CP-STMs, we notice that one with Squared Hinge loss is better, which may due to the fact that
Squared Hinge loss has a bigger penalty on points which strongly violates the margin during the
training procedure.
                                                   34


                         Classification Accuracy
                   0.8
                   0.6
                                                                                          Method
                                                                                             CMDA
        Accuracy
                                                                                             CP−GLM
                   0.4
                                                                                             CPSTM1
                                                                                             CPSTM2
                                                                                             DGTDA
                   0.2
                   0.0
                             CMDA      CP−GLM       CPSTM1        CPSTM2       DGTDA
                                                    Method
                                    Figure 2.1: Real Data: ADNI Classification Reults I
2.4.2        KITTI Traffic Images
The second application we conduct is traffic image data recognition. Traffic image data recognition
is an important computer vision problem. We considered the image data from the KITTI Vision
Benchmark Suit. [53], [49], and [52] provided a detailed description and some preliminary studies
about the data set. In this application, a 2D object detection task asks us to recognize different
objects pointed out by bounding boxes in pictures captured by a camera on streets. There are
various types of objects in the pictures, most of which are pedestrians and cars. We selected images
containing only pedestrians or cars to test the performance of our classifier.
   The first step we did before training our classifier is image pre-processing, which includes
cropping the images and dividing them into different categories. We picked patterns indicated by
bounding boxes from images and smoothed them into a uniform dimension 224 × 224 × 4. Then
we transform all these colored images into grey-scale to drop color information in order to avoid
potential problems caused by the extreme dimension imbalance among the three different modes.
                                                             35


                        50                               50
                       100                              100
                       150                              150
                       200                              200
                              50    100   150   200            50   100   150   200
                        50                               50
                       100                              100
                       150                              150
                       200                              200
                              50    100   150   200            50   100   150   200
                    Figure 2.2: Real Data: Examples of Traffic Objects in KITTI Data
The processed images are in size 224 by 224 which can be modeled by two-mode tensors. Figure
2.2 shows few examples of processed images for cars and pedestrians.
    The total number of images are 33229, among which 4487 are car images and 28742 are
pedestrian images. Next, we divide these images into three groups basing on their qualities and
visibility. Images having more than 40 pixels in height and fully visible will go to the easy group.
Partly visible images having 25 pixels or more in height are in the moderate group. Those images
which are difficult to see with bare eyes are going to the hard group. These three groups of images
are utilized to define three different classification tasks with levels of difficulties as easy, moderate,
and hard. To overcome the class imbalance in all the three groups of images, We randomly select
200 car images and 200 pedestrian images to form a balanced data set in each group for our
numerical experiments. Pedestrian images are considered as the positive class data, and car images
are negative class data.
    The following procedures are repeated for 50 times in all three tasks with the balanced data
sets. We randomly sample 80% of images as training and validation set. Classification models are
estimated and tuning parameters (if any) are selected using this part of the data. Then the models
                                                    36


with selected tuning parameters are applied on the rest 20% data for testing. The sampling is
conducted in a stratified way so that the proportion of pedestrian and car images are approximately
same in both training and testing set. For each repetition, we calculate the same performance
metrics, accuracy rates, precision (positive predictive rates), sensitivity (true positive rates), and
specificity (true negative rates), for each classification method using the testing set. The average
value of these rates and their standard deviations (in subscripts) are reported in the table 2.3.
The areas under the ROC curves (AUC) are also reported for all the methods. All the classifiers
reviewed in section 2.2 are included, and are denoted with the same notations from the previous
ADNI application. The comparison of model prediction accuracy rates is also illustrated by the
figure 2.3, in which accuracy rates are shown by bar charts and the standard deviations of accuracy
rates are shown by error bars.
        Task        Methods       Accuracy     Precision   Sensitivity    Specificity   AUC
                    CP-STM1        0.85 0.03   0.840.05     0.850.05       0.840.06   0.85 0.03
                    CP-STM2        0.830.04    0.830.05     0.830.05       0.830.06   0.830.04
                    CMDA           0.630.07    0.580.06     0.950.08       0.300.16   0.630.07
        Easy        DGTDA          0.840.04    0.770.04     0.960.03       0.720.07   0.840.04
                    CP-GLM         0.570.05    0.570.06     0.590.07       0.550.09   0.570.05
                    CP-STM1        0.78 0.05   0.780.06     0.770.07       0.780.07   0.78 0.05
                    CP-STM2        0.730.06    0.750.07     0.720.08       0.750.09   0.730.06
                    CMDA           0.590.05    0.550.04     0.890.11       0.280.13   0.590.05
        Moderate DGTDA             0.740.06    0.720.06     0.790.08       0.690.08   0.740.05
                    CP-GLM         0.530.05    0.530.05     0.540.07       0.520.08   0.530.05
                    CP-STM1        0.76 0.04   0.840.06     0.640.07       0.870.05   0.76 0.04
                    CP-STM2        0.740.04    0.800.06     0.630.07       0.840.06   0.740.04
                    CMDA           0.530.04    0.520.02     0.910.09       0.160.12   0.530.04
        Hard        DGTDA          0.720.05    0.680.04     0.840.06       0.600.08   0.720.05
                    CP-GLM         0.510.06    0.510.06     0.540.07       0.490.07   0.510.06
                          Table 2.3: Real Data: Traffic Image Classification I
    By comparing the prediction accuracy rates and AUC values, CP-STM models have significant
advantages in classification performance than other tensor-based classification models. Different
from previous study, CP-STM with Hinge loss outperforms CP-STM with Squared Hinge loss. The
                                                   37


                             Classification Accuracy
                      0.75
                                                                                                    Method
                                                                                                       CMDA
                      0.50
           Accuracy
                                                                                                       CP−GLM
                                                                                                       CPSTM1
                                                                                                       CPSTM2
                                                                                                       DGTDA
                      0.25
                      0.00
                                            Easy                 Hard                 Moderate
                                                                 Task
                                           Figure 2.3: Real Data: Traffic Classification Result I
reason for this might be that there are more tensors sitting close to the margin and weakly violate
                        X𝑖 ) < 1 and 𝑦𝑖 𝑓𝑛 (X
the margin, i.e. 𝑦𝑖 𝑓𝑛 (X                   X𝑖 ) ≈ 1. As a result, Hinge loss penalizes these points more
than Squared Hinge loss in the model estimation procedure, providing a better decision function.
Comparing to our previous results in ADNI study, we can conclude that the performance of CP-
STM with Hinge ans Squared Hinge loss often time depend on the data distribution. Two tensor
discriminant analysis have different performance in this application. DGTDA outperforms CMDA
with much higher accuracy rates in all three tasks. Particularly, the accuracy rate of DGTDA is
only 1% less than CP-STM1 in the easy task. As for CMDA, it turns out that it fails to identify a
discriminantive tensor-to-tensor projection in this application. The performance of CP-GLM is not
as good as others, which is similar to the results in our ADNI study.
2.5    Conclusion
   In this chapter, we explore the possibility of using tensors to model multi-dimensional and
structured data, and reviewed few tensor-based classification models. These models utilize tensor
algebraic structures and extend traditional classification methods for tensor data. CP-STM and
CP-GLM are extension of Support Vector Machine and Generalized Linear Regression with uses
tensor CP decomposition. DGTDA and CMDA are generalization of Fisher Discriminant analysis
                                                                        38


for tensor using tensor subspace learning. Such subspace learning is indeed a variant of tensor
Tucker decomposition. These models can be attractive when handling multidimensional data with
various structures.
    As a part of our contribution, we develop the statistical consistency result for CP-STM. All
these tensor-based methods are then can be considered as consistent decision rules with nice
generalization ability. Our data experiments also provide some empirical evidence on the model
performance. Through our numerical study, CP-STM show the best prediction accuracy in both
applications regardless of data distribution. However, the performance comparison between CP-
STM with Hinge and Squared Hinge loss often time depends on the data distribution. In comparison
to CP-STM, CP-GLM is a little bit restrictive due to its parametric form, and sometimes fail to
approximate the true data distribution. DGTDA and CMDA are both non-parametric and are
flexible enough to classify different types of tensors. However, CMDA sometimes will fail to
converge and its results then become bad.
    Inspired by these existing works, one possible direction for future work can be combining
more advanced tensor representation and operations with traditional non-parametric classification
models for novel tensor classification models. For example, there are various coupled matrix-
tensor decomposition methods established in recent research for multimodal data integration and
heterogeneous data analysis. Those decomposition methods can be adopted and help to extend
CP-STM for multimodal data classification problems. Besides that, tensor compression via random
projection or sketch are also popular in multidimensional big data analysis which aims to provide
efficient and scalable ways to process tensors in huge sizes. Coin these tensor compression methods
with CP-STM be a great potential for big tensor data classification.
                                                  39


                                           CHAPTER 3
                  TEC: TENSOR ENSEMBLE CLASSIFIER FOR BIG DATA
In this chapter, we consider classification problems for gigantic size multi-dimensional data. Al-
though tensor-based classification methods mentioned in the previous chapter can analyze multi-
dimensional data and preserve data structures, they may face more challenges such as long process-
ing time and insufficient computer memory when dealing with big tensor data. Previously we have
demonstrated the distribution-free and statistically consistent properties for the CP-STM model,
and highlighted its great potential in successfully handling wide varieties of data applications.
However, training a CP-STM can be computationally expensive with high-dimensional tensors. To
make it feasible for CP-STM to handle large size tensors, we introduce a tensor-shaped random
projection technique, and combine it with CP-STM to reduce the computational time and cost
for large tensors. The CP-STM estimated with randomly projected tensors is named as Random
Projection-based Support Tensor Machine (RPSTM). We further develop a Tensor Ensemble Clas-
sifier (TEC) by aggregating multiple RPSTMs to control the excess classification risk brought by
random projections. We demonstrate that TEC can balance between the computational costs and
excess classification risk, and provide descent performance in numerical studies.
3.1     Introduction
     With the advancement of information and engineering technology, modern-day data science
problems often come with data in gigantic size and increased complexities. These complexities
are often reflected by huge dimensionality and multi-way features in the observed data such as
high-resolution brain imaging and spatio-temporal data. Classification problems with such high-
dimensional multi-way data raise more challenges to scientists on how to process the gigantic
size data while preserving their structures. Even though there are many established studies in
the literature that handle the multi-way data structures with tensor representation and tensor-based
models [136, 103, 63, 92, 127, 155, 114], the high dimensionality issue for tensors are rarely
                                                  40


explored especially for classification problems. High-dimensional tensors, though are already in
multi-dimensional structure, can still have huge dimensionalities in different modes. The existences
of high-dimensional tensors could make the current tensor-based classification models fail to provide
reliable results due to extremely long processing time and huge computational cost.
    Current tensor-based classification approaches for high-dimensional data are mostly adopting
regularization or feature extraction steps into the models. For example, [114] proposes a objective
function with Lasso penalty to learn the mode-wise precision matrices in the tensor probabilistic
discriminant analysis instead of estimating them directly by taking the inverse from empirical
covariance matrices, which mitigates the inconsistency in estimate due to high dimensionality.
Other methods such as [103, 92] utilize various higher-order principle component analysis to
extract features and reduce the data complexity before applying tensor-based K-nearest neighbour
classifiers for classification. However, these techniques have several deficiencies. First of all, the
regularization-based methods still face the challenge of huge computational cost. Even though they
can provide more consistent and robust model estimate by doing a trade-off between variance and
bias, the estimation procedures are still the same as that without regularization. For instance, The
estimates for ℓ1 or nuclear norm regularized models are often calculated by doing soft-thresholding
on the original estimates. Thus, the computational cost for the procedure remains the same, and
could be too huge to be carried out for high-dimensional tensors. Secondly, the feature extraction-
based methods integrate unsupervised learning procedures to extract the feature, making it difficult
to evaluate the classification consistency for the models. Finally, there is a lack of theoretical results
in the current approaches that quantifies the excess risks caused by adding regularization terms and
feature extracting procedures, making it difficult to depict the trade-off between the classification
accuracy and the computational cost. Novel techniques and statistical frameworks are thus desired
to not only optimize the computational procedures, but also integrate the statistical analysis for
high-dimensional tensor classification problems.
    Random projection, comparing to the aforementioned techniques, turns out to be a perfect candi-
date for simplifying computational complexity in high-dimensional tensor classification problems,
                                                   41


since it is easy to apply and can provide straight forward steps for statistical analysis. It projects data
into lower dimensional space with randomly generated transformations to reduce data dimension,
and is motivated by the well celebrated Johnson Lindenstrauss Lemma (see e.g. [34]). The lemma
                                8 log 𝑛
says that for arbitrary 𝑘 >             ,  𝜖 ∈ (0, 1), there is a linear transformation 𝑓 : R 𝑝 → R 𝑘 such
                                   𝜖2
that for any two vectors 𝑥 𝑖 , 𝑥 𝑗 ∈ R 𝑝 , 𝑝 > 𝑘:
                      (1 − 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2 6 || 𝑓 (𝑥𝑥 𝑖 ) − 𝑓 (𝑥𝑥 𝑗 )|| 2 6 (1 + 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2
with large probability for all 𝑖, 𝑗 = 1, ..., 𝑛. The linear transformation 𝑓 preserves the pairwise
Euclidean distances between these points. The random projection has been proven to be a decent
dimension reduction technique in machine learning literature [19, 46]. A lot of theoretical results
on classification consistency are also established for random projection. [41] presents a Vapnik-
Chervonenkis type bounds on the generalization error of a linear classifier trained on a single
random projection. [29] provides a convergence rate for the classification error of the support
vector machine trained with a single random projection. [24] proves that random projection
ensemble classifier can reduce the generalization error further. Their results hold several types of
basic classifiers such as nearest neighboring and linear/quadratic discriminant analysis. In addition
to the computational efficiency and statistical consistency, [133, 71, 119] demonstrate that random
projections for tensors can cost low memory, suggesting that the techniques are memory efficient.
    In this work, we propose a computationally efficient and statistically consistent Tensor Ensemble
Classifier (TEC) which aggregates multiple CP-Support Tensor Machines (CP-STM). The new
STM distinguishes from the existing works [63, 64] by combining the estimation procedure with
a newly proposed tensor-shaped random projection to reduce the size of tensors and simplify the
computation, making it extremely useful for high-dimensional tensors. The new STM is named as
Random Projection-base Support Tensor Machine (RPSTM). Mutiple RPSTMs are then aggregated
to form an ensemble classifier, TEC, to mitigate the potential information loss and reduce the extra
classification risk brought by random projections. This idea is motivated by [24] and the well
known Random Forest model [19]. Similar to these methods, TEC aggregates base classifiers
which are estimated from randomly sampled or projected features, and makes predictions for new
                                                              42


test points by majority votes. Results from [60, 101] show that such an aggregation of decision can
be a very effective tool for improving unstable classifiers like RPSTM.
    Our contribution: Our work alleviates the limitations of existing tensor approaches in handling
big data classification problems. Specifically, the contribution of this work is threefold.
   1. We successfully adopt the well known random-projection technique into high dimensional
       tensor classification applications and provide an ensemble classifier that can handle extremely
       big-sized tensor data. The adoption of random projection is shown to be a low-memory cost
       operation, and makes it feasible to directly classify big tensor data on regular machines
       efficiently. We further aggregate multiple RPSTM to form our TEC classifier, which can be
       statistically consistent while remaining computationally efficient. Since the aggregated base
       classifiers are independent of each other, the model learning procedure can be accelerated in
       a parallel computing platform.
   2. Some theoretical results are established in order to validate the prediction consistency of
       our classification model. Unlike [29] and [24], we adopt the Johnson-Lindenstrauss lemma
       further for tensor data and show that the CP-STM can be estimated with randomly projected
       tensors. The average classification risk of the estimated model converges to the optimal
       Bayes risk under some specific conditions. Thus, the ensemble of multiple RPSTMs can have
       robust parameter estimation and provide strongly consistent label predictions. The results
       also highlight the trade-off between classification risk and dimension reduction created by
       random projections. As a result, one can take a balance between the computational cost and
       prediction accuracy in practice.
   3. We provide an extensive numerical study with synthetic and real tensor data to reveal our
       ensemble classifier’s decent performance. It performs better than the traditional methods
       such as linear discriminant analysis and random forest, and other tensor-based methods in
       applications like brain MRI classification and traffic image recognition. It can also handle
       large tensors generated from tensor CANDECOMP/PARAFAC (CP) models, which are
                                                   43


       widely applied in spatial-temporal data analysis. Besides, the computational cost is much
       lower for the TEC comparing with the existing methods. All these results indicate a great
       potential for the proposed TEC in big data and multi-modal data applications.
     The contents in this chapter are organized as follow: Section 3.2 reviews the basic concepts
about CP-STM classification problem and tensor random projection. Section 3.3 describes our
TEC classifier for high-dimensional tensor data, which includes an introduction to our proposed
tensor-shaped random projection, the RPSTM, and the ensemble classifier TEC. We provide two
different estimation methods for TEC model in section 3.4. In section 3.5, we establish the statistical
consistency for the TEC classifier, and provide an explicit upper bound on the excess classification
brought by random projection. Simulation studies and real data experiments are in section 3.7.
Section 3.8 concludes the work in this chapter.
3.2     Related Works
     We briefly review the CP-STM for tensor classification problems and some related works about
random projection.
3.2.1    CP-STM for Tensor Classification
Assume there is a collection of data 𝑇𝑛 = {(X          X1 , 𝑦 1 ), (X
                                                                    X2 , 𝑦 2 ), ..., (XX𝑛 , 𝑦 𝑛 )}, where X𝑖 ∈ X ⊂
R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 are d-way tensors. X is a compact tensor space, which is a subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 .
𝑦𝑖 ∈ {1, −1} are binary labels. CP-STM assumes the tensor predictors are in CP representation,
and can be classified by the function which minimizes the objective function
                                                            𝑛
                                                        1Õ
                                   min     𝜆|| 𝑓 || 2 +         L ( 𝑓 (X X𝑖 ), 𝑦𝑖 )                             (3.1)
                                                        𝑛
                                                          𝑖=1
                                                                                                              X) 2 𝑑X
                                                                                                                    X
                                                                                                         ∫
L is a loss function for classification, and 𝜆 is a tuning parameter. || 𝑓 || 2 =< 𝑓 , 𝑓 >=                𝑓 (X
is the square of functional norm for 𝑓 . By using tensor kernel function
                                                     𝑟 Ö  𝑑
                                                                         ( 𝑗)    ( 𝑗)
                                                   Õ
                                    X1 , X 2 ) =
                                 𝐾 (X                         𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑚 )                          (3.2)
                                                 𝑙,𝑚=1 𝑗=1
                                                        44


                 𝑟 (1)            (𝑑)             𝑟 (1)             (𝑑)
where X1 =                            and X2 =
                 Í                                Í
                    𝑥 1𝑙 ◦ .. ◦ 𝑥 1𝑙                  𝑥 2𝑙 ◦ .. ◦ 𝑥 2𝑙  are two different tensors. The STM
                𝑙=1                              𝑙=1
classifier can be written as
                                           Õ𝑛
                                      X) =
                                   𝑓 (X        𝛼𝑖 𝑦𝑖 𝐾 (XX𝑖 , X) = 𝛼𝑇 𝐷 𝑦 𝐾 (X
                                                                             X)                        (3.3)
                                           𝑖=1
where X is a new d-way rank-r tensor with shape 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 . 𝛼 = [𝛼1 , ..., 𝛼𝑛 ] 𝑇 are
the coefficients.      𝐷 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 . 𝐾 (X      X) =
    X1 , X ), ..., 𝐾 (X
[𝐾 (X                 X𝑛 , X )] 𝑇 is a column vector, whose values are kernel values between train-
ing data and the new test data. We denote the collection of functions in the form of (3.3) with H ,
which is a functional space also known as Reproducing Kernel Hilbert Space (RKHS). The optimal
classifier CP-STM 𝑓 ∈ H can be estimated by plugging function (3.3) into objective function (3.1)
and minimize it with Hinge loss and Squared Hinge loss. These steps are reviewed in section 2.2.1,
algorithm 1 and 2. The coefficient vector of the optimal CP-STM model is denoted by 𝛼 ∗ . The
classification model is statistically consistent if the tensor kernel function satisfying the universal
approximating property, as we introduced in the section 2.3.
    However, one potential issue of CP-STM for big tensor classification problems is the curse
of dimensionality. Even a high-dimensional tensor is decomposed into its CP representation,
      𝑟 (1)             (𝑑) ( 𝑗)
X=
      Í
         𝑥 1𝑙 ◦ .. ◦ 𝑥 1𝑙 , 𝑥 𝑙 can still be in high-dimensional form, making it expensive to calculate
     𝑙=1
the value of kernel functions. For example, Gaussian RBF kernel computes the ℓ2 norm of the
difference between two input tensor CP factors. Such computation can be intensive if the inputs
are in high-dimensional. Thus, we propose the RPSTM to simplify the computation with random
projection, and avoid the potential issues.
3.2.2    Random Projection
As a dimension reduction technique, the traditional random projection transforms vector data into
a lower dimensional space via a linear transformation. The linear transformation is usually defined
by a randomly generated projection matrix, 𝐴 , whose element 𝑎𝑖, 𝑗 are either from independently
and identically Gaussian distribution N (0, 1) [34] or a multinomial distribution with three possible
                                                        45


outcomes [7]. The two types of random projection matrices are called Gaussian random matrix
and sparse random matrix shown below.
                                                                           √
                                                                              3 P = 16
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                                                                         
                      𝐴 = [𝑎𝑖, 𝑗 ] ∼ N (0, 1)             𝐴 = [𝑎𝑖, 𝑗 ] =     0      P = 23
                                                                         
                                                                         
                                                                           √
                                                                          − 3 P = 16
                                                                         
                                                                         
                                                                         
                      Gaussian Random Matrix                Sparse Random Matrix
                                                                                                      log 𝑛
P stands for probability. With either option of random projection matrices 𝐴 ∈ R 𝑘×𝑝 , 𝑘 > 𝑜( 2 ),
                                                                                                       𝜖
𝜖 ∈ (0, 1), any two vectors 𝑥 𝑖 , 𝑥 𝑗 ∈ R 𝑝 , 𝑝 > 𝑘, the random projection satisfies
                      (1 − 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2 6 ||𝐴
                                                   𝐴𝑥 𝑖 − 𝐴𝑥 𝑗 || 2 6 (1 + 𝜖)||𝑥𝑥 𝑖 − 𝑥 𝑗 || 2
with large probability for all 𝑖, 𝑗 = 1, ..., 𝑛. This is called Johnson Lindenstrauss (JL) property for
the random projection transformation.
    For higher-order tensor data, random projection is still defined as a mapping 𝑓 TRP : R 𝐼1 ...×𝐼 𝑑 →
R𝑃 that transforming a high-dimensional tensor into a vector. In general, the function 𝑓 TRP is
considered to be
                                                𝑓 TRP =< A , X >                                        (3.4)
A is a projection tensor in the same size as X . To reduce the memory used for A and computa-
tional cost, [133] proposes a memory efficient random projection for with the assumption that the
projection tensor A is formed by Khatri-rao product product of random matrices. They shows the
transformation has the JL property for 2-way tensors (matrices). [71] proposes another random
projection satisfying the JL property for rank-one tensors with a projection tensor A defined by
Kroncker product of random matrices.
    More related to our work, [119] develops a tensor random projection by assuming the projection
tensor A is in either CP or tensor-train decomposition. The transformations are called CP and
tensor-train random projection. Both projections are equipped with the JL property. The CP
random projection is a multi-linear map [ 𝑓 TRP-CP (X        X)] 𝑝 : R 𝐼1 ×𝐼2 ...×𝐼 𝑑 → R𝑃 that for the 𝑝-th
                                                        46


element in the output vector
                                              1
                      [ 𝑓 TRP-CP (XX)] 𝑝 = √ < A 𝑝 , X >=< È𝐴        𝐴 (1)     (2)         (𝑑)
                                                                       𝑝 , 𝐴 𝑝 , ..., 𝐴 𝑝 É, X >                    (3.5)
                                               𝑃
                                                                                                                   ( 𝑗)
𝑝 = 1, 2, ..., 𝑃. È𝐴     𝐴 (1)   (2)        (𝑑)
                           𝑝 , 𝐴 𝑝 , ..., 𝐴 𝑝 É is the CP factor of the projection tensor A 𝑝 , and 𝐴 𝑝 ∈
R 𝐼𝑖 ×𝑟 𝑎 , 𝑗 = 1, .., 𝑑 are Gaussian random matrices. 𝑟 𝑎 is the CP rank of the random projection tensor
A𝑖 , which is independent to the tensor data X . This CP random projection can be applied efficiently
when the input tensor X is also given in CP form. However, since the projection (3.5) transforms
tensors into vectors, it may destroy the multi-way features in tensors. It also requires element-wise
transformation, which is an extra burden when the dimension of mapping 𝑃 is large. We propose an
alternation to the CP random projection and combine it with our CP-STM for efficient computation.
3.3       Methology
     In this section, we present the methodology of our TEC classifier for high-dimensional tensors.
We first introduce an alternative tensor-shaped random projection, and combine it with CP-STM
to construct RPSTM classifier. The ensemble classifier TEC is then developed by aggregating
multiple RPSTMs.
3.3.1      Tensor-Shaped Random Projection
We propose an alternative CP tensor-to-tensor random projection using rank-1 projection tensors
A that can preserve the multi-way structure of tensors after the projection. The proposed tensor-to-
tensor random projection is shown to be equivalent to the CP random projection (3.5) with rank-1
projection tensors A up to a folding-unfolding manner.
Definition 3.3.1. Suppose a d-mode CP tensor X = È𝑋              𝑋 (1) , 𝑋 (2) , ..., 𝑋 (𝑑) É has size 𝐼1 × 𝐼2 × ... × 𝐼 𝑑
                                     𝐼 ×𝑟
and CP rank 𝑟. 𝑋 ( 𝑗) ∈ R 𝑗 are the CP factors of tensor X in matrix form. A rank-1 CP
tensor-to-tensor random projection, 𝑓 TPR-CP-TT : R 𝐼1 ×𝐼2 ...×𝐼 𝑑 → R𝑃1 ×𝑃2 ×...𝑃𝑑 is defined as
                                                 1
                           𝑓 TPR-CP-TT (XX) = √ È𝐴  𝐴 (1) 𝑋 (1) , 𝐴 (2) 𝑋 (2) , ..., 𝐴 (𝑑) 𝑋 (𝑑) É                  (3.6)
                                                  𝑃
                                                         47


                       𝑃 ×𝐼
where 𝐴 ( 𝑗) ∈ R 𝑗 𝑗 are Gaussian random projection matrices or Sparse random projection
matrices. 𝑃 = 𝑃1 × 𝑃2 × ...𝑃 𝑑 . The projection is uniquely defined by the projection tensor
A = {𝐴 𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) }, the collection of random matrices.
Comparing to the CP random projection (3.5), 𝑓 TPR-CP-TT assumes the projection tensor A to be
rank-1 and perform projections directly on the CP tensor factors instead of element-wise. Please
notice that 𝐴 ( 𝑗) are not CP component matrices for A , and A = {𝐴                         𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) } is not the
notation of Kruskal tensors. We use this notation for convenience since the random projection in
definition 3.3.1 with the collection of matrices {𝐴             𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) } is equivalent to the CP random
projection using tensors in equation (3.5). We show this in the following proposition.
Proposition 3.3.1. Let 𝜋 : 𝑃1 × 𝑃2 × ...𝑃 𝑑 → 𝑃 be a invertable unfolding rule such that X 𝑝 1 ,..,𝑝 𝑑 =
Vec(XX)𝜋 ( 𝑝 ,..,𝑝 ) . 1 6 𝑝 𝑗 6 𝑃 𝑗 , and 1 6 𝑝 = 𝜋 ( 𝑝 1 , .., 𝑝 𝑑 ) 6 𝑃 are indices. For random
             1     𝑑
projection (3.5) with rank-1 projection tensor A , it is equivalent to the projection 3.6 up to the
unfolding rule 𝜋 .
The proof is provided in the appendix B.1. The projection can reduce the computational cost
significantly for tensor-based models as it transforms tensor CP components into lower dimensional
spaces. Now, we introduce our RPSTM classifiers estimated from the output of tensor random
projection (3.6).
3.3.2   Random-Projection-Based Support Tensor Machine (RPSTM)
With tensor-shaped CP random projection, we reformulate the model for the tensor classification
problem. Let 𝑇𝑛A = {(X         XA           XA
                                    , 𝑦 ), (X
                                  1 1
                                                , 𝑦 ), ..., (X
                                              2 2
                                                               XA 𝑛 , 𝑦 𝑛 )} be the random projection of the original
training data 𝑇𝑛 such that
                                                 XA 𝑖 = 𝑓 TPR-CP-TT (X       X𝑖 )                                             (3.7)
for all X𝑖 from 𝑇𝑛 . The random projection 𝑓 TPR-CP-TT is uniquely defined by the fixed CP
random projection tensor A = {𝐴             𝐴 (1) , 𝐴 (2) , ..., 𝐴 (𝑑) }, where each 𝐴 ( 𝑗) ∈ R𝑃 𝑗 ×𝐼 𝑗 . The original
training tensors are transformed into a lower dimensional space with size 𝑃1 × 𝑃2 ... × 𝑃 𝑑 , i.e.
                                                                 48


XA𝑖 ∈ R
            𝑃1 ×𝑃2 ...×𝑃 𝑑 . Similar to CP-STM, RPSTM tries to find an optimal function 𝑓 such that it
optimizes the objective function
                                                                       𝑛
                                                                                  XA
                                                                   1Õ
                                           min       𝜆|| 𝑓 || 2 +          L ( 𝑓 (X  𝑖 ), 𝑦𝑖 )                                       (3.8)
                                                                   𝑛
                                                                      𝑖=1
Instead of using the original data 𝑇𝑛 , the objective function measures the empirical classification
loss on the randomly projected training data 𝑇𝑛A . A new kernel function for randomly pro-
jected tensors is defined as follow: For any pair of CP tensors X 1 = È𝑋                            𝑋 1(1) , 𝑋 1(2) , ..., 𝑋 1(𝑑) É, X 2 =
È𝑋𝑋 2(1) , 𝑋 2(2) , ..., 𝑋 2(𝑑) É, the kernel function is
                                                                                                     
                              𝐾 (X   A   A
                                  X1 , X 2 ) = 𝐾 𝑓 TPR-CP-TT (X        X1 ), 𝑓 TPR-CP-TT (X      X2 )
                                                    
                                                       1
                                             = 𝐾 √ [𝐴       𝐴 (1) 𝑋 1(1) , 𝐴 (2) 𝑋 1(2) , ..., 𝐴 (𝑑) 𝑋 1(𝑑) ],
                                                        𝑃                                                    
                                                      1       (1)    (1)    (2)   (2)           (𝑑)   (𝑑)                            (3.9)
                                                    √ [𝐴   𝐴 𝑋 2 , 𝐴 𝑋 2 , ..., 𝐴 𝑋 2 ]
                                                       𝑃
                                                   𝑅 Ö    𝑑
                                                  Õ                     1          ( 𝑗) 1              ( 𝑗)
                                             =                𝐾 ( 𝑗) ( √ 𝐴 ( 𝑗) 𝑥 1,𝑙 , √ 𝐴 ( 𝑗) 𝑥 2,𝑘 )
                                                 𝑙,𝑘=1 𝑗=1               𝑃                  𝑃
                                                                                                                                        ( 𝑗)
where 𝐴 (1) , ..., 𝐴 (𝑑) are the projection matrices of A that defines the projection 𝑓 TPR-CP-TT . 𝑥 1,𝑙
                                 ( 𝑗)        ( 𝑗)                                 ( 𝑗)
are the columns of 𝑋 1 , and 𝑥 2,𝑘 are the columns of 𝑋 2 . 𝐾 ( 𝑗) are still vector-based kernel
functions measuring inner products for factors in different tensor modes.
     A Random Projection-based Support Tensor Machine (RPSTM), with the kernel function (3.9),
will be in the form of
                                          Õ 𝑛                                                           
                                 𝑔 (XX) =       𝛽𝑖 𝑦𝑖 𝐾 𝑓 TPR-CP-TT (X      X𝑖 ), 𝑓 TPR-CP-TT (X     X)
                                           𝑖=1                                                                                     (3.10)
                                                                            
                                        = 𝛽 𝑇 𝐷 𝑦 𝐾 𝑓 TPR-CP-TT (X        X)
for a new given tensor X due the representer theorem [9]. Notice that we use 𝑔 to denote functions
spanned by tensor kernels to distinguish it from the random projection function 𝑓 𝑇 𝑃𝑅−𝐶𝑃−𝑇𝑇 and
                                                                        
the original STM classifier 𝑓 . 𝐾 𝑓 𝑇 𝑃𝑅−𝐶𝑃−𝑇𝑇 (X                    X) is again a column vector whose elements are
kernel values between projected training data 𝑓 TPR-CP-TT (X                       X𝑖 ) and the projected new observation
𝑓 TPR-CP-TT (X      X). 𝐷 𝑦 is the diagonal matrix whose diagonal is 𝑦 = [𝑦 1 , ..., 𝑦 𝑛 ] 𝑇 . 𝛽 is the coefficient
                                                                   49


vector and is differentiated from the notation of CP-STM. We denote the collection of functions
in the form of (3.10) with H A , which is also a reproducing kernel Hilbert space (RKHS). The
optimal classifier can be estimated by plugging the function (3.10) into the objective function (3.8)
and minimize it. Let 𝑔𝑛 denote the optimal function satisfying
                                                              𝑛
                                                                        XA
                                                           1Õ
                           𝑔𝑛 = arg min       𝜆|| 𝑓 || 2 +       L ( 𝑓 (X 𝑖 ), 𝑦𝑖 )            (3.11)
                                         HA
                                      𝑓 ∈H                 𝑛
                                                             𝑖=1
then 𝑔𝑛 is the RPSTM classifier associated with the fixed random projection 𝑓 𝑇 𝑃𝑅−𝐶𝑃−𝑇𝑇 that
we estimated from the training data. The label for new observation tensor X will be predicted by
Sgn[𝑔𝑔𝑛 (XX)]. Thanks to the tensor random projection, the estimation of RPSTM can be compu-
tationally efficient and feasible for high-dimensional tensors. The computational benefit will be
discussed in details in the model estimation part of this paper.
3.3.3    TEC: Ensemble of RPSTM
While random projection provides extra efficiency by transforming tensor CP components into
lower dimension, there is no guarantee that the projected data will preserve the same margin for
every single random projection. As a result, the expected excess risk of RPSTM may be larger than
the original CP-STM. In order to mitigate the impact of random projection and provide robust class
assignments, multiple RPSTMs are aggregated to form a Tensor Ensemble Classifier (TEC).
    Let
                                                    𝑏
                                               1 Õ
                                   𝜏𝑛,𝑏 (XX) =                      X)]
                                                         Sgn[𝑔𝑔𝑚,𝑏 (X                          (3.12)
                                               𝑏
                                                  𝑚=1
𝑏 is the number of RPSTM classifiers estimated with different random projections. 𝑔𝑚,𝑏 are RPSTM
classifiers learned independently from the training data 𝑇𝑛 . The TEC classifier is then defined as
                                               1 if 𝜏𝑛,𝑏 (X     X) > 𝛾
                                              
                                              
                                         X) =
                                              
                                  𝑒 𝑛,𝑏 (X                                                     (3.13)
                                               −1
                                              
                                                           Otherwise
                                              
for a new test tensor X in CP form. 𝛾 is the threshold parameter. For simple majority vote and class-
balanced binary classification, 𝛾 = 0. However, it can be different values if any prior information
is provided.
                                                    50


3.4     Model Estimation
    In this section, We present two estimation procedures for TEC model using two different loss
functions. More importantly, we emphasize significant computational efficiency of TEC model by
comparing its algorithmic steps and memory costs to the CP-STM, and even to the naive vectorized
support vector machine models.
    Since TEC is an aggregation of multiple independent RPSTMs, we only have to show the
details for a single RPSTM estimation and aggregate them. Similar to the estimation of CP-STM,
we can use Hinge loss and Squared Hing loss in objective function (3.8) to measure the empirical
classification risk and estimate RPSTM classifiers. With Hinge loss, the objective function becomes
                                                      𝑛
                                                                         XA
                                                   1Õ                              
                              min    𝜆|| 𝑓 || 2 +        max 0, 1 − 𝑓 (X    ) · 𝑦𝑖              (3.14)
                               HA                  𝑛                      𝑖
                            𝑓 ∈H                     𝑖=1
Like CP-STM, the optimization problem (3.14) is equivalent to a quadratic programming problem,
which is shown in [25]. Once we calculate the kernel matrix 𝐾 A as
                                     XA  , XA          XA  , XA           XA    X  A ) 
                                
                                𝐾 (X           ) 𝐾 (X           ) ... 𝐾 (X   ,    𝑛 
                                     1      1           1     2            1
                                      A A              XA  , XA           XA       A )
                                 X2 , X 1 ) 𝐾 (X
                                                                                       
                           A
                                𝐾 (X
                                                         2     2
                                                                 ) ... 𝐾 (X 2
                                                                              , X  𝑛
                        𝐾 =
                                                                                        
                                                                                               (3.15)
                                      ...               ...       ...      ...         
                                                                                       
                                                                                       
                                     XA      A
                                      𝑛 , X 1 ) 𝐾 (X   XA      A
                                                         𝑛 , X2 )
                                                                            A      A
                                                                          X𝑛 , X 𝑛 ) 
                                                                                       
                                𝐾 (X                              ... 𝐾 (X
                                                                                       
with tensor kernel function (3.9). The quadratic programming problem is defined as
                                                   𝛽 𝐷 𝑦 𝐾 A 𝐷 𝑦 𝛽 − 1𝑇 𝛽
                                                 1 𝑇
                                      min
                                     𝛽 ∈R𝑛       2
                                     S.T.      𝛽𝑇 𝑦 = 0                                         (3.16)
                                                             1
                                                0𝛽
                                                           2𝑛𝜆
Same optimization techniques used in CP-STM can be adopted to solve this problem. A TEC
classifier can then be estimated by repeating the procedure for multiple times with different random
                                                                              ( 𝑗)
projections. The steps are summarized in the algorithm 7 below. 𝑋 ℎ [:, 𝑙] is the 𝑙-th column of
                                ( 𝑗)
the tensor CP factor matrix 𝑋 ℎ . The output of the algorithm contains a list of RPSTM coefficients
                                                        51


Algorithm 7 Hinge TEC
  1: procedure TEC Train
  2:      Input: Training set 𝑇𝑛 = {X          X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, number of ensemble
      𝑏
  3:      for i = 1, 2,...n do
  4:          X𝑖 = [𝑋  𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ]                                                            ⊲ CP decomposition
  5:      for m = 1, 2, ..., b do
                                                                             (1)     (2)         (𝑑)
  6:          Generate random projection tensor A 𝑚 = {𝐴                   𝐴𝑚    , 𝐴 𝑚 , ..., 𝐴 𝑚 }
  7:          Create initial matrix 𝐾 A 𝑚 ∈ R𝑛×𝑛
  8:          for i = 1,...,n do
  9:              for h = 1,...,i do
                                               𝑟 Î
                        𝐾 A 𝑚 [𝑖, ℎ] =
                                               Í         𝑑 𝐾 (𝐴    ( 𝑗) ( 𝑗)           ( 𝑗) ( 𝑗)
10:                                                      𝑗=1
                                                                 𝐴𝑚    𝑋 𝑖 [:, 𝑘], 𝐴 𝑚 𝑋 ℎ [:, 𝑙])             ⊲ Kernel values
                                             𝑘,𝑙=1
11:                     𝐾 A𝑚 [ℎ, 𝑖]      = 𝐾 A𝑚 [𝑖, ℎ]
12:           Solve the quadratic programming problem (3.16) and find the optimal 𝛽 𝑚∗ .
              Output:        𝑚∗ , A
13:                   1∗𝛽 2∗         𝑚
                                                     A1 , ..., A 𝑏 ]
                                                
          Output: 𝛽 , 𝛽 , ..., 𝛽 𝑏∗ , [A
                                                                                               A1 , ..., A 𝑏 ]. The projection
 1∗ 2∗              
 𝛽 , 𝛽 , ..., 𝛽 𝑏∗ and its corresponding random projection tensors [A
is still needed for new test point prediction.
     To estimate RPSTM with Squared Hinge loss, we can use Gaussian-Newton method to optimize
the objective function
                                                               𝑛
                                                                                      XA
                                                           1Õ
                                                                                           ) · 𝑦𝑖 2
                                                                                                 
                                   min        𝜆|| 𝑓 || 2 +        max 0, 1 − 𝑓 (X        𝑖                              (3.17)
                                    HA
                                 𝑓 ∈H                      𝑛
                                                              𝑖=1
by letting its derivative to be zero. Since the procedure is identical to the derivation (2.9) in section
2.2, we provide the updating rule for the parameter 𝛽 directly
                                              𝛽 ∗ = 𝐷 𝑦 (𝜆𝐼𝐼 + 𝐼 𝑠 𝐾 A ) −1 𝐼 𝑠 𝑦
                                                       1              1
                                                                                                                        (3.18)
                                                       𝑛              𝑛
where 𝐼𝑠 is the diagonal matrix whose diagonal elements are indicating if the corresponding tensors
are support tensors. With a initial value of 𝛽 , we can update the parameter iteratively with (3.18)
                                                                                                                        A
until its value converges. The steps are summarized in the algorithm 8. In the algorithm, 𝑘 A                           𝑖
                                                                                                                          𝑚 is
the 𝑖-th column vector of kernel matrix 𝐾 A 𝑚 .
     With estimated TEC model, we can make prediction for new test points. The steps for prediction
is identical no matter whether the model are estimated with Hinge loss or Squared Hinge loss. The
                                                                 52


Algorithm 8 Squared Hinge TEC
  1: procedure TEC Train
  2:     Input: Training set 𝑇𝑛 = {X            X𝑖 }, 𝑦 , kernel function 𝐾, tensor rank r, 𝜆, 𝜂, maxiter, number
     of ensemble 𝑏
  3:     for i = 1, 2,...n do
  4:         X𝑖 = [𝑋   𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ]                                                          ⊲ CP decomposition
  5:     for m = 1, ..., b do
  6:          Create initial matrix 𝐾 ∈ R𝑛×𝑛
                                                                               (1)     (2)        (𝑑)
  7:          Generate random projection tensor A𝑚 = {𝐴                      𝐴𝑚    , 𝐴 𝑚 , ..., 𝐴 𝑚 }
  8:          for i = 1,...,n do
  9:               for h = 1,...,i do
                                               𝑟 Î
                        𝐾 A 𝑚 [𝑖, ℎ] =
                                              Í        𝑑 𝐾 (𝐴       ( 𝑗) ( 𝑗)            ( 𝑗) ( 𝑗)
10:                                                    𝑗=1
                                                                 𝐴𝑚     𝑋 𝑖 [:, 𝑘], 𝐴 𝑚 𝑋 ℎ [:, 𝑙])         ⊲ Kernel values
                                             𝑘,𝑙=1
11:                     𝐾 A 𝑚 [ℎ, 𝑖]     = 𝐾 A 𝑚 [𝑖, ℎ]
12:           Create 𝛽 𝑚∗ = 1 𝑛×1 , 𝛽 𝑚 = 0 𝑛×1                                                               ⊲ Initial Value
13:           Iteration = 0
14:           while ||𝛽𝛽 𝑚∗ − 𝛽 𝑚 || 2 > 𝜂 & Iteration 6 maxiter do
15:                𝛽 𝑚 = 𝛽 𝑚∗
16:                Find 𝑠 ∈ R𝑛×1 . 𝑠 𝑖 ∈ {0, 1} such that 𝑠 𝑖 = 1 if 𝑦𝑖 𝑘 A              𝑖
                                                                                            𝑚𝑇 𝛽 𝑚 < 1    ⊲ Support tensors
17:                𝐼 𝑠 = diag(𝑠𝑠 )                                          ⊲ Create diagonal matrix with 𝑆 as diagonal
18:                𝛽 𝑚∗ = 𝑛1 𝐷 𝑦 (𝜆𝐼𝐼 + 𝑛1 𝐼 𝑠 𝐾 A 𝑚 ) −1 𝐼 𝑠 𝑦                                                     ⊲ Update
19:           Output: 𝛽 𝑚∗
                                                     A1 , ..., A 𝑏 ]
                                               
20:      Output: 𝛽 1∗ , 𝛽 2∗ , ..., 𝛽 𝑏∗ , [A
steps for prediction are stated in the algorithm 9. For convenience, we keep using the notation
for decomposed training tensors X𝑖 = [𝑋                𝑋 𝑖(1) , ..., 𝑋 𝑖(𝑑) ] from the estimation steps. 𝑘 A 𝑚 is a new
column vector in length n, whose 𝑖-th element is kernel value between training tensors X𝑖 and the
new test point X .
     Suppose the projected training tensors are in the shape of 𝑃1 × 𝑃2 × ... × 𝑃 𝑑 , the time complexity
                                                             𝑑
for kernel matrix computation is 𝑂 (𝑛2𝑟 2 𝑑
                                                             Í
                                                                  𝑃 𝑗 ). Notice that the choices of 𝑃 𝑗 are free from
                                                            𝑗=1
the original tensor dimensions 𝐼 𝑗 , and are only related to the training data size 𝑛 following the JL
lemma. We can choose relatively small 𝑃 𝑗 s so that the total time complexity of algorithm 7 and 8 are
            𝑑                                      𝑑
𝑂 (𝑛2𝑟 2 𝑑      𝑃 𝑗 + 𝑙1 ) and 𝑂 (𝑛2𝑟 2 𝑑
           Í                                      Í
                                                      𝑃 𝑗 + 𝑙2 ), when the training data are given in their projected
           𝑗=1                                    𝑗=1
CP decomposition forms. 𝑙 1 and 𝑙2 are the necessary steps to perform quadratic programming in
algorithm 7 and the iterations in algorithm 8. They are bounded by the order of 𝑂 (𝑛2 ) empirically,
                                                                  53


Algorithm 9 TEC Prediction
  1: procedure TEC Predict 
                                                                                                  A1 , ..., A 𝑏 ], kernel
                                                              
  2:     Input: TEC coefficients 𝛽 1∗ , 𝛽 2∗ , ..., 𝛽 𝑏∗ , random projection tensors [A
     function 𝐾, tensor rank r, new test point X , threshold parameter 𝛾
  3:     X = [𝑋 𝑋 (1) , ..., 𝑋 (𝑑) ]                                 ⊲ CP decomposition for New observation
  4:     𝜏𝑛,𝑏 = 0                                                               ⊲ Initial value in equation (3.12)
  5:     for m = 1,...,b do
  6:          for i = 1,...,n do
                                   𝑟 Î
                  𝑘 A 𝑚 [𝑖] =
                                                    ( 𝑗) ( 𝑗)         ( 𝑗)
                                                        𝑋 𝑖 [:, 𝑘], 𝐴 𝑚 𝑋 ( 𝑗) [:, 𝑙])
                                  Í       𝑑 𝐾 (𝐴
  7:                                      𝑗=1
                                                  𝐴𝑚                                                  ⊲ Kernel vector
                                 𝑘,𝑙=1
  8:          𝜏𝑛,𝑏 = 𝜏𝑛,𝑏 + Sign[𝑘𝑘 A 𝑚 𝑇 𝐷 𝑦 𝛽 𝑚∗ ]                                     ⊲ Update equation (3.12)
  9:     If 𝜏𝑛,𝑏 > 𝛾, the prediction is class 1. Otherwise it is -1.                                ⊲ Equation (3.13)
10:      Output: Prediction
which is shown by [25]. Since 𝐼 𝑗 >> 𝑃 𝑗 and 𝐼 𝑗 >> 𝑛 for 𝑗 = 1, ..., 𝑑 in high-dimensional tensor
problems, the time complexities of RPSTM in algorithms 7 and 8 are significantly smaller than
                                        𝑑                               𝑑
CP-STM, which are 𝑂 (𝑛2𝑟 2 𝑑               𝐼 𝑗 + 𝑙1 ) and 𝑂 (𝑛2𝑟 2 𝑑
                                       Í                               Í
                                                                           𝐼 𝑗 + 𝑙2 ). Due to the fact that each
                                       𝑗=1                             𝑗=1
RPSTM is estimated independently, TEC model can be fitted in a parallel computing manner. As a
result, the time complexity of TEC can be roughly the same as RPSTM, which is also much smaller
                                                                                       Í𝑑
than CP-STM. As for the memory complexity, CP-STM requires 𝑂 (𝑛𝑟                            𝐼 𝑗 + 𝑛), which is more
                                                                                       𝑗=1
                                                Í𝑑
prohibitive and infeasible than 𝑂 (𝑏𝑛𝑟               𝑃 𝑗 + 𝑏𝑛) required by TEC with 𝑏 aggregated RPSTMs.
                                                𝑗=1
As the memory complexity are dominated by the dimension of projected CP factors, TEC turns out
to be more efficient. If we further consider the naive vectorized SVM model, its time and memory
complexities are 𝑂 (𝑛2 𝑑𝑗=1 𝐼 𝑗 + 𝑙1 ) and 𝑂 (𝑛 𝑑𝑗=1 𝐼 𝑗 ). Through the comparison, it is obvious
                             Î                             Î
that TEC model is much more computationally efficient than both CP-STM and the traditional
vectorized SVM. This comparison is summarized in the table 3.1.
     One may notice that our discussion above does not include the complexity from tensor CP
decomposition and random projection. Since both the RPSTM and CP-STM requires CP decom-
position, subtracting this part of complexity does not affect the comparison. Moreover, novel CP
                                                                                         Í𝑑
decomposition methods from [116, 137] reach a time complexity of 𝑂 (𝑛𝑟                        𝐼 𝑗 ). Neither CP-STM
                                                                                        𝑗=1
nor RPSTM will have a larger time complexity than the vectorized SVM by adding this part. As
                                                           54


      Models                              Time Complexity                        Memory Complexity
                                      𝑑                            𝑑                     𝑑
                          𝑂 (𝑛2𝑟 2 𝑑     𝑃 𝑗 + 𝑙1 ) / 𝑂 (𝑛2𝑟 2 𝑑
                                     Í                            Í                     Í
      TEC (Parallel)                                                  𝑃 𝑗 + 𝑙2 ) 𝑂 (𝑏𝑛𝑟     𝑃 𝑗 + 𝑏𝑛)
                                     𝑗=1                          𝑗=1                   𝑗=1
                                      𝑑                            𝑑                     𝑑
                          𝑂 (𝑛2𝑟 2 𝑑     𝑃 𝑗 + 𝑙1 ) / 𝑂 (𝑛2𝑟 2 𝑑
                                     Í                            Í                     Í
      RPSTM                                                           𝑃 𝑗 + 𝑙2 )  𝑂 (𝑛𝑟     𝑃 𝑗 + 𝑛)
                                     𝑗=1                          𝑗=1                   𝑗=1
                                       𝑑                           𝑑                     𝑑
                          𝑂 (𝑛2𝑟 2 𝑑      𝐼 𝑗 + 𝑙1 ) / 𝑂 (𝑛2𝑟 2 𝑑
                                      Í                           Í                      Í
      CP-STM                                                          𝐼 𝑗 + 𝑙2 )  𝑂 (𝑛𝑟     𝐼 𝑗 + 𝑛)
                                      𝑗=1                         𝑗=1                   𝑗=1
                                         𝑂 (𝑛2 𝑑𝑗=1 𝐼 𝑗 + 𝑙 1 )
                                                Î                                       Î
      Vectorized SVM                                                                𝑂 (𝑛 𝑑𝑗=1 𝐼 𝑗 )
                      Table 3.1: TEC: Comparison of Computational Complexity
for the random projection, we define that the projection tensor can be composed by sparse projec-
tion matrices in definition 3.3.1. As a result, we can utilize techniques such as low-rank matrix
decomposition to reduce the costs of computation and memory. In addition, [90] showed that the
projection matrices can be very sparse, making the complexity of random projection to be trivial
comparing to other estimation steps.
    We want to briefly discuss the tuning parameter selection at the end of this section. The number
of ensemble classifiers, 𝑏, and the threshold parameter, 𝛾, are chosen by cross-validation. We first
let 𝛾 = 0, which is the middle of two labels, -1 and 1. Then we search 𝑏 in a reasonable range,
between 2 to 20. The optimal 𝑏 is the one that provides the best classification model. In the next
step, we fix 𝑏 and search 𝛾 between 1 and -1 with step size to be 0.1, and find optimal value which
has the best classification accuracy. For simple majority vote, 𝛾 can be set to be zero directly.
The choice of random projection matrices is more complicated. Although we can generate random
projection matrices, the dimension of matrices is remain unclear. Our guideline, JL-lemma, only
provides a lower bound for dimension, and is only for vector situation. As a result, we can only
choose the dimension based on our intuition and cross-validation results in practice. Empirically,
we suggest to choose the projection dimension 𝑃 𝑗 ≈ int(0.7 × 𝐼 𝑗 ) for each mode.
                                                       55


3.5     Statistical Properties
     In this section, we develop the statistical consistency for both TEC and RPSTM models. In
addition, we establish an explicit upper bound on the excessive risk brought by random projection,
highlighting the trade-off between computational efficiency and potential risks.
     For the convenience, we introduce a few more notations. R L ( 𝑓 ) is the classification risk of a
specific decision function 𝑓 , which is defined as
                                                                ∫
                          R L ( 𝑓 ) = E (X×Y) L (𝑦, 𝑓 (X  X)) =               X))𝑑P
                                                                     L (𝑦, 𝑓 (X
where L is a loss function. The empirical risk of the decision function 𝑓 over the training data 𝑇𝑛
is
                                                         𝑛
                                                      1Õ
                                                                       X𝑖 )
                                                                            
                                     R L,𝑇𝑛 ( 𝑓 ) =         L 𝑦𝑖 , 𝑓 (X
                                                      𝑛
                                                        𝑖=1
R L,𝑇𝑛 ( 𝑓 ) is the estimate of R L ( 𝑓 ) on finite training data. The subscript L in the notation of
risks emphasizes that the risks are calculated using specific loss functions. We also use risk
notations without the subscript L, like R ( 𝑓 ), to denote the risk of 𝑓 calculated with zero-one
loss L (𝑦, 𝑧) = 1 {𝑦 ≠ 𝑧}. 1 is indicator function. The definition of classification consistency is
defined on the zero-one loss initially. We use R ∗ to denote the Bayes risk of the tensor classification
problem over the joint distribution X ×Y    Y . It is the optimal risk in the sense that for any measureable
function 𝑓 : X → R, R ∗ = minR ( 𝑓 ). A decision rule is said to be consistent if R ( 𝑓 𝑛 ) → R ∗ as
                                  𝑓
𝑛 → ∞, see [36]. We have to show this for TEC and RPSTM in order to establish their consistency
properties. However, existing results [13, 128] show that the convergence in surrogate loss-based
risks indicates the convergence in zero-one loss-based risks, i.e.
                                  R L ( 𝑓𝑛 ) → R L ∗     ⇒     R ( 𝑓𝑛 ) → R ∗
RL ∗ = minR ( 𝑓 ) for any decision rule { 𝑓 }. This conclusion holds as long as the loss function is
               L                                𝑛
          𝑓
self-calibrated. Both Hinge loss and Squared Hinge loss used in our models are self-calibrated. As
a result, we only have to show R L ( 𝑓𝑛 ) → R L     ∗ for TEC and RPSTM.
                                                        56


    Recall that in section 3.2 and section 3.3, we use H and H A to denote the reproducing kernel
Hilbert space spanned by tensor kernels (3.2) and projected tensor kernels (3.9). Let 𝑓𝑛𝜆 , 𝑓 𝜆 ∈ H
such that
            𝑓𝑛𝜆 = arg min       𝜆|| 𝑓 || 2 + R L,𝑇𝑛 ( 𝑓 )        𝑓 𝜆 = arg min     𝜆|| 𝑓 || 2 + R L ( 𝑓 )
                            H
                         𝑓 ∈H                                                 H
                                                                           𝑓 ∈H
𝑓𝑛𝜆 is the CP-STM classifier estimated from the training data 𝑇𝑛 that minimizes the objective
function on 𝑇𝑛 . 𝑓 𝜆 is the optimal CP-STM learned from infinite size training data. The superscript
𝜆 in both functions denote that they are optimal with the given value of 𝜆 in the objective functions.
Notice that both 𝑓𝑛𝜆 and 𝑓 𝜆 do not minimize the empirical risk and expected risk, but the objective
functions which have regularization terms. We further denote R L,H            ∗    = min R L ( 𝑓 ), the optimal
                                                                                H         H
                                                                                       𝑓 ∈H
                    H
risk functions in can be achieved. Similarly, we define 𝑛 ,          𝑔𝜆  𝑔𝜆  ∈  H A as
          𝑔𝑛𝜆 = arg min         𝜆||𝑔𝑔 || 2 + R       (𝑔𝑔 )       𝑔𝜆 = arg min       𝜆||𝑔𝑔 || 2 + R L (𝑔𝑔 )
                                               L,𝑇𝑛A
                      𝑔 ∈HHA                                                  HA
                                                                           𝑔 ∈H
                             𝑛
                                            XA                                        XA )) for a given tensor
                                                 
                      = 𝑛1
                             Í
where R         (𝑔𝑔 )           L 𝑦𝑖 , 𝑔 (X    )   and R L (𝑔𝑔 ) = E (X×Y) L (𝑦, 𝑔 (X
          L,𝑇𝑛A             𝑖=1
                                             𝑖
random projection defined by A . 𝑔𝑛𝜆 is the optimal RPSTM model estimated from the projected
training data 𝑇𝑛A , and 𝑔 𝜆 is the infinite-sample estimate.
    We derive the proof of consistency for TEC 𝑒 𝑛,𝑏         𝜆 and RPSTM 𝑔 𝜆 models with the regularization
                                                                              𝑛
parameter 𝜆. Notice that R L (𝑒𝑒𝑛,𝑏   𝜆 ) and R (𝑔        𝜆
                                                    L 𝑔𝑛 ) are calculated with a specific random projection
tensor A . Thus, we develop the consistency results on the expected risk EA R L (𝑒𝑒𝑛,𝑏
                                                                                                          𝜆 )  and
             
EA R L (𝑔𝑔𝑛𝜆 ) instead to demonstrate the average performance and integrate the impacts of different
possible random projections. The expectation is taken over the distribution of random projection
tensors.
3.5.1   Excess Risk of TEC
We first boud the expected risk of our TEC classifier 𝑒𝜆𝑛,𝑏 by using the result from [24], theorem 2.
Theorem 3.5.1. For each 𝑏 ∈ N, 𝑒𝜆𝑛,𝑏 is the TEC classifier aggregating 𝑏 independent RPSTMs
                                                            57


𝑔𝑛𝜆 . 𝜆 is the parameter for the functional norm in the objective function (3.8). Then
                                                              1
                         EA R (𝑒𝑒𝜆𝑛,𝑏 ) − R ∗ 6                          EA R (𝑔𝑔𝑛𝜆 ) − R ∗
                                                                                       
                                                       min(𝛾, 1 − 𝛾)
This result says that the ensemble model TEC is statistically consistent as long as the base classifier
is consistent. With the surrogate loss property, we only need to develop the consistency for RPSTM
                                                                                     ∗ converges to zero.
                                                                              
model by showing excessive risk of surrogate loss EA R L (𝑔𝑔𝑛𝜆 ) − R L
3.5.2     Excess Risk of RPSTM
To show the consistency of RPSTM, we first use the following proposition to decompose the excess
                                             ∗ , into several parts and bound them separately.
                                  
risk of RPSTM, EA R L (𝑔𝑔𝑛𝜆 ) − R L
Proposition 3.5.1. The excess risk is bounded above:
                                                                                                                
                 𝜆       ∗                     𝜆
      EA R L (𝑔𝑔𝑛 ) − R L 6 EA R L (𝑔𝑔𝑛 ) − R A (𝑔𝑔𝑛 ) + EA R    𝜆              
                                                                                        (𝑓𝜆 )   − R L ( 𝑓A𝜆 ,𝑛 )
                                                                                                                 
                                                         L,𝑇𝑛                      L,𝑇𝑛A A ,𝑛
                                                                                          
                              + R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )
                                                                                             
                                                𝜆          𝜆    2              𝜆        𝜆 2
                              + EA R L ( 𝑓A ,𝑛 ) + 𝜆|| 𝑓A ,𝑛 || − R L ( 𝑓 𝑛 ) − 𝜆|| 𝑓 𝑛 ||
                              + 𝐷 (𝜆) + R L,H ∗          ∗
                                                  H − RL
                                                                                                                 (3.19)
                                                                                            
Where 𝐷 (𝜆) = R L      ( 𝑓 𝜆) + 𝜆|| 𝑓 𝜆 ||       ∗
                                            − R L,H   . 𝑓𝜆
                                                    H A ,𝑛
                                                             = 𝛼 ∗𝑇 𝐷   𝑦𝐾                X) , which is a function
                                                                             𝑓 TPR-CP-TT (X
in H A with the coefficient vector being the optimal coefficient estimate from CP-STM.
Notice that    𝑓 𝜆 is different from 𝑔𝑛𝜆 since its coefficients are estimated from the CP-STM model
                A ,𝑛
and original tensor data 𝑇𝑛 . Recall that we use 𝛼 ∗ and 𝛼 in section 3.2 to denote optimal CP-STM
and CP-STM coefficients. In other words, the coefficients of 𝑓A𝜆 ,𝑛 are the same as the 𝑓𝑛𝜆 . However,
their kernel basis functions will have different values, making them as two different functions. The
proof of proposition 3.5.1 is provided in the appendix B.2. The proposition unveils the fact that the
excess risk can be bounded by four types of risks:
                                                           58


  1. Gaps between empirical risk and expected risk:
                                                             
                                       𝜆                𝜆                                       
                             EA R L (𝑔𝑔𝑛 ) − R A (𝑔𝑔𝑛 )               R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 )
                                                L,𝑇𝑛
                                                                   
                                                           𝜆                                         
                                            𝜆
                             EA R A ( 𝑓A ,𝑛 ) − R L ( 𝑓A ,𝑛 )              R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )
                                    L,𝑇𝑛
  2. Extra risk brought by random projection:
                                                                                                
                                             𝜆           𝜆      2              𝜆          𝜆  2
                                  E𝐴 R L ( 𝑓A ,𝑛 ) + 𝜆|| 𝑓A ,𝑛 || − R L ( 𝑓𝑛 ) − 𝜆|| 𝑓𝑛 ||
  3. 𝐷 (𝜆), approximation error between the minimal regularized objective function and the risk
      of class optimal. This term depicts how regularized objective function approaches to the
      class optimal risk R L,H  ∗     as the parameter 𝜆 vanishes.
                                   H
         ∗
  4. R L,H      − RL ∗ measures the approximation error of the reproducing kernel Hilbert space
           H
      H . Later we show that with "nice" kernel functions, the functions in the RKHS H can
      approximate any measureable function as close as possible (in terms of infinite norm).
Next, we develop explicit bounds all these components.
   In the following part, we suppose that all the conditions listed below hold.
AS.1 The loss function L is 𝐶 (𝑊) local Lipschitz continuous in the sense that for |𝑎| 6 𝑊 < ∞
      and |𝑏| 6 𝑊 < ∞
                                          |L (𝑎, 𝑦) − L (𝑏, 𝑦)| 6 𝐶 (𝑊)|𝑎 − 𝑏|
      In addition, we need          sup L (0, 𝑦) 6 𝐿 0 < ∞.
                                  𝑦∈{1,−1}
AS.2 The kernel functions 𝐾 ( 𝑗) (·, ·) used to composite the coupled tensor kernel (3.2) are regular
      vector-based kernels satisfying the universal approximating property, see [107]. A kernel
      has this property if it satisfies the following condition. Suppose X is a compact subset of the
      Euclidean space R 𝑝 , and 𝐶 (X      X ) = { 𝑓 : X → R} is the collection of all continuous functions
      defined on X . The kernel function is also defined on X × X , and its reproduction kernel
      Hilbert space (RKHS) is H . Then ∀𝑔𝑔 ∈ 𝐶 (X               X ), ∃ 𝑓 ∈ H such that ∀𝜖 > 0, ||𝑔𝑔 − 𝑓 || ∞ =
      sup |𝑔𝑔 (𝑥𝑥 ) − 𝑓 (𝑥𝑥 )| 6 𝜖.
         X
      𝑥 ∈X
                                                          59


                                                                                     p
AS.3 Assume the tensor kernel function (3.2) is bounded, i.e.                          sup 𝐾 (·, ·) = 𝐾𝑚𝑎𝑥 < ∞. As a
     result, the projected kernel function (3.9) is also bounded by 𝐾𝑚𝑎𝑥 for any arbitrary random
     projection.
AS.4 For each component 𝐾 ( 𝑗) (·, ·) in the kernel function (3.2), we assume 𝐾 ( 𝑗) (𝑎, 𝑏) = ℎ ( 𝑗) (||𝑎 −
                                                                                                                 ( 𝑗)
     𝑏|| 2 ) or ℎ ( 𝑗) (h𝑎, 𝑏i). ℎ : R → R are functions. We assume that all of them are 𝐿 𝐾 -Lipschitz
     continuous
                                                                               ( 𝑗)
                                         |ℎℎ ( 𝑗) (𝑡1 ) − ℎ ( 𝑗) (𝑡 2 )| 6 𝐿 𝐾 |𝑡 1 − 𝑡2 |
                           𝐼                                                                                ( 𝑗)
     where 𝑡 1 , 𝑡2 ∈ R 𝑗 are different CP components. Further, let 𝐿 𝐾 = max 𝐿 𝐾 .
                                                                                                   𝑗=1,..,𝑑
AS.5 For random projection tensors A = {𝐴              𝐴 (1) , ..., 𝐴 (𝑑) }, suppose all the 𝐴 ( 𝑗) have their elements
     identically independently distributed as N (0, 1). The dimension of 𝐴 ( 𝑗) is 𝑃 𝑗 × 𝐼 𝑗 . For a
     𝛿1 ∈ (0, 1) and 𝜖 > 0, we assume
                                                                   1
                                                       [log 𝛿𝑛 ] 𝑑
                                                               1
                                        𝑃𝑗 = 𝑂(                      ),      𝑗 = 1, 2, ..., 𝑑
                                                            𝜖2
     𝜖 is considered as the error or distortion caused by random projection.
AS.6 The hyper-parameter in the regularization term 𝜆 = 𝜆 𝑛 satisfies:
                                    𝜆𝑛 → 0       𝑛𝜆 𝑛 → ∞          𝑛𝜆2𝑛 → ∞ as            𝑛→∞
                                         𝑟
AS.7 For all the tensor data X =              𝑥 𝑘(1) ◦ 𝑥 𝑘(2) ... ◦ 𝑥 𝑘(𝑑) , assume ||𝑥𝑥 𝑘( 𝑗) || 2 6 𝐵𝑥 < ∞.
                                         Í
                                        𝑘=1
AS.8 Suppose that there is a CP random projection defined by A such that the Bayes risk in the
     projected data, R L,A   ∗ , remains unaltered
                               A
                                                              ∗
                                                           R L,A           ∗
                                                                  A = RL
       ∗
     R L,A    = minR L,A
           A                A ( 𝑓 ) where 𝑓 is any measurable function mapping projected data into class
                   𝑓
     assignments.
                                                              60


 AS.9 For 𝐷 (𝜆), we assume there is a relation between 𝐷 (𝜆) and 𝜆
                                          𝐷 (𝜆) = 𝑐 𝜂 𝜆𝜂     0<𝜂61
        𝑐 𝜂 is a constant depending on 𝜂.
AS.10 Suppose the projection error ratio for each mode vanishes at rates depending on loss function.
                                              𝜖𝑛
                                               𝑞 →0         𝑛→∞
                                              𝜆𝑛
        Where 𝜖 𝑛 is the 𝜖 in the assumption AS.5. For hinge loss 𝑞 = 1 and square hinge loss 𝑞 = 32 .
AS.11 The probability of projection error is diminishing with increase of sample size,
                                                                1  
                                             𝛿1 = 𝑂 𝑛 exp(−𝑛 𝑑 )
 The assumption AS.1, AS.3, AS.6, and AS.7 are commonly used in supervised learning problems
 with kernel tricks.(see, e.g. [139, 82, 128]) Assumption AS.4 and AS.5 are needed to help to
 establish the explicit bound on extra errors brought by random projections. Assumption AS.10
 further gives out the condition that the extra errors brought by random projections can converge to
 zero as 𝑛 goes to infinity. Condition AS.8 assumes that it is possible to learn the optimal decision
 rule from randomly projected tensors. The optimal risk is still achievable after random projection.
 This helps to align our results to the definition of consistency in [36], and guarantees that RPSTM
                                       ∗ . A more detailed discussion about this condition is provided
                                
 are consistent if EA R L (𝑔𝑔𝑛𝜆 ) → R L
 in the appendix B.3. Condition AS.2 is the sufficient condition that the tensor kernel function (3.2)
                                             ∗
 is universal (see proof in [89]), making R L,H   − RL ∗ to be zero. Finally, condition AS.9 guarantees
                                               H
        ∗
 that R L,H    has a minimizer, and 𝑓 𝜆 converges to the minimizer as 𝜆 goes to zero. (See definition
           H
 5.14 and corollary 5.18 in the section 5.4 of [128])
     With the assumption AS.7, the gaps between empirical risk and expected risk are easily bounded
 by the Hoeffding Inequality (see e.g. [36]). Also, result from [89] says R L,H      ∗    = RL∗ due to
                                                                                       H
 condition AS.2. There are only two terms in the proposition 3.5.1 left to be bounded. The extra
 risk from random projection and the approximation error 𝐷 (𝜆). For these two parts, our strategy is
                                                    61


proving the convergence or risk under a single random projection first, and then use the dominant
convergence theorem to show the convergence of expected risks. Condition AS.11 entails that
probability of projection as well as expected risk difference vanishes with increase in sample size
(ℓ1 convergence).
3.5.3  Price of Random Projection
Applying random projection in the training procedure is indeed doing a trade-off between prediction
accuracy and computational cost. We give out an explicit upper bound on the extra risk brought by
random projections.
    Without taking expectation, the following proposition gives out an upper bound on the extra
risk when a random projection A is given.
Proposition 3.5.2. Assume a tensor CP random projection is defined by A , whose components
are generated independently and identically from a standard Gaussian distribution. With the
assumptions AS.1, AS.4, AS.5, AS.6, and AS.7, for the 𝜖 𝑑 described in AS.5. With probability
(1 − 2𝛿1 ) and 𝑞 = 1 for hinge loss, and 𝑞 = 23 for square hinge loss function respectively.
                                                                                             𝜖𝑑
                       |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | = 𝑂 ( 𝑞 )
                                                                                             𝜆
where 𝑛 is the size of training set, 𝑑 is the number of modes of tensor.
The proof of this proposition is provided in the appendix B.4. The value of 𝑞 depends on loss
function as well as kernel and geometric configuration of data, which is discussed in the appendix.
This proposition highlights the trade-off between dimension reduction and prediction risk. As the
reduced dimension 𝑃 𝑗 is related to 𝜖 negatively, small 𝑃 𝑗 can make the term converges at a very
slow rate.
3.5.4  Convergence of Risk
Now we summarize the previous results and establish the convergence of risk for RPSTM classifier
under a single random projection. The following theorem unveils the explicit convergence rate of
                                                           62


RPSTM classifier model.
Theorem 3.5.2 (RPSTM Convergence Rate). Suppose all the assumptions AS.1 - AS.8 hold. For
                                                       4
𝜖 > 0, let the projected dimension 𝑃 𝑗 = d3𝑟 𝑑 𝜖 −2 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1 for each 𝑗 = 1, 2, ..𝑑. The
excess risk of a RPSTM with a specific random projection is bounded with probability at least
(1 − 2𝛿1 ) (1 − 𝛿2 ), i.e.,
                                                     ∗ 6 𝑉 (1) + 𝑉 (2) + 𝑉 (3)
                                     R L (𝑔𝑔𝑛𝜆 ) − R L
                              q                 √          q                  q
                                 𝐿0               𝐿0         log(2/𝛿2 )         2 log(2/𝛿2 )
     • 𝑉 (1) = 12𝐶 (𝐾𝑚𝑎𝑥 𝜆 ) · 𝐾𝑚𝑎𝑥 √ + 9𝜁𝜆              ˜
                                                                 2𝑛     + 2𝜁𝜆        𝑛
                                                  𝑛𝜆
     • 𝑉 (2) = 𝐷 (𝜆)
                                       q
                                          𝐿0
     • 𝑉 (3) = 𝐶𝑑,𝑟 Ψ · [𝐶 (𝐾𝑚𝑎𝑥           𝜆 )  + 𝜆Ψ]𝜖 𝑑
where
                 q                                              q
                    𝐿0                                             𝐿0
     • 𝐶 (𝐾𝑚𝑎𝑥       𝜆 )  is a constant depending on 𝐾𝑚𝑎𝑥           𝜆 .
     • 𝛿1 ∈ (0, 21 ) and 𝛿2 ∈ (0, 1)
     • 𝜁𝜆 = sup{L ( 𝑓 𝜆 (X  X), 𝑦) : (X X, 𝑦) ∈ X × Y , }
                                                                        q
                                                                          𝐿0
     • 𝜁˜𝜆 = sup{L ( 𝑓 (X  X), 𝑦) : all 𝑓 : X → R, || 𝑓 || ∞ 6 𝐾𝑚𝑎𝑥       𝜆 , and       X, 𝑦) ∈ X × Y }
                                                                                   all (X
     • 𝐶𝑑,𝑟 = (2𝐿 𝐾 𝐵𝑥2 ) 𝑑 𝑟 2
                              𝑛
     • Ψ = sup{||𝛼                          X) = 𝛼 𝑇 𝐷 𝑦 𝐾 (X
                                                            X) ∈ H }
                              Í
                    𝛼 || 1 =     |𝛼𝑖 | : 𝑓 (X
                             𝑖=1
We prove the theorem, and explain the terms listed above in appendix B.5. The symbol d·e means
rounding a value to the nearest integer above its current value. The theorem provides an upper
bound controlling the convergence of excessive risk for RPSTM, which holds with probability
at least (1 − 2𝛿1 )(1 − 𝛿2 ). This probability is defined on the join distribution of three random
variables, tensor data X , labels y, and the random projection A . The component (1 − 𝛿2 ) is from
the randomness in sampling the training data 𝑇𝑛 , and (1 − 2𝛿1 ) is caused by random projection.
It is clear that the projection error 𝜖, random projection probability parameter 𝛿1 , and projection
                                                         63


                                                                 4
dimension 𝑃 𝑗 are connected by equation 𝑃 𝑗 = d3𝑟 𝑑 𝜖 −2 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1. As a result, one can
express the projection error as 𝜖 𝑑 = 𝑟 2 𝑙𝑜𝑔( 𝛿𝑛 )/ 𝑑𝑗=1 𝑃 𝑗 , which is a function of 𝛿1 when fixing the
                                                          Î
                                                   1
size of training data 𝑛 and projected dimension 𝑃 𝑗 . To obtain a higher probability of getting upper
bounds, one can consider decreasing 𝛿1 and allowing the projection error to increase. Alternative,
one can choose a higher projected dimension 𝑃 𝑗 to have the same level of projection error, but a
higher chance of bounding the excessive risk. It worth noting that 𝑃 𝑗 grows as sample size 𝑛 goes
to infinity. However, huge 𝑃 𝑗 will make our proposed model infeasible and prohibitive. Thus, the
projection error 𝜖 should be replaced by 𝜖 𝑛 in assumption AS.10 to guarantee 𝑃 𝑗 << 𝐼 𝑗 .
    Theorem 3.5.2 provides an upper bound in general to control the excess risk of RPSTM. In
the theorem, there are few quantities related to the loss function L. These terms can be further
expressed with specific loss functions such as Hinge and Squared Hinge loss. The next two
propositions extend the theorem 3.5.2 with Hinge and Squared Hinge loss, and provide explicit
upper bound on RPSTM.
                                                                     𝜇                                   𝜇
Proposition 3.5.3. For square hinge loss, let 𝜖 =            ( 𝑛1 ) 2𝑑 for 0 < 𝜇 < 1, and 𝜆 =       1
                                                                                                  ( 𝑛) 2𝜂+3  for some
                                    4 𝜇
0 < 𝜂 6 1. Assume 𝑃 𝑗 =         d3𝑟 𝑑 𝑛 𝑑 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e        + 1 for each mode 𝑗 = 1, 2, ..., 𝑑. For some
𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1), with probability (1 − 𝛿2 )(1 − 2𝛿1 )
                                                            s
                                                                                  𝜇𝜂
                                                 ∗ 6 𝐶 𝑙𝑜𝑔( 2 )( 1 ) 2𝜂+3
                                 R L (𝑔𝑔𝑛𝜆 ) − R L                                                              (3.20)
                                                                       𝛿2 𝑛
Where 𝐶 is a constant.
The rate of convergence is faster with increase in sample size, when high value of 𝜇 is chosen. For
                                                  𝑑
𝜇 → 1 the risk difference rate becomes ( 𝑛1 ) 5 . The proof of this result is in appendix B.6.
                                                             𝜇                                          𝜇
Proposition 3.5.4. For hinge loss,Let 𝜖 =            ( 𝑛1 ) 2𝑑    for 0 < 𝜇 < 1 and 𝜆 =            1
                                                                                                 ( 𝑛) 2𝜂+2  for some
                       4 𝜇
0 < 𝜂 6 1, 𝑃 𝑗 =   d3𝑟 𝑑 𝑛 𝑑 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e + 1 , For some 𝛿1          ∈ (0, 12 ) and 𝛿2 ∈ (0, 1) with probability
(1 − 2𝛿1 ) (1 − 𝛿2 )                                        s
                                                                                  𝜇𝜂
                                                 ∗ 6 𝐶 𝑙𝑜𝑔(            2 1 2𝜂+2
                                 R L (𝑔𝑔𝑛𝜆 ) − R L                        )( )                                  (3.21)
                                                                       𝛿2 𝑛
Where 𝐶 is a constant.
                                                          64


The rate of convergence is faster with increase in sample size, when high value of 𝜇 is chosen. For
                                              𝑑
𝜇 → 1 the risk difference rate becomes ( 𝑛1 ) 4 . The proof of this result is in appendix B.6.
    Finally, we show the convergence of expected risk
Theorem 3.5.3 (Convergence of Expected Risk). Suppose assumptions AS.1 - AS.11 hold. The
excess risk goes to zero in expectation as sample size increases, the EA denote expectation with
respect to tensor random projection A , E𝑛 denote expectation with respect to uniform measure on
samples
                                                            ∗ |→0
                                    E𝑛 |EA [R L (𝑔𝑛𝜆 )] − R L
This is the expected risk convergence building on top of our previous results. The proof is provided
in the appendix B.7. This theorem concludes that the expected risk of RPSTM converges to the
optimal Bayes risk under surrogate loss L. With the aforementioned property about L and theorem
3.5.1, the RPSTM and our ensemble model TEC are statistically consistent.
3.6     Simulation Study
    We provide a simulation study in this section to compare the empirical performance of our
TEC model and some other classification methods. Both vector-based and tensor-based methods
in the current literature are considered in this comparison. For vector-based methods, we include
Gaussian-RBF SVM from [128], BudgetSVM from [38], Linear Discriminant Analysis from [48],
and Random Forest from [21]. For tensor-based methods, we select few highly cited models
including Direct General Tensor Discriminant Analysis (DGTDA) and Constrained Multilinear
Discriminant Analysis (CMDA) from [92].
    The synthetic tensor data are generated using the idea from [43], which creates CP tensors by
generating random tensor factors and computing the sum of mulit-way outer products. To have
a more comprehensive comparison, we also generate Tucker tensors (see [78]) to show how TEC
performs when the tensors are not well approximated by CP decomposition. We listed out the data
generating models below:
                                                   65


1. F1 Model: Low dimensional rank 1 tensor factor model with each components confirming
   the same distribution. Shape of tensors is 50 × 50 × 50.
                       X1 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3)     𝑥 ( 𝑗) ∼ N (00, 𝐼50 ), 𝑗 = 1, 2, 3
                       X 2 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3)    𝑥 ( 𝑗) ∼ N (0.5
                                                                   0.5, 𝐼50 ), 𝑗 = 1, 2, 3
                                                                   0.5
2. F2 Model: High dimensional rank 1 tensor with normal distribution in each component.
   Shape of tensors is 50 × 50 × 50 × 50.
                X 1 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4)   𝑥 ( 𝑗) ∼ N (00, Σ ( 𝑗) ), 𝑗 = 1, 2, 3, 4
                X 2 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4)   𝑥 ( 𝑗) ∼ N (11, Σ ( 𝑗) ), 𝑗 = 1, 2, 3, 4
                                                                                  (4)
                Σ (1) = 𝐼 ,   Σ (2) = 𝐴𝑅(0.7),        Σ (3) = 𝐴𝑅(0.3),          Σ𝑖, 𝑗 = min(𝑖, 𝑗)
3. F3 Model: High dimensional rank 3 tensor factor model. Components confirm different
   Gaussian distribution. Shape of tensors is 50 × 50 × 50 × 50.
                        3
                             (1)    (2)     (3)      (4)      ( 𝑗)
                       Õ
                X1 =        𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘               𝑥 𝑘 ∼ N (00, Σ ), 𝑗 = 1, 2, 3, 4
                       𝑘=1
                        3
                             (1)    (2)     (3)      (4)      ( 𝑗)
                       Õ
                X2 =        𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘 ◦ 𝑥𝑘               𝑥 𝑘 ∼ N (11, Σ ), 𝑗 = 1, 2, 3, 4
                       𝑘=1
                                                                                  (4)
                Σ (1) = 𝐼 ,   Σ (2) = 𝐴𝑅(0.7),        Σ (3) = 𝐴𝑅(0.3),          Σ𝑖, 𝑗 = min(𝑖, 𝑗)
4. F4 Model: Low dimensional rank 1 tensor factor model with components confirming different
   distributions. Shape of tensor is 50 × 50 × 50.
                      X 1 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3)    each element of 𝑥 (1) ∼ Γ(4, 2),
                      𝑥 (2) ∼ N (11, 𝐼 ),   each element of 𝑥 (3) ∼ 𝑈 (1, 2)
                      X 2 = 𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3)    each element of 𝑥 (1) ∼ Γ(6, 2),
                      𝑥 (2) ∼ N (11, 𝐼 ),   each element of 𝑥 (3) ∼ 𝑈 (1, 2)
                                                    66


    5. F5 Model: A higher dimensional version of F4 model. Tensors are having four modes with
       dimension 50 × 50 × 50 × 50
                        X 1 =𝑥𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4) each element of 𝑥 (1) ∼ Γ(4, 2),
                               𝑥 (2) ∼ N (11, 𝐼 ),   each element of 𝑥 (3) ∼ Γ(2, 1),
                               each element of 𝑥 (4) ∼ 𝑈 (3.5, 4.5)
                        X 2 =𝑥𝑥 (1) ◦ 𝑥 (2) ◦ 𝑥 (3) ◦ 𝑥 (4) each element of 𝑥 (1) ∼ Γ(5, 2),
                               𝑥 (2) ∼ N (11, 𝐼 ),   each element of 𝑥 (3) ∼ Γ(2, 1),
                               each element of 𝑥 (4) ∼ 𝑈 (4.5, 5.5)
    6. T1 Model: A Tucker model. 𝑍 (1) , 𝑍 (2) ∈ R50×50×50 with elements independently and
       identically distributed. The size of factor matrices are all 50 by 50.
                   X 1 = 𝑍 (1) ×1 Σ (1) ×2 Σ (2) ×3 Σ (3)     each element of 𝑍 (1) ∼ N (0, 1)
                   X 2 = 𝑍 (2) ×1 Σ (1) ×2 Σ (2) ×3 Σ (3)     each element of 𝑍 (2) ∼ N (0.5, 1)
                   Σ (1) = 𝐼 ,    Σ (2) Random Orthogonal Matrix Σ (3) = 𝐴𝑅(0.7)
The models F1 - F5 generates CP tensors, whose components confirm various probability distribu-
tions. F3 is a rank-3 CP tensor model. T1 is a Tucker tensor models constructed using mode-wise
product (see [78]). The mode-1 factor is an identity matrix, mode-2 factor is a randomly generated
orthogonal matrix, and mode-3 factor is an auto-regression matrix. We call each classification
problem using tensors generated from these models as classification tasks. Thus, there are six tasks
in this simulation study.
     For each synthetic data, we generate 100 samples from class 1 and another 100 samples from
class 2. Each time, we first subsample the data to form a training set of size 160, then use the
remaining 40 observations to form the test set. We conduct stratified sampling to form training and
test sets so that the percentages of each class are the same in both training and test sets. The training
set is used to train and validate classifiers. Then the classifiers with the optimal tuning parameters
are evaluated on the test set. We record the percentage of true predictions, Total  True Predictions × 100%,
                                                                                         Predictions
                                                         67


as the accuracy of a classifier on the test set. The experiments are repeated for multiple times, and
the mean and the standard deviation of accuracy over all repetitions are reported in the table 3.2
below. For fair comparison, all the computations are done on a desktop with a 12-core CPU and
32GB RAM. We record the average time cost for model estimation over all repetitions, and notate
"NA" (Not Available) in the table if a classifier cannot be estimated with the limited resources. We
believe this could give an overview about the capabilities of different classifiers when handling big
tensor data. More technical details about this simulation study is provided in the appendix B.6.
    Notice that in the table 3.2, we use TEC1 and TEC2 to denote TEC models estimated with
Hinge and Square Hinge Loss. AAM, LLSM, and BSGD are three variants of SVM for scalable
and high-dimensional data analysis from BudgetedSVM package. The first thing we observe from
the simulation study is that all the vector-based methods fail to deliver result in F2, F3, and F5
due to memory insufficiency. We later test these models using the same simulation data but on a
high performance cluster which has 128GB memory. Their performance and the comparison are
included in the appendix B.9. Among the tensor-based methods, such space insufficiency would
not be an issue since data storing in tensors can better utilize computer memories. However, CMDA
method fail to provide results in F2, F3, and F5 as its optimization procedure takes extremely long
time. This failure is not due to memory limitation but high time complexity. On the other hand,
our TEC models utilize tensors to handle big data with limited memory, and provide results in all
tasks with high efficiency. The CP decomposition and random projection techniques in our TEC
model further reduces the number of elements to be stored in memory. Notice that although our
original tensors have the same number of elements as the vectorized data, the tensor decomposition
can be done independently on each data. As a result, when the data is in huge dimension, such as
the data from F2, F3, and F5, we can process the tensor decomposition and random projection one
by one, storing only the randomly projected CP factors in memory and recycling memory space
by deleting the original tensor. This processing pipeline distinguishes itself from other dimension
reduction techniques such as principle component analysis as it can process all observations in
the data set independently. It does not require to load all the data at one time and then perform
                                                   68


Model  Methods         TEC1      TEC2      RBF-SVM       AAM    LLSVM
       Accuracy (%)     83.96    85.70       82.00        73.75  62.81
F1
       STD (%)           3.85     3.29        3.12        6.36   12.94
       Time (s)          1.05     1.26        2.56        1.15    5.06
       Accuracy (%)     98.08    86.70        NA           NA     NA
F2
       STD (%)           1.06     1.87        NA           NA     NA
       Time (s)          1.44     1.56        NA           NA     NA
       Accuracy (%)     96.78    98.63        NA           NA     NA
F3
       STD ( %)          2.71     1.72        NA           NA     NA
       Time (s)           12      12.6        NA           NA     NA
       Accuracy ( %)    93.80    94.10       94.13        46.33  53.13
F4
       STD (%)          2.20      2.08        3.65        8.12   14.96
       Time (s)         1.75      2.10        1.31        1.94    6.21
       Accuracy (%)     89.38    89.78        NA           NA     NA
F5
       STD ( %)          2.55     2.25        NA           NA     NA
       Time (s)          1.49     1.63        NA           NA     NA
       Accuracy ( %)     100      100         100         84.13   100
T1
       STD ( %)          0.00     0.00        0.00        5.40    0.00
       Time (s)          1.05     1.26        2.55        1.59    6.17
Model  Methods         BSGD      LDA           RF        CMDA   DGTDA
       Accuracy (%)     79.84    83.75       68.45        55.25  64.25
F1
       STD (%)          20.16     4.25        6.15        1.25   11.68
       Time (s)          7.62     5.12        1.10        21.50   0.57
       Accuracy (%)      NA       NA          NA           NA    81.50
F2
       STD (%)           NA       NA          NA           NA     4.89
       Time (s)          NA       NA          NA           NA     195
       Accuracy (%)      NA       NA          NA           NA    93.75
F3
       STD ( %)          NA       NA          NA           NA     3.23
       Time (s)          NA       NA          NA           NA     198
       Accuracy ( %)    57.75    82.88       84.50        80.75  77.50
F4
       STD (%)           7.40     5.20        4.72        5.01    4.61
       Time (s)         18.68     5.21        0.89        22.35   0.56
       Accuracy (%)      NA       NA          NA           NA    77.25
F5
       STD ( %)          NA       NA          NA           NA     6.56
       Time (s)          NA       NA          NA           NA     1.93
       Accuracy ( %)     100      100         100         85.71  85.00
T1
       STD ( %)          0.00     0.00        0.00        22.59  22.91
       Time (s)          1.42     5.12        0.45        22.85   0.58
      Table 3.2: TEC Simulation Results I: Desktop with 32GB RAM
                                  69


feature extraction and dimension reduction, making it appealing for extremely high-dimensional
data analysis. Using only the projected tensor factors also makes the proposed TEC model finishing
all the computation in a very short time. Empirical evidence in table 3.2 shows that the processing
time is tremendously less than DGTDA and CMDA (no results due to high time complexity) in
tasks F2, F3 and F5 where the tensor dimensions are huge.
     Apart from the efficiency, the results in table 3.2 highlight the promising performance of our
TEC models. In tasks F1, F4 and T1 where the data dimensions are low, our TEC models have the
similar performance as the RBF-SVM and its variants in BudgetedSVM. In particular, our TEC
with Square Hinge loss outperforms RBF-SVM and all other competitors with significantly higher
accuracy rates in task F1. Their performances in F4 are still decent, providing much higher accuracy
rates than other classifiers except RBF-SVM. Their accuracy rates in F4 are only 0.5% less than
that of RBF-SVM in this task. In Tucker tensor classification task T1, our TEC models continuing
providing as solid performance as all other classifiers. Although the classification task is relatively
easy, it still can demonstrate the capability of our TEC models in handling tensors which are not
well approximated by tensor CP decomposition. The performance advantage of TEC models are
even more impressive in higher-dimensional tensor classification tasks F2, F3, and F5. Due to the
fact that all vector-based classifiers fail to deliver results on the testing platform, we only compare
TEC1 and TEC2 with tensor discriminant analysis DGTDA. In F2 and F5, TEC models have about
10% more average rates than DGTDA. This advantage reduces to 3% in task F3, however, is still
significant.
     In conclusion, the simulation study demonstrates computational efficiency as well as solid
performance for our proposed tensor ensemble classifier TEC.
3.7     Real Data Analysis
     In this section, we compare the performance of our proposed TEC models with other existing
tensor-based classifiers reviewed in chapter 2. We continue using the two real data sets in chapter
2 for experiments. CP-STM with Hinge and Squared Hinge loss from [63], CMDA and DGTDA
                                                     70


              Models        Accuracy    Precision   Sensitivity    Specificity     AUC
              TEC1           0.710.04   0.800.09      0.500.07      0.890.05     0.640.19
              TEC2           0.730.03   0.840.04      0.520.09      0.910.03     0.66 0.19
              CP-GLM         0.580.04   0.580.07      1.000.00      0.000.00     0.500.00
              CMDA           0.700.03   0.690.05      0.670.09      0.730.10     0.650.17
              DGTDA          0.700.02   0.710.02      0.590.06      0.800.01     0.640.18
              CP-STM1        0.730.03   1.000.00      0.410.07      1.000.00     0.640.20
              CP-STM2        0.74 0.04  1.000.00      0.430.08      1.000.00     0.650.20
                       Table 3.3: Real Data: ADNI Classification Comparison II
from [92], and CP-GLM from [155] are included for comparison.
3.7.1    MRI Classification for Alzheimer’s Disease
The first data set is ADNI MRI data from ADNI-1 screening session. The introduction of the data
is provided earlier in section 2.4, and thus is omitted here. We randomly sample 80% of images
from AD group and 80% from NC group to form the training set with size 321. AD is labeled as
positive class, and NC is labeled as negative class. The rest images are used as test set to evaluate
model performance. For each classification model, we evaluate its performance by calculating its
accuracy, precision, sensitivity, and specificity on the test set. Such step is replicated for multiple
times, and the average accuracy, precision, sensitivity, and specificity are reported in the table 3.3.
The standard deviation of these performance metrics are also provided in the subscripts. Since
the image data are already in tensors, we do not destroy their spatial structure and just compare
tensor-base classifiers in this study.
    The results in table 3.3 shows that all the tensor-based classifiers have close accuracy in
prediction, while CP-STM2 and TEC2 are slightly better than the others. Although CP-STM with
Squared Hinge loss has 0.01 more in accuracy rate than TEC models, the AUC of TEC2 is slightly
greater. This comparison unveils that tensor compression through random projections may not
affect the model classification accuracy negatively, while it can provide a much better computation
efficiency.
                                                  71


                            Classification Accuracy
                      0.8
                      0.6
                                                                                             Method
                                                                                                CMDA
                                                                                                CP−GLM
           Accuracy
                      0.4                                                                       CPSTM1
                                                                                                CPSTM2
                                                                                                DGTDA
                                                                                                TEC1
                                                                                                TEC2
                      0.2
                      0.0
                               CMDA     CP−GLM   CPSTM1   CPSTM2    DGTDA   TEC1    TEC2
                                                          Method
                                      Figure 3.1: Real Data: ADNI Classification Result II
3.7.2   KITTI Traffic Image Classification
Our second real data application use the same traffic image data set from section 2.4. The
introduction and data pre-processing is omitted and can be found in section 2.4. After imaging
processing, the data set are divided into three groups which defines three classification tasks with
levels of difficulties as easy, moderate, and hard. To maintain the balance between car and pedestrian
images in the data sets, we randomly select 200 car images and 200 pedestrian images in each group
for our numerical experiments. Pedestrian images are considered as the positive class data, and car
images are negative class data.
   The following procedures are repeated for 50 times in all three tasks with the balanced data
sets. We randomly sample 80% of images as training set, and use the rest 20% data as the testing
set. The sampling is conducted in a stratified way so that the proportion of pedestrian and car
images are approximately same in both training and test set. Classification models are estimated
and validated in training set. Then the models with selected tuning parameters are applied on
the testing set. For each repetition, we calculate the same performance metrics, accuracy rates,
                                                               72


precision (positive predictive rates), sensitivity (true positive rates), and specificity (true negative
rates), for each classification method using the testing set. The average value of these rates and
their standard deviations (in subscripts) are reported in the table 3.4. The areas under the ROC
curves (AUC) are also reported for all the methods. Figure 3.2 shows the comparison of prediction
accuracy rates, in which average accuracy rates of each method is shown by the bar charts and their
standard deviations are shown by the error bars.
         Task         Methods      Accuracy     Precision   Sensitivity    Specificity    AUC
                      CP-STM1       0.850.03     0.840.05    0.850.05       0.840.06    0.850.03
                      CP-STM2       0.830.04     0.830.05    0.830.05       0.830.06    0.830.04
                      TEC1          0.88 0.04    0.880.06    0.890.05       0.870.07    0.88 0.04
         Easy         TEC2          0.740.05     0.750.06    0.730.07       0.760.08    0.740.05
                      CMDA          0.630.07     0.580.06    0.950.08       0.300.16    0.630.07
                      DGTDA         0.840.04     0.770.04    0.960.03       0.720.07    0.840.04
                      CP-GLM        0.570.05     0.570.06    0.590.07       0.550.09    0.570.05
                      CP-STM1       0.780.05     0.780.06    0.770.07       0.780.07    0.780.05
                      CP-STM2       0.730.06     0.750.07    0.720.08       0.750.09    0.730.06
                      TEC1          0.85 0.04    0.850.05    0.840.06       0.850.06    0.85 0.04
         Moderate TEC2              0.740.04     0.730.04    0.770.08       0.710.06    0.740.04
                      CMDA          0.590.05     0.550.04    0.890.11       0.280.13    0.590.05
                      DGTDA         0.740.06     0.720.06    0.790.08       0.690.08    0.740.05
                      CP-GLM        0.530.05     0.530.05    0.540.07       0.520.08    0.530.05
                      CP-STM1       0.760.04     0.840.06    0.640.07       0.870.05    0.760.04
                      CP-STM2       0.740.04     0.800.06    0.630.07       0.840.06    0.740.04
                      TEC1          0.77 0.05    0.780.05    0.750.07       0.780.06    0.77 0.05
         Hard         TEC2          0.670.05     0.700.05    0.670.07       0.660.07    0.670.04
                      CMDA          0.530.04     0.520.02    0.910.09       0.160.12    0.530.04
                      DGTDA         0.720.05     0.680.04    0.840.06       0.600.08    0.720.05
                      CP-GLM        0.510.06     0.510.06    0.540.07       0.490.07    0.510.06
                           Table 3.4: Real Data: Traffic Image Classification II
    Besides the conclusion we already obtained from section 2.4, the results in table 3.4 show that
TEC1 model outperforms CP-STM1, the winner in our previous study, with a significant advantage
in all three classification tasks. In particular, TEC model with Hinge loss has 7% more prediction
accuracy rates than CP-STM1 in the moderate level classification task. This key observation can be
a strong empirical evidence for our proposed TEC models supporting that the models can provide
                                                    73


                        Classification Accuracy
                 0.75
                                                                                                      Method
                                                                                                         CMDA
                                                                                                         CP−GLM
      Accuracy
                 0.50                                                                                    CPSTM1
                                                                                                         CPSTM2
                                                                                                         DGTDA
                                                                                                         TEC1
                                                                                                         TEC2
                 0.25
                 0.00
                                       Easy                    Hard                   Moderate
                                                               Task
                                      Figure 3.2: Real Data: Traffic Image Classification Result II
significantly better prediction accuracy rates when the data are noisy and the projected data are
sufficient for classification.
3.8         Conclusion
    We have proposed a tensor ensemble classifier with the CP support tensor machine and random
projection in this work. The proposed method can handle high-dimensional tensor classification
problems much faster comparing with the existing regularization based methods. Thanks to
the Johnson-Lindenstrauss lemma and its variants, we have shown that the proposed ensemble
classifier has a converging classification risk and can provide consistent predictions under some
specific conditions. Tests with various synthetic tensor models and real data applications show that
the proposed TEC can provide optimistic predictions in most classification problems.
    Our primary focus in this work is on the classification applications on high-dimensional multi-
way data such as images. Support tensor ensemble turns out to be an efficient way of analyzing
such data. However, model interpretation has not been considered here. The features in the
projected space are not able to provide any information about variable importance. Alternative
approaches are possible for constructing explainable tensor classification models, but they are out
                                                                      74


of this article’s scope. Besides that, selection for the dimension (size) of projected tensor 𝑃 𝑗 s
cannot be addressed well at this moment. Although our theoretical result points out the connection
between the classification risk and min 𝑃 𝑗 , discussion about how to set 𝑃 𝑗 for each mode of tensor
may have to be developed in the future.
    In conclusion, TEC offers a new option in tensor data analysis. The key features highlighted
in work are that TEC can efficiently analyze high-dimensional tensor data without compromising
the estimation robustness and classification risk. We anticipate that this method will play a role in
future application areas such as neural imaging and multi-modal data analysis.
                                                  75


                                             CHAPTER 4
  COUPLED SUPPORT TENSOR MACHINE FOR MULTIMODAL NEUROIMAGING
                                                 DATA
In this chapter, we consider a classification problem with multimodal tensor predictors. Multimodal
neuroimaging data arise in various applications where information about the same phenomenon is
acquired from multiple sensors and across different imaging modalities. Learning from multimodal
data is of great interest in machine learning and statistics research as it offers the possibility of
capturing complementary information among modalities. Multimodal learning increases model
performance, explains the interdependence between heterogeneous data sources, discovers new
insights that may not be available from a single modality, and improves decision-making. Recently,
coupled matrix-tensor factorization has been introduced for multimodal data fusion to jointly
estimate latent factors and identify complex interdependence among the latent factors. However,
prior work on coupled matrix-tensor factorization mostly focuses on unsupervised learning, and
very few of them utilize the jointly estimated latent factors for supervised learning. This paper
considers the multimodal tensor data classification problem and proposes a Coupled Support Tensor
Machine (C-STM), which is built upon the latent factors jointly estimated from Advanced Coupled
Matrix Tensor Factorization (ACMTF). C-STM combines individual and shared latent factors with
multiple kernels and estimates a maximal-margin classifier for coupled matrix tensor data. The
classification risk of C-STM is shown to converge to the optimal Bayes risk, making it a statistically
consistent rule. C-STM is validated through simulation studies as well as simultaneous EEG-fMRI
analysis. The empirical evidence shows that C-STM can utilize information from multiple sources
and provide a better classification performance than traditional unimodal classifiers.
4.1     Introduction
    Advances in clinical neuroimaging and computational bioinformatics have dramatically in-
creased our understanding of various brain functions using multiple modalities such as Magnetic
                                                   76


Resonance Imaging (MRI), functional Magnetic Resonance Imaging (fMRI), electroencephalo-
gram (EEG), and Positron Emission Tomography (PET). The strong connection of these modalities
to the patients’ biological status and disease pathology suggests the great potential of their predictive
power in disease diagnostics. Numerous studies using vector- and tensor-based statistical models
illustrate how to utilize these imaging data at both the voxel- and Region-of-Interest (ROI) levels to
develop efficient biomarkers that predict disease status. For example, [8] propose a classification
model using functional connectivity MRI for autism disease with 89% diagnostic accuracy. [123]
utilize network models and brain imaging data to develop novel biomarkers for Parkinson’s disease.
Many works in Alzheimer’s disease research such as [109, 74, 100, 37, 94] use EEG, MRI and PET
imaging data to predict patient’s cognition and detect early-stage Alzheimer’s diseases. Although
these studies have provided impressive results, utilizing imaging data from single modality such
as individual MRI sequences are known to have limited predictive capacity, especially in the early
phases of the disease. For instance, [94] use brain MRI volumes from regions of interest to identify
patients in early-stage Alzheimer’s disease with 77% prediction accuracy. In recent years, it has
been common to acquire multiple neuroimaging modalities in clinical studies such as simultaneous
EEG-fMRI, MRI and fMRI. Even though each modality measures different physiological phenom-
ena, they are interdependent and mutually informative. Learning from multimodal neuroimaging
data may help integrate information from multiple sources and facilitate biomarker development
in clinical studies. It also raises the need for novel supervised learning techniques for multimodal
data in statistical learning literature.
    The existing statistical approaches to multimodal data science are dominated by unsupervised
learning methods. These methods analyze multimodal neuroimaging data jointly by performing
decomposition, and try to discover how the common information is overlaid across different modali-
ties. During optimization, the decomposed factors bridging two or more modalities are estimated to
interpret connections between multimodal data. Examples of these methods include matrix-based
joint Independent Component Analysis (ICA) [23, 56, 86, 97, 129, 6] which assume bilinear cor-
relations between factors in different modalities. When tensors are utilized for multi-dimensional
                                                   77


imaging modeling, various coupled matrix-tensor decomposition methods are established such as
[5, 4, 6, 26, 27, 73, 110] which impose different types of soft or hard multilinear constrains between
factors from different modalities. These methods further extend possible correlations between
multimodal data, providing more flexibility in data modeling.
    Current supervised learning approaches for multimodal data simply concatenate data modalities
as extra features without exploring their interdependence. For example, [155, 93] build generalized
regression models by appending tensor and vector predictors linearly for image prediction and
classification. [114] develop a discriminant analysis by including tensor and vector predictors in
a linear fashion. [91] propose an integrative factor regression for multimodal neuroimaging data
assuming that data from different modalities can be decomposed into common factors. Another type
of integration utilizes kernel tricks and combines information from multimodal data with multiple
kernels. [55] provide a survey on various multiple kernel learning techniques for multimodal
data fusion and classification with support vector machines. Combining kernels linearly or non-
linearly in different modalities, instead of original data, provides more flexibility in information
integration. [11] proposed a multiple kernel regression model with group lasso penalty, which
integrates information by multiple kernels and selects the most predictive data modalities.
    Despite these accomplishments, the current approaches have several shortcomings. First, they
mainly focus on exploring the interdependence between multimodal neuroimaging data, ignoring
the representative and discriminative power of the learned components. Thus, the methods cannot
further bridge the imaging data to the patients’ biological status, which is not helpful in biomarker
development. Second, the supervised techniques integrate information primarily by data or feature
concatenation without explicitly considering the possible correlations between different modali-
ties. This lack of consideration for interdependence may cause issues like overfitting and parameter
identifiability. Third, current multimodal approaches are mostly vector-based. Since many neu-
roimaging data are multi-dimensional, these approaches may fail to utilize the multi-way features
as well as the multi-way interdependence between different modalities. Finally, although many
empirical studies demonstrate the success of using multimodal data, there is a lack of mathematical
                                                  78


and statistical clarity to the extent of generalizability and associated uncertainties. The absence
of a sound statistical framework for multimodal data analysis makes it impossible to interpret the
generalization ability of a certain statistical model.
    In this paper, we propose a two-stage Coupled Support Tensor Machine (C-STM) for multimodal
tensor-based neuroimaging classification. The model accommodates current multimodal data
science issues and provides a sound statistical framework to interpret the interdependence between
modalities and quantify the model consistency and generalization ability. The major contributions
of this work are:
   1. To extract individual and common components from multimodal tensor data in the first stage
       using Advanced Coupled Matrix Tensor Factorization (ACMTF), and identify interdepen-
       dence between multimodal data through latent factors.
   2. To build a novel CP Support Tensor Machine with both the individual and common factors
       for classification. This new model is named Coupled Support Tensor Machine (C-STM).
   3. To show the proposed model is a consistent classification rule.
A Matlab package is also provided in the supplemental material, including all functions for C-STM
classification and detailed data processing pipeline. The rest part of this chapter is organized
as follow. Section 4.2 reviews current approaches about coupled matrix tensor factorization and
multiple kernel learning, which are the basis of this work. Section 4.3 introduce our classification
model. The model estimation is presented in section 4.4 using nonlinear conjugate gradient
descent optimization. A simulation study is presented in section 4.6 to compare the performance
of multimodal classification with single modal classification, highlighting the benefits of using
information from multiple sources. Then we adopt the C-STM model in a simultaneous EEG-fMRI
data trial classification problem in section 4.7. The conclusion of this chapter is in section 4.8.
                                                   79


4.2       Related Work
      In this section, we review some backgorund and prior work on tensor decomposition and support
tensor machine. In this work, we denote numbers and scalars by letters such as 𝑥, 𝑦, 𝑁. Vectors are
denoted by boldface lowercase letters, e.g. 𝑎 , 𝑏 . Matrices are denoted by boldface capital letters
like 𝐴 , 𝐵 . Multi-dimensional tensors are denoted by boldface Euler script letters such as X , Y . The
order of a tensor is the number of dimensions of the data hypercube, also known as ways or modes.
For example, a scalar can be regarded as a zeroth-order tensor, a vector is a first-order tensor, and
a matrix is a second-order tensor.
      Let X ∈ R 𝐼1 ×𝐼2 ×···×𝐼 𝑁 be a tensor of order 𝑁, where 𝑥𝑖1 ,𝑖2 ,...,𝑖 𝑁 denotes the (𝑖1 , 𝑖2 , . . . , 𝑖 𝑁 )th
element of the tensor. Vectors obtained by fixing all indices of the tensor except the one that
corresponds to 𝑛th mode are called mode-𝑛 fibers and denoted as 𝑥 𝑖1 ,...𝑖𝑛−1 ,𝑖𝑛+1 ,...𝑖 𝑁 ∈ R 𝐼𝑛 . The
                                                         𝐼𝑛 × 𝑁0
                                                             Î
                                                                         𝐼 0
mode-𝑛 unfolding of X is defined as X (𝑛) ∈            R        𝑛 =1,𝑛0≠𝑛 𝑛    where the mode-𝑛 fibers of the
tensor X are the columns of X (𝑛) and the remaining modes are organized accordingly along the
rows.
4.2.1     CP Decomposition
Let X ∈ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 be a tensor with 𝑑 modes. Rank-𝑟 Canonical/Polyadic (CP) decomposition
of X is defined as:
                                  𝑟
                                             (1)   (2)       (𝑑)
                                 Õ
                             X≈      𝜁 𝑘 · 𝑥 𝑘 ◦ 𝑥 𝑘 ... ◦ 𝑥 𝑘 = È𝜁𝜁 ; 𝑋 (1) , ..., 𝑋 (𝑑) É,                   (4.1)
                                 𝑘=1
                        𝐼 ×𝑟                                                                              ( 𝑗)
where 𝑋 ( 𝑗) ∈ R 𝑗 , 𝑗 ∈ {1, .., 𝑑} are defined as factor matrices whose columns are 𝑥 𝑘 and
"◦" represents the vector outer product. The right side of (4.1) is called Kruskal tensor, which
is a convenient representation for CP tensors, see [81]. We denote a Kruskal tensor by 𝔘 𝑥 =
È𝜁𝜁 ; 𝑋 (1) , ..., 𝑋 (𝑑) É where 𝜁 ∈ R𝑟 is a vector holding the weights of rank one components. In the
special case of matrices, 𝜁 corresponds to singular values of a matrix. If all the elements in 𝜁 are
1, then 𝜁 can be dropped from the notation. In general, it is assumed that the rank 𝑟 is small so that
                                                         80


equation (4.1) is also called low-rank approximation for a tensor X . Such an approximation can be
estimated by an Alternating Least Square (ALS) approach, see [78].
    Motivated by the fact that joint analysis of data from multiple sources can potentially un-
veil complex data structures and provide more information, Coupled Matrix Tensor Factorization
(CMTF) ([2]) was proposed for multimodal data fusion. CMTF estimates the underlying latent
factors for both tensor and matrix data simultaneously by taking the coupling between tensor and
matrix data into account. This feature makes CMTF a promising model in analyzing heterogeneous
data, which generally have different structures and modalities.
    Let X 1 ∈ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 and 𝑋 2 ∈ R 𝐼1 ×𝐽2 . Assuming the factors from the first mode of the
tensor X 1 span the column space of the matrix 𝑋 2 , CMTF tries to estimate all factors by minimizing:
                    1                                               1
  𝑄 (𝔘𝔘1 , 𝔘 2 ) =     X1 − È𝑋
                      kX       𝑋 1(1) , 𝑋 1(2) , ...𝑋
                                                    𝑋 1(𝑑) Ék 2Fro + k𝑋
                                                                      𝑋 − 𝑋 2(1) 𝑋 2(2)> k 2Fro ,        (1)    (1)
                                                                                                  s.t. 𝑋 1 = 𝑋 2 ,
                    2                                               2 2
                                                                                                                 (4.2)
           (𝑚)                                                                                          (1)   (1)
where 𝑋 𝑝       are the factor matrices for modality 𝑝 and mode 𝑚. The factor matrices 𝑋 1 = 𝑋 2 are
the coupled factors between tensor and matrix data. These factor matrices can also be represented
in Kruskal form, 𝔘 1 = È𝑋    𝑋 1(1) , 𝑋 1(2) , ...𝑋
                                                  𝑋 1(𝑑) É and 𝔘 2 = È𝑋𝑋 2(1) , 𝑋 2(2) É. By minimizing the objective
function 𝑄 (𝔘   𝔘1 , 𝔘 2 ), CMTF estimates latent factors for both the tensor and matrix data jointly
which allows it to utilize information from both modalities. [2] uses a gradient descent algorithm
to optimize the objective function (4.2).
    Although CMTF provides a successful framework for joint data analysis, it often fails to obtain
a unique estimation when both shared and individual components exist. As a result, any further
statistical analysis and learning from CMTF estimation will suffer from the large uncertainty in
latent factors. To address this issue, [3] proposed Advanced Coupled Matrix Tensor Factorization
(ACMTF) by introducing a sparsity penalty to the weights of latent factors in the objective function
(4.2), and restricting the norm of the columns of the factors to be unity to allow unique results up to
a permutation. This modification provides a more precise estimation of latent factors compared to
CMTF, and makes it possible to develop further stable statistical models upon the estimated factors.
                                                                81


4.2.2   CP Support Tensor Machine (CP-STM)
CP-STM has been previously studied by [136, 63, 64] and use CP model to construct STMs. Given
a collection of data 𝑇𝑛 = {(X       X1 , 𝑦 1 ), (X
                                                 X2 , 𝑦 2 ), ..., (X
                                                                   X𝑛 , 𝑦 𝑛 )}, where X𝑖 ∈ X ⊂ R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 are
d-way tensors, X is a compact tensor space which is a subspace of R 𝐼1 ×𝐼2 ×...×𝐼 𝑑 , and 𝑦𝑖 ∈ {1, −1}
are binary labels. CP-STM assumes the tensor predictors have a CP structure, and can be classified
                                                                                              𝑛
by the function which minimizes the objective function 𝜆|| 𝑓 || 2 + 𝑛1                                  X𝑖 ), 𝑦𝑖 ). By using
                                                                                              Í
                                                                                                 L ( 𝑓 (X
                                                                                             𝑖=1
tensor kernel function
                                                         𝑟 Ö    𝑑
                                                                              ( 𝑗)    ( 𝑗)
                                                       Õ
                                       X1 , X 2 ) =
                                    𝐾 (X                           𝐾 ( 𝑗) (𝑥𝑥 1,𝑙 , 𝑥 2,𝑚 ),                            (4.3)
                                                      𝑙,𝑚=1 𝑗=1
                𝑟 (1)             (𝑑)                𝑟 (1)              (𝑑)
where X 1 =        𝑥 1,𝑙 ◦ .. ◦ 𝑥 1,𝑙 and X 2 =
                Í                                   Í
                                                        𝑥 2,𝑙 ◦ .. ◦ 𝑥 2,𝑙 , the STM classifier can be written as
               𝑙=1                                 𝑙=1
                                               Õ𝑛
                                       X) =
                                    𝑓 (X           𝛼𝑖 𝑦𝑖 𝐾 (X X𝑖 , X ) = 𝛼 𝑇 𝐷 𝑦 𝐾 (X    X),                            (4.4)
                                               𝑖=1
where X is a new 𝑑-way rank-𝑟 tensor of size 𝐼1 × 𝐼2 × ... × 𝐼 𝑑 . In (4.4), 𝛼 = [𝛼1 , ..., 𝛼𝑛 ] 𝑇
are the coefficients, 𝐷 𝑦 is a diagonal matrix whose diagonal elements are 𝑦 1 , .., 𝑦 𝑛 , and 𝐾 (X                    X) =
    X1 , X ), ..., 𝐾 (X
[𝐾 (X                 X𝑛 , X )] 𝑇 is a column vector, whose entries are kernel values computed between
training data and the new test data. We denote the collection of functions in the form of (4.4) with
H , which is a functional space also known as Reproducing Kernel Hilbert Space (RKHS). The
optimal CP-STM classifier, 𝑓 ∈ H , can be estimated by plugging function (4.4) into the objective
function, and minimize it with Hinge or Squared Hinge loss. The coefficients of the optimal
CP-STM model is denoted by 𝛼 ∗ . The classification model is statistically consistent if the tensor
kernel function satisfies the universal approximating property as shown in [89].
4.2.3   Multiple Kernel Learning
Multiple kernel learning (MKL) creates new kernels using a linear or non-linear combination of
single kernels to measure inner products between data. Statistical learning algorithms such as
support vector machine and kernel regression can then utilize the new combined kernels instead of
single kernels to obtain better learning results and avoid the potential bias from kernel selection.
                                                              82


([55]) A more important and more related reason for using MKL is that different kernels can
take inputs from various data representations possibly from different sources or modalities. Thus,
combining kernels and using MKL is one possible way of integrating multiple information sources.
     Given a collection of kernel functions {𝐾1 (·, ·), ...𝐾𝑚 (·, ·)}, a new kernel function can be
constructed by
                                        𝐾 (·, ·) = 𝑓𝜂 ({𝐾1 (·, ·), ...𝐾𝑚 (·, ·)}|𝜂𝜂 )                     (4.5)
where 𝑓𝜂 is a linear or non-linear function. 𝜂 is a vector whose elements are weight coefficients
for the kernel combination. Linear combination methods are the most popular multiple kernel
learning, where the kernel function is parameterized as
                                        𝐾 (·, ·) = 𝑓𝜂 ({𝐾1 (·, ·), ...𝐾𝑚 (·, ·)}|𝜂𝜂 )
                                                   Õ𝑚                                                     (4.6)
                                                 =       𝜂𝑙 𝐾𝑙 (·, ·)
                                                   𝑙=1
The weight parameters 𝜂𝑙 can be simply assumed to be the same (unweighted) ([115, 14]), or
be determined by looking at some performance measures from each kernel or data representation
([134, 118]). There are few more advanced approaches such as optimization-based, Bayesian
approaches, and boosting approaches that can also be adopted ([84, 50, 140, 75, 76, 54, 32, 15]).
     Motivated by the elegant framework and consistency property of CP-STM, we decide to extend
it for multimodal tensor classification problems by combining it with ACMTF decomposition. We
further consider linear combination (4.6) of kernels to integrate latent factors from multimodal
data, and select the kernel weight parameters in a heuristic data driven way to construct our C-STM
model. The preciseness of ACMTF may offer chance to capture the true latent structures from
multimodal tensors resulting in better classification performance.
4.3       Methodology
     Let 𝑇𝑛 = {(X    X1,1 , 𝑋 1,2 , 𝑦 1 ), ..., (X
                                                 X𝑛,1 , 𝑋 𝑛,2 , 𝑦 𝑛 )} be training data, where each sample 𝑡 ∈
{1, . . . , 𝑛} has two data modalities X 𝑡,1 , 𝑋 𝑡,2 , and a corresponding binary label 𝑦 𝑡 ∈ {1, −1}. In this
work, following [2], we assume that the first data modality is a third-order tensor, X 𝑡,1 ∈ R 𝐼1 ×𝐼2 ×𝐼3 ,
                                                              83


                             Individual Factors (Tensor Modality)
                                      𝐼2                      𝐼2       𝐾1 (·, ·)
                                                   +..+
                                           𝐼1                     𝐼1
                     𝑋2     𝐼3
           X1
                     𝐼4
                                      𝐼3           +..+      𝐼3        𝐾2 (·, ·)     C-STM               𝑦
                𝐼2
           𝐼1
                                                Shared Factors
                                            𝐼4                𝐼4
                                                      +..+             𝐾3 (·, ·)
                             Individual Factors (Matrix Modality)
                                     Figure 4.1: C-STM Model Pipeline
and the other is a matrix, 𝑋 𝑡,2 ∈ R 𝐼4 ×𝐼3 . The third mode of X 𝑡,1 and the second mode of 𝑋 𝑡,2
are assumed to be coupled for each 𝑡, i.e., the factor matrix is assumed to be fully or partially
shared across these modes. Utilizing this coupling, one can extract factors that better represent the
underlying structure of the data, and preserve and utilize the discriminative power of the factors
from both modalities. Our approach, C-STM (see Figure 4.1), consists of two stages: Multimodal
tensor factorization, ACMTF, and coupled support tensor machine. We present both stages in this
section.
4.3.1   ACMTF
In this stage, we aim to perform a joint factorization across two modalities for each training sample,
                       (1)     (2)     (3)                                                           (1)     (2)
𝑡. Let 𝔘 𝑡,1 = È𝜁𝜁 ; 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,1 É denote the Kruskal tensor of X 𝑡,1 , and 𝔘 𝑡,2 = È𝜎
                                                                                               𝜎 ; 𝑋 𝑡,2 , 𝑋 𝑡,2 É
denote the singular value decomposition of 𝑋 𝑡,2 . The weights of the columns of each factor matrix
  (𝑚)
𝑋 𝑡,𝑝 , where 𝑝 is the modality index and 𝑚 is the mode index, are denoted by 𝜁 and 𝜎 . The norms
of these columns are constrained to be 1 to avoid redundancy. The objective function of ACMTF
                                                             84


decomposition is then given by:
                                                   (1)       (2)    (3)                             (1)      (2) >
       𝑄 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) = 𝛾1 kXX𝑡,1 − È𝜁𝜁 ; 𝑋 𝑡,1    , 𝑋 𝑡,1 , 𝑋 𝑡,1 Ék 2Fro + 𝛾2 k𝑋 𝑋 𝑡,2 − 𝑋 𝑡,2   Σ 𝑋 𝑡,2 k 2Fro
        + 𝛽1 k𝜁𝜁 k 1 + 𝛽2 k𝜎  𝜎 k1                                                                                               (4.7)
                  (3)       (2)         (1)              (2)             (3)             (1)             (2)
       s.t.    𝑋 𝑡,1 = 𝑋 𝑡,2 ,      k𝑥𝑥 𝑡,1,𝑘 k 2 = k𝑥𝑥 𝑡,1,𝑘 k 2 = k𝑥𝑥 𝑡,1,𝑘 k 2 = k𝑥𝑥 𝑡,2,𝑘 k 2 = k𝑥𝑥 𝑡,2,𝑘 k 2 = 1
∀𝑘 ∈ {1, . . . , 𝑟 }. Σ is a diagonal matrix whose elements are the singular values of the matrix 𝑋 𝑡,2
        ( 𝑗)
and 𝑥 𝑡,𝑚,𝑘 ∈ R 𝑗 denotes the columns of the factor matrices of X𝑡,𝑚 . The objective function in
                      𝐼
(4.7) includes ℓ1 penalties for weights in both tensor and matrix decomposition. Thus, the model
identifies the shared and individual components. In our experiments, we set 𝛾1 = 𝛾2 = 1 , and
𝛽1 = 𝛽2 = 0.01. These parameters can also be learned through optimization. These factors are
then considered as extracted data representations for multimodal data, and used to predict the labels
𝑦 𝑡 in C-STM classifier.
4.3.2      Coupled Support Tensor Machine (C-STM)
C-STM uses the idea of multiple kernel learning and considers the coupled and uncoupled factors
from ACMTF decomposition as various data representations. As a result, we use three different
kernel functions to measure their inner products. One can think of these three kernels inducing
three different feature maps transforming multimodal factors into different feature spaces. In each
feature space, the corresponding kernel measures the similarity between factors in this specific
data modality. The similarities of multimodal factors are then integrated by combining the kernel
functions through a linear combination. This procedure is illustrated in Figure 4.1. In particular,
the kernel 𝐾1 is a tensor kernel (equation (4.3)) since the first individual factors are tensor CP
factors. For two pairs of decomposed factors (𝔘                   𝔘𝑡,1 , 𝔘𝑡,2 ) and (𝔘 𝔘𝑖,1 , 𝔘𝑖,2 ), the kernel function for
C-STM is defined as
                                                                                                       
                           𝐾 (X X𝑡,1 , 𝑋 𝑡,2 ), (XX𝑖,1 , 𝑋 𝑖,2 ) = 𝐾 (𝔘    𝔘𝑡,1 , 𝔘 𝑡,2 ), (𝔘
                                                                                            𝔘𝑖,1 , 𝔘𝑖,2 )
       𝑟
                   (1) (1)        (1)      (2) (2)         (2)                  (3)∗ (3)∗                     (1)       (1)
      Õ
  =          𝑤 1 𝐾1 (𝑥𝑥 𝑡,1,𝑘 , 𝑥 𝑖,1,𝑙 )𝐾1 (𝑥𝑥 𝑡,1,𝑘 , 𝑥 𝑖,1,𝑙 ) + 𝑤 2 𝐾2 (𝑥𝑥 𝑡,1,𝑘 , 𝑥 𝑖,1,𝑙 ) + 𝑤 3 𝐾3 (𝑥𝑥 𝑡,2,𝑘 , 𝑥 𝑖,2,𝑙 ). (4.8)
     𝑘,𝑙=1
                                                                   85


  (3)∗                                                                  (3)      (2)
𝑥 𝑡,1,𝑘 is the average of the estimated shared factors 12 [𝑥𝑥 𝑡,1,𝑘 +𝑥𝑥 𝑡,2,𝑘 ] since ACMTF algorithm cannot
               (3)         (2)
guarantee 𝑥 𝑡,1,𝑘 = 𝑥 𝑡,2,𝑘 numerically. 𝑤 1 , 𝑤 2 , and 𝑤 3 are three weight parameters combining the
three kernel functions and can be tuned by cross-validation.
     With kernel function (4.8), C-STM model tries to estimate a bivariate decision function 𝑓 from
a collection of functions H such that
                                                                          𝑛
                                                                       1Õ
                                   𝑓 = arg min        𝜆·   || 𝑓 || 2 +               X𝑖 ), 𝑦𝑖 )
                                                                              L ( 𝑓 (X                                 (4.9)
                                             𝑓 ∈HH                     𝑛
                                                                         𝑖=1
             X𝑖 , 𝑦𝑖 ) = max 0, 1 − 𝑓 (X      X𝑖 ) · 𝑦𝑖 is Hinge loss. H is defined as the collection of all
                                                         
where L (X
functions in the form of
                                     Õ 𝑛
                        X1 , 𝑋 2 ) =
                     𝑓 (X                  𝛼𝑖 𝑦𝑖 𝐾 ((XX𝑡,1 , 𝑋 𝑡,2 ), (XX1 , 𝑋 2 )) = 𝛼 𝑇 𝐷 𝑦 𝐾 (XX1 , 𝑋 2 )          (4.10)
                                     𝑡=1
due to the well-known representer theorem ([9]) for any pair of test data (X                     X1 , 𝑋 2 ) and for 𝛼 ∈ R𝑛 .
For all possible values of 𝛼 , equation (4.10) defines the data collection H . 𝐷 𝑦 is a diagonal matrix
whose diagonal elements are labels from the training data 𝑇𝑛 . 𝐾 (X                  X1 , 𝑋 2 ) is a 𝑛 by 1 column vector
                                   X𝑡,1 , 𝑋 𝑡,2 ), (X
                                                    X1 , 𝑋 2 ) . The optimal C-STM decision function, denoted
                                                               
whose 𝑡-th element is 𝐾 (X
by 𝑓𝑛 = 𝛼 ∗𝑇 𝐷 𝑦 𝐾 (X   X1 , 𝑋 2 ), can be estimated by solving the quadratic programming problem
                                                     1 𝑇
                                            min        𝛼 𝐷 𝑦 𝐾 𝐷 𝑦 𝛼 − 1𝑇 𝛼 ,
                                           𝛼 ∈R  𝑛   2
                                                                                                                      (4.11)
                                                                                  1
                                           S.T. 𝛼 𝑇 𝑦 = 0, 0  𝛼                    ,
                                                                                2𝑛𝜆
where 𝐾 is the kernel matrix constructed by function (4.8). Problem (4.11) is the dual problem
of (4.9), and its optimal solution 𝛼 ∗ also minimizes the objective function (4.9) when plugging
functions in the form of (4.10). For a new pair of test points (X                  X1 , 𝑋 2 ), the class label is predicted
              X1 , 𝑋 2 ) .
                        
as Sgn 𝑓𝑛 (X
4.4      Model Estimation
     In this section, we first present the estimation procedure for tensor matrix decomposition (4.7),
and then combine it with the classification procedure to summarize the algorithm for C-STM.
                                                                86


    To satisfy the constraints in the objective function (4.7), we convert the function 𝑄 (𝔘                               𝔘𝑡,1 , 𝔘 𝑡,2 )
into a differentiable and unconstrained form
                                              (1)      (2)     (3)                                (1)       (2) >
      𝔘𝑡,1 , 𝔘 𝑡,2 ) =𝛾1 ||X
   𝑄 (𝔘                     X𝑡,1 − È𝜁𝜁 ; 𝑋 𝑡,1     , 𝑋 𝑡,1 , 𝑋 𝑡,1 É|| 2Fro + 𝛾2 ||X X𝑡,2 − 𝑋 𝑡,2      Σ 𝑋 𝑡,2 || 2Fro
                                  (3)      (2)
                        + 𝜏k𝑋 𝑋 𝑡,1 − 𝑋 𝑡,2 k 2𝐹𝑟𝑜
                            𝑟  q                     q                                                                         (4.12)
                                                                                (1)                          (2)
                           Õ
                                                                             X𝑡,1,𝑘                       X𝑡,1,𝑘
                                                                        
                        +          𝛽 𝜁 𝑘 + 𝜖 + 𝛽 𝜎𝑘2 + 𝜖 + 𝜃 (kX
                                         2                                            k 2 − 1) 2 + (kX            k 2 − 1) 2
                           𝑘=1
                                                                                                             
                                 (3)                         (1)                         (2)
                              X𝑡,1,𝑘                     X𝑡,2,𝑘                      X𝑡,2,𝑘
                                                                                                           
                        +  (kX         k2  − 1) 2   + (kX         k2  − 1) 2   +  (kX          k2  − 1) 2
ℓ1 norm penalties in (4.7) are replaces with differentiable approximations. 𝜏 and 𝜃 are Lagrange
multipliers. 𝜖 > 0. This unconstrained optimization problem can be solved by nonlinear conjugate
gradient descent ([2, 5, 110]). Let T 𝑡 be the full tensor of 𝔘 𝑡,1 (converting Kruskal tensor into
                                                           (1)      (2) >
multidimensional array form), and 𝑀 𝑡 = 𝑋 𝑡,2 Σ 𝑋 𝑡,2 , the partial derivative of each latent factors
can be derived as follow:
                 𝑄 (𝔘
                𝛿𝑄    𝔘𝑡,1 , 𝔘 𝑡,2 )                                          (3)        (2)             (1)       (1)
                                      = 𝛾1 (T T 𝑡 − X𝑡,1 ) (1) (𝜁𝜁 >       𝑋 𝑡,1      𝑋 𝑡,1  ) + 𝜃 (𝑋 𝑋 𝑡,1   − 𝑋¯ 𝑡,1 )        (4.13)
                          (1)
                      𝛿𝑋𝑋 𝑡,1
                 𝑄 (𝔘
                𝛿𝑄    𝔘𝑡,1 , 𝔘 𝑡,2 )                                          (3)        (1)             (2)       (2)
                                      = 𝛾1 (T T 𝑡 − X 𝑡,1 ) (2) (𝜁𝜁 >      𝑋𝑡,1       𝑋 𝑡,1 ) + 𝜃 (𝑋  𝑋 𝑡,1   − 𝑋¯ 𝑡,1 )        (4.14)
                          (2)
                      𝛿𝑋𝑋 𝑡,1
  𝛿𝑄  𝔘𝑡,1 , 𝔘 𝑡,2 )
   𝑄 (𝔘                                                      (2)      (1)             (3)        (2)             (3)      (3)
                      = 𝛾1 (T T 𝑡 − X 𝑡,1 ) (3) (𝜁𝜁 >     𝑋𝑡,1      𝑋 𝑡,1 ) + 𝜏(𝑋   𝑋 𝑡,1  − 𝑋 𝑡,2 ) + 𝜃 (𝑋   𝑋 𝑡,1  − 𝑋¯ 𝑡,1 ) (4.15)
          (3)
      𝛿𝑋𝑋 𝑡,1
                            𝑄 (𝔘
                           𝛿𝑄    𝔘𝑡,1 , 𝔘 𝑡,2 )                             (2)              (1)        (1)
                                                 = 𝛾2 (𝑀 𝑀 𝑡 − 𝑋 𝑡,2 )𝑋   𝑋 𝑡,2 Σ + 𝜃 (𝑋  𝑋 𝑡,2  − 𝑋¯ 𝑡,2 )                     (4.16)
                                      (1)
                                 𝛿𝑋𝑋 𝑡,2
          𝑄 (𝔘
         𝛿𝑄   𝔘𝑡,1 , 𝔘 𝑡,2 )                                (1)             (2)        (3)             (2)        (2)
                               = 𝛾2 (𝑀 𝑀 𝑡 − 𝑋 𝑡,2 ) > 𝑋 𝑡,2    Σ + 𝜏(𝑋   𝑋 𝑡,2  − 𝑋 𝑡,1 ) + 𝜃 (𝑋    𝑋 𝑡,2  − 𝑋¯ 𝑡,2 )          (4.17)
                    (2)
               𝛿𝑋𝑋 𝑡,2
     𝑄 (𝔘
    𝛿𝑄   𝔘𝑡,1 , 𝔘 𝑡,2 )                                (1)          (2)           (3)       𝛽 𝜁
                        = 𝛾1 (T  T 𝑡 − X 𝑡,1 ) ×1 𝑥 𝑡,1,𝑘     ×2 𝑥 𝑡,1,𝑘 ×3 𝑥 𝑡,1,𝑘 + q 𝑘 ; 𝑘 = 1, ..., 𝑟                       (4.18)
           𝛿𝜁 𝑘                                                                             2
                                                                                                  𝜁 𝑘2 + 𝜖
                                                                 87


                𝛿𝑄𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 )        (1)>                       (2)      𝛽 𝜎
                                      = 𝛾2𝑥 𝑡,2,𝑘 (𝑀 𝑀 𝑡 − 𝑋 𝑡,2 )𝑥𝑥 𝑡,2,𝑘   + q 𝑘 ; 𝑘 = 1, ..., 𝑟                         (4.19)
                       𝛿𝜎𝑘                                                      2
                                                                                     𝜎𝑘2 + 𝜖
Here T ( 𝑗) denotes the mode-j unfolding of a tensor T . × 𝑗 denotes mode-wise product, and
denotes Khatri-Rao product. (see Section 1.2). The matrix notation with a overline 𝑀                                ¯ denotes a
normalized matrix 𝑀 whose columns are divided by their respective ℓ2 norms. If we combine all
the derived parts above, the partial derivative of the objective function is
                                        𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝛿𝑄
                                     𝑄 (𝔘                 𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝛿𝑄  𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 )
                                
                                  𝛿𝑄
         O𝑄
         O𝑄(𝔘  𝔘𝑡,1 , 𝔘 𝑡,2 ) =                        ,                     ,                      ,
                                           (1)                     (2)                    (3)
                                         𝑋 𝑡,1
                                        𝛿𝑋                     𝛿𝑋𝑋 𝑡,1               𝛿𝑋𝑋 𝑡,1
                                                                                                              >           (4.20)
                                  𝛿𝑄    𝔘𝑡,1 , 𝔘 𝑡,2 ) 𝛿𝑄
                                     𝑄 (𝔘                 𝑄 (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 )        𝑄 (𝔘
                                                                                   𝛿𝑄   𝔘𝑡,1 , 𝔘 𝑡,2 )
                                                       ,                     , ...                      , ...
                                           (2)                   𝛿𝜁1                      𝛿𝜎1
                                         𝑋 𝑡,2
                                        𝛿𝑋
which is a 5 + 2𝑟 dimensional vector. If we use
                                               (1)     (2)     (3)      (1)    (2)
                         𝔘𝑡,1 , 𝔘 𝑡,2 ) = [𝑋
                        (𝔘                 𝑋 𝑡,1   , 𝑋 𝑡,1 , 𝑋 𝑡,1 , 𝑋 𝑡,2 , 𝑋 𝑡,2 , 𝜁1 , ...𝜎1 , ...] >
to denote the latent factors and the weights we have to estimate, the algorithm uses the negative
gradient of 𝑄 (𝔘 𝔘𝑡,1 , 𝔘 𝑡,2 ) as the direction to update all the components in (𝔘                       𝔘𝑡,1 , 𝔘 𝑡,2 ) simulta-
neously. We first describe this estimation procedure in the algorithm 10. The algorithm keeps
updating (𝔘𝔘𝑡,1 , 𝔘 𝑡,2 ) until convergence. Note that this is a non-convex optimization problem and
its convergence properties has been discussed in [112, 117, 143, 144].
     Once the factors for all data pairs in the training set 𝑇𝑛 are estimated, we can create the kernel
matrix using the kernel function in 4.8. By solving the quadratic programming problem (4.11), we
can obtain the optimal decision function 𝑓𝑛 . This two-stage procedure for C-STM estimation is
summarized in the algorithm 11 below.
4.5    Theory
     We discuss the statistical property of C-STM in this section. Let’s assume the risk of a decision
                                             X) ≠ 𝑦} , where X ⊂ R 𝐼1 ×..×𝐼 𝑑 is a subspace of R 𝐼1 ×..×𝐼 𝑑 .
                                                         
function, 𝑓 , is R ( 𝑓 ) = EX ×Y   Y 1 { 𝑓 (X
Y = {1, −1}. The function 1 {·} is an indicator function measuring the loss of classification
                                                              88


Algorithm 10 ACMTF Decomposition
  1: procedure ACMTF
  2:     Input: Multimodal data (X             X1 , 𝑋 2 ) tensor rank r, 𝜂, maxiter
  3:                         0      0
         𝔘 𝑡,1 , 𝔘 𝑡,2 = 𝔘 𝑡,1 , 𝔘 𝑡,2                                                                           ⊲ Initial value
  4:     Δ0 = −O𝑄  O𝑄
                   O𝑄(𝔘     0 , 𝔘0 )
                         𝔘𝑡,1
                                  𝑡,20        0 ) + 𝜑Δ
                                                              
  5:     𝜑0 = arg min𝜑 𝑄 (𝔘         𝔘𝑡,1 , 𝔘 𝑡,2          Δ0
  6:       1 , 𝔘 1 = (𝔘
         𝔘 𝑡,1                 0 , 𝔘0 ) + 𝜑 Δ
                            𝔘𝑡,1
                   𝑡,2                𝑡,2         0 0
  7:     𝑔 0 = Δ0
  8:     while s < maxiter and |𝑄          𝑄 (𝔘    𝑠 , 𝔘 𝑠 ) − 𝑄 (𝔘
                                                𝔘𝑡,1                 𝔘𝑡,1𝑠−1 , 𝔘 𝑠−1 )| > 𝜂 do
                                                          𝑡,2                    𝑡,2
  9:          Δ 𝑠+1 = −O𝑄  O𝑄
                            O𝑄(𝔘    𝑠 , 𝔘𝑠 )
                                  𝔘𝑡,1       𝑡,2
                                   Δ >      Δ 𝑠+1 −Δ
                                           (Δ        Δ𝑠 )
10:           𝑔 𝑠+1 = Δ 𝑠+1 + 𝑠+1       >                 𝑔
                                            Δ 𝑠+1 −Δ
                                    −𝑔𝑔 𝑠 (Δ        Δ𝑠 ) 𝑠
                                                𝑠 , 𝔘 𝑠 ) + 𝜑𝑔
                                                                        
11:           𝜑 𝑠+1 = arg min𝜑 𝑄 (𝔘          𝔘𝑡,1      𝑡,2
                                                                 𝑔 𝑠+1
                 𝑠+1 , 𝔘 𝑠+1 = (𝔘
              𝔘 𝑡,1                   𝑠 , 𝔘𝑠 ) + 𝜑
                                   𝔘𝑡,1
12:
                         𝑡,2                   𝑡,2        𝑠+1𝑔 𝑠+1
13:      Output: 𝔘 𝑡,1   ∗ , 𝔘∗
                                𝑡,2
Algorithm 11 Coupled Support Tensor Machine
  1: procedure C-STM
  2:     Input: Training set 𝑇𝑛 = {(X                X1,1 , 𝑋 1,2 , 𝑦 1 ), ..., (X
                                                                                 X𝑛,1 , 𝑋 𝑛,2 , 𝑦 𝑛 )}, 𝑦 , kernel function 𝐾,
     tensor rank r, 𝜆, 𝜂, maxiter
  3:     for t = 1, 2,...n do
  4:             ∗ , 𝔘 ∗ = ACMDF((X
              𝔘 𝑡,1     𝑡,2
                                                X𝑡,1 , 𝑋 𝑡,2 ), tensor rank r, 𝜂, maxiter)
  5:     Create initial matrix 𝐾 ∈ R𝑛×𝑛
  6:     for t = 1,...,n do
  7:          for i = 1,...,i do                                       
  8:                𝐾 [𝑖, 𝑡] = 𝐾 (𝔘    𝔘𝑡,1 , 𝔘 𝑡,2 ), (𝔘 𝔘𝑖,1 , 𝔘𝑖,2 )                                        ⊲ Kernel values
  9:                𝐾 [𝑖, 𝑡] = 𝐾 [𝑡, 𝑖]
10:      Solve the quadratic programming problem (4.11) and find the optimal 𝛼 ∗ .
11:      Output: 𝛼 ∗
function 𝑓 . If there is a 𝑓 ∗ : X → Y from the collection of all measurable functions such that
𝑓 ∗ = arg min R ( 𝑓 ), its risk is called the Bayes risk for the classification problem with data from
X × Y . We denote the Bayes risk as R ∗ = R ( 𝑓 ∗ ). With different training sets 𝑇𝑛 , we can estimate
a sequence of decision functions 𝑓𝑛 under the same training procedure. This sequence of decision
functions { 𝑓𝑛 } is called a decision rule. A decision rule is statistically consistent if R ( 𝑓𝑛 ) converges
to the Bayes risk R ∗ as the size of training data 𝑛 increases, see, e.g., [36]. Our next result shows
                                                                  89


that C-STM is a statistically consistent decision rule.
Proposition 4.5.1. Given the tensor and matrix factors for all data in the domain, the classification
risk of C-STM, R ( 𝑓𝑛 ), converges to the optimal Bayes risk almost surely, i.e.
                                         R ( 𝑓𝑛 ) → R ∗       𝑎.𝑠. 𝑛→∞
if the following conditions are satisfied:
AS.1 The loss function L is self-calibrated, see [128], and is 𝐶 (𝑊) local Lipschitz continuous in
        the sense that for |𝑎| 6 𝑊 < ∞ and |𝑏| 6 𝑊 < ∞, |L (𝑎, 𝑦) − L (𝑏, 𝑦)| 6 𝐶 (𝑊)|𝑎 − 𝑏|. In
        addition, we need      sup L (0, 𝑦) 6 𝐿 0 < ∞.
                             𝑦∈{1,−1}
                                   (1)          (2)
AS.2 The kernel functions 𝐾1 (·, ·), 𝐾1 (·, ·), 𝐾2 (·, ·), and 𝐾3 (·, ·) used to compose the coupled
        tensor kernel (4.8) are regular vector-based kernels satisfying the universal approximating
        property. A kernel has this property if it satisfies the following condition. Suppose X is a
        compact subset of the Euclidean space R 𝑝 , and 𝐶 (X        X ) = { 𝑓 : X → R} is the collection of
        all continuous functions defined on X . The kernel function is also defined on X × X , and
        its reproduction kernel Hilbert space (RKHS) is H . Then ∀𝑔𝑔 ∈ 𝐶 (X        X ), ∃ 𝑓 ∈ H such that
        ∀𝜖 > 0, ||𝑔𝑔 − 𝑓 || ∞ = sup |𝑔𝑔 (𝑥𝑥 ) − 𝑓 (𝑥𝑥 )| 6 𝜖.
                                   X
                                𝑥 ∈X
                                   (1)          (2)
AS.3 The kernel functions 𝐾1 (·, ·), 𝐾1 (·, ·), 𝐾2 (·, ·), and 𝐾3 (·, ·) used to compose the coupled
                                                                p
        tensor kernel (4.8) are all bounded, satisfying sup 𝐾 (·, ·) 6 𝐾𝑚𝑎𝑥 < ∞.
AS.4 The hyper-parameter in the regularization term 𝜆 = 𝜆 𝑛 satisfies 𝜆 𝑛 → 0 as               𝑛 → ∞ and
        𝑛𝜆 𝑛 → ∞ as         𝑛 → ∞.
This proposition is an extension of our previous result for the statistical consistency of CP-STM.
The proof of this proposition is provided in Appendix C.1.
                                                           90


4.6    Simulation Study
    We present a simulation study to demonstrate the benefit of utilizing C-STM with multimodal
data in classification problems. To show the advantage of using multiple modalities, we compare
with CP-STM from [63], Constrained Multilinear Discriminant Analysis (CMDA), and Direct
General Tensor Discriminant Analysis (DGTDA) from [92]. These existing approaches can only
take a single tensor / matrix as the input for classification. Thus, we apply these approaches on
each modality separately and compare their classification performance with C-STM.
    We generate synthetic data with two modalities using the idea from [43] as follows:
                              3                                         3
                                   (1)       (2)       (3)                   (1)       (2)
                             Õ                                         Õ
                     X 𝑡,1 =     𝑥 𝑘,𝑡,1 ◦ 𝑥 𝑘,𝑡,1 ◦ 𝑥 𝑘,𝑡,1 , 𝑋 𝑡,2 =     𝑥 𝑘,𝑡,2 ◦ 𝑥 𝑘,𝑡,2 , (4.21)
                             𝑘=1                                       𝑘=1
where X 𝑡,1 ∈ R30×20×10 and 𝑋 𝑡,2 ∈ R50×10 with ranks equal to 3. To generate data for the
simulation study, we first generate the latent factors (vectors) from various multivariate normal
distributions (with the parameters given in Table 4.1), and then use equation (4.21) to construct
the tensors X 𝑡,1 and matrices 𝑋 𝑡,2 . In Table 4.1, we use 𝑐 = 1, 2 to denote data from two
different classes. Eight different cases are considered in our simulation study. In cases 1 - 3,
the discriminative information about the two classes is capture by one of the tensor factors and
one of the matrix factors. This means that tensor and matrix data both contain class information
(discriminative power) which may be different in the two modalities. Notice that the discriminative
power in the tensor factor remains the same across cases 1 - 3, while the discriminative power in
the matrix factor increases. Cases 4 and 5 assume the class information exists only in a single
modality. In case 4, the distribution of one of the tensor factors is varied across classes and the
discrimination power between the two classes is captured by the tensor factor. The discriminative
factor becomes the matrix factor in case 5. In case 6, the difference between the two classes is
captured by the shared factors, meaning that both tensor and matrix data modalities contain class
information.
    For each simulation case, we generate 50 pairs of tensor and matrix data from both classes,
collecting 100 pairs of observations in total. We then randomly choose 20 samples as the testing
                                                        91


                                                             Tensor Factors                                         Shared Factors                             Matrix Factors
                                                          (1)                                    (2)                   (3)       (2)                                      (1)
                 Simulation               𝑐             𝑥 𝑘,𝑡,1                                𝑥 𝑘,𝑡,1               𝑥 𝑘,𝑡,1 = 𝑥 𝑘,𝑡,2                                  𝑥 𝑘,𝑡,2
                                          1      𝑀𝑉 𝑁 (11, 𝐼 )                             𝑀𝑉 𝑁 (11, 𝐼 )              𝑀𝑉 𝑁 (11, 𝐼 )                             𝑀𝑉 𝑁 (11, 𝐼 )
                 Case 1
                                          2      𝑀𝑉 𝑁 (1.5
                                                       1.5, 𝐼 )
                                                       1.5                                 𝑀𝑉 𝑁 (11, 𝐼 )              𝑀𝑉 𝑁 (11, 𝐼 )                             𝑀𝑉 𝑁 (1.5
                                                                                                                                                                      1.5, 𝐼 )
                                                                                                                                                                      1.5
                                          1      𝑀𝑉 𝑁 (11, 𝐼 )                             𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                            𝑀𝑉 𝑁 (11, 𝐼 )
                 Case 2
                                          2      𝑀𝑉 𝑁 (1.5
                                                       1.5, 𝐼 )
                                                       1.5                                 𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                           𝑀𝑉 𝑁 (1.75
                                                                                                                                                                     1.75, 𝐼 )
                                                                                                                                                                     1.75
                                          1      𝑀𝑉 𝑁 (11, 𝐼 )                             𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (11, 𝐼 )
                 Case 3
                                          2      𝑀𝑉 𝑁 (1.5
                                                       1.5, 𝐼 )
                                                       1.5                                 𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (22, 𝐼 )
                                          1        𝑀𝑉 𝑁 (11, 𝐼 )                           𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (11, 𝐼 )
                 Case 4
                                          2        𝑀𝑉 𝑁 (22, 𝐼 )                           𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (11, 𝐼 )
                                          1        𝑀𝑉 𝑁 (11, 𝐼 )                           𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (11, 𝐼 )
                 Case 5
                                          2        𝑀𝑉 𝑁 (11, 𝐼 )                           𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (22, 𝐼 )
                                          1        𝑀𝑉 𝑁 (11, 𝐼 )                           𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (11, 𝐼 )                              𝑀𝑉 𝑁 (11, 𝐼 )
                 Case 6
                                          2        𝑀𝑉 𝑁 (11, 𝐼 )                           𝑀𝑉 𝑁 (11, 𝐼 )               𝑀𝑉 𝑁 (22, 𝐼 )                              𝑀𝑉 𝑁 (11, 𝐼 )
Table 4.1: Distribution Specifications for Simulation; 𝑀𝑉 𝑁
                                                          𝑁: multivariate normal distribution. 𝐼 :
identity matrices. Bold numbers are vectors whose elements are all the same.
           1.2   Case 1                                                     1.2   Case 2                                                     1.2   Case 3
           0.9                                                              0.9                                                              0.9
Accuracy                                                         Accuracy                                                         Accuracy
           0.6                                                              0.6                                                              0.6
           0.3                                                              0.3                                                              0.3
           0.0                                                              0.0                                                              0.0
                 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                    C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                    C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2
                                    Method                                                           Method                                                           Method
           1.2   Case 4                                                     1.2   Case 5                                                     1.2   Case 6
           0.9                                                              0.9                                                              0.9
Accuracy                                                         Accuracy                                                         Accuracy
           0.6                                                              0.6                                                              0.6
           0.3                                                              0.3                                                              0.3
           0.0                                                              0.0                                                              0.0
                 C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                    C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2                    C−STM CMDA1 CMDA2 CPSTM1 CPSTM2 DGTDA1 DGTDA2
                                    Method                                                           Method                                                           Method
                  Figure 4.2: Simulation: Average accuracy(bar plot) with standard deviation (error bar)
                                                                                                   92


set, and use the remaining data as the training set. The random selection of testing set is conducted
in a stratified sampling manner such that the proportion of samples from each class remains the
same in both training and testing sets. For all models, we report the model prediction accuracy, the
proportion of correct predictions over total predictions, on the testing set as the performance metric.
The random selection of training and testing data is repeated 50 times. The average prediction
accuracy and standard deviation of these 50 repetitions for all cases are reported in Figure 4.2.
The results of CP-STM, CMDA, and DGTDA with tensor data are denoted by CPSTM1, CMDA1,
and DGTDA1, respectively. The results using matrix data are denoted by CPSTM2, CMDA2, and
DGTDA2.
    From Figure 4.2, we can conclude that our C-STM has a more favorable performance in this
multimodal classification problem compared with single modality methods. Its accuracy rates are
significantly larger than other methods in most cases. In particular, we can see that the accuracy
rates of C-STM (pink) increase from case 1 to case 3, while the accuracy rates of CP-STM using
only the tensor data remains the same. This is because the difference between the class mean
vectors for the first tensor factor does not change from case 1 to 3. However, the difference
between class mean vectors for the matrix factor increases. Due to this fact, both C-STM and
CP-STM (yellow) which utilize matrix data have better performance for case 3. More importantly,
C-STM always outperforms CP-STM with matrix data as it enjoys the extra class information
from multiple modalities. In cases 4 and 5, where class information is in a single modality, the
advantage of C-STM is not as significant as the previous cases, though its performance is still better
than CP-STM. This indicates that C-STM can provide robust classification results even when the
additional modalities do not provide any class information. In case 6, where the class information
is from the shared factors, C-STM recovers the shared factors and provides significantly better
classification accuracy. Through this simulation, we showed that C-STM has a clear advantage
when using multimodal data in classification problems, and is robust to redundant data modalities.
The performance of tensor discriminant analysis is not as good as C-STM and CP-STM because
they are not designed for CP tensors.
                                                  93


4.7     Trial Classification for Simultaneous EEG-fMRI Data
    In this section, we present the application of the proposed method on simultaneous EEG-fMRI
data. The data is obtained from [141]. In this study, there is data from seventeen individuals
(six females, average age 27.7) participating in three runs each of visual and auditory oddball
paradigms. 375 (125 per run) total stimuli per task were presented for 200 ms each with a 2-3
s uniformly distributed variable inter-trial interval. A trial is defined as a time window in which
subjects receive the stimuli and give responses. In the visual task, a large red circle on isoluminant
gray backgrounds was considered as the target stimuli while a small green circle was the standard
stimulus. For the auditory task, the standard and oddball stimuli were, respectively, 390 Hz
pure tones and broadband sounds which sound like "laser guns". During the experiment, the
stimuli were presented to all subjects, and their EEG and fMRI data are collected simultaneously
and continuously. We obtain the data from OpenNeuro website (https://openneuro.org/
datasets/ds000116/versions/00003). We utilize both EEG and fMRI in this data set with
our C-STM model to classify stimulus types across trials.
    We pre-process both the EEG and fMRI data with Statistical Parametric Mapping (SPM 12)
([10]) and Matlab. Details of data pre-processing are provided in Appendix C.2. For each trial,
we construct a three-mode tensor corresponding to the EEG data for all subjects where the modes
represent channel × time × subject denoted as X 𝑡,1 ∈ R34×121×16 . For fMRI data, there is only
one 3D scan of fMRI collected from a single subject during each trial. Time mode does not exist
in fMRI data because the trial duration is less than the repetition time of fMRI (time for obtaining
a single 3D volume fMRI). We further extract fMRI volumes from voxels in the regions of interest
(ROI) only for our study. ROI selection and data extraction are provided in Appendix C.2. We
extract fMRI volumes from 178 voxels for auditory oddball tasks, and 112 voxels for auditory tasks.
As a result, fMRI data for each trial are modeled by matrices whose rows and columns stand for
voxels and subjects: 𝑋 𝑡,2 ∈ R16×178 for auditory task data, and 𝑋 𝑡,2 ∈ R16×112 for visual task data.
    To classify trials with oddball and standard stimulus, we collect 140 multimodal data samples
 X𝑡,1 , 𝑋 𝑡,2 ) from auditory tasks, and 100 samples from visual tasks. For both types of tasks, the
(X
                                                  94


     Task        Method          Accuracy      Precision     Sensitivity     Specificity      AUC
                 C-STM            0.89 0.05     0.830.07       1.000.00       0.770.11     0.89 0.06
                 CP-STM1          0.800.08      0.710.11       1.000.00       0.600.12     0.780.06
                 CP-STM2          0.830.06      0.760.07       0.990.05       0.650.11     0.820.05
  Auditory CDMA1                  0.550.10      0.510.09       0.960.09       0.200.21     0.550.06
                 CDMA2            0.670.09      0.610.11       0.920.07       0.460.14     0.700.08
                 DGTDA1           0.550.09      0.510.09       0.940.07       0.230.12     0.590.06
                 DGTDA2           0.670.09      0.600.10       0.900.09       0.460.13     0.680.08
                 C-STM            0.86 0.06     0.820.09       0.930.07       0.770.12     0.86 0.06
                 CP-STM1          0.760.08      0.660.11       1.000.00       0.540.12     0.780.05
                 CP-STM2          0.770.08      0.700.11       0.980.08       0.580.17     0.770.07
    Visual       CDMA1            0.530.12      0.520.11       0.940.11       0.110.18     0.540.08
                 CDMA2            0.650.13      0.610.14       0.910.09       0.430.19     0.660.09
                 DGTDA1           0.560.11      0.540.11       0.940.06       0.170.12     0.560.07
                 DGTDA2           0.640.10      0.600.13       0.860.10       0.440.18     0.640.07
Table 4.2: Real Data Result: Simultaneous EEG-fMRI Data Trial Classification (Mean of Perfor-
mance Metrics with Standard Deviations in Subscripts)
numbers of oddball and standard trials are equal. We consider the trials with oddball stimulus as the
positive class, and the trials with standard stimulus as the negative class. Similar to the simulation
study, we select 20% of data as testing set, and use the remaining 80% for model estimation
and validation. The classification accuracy, precision (positive predictive rate), sensitivity (true
positive rate), specificity (true negative rate), and the area under the curve (AUC) of classifiers are
calculated using the test set for each experiment. The experiment is repeated for multiple times, and
the average accuracy, precision, sensitivity, and specificity, along with their standard deviations (in
subscripts) are reported in Table 4.2. The single mode classifiers CPSTM, CMDA, and DGTDA
are also applied on either EEG or fMRI data for comparison. The single mode classifiers applied
on EEG data are denoted by appending the method name with the number “1" , and those applied
on fMRI data are denoted by appending the method name with the number “2".
    It can be seen that the classification accuracy for C-STM using multimodal data is higher than
any classifier based on single modality with a significant improvement in terms of average accuracy
                                                    95


rates and average AUC values. This improvement is observed for both auditory and visual tasks.
Particularly, the accuracy rate of C-STM in visual task is 9% higher than CP-STM using fMRI
data, the model with the second best performance. This significant performance improvement
demonstrates the clear advantage of our C-STM with multimodal data, which is consistent with the
previous conclusions from the simulation study. Similarly, the tensor discriminant analysis does
not work as well as CP-STM and C-STM, which also agrees with our observations in the simulation
study.
4.8     Conclusion
    In this work, we have proposed a novel coupled support tensor machine classifier for multimodal
data by combining advanced coupled matrix tensor factorization and support tensor machine.
The most distinctive feature of this classifier is its ability to integrate features across different
modalities and structures. The approach can simultaneously process matrix and tensor data for
classification and can be extended to more than two modalities. Moreover, the coupled matrix
tensor decomposition helps unveil the intrinsic correlation structure in different modalities, making
it possible to integrate information from multiple sources efficiently. The newly designed kernel
functions in C-STM serve as a feature-level information fusion, combining discriminant information
from different modalities. In addition, the kernel formulation makes it possible to utilize the most
discriminative features from each modality by tuning the weight parameters in the function. Our
theoretical results demonstrate that the C-STM decision rule is statistically consistent.
    An important theoretical extension of our approach would be the development of excess risk for
C-STM. In particular, we look for an explicit expression for the excess risk in terms of data factors
from multiple modalities to quantify the contribution of each modality to minimizing the excess risk.
By doing so, we are able to interpret the importance of each data modality in classification tasks. In
addition, quantifying the uncertainty of tensor and matrix factor estimation and their impact on the
excess risk will build the foundation to the next level statistical inference. Another possible future
work can be learning the weight parameters in kernel function via optimization problems in the
                                                 96


algorithmic aspect. As [55] introduced, the weights in the kernel function can be further estimated
by including a group lasso penalty in the objective function. Such a weight estimation procedure
can identify the significant data components and reduce the burden of parameter selection.
    In conclusion, we believe C-STM offers many encouraging possibilities for multimodal data
integration and analysis. Its ability to handle multimodal tensor inputs will make it appropriate in
many advanced data applications in neuroscience research.
                                                 97


APPENDICES
    98


                                                APPENDIX A
                                      APPENDIX FOR CHAPTER 2
A.1      Proof of Proposition 2.3.1
Proof. The proof of this theorem is quite straightforward. We need to use the tensor product space
defined in section 1.2. In addition, since we are discussing general tensor product, we will use ⊗
to denote it. ⊗ can be replaced with outer product ◦ or Kharti-Rao product         when facing specific
vector or matrix data. The proof will still holds.
    Let V ( 𝑗) , 𝑗 = 1, , , .𝑑 be compact subsets of R 𝑗 , 𝑗 = 1, ..., 𝑑. The tensor product of these
                                                           𝐼
subsets X = ⊗ 𝑑𝑗=1V ( 𝑗) will be again a compact subspace of the tensor space R 𝐼1 ×...×𝐼 𝑑 . Let K (X
                                                                                                     X)
be the kernel sections of tensor kernels we defined in equation (2.2), and C(X     X ) = { 𝑓 : X → R}
be the collection of all continuous real-valued functions mapping CP tensors to scalars. We have
to prove for any 𝑓 ∈ C(X     X ), there exist an approximation in K (X X ). We will show such kinds of
approximation exists.
    If 𝑓 ∈ C(X   X ), it has sup | 𝑓 | < ∞ due to continuity. Further, since 𝑓 is defined on X and
                               X
continuous, then 𝑓 ∈ C(X     X ) and can be written as
                                          𝑟
                                                   (1)    (2)      (𝑑)
                                         Õ
                                     𝑓 =     𝜆 𝑘 𝑓 𝑘 ⊗ 𝑓 𝑘 ... ⊗ 𝑓 𝑘 + 𝜖                          (A.1)
                                         𝑘=1
          ( 𝑗)
where 𝑓 𝑘 are continuous function defined on V ( 𝑗) , 𝑗 = 1, , , .𝑑. This decomposition exists due
to the fact that 𝑓 is defined on X . As a result, 𝑓 belongs to the functional tensor product space

  𝑓 : X → R} in definition 1.2.2. It can also be exlpained by the fact that 𝑓 is continuous on a
compact space, thus it is multilinear and has such a decomposition (Lemma 4.30 from [59]). 𝜖 is
a reminder here, and can be as small as possible since 𝑓 is uniformly bounded. For simplicity, we
shall ignore the 𝜖 in the later proof for a while and mention it at the end. 𝜆 𝑘 are bounded since 𝑓 is
bounded.
    If in every mode of the kernel functions are universal, the kernel functions are universal. For
                                                       99


       ( 𝑗)                                                             ( 𝑗)
each 𝑓 𝑘 , 𝑘 = 1, ..., 𝑟; 𝑗 = 1, ...𝑑, there is a function 𝑔 𝑘 ∈ 𝑠𝑝𝑎𝑛{𝐾𝑥 : 𝑥 ∈ V ( 𝑗) }, which is from
the kernel sections of the corresponding mode, such that
                                     ( 𝑗)       ( 𝑗)
                             sup | 𝑓 𝑘 − 𝑔 𝑘 | < 𝜖                   𝑘 = 1, ..., 𝑟; 𝑗 = 1, ...𝑑                     (A.2)
                            V ( 𝑗)
for any arbitrary 𝜖 > 0. Then for
                                                  𝑟
                                                            (1)      (2)          (𝑑)
                                                 Õ
                                          𝑔=          𝜆 𝑘 𝑔 𝑘 ⊗ 𝑔 𝑘 ... ⊗ 𝑔 𝑘                                       (A.3)
                                                 𝑘=1
We can have
                                   𝑟                                           𝑟
                                             (1)        (2)         (𝑑)                 (1)    (2)          (𝑑)
                                 Õ                                           Õ
         sup | 𝑓 − 𝑔 | = sup |        𝜆𝑘 𝑓 𝑘      ⊗  𝑓 𝑘 ...   ⊗  𝑓𝑘     −        𝜆 𝑘 𝑔 𝑘 ⊗ 𝑔 𝑘 ... ⊗ 𝑔 𝑘 |
        X ∈XX             X ∈X
                             X 𝑘=1                                           𝑘=1
                                  𝑟            𝑑                      𝑑
                                                     ( 𝑗)                    ( 𝑗)
                                Õ            Ö                       Ö
                       = sup         |𝜆 𝑘 ||       𝑓 𝑘 (𝑥𝑥 ( 𝑗) ) −       𝑔 𝑘 (𝑥𝑥 ( 𝑗) )|                           (A.4)
                          X ∈X
                             X 𝑘=1            𝑗=1                    𝑗=1
                       6 𝑟𝑑𝜖 · max(|𝜆 𝑘 |)
The last step is because of a simple inequality |𝑎 1 𝑎 2 − 𝑏 1 𝑏 2 | 6 |𝑎 1 ||𝑎 2 − 𝑏 2 | + |𝑏 2 ||𝑎 1 − 𝑏 1 |, and
universal property in definition 2.3.1. Since r, d, and 𝜆 𝑘 are all bounded, let the 𝜖 becomes as small
as possible, we have
                                            sup | 𝑓 − 𝑔 | 6 𝜖       ∀ 𝑓 ∈ C(X     X)                                (A.5)
                                           X ∈XX
for any arbitrary 𝜖 > 0. The proof is completed.                                                                       
A.2     Proof of Theorem 2.3.1
Proof. The convergence in the theorem can be showed in two steps. Given the parameter 𝜆, we
denote
             𝑓𝑛𝜆 = arg min       𝜆|| 𝑓 || 2 + R L,𝑇𝑛 ( 𝑓 )             𝑓 𝜆 = arg min         𝜆|| 𝑓 || 2 + R L ( 𝑓 )
                           H
                        𝑓 ∈H                                                            H
                                                                                     𝑓 ∈H
where
                                                                           ∫
                            R L ( 𝑓 ) = E (X×Y) L (𝑦, 𝑓 (X       X)) =           L (𝑦, 𝑓 (XX))𝑑P
                                                              100


and
                                                           𝑛
                                                        1Õ
                                                                           X𝑖 )
                                                                                
                                       R L,𝑇𝑛 ( 𝑓 ) =          L 𝑦𝑖 , 𝑓 (X
                                                        𝑛
                                                          𝑖=1
L is a loss function. 𝑓𝑛𝜆 is the optimal classifier learned from training data, and 𝑓 𝜆 is the optimal
from the RKHS H generated by tensor kernel function (2.2). Since the Bayes risk under loss
function L is defined as R ∗ =           min R ( 𝑓 ) over all functions defined on X , we can immediate
                                         X →Y
                                      𝑓 :X   Y
show that
       |R ( 𝑓 𝜆 ) − R ∗ | 6 E (X×Y) |L (𝑦, 𝑓 𝜆 (X
                                                X)) − L (𝑦, 𝑓 ∗ (X     X))| 6 𝐶 (𝐾𝑚𝑎𝑥 ) sup | 𝑓 𝜆 − 𝑓 ∗ |
                                                                                                              (A.6)
                          6 𝐶 (𝐾𝑚𝑎𝑥 ) · 𝜖
This is the result of using condition Con.1 and Con.2 in the theorem. 𝑓 𝜆 is in the RKHS and
thus bounded by some constant depending on 𝐾𝑚𝑎𝑥 . 𝑓 ∗ is also continuous on compact subspace
X (because all the tensor components considered are bounded in condition Con.1) and thus is
bounded. The universal approximating property in condition Con.3 makes equation (A.6) vanishes
as 𝜖 goes to zero. Thus, the consistency result can be established if we show |R ( 𝑓𝑛𝜆 ) − R ( 𝑓 𝜆 )|
converges to zero. This can be done with Hoeffding equality ([36]) and Rademacher complexity
(see Theorem B.5.1).
     From the objective function (2.4), we have
                                          R L,𝑇𝑛 ( 𝑓𝑛 ) + 𝜆 𝑛 || 𝑓𝑛 || 2 6 𝐿 0                                (A.7)
                                                                                                          q
                                                                                                            𝐿
under condition Con.2 when we simply let 𝑓 = 0 as a naive classifier. Thus, || 𝑓𝑛 || 6 𝜆𝑛0 . Let
        q
           𝐿
𝑀𝑛 = 𝜆𝑛0 . 𝑓𝜖 ∈ H such that R L ( 𝑓𝜖 ) 6 R L ( 𝑓 𝜆 ) + 2𝜖 . || 𝑓𝜖 || 6 𝑀𝑛 when 𝑛 is sufficiently large.
Due to condition Con.4, 𝜆 𝑛 → 0, making 𝑀𝑛 → ∞. Further notice that we introduce 𝑓𝜖 since
it is independent of 𝑛. As a result, its norm, even though is bounded by 𝑀𝑛 , is a constant and is
not changing with respect to 𝑛. By Rademacher complexity, the following inequality holds with
                                                         101


probability at least 1 − 𝛿, where 0 < 𝛿 < 1
                                                                                                                     r
                                                                        2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛                                   log 2/𝛿
                                   R L ( 𝑓𝑛𝜆 )    6  R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) +           √           + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 )
                                                                                   𝑛                                     2𝑛
                                                                                                        2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛
     𝑓𝜖 is not the optimal in training data       6 R L,𝑇𝑛 ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 − 𝜆 𝑛 || 𝑓𝑛𝜆 || 2 +       √
                                                                                                               𝑛
                                                                                r
                                                                                    log 2/𝛿
                                                 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 )
                                                                                       2𝑛
                                                                                        2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛
                   Drop (𝜆𝑛 | | 𝑓𝑛𝜆 | | 2 > 0)    6 R L,𝑇𝑛 ( 𝑓 𝜖 ) + 𝜆 𝑛 || 𝑓 𝜖 || 2 +        √
                                                                                                 𝑛
                                                                                r
                                                                                    log 2/𝛿
                                                 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 )
                                                                                       2𝑛
                                                                                    4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛
          Rademacher Complexity again             6 R L ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 +         √
                                                                                            𝑛
                                                                                   r
                                                                                      log 2/𝛿
                                                 + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 )
                                                                                         2𝑛
Let 𝛿 = 12 , and 𝑁 large such that for all 𝑛 > 𝑁,
           𝑛
                                                                                                r
                                             4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛                                            log 2/𝛿 𝜖
                       𝜆 𝑛 || 𝑓𝜖   || 2   +         √          + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 )                      6
                                                       𝑛                                                2𝑛       2
The inequality exists because || 𝑓𝜖 || is a constant with respect to 𝑛, and all other terms are converging
to zero. Thus
                                                                            𝜖
                                               R L ( 𝑓𝑛𝜆 ) 6 R L ( 𝑓𝜖 ) +       6 RL ( 𝑓 𝜆 ) + 𝜖
                                                                            2
with probability 1 − 12 . We conclude that
                                𝑛
                                                 P(|R L ( 𝑓𝑛𝜆 ) − R L ( 𝑓 𝜆 )| > 𝜖) → 0                                     (A.8)
for any arbitrary 𝜖. This establishes the weak consistency of CP-STM. For strong consistency, we
consider for each 𝑛
                                 ∞                                                             ∞
                                Õ                                                             Õ       1
                                         P(|R L ( 𝑓𝑛𝜆 )  − RL  ( 𝑓 𝜆 )| > 𝜖) 6 𝑁 − 1 +                   6∞
                                                                                                     𝑛 2
                                𝑛=1                                                           𝑛=1
By Borel-Cantelli Lemma ([42]), R L ( 𝑓𝑛𝜆 ) → R L ( 𝑓 𝜆 ) almost surely. The proof is finished.                                
                                                                     102


                                                 APPENDIX B
                                      APPENDIX FOR CHAPTER 3
B.1      Proof for Proposition 3.3.1
      Suppose a CP rank-𝑟 tensor X = È𝑋       𝑋 1 , .., 𝑋 𝑑 É is given with size 𝐼1 × 𝐼2 .. × 𝐼 𝑑 . With a rank-1
projection tensor A 𝑝 , the CP tensor random projection (3.5) can be written as
                                        (1)          (𝑑)
            [ 𝑓 TPR-CP (X)] 𝑝 =< È𝐴   𝐴 𝑝 , ..., 𝐴 𝑝 É, È𝑋   𝑋 (1) , .., 𝑋 (𝑑) É >
                                                                     𝑟
                                      (1)     (2)             (𝑑)          (1)     (2)         (𝑑)
                                                                   Õ
                                =< 𝑎𝑝     ◦ 𝑎𝑝     ◦ ... ◦ 𝑎 𝑝 ,         𝑥 𝑘 ◦ 𝑥 𝑘 ◦ ... ◦ 𝑥 𝑘 >
                                                                                                             (B.1)
                                                                   𝑘=1
                                   𝑟
                                            (1)𝑇      (1)            (2)𝑇      (2)            (𝑑)𝑇    (𝑑)
                                  Õ
                                =       < 𝑎𝑝     ,𝑥𝑘      > ◦ < 𝑎𝑝         ,𝑥𝑘     > ...◦ < 𝑎 𝑝    ,𝑥𝑘     >
                                  𝑘=1
  ( 𝑗) ( 𝑗)                                                                                                  ( 𝑗)
𝑎 𝑝 , 𝑥 𝑘 ∈ R 𝑗 , 𝑗 = 1, ..., 𝑑 are CP factors for the projection tensor A 𝑝 and CP tensor X. 𝑎 𝑝 ∼
                  𝐼
𝑀𝑉 𝑁 (00, 𝜎 2 𝐼 ) is a multivariate random variable whose elements are identically and independently
                         ( 𝑗)
distributed. Also, 𝑎 𝑝 are identically and independently distributed with different value of 𝑝 =
1, .., 𝑃 and 𝑗 = 1, ..., 𝑑. Now we consider the tensor-to-tensor random projection (3.6), and let
             ( 𝑗)𝑇       ( 𝑗)𝑇         𝑃 ×𝐼
𝐴 ( 𝑗) = (𝑎𝑎 1 , ..., 𝑎 𝑃 )𝑇 ∈ R 𝑗 𝑗 be the random projection matrices in (3.6). Notice that
                            𝑗
                                                                                                        ( 𝑗)
the rows of matrices 𝐴 ( 𝑗) are identically and independently distributed as the 𝑎 𝑝 , since the
elements in the matrices 𝐴 ( 𝑗) are also identically and independently distributed as N (0, 𝜎 2 ). The
tensor-to-tensor CP random projection (3.6) is
                𝑓 TPR-CP-TT (X) = È𝐴 𝐴 (1) 𝑋 (1) , 𝐴 (2) 𝑋 (2) , ..., 𝐴 (𝑑) 𝑋 (𝑑) É
                                     𝑟
                                                      (1)                    (2)                    (𝑑)
                                   Õ
                                 =       < 𝐴 (1) , 𝑥 𝑘 > ◦ < 𝐴 (2) , 𝑥 𝑘 > ...◦ < 𝐴 (𝑑) , 𝑥 𝑘 >
                                                                                                             (B.2)
                                   𝑘=1
                                     𝑟
                                          (1)      (2)            (𝑑)
                                   Õ
                                 =      𝑣 𝑘 ◦ 𝑣 𝑘 ◦ ... ◦ 𝑣 𝑘
                                   𝑘=1
  ( 𝑗)              ( 𝑗)
𝑣 𝑘 =< 𝐴 ( 𝑗) , 𝑥 𝑘 >∈ R 𝑗 , 𝑗 = 1, ..., 𝑑 since it is just matrix vector multiplication. To show
                               𝑃
the equivalence between (3.5) and (3.6), we have to show that [ 𝑓 TPR-CP-TT (X)] 𝑝 1 ,𝑝 2 ,...,𝑝 𝑑 =
                                                                                                Î
[ 𝑓 TPR-CP (X)] 𝑝 when a index mapping 𝜋 is given, 𝜋 ( 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 ) = 𝑝, and 𝑑𝑗=1 𝑝 𝑗 = 𝑝.
                                                          103


                              Î𝑑
     For an arbitrary 𝑝 =          𝑗=1  𝑝 𝑗 and 𝜋 ( 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 ) = 𝑝, we can find the element in projected
tensor with index 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 is
                                            𝑟
                                                   (1) (2)        (𝑑)
                                           Õ
  [ 𝑓 TPR-CP-TT (X)] 𝑝 1 ,𝑝 2 ,...,𝑝 𝑑 =        𝑣 𝑝 𝑣 𝑝 ...𝑣𝑣 𝑝
                                                     1      2       𝑑
                                           𝑘=1
                                            𝑟                                                                       (B.3)
                                                        (1)𝑇 (1)               (2)𝑇 (2)                (𝑑)𝑇 (𝑑)
                                           Õ
                                        =        <   𝑎𝑝 ,𝑥𝑘           >◦<    𝑎𝑝 ,𝑥𝑘       > ...◦ <  𝑎𝑝 ,𝑥𝑘      >
                                                         1                       2                      𝑑
                                           𝑘=1
         ( 𝑗)                                                         ( 𝑗)
where 𝑎 𝑝 𝑗 ∈ R 𝑗 are rows of matrices 𝐴 ( 𝑗) . Since 𝑎 𝑝 𝑗 are identically and independently distributed
                     𝐼
                ( 𝑗)                               ( 𝑗)
for all 𝑝 𝑗 , 𝑎 𝑝 𝑗 are equivalent to the 𝑎 𝑝 in equation (B.1). Thus, the equation (B.3) is equivalent
to the equation (B.1). The proof is finished. Indeed, the equivalence can be identified as follow:
                                                                 (1)          (𝑑)          ( 𝑗)
For each projection tensor A 𝑝 in (3.5), A 𝑝 = 𝑎 𝑝 ◦ ... ◦ 𝑎 𝑝 , where 𝑎 𝑝 𝑗 ∈ R 𝑗 are 𝑝 𝑗 -th rows of
                                                                                                     𝐼
                                                                  1             𝑑
matrices 𝐴 ( 𝑗) . The order is decided by the index mapping 𝜋 ( 𝑝 1 , 𝑝 2 , ..., 𝑝 𝑑 ) = 𝑝.
B.2     Proof of Proposition 3.5.1
     We use the adding and subtraction trick to prove the proposition.
                  
 EA R L (𝑔𝑔𝑛𝜆 ) − R L   ∗ = [E [R (𝑔          𝜆                   𝑔𝑛𝜆 )]
                                 A L 𝑔𝑛 ) − R L,𝑇 A (𝑔
                                                              𝑛
                                                                                        
                          + EA R A (𝑔𝑔𝑛𝜆 ) − R A ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2
                                       L,𝑇𝑛                  L,𝑇𝑛
                          + EA [R             ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 )]
                                       L,𝑇𝑛A
                          + [EA [R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 ] − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 ]
                          + [R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 )] + [R𝑇𝑛 ( 𝑓𝑛𝜆 ) + 𝜆|| 𝑓𝑛𝜆 || 2 − R L,𝑇𝑛 ( 𝑓 𝜆 ) − 𝜆|| 𝑓 𝜆 || 2 ]
                          + [R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )] + [R L ( 𝑓 𝜆 ) − R L,H    ∗          ∗         ∗       𝜆 2
                                                                                       H + R L,H   H − R L + 𝜆|| 𝑓 || ]
                                                                                                               
                                             𝜆                      𝜆                      𝜆             𝜆   
                          6 EA R L (𝑔𝑔𝑛 ) − R A (𝑔𝑔𝑛 ) + EA R A ( 𝑓A ,𝑛 ) − R L ( 𝑓A ,𝑛 )
                                                          L,𝑇𝑛                       L,𝑇𝑛
                                                                                              
                          + R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )
                                                                                                 
                                             𝜆                𝜆     2            𝜆       𝜆    2
                          + EA R L ( 𝑓A,𝑛 ) + 𝜆|| 𝑓A,𝑛 || − R L ( 𝑓𝑛 ) − 𝜆|| 𝑓𝑛 ||
                          + 𝐷 (𝜆) + R L,H  ∗              ∗
                                              H − RL
Since 𝐷 (𝜆) = R L ( 𝑓 𝜆 ) + 𝜆|| 𝑓 𝜆 || − R L,H   ∗      , there are only two terms are dropped in the derivation.
                                                     H
                                                                104


    • The first term dropped is EA [R                  (𝑔𝑔𝑛𝜆 ) − R          ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 ]. As we explained in
                                               L,𝑇𝑛A                L,𝑇𝑛A
       the paper, 𝑓A𝜆 ,𝑛 is the decision function but with coefficients estimated from CP-STM model.
       As a result, it is not the optimal of the objective function (3.8). Since 𝑔𝑛𝜆 minimizes the
       objective function (3.8) and 𝜆||𝑔𝑔𝑛𝜆 || 2 > 0, we get
              R        (𝑔𝑔𝑛𝜆 ) − R       ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 6 R             (𝑔𝑔𝑛𝜆 ) + 𝜆||𝑔𝑔𝑛𝜆 || 2 − R       ( 𝑓A𝜆 ,𝑛 )
                L,𝑇𝑛A              L,𝑇𝑛A                                     L,𝑇𝑛A                               L,𝑇𝑛A
                                                                          − 𝜆|| 𝑓A𝜆 ,𝑛 || 2
                       (Since 𝑔𝑛𝜆       minimizes (3.8))              60
       The inequality holds for all random projection defined by random tensor A , so
                                 EA [R          (𝑔𝑔𝑛𝜆 ) − R          ( 𝑓A𝜆 ,𝑛 ) − 𝜆|| 𝑓A𝜆 ,𝑛 || 2 ] 6 0
                                        L,𝑇𝑛A                 L,𝑇𝑛A
    • The second term dropped is [R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + 𝜆|| 𝑓𝑛𝜆 || 2 − R L,𝑇𝑛 ( 𝑓 𝜆 ) − 𝜆|| 𝑓 𝜆 || 2 ]. Similar to the
       previous dropped term, this term is also less or equal to zero. As we defined, 𝑓𝑛𝜆 minimizes
       the objective function (3.1) that evaluates loss over the training data 𝑇𝑛 . Even though 𝑓 𝜆 is
       the class optimal with infinite-size training data, its objective function still has a greater value
       than that of 𝑓𝑛𝜆 . By comparing the values of objective function (3.1) on 𝑓𝑛𝜆 and 𝑓𝑛𝜆 , we can
       see that [R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) + 𝜆|| 𝑓𝑛𝜆 || 2 − R L,𝑇𝑛 ( 𝑓 𝜆 ) − 𝜆|| 𝑓 𝜆 || 2 ] 6 0
By dropping these two non-positive terms, we prove the proposition.
B.3     Discussion on Assumptions AS.8
    Assumption AS.8 ensures that the Bayes risk remains unchanged after random projection. This
assumption has also been made in [24]. This is a necessary condition to align our results with the
definition of classification consistency.
    For any arbitrary random projection A , it is obvious that R L,A                    ∗      > RL  ∗ . This is because the
                                                                                            A
optimal Bayes risk is achieved by choosing any measureable function. A function compose random
projection A and a decision rule should be measureable, and thus should be considered when
searching for Bayes rules. If R L,A  ∗       > RL  ∗ , then smallest achievable risk in projected data will no
                                        A
                                                              105


longer be R L∗ . By definition, a decision rule learned from the projected data just have to reach
         ∗                                                                                             𝜆 ) → R ∗ we show in the
                                                                                                         
to the R L,A
           A
              to be consistent.        This   deviates       from       the  result     EA   R  L   (𝑔𝑔𝑛                L
                                                                                                             ∗
                                                                                            𝜆
                                                                                                   
paper. Thus, we need the condition to guarantees that EA R L (𝑔𝑔𝑛 ) → R L indicating RPSTM’s
consistency aligns with the definition.
    In [33, 24], many examples satisfying this condition are provided. Typically, if there is an
random projection A such that E[𝑦| 𝑓 TPR-CP-TT (X                    X)] and X are independent, then the condition is
satisfied.
B.4     Proof of Proposition 3.5.2
    Johnson-Lindenstrauss lemma gives concentration bound on the error introduced by random
projection in a single mode. (e.g. see [72] and [34]) We first show how this property is applied at
each mode of the tensor CP components in the following lemma.
                                                                                                                         ( 𝑗) ( 𝑗) 𝐼 ×1
Lemma B.4.1. For each fixed mode 𝑗 = 1, 2, .., 𝑑 and any two tensor CP factors 𝑥 1 , 𝑥 2 ∈ R 𝑗
among 𝑛 training vectors, with probability at least (1 − 𝛿1 ) and the random projection matrices
described in AS.5, we have
                      𝐴 ( 𝑗) 𝑥 1( 𝑗) − 𝐴 ( 𝑗) 𝑥 2( 𝑗) || 22 − ||𝑥𝑥 1( 𝑗) − 𝑥 2( 𝑗) || 22 6 𝜖 ||𝑥𝑥 1( 𝑗) − 𝑥 2( 𝑗) || 22
                    ||𝐴
                                                                         log 𝛿𝑛
                                   𝑃 𝑗 ×𝐼 𝑗
Proof. The matrix    𝐴𝑗    ∈R               , where 𝑃 𝑗 =           𝑂 ( 2 1 ).          Under condition AS.5, the inequality
                                                                           𝜖
holds due to the JL-property ([72, 34]).                                                                                              
Next, we apply this lemma to multiple modes of tensor CP factors, and derive a bound for the
difference between projected tensor kernel function (3.9) and tensor kernel function (3.2). We need
the following lemma and corollary to derive the bound.
Lemma B.4.2. Consider a 2𝑑 degree polynomial of independent Centered Gaussian or Rademacher
random variables as 𝑄 2𝑑 (𝑌 ) = 𝑄 2𝑑 (𝑦𝑖 , .., 𝑦 𝑑 ). Then for some 𝜖 > 0 and 𝜉 > 0 constant.
                                                                                                                         
                                                                                                    𝜖 2𝑑            1
                 P(|𝑄 2𝑑 (𝑌 ) − E(𝑄 2𝑑 (𝑌 ))| >             𝜖 𝑑)    6   𝑒 2 exp        −                              2𝑑
                                                                                          𝜉Var[𝑄 2𝑑 (𝑌 )]
                                                                 106


Proof. The proof can be found using hypercontractivity, [69] Thm 6.12 and Thm 6.7. This result
is also mentioned in [124].                                                                                                               
From this lemma, we can show a corollary about the difference between projected tensor CP
components and original CP components.
                                                                                                            𝑟     (1)            (𝑑)
Corollary B.4.2.1. For any two d-mode tensors in rank-r CP form, X 1 =
                                                                                                           Í
                                                                                                                𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and
                                                                                                          𝑘=1
         𝑟   (1)           (𝑑)
X2 =
        Í
           𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 , concentration bounds of polynomials can be derived below. Given 𝜖 > 0,
       𝑘=1
𝜉 > 0 constant, we have following JL type result.
         𝑟 Ö  𝑑                                            𝑟 Ö   𝑑                                     𝑟 Ö    𝑑
                           ( 𝑗)               ( 𝑗)                         ( 𝑗)      ( 𝑗)                             ( 𝑗)    ( 𝑗)
        Õ                                                 Õ                                           Õ
    P(            𝐴 ( 𝑗) 𝑥 1,𝑘
                 |𝐴             −   𝐴 ( 𝑗) 𝑥 2,𝑙 || 22 −              ||𝑥𝑥 1,𝑘  − 𝑥 2,𝑙 || 22  >  𝜖𝑑             ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 )
       𝑘,𝑙=1 𝑗=1                                         𝑘,𝑙=1 𝑗=1                                   𝑘,𝑙=1 𝑗=1
                                                                                                          𝜖 2𝑑 Î𝑑 𝑃 ( 𝑗)  1
                                                                                                                     𝑗=1           2𝑑
                                                                                          6   𝑒 2 exp(−                                 )
                                                                                                                  3𝑑 𝑟 4
Proof. It is known that variable for any vector 𝑥 ( 𝑗) ∈ R 𝑗 and any matrix 𝐴 ( 𝑗) made out of entries
                                                                                𝐼
                                                                                                                              ( 𝑗) ( 𝑗)
of following independent normal with mean 0 and variance 𝑃1 , linear combination 𝐴 ( 𝑗)𝑥                                                  ∼
                                                                                            𝑗                               k𝑥𝑥 k 2
                              k𝐴𝐴 ( 𝑗) 𝑥 ( 𝑗) k 2
𝑀𝑉 𝑁 (00, 𝑃1 𝐼 ). So, 𝑃 𝑗            ( 𝑗) 2
                                                   follows Chi-square of degree of freedom 𝑃 𝑗 . Using the fact that
             𝑗                   k𝑥𝑥      k
                                            2
expression,
                                                                     ( 𝑗)              ( 𝑗)
                                               Õ  𝑟 Ö  𝑑 ||𝐴𝐴 ( 𝑗) 𝑥 1,𝑘   − 𝐴 ( 𝑗) 𝑥 2,𝑙 || 22
                                                                     ( 𝑗)       ( 𝑗)
                                              𝑘,𝑙=1 𝑗=1         ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22
is the sum of 𝑟 2 identically distributed random variables with correlation 1. Therefore, the variance
of the sum is (𝑟 2 ) 2 times variance of an individual term. Here each element in the summation is
a product of 𝑑 independent scaled Chi-Square variables. Thus, each element in the summation is
polynomial of degree equals to 2d of Gaussian random variables. This result follows from lemma
B.4.2. In view of similar result can be found in [119], the constant 𝜉 can be assumed to be 1. It is
worth noting that for polynomial of degree 2 or 𝑑 = 1, sharper bound can be obtained as illustrated
in [72, 34].
                                                                                                                                          
     Now we present the bound for tensor kernels.
                                                                  107


                                                                                                           𝑟     (1)             (𝑑)
Proposition B.4.1. For any two d-mode tensors in rank-r CP form, X1 =
                                                                                                           Í
                                                                                                               𝑥 1,𝑘   ◦ ... ◦ 𝑥 1,𝑘 and
                                                                                                          𝑘=1
       𝑟    (1)             (𝑑)
X2 =
       Í
          𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 , concentration bounds of polynomials can be derived below. Suppose the
      𝑘=1
random projection 𝑓 TPR-CP-TT is defined by projection tensor A , which satisfies the assumption
AS.5. Given 𝜖 > 0, 𝛿1 depending on 𝜖 as given in corollary B.4.2.1, 𝐶𝑑,𝑟 constant depending on
𝑑, we have following JL-type result. For a given tensor kernel function 𝐾 (·, ·),
                                                                                                              
                                           X1 ), 𝑓 TPR-CP-TT (X  X2 ) − 𝐾 (X       X1 , X 2 ) > 𝐶𝑑,𝑟
                                                                        
               P 𝐾 𝑓 TPR-CP-TT (X                                                                          𝜖𝑑     6 𝛿1
Proof.
                                       X1 ), 𝑓 TPR-CP-TT (X  X2 ) − 𝐾 (X     X1 , X 2 )|
                                                                   
              |𝐾 𝑓 TPR-CP-TT (X
                                                     𝑟 Ö  𝑑
                                                                                ( 𝑗)            ( 𝑗)                ( 𝑗) ( 𝑗)
                                                   Õ
                                              =|             𝐾 ( 𝑗) (𝐴𝐴 ( 𝑗) 𝑥 1𝑘    , 𝐴 ( 𝑗) 𝑥 2𝑙 ) − 𝐾 ( 𝑗) (𝑥𝑥 1𝑘 , 𝑥 2𝑙 )|
                                                  𝑘,𝑙=1 𝑗=1
                                                   𝑟 Ö   𝑑
                                                                               ( 𝑗)             ( 𝑗)               ( 𝑗) ( 𝑗)
                                                 Õ
                                              6             |𝐾 ( 𝑗) (𝐴
                                                                     𝐴 ( 𝑗) 𝑥 1𝑘     , 𝐴 ( 𝑗) 𝑥 2𝑙 ) − 𝐾 ( 𝑗) (𝑥𝑥 1𝑘 , 𝑥 2𝑙 )|
                                                𝑘,𝑙=1 𝑗=1
                                                   𝑟 Ö   𝑑
                                                              ( 𝑗)              ( 𝑗)               ( 𝑗)           ( 𝑗)      ( 𝑗)
                                                 Õ
           (Lipschitz continuity in AS.4)     6             𝐿 𝐾 | ||𝐴 𝐴 ( 𝑗) 𝑥 1𝑘      − 𝐴 ( 𝑗) 𝑥 2𝑙 || 22 − ||𝑥𝑥 1𝑘 − 𝑥 2𝑙 || 22 |
                                                𝑘,𝑙=1 𝑗=1
                                                        𝑟 Ö  𝑑
                                                                               ( 𝑗)               ( 𝑗)           ( 𝑗)      ( 𝑗)
                                                       Õ
                     (Max  𝐿𝑘
                             ( 𝑗)
                                  in AS.4)    6 𝐿𝐾 𝑑                 𝐴 ( 𝑗) 𝑥 1𝑘
                                                                 | ||𝐴                − 𝐴 ( 𝑗) 𝑥 2𝑙 || 22 − ||𝑥𝑥 1𝑘 − 𝑥 2𝑙 || 22 |
                                                      𝑘,𝑙=1 𝑗=1
                                                           𝑟 Ö   𝑑
                                                                          ( 𝑗)        ( 𝑗)
                                                          Õ
                     (Corollary B.4.2.1 )     6 𝜖 𝑑 𝐿𝐾𝑑              ||𝑥𝑥 1𝑘 − 𝑥 2𝑘 || 22
                                                         𝑘,𝑙=1 𝑗=1
                                                 Õ 𝑟
                      (Assumption AS.7)       6                 𝑑 𝐵2𝑑
                                                       2𝑑 𝜖 𝑑 𝐿 𝐾    𝑥
                                                𝑘,𝑙=1
                                              6 2𝑑 𝑟 2 𝐿 𝐾
                                                         𝑑 𝐵2𝑑 𝜖 𝑑
                                                             𝑥
          (Denote 𝐶𝑑,𝑟 = 2𝑑 𝑟 2 𝐿𝐾 𝑑 2𝑑
                                     𝐵𝑥 )     = 𝐶𝑑,𝑟 𝜖 𝑑
Such part vanishes as 𝜖 𝑑 becomes as small as possible.                                                                                
The proposition shows that the difference between the projected kernel and original kernel function
can be bounded with probability at least 1 − 𝛿1 , when condition AS.4, AS.5, and AS.7 hold.
                                                              108


    Now we include conditions AS.1, AS.6 together with the previous results to show proposition
3.5.2. With a single random projection defined by A , the extra risk from random projection
                                     |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 |
contains two parts. They are bounded in separate ways.
    • Difference between risks can be bounded by the following inequality. With probability at
       least 1 − 𝛿1 (with respect to random projection), 0 < 𝛿1 < 1,
                                                                          𝜆 X), 𝑦) − L ( 𝑓 𝜆 (X
                                                                                                   𝑛 X), 𝑦)]
                                                                   
                  R L ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓𝑛𝜆 ) = E (X      X ×YY ) L ( 𝑓A ,𝑛 (X
                                                                  r
                                                                       𝐿0
                                                                                                     X) − 𝑓𝑛𝜆 (X X)|
                                                                                            𝜆                        
            ( AS.1 and Jensen’s Inequality) 6 𝐶 (𝐾𝑚𝑎𝑥                     ) · E (XX ×Y Y ) | 𝑓A ,𝑛 (X
                                                                       𝜆
                                                                  r               𝑛
                                                                       𝐿0  Õ
                                                    6 𝐶 (𝐾𝑚𝑎𝑥             )·           |𝛼𝑖 |E (X
                                                                                               X ×YY ) {|𝑦𝑖 |·
                                                                       𝜆
                                                                                 𝑖=1
                                                                                X1 ), 𝑓 TPR-CP-TT (X     X2 ) − 𝐾 (X    X1 , X 2 )|}
                                                                                                                                    
                                                          |𝐾 𝑓 TPR-CP-TT (X
                                                                  r               𝑛
                                                                       𝐿0  Õ                                      𝑑 
            ( Proposition B.4.1 and |𝑦𝑖 | = 1) 6 𝐶 (𝐾𝑚𝑎𝑥                  )·           |𝛼𝑖 | · E (X  Y ) [𝐶𝑑,𝑟 𝜖 ]
                                                                                                  X ×Y
                                                                       𝜆
                                                                                 𝑖=1
                                                                  r
                                                                       𝐿0
                  ( Expectation over constant) 6 𝐶 (𝐾𝑚𝑎𝑥                  ) · Ψ𝐶𝑑,𝑟 𝜖 𝑑
                                                                       𝜆
                                             𝑛
                                                            X) = 𝛼𝑇 𝐷 𝑦 𝐾 (X    X) ∈ H }.
                                             Í
       where Ψ = sup{||𝛼         𝛼 || 1 =        |𝛼𝑖 | : 𝑓 (X
                                            𝑖=1
    • Difference between functional norms can be bounded in a similar way. With probability at
       least 1 − 𝛿1 (with respect to random projection), 0 < 𝛿1 < 1,
                                                            Õ𝑛 Õ  𝑛
                    |𝜆|| 𝑓A𝜆 ,𝑛 || 2 − 𝜆|| 𝑓𝑛𝜆 || 2 | 6𝜆              𝛼𝑖 𝛼𝑙 |𝑦𝑖 ||𝑦 𝑙 |·
                                                            𝑖=1 𝑙=1
                                                                                  X1 ), 𝑓 TPR-CP-TT (X     X2 ) − 𝐾 (X   X1 , X 2 )|
                                                                                                                 
                               ( Absolute value)            |𝐾 𝑓 TPR-CP-TT (X
                                                             Õ𝑛             Õ𝑛
               ( Proposition B.4.1 and |𝑦𝑖 | = 1)     6 𝜆(       |𝛼𝑖 |) · (      |𝛼𝑙 |) · 𝐶𝑑,𝑟 𝜖 𝑑
                                                             𝑖=1            𝑙=1
                                                      6  𝜆Ψ2𝐶    𝑑,𝑟  𝜖𝑑
Each of these two inequalities hold with probability at least 1 − 𝛿1 , then two inequalities hold
simultaneously with probability at least 1−2𝛿1 . This can be showed with simple probability theory,
                                                                  109


since the probability of at least one inequality does not hold is no more than 2𝛿1 . (Probability of
union is no more then the sum of probabilities.)
     As a result, we conclude that with probability at least 1 − 𝛿1 with respect to random projection
                                                                                      r
                           𝜆      2            𝜆          𝜆   2                            𝐿0                  𝜖𝑑
             𝜆
     |R L ( 𝑓A ,𝑛 ) + 𝜆|| 𝑓A ,𝑛 || − R L ( 𝑓𝑛 ) − 𝜆|| 𝑓𝑛 || | 6 𝐶𝑑 Ψ [𝐶 (𝐾𝑚𝑎𝑥                 ) + 𝜆Ψ] 𝜖 𝑑 = 𝑂 ( 𝑞 )
                                                                                           𝜆                   𝜆
The proposition 3.5.2 is proved.
B.5      Proof of Theorem 3.5.2
     Theorem 3.5.2 establishes an upper bound on the excess risk of RPSTM model under a single
random projection. Although it does not give out the statistical consistency of RPSTM we are
pursuing, it summarizes the conclusions from proposition 3.5.2 and bound the excess risk in
proposition 3.5.1 under a single random projection.
     The theorem assumes that if all the conditions AS.1 - AS.9 hold, then the excess risk under a
single random projection can be bounded with probability at least (1 − 2𝛿1 )(1 − 𝛿2 )
                                                        ∗ 6 𝑉 (1) + 𝑉 (2) + 𝑉 (3)
                                        R L (𝑔𝑔𝜆𝑛 ) − R L
                                 q                 √            q                   q
                                    𝐿0               𝐿            log(2/𝛿2 )            2 log(2/𝛿2 )
     • 𝑉 (1) = 12𝐶 (𝐾𝑚𝑎𝑥            𝜆 )  · 𝐾𝑚𝑎𝑥 √ 0      + 9 𝜁𝜆
                                                             ˜
                                                                      2𝑛     + 2𝜁𝜆            𝑛
                                                     𝑛𝜆
     • 𝑉 (2) = 𝐷 (𝜆)
                                         q
                                            𝐿0
     • 𝑉 (3) = 𝐶𝑑,𝑟 Ψ · [𝐶 (𝐾𝑚𝑎𝑥             𝜆 )   + 𝜆Ψ]𝜖 𝑑
(1 − 2𝛿1 ) is the probability with respect to random projections, and (1 − 𝛿2 ) is with respect to the
randomness of choosing training data 𝑇𝑛 . As noted in the theorem, 𝜁˜𝜆 can be regarded as the supreme
                                                                                   X, 𝑦) → L ( 𝑓 (X  X), 𝑦) : 𝑓 ∈ F ,
                                                                             
of the infinity norm of a function in the collection L ◦ F = ℎ : (X
i.e.
                              𝜁˜𝜆 = sup ||ℎℎ || ∞ = sup                  sup            X, 𝑦)|
                                                                                   |ℎℎ (X
                                     ℎ ∈L◦FF             ℎ ∈L◦F F    X,𝑦)∈(X
                                                                    (X       X ×Y
                                                                                Y)
                                    q
                                     𝐿0
F =      𝑓 : || 𝑓 || ∞ 6 𝐾𝑚𝑎𝑥          𝜆   . All functions in the collection L ◦ F are compositing a loss
function together with a decision function, and are bi-variate. 𝜁𝜆 is a special case of 𝜁˜𝜆 by letting
                                                             110


the decision function to be the optimal CP-STM 𝑓 𝜆 , i.e.
                                        𝜁𝜆 =           sup       |L ( 𝑓 𝜆 (XX), 𝑦)|
                                                  X,𝑦)∈(X
                                                 (X        X ×Y
                                                              Y)
As for Ψ, it is the supreme of the L-1 norm for CP-STM coefficient vector. In order to show this
theorem, we can use the result from proposition 3.5.1, but without taking expectation over random
projections.
    With a single random projection, the proof of Proposition 3.5.2 in appendix B.4 already shows
how the term 𝑉 (3) is developed. The probability component with respect to random projection
depicts the chance of 𝑉 (3) term being true, and is explained in the proof. 𝑉 (2) is directly taken
from the risk decomposition in Proposition 3.5.1. Thus, we only have to show 𝑉 (1) to establish the
theorem.
    Indeed, our discussion in Proposition 3.5.1 unveils that, except the terms already bounded by
𝑉 (2), 𝑉 (3), and the term vanishes due to universal tensor kernels, 𝑉 (1) only bounds the gaps
between empirical risk and expected risk, which are listed below.
                          R L (𝑔𝑔𝑛𝜆 ) − R           (𝑔𝑔𝑛𝜆 ),   R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 )
                                             L,𝑇𝑛A
                          R         ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 ),    R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )
                            L,𝑇𝑛A
Notice that consider the problem under a single random projection. Thus, the risks are not
expectations over all random projections. As we mentioned earlier, one of them can be bounded by
Hoeffding equality immediately, and the other three can be bounded by Rademacher Complexity.
We list the bound for each term, and explain how the bound is developed below. First, we consider
the term R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 ) and get the following result.
Proposition B.5.1. With probability at least (1 − 𝛿2 ) for 𝛿2 ∈ (0, 1)
                                                                            s
                                                                               2𝑙𝑜𝑔 𝛿1
                                                                                      2
                                  |R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )| < 2𝜁𝜆
                                                                                   𝑛
                                                𝑛
                                                             X𝑖 ), 𝑦𝑖 ) as a sum of independent and identically
                                                Í
Proof. We consider R L,𝑇𝑛 ( 𝑓 𝜆 ) =                L ( 𝑓 𝜆 (X
                                               𝑖=1
distributed (i.i.d) random variables since each pair (X           X𝑖 , 𝑦𝑖 ) ∈ 𝑇𝑛 are i.i.d distributed. R L ( 𝑓 𝜆 ) is the
                                                             111


expectation of R L,𝑇𝑛 ( 𝑓 𝜆 ). Since loss function is bounded by 𝜁𝜆 for every term in R L,𝑇𝑛 ( 𝑓 𝜆 ), using
Hoeffding’s inequality ([36]), we obtain P[R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )] > 𝜃) ≤ 𝑒𝑥 𝑝(− 𝑛𝜃2 ). Choosing
                                                                                             8𝜁
                                                                                                𝜆
𝛿2 =  𝑒𝑥 𝑝(− 𝑛𝜃2 )  leads to the above bound.                                                               
              8𝜁
                 𝜆
However, the other three terms cannot be bounded in the same way. This is because the decision
function in the other three terms 𝑓A𝜆 ,𝑛 , 𝑔𝑛𝜆 , and 𝑓𝑛𝜆 are calculated from the training data and
hence conditional on 𝑇𝑛 . As a result, the risk of RPSTM model R A (𝑔𝑔𝑛𝜆 ) is not a sum of
                                                                                L,𝑇𝑛
independent random variables. This violates the assumption of Hoeffding inequality. We need
to use Rademacher Complexity, a stronger tool to bound the three terms left. We use this tool to
develop a bound between R L (𝑔𝑔𝑛𝜆 ) and R          (𝑔𝑔𝑛𝜆 ).
                                             L,𝑇𝑛A
    We use R𝑛 (F   F ) to denote the Rademacher complexity of a function class F , and R̂ 𝐷 𝑛 (F       F ) to
denote the corresponding sample estimate with respect to samples 𝐷 𝑛 = {𝑍1 , .., 𝑍𝑛 }. To make our
description more consistent, one may regard each 𝑍𝑖 = (X     X𝑖 , 𝑦𝑖 ), so that 𝐷 𝑛 is another representation
of the training data 𝑇𝑛 . The reason for doing this is because we need to use composite function in
forms of L ◦ F . We shall present few well established results about the Rademacher complexity
without proof. One can find details about the proof from [106].
Theorem B.5.1. Consider a collection of classifiers F = { 𝑓 : || 𝑓 || ∞ 6 𝜁˜𝜆 }. ∀𝛿2 > 0, with
probability at least 1 − 𝛿2 , we obtain:
                                                                             s
                                           1 Õ𝑛                                log 𝛿2
                                                                                     2
                          sup |E[ 𝑓 (𝑍)] −       𝑓 (𝑍𝑖 )| 6 2R𝑛 (FF ) + 𝜁˜𝜆
                            F
                         𝑓 ∈F              𝑛                                     2𝑛
                                             𝑖=1
The probability is with respect to the draw of 𝐷 𝑛 .
This is the general inequality for Rademacher Complexity, and can be applied for all types of data
and functions 𝑓 . There is a corollary from the Rademacher Complexity developed especially for
controlling classification risks. The necessity of this corollary is due to the fact that 𝑓 is a uni-
variate function in the theorem, but loss functions in classification risk measurement are bi-variate.
As a result, it is not appropriate to replace 𝑓 with loss function L, and substitute 𝐷 𝑛 with our tensor
training data 𝑇𝑛 directly in Theorem B.5.1.
                                                    112


                                                      q
                                                        𝐿0
Corollary B.5.1.1. Let F = { 𝑓 : || 𝑓 || ∞ 6 𝐾𝑚𝑎𝑥       𝜆 },  and its sample Rademacher Complexity
is R̂𝑇𝑛 (F
         F ). Suppose L : R × Y → R is a loss function satisfying condition AS.1. Then for all
possible training set 𝑇𝑛 ,
                                                         r
                                                           𝐿0
                             R̂𝑇𝑛 (L ◦ F ) 6 2𝐶 (𝐾𝑚𝑎𝑥          ) · R̂𝑇𝑛 (F
                                                                         F)
                                                            𝜆
                                                                     q
                                                                        𝐿
where L ◦ F = {(X     X, 𝑦) → L ( 𝑓 (X
                                     X), 𝑦) : 𝑓 ∈ F }. 𝐶 (𝐾𝑚𝑎𝑥 𝜆0 ) is the Lipschitz continuous
constant of loss function L introduced in assumption AS.1.
The corollary bridges the Rademacher Complexity for general functions to the classification prob-
lems. This corollary is also know as a useful extension of Ledoux-Talagrand Contraction Theorem.
Lastly, since Corollary B.5.1.1 uses sample Rademacher Complexity, one more result from [106]
about kernel classes and RKHS is needed. Here, we continue using the notations about CP-STM
and RPSTM from our main content.
Corollary B.5.1.2. Suppose H is the RKHS generated by projected tensor kernel functions (3.2).
With assumption AS.3 and training data 𝑇𝑛 , for collection of function any function F (𝑀) ⊂ H
                                                      𝑀𝐾𝑚𝑎𝑥
                                      R̂𝑇𝑛 F (𝑀) 6 √
                                                             𝑛
           
F (𝑀) = 𝑓 ∈ H , || 𝑓 || 6 𝑀, 𝑀 > 0 .
    To use the inequality in Corollary B.5.1.1, we have to bound the infinite norm of the function
𝑔𝑛𝜆 due to the fact that we have to composite loss function L and decision function 𝑔𝑛𝜆 to measure
the risk. Thus, we provide the following proposition.
Proposition B.5.2. Let 𝑔𝑛𝜆 be the optimal RPSTM model. Under assumption AS.1, we have
                                                          r
                                            𝜆               𝐿0
                                        ||𝑔𝑔𝑛 || ∞ 6 𝐾𝑚𝑎𝑥
                                                             𝜆
                                                    113


Proof. Since 𝑔𝑛𝜆 = arg min {R A (𝑔𝑔 ) + 𝜆||𝑔𝑔 || 2 }, we get
                                                L,𝑇𝑛
                                𝑔∈HHA
                                                                                 1                                   
                                                                  ||𝑔𝑔𝑛𝜆 || 2 6     R A (0) + ||0|| 2 − R A (𝑔𝑔𝑛𝜆 )
                                                                                 𝜆 L,𝑇𝑛                     L,𝑇𝑛
                                                                                 1
                          ( R L,𝑇𝑛A (𝑔𝑔𝑛𝜆 ) negative and | |0| |2 = 0 )       6 R A (0)
                                                                                 𝜆 L,𝑇𝑛
                                                                                 𝐿
                                                 ( Assumption AS.1)           6 0
                                                                                  𝜆
Since 𝑔𝑛𝜆 ∈ H A , and H A is a RKHS generated by projected tensor kernels. By RKHS property ,
for any function 𝑔 ∈ H A
                                                                                                     r
                                XA )          h𝑔𝑔 , 𝐾 (, XA )i
                                                                               p                         𝐿0
                            𝑔 (X          =                           6 ||𝑔𝑔 || sup 𝐾 (·, ·) 6 𝐾𝑚𝑎𝑥
                                                                                                         𝜆
The step use Cauchy–Schwarz inequality. The inequality holds for all XA when the function 𝑔 is
replaced with 𝑔𝑛𝜆 .                                                                                                          
    Inspired by the proof of Proposition B.5.2, we consider a collection of function G𝜆 = {𝑔𝑔 : 𝑔 ∈
                q
H A , ||𝑔𝑔 || 6 𝜆0 }, which obviously includes 𝑔𝑛𝜆 . Due to Corollary B.5.1.2 and Proposition B.5.2,
                   𝐿
we have                                                                             r
                                                                                       𝐿0
                                                         R̂𝑛   (G𝜆 )     6 𝐾𝑚𝑎𝑥
                                                                                       𝑛𝜆
                         q
                             𝐿0
and ||𝑔𝑔 || ∞ 6 𝐾𝑚𝑎𝑥          𝜆   for all 𝑔 ∈ G𝜆 . Thus, we can now utilize Theorem B.5.1 to show the bound
between R L (𝑔𝑔𝑛𝜆 ) and R                   (𝑔𝑔𝑛𝜆 ).
                                  L,𝑇𝑛A
                                                                 q
                                                                      𝐿0
Proposition B.5.3. Let            G𝜆    = {𝑔𝑔 : ||𝑔𝑔 || 6              𝜆 }.   Assume conditions AS.1, AS.3, AS.4, and AS.7.
Let 𝛿2 > 0, with probability at least 1 − 𝛿2 and a given random projection defined by A
                                                                             r              r            r
                                                                               𝐿 0            𝐿 0          log(2/𝛿2 )
               |R L (𝑔𝑔𝑛𝜆 ) − R A (𝑔𝑔𝑛𝜆 )| 6 4𝐶 (𝐾𝑚𝑎𝑥                              ) · 𝐾𝑚𝑎𝑥       + 3𝜁˜𝜆                  (B.4)
                                   L,𝑇𝑛                                         𝜆             𝑛𝜆               2𝑛
The probability is with respect to the join distribution of X × Y .
                                                                     q
                                                                         𝐿0
Proof. Since 𝑔𝑛𝜆 ∈ G𝜆 and ||𝑔𝑔 || ∞ 6 𝐾𝑚𝑎𝑥                                    for all 𝑔 ∈ G𝜆 . Let H𝜆 = L ◦ G𝜆 = ℎ : ℎ =
                                                                                                                        
                                                                          𝜆
       X), 𝑦), 𝑔 ∈ G𝜆 }. Then ||ℎℎ || ∞ 6 𝜁˜𝜆 as we noted in the description of Theorem 3.5.2. Theorem
L (𝑔𝑔 (X
                                                                          114


B.5.1 then suggests given a training data 𝑇𝑛 and its projected counterpart,
                                                                                                𝑛
                                                                                            1Õ
                      |R L (𝑔𝑔𝑛𝜆 )     − R A (𝑔𝑔𝑛𝜆 )|         6   sup EX ×Y    Y ℎ (𝑍) −           ℎ (𝑍𝑖 )
                                              L,𝑇𝑛                                          𝑛
                                                                 ℎ ∈H𝜆                         𝑖=1
                                                                                    r
                                                                                       log(2/𝛿2 )
       ( Theorem B.5.1 and | |ℎℎ | |∞ 6 𝜁˜𝜆 by definition)    6  2R𝑛 (H𝜆 ) + 𝜁˜𝜆
                                                                                             2𝑛
                                                                                    r                        r
                                                                                         log(2/𝛿     )            log(2/𝛿2 )
                           ( McDiarmid’s inequality)          6  2 R̂𝑛 (H𝜆 ) + 𝜁˜𝜆                 2
                                                                                                         + 𝜁˜𝜆
                                                                                               2𝑛                     2𝑛
                                                                                r                           r
                                                                                    𝐿0                         log(2/𝛿2 ) 
                                         ( Corollary B.5.1.1) 6  2 2𝐶 (𝐾𝑚𝑎𝑥            ) · R̂𝑛 (G𝜆 ) + 𝜁˜𝜆                  +
                                                                                    𝜆                               2𝑛
                                                                        r
                                                                          log(2/𝛿2 )
                                                                    𝜁˜𝜆
                                                                               2𝑛
                                                                                 r                 r               r
                                                                                    𝐿0                 𝐿0            log(2/𝛿2 )
                                         ( Corollary B.5.1.2)    6 4𝐶 (𝐾𝑚𝑎𝑥             ) · 𝐾𝑚𝑎𝑥           + 3𝜁˜𝜆
                                                                                     𝜆                 𝑛𝜆                2𝑛
The proposition is proved.                                                                                                       
    Finally, we can use the exact same way to control R L ( 𝑓𝑛𝜆 ) − R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) and R A ( 𝑓A𝜆 ,𝑛 ) −
                                                                                                                    L,𝑇𝑛
       𝜆
R L ( 𝑓A ,𝑛 ). Since the probability 1 − 𝛿2 is with respect to sampling of training data 𝑇𝑛 . Given a
random projection, if we get 𝑇𝑛 , we get 𝑇𝑛A . Hence, the randomness of all the four terms bounded
by 𝑉 (1) holds simultaneously. We can conclude that with probability at least 1 − 𝛿2 ,
                                             r                 p              r                       r
                                                 𝐿 0 𝐾𝑚𝑎𝑥 𝐿 0                   log(2/𝛿    2 )           2 log(2/𝛿2 )
              𝑉 (1) = 12𝐶 (𝐾𝑚𝑎𝑥                       ) √              + 9𝜁˜𝜆                  + 2𝜁𝜆                          (B.5)
                                                  𝜆           𝑛𝜆                     2𝑛                         𝑛
Theorem 3.5.2 is proved. As one can see, the term 𝑉 (1) and 𝑉 (3) converge to zero as we assumed.
The 𝐷 (𝜆) term in 𝑉 (2) actually motivates us to make assumption AS.9 to make the excess risk
converging.
B.6      Convergence Rate of Squared Hinge and Hinge Loss
    We establish the explicit convergence rate for Squared Hinge and Hinge loss in this section. To
do that, we have to utilize some properties and facts that hold for these two loss functions. These
properties are listed below without proof since they can be easily verified. One can refer [128] for
the proof. For similarity, we use L1 and L2 to denote Hinge and Squared Hinge loss.
   1. For both loss function, 𝜆|| 𝑓 𝜆 || 2𝐾 < 𝐷 (𝜆) with a given 𝜆 > 0.
                                                                      115


                                                                                           q
                                                                                             𝐿0
   2. For both loss function,  || 𝑓 𝜆 || ∞  = sup          X)|
                                                    | 𝑓 𝜆 (X   6 𝐾𝑚𝑎𝑥   || 𝑓 𝜆 || 6 𝐾𝑚𝑎𝑥     𝜆 .
                                               X ∈H
                                                  H
                                                    q                              q                q
                                                        1                            𝐷 (𝜆)             𝐿0
   3. For hinge loss, 𝐿 0 = 1, 𝜁𝜆 ≤ 1 + 𝐾𝑚𝑎𝑥            𝜆, and 𝜁˜𝜆 ≤ 1 + 𝐾𝑚𝑎𝑥          𝜆 , 𝐶 (𝐾𝑚𝑎𝑥     𝜆 ) =1
                                                            q                                            q
                                                               1 2 ˜                𝐷 (𝜆)                  𝐿0
   4. For square hinge loss,𝐿 0 = 1, 𝜁𝜆 ≤ (1+𝐾𝑚𝑎𝑥              𝜆 ) , 𝜁𝜆 ≤  2(1+𝐾 𝜆 ), and 𝐶 (𝐾𝑚𝑎𝑥           𝜆 ) =
              q
       2𝐾𝑚𝑎𝑥 𝜆1 .
   5. For hinge loss, ΨL1 6 𝜆1 since the dual problem of CP-STM restrict 𝛼𝑖 6 2𝑛𝜆                1 . (See section
       2.2)
   6. For square hinge loss, using lemma ΨL2 = 𝑂 ( 𝜆1 )
For the property 6 about Squared Hinge loss, we provide some discussion here.
Lemma B.6.1. Let 𝑛+ and 𝑛− be training samples with label +1 and −1 respectively. Define
                                     X𝑖 , .)} Supremum of absolute sum of all coefficients over every
             Í             Í
Ψ = 𝑠𝑢 𝑝 { |𝛽𝑖 | : 𝑓 = 𝛽𝑖 𝐾 (X
     X𝑖 ∈X
possible function in RKHS is given by
                                                              1
                                              ΨL2 6
                                                        𝐶𝐾 + 4𝑛𝑛𝜆 + 𝑛−
where 𝐶𝑘 =       min 𝛽 𝑇 𝐾 𝛽 depends on kernel matrix 𝐾
             𝛽 :𝛽𝛽𝑇 1 =1
The proof of this lemma is available at [39]. Using this lemma, we can obtain a corollary and
establish the last property about Squared Hinge loss.
Corollary B.6.1.1. For Squared Hinge loss, supremum of sum of absolute coefficients is finite as
sample size grows.
                                                               1
                                                  ΨL2 = 𝑂 ( )
                                                               𝜆
Proof. Sum of eigenvalues of 𝐾 is of order 𝑂 (𝑛), since trace of 𝐾 is of order 𝑂 (𝑛). The trace of
𝐾 is indicated by the facts that 𝐾 (X    X𝑖 , X𝑖 ) = 𝑂 (1) guaranteed by assumption AS.7. Considering
4𝑛+ 𝑛−  6 1 from arithmetic mean and geometric mean inequality. Assuming that 𝐾 is positive
   𝑛
definite, then ΨL2 = 𝑂 (1). This bound agrees with the bound of Theorem 3.3 in [29], which is
                                                        116


of order 𝑂 ( 𝜆1 ) at constant level. In this case 𝐶𝐾 is 0 under the situation where 𝐾 is not positive
definite. Combining these two conditions for the kernel matrix 𝐾 , the corollary is proved.              
Note that depending on kernel and geometric configuration of data points. This quantity influences
error of projected classifier.The above bounds are consistent with [39]. Assuming the data are from
bounded domain, i.e ||X   X || 2 6 𝐶𝑑,𝑟 The gram matrix 𝐾 can have minimum eigenvalue as positive
so 𝐶𝐾 > 0 . The key idea is to divide the bounded domain into minimal increasing sequence of
discs 𝐷 𝑛 formed between rings of radius 𝑅𝑛−1 and 𝑅𝑛 such that X ⊆ ∪𝑛=1          𝑁 𝐷 . So the diameter
                                                                                       𝑛
of X ≤ 2𝑅 𝑁 assuming 𝑅𝑛 < ∞ and then count the number of points in each 𝐷 𝑛 . So some
                             Í 2
regularity conditions on distribution of data for each disc 𝐷 𝑛 is necessary to evaluate bounds on
eigenvalues of Gram matrix. For unknown case, we can estimate the Gram matrix. [126] discusses
the regularization error in such estimation
B.6.1     Proof of Proposition 3.5.3
                                                                                          𝑛 2
                                                                                  4 (𝑙𝑜𝑔 𝛿 )
For Squared Hinge loss, consider the projected dimension to be 𝑃 𝑗 =          d3𝑟 𝑑       1 e + 1 for each
                                                                                       𝜖2
mode 𝑗 = 1, 2, ..𝑑. Adopt theorem 3.5.2 and the properties 1 2, 4 and 6, we have with probability
at least (1 − 2𝛿1 ))(1 − 𝛿2 ) and some 𝜂 ∈ (0, 1]
                                                           r                           r
                             2
                        24𝐾𝑚𝑎𝑥                     𝐷 (𝜆)     𝑙𝑜𝑔(2/𝛿2 )           𝐾 2 2𝑙𝑜𝑔(2/𝛿2 )
   R L (𝑔𝑔𝑛𝜆 )    ∗
               − RL  6 √          + 18(1 + 𝐾𝑚𝑎𝑥          )               + 2(1 + √ )
                           𝑛𝜆2                       𝜆           2𝑛                𝜆          𝑛
                                                r
                                               1 1 𝜆 𝜖𝑑
                     + 𝐷 (𝜆) + 𝐶 𝐷 [2𝐾𝑚𝑎𝑥           + ]
                                               𝜆 𝜆 𝜆 𝜆
                                   s                                        s
                              1             2        1               1             2                𝜖𝑑
                     6 𝑂( √       ) 𝑙𝑜𝑔 + 𝑂 ( √ ) + 𝑂 ( p                  ) 𝑙𝑜𝑔 + 𝑂 (𝜆𝜂 ) + 𝑂 ( )
                                           𝛿2                                     𝛿2                 3
                             𝑛𝜆2                     𝑛𝜆          𝑛𝜆2(1−𝜂)                           𝜆2
𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1). Plugging in the assumption AS.10 and replace the last term in the
equation above with a number in term of 𝑛, we obtain
                                                       s
                                                 ∗ 6 𝐶 𝑙𝑜𝑔(    2       1 𝜇𝜂
                                 R L (𝑔𝑔𝑛𝜆 ) − R L               ) · ( ) 2𝜂+3
                                                              𝛿2      𝑛
                 𝜇                                    𝜇
Now 𝜖 = ( 𝑛1 ) 2𝑑 for 0 < 𝜇 < 1, and 𝜆 = ( 𝑛1 ) 2𝜂+3 for some 0 < 𝜂 6 1. The projected dimension
                              𝑛 2
                      4 (𝑙𝑜𝑔 𝛿 )
becomes 𝑃 𝑗 =     d3𝑟 𝑑       1 e  + 1.
                           𝜖2
                                                      117


B.6.2     Proof of Proposition 3.5.4
The convergence rate for Hinge loss are established using the same way as that of Squared Hinge
loss. Adopting theorem 3.5.2 together with the properties 1, 2, 3, and 5, we conclude that
                                                          r                                r
               𝜆        ∗       12𝐾       (1 + 𝐾𝑚𝑎𝑥 ) 𝑙𝑜𝑔(2/𝛿2 )                    𝐷 (𝜆)     2𝑙𝑜𝑔(2/𝛿2 )
      R L (𝑔𝑔𝑛 ) − R L 6 √ + 9                  √                      + 2(1 + 𝐾𝑚𝑎𝑥 √ )
                                  𝑛𝜆              𝜆              𝑛                     𝜆             𝑛
                                                   𝜆 𝜖𝑑
                           + 𝐷 (𝜆) + 𝐶 𝐷 [2 + ]
                                         s         𝜆 𝜆                    s
                                     1              2              1             2                   𝜖𝑑
                            6 𝑂 ( √ ) 𝑙𝑜𝑔( ) + 𝑂 ( p                     ) 𝑙𝑜𝑔( ) + 𝑂 (𝜆𝜂 ) + 𝑂 ( )
                                     𝑛𝜆             𝛿2          𝑛𝜆2(1−𝜂)         𝛿2                  𝜆
                    [𝑙𝑜𝑔 𝛿𝑛 ] 2
with 𝑃 𝑗 =     𝑂(         1 )     for each mode 𝑗 = 1, 2, ..𝑑. The inequality holds for probability at least
                       𝜖2
(1 − 2𝛿1 ) (1 − 𝛿2 ), for some 𝛿1 ∈ (0, 12 ) and 𝛿2 ∈ (0, 1). Use the assumption AS.10 again, we get
                                                               s
                                                                              𝜇𝜂
                                                       ∗ 6 𝐶 𝑙𝑜𝑔( 2 )( 1 ) 2𝜂+2
                                       R L (𝑔𝑔𝑛𝜆 ) − R L
                                                                      𝛿2 𝑛
                   𝜇                                         𝜇
The 𝜖 =    ( 𝑛1 ) 2𝑑  for 0 < 𝜇 < 1, and 𝜆 =            1
                                                      ( 𝑛) 2𝜂+2 for some 0 < 𝜂 6 1. The projected dimension
                         4 𝜇
becomes 𝑃 𝑗 =        d3𝑟 𝑑 𝑛 𝑑 [𝑙𝑜𝑔(𝑛/𝛿1 )] 2 e    + 1.
B.7     Proof of theorem 3.5.3
    Theorem 3.5.3 establishes the rate of convergence for expected risk difference, showing ℓ1
consistency of error vanishing as sample size increases. Thus theorem 3.5.3 establishes stronger
optimality of our algorithm. In this subsection, we show theorem 3.5.3 holds, which is a much
stronger result than the risk difference vanishing in probability.
    First, we show a corollary about the expected difference between projected tensor CP compo-
nents and original CP components.
                                                                                        𝑟    (1)           (𝑑)
Corollary B.7.0.1. For any two d-mode tensors in rank-r CP form, X 1 =
                                                                                        Í
                                                                                           𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and
                                                                                       𝑘=1
        𝑟      (1)            (𝑑)
X2 =
       Í
           𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 , expectation of difference of tensor norm and its projection have a upper
      𝑘=1
                                                             118


bound as shown below.
                             𝑟 Ö    𝑑                                           𝑟 Ö  𝑑
                                                  ( 𝑗)            ( 𝑗)                         ( 𝑗)    ( 𝑗)
                           Õ                                                   Õ
                      E|                 𝐴 ( 𝑗) 𝑥 1,𝑘
                                        |𝐴             − 𝐴 ( 𝑗) 𝑥 2,𝑙 || 22 −             ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22 |
                         𝑘,𝑙=1 𝑗=1                                            𝑘,𝑙=1 𝑗=1
                                                            v
                                                            u
                                                            t                    𝑟 Ö  𝑑
                                                                      3𝑑       Õ                ( 𝑗)    ( 𝑗)
                                                       6 𝑟2     Î𝑑                         ||𝑥𝑥 1,𝑘 − 𝑥 2,𝑙 || 22
                                                                     𝑗=1 𝑃 𝑗 𝑘,𝑙=1 𝑗=1
                                                                                              p
Proof. It is known that for any real random variable 𝑊, E(|𝑊 |) 6                                 E(𝑊 2 ). Using the aforemen-
tioned result along with variance of difference of projection stated in the proof of corollary B.4.2.1,
we prove the following result.                                                                                                      
Next we derive the expected difference in tensor kernel due to projection
                                                                                                          𝑟      (1)          (𝑑)
Proposition B.7.1. For any two d-mode tensors in rank-r CP form, X 1 =
                                                                                                         Í
                                                                                                              𝑥 1,𝑘 ◦ ... ◦ 𝑥 1,𝑘 and
                                                                                                        𝑘=1
        𝑟     (1)             (𝑑)
X2 =
        Í
           𝑥 2,𝑘 ◦ ... ◦ 𝑥 2,𝑘 . Suppose the random projection 𝑓 TPR-CP-TT is defined by projection
       𝑘=1
tensor A , which satisfies the assumption AS.5. For a given tensor kernel function 𝐾 (·, ·). We have
following bound on expected difference of tensor kernel due to projection, here constant 𝐶𝑑,𝑟 is
taken from proposition B.4.1
                                                                                                         v
                                                                                                           u
                                                                                                           t
                                                                                                                    3𝑑
                                       X1 ), 𝑓 TPR-CP-TT (X     X2 ) − 𝐾 (X    X1 , X 2 )
                                                                       
           E 𝐾 𝑓 TPR-CP-TT (X                                                                  6 𝐶𝑑,𝑟 𝑟 2       Î𝑑
                                                                                                                   𝑗=1 𝑃 𝑗
Proof. We proceed similarly as in proof of proposition B.4.1. Using result on from proposition
B.7.0.1, we derive our result.                                                                                                      
Using the results from above proposition B.7.1 and derivations mentioned in proof of proposition
3.5.2. As a result, we conclude that expectation with respect to random projection
                                                                                                     r                 v
                                                                                                                       u
                                                                                                                       t
                                                                                                       𝐿0                    3𝑑
  EA |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | 6 𝐶𝑑,𝑟 Ψ [𝐶 (𝐾𝑚𝑎𝑥                    ) + 𝜆Ψ]       Î𝑑
                                                                                                       𝜆
                                                                                                                            𝑗=1 𝑃 𝑗
                                                                                       1
                                                                         = 𝑂(      qÎ                )
                                                                               𝜆 𝑞      𝑑
                                                                                         𝑗=1 𝑃 𝑗
                                                                                                                                 (B.6)
                                                                 119


However,it should be noted that for fixed sample size 𝑛 and thus fixed 𝜖, the projected dimension
𝑃 𝑗 changes as a function of probability 𝛿1 only. Now we prove the following result on expected
risk of projected error, which is shown to be vanishing with increasing sample size.
Proposition B.7.2. Based on conditions AS.1 to AS.8 and conditions AS.10 to AS.11, the expected
risk of projection error goes to 0 as 𝑛 increases.
                                                                                                 𝜖𝑑
                    E𝑛 EA |R L ( 𝑓A𝜆,𝑛 ) + 𝜆|| 𝑓A𝜆,𝑛 || 2 − R L ( 𝑓 𝑛𝜆 ) − 𝜆|| 𝑓 𝑛𝜆 || 2 | = 𝑂 ( 𝑞 )
                                                                                                 𝜆
Proof. From equation B.6 and condition AS.5, we obtain following statement
With probability at least 1 − 𝛿1 ,
                                                                                                 1
                    EA |R L ( 𝑓A𝜆 ,𝑛 ) + 𝜆|| 𝑓A𝜆 ,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | 6     qÎ
                                                                                          𝜆 𝑞      𝑑
                                                                                                   𝑗=1 𝑃 𝑗
                                                                                           𝜖𝑑          1
                                                                         (AS.5 )        6 𝑞 𝑂(                 )
                                                                                          𝜆      [𝑙𝑜𝑔( 𝛿𝑛 )] 𝑑
                                                                                                         1
                                                                                           𝜖𝑑    1
                                            (𝛿1 as function of 𝑛 AS.11 )                6 𝑞 𝑂( )
                                                                                          𝜆      𝑛
Using above equation and substituting value of 𝛿1 as a function of 𝑛 from condition AS.11, we have
established that for sufficient large value of 𝑛
                                                                                          
                                                                                    𝜖𝑑 1                   1
      P𝑛 EA |R L ( 𝑓A𝜆,𝑛 ) + 𝜆|| 𝑓A𝜆,𝑛 || 2 − R L ( 𝑓 𝑛𝜆 )  − 𝜆|| 𝑓 𝑛𝜆 || 2 | > 𝑂( 𝑞 )        6 𝑛 exp(−𝑛 𝑑 )     (B.7)
                                                                                    𝜆 𝑛
.
We utilize the following lemma which states about the expectation of a sequence of random variables
with distribution function specific to equation B.7
Lemma B.7.1. For some 𝑑 > 1, let 𝑊𝑛 be a sequence of positive random variables with distribution,
for sufficiently large 𝑛
                                                         1                     1
                                          P𝑛 (𝑊𝑛 >         ) 6 𝑛 exp(−𝑛 𝑑 )
                                                         𝑛
Then, the sequence 𝑊𝑛 is uniformly bounded almost surely and in expectation 𝐿 1 norm.
                                                           120


                                   Í∞
Proof. We can show that               𝑛  P𝑛 (𝑊𝑛 > 1) < ∞. By Borel-Cantelli lemma, the sequence 𝑊𝑛
is uniformly bounded above by 1 almost surely. By Dominated convergence theorem, for any
sequence of positive random variables that is uniformly bounded almost surely; then the expectation
is bounded by the uniform bound or E𝑛 (𝑊𝑛 ) > 1 for all sufficiently large 𝑛.                                     
                                                                                                  𝑑
Let us denote, EA |R L ( 𝑓A𝜆,𝑛 ) + 𝜆|| 𝑓A𝜆,𝑛 || 2 − R L ( 𝑓𝑛𝜆 ) − 𝜆|| 𝑓𝑛𝜆 || 2 | = 𝑊𝑛 𝑂 ( 𝜆𝜖 𝑞 ). Assuming 𝑑 to be
the order of feature tensors, we claim using lemma B.7.1 that
                                                           𝜖𝑑           𝜖𝑑
                                               E𝑛 𝑊𝑛 𝑂 ( 𝑞 ) 6 𝑂 ( 𝑞 )
                                                           𝜆             𝜆
Thus, we conclude the proof for risk difference due to projection as stated in proposition B.7.2.                 
    In the following statements, we bound the expectation of sampling error of training data.
Proposition B.7.3. Based on condition AS.1 to AS.4 and AS.7, the expectation of sampling risk
vanishes as sampling size increases.
                           
                       E𝑛 |R L (𝑔𝑔𝑛𝜆 ) − R A (𝑔𝑔𝑛𝜆 ) + R L ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓𝑛𝜆 )
                                               L,𝑇𝑛
                                                                                                      
                             + R A ( 𝑓A𝜆 ,𝑛 )     − R L ( 𝑓A𝜆 ,𝑛 ) + R L,𝑇𝑛   ( 𝑓 𝜆) − RL    ( 𝑓 𝜆 )|
                                   L,𝑇𝑛
                                                         r               r
                                         1                  1                1
                             = 𝑂( √          ) + 𝑂 ( 𝜁˜𝜆      ) + 𝑂 (𝜁𝜆         )
                                         𝑛𝜆2               2𝑛                𝑛
Proof. Lets denote the random variable |R L (𝑔𝑔𝑛𝜆 ) − R                    (𝑔𝑔𝑛𝜆 ) + R L ( 𝑓𝑛𝜆 ) + R L,𝑇𝑛 ( 𝑓𝑛𝜆 )
                                                                    L,𝑇𝑛A
+R        ( 𝑓A𝜆 ,𝑛 ) − R L ( 𝑓A𝜆 ,𝑛 ) + R L,𝑇𝑛 ( 𝑓 𝜆 ) − R L ( 𝑓 𝜆 )| as 𝐻𝑛 . Also, 𝐻𝑛 is independent of random
    L,𝑇𝑛A
projection; thus taking over EA does not change it.
Ignoring the constants, we derive the following result from equation B.5,
                                                                                                   
                                       1           𝜁˜𝜆 p                        𝜁𝜆 p
                  P𝑛 𝐻 𝑛 > 𝑂 ( √           ) + 𝑂 ( √ log(2/𝛿2 )) + 𝑂 ( √ log(2/𝛿2 ) 6 𝛿2
                                      𝑛𝜆2             𝑛                           𝑛
We need the following proposition on sub Gaussian random variable to prove above proposition
B.7.3
                                                           121


Proposition B.7.4. For any real random variable 𝑊 with sub Gaussian tail, meaning P 𝑊 >
p            
  log(2/𝛿2 ) 6 𝛿2 ; then E(𝑊) = 𝑂 (1)
                                                  2                                 2
Proof. Using change of variable, 𝛿2 = 2𝑒 −𝑢 ; we obtain P(𝑊 > 𝑢) 6 2𝑒 −𝑢 . Now using identity
                                                        ∫∞
that for any positive random variable 𝑊, E(𝑊) = 0 P(𝑊 > 𝑤) 𝑑𝑤, we proof our proposition. 
                                             𝜁˜            𝜁
    We further split 𝐻𝑛 = 𝑂 ( √ 1 ) +𝑂 ( √𝜆 )𝑊1 +𝑂 ( √𝜆 )𝑊2 where 𝑊1 and 𝑊2 are random variables
                                  𝑛𝜆2           𝑛            𝑛
with Sub Gaussian tails. We can apply the above proposition B.7.4 about sub Gaussian random
variables to prove proposition B.7.3.                                                                 
    Gathering results from theorem 3.5.2, proposition B.7.2 and proposition B.7.3, we can now
show the following conclusion.
                                                        r              r
                            ∗ | 6 𝑂(    1                  1             1                    𝜖𝑑
    E𝑛 |EA [R L (𝑔𝑛𝜆 )] − R L         √     ) + 𝑂 ( 𝜁˜𝜆      ) + 𝑂 (𝜁𝜆     ) + 𝑂 (𝐷 (𝜆)) + 𝑂 ( 𝑞 ) (B.8)
                                        𝑛𝜆2               2𝑛             𝑛                    𝜆
Referring to value 𝜁˜𝜆 and 𝜁𝜆 as mentioned in proof of proposition 3.5.4 and proposition 3.5.3.
Under conditions AS.6 and conditions AS.9 to AS.11, we show validity of our claim as a corollary
of equation B.8. Therefore, we complete our proof of theorem 3.5.3
B.8     Technical Details about Numerical Studies
    All the code for our numerical studies are available at our Github repository https://github.
com/PeterLiPeide/TEC_Tensor_Ensemble_Classifier. In the Simulation folder, one can
find all the values of tuning parameter, optimal rank of CP decomposition, and the dimension of
random projection. We also provide a code to regenerate our synthetic data, so that the whole
simulation study is reproducible with our CP-STM module.
                                                      122


B.9     Simulation Study Discussion
         Model     Methods           RBF-SVM     AAM      LLSVM       BSGD    LDA      RF
                   Accuracy (%)        94.50     75.31     95.94       96.67  89.13  98.50
         F2
                   STD (%)             2.15       6.18      1.86        3.95   3.64   1.50
                   Time (s)              92       120       360         790    205    9.5
                   Accuracy (%)         100      83.33     98.50        77.5  97.63   100
         F3
                   STD ( %)            0.00       3.54      1.37       14.39   1.90   0.00
                   Time (s)              80       120       385         935    175    7.5
                   Accuracy (%)        89.63     52.92       50          50   83.75  76.50
         F5
                   STD ( %)            2.80       8.28      0.00        0.00   4.50   7.65
                   Time (s)             117       140       350        1820    231    8.65
                    Table B.1: TEC Simulation Results II: Cluster with 128GB RAM
    For the completeness of our numerical study, we further apply the vector-based methods in
simulation study 3.6 to those high dimensional classification tasks on a high performance cluster.
The cluster is equipped with a 16-core CPU and 128GB of memory. The classification accuracy
of all the vector-based classifier in F2, F3, and F5 tasks are provided in table B.1. If we further
consider these results obtained from a more power machine, the advantages of our TEC models are
more impressive. With much more memory, the BSGD model provides the best accuracy rate as
96.67% in F2. However, our TEC with Hinge loss has 98% average accuracy rate. Similar situation
also happens in F5 where the best vector-based method RBF-SVM provides 89.63% accuracy rate.
Our TEC with Square Hinge loss outperforms slightly than RBF-SVM. Only in F3, RBF-SVM and
RF have the best performance, and are better than TEC models with 2% accuracy rates. Since these
performance requires more computer memory, we believe TEC models are in general have greater
potential than all these traditional methods.
                                                 123


                                               APPENDIX C
                                    APPENDIX FOR CHAPTER 4
C.1      Proof of Theorem 4.5.1
Proof. To prove the proposition 4.5.1, we introduce few more notations here. Let L be the loss
function satisfying the condition AS.2. We denote the classification risk for an arbitrary decision
function, 𝑓 , as
                                                               ∫
                          R L ( 𝑓 ) = EX×Y L (𝑦, 𝑓 (XX)) =        L (𝑦, 𝑓 (XX))𝑑P
The expectation is taken over the joint distribution of X × Y . Notice that this risk notation, R L ( 𝑓 ),
is different from our notation R ( 𝑓 ) in section 4.5 since we use the Lipschitz continuous loss L
instead of the "zero-one" loss to measure the classification error. L is also called surrogate loss for
classification problems. Examples of such surrogate loss functions include Hinge loss and Squared
Hinge loss. Comparison of these loss functions and their statistical properties can be found in
[151]. If we denote the Bayes risk under the surrogate loss L as R L          ∗ , i.e. R ∗ = min R ( 𝑓 ) for
                                                                                         L           L
all measurable function 𝑓 , then the result from [151] says R L ( 𝑓𝑛 ) → R L        ∗ indicates R ( 𝑓 ) → R ∗
                                                                                                     𝑛
for any decision rule { 𝑓𝑛 }. This conclusion holds as long as the surrogate loss is "self-calibrated",
see [128]. Since we use Hinge loss in our problem, and Hinge loss is known to be Lipschitz
and self-calibrated, our assumption AS.2 holds in our discussion. Thus, we only need to show
               ∗ for the proof of our proposition 4.5.1.
R L ( 𝑓𝑛 ) → R L
    Given the tuning parameter 𝜆 satisfying condition AS.4, we denote
                                                               𝑛
                                                          1Õ
                              𝜆                      2
                             𝑓𝑛 = arg min 𝜆 · || 𝑓 || +                  X𝑖 ), 𝑦𝑖 )
                                                                  L ( 𝑓 (X
                                         H
                                      𝑓 ∈H                𝑛
                                                              𝑖=1
where H is the reproducing kernel Hilbert space (RKHS) generated by the kernel function (4.8).
As we mentioned in the section 4.3, H is also know as the collection of functions which are in the
form of equation (4.10). Now we further assume
                                    𝑓 𝜆 = arg min 𝜆 · || 𝑓 || 2 + R L ( 𝑓 )
                                               H
                                            𝑓 ∈H
                                                   124


Then 𝑓 𝜆 is the optimal decision function from H such that it minimizes the expected risk. Com-
paring 𝑓𝑛𝜆 with 𝑓 𝜆 , we can understand that 𝑓 𝜆 is the version of 𝑓𝑛𝜆 when the size of training data
                                                                    𝑛
is as large as possible. If we denote R L,𝑇𝑛 ( 𝑓 ) = 𝑛1                         X𝑖 ), 𝑦𝑖 ), then R L,𝑇𝑛 ( 𝑓 ) is a sample
                                                                    Í
                                                                         L ( 𝑓 (X
                                                                   𝑖=1
estimate of R L ( 𝑓 ). With 𝑓 𝜆 , we can show that
                                          ∗ | 6 |R ( 𝑓 𝜆 ) − R ( 𝑓 𝜆 )| + |R ( 𝑓 𝜆 ) − R ∗ |
                         |R L ( 𝑓𝑛𝜆 ) − R L        L 𝑛             L                L            L
through triangular inequality. Since the Bayes risk under loss function L is defined as R ∗ =
   min R ( 𝑓 ) over all functions defined on X , we can immediate show that
   X →Y
𝑓 :X   Y
       |R ( 𝑓 𝜆 ) − R ∗ | 6 E (X×Y) |L (𝑦, 𝑓 𝜆 (X X)) − L (𝑦, 𝑓 ∗ (X     X))| 6 𝐶 (𝐾𝑚𝑎𝑥 ) sup | 𝑓 𝜆 − 𝑓 ∗ |
                                                                                                                     (C.1)
                          6 𝐶 (𝐾𝑚𝑎𝑥 ) · 𝜖
This is the result of using condition AS.1 and AS.2 in the proposition 4.5.1. 𝑓 𝜆 is in the RKHS and
thus bounded by some constant depending on 𝐾𝑚𝑎𝑥 . 𝑓 ∗ is also continuous on compact subspace X
(because all the tensor components considered are bounded in condition AS.1) and thus is bounded.
The universal approximating property in condition AS.3 makes equation (C.1) vanishes as 𝜖 goes
to zero. Thus, the consistency result can be established if we show |R ( 𝑓𝑛𝜆 ) − R ( 𝑓 𝜆 )| converges to
zero. This can be done with Rademacher complexity, see Chapter 26 in [125].
     From the objective function (4.9), we have
                                            R L,𝑇𝑛 ( 𝑓𝑛 ) + 𝜆 𝑛 || 𝑓𝑛 || 2 6 𝐿 0                                     (C.2)
                                                                                                               q
                                                                                                                  𝐿
under condition AS.2 when we simply let 𝑓 = 0 as a naive classifier. Thus, || 𝑓𝑛 || 6 𝜆𝑛0 . Let
        q
           𝐿
𝑀𝑛 = 𝜆𝑛0 . 𝑓𝜖 ∈ H such that R L ( 𝑓𝜖 ) 6 R L ( 𝑓 𝜆 ) + 2𝜖 . || 𝑓𝜖 || 6 𝑀𝑛 when 𝑛 is sufficiently large.
Due to condition AS.4, 𝜆 𝑛 → 0, making 𝑀𝑛 → ∞. Further notice that we introduce 𝑓𝜖 since it
is independent of 𝑛. As a result, its norm, even though is bounded by 𝑀𝑛 , is a constant and is
not changing with respect to 𝑛. By Rademacher complexity, the following inequality holds with
                                                           125


probability at least 1 − 𝛿, where 0 < 𝛿 < 1
                                                                                                                     r
                                                                        2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛                                   log 2/𝛿
                                   R L ( 𝑓𝑛𝜆 )    6  R L,𝑇𝑛 ( 𝑓𝑛𝜆 ) +           √           + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 )
                                                                                   𝑛                                     2𝑛
                                                                                                        2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛
     𝑓𝜖 is not the optimal in training data       6 R L,𝑇𝑛 ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 − 𝜆 𝑛 || 𝑓𝑛𝜆 || 2 +       √
                                                                                                               𝑛
                                                                                r
                                                                                    log 2/𝛿
                                                 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 )
                                                                                       2𝑛
                                                                                        2𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛
                   Drop (𝜆𝑛 | | 𝑓𝑛𝜆 | | 2 > 0)    6 R L,𝑇𝑛 ( 𝑓 𝜖 ) + 𝜆 𝑛 || 𝑓 𝜖 || 2 +        √
                                                                                                 𝑛
                                                                                r
                                                                                    log 2/𝛿
                                                 + (𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛 )
                                                                                       2𝑛
                                                                                    4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛
          Rademacher Complexity again             6 R L ( 𝑓𝜖 ) + 𝜆 𝑛 || 𝑓𝜖 || 2 +         √
                                                                                            𝑛
                                                                                   r
                                                                                      log 2/𝛿
                                                 + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 )
                                                                                         2𝑛
Let 𝛿 = 12 , and 𝑁 large such that for all 𝑛 > 𝑁,
           𝑛
                                                                                                r
                                             4𝐶 (𝐾𝑚𝑎𝑥 )𝑀𝑛                                            log 2/𝛿 𝜖
                       𝜆 𝑛 || 𝑓𝜖   || 2   +         √          + 2(𝐿 0 + 𝐶 (𝐾𝑚𝑎𝑥 ) 𝑀𝑛 )                      6
                                                       𝑛                                                2𝑛       2
The inequality exists because || 𝑓𝜖 || is a constant with respect to 𝑛, and all other terms are converging
to zero. Thus
                                                                            𝜖
                                               R L ( 𝑓𝑛𝜆 ) 6 R L ( 𝑓𝜖 ) +       6 RL ( 𝑓 𝜆 ) + 𝜖
                                                                            2
with probability 1 − 12 . We conclude that
                                𝑛
                                                 P(|R L ( 𝑓𝑛𝜆 ) − R L ( 𝑓 𝜆 )| > 𝜖) → 0                                     (C.3)
for any arbitrary 𝜖. This establishes the weak consistency of CP-STM. For strong consistency, we
consider for each 𝑛
                                 ∞                                                             ∞
                                Õ                                                             Õ       1
                                         P(|R L ( 𝑓𝑛𝜆 )  − RL  ( 𝑓 𝜆 )| > 𝜖) 6 𝑁 − 1 +                   6∞
                                                                                                     𝑛 2
                                𝑛=1                                                           𝑛=1
By Borel-Cantelli Lemma, R L ( 𝑓𝑛𝜆 ) → R L ( 𝑓 𝜆 ) almost surely, see [42]. The proof is finished.                             
                                                                     126


C.2      Data Pre-processing for Section 4.7
    We provide further details about our EEG-fMRI data pre-processing and fMRI data extraction
in this section. Most of the processing steps are referred from [65].
C.2.1    fMRI Data
The fMRI data processing includes three major steps, which are pre-processing, regions of interests
(ROI) identification, and data extraction. We describe all these steps here. All the steps are
performed by SPM 12 in Matlab. There are five steps in the image pre-processing part including
realignment, co-registration, segment, normalization, and smoothing.
     • Realignment: It is a procedure to align all the 3D BOLD volumes recorded along the time to
       remove artifacts caused by head motions, and also to estimate head position. For each task,
       there are three sessions of fMRI scans, providing 510 scans in total for each subject. These
       scans are realigned within subject to the average of these 510 scans. (average across time)
       In SPM, we create three independent sessions to load all the fMRI runs, and choose not to
       reslice all the images at this step. The reslicing will be done in normalization step. Avoiding
       extra reslicing can avoid introducing new artifacts. The mean scan is created in this step for
       co-registration.
     • Co-registration: Since all the fMRI scans are aligned to the mean scan, we have to transform
       the T1 weighted anatomical scan to match their orientation. Reason for doing this is that all
       the data will finally be transformed to a standardized space. Estimating such a transformation
       with T1 weighted scan can provide a high accuracy, since anatomical scans have higher
       resolutions. Matching the orientation of T1 weighted scan with all the fMRI scans makes it
       possible to apply the transformation estimated from T1 scan directly on fMRI data. In this
       step, we let the mean fMRI scan to be stationary, and move T1 anatomical scan to match it.
       A resliced T1 weighted scan is created in this step.
                                                   127


     • Segment: This step estimate a deformation transformation mapping data into MNI 152
       template space [83, 22]. A forward deformation field is created in this step.
     • Normalization: In this step, the forward deformation is applied to all realigend fMRI scans,
       transforming all the data into MNI template space. The voxel size is set to be 3 × 3 × 4 mm,
       which is the same as the original images.
     • Smoothing: All normalized fMRI volumes are then smoothed by 3D Gaussian kernels with
       full width at half maximum (FWHM) parameter being 8 × 8 × 8.
This pre-processing procedure is applied to auditory and visual fMRI scans separately and inde-
pendently.
    For each task, the processed fMRI are used to for statistcal analysis introduced in [96, 145].
These models are basic linear mixed effect model with auto-regression covariance structure. Since
these models are standard and are out of the scope of this dissertation, we do not introduce them in
this part. For the first level (subject level) analysis, we use the model to estimate two contrast images:
standard stimulus over baseline and oddball stimulus over baseline. These two are difference of
average BOLD signals during stimulus time and that during no stimulus (baseline) time. They can
be understand as the estimate 𝛽ˆ in a regression model 𝑦 = 𝑥 𝛽 + 𝜖. These contrasts are then pooled
together in the group-level analysis. For each voxel, the group-level analysis performs a T-test
to compare the BOLD signals in standard contrasts and oddball contrasts. For voxels whose test
results is significant, SPM highlighted them as the regions of interest (ROI). The ROI of auditory
and visual tasks are presented in the figure C.1 and figure C.2 with P-values. Figure C.3 shows the
exact ROIs in the standard brain template in SPM 12.
    The coordinates of these activate voxels are also provided in the statistical analysis results. To
extract ROI data, we can use "spm_get_data" function in SPM 12. Since we are classifying trials,
we only take one fMRI scan for each trial. This is because the trial duration (0.6 sec) is less than
the repetition time (2 sec) of fMRI data. For each trial, we take the 𝑘-th fMRI scan where "k =
round(onset / TR) + 1". This option is also inspired by SPM codes.
                                                     128


Figure C.1: Auditory fMRI Group Level Analysis
                      129


Figure C.2: Visual fMRI Group Level Analysis
                     130


          (a) Auditory Task
           (b) Visual Task
Figure C.3: Region of Interest (ROI)
                131


          Tasks       Auditory Oddball   Auditory Standard  Visual Oddball   Visual Standard
          Subject 1          75                 299               75              299
          Subject 2          70                 287               70              287
          Subject 3          74                 296               74              296
          Subject 5          74                 299               74              299
          Subject 6          75                 290               75              290
          Subject 7          73                 295               73              295
          Subject 8          72                 297               72              297
          Subject 9          75                 297               75              298
          Subject 10         72                 298               72              298
          Subject 11         70                 293               70              293
          Subject 12         74                 299               74              299
          Subject 13         71                 297               71              297
          Subject 14         75                 296               75              296
          Subject 15         72                 295               72              295
          Subject 16         74                 293               74              293
          Subject 17         73                 295               73              295
                     Table C.1: EEG-fMRI Data: Number of Trials per Subject
C.2.2    EEG Data
The EEG data is collected by a custom built MR-compatible EEG system with 49 channels. [141]
provides a version of re-referenced EEG data with 34 channels which are used in our experiment.
The original and re-referenced channel positions are provided in the figure C.4. This version of EEG
data are sampled at 1,000 Hz, and are downsampled to 200 Hz at the beginning of pre-processing.
We use the “resample" function in Matlab Signal Processing toolbox to downsampled EEG data
to 200 Hz. Then, we use function "ft_preproc_lowpassfilter" and "ft_preproc_highpassfilter" from
SPM 12 toolbox to filter the data. This step intends to remove both low-frequency and high-
frequency noise in the data. Finally, we split EEG into epochs for trials which starts 100 ms before
the onset and ends 500 ms after the onset. According to [65], such a duration is long enough to
capture the event-related potential during each trial for EEG data. We show few examples of latent
factors from EEG data estimated by our ACMTF decomposition in figure C.5. For each trial, the
topoplot shows the components from channel mode, and the other plot shows the factors from time
mode.
                                                 132


Figure C.4: EEG Channel Position from [141]
                  133


Figure C.5: Examples of EEG Latent Factors (Different Trial and Stimulus Types): Topoplot for
Channel Factors (left); Plots for Temporal Factors (right)
                                               134


BIBLIOGRAPHY
     135


                                      BIBLIOGRAPHY
[1]  E. Acar, D. M. Dunlavy, and T. G. Kolda. A scalable optimization approach for fitting
     canonical tensor decompositions. Journal of Chemometrics, 25(2):67–86, 2011.
[2]  Evrim Acar, Tamara G Kolda, and Daniel M Dunlavy. All-at-once optimization for coupled
     matrix and tensor factorizations. arXiv preprint arXiv:1105.3422, 2011.
[3]  Evrim Acar, Yuri Levin-Schwartz, Vince D Calhoun, and Tülay Adali. Acmtf for fusion of
     multi-modal neuroimaging data and identification of biomarkers. In 2017 25th European
     Signal Processing Conference (EUSIPCO), pages 643–647. IEEE, 2017.
[4]  Evrim Acar, Yuri Levin-Schwartz, Vince D Calhoun, and Tülay Adali. Tensor-based fu-
     sion of eeg and fmri to understand neurological changes in schizophrenia. In 2017 IEEE
     International Symposium on Circuits and Systems (ISCAS), pages 1–4. IEEE, 2017.
[5]  Evrim Acar, Evangelos E Papalexakis, Gözde Gürdeniz, Morten A Rasmussen, Anders J
     Lawaetz, Mathias Nilsson, and Rasmus Bro. Structure-revealing data fusion. BMC bioin-
     formatics, 15(1):1–17, 2014.
[6]  Evrim Acar, Carla Schenker, Yuri Levin-Schwartz, Vince D Calhoun, and Tülay Adali.
     Unraveling diagnostic biomarkers of schizophrenia through structure-revealing fusion of
     multi-modal neuroimaging data. Frontiers in neuroscience, 13:416, 2019.
[7]  Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with
     binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
[8]  Jeffrey S Anderson, Jared A Nielsen, Alyson L Froehlich, Molly B DuBray, T Jason Druzgal,
     Annahir N Cariello, Jason R Cooperrider, Brandon A Zielinski, Caitlin Ravichandran,
     P Thomas Fletcher, et al. Functional connectivity magnetic resonance imaging classification
     of autism. Brain, 134(12):3742–3754, 2011.
[9]  Andreas Argyriou, Charles A Micchelli, and Massimiliano Pontil. When is there a representer
     theorem? vector versus matrix regularizers. The Journal of Machine Learning Research,
     10:2507–2529, 2009.
[10] John Ashburner, Gareth Barnes, Chun-Chuan Chen, Jean Daunizeau, Guillaume Flandin,
     Karl Friston, Stefan Kiebel, James Kilner, Vladimir Litvak, Rosalyn Moran, et al. Spm12
     manual. Wellcome Trust Centre for Neuroimaging, London, UK, 2464, 2014.
[11] Francis R Bach. Consistency of the group lasso and multiple kernel learning. Journal of
     Machine Learning Research, 9(6), 2008.
[12] Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for
     knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019.
                                              136


[13] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk
     bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
[14] Asa Ben-Hur and William Stafford Noble. Kernel methods for predicting protein–protein
     interactions. Bioinformatics, 21(suppl_1):i38–i46, 2005.
[15] Kristin P Bennett, Michinari Momma, and Mark J Embrechts. Mark: A boosting algorithm
     for heterogeneous kernel models. In Proceedings of the eighth ACM SIGKDD international
     conference on Knowledge discovery and data mining, pages 24–31, 2002.
[16] Austin R Benson, David F Gleich, and Jure Leskovec. Tensor spectral clustering for par-
     titioning higher-order network structures. In Proceedings of the 2015 SIAM International
     Conference on Data Mining, pages 118–126. SIAM, 2015.
[17] Xuan Bi, Annie Qu, Xiaotong Shen, et al. Multilayer tensor factorization with applications
     to recommender systems. Annals of Statistics, 46(6B):3308–3333, 2018.
[18] Xuan Bi, Xiwei Tang, Yubai Yuan, Yanqing Zhang, and Annie Qu. Tensors in statistics.
     Annual Review of Statistics and Its Application, 8, 2020.
[19] Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: appli-
     cations to image and text data. In Proceedings of the seventh ACM SIGKDD international
     conference on Knowledge discovery and data mining, pages 245–250, 2001.
[20] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge
     university press, 2004.
[21] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[22] Matthew Brett, Kalina Christoff, Rhodri Cusack, Jack Lancaster, et al. Using the talairach
     atlas with the mni template. Neuroimage, 13(6):85–85, 2001.
[23] Vince D Calhoun, Tulay Adali, NR Giuliani, JJ Pekar, KA Kiehl, and GD Pearlson. Method
     for multimodal analysis of independent source differences in schizophrenia: combining gray
     matter structural and auditory oddball functional data. Human brain mapping, 27(1):47–62,
     2006.
[24] Timothy I Cannings and Richard J Samworth. Random-projection ensemble classification.
     Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):959–1035,
     2017.
[25] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM
     transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
[26] Christos Chatzichristos, Mike Davies, Javier Escudero, Eleftherios Kofidis, and Sergios
     Theodoridis. Fusion of eeg and fmri via soft coupled tensor decompositions. In 2018 26th
     European Signal Processing Conference (EUSIPCO), pages 56–60. IEEE, 2018.
                                                137


[27] Christos Chatzichristos, Eleftherios Kofidis, Lieven De Lathauwer, Sergios Theodoridis, and
     Sabine Van Huffel. Early soft and flexible fusion of eeg and fmri via tensor decompositions.
     arXiv preprint arXiv:2005.07134, 2020.
[28] Cong Chen, Kim Batselier, Ching-Yun Ko, and Ngai Wong. A support tensor train machine.
     In 2019 International Joint Conference on Neural Networks (ĲCNN), pages 1–8. IEEE, 2019.
[29] Di-Rong Chen and Han Li. Convergence rates of learning algorithms by random projection.
     Applied and Computational Harmonic Analysis, 37(1):36–51, 2014.
[30] Xinyu Chen, Zhaocheng He, and Jiawei Wang. Spatial-temporal traffic speed patterns discov-
     ery and incomplete data recovery via svd-combined tensor decomposition. Transportation
     research part C: emerging technologies, 86:59–77, 2018.
[31] Yanyan Chen, Kuaini Wang, and Ping Zhong. One-class support tensor machine. Knowledge-
     Based Systems, 96:14–28, 2016.
[32] Mario Christoudias, Raquel Urtasun, Trevor Darrell, et al. Bayesian localized multiple kernel
     learning. Univ. California Berkeley, Berkeley, CA, 2009.
[33] R Dennis Cook. Regression graphics: Ideas for studying regressions through graphics,
     volume 482. John Wiley & Sons, 2009.
[34] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and
     lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
[35] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank-(r
     1, r 2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis and
     Applications, 21(4):1324–1342, 2000.
[36] Luc Devroye, László Györfi, and Gábor Lugosi. A probabilistic theory of pattern recognition,
     volume 31. Springer Science & Business Media, 2013.
[37] Yiming Ding, Jae Ho Sohn, Michael G Kawczynski, Hari Trivedi, Roy Harnish, Nathaniel W
     Jenkins, Dmytro Lituiev, Timothy P Copeland, Mariam S Aboian, Carina Mari Aparici, et al.
     A deep learning model to predict a diagnosis of alzheimer disease by using 18f-fdg pet of
     the brain. Radiology, 290(2):456–464, 2019.
[38] Nemanja Djuric, Liang Lan, Slobodan Vucetic, and Zhuang Wang. Budgetedsvm: A toolbox
     for scalable svm approximations. The Journal of Machine Learning Research, 14(1):3813–
     3817, 2013.
[39] Leo Doktorski. L2-svm: Dependence on the regularization parameter. Pattern Recognition
     and Image Analysis, 21(2):254–257, 2011.
[40] Olivier Duchenne, Francis Bach, In-So Kweon, and Jean Ponce. A tensor-based algorithm for
     high-order graph matching. IEEE transactions on pattern analysis and machine intelligence,
     33(12):2383–2395, 2011.
                                               138


[41] Robert Durrant and Ata Kabán. Sharp generalization error bounds for randomly-projected
     classifiers. In International Conference on Machine Learning, pages 693–701, 2013.
[42] Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press,
     2019.
[43] Hadi Fanaee-T and Joao Gama. Simtensor: A synthetic tensor data generator. arXiv preprint
     arXiv:1612.03772, 2016.
[44] Hadi Fanaee-T and Joao Gama. Tensor-based anomaly detection: An interdisciplinary
     survey. Knowledge-Based Systems, 98:130–147, 2016.
[45] Long Feng, Xuan Bi, and Heping Zhang. Brain regions identified as being associated with
     verbal reasoning through the use of imaging regression via internal variation. Journal of the
     American Statistical Association, pages 1–15, 2020.
[46] Xiaoli Z Fern and Carla E Brodley. Random projection for high dimensional data clustering:
     A cluster ensemble approach. In Proceedings of the 20th international conference on machine
     learning (ICML-03), pages 186–193, 2003.
[47] Roger Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
[48] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,
     volume 1. Springer series in statistics New York, 2001.
[49] Jannik Fritsch, Tobias Kuehnl, and Andreas Geiger. A new performance measure and eval-
     uation benchmark for road detection algorithms. In International Conference on Intelligent
     Transportation Systems (ITSC), 2013.
[50] Glenn Fung, Murat Dundar, Jinbo Bi, and Bharat Rao. A fast iterative algorithm for fisher
     discriminant using heterogeneous kernels. In Proceedings of the twenty-first international
     conference on Machine learning, page 40, 2004.
[51] Mostafa Reisi Gahrooei, Hao Yan, Kamran Paynabar, and Jianjun Shi. Multiple tensor-on-
     tensor regression: An approach for modeling processes with heterogeneous sources of data.
     Technometrics, pages 1–23, 2020.
[52] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics:
     The kitti dataset. International Journal of Robotics Research (ĲRR), 2013.
[53] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving?
     the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition
     (CVPR), 2012.
[54] Mark Girolami and Mingjun Zhong. Data integration for classification problems employ-
     ing gaussian process priors. In Advances in Neural Information Processing Systems 19:
     Proceedings of the 2006 Conference, volume 19, page 465. MIT Press, 2007.
[55] Mehmet Gönen and Ethem Alpaydın. Multiple kernel learning algorithms. The Journal of
     Machine Learning Research, 12:2211–2268, 2011.
                                               139


[56] Adrian R Groves, Christian F Beckmann, Steve M Smith, and Mark W Woolrich. Linked
     independent component analysis for multimodal data fusion. Neuroimage, 54(3):2198–2217,
     2011.
[57] Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian tensor regression. The
     Journal of Machine Learning Research, 18(1):2733–2763, 2017.
[58] Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression. IEEE
     Transactions on Image Processing, 21(2):816–827, 2011.
[59] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer,
     2012.
[60] Peter Hall and Richard J Samworth. Properties of bagged nearest neighbour classifiers.
     Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):363–379,
     2005.
[61] Zhifeng Hao, Lifang He, Bingqian Chen, and Xiaowei Yang. A linear support higher-order
     tensor machine for classification. IEEE Transactions on Image Processing, 22(7):2911–2920,
     2013.
[62] Lifang He, Kun Chen, Wanwan Xu, Jiayu Zhou, and Fei Wang. Boosted sparse and low-rank
     tensor regression. arXiv preprint arXiv:1811.01158, 2018.
[63] Lifang He, Xiangnan Kong, Philip S Yu, Xiaowei Yang, Ann B Ragin, and Zhifeng Hao.
     Dusk: A dual structure-preserving kernel for supervised tensor learning with applications to
     neuroimages. In Proceedings of the 2014 SIAM International Conference on Data Mining,
     pages 127–135. SIAM, 2014.
[64] Lifang He, Chun-Ta Lu, Guixiang Ma, Shen Wang, Linlin Shen, Philip S Yu, and Ann B
     Ragin. Kernelized support tensor machines. In Proceedings of the 34th International
     Conference on Machine Learning-Volume 70, pages 1442–1451. JMLR. org, 2017.
[65] Richard N Henson, Hunar Abdulrahman, Guillaume Flandin, and Vladimir Litvak. Mul-
     timodal integration of m/eeg and f/mri data in spm12. Frontiers in neuroscience, 13:300,
     2019.
[66] Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal
     of Mathematics and Physics, 6(1-4):164–189, 1927.
[67] Heng Huang, Chris Ding, Dĳun Luo, and Tao Li. Simultaneous tensor subspace selection
     and clustering: the equivalence of high order svd and k-means clustering. In Proceedings of
     the 14th ACM SIGKDD international conference on Knowledge Discovery and Data mining,
     pages 327–335, 2008.
[68] Prateek Jain and Sewoong Oh. Provable tensor factorization with missing data. arXiv
     preprint arXiv:1406.2784, 2014.
[69] Svante Janson et al. Gaussian hilbert spaces, volume 129. Cambridge university press,
     1997.
                                               140


[70] Yuwang Ji, Qiang Wang, Xuan Li, and Jie Liu. A survey on tensor techniques and applications
     in machine learning. IEEE Access, 7:162950–162990, 2019.
[71] Ruhui Jin, Tamara G Kolda, and Rachel Ward. Faster johnson-lindenstrauss transforms via
     kronecker products. arXiv preprint arXiv:1909.04801, 2019.
[72] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert
     space. Contemporary mathematics, 26(189-206):1, 1984.
[73] Esin Karahan, Pedro A Rojas-Lopez, Maria L Bringas-Vega, Pedro A Valdés-Hernández, and
     Pedro A Valdes-Sosa. Tensor analysis and fusion of multimodal brain images. Proceedings
     of the IEEE, 103(9):1531–1559, 2015.
[74] Ali Khazaee, Ata Ebrahimzadeh, and Abbas Babajani-Feremi. Application of advanced
     machine learning methods on resting-state fmri network for identification of mild cognitive
     impairment and alzheimer’s disease. Brain imaging and behavior, 10(3):799–817, 2016.
[75] Fei Yan Krystian Mikolajczyk Josef Kittler and Muhammad Tahir. A comparison of l1 norm
     and l2 norm multiple kernel svms in image and video classification.
[76] Marius Kloft, Ulf Brefeld, Soeren Sonnenburg, Pavel Laskov, Klaus-Robert Müller, and
     Alexander Zien. Efficient and accurate lp-norm multiple kernel learning. In NIPS, volume 22,
     pages 997–1005, 2009.
[77] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review,
     51(3):455–500, 2009.
[78] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review,
     51(3):455–500, 2009.
[79] Tamara Gibson Kolda. Multilinear operators for higher-order decompositions. Technical
     report, Sandia National Laboratories, 2006.
[80] Jean Kossaifi, Zachary C Lipton, Arinbjörn Kolbeinsson, Aran Khanna, Tommaso
     Furlanello, and Anima Anandkumar. Tensor regression networks. Journal of Machine
     Learning Research, 21:1–21, 2020.
[81] J. B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions, with
     application to arithmetic complexity and statistics. Linear Algebra and its Applications,
     18(2):95 – 138, 1977.
[82] Stephen LaConte, Stephen Strother, Vladimir Cherkassky, Jon Anderson, and Xiaoping Hu.
     Support vector machines for temporal classification of block design fmri data. NeuroImage,
     26(2):317–329, 2005.
[83] Jack L Lancaster, Diana Tordesillas-Gutiérrez, Michael Martinez, Felipe Salinas, Alan
     Evans, Karl Zilles, John C Mazziotta, and Peter T Fox. Bias between mni and talairach coor-
     dinates analyzed using the icbm-152 brain template. Human brain mapping, 28(11):1194–
     1205, 2007.
                                              141


[84] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I
     Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine
     learning research, 5(Jan):27–72, 2004.
[85] Michele Larobina and Loredana Murino. Medical image file formats. Journal of digital
     imaging, 27(2):200–206, 2014.
[86] Xu Lei, Pedro A Valdes-Sosa, and Dezhong Yao. Eeg/fmri fusion based on independent
     component analysis: integration of data-driven and model-driven methods. Journal of
     integrative neuroscience, 11(03):313–337, 2012.
[87] Jie Li, Guan Han, Jing Wen, and Xinbo Gao. Robust tensor subspace learning for anomaly
     detection. International Journal of Machine Learning and Cybernetics, 2(2):89–98, 2011.
[88] Lexin Li and Xin Zhang. Parsimonious tensor response regression. Journal of the American
     Statistical Association, 112(519):1131–1146, 2017.
[89] Peide Li and Taps Maiti. Universal consistency of support tensor machine. In 2019 IEEE
     International Conference on Data Science and Advanced Analytics (DSAA), pages 608–609.
     IEEE, 2019.
[90] Ping Li, Trevor J Hastie, and Kenneth W Church. Very sparse random projections. In
     Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery
     and data mining, pages 287–296. ACM, 2006.
[91] Quefeng Li and Lexin Li. Integrative factor regression and its inference for multimodal data
     analysis. arXiv preprint arXiv:1911.04056, 2019.
[92] Qun Li and Dan Schonfeld. Multilinear discriminant analysis for higher-order tensor data
     classification. IEEE transactions on pattern analysis and machine intelligence, 36(12):2524–
     2537, 2014.
[93] Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker tensor regression and neuroimaging
     analysis. Statistics in Biosciences, 10(3):520–545, 2018.
[94] Yingjie Li, Liangliang Zhang, Andrea Bozoki, David C Zhu, Jongeun Choi, and Taps Maiti.
     Early prediction of alzheimer’s disease using longitudinal volumetric mri data from adni.
     Health Services and Outcomes Research Methodology, 20(1):13–39, 2020.
[95] Yi Lin. A note on margin-based loss functions in classification. Statistics & probability
     letters, 68(1):73–82, 2004.
[96] Martin A Lindquist et al. The statistical analysis of fmri data. Statistical science, 23(4):439–
     464, 2008.
[97] Jingyu Liu, Godfrey Pearlson, Andreas Windemuth, Gualberto Ruano, Nora I Perrone-
     Bizzozero, and Vince Calhoun. Combining fmri and snp data to investigate connections
     between brain function and genetics using parallel ica. Human brain mapping, 30(1):241–
     255, 2009.
                                                142


[98] Yipeng Liu, Jiani Liu, and Ce Zhu. Low-rank tensor train coefficient array estimation for
      tensor-on-tensor regression. IEEE transactions on neural networks and learning systems,
      31(12):5402–5411, 2020.
[99] Eric F Lock. Tensor-on-tensor regression. Journal of Computational and Graphical Statis-
      tics, 27(3):638–647, 2018.
[100] Xiaojing Long, Lifang Chen, Chunxiang Jiang, Lĳuan Zhang, and Alzheimer’s Disease Neu-
      roimaging Initiative. Prediction and classification of alzheimer disease based on quantifica-
      tion of mri deformation. PloS one, 12(3):e0173372, 2017.
[101] Miles E Lopes. A sharp bound on the computation-accuracy tradeoff for majority voting
      ensembles. eScholarship, University of California, 2013.
[102] Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and Shuicheng Yan. Tensor
      robust principal component analysis: Exact recovery of corrupted low-rank tensors via
      convex optimization. In Proceedings of the IEEE conference on computer vision and pattern
      recognition, pages 5249–5257, 2016.
[103] Haiping Lu, Konstantinos N Plataniotis, and Anastasios N Venetsanopoulos. Mpca: Multi-
      linear principal component analysis of tensor objects. IEEE transactions on Neural Networks,
      19(1):18–39, 2008.
[104] Wei Lu, Fu-Lai Chung, Wenhao Jiang, Martin Ester, and Wei Liu. A deep bayesian tensor-
      based system for video recommendation. ACM Transactions on Information Systems (TOIS),
      37(1):1–22, 2018.
[105] Wenqi Lu, Zhongyi Zhu, and Heng Lian. High-dimensional quantile tensor regression.
      Journal of Machine Learning Research, 21(250):1–31, 2020.
[106] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms.
      Journal of Machine Learning Research, 4(Oct):839–860, 2003.
[107] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of
      Machine Learning Research, 7(Dec):2651–2667, 2006.
[108] Sebastian Mika, Gunnar Ratsch, Jason Weston, Bernhard Scholkopf, and Klaus-Robert
      Mullers. Fisher discriminant analysis with kernels. In Neural networks for signal processing
      IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468),
      pages 41–48. Ieee, 1999.
[109] John C Morris, Catherine M Roe, Elizabeth A Grant, Denise Head, Martha Storandt,
      Alison M Goate, Anne M Fagan, David M Holtzman, and Mark A Mintun. Pittsburgh
      compound b imaging and prediction of progression from cognitive normality to symptomatic
      alzheimer disease. Archives of neurology, 66(12):1469–1475, 2009.
[110] Raziyeh Mosayebi and Gholam-Ali Hossein-Zadeh. Correlated coupled matrix tensor fac-
      torization method for simultaneous eeg-fmri data fusion. Biomedical Signal Processing and
      Control, 62:102071, 2020.
                                                143


[111] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective
      learning on multi-relational data. In Icml, 2011.
[112] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business
      Media, 2006.
[113] Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing,
      33(5):2295–2317, 2011.
[114] Yuqing Pan, Qing Mai, and Xin Zhang. Covariate-adjusted tensor classification in high
      dimensions. Journal of the American Statistical Association, pages 1–15, 2018.
[115] Paul Pavlidis, Jason Weston, Jinsong Cai, and William Noble Grundy. Gene functional
      classification from heterogeneous data. In Proceedings of the fifth annual international
      conference on Computational biology, pages 249–255, 2001.
[116] A. H. Phan, P. Tichavsky, and A. Cichocki. Low complexity damped gauss–newton al-
      gorithms for candecomp/parafac. SIAM Journal on Matrix Analysis and Applications,
      34(1):126–147, 2013.
[117] Michael JD Powell. Nonconvex minimization calculations and the conjugate gradient
      method. In Numerical analysis, pages 122–141. Springer, 1984.
[118] Shibin Qiu and Terran Lane. A framework for multiple kernel support vector regression
      and its applications to sirna efficacy prediction. IEEE/ACM Transactions on Computational
      Biology and Bioinformatics, 6(2):190–199, 2008.
[119] Beheshteh T Rakhshan and Guillaume Rabusseau. Tensorized random projections. arXiv
      preprint arXiv:2003.05101, 2020.
[120] Bin Ran, Huachun Tan, Yuankai Wu, and Peter J Jin. Tensor based missing traffic data
      completion with spatial–temporal correlation. Physica A: Statistical Mechanics and its
      Applications, 446:54–63, 2016.
[121] Garvesh Raskutti, Ming Yuan, Han Chen, et al. Convex regularization for high-dimensional
      multiresponse tensor regression. The Annals of Statistics, 47(3):1554–1584, 2019.
[122] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro
      Verri. Are loss functions all the same? Neural computation, 16(5):1063–1076, 2004.
[123] Katharina A Schindlbeck and David Eidelberg. Network imaging biomarkers: insights and
      clinical applications in parkinson’s disease. The Lancet Neurology, 17(7):629–640, 2018.
[124] Warren Schudy and Maxim Sviridenko. Concentration and moment inequalities for poly-
      nomials of independent random variables. In Proceedings of the twenty-third annual ACM-
      SIAM symposium on Discrete Algorithms, pages 437–446. SIAM, 2012.
[125] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory
      to algorithms. Cambridge university press, 2014.
                                                 144


[126] John Shawe-Taylor, Christopher KI Williams, Nello Cristianini, and Jaz Kandola. On
      the eigenspectrum of the gram matrix and the generalization error of kernel-pca. IEEE
      Transactions on Information Theory, 51(7):2510–2522, 2005.
[127] Marco Signoretto, Quoc Tran Dinh, Lieven De Lathauwer, and Johan AK Suykens. Learning
      with tensors: a framework based on convex optimization and spectral regularization. Machine
      Learning, 94(3):303–351, 2014.
[128] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science &
      Business Media, 2008.
[129] Jing Sui, Godfrey Pearlson, Arvind Caprihan, Tülay Adali, Kent A Kiehl, Jingyu Liu, Jeremy
      Yamamoto, and Vince D Calhoun. Discriminating schizophrenia and bipolar disorder by
      fusing fmri and dti in a multimodal cca+ joint ica model. Neuroimage, 57(3):839–855, 2011.
[130] Will Wei Sun and Lexin Li. Store: sparse tensor response regression and neuroimaging
      analysis. The Journal of Machine Learning Research, 18(1):4908–4944, 2017.
[131] Will Wei Sun and Lexin Li. Dynamic tensor clustering. Journal of the American Statistical
      Association, 114(528):1894–1907, 2019.
[132] Yanfeng Sun, Junbin Gao, Xia Hong, Bamdev Mishra, and Baocai Yin. Heterogeneous
      tensor decomposition for clustering via manifold optimization. IEEE transactions on pattern
      analysis and machine intelligence, 38(3):476–489, 2015.
[133] Yiming Sun, Yang Guo, Joel A Tropp, and Madeleine Udell. Tensor random projection
      for low memory dimension reduction. In NeurIPS Workshop on Relational Representation
      Learning, 2018.
[134] Hiroaki Tanabe, Tu Bao Ho, Canh Hao Nguyen, and Saori Kawasaki. Simple but effective
      methods for combining kernels in computational biology. In 2008 IEEE International Con-
      ference on Research, Innovation and Vision for the Future in Computing and Communication
      Technologies, pages 71–78. IEEE, 2008.
[135] D. Tao, X. Li, X. Wu, and S. J. Maybank. General tensor discriminant analysis and gabor fea-
      tures for gait recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
      29(10), 2007.
[136] Dacheng Tao, Xuelong Li, Weiming Hu, Stephen Maybank, and Xindong Wu. Supervised
      tensor learning. In Fifth IEEE International Conference on Data Mining (ICDM’05), pages
      8–pp. IEEE, 2005.
[137] Petr Tichavsky, Anh Huy Phan, and Zbyněk Koldovsky. Cramér-rao-induced bounds
      for candecomp/parafac tensor decomposition. IEEE Transactions on Signal Processing,
      61(8):1986–1997, 2013.
[138] Théo Trouillon, Christopher R Dance, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and
      Guillaume Bouchard. Knowledge graph completion via complex tensor factorization. arXiv
      preprint arXiv:1702.06879, 2017.
                                               145


[139] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business
      media, 2013.
[140] Manik Varma and Debajyoti Ray. Learning the discriminative power-invariance trade-off.
      In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.
[141] Jennifer M Walz, Robin I Goldman, Jordan Muraskin, Bryan Conroy, Truman R Brown, and
      Paul Sajda. "auditory and visual oddball eeg-fmri", 2018.
[142] Huan Wang, Shuicheng Yan, Thomas S Huang, and Xiaoou Tang. A convengent solution to
      tensor subspace learning. In ĲCAI, pages 629–634, 2007.
[143] Philip Wolfe. Convergence conditions for ascent methods. SIAM review, 11(2):226–235,
      1969.
[144] Philip Wolfe. Convergence conditions for ascent methods. ii: Some corrections. SIAM
      review, 13(2):185–188, 1971.
[145] Keith J Worsley, Chien Heng Liao, John Aston, V Petre, GH Duncan, F Morales, and
      AC Evans. A general statistical analysis for fmri data. Neuroimage, 15(1):1–15, 2002.
[146] Kun Xie, Lele Wang, Xin Wang, Gaogang Xie, Jigang Wen, and Guangxing Zhang. Accurate
      recovery of internet traffic data: A tensor completion approach. In IEEE INFOCOM 2016-
      The 35th Annual IEEE International Conference on Computer Communications, pages 1–9.
      IEEE, 2016.
[147] Shuicheng Yan, Dong Xu, Qiang Yang, Lei Zhang, Xiaoou Tang, and Hong-Jiang Zhang.
      Discriminant analysis with tensor representation. In Computer Vision and Pattern Recogni-
      tion, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 526–532.
      IEEE, 2005.
[148] Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression.
      In International Conference on Machine Learning, pages 373–381. PMLR, 2016.
[149] Anru Zhang et al. Cross: Efficient low-rank tensor completion. Annals of Statistics,
      47(2):936–964, 2019.
[150] Changqing Zhang, Huazhu Fu, Si Liu, Guangcan Liu, and Xiaochun Cao. Low-rank ten-
      sor constrained multiview subspace clustering. In Proceedings of the IEEE international
      conference on computer vision, pages 1582–1590, 2015.
[151] Tong Zhang et al. Statistical behavior and consistency of classification methods based on
      convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004.
[152] Xin Zhang and Lexin Li. Tensor envelope partial least-squares regression. Technometrics,
      59(4):426–436, 2017.
[153] Xing Zhang, Gongjian Wen, and Wei Dai. A tensor decomposition-based anomaly detection
      algorithm for hyperspectral image. IEEE Transactions on Geoscience and Remote Sensing,
      54(10):5801–5820, 2016.
                                                146


[154] Yanqing Zhang, Xuan Bi, Niansheng Tang, and Annie Qu. Dynamic tensor recommender
      systems. Journal of Machine Learning Research, 22(65):1–35, 2021.
[155] Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications in neuroimaging
      data analysis. Journal of the American Statistical Association, 108(502):540–552, 2013.
[156] P. Zhou and J. Feng. Outlier-robust tensor pca. In Proc. IEEE Conf. Comput. Vis. Pattern
      Recognit., pages 1–9, 2017.
                                              147