FAST AND MEMORY-EFFICIENT SUBSPACE EMBEDDINGS FOR TENSOR DATA WITH
                                   APPLICATIONS
                                          By
                                        Ali Zare
                                  A DISSERTATION
                                      Submitted to
                              Michigan State University
                      in partial fulfillment of the requirements
                                   for the degree of
      Computational Mathematics, Science and Engineering – Doctor of Philosophy
                                         2022


                                                ABSTRACT
 FAST AND MEMORY-EFFICIENT SUBSPACE EMBEDDINGS FOR TENSOR DATA WITH
                                               APPLICATIONS
                                                    By
                                                  Ali Zare
The widespread use of multisensor technology and the emergence of big data sets have brought the
necessity to develop more versatile tools to represent higher-order data with multiple aspects and
high dimensionality. Data in the form of multidimensional arrays, also referred to as tensors, arise in
a variety of applications including chemometrics, physics, hyperspectral imaging, high-resolution
videos, neuroimaging, biometrics, and social network analysis. Early multiway data analysis
approaches used to reformat such tensor data as large vectors or matrices and would then resort to
dimensionality reduction methods developed for low-dimensional data. However, by vectorizing
tensors, the inherent multiway structure of the data and the possible correlation between different
dimensions will be lost, in some cases resulting in a degradation in the performance of vector-
based methods. Moreover, in many cases, vectorizing tensors leads to vectors with extremely high
dimensionality that might render most existing methods computationally impractical. In the case
of dimension reduction, the enormous amount of memory needed to store the embedding matrix
becomes the main obstacle. This highlights the need for approaches that are applied to tensor
data in their multi-dimensional form. To reduce the dimension of an 𝑛1 × 𝑛2 × · · · × 𝑛 𝑑 tensor to
                                                                                          Î
𝑚 1 × 𝑚 2 × · · · × 𝑚 𝑑 with 𝑚 𝑗 ≤ 𝑛 𝑗 , MPCA1 would change the memory requirement from 𝑑𝑗=1 𝑚 𝑗 𝑛 𝑗
                       Í
for vector PCA to 𝑑𝑗=1 𝑚 𝑗 𝑛 𝑗 , which can be a considerable improvement. On the other hand, tensor
dimension reduction methods such as MPCA need training samples for the projection matrices
to be learned. This makes such methods time consuming and computationally less efficient than
oblivious approaches such as the Johnson-Lindenstrauss embedding. The term oblivious refers to
    1 Multilinear Principal Component Analysis


the fact that one does not need any data samples beforehand to learn the embedding that projects a
new data sample onto a lower-dimensional space.
In this thesis, first a review of tensor concepts and algebra as well as common tensor decompositions
is presented. Next, a modewise JL approach is proposed for compressing tensors without reshaping
them into potentially very large vectors. Theoretical guarantees for the norm and inner product
approximation errors as well as theoretical bounds on the embedding dimension are presented
for data with low CP rank, and the corresponding effects of basis coherence assumptions are
addressed. Experiments are performed using various choices of embedding matrices. Results
verify the validity of one- and two-stage modewise JL embeddings in preserving the norm of
MRI and synthesized data constructed from both coherent and incoherent bases. Two novel
applications of the proposed modewise JL method are discussed. (i) Approximate solutions to
least squares problems as a computationally efficient way of fitting tensor decompositions: The
proposed approach is incorporated as a stage in the fitting procedure, and is tested on relatively
low-rank MRI data. Results show improvement in computational complexity at a slight cost in the
accuracy of the solution in the Euclidean norm. (ii) Many-Body Perturbation Theory problems
involving energy calculations: In large model spaces, the dimension sizes of tensors can grow fast,
rendering the direct calculation of perturbative correction terms challenging. The second-order
energy correction term as well as the one-body radius correction are formulated and modeled as
inner products in such a way that modewise JL can be used to reduce the computational complexity
of the calculations. Experiments are performed on data from various nuclei in different model space
sizes, and show that in the case of large model spaces, very good compression can be achieved at
the price of small errors in the estimated energy values.


Copyright by
ALI ZARE
2022


                                   ACKNOWLEDGEMENTS
I would like to warmly thank my academic advisor, Dr. Mark Iwen, for his supervision and tutelage
throughout the course of my Ph.D. studies. I truely appreciate his support, encouragement, and
patience in the face of many challenges that I tackled on my academic journey at Michigan State
University. My appreciation extends to the dissertation committee members, Dr. Selin Aviyente,
Dr. Rongrong Wang, and Dr. Yuying Xie for their experience and knowledge that they supported
me with during research and coursework. I also hereby thank Dr. Heiko Hergert who I collaborated
with and learned from in the novel application of my thesis work to Many-Body Perturbation
Theory problems in physics.
Finally, I would like to express my deepest gratitude to my dear parents and sister without whose
tremendous and unconditional support throughout my life, and especially during the past few years,
earning my Ph.D. degree would not have been possible.
                                                 v


                                TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     xi
CHAPTER 1    INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         1
CHAPTER 2    BACKGROUND: TENSOR BASICS AND ALGEBRA . . . . . . . . . . .                        5
CHAPTER 3 TENSOR DECOMPOSITIONS AND RANK .                     . . . . . . . . . . . . . . . . 14
   3.1 The CANDECOMP/PARAFAC Decomposition . . .               . . . . . . . . . . . . . . . . 14
       3.1.1 Uniquensess of CPD . . . . . . . . . . . . .      . . . . . . . . . . . . . . . . 15
       3.1.2 Computing CPD . . . . . . . . . . . . . . .       . . . . . . . . . . . . . . . . 16
   3.2 Tensor Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
       3.2.1 Low-rank approximation and border rank . .        . . . . . . . . . . . . . . . . 18
   3.3 Compression and the Tucker Decomposition . . . . .      . . . . . . . . . . . . . . . . 18
       3.3.1 𝑗-rank . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . 19
       3.3.2 Computing the Tucker Decomposition . . . .        . . . . . . . . . . . . . . . . 19
       3.3.3 Uniqueness of Tucker . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . 20
   3.4 Tensor-Train Decomposition . . . . . . . . . . . . .    . . . . . . . . . . . . . . . . 21
CHAPTER 4    DIMENSIONALITY REDUCTION OF TENSOR DATA: MODEWISE
             RANDOM PROJECTIONS . . . . . . . . . . . . . . . . . . . . . . . . .            . 24
   4.1 Johnson-Lindenstrauss Embeddings for Tensor Data . . . . . . . . . . . . . . . .      . 24
   4.2 Johnson-Lindenstrauss Embedings for Low-Rank Tensors . . . . . . . . . . . . .        . 29
       4.2.1 Geometry-Preserving Property of JL Embeddings for Low-Rank Tensors .            . 29
              4.2.1.1 Computational Complexity of Modewise Johnson-Lindenstrauss
                        Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 41
       4.2.2 Main Theorems: Oblivious Tensor Subspace Embeddings . . . . . . . .             . 41
       4.2.3 Fast and Memory-Efficient Modewise Johnson-Lindenstrauss Embeddings             . 46
   4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
       4.3.1 Effect of JL Embeddings on Norm . . . . . . . . . . . . . . . . . . . . .       . 50
CHAPTER 5    APPLICATIONS OF MODEWISE JOHNSON-LINDENSTRAUSS EM-
             BEDDINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . 54
   5.1 Application to Least Squares Problems and CPD Fitting . . . . . . . . . . . . .     . . 54
       5.1.1 Experiments: Effect of JL Embeddings on Least Squares Solutions . . .         . . 59
              5.1.1.1 CPD Reconstruction . . . . . . . . . . . . . . . . . . . . . .       . . 60
              5.1.1.2 Compressed Least Squares Performance . . . . . . . . . . . .         . . 61
                                              vi


   5.2 Application to Many-Body Purturbation Theory Problems .       . . . . . . . . . . . . . 64
       5.2.1 Second-order energy correction . . . . . . . . . .      . . . . . . . . . . . . . 64
       5.2.2 Radius Corrections . . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . 65
       5.2.3 Third-order energy correction . . . . . . . . . . .     . . . . . . . . . . . . . 68
               5.2.3.1 Particle-Particle . . . . . . . . . . . . .   . . . . . . . . . . . . . 68
               5.2.3.2 Hole-Hole . . . . . . . . . . . . . . . .     . . . . . . . . . . . . . 70
               5.2.3.3 Particle-Hole . . . . . . . . . . . . . . .   . . . . . . . . . . . . . 70
       5.2.4 Experiments . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . 71
               5.2.4.1 𝐸 (2) Experiments . . . . . . . . . . . . .   . . . . . . . . . . . . . 72
               5.2.4.2 Radius Correction Experiments . . . . .       . . . . . . . . . . . . . 76
               5.2.4.3 𝐸 (3) Experiments . . . . . . . . . . . . .   . . . . . . . . . . . . . 76
CHAPTER 6    EXTENSION OF VECTOR-BASED METHODS TO                      TENSORS AND
             FUTURE WORK . . . . . . . . . . . . . . . . . . . .       . . . . . . . . . . . . 79
   6.1 Multilinear Principal Component Analysis . . . . . . . . . .    . . . . . . . . . . . . 79
       6.1.1 Problem Statement . . . . . . . . . . . . . . . . . .     . . . . . . . . . . . . 80
       6.1.2 Full Projection . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . 82
       6.1.3 Initialization by Full Projection Truncation (FPT) . .    . . . . . . . . . . . . 82
       6.1.4 Determination of subspace Dimensions 𝑃 𝑗 . . . . . .      . . . . . . . . . . . . 83
       6.1.5 Feature Extraction and Classification . . . . . . . . .   . . . . . . . . . . . . 83
   6.2 Comparison between PCA, MPCA and MPS . . . . . . . . .          . . . . . . . . . . . . 84
   6.3 Extension of Support Vector Machine to Tensors . . . . . . .    . . . . . . . . . . . . 86
       6.3.1 Support Tensor Machine . . . . . . . . . . . . . . .      . . . . . . . . . . . . 87
       6.3.2 Support Higher-order Tensor Machine . . . . . . . .       . . . . . . . . . . . . 89
       6.3.3 Kernelized Support Tensor Machine . . . . . . . . .       . . . . . . . . . . . . 89
   6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
   APPENDIX A USEFUL NOTIONS, DEFINITIONS, AND RELATIONS . . . . . . . 98
   APPENDIX B MEMORY-EFFICIENT MODE-WISE PROJECTION CALCULA-
                    TIONS OF THE ENERGY TERMS . . . . . . . . . . . . . . . . . . . 106
   APPENDIX C FASTER KRONECKER JOHNSON-LINDENTRAUSS TRANSFORM109
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
                                             vii


                                     LIST OF TABLES
Table 5.1: Basis truncation parameters and mode dimensions for single-particle bases
           labeled by 𝑒Max. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
                                              viii


                                         LIST OF FIGURES
Figure 2.1: An example of a 3 × 4 × 5 tensor. . . . . . . . . . . . . . . . . . . . . . . . . .    5
Figure 2.2: Visualization of a 5-mode tensor by stacking a 3-mode tensor along its 1st
            and 2nd modes. The result is a 5-mode tensor of size 3 × 4 × 5 × 3 × 2. Note
            that the elements of the stacked versions are not necessarily the same as they
            are elements corresponding to different indices in the 5-mode tensor. . . . . . .      6
Figure 2.3: An example of the fibers of a 3-mode tensor. Left: mode-3 fibers. Right:
            mode-1 fibers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   6
Figure 2.4: An example showing how the mode-3 slices of a 3-mode tensor are formed. . .            7
Figure 2.5: An example showing how the mode-1 unfolding of a 3-mode tensor is formed
            using its mode-1 fibers. Colors are used to show how this is done in a
            column-major format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     7
Figure 4.1: An example of 2-stage JL embedding applied to a 3-dimensional tensor X ∈
            R3×4×5 . The output of the 1st stage is the projected tensor Y = X ×1 A (1) ×2
            A (2) ×3 A (3) , where A ( 𝑗) are JL matrices for 𝑗 ∈ {1, 2, 3}, A (1) ∈ R2×3 ,
            A (2) ∈ R3×4 , and A (3) ∈ R4×5 , resulting in Y ∈ R2×3×4 . Matching colors
            have been used to show how the rows of A ( 𝑗) interact with the mode- 𝑗 fibers
            of X (and the intermediate partially compressed tensors) to generate the
            elements of the mode- 𝑗 unfolding of the result after each 𝑗-mode product.
            Next, the resulting tensor is vectorized (leading to y ∈ R24 ), and a 2nd -stage
            JL is then performed to obtain z = Ay where A ∈ R3×24 , and z ∈ R3 . . . . . . . 45
Figure 4.2: Relative norm of randomly generated 4-dimensional data. Here, the total
            compression will be 𝑐 𝑡𝑜𝑡 = 𝑐41 . (a) Gaussian data. (b) Coherent data. Note
            that the modewise approach still preserves norms well for the coherent data
            indicating that the incoherence assumptions utilized in Section 4.2 can likely
            be relaxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 4.3: Simulation results averaged over 1000 trials for 3 MRI data samples, where
            each sample is 3-dimensional. In the 2-stage cases, 𝑐 2 = 0.05 has been used.
            (a) Relative norm. (b) Runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Figure 5.1: Relative reconstruction error of CPD calculated for different values of rank 𝑟
            for MRI data. As the rank increases, the error becomes smaller. . . . . . . . . . 61
                                                 ix


Figure 5.2: Effect of JL embeddings on the relative reconstruction error of least squares
             estimation of CPD coefficients. In the 2-stage cases, 𝑐 2 = 0.05 has been used.
             (a) 𝑟 = 40. (b) 𝑟 = 75. (c) 𝑟 = 110. (d) Average runtime for 𝑟 = 40. The
             other runtime plots for 𝑟 = 75 and 𝑟 = 110 are qualitatively identical. . . . . . . 63
Figure 5.3: A block diagram showing how the approximations to 𝑅1 and 𝑅2 are calculated. . 67
Figure 5.4: 𝐸 (2) experiment results for O16, 𝑒𝑀𝑎𝑥 = 2. . . . . . . . . . . . . . . . . . . . 73
Figure 5.5: 𝐸 (2) experiment results for O16, 𝑒𝑀𝑎𝑥 = 4. . . . . . . . . . . . . . . . . . . . 74
Figure 5.6: 𝐸 (2) experiment results for O16, 𝑒𝑀𝑎𝑥 = 8. . . . . . . . . . . . . . . . . . . . 75
Figure 5.7: Relative error in 𝐸 (2) for total compression values of 0.0009 and 0.0125. (a)
             O16. (b) Sn132. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Figure 5.8: Radius correction results, for interaction em1.8 − 2.0 and eMax= 14. (a)
             Ca48, particle term. (b) Ca48, hole term. (c) Sn132, particle term. (d)
             Sn132, hole term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 5.9: Mean absolute relative error in 𝐸 (3) for hole-hole and 𝑒𝑀𝑎𝑥 = 8. . . . . . . . . 77
Figure 5.10: Mean absolute relative error in 𝐸 (3) for particle-particle and 𝑒𝑀𝑎𝑥 = 8. . . . . . 77
Figure 5.11: Mean absolute relative error in 𝐸 (3) for particle-hole and 𝑒𝑀𝑎𝑥 = 8. . . . . . . 78
Figure 6.1: Gray-scale sample images of five objects in the COIL-100 database. . . . . . . 85
Figure 6.2: A lateral slice of a sample MRI image. . . . . . . . . . . . . . . . . . . . . . . 85
Figure 6.3: Training time. (a) COIL-100. (b) MRI. . . . . . . . . . . . . . . . . . . . . . . 86
Figure 6.4: Classification Success Rate. (a) COIL-100. (b) MRI. . . . . . . . . . . . . . . 86
                                                  x


                                LIST OF ALGORITHMS
Algorithm 3.1: CPD-ALS [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Algorithm 3.2: HOOI-ALS [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Algorithm 3.3: Tensor-Train [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Algorithm 6.1: MPCA [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
                                              xi


                                            CHAPTER 1
                                          INTRODUCTION
The emergence of big data elicits the development of compression methods to efficiently represent
such data without losing much information. One of the most well-known techniques to this end
is PCA1 which uses the linear structure of a high-dimensional vector and projects it onto the
underlying lower-dimensional subspace [2]. However, when the number of dimensions in the
data increases, as is the case with matrices and cubes, reshaping the data into a vector becomes
troublesome in the sense that it will require huge amounts of memory to store the matrix that will
project data elements onto their corresponding principal components. This problem, for instance,
can be observed in the case of MRI2 data. Take, for instance, a 240 × 240 × 155 cube from an
MRI data set, containing 8928000 data elements. To reduce the dimensionalty of this vector to
0.1% of its original size, a 8928 × 8928000 projection matrix needs to be generated. This means
an approximate 594 Gigabytes of data only to store the matrix. It is obvious how intensive memory
requirements could be if higher-dimensional data with larger mode sizes were to be dealt with.
This simple example clearly demonstrates the importance of dimension reduction techniques that
do not rely on the vectorization of higher-dimensional data, and that deal with such data in their
original multidimensional form.
    In addition to computational considerations, one can intuitively observe that if the multilinear
structure of tensor data is changed, e.g. by vectorization, many conventional approaches that were
initially developed for vector data might not yield the same satisfactory results. Generally speaking,
although tensors are natural extensions to vectors and matrices, their multidimensional form adds
extra complexity to their structure in a way that the notion low-dimensionality becomes challenging
to address, especially simply as an extension of the same notion from matrices to tensors. This
    1 Principal Component Analysis
    2 Magnetic  Resonance Imaging
                                                    1


issue will be addressed in Section 3.2 where tensor rank is introduced, and it is discussed that
there are various notions of rank for tensor data. In the experiments of Section 6.2, it is shown
that for higher-dimensional data, a conventional vector-based method such as PCA becomes both
computationally more expensive and less accurate compared to its multilinear counterpart. In this
case, an extension of PCA to tensor data, abbreviated to MPCA3, will be presented in Chapter
6. This method performs PCA on fibers of the unfoldings of a tensor one mode at a time, and is
specifically useful to compress tensors with low Tucker ranks [14].
    An efficient method to compute MPCA would be to compress the data using the general
randomized embedding proposed in [9], which constitutes the central body of work in this thesis,
before starting the main algorithm. This approach can in general be applied as a preprocessing
stage to large scale tensor data to alleviate the computational intensity of the subsequent processing
scheme. To make tensor compression even more computationally efficient, randomized streaming
mappings are of great value where one will not have to store a big tensor to compute its compressed
form. Rather, a sketch of the unfoldings of the tensor along with a sketch of the core in the
factorization of interest will be enough to construct an approximate version of the decomposition.
Such a method has been proposed to compute of the Tucker approximation in [22], and can also be
applied to MPCA.
    On the other hand, dimension reduction methods such as PCA and its extensions, including
MPCA, need a set of training samples (vectors, or tensors in the multilinear case) for the projection
matrix or matrices to be found. This makes these methods time consuming and computationally
less efficient than oblivious approaches such as the Johnson-Lindenstrauss embedding. The term
oblivious refers to the fact that one does not need any data sample beforehand to project a new data
sample onto a lower-dimensional subspace.
    In recent years, many tensor dimension reduction techniques that can be applied to low-rank
data have been proposed in the literature. Most of these methods are limited to some degree in
    3 Multilinear Principal Component Analysis
                                                    2


the sense that they either are limited in the number of data modes and/or lack general theoretical
guarantees on the error in the geometry preserving property of the embedding. In [23], such
theoretical guarantees have been presented for the vectorized form of a tensor projected using the
Khatri-Rao product of individual random projections of smaller sizes, and have been extended
to 2-mode tensors. In [18, 19], the CountSketch projection matrix has been extended to tensors,
named TensorSketch, although the multilinear structure of the tensor is again not preserved. In a
more recent version [20], however, TensorSketch is developed based on the Tucker format to extend
CountSketch to tensor data. The TensorSketch method is mainly developed for polynomial kernels
as a special case of rank-1 tensors [19, 3].
    There are other more related methods that assume specific data structure for the input tensors
and at the same time respect their multimodal structure. In a closely related method abbreviated
to KFJLT4 [10], a very large Fast Johnson-Lindenstrauss embedding matrix, addressed in Section
4.2.3, is used in the form of the Kronecker product of smaller fast embeddings, and is applied
to a vector also having Kronecker structure corresponding to the vectorized form of a rank-1
tensor. The implementation of KFJLT, however, is done without actually forming the extremely
large embedding matrix or input vector, and the elements of the compressed vector (or tensor,
equivalently) can be calculated efficiently using the smaller embeddings implicitly applied to the
corresponding tensor modes. In each mode, this is made possible by using random rows of the
DFT matrix, a vector consisting of Rademacher random variables, and smaller vectors that form
the Kronecker structure of the input data. The rank-1 property of the input tensor naturally draws
one’s attention to KFJLT being suitable for the efficient computation of CP decompositions. This
method leads to lower computational cost at a small price in the embedding dimension size, and its
computational efficiency originates from both the inherent speed of the fast embedding matrices
used and also the Kronecker structure of the input data in the vectorized form. KFJLT is shown to
work for tensors having general structure with a reduction in performance. A short summary of
    4 Kronecker Faster Johnson-Lindenstrauss Transform.
                                                        3


how computational efficiency is achieved in this method works is presented in Appendix C.
    The work presented in this dissertation contains materials discussed in the following publica-
tions. In [27], extensions of PCA and its variants to tensor data are discussed for those who are
well familiar with PCA methods for vector-type data. In [17], a multiscale HoSVD5 approach
has been developed to compress tensors in multiple scales. In the 0th scale, truncated HoSVD is
performed on data. The reconstruction error tensor is then partitioned into subtensors in the 1st
scale using a clustering algorithm, and truncated HoSVD is applied to each subtensor. This process
can be repeated in higher scales depending on the needed tradeoff between reconstruction error and
compression. In [9] which constitues the main body of this thesis presented in Chapter 4, a mod-
ewise Johnson-Lindenstrauss embedding has been proposed for compressing tensor data without
reshaping the tensor into an extremely large vector. Theoretical guarantees for the approximation
error have been presented for data with low CP rank.
    5 Higher-order Singular Value Dicomposition
                                                  4


                                             CHAPTER 2
                       BACKGROUND: TENSOR BASICS AND ALGEBRA
In this chapter, basic concepts and algebraic relations that are used in the statement of problems
and proofs are presented.
Notation. The type of letters used for tensors, matrices, vectors and scalars are as follows.
Calligraphic boldface capital letters (e.g., X) are used for tensors, boldface capital letters for
matrices (e.g., X), boldface lower-case letters for vectors (e.g., x), and regular (lower-case or
capital) letters for scalars (e.g., 𝑥 or 𝑋).
For numbers in parentheses used as subscript or superscript, subscript refers to “unfoldings” while
superscript denotes an object in a sequence of objects.
We assume [𝑑] := {1, . . . , 𝑑} for all 𝑑 ∈  N.
Whenever used, the vec(·) operator generates the vectorized form of its argument.
In the following definitions, we assume X ∈     C𝑛 ×...×𝑛 .
                                                  1      𝑑
Definition 2.0.1 A “tensor” is a multi-dimensional array. A 𝑑-way, 𝑑-mode or 𝑑 th -order tensor
is an element of the tensor product of 𝑑 vector spaces. Figure 2.1 shows an example of a 3-mode
tensor. A tensor is called “cubical” is all its modes are of the same size, i.e., X ∈ C𝑛×...×𝑛 .
                              Figure 2.1: An example of a 3 × 4 × 5 tensor.
One can stack 3-mode tensors along modes 1, 2, and 3 of a 3-mode tensor to visualize higher-order
data in the 3-dimensional space, where the elements of the stacked versions can obviously be
different. For instance, Figure 2.2 illustrates this idea where the 3-mode tensor of Figure 2.1 is
                                                   5


stacked along the 1st and 2nd dimensions to simulate the 4th and 5th modes, respectively, leading to
a 3 × 4 × 5 × 3 × 2 tensor.
Figure 2.2: Visualization of a 5-mode tensor by stacking a 3-mode tensor along its 1st and 2nd
modes. The result is a 5-mode tensor of size 3 × 4 × 5 × 3 × 2. Note that the elements of the stacked
versions are not necessarily the same as they are elements corresponding to different indices in the
5-mode tensor.
Definition 2.0.2 (Mode- 𝒋 Fiber) In a 𝑑-mode tensor X, a mode- 𝑗 fiber is obtained by fixing all
                                                                         C
but the 𝑗 th index, and is denoted by X𝑖1 ,...,𝑖 𝑗−1 ,:, 𝑗 𝑗+1 ,...,𝑖 𝑑 ∈ 𝑛 𝑗 for 𝑖 𝑗 ∈ [𝑛 𝑗 ] and 𝑗 ∈ [𝑑]. There are
Î
     𝑛 𝑗 mode- 𝑗 fibers. Figure 2.3 depicts how the fibers of a 3-mode tensor are formed.
𝑖≠ 𝑗
Figure 2.3: An example of the fibers of a 3-mode tensor. Left: mode-3 fibers. Right: mode-1
fibers.
Definition 2.0.3 (Mode- 𝒋 Slice) In a 𝑑-mode tensor X, a mode- 𝑗 “slice” is a (𝑑 − 1)-mode
subtensor obtained by fixing the 𝑗 th index. A mode- 𝑗 slice of X is denoted by X:,...,:,𝑘,:,...,: ∈
C𝑛 ×...×𝑛
     1     𝑗−1 ×𝑛 𝑗+1 ×...×𝑛 𝑑 for 𝑖 𝑗 = 𝑘 ∈ [𝑛 𝑗 ]. There are 𝑛 𝑗 mode- 𝑗 slices.
                                                             6


       Figure 2.4: An example showing how the mode-3 slices of a 3-mode tensor are formed.
Definition 2.0.4 (Matricization of a Tensor) The process of reshaping a tensor X into a matrix
is called “matricization”, “flattening” or “unfolding”. The most common way of doing this is
                                                           C𝑛 ×
                                                                Î
                                                                       𝑛𝑖
the mode- 𝑗 unfolding, denoted by X ( 𝑗) ∈                    𝑗   𝑖≠ 𝑗     , which has all the mode- 𝑗 fibers of X as its
columns. The process of matricization of tensors is linear, in the sense that for X, Y ∈                      C𝑛 ×...×𝑛 ,
                                                                                                                 1     𝑑
(𝛼X + 𝛽Y) ( 𝑗) = 𝛼X ( 𝑗) + 𝛽Y ( 𝑗) for 𝑗 ∈ [𝑑] and 𝛼, 𝛽 ∈                     C.
Figure 2.5: An example showing how the mode-1 unfolding of a 3-mode tensor is formed using its
mode-1 fibers. Colors are used to show how this is done in a column-major format.
                                                                  C
Lemma 2.0.1 Assume we have a tensor X ∈ 𝑛1 ×𝑛2 ×···×𝑛𝑑 , and for 𝑖 𝑗 ∈ [𝑛 𝑗 ], 𝑗 ∈ [𝑑] and
         Î
ℓ ∈ [ 𝑑𝑚=1 𝑛𝑚 ], we want to find the element (𝑖 𝑗 , ℓ) of the mode- 𝑗 unfolding X ( 𝑗) corresponding
            𝑚≠ 𝑗
to the element (𝑖1 , 𝑖2 , . . . , 𝑖 𝑑 ) of X1. Then, for a given (𝑖 𝑗 , ℓ), we have
                                            X ( 𝑗) (𝑖 𝑗 , ℓ) = X𝑖1 ,...,𝑖 𝑗−1 ,𝑖 𝑗 ,𝑖 𝑗+1 ,...,𝑖 𝑑 ,
    1 In this thesis, we always use column-major formatting when reshaping tensors to matrices/vectors and vice versa.
This essentially means going from lower to higher modes when moving across dimensions.
                                                                 7


where
                                                        Í 𝑑                  𝑘−1
                                                                             Î
                                       $ℓ − 1 −               (𝑖 𝑘 − 1)            𝑛𝑚 %
                                                     𝑘=𝑙+1                   𝑚=1
                                                       𝑘≠ 𝑗                  𝑚≠ 𝑗
                                 𝑖𝑙 =                                                        + 1, for 𝑙 ≠ 𝑗,         (2.1)
                                                            𝑙−1
                                                            Î
                                                                  𝑛𝑚
                                                           𝑚=1
                                                           𝑚≠ 𝑗
starting from 𝑙 = 𝑑 and going down to 𝑙 = 1. Obviously, for 𝑙 = 𝑑, the numerator is reduced to
ℓ − 1, and for 𝑙 = 1, the denominator becomes 1. It is also clear that 𝑖 𝑗 will be the same in both the
tensor and the unfolding.
    On the other hand, assume we want to obtain a tensor X from its mode- 𝑗 unfolding X ( 𝑗) . Given
indices (𝑖 1 , . . . , 𝑖 𝑑 ), we want to find the corresponding coordinates (𝑖 𝑗 , ℓ) in X ( 𝑗) . We can do so
using
                                               ∑︁𝑑                  𝑘−1
                                                                    Ö                            hÖ  𝑑           i
                                     ℓ =1+           (𝑖 𝑘 − 1)             𝑛𝑚 , for ℓ ∈                    𝑛𝑚 ,      (2.2)
                                               𝑘=1                  𝑚=1                           𝑚=1
                                               𝑘≠ 𝑗                 𝑚≠ 𝑗                          𝑚≠ 𝑗
meaning that
                                      X(𝑖 1 , . . . , 𝑖 𝑗−1 , 𝑖 𝑗 , 𝑖 𝑗+1 , . . . , 𝑖 𝑑 ) = X ( 𝑗) (𝑖 𝑗 , ℓ),
with ℓ defined above [12].
Definition 2.0.5 (The Standard Inner Product Space of 𝒅-mode Tensors) The set of all 𝑑-mode
tensors X ∈     C𝑛 ×...×𝑛
                      1       𝑑 forms a vector space over the field of complex numbers when equipped with
component-wise addition and scalar multiplication. The inner product of X and Y is defined as
                                                       ∑︁𝑛1 ∑︁ 𝑛2       ∑︁𝑛𝑑
                                      hX, Yi :=                     ...        X𝑖1 ,𝑖2 ,...,𝑖 𝑑 Y𝑖1 ,𝑖2 ,...,𝑖 𝑑 .   (2.3)
                                                       𝑖 1 =1 𝑖2 =1     𝑖 𝑑 =1
The standard Euclidean norm can be deduced from this inner product, as
                                                                   v
                                                                   u
                                                                   t∑︁  𝑛1 ∑︁  𝑛2        ∑︁𝑛𝑑
                                              √︁                                                                   2
                                    kXk :=        hX, Xi =                           ...        X𝑖1 ,𝑖2 ,...,𝑖 𝑑 .   (2.4)
                                                                       𝑖1 =1 𝑖2 =1       𝑖 𝑑 =1
                                                                         8


Definition 2.0.6 (Tensor Outer Product) The tensor outer product of two tensors X ∈                                                              C𝑛 ×𝑛 ×···×𝑛
                                                                                                                                                     1     2        𝑑
and Y ∈     C𝑛 ×𝑛 ×···×𝑛
                 0
                 1
                       0
                       2
                               0
                               𝑑0 , denoted by X               Y∈           C𝑛 ×𝑛 ×···×𝑛 ×𝑛 ×𝑛 ×···×𝑛
                                                                                   1   2        𝑑
                                                                                                      0
                                                                                                      1     2
                                                                                                             0          0
                                                                                                                        𝑑0 , is a (𝑑 + 𝑑 0)-mode tensor
whose entries are given by
                                             (X        Y) 𝑖1 ,...,𝑖 𝑑 ,𝑖 0 ,...,𝑖 0 0 = X𝑖1 ,...,𝑖 𝑑 Y𝑖10 ,...,𝑖 𝑑0 0 .                                      (2.5)
                                                                        1         𝑑
When X and Y are both vectors, the tensor outer product will be reduced to the standard outer
product.
Definition 2.0.7 (Rank-1 Tensor) A 𝑑-mode tensor X ∈                                            C𝑛 ×...×𝑛
                                                                                                     1           𝑑   is rank-1 if it can be written as
the outer product of 𝑑 vectors, i.e.,
                                          X = x (1)         x (2)            ...         x (𝑑) =:         𝑑
                                                                                                           𝑗=1 x ,
                                                                                                                    ( 𝑗)
                                                                                                                                                             (2.6)
where x ( 𝑗) ∈     C𝑛    𝑗 for 𝑗 ∈ [𝑑].
Definition 2.0.8 ( 𝒋-mode Product) The 𝑗-mode product of a 𝑑-mode tensor X ∈                                                             C𝑛 ×···×𝑛
                                                                                                                                             1        𝑗−1 ×𝑛 𝑗 ×𝑛 𝑗+1 ×···×𝑛 𝑑
with a matrix U ∈             C𝑚 ×𝑛 𝑗   𝑗  is another 𝑑-mode tensor X × 𝑗 U ∈                                    C𝑛 ×···×𝑛
                                                                                                                        1       𝑗−1 ×𝑚 𝑗 ×𝑛 𝑗+1 ×···×𝑛 𝑑   whose
entries are given by
                                                                                          𝑛𝑗
                                                                                         ∑︁
                                      (X × 𝑗 U)𝑖1 ,...,𝑖 𝑗−1 ,ℓ,𝑖 𝑗+1 ,...,𝑖 𝑑 =               X𝑖1 ,...,𝑖 𝑗 ,...,𝑖 𝑑 Uℓ,𝑖 𝑗 .                                (2.7)
                                                                                        𝑖 𝑗 =1
for all (𝑖1 , . . . , 𝑖 𝑗−1 , ℓ, 𝑖 𝑗+1 , . . . , 𝑖 𝑑 ) ∈ [𝑛1 ] × · · · × [𝑛 𝑗−1 ] × [𝑚 𝑗 ] × [𝑛 𝑗+1 ] × · · · × [𝑛 𝑑 ]. In terms of
the mode- 𝑗 unfoldings of X × 𝑗 U and X, it can be observed that (X × 𝑗 U) ( 𝑗) = UX ( 𝑗) holds for all
 𝑗 ∈ [𝑑].
Lemma 2.0.2 Let X, Y ∈                   C𝑛 ×𝑛 ×···×𝑛 , A, B ∈ C𝑛 ×𝑛 ×···×𝑛
                                              1     2     𝑑
                                                                                      0
                                                                                      1     2
                                                                                             0      0
                                                                                                    𝑑0  , 𝛼, 𝛽 ∈          C, and Uℓ , Vℓ ∈ C𝑚 ×𝑛         ℓ   ℓ  for
all ℓ ∈ [𝑑]. The following four properties hold:
   (a) (𝛼X + 𝛽Y)               A = 𝛼X                 A + 𝛽Y              A=X                  𝛼A + Y                𝛽A
   (b) hX        A, Y           Bi = hX, Yi hA, Bi
                                                                                9


  (c) (𝛼X + 𝛽Y) × 𝑗 U 𝑗 = 𝛼 X × 𝑗 U 𝑗 + 𝛽 Y × 𝑗 U 𝑗 .
                                                                             
  (d) X × 𝑗 𝛼U 𝑗 + 𝛽V 𝑗 = 𝛼 X × 𝑗 U 𝑗 + 𝛽 X × 𝑗 V 𝑗 .
                                                                 
  (e) If 𝑗 ≠ ℓ then X × 𝑗 U 𝑗 ×ℓ Vℓ = X × 𝑗 U 𝑗 ×ℓ Vℓ = (X ×ℓ Vℓ ) × 𝑗 U 𝑗 = X ×ℓ Vℓ × 𝑗 U 𝑗 .
   (f) If 𝑊 ∈ C 𝑝×𝑚 𝑗
                                                                           
                      then X × 𝑗 U 𝑗 × 𝑗 W = X × 𝑗 U 𝑗 × 𝑗 W = X × 𝑗 WU 𝑗 = X × 𝑗 WU 𝑗 .
                                                                                                                       
Proof The proof of (a) can be done element-wise.
                ((𝛼X + 𝛽Y)         A) 𝑖1 ,...,𝑖 𝑑 ,𝑖 0 ,...,𝑖 0 0 = (𝛼X + 𝛽Y) 𝑖1 ,...,𝑖 𝑑 A𝑖10 ,...,𝑖 𝑑0 0
                                                    1         𝑑
                                                                                                                    
                                                                      = 𝛼X𝑖1 ,...,𝑖 𝑑 + 𝛽Y𝑖1 ,...,𝑖 𝑑 A𝑖10 ,...,𝑖 𝑑0 0 .
To prove (b), we note that
                                                       0          𝑛𝑑 00
                                𝑛1          𝑛 𝑑 ∑︁  𝑛1
                               ∑︁         ∑︁                     ∑︁
          hX    A, Y     Bi =         ···                 ···            X𝑖1 ,𝑖2 ,...,𝑖 𝑑 A𝑖10 ,...,𝑖 𝑑0 0 Y𝑖1 ,𝑖2 ,...,𝑖 𝑑 B𝑖10 ,...,𝑖 𝑑0 0
                               𝑖1 =1      𝑖 𝑑 =1 𝑖10 =1          𝑖 𝑑0 =1
                                   𝑛1          𝑛𝑑
                                                                                         !    𝑛10            𝑛 𝑑0 0
                                 ∑︁          ∑︁                                            ©∑︁ ∑︁ 0 0 0 0 ª
                             =          ···          X𝑖1 ,𝑖2 ,...,𝑖 𝑑 Y𝑖1 ,𝑖2 ,...,𝑖 𝑑     ­ ···                    A𝑖1 ,...,𝑖 𝑑 0 B𝑖1 ,...,𝑖 𝑑 0 ®
                                 𝑖 1 =1      𝑖 𝑑 =1                                         𝑖 0 =1         𝑖 0 =1
                                                                                           «1                𝑑                                    ¬
                             = hX, Yi hA, Bi .
The proof of (c), (d), and (f) can be done using the definition of the mode- 𝑗 unfolding. For (e),
suppose that ℓ > 𝑗 (the case where ℓ < 𝑗 is similar). Set U := U 𝑗 and V := Vℓ to simplify subscript
                                                                     10


notation. We have for all 𝑘 ∈ [𝑚 𝑗 ], 𝑙 ∈ [𝑚 ℓ ], and 𝑖 𝑞 ∈ [𝑛𝑞 ] with 𝑞 ∉ { 𝑗, ℓ} that
                                                                               ∑︁𝑛ℓ
                                                                                                
         X × 𝑗 U ×ℓ V     𝑖1 ,...,𝑖 𝑗−1 ,𝑘,𝑖 𝑗+1 ,...,𝑖ℓ−1 ,𝑙,𝑖ℓ+1 ,...,𝑖 𝑑 =          X ×𝑗 U       𝑖 1 ,...,𝑖 𝑗−1 ,𝑘,𝑖 𝑗+1 ,...,𝑖ℓ ,...,𝑖 𝑑 V𝑙,𝑖ℓ
                                                                               𝑖ℓ =1
                                                                               ∑︁𝑛ℓ ∑︁  𝑛𝑗
                                                                            =                X𝑖1 ,...,𝑖 𝑗 ,...,𝑖ℓ ,...,𝑖 𝑑 U 𝑘,𝑖 𝑗 ® V𝑙,𝑖ℓ
                                                                                      ©                                             ª
                                                                                      ­
                                                                               𝑖ℓ =1 «𝑖 𝑗 =1
                                                                                 𝑛 𝑗 ∑︁ 𝑛ℓ
                                                                                                                                   !¬
                                                                               ∑︁
                                                                            =                X𝑖1 ,...,𝑖 𝑗 ,...,𝑖ℓ ,...,𝑖 𝑑 V𝑙,𝑖ℓ U 𝑘,𝑖 𝑗
                                                                               𝑖 𝑗 =1 𝑖ℓ =1
                                                                               ∑︁𝑛𝑗
                                                                            =         (X ×ℓ V) 𝑖1 ,...,𝑖 𝑗 ,...,𝑖ℓ−1 ,𝑙,𝑖ℓ+1 ,...,𝑖 𝑑 U 𝑘,𝑖 𝑗
                                                                               𝑖 𝑗 =1
                                                                                                           
                                                                            = (X ×ℓ V) × 𝑗 U                 𝑖1 ,...,𝑖 𝑗−1 ,𝑘,𝑖 𝑗+1 ,...,𝑖ℓ−1 ,𝑙,𝑖ℓ+1 ,...,𝑖 𝑑 .
                                                                                                                             ?   𝑑
Note 2.0.1 Unfolding the tensor Y = X ×1                                U (1)   ×2    U (2) ... ×𝑑    U (𝑑)        =: X               U ( 𝑗) along the 𝑗 th
                                                                                                                               𝑗=1
mode is equivalent to
                     Y ( 𝑗) = U ( 𝑗) X ( 𝑗) (U (𝑑) ⊗ · · · ⊗ U ( 𝑗+1) ⊗ U ( 𝑗−1) ⊗ · · · ⊗ U (1) ) > ,                                                          (2.8)
where ⊗ denotes the matrix Kronecker product. If X is a superdiagonal tensor2, then all matrices
U ( 𝑗) must have the same number of columns, and (2.8) will be simplified to
                      Y ( 𝑗) = U ( 𝑗) X (U (𝑑)                  ···         U ( 𝑗+1)      U ( 𝑗−1)         ···         U (1) ) > ,                              (2.9)
where X is a diagonal matrix with the superdiagonal of X as its diagonal. The symbol                                                                      denotes
the Khatri-Rao product, which is defined as the column-wise matching Kronecker product, i.e., for
matrices A = [a1 , . . . , a𝐽 ] ∈        C𝐼×𝐽 and B = [b1, . . . , b𝐽 ] ∈ C𝐾×𝐽 ,
                                       A        B = [a1 ⊗ b1 , . . . , a𝐽 ⊗ b𝐽 ] ∈                 C𝐼𝐾×𝐽 .
     The reason for (2.9) being a simplified form of (2.8) lies in the fact that when a 𝑑-mode tensor
X ∈    C𝑛×...×𝑛 is superdiagonal, all columns of X( 𝑗) are zeros except for 𝑛 of them spread evenly,
     2A  𝑑-mode superdiagonal tensor X ∈               C𝑛×...×𝑛 is cubical and has nonzero elements only at indices (𝑖, . . . , 𝑖) for
𝑖 ∈ [𝑛].
                                                                            11


where in the ℓ th column, only the entry in position ((ℓ − 1) mod 𝑛) + 1 is nonzero3. This means that
all but matching columns in the Kronecker product will be crossed out in the final result, simplifying
the kronecker product to the Khatri-Rao product, and reducing X ( 𝑗) in (2.8) to a diagonal matrix
X in (2.9).
Note 2.0.2 Vectorizing the tensor Y = X ×1 U (1) ×2 U (2) ... ×𝑑 U (𝑑) , it is straightforward to show
that
                                                                                            
                                        y = U (𝑑) ⊗ · · · ⊗ U (2) ⊗ U (1) x,                                               (2.10)
where x and y are the vectorized forms of X and Y, respectively.
Definition 2.0.9 ( 𝒋-mode Vector Product) The 𝑗-mode product of a 𝑑-mode tensor X ∈                                   C𝑛 ×···×𝑛
                                                                                                                          1     𝑑
with a vector v ∈      C𝑛   𝑗 is a (𝑑 − 1)-mode tensor, and is denoted by X • 𝑗 v, whose elements are
obtained using
                                                                             ∑︁𝑛𝑗
                                           
                                   X •𝑗 v    𝑖1 ,...,𝑖 𝑗−1 ,𝑖 𝑗+1 ,...,𝑖 𝑑 =        X𝑖1 ,...,𝑖 𝑗 ,...,𝑖 𝑑 v𝑖 𝑗 ,           (2.11)
                                                                             𝑖 𝑗 =1
which means that the 𝑗 th mode of X is contracted with v. If we want to keep the 𝑗 th mode with
dimension size 1, meaning X • 𝑗 v ∈           C𝑛 ×𝑛  1     𝑗−1 ×1×𝑛 𝑗+1 ...×𝑛 𝑑    , a useful interpretation of this will be
                                                      X • 𝑗 v = X × 𝑗 v> ,                                                 (2.12)
which is equivalent to
                                                                                                     >
                                   v> X ( 𝑗) = X • 𝑗 v
                                                                   
                                                                     ( 𝑗)  = vec X • 𝑗 v                   .               (2.13)
    The mode- 𝑗 vector product can be used to define eigenvalue problems for tensors. For a super-
symmetric tensor4 X ∈         C𝑛×...×𝑛 , the scalar 𝜆 is an eigenvalue with the corresponding eigenvector
v∈   C𝑛 if
                                              X •2 v •3 v · · · •𝑑 v = 𝜆v.
    3 For integers 𝑎 and 𝑏, we assume that 𝑎 mod 𝑏 ∈ {0, . . . , 𝑏 − 1}.
    4A  cubical tensor is called supersymmetric if its elements remain the same under any permutaion of indices.
Obviously, superdiagonal tensors are also supersymmetric.
                                                                      12


    The 𝑗-mode vector product is also used in developing Support Vector Machines for tensors,
where 𝑑 optimization problems for 𝑑 modes of a tensor are mixed to form one problem for the whole
tensor [6].
Definition 2.0.10 (Tensor (𝒌, 𝒋)-Contraction) Consider tensors X ∈                                                           C𝑛 ×...×𝑛 ×...×𝑛
                                                                                                                                    1         𝑗     𝑑 and Y ∈
C𝑚 ×...×𝑚
    1     𝑘−1 ×𝑛 𝑗 ×𝑚 𝑘+1 ×...×𝑚 𝑑 0   . Then, for each 𝑗 ∈ [𝑑] and 𝑘 ∈ [𝑑 0], the (𝑘, 𝑗)-contraction of X
and Y, which is the contraction of modes 𝑗 of X and 𝑘 of Y, is a (𝑑 + 𝑑 0 − 2)-dimensional array
denoted by
                            Z := X × 𝑘𝑗 Y ∈                   C𝑛 ×...×𝑛
                                                                     1        𝑗−1 ×𝑚 ℓ ×𝑛 𝑗+1 ×...×𝑛 𝑑 ×𝑚 𝐿 0 ×...×𝑚 𝐿 0 0
                                                                                                                1             𝑑 −2    ,
where ℓ = min{[𝑑 0] \ 𝑘 }, 𝐿 0 := [𝑑 0] \ {𝑘, ℓ}, and 𝐿 0ℎ denotes the ℎth element of the set 𝐿 0. It is
observed that ℓ = 1 for all choices of 𝑘 except for 𝑘 = 1 in which case ℓ = 2. Element-wise,
                                                                               ∑︁𝑛𝑗
                  Z𝑖1 ,...,𝑖 𝑗−1 ,𝑞ℓ ,𝑖 𝑗+1 ,...,𝑖 𝑑 ,𝑞 𝐿 0 ,...,𝑞 𝐿 0       =        X𝑖1 ,...,𝑖 𝑗 ,...,𝑖 𝑑 Y𝑞1 ,...,𝑞 𝑘−1 ,𝑖 𝑗 ,𝑞 𝑘+1 ,...,𝑞 𝑑 0 ,
                                                           1          𝑑 0 −2
                                                                               𝑖 𝑗 =1                                                                    (2.14)
                        for            𝑖 𝑗 ∈ [𝑛 𝑗 ], 𝑗 ∈ [𝑑], 𝑞 ℓ ∈ [𝑚 ℓ ], 𝑞 𝐿 𝑖0 ∈ [𝑚 𝐿 ℎ0 ], ℎ ∈ [𝑑 0 − 2].
Note 2.0.3 If 𝑘 = 𝑑 0 = 2, then ℓ = min{[2] \ 2} = 1 and 𝐿 0 = [2] \ {1, 2} = ∅, and the (2, 𝑗)-
contraction of X and Y (which is a matrix now, denoted by Y) is reduced to the familiar 𝑗-mode
product X × 𝑗 Y.
Note 2.0.4 One can also define the (𝑘, 𝑗)-contraction of X and Y in a way that modes of Y are
interleaved right after contracting the 𝑗 th mode of X, i.e.,
                                       Z∈        C𝑛 ×...×𝑛
                                                        1            𝑗−1 ×𝑚 ℓ ×𝑚 𝐿 0 ×...×𝑚 𝐿 0 0 ×𝑛 𝑗+1 ×...×𝑛 𝑑
                                                                                    1           𝑑 −2                     ,
and
                                                                               ∑︁𝑛𝑗
                  Z𝑖1 ,...,𝑖 𝑗−1 ,𝑞ℓ ,𝑞 𝐿 0 ,...,𝑞 𝐿 0       ,𝑖 𝑗+1 ,...,𝑖 𝑑 =        X𝑖1 ,...,𝑖 𝑗 ,...,𝑖 𝑑 Y𝑞1 ,...,𝑞 𝑘−1 ,𝑖 𝑗 ,𝑞 𝑘+1 ,...,𝑞 𝑑 0 ,
                                          1           𝑑 0 −2
                                                                               𝑖 𝑗 =1                                                                    (2.15)
                          for            𝑖 𝑗 ∈ [𝑛 𝑗 ], 𝑗 ∈ [𝑑], 𝑞 ℓ ∈ [𝑚 ℓ ], 𝑞 𝐿 𝑖0 ∈ [𝑚 𝐿 𝑖0 ], 𝑖 ∈ [𝑑 0 − 2].
                                                                                13


                                                        CHAPTER 3
                             TENSOR DECOMPOSITIONS AND RANK
In this section, a short review on the more commonly used tensor decompositions is presented. For
simplicity, the elements of the tensors as well as scalars are defined over the field of real numbers.
Results can be extended to the field of complex numbers with slight modifications. This will
specifically be the case in Chapter 4.
3.1     The CANDECOMP/PARAFAC Decomposition
This factorization, abbreviated to CPD, decomposes a tensor X into the (weighted) sum of rank-1
tensors [12]. For X ∈     R𝑛 ×...×𝑛 ,
                             1        𝑑
                                                    𝑟
                                                   ∑︁
                                X ≈ X̂ =               𝑔 𝑘 a 𝑘(1)      a 𝑘(2)     ···      a 𝑘(𝑑) ,                     (3.1)
                                                   𝑘=1
where      denotes the tensor outer product. The vector a 𝑘
                                                                                 ( 𝑗)
                                                                                      ∈  R𝑛    𝑗 can be considered as the 𝑘 th
column in a matrix A ( 𝑗) ∈    R𝑛 ×𝑟 for 𝑗 ∈ [𝑑]. The scalar 𝑔𝑘 can be considered as the 𝑘 th element of
                                   𝑗
a vector g. Therefore, if g is set as the superdiagonal of a diagonal tensor G, called the core tensor,
then
                                           X̂ = G ×1 A (1) × · · · ×𝑑 A (𝑑) .                                           (3.2)
    Component-wise, we have
                                                        𝑟
                                                       ∑︁
                                      X̂𝑖1 ,...,𝑖 𝑑 =      𝑔 𝑘 A𝑖(1)    A (2) . . . A𝑖(𝑑)
                                                                    1 ,𝑘 𝑖 2 ,𝑘         𝑑 ,𝑘
                                                                                             .                          (3.3)
                                                       𝑘=1
    Considering the superdiagonality of G, and according to (2.9), the relation between the unfold-
ings of X and G may be written as
                     X̂ ( 𝑗) = A ( 𝑗) G(A (𝑑)          ···     A ( 𝑗+1)       A ( 𝑗−1)     ···      A (1) ) > ,         (3.4)
where G is a diagonal matrix with g as its diagonal.
                                                                14


3.1.1   Uniquensess of CPD
CPD is unique under weak conditions; there is a permutation and scaling indeterminacy. For a
permutation matrix Π ∈           R𝑟×𝑟 ,
                                                                                                              
                    X̂ = G ×1 A (1) × · · · ×𝑑 A (𝑑) = G ×1 A (1) Π × · · · ×𝑑 A (𝑑) Π ,                           (3.5)
implying that as long as the columns of the factor matrices are permuted in the same way, the
factorization will not change. This is also evident from (3.1) where the order in which the terms in
the summation are added together does not matter. As for the scaling indeterminacy, we observe
that
                                  ∑︁𝑟                                                                   
                           X̂ =        𝑔 𝑘 𝛼 𝑘(1) a 𝑘(1)           𝛼𝑘(2) a 𝑘(2)        ···      𝛼 𝑘(𝑑) a 𝑘(𝑑) ,    (3.6)
                                  𝑘=1
as long as 𝛼 𝑘(1) 𝛼𝑘(2) . . . 𝛼𝑘(𝑑)  = 1.
    A sufficient condition for the uniqueness of CPD is [21]
                                                    𝑑
                                                   ∑︁
                                                         𝑘 A ( 𝑗) ≥ 2𝑟 + 𝑑 − 1,                                    (3.7)
                                                   𝑗=1
where 𝑘 A is the 𝑘-rank of a matrix A, and is defined as the largest number 𝑘 such that any 𝑘 columns
of A are linearly independent. Equation (3.7) is also the necessary condition for uniqueness of
CPD for 𝑟 = 2, 3 but not for 𝑟 ≥ 4. In its general form, the necessary condition for the uniqueness
of CPD is [26]
                                                                                                       
                                           (1)                  ( 𝑗−1)          ( 𝑗+1)              (𝑑)
                          min rank A               ···      A               A             ···   A           = 𝑟.   (3.8)
                          𝑗 ∈[𝑑]
    However, noting that
                                rank(A       B) ≤ rank(A ⊗ B) ≤ rank(A) rank(B),
then (3.8) can be simplified to
                                                      ©Ö  𝑑                      ª
                                               min ­           rank A (𝑚) ® ≥ 𝑟.                                   (3.9)
                                                      ­                            ®
                                               𝑗 ∈[𝑑] ­                            ®
                                                        𝑚=1
                                                      «𝑚≠ 𝑗                        ¬
                                                                   15


Algorithm 3.1: CPD-ALS [12]
                        R
   initialize A ( 𝑗) ∈ 𝑛 𝑗 ×𝑟 for 𝑗 ∈ [𝑑]
   repeat
      for 𝑗 = 1, . . . , 𝑑 do
         V ← A (1)> A(1) ∗ · · · ∗ A ( 𝑗−1)> A ( 𝑗−1) ∗ A ( 𝑗+1)> A ( 𝑗+1) ∗ · · · ∗ A (𝑑)> A (𝑑)
         A ( 𝑗) ← X ( 𝑗) A (𝑑) · · · A ( 𝑗+1) A ( 𝑗−1) · · · A (1) V†
         normalize columns of A ( 𝑗) storing norms as g
      end for
   until fit ceases to improve or maximum iterations exhausted
   return g, A ( 𝑗) for 𝑗 ∈ [𝑑]
3.1.2    Computing CPD
At first, assume the number of rank-1 tensors is known beforehand. The problem is now the
calculation of factor matrices A ( 𝑗) for 𝑗 ∈ [𝑑] and g in (3.1), i.e. the solution to
                                                       ∑︁𝑟
                          min kX − X̂k with X̂ =            𝑔 𝑘 a 𝑘(1)     a 𝑘(2)     ···      a 𝑘(𝑑) . (3.10)
                           X̂                          𝑘=1
Alternating Least Squares (ALS) is a common way of finding the fit. For instance, assume that X
is a 3-mode tensor. Then, in light of (3.10) and given that the Euclidean norm of a tensor is equal
to the Frobenius norm of any of its unfoldings, finding A (1) would be done by solving
                                                                                  >
                                   A (1) = min X (1) − Â A (3) A (2)                    ,              (3.11)
                                            Â                                         𝐹
where Â = AG with G being an 𝑟 × 𝑟 diagonal matrix with 𝑔 𝑘 forming its diagonal. The optimal
                                                              >†
solution to (3.11) would be Â = X (1) A          (3)     A (2)        which can be rearranged as
                                                                                          †
                                           (3)       (2)       (3)>    (3)        (2)>   (2)
                              Â = X (1) A         A        A        A     ∗A          A        .
Here, the symbol ∗ denotes the Hadamard product, and † represents the pseudo-inverse of a matrix.
Extending the same idea to a 𝑑-mode tensor is outlined in Algorithm 3.1, and is called CPD-ALS.
The initialization for A ( 𝑗) could be either random or using 𝑟 leading left singular vectors of X ( 𝑗) .
    The remaining question is how to choose 𝑟. Most methods fit multiple CP decompositions with
different number of components until one is good, i.e., the one that yields an exact representation
in the Euclidean norm.
                                                           16


     For noiseless data, CPD is computed for 𝑟 = 1, 2, . . . , and the first value of 𝑟 that gives a 100%
fit is chosen as rank. This may indeed not work in the case of degenerate tensors (see 3.2.1). For
noisy data, which is almost always the case, the fit alone fails to determine rank. A commonly
used consistency diagnostic called CORCONDIA1 is employed to determine the proper number of
components [4]. Assume the factor matrices A (1) , . . . , A (𝑑) are fixed, i.e., they have been obtained
using a CP procedure. A Tucker model (see 3.3) is next assumed to represent the data as in
                                     X ≈ G ×1 A (1) × · · · ×𝑑 A (𝑑) .
Noting that the Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings
as well as the 2-norm of its vectorized form, the core G can be found by solving
                                                > 2
                                    1                                      1
                                                                                    !          2
                                 ©Ì           ª                           Ì
                          ( 𝑗)            (ℓ) ®
           min X ( 𝑗) − A G ( 𝑗) ­      A ®         = min vec (X) −            A (ℓ) vec (G)
                                 ­
                                                                                                 ,
           G ( 𝑗)                ­            ®       vec(G)
                                   ℓ=𝑑                                    ℓ=𝑑                  2
                                   ℓ≠ 𝑗
                                 «            ¬   𝐹
for 𝑗 ∈ [𝑑], which can be treated as a least squares problem. Now, the question is how close
the core is to a diagonal tensor with a superdiagonal of ones. If there is a 100% match, then the
perfect fit has been found. The reason why a diagonal core is sought is that in a perfect CP model,
interaction exists only between parallel factors of different modes.
3.2      Tensor Rank
The rank of a tensor X is defined as the smallest number of rank-1 tensors that generate X as their
sum. In other words, it is the smallest number of components in an exact CP decomposition.
     Although this definition is similar to the definition of rank in matrices, the properties of tensor
rank are very different from matrix rank. The major difference is that there is no straightforward
algorithm to compute the rank of a tensor, and in practice, it is determined numerically by fitting
various rank-𝑟 CP models.
     1 CORe CONsistency DIAgnostic
                                                     17


    Other types of rank that are used for tensors are maximum rank and typical rank. Maximum
rank is defined as the largest attainable rank of a tensor. Typical rank is defined as any rank that
occurs with probability greater than zero when the elements of the tensor are drawn randomly from
a uniform continuous distribution. Typical rank and maximum rank are the same for matrices.
However, they may be different for tensors, and there might be more than one typical rank.
3.2.1   Low-rank approximation and border rank
                                                                                Í𝑟
For a matrix A with rank 𝑟 and a decomposition of the form A =                     𝑖=1 𝜎𝑖 u𝑖 v𝑇𝑖 where 𝜎1 ≥ · · · ≥ 𝜎𝑟 ,
the best rank-𝑘 approximation (𝑘 ≤ 𝑟) will be obtained by keeping the 𝑘 leading factors, i.e.,
     Í𝑘
Â = 𝑖=1   𝜎𝑖 u𝑖 v𝑇𝑖 . For tensors, this might not be the case; the best rank-𝑘 approximation may not
even exist, which is a problem of degeneracy. A tensor is degenerate if it can be approximated
arbitrarily well by a factorization of lower rank.
    When a low-rank approximation does not exist for a tensor, it is useful to introduce the concept
of border rank. It is defined as the minimum number of rank-one tensors that approximate it with
arbitrarily small non-zero error, i.e.,
                      rank(X)
                       g       = min{ 𝑟 | ∀𝜀 > 0, ∃E; kX − E k < 𝜀, rank(E) = 𝑟}.                                (3.12)
    Obviously, rank(X)
                   g         ≤ rank(X).
3.3    Compression and the Tucker Decomposition
The Tucker decomposition can be considered as an extension to CPD, as well as a higher-order
principal component analysis. A tensor X ∈              R𝑛 ×...×𝑛
                                                           1          𝑑 is decomposed in the Tucker format in the
following way.
                                          𝑟1
                                         ∑︁        ∑︁𝑟𝑑
                              X ≈ X̂ =         ···        G𝑘 1 ,...,𝑘 𝑑 a 𝑘(1)
                                                                            1
                                                                               ···     a 𝑘(𝑑)
                                                                                           𝑑
                                        𝑘 1 =1     𝑘 𝑑 =1                                                        (3.13)
                                      = G ×1 A (1) × · · · ×𝑑 A (𝑑) ,
                                                           18


where G ∈     R𝑟 ×···×𝑟
                1       𝑑 is the core tensor with 𝑟 𝑗 ≤ 𝑛 𝑗 for 𝑗 ∈ [𝑑], and A ( 𝑗) ∈                          R𝑛 ×𝑟
                                                                                                                 𝑗   𝑗 is the 𝑗 th factor
                                         ( 𝑗)
matrix whose 𝑘 𝑗th column is a 𝑘 𝑗 for 𝑘 𝑗 ∈ [𝑟 𝑗 ]. Component-wise, we have
                                              ∑︁𝑟1       ∑︁𝑟𝑑
                            X̂𝑖1 ,...,𝑖 𝑑 =          ···        G𝑘 1 ,...,𝑘 𝑑 A𝑖(1)    A (2) . . . A𝑖(𝑑)
                                                                                 1 ,𝑘 1 𝑖 2 ,𝑘 2      𝑑 ,𝑘 𝑑
                                                                                                             .                     (3.14)
                                              𝑘 1 =1     𝑘 𝑑 =1
                                                                                                                       Î𝑑
    As can be seen, Tucker approximates a tensor as the linear combination of                                             𝑗=1 𝑟 𝑗 rank-1
tensors. Unlike CPD where the interaction between modes is restricted to matching columns of the
factor matrices, in Tucker, this interaction occurs between all possible combinations of columns,
and the level of interaction is governed by the elements of G.
    It is observed that if 𝑟 𝑗 < 𝑛 𝑗 for at least one 𝑗, the size of the core tensor G will be smaller than
the size of X. This means that G can be thought of as a compressed version of X.
3.3.1     𝑗-rank
The column rank of X ( 𝑗) is defined as the 𝑗-rank of X and is denoted by rank 𝑗 (X). If 𝑟 𝑗 = rank 𝑗 (X)
for 𝑗 ∈ [𝑑], then X is said to have an exact rank-(𝑟 1 , 𝑟 2 , . . . , 𝑟 𝑑 ) Tucker decomposition. Obviously,
                                               Î
𝑟 𝑗 ≤ 𝑛 𝑗 and rank 𝑗 (X) ≤ min{𝑛 𝑗 , ℓ≠ 𝑗 𝑛ℓ }. If 𝑟 𝑗 ≤ rank 𝑗 (X) for at least one 𝑗, then X cannot be
reconstructed exactly from its Tucker representation.
3.3.2    Computing the Tucker Decomposition
In one of the first methods developed to compute the Tucker decomposition, the basic idea is to
find those components (rank-1 tensors) that capture the most variations in each mode. This method
is known as the Higher-order SVD (HoSVD) as a generalization of the matrix SVD, and computes
the singular vectors of the mode- 𝑗 unfoldings of a tensor X. For 𝑟 𝑗 ≤ rank 𝑗 (X), the method is
called truncated HoSVD. This method is not optimal, but it can be used as a good starting point in
an ALS algorithm whose goal is to compute the Tucker decomposition.
                                                                   19


Algorithm 3.2: HOOI-ALS [12]
                        R
   initialize A ( 𝑗) ∈ 𝑛 𝑗 ×𝑟 𝑗 for 𝑗 ∈ [𝑑] using HoSVD
   repeat
      for 𝑗 = 1, . . . , 𝑑 do
         Y ← X ×1 A (1)> × · · · × 𝑗−1 A ( 𝑗−1)> × 𝑗+1 A ( 𝑗+1)> ×𝑑 A (𝑑)>
         A ( 𝑗) ← 𝑟 𝑗 leading left singular vectors of Y ( 𝑗)
      end for
   until fit ceases to improve or maximum iterations exhausted
   G ← X ×1 A (1)> · · · ×𝑑 A (𝑑)>
   return G, A ( 𝑗) for 𝑗 ∈ [𝑑]
    To compute HoSVD, 𝑟 𝑗 leading left singular vectors of X ( 𝑗) is set as A ( 𝑗) for all 𝑗 ∈ [𝑑]. Then
the core tensor is computed using
                                               G = X ×1 A (1)> × · · · ×𝑑 A (𝑑)> .
    The Higher-Order Orthogonal Iteration, abbreviated to HOOI, is used as the ALS method that
takes the result of HoSVD as input. The optimization problem that is solved by HOOI is expressed
as
                                             min                    kX − G ×1 A (1) · · · ×𝑑 A (𝑑) k.
                                                 R
                                 subject to G∈ 𝑟1 ×...×𝑟𝑑                                               (3.15)
                      A ∈( 𝑗) R 𝑛 𝑗 ×𝑟 𝑗
                                         and column-wise orthogonal
    The pseudo-code for HOOI is shown in Algorithm 3.2.
3.3.3    Uniqueness of Tucker
The Tucker decomposition is not unique. The core tensor can be modified as long as the inverse
modification is applied to the factor matrices. For instance, consider the 3-mode case. For
nonsingular matrices U ∈            R𝑛 ×𝑛 , V ∈ R𝑛 ×𝑛
                                         1   1           2   2 and W ∈  R𝑛 ×𝑛 , we have
                                                                           3  3
                                                                                                     
  G ×1 A (1) ×2 A (2) ×3 A (3) = (G ×1 U ×2 V ×3 W) ×1 A (1) U−1 ×2 A (2) V−1 ×3 A (3) W−1 .
                                                                20


                                                                                                                       
Proof Let H := (G ×1 U ×2 V ×3 W) ×1 A (1) U−1 ×2 A (2) V−1 ×3 A (3) W−1 , then we have
                                                                                                                    >
                H (1) = A (1) U−1 (G ×1 U ×2 V ×3 W) (1) A (3) W−1 ⊗ A (2) V−1
                                                                                                        >
                      = A (1) U−1 UG (1) (W ⊗ V) > A (3) W−1 ⊗ A (2) V−1
                                          h                                                    i>
                             (1)                  (3) −1            (2) −1
                      = A G (1) A W ⊗ A V                                          (W ⊗ V)                                        (3.16)
                                                                                    >                                       >
                             (1)               (3) −1                 (2) −1                     (1)             (3)       (2)
                      = A G (1) A W W ⊗ A V V = A G (1) A ⊗ A
                                                                      
                                         (1)           (2)         (3)
                      = G ×1 A ×2 A ×3 A                                       ,
                                                                         (1)
implying the two tensors are equal. A similar approach can be applied to modes 2 and 3.
3.4     Tensor-Train Decomposition
The Tensor-Train decomposition, which is also known as MPS2 in the physics community, decom-
poses a 𝑑-mode tensor into a chain of 𝑑 lower-dimensional tensors of at most 3 modes [16]. A
tensor X is approximated by another tensor X̂ whose elements are expressed as contractions of
lower-dimensional tensors G ( 𝑗) for 𝑗 ∈ [𝑑],
                                            ∑︁𝑟 0 ∑︁  𝑟1        𝑟𝑑
                                                               ∑︁
                          X̂𝑖1 ,...,𝑖 𝑑 =                 ···         G𝛼(1)          G (2)
                                                                          0 ,𝑖 1 ,𝛼1 𝛼1 ,𝑖 2 ,𝛼2
                                                                                                  . . . G𝛼(𝑑)
                                                                                                           𝑑−1 ,𝑖 𝑑 ,𝛼 𝑑
                                                                                                                         ,        (3.17)
                                            𝛼0 =1 𝛼1 =1       𝛼𝑑 =1
where G ( 𝑗) ∈   R𝑟  𝑗−1 ×𝑛 𝑗 ×𝑟 𝑗 for 𝑗 ∈ [𝑑], and 𝑟 0 = 𝑟 𝑑 = 1, i.e., G (1) and G (𝑑) are in fact matrices. In
other words, elements of X̂ can be obtained by employing
                                        X̂𝑖1 ,...,𝑖 𝑑 = G (1) (𝑖1 ) G (2) (𝑖 2 ) · · · G (𝑑) (𝑖 𝑑 ) ,                             (3.18)
where G ( 𝑗) 𝑖 𝑗 ∈
                 
                      R𝑟  𝑗−1 ×𝑟 𝑗  for 𝑗 ∈ [𝑑] is the 𝑖 th𝑗 lateral slice of G ( 𝑗) . The tensor-Train decomposition
is calculated by a sequence of SVD’s starting with X (1) . Assuming the Tensor-Train ranks 𝑟 𝑗 are
known, a simplified version of the Tensor-Train algorithm is presented in Algorithm 3.3. For a
more detailed version also including how to choose the ranks 𝑟 𝑗 , see [16].
    2 Matrix Product State
                                                                       21


Algorithm 3.3: Tensor-Train [16]
                                                                                      R
                                                                                               Î
                                                                                          𝑛1 ×    𝑛ℓ
   Compute the SVD of X (1) : X (1) =                   U (1) S (1) V (1)>         ∈          ℓ≠1    .
   Compute G (1) = U (1) (:, 1 : 𝑟 1 ) ∈               R𝑛 ×𝑟 .
                                                            1   1
   for 𝑗 = 2, . . . , 𝑑 − 1 do
                                                              R
                                                                               Î
                                                                   𝑟 𝑗−1 ×           𝑛ℓ
      Compute     W ( 𝑗−1)   =  S ( 𝑗−1) V ( 𝑗−1)>         ∈                ℓ∉[ 𝑗−1]
                                                                                        .
                                                       R
                                                                      Î
                                                         𝑟 𝑗−1 𝑛 𝑗 ×         𝑛ℓ
      Reshape W ( 𝑗−1) into W ( 𝑗−1) ∈                               ℓ∉[ 𝑗 ]
                                                                                .
      Compute W       ( 𝑗−1) =U S V( 𝑗)    ( 𝑗)     ( 𝑗)> .
      Truncate U ∈  ( 𝑗)    R 𝑟 𝑗−1 𝑛 𝑗 ×𝑟 𝑗−1  𝑛 𝑗  to get G ( 𝑗) = U ( 𝑗) :, 1 : 𝑟 𝑗 .
                                                                                                  
                                            R
      Reshape G ( 𝑗) into G ( 𝑗) ∈ 𝑟 𝑗−1 ×𝑛 𝑗 ×𝑟 𝑗 .
   end for
   Compute G (𝑑) = W (𝑑−2) = S (𝑑−1) V (𝑑−1)> ∈ 𝑟 𝑑−1 ×𝑛𝑑 .              R
   return G ( 𝑗) for 𝑗 ∈ [𝑑].
    Consider another definition of unfoldings of a tensor X, defined by
                                                                                  R
                                                                                      Î𝑗         Î𝑑
                                                                                             𝑛 ×    ℓ= 𝑗+1 𝑛ℓ   (3.19)
                                        X 𝑗 = X𝑖1 ,...,𝑖 𝑗 ;𝑖 𝑗+1 ,...,𝑖 𝑑 ∈             ℓ=1 ℓ                ,
where the indices are divided into two row and column groups for 𝑗 ∈ [𝑑]. Obviously, X1 = X (1) ,
and X𝑑 is the vectorized version of X. The following theorems hold [16].
Theorem 3.4.1 If rank(X 𝑗 ) = 𝑟 𝑗 for each unfolding X 𝑗 of a tensor X for all 𝑗 ∈ [𝑑], then there
exists a decomposition (3.17) with Tensor-Train ranks no greater than 𝑟 𝑗 .
Theorem 3.4.2 Consider the 𝛿-truncated SVD of X 𝑗 in the sense that
                                                              X 𝑗 = USV> + E
for kEk 𝐹 ≤ 𝛿 and 𝑟 𝑘 = rank𝛿 (X 𝑗 ), where rank𝛿 (X 𝑗 ) is the 𝛿-rank of X 𝑗 defined as the minimum
rank(B) over all matrices B satisfying kA − Bk 𝐹 ≤ 𝛿. If
                                                                             𝜀
                                                              𝛿=√                   kXk,
                                                                         𝑑−1
then
                                                             kX − X̂k ≤ 𝜀kXk.
                                                                            22


Theorem 3.4.3 Suppose that the unfolding matrices X 𝑗 have low ranks 𝑟 𝑗 only approximately, i.e.,
X 𝑗 = R 𝑗 + E 𝑗 for all 𝑗 ∈ [𝑑], such that rank(𝑅 𝑗 ) = 𝑟 𝑗 and kE√︄𝑗 k 𝐹 = 𝜀 𝑗 . Then Algorithm 3.3
                                                                       𝑑
                                                                          𝜀 2𝑗 .
                                                                      Í
computes a tensor X̂ with Tensor-Train ranks 𝑟 𝑗 and kX − X̂k ≤
                                                                      𝑗=1
Corollary 1 If a tensor X admits a rank-𝑟 CP decomposition with accuracy 𝜀, there exists a
                                                                                 √
Tensor-Train decomposition with Tensor-Train ranks 𝑟 𝑗 ≤ 𝑟 and accuracy 𝑑 − 1𝜀.
Corollary 2 Given a tensor X and rank bounds 𝑟 𝑗 , the best approximation to X in the Euclidean
norm with Tensor-Train ranks bounded by 𝑟 𝑗 always exists, and the Tensor-Train approximation X̂
                                  √
is quasi-optimal. i.e., kX − X̂k ≤ 𝑑 − 1kX − X best k.
                                                23


                                             CHAPTER 4
     DIMENSIONALITY REDUCTION OF TENSOR DATA: MODEWISE RANDOM
                                          PROJECTIONS
Johnson-Lindenstrauss embeddings, called JL from here on for the sake of brevity, provide a simple
yet powerful tool for dimension reduction of high-dimensional data using linear random projections.
By performing JL on mode- 𝑗 fibers of a tensor X, the dimentionality of all modes can be reduced to
yield a projected tensor of much smaller size, without first vectorizing the tensor. It is then expected
that the Euclidean norm of the projected tensor remains preserved to within a predictable error.
In this chapter, theoretical guarantees for the geometry preserving properties of modewise random
projections as JL embeddings are presented. More theorems, detailed discussions and proofs are
provided in [9].
4.1     Johnson-Lindenstrauss Embeddings for Tensor Data
In this section, a brief overview of the necessary tools that will used to extend JL embeddings to
higher-order data will be presented, as well as the theorems providing the underlying theory of
modewie JL embeddings.
Definition 4.1.1 (𝜺-JL embedding) A matrix A ∈        C𝑚×𝑁 is an 𝜀-JL embedding of a set 𝑆 ⊂ C𝑁
into C𝑚 if
                                       kAxk 22 = (1 + 𝜀x ) kxk 22 ,                                (4.1)
for |𝜀x | ≤ 𝜀 and all x ∈ 𝑆.
Assuming the elements of A are subgaussian random variables and that |𝑆| = 𝑀, then (4.1) holds
                                                                       log 𝑀
for all x ∈ 𝑆 with probability 𝑝 ≥ 1 − 2 exp −𝐶𝑚𝜀 2 if 𝑚 ≥ 𝐶 𝜀2 , where 𝐶 is an absolute
                                                           
constant [25]. A brief statement of the JL lemma along with its relation with random projections
is presented in Appendix A.3.
                                                  24


Lemma 4.1.1 Let x, y ∈     C𝑛 and suppose that A ∈ C𝑚×𝑛 is an 𝜀-JL embedding of the vectors
                                                        i       i
                                   {x − y, x + y, x − y, x + y} ⊂         C𝑛
into  C𝑚 . Then,
                                                            
                 |hAx, Ayi − hx, yi| ≤ 2𝜀 kxk 22 + kyk 22 ≤ 4𝜀 max kxk 22 , kyk 22 .
                                                                               
Proof Using the polarization identity for inner products, we have that
                                  3                                                   3
                              1 ∑︁
                                     i          i                  i              1 ∑︁ ℓ
                                                                                        i       i
                                                       2                2
                                                                                                    2
                                      ℓ           ℓ                 ℓ
       |hAx, Ayi − hx, yi| =               Ax + Ay      2
                                                          − x+ y         2
                                                                             =            𝜀ℓ x + ℓ y 2
                              4 ℓ=0                                               4 ℓ=0
                                 3
                             1 ∑︁
                           ≤        𝜀 (kxk 2 + kyk 2 ) 2 = 𝜀 (kxk 2 + kyk 2 ) 2
                             4 ℓ=0
                                                              
                           = 𝜀 kxk 22 + kyk 22 + 2kxk 2 kyk 2
                                               
                           ≤ 2𝜀 kxk 22 + kyk 22 ≤ 4𝜀 max kxk 22 , kyk 22 ,
                                                                  
where the second to last inequality follows from Young’s inequality for products. In the second
equality, 𝜀ℓ denotes the amount of distortion applied to x + ℓ y           i  2
                                                                              2
                                                                                 by the JL matrix A, where
|𝜀ℓ | ≤ 𝜀.
     Extending vectors to tensors, one can define a tensor 𝜀-JL embedding in a similar way as
follows.
Definition 4.1.2 (Tensor 𝜺-JL embedding) A linear operator 𝐿 :               C𝑛 ×𝑛 ×...×𝑛
                                                                                 1  2     𝑑 →  C𝑚 ×···×𝑚
                                                                                                  1      𝑑0 is
an 𝜀-JL embedding of a set 𝑆 ⊂     C𝑛 ×𝑛 ×...×𝑛
                                       1   2    𝑑  into  C𝑚 ×···×𝑚
                                                           1         𝑑0 if
                                       k𝐿 (X)k 2 = (1 + 𝜀X ) kXk 2
holds for some 𝜀X ∈ (−𝜀, 𝜀) for all X ∈ 𝑆.
                                                     25


The following lemma shows that the Tensor 𝜀-JL embedding will preserve the pairwise inner
product of tensors.
Lemma 4.1.2 If X, Y ∈           C𝑛 ×𝑛 ×...×𝑛
                                    1   2    𝑑 and suppose that 𝐿 is an 𝜀-JL embedding of the tensors
                                                       i
                               {X − Y, X + Y, X − Y, X + Y} ⊂  i        C𝑛 ×𝑛 ×...×𝑛
                                                                            1  2     𝑑
into  C𝑚 ×···×𝑚
          1       𝑑0 . Then,
                                                                     
                                                           2        2
                                                                         ≤ 4𝜀 · max kXk 2 , kY k 2 .
                                                                                       
          |h𝐿 (X) , 𝐿 (Y)i − hX, Yi| ≤ 2𝜀 kXk + kY k
Proof The proof will be similar to what was presented in the proof of Lemma 4.1.1, and by replacing
A𝑥 with 𝐿 (X) and using the linearity of 𝐿.
    When a more general set is being projected using JL embeddings, a discretization scheme can
be used in order to embed a finite set (see Appendix A.2, and [25] for more details). This applies to,
for instance, a low-rank subspace of tensors. In such cases, due to the linearity of the embedding,
discretization can rather be done on the unit ball in that subspace1. In the following lemma, a JL
embedding result for a subspace is presented based on a covering argument.
Lemma 4.1.3 Fix 𝜀 ∈ (0, 1). Let L be an 𝑟-dimensional subspace of                   C𝑛 , and let C ⊂ L be an
(𝜀/16)-net of the (𝑟 − 1)-dimensional Euclidean unit sphere Sℓ2               ⊂ L. Then, if A ∈ C𝑚×𝑛 is an
(𝜀/2)-JL embedding of C, it will also satisfy
                                      (1 − 𝜀)kxk 22 ≤ kAxk 22 ≤ (1 + 𝜀)kxk 22 ,                              (4.2)
                                                                   𝑟
                                                                   47
for all x ∈ L. Furthermore, one can observe that |C| ≤              𝜀    .
    1 Any  point in an 𝑟-dimensional subspace can be represented by a linear combination of 𝑟 basis elements in the
discretized unit ball of that subspace.
                                                         26


Proof Noting that A is a linear embedding, it is enough to prove (4.2) for an arbitrary x ∈ Sℓ2 as
any point in the subspace L can be scaled up/down to a point on the unit sphere in L. To prove the
upper bound, let Δ := kAk 2→2 and choose an element y ∈ C such that kx − yk 2 ≤ 𝜀/16. Noting
kxk 2 = 1, we can write
                                                            √︁
            kAxk 2 − kxk 2 ≤ kAyk 2 + kA(x − y)k 2 − 1 ≤ 1 + 𝜀/2 − 1 + kA(x − y)k 2
                              √︁
                           ≤ 1 + 𝜀/2 − 1 + kAk 2→2 kx − yk 2 ≤ (1 + 𝜀/4) − 1 + Δ𝜀/16
                           = (𝜀/4)(1 + Δ/4)
for all x ∈ Sℓ2 . Since this upper bound holds for all x with kxk 2 = 1, it will also hold for the
maximizer of kAxk 2 with kxk 2 = 1, meaning for that x, kAxk 2 = kAk 2→2 so that Δ−1 ≤ (𝜀/4)(1+
                                                                        1+𝜀/4
Δ/4) must also hold. Therefore, Δ ≤ 1 + 𝜀/4 + Δ𝜀/16 =⇒ Δ ≤             1−𝜀/16 ≤ 1 + 𝜀/3. The upper
bound now follows as kAxk 2 ≤ Δ = supz∈S 2 kAzk 2 for all x ∈ Sℓ2 .
                                             ℓ
For the lower bound, let 𝛿 := inf z∈Sℓ 2 kAzk 2 ≥ 0, and we also note that there exists an element
of the compact set Sℓ2 that realizes 𝛿. Similar to the proof of the upper bound, we consider this
minimizing vector x ∈ Sℓ2 and choose an element y ∈ C with kx − yk 2 ≤ 𝜀/16 in order to observe
that
                                                              √︁
     𝛿 − 1 = kAxk 2 − kxk 2 ≥ kAyk 2 − kA(x − y)k 2 − 1 ≥ 1 − 𝜀/2 − 1 − kA(x − y)k 2
                               √︁
                            ≥ 1 − 𝜀/2 − 1 − kAk 2→2 kx − yk 2 ≥ (1 − 𝜀/3) − 1 − Δ𝜀/16
                            ≥ − (𝜀/3 + 𝜀/16 (1 + 𝜀/3)) ≥ − (𝜀/3 + 𝜀/16 + 𝜀/48) = −5𝜀/12.
Thus, 𝛿 ≥ 1 − 5𝜀/12 ≥ 1 − 𝜀. This proves the lower bound as kAxk 2 ≥ 𝛿 for all x ∈ Sℓ2 . The proof
of the upper bound on |C| can be found in Appendix C of [5].
Note 4.1.1 According to the lower and upper bounds proved above, (4.2) can use a tighter bound,
i.e.,
                             (1 − 𝜀/2)kxk 22 ≤ kAxk 22 ≤ (1 + 𝜀/2)kxk 22 ,
                                                 27


as 𝛿 ≥ 1 − 5𝜀/12 ≥ 1 − 𝜀/2 and Δ ≤ 1 + 𝜀/3 ≤ 1 + 𝜀/2. This means the (𝜀/2)-JL property of A
assumed to hold for the elements of C is carried over to all elements of the subspace.
In Lemma 4.1.3, the norms of vectors in a subspace are preserved by preserving the norms
of all points in the discretized unit ball in that subspace. This makes the dependence on the
subspace dimension 𝑟 exponential according to |C| ≤ (47/𝜀) 𝑟 . The following lemma uses a
coarser discretization to improve the dependence on 𝑟 so that a better target dimension can be
achieved for the JL embedding. This is done by preserving the norms of an orthonormal basis to,
by linearity, control the norms of all points in the subspace. If the angles between the elements of the
orthonormal basis are preserved very accurately, then the projected basis will also be approximately
orthonormal, and the norms of the points that are in the span of the orthonormal basis will also be
preserved. Requiring the preservation of the aformentioned angles to be accurate imposes, in turn,
a more strict bound on the norm-preserving property of the embedding. This concept is presented
in the following lemma.
Lemma 4.1.4 Fix 𝜀 ∈ (0, 1) and let L be an 𝑟-dimensional subspace of             C𝑛 ×···×𝑛
                                                                                     1     𝑑 spanned by a set
of 𝑟 orthonormal basis tensors {T𝑘 } 𝑘 ∈[𝑟] . If 𝐿 is an (𝜀/4𝑟)-JL embedding of the 4 2𝑟 + 𝑟 = 2𝑟 2 − 𝑟
                                                                                               
tensors                                                                 !
                                                         i        i
                        Ø                                                 Ø
                               {T𝑘 − Tℎ , T𝑘 + Tℎ , T𝑘 − Tℎ , T𝑘 + Tℎ }     {T𝑘 } 𝑘∈[𝑟] ⊂ L
                      1≤ℎ<𝑘 ≤𝑟
into C𝑚 ×···×𝑚
         1      𝑑0 , then
                                         k𝐿 (X)k 2 − kXk 2 ≤ 𝜀kXk 2
holds for all X ∈ L.
Proof According to Lemma 4.1.2, one can see that |𝜀 𝑘,ℎ | := |h𝐿 (T𝑘 ) , 𝐿 (Tℎ )i − hT𝑘 , Tℎ i| ≤ 𝜀/𝑟
                                                        28


                                                Í𝑟
for all ℎ, 𝑘 ∈ [𝑟]. Thus, for any X =              𝑘=1 𝛼 𝑘 T𝑘   ∈ L,
                                ∑︁𝑟 ∑︁ 𝑟                                                      ∑︁ 𝑟 ∑︁
                                                                                                    𝑟
                2        2
     k𝐿 (X) k − kXk          =              𝛼 𝑘 𝛼ℎ (h𝐿 (T𝑘 ) , 𝐿 (Tℎ )i − hT𝑘 , Tℎ i) =                𝛼 𝑘 𝛼ℎ 𝜀 𝑘,ℎ
                                 𝑘=1 ℎ=1                                                      𝑘=1 ℎ=1
                                  𝑟           𝑟
                                ∑︁          ∑︁                     𝜀
                             ≤       |𝛼 𝑘 |       |𝛼ℎ | |𝜀 𝑘,ℎ | ≤    k𝜶k 21 ≤ 𝜀k𝜶k 22 ,
                                𝑘=1         ℎ=1
                                                                   𝑟
                                                   √
where we have used the relation k𝜶k 1 ≤ 𝑟 k𝜶k 2 to obtain the last inequality2. To finish the proof,
we must show that kXk 2 = k𝜶k 22 . Due to the orthonormality of the basis tensors {T𝑘 } 𝑘∈[𝑟] , one
may write
                                                 𝑟 ∑︁
                                                ∑︁   𝑟                        𝑟
                                                                             ∑︁
                             2
                        kXk = hX, Xi =                    𝛼𝑘 𝛼ℎ hT𝑘 , Tℎ i =     |𝛼| 2 = k𝜶k 2 .
                                                𝑘=1 ℎ=1                      𝑘=1
4.2     Johnson-Lindenstrauss Embedings for Low-Rank Tensors
In this section, JL embeddings are discussed along the lines of Lemmas 4.1.3 and 4.1.4 in the case
of low-CP-rank tensors, i.e., tensors that can be expressed as the weighted sum of a number of
rank-1 tensors.
4.2.1    Geometry-Preserving Property of JL Embeddings for Low-Rank Tensors
The main purpose of this section is to show how employing modewise JL embeddings affect the
norm of a low-rank tensor and its inner product with another low-rank tensor. Considering rank-𝑟
tensors as members of an 𝑟-dimensional tensor subspace spanned by 𝑟 rank-1 basis tensors, we are
essentially assuming that the these basis tensors always exist. This, however, is not guaranteed,
and we are not even able to guarantee that there always exist a sufficiently incoherent basis of
𝑟 rank-1 tensors that span any rank-𝑟 subspace. To incorporate coherence into our analysis, the
concepts of modewise and basis coherence are introduced in the following. Next, the norm and
    2 Using                                                                                                    √
            the Cauchy-Schwarz ineqquality for vectors 𝜶 and 1, we have k𝜶k 1 = h|𝜶| , 1i ≤ k𝜶k 2 k1k 2 = 𝑟 k𝜶k 2 ,
where |𝜶| is a vector whose elements are the absolute values of the elements of 𝜶, and 1 is a vector of all ones.
                                                             29


inner product preservation property in the Johnson-Lindenstrauss embeddings of low-rank tensors
will be discussed.
Definition 4.2.1 (Modewise Coherence) Assume that X admits a decomposition of rank 𝑟 in the
“standard” form, i.e.,
                                   𝑟
                                  ∑︁                                                ∑︁  𝑟
                           X=         𝛼 𝑘 x 𝑘(1)     x 𝑘(2)      ···      x 𝑘(𝑑) =          𝛼𝑘        𝑑
                                                                                                      ℓ=1  x 𝑘(ℓ) ,                (4.3)
                                  𝑘=1                                                𝑘=1
where kx 𝑘(ℓ) k 22 = 1 for 𝑘 ∈ [𝑟] and ℓ ∈ [𝑑]. The maximum modewise coherence of X is defined as
                                                                                  D                E
                                    𝜇X := max 𝜇X,ℓ := max max                       x 𝑘(ℓ) , xℎ(ℓ)     .                           (4.4)
                                           ℓ∈[𝑑]                ℓ∈[𝑑] 𝑘,ℎ∈[𝑟]
                                                                         𝑘≠ℎ
where 𝜇X,ℓ ∈ [0, 1] for ℓ ∈ [𝑑] is the modewise coherence for mode ℓ. Also obviously, 𝜇X ∈ [0, 1].
Definition 4.2.2 (Basis Coherence) Let B be a set of 𝑟 rank-1 tensors, defined as
                                                                                  C𝑛 ×···×𝑛
                                               n                           o
                                                   𝑑      (ℓ)
                                      B :=         ℓ=1 x 𝑘       𝑘 ∈ [𝑟] ⊂               1        𝑑
                                                                                n                                  o
with kx 𝑘(ℓ) k 2 = 1 for 𝑘 ∈ [𝑟] and ℓ ∈ [𝑑]. Let L := span                              𝑑
                                                                                         ℓ=1 x 𝑘
                                                                                                (ℓ)
                                                                                                       𝑘 ∈ [𝑟]         be the span of B.
For the set B, the basis coherence is defined as
                                         D                             E                 Ö 𝑑 D                    E
                                                    (ℓ)            (ℓ)
                        𝜇0B := max            𝑑
                                              ℓ=1 x 𝑘 ,
                                                              𝑑
                                                             ℓ=1 x ℎ      = max                   x 𝑘(ℓ) , xℎ(ℓ) .                 (4.5)
                                 𝑘,ℎ∈[𝑟]                                      𝑘,ℎ∈[𝑟]
                                   𝑘≠ℎ                                          𝑘≠ℎ ℓ=1
It is easy to verify that 𝜇0B ∈ [0, 1].
Note 4.2.1 Looking at (4.5), one can observe that 𝜇0B is in fact the maximum absolute inner
product3 of the pairs of rank-1 tensors that form the basis B, hence the name basis coherence.
We will also use, with some abuse of notation, the (maximum) modewise coherence and basis
coherence for any X ∈ L = span (B) and the basis B interchangeably, i.e., 𝜇X,ℓ = 𝜇 B,ℓ , 𝜇X = 𝜇 B ,
and 𝜇0X = 𝜇0B . Finally, it can also be inferred that
                                                            Ö  𝑑
                                                  𝜇0X    ≤         𝜇X,ℓ ≤ 𝜇X    𝑑
                                                                                  .
                                                             ℓ=1
     3 Defined as per (2.3), or equivalently, the inner product of the vectorized form of the tensors.
                                                                  30


                                                      0
It should also be noted that 𝜇X,ℓ , 𝜇X , and 𝜇X always depend on the choice of the particular basis
tensors forming B.
    To further prepare grounds for the main JL embedding result, we discuss how modewise JL
embeddings affect the structure, modewise coherence, and the norm of a low-rank tensor of the
form (4.3), as well as the inner product of two such low-rank tensors. This will be done through
the next few lemmas and theorems below.
Lemma 4.2.1 Let 𝑗 ∈ [𝑑], A ∈           C𝑚×𝑛 , and X ∈ C𝑛 ×···×𝑛
                                                 𝑗                      1        𝑑   be a rank-𝑟 tensor as per (4.3) such
              ( 𝑗)
that min Ax 𝑘         > 0. Then X 0 := X × 𝑗 A can be written in standard form as
     𝑘∈[𝑟]         2
                           𝑟                                                   ( 𝑗)
                     0
                         ∑︁
                                     ( 𝑗) ©
                                                
                                                    𝑗−1 (ℓ)
                                                                          Ax 𝑘               
                                                                                                 𝑑        (ℓ)
                                                                                                              ª
                   X =        𝛼 𝑘 Ax 𝑘        ­
                                                    ℓ=1 x 𝑘                                      ℓ= 𝑗+1 x 𝑘
                                                                                                               ®.
                                            2­                                ( 𝑗)                             ®
                         𝑘=1                                              Ax 𝑘
                                              «                                      2                         ¬
Furthermore, the modewise coherence of X 0 as above will satisfy
                                                                 D                      E
                                                                        ( 𝑗)       ( 𝑗)
                                                                   Ax 𝑘 , Axℎ
                                    𝜇X 0, 𝑗 = max
                                                   𝑘,ℎ∈[𝑟]          ( 𝑗)            ( 𝑗)
                                                     𝑘≠ℎ       Ax 𝑘            Axℎ
                                                                           2               2
so that
                                                                                 D              E
                             𝜇X 0 = max ­ 𝜇X 0, 𝑗 , max max x 𝑘(ℓ) , xℎ(ℓ) ® .
                                              ©                                                     ª
                                                      ℓ∈[𝑑]\{ 𝑗 } 𝑘,ℎ∈[𝑟]
                                              «                         𝑘≠ℎ                         ¬
Proof Using Lemma 2.0.2, the linearity of tensor matricization, and (2.8) we can see that the
mode- 𝑗 unfolding of X 0 satisfies
                                         ∑︁𝑟                                    ∑︁𝑟                               >
                                                               (ℓ)                               ( 𝑗)
              X0( 𝑗)   = AX ( 𝑗) = A            𝛼𝑘      𝑑
                                                        ℓ=1 x 𝑘              =            𝛼 𝑘 Ax 𝑘      ⊗ℓ≠ 𝑗 x 𝑘(ℓ)
                                                                     ( 𝑗)
                                         𝑘=1                                      𝑘=1
                            𝑟                            ( 𝑗)
                         ∑︁
                                         ( 𝑗)
                                                    Ax 𝑘                        >
                       =        𝛼 𝑘 Ax 𝑘                            ⊗ℓ≠ 𝑗 x 𝑘(ℓ)        .
                                               2         ( 𝑗)
                          𝑘=1                       Ax 𝑘
                                                               2
                                                              31


                                         ( 𝑗+1)         ( 𝑗−1)
where ⊗ℓ≠ 𝑗 x 𝑘(ℓ) = x 𝑘(𝑑) ⊗ · · · ⊗ x 𝑘         ⊗ x𝑘          ⊗ · · · ⊗ x 𝑘(1) . Folding X0( 𝑗) back into a 𝑑-mode tensor
then gives us our first equality. The second two equalities follow directly from the definition of
modewise coherence.
    The following lemma will be helpful in proving the norm-preserving property of modewise JL
embeddings; it provides a useful relation expressing the norm of the 𝑗-mode product of a tensor
that is in the standard form (4.3) in terms of inner products of its individual factor vectors projected
by the embedding.
Lemma 4.2.2 Let 𝑗 ∈ [𝑑], A ∈              C𝑚×𝑛 , and X ∈ C𝑛 ×···×𝑛
                                                    𝑗                      1       𝑑  be a rank-𝑟 tensor in standard form as
per (4.3). Then,
                                           Î
                                      𝑟
                                     ∑︁      ℓ≠ 𝑗 𝑛ℓ
                                             ∑︁                                                D                     E
                                                                                                           ( 𝑗)     ( 𝑗)
               kX × 𝑗 Ak 2 =                          𝛼 𝑘 ⊗ℓ≠ 𝑗 x 𝑘(ℓ) 𝛼ℎ ⊗ℓ≠ 𝑗 xℎ(ℓ)                  Ax 𝑘 , Axℎ .
                                                                             𝑎                     𝑎
                                    𝑘,ℎ=1    𝑎=1
where (·)𝑎 denotes the 𝑎 th element of a vector.
Proof Using the linearity of 𝑗-mode products, tensor matricization, and observing that the Eu-
clidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings, one can write
                                  𝑟                                        2        𝑟                                         2
                               ∑︁                                            ∑︁                                
                         2                        𝑑      (ℓ)                                       𝑑     (ℓ)
             kX × 𝑗 Ak =             𝛼𝑘           ℓ=1 x 𝑘      ×𝑗 A           =          𝛼𝑘        ℓ=1 x 𝑘      ×𝑗 A
                                                                                                                        ( 𝑗)
                                𝑘=1                                               𝑘=1                                         F
                                  𝑟                                     2
                               ∑︁                                 >
                                              ( 𝑗)
                            =        𝛼 𝑘 Ax 𝑘         ⊗ℓ≠ 𝑗 x 𝑘(ℓ)
                                𝑘=1                                     F
                               ∑︁𝑟                                 >                               >
                                                 ( 𝑗)                               ( 𝑗)
                            =           𝛼 𝑘 Ax 𝑘        ⊗ℓ≠ 𝑗 x 𝑘(ℓ)     , 𝛼ℎ Axℎ          ⊗ℓ≠ 𝑗 xℎ(ℓ)
                              𝑘,ℎ=1                                                                        F
where k·k F and h·, ·iF denote the Frobenius matrix norm and inner product, respectively. Computing
                                                                                                            Î
the Frobenius inner products above columnwise, and noting that there are ℓ≠ 𝑗 𝑛ℓ columns in the
mode- 𝑗 unfolding, one can further see that
                                           Î
                                      𝑟
                                     ∑︁      ℓ≠ 𝑗 𝑛ℓ
                                             ∑︁                                                D                     E
                                                                                                           ( 𝑗)     ( 𝑗)
               kX × 𝑗 Ak =  2
                                                      𝛼𝑘      ⊗ℓ≠ 𝑗 x 𝑘(ℓ)     𝛼ℎ ⊗ℓ≠ 𝑗 xℎ(ℓ)          Ax 𝑘 , Axℎ           ,
                                                                             𝑎                     𝑎
                                    𝑘,ℎ=1    𝑎=1
                                                                   32


completing the proof.
      The following theorem demonstrates that a single modewise JL embedding of any low-rank
tensor X of the standard form (4.3) will preserve its norm up to an error depending on the overall
ℓ 2 -norm of its coefficients 𝜶 ∈            C𝑟 .  It should also be noted that in establishing proofs for the
following lemmas and theorems, we will encounter situations where applying the polarization
identity for inner products (recall how it was used in the proofs of Lemmas 4.1.1 and 4.1.2) will
lead to the requirement that the JL property of matrices involved must hold for a set of vectors and
combinations of them. For low-rank tensors under consideration, this set will include the vectors
forming (4.3), and the corresponding set of interest will be
                                                                                          !
                                                                 i               i                                  C𝑛 ,
                Ø        n                                                               o Øn             o
                             ( 𝑗)    ( 𝑗) ( 𝑗)     ( 𝑗) ( 𝑗)        ( 𝑗) ( 𝑗)       ( 𝑗)             ( 𝑗)
     S 0𝑗 =                 x𝑘    − xℎ , x𝑘    +  xℎ , x𝑘      −   xℎ , x𝑘     +   xℎ              x𝑘             ⊂   𝑗
                                                                                                                            (4.6)
                                                                                                            𝑘∈[𝑟]
            1≤ℎ<𝑘 ≤𝑟
containing 4     2
                  𝑟
                     + 𝑟 = 2𝑟 2 − 𝑟 vectors for each mode 𝑗 ∈ [𝑑].
Theorem 4.2.1 Let 𝑗 ∈ [𝑑] and X ∈                     C𝑛 ×···×𝑛
                                                         1       𝑑  be a rank-𝑟 tensor as per (4.3). Suppose that
A ∈      C𝑚×𝑛 𝑗  is an (𝜀/4)-JL embedding of the vectors in the set defined in (4.6) into                              C𝑚 .   Let
X 0 := X × 𝑗 A and rewrite it in standard form so that
                                        𝑟                                 ( 𝑗)
                                  0
                                     ∑︁       ©          (ℓ)
                                                                     Ax 𝑘                        
                                                                                                (ℓ) ª
                               X =        𝛼0𝑘 ­    ℓ< 𝑗 x 𝑘                                   x
                                                                                          ℓ> 𝑗 𝑘 ® .
                                                                                                     ®
                                              ­                          ( 𝑗)
                                      𝑘=1                            Ax 𝑘
                                              «                                2                     ¬
Then all of the following hold:
   (a) 𝛼0𝑘 − 𝛼𝑘        ≤ 𝜀|𝛼 𝑘 |/4 for all 𝑘 ∈ [𝑟] so that k𝜶0 k ∞ ≤ (1 + 𝜀/4)k𝜶k ∞
                     𝜇 X, 𝑗 +𝜀
   (b) 𝜇X 0, 𝑗 ≤     1−𝜀/4 ,     and 𝜇X 0,ℓ = 𝜇X,ℓ for all ℓ ∈ [𝑑] \ { 𝑗 }
                                            √︁            Ö                                          
    (c) kX 0 k 2 − kXk 2 ≤ 𝜀 ­1 + 𝑟 (𝑟 − 1)                     𝜇X,ℓ ® k𝜶k 22 ≤ 𝜀 1 + 𝑟 𝜇X       𝑑−1
                                                                                                          k𝜶k 22 ≤ 𝜀(𝑟 +1)k𝜶k 22
                                      ©                               ª
                                      «                   ℓ≠ 𝑗        ¬
                                                                33


Proof Properties are proved in order below.
(a) By Lemma 4.2.1, one can write for all 𝑘 ∈ [𝑟] that
                                                    ( 𝑗)                                       ( 𝑗)
                  𝛼0𝑘 − 𝛼 𝑘      = 𝛼 𝑘 Ax 𝑘                   − 𝛼𝑘        = |𝛼 𝑘 | Ax 𝑘                 − 1 ≤ 𝜀|𝛼 𝑘 |/4.
                                                           2                                         2
(b) Lemma 4.2.1 and the definition of 𝑗-mode coherence lead to
                                       D                        E                       D                  E
                                              ( 𝑗)        ( 𝑗)                               ( 𝑗) ( 𝑗)
                                         Ax 𝑘 , Axℎ                                       x𝑘 , xℎ              + 𝜀 𝜇X, 𝑗 + 𝜀
               𝜇X 0, 𝑗 = max                                           ≤ max                                           =                ,
                          𝑘,ℎ∈[𝑟]
                                      Ax
                                           ( 𝑗)
                                                      Ax
                                                            ( 𝑗)            𝑘,ℎ∈[𝑟]               1 − 𝜀4                      1 − 𝜀4
                            𝑘≠ℎ            𝑘     2          ℎ      2          𝑘≠ℎ
where the inequality follows from A being an (𝜀/4)-JL embedding and Lemma 4.1.1.
(c) Applying Lemma 4.2.2 one can observe that
                             Î
                         𝑟
                        ∑︁      ℓ≠ 𝑗 𝑛ℓ
                               ∑︁                                                     D                             E      D              E
                                                                                                        ( 𝑗)        ( 𝑗)            ( 𝑗) ( 𝑗)
     0 2
  kX k − kXk =  2
                                        𝛼𝑘      ⊗ℓ≠ 𝑗 x 𝑘(ℓ)         𝛼ℎ ⊗ℓ≠ 𝑗 xℎ(ℓ)               Ax 𝑘 , Axℎ                −     x𝑘 , xℎ           . (4.7)
                                                                  𝑎                        𝑎
                       𝑘,ℎ=1    𝑎=1
Lemma 4.1.1 can be applied to each inner product in (4.7) to get
                                         D                         E        D                E
                                                 ( 𝑗)         ( 𝑗)              ( 𝑗)    ( 𝑗)
                                           Ax 𝑘 , Axℎ                  =      x𝑘 , xℎ            + 𝜀 𝑘,ℎ
for some 𝜀 𝑘,ℎ ∈   C with      𝜀 𝑘,ℎ ≤ 𝜀. As a result we have that
                                       Î
                                 𝑟
                                ∑︁        ℓ≠ 𝑗 𝑛ℓ
                                         ∑︁                                                      
    kX 0 k 2 − kXk 2 =                             𝛼 𝑘 ⊗ℓ≠ 𝑗 x 𝑘(ℓ) 𝛼ℎ ⊗ℓ≠ 𝑗 xℎ(ℓ) 𝜀 𝑘,ℎ .
                                                                           𝑎                          𝑎
                               𝑘,ℎ=1     𝑎=1
                                                         Î
                                 𝑟
                                ∑︁                          ℓ≠ 𝑗 𝑛ℓ 
                                                           ∑︁                                            
                          =            𝛼 𝑘 𝛼ℎ 𝜀 𝑘,ℎ                   ⊗ℓ≠ 𝑗 x 𝑘(ℓ)         ⊗ℓ≠ 𝑗 xℎ(ℓ)
                                                                                      𝑎                      𝑎
                               𝑘,ℎ=1                       𝑎=1
                                 𝑟
                                ∑︁                       D                                 E            ∑︁𝑟                      ÖD                    E
                                                                      (ℓ)             (ℓ)
                          =            𝛼 𝑘 𝛼ℎ 𝜀 𝑘,ℎ            ℓ≠ 𝑗 x 𝑘 ,      ℓ≠ 𝑗 x ℎ         =               𝛼 𝑘 𝛼ℎ 𝜀 𝑘,ℎ            x 𝑘(ℓ) , xℎ(ℓ)
                               𝑘,ℎ=1                                                                   𝑘,ℎ=1                      ℓ≠ 𝑗
                             ∑︁𝑟                    Ö                 2      ∑︁                      ÖD                       E
                          ≤               2
                                    |𝛼 𝑘 | 𝜀 𝑘,𝑘              x 𝑘(ℓ)     +          𝛼 𝑘 𝛼ℎ 𝜀 𝑘,ℎ               x 𝑘(ℓ) , xℎ(ℓ)     .
                              𝑘=1                    ℓ≠ 𝑗                    𝑘≠ℎ                      ℓ≠ 𝑗
                                                                        34


                                   2
                           x 𝑘(ℓ)    = 1 as x 𝑘(ℓ)
                  Î
    Noting that      ℓ≠ 𝑗                              = 1 for all ℓ ∈ [𝑑] and 𝑘 ∈ [𝑟], it follows that
                                                     2
                                                 ∑︁𝑟                  ∑︁                   ÖD                      E
                         0 2
                    kX k − kXk        2
                                          ≤ 𝜀          |𝛼 𝑘 | 2 +           𝛼 𝑘 𝛼ℎ 𝜀 𝑘,ℎ           x 𝑘(ℓ) , xℎ(ℓ)
                                                  𝑘=1                 𝑘≠ℎ                   ℓ≠ 𝑗
                                                                                                         
                                                                  >                            >
                                          =    𝜀k𝜶k 22  + E 𝜶, 𝜶               ≤ 𝜀+ E              2→2
                                                                                                             k𝜶k 22 ,
             C                                                             ÎD                  E
where E ∈       𝑟×𝑟  is zero on its diagonal, 𝐸 𝑘,ℎ = 𝜀 𝑘,ℎ                     x 𝑘(ℓ) , xℎ(ℓ)    for 𝑘 ≠ ℎ, and the operator
                                                                          ℓ≠ 𝑗
norm kE> k 2→2 satisfies
                                        v
                                        u                                                 v
                                                                                          u
                                        u
                                        u                                 2               u
                                                                                          u                                     2
                                        t∑︁ Ö D                        E                  t∑︁ Ö D                             E
           E>    2→2
                         ≤ kEk 𝐹 ≤                      x 𝑘(ℓ) , xℎ(ℓ)      𝜀2 = 𝜀 ·                           x 𝑘(ℓ) , xℎ(ℓ)     .
                                            𝑘≠ℎ ℓ≠ 𝑗                                           𝑘≠ℎ ℓ≠ 𝑗
Finally, the definition of 𝜇X implies that
                                                   √︁               Ö
                                                                                               𝑑−1
                                  kEk 2→2 ≤ 𝜀 𝑟 (𝑟 − 1)                    𝜇X,ℓ ≤ 𝜀𝑟 𝜇X             .
                                                                     ℓ≠ 𝑗
Thus, the desired bound can be obtained, i.e.,
                                                √︁               Ö                                                 
             kX 0 k 2 − kXk 2 ≤ 𝜀 ­1 + 𝑟 (𝑟 − 1)                       𝜇X,ℓ ® k𝜶k 22 ≤ 𝜀 1 + 𝑟 𝜇X             𝑑−1
                                                                                                                       k𝜶k 22 .
                                         ©                                   ª
                                         «                       ℓ≠ 𝑗        ¬
    Theorem 4.2.2 provides an upper bound for the distortion in the norm of a tensor when modewise
JL embeddings are applied to all its modes. The following remark provides a useful tool in the
proof of the theorem.
Remark 4.2.1 Let 𝑐, 𝑑 ∈           R+. Then,     1+   𝑐 𝑑
                                                     𝑑
                                                       
                                                            ≤    e𝑐 .
Theorem 4.2.2 Let 𝜀 ∈ (0, 3/4]. Assume that X admits a decomposition of rank 𝑟 in the standard
form as per (4.3). Also, assume that the matrices A ( 𝑗) ∈ 𝑚 𝑗 ×𝑛 𝑗 are 4𝑑      C                       𝜀
                                                                                                           
                                                                                                             -JL embeddings of the
vectors in S 0𝑗 as per (4.6) into     C𝑚  𝑗  for each 𝑗 ∈ [𝑑]. If
                                      Y = X ×1 A (1) ×2 A (2) × · · · ×𝑑 A (𝑑) ,                                                    (4.8)
                                                               35


then,
                                                          e + e2
                                                                 √︁                              
                          kY k 2 − kXk 2 ≤ 𝜀                        𝑟 (𝑟 − 1) max 𝜀 𝑑−1 , 𝜇X   𝑑−1
                                                                                                       k𝜶k 22
                                                                                                                      (4.9)
                                                  ≤𝜀   e 2
                                                           (𝑟 + 1)    k𝜶k 22 ,
and if the maximum modewise coherence of X is zero, i.e., if 𝜇X = 0, then
                                                                        e                  e
                                                                         √︁             
                                       kY k 2 − kXk 2 ≤ 𝜀 + 𝑟 (𝑟 − 1)𝜀 𝑑 k𝜶k 22 .                                    (4.10)
Proof Let X (0) := X and X (𝑑) := Y, and for each 𝑗 ∈ [𝑑] define the partially compressed tensor
                                                                   ∑︁𝑟                                          
                     ( 𝑗)                  (1)              ( 𝑗)                 𝑗   (ℓ) (ℓ)          𝑑        (ℓ)
                   X      := X ×1 A            · · · ×𝑗 A        =      𝛼𝑘      ℓ=1 A x 𝑘             ℓ= 𝑗+1 x 𝑘
                                                                   𝑘=1
                                𝑟                                                                                    (4.11)
                             ∑︁
                                               𝑑       (ℓ)
                          =:         𝛼 𝑗,𝑘     ℓ=1 x 𝑗,𝑘 ,
                              𝑘=1
expressed in standard form via 𝑗 applications of Lemma 4.2.1. By looking closely at the second
and third equalities above, one can observe that for all 𝑗 ∈ [𝑑], 𝛼 𝑗,𝑘 = 𝛼 𝑘 ℓ=1 A (ℓ) x 𝑘(ℓ) , as well
                                                                                                     Î𝑗
as x (ℓ)        (ℓ) (ℓ)
     𝑗,𝑘 = A x 𝑘 / A x 𝑘
                             (ℓ) (ℓ) for ℓ ∈ [ 𝑗] and x (ℓ) = x (ℓ) for ℓ > 𝑗.
                                                                   𝑗,𝑘        𝑘
     The first two parts of Theorem 4.2.1 can be used to write
   (𝑖) 𝛼 𝑗,𝑘 − 𝛼 𝑗−1,𝑘        ≤ 𝜀|𝛼 𝑗−1,𝑘 |/4𝑑 so that |𝛼 𝑗,𝑘 | ≤ (1 + 𝜀/4𝑑)|𝛼 𝑗−1,𝑘 | holds for all 𝑘 ∈ [𝑟], and
  (𝑖𝑖) 𝜇X ( 𝑗) , 𝑗 ≤ (𝜇X ( 𝑗−1) , 𝑗 + 𝜀/𝑑)/(1 − 𝜀/4𝑑), and 𝜇X ( 𝑗) ,ℓ = 𝜇X ( 𝑗−1) ,ℓ for all ℓ ∈ [𝑑] \ { 𝑗 },
for 𝑗 ∈ [𝑑]. Using these facts inductively, it can also be established that both
                                                      |𝛼 𝑗,𝑘 | ≤ (1 + 𝜀/4𝑑) 𝑗 |𝛼 𝑘 |,                                (4.12)
and
                                                                                                    𝑗−1
                    Ö                         ©Ö 𝜇X,ℓ + 𝜀/𝑑 ª Ö                          𝜇X + 𝜀/𝑑           𝑑− 𝑗
                          𝜇X ( 𝑗−1) ,ℓ  ≤ ­                                   𝜇X,ℓ ≤                      𝜇X ,       (4.13)
                                                          −                                −
                                                                      ®
                                                       1     𝜀/4𝑑                        1    𝜀/4𝑑
                    ℓ≠ 𝑗                      « ℓ< 𝑗                  ¬ ℓ> 𝑗
                                                                      36


                                                                                        Î                     Î               Î
hold for all 𝑘 ∈ [𝑟] and 𝑗 ∈ [𝑑]. To prove (4.13), one can write ℓ≠ 𝑗 𝜇X ( 𝑗−1) ,ℓ = ℓ< 𝑗 𝜇X ( 𝑗−1) ,ℓ ℓ> 𝑗 𝜇X ( 𝑗−1) ,ℓ .
                                                                        Î
The second term on the right-hand side is equal to ℓ> 𝑗 𝜇X,ℓ since ℓ > 𝑗. For the first term, we
have
                              𝑗−2
                                               !                       𝑗−3
                                                                                         !
        Ö                     Ö                                       Ö
             𝜇X ( 𝑗−1) ,ℓ =        𝜇X ( 𝑗−1),ℓ 𝜇X ( 𝑗−1) , 𝑗−1 =             𝜇X ( 𝑗−1) ,ℓ 𝜇X ( 𝑗−1) , 𝑗−2 𝜇X ( 𝑗−1) , 𝑗−1
        ℓ< 𝑗                  ℓ=1                                     ℓ=1
                              𝑗−3
                                               !                                             𝑗−1
                              Ö                                                              Ö
                          =        𝜇X ( 𝑗−1) ,ℓ 𝜇X ( 𝑗−2) , 𝑗−2 𝜇X ( 𝑗−1) , 𝑗−1 = · · · =          𝜇X (ℓ) ,ℓ                 (4.14)
                              ℓ=1                                                            ℓ=1
                             𝑗−1                             𝑗−1
                                  𝜇X (ℓ−1) ,ℓ + 𝜀/𝑑 Ö
                                                                                                        𝑗−1
                            Ö                                    𝜇X,ℓ + 𝜀/𝑑                𝜇X + 𝜀/𝑑
                          ≤                            =                             ≤                         .
                             ℓ=1
                                     1 − 𝜀/4𝑑               ℓ=1
                                                                   1  −  𝜀/4𝑑              1 −    𝜀/4𝑑
In (4.13), 𝜇X 0 = 1 is assumed even if 𝜇 = 0 since this still yields the correct bound in the case
                                                    X
where 𝑗 = 𝑑 and 𝜇X = 0.
    To get the desired error bound, we can now see that
                                   𝑑−1
                                   ∑︁            2                   2
        kXk 2 − kY k 2 =                  X ( 𝑗)   − X ( 𝑗+1)
                                    𝑗=0
                                      𝑑−1
                                  𝜀 ∑︁ ©         √︁                Ö
                              ≤            ­1 + 𝑟 (𝑟 − 1)                  𝜇X ( 𝑗) ,ℓ ® k𝜶 𝑗 k 22
                                                                                      ª
                                  𝑑 𝑗=0                           ℓ≠ 𝑗+1
                                           «                                          ¬            !
                                      𝑑−1                                            𝑗
                                  𝜀 ∑︁           √︁                 𝜇X + 𝜀/𝑑
                                                                                                     (1 + 𝜀/4𝑑) 2 𝑗 k𝜶k 22
                                                                                           𝑑−1− 𝑗
                              ≤               1 + 𝑟 (𝑟 − 1)                              𝜇X
                                  𝑑 𝑗=0                             1 − 𝜀/𝑑4
                                      𝑑−1                                           𝑗            !
                                  𝜀  ∑︁          √︁                 𝜇X + 𝜀/𝑑
                                                                                                     (1 + 9𝜀/16𝑑) 𝑗 k𝜶k 22 ,
                                                                                           𝑑−1− 𝑗
                              ≤               1 + 𝑟 (𝑟 − 1)                              𝜇X
                                  𝑑 𝑗=0                             1 − 𝜀/𝑑4
where the third part of Theorem 4.2.1, as well as (4.12) and (4.13) have been used. Considering
each term in the upper bound above separately, we have that
                                                              𝜀                 √︁                  
                                         2          2                 2
                                  kXk − kY k ≤ k𝜶k 2 𝑇1 + 𝑟 (𝑟 − 1)𝑇2
                                                              𝑑
where
                                    𝑑−1
                                                                     (1 + 9𝜀/16𝑑) 𝑑 − 1
                                                                                                        e𝑑
                                   ∑︁
                            𝑇1 :=        (1 + 9𝜀/16𝑑) 𝑗 =                                            ≤
                                    𝑗=0
                                                                              9𝜀/16𝑑
                                                                 37


using 9𝜀/16 < 1, and where
             𝑑−1                𝑗                             𝑑−1
             ∑︁    𝜇X + 𝜀/𝑑          𝑑−1− 𝑗
                                                                ∑︁
                                                                                       𝑑−1− 𝑗
       𝑇2 :=                        𝜇X      (1 + 9𝜀/16𝑑) 𝑗 ≤         (𝜇X + 𝜀/𝑑) 𝑗 𝜇X          (1 + 𝜀/𝑑) 𝑗
             𝑗=0
                   1 −  𝜀/𝑑4                                     𝑗=0
for 𝜀 ≤ 3/4.
    Continuing to bound the second term we will consider three cases. First, if 𝜇X = 0 then
                             𝑇2 ≤ (𝜀/𝑑) 𝑑−1 (1 + 𝜀/𝑑) 𝑑−1 ≤          e (𝜀/𝑑) 𝑑−1 ,
using Remark 4.2.1 and that 𝜀 < 1. Second, if 0 < 𝜇X ≤ 𝜀 then
                    𝑑−1
                    ∑︁                                               𝑑−1
                                                                     ∑︁
                                      𝑗
            𝑇2 ≤        (𝜀 + 𝜀/𝑑) 𝜀      𝑑−1− 𝑗           𝑗
                                                (1 + 𝜀/𝑑) = 𝜀   𝑑−1
                                                                          (1 + 1/𝑑) 𝑗 (1 + 𝜀/𝑑) 𝑗
                    𝑗=0                                               𝑗=0
                                                            e
                ≤ 𝜀 𝑑−1 𝑑 (1 + 1/𝑑) 𝑑 (1 + 𝜀/𝑑) 𝑑 ≤ 𝑑 2 𝜀 𝑑−1 ,
using Remark 4.2.1 and that 𝜀 < 1 once more. If, however, 𝜇X > 𝜀 then we can see that
                         𝑑−1
                         ∑︁                                            𝑑−1
                                                                       ∑︁
           𝑇2 ≤ 𝜇X  𝑑−1
                              (1 + 𝜀/𝜇X 𝑑) 𝑗 (1 + 𝜀/𝑑) 𝑗 ≤ 𝜇X     𝑑−1
                                                                           (1 + 1/𝑑) 𝑗 (1 + 𝜀/𝑑) 𝑗
                         𝑗=0                                           𝑗=0
               ≤ 𝜇X 𝑑−1
                         · 𝑑 (1 + 1/𝑑) 𝑑 (1 + 𝜀/𝑑) 𝑑 ≤ 𝜇X     𝑑−1
                                                                   𝑑 e1+𝜀   ≤ 𝑑  e2 𝜇X𝑑−1,
where we have again utilized Remark 4.2.1. The desired result now follows.
    Theorem 4.2.2 expresses the distortion in the Euclidean norm of a low-rank tensor X after
applying modewise JL embeddings in terms of its low-rank expansion coefficients norm k𝜶k 2 . The
following lemma helps express the distortion in terms of the norm of a tensor X in with sufficiently
small modewise coherence by establishing its relation to the norm of 𝜶, as this is usually the
convention in expressing error guarantees for JL embeddings.
Lemma 4.2.3 Let X ∈       C𝑛 ×···×𝑛
                               1      𝑑 be a rank-𝑟 tensor in the standard form as per (4.3) with basis
coherence 𝜇0X < (𝑟 − 1) −1 . Then,
                                                                                      !
                    2                 1                                   1
               k𝜶k 2 ≤                            kXk 2 ≤                                 kXk 2
                              1 − (𝑟 − 1)𝜇0X
                                                                            Î𝑑
                                                             1 − (𝑟 − 1)      ℓ=1 𝜇X,ℓ
                                                !                                                         (4.15)
                                      1
                      ≤                           kXk 2 .
                              1 − (𝑟 − 1)𝜇X   𝑑
                                                       38


Proof To establish the result, one can write
                             * 𝑟                                 𝑟
                                                                                              +         𝑟
                                ∑︁                             ∑︁                                     ∑︁              D                              E
      kXk 2 = hX, Xi =               𝛼 𝑘 ℓ=1   𝑑
                                                     x 𝑘(ℓ) ,         𝛼𝑘        𝑑
                                                                                ℓ=1    x 𝑘(ℓ) =            𝛼𝑘 𝛼ℎ             𝑑    (ℓ)
                                                                                                                            ℓ=1 x 𝑘 ,
                                                                                                                                           𝑑     (ℓ)
                                                                                                                                           ℓ=1 x ℎ
                                𝑘=1                            𝑘=1                                   𝑘,ℎ=1
                ∑︁𝑟           Ö  𝑑 D                  E       ∑︁𝑟                   ∑︁𝑟             Ö  𝑑 D                 E
             =         𝛼𝑘 𝛼ℎ          x 𝑘(ℓ) , xℎ(ℓ)      =                 2
                                                                     |𝛼 𝑘 | +              𝛼𝑘 𝛼ℎ           x 𝑘(ℓ) , xℎ(ℓ)
               𝑘,ℎ=1           ℓ=1                            𝑘=1                   𝑘≠ℎ              ℓ=1
                           𝑟
                          ∑︁             Ö 𝑑 D                     E
             = k𝜶k 2 +          𝛼𝑘 𝛼ℎ             x 𝑘(ℓ) , xℎ(ℓ) .
                          𝑘≠ℎ             ℓ=1
where (A.8) has been used to get the fourth equality. Therefore,
                            ∑︁𝑟              Ö 𝑑 D                    E         ∑︁𝑟               Ö  𝑑 D                   E             𝑟
                                                                                                                                        ∑︁
                                                                                                                                    0
           2
       kXk −   k𝜶k 22   =          𝛼𝑘 𝛼ℎ              x 𝑘(ℓ) , xℎ(ℓ)      ≤            |𝛼 𝑘 𝛼ℎ |           x 𝑘(ℓ) , xℎ(ℓ)       ≤ 𝜇X         |𝛼𝑘 𝛼ℎ |
                             𝑘≠ℎ             ℓ=1                                𝑘≠ℎ                ℓ=1                                  𝑘≠ℎ
                                                  !2           𝑟
                                   𝑟
                                   ∑︁                      ∑︁                                                   
                              0                                                         0                                     0
                                                                    |𝛼 𝑘 | 2  = 𝜇X k𝜶k 21 − k𝜶k 22 ≤ 𝜇X (𝑟 − 1) k𝜶k 22 ,
                                                                              
                        = 𝜇X              |𝛼 𝑘 | −
                                   𝑘=1                      𝑘=1              
                                                                             
                                                       0
yielding the result and implying that 𝜇X (𝑟 − 1) < 1 should also hold given that both kXk and k𝜶k 2
                                                                            √
are non-negative numbers. The relation k𝜶k 1 ≤ 𝑟 k𝜶k 2 has been used to get the final inequality.
     Theorem 4.2.2 guarantees that modewise JL embeddings approximately preserve the norms of
all tensors in the form of (4.3). Theorem 4.2.3 below states the inner product preservation property
of JL embeddimgs for low-rank tensors, and guarantees that the inner products of all tensors in the
                                                               C   𝑛1 ···×𝑛 𝑑 are preserved.
                           𝑑 ℓ
span of the set B :=            ℓ=1 x 𝑘 𝑘 ∈ [𝑟] ∈
                                                                      C
Theorem 4.2.3 Suppose that X1 , X2 ∈ L ⊂ 𝑛1 ×···×𝑛𝑑 have standard forms as per (4.3), in terms
                                                                                       C   𝑛1 ···×𝑛 𝑑 , given by
                                                𝑑 ℓ
of the elements of the basis B :=                     ℓ=1 x 𝑘 𝑘 ∈ [𝑟] ∈
                                       ∑︁ 𝑟                                                  ∑︁𝑟
                               X1 =            𝛽𝑘         𝑑
                                                         ℓ=1    x 𝑘(ℓ) ,  and X2 =                 𝛼𝑘      𝑑
                                                                                                           ℓ=1   x 𝑘(ℓ) .
                                        𝑘=1                                                  𝑘=1
                                       C
Let 𝜀 ∈ (0, 3/4], and A ( 𝑗) ∈ 𝑚 𝑗 ×𝑛 𝑗 be defined as per Theorem 4.2.2 for each 𝑗 ∈ [𝑑]. Then,
         ?                ?
    *     𝑑                  𝑑
                                         +
                                                                                                         
              A ( 𝑗) , X2         A ( 𝑗) − hX1 , X2 i ≤ 2𝜀0 k 𝜷k 22 + k𝜶k 22 ≤ 4𝜀0 · max k 𝜷k 22 , k𝜶k 22
                                                                                                                                      
      X1
         𝑗=1               𝑗=1
                                                                                                 kX1 k 2 , kX2 k 2
                                                                                              
                                                                                0
                                                                                     max
                                                                       ≤ 4𝜀 ·                                            ,
                                                                                            1 − (𝑟 − 1)𝜇0B
                                                                          39


where
                                        e                       e
                             
                                        √︁                  
                                  𝜀 + 𝑟 (𝑟 − 1)𝜀 𝑑                                        if 𝜇 B = 0,
                             
                             
                             
                             
                      𝜀0 :=                                                                                  (4.16)
                                     e e
                                             √︁                                    
                              𝜀 + 2 𝑟 (𝑟 − 1) max 𝜀 𝑑−1 , 𝜇 B
                                                                                𝑑−1
                             
                                                                                         otherwise.
                             
Proof Using the polarization identity in combination with Lemma 2.0.2 and Theorem 4.2.2 we can
see that
                     ?                   ?
               *       𝑑                    𝑑
                                                        +
                 X1        A ( 𝑗) , X2           A ( 𝑗) − hX1 , X2 i
                      𝑗=1                  𝑗=1
                                                                               2
                       3               ?  𝑑                      ?  𝑑
                  1 ∑︁
                          i                      ( 𝑗)
                                                          i             A ( 𝑗)             i     2ª
                            ©
                           ℓ­                              ℓ
               =            ­ X1               A + X2                            − X1 + ℓ X2 2 ®®
                  4  ℓ=0                 𝑗=1                       𝑗=1
                            «                                                  2                  ¬
                                                                      2
                       3                            ?   𝑑
                  1 ∑︁
                          i                 i                                     i
                            ©                                                           2ª
               =           ℓ­
                                    X  1  +   ℓ
                                                X 2           A ( 𝑗)    − X1 + ℓ X2 2 ®®
                  4 ℓ=0     ­
                                                        𝑗=1
                            «                                         2                   ¬
                                                                 2
                      3                        ?   𝑑
                  1 ∑︁
                                       i                                       i
                                                                                     2
               ≤               X1 + ℓ X2                 A ( 𝑗)     − X1 + ℓ X2       2
                  4 ℓ=0                           𝑗=1
                                                                 2
                      3                                  3
                  1
                                     i      2      1
                    ∑︁                                ∑︁
               ≤          𝜀0 𝜷 + ℓ 𝜶        2
                                                ≤            𝜀0 (k 𝜷k 2 + k𝜶k 2 ) 2 = 𝜀0 (k 𝜷k 2 + k𝜶k 2 ) 2
                  4  ℓ=0
                                                   4   ℓ=0
                                           
               ≤ 2𝜀0 k 𝜷k 22 + k𝜶k 22           ≤ 4𝜀0 max k 𝜷k 22 , k𝜶k 22 ,
                                                                   
where the third to last and second to last inequalities follow from the triangle inequality and Young’s
inequality for products, respectively. Applying Lemma 4.2.3 to relate the Euclidean norms of 𝜷
and 𝜶 to X1 and X2 , respectively, leads to the final result.
To sum up, theorems 4.2.2 and 4.2.3 guarantee that modewise JL embeddings approximately
preserve the norms of and inner products between all tensors in the span of the set
                                                                                C𝑛 ×···×𝑛 .
                                            n                            o
                                                 𝑑       (ℓ)
                                    B :=         ℓ=1 x 𝑘     | 𝑘 ∈ [𝑟] ⊂           1     𝑑
                                                                40


4.2.1.1     Computational Complexity of Modewise Johnson-Lindenstrauss Embeddings
                                                                                   >𝑑
Assuming X ∈       R𝑛 ×...×𝑛
                       1       𝑑 and A𝑚 𝑗 ×𝑛 𝑗 with 𝑚 𝑗 ≤ 𝑛 𝑗 , we have X             𝑗=1 A
                                                                                              ( 𝑗) ∈ R𝑚 ×...×𝑚 . The total
                                                                                                       1        𝑑
operation count4 of the embedding is the sum of the operation counts for each 𝑗-mode product in
each mode. Therefore, one can show that the operations count will be
        O (𝑚 1 𝑛1 . . . 𝑛 𝑑 ) + O (𝑚 1 𝑚 2 𝑛2 . . . 𝑛 𝑑 ) + · · · + O (𝑚 1 𝑚 2 . . . 𝑚 𝑑 𝑛 𝑑 ) = O (𝑚 1 𝑛1 . . . 𝑛 𝑑 ) .
On the other hand, if X is vectorized, to achieve the same compression, it should be left-multiplied by
a JL matrix A ∈     R𝑚 ...𝑚 ×𝑛 ...𝑛 . The computational complexity would then be O (𝑚1𝑛1 . . . 𝑚 𝑑 𝑛𝑑 )
                         1     𝑑  1  𝑑
which is significantly higher than the complexity of the modewise approach.
4.2.2    Main Theorems: Oblivious Tensor Subspace Embeddings
So far, no assumption has been made about the type of JL embeddings that have been considered
for dimension reduction of tensor data. In this section, the main theorems establishing bounds on
the embedding dimension are presented in the case where randomness is incorporated into the JL
embeddings being used. In the case of finite-dimensional bases, the JL embeddings of interest will
be matrices drawn from random distributions, or are contructed as the product of matrices some of
which have random properties. This section starts with the definition of the family of 𝜂-optimal JL
embedding distributions, followed by the main theorems.
                                                    
Definition 4.2.3 Fix 𝜂 ∈ (0, 1/2) and let D (𝑚,𝑛)                       N N be a family of probability distributions
                                                                  (𝑚,𝑛)∈ ×
where each D (𝑚,𝑛) is a distribution over 𝑚 × 𝑛 matrices. We will refer to any such family of
distributions as being an 𝜂-optimal family of JL embedding distributions if there exists an absolute
constant 𝐶 ∈       R+ such that, for any given 𝜀 ∈ (0, 1), 𝑚, 𝑛 ∈ N with 𝑚 < 𝑛, and nonempty set
S⊂    C𝑛 of cardinality
                                                                     𝜀2 𝑚
                                                                         
                                                  |𝑆| ≤ 𝜂 exp               ,
                                                                      𝐶
    4 Here, the elements of tensors and matrices are assumed to belong to the field of real numbers, for simplicity. The
operation count can be updated accordingly when the field of complex numbers is considered.
                                                              41


a matrix A ∼ D (𝑚,𝑛) will be an 𝜀-JL embedding of S into            C𝑚 with probability at least 1 − 𝜂.
In fact, many 𝜂-optimal families of JL embedding distributions exist for any given 𝜂 ∈ (0, 1/2), in-
cluding, e.g., those associated with random matrices having independent and identically distributed
(i.i.d.) sub-Gaussian entries (see Lemma 9.35 in [5]).
     We are now ready to state the main oblivious subspace property of JL embeddings for low-rank
tensors. It should be noted, however, that as Lemma 4.2.3 suggests, an incoherence assumption is
necessary to establish a relation between the Euclidean norm of a low-rank tensor and the norm of
its expansion coefficients in the standard form. This assumption will be necessary for the proof of
Theorem 4.2.4 to work.
Theorem 4.2.4 Fix 𝜀, 𝜂 ∈ (0, 1/2) and 𝑑 ≥ 2. Let L be an 𝑟-dimensional subspace of 𝑛1 ×···×𝑛𝑑       C
                                                     n                       o
spanned by a basis of rank-1 tensors B :=                 𝑑 x (ℓ) 𝑘 ∈ [𝑟] with modewise coherence as
                                                         ℓ=1 𝑘
per (4.4) satisfying 𝜇 B𝑑−1 < 1/2𝑟. For each 𝑗 ∈ [𝑑] draw A ( 𝑗) ∈        C𝑚 ×𝑛 𝑗 𝑗 with
                                                                          
                                   𝑚 𝑗 ≥ 𝐶˜ · 𝑟 2/𝑑 𝑑 2 /𝜀 2 · ln 2𝑟 2 𝑑/𝜂                              (4.17)
from an (𝜂/𝑑)-optimal family of JL embedding distributions, where 𝐶˜ ∈             R+ is an absolute constant.
Then, with probability at least 1 − 𝜂 we have
                                                          2
                              X ×1 A (1) · · · ×𝑑 A (𝑑)     − kXk 2 ≤ 𝜀 kXk 2 ,                         (4.18)
for all X ∈ L.
Proof Let B be a set of 𝑟 rank-1 tensors, defined as
                                                                    C𝑛 ×···×𝑛
                                      n                         o
                                          𝑑     (ℓ)
                                B :=      ℓ=1 x 𝑘    𝑘 ∈ [𝑟] ⊂          1      𝑑
with kx 𝑘(ℓ) k 2 = 1 for 𝑘 ∈ [𝑟] and ℓ ∈ [𝑑], and let L := span (B). We first note the coherence
assumption 𝜇 B   𝑑−1 < 1/2𝑟 guarantees that
                                𝜇0B ≤ 𝜇 B
                                        𝑑
                                           ≤ 𝜇B 𝑑−1
                                                     < 1/2𝑟 < 1/2(𝑟 − 1),
                                                      42


which can be rearranged as
                                           4/(1 − (𝑟 − 1)𝜇0B ) ≤ 8.
                                     e
Also, letting 𝛿 := (1/𝑟) 1/𝑑 𝜀/16 and according to (4.16), it is enough to have 𝜀 ≥ 8𝛿 +                      e
   e                𝑑−1 ) so that each embedding A ( 𝑗) will be a (𝛿/4𝑑)-JL embedding of the set
8𝛿 2𝑟 max(𝛿 𝑑−1 , 𝜇 B
𝑆0𝑗 in (4.6) into C𝑚 𝑗  where 𝜀0 (𝛿) is defined by (4.16) and 𝜀 ≥ 8𝜀0. Furthermore, if A ( 𝑗) is taken
from an 𝜂/𝑑-optimal family of JL distributions, it will also be a (𝛿/4𝑑)-JL embedding of 𝑆0𝑗 in (4.6)
into  C𝑚 𝑗 with probability 1 − 𝜂/𝑑 if
                                                                          !
                                                                   𝛿 2𝑚
                                                        𝜂
                                   |𝑆0𝑗 | = 2𝑟 2 − 𝑟 ≤ exp
                                                                        𝑗
                                                                            ,
                                                        𝑑         16𝑑 2𝐶
which is satisfied for each 𝑚 𝑗 defined in (4.17). Finally, taking union bound over all 𝑑 modes
concludes the proof.
Note 4.2.2 Theorem 4.2.4 can be used in the special case where X is a matrix X ∈                     C𝑛 ×𝑛 .
                                                                                                        1   2   In
this case, the CP-rank is the usual matrix rank, and the CP decomposition becomes the regular
SVD decomposition of the matrix, which can be computed efficiently in parallel (see, e.g., [8]). In
particular, the basis vectors are orthogonal to each other in this case. The result of Theorem 4.2.4
implies that taking A and B as matrices belonging to the (𝜂/2)-JL embedding family and of sizes
                                                                  √
𝑛1 × 𝑚 1 and 𝑛2 × 𝑚 2 , respectively, such that 𝑚 𝑗 & 𝑟 ln(𝑟/ 𝜂)/𝜀 2 (for 𝑗 = 1, 2), we get the following
JL-type result for the Frobenius matrix norm: with probability 1 − 𝜂,
                             kA𝑇 XBk 2𝐹 = (1 + 𝜀)kXk
                                                   ˜     2
                                                          𝐹     for some | 𝜀|
                                                                            ˜ ≤ 𝜀.
Theorem 4.2.5 Fix 𝜀, 𝜂 ∈ (0, 1/2) and 𝑑 ≥ 3. Let X ∈             C𝑛 ×···×𝑛 , 𝑛 := max
                                                                    1     𝑑
                                                                                    𝑗
                                                                                         𝑛 𝑗 ≥ 4𝑟 + 1, and let L
                                    C𝑛 ×···×𝑛
                                                                              n                      o
                                                                                        (ℓ)
be an 𝑟-dimensional subspace of          1      𝑑 spanned by a basis B :=         𝑑
                                                                                  ℓ=1 x 𝑘    𝑘 ∈ [𝑟] of rank-1
tensors, with modewise coherence satisfying 𝜇 B      𝑑−1 < 1/2𝑟. For each 𝑗 ∈ [𝑑], draw A ( 𝑗) ∈          C𝑚 ×𝑛
                                                                                                              𝑗  𝑗
with
                                                                      √ 
                                      𝑚 𝑗 ≥ 𝐶 𝑗 · 𝑟𝑑 3 /𝜀 2 · ln 𝑛/ 𝑑 𝜂                                    (4.19)
                                                      43


from an (𝜂/4𝑑)-optimal family of JL embedding distributions, where 𝐶 𝑗 ∈                             R+ is an absolute
                                       C𝑚 × 0 Î𝑑
constant. Furthermore, let A ∈                   ℓ=1 𝑚ℓ   with
                                                                               
                                               0        0       −2         47
                                              𝑚 ≥𝐶𝑟·𝜀              · ln √𝑟
                                                                          𝜀 𝜂
be drawn from an (𝜂/2)-optimal family of JL embedding distributions, where 𝐶 0 ∈                                 R+ is an
absolute constant. Define 𝐿˜ :        C𝑛 ×···×𝑛
                                          1       𝑑  →     C𝑚 ×···×𝑚
                                                                1      𝑑  by 𝐿˜ (Z) = Z ×1 A (1) · · · ×𝑑 A (𝑑) . Then,
with probability at least 1 − 𝜂, the linear operator A ◦ vect ◦ 𝐿˜ :                C𝑛 ×···×𝑛
                                                                                       1       𝑑 →  C𝑚  0
                                                                                                           satisfies
                                                             2
                             A vect ◦ 𝐿˜ (X − Y)               − kX − Y k 2 ≤ 𝜀 kX − Y k 2
                                                          
                                                             2
for all Y ∈ L.
Proof To begin, we note that A will satisfy the conditions required by Theorem 5.1.1 with probability
at least 1 − 𝜂/2 as a consequence of Lemma 4.1.3. Thus, if we can also establish that 𝐿˜ will satisfy
the conditions required by Theorem 5.1.1 with probability at least 1 − 𝜂/2, we will be finished with
our proof by Theorem 5.1.1 and the union bound.
     To establish that 𝐿˜ satisfies the conditions required by Theorem 5.1.1 with probability at least
1 − 𝜂/2, it suffices to prove that
  (a) 𝐿˜ will be an (𝜀/6)-JL embedding of all Y ∈ L into                          C𝑚 ×···×𝑚
                                                                                    1       𝑑 with probability at least
        1 − 𝜂/4, and that
                              √
  (b) 𝐿˜ will be an (𝜀/24 𝑟)-JL embedding of the 4𝑟 +1 tensors S 0 ∪{                   PL  ⊥ (X)} ⊂    C𝑛 ×𝑛 ×...×𝑛
                                                                                                            1  2     𝑑 into
        C𝑚 ×···×𝑚
            1      𝑑 with probability at least 1 − 𝜂/4, where the set S 0 is defined as in Theorem 5.1.1,
and apply yet another union bound.
     To show that (a) holds, we will utilize Theorem 4.2.2 and Lemma 4.2.3. Since each A ( 𝑗) matrix
is an (𝜂/4𝑑)-optimal JL embedding and the sets S 0𝑗 defined as in (4.6) are such that |S 0𝑗 | < 𝑛 𝑑 ,
                                                    √ 
we know that each A ( 𝑗) is an 𝜀/480𝑑 𝑟 -JL embedding of S 0𝑗 into 𝑚 𝑗 with probability5 at   C
                                                              e holds for all 𝑑 > 0 in order to avoid a
     5 Here
                                                  √
                                                  𝑑         √
                                                            e
                                                                                                        √𝑑
            we also implicitly use the fact that    𝑑 ≤                                                    𝑑 term appearing
inside the logarithm in (4.19).
                                                               44


                                                                 √
least 1 − 𝜂/4𝑑. Thus, Theorem 4.2.2 holds with 𝜀 → 𝜀/120 𝑟 with probability at least 1 − 𝜂/4.
Note that the modewise coherence assumption that 𝜇 B      𝑑−1 < 1/2𝑟 both allows 𝜀 𝑑−1 to reduce the
√︁                                                                  √
  𝑟 (𝑟 − 1) factor in (4.9) to a size less than one for any 𝜀 ≤ 1/ 𝑟 ≤ (1/𝑟) 1/(𝑑−1) , and also allows
Lemma 4.2.3 to guarantee that k𝜶k 22 < 2 kY k 2 holds for all Y ∈ L. Hence, applying Theorem 4.2.2
                  √
                                                                                          C
with 𝜀 → 𝜀/120 𝑟 will ensure that 𝐿˜ is an (𝜀/6)-JL embedding of all Y ∈ L into 𝑚1 ×···×𝑚 𝑑 .
     To show that (b) holds we will utilize Lemma 5.1.1. Note that the S 𝑗 sets defined in Lemma 5.1.1
all have cardinalities S 𝑗 ≤ 𝑝𝑛 𝑑−1 , where 𝑝 = 4𝑟 + 1 ≤ 𝑛 in our current setting. As a consequence,
                                                                                       √
we can see that the conditions of Lemma 5.1.1 will be satisfied with 𝜀 → 𝜀/24 𝑟 for all 𝑗 ∈ [𝑑]
with probability at least 1 − 𝜂/4 by the union bound. Hence, both (a) and (b) hold and our proof is
concluded.
Figure 4.1 provides a schematic view of the 2-stage JL embedding introduced in Theorem 4.2.5 on
a 3 × 4 × 5 sample tensor.
Figure 4.1: An example of 2-stage JL embedding applied to a 3-dimensional tensor X ∈ R3×4×5 .
The output of the 1st stage is the projected tensor Y = X ×1 A (1) ×2 A (2) ×3 A (3) , where A ( 𝑗) are
JL matrices for 𝑗 ∈ {1, 2, 3}, A (1) ∈ R2×3 , A (2) ∈ R3×4 , and A (3) ∈ R4×5 , resulting in Y ∈ R2×3×4 .
Matching colors have been used to show how the rows of A ( 𝑗) interact with the mode- 𝑗 fibers
of X (and the intermediate partially compressed tensors) to generate the elements of the mode- 𝑗
unfolding of the result after each 𝑗-mode product. Next, the resulting tensor is vectorized (leading
to y ∈ R24 ), and a 2nd -stage JL is then performed to obtain z = Ay where A ∈ R3×24 , and z ∈ R3 .
                                                    45


Note 4.2.3 (About 𝒓 and 𝜺 Dependence) Fix 𝑑, 𝑛, and 𝜂. Looking at Theorem 4.2.5 we can see
that it’s intermediate embedding dimension is
                                                Ö𝑑
                                                             𝑑
                                                    𝑚 ℓ ≤ 𝐶𝑑,𝜂,𝑛  𝑟 𝑑 𝜀 −2𝑑
                                                ℓ=1
which effectively determines its overall storage complexity. Hence, Theorem 4.2.5 will only result
in an improved memory complexity over the straightforward single-stage vectorization approach
if, e.g., the rank 𝑟 of L is relatively small. The purpose of facultative vectorization and subsequent
multiplication by an additional JL transform A in Theorem 4.2.5 is to reduce the resulting final
embedding dimension to the near-optimal order O (𝑟𝜀 −2 ) from total dimension O𝜂,𝑛 (𝑑 3𝑑 𝑟 𝑑 𝜀 −2𝑑 )
that we have after the modewise compression.
4.2.3      Fast and Memory-Efficient Modewise Johnson-Lindenstrauss Embeddings
In this section we consider a fast Johnson-Lindenstrauss transform for tensors recently introduced
in [10], which is effectively based on applying fast JL transforms [13] in a modewise fashion.6
Given a tensor Z ∈          C𝑛 ×···×𝑛
                               1      𝑑 the transform takes the form
                                          √︂
                                              𝑁                                            
                         𝐿 FJL (Z) :=           R vect Z ×1 F (1) D (1) · · · ×𝑑 F (𝑑) D (𝑑)                   (4.20)
                                              𝑚
where vect :      C𝑛 ×···×𝑛
                      1      𝑑  →  C𝑁 for 𝑁 := Îℓ=1   𝑑
                                                          𝑛ℓ is the vectorization operator, R ∈ {0, 1} 𝑚×𝑁 is a
matrix containing 𝑚 rows selected randomly from the 𝑁 × 𝑁 identity matrix, F (ℓ) ∈ C𝑛 ×𝑛 is a               ℓ   ℓ
unitary discrete Fourier transform matrix for all ℓ ∈ [𝑑], and D (ℓ) ∈ C𝑛 ×𝑛 is a diagonal matrix
                                                                                        ℓ  ℓ
with 𝑛ℓ random ±1 entries for all ℓ ∈ [𝑑]. The following theorem is proven about this transform in
[10, 13].
Theorem 4.2.6 (See Theorem 2.1 and Remark 4 in [10]) Fix 𝑑 ≥ 1, 𝜀, 𝜂 ∈ (0, 1), and 𝑁 ≥ 𝐶 0/𝜂
for a sufficiently large absolute constant 𝐶 0 ∈        R+. Consider a finite set S ⊂ C𝑛 ×···×𝑛
                                                                                             1      𝑑 of cardinality
     6 In fact, the fast transform described here differs cosmetically from the form in which it is presented in [10].
However, one can easily see they are equivalent using (2.10).
                                                            46


𝑝 = |S|, and let 𝐿 FJL : C𝑛 ×···×𝑛
                            1      𝑑 →  C𝑚 be defined as above in (4.20) with
                                                                                            
                                                                             max( 𝑝,𝑁)                   
                                        
                                           max(   𝑝, 𝑁)
                                                                     log           𝜂
                                                                                                          
                𝑚 ≥ 𝐶 𝜀 −2 · log2𝑑−1                      · log4 ­­
                                                                    ©                          ª
                                                                                              ® · log 𝑁  ,
                                                𝜂                               𝜀             ®          
                                                                                                         
                                                                   «                          ¬          
where 𝐶 > 0 is an absolute constant. Then with probability at least 1 − 𝜂 the linear operator 𝐿 FJL
is an 𝜀-JL embedding of S into      C𝑚 . If 𝑑 = 1 then we may replace max( 𝑝, 𝑁) with 𝑝 inside all of
the logarithmic factors above (see [13]).
                                                                                    Í
    Note that the fast transform 𝐿 FJL requires only O (𝑚 log 𝑁 +                      ℓ  𝑛ℓ ) i.i.d. random bits and
memory for storage. Thus, it can be used to produce fast and low memory complexity oblivious
subspace embeddings. The next Theorem does so.
Theorem 4.2.7 Fix 𝜀, 𝜂 ∈ (0, 1/2) and 𝑑 ≥ 2. Let X ∈                 C𝑛 ×···×𝑛 , 𝑁 = Îℓ=1
                                                                         1        𝑑                𝑑
                                                                                                        𝑛ℓ ≥ 4𝐶 0/𝜂 for an
absolute constant 𝐶 0 > 0, L be an 𝑟-dimensional subspace of C𝑛 ×···×𝑛 for max 2𝑟 2 − 𝑟, 4𝑟 ≤ 𝑁,
                                                                                                                    
                                                                               1         𝑑
and 𝐿 FJL : C𝑛 ×···×𝑛 → C𝑚 be defined as above in (4.20) with
               1     𝑑         1
                                                                                
                                                                                 𝑁                  
                                   𝑟 2               
                                                        𝑁                log      𝜂
                                                                                                     
                                                 2𝑑−1                4­
                                  𝑑                                   ©               ª
                         ≥ 𝐶1 𝐶2          · log              · log ­                     · log 𝑁  ,
                                                                                                     
                    𝑚1                                                                 ®
                                     𝜀                  𝜂                    𝜀        ®             
                                                                                                    
                                                                      «               ¬             
where 𝐶1 , 𝐶2 > 0 are absolute constants. Furthermore, let L0FJL ∈                   C𝑚 ×𝑚 2     1 be defined as above in
(4.20) for 𝑑 = 1 with
                                                                                  
                                                                             47                     
                                              
                                                  47
                                                                  𝑟 log     𝜀
                                                                               √
                                                                               𝑟𝜂 ª                  
                    𝑚 2 ≥ 𝐶3 𝑟 · 𝜀 −2 · log √𝑟         · log4 ­­
                                                                 ©
                                                                                     ® · log 𝑚 1  ,
                                                𝜀 𝜂                       𝜀          ®              
                                                                                                    
                                                                «                    ¬              
where 𝐶3 > 0 is an absolute constant. Then, with probability at least 1 − 𝜂 it will be the case that
                                                    2
                          L0FJL (𝐿 FJL (X − Y))     2
                                                      − kX − Y k 2 ≤ 𝜀 kX − Y k 2
holds for all Y ∈ L.
    In addition, the L0FJL , 𝐿 FJL transform pair requires only O (𝑚 1 log 𝑁 + ℓ 𝑛ℓ ) random bits
                                                                                                      Í
and memory for storage (assuming w.l.o.g. that 𝑚 2 ≤ 𝑚 1 ), and L0FJL ◦ 𝐿 FJL :                         C𝑛 ×···×𝑛
                                                                                                            1     𝑑 → C𝑚 2
can be applied to any tensor in just O (𝑁 log 𝑁)-time.
                                                      47


Proof Let {T𝑘 } 𝑘∈[𝑟] be an orthonormal basis for L (note that these basis tensors need not be
low-rank), and      PL  ⊥ :  C𝑛 ×···×𝑛
                                 1      𝑑 →    C𝑛 ×···×𝑛
                                                   1        𝑑   be the orthogonal projection operator onto the
orthogonal complement of L. Theorem 5.1.1 combined with Lemmas 4.1.4 and 4.1.3 imply that
the result will be proven if all of the following hold:
    (i) 𝐿 FJL is an (𝜀/24𝑟)-JL embedding of the 2𝑟 2 − 𝑟 tensors
                                                                                  !
                                                                  i           i
                            Ø                                                       Ø
                                    {T𝑘 − Tℎ , T𝑘 + Tℎ , T𝑘 − Tℎ , T𝑘 + Tℎ }           {T𝑘 } 𝑘∈[𝑟] ⊂ L
                         1≤ℎ<𝑘 ≤𝑟
        into   C𝑚 ,
                  1
   (ii) 𝐿 FJL is an (𝜀/6)-JL embedding of {        PL    ⊥ (X)} into    C𝑚 ,1
                            √
  (iii) 𝐿 FJL is an (𝜀/24 𝑟)-JL embedding of the 4𝑟 tensors
         Ø       PL ⊥ (X)              PL ⊥ (X)                PL  ⊥ (X)
                                                                                i     PL ⊥ (X)
                                                                                                   i
                                                                                                      
                                                                                                          C𝑛 ×...×𝑛
                k PL                    PL                      PL                    PL
                               − T𝑘 ,                + T𝑘 ,                 − T𝑘 ,               + T𝑘 ⊂     1       𝑑
                     ⊥ (X) k          k    ⊥ (X) k            k     ⊥ (X) k         k    ⊥ (X)k
        𝑘 ∈[𝑟]
        into   C𝑚 , and
                  1
  (iv) L0FJL is an (𝜀/6)-JL embedding of a minimal (𝜀/16)-cover, C, of the 𝑟-dimensional Euclidean
        unit sphere in the subspace L 0 ⊂       C𝑚    1 from Theorem 5.1.1 with 𝐿 = 𝐿 FJL into        C𝑚 . Here we
                                                                                                        2
                           𝑟
        note that |C| ≤ 47   𝜀    .
Furthermore, if 𝑚 1 and 𝑚 2 are chosen as above for sufficiently large absolute constants 𝐶1 , 𝐶2 ,
and 𝐶3 , then Theorem 4.2.6 implies that each of (𝑖) − (𝑖𝑣) above will fail to hold with probability
at most 𝜂/4. The desired result now follows from the union bound.
     The number of random bits and storage complexity follows directly form Theorem 4.2.6 after
noting that each row of R in (4.20) is determined by O (log 𝑁) bits. The fact that L0FJL ◦ 𝐿 FJL can be
applied to any tensor Z in O (𝑁 log 𝑁)-time again follows from the form of (4.20). Note that each
 𝑗-mode product with F ( 𝑗) D ( 𝑗) involves ℓ≠ 𝑗 𝑛ℓ multiplications of F ( 𝑗) D ( 𝑗) against all the mode- 𝑗
                                                Î
fibers of the given tensor Z, each of which can be performed in O (𝑛 𝑗 log(𝑛 𝑗 ))-time using fast
                                                             48


Fourier transform techniques (or approximated even more quickly using sparse Fourier transform
techniques if 𝑛 𝑗 is itself very large). The required vectorization and applications of R can then be
performed in just O (𝑁)-time thereafter. Finally, Fourier transform techniques can again be used
to also apply L0FJL in O (𝑚 1 log 𝑚 1 )-time.
4.3     Experiments
In this section, it is shown that the norms of several different types of (approximately) low-rank
data can be preserved using JL embeddings. The data sets used in the experiments consist of
   1. MRI data: This data set contains three 3-mode MRI images of size 240 × 240 × 155 [1].
   2. Randomly generated data: This data set contains 10 rank-10 4-mode tensors. Each test tensor
       is a 100 × 100 × 100 × 100 array that is created by adding 10 randomly generated rank-1
       tensors. More specifically, each rank-10 tensor is generated according to
                                                            𝑟
                                                           ∑︁
                                                                      ( 𝑗)
                                               X (𝑚) =          𝑑
                                                                𝑗=1 x 𝑘 ,
                                                           𝑘=1
       where 𝑚 ∈ [10], 𝑟 = 10, 𝑑 = 4 and x 𝑘
                                                    ( 𝑗)
                                                          ∈ R100.                                       ( 𝑗)
                                                                  In the Gaussian case, each entry of x 𝑘
       is drawn independently from the standard Gaussian distribution N (0, 1). In the case of
                                                                                                ( 𝑗)    ( 𝑗)
       coherent data, low-variance Gaussian noise is added to a constant, i.e., each entry x 𝑘,ℓ of x 𝑘
                        ( 𝑗)       ( 𝑗)
       is set as 1 + 𝜎𝑔 𝑘,ℓ with 𝑔 𝑘,ℓ being an i.i.d. standard Gaussian random variable defined above,
                                                                                            √
       and 𝜎 2 denoting the desired variance. In the experiments of this section, 𝜎 = 0.1 is used.
                                         ( 𝑗)
       In both cases, the 2-norm of x 𝑘 is also normalized to 1.
       The reason for running experiments on both Gaussian and coherent data is to show that
       although coherence requirements presented in section 4.2 are used to help get general theo-
       retical results for a large class of modewise JL embeddings, they do not seem to be necessary
       in practice.
                                                         49


    When JL embeddings are applied, experiments are performed using Gaussian JL matrices as
well as Fast JL matrices. For Gaussian JL, A 𝑗 =                 √1 G    is used for all 𝑗 ∈ [𝑑], where 𝑚 is the target
                                                                  𝑚
dimension and each entry in G is an i.i.d. standard Gaussian random variable G𝑖, 𝑗 ∼ N (0, 1). For
Fast JL, A 𝑗 =     √1 RFD       is used for all 𝑗 ∈ [𝑑], where R denotes the random restriction matrix, F
                      𝑚
                                                 √
is the unitary DFT matrix scaled by 𝑛 𝑗 ,7 and D is a diagonal matrix with Rademacher random
variables forming its diagonal [13]. The embedded version of a test tensor X is always denoted by
𝐿 (X), and is calculated by
                                        X ×1 A (1) × · · · ×𝑑 A (𝑑) ,
                                     
                                                                                        1-stage JL
                                     
                                     
                                     
                                     
                                     
                                     
                                     
                          𝐿 (X) =                                                                               (4.21)
                                     
                                                                                 
                                     
                                      A vec X ×1 A (1) × · · · ×𝑑 A (𝑑) , 2-stage JL
                                     
                                     
                                     
where A is a JL matrix used in the 2nd stage. Obviously, 𝐿 (X) is a vector in the 2-stage case.
4.3.1    Effect of JL Embeddings on Norm
In this section, numerical results have been presented, showing the effect of mode-wise JL embed-
ding on the norm of 3 MRI 3-mode images treated as generic tensors, as well as randomly generated
data.
                                                                             ( 𝑗)
    The compression ratio for the 𝑗 th mode, denoted by 𝑐 1 , is defined as the compression in the
size of each of the mode- 𝑗 fibers, i.e.,
                                                           ( 𝑗)    𝑚𝑗
                                                         𝑐1 =          .
                                                                   𝑛𝑗
                                                                                       
    The target dimension 𝑚 𝑗 in JL matrices is chosen as 𝑚 𝑗 = 𝑐 1 𝑛 𝑗 for all 𝑗 ∈ [𝑑], to ensure that
at least a fraction 𝑐 1 of the ambient dimension in each mode is preserved. In the experiments, the
                                                                               ( 𝑗)
compression ratio is set to be the same for all modes, i.e., 𝑐 1 = 𝑐 1 for all 𝑗 ∈ [𝑑]. In the case of a
2-stage JL embedding, the target dimension 𝑚 of the secondary JL embedding is chosen as
                                                        𝑚 = d𝑐 2 𝑁e ,
    7 Recall that 𝑛 𝑗 is the size of the mode- 𝑗 fibers of the input tensor.
                                                                50


where 𝑐 2 is the compression ratio in the 2nd stage, and 𝑁 is the length of the vectorized projected
tensor after the modewise JL embedding. The total achieved compression is calculated by 𝑐 𝑡𝑜𝑡 =
   Î             
       𝑑     ( 𝑗)             nd                                        Î𝑑       ( 𝑗)
𝑐2      𝑗=1 1 . When the 2 stage embedding is skipped, 𝑐 𝑡𝑜𝑡 =
           𝑐                                                               𝑗=1 𝑐 1 . In all experiments of
this section, when a 2-stage embedding is performed, 𝑐 2 = 0.05. Also, in figure legends, when
two JL types are listed together, the first and second terms refer to the first and second stages,
respectively. For example, in ‘Gaussian+RFD’, Gaussian and RFD JL embeddings were used in
the first and second stages, respectively. The term ‘vec’ in the legends refers to vectorizing the data.
    Assuming X denotes the original tensor and 𝐿 (X) is the projected result, the relative norm of
X is defined by
                                                      k𝐿 (X) k
                                           𝑐 𝑛,X =             .
                                                        kXk
The results of this section depict the interplay between 𝑐 𝑛,X and 𝑐 1 for randomly generated data,
and 𝑐 𝑛,X versus 𝑐 𝑡𝑜𝑡 for MRI data, where the numbers have been averaged over 1000 trials, as well
as over all samples for each value of 𝑐 1 or 𝑐 𝑡𝑜𝑡 . In the case of Figure 4.2, 1000 randomly generated
JL matrices were applied to each mode of all 10 randomly generated tensors. The results there
indicate that the modewise embedding methods proposed herein still work on relatively coherent
data despite the incoherence assumptions utilized in their theoretical analysis (recall Section 4.2).
In Figure 4.3, 1000 JL embedding choices have been averaged over each of the 3 MRI images as
well as the 3 images themselves. As expected, it can be observed in both figures that increasing the
compression ratio leads to better norm (and distance) preservation.
    The MRI data experiments were done using various combinations of JL matrices in the first
and second stages, and were compared with the 1-stage (modewise) case and also JL applied to
vectorized data. In Figure 4.3b, the runtime plots show that vectorizing the data before applying
JL embeddings is the most computationally intensive way of compressing the data, although it
preserves norms the best, as Figure 4.3a demonstrates. Due to the small mode sizes of the MRI data
used in the experiments, modewise fast JL does not outperform modewise Gaussian JL in terms
of computational efficiency in the modewise embeddings as one might initially expect (see the red
                                                      51


and blue curves). This is likely due to the fact that the individual mode sizes are too small to benefit
from the FFT (recall all modes are ≤ 240 in size), together with the need of Fourier methods to
use less efficient complex number arithmetic. However, when the 2-stage JL is employed for larger
compression ratios, the vectorized data after the first stage compression is large enough to make
the efficiency of fast JL over Gaussian JL embeddings clear (compare, e.g., the yellow and purple
curves). Also the small sizes of modes make the use of explicitly constructed                         √1 RFD       matrices
                                                                                                       𝑚
more efficient than taking the FFT of mode fibers.
It should be noted that in the second stage of the 2-stage JL throughout the experiments of this
thesis, the matrix    √1 RFD         is not constructed explicitly as this would be inefficient due to the large
                         𝑚
size of the vector that      √1 RFD        is applied to. Instead, the signs of the vector are randomly changed
                                𝑚
(the effect of D) followed by a Fourier transform (the effect of F). Finally 𝑚 samples of the
                                                                                                                       √
resulting vector are picked at random with replacement (the effect of R) after which the scale 1/ 𝑚
is applied. This allows one to notice the computational efficiency of FFT in the fast JL embedding.
            1                                                        1
          0.9                                                      0.9
                                             JL (Gaussian)                                       JL (Gaussian)
                                             Fast JL (RFD)                                       Fast JL (RFD)
          0.8                                                      0.8
          0.7                                                      0.7
          0.6                                                      0.6
          0.5                                                      0.5
          0.4                                                      0.4
              0 0.05 0.1   0.15    0.2  0.25   0.3  0.35   0.4         0 0.05 0.1 0.15  0.2 0.25  0.3   0.35   0.4
                                  (a)                                                  (b)
Figure 4.2: Relative norm of randomly generated 4-dimensional data. Here, the total compression
will be 𝑐 𝑡𝑜𝑡 = 𝑐41 . (a) Gaussian data. (b) Coherent data. Note that the modewise approach still
preserves norms well for the coherent data indicating that the incoherence assumptions utilized in
Section 4.2 can likely be relaxed.
    By looking at Figure 4.2 We observe that the proposed modewise JL approach leads to very
good norm preservation for data generated from both coherent and incoherent factors. Specifically,
                                                               52


the compression listed on the horizontal axis is for one mode, and given that the synthesized data
samples are 4-mode tensors, the total compression is very good.
      1.02
                                                              100         Gaussian
         1
                                                                          RFD
      0.98                                                                Gaussian+Gaussian
                                                                          Gaussian+RFD
      0.96
                                                                          RFD+RFD
      0.94
                                                                          vec+RFD
                                                              10-1
                                   Gaussian
      0.92
                                   RFD
       0.9
                                   Gaussian+Gaussian
                                   Gaussian+RFD
      0.88                         RFD+RFD
                                   vec+RFD
      0.86                                                    10-2
         10-7     10-6   10-5     10-4     10-3     10-2         10-7       10-6    10-5     10-4 10-3 10-2
                              (a)                                                        (b)
Figure 4.3: Simulation results averaged over 1000 trials for 3 MRI data samples, where each sample
is 3-dimensional. In the 2-stage cases, 𝑐 2 = 0.05 has been used. (a) Relative norm. (b) Runtime.
    Figure 4.3b depicts the results of the proposed modewise JL method on three MRI data samples.
Although part (a) shows relative superiority of the ‘vec+RFD’ in terms of accuracy, part (b) suggests
that for good compression values, the 1-stage and 2-stage JL approaches yields much smaller
runtimes. The reason ‘vec+RFD’ is leading to an almost horizontal line in part (b) lies in the way
√1 RFD       is applied. As the matrix is not formed explicitly and the only part that determines the
  𝑚
compression is the restriction (which does not inflict any computational load if it is simply picking
random samples from a vector), choosing various compression values does not alter the runtime.
However, if one explicitly forms and applies             √1 RFD,      the runtime will change with the chosen
                                                          𝑚
compression.
                                                          53


                                                       CHAPTER 5
     APPLICATIONS OF MODEWISE JOHNSON-LINDENSTRAUSS EMBEDDINGS
In this chapter, two cases are presented where modewise JL embeddings can be used to reduce the
computational cost of computationally intensive problems.
5.1     Application to Least Squares Problems and CPD Fitting
Tensor decomposition problems usually involve fitting a low-rank approximation to a given tensor
that is assumed to have a low rank of some type. In this section, it is shown that modewise JL
embeddings offer an efficient way to reduce the computational cost of such fitting problems through
dimension reduction at the cost of an approximation error.
Consider a tensor X which is assumed to have low CP rank 𝑟. We would like to approximate X in
the Euclidean norm with a tensor Y expressed in the standard form as per (4.3). As mentioned in
Chapter 3, a common fitting method is the Alternating Least Squares, where the factors representing
the rank-𝑟 subspace are solved for one mode at a time. One can start from a random subspace
and improve the least squares error mode by mode through multiple iterations. Since the subspace
of interest is changing throughout the fitting process, oblivious subspace embeddings would be a
natural choice to reduce the fitting problem size. For an arbitrary tensor X ∈                     C𝑛 ×···×𝑛 , the fitting
                                                                                                      1       𝑑
process involves solving
                                                                  ∑︁𝑟
                                         arg min            X−         𝛼𝑘       𝑑
                                                                                    x 𝑘(ℓ)                           (5.1)
                                                   C𝑛 𝑗
                                                                                ℓ=1
                                     ( 𝑗)     ( 𝑗)
                                   x̃1 ,...,x̃𝑟 ∈                  𝑘=1
                                                 n        o
                                                                                           ( 𝑗)  ( 𝑗)    ( 𝑗)
for each mode 𝑗 ∈ [𝑑] after fixing                 x 𝑘(ℓ)                     .   Here, x 𝑘 = x̃ 𝑘 /k x̃ 𝑘 k 2 ∀ 𝑗, 𝑘 and
                                                            𝑘∈[𝑟],ℓ∈[𝑑]\{ 𝑗 }
𝛼 𝑘 = ℓ=1 k x̃ 𝑘(ℓ) k 2 . One then varies 𝑗 through all values in [𝑑] solving (5.1) for each 𝑗 in order
       Î𝑑
             ( 𝑗)
to update x 𝑘 ∀ 𝑗, 𝑘. Sweeping through all modes usually takes place in numerous iterations until
convergence is achieved, meaning the fit stops to improve, or the maximum number of iterations is
exhausted. This in turn means a high computational load, and makes it particularly important to
                                                               54


solve each least squares problem (5.1) efficiently by reducing the problem size. To see how this can
be done to solve (5.1), one may write
                                                                                                                    2
                                𝑟                         2                   𝑟             Ì   1           >
                              ∑︁                                            ∑︁
                                                                                       ( 𝑗)
                         X−        𝛼𝑘       𝑑
                                            ℓ=1  x 𝑘(ℓ)      = X ( 𝑗) −           𝛼𝑘 x𝑘               x 𝑘(ℓ)           ,
                              𝑘=1                                           𝑘=1                 ℓ=𝑑
                                                                                                ℓ≠ 𝑗                F
as the Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings. By
looking closely at the right-hand side of the above equation, one can see that the Frobenius norm
squared can be calculated row-wise (also note that the Frobenius norm is equivalent with the 2-norm
for vectors, i.e., rows or columns of a matrix). Denoting row ℎ of X ( 𝑗) by x 𝑗,ℎ and element ℎ of
            ( 𝑗)
x 𝑘(ℓ) by 𝑥 𝑘,ℎ , we can write
                                                                                                                         2
                              𝑟                         2       𝑛𝑗               𝑟              Ì   1           >
                             ∑︁                                ∑︁               ∑︁
                                                                                           ( 𝑗)
                       X−         𝛼𝑘      𝑑
                                          ℓ=1  x 𝑘(ℓ)      =          x 𝑗,ℎ −       𝛼 𝑘 𝑥 𝑘,ℎ             x 𝑘(ℓ)
                             𝑘=1                               ℎ=1              𝑘=1                ℓ=𝑑
                                                                                                   ℓ≠ 𝑗                  2 (5.2)
                                                                𝑛𝑗                  𝑟                                 2
                                                               ∑︁                  ∑︁
                                                           =          X ( 𝑗,ℎ) −        𝛼0𝑗,ℎ,𝑘        𝑑
                                                                                                      ℓ=1    x 𝑘(ℓ)      ,
                                                               ℎ=1                 𝑘=1                ℓ≠ 𝑗
                        ( 𝑗)
where 𝛼0𝑗,ℎ,𝑘 = 𝛼𝑘 𝑥 𝑘,ℎ with 𝛼 𝑘 is known for 𝑘 ∈ [𝑟] from (5.1) and X ( 𝑗,ℎ) is the tensorized form of
x 𝑗,ℎ which is in fact the ℎth mode- 𝑗 slice of X. It is also clear that the original problem (5.1) can
be modeled as 𝑛 𝑗 independent least squares problems that can be solved in parallel if needed, with
each least squares problem involving a (𝑑 − 1)-mode tensor X ( 𝑗,ℎ) . Essentially, this would mean
that for mode 𝑗, one has to solve 𝑛 𝑗 minimization problems, each of the following form.
                                                                    𝑟
                                                                   ∑︁
                                     arg min X          ( 𝑗,ℎ)
                                                               −        𝛼0𝑗,ℎ,𝑘     𝑑
                                                                                          x 𝑘(ℓ) .                         (5.3)
                                     𝜶 0𝑗,ℎ ∈ C𝑟                   𝑘=1
                                                                                    ℓ=1
                                                                                    ℓ≠ 𝑗
Now, assuming that for each mode 𝑗, the factors {x 𝑘(ℓ) } are sufficiently incoherent for 𝑘 ∈ [𝑟] and
ℓ ∈ [𝑑] \ 𝑗, we can use our modewise JL embedding method to solve a compressed version of each
                                                                   55


least squares problem in (5.3) in the following way.
                                                          ?   𝑑                 ∑︁𝑟                          ?  𝑑
                                                                                                                         0
                                                                      (ℓ 0 )
                          arg min X             ( 𝑗,ℎ)
                                                                   A         −        𝛼0𝑗,ℎ,𝑘    𝑑
                                                                                                      x 𝑘(ℓ)        A (ℓ )                 (5.4)
                          𝜶 0𝑗,ℎ ∈  C𝑟                     ℓ 0 =1               𝑘=1
                                                                                                 ℓ=1
                                                                                                 ℓ≠ 𝑗        ℓ 0 =1
                                                           ℓ 0≠ 𝑗                                            ℓ 0≠ 𝑗
                                                          ( 𝑗)                      ( 𝑗)
We can then update each entry of x̃ 𝑘 by setting 𝑥˜ 𝑘,ℎ = 𝛼0𝑗,ℎ,𝑘 /𝛼 𝑘 for all ℎ ∈ [𝑛 𝑗 ] and 𝑘 ∈ [𝑟]. To
show that the solutions to (5.4) and (5.3) are close, we first establish that
                                                                      ?   𝑑
                                                          X ( 𝑗,ℎ)             A (ℓ) ≈ X ( 𝑗,ℎ)
                                                                       ℓ 0 =1
                                                                       ℓ 0≠ 𝑗
can also hold for all 𝑗 ∈ [𝑑] and ℎ ∈ [𝑛 𝑗 ]. This is done in the following lemma.
Lemma 5.1.1 Let 𝜀 ∈ (0, 1), Z (1) , . . . , Z ( 𝑝) ∈ 𝑛1 ×···×𝑛𝑑 , and A (1) ∈      C                                C𝑚 ×𝑛  1  1          e
                                                                                                                                be an (𝜀/ 𝑑)-JL
                                Î                 
                                      𝑑
embedding of the all 𝑝                ℓ=2 ℓ mode-1 fibers of all 𝑝 of these tensors,
                                             𝑛
                                                                                                                    C𝑛 ,
                                        Ø n                                                                   o
                           S1 :=                    Z:,𝑖(𝑡)2 ,...,𝑖 𝑑 | ∀𝑖ℓ ∈ [𝑛ℓ ], ℓ ∈ [𝑑] \ {1} ⊂                      1
                                       𝑡∈[ 𝑝]
into  C𝑚 . Next, set Z (1,𝑡) := Z (𝑡) ×1 A(1) ∈ C𝑚 ×𝑛 ×···×𝑛 ∀𝑡 ∈ [ 𝑝], and then let A(2) ∈ C𝑚 ×𝑛
          1                                                                     1    2       𝑑                                           2    2 be
an (𝜀/e𝑑)-JL embedding of all 𝑝 𝑚 1 ℓ=3
                                                       Î                     
                                                                    𝑑
                                                                          𝑛ℓ mode-2 fibers
                                                                                                                                C𝑛
                          Ø n                                                                                              o
                 S2 :=               Z𝑖(1,𝑡)
                                         1 ,:,𝑖 3 ,...,𝑖 𝑑
                                                             | ∀𝑖1 ∈ [𝑚 1 ] & 𝑖ℓ ∈ [𝑛ℓ ], ℓ ∈ [𝑑] \ [2]                      ⊂     2
                         𝑡∈[ 𝑝]
into  C𝑚 . Continuing inductively, for each 𝑗 ∈ [𝑑] \ [2] and 𝑡 ∈ [ 𝑝] set Z ( 𝑗−1,𝑡) := Z ( 𝑗−2,𝑡) × 𝑗−1
          2
A ( 𝑗−1) ∈ C𝑚 ×···×𝑚 ×𝑛 ×···×𝑛 , and then let A ( 𝑗) ∈ C𝑚 ×𝑛 be an (𝜀/e𝑑)-JL embedding of all
                 1       𝑗−1      𝑗         𝑑                                                  𝑗  𝑗
  Î            Î               
       𝑗−1          𝑑
𝑝      ℓ=1 𝑚 ℓ      ℓ= 𝑗+1 𝑛ℓ        mode- 𝑗 fibers
                                                                                                                                         C𝑛
             Ø n                                                                                                                     o
                      ( 𝑗−1,𝑡)
     S 𝑗 :=        Z𝑖1 ,...,𝑖 𝑗−1 ,:,𝑖 𝑗+1 ,...,𝑖 𝑑 | ∀𝑖ℓ ∈ [𝑚 ℓ ], ℓ ∈ [ 𝑗 − 1] & 𝑖ℓ ∈ [𝑛ℓ ], ℓ ∈ [𝑑] \ [ 𝑗],                         ⊂    𝑗
            𝑡∈[ 𝑝]
into  C𝑚 . Then,
          𝑗
                                            2                                                    2                     2
                                 Z (𝑡)          − Z (𝑡) ×1 A (1) · · · ×𝑑 A (𝑑)                       ≤ 𝜀 Z (𝑡)
will hold for all 𝑡 ∈ [ 𝑝].
                                                                                56


Proof Fix 𝑡 ∈ [ 𝑝] and let X (0) := Z (𝑡) , X ( 𝑗) := Z ( 𝑗,𝑡) for all 𝑗 ∈ [𝑑 − 1], and X (𝑑) :=
Z (𝑑−1,𝑡) ×𝑑 A (𝑑) = Z (𝑡) ×1 A (1) · · · ×𝑑 A (𝑑) . Choose any 𝑗 ∈ [𝑑], and let x 𝑗,ℎ ∈                          C𝑛  𝑗  denote the ℎth
                                                                             ( 𝑗−1)
column of the mode- 𝑗 unfolding of X ( 𝑗−1) , denoted by X ( 𝑗) . It is easy to see that each x 𝑗,ℎ is a
                                                                                 Î              Î                
mode- 𝑗 fiber of X ( 𝑗−1) = Z ( 𝑗−1,𝑡) for each 1 ≤ ℎ ≤ 𝑁 0𝑗 :=
                                                                                      𝑗−1
                                                                                      ℓ=1 ℓ  𝑚          ℓ= 𝑗+1 ℓ . Thus, we can
                                                                                                                  𝑛
see that
               2          2                        2                             2                        2                           2
                                                                                                 ( 𝑗−1)                      ( 𝑗−1)
      X ( 𝑗−1)   − X ( 𝑗)      =      X ( 𝑗−1)       − X ( 𝑗−1) × 𝑗 A ( 𝑗)             =       X ( 𝑗)         − A ( 𝑗) X ( 𝑗)
                                                                                                          F                           F
                                     𝑁 0𝑗                                           𝑁 0𝑗
                                     ∑︁                                  2          ∑︁
                               =          kx 𝑗,ℎ k 22 − A ( 𝑗) x 𝑗,ℎ           ≤            kx 𝑗,ℎ k 22 − kA ( 𝑗) x 𝑗,ℎ k 22
                                                                         2
                                     ℎ=1                                            ℎ=1
                                          𝑁 0𝑗
                                     𝜀 ∑︁                        𝜀               2        𝜀                2
                                                                        ( 𝑗−1)
                                                kx 𝑗,ℎ k 22 =                                   X ( 𝑗−1)
                                ≤
                                    e 𝑑 ℎ=1                     e 𝑑
                                                                      X ( 𝑗)
                                                                                 F
                                                                                    =
                                                                                         e 𝑑
                                                                                                               .
                                                                      2                                  2
A short induction argument now reveals that X ( 𝑗)                                     e
                                                                                         𝜀 𝑗
                                                                                                X (0)
                                                                                            
                                                                          ≤      1+       𝑑                  holds for all 𝑗 ∈ [𝑑].
As a result we can now see that
        2           2         𝑑                2                2          𝑑                  2                2                 𝑑            2
    (0)         (𝑑)
                            ∑︁
                                      ( 𝑗−1)               ( 𝑗)
                                                                        ∑︁
                                                                                    ( 𝑗−1)              ( 𝑗)               𝜀 ∑︁ ( 𝑗−1)
  X       − X          =
                             𝑗=1
                                    X             − X               ≤
                                                                         𝑗=1
                                                                                 X              − X                 ≤
                                                                                                                          e 𝑑 𝑗=1
                                                                                                                                       X
                                  𝑑                                 2                                        2
                             𝜀 ∑︁              𝜀  𝑗−1 (0)                  𝜀           𝜀 𝑑
                                                                                                   X (0)
                       ≤
                            e 𝑑 𝑗=1
                                        1+
                                               e 𝑑
                                                              X         ≤
                                                                             e    1+
                                                                                         e  𝑑
                                                                                                                .
holds. The desired result now follows from Remark 4.2.1.
Now, we can use the result of Lemma 5.1.1 to prove that the solution to (5.4) will be a close
approximation to that of (5.3) if the matrices A ( 𝑗) are chosen appropriately. We have the following
                                                                                                                                   >𝑑
general result which directly applies to least squares problems as per (5.4) when 𝐿˜ (Z) := Z                                           A (ℓ)
                                                                                                                                   ℓ=1
                                                                                                                                   ℓ≠ 𝑗
and A = I.
Theorem 5.1.1 (Embeddings for Compressed Least Squares) Let X ∈                                          C𝑛 ×···×𝑛 , L be an 𝑟-
                                                                                                               1        𝑑
dimensional subspace of      C𝑛 ×···×𝑛
                                 1        𝑑   spanned by a set of orthonormal basis tensors {T𝑘 } 𝑘∈[𝑟] , and
PL : C𝑛 ×···×𝑛
   ⊥       1     𝑑  → C𝑛 ×···×𝑛
                          1       𝑑 be the orthogonal projection operator on the orthogonal complement
                                                                57


of L. Fix 𝜀 ∈ (0, 1) and suppose that the linear operator 𝐿˜ :                  C𝑛 ×𝑛 ×···×𝑛
                                                                                   1  2       𝑑 →   C𝑚 ×···×𝑚
                                                                                                        1        𝑑0  has both
of the following properties:
   (i) 𝐿˜ is an (𝜀/6)-JL embedding of all Y ∈ L ∪ {                PL    ⊥ (X)} into  C𝑚 ×···×𝑚
                                                                                          1      𝑑0 , and
                         √
  (ii) 𝐿˜ is an (𝜀/24 𝑟)-JL embedding of the 4𝑟 tensors
           0
                  Ø     P  L ⊥ (X)              P  L ⊥ (X)              PL  ⊥ (X)
                                                                                        i       PL ⊥ (X)
                                                                                                               i
                                                                                                                     
                                                                                                                         C𝑛 ×𝑛 ×···×𝑛
       S :=
                         P             − T𝑘 ,
                                                 P             + T𝑘 ,
                                                                         PL          − T𝑘 ,
                                                                                                PL         + T𝑘 ⊂           1 2       𝑑
                        k L ⊥ (X) k            k L ⊥ (X)k              k     ⊥ (X) k          k    ⊥ (X) k
                 𝑘 ∈[𝑟]
       into   C𝑚 ×···×𝑚
                    1    𝑑0 .
                                C𝑚 ×···×𝑚           C
                                                       Î𝑑 0
Furthermore, let vect :             1      𝑑0  →         ℓ=1 𝑚ℓ  be a reshaping vectorization operator, and A ∈
C𝑚×
      Î𝑑 0
       ℓ=1   𝑚ℓ be an (𝜀/3)-JL embedding of the (𝑟 + 1)-dimensional subspace
                                        PL                                                             C
                                                                                                          Î𝑑 0
             L 0 := span vect ◦ 𝐿˜ (          (X)) , vect ◦ 𝐿˜ (T1 ) , . . . , vect ◦ 𝐿˜ (T𝑟 ) ⊂
                                                                                                              𝑚ℓ
                                            ⊥                                                              ℓ=1
into  C𝑚 . Then,
                                                             2
                                A vect ◦ 𝐿˜ (X − Y)            − kX − Y k 2 ≤ 𝜀 kX − Y k 2
                                                          
                                                             2
holds for all Y ∈ L.
Proof Note that the theorem will be proven if 𝐿˜ is an (𝜀/3)–JL embedding of all tensors of the form
                              C
  X − Y Y ∈ L into 𝑚1 ×···×𝑚 𝑑 0 since any such tensor X−Y will also have vect◦ 𝐿˜ (X − Y) ∈ L 0

so that
                                 2
  A vect ◦ 𝐿˜ (X − Y)              − kX − Y k 2
                              
                                 2
                                                                       2                    2                       2
                                        A vect ◦ 𝐿˜ (X − Y)               − 𝐿˜ (X − Y)         + 𝐿˜ (X − Y)           − kX − Y k 2
                                                                    
                                   ≤                                   2
                                                                       2                           2     𝜀
                                        A vect ◦ 𝐿˜ (X − Y)               − vect ◦ 𝐿˜ (X − Y)              kX − Y k 2
                                                                    
                                   ≤                                   2                           2
                                                                                                      +
                                                                                                         3
                                      𝜀                           2      𝜀
                                   ≤     vect ◦ 𝐿˜ (X − Y)        2
                                                                     +      kX − Y k 2
                                      3                                  3
                                      𝜀                   2     𝜀
                                   =     𝐿˜ (X − Y) + kX − Y k 2
                                      3                         3
                                      𝜀        𝜀                    𝜀
                                   ≤      1+        kX − Y k 2 + kX − Y k 2 ≤ 𝜀 kX − Y k 2 .
                                      3         3                     3
                                                               58


     Let   PL be the orthogonal projection operator onto L. Our first step in establishing that 𝐿˜ is an
(𝜀/3)–JL embedding of all tensors of the form X − Y Y ∈ L into C𝑚 ×···×𝑚 will be to show
                                                         
                                                                                          1      𝑑0
that 𝐿˜ preserves all the angles between PL (X) and L well enough that the Pythagorean theorem
                                                 ⊥
              kX − Y k 2 = k    PL ⊥ (X) +   PL (X) − Y k 2     = k  PL  ⊥ (X)k 2 + k   PL (X) − Y k 2
still approximately holds for all Y ∈ L after 𝐿˜ is applied. Toward that end, let 𝜸 ∈ 𝑟 be such          C
       P               Í
                                                                  P
that L (X) − Y = 𝑘∈[𝑟] 𝛾 𝑘 T𝑘 and note that k𝜸k 2 = k L (X) − Y k due to the orthonormality of
{T𝑘 } 𝑘∈[𝑟] . Appealing to Lemma 4.1.2 we now have that
     𝐿˜ ( PL (X) − Y) , 𝐿˜ (PL                    PL
                                                                 ∑︁       
                                                                      𝛾 𝑘 𝐿˜ (T𝑘 ) , 𝐿˜
                                                                                         
                                                                                             PL ⊥ (X)
                                                                                                         
                                                                                           k PL
                                  ⊥ (X))      = k      ⊥ (X) k
                                                                                                ⊥ (X)k
                                                                𝑘∈[𝑟]
                                                  P                                            PL
                                                                       ∑︁
                                                                   𝜀                       𝜀
                                              ≤ k L ⊥ (X) k √                   |𝛾 𝑘 | ≤     k     ⊥ (X) k k𝜸k 2
                                                                  6 𝑟                      6
                                                                         𝑘 ∈[𝑟]
                                                                                                                (5.5)
                                                𝜀 
                                                        PL                PL (X) − Y k 2
                                                                                                    𝜀
                                              ≤       k     ⊥ (X) k 2 + k                       =       kX − Y k 2 .
                                                12                                                   12
     Using (5.5) we can now see that
                   2
       𝐿˜ (X − Y)  2
                     − kX − Y k 2
                     =      𝐿˜ (X − Y)
                                         2
                                         2
                                            −k PL (X)k 2 − k PL (X) − Y k 2
                                                  ⊥
                          𝐿˜ ( PL (X)) − k PL (X)k 2 + 𝐿˜ ( PL (X) − Y) − k PL (X) − Y k 2
                                           2                                               2
                     ≤            ⊥                 ⊥
                                                              + 2 𝐿˜ ( PL (X) − Y) , 𝐿˜ ( PL (X))   ⊥
                              k PL (X)k 2 + k PL (X) − Y k 2 + kX − Y k 2 = kX − Y k 2 .
                         𝜀                                                               𝜀
                     ≤             ⊥
                         6                                                                 3
Thus, 𝐿˜ has the desired JL-embedding property required to conclude the proof.
5.1.1      Experiments: Effect of JL Embeddings on Least Squares Solutions
In this section, trial least squares experiments with compressed tensor data are performed to show
the effect of modewise JL embeddings on solutions to least squares problems. In the experiments
                                                         59


of this section, the first sample of the three MRI data samples used in Section 4.3 is employed.
Again, all experiments were carried out in MATLAB.
First, it is shown that this MRI sample has a relatively low-rank CP representations by plotting
its CP reconstruction error for various choices of rank. Next, the effect of modewise JL on least
squares solutions is investigated by solving for the coefficients of the CP decomposition of the MRI
sample in a least squares problem. This will be done by performing 1-stage (modewise) and 2-stage
JL on the data, which we call compressed least squares, and will be compared with the case where
a regular uncompressed least squares problem is solved instead.
5.1.1.1    CPD Reconstruction
Before the experimental results, a short description of the basic form of CPD calculation is reviewd.
                                                                                                       ( 𝑗)
Given a tensor X, assume 𝑟 is known beforehand. The problem is now the calculation of x 𝑘 for
𝑗 ∈ [𝑑] and 𝑘 ∈ [𝑟] and 𝜶 in (4.3), i.e. the solution to
                                                     ∑︁𝑟
                             min kX − X̂k with X̂ =       𝛼 𝑘 x 𝑘(1)   x 𝑘(2) ···   x 𝑘(𝑑) .             (5.6)
                              X̂                      𝑘=1
    As the Euclidean norm a 𝑑-mode tensor is equal to the Frobenius norm of its mode- 𝑗 unfoldings
                                 ( 𝑗)
for 𝑗 ∈ [𝑑], by letting x 𝑘 be the 𝑘 th column of a matrix X ( 𝑗) ∈            C𝑛 ×𝑟 , the above minimization
                                                                                  𝑗
problem can be written as
                                                                                          >
                                      ( 𝑗)     (𝑑)          ( 𝑗+1)     ( 𝑗−1)          (1)
                     min X ( 𝑗) − X̂         X     ···    X          X        ···   X
                     X̂ ( 𝑗)                                                                  F
where X̂ ( 𝑗) = X ( 𝑗) diag (𝜶), and the operator diag(·) creates a diagonal matrix with 𝜶 as its diagonal.
Once solved for, the columns of X̂ ( 𝑗) can then be normalized and used to form the coefficients
       Î        ( 𝑗)
𝛼 𝑘 = 𝑑𝑗=1 k x̂ 𝑘 k 2 for 𝑘 ∈ [𝑟], although this is optional, i.e., if the columns are not normalized,
the coefficients 𝛼 𝑘 in the factorization will all be ones. This procedure is repeated iteratively until
the fit ceases to improve (the objective function stops improving with respect to a tolerance) or the
maximum number of iterations are exhausted. To choose the rank of the decomposition as well as
                                                         60


                                0.35
                                 0.3
                                0.25
                                 0.2
                                0.15
                                 0.1
                                0.05
                                        20  40   60   80  100  120  140 160  180 200
Figure 5.1: Relative reconstruction error of CPD calculated for different values of rank 𝑟 for MRI
data. As the rank increases, the error becomes smaller.
obtaining the best estimates for X ( 𝑗) , a commonly used consistency diagnostic called CORCONDIA
can be employed as explained in Section 3.1.2.
    Now, the relative reconstruction error of CPD is calculated and plotted for various values of
rank 𝑟. Assuming X represents the data, this error is defined as
                                                          kX − X̂k
                                               𝑒 𝑐 𝑝𝑑 =                 ,
                                                              kXk
where X̂ denotes the reconstruction of X. Figure 5.1 displays the results.
5.1.1.2    Compressed Least Squares Performance
      ( 𝑗)
Let x 𝑘 be known in
                                                     𝑟
                                                    ∑︁
                                                                 𝑑      ( 𝑗)
                                            X≈           𝛼𝑘       𝑗=1  x𝑘 ,
                                                    𝑘=1
for 𝑘 ∈ [𝑟] and 𝑗 ∈ [𝑑]. They can be obtained from a previous iteration in the CPD fitting
procedure. Here, they come from the CPD of the data calculated in section 5.1.1.1. Also, assume
                                                                                       ( 𝑗)
these vectors have unit norms. In general, as stated in section 5.1.1.1, when x 𝑘 are obtained using
a CPD algorithm, they do not necessarily have unit norms. Therefore, they are normalized and the
                                                                                     Î     ( 𝑗)
norms are absorbed into the coefficients of CPD. In other words, 𝛼 𝑘 = 𝑑𝑗=1 kx 𝑘 k 2 for 𝑘 ∈ [𝑟]. If
the normalization of the vectors is not performed, 𝛼 𝑘 = 1 for 𝑘 ∈ [𝑟]. The coefficients of the CPD
                                                          61


fit are the solutions to the following least squares problem,
                                                               ∑︁𝑟
                                                                              𝑑      ( 𝑗)
                                          𝜶 = arg min X −           𝛽𝑘         𝑗=1 x𝑘      .
                                                   𝜷            𝑘=1
                              ( 𝑗)
As normalization of x 𝑘 was not performed when computing the CPD of the data in these experi-
ments, the true solution will be 𝜶 = 1. An approximate solution for the coefficients can be obtained
by solving for                                                                                  !
                                                                    ∑︁𝑟
                                                                                   𝑑       ( 𝑗)
                                    𝜶 𝑃 = arg min 𝐿 (X) − 𝐿               𝛽𝑘       𝑗=1    x𝑘       ,
                                                𝜷                   𝑘=1
where 𝜶 𝑃 is the vector 𝜶 estimated for randomly projected data, and 𝐿 (X) is defined as per
(4.21). This is in fact simply another way of demonstrating that solving (5.4) yields an approximate
solution to (5.3) for a (𝑑 − 1)-mode tensor. Of course, both of these problems can be solved using
the vectorized versions of the tensors instead. Indeed, for 𝜶 𝑃 , vectorization should be done after
random projection of X and the rank-1 tensors, i.e.,
                                𝜶 𝑃 = arg min kx𝑃 − B𝜷k 2 = (B∗ B) −1 B∗ x𝑃 = B† x𝑃 ,
                                            𝜷
where B† denotes the pseudo-inverse of B, x𝑃 = vec (𝐿 (X)), and B is a matrix whose 𝑘 th column
                         
is vec 𝐿          𝑑 x ( 𝑗) 1 for 𝑘 ∈ [𝑟].2 The error measure used to evaluate the approximate solution
                  𝑗=1 𝑘
is defined as
                                                             𝑒 𝑃 − 𝑒𝑇
                                                       𝑒𝑟 =               ,
                                                                 𝑒𝑇
                         Í𝑟                   ( 𝑗)                    Í                           ( 𝑗)
where 𝑒𝑇 = X −              𝑘=1 𝛼 𝑘
                                        𝑑
                                        𝑗=1 x𝑘     and 𝑒 𝑃 = X − 𝑟𝑘=1 𝛼𝑃,𝑘                 𝑑
                                                                                           𝑗=1   x𝑘    . This in fact compares
the true CPD reconstruction error and the reconstruction error calculated using the approximate
solution for the CPD coefficients 𝜶 𝑃 . The results are shown in Figure 5.2.
                                                                              
     1 Again,                                                             ( 𝑗)
              it is clear that in the 2-stage case, 𝐿 (X) and 𝐿     𝑑
                                                                        x
                                                                    𝑗=1 𝑘        are vectors, and therefore, the operator vec (·)
does not change the result.
     2 The backslash operator was used to actually solve the resulting least squares problems in MATLAB.
                                                             62


                                                                  100
                                          Gaussian                                                   Gaussian
                                          RFD                                                        RFD
                                          Gaussian+RFD                                               Gaussian+RFD
       10-1                               RFD+RFD                                                    RFD+RFD
                                          Vectorize+RFD           10-1                               Vectorize+RFD
       10-2
                                                                  10-2
       10-3                                                       10-3
                         10-4         10-3              10-2                     10-4           10-3               10-2
                                (a)                                                       (b)
                                          Gaussian                         Gaussian
       100                                RFD                              RFD
                                          Gaussian+RFD                     Gaussian+RFD
                                          RFD+RFD                 101      RFD+RFD
                                          Vectorize+RFD                    Vectorize+RFD
       10-1
       10-2
                                                                  100
          -3
       10
                       10-4          10-3               10-2         10-6       10-5       10-4        10-3        10-2
                                (c)                                                       (d)
Figure 5.2: Effect of JL embeddings on the relative reconstruction error of least squares estimation
of CPD coefficients. In the 2-stage cases, 𝑐 2 = 0.05 has been used. (a) 𝑟 = 40. (b) 𝑟 = 75. (c)
𝑟 = 110. (d) Average runtime for 𝑟 = 40. The other runtime plots for 𝑟 = 75 and 𝑟 = 110 are
qualitatively identical.
    In Figure 5.2, the compressed least squares results for the aforementioned MRI data sample
have been plotted. We can observe that as we choose a higher rank for the CPD model, we obtain
a smaller error in the estimated coefficients of 𝜶. As expected, the ‘Vectorize+RFD’ case yields
the most accurate results by a small margin. However, its runtime is considerably larger due to the
huge size of the vectorized tensor, although the it is benefiting from the computational efficiency
of the FFT.3 To see why the runtime plot is almost flat regardless of the chosen compression, see
    3 For    information about how RFD is applied after vectorization, refer to Section 4.3.1.
                                                             63


the discussion at the end of Section 4.3.1.
5.2    Application to Many-Body Purturbation Theory Problems
This section provides a framework for modeling the energy correction terms as the sum of multiple
inner products between tensors, so that each inner product can be approximated according to the
geometry preserving property of JL embeddings as outlined in Lemma 4.1.1. The idea is to
calculate the inner product of tensors with reduced dimensions to obtain an approximate value of
the true energy terms. In doing so, it is assumed that the data lie on a low-rank inner product space
of tensors.
5.2.1   Second-order energy correction
The 2nd -order energy correction term is defined as:
                                                    𝑁∑︁𝐽 −1
                                             (2)
                                         𝐸       =          𝐸 (2) (𝐽) ,                          (5.7)
                                                      𝐽=0
where 𝑁 𝐽 is the number of blocks, and
                             1           ∑︁                               1
                𝐸 (2) (𝐽) = − (2𝐽 + 1)          H𝑖 𝑗 𝑘𝑙 H𝑘𝑙𝑖 𝑗 D𝑖 𝑗 𝑘𝑙 = − (2𝐽 + 1) hH , H̃ i,   (5.8)
                             4           𝑖 𝑗 𝑘𝑙
                                                                          4
in which the Hamiltonian tensor H ∈ R𝑛×𝑛×𝑛×𝑛 should be updated for each value of 𝐽, and D has
the same dimensions as H and is calculated from single-particle energy values. The tensor H̃ is a
permuted version of the H multiplied component-wise by D, i.e.,
                                          H̃𝑖 𝑗 𝑘𝑙 = H𝑘𝑙𝑖 𝑗 D𝑖 𝑗 𝑘𝑙 .                            (5.9)
Now, an approximation of (5.8) can be computed by randomly projecting H and H̃ onto a lower-
dimensional space using mode-wise Johnson-Lindenstrauss embeddings:
                               S = H ×1 A (1) ×2 A (2) ×3 A (2) ×4 A (4) ,                     (5.10)
                                                        64


                                   S̃ = H̃ ×1 A (1) ×2 A (2) ×3 A (2) ×4 A (4) ,                 (5.11)
where A ( 𝑗) ∈ R𝑚 𝑗 ×𝑛 are JL matrices and 𝑚 𝑗 ≤ 𝑛 for 𝑗 ∈ [4]. Now, with high probability,
                                              hH , H̃ i ≈ hS, S̃i,                               (5.12)
to within an adjustable error that is related to the target dimension sizes 𝑚 𝑗 .
    A 2nd stage JL embedding can be applied to the vectorized versions of S and S̃ to further
compress the projected tensors before computing the approximate inner product. This is done
according to
                                               s 𝑝 = Avect (S) ,                                 (5.13)
                                   Î𝑑
where s 𝑝 ∈ R𝑚 and A ∈ R𝑚×           𝑗=1 𝑚𝑗
                                            .
Note 5.2.1 It is observed that assuming real arithmetic, the operations count to directly caclulate
the inner product is O (𝑛1 . . . 𝑛 𝑑 ) while the computational complexity of a one-stage JL embedding
is O (𝑚 1 𝑛1 . . . 𝑛 𝑑 ) as discussed in Section 4.2.1.1, which is obviously higher. However, the same
compressed tensors can be used in the process of calculating many observables including the
higher-order purturbative terms such as the third-order energy correction and radius corrections.
As the number of such terms increases, the overall computational complexity will become much
lower when compressed tensors are used to approximate observables.
5.2.2   Radius Corrections
In what follows, a one-stage (modewise) JL compression scheme is discussed. Obviously, in both
cases shown below, a second stage JL compression can also be performed after vectorizing the
result of the first stage.
Particle Term:
                                                       65


The one-body particle term is expressed in the following way
                                     1 ∑︁
                            𝑅1 =                H𝑖 𝑗 𝑘𝑙 D𝑖 𝑗 𝑘𝑙 D𝑚 𝑗 𝑘𝑙 H𝑘𝑙𝑚 𝑗 𝑅𝑚𝑖
                                     2 𝑖 𝑗 𝑘𝑙𝑚
                                     1 ∑︁                     ∑︁
                                  =           H𝑖 𝑗 𝑘𝑙 D𝑖 𝑗 𝑘𝑙     D𝑚 𝑗 𝑘𝑙 H𝑘𝑙𝑚 𝑗 𝑅𝑚𝑖              (5.14)
                                     2 𝑖 𝑗 𝑘𝑙                   𝑚
                                     1D            E
                                  =       Ȟ , Ĥ ,
                                     2
where Ȟ is obtained by the component-wise product of H and D, i.e., Ȟ𝑖 𝑗 𝑘𝑙 = H𝑖 𝑗 𝑘𝑙 D𝑖 𝑗 𝑘𝑙 ,
                                      ∑︁                                ∑︁
                           Ĥ𝑖 𝑗 𝑘𝑙 =        D𝑚 𝑗 𝑘𝑙 H𝑘𝑙𝑚 𝑗 𝑅𝑚𝑖 =           H̃𝑚 𝑗 𝑘𝑙 𝑅𝑚𝑖 ,        (5.15)
                                       𝑚                                 𝑚
and R is the radius operator and a square matrix. Here, H̃ is defined in (5.9). We can observe that
Ĥ = H̃ ×1 R> . Therefore, the approximate correction term would be calculated as
                                                      1
                                              𝑅1 ≈       H𝑝1 , H𝑝2 ,
                                                      2
where
                                                            ?  4
                                              H 𝑝1 = Ȟ           A (ℓ) ,                         (5.16)
                                                              ℓ=1
and
                                       ?   4                                   ?  4
                                                  (ℓ)                (1)   >
                          H 𝑝2 = Ĥ            A      = H̃ ×1 (A R )                  A (ℓ) .     (5.17)
                                        ℓ=1                                     ℓ=2
Hole Term:
Calculations for the one-body hole term are very similar to the first term, as shown below.
                                     1 ∑︁
                             𝑅2 =               H𝑖 𝑗 𝑘𝑙 D𝑖 𝑗 𝑘𝑙 H𝑚𝑙𝑖 𝑗 D𝑖 𝑗 𝑚𝑙 𝑅 𝑘𝑚               (5.18)
                                     2 𝑖 𝑗 𝑘𝑙𝑚
                                     1 ∑︁                      ∑︁
                                   =          H𝑖 𝑗 𝑘𝑙 D𝑖 𝑗 𝑘𝑙     D𝑖 𝑗 𝑚𝑙 H𝑚𝑙𝑖 𝑗 𝑅 𝑘𝑚             (5.19)
                                     2 𝑖 𝑗 𝑘𝑙                   𝑚
                                     1D             E
                                   =       Ȟ , H̄ ,                                              (5.20)
                                     2
where
                                       ∑︁                              ∑︁
                           H̄𝑖 𝑗 𝑘𝑙 =         D𝑖 𝑗 𝑚𝑙 H𝑚𝑙𝑖 𝑗 𝑅 𝑘𝑚 =         H̃𝑖 𝑗 𝑚𝑙 𝑅 𝑘𝑚 .       (5.21)
                                        𝑚                               𝑚
                                                         66


We have that
                                              H̄ = H̃ ×3 R.                                    (5.22)
Therefore,
                                                  1
                                           𝑅2 ≈     H𝑝1 , H𝑝2 ,                                (5.23)
                                                  2
where
                                                      ? 4
                                            H 𝑝1 = Ȟ      A (ℓ)                               (5.24)
                                                       ℓ=1
as in the case of the first correction term, and
                                       ? 4
                             H 𝑝2 = H̄      A (ℓ)
                                        ℓ=1                                                    (5.25)
                                  = H̃ ×1 A (1) ×2 A (2) ×3 (A (3) R) ×4 A (4) .
It can be observed that all one needs to calculate the approximations to 𝑅1 and 𝑅2 is the two tensors
Ȟ and H̃ . This has been depicted in the block diagram of Figure 5.3. In many cases, due to the
symmetry in H , we have that Ȟ = H̃ which further reduces the storage requirements.
    Figure 5.3: A block diagram showing how the approximations to 𝑅1 and 𝑅2 are calculated.
                                                    67


5.2.3    Third-order energy correction
The 3rd -order energy correction term is defined as
                                                         𝐽 −1
                                                       𝑁∑︁
                                                 (3)
                                              𝐸      =        𝐸 (3) (𝐽) ,                                  (5.26)
                                                        𝐽=0
where the term 𝐸 (3) (𝐽) is calculated in each of the following settings.
5.2.3.1    Particle-Particle
In the particle-particle case,
              1               ∑︁
  𝐸 (3) (𝐽) = (2𝐽 + 1)               H (𝑖, 𝑗, 𝑘, 𝑙) H (𝑘, 𝑙, 𝑚, 𝑛) H (𝑚, 𝑛, 𝑖, 𝑗) D (𝑘, 𝑙, 𝑖, 𝑗) D (𝑚, 𝑛, 𝑖, 𝑗).
              8          𝑖, 𝑗,𝑘,𝑙,𝑚,𝑛
                                                                                                           (5.27)
    Again, the Hamiltonian tensor H should be updated for each value of 𝐽. To calculate the sum,
one can use a scheme similar to the one used in section 5.2.1, but this time with 6 dimensions:
generate 6-dimensional tensors by regrouping the terms in (5.27) as H1 and H2 , and then calculate
the inner product between H1 and H2 . There are multiple ways to group the terms in (5.27). In the
following, two grouping options are listed.
                     
                      H1 (𝑖, 𝑗, 𝑘, 𝑙, 𝑚, 𝑛) = 81 H (𝑖, 𝑗, 𝑘, 𝑙)H (𝑘, 𝑙, 𝑚, 𝑛)
                     
                     
                     
                     
          Option 1 :                                                                                       (5.28)
                     
                      H2 (𝑖, 𝑗, 𝑘, 𝑙, 𝑚, 𝑛) = H (𝑚, 𝑛, 𝑖, 𝑗)D (𝑘, 𝑙, 𝑖, 𝑗)D (𝑚, 𝑛, 𝑖, 𝑗),
                     
                     
                     
                     
                      H1 (𝑖, 𝑗, 𝑘, 𝑙, 𝑚, 𝑛) = 18 H (𝑖, 𝑗, 𝑘, 𝑙)H (𝑘, 𝑙, 𝑚, 𝑛)D (𝑘, 𝑙, 𝑖, 𝑗)
                     
                     
                     
                     
          Option 2 :                                                                                       (5.29)
                     
                      H2 (𝑖, 𝑗, 𝑘, 𝑙, 𝑚, 𝑛) = H (𝑚, 𝑛, 𝑖, 𝑗)D (𝑚, 𝑛, 𝑖, 𝑗).
                     
                     
                     
    Now, if S1 and S2 are the projected versions of H1 and H2 , we expect that
                            𝐸 (3) (𝐽) = (2𝐽 + 1) hH1 , H2 i ≈ (2𝐽 + 1) hS1 , S2 i.
    The problem with this approach lies in the fact that when the dimension sizes increase, the
6-mode tensors become problematic in terms of storage. For instance, for H ∈ R100×100×100×100 ,
                                                          68


7.45 TB of space is needed to store each of H1 and H2 . To overcome this problem, we can reshape
the tensors and perform the projections as explained below.
    It is observed that the indices of the hypothetical 6-mode tensors always appear in groups of two
in the inner product summation. Therefore, they can be reshaped into 3-mode tensors by regrouping
the indices, and one may perform mode-wise JL on the reshaped tensors.
                          
                           H1 ( 𝑝, 𝑞, 𝑟) = 18 H ( 𝑝, 𝑞)H (𝑞, 𝑟)
                          
                          
                          
                          
            Option 1 :                                                                                          (5.30)
                          
                           H2 ( 𝑝, 𝑞, 𝑟) = H (𝑟, 𝑝)D (𝑞, 𝑝)D (𝑟, 𝑝) = H̃ (𝑟, 𝑝)D (𝑞, 𝑝),
                          
                          
                          
where H̃ is defined similarly as in (5.9). Here, 𝑝 represents all relevant pairs of 𝑖 and 𝑗, 𝑞 encodes
all pairs of 𝑘 and 𝑙, and 𝑟 represents all pairs of 𝑚 and 𝑛 in the grouping operation4. The repetitive
patterns existing in 3-mode tensors that now can be formed using combinations of matrices, as well
as reducing the size of two modes at once, as a result of combining two indices into one index, will
provide the tools to avoid dealing with extremely large tensors when performing the projections.
For instance, to project H1 in (5.30), one must calculate
                                          ∑︁
                     P1 (𝑖1 , 𝑖2 , 𝑖3 ) =       H1 ( 𝑝, 𝑞, 𝑟)A (1) (𝑖1 , 𝑝) A (2) (𝑖2 , 𝑞) A (3) (𝑖3 , 𝑟) ,
                                          𝑝,𝑞,𝑟
which is the element-wise version of
                                           P1 = H1 ×1 A (1) ×2 A (2) ×3 A (3) .
According to the way H1 is defined it will be possible to decompose the triple summation into
separate sums for two of the mode-wise projections. This way, one can obtain the fully-projected
tensor P1 by only dealing with 2-mode (partially) compressed arrays. Algebraic details on how the
mode-wise projections can be done in a memory efficient way are presented in Appendix B.1.
    Due to the resemblance of options 1 and 2 in terms of the method used, only option 1 will be
considered for future experiments, and also in the Hole-Hole and Particle-Hole settings, only one
option will be discussed.
    4 For instance, if column-major formatting is used, and assuming the indices start from 1, the relation between 𝑝, 𝑖
and 𝑗 is 𝑝 = 𝑖 + ( 𝑗 − 1) 𝑁, where 𝑖, 𝑗 ∈ [𝑁].
                                                             69


5.2.3.2    Hole-Hole
In this case, the energy term for each value of 𝐽 is expressed by
              1                ∑︁
 𝐸 (3) (𝐽) = (2𝐽 + 1)                 H (𝑖, 𝑗, 𝑘, 𝑙) H (𝑘, 𝑙, 𝑚, 𝑛) H (𝑚, 𝑛, 𝑖, 𝑗) D (𝑚, 𝑛, 𝑖, 𝑗) D (𝑚, 𝑛, 𝑘, 𝑙).
              8           𝑖, 𝑗,𝑘,𝑙,𝑚,𝑛
                                                                                                                (5.31)
For simplicity, only one option will be used to form H1 and H2 , where
                   
                    H1 ( 𝑝, 𝑞, 𝑟) = 18 H ( 𝑝, 𝑞)H (𝑞, 𝑟)
                   
                   
                   
                   
                                                                                                                (5.32)
                   
                    H2 ( 𝑝, 𝑞, 𝑟) = H (𝑟, 𝑝)D (𝑟, 𝑝)D (𝑟, 𝑞) = H̃ (𝑟, 𝑝)D (𝑟, 𝑞).
                   
                   
                   
Again, 𝑖 and 𝑗 are combined to form 𝑝, 𝑘 and 𝑙 are grouped to form 𝑞, and 𝑚 and 𝑛 are combined
to form 𝑟, as explained above. Details on the calculations of mode-wise projections are presented
in Appendix B.2.
5.2.3.3    Particle-Hole
In this case, the energy term for each value of 𝐽 is calculated by
                           ∑︁
  𝐸 (3) (𝐽) = (2𝐽 + 1)             H 𝑝 (𝑖, 𝑗, 𝑘, 𝑙) H 𝑝 (𝑘, 𝑙, 𝑚, 𝑛) H 𝑝 (𝑚, 𝑛, 𝑖, 𝑗) D (𝑘, 𝑗, 𝑙, 𝑖) D ( 𝑗, 𝑚, 𝑖, 𝑛),
                      𝑖, 𝑗,𝑘,𝑙,𝑚,𝑛
                                                                                                                (5.33)
where the Hamiltonians are obtained after a Pandya transform shown by the subscript 𝑝 in the
summation. To make the process of reshaping the data into 3-mode tensors possible, the dimensions
of D should be permuted to get
                                             D1 (𝑖, 𝑗, 𝑘, 𝑙) = D (𝑘, 𝑗, 𝑙, 𝑖)
                                           D2 (𝑚, 𝑛, 𝑖, 𝑗) = D ( 𝑗, 𝑚, 𝑖, 𝑛).
Then, we can choose
                               
                               
                                H̃1 (𝑖, 𝑗, 𝑘, 𝑙) = H 𝑝 (𝑖, 𝑗, 𝑘, 𝑙)D1 (𝑖, 𝑗, 𝑘, 𝑙)
                               
                               
                               
                                                                                                                (5.34)
                               
                                H̃2 (𝑚, 𝑛, 𝑖, 𝑗) = H 𝑝 (𝑚, 𝑛, 𝑖, 𝑗)D2 (𝑚, 𝑛, 𝑖, 𝑗),
                               
                               
                               
                                                            70


                               𝑒Max      4      6      8      10 12 14
                               𝑛         30     56     90     132 174 216
Table 5.1: Basis truncation parameters and mode dimensions for single-particle bases labeled by
𝑒Max.
leading to the reshaped version
                                 
                                 
                                  H1 ( 𝑝, 𝑞, 𝑟) = H̃1 ( 𝑝, 𝑞)H 𝑝 (𝑞, 𝑟)
                                 
                                 
                                 
                                                                                                   (5.35)
                                 
                                  H2 ( 𝑝, 𝑞, 𝑟) = H̃2 (𝑟, 𝑝).
                                 
                                 
                                 
Memory efficient calculations for the mode-wise projections can be found in Appendix B.3.
5.2.4   Experiments
In this section, numerical results are provided to demonstrate how mode-wise JL embeddings
affect the accuracy of energy calculations. Experiments are done for different data sizes, i.e., for
H , D ∈ R𝑛×𝑛×𝑛×𝑛 where the dimension size 𝑛 is chosen from the set of number listed in Table 5.1.
In each case, the relative error in 𝐸 (2) , 𝐸 (3) , and the radius correction terms 𝑅1 and 𝑅2 defined by
(5.36), (5.37), and (5.38), and are plotted for various values of compression.
                                                          𝐸 𝑝(2) − 𝐸 (2)
                                                                           !
                                   Δ𝐸 (2) = mean                             .                     (5.36)
                                                                𝐸 (2)
                                                          𝐸 𝑝(3) − 𝐸 (3)
                                                                           !
                                   Δ𝐸 (3) = mean                             .                     (5.37)
                                                                𝐸 (3)
                                                                     
                                                          𝑅𝑝 − 𝑅
                                      Δ𝑅 = mean                          .                         (5.38)
                                                                𝑅
where the subscript 𝑝 is used to denote the corresponding value calculated after the projection of
tensors, and mean(𝑋) denotes the mean of 𝑋. Compression in mode 𝑗 is defined by
                                                         𝑚𝑗
                                                    𝑐𝑗 =      ,                                    (5.39)
                                                          𝑁
                                                       71


where 𝑁 and 𝑚 𝑗 denote the size of mode 𝑗 before and after projection, respectively. The target
                                                     
dimension 𝑚 𝑗 in JL matrices is chosen as 𝑚 𝑗 = 𝑐 𝑗 𝑛 for all 𝑗, to ensure that at least a fraction 𝑐 𝑗
of the ambient dimension in mode 𝑗 is preserved. In the experiments, compression is chosen the
same for all modes, i.e., 𝑐 𝑗 = 𝑐 for all 𝑗. It should be noted that for the 𝐸 (3) calculations, the size
of each dimension in the reshaped data is 𝑁 = 𝑛2 , while for the 𝐸 (2) and 𝑅 calculations, 𝑁 = 𝑛.
5.2.4.1   𝐸 (2) Experiments
Experiment results for O16 have been plotted in Figures 5.4, 5.5, and 5.6. In Figure 5.7, the relative
error in 𝐸 (2) has been plotted for two compression levels for O16 and Sn132. These results clearly
show that JL embeddings result in smaller error values when data size increases. This dependence
is almost log-linear.
                                                   72


0.9
                  (Gaussian + RFD)real                               Gaussian
0.8
                  (RFD + RFD)real                                    Rademacher
0.7                                                                  RFDreal
                  Gaussian
0.6               RFDreal                                            RCD
0.5
0.4
0.3
0.2
0.1
  0
    10-3 10-2         10-1             100
              (a)                                         (b)
                           Gaussian                           (Gaussian + RFD)real
                           Rademacher                         (RFD + RFD)real
                           RFDreal
                           RCD
              (c)                                         (d)
         Figure 5.4: 𝐸 (2) experiment results for O16, 𝑒𝑀𝑎𝑥 = 2.
                                           73


0.45
                      (Gaussian + RFD)real                               Gaussian
 0.4
                      (RFD + RFD)real                                    Rademacher
0.35                                                                     RFDreal
                      Gaussian
 0.3                  RFDreal                                            RCD
0.25
 0.2
0.15
 0.1
0.05
   0
   10-4 10-3     10-2         10-1         100
                (a)                                           (b)
                              Gaussian                            (Gaussian + RFD)real
                              Rademacher                          (RFD + RFD)real
                              RFDreal
                              RCD
                (c)                                           (d)
             Figure 5.5: 𝐸 (2) experiment results for O16, 𝑒𝑀𝑎𝑥 = 4.
                                               74


          0.7
                                (Gaussian + RFD)real                                 Gaussian
          0.6
                                (RFD + RFD)real                                      Rademacher
                                                                                     RFDreal
          0.5                   Gaussian
                                RFDreal                                              RCD
          0.4
          0.3
          0.2
          0.1
            0
            10-5  10-4     10-3         10-2         10-1
                          (a)                                            (b)
                                        Gaussian                              (Gaussian + RFD)real
                                        Rademacher                            (RFD + RFD)real
                                        RFDreal
                                        RCD
                          (c)                                            (d)
                       Figure 5.6: 𝐸 (2) experiment results for O16, 𝑒𝑀𝑎𝑥 = 8.
        0.09                                                 0.035
        0.08
                                                              0.03
        0.07
                                                             0.025
        0.06
        0.05                                                  0.02
        0.04
                                                             0.015
        0.03
                                                              0.01
              6    8        10          12           14            6 8     10         12           14
                          (a)                                            (b)
Figure 5.7: Relative error in 𝐸 (2) for total compression values of 0.0009 and 0.0125. (a) O16. (b)
Sn132.
                                                          75


5.2.4.2    Radius Correction Experiments
The experiments of this section were done on the data of Tin (Sn132) and Calcium (Ca48) for
𝑛 = 216 or eMax= 14. The results can be viewed in Figure 5.8.
         0.017                                          0.024
                                RFDreal                                       RFDreal
         0.016                                          0.022
                                (RFD + RFD)real                               (RFD + RFD)real
         0.015                                           0.02
         0.014                                          0.018
         0.013                                          0.016
         0.012                                          0.014
         0.011                                          0.012
          0.01                                           0.01
               10-4    10-3     10-2          10-1            10-4   10-3     10-2          10-1
                            (a)                                           (b)
         0.013                                          0.016
                                RFDreal                                       RFDreal
         0.012                  (RFD + RFD)real                               (RFD + RFD)real
                                                        0.014
         0.011
          0.01                                          0.012
         0.009
                                                         0.01
         0.008
         0.007                                          0.008
               10-4    10-3     10-2          10-1            10-4   10-3     10-2          10-1
                            (c)                                           (d)
Figure 5.8: Radius correction results, for interaction em1.8 − 2.0 and eMax= 14. (a) Ca48, particle
term. (b) Ca48, hole term. (c) Sn132, particle term. (d) Sn132, hole term.
5.2.4.3    𝐸 (3) Experiments
The reason behind 𝐸 (3) experiments not working as well as 𝐸 (2) is that the two tensors forming
the inner product become nearly orthogonal in the 𝐸 (3) case, in such a way that after projection,
their inner product becomes much smaller than their individual norms. In other words, two tensors
that are not originally orthogonal are made close to orthogonal after projection onto the lower-
                                                   76


dimensional subspace. One can think of a criterion that should be met for the inner product
preservation to work, i.e., |hAx, Ayi| ≥ 𝜀¯ kAxk kAyk for some small 𝜀,              ¯ where x and y are fibers in
unfoldings of a tensor.
         0.7
                                             Gaussian                0.8
                                                                                                  Gaussian
         0.6                                 RFD (real part)                                      RFD (real part)
                                                                     0.7
         0.5
                                                                     0.6
         0.4                                                         0.5
                                                                     0.4
         0.3
                                                                     0.3
         0.2
                                                                     0.2
         0.1                                                         0.1
                                                                       0
           0
           0.01 0.015 0.02 0.025  0.03 0.035  0.04  0.045    0.05        c=0.03 c=0.03     c=0.05        c=0.05
                                 (a)                                                   (b)
           Figure 5.9: Mean absolute relative error in 𝐸 (3) for hole-hole and 𝑒𝑀𝑎𝑥 = 8.
           5
                                                                     4.5
                                             Gaussian                                             Gaussian
         4.5                                                           4
                                             RFD (real part)                                      RFD (real part)
           4                                                         3.5
         3.5                                                           3
           3                                                         2.5
         2.5                                                           2
                                                                     1.5
           2
                                                                       1
         1.5
                                                                     0.5
           1
                                                                       0
         0.5
           0.01 0.015 0.02 0.025  0.03 0.035  0.04  0.045    0.05        c=0.03 c=0.03     c=0.05        c=0.05
                                 (a)                                                   (b)
      Figure 5.10: Mean absolute relative error in 𝐸 (3) for particle-particle and 𝑒𝑀𝑎𝑥 = 8.
                                                                  77


 1.6
                                     Gaussian                3.5
                                                                                          Gaussian
 1.4                                 RFD (real part)                                      RFD (real part)
                                                               3
 1.2                                                         2.5
                                                               2
   1
                                                             1.5
 0.8
                                                               1
 0.6                                                         0.5
                                                               0
 0.4
   0.01 0.015 0.02 0.025  0.03 0.035  0.04  0.045    0.05        c=0.03 c=0.03     c=0.05        c=0.05
                         (a)                                                   (b)
Figure 5.11: Mean absolute relative error in 𝐸 (3) for particle-hole and 𝑒𝑀𝑎𝑥 = 8.
                                                          78


                                               CHAPTER 6
  EXTENSION OF VECTOR-BASED METHODS TO TENSORS AND FUTURE WORK
In this chapter, two of the conventional methods developed for vector data are extended to tensors
while keeping the miltilinear structure of the data. The computational load and memory require-
ments of these methods can be mitigated by applying modewise JL embeddings to the inout data.
For instance, in Algorithm 6.1 of Section 6.1, the training tensors X (𝑚) can be compressed to a
much smaller size by applying modewise JL prior to solving for the projection matrices Ũ ( 𝑗) . The
same set of JL embeddings used to compress all the training samples could also be used for test data,
meaning they need to be generated only once. The result will be an approximate yet easier to obtain
set of projection matrices that will be used to compute the approximate MPCA output. In another
possible way, randomized methods can be used to obtain an approximate low-rank representation
of data during the SVD stage of MPCA (in step 2 of Algorithm 6.1 in the following, SVD could be
                X̃ (𝑚)                                            X̃ (𝑚) X̃ (𝑚)> ).
          Í𝑀                                                 Í𝑀
used on 𝑚=1        ( 𝑗)
                        instead of the eigen-decomposition of 𝑚=1    ( 𝑗) ( 𝑗)
In a similar manner, the support vector machine methods mentioned in Section 6.3 involve the
computation of the CP decomposition of a tensor. Obtaining CPD requires solving a least squares
problem for each mode of the input data, which in turn can be made faster and memory-efficient
by solving the least squares problem using a sketched (compressed) version of the known variables
in, e.g., (3.11). This leads to an approximate and a computationally more efficient implementation
of CPD to be used in the main support vector machine approach.
6.1     Multilinear Principal Component Analysis
Multilineal Principal Component Analysis, abbreviated to MPCA, is a dimensionality reduction
scheme for tensor objects. As an extension to regular PCA, it is similar in structure to the Tucker
decomposition and defines its goal to capture as much variations as possible across the modes of a
tensor object.
                                                   79


Definition 6.1.1 Let {X (1) , . . . , X (𝑀) } be a set of tensor objects in R𝑛1 ×···×𝑛𝑑 . The total scatter of
these tensors is defined as
                                                         ∑︁𝑀
                                               ΨX =              kX (𝑚) − X̄k 2 ,
                                                         𝑚=1
where X̄ is the mean tensor calculated by
                                                                   𝑀
                                                              1 ∑︁ (𝑚)
                                                    X̄ =                 X .
                                                              𝑀 𝑚=1
6.1.1      Problem Statement
Consider {X (𝑚) } 𝑚=1  𝑀 for training. The objective is to define a multilinear transformation { Ũ ( 𝑗) ∈
R𝑛 𝑗 ×𝑃 𝑗 } 𝑑𝑗=1 that maps the tensor space R𝑛1 ×···×𝑛𝑑 into a tensor subspace R𝑃1 ×···×𝑃𝑑 with 𝑃 𝑗 ≤ 𝑛 𝑗 for
 𝑗 ∈ [𝑑], such that
                       Y (𝑚) = X (𝑚) ×1 Ũ (1)> × · · · ×𝑑 Ũ (𝑇)> ∈ R𝑃1 ×···×𝑃𝑑 ; 𝑚 ∈ [𝑀],
capture most of the variations in {X (𝑚) } 𝑚=1      𝑀 measured by the total scatter Ψ , i.e.,
                                                                                                              Y
                                          {Ũ ( 𝑗) } 𝑑𝑗=1 = arg           max           ΨY .                           (6.1)
                                                                     Ũ (1) ,...,Ũ (𝑑)
     A pseudo-code outlining the steps of MPCA is shown in Algorithm 6.1 [14], with 𝐾 denoting
the maximal number of allowed iterations.
Theorem 6.1.1 Let {Ũ ( 𝑗) } 𝑑𝑗=1 be the solution to (6.1). Then, given Ũ (1) , . . . , Ũ ( 𝑗−1) , Ũ ( 𝑗+1) , . . . , Ũ (𝑑) ,
the matrix Ũ ( 𝑗) consists of 𝑃 𝑗 eigenvectors corresponding to the 𝑃 𝑗 largest eigenvalues of the matrix
                                       𝑀 
                                      ∑︁                                                                 >
                           𝚽 ( 𝑗) =        X (𝑚)
                                              ( 𝑗)
                                                   −    X̄ ( 𝑗)   Ũ        Ũ  >
                                                                     𝚽 ( 𝑗) 𝚽 ( 𝑗)        X (𝑚)
                                                                                            ( 𝑗)
                                                                                                 − X̄ ( 𝑗)    ,        (6.2)
                                      𝑚=1
where
                                                                                                           >
                            Ũ𝚽 ( 𝑗) = Ũ (𝑑) ⊗ · · · ⊗ Ũ ( 𝑗+1) ⊗ Ũ ( 𝑗−1) ⊗ · · · ⊗ Ũ (1)
for 𝑗 ∈ [𝑑].
                                                                  80


Algorithm 6.1: MPCA [14]
Require: Training samples X (𝑚) ∈ R𝑛1 ×...×𝑛𝑑 for 𝑚 ∈ [𝑀].
Ensure: Projection matrices Ũ ( 𝑗) ∈ R𝑃 𝑗 ×𝑛 𝑗 for 𝑛 ∈ [𝑑].
  (1) Center input samples: X̃𝑚 = X𝑚 − X̄.
                                                                                                      𝑀
  (2) Initialization: Calculate the eigen-decomposition of 𝚽 ( 𝑗)∗ =                                       X̃ (𝑚)  X̃ (𝑚)> . Set Ũ ( 𝑗) to
                                                                                                     Í
                                                                                                              ( 𝑗) ( 𝑗)
                                                                                                    𝑚=1
  consist of the 𝑃 𝑗 leading eigenvectors for 𝑗 ∈ [𝑑].
  (3) Local optimization:
  Calculate Ỹ (𝑚) = X̃ (𝑚) ×1 Ũ (1)> × · · · ×𝑑 Ũ (𝑇)> for 𝑚 ∈ [𝑀].
                        𝑀
                           k Ỹ (𝑚) k 2𝐹 .
                        Í
  Calculate ΨY0 =
                       𝑚=1
  for 𝑘 = 1, . . . , 𝐾 do
     for 𝑗 = 1, . . . , 𝑑 do
        Set the matrix Ũ ( 𝑗) to consist of the 𝑃 𝑗 leading eigenvectors of 𝚽 ( 𝑗) defined in (6.2).
     end for
     Calculate Ỹ𝑚 for 𝑚 ∈ [𝑀], and ΨY𝑘 .
     If ΨY𝑘 − ΨY𝑘−1 < 𝜂, break, and go to step 4.
  end for
  (4) Projection: Obtain the feature tensors as Y (𝑚) = X (𝑚) ×1 Ũ (1)> × · · · ×𝑑 Ũ (𝑇)> for 𝑚 ∈ [𝑀].
Proof The Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings.
Therefore, the total scatter of the projected samples can be written as
                         𝑀
                        ∑︁                              ∑︁𝑀
               ΨY =         kY    (𝑚)
                                       − Ȳ k =  2
                                                               kY (𝑚)( 𝑗)
                                                                          − Ȳ ( 𝑗) k 2𝐹
                        𝑚=1                             𝑚=1
                         𝑀
                        ∑︁                                  
                     =      k Ũ ( 𝑗) X (𝑚)( 𝑗)
                                                 −  X̄  ( 𝑗)    Ũ𝚽 ( 𝑗) k 2𝐹
                        𝑚=1
                         𝑀
                        ∑︁                                                                               >            
                     =      trace Ũ      ( 𝑗)
                                                  X (𝑚)
                                                    ( 𝑗)
                                                           −   X̄ ( 𝑗) Ũ𝚽 ( 𝑗) Ũ>𝚽 ( 𝑗)
                                                                                            X (𝑚)
                                                                                              ( 𝑗)
                                                                                                   − X̄ ( 𝑗)      Ũ ( 𝑗)>
                        𝑚=1
                                         𝑀 
                                                                                                                           !
                                        ∑︁                                                                 >
                     = trace Ũ ( 𝑗)              X (𝑚)
                                                    ( 𝑗)
                                                           − X̄ ( 𝑗) Ũ𝚽 ( 𝑗) Ũ>  𝚽 ( 𝑗)
                                                                                            X (𝑚)
                                                                                              ( 𝑗)
                                                                                                   − X̄ ( 𝑗)      Ũ ( 𝑗)>
                                        𝑚=1
                                                         
                     = trace Ũ ( 𝑗) 𝚽 ( 𝑗) Ũ ( 𝑗)> ,
which turns into an eigenvalue problem when ΨY is to be maximized.
                                                                     81


6.1.2        Full Projection
                                                                              ( 𝑗)      ( 𝑗)>
If 𝑃 𝑗 = 𝑛 𝑗 for 𝑗 ∈ [𝑑], it is easy to show that Ũ𝚽 ( 𝑗) Ũ𝚽 ( 𝑗) = I, then in the optimal case, 𝚽 ( 𝑗)∗ =
  𝑀                                          >
         X ((𝑚)        (𝑚)       (𝑚)       (𝑚)
                                                     . In this case, U ( 𝑗)∗ is the optimal solution for Ũ ( 𝑗) , and consists
 Í
             𝑗)
                  − X̄ ( 𝑗)
                               X ( 𝑗)
                                      − X̄ ( 𝑗)
𝑚=1
of the eigenvectors of 𝚽 ( 𝑗)∗ .
      The total scatter tensor Yvar       ∗ of the full projection is defined as
                                                                   𝑀 
                                                                  ∑︁                           2
                                                         ∗                    (𝑚)∗           ∗
                                                      Yvar    =          Y            − Ȳ          ,                     (6.3)
                                                                  𝑚=1
where the exponentiation is done component-wise, Y (𝑚)∗ is the full projection of the 𝑚 th sample
X (𝑚) and Ȳ ∗ is the mean of the fully projected samples. The following two observations can be
made.
                                                ( 𝑗)∗                                                                ∗ , where
(a) The 𝑖 th𝑗 mode- 𝑗 eigenvalue 𝜆𝑖 𝑗                  is the sum of all entries of the 𝑖 th𝑗 mode- 𝑗 slice of Yvar
𝑖 𝑗 ∈ [𝑛 𝑗 ] for 𝑗 ∈ [𝑑].
(b) Every sample tensor X (𝑚) can be represented as an expansion in the subspace spanned by rank-1
tensors, called eigentensors. This is shown by
                                          𝑃1 ∑︁
                                         ∑︁       𝑃2       ∑︁𝑃𝑑
                               X (𝑚) ≈                ···         Y𝑖1(𝑚)∗            (1)
                                                                      ,𝑖2 ,...,𝑖 𝑑 ũ𝑖1       ũ𝑖(2)
                                                                                                  2
                                                                                                      ··· ũ𝑖(𝑑)
                                                                                                              𝑑
                                                                                                                 ,        (6.4)
                                         𝑖1 =1 𝑖2 =1       𝑖 𝑑 =1
              ( 𝑗)                               ( 𝑗)
where ũ𝑖 𝑗 is the 𝑖 th𝑗 column of Ũ𝑖 𝑗 for 𝑖 𝑗 ∈ [𝑃 𝑗 ] and 𝑗 ∈ [𝑑].
6.1.3        Initialization by Full Projection Truncation (FPT)
To initialize the MPCA algorithm, assume the first 𝑃 𝑗 < 𝑛 𝑗 leading eigenvectors of 𝚽 (𝑛)∗ form
Ũ ( 𝑗) . It is shown in [14] that if a nonzero eigenvalue is truncated in one mode, the eigenvalues
in all other modes tend to decrease in magnitude, and therefore, the optimality of (6.1) is affected
negatively. Thus, the eigen-decomposition needs to be updated in all other modes. If the total
scatter of the projected samples in FPT is denoted by ΨY0 , then the loss of variations due to FPT is
                                                                          82


bounded by
                                        𝑛𝑗
                                       ∑︁                                         𝑑
                                                                                 ∑︁       𝑛𝑗
                                                                                         ∑︁
                                                 ( 𝑗)∗                                            ( 𝑗)∗
                             max               𝜆𝑖 𝑗     ≤ ΨX − ΨY0 ≤                             𝜆𝑖 𝑗 .         (6.5)
                               𝑗
                                   𝑖 𝑗 =𝑃 𝑗 +1                                   𝑗=1 𝑖 𝑗 =𝑃 𝑗 +1
6.1.4    Determination of subspace Dimensions 𝑃 𝑗
A simple yet commonly used method to choose 𝑃 𝑗 is to pick the minimum value 𝑃 𝑗 such that
                                                           𝑃𝑗
                                                           Í     ( 𝑗)∗
                                                                𝜆𝑖 𝑗
                                                         𝑖 𝑗 =1
                                                           𝑛𝑗           ≥ 𝜏,                                    (6.6)
                                                           Í     ( 𝑗)∗
                                                                𝜆𝑖 𝑗
                                                         𝑖 𝑗 =1
where 𝜏 is a predetermined threshold set by the user.
6.1.5    Feature Extraction and Classification
The set of projection matrices {Ũ ( 𝑗) } 𝑑𝑗=1 obtained using the training samples can be employed to
project any new sample onto the same subspace. The projected samples (training or test) can be
either directly used in classification, or they can undergo further processing. In LDA1, the elements
of the projected samples Y (𝑚) are vectorized to yield y (𝑚) and are ordered according to the Fisher
score to maximize the between-class to in-class discriminability. Next, a predetermined number of
features with the highest Fisher scores are selected for classification.
    Or, to maximize between-class to in-class discriminability, y (𝑚) can be projected onto the LDA
space by
                                                       z (𝑚) = V>    lda y
                                                                           (𝑚)
                                                                               ,
where for 𝐶 classes, Vlda consists of 𝑛 𝑧 ≤ 𝐶 − 1 of the leading generalized eigenvectors of
        Í 𝑀  (𝑚)                                 >
                          (𝑚)                  (𝑚)
                                   (𝑚)                   and S𝐵 = 𝐶𝑐=1 𝑁𝑐 ( ȳ𝑐 − ȳ) ( ȳ𝑐 − ȳ) > . Here, 𝑁𝑐 is the
                                                                          Í
S𝑊 = 𝑚=1 y − ȳ𝑐                 y − ȳ𝑐
number of samples in class 𝑐 and ȳ𝑐(𝑚) denotes the in-class mean for the 𝑚 th training sample, i.e.,
    1 Linear Discriminant Analysis
                                                                  83


ȳ𝑐(𝑚) ∈ {ȳ1 , . . . , ȳ𝐶 } where ȳ𝑐 is the mean of class 𝑐. In fact, the LDA projection matrix is obtained
by solving
                                                       |V> S𝐵 V|                     
                                      Vlda = arg max              =   v 1 . . . v 𝑛     .
                                                   V   |V> S𝑊 V|                    𝑧
Now, z (𝑚) is used for classification as the projected training sample, and Vlda is applied to any
vectorized projected test data y to further project it onto the LDA space according to z = V>          lda y.
6.2      Comparison between PCA, MPCA and MPS
Here, the methods PCA, MPCA and MPS have been compared in terms of computational complex-
ity measured by training time, and classification Success Rate (CSR). The data were first projected
onto the subspace, and then the features were either directly used in nearest neighbor (1NN) clas-
sification, or they were further sorted in descending order according to the Fisher score and 100
features were selected to be used in 1NN. The following data sets were used in the experiments.
COIL-100 data set: 7200 images collected from 100 objects taken at 5◦ pose intervals, creating a
3-mode tensor X ∈ R128×128×7200 [15]. Sample images from this library can be viewed in Figure 6.1.
                                                          84


          Figure 6.1: Gray-scale sample images of five objects in the COIL-100 database.
    MRI data: A 4-mode tensor X ∈ R240×240×155×51 comprised of 51 MRI samples [1]. A lateral
slice of a sample MRI image is shown in Figure 6.2.
                              50
                             100
                             150
                             200
                                     50       100     150    200    250
                        Figure 6.2: A lateral slice of a sample MRI image.
                                                  85


                            Training time vs model size.                                         Training time vs model size.
                    7                                                                   40
                                                     MPCA (1NN)
                                                                                                                                    MPCA (1NN)
                    6                                MPCA (LDA)                                                                     MPCA (LDA)
                                                     MPS (1NN)                          35                                          MPS (1NN)
                    5                                MPS (LDA)                                                                      MPS (LDA)
                                                     PCA (1NN)                                                                      PCA (1NN)
                                                                                        30                                          PCA (LDA)
         time (s)
                    4                                PCA (LDA)
                                                                             time (s)
                    3
                                                                                        25
                    2
                                                                                        20
                    1
                    0
                                                                                        15
                    104         105                106            107                    102   104            106             108                1010
                                      Model size                                                          Model size
                                (a) COIL-100                                                           (b) MRI
                                      Figure 6.3: Training time. (a) COIL-100. (b) MRI.
                          Classification rate vs model size.                                   Classification rate vs model size.
                97                                                                      70
                                                                                                                                    MPCA (1NN)
                96
                                                                                        65                                          MPCA (LDA)
                                                                                                                                    MPS (1NN)
                95
                                                                                                                                    MPS (LDA)
                                                                                        60                                          PCA (1NN)
                94
      CSR (%)
                                                                                                                                    PCA (LDA)
                                                                             CSR (%)
                93                                                                      55
                                                     MPCA (1NN)
                92                                   MPCA (LDA)
                                                                                        50
                                                     MPS (1NN)
                91
                                                     MPS (LDA)
                90                                   PCA (1NN)                          45
                                                     PCA (LDA)
                89
                                                                                        40
                 104            105                106            107                    102   104            106             108                1010
                                      Model size                                                          Model size
                                        (a)                                                                 (b)
                           Figure 6.4: Classification Success Rate. (a) COIL-100. (b) MRI.
6.3     Extension of Support Vector Machine to Tensors
In this section, 3 tensor-based methods used to extend the regular support vector machine to tensor
data are summarized. Consider 𝑀 𝑑-mode training sample tensors {X (𝑚) } 𝑚=1
                                                                        𝑀   ∈ R𝑛1 ×...×𝑛𝑑 with
corresponding labels {𝑦 𝑚 } 𝑚=1
                            𝑀 ∈ {−1, +1}.
                                                                        86


6.3.1    Support Tensor Machine
The soft-margin Support Tensor Machine for binary classification of data is composed of 𝑑 quadratic
programming problems with inequality constraints, where the 𝑗 th problem is described as [24]:
                                                                1                        Ö𝑑                    𝑀
                                                                                                                ∑︁
                                                                                                                     ( 𝑗)
                        min         𝐽 w ( 𝑗) , 𝑏 ( 𝑗) , 𝝃 ( 𝑗) = kw ( 𝑗) k 22                   kw (𝑖) k 22 + 𝐶     𝜉𝑚
                   (
                 w ,𝑏 ,𝝃
                     𝑗)  ( 𝑗)  ( 𝑗)                                      2                 𝑖=1                  𝑚=1
                                                                                           𝑖≠ 𝑗
                                                             ­ ( 𝑗)> ­ (𝑚) ? ( 𝑗)> ®                                                (6.7)
                                                             ©           ©            𝑑            ª          ª
                                                                                                                          ( 𝑗)
                                    subject to 𝑦 𝑚 ­w                    ­X                 w      ® + 𝑏 ( 𝑗) ® ≥ 1 − 𝜉𝑚
                                                                                                              ®
                                                             ­           ­                         ®          ®
                                                                                    𝑖=1
                                                             «           «          𝑖≠   𝑗         ¬          ¬
                                      ( 𝑗)
                                    𝜉𝑚 ≥ 0, 𝑚 ∈ [𝑀],
for 𝑗 ∈ [𝑑], where w ( 𝑗) ∈ R𝑛 𝑗 is the normal to the 𝑗 th hyperplane corresponding to the 𝑗 th
                                  ( 𝑗)
mode, 𝑏 ( 𝑗) is the bias, 𝜉𝑚 is error of the 𝑚 th training sample, and 𝐶 is the trade-off between the
classification error and the amount of margin violation. These 𝑑 optimization problems have no
closed-form solution, and need to be solved iteratively using the alternating projection algorithm.
All 𝑑 normal vectors are randomly initialized. In each iteration, for each mode 𝑗, {w (𝑘) } 𝑘≠ 𝑗 are
fixed and (6.7) is solved for w ( 𝑗) . Iterations continue until convergence is reached. Convergence
criterion considering iterations 𝑡 and 𝑡 − 1 is set as
                                                                 ( 𝑗)> ( 𝑗)
                                                       𝑑
                                                                                          !
                                                     ∑︁        w𝑡 w𝑡−1
                                                                      ( 𝑗)
                                                                                  − 1 ≤ 𝜀,
                                                     𝑗=1        kw𝑡 k 2
for some 𝜀. Once the STM model has been solved, the binary classifier will determine the class of
a test sample X based on the decision rule
                                                                           ? 𝑑
                                                                                                   !
                                               𝑦(X) = sign X                       w ( 𝑗)> + 𝑏 .                                    (6.8)
                                                                            𝑖=1
                                                         ( 𝑗)
    Further, assume that 𝜉𝑚 = max {𝜉𝑚 }, and that the 𝑑 normal vectors w ( 𝑗) form a rank-1 tensor
                                            𝑗 ∈[𝑑]
W=     w (1)    w (2)        ···           (𝑑)
                                        w [6].          In this case, the following can be observed.
                                                   ∑︁                          ∑︁                                Ö𝑑
              kW k 2 = hW, Wi =                              W𝑖21 ,...,𝑖 𝑑 =              w𝑖(1)2
                                                                                              1
                                                                                                 . . . w𝑖(𝑑)2
                                                                                                           𝑑
                                                                                                               =     kw ( 𝑗) k 22 , (6.9)
                                                 𝑖1 ,...,𝑖 𝑑                 𝑖1 ,...,𝑖 𝑑                         𝑗=1
                                                                          87


and
           ©      ?𝑑           ª         ? 𝑑               ∑︁                                         D        E
     ( 𝑗)> ­  (𝑚)        ( 𝑗)> ®     (𝑚)        ( 𝑗)>                  (𝑚)          (1)        (𝑑)      (𝑚)
   w       ­X          w       ®=X            w       =              X𝑖1 ,...,𝑖 𝑑 w𝑖1 . . . w𝑖 𝑑 = X , W ,          (6.10)
           ­                   ®                         𝑖1 ,...,𝑖 𝑑
                  𝑖=1                     𝑖=1
           «      𝑖≠ 𝑗         ¬
where (2.11) and (2.12) have been used in the penultimate equality. This result can be used to write
the problem as
                                                                              𝑀
                                                      1                     ∑︁
                                min 𝐽 (W, 𝑏, 𝝃) =       kW k 2 + 𝐶                 𝜉𝑚
                              W,𝑏,𝝃                   2                     𝑚=1
                                                     D                 E                                          (6.11)
                                    subject to 𝑦 𝑚 X (𝑚) , W + 𝑏 ≥ 1 − 𝜉𝑚
                                    𝜉𝑚 ≥ 0,
    By forming the Lagrangian function with Lagrange multipliers 𝛼 and 𝜆, and taking partial
                                                                                        𝛼𝑚 𝑦 𝑚 X (𝑚) , 𝑚=1
                                                                              Í𝑀                       Í𝑀
derivatives with respect to W, 𝑏 and 𝜉𝑚 , we obtain W = 𝑚=1                                                 𝛼𝑚 𝑦 𝑚 = 0 and
𝛼𝑚 + 𝜆 𝑚 = 𝐶. Then, the dual problem can be written in the following form.
                                                         1
                                       max = 1> 𝜶 − 𝜶> S𝜶
                                         𝜶               2
                                                               𝑀
                                                              ∑︁
                                             subject to              𝛼𝑚 𝑦 𝑚 = 0                                     (6.12)
                                                             𝑚=1
                                             0 ≤ 𝛼𝑚 ≤ 𝐶, 𝑚 ∈ [𝑀],
where S 𝑝𝑞 = 𝑦 𝑝 𝑦 𝑞 X ( 𝑝) , X (𝑞) . Therefore, after solving the model, the binary classifier will be
                                         𝑦(X) = sign (hX, Wi + 𝑏) ,                                                 (6.13)
for a test tensor X. This procedure is equivalent to solving the regular soft-margin SVM if the
tensors are first vectorized. However, for large tensors vectorized, SVM suffers significantly from
the curse of dimensionality and also small training sample count compared to the dimensionality
of each sample making it susceptible to new data.
                                                        88


6.3.2    Support Higher-order Tensor Machine
Support Higher-order Tensor Machine (abbreviated to SHTM) assumes that the data samples admit
low-rank CP decompositions, which allows for another form of the objective function in the dual
problem. As can be expected, in obtaining the CPD of data, a sketched least squares problem
can be solved in each mode of the training and test samples to obtain approximate but more
efficient versions of the corresponding CP decompositions. Let the CPD of X ( 𝑝) and X (𝑞) be
                                                                       (1)                    (𝑑)               ( 𝑗)              ( 𝑗)
X ( 𝑝) ≈ 𝑟𝑘=1 x (1)      · · · x (𝑑)                   (𝑞) ≈ Í𝑟
          Í
                  𝑝𝑘                  𝑝𝑘 and X                   𝑘=1 x𝑞𝑘           · · · x𝑞𝑘      , where x𝑞𝑘 and x 𝑝𝑘 represent
the 𝑘 th column in the 𝑗 th factor matrix of X ( 𝑝) and X (𝑞) , respectively. Therefore, the elements of S
in (6.12) will be
                             D                  E               𝑟 D
                                                               ∑︁                                                                 E
             S 𝑝𝑞 = 𝑦 𝑝 𝑦 𝑞 X    ( 𝑝)
                                      ,X   (𝑞)
                                                    = 𝑦 𝑝 𝑦𝑞          x (1)
                                                                        𝑝𝑘         ···      x (𝑑)
                                                                                              𝑝𝑘 , x𝑞ℎ
                                                                                                       (1)
                                                                                                              ···         x𝑞ℎ (𝑑)
                                                              𝑘,ℎ=1
                               𝑟
                              ∑︁      ∑︁                                                                       
                  = 𝑦 𝑝 𝑦𝑞                       x (1)
                                                    𝑝𝑘     ···     x (𝑑)
                                                                     𝑝𝑘
                                                                                          (1)
                                                                                         x𝑞ℎ      ···       (𝑑)
                                                                                                           x𝑞ℎ
                                                                           𝑖1 ,...,𝑖 𝑑                              𝑖1 ,...,𝑖 𝑑
                             𝑘,ℎ=1 𝑖1 ,...,𝑖 𝑑
                               𝑟 Ö
                              ∑︁      𝑑 D                  E
                                                ( 𝑗) ( 𝑗)
                  = 𝑦 𝑝 𝑦𝑞                    x 𝑝𝑘 , x𝑞ℎ .
                             𝑘,ℎ=1 𝑗=1
Now, (6.12) can be solved using Sequential Minimal Optimization [11] in one iteration. The binary
classifier for a test sample X ≈ 𝑟ℎ=1 xℎ(1) · · · xℎ(𝑑) will decide the class based on
                                          Í
                                                                𝑀 𝑟        𝑟                𝑑 D               E
                                                             ©∑︁ ∑︁ ∑︁                     Ö
                                                                                                    ( 𝑗) ( 𝑗)
           𝑦(X) = sign (hX, Wi + 𝑏) = sign ­                                                     x𝑚𝑘 , xℎ + 𝑏 ® ,                      (6.14)
                                                                                                                        ª
                                                                                𝛼𝑚 𝑦 𝑚
                                                             «𝑚=1 𝑘=1 ℎ=1                  𝑗=1                          ¬
which clearly shows that the curse of dimensionality will not be an issue as only the much smaller
factors of the CPD of data are incorporated in the classification operation.
6.3.3    Kernelized Support Tensor Machine
Aside from the low-rank structure that is assumed for the weight tensor W and the data samples,
STM and SHTM are only taking into account the multilinear structure in tensors, and are missing
any nonlinearities that might exist in the data. For this reason, STM and SHTM will yield
                                                                  89


suboptimal results. A kernelized approach can lead to improved performance by capturing the
existing nonlinearities in the data. In [7], the primal problem for Kernelized Support Tensor
Machine (KSTM) is stated as
                             arg min 𝐿 (𝑦, hW, Xi + 𝑏) + 𝑃 (W) + Ω (X) ,                                               (6.15)
                                    W,𝑏
where 𝐿 is a loss function, 𝑃 (W) is a penalty function, and Ω (X) is a specific constraint imposed
on the training samples. Before going into details about the terms in (6.15), the tensor Reproducing
Kernel Hilbert Space needs to be defined.
Note: For a domain D, a reproducing kernel 𝜅 : D × D → R is a kernel function associated with
a feature map 𝜙 : D → RD , where RD denotes the space of functions from D to R, with the
property that any function 𝑓 ∈ RD can be reproduced pointwise by calculating the inner product
of the kernel and 𝑓 , i.e., 𝑓 (𝑥) = h 𝑓 (·), 𝜅(·, 𝑥)i for all 𝑥 ∈ D, where 𝜅(·, 𝑥) = 𝜙(𝑥).
    A direct consequence of this property is that any inner product defined on RD can be calculated
by the kernel as h𝜙(𝑥 1 ), 𝜙(𝑥 2 )i = h𝜅(·, 𝑥1 ), 𝜅(·, 𝑥2 )i = 𝜅(𝑥 2 , 𝑥1 ) = 𝜅(𝑥 1 , 𝑥2 ) for all 𝑥 1 , 𝑥2 ∈ D.
Definition 6.3.1 Tensor Product Reproducing Kernel Hilbert Space
                                       
For 𝑗 ∈ [𝑑], let H 𝑗 , h., .i 𝑗 , 𝜅 ( 𝑗) be a reproducing kernel Hilbert space (RKHS) of functions on
a set S 𝑗 with a reproducing kernel 𝜅 ( 𝑗) : S 𝑗 × S 𝑗 → R and the inner product operator h·, ·i 𝑗 .
The space H = H1           ···        H𝑑 is called a tensor product RKHS of functions on the domain
                                                                                    
S := S1 × · · · × S𝑑 . In particular, assume that 𝑥 = 𝑥 (1) , . . . , 𝑥 (𝑑) ∈ S is a tuple. Let the tensor
product space formed by the linear combinations of the functions 𝑓 ( 𝑗) for 𝑗 ∈ [𝑑] be defined as
                                                          Ö 𝑑                  
                         𝑓 (1)       ···    𝑓 (𝑑) : 𝑥 ↦→          𝑓 ( 𝑗) 𝑥 ( 𝑗) ,   𝑓 ( 𝑗) ∈ H 𝑗
                                                           𝑗=1
    Then for a multi-index 𝒌 = (𝑘 1 , . . . , 𝑘 𝑑 ), it holds that
     ∑︁                                      ∑︁       Ö𝑑                     ∑︁        Ö𝑑 D                  E
                                                                ( 𝑗)                               ( 𝑗)     ( 𝑗)
         G𝒌 𝑓 𝑘(1)
                1
                      ···      𝑓 𝑘(𝑑)
                                   𝑑
                                        (𝑥) =       G 𝒌       𝑓 𝑘𝑗      𝑥 ( 𝑗)
                                                                                =    G   𝒌       𝑓 𝑘𝑗   , 𝑘 𝑥        , (6.16)
                                                                                                                   𝑗
      𝒌                                         𝒌       𝑗=1                       𝒌        𝑗=1
                                                          90


                                                                                                                                                                   
                                                                                      ( 𝑗)
where G𝒌 is the combination coefficient, and 𝑘 𝑥 is the function 𝑘 ( 𝑗) ·, 𝑥 ( 𝑗) : 𝑡 ↦→ 𝑘 ( 𝑗) 𝑡, 𝑥 ( 𝑗) in
                                           D                                           E
                            ( 𝑗)                        ( 𝑗)              ( 𝑗)
the sense that 𝑓 𝑘 𝑗 𝑥 ( 𝑗) = 𝑓 𝑘 𝑗 (·) , 𝑘 𝑥 ·, 𝑥 ( 𝑗) .
                                                                                              𝑗
                                                                                                                                                        
                                                                                                                                           ( 𝑗)                  ( 𝑗)
                                                                                                                                                  𝑥 ( 𝑗)
                                                                                                                                                                         
     By looking closely at (6.16), it is observed that if in a special case, we let                                                      𝑓𝑘 𝑗              = u𝑘 𝑗 𝑖 𝑗
for 𝑘 𝑗 ∈ [𝑟 𝑗 ] and 𝑖 𝑗 ∈ [𝑛 𝑗 ] for 𝑗 ∈ [𝑑], then
                                                                                                     
     𝑓 𝑘(1)
         1
                ···            𝑓 𝑘(𝑑)
                                   𝑑
                                            𝑥 (1) , . . . , 𝑥 (𝑑) = u 𝑘(1)         1
                                                                                           ···    u 𝑘(𝑑)𝑑
                                                                                                            (𝑖1 , . . . , 𝑖 𝑑 ) = u 𝑘(1)   1
                                                                                                                                              (𝑖1 ) . . . u 𝑘(𝑑)
                                                                                                                                                              𝑑
                                                                                                                                                                  (𝑖 𝑑 )
                                                                                                                                                                 ( 𝑗)    
and (6.16) presents a general form of the Tucker decomposition, in a kernelized form, where u 𝑘 𝑗 𝑖 𝑗
                                                                                                   ( 𝑗)
can be represented as the inner product of a kernel and u 𝑘 𝑗 , and G plays the role of the core tensor.
Therefore (6.16) represents a kernelized tensor factorization. Treating X ∈ R𝑛1 ×...×𝑛𝑑 as an element
of the tensor product RKHS H , assume that it has a low-rank structure in H , such that
                                           ∑︁                                                                                       ?   𝑑                   
                                                                                    ( 𝑗)
    arg       min            kX −                    G𝑘 1 ,...,𝑘 𝑑            𝑑
                                                                              𝑗=1 u 𝑘 𝑗    k 2 = arg       min          kX − G                   K ( 𝑗) U ( 𝑗) k 2 ,
           G,{U ( 𝑗) } 𝑑𝑗=1             𝑘 1 ,...,𝑘 𝑑                                                   G,{U ( 𝑗) } 𝑑𝑗=1
                                                                                                                                        𝑗=1
                                                                                                                                                                  (6.17)
                                                                                                                                       ( 𝑗)
where K ( 𝑗) ∈ R𝑛 𝑗 ×𝑛 𝑗 is a symmetric kernel matrix whose 𝑖 th𝑗 row/column is 𝑘 𝑥
                                                                                                                                                    
                                                                                                                                              ·, 𝑖 𝑗 for 𝑖 𝑗 ∈ [𝑛 𝑗 ],
                                             ( 𝑗)
and U ( 𝑗) ∈ R𝑛 𝑗 ×𝑟 𝑗 has u 𝑘 𝑗 as its 𝑘 th𝑗 column for 𝑘 𝑗 ∈ [𝑟 𝑗 ]. To get from the left-hand side to the
right-hand side, one can use the reproducing property of the kernel, i.e.,
                                                                             𝑑                    𝑑 D
                                             𝑑       ( 𝑗)
                                                                             Ö
                                                                                     ( 𝑗)        Ö
                                                                                                            ( 𝑗)      ( 𝑗)          E
                                             𝑗=1 u 𝑘 𝑗                   =         u𝑘 𝑗    𝑖𝑗 =           u𝑘 𝑗 , 𝑘 𝑥         ·, 𝑖 𝑗     ,
                                                            𝑖1 ,...,𝑖 𝑑
                                                                              𝑗=1                 𝑗=1
     In a kernelized CP factorization (KCP), the objective function of (6.17) will be the same except
that G will be a diagonal tensor, and if the elements of its superdiagonal are absorbed into the factor
matrices, then the primal model of KSTM for 𝑀 training samples {X (𝑚) , 𝑦 𝑚 } 𝑚=1                                                         𝑀 is stated as
                                                                               ∑︁𝑀                  ?  𝑑                      
                                                                                                                          ( 𝑗)
                                                    min                     𝛾        kX (𝑚) − I              K ( 𝑗) U𝑚           k2
                                        ( 𝑗)
                                     {U𝑚 } 𝑑𝑗=1 ,{K ( 𝑗) } 𝑑𝑗=1 ,W,𝑏           𝑚=1                    𝑗=1
                                                                             + hW, Wi                                                                             (6.18)
                                                                                   ∑︁𝑀 h             D                  E        i
                                                                             +𝐶            1 − 𝑦𝑚        W, X̂ (𝑚) + 𝑏                   ,
                                                                                                                                      +
                                                                                   𝑚=1
                                                                                         91


where I denotes the identity tensor and [1 − 𝑥] + = max (0, 1 − 𝑥) 𝑝 for 𝑝 = 1 or 𝑝 = 2. Also, X̂ (𝑚)
                                                               ( 𝑗)
is the CP reconstruction of X (𝑚) using {U𝑚 } 𝑑𝑗=1 . Comparing (6.18) with (6.15), the first term
is Ω (X) representing the total KCP reconstruction error of training samples penalized by 𝛾, the
second term is 𝑃 (W), and the third term corresponds to 𝐿 (𝑦, hW, Xi + 𝑏). All training samples
are sharing the same set of kernel matrices for a specific mode 𝑗 ∈ [𝑑] which makes characterization
of tensor data possible by taking into account both common and discriminative features. Solving
(6.18) in the dual domain is very complicated due to the inherent coupling between the weight
                                             ( 𝑗)
tensor W and factor matrices {U𝑚 } 𝑑𝑗=1 . The kernel trick is used to implicitly capture the nonlinear
structures in the data. If W is replaced with a function
                                                               ∑︁𝑀
                                                   𝑓 (·) =           𝛽𝑚 𝜅(·, X̂ (𝑚) ),
                                                               𝑚=1
represented as the linear combination of a kernel function 𝜅(·, X̂ (𝑚) ) for 𝑀 reconstructed training
data samples, then (6.18) can be transformed to
                                                       𝑀
                                                      ∑︁                    ? 𝑑                     
                                                                                                ( 𝑗)
                             min                   𝛾        kX (𝑚) − I                K ( 𝑗) U𝑚          k2
                     ( 𝑗)
                   {U𝑚 } 𝑑𝑗=1 ,{K ( 𝑗) } 𝑑𝑗=1 ,𝜷,𝑏    𝑚=1                    𝑗=1
                                                          ∑︁𝑀
                                                   +𝜆            𝛽𝑖 𝛽 𝑗 𝜅( X̂ (𝑖) , X̂ ( 𝑗) )                           (6.19)
                                                         𝑖, 𝑗=1
                                                        𝑀 
                                                       ∑︁                   𝑀
                                                                           ∑︁                                        
                                                                                             (𝑚)       ( 𝑗)
                                                                                                                     
                                                   +         1 − 𝑦 𝑚 ­         𝛽 𝑗 𝜅( X̂ , X̂ ) + 𝑏 ® ,
                                                                         ©                                         ª
                                                             
                                                       𝑚=1              « 𝑗=1
                                                                                                                     
                                                                                                                   ¬ +
                                                             
where 𝜆 = 1/𝐶 is the weight between the loss function and the margin, and 𝛾 controls the tradeoff
between discriminative components and reconstruction error. Letting K̂𝑖, 𝑗 = 𝜅( X̂ (𝑖) , X̂ ( 𝑗) ) denote
the elements of the symmetric kernel matrix K̂, the so called dual form of (6.18) is obtained as
                                                              𝑀
                                                             ∑︁                     ? 𝑑                      
                                                                                                         ( 𝑗)
                                      min                 𝛾         kX (𝑚) − I                K ( 𝑗) U𝑚         k2
                            ( 𝑗)
                          {U𝑚 } 𝑑𝑗=1 ,{K ( 𝑗) } 𝑑𝑗=1 ,𝜷,𝑏    𝑚=1                     𝑗=1
                                                          + 𝜆𝜷> K̂𝜷                                                     (6.20)
                                                              ∑︁𝑀 h                            i
                                                          +          1 − 𝑦 𝑚 k̂>    𝑚  𝜷   +  𝑏        ,
                                                                                                    +
                                                              𝑚=1
                                                                     92


where k̂𝑚 denotes the 𝑚 th row/column of K̂. This objective function is non-convex and finding
a global minimum might not be possible. Therefore, an iterative scheme has been derived in [7]
by alternatively finding the local minimum of the objective function for each variable by fixing
                                                              ( 𝑗)
the others. This is done by updating K ( 𝑗) , 𝜷, U𝑚 and 𝑏 consecutively in each iteration. Let
                                                                                                                 2
𝑧 𝑚 = k̂>𝑚 𝜷 + 𝑏 in the following steps whenever it is used. Also, let [1 − 𝑥] + = max (0, 1 − 𝑥) .
     At each iteration, solving (6.20) includes the following steps.
   (a) Update K ( 𝑗) : Since there is no supervised information involving K ( 𝑗) , the optimization
       technique in CPD is used by finding the solution to the following system of equations for
        𝑗 ∈ [𝑑].
                                                𝑀 
                                               ∑︁                    𝑀 
                                                                     ∑︁                  
                                                     ( 𝑗)  (− 𝑗)                   (− 𝑗)
                                       K  ( 𝑗)
                                                    U𝑚 W𝑚          =       X (𝑚) V
                                                                             ( 𝑗) 𝑚
                                               𝑚=1                   𝑚=1
                                                                                                         
                                                                      (− 𝑗)                          ( 𝑗)        (− 𝑗)
       where X (𝑚)
                                                                                 Ç1
                    ( 𝑗)
                           is the mode- 𝑗 unfolding of X (𝑚) , V𝑚           =       ℓ=𝑑      K ( 𝑗) U𝑚      and W𝑚     =
                                                                                    ℓ≠ 𝑗
                >
           (− 𝑗)         (− 𝑗)
         V𝑚          V𝑚 .
   (b) Update 𝜷: A data point is a support vector (support tensor, in fact) if 𝑦 𝑚 𝑧 𝑚 < 1 for that point,
       i.e., if the loss is nonzero for it. After reordering the training samples such that the first 𝑀𝑠
       samples are support tensors, 𝜷 can be found by setting the first-order gradient of the objective
       function to zero, i.e.,
                                                                               
                                       5 𝜷 = 2 𝜆K̂𝜷 + K̂I0 K̂𝜷 − y + 𝑏1 = 0,
                                  
                          I𝑀 0 
       where I0 = 
                          𝑠
                                    and y is the vector of labels.
                                   
                          0 0 
                                  
                                  
                     ( 𝑗)                                                                   ( 𝑗)
   (c) Update U𝑚 : The kernel function 𝜅 is inherently coupled with U𝑚 , which underlines the
       importance of choosing an appropriate kernel when calculating the gradient of the objective
                                       ( 𝑗)
       function with respect to U𝑚 . Before that, it should be made clear how the kernel 𝜅 is related
             ( 𝑗)
       to U𝑚 . Mapping the tensors into a tensor Hilbert space of higher dimension, one can retain
                                                          93


    the multilinear structure of data as well as capture existing nonlinearities. The mapping
    function is defined as
                                           ∑︁𝑟                         ∑︁𝑟                      
                                                      𝑑      ( 𝑗)                𝑑          ( 𝑗)
                                     𝜙:               𝑗=1 u 𝑘     →              𝑗=1 𝜙     u𝑘       ,
                                           𝑘=1                         𝑘=1
    which assumes mapping the tensor data into a tensor Hilbert space and then performing CPD.
    In other words, the feature map 𝜙 acts on the individual factors of CPD while retaining the
    low-rank structure of the data. The kernel will now be the standard inner product of tensors
    on the higher-dimensional space, i.e., one can find a dual structure-preserving kernel function
    by writing
                              𝑟                     𝑟
                                                                       !        𝑟
                           ∑︁                   ∑︁                           ∑︁                                      
           (𝑖)  ( 𝑗)                 𝑑      (ℓ)             𝑑      (ℓ)                       𝑑        (ℓ)    𝑑      (ℓ)
     𝜅 X̂ , X̂         =𝜅            ℓ=1 u𝑖𝑘 ,              ℓ=1 u 𝑗 ℎ     =           𝜅      ℓ=1 u𝑖𝑘 ,       ℓ=1 u 𝑗 ℎ
                             𝑘=1                  ℎ=1                         𝑘,ℎ=1
                           𝑟 D 
                          ∑︁                                               E         𝑟 D
                                                                                       ∑︁                                           E
                       =         𝜙       𝑑
                                         ℓ=1 u𝑖𝑘
                                                (ℓ)
                                                        ,𝜙        𝑑      (ℓ)
                                                                  ℓ=1 u 𝑗 ℎ       =                𝑑
                                                                                                   ℓ=1 𝜙
                                                                                                              (ℓ)
                                                                                                            u𝑖𝑘     ,     𝑑
                                                                                                                         ℓ=1 𝜙   u (ℓ)
                                                                                                                                    𝑗ℎ
                         𝑘,ℎ=1                                                       𝑘,ℎ=1
                           𝑟
                          ∑︁    Ö𝑑 D                           E         𝑟
                                                                            ∑︁     Ö 𝑑                   
                                              (ℓ)
                       =               𝜙    u𝑖𝑘      ,𝜙     u (ℓ)
                                                               𝑗ℎ      =                 𝜅     (ℓ) (ℓ)
                                                                                             u𝑖𝑘    , u𝑗ℎ    ,
                         𝑘,ℎ=1 ℓ=1                                         𝑘,ℎ=1 ℓ=1
                                                                                                                                   (6.21)
    implying that to apply the kernel to two tensors, it suffices to apply it to the their factors in
    their respective CPD representations. The second equality results from the bilinearity of the
    kernel 𝜅. Examples of commonly used kernels include 𝜅 (x, y) = x> y for the inner product
    kernel, and 𝜅 (x, y) = exp −𝜎kx − yk 22 for the Gaussian kernel where 𝜎 controls the width
                                                         
    of the kernel.
                                                                                                                                 ( 𝑗)
    After choosing a kernel and thus knowing how it is related to the factor matrices U𝑚 , the
                                                                                ( 𝑗)
    gradient of the objective function with respect to U𝑚 can be calculated and set to zero to
                                ( 𝑗)
    solve for the updated U𝑚 . An explicit form of the gradient can be found in [7].
(d) Update 𝑏: Setting the first-order gradient of the objective function with respect to 𝑏 to zero
                                                               94


      yields the solution for 𝑏, i.e.,
                                                     𝑀𝑠
                                                    ∑︁
                                             5𝑏 = 2       (𝑧 𝑚 − 𝑦 𝑚 ) = 0,
                                                    𝑚=1
      which can be solved as 𝜷 has been estimated and K̂ is known.
    Classification. After solving (6.20) using the training data, the shared kernel matrices {K ( 𝑗) } 𝑑𝑗=1
can be utilized to compute the KCP factorization of a test sample X. Computing the CPD of X
                                                                                                      −1
yields its CP factor matrices {U ( 𝑗) } 𝑑𝑗=1 . Letting K ( 𝑗) V ( 𝑗) = U ( 𝑗) and thus V ( 𝑗) = K ( 𝑗) U ( 𝑗) , the
                                                                                      >
KCP reconstruction of X is calculated using {V ( 𝑗) } 𝑑𝑗=1 , i.e., X̂ = I 𝑑𝑗=1 V ( 𝑗) . Finally, the test
sample X can be classified according to
                                                  𝑀
                                                                                    !
                                                 ∑︁                        
                                𝑦 (X) = sign          𝛽𝑚 𝜅 X̂, X̂ (𝑚) + 𝑏 .
                                                 𝑚=1
using the reconstructed training samples. In this case, (6.21) can be used to calculate the kernel as
                                               ∑︁  𝑟 Ö     𝑑                   
                                  𝜅 X̂, X̂ (𝑚) =                 𝜅 v 𝑘(ℓ) , u𝑚ℎ
                                                                              (ℓ)
                                                                                    ,
                                                   𝑘,ℎ=1 ℓ=1
instead of dealing with the actual reconstructed data. For the training samples, it is important to
                                          ( 𝑗)          ( 𝑗)                                                    ( 𝑗)
note that for 𝑗 ∈ [𝑑] and 𝑚 ∈ [𝑀], U𝑚 = K ( 𝑗) U𝑚 according to the reproducing property of 𝑘 𝑥 ,
which explains why the CP reconstruction of training data can be used in the kernel instead of their
KCP reconstruction, as the KCP and CP reconstructions of the training samples are equivalent.
However, the reproducing property of K ( 𝑗) will not necessarily hold for test data, and therefore,
the assumption K ( 𝑗) V ( 𝑗) = U ( 𝑗) is made in this case, where U ( 𝑗) are the corresponding CP factor
matrices of the test tensor.
6.4    Future Work
Directions for future research based on oblivious subspace embeddings include
    • Extend the results of JL on tensors with low CP rank to tensors with low Tucker rank without
      orthogonality constraints. The goal is to obtain theoretical bounds on the deviation in the
                                                      95


  norm and inner product of randomly projected tensors as well as lower bounds on embedding
  dimensions.
• In settings where data points lie on a low-rank subspace, the following can be investigated.
     – Compressed MPCA. One can use JL embeddings to project tensors before performing
        MPCA to solve for the principal components in the low-rank subspace.
     – Compressed Tensor SVM. One can use JL embeddings to project tensors with low-rank
        expansions before SVM classification. In other words, the separating hyperplane can
        be found in a lower-dimensional subspace.
• Modewise JL in third-order MBPT calculations where higher-order tensors are formed from
  lower-order data replicated in certain dimensions. As was seen, in this setting, non-orthogonal
  tensors sometimes become nearly orthogonal after projection. The role of repetitive patters
  forming the higher-dimensional tensors as well as sparsity can be investigated.
                                             96


APPENDICES
    97


                                                  APPENDIX A
                      USEFUL NOTIONS, DEFINITIONS, AND RELATIONS
In this appendix chapter, some of the terms and mathematical definitions used in derivations
throughout the thesis are presented.
A.1       Operations on Rank-1 tensors
In this section, useful relations are presented that explain how rank-1 tensors are reshaped into
vectors or unfoldings, and how 𝑗-mode products are done.
A.1.1      Reshaping rank-1 tensors
A.1.1.1       Vectorization
For a rank-1 tensor X =             (ℓ)
                               ℓ=1 x ,   the vectorized form will be obtained according to
                               𝑑
                                               1
                                                    (ℓ)
                                             Ë
                                             x
                                            
                                            
                                            
                                                              column − major,
                               vec (X) =      ℓ=𝑑
                                                𝑑
                                                                                                        (A.1)
                                                    (ℓ)
                                            Ë
                                             x
                                            
                                            
                                                              row − major.
                                             ℓ=1
A.1.1.2       Mode- 𝑗 Unfolding
The mode- 𝑗 unfolding a rank-1 tensor X = ℓ=1          𝑑 x (ℓ) can be calculated using
                                 !
         
                                                                                      >
         
          x ( 𝑗) vec    𝑑 x (ℓ) = x ( 𝑗) x (𝑑) ⊗· · · ⊗x ( 𝑗+1) ⊗x ( 𝑗−1) ⊗· · · ⊗x (1) , column − major,
         
                       ℓ=1
         
                       ℓ≠ 𝑗
X ( 𝑗) =                         !
                                                                                       >
         
         
          x ( 𝑗) vec    𝑑 x (ℓ) = x ( 𝑗) x (1) ⊗· · · ⊗x ( 𝑗−1) ⊗x ( 𝑗+1) ⊗· · · ⊗x (𝑑) , row − major.
         
                       ℓ=1
                       ℓ≠ 𝑗
         
                                                                                                        (A.2)
In the rest of this section, only the column-major relations will be presented and used as this has
been the case throughout this thesis.
                                                         98


A.1.2     𝑗-mode Product
For a rank-1 tensor X =                        (ℓ)
                                        ℓ=1 x ,     the 𝑗-mode product can be done by applying the matrix of
                                        𝑑
interest to the mode- 𝑗 factor x ( 𝑗) , i.e.,
                                                                                         
                                                          𝑗−1 (ℓ)
                                  X × 𝑗 A ( 𝑗) =         ℓ=1 x              A ( 𝑗) x ( 𝑗)     𝑑
                                                                                              ℓ= 𝑗+1 x (ℓ)                            (A.3)
The proof is simply done by considering the mode- 𝑗 unfolding.
                                                                                                                      !
                                                                   
                                                                                                                  (ℓ)
                         X × 𝑗 A ( 𝑗)        = A ( 𝑗)       𝑑
                                                            ℓ=1 x
                                                                  (ℓ)
                                                                              = A ( 𝑗) x ( 𝑗)     vec       𝑑
                                                                                                            ℓ=1 x                     (A.4)
                                        ( 𝑗)                            ( 𝑗)                               ℓ≠ 𝑗
Reshaping to tensor form, (A.3) is obtained.
    A more general relation involving all modes can be established through a similar approach.
Consider a rank-1 tensor X = ℓ=1              𝑑 x (ℓ) , and let Y := X >𝑑 A (ℓ) . Then, Y is also a rank-1 tensor,
                                                                                      ℓ=1
and
                                                                   𝑑        (ℓ) (ℓ)
                                                        Y=         ℓ=1 A x .                                                          (A.5)
To show this, the mode- 𝑗 unfolding of Y is calculated as follows.
                                                                                          >
  Y ( 𝑗) = A ( 𝑗) X ( 𝑗) A (𝑑) ⊗ · · · ⊗ A ( 𝑗+1) ⊗ A ( 𝑗−1) ⊗ · · · ⊗ A (1)
                                                                                     >
         = A ( 𝑗) x ( 𝑗) x (𝑑) ⊗ · · · ⊗ x ( 𝑗+1) ⊗ x ( 𝑗−1) ⊗ · · · ⊗ x (1)
                                                                         >
            A (𝑑) ⊗ · · · ⊗ A ( 𝑗+1) ⊗ A ( 𝑗−1) ⊗ · · · ⊗ A (1)
                        h                                                                                                          i >
             ( 𝑗) ( 𝑗)       (𝑑)             ( 𝑗+1)      ( 𝑗−1)                (1)      (𝑑)           ( 𝑗+1)     ( 𝑗−1)            (1)
         =A x             A ⊗ . . . ⊗A              ⊗A          ⊗ . . . ⊗A            x ⊗ . . . ⊗x           ⊗x         ⊗ . . . ⊗x
                                                                                                                   >
         = A ( 𝑗) x ( 𝑗) A (𝑑) x (𝑑) ⊗ · · · ⊗ A ( 𝑗+1) x ( 𝑗+1) ⊗ A ( 𝑗−1) x ( 𝑗−1) ⊗ · · · ⊗ A (1) x (1) .
                                                                                                                                      (A.6)
This is clearly the mode- 𝑗 unfolding of a rank-1 tensor which can be formed by reshaping (A.6) to
tensor structure to obtain (A.5). To get from the penultimate equality to the last one, the following
matrix Kronecker product property was used.
                                                                           
                         (1)               (𝑑)       (1)                 (𝑑)
                      A      ⊗ ··· ⊗ A             B      ⊗ ··· ⊗ B              = A (1) B (1) ⊗ · · · ⊗ A (𝑑) B (𝑑) .                (A.7)
                                                                    99


A.1.3    Inner Product
Consider rank-1 tensors X =                   (ℓ)    and Y =                       (ℓ)
                                                                         ℓ=1 y .          To simplify the inner product defined
                                        𝑑                                 𝑑
                                        ℓ=1 x
as per (2.3), one can use the fact that an element of a rank-1 tensor at index set (𝑖1 , 𝑖2 , . . . , 𝑖 𝑑 ) is
the product of the individual elements of its factor vectors at indices 𝑖1 , 𝑖2 , ... , and 𝑖 𝑑 . The inner
product is then written as
                                 ∑︁                                        ∑︁
                   hX, Yi =                X𝑖1 ,...,𝑖 𝑑 Y 𝑖1 ,...,𝑖 𝑑 =              x𝑖(1)
                                                                                        1
                                                                                           . . . x𝑖(𝑑)
                                                                                                    𝑑
                                                                                                       y𝑖(1)
                                                                                                          1
                                                                                                             . . . y𝑖(𝑑)
                                                                                                                      𝑑
                              𝑖 1 ,...,𝑖 𝑑                               𝑖1 ,...,𝑖 𝑑
                              ∑︁                         ∑︁                          Ö𝑑                                   (A.8)
                            =          x𝑖(1)
                                           1
                                             y𝑖(1)
                                                1
                                                    ···          x𝑖(1)
                                                                     𝑑
                                                                       y𝑖(1)
                                                                          𝑑
                                                                               =          hx (ℓ) , y (ℓ) i.
                                 𝑖1                        𝑖𝑑                        ℓ=1
This means the inner product of rank-1 tensors boils down to the inner product of their factor
vectors.
A.2      𝜀-nets as a Means of Set Discretization
This is a useful discretization concept that allowes for obtaining large deviation inequalities when
assessing the stattistical properties of random variables/vectors/matrices/tensors [25].
Definition A.2.1 (𝜺-net) Let (S, 𝑑) be a metric space, and let S0 be a subset of 𝑆. A subset N ⊆ S0
is called an 𝜀-net of S0 if for 𝜀 > 0, we have
                                        ∀𝑥 ∈ S0 ∃𝑥 0 ∈ N : 𝑑 (𝑥, 𝑥 0 ) ≤ 𝜀,                                               (A.9)
meaning any given point in S0 is within a distance 𝜀 of a point in N . This also means that N is an
𝜀-net of S0 if and only if S0 can be fully covered by balls centered in the elements of N and with
radii 𝜀.
It can be observed that the balls of radii 𝜀 in the above definition may have overlaps. This puts a
lower bound on the number for such balls, paving the way for the following definition.
Definition A.2.2 (Covering Number) The smallest possible cardinality of an 𝜀-net N of S0 in
Definition A.2.1 is called the convering number of S0 and is denoted by C(S0 , 𝜀). Equivalently, it
                                                                  100


is the smallest number of closed balls with centers in the elements of N and radii 𝜀 whose union
covers S0 .
Note A.2.1 The covering number of the Euclidean ball 𝐵2𝑛 satisfies the following inequality.
                                            𝑛                       𝑛
                                             1           𝑛           2
                                                 ≤ C(𝐵2 , 𝜀) ≤ 1 +                                  (A.10)
                                             𝜀                       𝜀
A.3      Random Projections and Johnson-Lindenstrauss Embeddings
In this section, a brief introduction to random projections in the context of JL embeddings is
presented. More details can be found in [25]. Suppose that we have 𝑁 data points {x𝑖 ∈              R𝑛 }𝑖=1
                                                                                                        𝑁
and we want to project them onto {y𝑖 ∈            R𝑚 }𝑖=1
                                                        𝑁 where 𝑚  𝑛, in a way that the geometry of the
data is preserved, i.e.,
                                         x𝑖 − x 𝑗 ≈ y𝑖 − y 𝑗     for 𝑖 ≠ 𝑗 .
We would also like to know what the smallest 𝑚 that make this possible is. First, we review notions
that will be later useful. A general group of random variables called sub-Gaussian random variables
are of special interest in this discussion.
Definition A.3.1 (Sub-Gaussian Random Variable) For a sub-Gaussian random variable 𝑋, the
following properties hold and are equivalent. The constants 𝐾𝑖 differ from each other by an absolute
constant.
    1. P{|𝑋 | > 𝑡} ≤ 2 exp        −𝑡 2 /𝐾12 , for all 𝑡 > 0.
                                            
    2. k 𝑋 k 𝑝 := ( E |𝑋 | 𝑝 ) 1/𝑝 ≤ 𝐾2√ 𝑝, for all 𝑝 > 1.
    3. The MGF of 𝑋 2 satisfies        E exp    𝜆2 𝑋 2 ≤ exp 𝐾 3𝜆2 for all 𝜆 such that |𝜆| < 1/𝐾3 .
                                                                  
    4. The MGF of 𝑋 2 is bounded at some point, namely,           E exp   𝑋 2 /𝐾42 ≤ 2.
                                                                                  
    5. If E 𝑋 = 0, the above four properties are equal to the MGF of 𝑋 satisfying E exp (𝜆𝑋) ≤
       E exp 𝐾5 𝜆 , for all 𝜆 ∈ R.
                      
                  2  2
                                                         101


All Guassian random variables and bounded random variables are sub-gaussian.
Definition A.3.2 (Sub-Gaussian Norm) The sub-Gaussian norm of a random variable 𝑋, denoted
by k 𝑋 k 𝜓2 , is defined as
                                                           E exp
                                                 n                              o
                                k 𝑋 k 𝜓2 = inf 𝑡 > 0 :               𝑋 2 /𝑡 2 ≤ 2 .
An alternative definition of the sub-Gaussian norm of a random variable 𝑋 is
                                                                1
                                             k 𝑋 k 𝜓2 = sup 𝑝 − 2 k 𝑋 k 𝑝 .
                                                         𝑝≥1
Definition A.3.3 (Random Subspace) Let 𝐺 𝑛,𝑚 denote the Grassmannian1, i.e., the set of all 𝑚-
dimensional subspaces in      R𝑛 . We say that 𝐸 is a random 𝑚-dimensional subspace of R𝑛 uniformly
distributed in 𝐺 𝑛,𝑚 if 𝐸 is a random 𝑚-dimensional subspace of R𝑛 whose distribution is rotation-
invariant, i.e.,
                                           P{𝐸 ∈ E} = P{U (𝐸) ∈ E},
for any fixed subset E ⊂ 𝐺 𝑛,𝑚 , where U is an 𝑛 × 𝑛 orthogonal matrix.
Proposition A.3.1 (Random Projection) Let P be a projection from                  R𝑛 onto a random 𝑚-dimensional
subspace of     R𝑛 uniformly distributed in 𝐺 𝑛,𝑚 . Let z ∈ R𝑛 be a fixed point and 𝜀 > 0. Then,
                          E kPzk 22
                                    1/2   √︁ 𝑚
   1. k kPzk 2 k 2 :=                     =    𝑛 kzk 2 .
   2. With probability at least 1 − 2 exp −𝑐𝜀 2 𝑚 , we have that
                                                           
                                            √︂                                 √︂
                                               𝑚                                  𝑚
                                 (1 − 𝜀)          kzk 2 ≤ kPzk 2 ≤ (1 + 𝜀)          kzk 2 ,
                                               𝑛                                  𝑛
       where 𝑐 is an absolute contant.
    1 Also called the Grassmann manifold.
                                                          102


Proof Since we can normalize the terms by dividing by kzk 2 , without loss of generality, we may
assume that kzk 2 = 1. Using rotation-invariance, one can show that the random projection of
a fixed point is equivalent in distribution to the fixed projection of a random point uniformly
                                                   R
distributed on the unit sphere 𝑆 𝑛−1 ∈ 𝑛 . This is denoted by z ∼ uniform 𝑆 𝑛−1 . We choose this
                                                                                                        
fixed projection to be in such a way that it picks the first 𝑚 entries of z ∈                  R𝑛 , i.e.,
                                                   Pz = [𝑧1 , 𝑧2 , . . . , 𝑧 𝑚 ] > .
In other words, P = [I𝑚 | 0] 𝑚×𝑛 , where 0 ∈                R𝑚×(𝑛−𝑚) is a matrix of all zeros. Therefore, we have
that
                                                            𝑚            𝑚
                                       E               E                    E𝑧𝑖2 = 𝑚E𝑧𝑖2,
                                                          ∑︁            ∑︁
                                          kPzk 22   =           𝑧𝑖2  =
                                                           𝑖=1          𝑖=1
since all 𝑧𝑖 are drawn from the same distribution. We also know that kzk 22 = 1. Taking expectation
of both sides we can write
                                                               𝑛
                                                          E
                                                              ∑︁
                                                                    𝑧𝑖2 = 1,
                                                              𝑖=1
resulting in      E  𝑧𝑖2 = 1/𝑛. Therefore,     E   kPzk 22   = 𝑚/𝑛. This proves 1.
To prove 2, we may use the large deviation inequality for the concentration of Lipschitz-continuous
                                                                                                                           
functions on the unit sphere.2 In particular, if 𝑓 (z) is Lipschitz-continuous, and z ∼ uniform 𝑆 𝑛−1 ,
then
                                                                                       𝑐𝑛𝑡 2
                                     P
                                                                                            
                                       {| 𝑓 (z) − k 𝑓 (z) k 2 | > 𝑡} ≤ 2exp − 2 ,
                                                                                        𝐿
where 𝑐 is an absolute constant and 𝐿 is the Lipschitz constant of 𝑓 (z).3 Here, k 𝑓 (z)k 2 =
  E
                1/2
     | 𝑓 (z)| 2        as 𝑓 (z) is a random variable. In our case, 𝑓 (z) = kPzk 2 . Since 𝑓 (z) is Lipschitz,
and therefore continuous, we can use the mean value theorem to write
                            | 𝑓 (z) − 𝑓 (y)| = ∇> 𝑓 (z0 ) (z − y) ≤ k∇ 𝑓 (z0 )k 2 kz − yk 2
     2 If 𝑓 : R𝑛 → R is Lipschitz with Lipschitz  constant 𝐿, then 𝑓 (x) is sub-Gaussian for x ∼ uniform 𝑆
                                                                                                                   
                                                                                                                𝑛−1 , and we
have that P (| 𝑓 (x) − 𝑀 | > 𝑡) ≤ 2exp − 2𝐿  𝑛𝑡 2
                                                2   for 𝑡 > 0, where 𝑀 is the median of 𝑓 . This means with high probability,
 𝑓 (x) ≈ 𝑀 on the unit sphere. A similar inequality can be obtained when 𝑀 is replaced with E 𝑓 (z) which, in turn,
can be substituted by k 𝑓 (z) k 2 . For details, see Chapter 5 in [25].
     3 Here, we have also replaced E 𝑓 (z) by k 𝑓 (z) k in the large deviation inequality [25].
                                                            2
                                                                 103


where y, z ∈   R𝑛 and z0 is a point on the line segment between z and y. We can bound the gradient
of 𝑓 (z) to show that k 𝑓 (z) − 𝑓 (y) k 2 ≤ kz − yk 2 meaning that we can let 𝐿 = 1. On the other hand,
we may rearrange the inequalily in 2 as
                                               √︂              √︂
                                                   𝑚              𝑚
                                      kPzk 2 −       kzk 2 ≤ 𝜀      kzk 2
                                                   𝑛              𝑛
                    √︁ 𝑚
Noting that 𝑡 = 𝜀      𝑛  kzk 2 in the large deviation inequality, and using the result from 1, we can
obtain the desired result.
Lemma A.3.1 (Johnson-Lindenstrauss Lemma for Point Sets) Let 𝑋 ⊆                   R𝑛 be a set of 𝑁 points
in R𝑛 . Then, for an absolute constant 𝐶, there exists a linear map A = 𝑚𝑛 P ∈ R𝑚×𝑛 where
                                                                                   √︁
        log 𝑁
𝑚≥𝐶       𝜀2
              such that for all x, y ∈ 𝑋,
                           (1 − 𝜀) kx − yk 2 ≤ kA (x − y)k 2 ≤ (1 + 𝜀) kx − yk 2
with probability at least 1 − 2 exp −𝐶𝑚𝜀 2 . This means that A is an approximate isometry on 𝑋.
                                                 
Proof We showed how the random projection P acts on a vector in 𝑋. Now, consider the difference
set 𝑋 − 𝑋 := {x − y | x, y ∈ 𝑋 }. We want to show for all z ∈ 𝑋 − 𝑋,
                                   (1 − 𝜀) kzk 2 ≤ kAzk 2 ≤ (1 + 𝜀) kzk 2 .
                  √︁ 𝑛
Replacing A by       𝑚 P,  we have
                             √︂                              √︂
                                𝑚                               𝑚
                                   (1 − 𝜀) kzk 2 ≤ kPzk 2 ≤        (1 + 𝜀) kzk 2 ,
                                𝑛                               𝑛
which we already know holds with probability at least 1 − 2 exp 𝑐𝑚𝜀 2 from Proposition A.3.1.
                                                                                
Now, taking union bound over all points in 𝑋, we may conclude that the desired result holds with
probability at least
                                                                                            
        1 − 2 |𝑋 − 𝑋 | exp −𝑐𝑚𝜀 2 = 1 − 2𝑁 2 exp −𝑐𝑚𝜀 2 = 1 − exp −𝑐𝑚𝜀 2 + log 2𝑁 2 .
                                                     104


Since this probability must be non-negative, we have that
                                        𝑐𝑚𝜀 2 ≥ log 2𝑁 2 ,
and therefore,
                                                 log 𝑁
                                          𝑚≥𝐶          ,
                                                   𝜀2
completing the proof.
                                               105


                                                        APPENDIX B
      MEMORY-EFFICIENT MODE-WISE PROJECTION CALCULATIONS OF THE
                                                    ENERGY TERMS
In the following, it is assumed that 𝑛 𝑗 , 𝑗 ∈ {1, 2, 3} is the dimension size of the 3-mode tensors
as the reshaped versions of the hypothetical 6-mode arrays, and 𝑚 𝑗 is the corresponding size after
projection.
B.1      Particle-Particle
Denoting the projected version of H1 by P1 , one can calculate its elements by
                         ∑︁
    P1 (𝑖1 , 𝑖2 , 𝑖3 ) =       H ( 𝑝, 𝑞) H (𝑞, 𝑟) A (1) (𝑖1 , 𝑝) A (2) (𝑖2 , 𝑞) A (3) (𝑖3 , 𝑟)
                         𝑝,𝑞,𝑟
                         ∑︁                   ∑︁
                                                                                !
                                                                                   ∑︁
                                                                                                                !  (B.1)
                                (2)                    (1)                                 (3)           𝑇
                       =      A     (𝑖2 , 𝑞)        A        (𝑖1 , 𝑝) H ( 𝑝, 𝑞)         A      (𝑖3 , 𝑟) H (𝑟, 𝑞) ,
                          𝑞                     𝑝                                    𝑟
for 𝑖 𝑗 ∈ [𝑚 𝑗 ], 𝑗 ∈ {1, 2, 3}. Now, letting H 0 = A (1) H and H 00 = A (3) H 𝑇 , it can be observed that
P1 = H 000 ×2 A (2) where H 000 is a tensor that is formed element-wise according to
                                       H 000 (𝑖 1 , 𝑞, 𝑖 3 ) = H 0 (𝑖1 , 𝑞) H 00 (𝑖3 , 𝑞) .
Indeed, the mode-2 unfolding of H 000 is the result of the Hadamard product of the 𝑛2 × 𝑚 1 𝑚 3
matrices G1 and G2 , where G1 is formed by replicating H 0𝑇 across its second dimension 𝑚 3 times,
and G2 is formed by replicating each column of H 00𝑇 𝑚 1 times1. Left-multiplying the mode-2
unfolding of H 000 by A (2) and folding back to tensor shape yields P1 .
    Letting P2 denote the projected version of H2 , it is observed that
    1 For implementation, one does not have to store replicated versions of data as this would be inefficient use of
memory.
                                                                  106


                           ∑︁
     P2 (𝑖1 , 𝑖2 , 𝑖3 ) =       H̃ (𝑟, 𝑝) D (𝑞, 𝑝) A (1) (𝑖 1 , 𝑝) A (2) (𝑖2 , 𝑞) A (3) (𝑖3 , 𝑟)
                          𝑝,𝑞,𝑟
                          ∑︁                      ∑︁
                                                                                   !
                                                                                       ∑︁
                                                                                                                       !  (B.2)
                                  (1)                       (2)                                 (3)             𝑇
                        =       A     (𝑖1 , 𝑝)           A      (𝑖2 , 𝑞) D (𝑞, 𝑝)           A        (𝑖3 , 𝑟) H̃ (𝑟, 𝑝) ,
                            𝑝                       𝑞                                    𝑟
     Again, letting H 0 = A (2) D and H 00 = A (3) H̃ , it can be observed that P2 = H 000 ×1 A (1) where
H 000 is defined element-wise by
                                          H 000 ( 𝑝, 𝑖2 , 𝑖3 ) = H 0 (𝑖2 , 𝑝) H 00 (𝑖3 , 𝑝) .
The mode-1 unfolding of H 000 is obtained by the Hadamard product of the 𝑛1 × 𝑚 2 𝑚 3 matrices G1
and G2 , where G1 is formed by replicating H 0𝑇 across its second dimension 𝑚 3 times, and G2 is
formed by replicating each column of H 00𝑇 𝑚 2 times. It suffices now to left-multiply the mode-1
unfolding of H 000 by A (1) followed by folding back to tensor form to obtain P2 .
B.2      Hole-Hole
Calculations are done in a similar way to that of Appendix B.1. The projection of H1 will be
exactly the same. For H2 , one can write
                          ∑︁
    P2 (𝑖 1 , 𝑖2 , 𝑖3 ) =       H̃ (𝑟, 𝑝) D (𝑟, 𝑞) A (1) (𝑖1 , 𝑝) A (2) (𝑖2 , 𝑞) A (3) (𝑖3 , 𝑟)
                          𝑝,𝑞,𝑟
                          ∑︁                     ∑︁
                                                                                    !
                                                                                       ∑︁
                                                                                                                       !  (B.3)
                                 (3)                      (1)             𝑇                      (2)             𝑇
                        =      A      (𝑖 3 , 𝑟)        A       (𝑖1 , 𝑝) H̃ ( 𝑝, 𝑟)           A       (𝑖2 , 𝑞) D (𝑞, 𝑟) ,
                           𝑟                       𝑝                                      𝑞
     Letting H 0 = A (1) H̃ 𝑇 and H 00 = A (2) D 𝑇 , it can be observed that P2 = H 000 ×3 A (3) where H 000
is formed element-wise according to
                                            H 000 (𝑖1 , 𝑖2 , 𝑟) = H 0 (𝑖 1 , 𝑟) H 00 (𝑖2 , 𝑟) .
The mode-3 unfolding of H 000 is obtained by the Hadamard product of the 𝑛3 × 𝑚 1 𝑚 2 matrices
G1 and G2 , where G1 is formed by replicating H 0𝑇 across its second dimension 𝑚 2 times, and G2
                                                                     107


is formed by replicating each column of H 00𝑇 𝑚 1 times, where the replication is not performed
explicitly to save time and memory as mentioned above. The remaining step is to left-multiply the
mode-3 unfolding of H 000 by A (3) and fold back to tensor shape to get P2 .
B.3     Particle-Hole
The projection of H1 can be calculated using
                                  ∑︁
             P1 (𝑖1 , 𝑖2 , 𝑖3 ) =       H̃1 ( 𝑝, 𝑞) H 𝑝 (𝑞, 𝑟) A (1) (𝑖1 , 𝑝) A (2) (𝑖2 , 𝑞) A (3) (𝑖3 , 𝑟) , (B.4)
                                  𝑝,𝑞,𝑟
which is the same as (B.1) after replacing H (𝑞, 𝑟) by H 𝑝 (𝑞, 𝑟) and H ( 𝑝, 𝑞) by H̃1 (𝑞, 𝑟) in (B.1).
    For H2 , the projected tensor P2 is calculated using
                                  ∑︁
             P2 (𝑖1 , 𝑖2 , 𝑖3 ) =       H̃2 (𝑟, 𝑝) H 𝑝 A (1) (𝑖1 , 𝑝) A (2) (𝑖2 , 𝑞) A (3) (𝑖3 , 𝑟)
                                  𝑝,𝑞,𝑟
                                                                                                          !!
                                  ∑︁                   ∑︁                  ∑︁
                                =      A (2) (𝑖2 , 𝑞)       A (1) (𝑖1 , 𝑝)      A (3) (𝑖3 , 𝑟) H̃2 (𝑟, 𝑝)     (B.5)
                                   𝑞                     𝑝                  𝑟
                                                  ∑︁
                                = H 00 (𝑖1 , 𝑖3 )     A (2) (𝑖2 , 𝑞) .
                                                   𝑞
where H 00 = A (1) H 0𝑇 and H 0 = A (3) H̃2 .
                                                            108


                                                   APPENDIX C
            FASTER KRONECKER JOHNSON-LINDENTRAUSS TRANSFORM
In this appendix, the idea behind the method proposed in [10] is outlined in short. Here, the
notation is kept similar to that of [10], and is slightly different from the notation used in Section
                             Î
4.2.3. Assume that 𝑁 = 𝑑𝑘=1 𝑛 𝑘 , and consider the JL matrix
                                       √︂                 1 
                                                                                  C
                                               𝑁       Ì                  
                                  𝚽=               S            F𝑛 𝑘 D𝜉𝑛𝑘 ∈ 𝑚kron ×𝑁                                     (C.1)
                                           𝑚 kron 𝑘=𝑑
where S ∈    R𝑚 kron is a random sampling matrix (similar to R in RFD), D𝜉𝑛𝑘 ∈                        R𝑛 ×𝑛
                                                                                                          𝑘   𝑘 is a diagonal
matrix with Rademacher random variables on its diagonal, and F𝑛 𝑘 ∈                          C𝑛 ×𝑛 𝑘  𝑘 is the unitary DFT
matrix. If a vector x ∈     C𝑁×1    has Kronecker structure, meaning it can be written as x =
                                                                                                                   Ë1
                                                                                                                       𝑑 x 𝑘 or
equivalently, the vectorized form of a rank-1 tensor X = 𝑘=1                 𝑑 x , then one can observe that
                                                                                    𝑘
                        √︂             1                 Ì  1
                             𝑁       Ì
                𝚽x =               S         F𝑛 𝑘 D 𝜉 𝑛 𝑘        x𝑘
                           𝑚 kron 𝑘=𝑑                        𝑘=𝑑
                        √︂             1                     √︂ 𝑁                                         
                             𝑁       Ì
                                                                                          1
                     =             S         F𝑛 𝑘 D 𝜉 𝑛 𝑘 x 𝑘 =                 S vec          F     D
                                                                                          𝑘=𝑑 𝑛 𝑘 𝜉 𝑛𝑘 𝑘  x              (C.2)
                           𝑚 kron 𝑘=𝑑                                 𝑚 kron
                                                          ?
                        √︂                "                   𝑑 
                                                                                  #
                             𝑁                                                 
                                                 1
                     =             S vec         𝑘=𝑑 x 𝑘            F𝑛 𝑘 D 𝜉 𝑛 𝑘 .
                           𝑚 kron                            𝑘=1
It is easy to see that applying 𝚽 to x defined above is equivalent to performing a modewise fast JL
embedding to a rank-1 tensor whose vectorized form is x, meaning x = vec(X) = vec(                                    𝑘=1 x 𝑘 ).
                                                                                                                       𝑑
The embedding applied to mode 𝑘 is F𝑛 𝑘 D𝜉𝑛𝑘 . Next, the result is vectorized and the random
                                                                     √︃
restriction matrix S is applied followed by the factor 𝑚𝑁kron .
The computational cost will be low since if y 𝑘 := F𝑛 𝑘 D𝜉𝑛𝑘 x 𝑘 , then one can see that
                                                                                ? 𝑑               
                                     𝑑             𝑑
                           Y :=      𝑘=1 y 𝑘  =    𝑘=1 F𝑛 𝑘 D𝜉 𝑛𝑘 x 𝑘   =X            F𝑛 𝑘 D 𝜉 𝑛 𝑘
                                                                                 𝑘=1
                        Ë𝑑
and y := vec (Y) =          y 𝑘 . Therefore, when applying the random sampling matrix 𝑆, if one knows
                        𝑘=1
which indices in y are being picked, those indices can be translated to indices in y 𝑘 , meaning
                                                             109


that calculations will only be done for those specific indices. Assume we are interested in finding
element 𝑖 𝑘 of y 𝑘 . For 𝑖 𝑘 ∈ [𝑛 𝑘 ] and 𝑘 ∈ [𝑑], we have that
                                                      ∑︁𝑛𝑘                    
                                       y 𝑘 (𝑖 𝑘 ) =            F𝑛 𝑘 D 𝜉 𝑛 𝑘                (x 𝑘 ) 𝑗 .
                                                                                  𝑖𝑘 , 𝑗
                                                       𝑗=1
On the other hand,
                                                                                              
                                         F𝑛 𝑘 D 𝜉 𝑛 𝑘
                                                        𝑖𝑘 , 𝑗
                                                               =       F𝑛 𝑘     𝑖 𝑘 ,: ∗ d𝜉 𝑛𝑘      𝑗
where ∗ is the elementwise multiplication and d𝜉𝑛𝑘 denotes the diagonal of D𝜉𝑛𝑘 . Therefore,
                                                         D                                    E
                                         y 𝑘 (𝑖 𝑘 ) =         F𝑛 𝑘   𝑖 𝑘 ,: ∗  d   𝜉 𝑛𝑘  , x 𝑘 .
This means in each mode, all one needs is d𝜉𝑛𝑘 and the 𝑖 𝑘 th row of F𝑛 𝑘 . This significantly reduces
the computational cost as in practice, it is the case that 𝑚 kron  𝑁.
                                                                110


BIBLIOGRAPHY
     111


                                       BIBLIOGRAPHY
[1] Alzheimer’s disease neuroimaging initiative. http://adni.loni.usc.edu/.
[2] A mathematical introduction to fast and memory efficient algorithms for big data. https:
     //users.math.msu.edu/users/iwenmark/Notes_Fall2020_Iwen_Classes.pdf.
[3] T. D. Ahle, M. Kapralov, J. B. Knudsen, R. Pagh, A. Velingker, D. P. Woodruff, and A. Zandieh.
     Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth
     Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM, 2020.
[4] R. Bro and H. A. Kiers. A new efficient method for determining the number of components
     in parafac models. Journal of Chemometrics: A Journal of the Chemometrics Society,
     17(5):274–286, 2003.
[5] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Springer,
     2013.
[6] Z. Hao, L. He, B. Chen, and X. Yang. A linear support higher-order tensor machine for
     classification. IEEE Transactions on Image Processing, 22(7):2911–2920, July 2013.
[7] L. He, C.-T. Lu, G. Ma, S. Wang, L. Shen, P. S. Yu, and A. B. Ragin. Kernelized support tensor
     machines. In Proceedings of the 34th International Conference on Machine Learning-Volume
     70, pages 1442–1451. JMLR. org, 2017.
[8] M. Iwen and B. Ong. A distributed and incremental SVD algorithm for agglomerative data
     analysis on large networks. SIAM Journal on Matrix Analysis and Applications, 37(4):1699–
     1718, 2016.
[9] M. A. Iwen, D. Needell, E. Rebrova, and A. Zare. Lower memory oblivious (tensor) subspace
     embeddings with fewer random bits: modewise methods for least squares. SIAM Journal on
     Matrix Analysis and Applications, 42(1):376–416, 2021.
[10] R. Jin, T. G. Kolda, and R. Ward. Faster Johnson–Lindenstrauss transforms via kronecker
     products. arXiv preprint arXiv:1909.04801, 2019.
[11] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to platt’s
     smo algorithm for svm classifier design. Neural computation, 13(3):637–649, 2001.
[12] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review,
     51(3):455–500, 2009.
[13] F. Krahmer and R. Ward. New and improved Johnson–Lindenstrauss embeddings via the
     restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269–1281,
     2011.
                                               112


[14] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. Mpca: Multilinear principal component
     analysis of tensor objects. IEEE Transactions on Neural Networks, 19(1):18–39, Jan 2008.
[15] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-100), 1996.
[16] I. V. Oseledets. Tensor-train decomposition.        SIAM Journal on Scientific Computing,
     33(5):2295–2317, 2011.
[17] A. Ozdemir, A. Zare, M. A. Iwen, and S. Aviyente. Multiscale analysis for higher-order
     tensors. In Wavelets and Sparsity XVIII, volume 11138, page 1113808. International Society
     for Optics and Photonics, 2019.
[18] R. Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory
     (TOCT), 5(3):1–17, 2013.
[19] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In
     Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and
     data mining, pages 239–247, 2013.
[20] Y. Shi and A. Anandkumar. Higher-order count sketch: Dimensionality reduction that retains
     efficient tensor operations, 2019.
[21] N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposition of n-way
     arrays. Journal of Chemometrics, 14(3):229–239, 2000.
[22] Y. Sun, Y. Guo, C. Luo, J. Tropp, and M. Udell. Low-rank tucker approximation of a tensor
     from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):1123–1150, 2020.
[23] Y. Sun, Y. Guo, J. A. Tropp, and M. Udell. Tensor random projection for low memory
     dimension reduction. In NeurIPS Workshop on Relational Representation Learning, 2018.
[24] D. Tao, X. Li, X. Wu, W. Hu, and S. Maybank. Supervised tensor learning, knowledge and
     information systems. 2007.
[25] R. Vershynin. High-dimensional probability: An introduction with applications in data
     science. cambridge series in statistical and probabilistic mathematics, 2018.
[26] X. Liu and N. D. Sidiropoulos. Cramer-rao lower bounds for low-rank decomposition of
     multidimensional arrays. IEEE Transactions on Signal Processing, 49(9):2074–2086, Sep.
     2001.
[27] A. Zare, A. Ozdemir, M. A. Iwen, and S. Aviyente. Extension of pca to higher order data
     structures: An introduction to tensors, tensor decompositions, and tensor pca. Proceedings
     of the IEEE, 106(8):1341–1358, 2018.
                                                113