FAST AND MEMORY-EFFICIENT SUBSPACE EMBEDDINGS FOR TENSOR DATA WITH APPLICATIONS By Ali Zare A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering โ€“ Doctor of Philosophy 2022 ABSTRACT FAST AND MEMORY-EFFICIENT SUBSPACE EMBEDDINGS FOR TENSOR DATA WITH APPLICATIONS By Ali Zare The widespread use of multisensor technology and the emergence of big data sets have brought the necessity to develop more versatile tools to represent higher-order data with multiple aspects and high dimensionality. Data in the form of multidimensional arrays, also referred to as tensors, arise in a variety of applications including chemometrics, physics, hyperspectral imaging, high-resolution videos, neuroimaging, biometrics, and social network analysis. Early multiway data analysis approaches used to reformat such tensor data as large vectors or matrices and would then resort to dimensionality reduction methods developed for low-dimensional data. However, by vectorizing tensors, the inherent multiway structure of the data and the possible correlation between different dimensions will be lost, in some cases resulting in a degradation in the performance of vector- based methods. Moreover, in many cases, vectorizing tensors leads to vectors with extremely high dimensionality that might render most existing methods computationally impractical. In the case of dimension reduction, the enormous amount of memory needed to store the embedding matrix becomes the main obstacle. This highlights the need for approaches that are applied to tensor data in their multi-dimensional form. To reduce the dimension of an ๐‘›1 ร— ๐‘›2 ร— ยท ยท ยท ร— ๐‘› ๐‘‘ tensor to รŽ ๐‘š 1 ร— ๐‘š 2 ร— ยท ยท ยท ร— ๐‘š ๐‘‘ with ๐‘š ๐‘— โ‰ค ๐‘› ๐‘— , MPCA1 would change the memory requirement from ๐‘‘๐‘—=1 ๐‘š ๐‘— ๐‘› ๐‘— ร for vector PCA to ๐‘‘๐‘—=1 ๐‘š ๐‘— ๐‘› ๐‘— , which can be a considerable improvement. On the other hand, tensor dimension reduction methods such as MPCA need training samples for the projection matrices to be learned. This makes such methods time consuming and computationally less efficient than oblivious approaches such as the Johnson-Lindenstrauss embedding. The term oblivious refers to 1 Multilinear Principal Component Analysis the fact that one does not need any data samples beforehand to learn the embedding that projects a new data sample onto a lower-dimensional space. In this thesis, first a review of tensor concepts and algebra as well as common tensor decompositions is presented. Next, a modewise JL approach is proposed for compressing tensors without reshaping them into potentially very large vectors. Theoretical guarantees for the norm and inner product approximation errors as well as theoretical bounds on the embedding dimension are presented for data with low CP rank, and the corresponding effects of basis coherence assumptions are addressed. Experiments are performed using various choices of embedding matrices. Results verify the validity of one- and two-stage modewise JL embeddings in preserving the norm of MRI and synthesized data constructed from both coherent and incoherent bases. Two novel applications of the proposed modewise JL method are discussed. (i) Approximate solutions to least squares problems as a computationally efficient way of fitting tensor decompositions: The proposed approach is incorporated as a stage in the fitting procedure, and is tested on relatively low-rank MRI data. Results show improvement in computational complexity at a slight cost in the accuracy of the solution in the Euclidean norm. (ii) Many-Body Perturbation Theory problems involving energy calculations: In large model spaces, the dimension sizes of tensors can grow fast, rendering the direct calculation of perturbative correction terms challenging. The second-order energy correction term as well as the one-body radius correction are formulated and modeled as inner products in such a way that modewise JL can be used to reduce the computational complexity of the calculations. Experiments are performed on data from various nuclei in different model space sizes, and show that in the case of large model spaces, very good compression can be achieved at the price of small errors in the estimated energy values. Copyright by ALI ZARE 2022 ACKNOWLEDGEMENTS I would like to warmly thank my academic advisor, Dr. Mark Iwen, for his supervision and tutelage throughout the course of my Ph.D. studies. I truely appreciate his support, encouragement, and patience in the face of many challenges that I tackled on my academic journey at Michigan State University. My appreciation extends to the dissertation committee members, Dr. Selin Aviyente, Dr. Rongrong Wang, and Dr. Yuying Xie for their experience and knowledge that they supported me with during research and coursework. I also hereby thank Dr. Heiko Hergert who I collaborated with and learned from in the novel application of my thesis work to Many-Body Perturbation Theory problems in physics. Finally, I would like to express my deepest gratitude to my dear parents and sister without whose tremendous and unconditional support throughout my life, and especially during the past few years, earning my Ph.D. degree would not have been possible. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND: TENSOR BASICS AND ALGEBRA . . . . . . . . . . . 5 CHAPTER 3 TENSOR DECOMPOSITIONS AND RANK . . . . . . . . . . . . . . . . . 14 3.1 The CANDECOMP/PARAFAC Decomposition . . . . . . . . . . . . . . . . . . . 14 3.1.1 Uniquensess of CPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 Computing CPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Tensor Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Low-rank approximation and border rank . . . . . . . . . . . . . . . . . . 18 3.3 Compression and the Tucker Decomposition . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 ๐‘—-rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Computing the Tucker Decomposition . . . . . . . . . . . . . . . . . . . . 19 3.3.3 Uniqueness of Tucker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4 Tensor-Train Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 CHAPTER 4 DIMENSIONALITY REDUCTION OF TENSOR DATA: MODEWISE RANDOM PROJECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Johnson-Lindenstrauss Embeddings for Tensor Data . . . . . . . . . . . . . . . . . 24 4.2 Johnson-Lindenstrauss Embedings for Low-Rank Tensors . . . . . . . . . . . . . . 29 4.2.1 Geometry-Preserving Property of JL Embeddings for Low-Rank Tensors . . 29 4.2.1.1 Computational Complexity of Modewise Johnson-Lindenstrauss Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.2 Main Theorems: Oblivious Tensor Subspace Embeddings . . . . . . . . . 41 4.2.3 Fast and Memory-Efficient Modewise Johnson-Lindenstrauss Embeddings . 46 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.1 Effect of JL Embeddings on Norm . . . . . . . . . . . . . . . . . . . . . . 50 CHAPTER 5 APPLICATIONS OF MODEWISE JOHNSON-LINDENSTRAUSS EM- BEDDINGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.1 Application to Least Squares Problems and CPD Fitting . . . . . . . . . . . . . . . 54 5.1.1 Experiments: Effect of JL Embeddings on Least Squares Solutions . . . . . 59 5.1.1.1 CPD Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1.1.2 Compressed Least Squares Performance . . . . . . . . . . . . . . 61 vi 5.2 Application to Many-Body Purturbation Theory Problems . . . . . . . . . . . . . . 64 5.2.1 Second-order energy correction . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.2 Radius Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2.3 Third-order energy correction . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.3.1 Particle-Particle . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.3.2 Hole-Hole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.3.3 Particle-Hole . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.4.1 ๐ธ (2) Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2.4.2 Radius Correction Experiments . . . . . . . . . . . . . . . . . . 76 5.2.4.3 ๐ธ (3) Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 76 CHAPTER 6 EXTENSION OF VECTOR-BASED METHODS TO TENSORS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Multilinear Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . 79 6.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1.2 Full Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.1.3 Initialization by Full Projection Truncation (FPT) . . . . . . . . . . . . . . 82 6.1.4 Determination of subspace Dimensions ๐‘ƒ ๐‘— . . . . . . . . . . . . . . . . . . 83 6.1.5 Feature Extraction and Classification . . . . . . . . . . . . . . . . . . . . . 83 6.2 Comparison between PCA, MPCA and MPS . . . . . . . . . . . . . . . . . . . . . 84 6.3 Extension of Support Vector Machine to Tensors . . . . . . . . . . . . . . . . . . . 86 6.3.1 Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.2 Support Higher-order Tensor Machine . . . . . . . . . . . . . . . . . . . . 89 6.3.3 Kernelized Support Tensor Machine . . . . . . . . . . . . . . . . . . . . . 89 6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 APPENDIX A USEFUL NOTIONS, DEFINITIONS, AND RELATIONS . . . . . . . 98 APPENDIX B MEMORY-EFFICIENT MODE-WISE PROJECTION CALCULA- TIONS OF THE ENERGY TERMS . . . . . . . . . . . . . . . . . . . 106 APPENDIX C FASTER KRONECKER JOHNSON-LINDENTRAUSS TRANSFORM109 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 vii LIST OF TABLES Table 5.1: Basis truncation parameters and mode dimensions for single-particle bases labeled by ๐‘’Max. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 viii LIST OF FIGURES Figure 2.1: An example of a 3 ร— 4 ร— 5 tensor. . . . . . . . . . . . . . . . . . . . . . . . . . 5 Figure 2.2: Visualization of a 5-mode tensor by stacking a 3-mode tensor along its 1st and 2nd modes. The result is a 5-mode tensor of size 3 ร— 4 ร— 5 ร— 3 ร— 2. Note that the elements of the stacked versions are not necessarily the same as they are elements corresponding to different indices in the 5-mode tensor. . . . . . . 6 Figure 2.3: An example of the fibers of a 3-mode tensor. Left: mode-3 fibers. Right: mode-1 fibers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Figure 2.4: An example showing how the mode-3 slices of a 3-mode tensor are formed. . . 7 Figure 2.5: An example showing how the mode-1 unfolding of a 3-mode tensor is formed using its mode-1 fibers. Colors are used to show how this is done in a column-major format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 4.1: An example of 2-stage JL embedding applied to a 3-dimensional tensor X โˆˆ R3ร—4ร—5 . The output of the 1st stage is the projected tensor Y = X ร—1 A (1) ร—2 A (2) ร—3 A (3) , where A ( ๐‘—) are JL matrices for ๐‘— โˆˆ {1, 2, 3}, A (1) โˆˆ R2ร—3 , A (2) โˆˆ R3ร—4 , and A (3) โˆˆ R4ร—5 , resulting in Y โˆˆ R2ร—3ร—4 . Matching colors have been used to show how the rows of A ( ๐‘—) interact with the mode- ๐‘— fibers of X (and the intermediate partially compressed tensors) to generate the elements of the mode- ๐‘— unfolding of the result after each ๐‘—-mode product. Next, the resulting tensor is vectorized (leading to y โˆˆ R24 ), and a 2nd -stage JL is then performed to obtain z = Ay where A โˆˆ R3ร—24 , and z โˆˆ R3 . . . . . . . 45 Figure 4.2: Relative norm of randomly generated 4-dimensional data. Here, the total compression will be ๐‘ ๐‘ก๐‘œ๐‘ก = ๐‘41 . (a) Gaussian data. (b) Coherent data. Note that the modewise approach still preserves norms well for the coherent data indicating that the incoherence assumptions utilized in Section 4.2 can likely be relaxed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 4.3: Simulation results averaged over 1000 trials for 3 MRI data samples, where each sample is 3-dimensional. In the 2-stage cases, ๐‘ 2 = 0.05 has been used. (a) Relative norm. (b) Runtime. . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Figure 5.1: Relative reconstruction error of CPD calculated for different values of rank ๐‘Ÿ for MRI data. As the rank increases, the error becomes smaller. . . . . . . . . . 61 ix Figure 5.2: Effect of JL embeddings on the relative reconstruction error of least squares estimation of CPD coefficients. In the 2-stage cases, ๐‘ 2 = 0.05 has been used. (a) ๐‘Ÿ = 40. (b) ๐‘Ÿ = 75. (c) ๐‘Ÿ = 110. (d) Average runtime for ๐‘Ÿ = 40. The other runtime plots for ๐‘Ÿ = 75 and ๐‘Ÿ = 110 are qualitatively identical. . . . . . . 63 Figure 5.3: A block diagram showing how the approximations to ๐‘…1 and ๐‘…2 are calculated. . 67 Figure 5.4: ๐ธ (2) experiment results for O16, ๐‘’๐‘€๐‘Ž๐‘ฅ = 2. . . . . . . . . . . . . . . . . . . . 73 Figure 5.5: ๐ธ (2) experiment results for O16, ๐‘’๐‘€๐‘Ž๐‘ฅ = 4. . . . . . . . . . . . . . . . . . . . 74 Figure 5.6: ๐ธ (2) experiment results for O16, ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. . . . . . . . . . . . . . . . . . . . 75 Figure 5.7: Relative error in ๐ธ (2) for total compression values of 0.0009 and 0.0125. (a) O16. (b) Sn132. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 5.8: Radius correction results, for interaction em1.8 โˆ’ 2.0 and eMax= 14. (a) Ca48, particle term. (b) Ca48, hole term. (c) Sn132, particle term. (d) Sn132, hole term. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 5.9: Mean absolute relative error in ๐ธ (3) for hole-hole and ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. . . . . . . . . 77 Figure 5.10: Mean absolute relative error in ๐ธ (3) for particle-particle and ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. . . . . . 77 Figure 5.11: Mean absolute relative error in ๐ธ (3) for particle-hole and ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. . . . . . . 78 Figure 6.1: Gray-scale sample images of five objects in the COIL-100 database. . . . . . . 85 Figure 6.2: A lateral slice of a sample MRI image. . . . . . . . . . . . . . . . . . . . . . . 85 Figure 6.3: Training time. (a) COIL-100. (b) MRI. . . . . . . . . . . . . . . . . . . . . . . 86 Figure 6.4: Classification Success Rate. (a) COIL-100. (b) MRI. . . . . . . . . . . . . . . 86 x LIST OF ALGORITHMS Algorithm 3.1: CPD-ALS [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Algorithm 3.2: HOOI-ALS [12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Algorithm 3.3: Tensor-Train [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Algorithm 6.1: MPCA [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 xi CHAPTER 1 INTRODUCTION The emergence of big data elicits the development of compression methods to efficiently represent such data without losing much information. One of the most well-known techniques to this end is PCA1 which uses the linear structure of a high-dimensional vector and projects it onto the underlying lower-dimensional subspace [2]. However, when the number of dimensions in the data increases, as is the case with matrices and cubes, reshaping the data into a vector becomes troublesome in the sense that it will require huge amounts of memory to store the matrix that will project data elements onto their corresponding principal components. This problem, for instance, can be observed in the case of MRI2 data. Take, for instance, a 240 ร— 240 ร— 155 cube from an MRI data set, containing 8928000 data elements. To reduce the dimensionalty of this vector to 0.1% of its original size, a 8928 ร— 8928000 projection matrix needs to be generated. This means an approximate 594 Gigabytes of data only to store the matrix. It is obvious how intensive memory requirements could be if higher-dimensional data with larger mode sizes were to be dealt with. This simple example clearly demonstrates the importance of dimension reduction techniques that do not rely on the vectorization of higher-dimensional data, and that deal with such data in their original multidimensional form. In addition to computational considerations, one can intuitively observe that if the multilinear structure of tensor data is changed, e.g. by vectorization, many conventional approaches that were initially developed for vector data might not yield the same satisfactory results. Generally speaking, although tensors are natural extensions to vectors and matrices, their multidimensional form adds extra complexity to their structure in a way that the notion low-dimensionality becomes challenging to address, especially simply as an extension of the same notion from matrices to tensors. This 1 Principal Component Analysis 2 Magnetic Resonance Imaging 1 issue will be addressed in Section 3.2 where tensor rank is introduced, and it is discussed that there are various notions of rank for tensor data. In the experiments of Section 6.2, it is shown that for higher-dimensional data, a conventional vector-based method such as PCA becomes both computationally more expensive and less accurate compared to its multilinear counterpart. In this case, an extension of PCA to tensor data, abbreviated to MPCA3, will be presented in Chapter 6. This method performs PCA on fibers of the unfoldings of a tensor one mode at a time, and is specifically useful to compress tensors with low Tucker ranks [14]. An efficient method to compute MPCA would be to compress the data using the general randomized embedding proposed in [9], which constitutes the central body of work in this thesis, before starting the main algorithm. This approach can in general be applied as a preprocessing stage to large scale tensor data to alleviate the computational intensity of the subsequent processing scheme. To make tensor compression even more computationally efficient, randomized streaming mappings are of great value where one will not have to store a big tensor to compute its compressed form. Rather, a sketch of the unfoldings of the tensor along with a sketch of the core in the factorization of interest will be enough to construct an approximate version of the decomposition. Such a method has been proposed to compute of the Tucker approximation in [22], and can also be applied to MPCA. On the other hand, dimension reduction methods such as PCA and its extensions, including MPCA, need a set of training samples (vectors, or tensors in the multilinear case) for the projection matrix or matrices to be found. This makes these methods time consuming and computationally less efficient than oblivious approaches such as the Johnson-Lindenstrauss embedding. The term oblivious refers to the fact that one does not need any data sample beforehand to project a new data sample onto a lower-dimensional subspace. In recent years, many tensor dimension reduction techniques that can be applied to low-rank data have been proposed in the literature. Most of these methods are limited to some degree in 3 Multilinear Principal Component Analysis 2 the sense that they either are limited in the number of data modes and/or lack general theoretical guarantees on the error in the geometry preserving property of the embedding. In [23], such theoretical guarantees have been presented for the vectorized form of a tensor projected using the Khatri-Rao product of individual random projections of smaller sizes, and have been extended to 2-mode tensors. In [18, 19], the CountSketch projection matrix has been extended to tensors, named TensorSketch, although the multilinear structure of the tensor is again not preserved. In a more recent version [20], however, TensorSketch is developed based on the Tucker format to extend CountSketch to tensor data. The TensorSketch method is mainly developed for polynomial kernels as a special case of rank-1 tensors [19, 3]. There are other more related methods that assume specific data structure for the input tensors and at the same time respect their multimodal structure. In a closely related method abbreviated to KFJLT4 [10], a very large Fast Johnson-Lindenstrauss embedding matrix, addressed in Section 4.2.3, is used in the form of the Kronecker product of smaller fast embeddings, and is applied to a vector also having Kronecker structure corresponding to the vectorized form of a rank-1 tensor. The implementation of KFJLT, however, is done without actually forming the extremely large embedding matrix or input vector, and the elements of the compressed vector (or tensor, equivalently) can be calculated efficiently using the smaller embeddings implicitly applied to the corresponding tensor modes. In each mode, this is made possible by using random rows of the DFT matrix, a vector consisting of Rademacher random variables, and smaller vectors that form the Kronecker structure of the input data. The rank-1 property of the input tensor naturally draws oneโ€™s attention to KFJLT being suitable for the efficient computation of CP decompositions. This method leads to lower computational cost at a small price in the embedding dimension size, and its computational efficiency originates from both the inherent speed of the fast embedding matrices used and also the Kronecker structure of the input data in the vectorized form. KFJLT is shown to work for tensors having general structure with a reduction in performance. A short summary of 4 Kronecker Faster Johnson-Lindenstrauss Transform. 3 how computational efficiency is achieved in this method works is presented in Appendix C. The work presented in this dissertation contains materials discussed in the following publica- tions. In [27], extensions of PCA and its variants to tensor data are discussed for those who are well familiar with PCA methods for vector-type data. In [17], a multiscale HoSVD5 approach has been developed to compress tensors in multiple scales. In the 0th scale, truncated HoSVD is performed on data. The reconstruction error tensor is then partitioned into subtensors in the 1st scale using a clustering algorithm, and truncated HoSVD is applied to each subtensor. This process can be repeated in higher scales depending on the needed tradeoff between reconstruction error and compression. In [9] which constitues the main body of this thesis presented in Chapter 4, a mod- ewise Johnson-Lindenstrauss embedding has been proposed for compressing tensor data without reshaping the tensor into an extremely large vector. Theoretical guarantees for the approximation error have been presented for data with low CP rank. 5 Higher-order Singular Value Dicomposition 4 CHAPTER 2 BACKGROUND: TENSOR BASICS AND ALGEBRA In this chapter, basic concepts and algebraic relations that are used in the statement of problems and proofs are presented. Notation. The type of letters used for tensors, matrices, vectors and scalars are as follows. Calligraphic boldface capital letters (e.g., X) are used for tensors, boldface capital letters for matrices (e.g., X), boldface lower-case letters for vectors (e.g., x), and regular (lower-case or capital) letters for scalars (e.g., ๐‘ฅ or ๐‘‹). For numbers in parentheses used as subscript or superscript, subscript refers to โ€œunfoldingsโ€ while superscript denotes an object in a sequence of objects. We assume [๐‘‘] := {1, . . . , ๐‘‘} for all ๐‘‘ โˆˆ N. Whenever used, the vec(ยท) operator generates the vectorized form of its argument. In the following definitions, we assume X โˆˆ C๐‘› ร—...ร—๐‘› . 1 ๐‘‘ Definition 2.0.1 A โ€œtensorโ€ is a multi-dimensional array. A ๐‘‘-way, ๐‘‘-mode or ๐‘‘ th -order tensor is an element of the tensor product of ๐‘‘ vector spaces. Figure 2.1 shows an example of a 3-mode tensor. A tensor is called โ€œcubicalโ€ is all its modes are of the same size, i.e., X โˆˆ C๐‘›ร—...ร—๐‘› . Figure 2.1: An example of a 3 ร— 4 ร— 5 tensor. One can stack 3-mode tensors along modes 1, 2, and 3 of a 3-mode tensor to visualize higher-order data in the 3-dimensional space, where the elements of the stacked versions can obviously be different. For instance, Figure 2.2 illustrates this idea where the 3-mode tensor of Figure 2.1 is 5 stacked along the 1st and 2nd dimensions to simulate the 4th and 5th modes, respectively, leading to a 3 ร— 4 ร— 5 ร— 3 ร— 2 tensor. Figure 2.2: Visualization of a 5-mode tensor by stacking a 3-mode tensor along its 1st and 2nd modes. The result is a 5-mode tensor of size 3 ร— 4 ร— 5 ร— 3 ร— 2. Note that the elements of the stacked versions are not necessarily the same as they are elements corresponding to different indices in the 5-mode tensor. Definition 2.0.2 (Mode- ๐’‹ Fiber) In a ๐‘‘-mode tensor X, a mode- ๐‘— fiber is obtained by fixing all C but the ๐‘— th index, and is denoted by X๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,:, ๐‘— ๐‘—+1 ,...,๐‘– ๐‘‘ โˆˆ ๐‘› ๐‘— for ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ] and ๐‘— โˆˆ [๐‘‘]. There are รŽ ๐‘› ๐‘— mode- ๐‘— fibers. Figure 2.3 depicts how the fibers of a 3-mode tensor are formed. ๐‘–โ‰  ๐‘— Figure 2.3: An example of the fibers of a 3-mode tensor. Left: mode-3 fibers. Right: mode-1 fibers. Definition 2.0.3 (Mode- ๐’‹ Slice) In a ๐‘‘-mode tensor X, a mode- ๐‘— โ€œsliceโ€ is a (๐‘‘ โˆ’ 1)-mode subtensor obtained by fixing the ๐‘— th index. A mode- ๐‘— slice of X is denoted by X:,...,:,๐‘˜,:,...,: โˆˆ C๐‘› ร—...ร—๐‘› 1 ๐‘—โˆ’1 ร—๐‘› ๐‘—+1 ร—...ร—๐‘› ๐‘‘ for ๐‘– ๐‘— = ๐‘˜ โˆˆ [๐‘› ๐‘— ]. There are ๐‘› ๐‘— mode- ๐‘— slices. 6 Figure 2.4: An example showing how the mode-3 slices of a 3-mode tensor are formed. Definition 2.0.4 (Matricization of a Tensor) The process of reshaping a tensor X into a matrix is called โ€œmatricizationโ€, โ€œflatteningโ€ or โ€œunfoldingโ€. The most common way of doing this is C๐‘› ร— รŽ ๐‘›๐‘– the mode- ๐‘— unfolding, denoted by X ( ๐‘—) โˆˆ ๐‘— ๐‘–โ‰  ๐‘— , which has all the mode- ๐‘— fibers of X as its columns. The process of matricization of tensors is linear, in the sense that for X, Y โˆˆ C๐‘› ร—...ร—๐‘› , 1 ๐‘‘ (๐›ผX + ๐›ฝY) ( ๐‘—) = ๐›ผX ( ๐‘—) + ๐›ฝY ( ๐‘—) for ๐‘— โˆˆ [๐‘‘] and ๐›ผ, ๐›ฝ โˆˆ C. Figure 2.5: An example showing how the mode-1 unfolding of a 3-mode tensor is formed using its mode-1 fibers. Colors are used to show how this is done in a column-major format. C Lemma 2.0.1 Assume we have a tensor X โˆˆ ๐‘›1 ร—๐‘›2 ร—ยทยทยทร—๐‘›๐‘‘ , and for ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ], ๐‘— โˆˆ [๐‘‘] and รŽ โ„“ โˆˆ [ ๐‘‘๐‘š=1 ๐‘›๐‘š ], we want to find the element (๐‘– ๐‘— , โ„“) of the mode- ๐‘— unfolding X ( ๐‘—) corresponding ๐‘šโ‰  ๐‘— to the element (๐‘–1 , ๐‘–2 , . . . , ๐‘– ๐‘‘ ) of X1. Then, for a given (๐‘– ๐‘— , โ„“), we have X ( ๐‘—) (๐‘– ๐‘— , โ„“) = X๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘– ๐‘— ,๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ , 1 In this thesis, we always use column-major formatting when reshaping tensors to matrices/vectors and vice versa. This essentially means going from lower to higher modes when moving across dimensions. 7 where ร ๐‘‘ ๐‘˜โˆ’1 รŽ $โ„“ โˆ’ 1 โˆ’ (๐‘– ๐‘˜ โˆ’ 1) ๐‘›๐‘š % ๐‘˜=๐‘™+1 ๐‘š=1 ๐‘˜โ‰  ๐‘— ๐‘šโ‰  ๐‘— ๐‘–๐‘™ = + 1, for ๐‘™ โ‰  ๐‘—, (2.1) ๐‘™โˆ’1 รŽ ๐‘›๐‘š ๐‘š=1 ๐‘šโ‰  ๐‘— starting from ๐‘™ = ๐‘‘ and going down to ๐‘™ = 1. Obviously, for ๐‘™ = ๐‘‘, the numerator is reduced to โ„“ โˆ’ 1, and for ๐‘™ = 1, the denominator becomes 1. It is also clear that ๐‘– ๐‘— will be the same in both the tensor and the unfolding. On the other hand, assume we want to obtain a tensor X from its mode- ๐‘— unfolding X ( ๐‘—) . Given indices (๐‘– 1 , . . . , ๐‘– ๐‘‘ ), we want to find the corresponding coordinates (๐‘– ๐‘— , โ„“) in X ( ๐‘—) . We can do so using โˆ‘๏ธ๐‘‘ ๐‘˜โˆ’1 ร– hร– ๐‘‘ i โ„“ =1+ (๐‘– ๐‘˜ โˆ’ 1) ๐‘›๐‘š , for โ„“ โˆˆ ๐‘›๐‘š , (2.2) ๐‘˜=1 ๐‘š=1 ๐‘š=1 ๐‘˜โ‰  ๐‘— ๐‘šโ‰  ๐‘— ๐‘šโ‰  ๐‘— meaning that X(๐‘– 1 , . . . , ๐‘– ๐‘—โˆ’1 , ๐‘– ๐‘— , ๐‘– ๐‘—+1 , . . . , ๐‘– ๐‘‘ ) = X ( ๐‘—) (๐‘– ๐‘— , โ„“), with โ„“ defined above [12]. Definition 2.0.5 (The Standard Inner Product Space of ๐’…-mode Tensors) The set of all ๐‘‘-mode tensors X โˆˆ C๐‘› ร—...ร—๐‘› 1 ๐‘‘ forms a vector space over the field of complex numbers when equipped with component-wise addition and scalar multiplication. The inner product of X and Y is defined as โˆ‘๏ธ๐‘›1 โˆ‘๏ธ ๐‘›2 โˆ‘๏ธ๐‘›๐‘‘ hX, Yi := ... X๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ Y๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ . (2.3) ๐‘– 1 =1 ๐‘–2 =1 ๐‘– ๐‘‘ =1 The standard Euclidean norm can be deduced from this inner product, as v u tโˆ‘๏ธ ๐‘›1 โˆ‘๏ธ ๐‘›2 โˆ‘๏ธ๐‘›๐‘‘ โˆš๏ธ 2 kXk := hX, Xi = ... X๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ . (2.4) ๐‘–1 =1 ๐‘–2 =1 ๐‘– ๐‘‘ =1 8 Definition 2.0.6 (Tensor Outer Product) The tensor outer product of two tensors X โˆˆ C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› 1 2 ๐‘‘ and Y โˆˆ C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› 0 1 0 2 0 ๐‘‘0 , denoted by X Yโˆˆ C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› ร—๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› 1 2 ๐‘‘ 0 1 2 0 0 ๐‘‘0 , is a (๐‘‘ + ๐‘‘ 0)-mode tensor whose entries are given by (X Y) ๐‘–1 ,...,๐‘– ๐‘‘ ,๐‘– 0 ,...,๐‘– 0 0 = X๐‘–1 ,...,๐‘– ๐‘‘ Y๐‘–10 ,...,๐‘– ๐‘‘0 0 . (2.5) 1 ๐‘‘ When X and Y are both vectors, the tensor outer product will be reduced to the standard outer product. Definition 2.0.7 (Rank-1 Tensor) A ๐‘‘-mode tensor X โˆˆ C๐‘› ร—...ร—๐‘› 1 ๐‘‘ is rank-1 if it can be written as the outer product of ๐‘‘ vectors, i.e., X = x (1) x (2) ... x (๐‘‘) =: ๐‘‘ ๐‘—=1 x , ( ๐‘—) (2.6) where x ( ๐‘—) โˆˆ C๐‘› ๐‘— for ๐‘— โˆˆ [๐‘‘]. Definition 2.0.8 ( ๐’‹-mode Product) The ๐‘—-mode product of a ๐‘‘-mode tensor X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘—โˆ’1 ร—๐‘› ๐‘— ร—๐‘› ๐‘—+1 ร—ยทยทยทร—๐‘› ๐‘‘ with a matrix U โˆˆ C๐‘š ร—๐‘› ๐‘— ๐‘— is another ๐‘‘-mode tensor X ร— ๐‘— U โˆˆ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘—โˆ’1 ร—๐‘š ๐‘— ร—๐‘› ๐‘—+1 ร—ยทยทยทร—๐‘› ๐‘‘ whose entries are given by ๐‘›๐‘— โˆ‘๏ธ (X ร— ๐‘— U)๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,โ„“,๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ = X๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘– ๐‘‘ Uโ„“,๐‘– ๐‘— . (2.7) ๐‘– ๐‘— =1 for all (๐‘–1 , . . . , ๐‘– ๐‘—โˆ’1 , โ„“, ๐‘– ๐‘—+1 , . . . , ๐‘– ๐‘‘ ) โˆˆ [๐‘›1 ] ร— ยท ยท ยท ร— [๐‘› ๐‘—โˆ’1 ] ร— [๐‘š ๐‘— ] ร— [๐‘› ๐‘—+1 ] ร— ยท ยท ยท ร— [๐‘› ๐‘‘ ]. In terms of the mode- ๐‘— unfoldings of X ร— ๐‘— U and X, it can be observed that (X ร— ๐‘— U) ( ๐‘—) = UX ( ๐‘—) holds for all ๐‘— โˆˆ [๐‘‘]. Lemma 2.0.2 Let X, Y โˆˆ C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› , A, B โˆˆ C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› 1 2 ๐‘‘ 0 1 2 0 0 ๐‘‘0 , ๐›ผ, ๐›ฝ โˆˆ C, and Uโ„“ , Vโ„“ โˆˆ C๐‘š ร—๐‘› โ„“ โ„“ for all โ„“ โˆˆ [๐‘‘]. The following four properties hold: (a) (๐›ผX + ๐›ฝY) A = ๐›ผX A + ๐›ฝY A=X ๐›ผA + Y ๐›ฝA (b) hX A, Y Bi = hX, Yi hA, Bi 9   (c) (๐›ผX + ๐›ฝY) ร— ๐‘— U ๐‘— = ๐›ผ X ร— ๐‘— U ๐‘— + ๐›ฝ Y ร— ๐‘— U ๐‘— .    (d) X ร— ๐‘— ๐›ผU ๐‘— + ๐›ฝV ๐‘— = ๐›ผ X ร— ๐‘— U ๐‘— + ๐›ฝ X ร— ๐‘— V ๐‘— .  (e) If ๐‘— โ‰  โ„“ then X ร— ๐‘— U ๐‘— ร—โ„“ Vโ„“ = X ร— ๐‘— U ๐‘— ร—โ„“ Vโ„“ = (X ร—โ„“ Vโ„“ ) ร— ๐‘— U ๐‘— = X ร—โ„“ Vโ„“ ร— ๐‘— U ๐‘— . (f) If ๐‘Š โˆˆ C ๐‘ร—๐‘š ๐‘—  then X ร— ๐‘— U ๐‘— ร— ๐‘— W = X ร— ๐‘— U ๐‘— ร— ๐‘— W = X ร— ๐‘— WU ๐‘— = X ร— ๐‘— WU ๐‘— .  Proof The proof of (a) can be done element-wise. ((๐›ผX + ๐›ฝY) A) ๐‘–1 ,...,๐‘– ๐‘‘ ,๐‘– 0 ,...,๐‘– 0 0 = (๐›ผX + ๐›ฝY) ๐‘–1 ,...,๐‘– ๐‘‘ A๐‘–10 ,...,๐‘– ๐‘‘0 0 1 ๐‘‘  = ๐›ผX๐‘–1 ,...,๐‘– ๐‘‘ + ๐›ฝY๐‘–1 ,...,๐‘– ๐‘‘ A๐‘–10 ,...,๐‘– ๐‘‘0 0 . To prove (b), we note that 0 ๐‘›๐‘‘ 00 ๐‘›1 ๐‘› ๐‘‘ โˆ‘๏ธ ๐‘›1 โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ hX A, Y Bi = ยทยทยท ยทยทยท X๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ A๐‘–10 ,...,๐‘– ๐‘‘0 0 Y๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ B๐‘–10 ,...,๐‘– ๐‘‘0 0 ๐‘–1 =1 ๐‘– ๐‘‘ =1 ๐‘–10 =1 ๐‘– ๐‘‘0 =1 ๐‘›1 ๐‘›๐‘‘ ! ๐‘›10 ๐‘› ๐‘‘0 0 โˆ‘๏ธ โˆ‘๏ธ ยฉโˆ‘๏ธ โˆ‘๏ธ 0 0 0 0 ยช = ยทยทยท X๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ Y๐‘–1 ,๐‘–2 ,...,๐‘– ๐‘‘ ยญ ยทยทยท A๐‘–1 ,...,๐‘– ๐‘‘ 0 B๐‘–1 ,...,๐‘– ๐‘‘ 0 ยฎ ๐‘– 1 =1 ๐‘– ๐‘‘ =1 ๐‘– 0 =1 ๐‘– 0 =1 ยซ1 ๐‘‘ ยฌ = hX, Yi hA, Bi . The proof of (c), (d), and (f) can be done using the definition of the mode- ๐‘— unfolding. For (e), suppose that โ„“ > ๐‘— (the case where โ„“ < ๐‘— is similar). Set U := U ๐‘— and V := Vโ„“ to simplify subscript 10 notation. We have for all ๐‘˜ โˆˆ [๐‘š ๐‘— ], ๐‘™ โˆˆ [๐‘š โ„“ ], and ๐‘– ๐‘ž โˆˆ [๐‘›๐‘ž ] with ๐‘ž โˆ‰ { ๐‘—, โ„“} that โˆ‘๏ธ๐‘›โ„“    X ร— ๐‘— U ร—โ„“ V ๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘˜,๐‘– ๐‘—+1 ,...,๐‘–โ„“โˆ’1 ,๐‘™,๐‘–โ„“+1 ,...,๐‘– ๐‘‘ = X ร—๐‘— U ๐‘– 1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘˜,๐‘– ๐‘—+1 ,...,๐‘–โ„“ ,...,๐‘– ๐‘‘ V๐‘™,๐‘–โ„“ ๐‘–โ„“ =1 โˆ‘๏ธ๐‘›โ„“ โˆ‘๏ธ ๐‘›๐‘— = X๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘–โ„“ ,...,๐‘– ๐‘‘ U ๐‘˜,๐‘– ๐‘— ยฎ V๐‘™,๐‘–โ„“ ยฉ ยช ยญ ๐‘–โ„“ =1 ยซ๐‘– ๐‘— =1 ๐‘› ๐‘— โˆ‘๏ธ ๐‘›โ„“ !ยฌ โˆ‘๏ธ = X๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘–โ„“ ,...,๐‘– ๐‘‘ V๐‘™,๐‘–โ„“ U ๐‘˜,๐‘– ๐‘— ๐‘– ๐‘— =1 ๐‘–โ„“ =1 โˆ‘๏ธ๐‘›๐‘— = (X ร—โ„“ V) ๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘–โ„“โˆ’1 ,๐‘™,๐‘–โ„“+1 ,...,๐‘– ๐‘‘ U ๐‘˜,๐‘– ๐‘— ๐‘– ๐‘— =1  = (X ร—โ„“ V) ร— ๐‘— U ๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘˜,๐‘– ๐‘—+1 ,...,๐‘–โ„“โˆ’1 ,๐‘™,๐‘–โ„“+1 ,...,๐‘– ๐‘‘ . ? ๐‘‘ Note 2.0.1 Unfolding the tensor Y = X ร—1 U (1) ร—2 U (2) ... ร—๐‘‘ U (๐‘‘) =: X U ( ๐‘—) along the ๐‘— th ๐‘—=1 mode is equivalent to Y ( ๐‘—) = U ( ๐‘—) X ( ๐‘—) (U (๐‘‘) โŠ— ยท ยท ยท โŠ— U ( ๐‘—+1) โŠ— U ( ๐‘—โˆ’1) โŠ— ยท ยท ยท โŠ— U (1) ) > , (2.8) where โŠ— denotes the matrix Kronecker product. If X is a superdiagonal tensor2, then all matrices U ( ๐‘—) must have the same number of columns, and (2.8) will be simplified to Y ( ๐‘—) = U ( ๐‘—) X (U (๐‘‘) ยทยทยท U ( ๐‘—+1) U ( ๐‘—โˆ’1) ยทยทยท U (1) ) > , (2.9) where X is a diagonal matrix with the superdiagonal of X as its diagonal. The symbol denotes the Khatri-Rao product, which is defined as the column-wise matching Kronecker product, i.e., for matrices A = [a1 , . . . , a๐ฝ ] โˆˆ C๐ผร—๐ฝ and B = [b1, . . . , b๐ฝ ] โˆˆ C๐พร—๐ฝ , A B = [a1 โŠ— b1 , . . . , a๐ฝ โŠ— b๐ฝ ] โˆˆ C๐ผ๐พร—๐ฝ . The reason for (2.9) being a simplified form of (2.8) lies in the fact that when a ๐‘‘-mode tensor X โˆˆ C๐‘›ร—...ร—๐‘› is superdiagonal, all columns of X( ๐‘—) are zeros except for ๐‘› of them spread evenly, 2A ๐‘‘-mode superdiagonal tensor X โˆˆ C๐‘›ร—...ร—๐‘› is cubical and has nonzero elements only at indices (๐‘–, . . . , ๐‘–) for ๐‘– โˆˆ [๐‘›]. 11 where in the โ„“ th column, only the entry in position ((โ„“ โˆ’ 1) mod ๐‘›) + 1 is nonzero3. This means that all but matching columns in the Kronecker product will be crossed out in the final result, simplifying the kronecker product to the Khatri-Rao product, and reducing X ( ๐‘—) in (2.8) to a diagonal matrix X in (2.9). Note 2.0.2 Vectorizing the tensor Y = X ร—1 U (1) ร—2 U (2) ... ร—๐‘‘ U (๐‘‘) , it is straightforward to show that   y = U (๐‘‘) โŠ— ยท ยท ยท โŠ— U (2) โŠ— U (1) x, (2.10) where x and y are the vectorized forms of X and Y, respectively. Definition 2.0.9 ( ๐’‹-mode Vector Product) The ๐‘—-mode product of a ๐‘‘-mode tensor X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ with a vector v โˆˆ C๐‘› ๐‘— is a (๐‘‘ โˆ’ 1)-mode tensor, and is denoted by X โ€ข ๐‘— v, whose elements are obtained using โˆ‘๏ธ๐‘›๐‘—  X โ€ข๐‘— v ๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ = X๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘– ๐‘‘ v๐‘– ๐‘— , (2.11) ๐‘– ๐‘— =1 which means that the ๐‘— th mode of X is contracted with v. If we want to keep the ๐‘— th mode with dimension size 1, meaning X โ€ข ๐‘— v โˆˆ C๐‘› ร—๐‘› 1 ๐‘—โˆ’1 ร—1ร—๐‘› ๐‘—+1 ...ร—๐‘› ๐‘‘ , a useful interpretation of this will be X โ€ข ๐‘— v = X ร— ๐‘— v> , (2.12) which is equivalent to  > v> X ( ๐‘—) = X โ€ข ๐‘— v  ( ๐‘—) = vec X โ€ข ๐‘— v . (2.13) The mode- ๐‘— vector product can be used to define eigenvalue problems for tensors. For a super- symmetric tensor4 X โˆˆ C๐‘›ร—...ร—๐‘› , the scalar ๐œ† is an eigenvalue with the corresponding eigenvector vโˆˆ C๐‘› if X โ€ข2 v โ€ข3 v ยท ยท ยท โ€ข๐‘‘ v = ๐œ†v. 3 For integers ๐‘Ž and ๐‘, we assume that ๐‘Ž mod ๐‘ โˆˆ {0, . . . , ๐‘ โˆ’ 1}. 4A cubical tensor is called supersymmetric if its elements remain the same under any permutaion of indices. Obviously, superdiagonal tensors are also supersymmetric. 12 The ๐‘—-mode vector product is also used in developing Support Vector Machines for tensors, where ๐‘‘ optimization problems for ๐‘‘ modes of a tensor are mixed to form one problem for the whole tensor [6]. Definition 2.0.10 (Tensor (๐’Œ, ๐’‹)-Contraction) Consider tensors X โˆˆ C๐‘› ร—...ร—๐‘› ร—...ร—๐‘› 1 ๐‘— ๐‘‘ and Y โˆˆ C๐‘š ร—...ร—๐‘š 1 ๐‘˜โˆ’1 ร—๐‘› ๐‘— ร—๐‘š ๐‘˜+1 ร—...ร—๐‘š ๐‘‘ 0 . Then, for each ๐‘— โˆˆ [๐‘‘] and ๐‘˜ โˆˆ [๐‘‘ 0], the (๐‘˜, ๐‘—)-contraction of X and Y, which is the contraction of modes ๐‘— of X and ๐‘˜ of Y, is a (๐‘‘ + ๐‘‘ 0 โˆ’ 2)-dimensional array denoted by Z := X ร— ๐‘˜๐‘— Y โˆˆ C๐‘› ร—...ร—๐‘› 1 ๐‘—โˆ’1 ร—๐‘š โ„“ ร—๐‘› ๐‘—+1 ร—...ร—๐‘› ๐‘‘ ร—๐‘š ๐ฟ 0 ร—...ร—๐‘š ๐ฟ 0 0 1 ๐‘‘ โˆ’2 , where โ„“ = min{[๐‘‘ 0] \ ๐‘˜ }, ๐ฟ 0 := [๐‘‘ 0] \ {๐‘˜, โ„“}, and ๐ฟ 0โ„Ž denotes the โ„Žth element of the set ๐ฟ 0. It is observed that โ„“ = 1 for all choices of ๐‘˜ except for ๐‘˜ = 1 in which case โ„“ = 2. Element-wise, โˆ‘๏ธ๐‘›๐‘— Z๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘žโ„“ ,๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ ,๐‘ž ๐ฟ 0 ,...,๐‘ž ๐ฟ 0 = X๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘– ๐‘‘ Y๐‘ž1 ,...,๐‘ž ๐‘˜โˆ’1 ,๐‘– ๐‘— ,๐‘ž ๐‘˜+1 ,...,๐‘ž ๐‘‘ 0 , 1 ๐‘‘ 0 โˆ’2 ๐‘– ๐‘— =1 (2.14) for ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ], ๐‘— โˆˆ [๐‘‘], ๐‘ž โ„“ โˆˆ [๐‘š โ„“ ], ๐‘ž ๐ฟ ๐‘–0 โˆˆ [๐‘š ๐ฟ โ„Ž0 ], โ„Ž โˆˆ [๐‘‘ 0 โˆ’ 2]. Note 2.0.3 If ๐‘˜ = ๐‘‘ 0 = 2, then โ„“ = min{[2] \ 2} = 1 and ๐ฟ 0 = [2] \ {1, 2} = โˆ…, and the (2, ๐‘—)- contraction of X and Y (which is a matrix now, denoted by Y) is reduced to the familiar ๐‘—-mode product X ร— ๐‘— Y. Note 2.0.4 One can also define the (๐‘˜, ๐‘—)-contraction of X and Y in a way that modes of Y are interleaved right after contracting the ๐‘— th mode of X, i.e., Zโˆˆ C๐‘› ร—...ร—๐‘› 1 ๐‘—โˆ’1 ร—๐‘š โ„“ ร—๐‘š ๐ฟ 0 ร—...ร—๐‘š ๐ฟ 0 0 ร—๐‘› ๐‘—+1 ร—...ร—๐‘› ๐‘‘ 1 ๐‘‘ โˆ’2 , and โˆ‘๏ธ๐‘›๐‘— Z๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,๐‘žโ„“ ,๐‘ž ๐ฟ 0 ,...,๐‘ž ๐ฟ 0 ,๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ = X๐‘–1 ,...,๐‘– ๐‘— ,...,๐‘– ๐‘‘ Y๐‘ž1 ,...,๐‘ž ๐‘˜โˆ’1 ,๐‘– ๐‘— ,๐‘ž ๐‘˜+1 ,...,๐‘ž ๐‘‘ 0 , 1 ๐‘‘ 0 โˆ’2 ๐‘– ๐‘— =1 (2.15) for ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ], ๐‘— โˆˆ [๐‘‘], ๐‘ž โ„“ โˆˆ [๐‘š โ„“ ], ๐‘ž ๐ฟ ๐‘–0 โˆˆ [๐‘š ๐ฟ ๐‘–0 ], ๐‘– โˆˆ [๐‘‘ 0 โˆ’ 2]. 13 CHAPTER 3 TENSOR DECOMPOSITIONS AND RANK In this section, a short review on the more commonly used tensor decompositions is presented. For simplicity, the elements of the tensors as well as scalars are defined over the field of real numbers. Results can be extended to the field of complex numbers with slight modifications. This will specifically be the case in Chapter 4. 3.1 The CANDECOMP/PARAFAC Decomposition This factorization, abbreviated to CPD, decomposes a tensor X into the (weighted) sum of rank-1 tensors [12]. For X โˆˆ R๐‘› ร—...ร—๐‘› , 1 ๐‘‘ ๐‘Ÿ โˆ‘๏ธ X โ‰ˆ Xฬ‚ = ๐‘” ๐‘˜ a ๐‘˜(1) a ๐‘˜(2) ยทยทยท a ๐‘˜(๐‘‘) , (3.1) ๐‘˜=1 where denotes the tensor outer product. The vector a ๐‘˜ ( ๐‘—) โˆˆ R๐‘› ๐‘— can be considered as the ๐‘˜ th column in a matrix A ( ๐‘—) โˆˆ R๐‘› ร—๐‘Ÿ for ๐‘— โˆˆ [๐‘‘]. The scalar ๐‘”๐‘˜ can be considered as the ๐‘˜ th element of ๐‘— a vector g. Therefore, if g is set as the superdiagonal of a diagonal tensor G, called the core tensor, then Xฬ‚ = G ร—1 A (1) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) . (3.2) Component-wise, we have ๐‘Ÿ โˆ‘๏ธ Xฬ‚๐‘–1 ,...,๐‘– ๐‘‘ = ๐‘” ๐‘˜ A๐‘–(1) A (2) . . . A๐‘–(๐‘‘) 1 ,๐‘˜ ๐‘– 2 ,๐‘˜ ๐‘‘ ,๐‘˜ . (3.3) ๐‘˜=1 Considering the superdiagonality of G, and according to (2.9), the relation between the unfold- ings of X and G may be written as Xฬ‚ ( ๐‘—) = A ( ๐‘—) G(A (๐‘‘) ยทยทยท A ( ๐‘—+1) A ( ๐‘—โˆ’1) ยทยทยท A (1) ) > , (3.4) where G is a diagonal matrix with g as its diagonal. 14 3.1.1 Uniquensess of CPD CPD is unique under weak conditions; there is a permutation and scaling indeterminacy. For a permutation matrix ฮ  โˆˆ R๐‘Ÿร—๐‘Ÿ ,     Xฬ‚ = G ร—1 A (1) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) = G ร—1 A (1) ฮ  ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) ฮ  , (3.5) implying that as long as the columns of the factor matrices are permuted in the same way, the factorization will not change. This is also evident from (3.1) where the order in which the terms in the summation are added together does not matter. As for the scaling indeterminacy, we observe that โˆ‘๏ธ๐‘Ÿ       Xฬ‚ = ๐‘” ๐‘˜ ๐›ผ ๐‘˜(1) a ๐‘˜(1) ๐›ผ๐‘˜(2) a ๐‘˜(2) ยทยทยท ๐›ผ ๐‘˜(๐‘‘) a ๐‘˜(๐‘‘) , (3.6) ๐‘˜=1 as long as ๐›ผ ๐‘˜(1) ๐›ผ๐‘˜(2) . . . ๐›ผ๐‘˜(๐‘‘) = 1. A sufficient condition for the uniqueness of CPD is [21] ๐‘‘ โˆ‘๏ธ ๐‘˜ A ( ๐‘—) โ‰ฅ 2๐‘Ÿ + ๐‘‘ โˆ’ 1, (3.7) ๐‘—=1 where ๐‘˜ A is the ๐‘˜-rank of a matrix A, and is defined as the largest number ๐‘˜ such that any ๐‘˜ columns of A are linearly independent. Equation (3.7) is also the necessary condition for uniqueness of CPD for ๐‘Ÿ = 2, 3 but not for ๐‘Ÿ โ‰ฅ 4. In its general form, the necessary condition for the uniqueness of CPD is [26]   (1) ( ๐‘—โˆ’1) ( ๐‘—+1) (๐‘‘) min rank A ยทยทยท A A ยทยทยท A = ๐‘Ÿ. (3.8) ๐‘— โˆˆ[๐‘‘] However, noting that rank(A B) โ‰ค rank(A โŠ— B) โ‰ค rank(A) rank(B), then (3.8) can be simplified to ยฉร– ๐‘‘  ยช min ยญ rank A (๐‘š) ยฎ โ‰ฅ ๐‘Ÿ. (3.9) ยญ ยฎ ๐‘— โˆˆ[๐‘‘] ยญ ยฎ ๐‘š=1 ยซ๐‘šโ‰  ๐‘— ยฌ 15 Algorithm 3.1: CPD-ALS [12] R initialize A ( ๐‘—) โˆˆ ๐‘› ๐‘— ร—๐‘Ÿ for ๐‘— โˆˆ [๐‘‘] repeat for ๐‘— = 1, . . . , ๐‘‘ do V โ† A (1)> A(1) โˆ— ยท ยท ยท โˆ— A ( ๐‘—โˆ’1)> A ( ๐‘—โˆ’1) โˆ— A ( ๐‘—+1)> A ( ๐‘—+1) โˆ— ยท ยท ยท โˆ— A (๐‘‘)> A (๐‘‘) A ( ๐‘—) โ† X ( ๐‘—) A (๐‘‘) ยท ยท ยท A ( ๐‘—+1) A ( ๐‘—โˆ’1) ยท ยท ยท A (1) Vโ€  normalize columns of A ( ๐‘—) storing norms as g end for until fit ceases to improve or maximum iterations exhausted return g, A ( ๐‘—) for ๐‘— โˆˆ [๐‘‘] 3.1.2 Computing CPD At first, assume the number of rank-1 tensors is known beforehand. The problem is now the calculation of factor matrices A ( ๐‘—) for ๐‘— โˆˆ [๐‘‘] and g in (3.1), i.e. the solution to โˆ‘๏ธ๐‘Ÿ min kX โˆ’ Xฬ‚k with Xฬ‚ = ๐‘” ๐‘˜ a ๐‘˜(1) a ๐‘˜(2) ยทยทยท a ๐‘˜(๐‘‘) . (3.10) Xฬ‚ ๐‘˜=1 Alternating Least Squares (ALS) is a common way of finding the fit. For instance, assume that X is a 3-mode tensor. Then, in light of (3.10) and given that the Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings, finding A (1) would be done by solving  > A (1) = min X (1) โˆ’ Aฬ‚ A (3) A (2) , (3.11) Aฬ‚ ๐น where Aฬ‚ = AG with G being an ๐‘Ÿ ร— ๐‘Ÿ diagonal matrix with ๐‘” ๐‘˜ forming its diagonal. The optimal  >โ€  solution to (3.11) would be Aฬ‚ = X (1) A (3) A (2) which can be rearranged as   โ€  (3) (2) (3)> (3) (2)> (2) Aฬ‚ = X (1) A A A A โˆ—A A . Here, the symbol โˆ— denotes the Hadamard product, and โ€  represents the pseudo-inverse of a matrix. Extending the same idea to a ๐‘‘-mode tensor is outlined in Algorithm 3.1, and is called CPD-ALS. The initialization for A ( ๐‘—) could be either random or using ๐‘Ÿ leading left singular vectors of X ( ๐‘—) . The remaining question is how to choose ๐‘Ÿ. Most methods fit multiple CP decompositions with different number of components until one is good, i.e., the one that yields an exact representation in the Euclidean norm. 16 For noiseless data, CPD is computed for ๐‘Ÿ = 1, 2, . . . , and the first value of ๐‘Ÿ that gives a 100% fit is chosen as rank. This may indeed not work in the case of degenerate tensors (see 3.2.1). For noisy data, which is almost always the case, the fit alone fails to determine rank. A commonly used consistency diagnostic called CORCONDIA1 is employed to determine the proper number of components [4]. Assume the factor matrices A (1) , . . . , A (๐‘‘) are fixed, i.e., they have been obtained using a CP procedure. A Tucker model (see 3.3) is next assumed to represent the data as in X โ‰ˆ G ร—1 A (1) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) . Noting that the Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings as well as the 2-norm of its vectorized form, the core G can be found by solving > 2 1 1 ! 2 ยฉรŒ ยช รŒ ( ๐‘—) (โ„“) ยฎ min X ( ๐‘—) โˆ’ A G ( ๐‘—) ยญ A ยฎ = min vec (X) โˆ’ A (โ„“) vec (G) ยญ , G ( ๐‘—) ยญ ยฎ vec(G) โ„“=๐‘‘ โ„“=๐‘‘ 2 โ„“โ‰  ๐‘— ยซ ยฌ ๐น for ๐‘— โˆˆ [๐‘‘], which can be treated as a least squares problem. Now, the question is how close the core is to a diagonal tensor with a superdiagonal of ones. If there is a 100% match, then the perfect fit has been found. The reason why a diagonal core is sought is that in a perfect CP model, interaction exists only between parallel factors of different modes. 3.2 Tensor Rank The rank of a tensor X is defined as the smallest number of rank-1 tensors that generate X as their sum. In other words, it is the smallest number of components in an exact CP decomposition. Although this definition is similar to the definition of rank in matrices, the properties of tensor rank are very different from matrix rank. The major difference is that there is no straightforward algorithm to compute the rank of a tensor, and in practice, it is determined numerically by fitting various rank-๐‘Ÿ CP models. 1 CORe CONsistency DIAgnostic 17 Other types of rank that are used for tensors are maximum rank and typical rank. Maximum rank is defined as the largest attainable rank of a tensor. Typical rank is defined as any rank that occurs with probability greater than zero when the elements of the tensor are drawn randomly from a uniform continuous distribution. Typical rank and maximum rank are the same for matrices. However, they may be different for tensors, and there might be more than one typical rank. 3.2.1 Low-rank approximation and border rank ร๐‘Ÿ For a matrix A with rank ๐‘Ÿ and a decomposition of the form A = ๐‘–=1 ๐œŽ๐‘– u๐‘– v๐‘‡๐‘– where ๐œŽ1 โ‰ฅ ยท ยท ยท โ‰ฅ ๐œŽ๐‘Ÿ , the best rank-๐‘˜ approximation (๐‘˜ โ‰ค ๐‘Ÿ) will be obtained by keeping the ๐‘˜ leading factors, i.e., ร๐‘˜ Aฬ‚ = ๐‘–=1 ๐œŽ๐‘– u๐‘– v๐‘‡๐‘– . For tensors, this might not be the case; the best rank-๐‘˜ approximation may not even exist, which is a problem of degeneracy. A tensor is degenerate if it can be approximated arbitrarily well by a factorization of lower rank. When a low-rank approximation does not exist for a tensor, it is useful to introduce the concept of border rank. It is defined as the minimum number of rank-one tensors that approximate it with arbitrarily small non-zero error, i.e., rank(X) g = min{ ๐‘Ÿ | โˆ€๐œ€ > 0, โˆƒE; kX โˆ’ E k < ๐œ€, rank(E) = ๐‘Ÿ}. (3.12) Obviously, rank(X) g โ‰ค rank(X). 3.3 Compression and the Tucker Decomposition The Tucker decomposition can be considered as an extension to CPD, as well as a higher-order principal component analysis. A tensor X โˆˆ R๐‘› ร—...ร—๐‘› 1 ๐‘‘ is decomposed in the Tucker format in the following way. ๐‘Ÿ1 โˆ‘๏ธ โˆ‘๏ธ๐‘Ÿ๐‘‘ X โ‰ˆ Xฬ‚ = ยทยทยท G๐‘˜ 1 ,...,๐‘˜ ๐‘‘ a ๐‘˜(1) 1 ยทยทยท a ๐‘˜(๐‘‘) ๐‘‘ ๐‘˜ 1 =1 ๐‘˜ ๐‘‘ =1 (3.13) = G ร—1 A (1) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) , 18 where G โˆˆ R๐‘Ÿ ร—ยทยทยทร—๐‘Ÿ 1 ๐‘‘ is the core tensor with ๐‘Ÿ ๐‘— โ‰ค ๐‘› ๐‘— for ๐‘— โˆˆ [๐‘‘], and A ( ๐‘—) โˆˆ R๐‘› ร—๐‘Ÿ ๐‘— ๐‘— is the ๐‘— th factor ( ๐‘—) matrix whose ๐‘˜ ๐‘—th column is a ๐‘˜ ๐‘— for ๐‘˜ ๐‘— โˆˆ [๐‘Ÿ ๐‘— ]. Component-wise, we have โˆ‘๏ธ๐‘Ÿ1 โˆ‘๏ธ๐‘Ÿ๐‘‘ Xฬ‚๐‘–1 ,...,๐‘– ๐‘‘ = ยทยทยท G๐‘˜ 1 ,...,๐‘˜ ๐‘‘ A๐‘–(1) A (2) . . . A๐‘–(๐‘‘) 1 ,๐‘˜ 1 ๐‘– 2 ,๐‘˜ 2 ๐‘‘ ,๐‘˜ ๐‘‘ . (3.14) ๐‘˜ 1 =1 ๐‘˜ ๐‘‘ =1 รŽ๐‘‘ As can be seen, Tucker approximates a tensor as the linear combination of ๐‘—=1 ๐‘Ÿ ๐‘— rank-1 tensors. Unlike CPD where the interaction between modes is restricted to matching columns of the factor matrices, in Tucker, this interaction occurs between all possible combinations of columns, and the level of interaction is governed by the elements of G. It is observed that if ๐‘Ÿ ๐‘— < ๐‘› ๐‘— for at least one ๐‘—, the size of the core tensor G will be smaller than the size of X. This means that G can be thought of as a compressed version of X. 3.3.1 ๐‘—-rank The column rank of X ( ๐‘—) is defined as the ๐‘—-rank of X and is denoted by rank ๐‘— (X). If ๐‘Ÿ ๐‘— = rank ๐‘— (X) for ๐‘— โˆˆ [๐‘‘], then X is said to have an exact rank-(๐‘Ÿ 1 , ๐‘Ÿ 2 , . . . , ๐‘Ÿ ๐‘‘ ) Tucker decomposition. Obviously, รŽ ๐‘Ÿ ๐‘— โ‰ค ๐‘› ๐‘— and rank ๐‘— (X) โ‰ค min{๐‘› ๐‘— , โ„“โ‰  ๐‘— ๐‘›โ„“ }. If ๐‘Ÿ ๐‘— โ‰ค rank ๐‘— (X) for at least one ๐‘—, then X cannot be reconstructed exactly from its Tucker representation. 3.3.2 Computing the Tucker Decomposition In one of the first methods developed to compute the Tucker decomposition, the basic idea is to find those components (rank-1 tensors) that capture the most variations in each mode. This method is known as the Higher-order SVD (HoSVD) as a generalization of the matrix SVD, and computes the singular vectors of the mode- ๐‘— unfoldings of a tensor X. For ๐‘Ÿ ๐‘— โ‰ค rank ๐‘— (X), the method is called truncated HoSVD. This method is not optimal, but it can be used as a good starting point in an ALS algorithm whose goal is to compute the Tucker decomposition. 19 Algorithm 3.2: HOOI-ALS [12] R initialize A ( ๐‘—) โˆˆ ๐‘› ๐‘— ร—๐‘Ÿ ๐‘— for ๐‘— โˆˆ [๐‘‘] using HoSVD repeat for ๐‘— = 1, . . . , ๐‘‘ do Y โ† X ร—1 A (1)> ร— ยท ยท ยท ร— ๐‘—โˆ’1 A ( ๐‘—โˆ’1)> ร— ๐‘—+1 A ( ๐‘—+1)> ร—๐‘‘ A (๐‘‘)> A ( ๐‘—) โ† ๐‘Ÿ ๐‘— leading left singular vectors of Y ( ๐‘—) end for until fit ceases to improve or maximum iterations exhausted G โ† X ร—1 A (1)> ยท ยท ยท ร—๐‘‘ A (๐‘‘)> return G, A ( ๐‘—) for ๐‘— โˆˆ [๐‘‘] To compute HoSVD, ๐‘Ÿ ๐‘— leading left singular vectors of X ( ๐‘—) is set as A ( ๐‘—) for all ๐‘— โˆˆ [๐‘‘]. Then the core tensor is computed using G = X ร—1 A (1)> ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘)> . The Higher-Order Orthogonal Iteration, abbreviated to HOOI, is used as the ALS method that takes the result of HoSVD as input. The optimization problem that is solved by HOOI is expressed as min kX โˆ’ G ร—1 A (1) ยท ยท ยท ร—๐‘‘ A (๐‘‘) k. R subject to Gโˆˆ ๐‘Ÿ1 ร—...ร—๐‘Ÿ๐‘‘ (3.15) A โˆˆ( ๐‘—) R ๐‘› ๐‘— ร—๐‘Ÿ ๐‘— and column-wise orthogonal The pseudo-code for HOOI is shown in Algorithm 3.2. 3.3.3 Uniqueness of Tucker The Tucker decomposition is not unique. The core tensor can be modified as long as the inverse modification is applied to the factor matrices. For instance, consider the 3-mode case. For nonsingular matrices U โˆˆ R๐‘› ร—๐‘› , V โˆˆ R๐‘› ร—๐‘› 1 1 2 2 and W โˆˆ R๐‘› ร—๐‘› , we have 3 3       G ร—1 A (1) ร—2 A (2) ร—3 A (3) = (G ร—1 U ร—2 V ร—3 W) ร—1 A (1) Uโˆ’1 ร—2 A (2) Vโˆ’1 ร—3 A (3) Wโˆ’1 . 20       Proof Let H := (G ร—1 U ร—2 V ร—3 W) ร—1 A (1) Uโˆ’1 ร—2 A (2) Vโˆ’1 ร—3 A (3) Wโˆ’1 , then we have  > H (1) = A (1) Uโˆ’1 (G ร—1 U ร—2 V ร—3 W) (1) A (3) Wโˆ’1 โŠ— A (2) Vโˆ’1  > = A (1) Uโˆ’1 UG (1) (W โŠ— V) > A (3) Wโˆ’1 โŠ— A (2) Vโˆ’1 h  i> (1) (3) โˆ’1 (2) โˆ’1 = A G (1) A W โŠ— A V (W โŠ— V) (3.16)  >  > (1) (3) โˆ’1 (2) โˆ’1 (1) (3) (2) = A G (1) A W W โŠ— A V V = A G (1) A โŠ— A   (1) (2) (3) = G ร—1 A ร—2 A ร—3 A , (1) implying the two tensors are equal. A similar approach can be applied to modes 2 and 3. 3.4 Tensor-Train Decomposition The Tensor-Train decomposition, which is also known as MPS2 in the physics community, decom- poses a ๐‘‘-mode tensor into a chain of ๐‘‘ lower-dimensional tensors of at most 3 modes [16]. A tensor X is approximated by another tensor Xฬ‚ whose elements are expressed as contractions of lower-dimensional tensors G ( ๐‘—) for ๐‘— โˆˆ [๐‘‘], โˆ‘๏ธ๐‘Ÿ 0 โˆ‘๏ธ ๐‘Ÿ1 ๐‘Ÿ๐‘‘ โˆ‘๏ธ Xฬ‚๐‘–1 ,...,๐‘– ๐‘‘ = ยทยทยท G๐›ผ(1) G (2) 0 ,๐‘– 1 ,๐›ผ1 ๐›ผ1 ,๐‘– 2 ,๐›ผ2 . . . G๐›ผ(๐‘‘) ๐‘‘โˆ’1 ,๐‘– ๐‘‘ ,๐›ผ ๐‘‘ , (3.17) ๐›ผ0 =1 ๐›ผ1 =1 ๐›ผ๐‘‘ =1 where G ( ๐‘—) โˆˆ R๐‘Ÿ ๐‘—โˆ’1 ร—๐‘› ๐‘— ร—๐‘Ÿ ๐‘— for ๐‘— โˆˆ [๐‘‘], and ๐‘Ÿ 0 = ๐‘Ÿ ๐‘‘ = 1, i.e., G (1) and G (๐‘‘) are in fact matrices. In other words, elements of Xฬ‚ can be obtained by employing Xฬ‚๐‘–1 ,...,๐‘– ๐‘‘ = G (1) (๐‘–1 ) G (2) (๐‘– 2 ) ยท ยท ยท G (๐‘‘) (๐‘– ๐‘‘ ) , (3.18) where G ( ๐‘—) ๐‘– ๐‘— โˆˆ  R๐‘Ÿ ๐‘—โˆ’1 ร—๐‘Ÿ ๐‘— for ๐‘— โˆˆ [๐‘‘] is the ๐‘– th๐‘— lateral slice of G ( ๐‘—) . The tensor-Train decomposition is calculated by a sequence of SVDโ€™s starting with X (1) . Assuming the Tensor-Train ranks ๐‘Ÿ ๐‘— are known, a simplified version of the Tensor-Train algorithm is presented in Algorithm 3.3. For a more detailed version also including how to choose the ranks ๐‘Ÿ ๐‘— , see [16]. 2 Matrix Product State 21 Algorithm 3.3: Tensor-Train [16] R รŽ ๐‘›1 ร— ๐‘›โ„“ Compute the SVD of X (1) : X (1) = U (1) S (1) V (1)> โˆˆ โ„“โ‰ 1 . Compute G (1) = U (1) (:, 1 : ๐‘Ÿ 1 ) โˆˆ R๐‘› ร—๐‘Ÿ . 1 1 for ๐‘— = 2, . . . , ๐‘‘ โˆ’ 1 do R รŽ ๐‘Ÿ ๐‘—โˆ’1 ร— ๐‘›โ„“ Compute W ( ๐‘—โˆ’1) = S ( ๐‘—โˆ’1) V ( ๐‘—โˆ’1)> โˆˆ โ„“โˆ‰[ ๐‘—โˆ’1] . R รŽ ๐‘Ÿ ๐‘—โˆ’1 ๐‘› ๐‘— ร— ๐‘›โ„“ Reshape W ( ๐‘—โˆ’1) into W ( ๐‘—โˆ’1) โˆˆ โ„“โˆ‰[ ๐‘— ] . Compute W ( ๐‘—โˆ’1) =U S V( ๐‘—) ( ๐‘—) ( ๐‘—)> . Truncate U โˆˆ ( ๐‘—) R ๐‘Ÿ ๐‘—โˆ’1 ๐‘› ๐‘— ร—๐‘Ÿ ๐‘—โˆ’1 ๐‘› ๐‘— to get G ( ๐‘—) = U ( ๐‘—) :, 1 : ๐‘Ÿ ๐‘— .  R Reshape G ( ๐‘—) into G ( ๐‘—) โˆˆ ๐‘Ÿ ๐‘—โˆ’1 ร—๐‘› ๐‘— ร—๐‘Ÿ ๐‘— . end for Compute G (๐‘‘) = W (๐‘‘โˆ’2) = S (๐‘‘โˆ’1) V (๐‘‘โˆ’1)> โˆˆ ๐‘Ÿ ๐‘‘โˆ’1 ร—๐‘›๐‘‘ . R return G ( ๐‘—) for ๐‘— โˆˆ [๐‘‘]. Consider another definition of unfoldings of a tensor X, defined by R รŽ๐‘— รŽ๐‘‘ ๐‘› ร— โ„“= ๐‘—+1 ๐‘›โ„“ (3.19) X ๐‘— = X๐‘–1 ,...,๐‘– ๐‘— ;๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ โˆˆ โ„“=1 โ„“ , where the indices are divided into two row and column groups for ๐‘— โˆˆ [๐‘‘]. Obviously, X1 = X (1) , and X๐‘‘ is the vectorized version of X. The following theorems hold [16]. Theorem 3.4.1 If rank(X ๐‘— ) = ๐‘Ÿ ๐‘— for each unfolding X ๐‘— of a tensor X for all ๐‘— โˆˆ [๐‘‘], then there exists a decomposition (3.17) with Tensor-Train ranks no greater than ๐‘Ÿ ๐‘— . Theorem 3.4.2 Consider the ๐›ฟ-truncated SVD of X ๐‘— in the sense that X ๐‘— = USV> + E for kEk ๐น โ‰ค ๐›ฟ and ๐‘Ÿ ๐‘˜ = rank๐›ฟ (X ๐‘— ), where rank๐›ฟ (X ๐‘— ) is the ๐›ฟ-rank of X ๐‘— defined as the minimum rank(B) over all matrices B satisfying kA โˆ’ Bk ๐น โ‰ค ๐›ฟ. If ๐œ€ ๐›ฟ=โˆš kXk, ๐‘‘โˆ’1 then kX โˆ’ Xฬ‚k โ‰ค ๐œ€kXk. 22 Theorem 3.4.3 Suppose that the unfolding matrices X ๐‘— have low ranks ๐‘Ÿ ๐‘— only approximately, i.e., X ๐‘— = R ๐‘— + E ๐‘— for all ๐‘— โˆˆ [๐‘‘], such that rank(๐‘… ๐‘— ) = ๐‘Ÿ ๐‘— and kEโˆš๏ธ„๐‘— k ๐น = ๐œ€ ๐‘— . Then Algorithm 3.3 ๐‘‘ ๐œ€ 2๐‘— . ร computes a tensor Xฬ‚ with Tensor-Train ranks ๐‘Ÿ ๐‘— and kX โˆ’ Xฬ‚k โ‰ค ๐‘—=1 Corollary 1 If a tensor X admits a rank-๐‘Ÿ CP decomposition with accuracy ๐œ€, there exists a โˆš Tensor-Train decomposition with Tensor-Train ranks ๐‘Ÿ ๐‘— โ‰ค ๐‘Ÿ and accuracy ๐‘‘ โˆ’ 1๐œ€. Corollary 2 Given a tensor X and rank bounds ๐‘Ÿ ๐‘— , the best approximation to X in the Euclidean norm with Tensor-Train ranks bounded by ๐‘Ÿ ๐‘— always exists, and the Tensor-Train approximation Xฬ‚ โˆš is quasi-optimal. i.e., kX โˆ’ Xฬ‚k โ‰ค ๐‘‘ โˆ’ 1kX โˆ’ X best k. 23 CHAPTER 4 DIMENSIONALITY REDUCTION OF TENSOR DATA: MODEWISE RANDOM PROJECTIONS Johnson-Lindenstrauss embeddings, called JL from here on for the sake of brevity, provide a simple yet powerful tool for dimension reduction of high-dimensional data using linear random projections. By performing JL on mode- ๐‘— fibers of a tensor X, the dimentionality of all modes can be reduced to yield a projected tensor of much smaller size, without first vectorizing the tensor. It is then expected that the Euclidean norm of the projected tensor remains preserved to within a predictable error. In this chapter, theoretical guarantees for the geometry preserving properties of modewise random projections as JL embeddings are presented. More theorems, detailed discussions and proofs are provided in [9]. 4.1 Johnson-Lindenstrauss Embeddings for Tensor Data In this section, a brief overview of the necessary tools that will used to extend JL embeddings to higher-order data will be presented, as well as the theorems providing the underlying theory of modewie JL embeddings. Definition 4.1.1 (๐œบ-JL embedding) A matrix A โˆˆ C๐‘šร—๐‘ is an ๐œ€-JL embedding of a set ๐‘† โŠ‚ C๐‘ into C๐‘š if kAxk 22 = (1 + ๐œ€x ) kxk 22 , (4.1) for |๐œ€x | โ‰ค ๐œ€ and all x โˆˆ ๐‘†. Assuming the elements of A are subgaussian random variables and that |๐‘†| = ๐‘€, then (4.1) holds log ๐‘€ for all x โˆˆ ๐‘† with probability ๐‘ โ‰ฅ 1 โˆ’ 2 exp โˆ’๐ถ๐‘š๐œ€ 2 if ๐‘š โ‰ฅ ๐ถ ๐œ€2 , where ๐ถ is an absolute  constant [25]. A brief statement of the JL lemma along with its relation with random projections is presented in Appendix A.3. 24 Lemma 4.1.1 Let x, y โˆˆ C๐‘› and suppose that A โˆˆ C๐‘šร—๐‘› is an ๐œ€-JL embedding of the vectors i i {x โˆ’ y, x + y, x โˆ’ y, x + y} โŠ‚ C๐‘› into C๐‘š . Then,   |hAx, Ayi โˆ’ hx, yi| โ‰ค 2๐œ€ kxk 22 + kyk 22 โ‰ค 4๐œ€ max kxk 22 , kyk 22 .  Proof Using the polarization identity for inner products, we have that 3 3 1 โˆ‘๏ธ i i i 1 โˆ‘๏ธ โ„“ i i  2 2  2 โ„“ โ„“ โ„“ |hAx, Ayi โˆ’ hx, yi| = Ax + Ay 2 โˆ’ x+ y 2 = ๐œ€โ„“ x + โ„“ y 2 4 โ„“=0 4 โ„“=0 3 1 โˆ‘๏ธ โ‰ค ๐œ€ (kxk 2 + kyk 2 ) 2 = ๐œ€ (kxk 2 + kyk 2 ) 2 4 โ„“=0   = ๐œ€ kxk 22 + kyk 22 + 2kxk 2 kyk 2   โ‰ค 2๐œ€ kxk 22 + kyk 22 โ‰ค 4๐œ€ max kxk 22 , kyk 22 ,  where the second to last inequality follows from Youngโ€™s inequality for products. In the second equality, ๐œ€โ„“ denotes the amount of distortion applied to x + โ„“ y i 2 2 by the JL matrix A, where |๐œ€โ„“ | โ‰ค ๐œ€. Extending vectors to tensors, one can define a tensor ๐œ€-JL embedding in a similar way as follows. Definition 4.1.2 (Tensor ๐œบ-JL embedding) A linear operator ๐ฟ : C๐‘› ร—๐‘› ร—...ร—๐‘› 1 2 ๐‘‘ โ†’ C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 is an ๐œ€-JL embedding of a set ๐‘† โŠ‚ C๐‘› ร—๐‘› ร—...ร—๐‘› 1 2 ๐‘‘ into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 if k๐ฟ (X)k 2 = (1 + ๐œ€X ) kXk 2 holds for some ๐œ€X โˆˆ (โˆ’๐œ€, ๐œ€) for all X โˆˆ ๐‘†. 25 The following lemma shows that the Tensor ๐œ€-JL embedding will preserve the pairwise inner product of tensors. Lemma 4.1.2 If X, Y โˆˆ C๐‘› ร—๐‘› ร—...ร—๐‘› 1 2 ๐‘‘ and suppose that ๐ฟ is an ๐œ€-JL embedding of the tensors i {X โˆ’ Y, X + Y, X โˆ’ Y, X + Y} โŠ‚ i C๐‘› ร—๐‘› ร—...ร—๐‘› 1 2 ๐‘‘ into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 . Then,   2 2 โ‰ค 4๐œ€ ยท max kXk 2 , kY k 2 .  |h๐ฟ (X) , ๐ฟ (Y)i โˆ’ hX, Yi| โ‰ค 2๐œ€ kXk + kY k Proof The proof will be similar to what was presented in the proof of Lemma 4.1.1, and by replacing A๐‘ฅ with ๐ฟ (X) and using the linearity of ๐ฟ. When a more general set is being projected using JL embeddings, a discretization scheme can be used in order to embed a finite set (see Appendix A.2, and [25] for more details). This applies to, for instance, a low-rank subspace of tensors. In such cases, due to the linearity of the embedding, discretization can rather be done on the unit ball in that subspace1. In the following lemma, a JL embedding result for a subspace is presented based on a covering argument. Lemma 4.1.3 Fix ๐œ€ โˆˆ (0, 1). Let L be an ๐‘Ÿ-dimensional subspace of C๐‘› , and let C โŠ‚ L be an (๐œ€/16)-net of the (๐‘Ÿ โˆ’ 1)-dimensional Euclidean unit sphere Sโ„“2 โŠ‚ L. Then, if A โˆˆ C๐‘šร—๐‘› is an (๐œ€/2)-JL embedding of C, it will also satisfy (1 โˆ’ ๐œ€)kxk 22 โ‰ค kAxk 22 โ‰ค (1 + ๐œ€)kxk 22 , (4.2)  ๐‘Ÿ 47 for all x โˆˆ L. Furthermore, one can observe that |C| โ‰ค ๐œ€ . 1 Any point in an ๐‘Ÿ-dimensional subspace can be represented by a linear combination of ๐‘Ÿ basis elements in the discretized unit ball of that subspace. 26 Proof Noting that A is a linear embedding, it is enough to prove (4.2) for an arbitrary x โˆˆ Sโ„“2 as any point in the subspace L can be scaled up/down to a point on the unit sphere in L. To prove the upper bound, let ฮ” := kAk 2โ†’2 and choose an element y โˆˆ C such that kx โˆ’ yk 2 โ‰ค ๐œ€/16. Noting kxk 2 = 1, we can write โˆš๏ธ kAxk 2 โˆ’ kxk 2 โ‰ค kAyk 2 + kA(x โˆ’ y)k 2 โˆ’ 1 โ‰ค 1 + ๐œ€/2 โˆ’ 1 + kA(x โˆ’ y)k 2 โˆš๏ธ โ‰ค 1 + ๐œ€/2 โˆ’ 1 + kAk 2โ†’2 kx โˆ’ yk 2 โ‰ค (1 + ๐œ€/4) โˆ’ 1 + ฮ”๐œ€/16 = (๐œ€/4)(1 + ฮ”/4) for all x โˆˆ Sโ„“2 . Since this upper bound holds for all x with kxk 2 = 1, it will also hold for the maximizer of kAxk 2 with kxk 2 = 1, meaning for that x, kAxk 2 = kAk 2โ†’2 so that ฮ”โˆ’1 โ‰ค (๐œ€/4)(1+ 1+๐œ€/4 ฮ”/4) must also hold. Therefore, ฮ” โ‰ค 1 + ๐œ€/4 + ฮ”๐œ€/16 =โ‡’ ฮ” โ‰ค 1โˆ’๐œ€/16 โ‰ค 1 + ๐œ€/3. The upper bound now follows as kAxk 2 โ‰ค ฮ” = supzโˆˆS 2 kAzk 2 for all x โˆˆ Sโ„“2 . โ„“ For the lower bound, let ๐›ฟ := inf zโˆˆSโ„“ 2 kAzk 2 โ‰ฅ 0, and we also note that there exists an element of the compact set Sโ„“2 that realizes ๐›ฟ. Similar to the proof of the upper bound, we consider this minimizing vector x โˆˆ Sโ„“2 and choose an element y โˆˆ C with kx โˆ’ yk 2 โ‰ค ๐œ€/16 in order to observe that โˆš๏ธ ๐›ฟ โˆ’ 1 = kAxk 2 โˆ’ kxk 2 โ‰ฅ kAyk 2 โˆ’ kA(x โˆ’ y)k 2 โˆ’ 1 โ‰ฅ 1 โˆ’ ๐œ€/2 โˆ’ 1 โˆ’ kA(x โˆ’ y)k 2 โˆš๏ธ โ‰ฅ 1 โˆ’ ๐œ€/2 โˆ’ 1 โˆ’ kAk 2โ†’2 kx โˆ’ yk 2 โ‰ฅ (1 โˆ’ ๐œ€/3) โˆ’ 1 โˆ’ ฮ”๐œ€/16 โ‰ฅ โˆ’ (๐œ€/3 + ๐œ€/16 (1 + ๐œ€/3)) โ‰ฅ โˆ’ (๐œ€/3 + ๐œ€/16 + ๐œ€/48) = โˆ’5๐œ€/12. Thus, ๐›ฟ โ‰ฅ 1 โˆ’ 5๐œ€/12 โ‰ฅ 1 โˆ’ ๐œ€. This proves the lower bound as kAxk 2 โ‰ฅ ๐›ฟ for all x โˆˆ Sโ„“2 . The proof of the upper bound on |C| can be found in Appendix C of [5]. Note 4.1.1 According to the lower and upper bounds proved above, (4.2) can use a tighter bound, i.e., (1 โˆ’ ๐œ€/2)kxk 22 โ‰ค kAxk 22 โ‰ค (1 + ๐œ€/2)kxk 22 , 27 as ๐›ฟ โ‰ฅ 1 โˆ’ 5๐œ€/12 โ‰ฅ 1 โˆ’ ๐œ€/2 and ฮ” โ‰ค 1 + ๐œ€/3 โ‰ค 1 + ๐œ€/2. This means the (๐œ€/2)-JL property of A assumed to hold for the elements of C is carried over to all elements of the subspace. In Lemma 4.1.3, the norms of vectors in a subspace are preserved by preserving the norms of all points in the discretized unit ball in that subspace. This makes the dependence on the subspace dimension ๐‘Ÿ exponential according to |C| โ‰ค (47/๐œ€) ๐‘Ÿ . The following lemma uses a coarser discretization to improve the dependence on ๐‘Ÿ so that a better target dimension can be achieved for the JL embedding. This is done by preserving the norms of an orthonormal basis to, by linearity, control the norms of all points in the subspace. If the angles between the elements of the orthonormal basis are preserved very accurately, then the projected basis will also be approximately orthonormal, and the norms of the points that are in the span of the orthonormal basis will also be preserved. Requiring the preservation of the aformentioned angles to be accurate imposes, in turn, a more strict bound on the norm-preserving property of the embedding. This concept is presented in the following lemma. Lemma 4.1.4 Fix ๐œ€ โˆˆ (0, 1) and let L be an ๐‘Ÿ-dimensional subspace of C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ spanned by a set of ๐‘Ÿ orthonormal basis tensors {T๐‘˜ } ๐‘˜ โˆˆ[๐‘Ÿ] . If ๐ฟ is an (๐œ€/4๐‘Ÿ)-JL embedding of the 4 2๐‘Ÿ + ๐‘Ÿ = 2๐‘Ÿ 2 โˆ’ ๐‘Ÿ  tensors ! i i ร˜ ร˜ {T๐‘˜ โˆ’ Tโ„Ž , T๐‘˜ + Tโ„Ž , T๐‘˜ โˆ’ Tโ„Ž , T๐‘˜ + Tโ„Ž } {T๐‘˜ } ๐‘˜โˆˆ[๐‘Ÿ] โŠ‚ L 1โ‰คโ„Ž<๐‘˜ โ‰ค๐‘Ÿ into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 , then k๐ฟ (X)k 2 โˆ’ kXk 2 โ‰ค ๐œ€kXk 2 holds for all X โˆˆ L. Proof According to Lemma 4.1.2, one can see that |๐œ€ ๐‘˜,โ„Ž | := |h๐ฟ (T๐‘˜ ) , ๐ฟ (Tโ„Ž )i โˆ’ hT๐‘˜ , Tโ„Ž i| โ‰ค ๐œ€/๐‘Ÿ 28 ร๐‘Ÿ for all โ„Ž, ๐‘˜ โˆˆ [๐‘Ÿ]. Thus, for any X = ๐‘˜=1 ๐›ผ ๐‘˜ T๐‘˜ โˆˆ L, โˆ‘๏ธ๐‘Ÿ โˆ‘๏ธ ๐‘Ÿ โˆ‘๏ธ ๐‘Ÿ โˆ‘๏ธ ๐‘Ÿ 2 2 k๐ฟ (X) k โˆ’ kXk = ๐›ผ ๐‘˜ ๐›ผโ„Ž (h๐ฟ (T๐‘˜ ) , ๐ฟ (Tโ„Ž )i โˆ’ hT๐‘˜ , Tโ„Ž i) = ๐›ผ ๐‘˜ ๐›ผโ„Ž ๐œ€ ๐‘˜,โ„Ž ๐‘˜=1 โ„Ž=1 ๐‘˜=1 โ„Ž=1 ๐‘Ÿ ๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ ๐œ€ โ‰ค |๐›ผ ๐‘˜ | |๐›ผโ„Ž | |๐œ€ ๐‘˜,โ„Ž | โ‰ค k๐œถk 21 โ‰ค ๐œ€k๐œถk 22 , ๐‘˜=1 โ„Ž=1 ๐‘Ÿ โˆš where we have used the relation k๐œถk 1 โ‰ค ๐‘Ÿ k๐œถk 2 to obtain the last inequality2. To finish the proof, we must show that kXk 2 = k๐œถk 22 . Due to the orthonormality of the basis tensors {T๐‘˜ } ๐‘˜โˆˆ[๐‘Ÿ] , one may write ๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ ๐‘Ÿ ๐‘Ÿ โˆ‘๏ธ 2 kXk = hX, Xi = ๐›ผ๐‘˜ ๐›ผโ„Ž hT๐‘˜ , Tโ„Ž i = |๐›ผ| 2 = k๐œถk 2 . ๐‘˜=1 โ„Ž=1 ๐‘˜=1 4.2 Johnson-Lindenstrauss Embedings for Low-Rank Tensors In this section, JL embeddings are discussed along the lines of Lemmas 4.1.3 and 4.1.4 in the case of low-CP-rank tensors, i.e., tensors that can be expressed as the weighted sum of a number of rank-1 tensors. 4.2.1 Geometry-Preserving Property of JL Embeddings for Low-Rank Tensors The main purpose of this section is to show how employing modewise JL embeddings affect the norm of a low-rank tensor and its inner product with another low-rank tensor. Considering rank-๐‘Ÿ tensors as members of an ๐‘Ÿ-dimensional tensor subspace spanned by ๐‘Ÿ rank-1 basis tensors, we are essentially assuming that the these basis tensors always exist. This, however, is not guaranteed, and we are not even able to guarantee that there always exist a sufficiently incoherent basis of ๐‘Ÿ rank-1 tensors that span any rank-๐‘Ÿ subspace. To incorporate coherence into our analysis, the concepts of modewise and basis coherence are introduced in the following. Next, the norm and 2 Using โˆš the Cauchy-Schwarz ineqquality for vectors ๐œถ and 1, we have k๐œถk 1 = h|๐œถ| , 1i โ‰ค k๐œถk 2 k1k 2 = ๐‘Ÿ k๐œถk 2 , where |๐œถ| is a vector whose elements are the absolute values of the elements of ๐œถ, and 1 is a vector of all ones. 29 inner product preservation property in the Johnson-Lindenstrauss embeddings of low-rank tensors will be discussed. Definition 4.2.1 (Modewise Coherence) Assume that X admits a decomposition of rank ๐‘Ÿ in the โ€œstandardโ€ form, i.e., ๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ ๐‘Ÿ X= ๐›ผ ๐‘˜ x ๐‘˜(1) x ๐‘˜(2) ยทยทยท x ๐‘˜(๐‘‘) = ๐›ผ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) , (4.3) ๐‘˜=1 ๐‘˜=1 where kx ๐‘˜(โ„“) k 22 = 1 for ๐‘˜ โˆˆ [๐‘Ÿ] and โ„“ โˆˆ [๐‘‘]. The maximum modewise coherence of X is defined as D E ๐œ‡X := max ๐œ‡X,โ„“ := max max x ๐‘˜(โ„“) , xโ„Ž(โ„“) . (4.4) โ„“โˆˆ[๐‘‘] โ„“โˆˆ[๐‘‘] ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] ๐‘˜โ‰ โ„Ž where ๐œ‡X,โ„“ โˆˆ [0, 1] for โ„“ โˆˆ [๐‘‘] is the modewise coherence for mode โ„“. Also obviously, ๐œ‡X โˆˆ [0, 1]. Definition 4.2.2 (Basis Coherence) Let B be a set of ๐‘Ÿ rank-1 tensors, defined as C๐‘› ร—ยทยทยทร—๐‘› n o ๐‘‘ (โ„“) B := โ„“=1 x ๐‘˜ ๐‘˜ โˆˆ [๐‘Ÿ] โŠ‚ 1 ๐‘‘ n o with kx ๐‘˜(โ„“) k 2 = 1 for ๐‘˜ โˆˆ [๐‘Ÿ] and โ„“ โˆˆ [๐‘‘]. Let L := span ๐‘‘ โ„“=1 x ๐‘˜ (โ„“) ๐‘˜ โˆˆ [๐‘Ÿ] be the span of B. For the set B, the basis coherence is defined as D E ร– ๐‘‘ D E (โ„“) (โ„“) ๐œ‡0B := max ๐‘‘ โ„“=1 x ๐‘˜ , ๐‘‘ โ„“=1 x โ„Ž = max x ๐‘˜(โ„“) , xโ„Ž(โ„“) . (4.5) ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] ๐‘˜โ‰ โ„Ž ๐‘˜โ‰ โ„Ž โ„“=1 It is easy to verify that ๐œ‡0B โˆˆ [0, 1]. Note 4.2.1 Looking at (4.5), one can observe that ๐œ‡0B is in fact the maximum absolute inner product3 of the pairs of rank-1 tensors that form the basis B, hence the name basis coherence. We will also use, with some abuse of notation, the (maximum) modewise coherence and basis coherence for any X โˆˆ L = span (B) and the basis B interchangeably, i.e., ๐œ‡X,โ„“ = ๐œ‡ B,โ„“ , ๐œ‡X = ๐œ‡ B , and ๐œ‡0X = ๐œ‡0B . Finally, it can also be inferred that ร– ๐‘‘ ๐œ‡0X โ‰ค ๐œ‡X,โ„“ โ‰ค ๐œ‡X ๐‘‘ . โ„“=1 3 Defined as per (2.3), or equivalently, the inner product of the vectorized form of the tensors. 30 0 It should also be noted that ๐œ‡X,โ„“ , ๐œ‡X , and ๐œ‡X always depend on the choice of the particular basis tensors forming B. To further prepare grounds for the main JL embedding result, we discuss how modewise JL embeddings affect the structure, modewise coherence, and the norm of a low-rank tensor of the form (4.3), as well as the inner product of two such low-rank tensors. This will be done through the next few lemmas and theorems below. Lemma 4.2.1 Let ๐‘— โˆˆ [๐‘‘], A โˆˆ C๐‘šร—๐‘› , and X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› ๐‘— 1 ๐‘‘ be a rank-๐‘Ÿ tensor as per (4.3) such ( ๐‘—) that min Ax ๐‘˜ > 0. Then X 0 := X ร— ๐‘— A can be written in standard form as ๐‘˜โˆˆ[๐‘Ÿ] 2 ๐‘Ÿ ( ๐‘—) 0 โˆ‘๏ธ ( ๐‘—) ยฉ  ๐‘—โˆ’1 (โ„“)  Ax ๐‘˜  ๐‘‘ (โ„“) ยช X = ๐›ผ ๐‘˜ Ax ๐‘˜ ยญ โ„“=1 x ๐‘˜ โ„“= ๐‘—+1 x ๐‘˜ ยฎ. 2ยญ ( ๐‘—) ยฎ ๐‘˜=1 Ax ๐‘˜ ยซ 2 ยฌ Furthermore, the modewise coherence of X 0 as above will satisfy D E ( ๐‘—) ( ๐‘—) Ax ๐‘˜ , Axโ„Ž ๐œ‡X 0, ๐‘— = max ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] ( ๐‘—) ( ๐‘—) ๐‘˜โ‰ โ„Ž Ax ๐‘˜ Axโ„Ž 2 2 so that D E ๐œ‡X 0 = max ยญ ๐œ‡X 0, ๐‘— , max max x ๐‘˜(โ„“) , xโ„Ž(โ„“) ยฎ . ยฉ ยช โ„“โˆˆ[๐‘‘]\{ ๐‘— } ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] ยซ ๐‘˜โ‰ โ„Ž ยฌ Proof Using Lemma 2.0.2, the linearity of tensor matricization, and (2.8) we can see that the mode- ๐‘— unfolding of X 0 satisfies โˆ‘๏ธ๐‘Ÿ   โˆ‘๏ธ๐‘Ÿ  > (โ„“) ( ๐‘—) X0( ๐‘—) = AX ( ๐‘—) = A ๐›ผ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜ = ๐›ผ ๐‘˜ Ax ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) ( ๐‘—) ๐‘˜=1 ๐‘˜=1 ๐‘Ÿ  ( ๐‘—) โˆ‘๏ธ ( ๐‘—)  Ax ๐‘˜  > = ๐›ผ ๐‘˜ Ax ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) . 2 ( ๐‘—) ๐‘˜=1 Ax ๐‘˜ 2 31 ( ๐‘—+1) ( ๐‘—โˆ’1) where โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) = x ๐‘˜(๐‘‘) โŠ— ยท ยท ยท โŠ— x ๐‘˜ โŠ— x๐‘˜ โŠ— ยท ยท ยท โŠ— x ๐‘˜(1) . Folding X0( ๐‘—) back into a ๐‘‘-mode tensor then gives us our first equality. The second two equalities follow directly from the definition of modewise coherence. The following lemma will be helpful in proving the norm-preserving property of modewise JL embeddings; it provides a useful relation expressing the norm of the ๐‘—-mode product of a tensor that is in the standard form (4.3) in terms of inner products of its individual factor vectors projected by the embedding. Lemma 4.2.2 Let ๐‘— โˆˆ [๐‘‘], A โˆˆ C๐‘šร—๐‘› , and X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› ๐‘— 1 ๐‘‘ be a rank-๐‘Ÿ tensor in standard form as per (4.3). Then, รŽ ๐‘Ÿ โˆ‘๏ธ โ„“โ‰  ๐‘— ๐‘›โ„“ โˆ‘๏ธ     D E ( ๐‘—) ( ๐‘—) kX ร— ๐‘— Ak 2 = ๐›ผ ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) ๐›ผโ„Ž โŠ—โ„“โ‰  ๐‘— xโ„Ž(โ„“) Ax ๐‘˜ , Axโ„Ž . ๐‘Ž ๐‘Ž ๐‘˜,โ„Ž=1 ๐‘Ž=1 where (ยท)๐‘Ž denotes the ๐‘Ž th element of a vector. Proof Using the linearity of ๐‘—-mode products, tensor matricization, and observing that the Eu- clidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings, one can write ๐‘Ÿ 2 ๐‘Ÿ 2 โˆ‘๏ธ    โˆ‘๏ธ    2 ๐‘‘ (โ„“) ๐‘‘ (โ„“) kX ร— ๐‘— Ak = ๐›ผ๐‘˜ โ„“=1 x ๐‘˜ ร—๐‘— A = ๐›ผ๐‘˜ โ„“=1 x ๐‘˜ ร—๐‘— A ( ๐‘—) ๐‘˜=1 ๐‘˜=1 F ๐‘Ÿ 2 โˆ‘๏ธ  > ( ๐‘—) = ๐›ผ ๐‘˜ Ax ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) ๐‘˜=1 F โˆ‘๏ธ๐‘Ÿ   >  > ( ๐‘—) ( ๐‘—) = ๐›ผ ๐‘˜ Ax ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) , ๐›ผโ„Ž Axโ„Ž โŠ—โ„“โ‰  ๐‘— xโ„Ž(โ„“) ๐‘˜,โ„Ž=1 F where kยทk F and hยท, ยทiF denote the Frobenius matrix norm and inner product, respectively. Computing รŽ the Frobenius inner products above columnwise, and noting that there are โ„“โ‰  ๐‘— ๐‘›โ„“ columns in the mode- ๐‘— unfolding, one can further see that รŽ ๐‘Ÿ โˆ‘๏ธ โ„“โ‰  ๐‘— ๐‘›โ„“ โˆ‘๏ธ     D E ( ๐‘—) ( ๐‘—) kX ร— ๐‘— Ak = 2 ๐›ผ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) ๐›ผโ„Ž โŠ—โ„“โ‰  ๐‘— xโ„Ž(โ„“) Ax ๐‘˜ , Axโ„Ž , ๐‘Ž ๐‘Ž ๐‘˜,โ„Ž=1 ๐‘Ž=1 32 completing the proof. The following theorem demonstrates that a single modewise JL embedding of any low-rank tensor X of the standard form (4.3) will preserve its norm up to an error depending on the overall โ„“ 2 -norm of its coefficients ๐œถ โˆˆ C๐‘Ÿ . It should also be noted that in establishing proofs for the following lemmas and theorems, we will encounter situations where applying the polarization identity for inner products (recall how it was used in the proofs of Lemmas 4.1.1 and 4.1.2) will lead to the requirement that the JL property of matrices involved must hold for a set of vectors and combinations of them. For low-rank tensors under consideration, this set will include the vectors forming (4.3), and the corresponding set of interest will be ! i i C๐‘› , ร˜ n o ร˜n o ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) S 0๐‘— = x๐‘˜ โˆ’ xโ„Ž , x๐‘˜ + xโ„Ž , x๐‘˜ โˆ’ xโ„Ž , x๐‘˜ + xโ„Ž x๐‘˜ โŠ‚ ๐‘— (4.6) ๐‘˜โˆˆ[๐‘Ÿ] 1โ‰คโ„Ž<๐‘˜ โ‰ค๐‘Ÿ containing 4 2 ๐‘Ÿ + ๐‘Ÿ = 2๐‘Ÿ 2 โˆ’ ๐‘Ÿ vectors for each mode ๐‘— โˆˆ [๐‘‘]. Theorem 4.2.1 Let ๐‘— โˆˆ [๐‘‘] and X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ be a rank-๐‘Ÿ tensor as per (4.3). Suppose that A โˆˆ C๐‘šร—๐‘› ๐‘— is an (๐œ€/4)-JL embedding of the vectors in the set defined in (4.6) into C๐‘š . Let X 0 := X ร— ๐‘— A and rewrite it in standard form so that ๐‘Ÿ ( ๐‘—) 0 โˆ‘๏ธ ยฉ (โ„“)  Ax ๐‘˜   (โ„“) ยช X = ๐›ผ0๐‘˜ ยญ โ„“< ๐‘— x ๐‘˜ x โ„“> ๐‘— ๐‘˜ ยฎ . ยฎ ยญ ( ๐‘—) ๐‘˜=1 Ax ๐‘˜ ยซ 2 ยฌ Then all of the following hold: (a) ๐›ผ0๐‘˜ โˆ’ ๐›ผ๐‘˜ โ‰ค ๐œ€|๐›ผ ๐‘˜ |/4 for all ๐‘˜ โˆˆ [๐‘Ÿ] so that k๐œถ0 k โˆž โ‰ค (1 + ๐œ€/4)k๐œถk โˆž ๐œ‡ X, ๐‘— +๐œ€ (b) ๐œ‡X 0, ๐‘— โ‰ค 1โˆ’๐œ€/4 , and ๐œ‡X 0,โ„“ = ๐œ‡X,โ„“ for all โ„“ โˆˆ [๐‘‘] \ { ๐‘— } โˆš๏ธ ร–   (c) kX 0 k 2 โˆ’ kXk 2 โ‰ค ๐œ€ ยญ1 + ๐‘Ÿ (๐‘Ÿ โˆ’ 1) ๐œ‡X,โ„“ ยฎ k๐œถk 22 โ‰ค ๐œ€ 1 + ๐‘Ÿ ๐œ‡X ๐‘‘โˆ’1 k๐œถk 22 โ‰ค ๐œ€(๐‘Ÿ +1)k๐œถk 22 ยฉ ยช ยซ โ„“โ‰  ๐‘— ยฌ 33 Proof Properties are proved in order below. (a) By Lemma 4.2.1, one can write for all ๐‘˜ โˆˆ [๐‘Ÿ] that ( ๐‘—) ( ๐‘—) ๐›ผ0๐‘˜ โˆ’ ๐›ผ ๐‘˜ = ๐›ผ ๐‘˜ Ax ๐‘˜ โˆ’ ๐›ผ๐‘˜ = |๐›ผ ๐‘˜ | Ax ๐‘˜ โˆ’ 1 โ‰ค ๐œ€|๐›ผ ๐‘˜ |/4. 2 2 (b) Lemma 4.2.1 and the definition of ๐‘—-mode coherence lead to D E D E ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) Ax ๐‘˜ , Axโ„Ž x๐‘˜ , xโ„Ž + ๐œ€ ๐œ‡X, ๐‘— + ๐œ€ ๐œ‡X 0, ๐‘— = max โ‰ค max = , ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] Ax ( ๐‘—) Ax ( ๐‘—) ๐‘˜,โ„Žโˆˆ[๐‘Ÿ] 1 โˆ’ ๐œ€4 1 โˆ’ ๐œ€4 ๐‘˜โ‰ โ„Ž ๐‘˜ 2 โ„Ž 2 ๐‘˜โ‰ โ„Ž where the inequality follows from A being an (๐œ€/4)-JL embedding and Lemma 4.1.1. (c) Applying Lemma 4.2.2 one can observe that รŽ ๐‘Ÿ โˆ‘๏ธ โ„“โ‰  ๐‘— ๐‘›โ„“ โˆ‘๏ธ     D E D E ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) 0 2 kX k โˆ’ kXk = 2 ๐›ผ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) ๐›ผโ„Ž โŠ—โ„“โ‰  ๐‘— xโ„Ž(โ„“) Ax ๐‘˜ , Axโ„Ž โˆ’ x๐‘˜ , xโ„Ž . (4.7) ๐‘Ž ๐‘Ž ๐‘˜,โ„Ž=1 ๐‘Ž=1 Lemma 4.1.1 can be applied to each inner product in (4.7) to get D E D E ( ๐‘—) ( ๐‘—) ( ๐‘—) ( ๐‘—) Ax ๐‘˜ , Axโ„Ž = x๐‘˜ , xโ„Ž + ๐œ€ ๐‘˜,โ„Ž for some ๐œ€ ๐‘˜,โ„Ž โˆˆ C with ๐œ€ ๐‘˜,โ„Ž โ‰ค ๐œ€. As a result we have that รŽ ๐‘Ÿ โˆ‘๏ธ โ„“โ‰  ๐‘— ๐‘›โ„“ โˆ‘๏ธ     kX 0 k 2 โˆ’ kXk 2 = ๐›ผ ๐‘˜ โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) ๐›ผโ„Ž โŠ—โ„“โ‰  ๐‘— xโ„Ž(โ„“) ๐œ€ ๐‘˜,โ„Ž . ๐‘Ž ๐‘Ž ๐‘˜,โ„Ž=1 ๐‘Ž=1 รŽ ๐‘Ÿ โˆ‘๏ธ โ„“โ‰  ๐‘— ๐‘›โ„“  โˆ‘๏ธ    = ๐›ผ ๐‘˜ ๐›ผโ„Ž ๐œ€ ๐‘˜,โ„Ž โŠ—โ„“โ‰  ๐‘— x ๐‘˜(โ„“) โŠ—โ„“โ‰  ๐‘— xโ„Ž(โ„“) ๐‘Ž ๐‘Ž ๐‘˜,โ„Ž=1 ๐‘Ž=1 ๐‘Ÿ โˆ‘๏ธ D E โˆ‘๏ธ๐‘Ÿ ร–D E (โ„“) (โ„“) = ๐›ผ ๐‘˜ ๐›ผโ„Ž ๐œ€ ๐‘˜,โ„Ž โ„“โ‰  ๐‘— x ๐‘˜ , โ„“โ‰  ๐‘— x โ„Ž = ๐›ผ ๐‘˜ ๐›ผโ„Ž ๐œ€ ๐‘˜,โ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) ๐‘˜,โ„Ž=1 ๐‘˜,โ„Ž=1 โ„“โ‰  ๐‘— โˆ‘๏ธ๐‘Ÿ ร– 2 โˆ‘๏ธ ร–D E โ‰ค 2 |๐›ผ ๐‘˜ | ๐œ€ ๐‘˜,๐‘˜ x ๐‘˜(โ„“) + ๐›ผ ๐‘˜ ๐›ผโ„Ž ๐œ€ ๐‘˜,โ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) . ๐‘˜=1 โ„“โ‰  ๐‘— ๐‘˜โ‰ โ„Ž โ„“โ‰  ๐‘— 34 2 x ๐‘˜(โ„“) = 1 as x ๐‘˜(โ„“) รŽ Noting that โ„“โ‰  ๐‘— = 1 for all โ„“ โˆˆ [๐‘‘] and ๐‘˜ โˆˆ [๐‘Ÿ], it follows that 2 โˆ‘๏ธ๐‘Ÿ โˆ‘๏ธ ร–D E 0 2 kX k โˆ’ kXk 2 โ‰ค ๐œ€ |๐›ผ ๐‘˜ | 2 + ๐›ผ ๐‘˜ ๐›ผโ„Ž ๐œ€ ๐‘˜,โ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) ๐‘˜=1 ๐‘˜โ‰ โ„Ž โ„“โ‰  ๐‘—   > > = ๐œ€k๐œถk 22 + E ๐œถ, ๐œถ โ‰ค ๐œ€+ E 2โ†’2 k๐œถk 22 , C รŽD E where E โˆˆ ๐‘Ÿร—๐‘Ÿ is zero on its diagonal, ๐ธ ๐‘˜,โ„Ž = ๐œ€ ๐‘˜,โ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) for ๐‘˜ โ‰  โ„Ž, and the operator โ„“โ‰  ๐‘— norm kE> k 2โ†’2 satisfies v u v u u u 2 u u 2 tโˆ‘๏ธ ร– D E tโˆ‘๏ธ ร– D E E> 2โ†’2 โ‰ค kEk ๐น โ‰ค x ๐‘˜(โ„“) , xโ„Ž(โ„“) ๐œ€2 = ๐œ€ ยท x ๐‘˜(โ„“) , xโ„Ž(โ„“) . ๐‘˜โ‰ โ„Ž โ„“โ‰  ๐‘— ๐‘˜โ‰ โ„Ž โ„“โ‰  ๐‘— Finally, the definition of ๐œ‡X implies that โˆš๏ธ ร– ๐‘‘โˆ’1 kEk 2โ†’2 โ‰ค ๐œ€ ๐‘Ÿ (๐‘Ÿ โˆ’ 1) ๐œ‡X,โ„“ โ‰ค ๐œ€๐‘Ÿ ๐œ‡X . โ„“โ‰  ๐‘— Thus, the desired bound can be obtained, i.e., โˆš๏ธ ร–   kX 0 k 2 โˆ’ kXk 2 โ‰ค ๐œ€ ยญ1 + ๐‘Ÿ (๐‘Ÿ โˆ’ 1) ๐œ‡X,โ„“ ยฎ k๐œถk 22 โ‰ค ๐œ€ 1 + ๐‘Ÿ ๐œ‡X ๐‘‘โˆ’1 k๐œถk 22 . ยฉ ยช ยซ โ„“โ‰  ๐‘— ยฌ Theorem 4.2.2 provides an upper bound for the distortion in the norm of a tensor when modewise JL embeddings are applied to all its modes. The following remark provides a useful tool in the proof of the theorem. Remark 4.2.1 Let ๐‘, ๐‘‘ โˆˆ R+. Then, 1+ ๐‘ ๐‘‘ ๐‘‘  โ‰ค e๐‘ . Theorem 4.2.2 Let ๐œ€ โˆˆ (0, 3/4]. Assume that X admits a decomposition of rank ๐‘Ÿ in the standard form as per (4.3). Also, assume that the matrices A ( ๐‘—) โˆˆ ๐‘š ๐‘— ร—๐‘› ๐‘— are 4๐‘‘ C ๐œ€  -JL embeddings of the vectors in S 0๐‘— as per (4.6) into C๐‘š ๐‘— for each ๐‘— โˆˆ [๐‘‘]. If Y = X ร—1 A (1) ร—2 A (2) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) , (4.8) 35 then, e + e2  โˆš๏ธ   kY k 2 โˆ’ kXk 2 โ‰ค ๐œ€ ๐‘Ÿ (๐‘Ÿ โˆ’ 1) max ๐œ€ ๐‘‘โˆ’1 , ๐œ‡X ๐‘‘โˆ’1 k๐œถk 22 (4.9) โ‰ค๐œ€ e 2 (๐‘Ÿ + 1) k๐œถk 22 , and if the maximum modewise coherence of X is zero, i.e., if ๐œ‡X = 0, then e e  โˆš๏ธ  kY k 2 โˆ’ kXk 2 โ‰ค ๐œ€ + ๐‘Ÿ (๐‘Ÿ โˆ’ 1)๐œ€ ๐‘‘ k๐œถk 22 . (4.10) Proof Let X (0) := X and X (๐‘‘) := Y, and for each ๐‘— โˆˆ [๐‘‘] define the partially compressed tensor โˆ‘๏ธ๐‘Ÿ     ( ๐‘—) (1) ( ๐‘—) ๐‘— (โ„“) (โ„“) ๐‘‘ (โ„“) X := X ร—1 A ยท ยท ยท ร—๐‘— A = ๐›ผ๐‘˜ โ„“=1 A x ๐‘˜ โ„“= ๐‘—+1 x ๐‘˜ ๐‘˜=1 ๐‘Ÿ (4.11) โˆ‘๏ธ ๐‘‘ (โ„“) =: ๐›ผ ๐‘—,๐‘˜ โ„“=1 x ๐‘—,๐‘˜ , ๐‘˜=1 expressed in standard form via ๐‘— applications of Lemma 4.2.1. By looking closely at the second and third equalities above, one can observe that for all ๐‘— โˆˆ [๐‘‘], ๐›ผ ๐‘—,๐‘˜ = ๐›ผ ๐‘˜ โ„“=1 A (โ„“) x ๐‘˜(โ„“) , as well รŽ๐‘— as x (โ„“) (โ„“) (โ„“) ๐‘—,๐‘˜ = A x ๐‘˜ / A x ๐‘˜ (โ„“) (โ„“) for โ„“ โˆˆ [ ๐‘—] and x (โ„“) = x (โ„“) for โ„“ > ๐‘—. ๐‘—,๐‘˜ ๐‘˜ The first two parts of Theorem 4.2.1 can be used to write (๐‘–) ๐›ผ ๐‘—,๐‘˜ โˆ’ ๐›ผ ๐‘—โˆ’1,๐‘˜ โ‰ค ๐œ€|๐›ผ ๐‘—โˆ’1,๐‘˜ |/4๐‘‘ so that |๐›ผ ๐‘—,๐‘˜ | โ‰ค (1 + ๐œ€/4๐‘‘)|๐›ผ ๐‘—โˆ’1,๐‘˜ | holds for all ๐‘˜ โˆˆ [๐‘Ÿ], and (๐‘–๐‘–) ๐œ‡X ( ๐‘—) , ๐‘— โ‰ค (๐œ‡X ( ๐‘—โˆ’1) , ๐‘— + ๐œ€/๐‘‘)/(1 โˆ’ ๐œ€/4๐‘‘), and ๐œ‡X ( ๐‘—) ,โ„“ = ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ for all โ„“ โˆˆ [๐‘‘] \ { ๐‘— }, for ๐‘— โˆˆ [๐‘‘]. Using these facts inductively, it can also be established that both |๐›ผ ๐‘—,๐‘˜ | โ‰ค (1 + ๐œ€/4๐‘‘) ๐‘— |๐›ผ ๐‘˜ |, (4.12) and   ๐‘—โˆ’1 ร– ยฉร– ๐œ‡X,โ„“ + ๐œ€/๐‘‘ ยช ร– ๐œ‡X + ๐œ€/๐‘‘ ๐‘‘โˆ’ ๐‘— ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ โ‰ค ยญ ๐œ‡X,โ„“ โ‰ค ๐œ‡X , (4.13) โˆ’ โˆ’ ยฎ 1 ๐œ€/4๐‘‘ 1 ๐œ€/4๐‘‘ โ„“โ‰  ๐‘— ยซ โ„“< ๐‘— ยฌ โ„“> ๐‘— 36 รŽ รŽ รŽ hold for all ๐‘˜ โˆˆ [๐‘Ÿ] and ๐‘— โˆˆ [๐‘‘]. To prove (4.13), one can write โ„“โ‰  ๐‘— ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ = โ„“< ๐‘— ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ โ„“> ๐‘— ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ . รŽ The second term on the right-hand side is equal to โ„“> ๐‘— ๐œ‡X,โ„“ since โ„“ > ๐‘—. For the first term, we have ๐‘—โˆ’2 ! ๐‘—โˆ’3 ! ร– ร– ร– ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ = ๐œ‡X ( ๐‘—โˆ’1),โ„“ ๐œ‡X ( ๐‘—โˆ’1) , ๐‘—โˆ’1 = ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ ๐œ‡X ( ๐‘—โˆ’1) , ๐‘—โˆ’2 ๐œ‡X ( ๐‘—โˆ’1) , ๐‘—โˆ’1 โ„“< ๐‘— โ„“=1 โ„“=1 ๐‘—โˆ’3 ! ๐‘—โˆ’1 ร– ร– = ๐œ‡X ( ๐‘—โˆ’1) ,โ„“ ๐œ‡X ( ๐‘—โˆ’2) , ๐‘—โˆ’2 ๐œ‡X ( ๐‘—โˆ’1) , ๐‘—โˆ’1 = ยท ยท ยท = ๐œ‡X (โ„“) ,โ„“ (4.14) โ„“=1 โ„“=1 ๐‘—โˆ’1 ๐‘—โˆ’1 ๐œ‡X (โ„“โˆ’1) ,โ„“ + ๐œ€/๐‘‘ ร–   ๐‘—โˆ’1 ร– ๐œ‡X,โ„“ + ๐œ€/๐‘‘ ๐œ‡X + ๐œ€/๐‘‘ โ‰ค = โ‰ค . โ„“=1 1 โˆ’ ๐œ€/4๐‘‘ โ„“=1 1 โˆ’ ๐œ€/4๐‘‘ 1 โˆ’ ๐œ€/4๐‘‘ In (4.13), ๐œ‡X 0 = 1 is assumed even if ๐œ‡ = 0 since this still yields the correct bound in the case X where ๐‘— = ๐‘‘ and ๐œ‡X = 0. To get the desired error bound, we can now see that ๐‘‘โˆ’1 โˆ‘๏ธ 2 2 kXk 2 โˆ’ kY k 2 = X ( ๐‘—) โˆ’ X ( ๐‘—+1) ๐‘—=0 ๐‘‘โˆ’1 ๐œ€ โˆ‘๏ธ ยฉ โˆš๏ธ ร– โ‰ค ยญ1 + ๐‘Ÿ (๐‘Ÿ โˆ’ 1) ๐œ‡X ( ๐‘—) ,โ„“ ยฎ k๐œถ ๐‘— k 22 ยช ๐‘‘ ๐‘—=0 โ„“โ‰  ๐‘—+1 ยซ ยฌ ! ๐‘‘โˆ’1   ๐‘— ๐œ€ โˆ‘๏ธ โˆš๏ธ ๐œ‡X + ๐œ€/๐‘‘ (1 + ๐œ€/4๐‘‘) 2 ๐‘— k๐œถk 22 ๐‘‘โˆ’1โˆ’ ๐‘— โ‰ค 1 + ๐‘Ÿ (๐‘Ÿ โˆ’ 1) ๐œ‡X ๐‘‘ ๐‘—=0 1 โˆ’ ๐œ€/๐‘‘4 ๐‘‘โˆ’1  ๐‘— ! ๐œ€ โˆ‘๏ธ โˆš๏ธ ๐œ‡X + ๐œ€/๐‘‘ (1 + 9๐œ€/16๐‘‘) ๐‘— k๐œถk 22 , ๐‘‘โˆ’1โˆ’ ๐‘— โ‰ค 1 + ๐‘Ÿ (๐‘Ÿ โˆ’ 1) ๐œ‡X ๐‘‘ ๐‘—=0 1 โˆ’ ๐œ€/๐‘‘4 where the third part of Theorem 4.2.1, as well as (4.12) and (4.13) have been used. Considering each term in the upper bound above separately, we have that ๐œ€  โˆš๏ธ  2 2 2 kXk โˆ’ kY k โ‰ค k๐œถk 2 ๐‘‡1 + ๐‘Ÿ (๐‘Ÿ โˆ’ 1)๐‘‡2 ๐‘‘ where ๐‘‘โˆ’1 (1 + 9๐œ€/16๐‘‘) ๐‘‘ โˆ’ 1 e๐‘‘ โˆ‘๏ธ ๐‘‡1 := (1 + 9๐œ€/16๐‘‘) ๐‘— = โ‰ค ๐‘—=0 9๐œ€/16๐‘‘ 37 using 9๐œ€/16 < 1, and where ๐‘‘โˆ’1  ๐‘— ๐‘‘โˆ’1 โˆ‘๏ธ ๐œ‡X + ๐œ€/๐‘‘ ๐‘‘โˆ’1โˆ’ ๐‘— โˆ‘๏ธ ๐‘‘โˆ’1โˆ’ ๐‘— ๐‘‡2 := ๐œ‡X (1 + 9๐œ€/16๐‘‘) ๐‘— โ‰ค (๐œ‡X + ๐œ€/๐‘‘) ๐‘— ๐œ‡X (1 + ๐œ€/๐‘‘) ๐‘— ๐‘—=0 1 โˆ’ ๐œ€/๐‘‘4 ๐‘—=0 for ๐œ€ โ‰ค 3/4. Continuing to bound the second term we will consider three cases. First, if ๐œ‡X = 0 then ๐‘‡2 โ‰ค (๐œ€/๐‘‘) ๐‘‘โˆ’1 (1 + ๐œ€/๐‘‘) ๐‘‘โˆ’1 โ‰ค e (๐œ€/๐‘‘) ๐‘‘โˆ’1 , using Remark 4.2.1 and that ๐œ€ < 1. Second, if 0 < ๐œ‡X โ‰ค ๐œ€ then ๐‘‘โˆ’1 โˆ‘๏ธ ๐‘‘โˆ’1 โˆ‘๏ธ ๐‘— ๐‘‡2 โ‰ค (๐œ€ + ๐œ€/๐‘‘) ๐œ€ ๐‘‘โˆ’1โˆ’ ๐‘— ๐‘— (1 + ๐œ€/๐‘‘) = ๐œ€ ๐‘‘โˆ’1 (1 + 1/๐‘‘) ๐‘— (1 + ๐œ€/๐‘‘) ๐‘— ๐‘—=0 ๐‘—=0 e โ‰ค ๐œ€ ๐‘‘โˆ’1 ๐‘‘ (1 + 1/๐‘‘) ๐‘‘ (1 + ๐œ€/๐‘‘) ๐‘‘ โ‰ค ๐‘‘ 2 ๐œ€ ๐‘‘โˆ’1 , using Remark 4.2.1 and that ๐œ€ < 1 once more. If, however, ๐œ‡X > ๐œ€ then we can see that ๐‘‘โˆ’1 โˆ‘๏ธ ๐‘‘โˆ’1 โˆ‘๏ธ ๐‘‡2 โ‰ค ๐œ‡X ๐‘‘โˆ’1 (1 + ๐œ€/๐œ‡X ๐‘‘) ๐‘— (1 + ๐œ€/๐‘‘) ๐‘— โ‰ค ๐œ‡X ๐‘‘โˆ’1 (1 + 1/๐‘‘) ๐‘— (1 + ๐œ€/๐‘‘) ๐‘— ๐‘—=0 ๐‘—=0 โ‰ค ๐œ‡X ๐‘‘โˆ’1 ยท ๐‘‘ (1 + 1/๐‘‘) ๐‘‘ (1 + ๐œ€/๐‘‘) ๐‘‘ โ‰ค ๐œ‡X ๐‘‘โˆ’1 ๐‘‘ e1+๐œ€ โ‰ค ๐‘‘ e2 ๐œ‡X๐‘‘โˆ’1, where we have again utilized Remark 4.2.1. The desired result now follows. Theorem 4.2.2 expresses the distortion in the Euclidean norm of a low-rank tensor X after applying modewise JL embeddings in terms of its low-rank expansion coefficients norm k๐œถk 2 . The following lemma helps express the distortion in terms of the norm of a tensor X in with sufficiently small modewise coherence by establishing its relation to the norm of ๐œถ, as this is usually the convention in expressing error guarantees for JL embeddings. Lemma 4.2.3 Let X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ be a rank-๐‘Ÿ tensor in the standard form as per (4.3) with basis coherence ๐œ‡0X < (๐‘Ÿ โˆ’ 1) โˆ’1 . Then,   ! 2 1 1 k๐œถk 2 โ‰ค kXk 2 โ‰ค kXk 2 1 โˆ’ (๐‘Ÿ โˆ’ 1)๐œ‡0X รŽ๐‘‘ 1 โˆ’ (๐‘Ÿ โˆ’ 1) โ„“=1 ๐œ‡X,โ„“ ! (4.15) 1 โ‰ค kXk 2 . 1 โˆ’ (๐‘Ÿ โˆ’ 1)๐œ‡X ๐‘‘ 38 Proof To establish the result, one can write * ๐‘Ÿ ๐‘Ÿ + ๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ D E kXk 2 = hX, Xi = ๐›ผ ๐‘˜ โ„“=1 ๐‘‘ x ๐‘˜(โ„“) , ๐›ผ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) = ๐›ผ๐‘˜ ๐›ผโ„Ž ๐‘‘ (โ„“) โ„“=1 x ๐‘˜ , ๐‘‘ (โ„“) โ„“=1 x โ„Ž ๐‘˜=1 ๐‘˜=1 ๐‘˜,โ„Ž=1 โˆ‘๏ธ๐‘Ÿ ร– ๐‘‘ D E โˆ‘๏ธ๐‘Ÿ โˆ‘๏ธ๐‘Ÿ ร– ๐‘‘ D E = ๐›ผ๐‘˜ ๐›ผโ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) = 2 |๐›ผ ๐‘˜ | + ๐›ผ๐‘˜ ๐›ผโ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) ๐‘˜,โ„Ž=1 โ„“=1 ๐‘˜=1 ๐‘˜โ‰ โ„Ž โ„“=1 ๐‘Ÿ โˆ‘๏ธ ร– ๐‘‘ D E = k๐œถk 2 + ๐›ผ๐‘˜ ๐›ผโ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) . ๐‘˜โ‰ โ„Ž โ„“=1 where (A.8) has been used to get the fourth equality. Therefore, โˆ‘๏ธ๐‘Ÿ ร– ๐‘‘ D E โˆ‘๏ธ๐‘Ÿ ร– ๐‘‘ D E ๐‘Ÿ โˆ‘๏ธ 0 2 kXk โˆ’ k๐œถk 22 = ๐›ผ๐‘˜ ๐›ผโ„Ž x ๐‘˜(โ„“) , xโ„Ž(โ„“) โ‰ค |๐›ผ ๐‘˜ ๐›ผโ„Ž | x ๐‘˜(โ„“) , xโ„Ž(โ„“) โ‰ค ๐œ‡X |๐›ผ๐‘˜ ๐›ผโ„Ž | ๐‘˜โ‰ โ„Ž โ„“=1 ๐‘˜โ‰ โ„Ž โ„“=1 ๐‘˜โ‰ โ„Ž !2 ๐‘Ÿ ๏ฃฏ ๐‘Ÿ ๏ฃฎ โˆ‘๏ธ โˆ‘๏ธ ๏ฃน   0 0 0 |๐›ผ ๐‘˜ | 2 ๏ฃบ๏ฃบ = ๐œ‡X k๐œถk 21 โˆ’ k๐œถk 22 โ‰ค ๐œ‡X (๐‘Ÿ โˆ’ 1) k๐œถk 22 , ๏ฃบ = ๐œ‡X ๏ฃฏ๏ฃฏ |๐›ผ ๐‘˜ | โˆ’ ๏ฃฏ ๐‘˜=1 ๐‘˜=1 ๏ฃบ ๏ฃฐ ๏ฃป 0 yielding the result and implying that ๐œ‡X (๐‘Ÿ โˆ’ 1) < 1 should also hold given that both kXk and k๐œถk 2 โˆš are non-negative numbers. The relation k๐œถk 1 โ‰ค ๐‘Ÿ k๐œถk 2 has been used to get the final inequality. Theorem 4.2.2 guarantees that modewise JL embeddings approximately preserve the norms of all tensors in the form of (4.3). Theorem 4.2.3 below states the inner product preservation property of JL embeddimgs for low-rank tensors, and guarantees that the inner products of all tensors in the C ๐‘›1 ยทยทยทร—๐‘› ๐‘‘ are preserved.  ๐‘‘ โ„“ span of the set B := โ„“=1 x ๐‘˜ ๐‘˜ โˆˆ [๐‘Ÿ] โˆˆ C Theorem 4.2.3 Suppose that X1 , X2 โˆˆ L โŠ‚ ๐‘›1 ร—ยทยทยทร—๐‘›๐‘‘ have standard forms as per (4.3), in terms C ๐‘›1 ยทยทยทร—๐‘› ๐‘‘ , given by  ๐‘‘ โ„“ of the elements of the basis B := โ„“=1 x ๐‘˜ ๐‘˜ โˆˆ [๐‘Ÿ] โˆˆ โˆ‘๏ธ ๐‘Ÿ โˆ‘๏ธ๐‘Ÿ X1 = ๐›ฝ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) , and X2 = ๐›ผ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) . ๐‘˜=1 ๐‘˜=1 C Let ๐œ€ โˆˆ (0, 3/4], and A ( ๐‘—) โˆˆ ๐‘š ๐‘— ร—๐‘› ๐‘— be defined as per Theorem 4.2.2 for each ๐‘— โˆˆ [๐‘‘]. Then, ? ? * ๐‘‘ ๐‘‘ +   A ( ๐‘—) , X2 A ( ๐‘—) โˆ’ hX1 , X2 i โ‰ค 2๐œ€0 k ๐œทk 22 + k๐œถk 22 โ‰ค 4๐œ€0 ยท max k ๐œทk 22 , k๐œถk 22  X1 ๐‘—=1 ๐‘—=1 kX1 k 2 , kX2 k 2  0 max โ‰ค 4๐œ€ ยท , 1 โˆ’ (๐‘Ÿ โˆ’ 1)๐œ‡0B 39 where e e ๏ฃฑ ๏ฃด  โˆš๏ธ  ๐œ€ + ๐‘Ÿ (๐‘Ÿ โˆ’ 1)๐œ€ ๐‘‘ if ๐œ‡ B = 0, ๏ฃด ๏ฃด ๏ฃฒ ๏ฃด ๐œ€0 := (4.16) e e  โˆš๏ธ   ๏ฃด ๐œ€ + 2 ๐‘Ÿ (๐‘Ÿ โˆ’ 1) max ๐œ€ ๐‘‘โˆ’1 , ๐œ‡ B ๏ฃด ๐‘‘โˆ’1 ๏ฃด ๏ฃด otherwise. ๏ฃณ Proof Using the polarization identity in combination with Lemma 2.0.2 and Theorem 4.2.2 we can see that ? ? * ๐‘‘ ๐‘‘ + X1 A ( ๐‘—) , X2 A ( ๐‘—) โˆ’ hX1 , X2 i ๐‘—=1 ๐‘—=1 2 3 ? ๐‘‘ ? ๐‘‘ 1 โˆ‘๏ธ i ( ๐‘—) i A ( ๐‘—) i 2ยช ยฉ โ„“ยญ โ„“ = ยญ X1 A + X2 โˆ’ X1 + โ„“ X2 2 ยฎยฎ 4 โ„“=0 ๐‘—=1 ๐‘—=1 ยซ 2 ยฌ 2 3 ? ๐‘‘ 1 โˆ‘๏ธ i i i ยฉ  2ยช = โ„“ยญ X 1 + โ„“ X 2 A ( ๐‘—) โˆ’ X1 + โ„“ X2 2 ยฎยฎ 4 โ„“=0 ยญ ๐‘—=1 ยซ 2 ยฌ 2 3 ? ๐‘‘ 1 โˆ‘๏ธ i i  2 โ‰ค X1 + โ„“ X2 A ( ๐‘—) โˆ’ X1 + โ„“ X2 2 4 โ„“=0 ๐‘—=1 2 3 3 1 i 2 1 โˆ‘๏ธ โˆ‘๏ธ โ‰ค ๐œ€0 ๐œท + โ„“ ๐œถ 2 โ‰ค ๐œ€0 (k ๐œทk 2 + k๐œถk 2 ) 2 = ๐œ€0 (k ๐œทk 2 + k๐œถk 2 ) 2 4 โ„“=0 4 โ„“=0   โ‰ค 2๐œ€0 k ๐œทk 22 + k๐œถk 22 โ‰ค 4๐œ€0 max k ๐œทk 22 , k๐œถk 22 ,  where the third to last and second to last inequalities follow from the triangle inequality and Youngโ€™s inequality for products, respectively. Applying Lemma 4.2.3 to relate the Euclidean norms of ๐œท and ๐œถ to X1 and X2 , respectively, leads to the final result. To sum up, theorems 4.2.2 and 4.2.3 guarantee that modewise JL embeddings approximately preserve the norms of and inner products between all tensors in the span of the set C๐‘› ร—ยทยทยทร—๐‘› . n o ๐‘‘ (โ„“) B := โ„“=1 x ๐‘˜ | ๐‘˜ โˆˆ [๐‘Ÿ] โŠ‚ 1 ๐‘‘ 40 4.2.1.1 Computational Complexity of Modewise Johnson-Lindenstrauss Embeddings >๐‘‘ Assuming X โˆˆ R๐‘› ร—...ร—๐‘› 1 ๐‘‘ and A๐‘š ๐‘— ร—๐‘› ๐‘— with ๐‘š ๐‘— โ‰ค ๐‘› ๐‘— , we have X ๐‘—=1 A ( ๐‘—) โˆˆ R๐‘š ร—...ร—๐‘š . The total 1 ๐‘‘ operation count4 of the embedding is the sum of the operation counts for each ๐‘—-mode product in each mode. Therefore, one can show that the operations count will be O (๐‘š 1 ๐‘›1 . . . ๐‘› ๐‘‘ ) + O (๐‘š 1 ๐‘š 2 ๐‘›2 . . . ๐‘› ๐‘‘ ) + ยท ยท ยท + O (๐‘š 1 ๐‘š 2 . . . ๐‘š ๐‘‘ ๐‘› ๐‘‘ ) = O (๐‘š 1 ๐‘›1 . . . ๐‘› ๐‘‘ ) . On the other hand, if X is vectorized, to achieve the same compression, it should be left-multiplied by a JL matrix A โˆˆ R๐‘š ...๐‘š ร—๐‘› ...๐‘› . The computational complexity would then be O (๐‘š1๐‘›1 . . . ๐‘š ๐‘‘ ๐‘›๐‘‘ ) 1 ๐‘‘ 1 ๐‘‘ which is significantly higher than the complexity of the modewise approach. 4.2.2 Main Theorems: Oblivious Tensor Subspace Embeddings So far, no assumption has been made about the type of JL embeddings that have been considered for dimension reduction of tensor data. In this section, the main theorems establishing bounds on the embedding dimension are presented in the case where randomness is incorporated into the JL embeddings being used. In the case of finite-dimensional bases, the JL embeddings of interest will be matrices drawn from random distributions, or are contructed as the product of matrices some of which have random properties. This section starts with the definition of the family of ๐œ‚-optimal JL embedding distributions, followed by the main theorems.  Definition 4.2.3 Fix ๐œ‚ โˆˆ (0, 1/2) and let D (๐‘š,๐‘›) N N be a family of probability distributions (๐‘š,๐‘›)โˆˆ ร— where each D (๐‘š,๐‘›) is a distribution over ๐‘š ร— ๐‘› matrices. We will refer to any such family of distributions as being an ๐œ‚-optimal family of JL embedding distributions if there exists an absolute constant ๐ถ โˆˆ R+ such that, for any given ๐œ€ โˆˆ (0, 1), ๐‘š, ๐‘› โˆˆ N with ๐‘š < ๐‘›, and nonempty set SโŠ‚ C๐‘› of cardinality ๐œ€2 ๐‘š   |๐‘†| โ‰ค ๐œ‚ exp , ๐ถ 4 Here, the elements of tensors and matrices are assumed to belong to the field of real numbers, for simplicity. The operation count can be updated accordingly when the field of complex numbers is considered. 41 a matrix A โˆผ D (๐‘š,๐‘›) will be an ๐œ€-JL embedding of S into C๐‘š with probability at least 1 โˆ’ ๐œ‚. In fact, many ๐œ‚-optimal families of JL embedding distributions exist for any given ๐œ‚ โˆˆ (0, 1/2), in- cluding, e.g., those associated with random matrices having independent and identically distributed (i.i.d.) sub-Gaussian entries (see Lemma 9.35 in [5]). We are now ready to state the main oblivious subspace property of JL embeddings for low-rank tensors. It should be noted, however, that as Lemma 4.2.3 suggests, an incoherence assumption is necessary to establish a relation between the Euclidean norm of a low-rank tensor and the norm of its expansion coefficients in the standard form. This assumption will be necessary for the proof of Theorem 4.2.4 to work. Theorem 4.2.4 Fix ๐œ€, ๐œ‚ โˆˆ (0, 1/2) and ๐‘‘ โ‰ฅ 2. Let L be an ๐‘Ÿ-dimensional subspace of ๐‘›1 ร—ยทยทยทร—๐‘›๐‘‘ C n o spanned by a basis of rank-1 tensors B := ๐‘‘ x (โ„“) ๐‘˜ โˆˆ [๐‘Ÿ] with modewise coherence as โ„“=1 ๐‘˜ per (4.4) satisfying ๐œ‡ B๐‘‘โˆ’1 < 1/2๐‘Ÿ. For each ๐‘— โˆˆ [๐‘‘] draw A ( ๐‘—) โˆˆ C๐‘š ร—๐‘› ๐‘— ๐‘— with   ๐‘š ๐‘— โ‰ฅ ๐ถหœ ยท ๐‘Ÿ 2/๐‘‘ ๐‘‘ 2 /๐œ€ 2 ยท ln 2๐‘Ÿ 2 ๐‘‘/๐œ‚ (4.17) from an (๐œ‚/๐‘‘)-optimal family of JL embedding distributions, where ๐ถหœ โˆˆ R+ is an absolute constant. Then, with probability at least 1 โˆ’ ๐œ‚ we have 2 X ร—1 A (1) ยท ยท ยท ร—๐‘‘ A (๐‘‘) โˆ’ kXk 2 โ‰ค ๐œ€ kXk 2 , (4.18) for all X โˆˆ L. Proof Let B be a set of ๐‘Ÿ rank-1 tensors, defined as C๐‘› ร—ยทยทยทร—๐‘› n o ๐‘‘ (โ„“) B := โ„“=1 x ๐‘˜ ๐‘˜ โˆˆ [๐‘Ÿ] โŠ‚ 1 ๐‘‘ with kx ๐‘˜(โ„“) k 2 = 1 for ๐‘˜ โˆˆ [๐‘Ÿ] and โ„“ โˆˆ [๐‘‘], and let L := span (B). We first note the coherence assumption ๐œ‡ B ๐‘‘โˆ’1 < 1/2๐‘Ÿ guarantees that ๐œ‡0B โ‰ค ๐œ‡ B ๐‘‘ โ‰ค ๐œ‡B ๐‘‘โˆ’1 < 1/2๐‘Ÿ < 1/2(๐‘Ÿ โˆ’ 1), 42 which can be rearranged as 4/(1 โˆ’ (๐‘Ÿ โˆ’ 1)๐œ‡0B ) โ‰ค 8. e Also, letting ๐›ฟ := (1/๐‘Ÿ) 1/๐‘‘ ๐œ€/16 and according to (4.16), it is enough to have ๐œ€ โ‰ฅ 8๐›ฟ + e e ๐‘‘โˆ’1 ) so that each embedding A ( ๐‘—) will be a (๐›ฟ/4๐‘‘)-JL embedding of the set 8๐›ฟ 2๐‘Ÿ max(๐›ฟ ๐‘‘โˆ’1 , ๐œ‡ B ๐‘†0๐‘— in (4.6) into C๐‘š ๐‘— where ๐œ€0 (๐›ฟ) is defined by (4.16) and ๐œ€ โ‰ฅ 8๐œ€0. Furthermore, if A ( ๐‘—) is taken from an ๐œ‚/๐‘‘-optimal family of JL distributions, it will also be a (๐›ฟ/4๐‘‘)-JL embedding of ๐‘†0๐‘— in (4.6) into C๐‘š ๐‘— with probability 1 โˆ’ ๐œ‚/๐‘‘ if ! ๐›ฟ 2๐‘š ๐œ‚ |๐‘†0๐‘— | = 2๐‘Ÿ 2 โˆ’ ๐‘Ÿ โ‰ค exp ๐‘— , ๐‘‘ 16๐‘‘ 2๐ถ which is satisfied for each ๐‘š ๐‘— defined in (4.17). Finally, taking union bound over all ๐‘‘ modes concludes the proof. Note 4.2.2 Theorem 4.2.4 can be used in the special case where X is a matrix X โˆˆ C๐‘› ร—๐‘› . 1 2 In this case, the CP-rank is the usual matrix rank, and the CP decomposition becomes the regular SVD decomposition of the matrix, which can be computed efficiently in parallel (see, e.g., [8]). In particular, the basis vectors are orthogonal to each other in this case. The result of Theorem 4.2.4 implies that taking A and B as matrices belonging to the (๐œ‚/2)-JL embedding family and of sizes โˆš ๐‘›1 ร— ๐‘š 1 and ๐‘›2 ร— ๐‘š 2 , respectively, such that ๐‘š ๐‘— & ๐‘Ÿ ln(๐‘Ÿ/ ๐œ‚)/๐œ€ 2 (for ๐‘— = 1, 2), we get the following JL-type result for the Frobenius matrix norm: with probability 1 โˆ’ ๐œ‚, kA๐‘‡ XBk 2๐น = (1 + ๐œ€)kXk หœ 2 ๐น for some | ๐œ€| หœ โ‰ค ๐œ€. Theorem 4.2.5 Fix ๐œ€, ๐œ‚ โˆˆ (0, 1/2) and ๐‘‘ โ‰ฅ 3. Let X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› , ๐‘› := max 1 ๐‘‘ ๐‘— ๐‘› ๐‘— โ‰ฅ 4๐‘Ÿ + 1, and let L C๐‘› ร—ยทยทยทร—๐‘› n o (โ„“) be an ๐‘Ÿ-dimensional subspace of 1 ๐‘‘ spanned by a basis B := ๐‘‘ โ„“=1 x ๐‘˜ ๐‘˜ โˆˆ [๐‘Ÿ] of rank-1 tensors, with modewise coherence satisfying ๐œ‡ B ๐‘‘โˆ’1 < 1/2๐‘Ÿ. For each ๐‘— โˆˆ [๐‘‘], draw A ( ๐‘—) โˆˆ C๐‘š ร—๐‘› ๐‘— ๐‘— with โˆš  ๐‘š ๐‘— โ‰ฅ ๐ถ ๐‘— ยท ๐‘Ÿ๐‘‘ 3 /๐œ€ 2 ยท ln ๐‘›/ ๐‘‘ ๐œ‚ (4.19) 43 from an (๐œ‚/4๐‘‘)-optimal family of JL embedding distributions, where ๐ถ ๐‘— โˆˆ R+ is an absolute C๐‘š ร— 0 รŽ๐‘‘ constant. Furthermore, let A โˆˆ โ„“=1 ๐‘šโ„“ with   0 0 โˆ’2 47 ๐‘š โ‰ฅ๐ถ๐‘Ÿยท๐œ€ ยท ln โˆš๐‘Ÿ ๐œ€ ๐œ‚ be drawn from an (๐œ‚/2)-optimal family of JL embedding distributions, where ๐ถ 0 โˆˆ R+ is an absolute constant. Define ๐ฟหœ : C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ โ†’ C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘ by ๐ฟหœ (Z) = Z ร—1 A (1) ยท ยท ยท ร—๐‘‘ A (๐‘‘) . Then, with probability at least 1 โˆ’ ๐œ‚, the linear operator A โ—ฆ vect โ—ฆ ๐ฟหœ : C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ โ†’ C๐‘š 0 satisfies 2 A vect โ—ฆ ๐ฟหœ (X โˆ’ Y) โˆ’ kX โˆ’ Y k 2 โ‰ค ๐œ€ kX โˆ’ Y k 2  2 for all Y โˆˆ L. Proof To begin, we note that A will satisfy the conditions required by Theorem 5.1.1 with probability at least 1 โˆ’ ๐œ‚/2 as a consequence of Lemma 4.1.3. Thus, if we can also establish that ๐ฟหœ will satisfy the conditions required by Theorem 5.1.1 with probability at least 1 โˆ’ ๐œ‚/2, we will be finished with our proof by Theorem 5.1.1 and the union bound. To establish that ๐ฟหœ satisfies the conditions required by Theorem 5.1.1 with probability at least 1 โˆ’ ๐œ‚/2, it suffices to prove that (a) ๐ฟหœ will be an (๐œ€/6)-JL embedding of all Y โˆˆ L into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘ with probability at least 1 โˆ’ ๐œ‚/4, and that โˆš (b) ๐ฟหœ will be an (๐œ€/24 ๐‘Ÿ)-JL embedding of the 4๐‘Ÿ +1 tensors S 0 โˆช{ PL โŠฅ (X)} โŠ‚ C๐‘› ร—๐‘› ร—...ร—๐‘› 1 2 ๐‘‘ into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘ with probability at least 1 โˆ’ ๐œ‚/4, where the set S 0 is defined as in Theorem 5.1.1, and apply yet another union bound. To show that (a) holds, we will utilize Theorem 4.2.2 and Lemma 4.2.3. Since each A ( ๐‘—) matrix is an (๐œ‚/4๐‘‘)-optimal JL embedding and the sets S 0๐‘— defined as in (4.6) are such that |S 0๐‘— | < ๐‘› ๐‘‘ , โˆš  we know that each A ( ๐‘—) is an ๐œ€/480๐‘‘ ๐‘Ÿ -JL embedding of S 0๐‘— into ๐‘š ๐‘— with probability5 at C e holds for all ๐‘‘ > 0 in order to avoid a 5 Here โˆš ๐‘‘ โˆš e โˆš๐‘‘ we also implicitly use the fact that ๐‘‘ โ‰ค ๐‘‘ term appearing inside the logarithm in (4.19). 44 โˆš least 1 โˆ’ ๐œ‚/4๐‘‘. Thus, Theorem 4.2.2 holds with ๐œ€ โ†’ ๐œ€/120 ๐‘Ÿ with probability at least 1 โˆ’ ๐œ‚/4. Note that the modewise coherence assumption that ๐œ‡ B ๐‘‘โˆ’1 < 1/2๐‘Ÿ both allows ๐œ€ ๐‘‘โˆ’1 to reduce the โˆš๏ธ โˆš ๐‘Ÿ (๐‘Ÿ โˆ’ 1) factor in (4.9) to a size less than one for any ๐œ€ โ‰ค 1/ ๐‘Ÿ โ‰ค (1/๐‘Ÿ) 1/(๐‘‘โˆ’1) , and also allows Lemma 4.2.3 to guarantee that k๐œถk 22 < 2 kY k 2 holds for all Y โˆˆ L. Hence, applying Theorem 4.2.2 โˆš C with ๐œ€ โ†’ ๐œ€/120 ๐‘Ÿ will ensure that ๐ฟหœ is an (๐œ€/6)-JL embedding of all Y โˆˆ L into ๐‘š1 ร—ยทยทยทร—๐‘š ๐‘‘ . To show that (b) holds we will utilize Lemma 5.1.1. Note that the S ๐‘— sets defined in Lemma 5.1.1 all have cardinalities S ๐‘— โ‰ค ๐‘๐‘› ๐‘‘โˆ’1 , where ๐‘ = 4๐‘Ÿ + 1 โ‰ค ๐‘› in our current setting. As a consequence, โˆš we can see that the conditions of Lemma 5.1.1 will be satisfied with ๐œ€ โ†’ ๐œ€/24 ๐‘Ÿ for all ๐‘— โˆˆ [๐‘‘] with probability at least 1 โˆ’ ๐œ‚/4 by the union bound. Hence, both (a) and (b) hold and our proof is concluded. Figure 4.1 provides a schematic view of the 2-stage JL embedding introduced in Theorem 4.2.5 on a 3 ร— 4 ร— 5 sample tensor. Figure 4.1: An example of 2-stage JL embedding applied to a 3-dimensional tensor X โˆˆ R3ร—4ร—5 . The output of the 1st stage is the projected tensor Y = X ร—1 A (1) ร—2 A (2) ร—3 A (3) , where A ( ๐‘—) are JL matrices for ๐‘— โˆˆ {1, 2, 3}, A (1) โˆˆ R2ร—3 , A (2) โˆˆ R3ร—4 , and A (3) โˆˆ R4ร—5 , resulting in Y โˆˆ R2ร—3ร—4 . Matching colors have been used to show how the rows of A ( ๐‘—) interact with the mode- ๐‘— fibers of X (and the intermediate partially compressed tensors) to generate the elements of the mode- ๐‘— unfolding of the result after each ๐‘—-mode product. Next, the resulting tensor is vectorized (leading to y โˆˆ R24 ), and a 2nd -stage JL is then performed to obtain z = Ay where A โˆˆ R3ร—24 , and z โˆˆ R3 . 45 Note 4.2.3 (About ๐’“ and ๐œบ Dependence) Fix ๐‘‘, ๐‘›, and ๐œ‚. Looking at Theorem 4.2.5 we can see that itโ€™s intermediate embedding dimension is ร–๐‘‘ ๐‘‘ ๐‘š โ„“ โ‰ค ๐ถ๐‘‘,๐œ‚,๐‘› ๐‘Ÿ ๐‘‘ ๐œ€ โˆ’2๐‘‘ โ„“=1 which effectively determines its overall storage complexity. Hence, Theorem 4.2.5 will only result in an improved memory complexity over the straightforward single-stage vectorization approach if, e.g., the rank ๐‘Ÿ of L is relatively small. The purpose of facultative vectorization and subsequent multiplication by an additional JL transform A in Theorem 4.2.5 is to reduce the resulting final embedding dimension to the near-optimal order O (๐‘Ÿ๐œ€ โˆ’2 ) from total dimension O๐œ‚,๐‘› (๐‘‘ 3๐‘‘ ๐‘Ÿ ๐‘‘ ๐œ€ โˆ’2๐‘‘ ) that we have after the modewise compression. 4.2.3 Fast and Memory-Efficient Modewise Johnson-Lindenstrauss Embeddings In this section we consider a fast Johnson-Lindenstrauss transform for tensors recently introduced in [10], which is effectively based on applying fast JL transforms [13] in a modewise fashion.6 Given a tensor Z โˆˆ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ the transform takes the form โˆš๏ธ‚ ๐‘    ๐ฟ FJL (Z) := R vect Z ร—1 F (1) D (1) ยท ยท ยท ร—๐‘‘ F (๐‘‘) D (๐‘‘) (4.20) ๐‘š where vect : C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ โ†’ C๐‘ for ๐‘ := รŽโ„“=1 ๐‘‘ ๐‘›โ„“ is the vectorization operator, R โˆˆ {0, 1} ๐‘šร—๐‘ is a matrix containing ๐‘š rows selected randomly from the ๐‘ ร— ๐‘ identity matrix, F (โ„“) โˆˆ C๐‘› ร—๐‘› is a โ„“ โ„“ unitary discrete Fourier transform matrix for all โ„“ โˆˆ [๐‘‘], and D (โ„“) โˆˆ C๐‘› ร—๐‘› is a diagonal matrix โ„“ โ„“ with ๐‘›โ„“ random ยฑ1 entries for all โ„“ โˆˆ [๐‘‘]. The following theorem is proven about this transform in [10, 13]. Theorem 4.2.6 (See Theorem 2.1 and Remark 4 in [10]) Fix ๐‘‘ โ‰ฅ 1, ๐œ€, ๐œ‚ โˆˆ (0, 1), and ๐‘ โ‰ฅ ๐ถ 0/๐œ‚ for a sufficiently large absolute constant ๐ถ 0 โˆˆ R+. Consider a finite set S โŠ‚ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ of cardinality 6 In fact, the fast transform described here differs cosmetically from the form in which it is presented in [10]. However, one can easily see they are equivalent using (2.10). 46 ๐‘ = |S|, and let ๐ฟ FJL : C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ โ†’ C๐‘š be defined as above in (4.20) with   ๏ฃฎ max( ๐‘,๐‘) ๏ฃน ๏ฃฏ  max( ๐‘, ๐‘)  log ๐œ‚ ๏ฃบ ๐‘š โ‰ฅ ๐ถ ๏ฃฏ๐œ€ โˆ’2 ยท log2๐‘‘โˆ’1 ยท log4 ยญยญ ยฉ ยช ๏ฃฏ ยฎ ยท log ๐‘ ๏ฃบ๏ฃบ , ๏ฃฏ ๐œ‚ ๐œ€ ยฎ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ ยซ ยฌ ๏ฃป where ๐ถ > 0 is an absolute constant. Then with probability at least 1 โˆ’ ๐œ‚ the linear operator ๐ฟ FJL is an ๐œ€-JL embedding of S into C๐‘š . If ๐‘‘ = 1 then we may replace max( ๐‘, ๐‘) with ๐‘ inside all of the logarithmic factors above (see [13]). ร Note that the fast transform ๐ฟ FJL requires only O (๐‘š log ๐‘ + โ„“ ๐‘›โ„“ ) i.i.d. random bits and memory for storage. Thus, it can be used to produce fast and low memory complexity oblivious subspace embeddings. The next Theorem does so. Theorem 4.2.7 Fix ๐œ€, ๐œ‚ โˆˆ (0, 1/2) and ๐‘‘ โ‰ฅ 2. Let X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› , ๐‘ = รŽโ„“=1 1 ๐‘‘ ๐‘‘ ๐‘›โ„“ โ‰ฅ 4๐ถ 0/๐œ‚ for an absolute constant ๐ถ 0 > 0, L be an ๐‘Ÿ-dimensional subspace of C๐‘› ร—ยทยทยทร—๐‘› for max 2๐‘Ÿ 2 โˆ’ ๐‘Ÿ, 4๐‘Ÿ โ‰ค ๐‘,  1 ๐‘‘ and ๐ฟ FJL : C๐‘› ร—ยทยทยทร—๐‘› โ†’ C๐‘š be defined as above in (4.20) with 1 ๐‘‘ 1   ๏ฃฎ ๐‘ ๏ฃน ๏ฃฏ  ๐‘Ÿ 2   ๐‘ log ๐œ‚ ๏ฃบ 2๐‘‘โˆ’1 4ยญ ๏ฃฏ ๐‘‘ ยฉ ยช โ‰ฅ ๐ถ1 ๏ฃฏ๐ถ2 ยท log ยท log ยญ ยท log ๐‘ ๏ฃบ , ๏ฃบ ๐‘š1 ยฎ ๏ฃฏ ๐œ€ ๐œ‚ ๐œ€ ยฎ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ ยซ ยฌ ๏ฃป where ๐ถ1 , ๐ถ2 > 0 are absolute constants. Furthermore, let L0FJL โˆˆ C๐‘š ร—๐‘š 2 1 be defined as above in (4.20) for ๐‘‘ = 1 with   ๏ฃฎ 47 ๏ฃน ๏ฃฏ  47  ๐‘Ÿ log ๐œ€ โˆš ๐‘Ÿ๐œ‚ ยช ๏ฃบ ๐‘š 2 โ‰ฅ ๐ถ3 ๏ฃฏ๐‘Ÿ ยท ๐œ€ โˆ’2 ยท log โˆš๐‘Ÿ ยท log4 ยญยญ ยฉ ๏ฃฏ ยฎ ยท log ๐‘š 1 ๏ฃบ๏ฃบ , ๏ฃฏ ๐œ€ ๐œ‚ ๐œ€ ยฎ ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ ยซ ยฌ ๏ฃป where ๐ถ3 > 0 is an absolute constant. Then, with probability at least 1 โˆ’ ๐œ‚ it will be the case that 2 L0FJL (๐ฟ FJL (X โˆ’ Y)) 2 โˆ’ kX โˆ’ Y k 2 โ‰ค ๐œ€ kX โˆ’ Y k 2 holds for all Y โˆˆ L. In addition, the L0FJL , ๐ฟ FJL transform pair requires only O (๐‘š 1 log ๐‘ + โ„“ ๐‘›โ„“ ) random bits  ร and memory for storage (assuming w.l.o.g. that ๐‘š 2 โ‰ค ๐‘š 1 ), and L0FJL โ—ฆ ๐ฟ FJL : C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ โ†’ C๐‘š 2 can be applied to any tensor in just O (๐‘ log ๐‘)-time. 47 Proof Let {T๐‘˜ } ๐‘˜โˆˆ[๐‘Ÿ] be an orthonormal basis for L (note that these basis tensors need not be low-rank), and PL โŠฅ : C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ โ†’ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ be the orthogonal projection operator onto the orthogonal complement of L. Theorem 5.1.1 combined with Lemmas 4.1.4 and 4.1.3 imply that the result will be proven if all of the following hold: (i) ๐ฟ FJL is an (๐œ€/24๐‘Ÿ)-JL embedding of the 2๐‘Ÿ 2 โˆ’ ๐‘Ÿ tensors ! i i ร˜ ร˜ {T๐‘˜ โˆ’ Tโ„Ž , T๐‘˜ + Tโ„Ž , T๐‘˜ โˆ’ Tโ„Ž , T๐‘˜ + Tโ„Ž } {T๐‘˜ } ๐‘˜โˆˆ[๐‘Ÿ] โŠ‚ L 1โ‰คโ„Ž<๐‘˜ โ‰ค๐‘Ÿ into C๐‘š , 1 (ii) ๐ฟ FJL is an (๐œ€/6)-JL embedding of { PL โŠฅ (X)} into C๐‘š ,1 โˆš (iii) ๐ฟ FJL is an (๐œ€/24 ๐‘Ÿ)-JL embedding of the 4๐‘Ÿ tensors ร˜  PL โŠฅ (X) PL โŠฅ (X) PL โŠฅ (X) i PL โŠฅ (X) i  C๐‘› ร—...ร—๐‘› k PL PL PL PL โˆ’ T๐‘˜ , + T๐‘˜ , โˆ’ T๐‘˜ , + T๐‘˜ โŠ‚ 1 ๐‘‘ โŠฅ (X) k k โŠฅ (X) k k โŠฅ (X) k k โŠฅ (X)k ๐‘˜ โˆˆ[๐‘Ÿ] into C๐‘š , and 1 (iv) L0FJL is an (๐œ€/6)-JL embedding of a minimal (๐œ€/16)-cover, C, of the ๐‘Ÿ-dimensional Euclidean unit sphere in the subspace L 0 โŠ‚ C๐‘š 1 from Theorem 5.1.1 with ๐ฟ = ๐ฟ FJL into C๐‘š . Here we 2  ๐‘Ÿ note that |C| โ‰ค 47 ๐œ€ . Furthermore, if ๐‘š 1 and ๐‘š 2 are chosen as above for sufficiently large absolute constants ๐ถ1 , ๐ถ2 , and ๐ถ3 , then Theorem 4.2.6 implies that each of (๐‘–) โˆ’ (๐‘–๐‘ฃ) above will fail to hold with probability at most ๐œ‚/4. The desired result now follows from the union bound. The number of random bits and storage complexity follows directly form Theorem 4.2.6 after noting that each row of R in (4.20) is determined by O (log ๐‘) bits. The fact that L0FJL โ—ฆ ๐ฟ FJL can be applied to any tensor Z in O (๐‘ log ๐‘)-time again follows from the form of (4.20). Note that each ๐‘—-mode product with F ( ๐‘—) D ( ๐‘—) involves โ„“โ‰  ๐‘— ๐‘›โ„“ multiplications of F ( ๐‘—) D ( ๐‘—) against all the mode- ๐‘— รŽ fibers of the given tensor Z, each of which can be performed in O (๐‘› ๐‘— log(๐‘› ๐‘— ))-time using fast 48 Fourier transform techniques (or approximated even more quickly using sparse Fourier transform techniques if ๐‘› ๐‘— is itself very large). The required vectorization and applications of R can then be performed in just O (๐‘)-time thereafter. Finally, Fourier transform techniques can again be used to also apply L0FJL in O (๐‘š 1 log ๐‘š 1 )-time. 4.3 Experiments In this section, it is shown that the norms of several different types of (approximately) low-rank data can be preserved using JL embeddings. The data sets used in the experiments consist of 1. MRI data: This data set contains three 3-mode MRI images of size 240 ร— 240 ร— 155 [1]. 2. Randomly generated data: This data set contains 10 rank-10 4-mode tensors. Each test tensor is a 100 ร— 100 ร— 100 ร— 100 array that is created by adding 10 randomly generated rank-1 tensors. More specifically, each rank-10 tensor is generated according to ๐‘Ÿ โˆ‘๏ธ ( ๐‘—) X (๐‘š) = ๐‘‘ ๐‘—=1 x ๐‘˜ , ๐‘˜=1 where ๐‘š โˆˆ [10], ๐‘Ÿ = 10, ๐‘‘ = 4 and x ๐‘˜ ( ๐‘—) โˆˆ R100. ( ๐‘—) In the Gaussian case, each entry of x ๐‘˜ is drawn independently from the standard Gaussian distribution N (0, 1). In the case of ( ๐‘—) ( ๐‘—) coherent data, low-variance Gaussian noise is added to a constant, i.e., each entry x ๐‘˜,โ„“ of x ๐‘˜ ( ๐‘—) ( ๐‘—) is set as 1 + ๐œŽ๐‘” ๐‘˜,โ„“ with ๐‘” ๐‘˜,โ„“ being an i.i.d. standard Gaussian random variable defined above, โˆš and ๐œŽ 2 denoting the desired variance. In the experiments of this section, ๐œŽ = 0.1 is used. ( ๐‘—) In both cases, the 2-norm of x ๐‘˜ is also normalized to 1. The reason for running experiments on both Gaussian and coherent data is to show that although coherence requirements presented in section 4.2 are used to help get general theo- retical results for a large class of modewise JL embeddings, they do not seem to be necessary in practice. 49 When JL embeddings are applied, experiments are performed using Gaussian JL matrices as well as Fast JL matrices. For Gaussian JL, A ๐‘— = โˆš1 G is used for all ๐‘— โˆˆ [๐‘‘], where ๐‘š is the target ๐‘š dimension and each entry in G is an i.i.d. standard Gaussian random variable G๐‘–, ๐‘— โˆผ N (0, 1). For Fast JL, A ๐‘— = โˆš1 RFD is used for all ๐‘— โˆˆ [๐‘‘], where R denotes the random restriction matrix, F ๐‘š โˆš is the unitary DFT matrix scaled by ๐‘› ๐‘— ,7 and D is a diagonal matrix with Rademacher random variables forming its diagonal [13]. The embedded version of a test tensor X is always denoted by ๐ฟ (X), and is calculated by X ร—1 A (1) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) , ๏ฃฑ 1-stage JL ๏ฃด ๏ฃด ๏ฃด ๏ฃด ๏ฃด ๏ฃฒ ๏ฃด ๐ฟ (X) = (4.21) ๏ฃด ๏ฃด    ๏ฃด ๏ฃด A vec X ร—1 A (1) ร— ยท ยท ยท ร—๐‘‘ A (๐‘‘) , 2-stage JL ๏ฃด ๏ฃด ๏ฃณ where A is a JL matrix used in the 2nd stage. Obviously, ๐ฟ (X) is a vector in the 2-stage case. 4.3.1 Effect of JL Embeddings on Norm In this section, numerical results have been presented, showing the effect of mode-wise JL embed- ding on the norm of 3 MRI 3-mode images treated as generic tensors, as well as randomly generated data. ( ๐‘—) The compression ratio for the ๐‘— th mode, denoted by ๐‘ 1 , is defined as the compression in the size of each of the mode- ๐‘— fibers, i.e., ( ๐‘—) ๐‘š๐‘— ๐‘1 = . ๐‘›๐‘—   The target dimension ๐‘š ๐‘— in JL matrices is chosen as ๐‘š ๐‘— = ๐‘ 1 ๐‘› ๐‘— for all ๐‘— โˆˆ [๐‘‘], to ensure that at least a fraction ๐‘ 1 of the ambient dimension in each mode is preserved. In the experiments, the ( ๐‘—) compression ratio is set to be the same for all modes, i.e., ๐‘ 1 = ๐‘ 1 for all ๐‘— โˆˆ [๐‘‘]. In the case of a 2-stage JL embedding, the target dimension ๐‘š of the secondary JL embedding is chosen as ๐‘š = d๐‘ 2 ๐‘e , 7 Recall that ๐‘› ๐‘— is the size of the mode- ๐‘— fibers of the input tensor. 50 where ๐‘ 2 is the compression ratio in the 2nd stage, and ๐‘ is the length of the vectorized projected tensor after the modewise JL embedding. The total achieved compression is calculated by ๐‘ ๐‘ก๐‘œ๐‘ก = รŽ  ๐‘‘ ( ๐‘—) nd รŽ๐‘‘ ( ๐‘—) ๐‘2 ๐‘—=1 1 . When the 2 stage embedding is skipped, ๐‘ ๐‘ก๐‘œ๐‘ก = ๐‘ ๐‘—=1 ๐‘ 1 . In all experiments of this section, when a 2-stage embedding is performed, ๐‘ 2 = 0.05. Also, in figure legends, when two JL types are listed together, the first and second terms refer to the first and second stages, respectively. For example, in โ€˜Gaussian+RFDโ€™, Gaussian and RFD JL embeddings were used in the first and second stages, respectively. The term โ€˜vecโ€™ in the legends refers to vectorizing the data. Assuming X denotes the original tensor and ๐ฟ (X) is the projected result, the relative norm of X is defined by k๐ฟ (X) k ๐‘ ๐‘›,X = . kXk The results of this section depict the interplay between ๐‘ ๐‘›,X and ๐‘ 1 for randomly generated data, and ๐‘ ๐‘›,X versus ๐‘ ๐‘ก๐‘œ๐‘ก for MRI data, where the numbers have been averaged over 1000 trials, as well as over all samples for each value of ๐‘ 1 or ๐‘ ๐‘ก๐‘œ๐‘ก . In the case of Figure 4.2, 1000 randomly generated JL matrices were applied to each mode of all 10 randomly generated tensors. The results there indicate that the modewise embedding methods proposed herein still work on relatively coherent data despite the incoherence assumptions utilized in their theoretical analysis (recall Section 4.2). In Figure 4.3, 1000 JL embedding choices have been averaged over each of the 3 MRI images as well as the 3 images themselves. As expected, it can be observed in both figures that increasing the compression ratio leads to better norm (and distance) preservation. The MRI data experiments were done using various combinations of JL matrices in the first and second stages, and were compared with the 1-stage (modewise) case and also JL applied to vectorized data. In Figure 4.3b, the runtime plots show that vectorizing the data before applying JL embeddings is the most computationally intensive way of compressing the data, although it preserves norms the best, as Figure 4.3a demonstrates. Due to the small mode sizes of the MRI data used in the experiments, modewise fast JL does not outperform modewise Gaussian JL in terms of computational efficiency in the modewise embeddings as one might initially expect (see the red 51 and blue curves). This is likely due to the fact that the individual mode sizes are too small to benefit from the FFT (recall all modes are โ‰ค 240 in size), together with the need of Fourier methods to use less efficient complex number arithmetic. However, when the 2-stage JL is employed for larger compression ratios, the vectorized data after the first stage compression is large enough to make the efficiency of fast JL over Gaussian JL embeddings clear (compare, e.g., the yellow and purple curves). Also the small sizes of modes make the use of explicitly constructed โˆš1 RFD matrices ๐‘š more efficient than taking the FFT of mode fibers. It should be noted that in the second stage of the 2-stage JL throughout the experiments of this thesis, the matrix โˆš1 RFD is not constructed explicitly as this would be inefficient due to the large ๐‘š size of the vector that โˆš1 RFD is applied to. Instead, the signs of the vector are randomly changed ๐‘š (the effect of D) followed by a Fourier transform (the effect of F). Finally ๐‘š samples of the โˆš resulting vector are picked at random with replacement (the effect of R) after which the scale 1/ ๐‘š is applied. This allows one to notice the computational efficiency of FFT in the fast JL embedding. 1 1 0.9 0.9 JL (Gaussian) JL (Gaussian) Fast JL (RFD) Fast JL (RFD) 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 (a) (b) Figure 4.2: Relative norm of randomly generated 4-dimensional data. Here, the total compression will be ๐‘ ๐‘ก๐‘œ๐‘ก = ๐‘41 . (a) Gaussian data. (b) Coherent data. Note that the modewise approach still preserves norms well for the coherent data indicating that the incoherence assumptions utilized in Section 4.2 can likely be relaxed. By looking at Figure 4.2 We observe that the proposed modewise JL approach leads to very good norm preservation for data generated from both coherent and incoherent factors. Specifically, 52 the compression listed on the horizontal axis is for one mode, and given that the synthesized data samples are 4-mode tensors, the total compression is very good. 1.02 100 Gaussian 1 RFD 0.98 Gaussian+Gaussian Gaussian+RFD 0.96 RFD+RFD 0.94 vec+RFD 10-1 Gaussian 0.92 RFD 0.9 Gaussian+Gaussian Gaussian+RFD 0.88 RFD+RFD vec+RFD 0.86 10-2 10-7 10-6 10-5 10-4 10-3 10-2 10-7 10-6 10-5 10-4 10-3 10-2 (a) (b) Figure 4.3: Simulation results averaged over 1000 trials for 3 MRI data samples, where each sample is 3-dimensional. In the 2-stage cases, ๐‘ 2 = 0.05 has been used. (a) Relative norm. (b) Runtime. Figure 4.3b depicts the results of the proposed modewise JL method on three MRI data samples. Although part (a) shows relative superiority of the โ€˜vec+RFDโ€™ in terms of accuracy, part (b) suggests that for good compression values, the 1-stage and 2-stage JL approaches yields much smaller runtimes. The reason โ€˜vec+RFDโ€™ is leading to an almost horizontal line in part (b) lies in the way โˆš1 RFD is applied. As the matrix is not formed explicitly and the only part that determines the ๐‘š compression is the restriction (which does not inflict any computational load if it is simply picking random samples from a vector), choosing various compression values does not alter the runtime. However, if one explicitly forms and applies โˆš1 RFD, the runtime will change with the chosen ๐‘š compression. 53 CHAPTER 5 APPLICATIONS OF MODEWISE JOHNSON-LINDENSTRAUSS EMBEDDINGS In this chapter, two cases are presented where modewise JL embeddings can be used to reduce the computational cost of computationally intensive problems. 5.1 Application to Least Squares Problems and CPD Fitting Tensor decomposition problems usually involve fitting a low-rank approximation to a given tensor that is assumed to have a low rank of some type. In this section, it is shown that modewise JL embeddings offer an efficient way to reduce the computational cost of such fitting problems through dimension reduction at the cost of an approximation error. Consider a tensor X which is assumed to have low CP rank ๐‘Ÿ. We would like to approximate X in the Euclidean norm with a tensor Y expressed in the standard form as per (4.3). As mentioned in Chapter 3, a common fitting method is the Alternating Least Squares, where the factors representing the rank-๐‘Ÿ subspace are solved for one mode at a time. One can start from a random subspace and improve the least squares error mode by mode through multiple iterations. Since the subspace of interest is changing throughout the fitting process, oblivious subspace embeddings would be a natural choice to reduce the fitting problem size. For an arbitrary tensor X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› , the fitting 1 ๐‘‘ process involves solving โˆ‘๏ธ๐‘Ÿ arg min Xโˆ’ ๐›ผ๐‘˜ ๐‘‘ x ๐‘˜(โ„“) (5.1) C๐‘› ๐‘— โ„“=1 ( ๐‘—) ( ๐‘—) xฬƒ1 ,...,xฬƒ๐‘Ÿ โˆˆ ๐‘˜=1 n o ( ๐‘—) ( ๐‘—) ( ๐‘—) for each mode ๐‘— โˆˆ [๐‘‘] after fixing x ๐‘˜(โ„“) . Here, x ๐‘˜ = xฬƒ ๐‘˜ /k xฬƒ ๐‘˜ k 2 โˆ€ ๐‘—, ๐‘˜ and ๐‘˜โˆˆ[๐‘Ÿ],โ„“โˆˆ[๐‘‘]\{ ๐‘— } ๐›ผ ๐‘˜ = โ„“=1 k xฬƒ ๐‘˜(โ„“) k 2 . One then varies ๐‘— through all values in [๐‘‘] solving (5.1) for each ๐‘— in order รŽ๐‘‘ ( ๐‘—) to update x ๐‘˜ โˆ€ ๐‘—, ๐‘˜. Sweeping through all modes usually takes place in numerous iterations until convergence is achieved, meaning the fit stops to improve, or the maximum number of iterations is exhausted. This in turn means a high computational load, and makes it particularly important to 54 solve each least squares problem (5.1) efficiently by reducing the problem size. To see how this can be done to solve (5.1), one may write 2 ๐‘Ÿ 2 ๐‘Ÿ รŒ 1 > โˆ‘๏ธ โˆ‘๏ธ ( ๐‘—) Xโˆ’ ๐›ผ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) = X ( ๐‘—) โˆ’ ๐›ผ๐‘˜ x๐‘˜ x ๐‘˜(โ„“) , ๐‘˜=1 ๐‘˜=1 โ„“=๐‘‘ โ„“โ‰  ๐‘— F as the Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings. By looking closely at the right-hand side of the above equation, one can see that the Frobenius norm squared can be calculated row-wise (also note that the Frobenius norm is equivalent with the 2-norm for vectors, i.e., rows or columns of a matrix). Denoting row โ„Ž of X ( ๐‘—) by x ๐‘—,โ„Ž and element โ„Ž of ( ๐‘—) x ๐‘˜(โ„“) by ๐‘ฅ ๐‘˜,โ„Ž , we can write 2 ๐‘Ÿ 2 ๐‘›๐‘— ๐‘Ÿ รŒ 1 > โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ ( ๐‘—) Xโˆ’ ๐›ผ๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) = x ๐‘—,โ„Ž โˆ’ ๐›ผ ๐‘˜ ๐‘ฅ ๐‘˜,โ„Ž x ๐‘˜(โ„“) ๐‘˜=1 โ„Ž=1 ๐‘˜=1 โ„“=๐‘‘ โ„“โ‰  ๐‘— 2 (5.2) ๐‘›๐‘— ๐‘Ÿ 2 โˆ‘๏ธ โˆ‘๏ธ = X ( ๐‘—,โ„Ž) โˆ’ ๐›ผ0๐‘—,โ„Ž,๐‘˜ ๐‘‘ โ„“=1 x ๐‘˜(โ„“) , โ„Ž=1 ๐‘˜=1 โ„“โ‰  ๐‘— ( ๐‘—) where ๐›ผ0๐‘—,โ„Ž,๐‘˜ = ๐›ผ๐‘˜ ๐‘ฅ ๐‘˜,โ„Ž with ๐›ผ ๐‘˜ is known for ๐‘˜ โˆˆ [๐‘Ÿ] from (5.1) and X ( ๐‘—,โ„Ž) is the tensorized form of x ๐‘—,โ„Ž which is in fact the โ„Žth mode- ๐‘— slice of X. It is also clear that the original problem (5.1) can be modeled as ๐‘› ๐‘— independent least squares problems that can be solved in parallel if needed, with each least squares problem involving a (๐‘‘ โˆ’ 1)-mode tensor X ( ๐‘—,โ„Ž) . Essentially, this would mean that for mode ๐‘—, one has to solve ๐‘› ๐‘— minimization problems, each of the following form. ๐‘Ÿ โˆ‘๏ธ arg min X ( ๐‘—,โ„Ž) โˆ’ ๐›ผ0๐‘—,โ„Ž,๐‘˜ ๐‘‘ x ๐‘˜(โ„“) . (5.3) ๐œถ 0๐‘—,โ„Ž โˆˆ C๐‘Ÿ ๐‘˜=1 โ„“=1 โ„“โ‰  ๐‘— Now, assuming that for each mode ๐‘—, the factors {x ๐‘˜(โ„“) } are sufficiently incoherent for ๐‘˜ โˆˆ [๐‘Ÿ] and โ„“ โˆˆ [๐‘‘] \ ๐‘—, we can use our modewise JL embedding method to solve a compressed version of each 55 least squares problem in (5.3) in the following way. ? ๐‘‘ โˆ‘๏ธ๐‘Ÿ ? ๐‘‘ 0 (โ„“ 0 ) arg min X ( ๐‘—,โ„Ž) A โˆ’ ๐›ผ0๐‘—,โ„Ž,๐‘˜ ๐‘‘ x ๐‘˜(โ„“) A (โ„“ ) (5.4) ๐œถ 0๐‘—,โ„Ž โˆˆ C๐‘Ÿ โ„“ 0 =1 ๐‘˜=1 โ„“=1 โ„“โ‰  ๐‘— โ„“ 0 =1 โ„“ 0โ‰  ๐‘— โ„“ 0โ‰  ๐‘— ( ๐‘—) ( ๐‘—) We can then update each entry of xฬƒ ๐‘˜ by setting ๐‘ฅหœ ๐‘˜,โ„Ž = ๐›ผ0๐‘—,โ„Ž,๐‘˜ /๐›ผ ๐‘˜ for all โ„Ž โˆˆ [๐‘› ๐‘— ] and ๐‘˜ โˆˆ [๐‘Ÿ]. To show that the solutions to (5.4) and (5.3) are close, we first establish that ? ๐‘‘ X ( ๐‘—,โ„Ž) A (โ„“) โ‰ˆ X ( ๐‘—,โ„Ž) โ„“ 0 =1 โ„“ 0โ‰  ๐‘— can also hold for all ๐‘— โˆˆ [๐‘‘] and โ„Ž โˆˆ [๐‘› ๐‘— ]. This is done in the following lemma. Lemma 5.1.1 Let ๐œ€ โˆˆ (0, 1), Z (1) , . . . , Z ( ๐‘) โˆˆ ๐‘›1 ร—ยทยทยทร—๐‘›๐‘‘ , and A (1) โˆˆ C C๐‘š ร—๐‘› 1 1 e be an (๐œ€/ ๐‘‘)-JL รŽ  ๐‘‘ embedding of the all ๐‘ โ„“=2 โ„“ mode-1 fibers of all ๐‘ of these tensors, ๐‘› C๐‘› , ร˜ n o S1 := Z:,๐‘–(๐‘ก)2 ,...,๐‘– ๐‘‘ | โˆ€๐‘–โ„“ โˆˆ [๐‘›โ„“ ], โ„“ โˆˆ [๐‘‘] \ {1} โŠ‚ 1 ๐‘กโˆˆ[ ๐‘] into C๐‘š . Next, set Z (1,๐‘ก) := Z (๐‘ก) ร—1 A(1) โˆˆ C๐‘š ร—๐‘› ร—ยทยทยทร—๐‘› โˆ€๐‘ก โˆˆ [ ๐‘], and then let A(2) โˆˆ C๐‘š ร—๐‘› 1 1 2 ๐‘‘ 2 2 be an (๐œ€/e๐‘‘)-JL embedding of all ๐‘ ๐‘š 1 โ„“=3  รŽ  ๐‘‘ ๐‘›โ„“ mode-2 fibers C๐‘› ร˜ n o S2 := Z๐‘–(1,๐‘ก) 1 ,:,๐‘– 3 ,...,๐‘– ๐‘‘ | โˆ€๐‘–1 โˆˆ [๐‘š 1 ] & ๐‘–โ„“ โˆˆ [๐‘›โ„“ ], โ„“ โˆˆ [๐‘‘] \ [2] โŠ‚ 2 ๐‘กโˆˆ[ ๐‘] into C๐‘š . Continuing inductively, for each ๐‘— โˆˆ [๐‘‘] \ [2] and ๐‘ก โˆˆ [ ๐‘] set Z ( ๐‘—โˆ’1,๐‘ก) := Z ( ๐‘—โˆ’2,๐‘ก) ร— ๐‘—โˆ’1 2 A ( ๐‘—โˆ’1) โˆˆ C๐‘š ร—ยทยทยทร—๐‘š ร—๐‘› ร—ยทยทยทร—๐‘› , and then let A ( ๐‘—) โˆˆ C๐‘š ร—๐‘› be an (๐œ€/e๐‘‘)-JL embedding of all 1 ๐‘—โˆ’1 ๐‘— ๐‘‘ ๐‘— ๐‘— รŽ  รŽ  ๐‘—โˆ’1 ๐‘‘ ๐‘ โ„“=1 ๐‘š โ„“ โ„“= ๐‘—+1 ๐‘›โ„“ mode- ๐‘— fibers C๐‘› ร˜ n o ( ๐‘—โˆ’1,๐‘ก) S ๐‘— := Z๐‘–1 ,...,๐‘– ๐‘—โˆ’1 ,:,๐‘– ๐‘—+1 ,...,๐‘– ๐‘‘ | โˆ€๐‘–โ„“ โˆˆ [๐‘š โ„“ ], โ„“ โˆˆ [ ๐‘— โˆ’ 1] & ๐‘–โ„“ โˆˆ [๐‘›โ„“ ], โ„“ โˆˆ [๐‘‘] \ [ ๐‘—], โŠ‚ ๐‘— ๐‘กโˆˆ[ ๐‘] into C๐‘š . Then, ๐‘— 2 2 2 Z (๐‘ก) โˆ’ Z (๐‘ก) ร—1 A (1) ยท ยท ยท ร—๐‘‘ A (๐‘‘) โ‰ค ๐œ€ Z (๐‘ก) will hold for all ๐‘ก โˆˆ [ ๐‘]. 56 Proof Fix ๐‘ก โˆˆ [ ๐‘] and let X (0) := Z (๐‘ก) , X ( ๐‘—) := Z ( ๐‘—,๐‘ก) for all ๐‘— โˆˆ [๐‘‘ โˆ’ 1], and X (๐‘‘) := Z (๐‘‘โˆ’1,๐‘ก) ร—๐‘‘ A (๐‘‘) = Z (๐‘ก) ร—1 A (1) ยท ยท ยท ร—๐‘‘ A (๐‘‘) . Choose any ๐‘— โˆˆ [๐‘‘], and let x ๐‘—,โ„Ž โˆˆ C๐‘› ๐‘— denote the โ„Žth ( ๐‘—โˆ’1) column of the mode- ๐‘— unfolding of X ( ๐‘—โˆ’1) , denoted by X ( ๐‘—) . It is easy to see that each x ๐‘—,โ„Ž is a รŽ  รŽ  mode- ๐‘— fiber of X ( ๐‘—โˆ’1) = Z ( ๐‘—โˆ’1,๐‘ก) for each 1 โ‰ค โ„Ž โ‰ค ๐‘ 0๐‘— := ๐‘—โˆ’1 โ„“=1 โ„“ ๐‘š โ„“= ๐‘—+1 โ„“ . Thus, we can ๐‘› see that 2 2 2 2 2 2 ( ๐‘—โˆ’1) ( ๐‘—โˆ’1) X ( ๐‘—โˆ’1) โˆ’ X ( ๐‘—) = X ( ๐‘—โˆ’1) โˆ’ X ( ๐‘—โˆ’1) ร— ๐‘— A ( ๐‘—) = X ( ๐‘—) โˆ’ A ( ๐‘—) X ( ๐‘—) F F ๐‘ 0๐‘— ๐‘ 0๐‘— โˆ‘๏ธ 2 โˆ‘๏ธ = kx ๐‘—,โ„Ž k 22 โˆ’ A ( ๐‘—) x ๐‘—,โ„Ž โ‰ค kx ๐‘—,โ„Ž k 22 โˆ’ kA ( ๐‘—) x ๐‘—,โ„Ž k 22 2 โ„Ž=1 โ„Ž=1 ๐‘ 0๐‘— ๐œ€ โˆ‘๏ธ ๐œ€ 2 ๐œ€ 2 ( ๐‘—โˆ’1) kx ๐‘—,โ„Ž k 22 = X ( ๐‘—โˆ’1) โ‰ค e ๐‘‘ โ„Ž=1 e ๐‘‘ X ( ๐‘—) F = e ๐‘‘ . 2 2 A short induction argument now reveals that X ( ๐‘—) e ๐œ€ ๐‘— X (0)  โ‰ค 1+ ๐‘‘ holds for all ๐‘— โˆˆ [๐‘‘]. As a result we can now see that 2 2 ๐‘‘ 2 2 ๐‘‘ 2 2 ๐‘‘ 2 (0) (๐‘‘) โˆ‘๏ธ ( ๐‘—โˆ’1) ( ๐‘—) โˆ‘๏ธ ( ๐‘—โˆ’1) ( ๐‘—) ๐œ€ โˆ‘๏ธ ( ๐‘—โˆ’1) X โˆ’ X = ๐‘—=1 X โˆ’ X โ‰ค ๐‘—=1 X โˆ’ X โ‰ค e ๐‘‘ ๐‘—=1 X ๐‘‘ 2 2 ๐œ€ โˆ‘๏ธ  ๐œ€  ๐‘—โˆ’1 (0) ๐œ€ ๐œ€ ๐‘‘ X (0) โ‰ค e ๐‘‘ ๐‘—=1 1+ e ๐‘‘ X โ‰ค e 1+ e ๐‘‘ . holds. The desired result now follows from Remark 4.2.1. Now, we can use the result of Lemma 5.1.1 to prove that the solution to (5.4) will be a close approximation to that of (5.3) if the matrices A ( ๐‘—) are chosen appropriately. We have the following >๐‘‘ general result which directly applies to least squares problems as per (5.4) when ๐ฟหœ (Z) := Z A (โ„“) โ„“=1 โ„“โ‰  ๐‘— and A = I. Theorem 5.1.1 (Embeddings for Compressed Least Squares) Let X โˆˆ C๐‘› ร—ยทยทยทร—๐‘› , L be an ๐‘Ÿ- 1 ๐‘‘ dimensional subspace of C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ spanned by a set of orthonormal basis tensors {T๐‘˜ } ๐‘˜โˆˆ[๐‘Ÿ] , and PL : C๐‘› ร—ยทยทยทร—๐‘› โŠฅ 1 ๐‘‘ โ†’ C๐‘› ร—ยทยทยทร—๐‘› 1 ๐‘‘ be the orthogonal projection operator on the orthogonal complement 57 of L. Fix ๐œ€ โˆˆ (0, 1) and suppose that the linear operator ๐ฟหœ : C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› 1 2 ๐‘‘ โ†’ C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 has both of the following properties: (i) ๐ฟหœ is an (๐œ€/6)-JL embedding of all Y โˆˆ L โˆช { PL โŠฅ (X)} into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 , and โˆš (ii) ๐ฟหœ is an (๐œ€/24 ๐‘Ÿ)-JL embedding of the 4๐‘Ÿ tensors 0 ร˜  P L โŠฅ (X) P L โŠฅ (X) PL โŠฅ (X) i PL โŠฅ (X) i  C๐‘› ร—๐‘› ร—ยทยทยทร—๐‘› S := P โˆ’ T๐‘˜ , P + T๐‘˜ , PL โˆ’ T๐‘˜ , PL + T๐‘˜ โŠ‚ 1 2 ๐‘‘ k L โŠฅ (X) k k L โŠฅ (X)k k โŠฅ (X) k k โŠฅ (X) k ๐‘˜ โˆˆ[๐‘Ÿ] into C๐‘š ร—ยทยทยทร—๐‘š 1 ๐‘‘0 . C๐‘š ร—ยทยทยทร—๐‘š C รŽ๐‘‘ 0 Furthermore, let vect : 1 ๐‘‘0 โ†’ โ„“=1 ๐‘šโ„“ be a reshaping vectorization operator, and A โˆˆ C๐‘šร— รŽ๐‘‘ 0 โ„“=1 ๐‘šโ„“ be an (๐œ€/3)-JL embedding of the (๐‘Ÿ + 1)-dimensional subspace PL C รŽ๐‘‘ 0 L 0 := span vect โ—ฆ ๐ฟหœ ( (X)) , vect โ—ฆ ๐ฟหœ (T1 ) , . . . , vect โ—ฆ ๐ฟหœ (T๐‘Ÿ ) โŠ‚  ๐‘šโ„“ โŠฅ โ„“=1 into C๐‘š . Then, 2 A vect โ—ฆ ๐ฟหœ (X โˆ’ Y) โˆ’ kX โˆ’ Y k 2 โ‰ค ๐œ€ kX โˆ’ Y k 2  2 holds for all Y โˆˆ L. Proof Note that the theorem will be proven if ๐ฟหœ is an (๐œ€/3)โ€“JL embedding of all tensors of the form C X โˆ’ Y Y โˆˆ L into ๐‘š1 ร—ยทยทยทร—๐‘š ๐‘‘ 0 since any such tensor Xโˆ’Y will also have vectโ—ฆ ๐ฟหœ (X โˆ’ Y) โˆˆ L 0  so that 2 A vect โ—ฆ ๐ฟหœ (X โˆ’ Y) โˆ’ kX โˆ’ Y k 2  2 2 2 2 A vect โ—ฆ ๐ฟหœ (X โˆ’ Y) โˆ’ ๐ฟหœ (X โˆ’ Y) + ๐ฟหœ (X โˆ’ Y) โˆ’ kX โˆ’ Y k 2  โ‰ค 2 2 2 ๐œ€ A vect โ—ฆ ๐ฟหœ (X โˆ’ Y) โˆ’ vect โ—ฆ ๐ฟหœ (X โˆ’ Y) kX โˆ’ Y k 2  โ‰ค 2 2 + 3 ๐œ€ 2 ๐œ€ โ‰ค vect โ—ฆ ๐ฟหœ (X โˆ’ Y) 2 + kX โˆ’ Y k 2 3 3 ๐œ€ 2 ๐œ€ = ๐ฟหœ (X โˆ’ Y) + kX โˆ’ Y k 2 3 3 ๐œ€  ๐œ€  ๐œ€ โ‰ค 1+ kX โˆ’ Y k 2 + kX โˆ’ Y k 2 โ‰ค ๐œ€ kX โˆ’ Y k 2 . 3 3 3 58 Let PL be the orthogonal projection operator onto L. Our first step in establishing that ๐ฟหœ is an (๐œ€/3)โ€“JL embedding of all tensors of the form X โˆ’ Y Y โˆˆ L into C๐‘š ร—ยทยทยทร—๐‘š will be to show  1 ๐‘‘0 that ๐ฟหœ preserves all the angles between PL (X) and L well enough that the Pythagorean theorem โŠฅ kX โˆ’ Y k 2 = k PL โŠฅ (X) + PL (X) โˆ’ Y k 2 = k PL โŠฅ (X)k 2 + k PL (X) โˆ’ Y k 2 still approximately holds for all Y โˆˆ L after ๐ฟหœ is applied. Toward that end, let ๐œธ โˆˆ ๐‘Ÿ be such C P ร P that L (X) โˆ’ Y = ๐‘˜โˆˆ[๐‘Ÿ] ๐›พ ๐‘˜ T๐‘˜ and note that k๐œธk 2 = k L (X) โˆ’ Y k due to the orthonormality of {T๐‘˜ } ๐‘˜โˆˆ[๐‘Ÿ] . Appealing to Lemma 4.1.2 we now have that ๐ฟหœ ( PL (X) โˆ’ Y) , ๐ฟหœ (PL PL โˆ‘๏ธ  ๐›พ ๐‘˜ ๐ฟหœ (T๐‘˜ ) , ๐ฟหœ  PL โŠฅ (X)  k PL โŠฅ (X)) = k โŠฅ (X) k โŠฅ (X)k ๐‘˜โˆˆ[๐‘Ÿ] P PL   โˆ‘๏ธ ๐œ€ ๐œ€ โ‰ค k L โŠฅ (X) k โˆš |๐›พ ๐‘˜ | โ‰ค k โŠฅ (X) k k๐œธk 2 6 ๐‘Ÿ 6 ๐‘˜ โˆˆ[๐‘Ÿ] (5.5) ๐œ€  PL PL (X) โˆ’ Y k 2  ๐œ€ โ‰ค k โŠฅ (X) k 2 + k = kX โˆ’ Y k 2 . 12 12 Using (5.5) we can now see that 2 ๐ฟหœ (X โˆ’ Y) 2 โˆ’ kX โˆ’ Y k 2 = ๐ฟหœ (X โˆ’ Y) 2 2 โˆ’k PL (X)k 2 โˆ’ k PL (X) โˆ’ Y k 2 โŠฅ ๐ฟหœ ( PL (X)) โˆ’ k PL (X)k 2 + ๐ฟหœ ( PL (X) โˆ’ Y) โˆ’ k PL (X) โˆ’ Y k 2 2 2 โ‰ค โŠฅ โŠฅ + 2 ๐ฟหœ ( PL (X) โˆ’ Y) , ๐ฟหœ ( PL (X)) โŠฅ k PL (X)k 2 + k PL (X) โˆ’ Y k 2 + kX โˆ’ Y k 2 = kX โˆ’ Y k 2 . ๐œ€  ๐œ€ โ‰ค โŠฅ 6 3 Thus, ๐ฟหœ has the desired JL-embedding property required to conclude the proof. 5.1.1 Experiments: Effect of JL Embeddings on Least Squares Solutions In this section, trial least squares experiments with compressed tensor data are performed to show the effect of modewise JL embeddings on solutions to least squares problems. In the experiments 59 of this section, the first sample of the three MRI data samples used in Section 4.3 is employed. Again, all experiments were carried out in MATLAB. First, it is shown that this MRI sample has a relatively low-rank CP representations by plotting its CP reconstruction error for various choices of rank. Next, the effect of modewise JL on least squares solutions is investigated by solving for the coefficients of the CP decomposition of the MRI sample in a least squares problem. This will be done by performing 1-stage (modewise) and 2-stage JL on the data, which we call compressed least squares, and will be compared with the case where a regular uncompressed least squares problem is solved instead. 5.1.1.1 CPD Reconstruction Before the experimental results, a short description of the basic form of CPD calculation is reviewd. ( ๐‘—) Given a tensor X, assume ๐‘Ÿ is known beforehand. The problem is now the calculation of x ๐‘˜ for ๐‘— โˆˆ [๐‘‘] and ๐‘˜ โˆˆ [๐‘Ÿ] and ๐œถ in (4.3), i.e. the solution to โˆ‘๏ธ๐‘Ÿ min kX โˆ’ Xฬ‚k with Xฬ‚ = ๐›ผ ๐‘˜ x ๐‘˜(1) x ๐‘˜(2) ยทยทยท x ๐‘˜(๐‘‘) . (5.6) Xฬ‚ ๐‘˜=1 As the Euclidean norm a ๐‘‘-mode tensor is equal to the Frobenius norm of its mode- ๐‘— unfoldings ( ๐‘—) for ๐‘— โˆˆ [๐‘‘], by letting x ๐‘˜ be the ๐‘˜ th column of a matrix X ( ๐‘—) โˆˆ C๐‘› ร—๐‘Ÿ , the above minimization ๐‘— problem can be written as  > ( ๐‘—) (๐‘‘) ( ๐‘—+1) ( ๐‘—โˆ’1) (1) min X ( ๐‘—) โˆ’ Xฬ‚ X ยทยทยท X X ยทยทยท X Xฬ‚ ( ๐‘—) F where Xฬ‚ ( ๐‘—) = X ( ๐‘—) diag (๐œถ), and the operator diag(ยท) creates a diagonal matrix with ๐œถ as its diagonal. Once solved for, the columns of Xฬ‚ ( ๐‘—) can then be normalized and used to form the coefficients รŽ ( ๐‘—) ๐›ผ ๐‘˜ = ๐‘‘๐‘—=1 k xฬ‚ ๐‘˜ k 2 for ๐‘˜ โˆˆ [๐‘Ÿ], although this is optional, i.e., if the columns are not normalized, the coefficients ๐›ผ ๐‘˜ in the factorization will all be ones. This procedure is repeated iteratively until the fit ceases to improve (the objective function stops improving with respect to a tolerance) or the maximum number of iterations are exhausted. To choose the rank of the decomposition as well as 60 0.35 0.3 0.25 0.2 0.15 0.1 0.05 20 40 60 80 100 120 140 160 180 200 Figure 5.1: Relative reconstruction error of CPD calculated for different values of rank ๐‘Ÿ for MRI data. As the rank increases, the error becomes smaller. obtaining the best estimates for X ( ๐‘—) , a commonly used consistency diagnostic called CORCONDIA can be employed as explained in Section 3.1.2. Now, the relative reconstruction error of CPD is calculated and plotted for various values of rank ๐‘Ÿ. Assuming X represents the data, this error is defined as kX โˆ’ Xฬ‚k ๐‘’ ๐‘ ๐‘๐‘‘ = , kXk where Xฬ‚ denotes the reconstruction of X. Figure 5.1 displays the results. 5.1.1.2 Compressed Least Squares Performance ( ๐‘—) Let x ๐‘˜ be known in ๐‘Ÿ โˆ‘๏ธ ๐‘‘ ( ๐‘—) Xโ‰ˆ ๐›ผ๐‘˜ ๐‘—=1 x๐‘˜ , ๐‘˜=1 for ๐‘˜ โˆˆ [๐‘Ÿ] and ๐‘— โˆˆ [๐‘‘]. They can be obtained from a previous iteration in the CPD fitting procedure. Here, they come from the CPD of the data calculated in section 5.1.1.1. Also, assume ( ๐‘—) these vectors have unit norms. In general, as stated in section 5.1.1.1, when x ๐‘˜ are obtained using a CPD algorithm, they do not necessarily have unit norms. Therefore, they are normalized and the รŽ ( ๐‘—) norms are absorbed into the coefficients of CPD. In other words, ๐›ผ ๐‘˜ = ๐‘‘๐‘—=1 kx ๐‘˜ k 2 for ๐‘˜ โˆˆ [๐‘Ÿ]. If the normalization of the vectors is not performed, ๐›ผ ๐‘˜ = 1 for ๐‘˜ โˆˆ [๐‘Ÿ]. The coefficients of the CPD 61 fit are the solutions to the following least squares problem, โˆ‘๏ธ๐‘Ÿ ๐‘‘ ( ๐‘—) ๐œถ = arg min X โˆ’ ๐›ฝ๐‘˜ ๐‘—=1 x๐‘˜ . ๐œท ๐‘˜=1 ( ๐‘—) As normalization of x ๐‘˜ was not performed when computing the CPD of the data in these experi- ments, the true solution will be ๐œถ = 1. An approximate solution for the coefficients can be obtained by solving for ! โˆ‘๏ธ๐‘Ÿ ๐‘‘ ( ๐‘—) ๐œถ ๐‘ƒ = arg min ๐ฟ (X) โˆ’ ๐ฟ ๐›ฝ๐‘˜ ๐‘—=1 x๐‘˜ , ๐œท ๐‘˜=1 where ๐œถ ๐‘ƒ is the vector ๐œถ estimated for randomly projected data, and ๐ฟ (X) is defined as per (4.21). This is in fact simply another way of demonstrating that solving (5.4) yields an approximate solution to (5.3) for a (๐‘‘ โˆ’ 1)-mode tensor. Of course, both of these problems can be solved using the vectorized versions of the tensors instead. Indeed, for ๐œถ ๐‘ƒ , vectorization should be done after random projection of X and the rank-1 tensors, i.e., ๐œถ ๐‘ƒ = arg min kx๐‘ƒ โˆ’ B๐œทk 2 = (Bโˆ— B) โˆ’1 Bโˆ— x๐‘ƒ = Bโ€  x๐‘ƒ , ๐œท where Bโ€  denotes the pseudo-inverse of B, x๐‘ƒ = vec (๐ฟ (X)), and B is a matrix whose ๐‘˜ th column    is vec ๐ฟ ๐‘‘ x ( ๐‘—) 1 for ๐‘˜ โˆˆ [๐‘Ÿ].2 The error measure used to evaluate the approximate solution ๐‘—=1 ๐‘˜ is defined as ๐‘’ ๐‘ƒ โˆ’ ๐‘’๐‘‡ ๐‘’๐‘Ÿ = , ๐‘’๐‘‡ ร๐‘Ÿ ( ๐‘—) ร ( ๐‘—) where ๐‘’๐‘‡ = X โˆ’ ๐‘˜=1 ๐›ผ ๐‘˜ ๐‘‘ ๐‘—=1 x๐‘˜ and ๐‘’ ๐‘ƒ = X โˆ’ ๐‘Ÿ๐‘˜=1 ๐›ผ๐‘ƒ,๐‘˜ ๐‘‘ ๐‘—=1 x๐‘˜ . This in fact compares the true CPD reconstruction error and the reconstruction error calculated using the approximate solution for the CPD coefficients ๐œถ ๐‘ƒ . The results are shown in Figure 5.2.   1 Again, ( ๐‘—) it is clear that in the 2-stage case, ๐ฟ (X) and ๐ฟ ๐‘‘ x ๐‘—=1 ๐‘˜ are vectors, and therefore, the operator vec (ยท) does not change the result. 2 The backslash operator was used to actually solve the resulting least squares problems in MATLAB. 62 100 Gaussian Gaussian RFD RFD Gaussian+RFD Gaussian+RFD 10-1 RFD+RFD RFD+RFD Vectorize+RFD 10-1 Vectorize+RFD 10-2 10-2 10-3 10-3 10-4 10-3 10-2 10-4 10-3 10-2 (a) (b) Gaussian Gaussian 100 RFD RFD Gaussian+RFD Gaussian+RFD RFD+RFD 101 RFD+RFD Vectorize+RFD Vectorize+RFD 10-1 10-2 100 -3 10 10-4 10-3 10-2 10-6 10-5 10-4 10-3 10-2 (c) (d) Figure 5.2: Effect of JL embeddings on the relative reconstruction error of least squares estimation of CPD coefficients. In the 2-stage cases, ๐‘ 2 = 0.05 has been used. (a) ๐‘Ÿ = 40. (b) ๐‘Ÿ = 75. (c) ๐‘Ÿ = 110. (d) Average runtime for ๐‘Ÿ = 40. The other runtime plots for ๐‘Ÿ = 75 and ๐‘Ÿ = 110 are qualitatively identical. In Figure 5.2, the compressed least squares results for the aforementioned MRI data sample have been plotted. We can observe that as we choose a higher rank for the CPD model, we obtain a smaller error in the estimated coefficients of ๐œถ. As expected, the โ€˜Vectorize+RFDโ€™ case yields the most accurate results by a small margin. However, its runtime is considerably larger due to the huge size of the vectorized tensor, although the it is benefiting from the computational efficiency of the FFT.3 To see why the runtime plot is almost flat regardless of the chosen compression, see 3 For information about how RFD is applied after vectorization, refer to Section 4.3.1. 63 the discussion at the end of Section 4.3.1. 5.2 Application to Many-Body Purturbation Theory Problems This section provides a framework for modeling the energy correction terms as the sum of multiple inner products between tensors, so that each inner product can be approximated according to the geometry preserving property of JL embeddings as outlined in Lemma 4.1.1. The idea is to calculate the inner product of tensors with reduced dimensions to obtain an approximate value of the true energy terms. In doing so, it is assumed that the data lie on a low-rank inner product space of tensors. 5.2.1 Second-order energy correction The 2nd -order energy correction term is defined as: ๐‘โˆ‘๏ธ๐ฝ โˆ’1 (2) ๐ธ = ๐ธ (2) (๐ฝ) , (5.7) ๐ฝ=0 where ๐‘ ๐ฝ is the number of blocks, and 1 โˆ‘๏ธ 1 ๐ธ (2) (๐ฝ) = โˆ’ (2๐ฝ + 1) H๐‘– ๐‘— ๐‘˜๐‘™ H๐‘˜๐‘™๐‘– ๐‘— D๐‘– ๐‘— ๐‘˜๐‘™ = โˆ’ (2๐ฝ + 1) hH , Hฬƒ i, (5.8) 4 ๐‘– ๐‘— ๐‘˜๐‘™ 4 in which the Hamiltonian tensor H โˆˆ R๐‘›ร—๐‘›ร—๐‘›ร—๐‘› should be updated for each value of ๐ฝ, and D has the same dimensions as H and is calculated from single-particle energy values. The tensor Hฬƒ is a permuted version of the H multiplied component-wise by D, i.e., Hฬƒ๐‘– ๐‘— ๐‘˜๐‘™ = H๐‘˜๐‘™๐‘– ๐‘— D๐‘– ๐‘— ๐‘˜๐‘™ . (5.9) Now, an approximation of (5.8) can be computed by randomly projecting H and Hฬƒ onto a lower- dimensional space using mode-wise Johnson-Lindenstrauss embeddings: S = H ร—1 A (1) ร—2 A (2) ร—3 A (2) ร—4 A (4) , (5.10) 64 Sฬƒ = Hฬƒ ร—1 A (1) ร—2 A (2) ร—3 A (2) ร—4 A (4) , (5.11) where A ( ๐‘—) โˆˆ R๐‘š ๐‘— ร—๐‘› are JL matrices and ๐‘š ๐‘— โ‰ค ๐‘› for ๐‘— โˆˆ [4]. Now, with high probability, hH , Hฬƒ i โ‰ˆ hS, Sฬƒi, (5.12) to within an adjustable error that is related to the target dimension sizes ๐‘š ๐‘— . A 2nd stage JL embedding can be applied to the vectorized versions of S and Sฬƒ to further compress the projected tensors before computing the approximate inner product. This is done according to s ๐‘ = Avect (S) , (5.13) รŽ๐‘‘ where s ๐‘ โˆˆ R๐‘š and A โˆˆ R๐‘šร— ๐‘—=1 ๐‘š๐‘— . Note 5.2.1 It is observed that assuming real arithmetic, the operations count to directly caclulate the inner product is O (๐‘›1 . . . ๐‘› ๐‘‘ ) while the computational complexity of a one-stage JL embedding is O (๐‘š 1 ๐‘›1 . . . ๐‘› ๐‘‘ ) as discussed in Section 4.2.1.1, which is obviously higher. However, the same compressed tensors can be used in the process of calculating many observables including the higher-order purturbative terms such as the third-order energy correction and radius corrections. As the number of such terms increases, the overall computational complexity will become much lower when compressed tensors are used to approximate observables. 5.2.2 Radius Corrections In what follows, a one-stage (modewise) JL compression scheme is discussed. Obviously, in both cases shown below, a second stage JL compression can also be performed after vectorizing the result of the first stage. Particle Term: 65 The one-body particle term is expressed in the following way 1 โˆ‘๏ธ ๐‘…1 = H๐‘– ๐‘— ๐‘˜๐‘™ D๐‘– ๐‘— ๐‘˜๐‘™ D๐‘š ๐‘— ๐‘˜๐‘™ H๐‘˜๐‘™๐‘š ๐‘— ๐‘…๐‘š๐‘– 2 ๐‘– ๐‘— ๐‘˜๐‘™๐‘š 1 โˆ‘๏ธ โˆ‘๏ธ = H๐‘– ๐‘— ๐‘˜๐‘™ D๐‘– ๐‘— ๐‘˜๐‘™ D๐‘š ๐‘— ๐‘˜๐‘™ H๐‘˜๐‘™๐‘š ๐‘— ๐‘…๐‘š๐‘– (5.14) 2 ๐‘– ๐‘— ๐‘˜๐‘™ ๐‘š 1D E = HฬŒ , Hฬ‚ , 2 where HฬŒ is obtained by the component-wise product of H and D, i.e., HฬŒ๐‘– ๐‘— ๐‘˜๐‘™ = H๐‘– ๐‘— ๐‘˜๐‘™ D๐‘– ๐‘— ๐‘˜๐‘™ , โˆ‘๏ธ โˆ‘๏ธ Hฬ‚๐‘– ๐‘— ๐‘˜๐‘™ = D๐‘š ๐‘— ๐‘˜๐‘™ H๐‘˜๐‘™๐‘š ๐‘— ๐‘…๐‘š๐‘– = Hฬƒ๐‘š ๐‘— ๐‘˜๐‘™ ๐‘…๐‘š๐‘– , (5.15) ๐‘š ๐‘š and R is the radius operator and a square matrix. Here, Hฬƒ is defined in (5.9). We can observe that Hฬ‚ = Hฬƒ ร—1 R> . Therefore, the approximate correction term would be calculated as 1 ๐‘…1 โ‰ˆ H๐‘1 , H๐‘2 , 2 where ? 4 H ๐‘1 = HฬŒ A (โ„“) , (5.16) โ„“=1 and ? 4 ? 4 (โ„“) (1) > H ๐‘2 = Hฬ‚ A = Hฬƒ ร—1 (A R ) A (โ„“) . (5.17) โ„“=1 โ„“=2 Hole Term: Calculations for the one-body hole term are very similar to the first term, as shown below. 1 โˆ‘๏ธ ๐‘…2 = H๐‘– ๐‘— ๐‘˜๐‘™ D๐‘– ๐‘— ๐‘˜๐‘™ H๐‘š๐‘™๐‘– ๐‘— D๐‘– ๐‘— ๐‘š๐‘™ ๐‘… ๐‘˜๐‘š (5.18) 2 ๐‘– ๐‘— ๐‘˜๐‘™๐‘š 1 โˆ‘๏ธ โˆ‘๏ธ = H๐‘– ๐‘— ๐‘˜๐‘™ D๐‘– ๐‘— ๐‘˜๐‘™ D๐‘– ๐‘— ๐‘š๐‘™ H๐‘š๐‘™๐‘– ๐‘— ๐‘… ๐‘˜๐‘š (5.19) 2 ๐‘– ๐‘— ๐‘˜๐‘™ ๐‘š 1D E = HฬŒ , Hฬ„ , (5.20) 2 where โˆ‘๏ธ โˆ‘๏ธ Hฬ„๐‘– ๐‘— ๐‘˜๐‘™ = D๐‘– ๐‘— ๐‘š๐‘™ H๐‘š๐‘™๐‘– ๐‘— ๐‘… ๐‘˜๐‘š = Hฬƒ๐‘– ๐‘— ๐‘š๐‘™ ๐‘… ๐‘˜๐‘š . (5.21) ๐‘š ๐‘š 66 We have that Hฬ„ = Hฬƒ ร—3 R. (5.22) Therefore, 1 ๐‘…2 โ‰ˆ H๐‘1 , H๐‘2 , (5.23) 2 where ? 4 H ๐‘1 = HฬŒ A (โ„“) (5.24) โ„“=1 as in the case of the first correction term, and ? 4 H ๐‘2 = Hฬ„ A (โ„“) โ„“=1 (5.25) = Hฬƒ ร—1 A (1) ร—2 A (2) ร—3 (A (3) R) ร—4 A (4) . It can be observed that all one needs to calculate the approximations to ๐‘…1 and ๐‘…2 is the two tensors HฬŒ and Hฬƒ . This has been depicted in the block diagram of Figure 5.3. In many cases, due to the symmetry in H , we have that HฬŒ = Hฬƒ which further reduces the storage requirements. Figure 5.3: A block diagram showing how the approximations to ๐‘…1 and ๐‘…2 are calculated. 67 5.2.3 Third-order energy correction The 3rd -order energy correction term is defined as ๐ฝ โˆ’1 ๐‘โˆ‘๏ธ (3) ๐ธ = ๐ธ (3) (๐ฝ) , (5.26) ๐ฝ=0 where the term ๐ธ (3) (๐ฝ) is calculated in each of the following settings. 5.2.3.1 Particle-Particle In the particle-particle case, 1 โˆ‘๏ธ ๐ธ (3) (๐ฝ) = (2๐ฝ + 1) H (๐‘–, ๐‘—, ๐‘˜, ๐‘™) H (๐‘˜, ๐‘™, ๐‘š, ๐‘›) H (๐‘š, ๐‘›, ๐‘–, ๐‘—) D (๐‘˜, ๐‘™, ๐‘–, ๐‘—) D (๐‘š, ๐‘›, ๐‘–, ๐‘—). 8 ๐‘–, ๐‘—,๐‘˜,๐‘™,๐‘š,๐‘› (5.27) Again, the Hamiltonian tensor H should be updated for each value of ๐ฝ. To calculate the sum, one can use a scheme similar to the one used in section 5.2.1, but this time with 6 dimensions: generate 6-dimensional tensors by regrouping the terms in (5.27) as H1 and H2 , and then calculate the inner product between H1 and H2 . There are multiple ways to group the terms in (5.27). In the following, two grouping options are listed. ๏ฃฑ ๏ฃฒ H1 (๐‘–, ๐‘—, ๐‘˜, ๐‘™, ๐‘š, ๐‘›) = 81 H (๐‘–, ๐‘—, ๐‘˜, ๐‘™)H (๐‘˜, ๐‘™, ๐‘š, ๐‘›) ๏ฃด ๏ฃด ๏ฃด ๏ฃด Option 1 : (5.28) ๏ฃด ๏ฃด H2 (๐‘–, ๐‘—, ๐‘˜, ๐‘™, ๐‘š, ๐‘›) = H (๐‘š, ๐‘›, ๐‘–, ๐‘—)D (๐‘˜, ๐‘™, ๐‘–, ๐‘—)D (๐‘š, ๐‘›, ๐‘–, ๐‘—), ๏ฃด ๏ฃด ๏ฃณ ๏ฃฑ ๏ฃฒ H1 (๐‘–, ๐‘—, ๐‘˜, ๐‘™, ๐‘š, ๐‘›) = 18 H (๐‘–, ๐‘—, ๐‘˜, ๐‘™)H (๐‘˜, ๐‘™, ๐‘š, ๐‘›)D (๐‘˜, ๐‘™, ๐‘–, ๐‘—) ๏ฃด ๏ฃด ๏ฃด ๏ฃด Option 2 : (5.29) ๏ฃด ๏ฃด H2 (๐‘–, ๐‘—, ๐‘˜, ๐‘™, ๐‘š, ๐‘›) = H (๐‘š, ๐‘›, ๐‘–, ๐‘—)D (๐‘š, ๐‘›, ๐‘–, ๐‘—). ๏ฃด ๏ฃด ๏ฃณ Now, if S1 and S2 are the projected versions of H1 and H2 , we expect that ๐ธ (3) (๐ฝ) = (2๐ฝ + 1) hH1 , H2 i โ‰ˆ (2๐ฝ + 1) hS1 , S2 i. The problem with this approach lies in the fact that when the dimension sizes increase, the 6-mode tensors become problematic in terms of storage. For instance, for H โˆˆ R100ร—100ร—100ร—100 , 68 7.45 TB of space is needed to store each of H1 and H2 . To overcome this problem, we can reshape the tensors and perform the projections as explained below. It is observed that the indices of the hypothetical 6-mode tensors always appear in groups of two in the inner product summation. Therefore, they can be reshaped into 3-mode tensors by regrouping the indices, and one may perform mode-wise JL on the reshaped tensors. ๏ฃฑ ๏ฃฒ H1 ( ๐‘, ๐‘ž, ๐‘Ÿ) = 18 H ( ๐‘, ๐‘ž)H (๐‘ž, ๐‘Ÿ) ๏ฃด ๏ฃด ๏ฃด ๏ฃด Option 1 : (5.30) ๏ฃด ๏ฃด H2 ( ๐‘, ๐‘ž, ๐‘Ÿ) = H (๐‘Ÿ, ๐‘)D (๐‘ž, ๐‘)D (๐‘Ÿ, ๐‘) = Hฬƒ (๐‘Ÿ, ๐‘)D (๐‘ž, ๐‘), ๏ฃด ๏ฃด ๏ฃณ where Hฬƒ is defined similarly as in (5.9). Here, ๐‘ represents all relevant pairs of ๐‘– and ๐‘—, ๐‘ž encodes all pairs of ๐‘˜ and ๐‘™, and ๐‘Ÿ represents all pairs of ๐‘š and ๐‘› in the grouping operation4. The repetitive patterns existing in 3-mode tensors that now can be formed using combinations of matrices, as well as reducing the size of two modes at once, as a result of combining two indices into one index, will provide the tools to avoid dealing with extremely large tensors when performing the projections. For instance, to project H1 in (5.30), one must calculate โˆ‘๏ธ P1 (๐‘–1 , ๐‘–2 , ๐‘–3 ) = H1 ( ๐‘, ๐‘ž, ๐‘Ÿ)A (1) (๐‘–1 , ๐‘) A (2) (๐‘–2 , ๐‘ž) A (3) (๐‘–3 , ๐‘Ÿ) , ๐‘,๐‘ž,๐‘Ÿ which is the element-wise version of P1 = H1 ร—1 A (1) ร—2 A (2) ร—3 A (3) . According to the way H1 is defined it will be possible to decompose the triple summation into separate sums for two of the mode-wise projections. This way, one can obtain the fully-projected tensor P1 by only dealing with 2-mode (partially) compressed arrays. Algebraic details on how the mode-wise projections can be done in a memory efficient way are presented in Appendix B.1. Due to the resemblance of options 1 and 2 in terms of the method used, only option 1 will be considered for future experiments, and also in the Hole-Hole and Particle-Hole settings, only one option will be discussed. 4 For instance, if column-major formatting is used, and assuming the indices start from 1, the relation between ๐‘, ๐‘– and ๐‘— is ๐‘ = ๐‘– + ( ๐‘— โˆ’ 1) ๐‘, where ๐‘–, ๐‘— โˆˆ [๐‘]. 69 5.2.3.2 Hole-Hole In this case, the energy term for each value of ๐ฝ is expressed by 1 โˆ‘๏ธ ๐ธ (3) (๐ฝ) = (2๐ฝ + 1) H (๐‘–, ๐‘—, ๐‘˜, ๐‘™) H (๐‘˜, ๐‘™, ๐‘š, ๐‘›) H (๐‘š, ๐‘›, ๐‘–, ๐‘—) D (๐‘š, ๐‘›, ๐‘–, ๐‘—) D (๐‘š, ๐‘›, ๐‘˜, ๐‘™). 8 ๐‘–, ๐‘—,๐‘˜,๐‘™,๐‘š,๐‘› (5.31) For simplicity, only one option will be used to form H1 and H2 , where ๏ฃฑ ๏ฃฒ H1 ( ๐‘, ๐‘ž, ๐‘Ÿ) = 18 H ( ๐‘, ๐‘ž)H (๐‘ž, ๐‘Ÿ) ๏ฃด ๏ฃด ๏ฃด ๏ฃด (5.32) ๏ฃด ๏ฃด H2 ( ๐‘, ๐‘ž, ๐‘Ÿ) = H (๐‘Ÿ, ๐‘)D (๐‘Ÿ, ๐‘)D (๐‘Ÿ, ๐‘ž) = Hฬƒ (๐‘Ÿ, ๐‘)D (๐‘Ÿ, ๐‘ž). ๏ฃด ๏ฃด ๏ฃณ Again, ๐‘– and ๐‘— are combined to form ๐‘, ๐‘˜ and ๐‘™ are grouped to form ๐‘ž, and ๐‘š and ๐‘› are combined to form ๐‘Ÿ, as explained above. Details on the calculations of mode-wise projections are presented in Appendix B.2. 5.2.3.3 Particle-Hole In this case, the energy term for each value of ๐ฝ is calculated by โˆ‘๏ธ ๐ธ (3) (๐ฝ) = (2๐ฝ + 1) H ๐‘ (๐‘–, ๐‘—, ๐‘˜, ๐‘™) H ๐‘ (๐‘˜, ๐‘™, ๐‘š, ๐‘›) H ๐‘ (๐‘š, ๐‘›, ๐‘–, ๐‘—) D (๐‘˜, ๐‘—, ๐‘™, ๐‘–) D ( ๐‘—, ๐‘š, ๐‘–, ๐‘›), ๐‘–, ๐‘—,๐‘˜,๐‘™,๐‘š,๐‘› (5.33) where the Hamiltonians are obtained after a Pandya transform shown by the subscript ๐‘ in the summation. To make the process of reshaping the data into 3-mode tensors possible, the dimensions of D should be permuted to get D1 (๐‘–, ๐‘—, ๐‘˜, ๐‘™) = D (๐‘˜, ๐‘—, ๐‘™, ๐‘–) D2 (๐‘š, ๐‘›, ๐‘–, ๐‘—) = D ( ๐‘—, ๐‘š, ๐‘–, ๐‘›). Then, we can choose ๏ฃฑ ๏ฃด ๏ฃฒ Hฬƒ1 (๐‘–, ๐‘—, ๐‘˜, ๐‘™) = H ๐‘ (๐‘–, ๐‘—, ๐‘˜, ๐‘™)D1 (๐‘–, ๐‘—, ๐‘˜, ๐‘™) ๏ฃด ๏ฃด ๏ฃด (5.34) ๏ฃด ๏ฃด Hฬƒ2 (๐‘š, ๐‘›, ๐‘–, ๐‘—) = H ๐‘ (๐‘š, ๐‘›, ๐‘–, ๐‘—)D2 (๐‘š, ๐‘›, ๐‘–, ๐‘—), ๏ฃด ๏ฃด ๏ฃณ 70 ๐‘’Max 4 6 8 10 12 14 ๐‘› 30 56 90 132 174 216 Table 5.1: Basis truncation parameters and mode dimensions for single-particle bases labeled by ๐‘’Max. leading to the reshaped version ๏ฃฑ ๏ฃด ๏ฃฒ H1 ( ๐‘, ๐‘ž, ๐‘Ÿ) = Hฬƒ1 ( ๐‘, ๐‘ž)H ๐‘ (๐‘ž, ๐‘Ÿ) ๏ฃด ๏ฃด ๏ฃด (5.35) ๏ฃด ๏ฃด H2 ( ๐‘, ๐‘ž, ๐‘Ÿ) = Hฬƒ2 (๐‘Ÿ, ๐‘). ๏ฃด ๏ฃด ๏ฃณ Memory efficient calculations for the mode-wise projections can be found in Appendix B.3. 5.2.4 Experiments In this section, numerical results are provided to demonstrate how mode-wise JL embeddings affect the accuracy of energy calculations. Experiments are done for different data sizes, i.e., for H , D โˆˆ R๐‘›ร—๐‘›ร—๐‘›ร—๐‘› where the dimension size ๐‘› is chosen from the set of number listed in Table 5.1. In each case, the relative error in ๐ธ (2) , ๐ธ (3) , and the radius correction terms ๐‘…1 and ๐‘…2 defined by (5.36), (5.37), and (5.38), and are plotted for various values of compression. ๐ธ ๐‘(2) โˆ’ ๐ธ (2) ! ฮ”๐ธ (2) = mean . (5.36) ๐ธ (2) ๐ธ ๐‘(3) โˆ’ ๐ธ (3) ! ฮ”๐ธ (3) = mean . (5.37) ๐ธ (3)   ๐‘…๐‘ โˆ’ ๐‘… ฮ”๐‘… = mean . (5.38) ๐‘… where the subscript ๐‘ is used to denote the corresponding value calculated after the projection of tensors, and mean(๐‘‹) denotes the mean of ๐‘‹. Compression in mode ๐‘— is defined by ๐‘š๐‘— ๐‘๐‘— = , (5.39) ๐‘ 71 where ๐‘ and ๐‘š ๐‘— denote the size of mode ๐‘— before and after projection, respectively. The target   dimension ๐‘š ๐‘— in JL matrices is chosen as ๐‘š ๐‘— = ๐‘ ๐‘— ๐‘› for all ๐‘—, to ensure that at least a fraction ๐‘ ๐‘— of the ambient dimension in mode ๐‘— is preserved. In the experiments, compression is chosen the same for all modes, i.e., ๐‘ ๐‘— = ๐‘ for all ๐‘—. It should be noted that for the ๐ธ (3) calculations, the size of each dimension in the reshaped data is ๐‘ = ๐‘›2 , while for the ๐ธ (2) and ๐‘… calculations, ๐‘ = ๐‘›. 5.2.4.1 ๐ธ (2) Experiments Experiment results for O16 have been plotted in Figures 5.4, 5.5, and 5.6. In Figure 5.7, the relative error in ๐ธ (2) has been plotted for two compression levels for O16 and Sn132. These results clearly show that JL embeddings result in smaller error values when data size increases. This dependence is almost log-linear. 72 0.9 (Gaussian + RFD)real Gaussian 0.8 (RFD + RFD)real Rademacher 0.7 RFDreal Gaussian 0.6 RFDreal RCD 0.5 0.4 0.3 0.2 0.1 0 10-3 10-2 10-1 100 (a) (b) Gaussian (Gaussian + RFD)real Rademacher (RFD + RFD)real RFDreal RCD (c) (d) Figure 5.4: ๐ธ (2) experiment results for O16, ๐‘’๐‘€๐‘Ž๐‘ฅ = 2. 73 0.45 (Gaussian + RFD)real Gaussian 0.4 (RFD + RFD)real Rademacher 0.35 RFDreal Gaussian 0.3 RFDreal RCD 0.25 0.2 0.15 0.1 0.05 0 10-4 10-3 10-2 10-1 100 (a) (b) Gaussian (Gaussian + RFD)real Rademacher (RFD + RFD)real RFDreal RCD (c) (d) Figure 5.5: ๐ธ (2) experiment results for O16, ๐‘’๐‘€๐‘Ž๐‘ฅ = 4. 74 0.7 (Gaussian + RFD)real Gaussian 0.6 (RFD + RFD)real Rademacher RFDreal 0.5 Gaussian RFDreal RCD 0.4 0.3 0.2 0.1 0 10-5 10-4 10-3 10-2 10-1 (a) (b) Gaussian (Gaussian + RFD)real Rademacher (RFD + RFD)real RFDreal RCD (c) (d) Figure 5.6: ๐ธ (2) experiment results for O16, ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. 0.09 0.035 0.08 0.03 0.07 0.025 0.06 0.05 0.02 0.04 0.015 0.03 0.01 6 8 10 12 14 6 8 10 12 14 (a) (b) Figure 5.7: Relative error in ๐ธ (2) for total compression values of 0.0009 and 0.0125. (a) O16. (b) Sn132. 75 5.2.4.2 Radius Correction Experiments The experiments of this section were done on the data of Tin (Sn132) and Calcium (Ca48) for ๐‘› = 216 or eMax= 14. The results can be viewed in Figure 5.8. 0.017 0.024 RFDreal RFDreal 0.016 0.022 (RFD + RFD)real (RFD + RFD)real 0.015 0.02 0.014 0.018 0.013 0.016 0.012 0.014 0.011 0.012 0.01 0.01 10-4 10-3 10-2 10-1 10-4 10-3 10-2 10-1 (a) (b) 0.013 0.016 RFDreal RFDreal 0.012 (RFD + RFD)real (RFD + RFD)real 0.014 0.011 0.01 0.012 0.009 0.01 0.008 0.007 0.008 10-4 10-3 10-2 10-1 10-4 10-3 10-2 10-1 (c) (d) Figure 5.8: Radius correction results, for interaction em1.8 โˆ’ 2.0 and eMax= 14. (a) Ca48, particle term. (b) Ca48, hole term. (c) Sn132, particle term. (d) Sn132, hole term. 5.2.4.3 ๐ธ (3) Experiments The reason behind ๐ธ (3) experiments not working as well as ๐ธ (2) is that the two tensors forming the inner product become nearly orthogonal in the ๐ธ (3) case, in such a way that after projection, their inner product becomes much smaller than their individual norms. In other words, two tensors that are not originally orthogonal are made close to orthogonal after projection onto the lower- 76 dimensional subspace. One can think of a criterion that should be met for the inner product preservation to work, i.e., |hAx, Ayi| โ‰ฅ ๐œ€ยฏ kAxk kAyk for some small ๐œ€, ยฏ where x and y are fibers in unfoldings of a tensor. 0.7 Gaussian 0.8 Gaussian 0.6 RFD (real part) RFD (real part) 0.7 0.5 0.6 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 c=0.03 c=0.03 c=0.05 c=0.05 (a) (b) Figure 5.9: Mean absolute relative error in ๐ธ (3) for hole-hole and ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. 5 4.5 Gaussian Gaussian 4.5 4 RFD (real part) RFD (real part) 4 3.5 3.5 3 3 2.5 2.5 2 1.5 2 1 1.5 0.5 1 0 0.5 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 c=0.03 c=0.03 c=0.05 c=0.05 (a) (b) Figure 5.10: Mean absolute relative error in ๐ธ (3) for particle-particle and ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. 77 1.6 Gaussian 3.5 Gaussian 1.4 RFD (real part) RFD (real part) 3 1.2 2.5 2 1 1.5 0.8 1 0.6 0.5 0 0.4 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 c=0.03 c=0.03 c=0.05 c=0.05 (a) (b) Figure 5.11: Mean absolute relative error in ๐ธ (3) for particle-hole and ๐‘’๐‘€๐‘Ž๐‘ฅ = 8. 78 CHAPTER 6 EXTENSION OF VECTOR-BASED METHODS TO TENSORS AND FUTURE WORK In this chapter, two of the conventional methods developed for vector data are extended to tensors while keeping the miltilinear structure of the data. The computational load and memory require- ments of these methods can be mitigated by applying modewise JL embeddings to the inout data. For instance, in Algorithm 6.1 of Section 6.1, the training tensors X (๐‘š) can be compressed to a much smaller size by applying modewise JL prior to solving for the projection matrices Uฬƒ ( ๐‘—) . The same set of JL embeddings used to compress all the training samples could also be used for test data, meaning they need to be generated only once. The result will be an approximate yet easier to obtain set of projection matrices that will be used to compute the approximate MPCA output. In another possible way, randomized methods can be used to obtain an approximate low-rank representation of data during the SVD stage of MPCA (in step 2 of Algorithm 6.1 in the following, SVD could be Xฬƒ (๐‘š) Xฬƒ (๐‘š) Xฬƒ (๐‘š)> ). ร๐‘€ ร๐‘€ used on ๐‘š=1 ( ๐‘—) instead of the eigen-decomposition of ๐‘š=1 ( ๐‘—) ( ๐‘—) In a similar manner, the support vector machine methods mentioned in Section 6.3 involve the computation of the CP decomposition of a tensor. Obtaining CPD requires solving a least squares problem for each mode of the input data, which in turn can be made faster and memory-efficient by solving the least squares problem using a sketched (compressed) version of the known variables in, e.g., (3.11). This leads to an approximate and a computationally more efficient implementation of CPD to be used in the main support vector machine approach. 6.1 Multilinear Principal Component Analysis Multilineal Principal Component Analysis, abbreviated to MPCA, is a dimensionality reduction scheme for tensor objects. As an extension to regular PCA, it is similar in structure to the Tucker decomposition and defines its goal to capture as much variations as possible across the modes of a tensor object. 79 Definition 6.1.1 Let {X (1) , . . . , X (๐‘€) } be a set of tensor objects in R๐‘›1 ร—ยทยทยทร—๐‘›๐‘‘ . The total scatter of these tensors is defined as โˆ‘๏ธ๐‘€ ฮจX = kX (๐‘š) โˆ’ Xฬ„k 2 , ๐‘š=1 where Xฬ„ is the mean tensor calculated by ๐‘€ 1 โˆ‘๏ธ (๐‘š) Xฬ„ = X . ๐‘€ ๐‘š=1 6.1.1 Problem Statement Consider {X (๐‘š) } ๐‘š=1 ๐‘€ for training. The objective is to define a multilinear transformation { Uฬƒ ( ๐‘—) โˆˆ R๐‘› ๐‘— ร—๐‘ƒ ๐‘— } ๐‘‘๐‘—=1 that maps the tensor space R๐‘›1 ร—ยทยทยทร—๐‘›๐‘‘ into a tensor subspace R๐‘ƒ1 ร—ยทยทยทร—๐‘ƒ๐‘‘ with ๐‘ƒ ๐‘— โ‰ค ๐‘› ๐‘— for ๐‘— โˆˆ [๐‘‘], such that Y (๐‘š) = X (๐‘š) ร—1 Uฬƒ (1)> ร— ยท ยท ยท ร—๐‘‘ Uฬƒ (๐‘‡)> โˆˆ R๐‘ƒ1 ร—ยทยทยทร—๐‘ƒ๐‘‘ ; ๐‘š โˆˆ [๐‘€], capture most of the variations in {X (๐‘š) } ๐‘š=1 ๐‘€ measured by the total scatter ฮจ , i.e., Y {Uฬƒ ( ๐‘—) } ๐‘‘๐‘—=1 = arg max ฮจY . (6.1) Uฬƒ (1) ,...,Uฬƒ (๐‘‘) A pseudo-code outlining the steps of MPCA is shown in Algorithm 6.1 [14], with ๐พ denoting the maximal number of allowed iterations. Theorem 6.1.1 Let {Uฬƒ ( ๐‘—) } ๐‘‘๐‘—=1 be the solution to (6.1). Then, given Uฬƒ (1) , . . . , Uฬƒ ( ๐‘—โˆ’1) , Uฬƒ ( ๐‘—+1) , . . . , Uฬƒ (๐‘‘) , the matrix Uฬƒ ( ๐‘—) consists of ๐‘ƒ ๐‘— eigenvectors corresponding to the ๐‘ƒ ๐‘— largest eigenvalues of the matrix ๐‘€  โˆ‘๏ธ   > ๐šฝ ( ๐‘—) = X (๐‘š) ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) Uฬƒ Uฬƒ > ๐šฝ ( ๐‘—) ๐šฝ ( ๐‘—) X (๐‘š) ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) , (6.2) ๐‘š=1 where  > Uฬƒ๐šฝ ( ๐‘—) = Uฬƒ (๐‘‘) โŠ— ยท ยท ยท โŠ— Uฬƒ ( ๐‘—+1) โŠ— Uฬƒ ( ๐‘—โˆ’1) โŠ— ยท ยท ยท โŠ— Uฬƒ (1) for ๐‘— โˆˆ [๐‘‘]. 80 Algorithm 6.1: MPCA [14] Require: Training samples X (๐‘š) โˆˆ R๐‘›1 ร—...ร—๐‘›๐‘‘ for ๐‘š โˆˆ [๐‘€]. Ensure: Projection matrices Uฬƒ ( ๐‘—) โˆˆ R๐‘ƒ ๐‘— ร—๐‘› ๐‘— for ๐‘› โˆˆ [๐‘‘]. (1) Center input samples: Xฬƒ๐‘š = X๐‘š โˆ’ Xฬ„. ๐‘€ (2) Initialization: Calculate the eigen-decomposition of ๐šฝ ( ๐‘—)โˆ— = Xฬƒ (๐‘š) Xฬƒ (๐‘š)> . Set Uฬƒ ( ๐‘—) to ร ( ๐‘—) ( ๐‘—) ๐‘š=1 consist of the ๐‘ƒ ๐‘— leading eigenvectors for ๐‘— โˆˆ [๐‘‘]. (3) Local optimization: Calculate Yฬƒ (๐‘š) = Xฬƒ (๐‘š) ร—1 Uฬƒ (1)> ร— ยท ยท ยท ร—๐‘‘ Uฬƒ (๐‘‡)> for ๐‘š โˆˆ [๐‘€]. ๐‘€ k Yฬƒ (๐‘š) k 2๐น . ร Calculate ฮจY0 = ๐‘š=1 for ๐‘˜ = 1, . . . , ๐พ do for ๐‘— = 1, . . . , ๐‘‘ do Set the matrix Uฬƒ ( ๐‘—) to consist of the ๐‘ƒ ๐‘— leading eigenvectors of ๐šฝ ( ๐‘—) defined in (6.2). end for Calculate Yฬƒ๐‘š for ๐‘š โˆˆ [๐‘€], and ฮจY๐‘˜ . If ฮจY๐‘˜ โˆ’ ฮจY๐‘˜โˆ’1 < ๐œ‚, break, and go to step 4. end for (4) Projection: Obtain the feature tensors as Y (๐‘š) = X (๐‘š) ร—1 Uฬƒ (1)> ร— ยท ยท ยท ร—๐‘‘ Uฬƒ (๐‘‡)> for ๐‘š โˆˆ [๐‘€]. Proof The Euclidean norm of a tensor is equal to the Frobenius norm of any of its unfoldings. Therefore, the total scatter of the projected samples can be written as ๐‘€ โˆ‘๏ธ โˆ‘๏ธ๐‘€ ฮจY = kY (๐‘š) โˆ’ Yฬ„ k = 2 kY (๐‘š)( ๐‘—) โˆ’ Yฬ„ ( ๐‘—) k 2๐น ๐‘š=1 ๐‘š=1 ๐‘€ โˆ‘๏ธ   = k Uฬƒ ( ๐‘—) X (๐‘š)( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) Uฬƒ๐šฝ ( ๐‘—) k 2๐น ๐‘š=1 ๐‘€ โˆ‘๏ธ     >  = trace Uฬƒ ( ๐‘—) X (๐‘š) ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) Uฬƒ๐šฝ ( ๐‘—) Uฬƒ>๐šฝ ( ๐‘—) X (๐‘š) ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) Uฬƒ ( ๐‘—)> ๐‘š=1 ๐‘€  ! โˆ‘๏ธ   > = trace Uฬƒ ( ๐‘—) X (๐‘š) ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) Uฬƒ๐šฝ ( ๐‘—) Uฬƒ> ๐šฝ ( ๐‘—) X (๐‘š) ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) Uฬƒ ( ๐‘—)> ๐‘š=1   = trace Uฬƒ ( ๐‘—) ๐šฝ ( ๐‘—) Uฬƒ ( ๐‘—)> , which turns into an eigenvalue problem when ฮจY is to be maximized. 81 6.1.2 Full Projection ( ๐‘—) ( ๐‘—)> If ๐‘ƒ ๐‘— = ๐‘› ๐‘— for ๐‘— โˆˆ [๐‘‘], it is easy to show that Uฬƒ๐šฝ ( ๐‘—) Uฬƒ๐šฝ ( ๐‘—) = I, then in the optimal case, ๐šฝ ( ๐‘—)โˆ— = ๐‘€   > X ((๐‘š) (๐‘š) (๐‘š) (๐‘š) . In this case, U ( ๐‘—)โˆ— is the optimal solution for Uฬƒ ( ๐‘—) , and consists ร ๐‘—) โˆ’ Xฬ„ ( ๐‘—) X ( ๐‘—) โˆ’ Xฬ„ ( ๐‘—) ๐‘š=1 of the eigenvectors of ๐šฝ ( ๐‘—)โˆ— . The total scatter tensor Yvar โˆ— of the full projection is defined as ๐‘€  โˆ‘๏ธ 2 โˆ— (๐‘š)โˆ— โˆ— Yvar = Y โˆ’ Yฬ„ , (6.3) ๐‘š=1 where the exponentiation is done component-wise, Y (๐‘š)โˆ— is the full projection of the ๐‘š th sample X (๐‘š) and Yฬ„ โˆ— is the mean of the fully projected samples. The following two observations can be made. ( ๐‘—)โˆ— โˆ— , where (a) The ๐‘– th๐‘— mode- ๐‘— eigenvalue ๐œ†๐‘– ๐‘— is the sum of all entries of the ๐‘– th๐‘— mode- ๐‘— slice of Yvar ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ] for ๐‘— โˆˆ [๐‘‘]. (b) Every sample tensor X (๐‘š) can be represented as an expansion in the subspace spanned by rank-1 tensors, called eigentensors. This is shown by ๐‘ƒ1 โˆ‘๏ธ โˆ‘๏ธ ๐‘ƒ2 โˆ‘๏ธ๐‘ƒ๐‘‘ X (๐‘š) โ‰ˆ ยทยทยท Y๐‘–1(๐‘š)โˆ— (1) ,๐‘–2 ,...,๐‘– ๐‘‘ uฬƒ๐‘–1 uฬƒ๐‘–(2) 2 ยทยทยท uฬƒ๐‘–(๐‘‘) ๐‘‘ , (6.4) ๐‘–1 =1 ๐‘–2 =1 ๐‘– ๐‘‘ =1 ( ๐‘—) ( ๐‘—) where uฬƒ๐‘– ๐‘— is the ๐‘– th๐‘— column of Uฬƒ๐‘– ๐‘— for ๐‘– ๐‘— โˆˆ [๐‘ƒ ๐‘— ] and ๐‘— โˆˆ [๐‘‘]. 6.1.3 Initialization by Full Projection Truncation (FPT) To initialize the MPCA algorithm, assume the first ๐‘ƒ ๐‘— < ๐‘› ๐‘— leading eigenvectors of ๐šฝ (๐‘›)โˆ— form Uฬƒ ( ๐‘—) . It is shown in [14] that if a nonzero eigenvalue is truncated in one mode, the eigenvalues in all other modes tend to decrease in magnitude, and therefore, the optimality of (6.1) is affected negatively. Thus, the eigen-decomposition needs to be updated in all other modes. If the total scatter of the projected samples in FPT is denoted by ฮจY0 , then the loss of variations due to FPT is 82 bounded by ๐‘›๐‘— โˆ‘๏ธ ๐‘‘ โˆ‘๏ธ ๐‘›๐‘— โˆ‘๏ธ ( ๐‘—)โˆ— ( ๐‘—)โˆ— max ๐œ†๐‘– ๐‘— โ‰ค ฮจX โˆ’ ฮจY0 โ‰ค ๐œ†๐‘– ๐‘— . (6.5) ๐‘— ๐‘– ๐‘— =๐‘ƒ ๐‘— +1 ๐‘—=1 ๐‘– ๐‘— =๐‘ƒ ๐‘— +1 6.1.4 Determination of subspace Dimensions ๐‘ƒ ๐‘— A simple yet commonly used method to choose ๐‘ƒ ๐‘— is to pick the minimum value ๐‘ƒ ๐‘— such that ๐‘ƒ๐‘— ร ( ๐‘—)โˆ— ๐œ†๐‘– ๐‘— ๐‘– ๐‘— =1 ๐‘›๐‘— โ‰ฅ ๐œ, (6.6) ร ( ๐‘—)โˆ— ๐œ†๐‘– ๐‘— ๐‘– ๐‘— =1 where ๐œ is a predetermined threshold set by the user. 6.1.5 Feature Extraction and Classification The set of projection matrices {Uฬƒ ( ๐‘—) } ๐‘‘๐‘—=1 obtained using the training samples can be employed to project any new sample onto the same subspace. The projected samples (training or test) can be either directly used in classification, or they can undergo further processing. In LDA1, the elements of the projected samples Y (๐‘š) are vectorized to yield y (๐‘š) and are ordered according to the Fisher score to maximize the between-class to in-class discriminability. Next, a predetermined number of features with the highest Fisher scores are selected for classification. Or, to maximize between-class to in-class discriminability, y (๐‘š) can be projected onto the LDA space by z (๐‘š) = V> lda y (๐‘š) , where for ๐ถ classes, Vlda consists of ๐‘› ๐‘ง โ‰ค ๐ถ โˆ’ 1 of the leading generalized eigenvectors of ร ๐‘€  (๐‘š)  > (๐‘š) (๐‘š) (๐‘š) and S๐ต = ๐ถ๐‘=1 ๐‘๐‘ ( yฬ„๐‘ โˆ’ yฬ„) ( yฬ„๐‘ โˆ’ yฬ„) > . Here, ๐‘๐‘ is the ร S๐‘Š = ๐‘š=1 y โˆ’ yฬ„๐‘ y โˆ’ yฬ„๐‘ number of samples in class ๐‘ and yฬ„๐‘(๐‘š) denotes the in-class mean for the ๐‘š th training sample, i.e., 1 Linear Discriminant Analysis 83 yฬ„๐‘(๐‘š) โˆˆ {yฬ„1 , . . . , yฬ„๐ถ } where yฬ„๐‘ is the mean of class ๐‘. In fact, the LDA projection matrix is obtained by solving |V> S๐ต V|   Vlda = arg max = v 1 . . . v ๐‘› . V |V> S๐‘Š V| ๐‘ง Now, z (๐‘š) is used for classification as the projected training sample, and Vlda is applied to any vectorized projected test data y to further project it onto the LDA space according to z = V> lda y. 6.2 Comparison between PCA, MPCA and MPS Here, the methods PCA, MPCA and MPS have been compared in terms of computational complex- ity measured by training time, and classification Success Rate (CSR). The data were first projected onto the subspace, and then the features were either directly used in nearest neighbor (1NN) clas- sification, or they were further sorted in descending order according to the Fisher score and 100 features were selected to be used in 1NN. The following data sets were used in the experiments. COIL-100 data set: 7200 images collected from 100 objects taken at 5โ—ฆ pose intervals, creating a 3-mode tensor X โˆˆ R128ร—128ร—7200 [15]. Sample images from this library can be viewed in Figure 6.1. 84 Figure 6.1: Gray-scale sample images of five objects in the COIL-100 database. MRI data: A 4-mode tensor X โˆˆ R240ร—240ร—155ร—51 comprised of 51 MRI samples [1]. A lateral slice of a sample MRI image is shown in Figure 6.2. 50 100 150 200 50 100 150 200 250 Figure 6.2: A lateral slice of a sample MRI image. 85 Training time vs model size. Training time vs model size. 7 40 MPCA (1NN) MPCA (1NN) 6 MPCA (LDA) MPCA (LDA) MPS (1NN) 35 MPS (1NN) 5 MPS (LDA) MPS (LDA) PCA (1NN) PCA (1NN) 30 PCA (LDA) time (s) 4 PCA (LDA) time (s) 3 25 2 20 1 0 15 104 105 106 107 102 104 106 108 1010 Model size Model size (a) COIL-100 (b) MRI Figure 6.3: Training time. (a) COIL-100. (b) MRI. Classification rate vs model size. Classification rate vs model size. 97 70 MPCA (1NN) 96 65 MPCA (LDA) MPS (1NN) 95 MPS (LDA) 60 PCA (1NN) 94 CSR (%) PCA (LDA) CSR (%) 93 55 MPCA (1NN) 92 MPCA (LDA) 50 MPS (1NN) 91 MPS (LDA) 90 PCA (1NN) 45 PCA (LDA) 89 40 104 105 106 107 102 104 106 108 1010 Model size Model size (a) (b) Figure 6.4: Classification Success Rate. (a) COIL-100. (b) MRI. 6.3 Extension of Support Vector Machine to Tensors In this section, 3 tensor-based methods used to extend the regular support vector machine to tensor data are summarized. Consider ๐‘€ ๐‘‘-mode training sample tensors {X (๐‘š) } ๐‘š=1 ๐‘€ โˆˆ R๐‘›1 ร—...ร—๐‘›๐‘‘ with corresponding labels {๐‘ฆ ๐‘š } ๐‘š=1 ๐‘€ โˆˆ {โˆ’1, +1}. 86 6.3.1 Support Tensor Machine The soft-margin Support Tensor Machine for binary classification of data is composed of ๐‘‘ quadratic programming problems with inequality constraints, where the ๐‘— th problem is described as [24]:   1 ร–๐‘‘ ๐‘€ โˆ‘๏ธ ( ๐‘—) min ๐ฝ w ( ๐‘—) , ๐‘ ( ๐‘—) , ๐ƒ ( ๐‘—) = kw ( ๐‘—) k 22 kw (๐‘–) k 22 + ๐ถ ๐œ‰๐‘š ( w ,๐‘ ,๐ƒ ๐‘—) ( ๐‘—) ( ๐‘—) 2 ๐‘–=1 ๐‘š=1 ๐‘–โ‰  ๐‘— ยญ ( ๐‘—)> ยญ (๐‘š) ? ( ๐‘—)> ยฎ (6.7) ยฉ ยฉ ๐‘‘ ยช ยช ( ๐‘—) subject to ๐‘ฆ ๐‘š ยญw ยญX w ยฎ + ๐‘ ( ๐‘—) ยฎ โ‰ฅ 1 โˆ’ ๐œ‰๐‘š ยฎ ยญ ยญ ยฎ ยฎ ๐‘–=1 ยซ ยซ ๐‘–โ‰  ๐‘— ยฌ ยฌ ( ๐‘—) ๐œ‰๐‘š โ‰ฅ 0, ๐‘š โˆˆ [๐‘€], for ๐‘— โˆˆ [๐‘‘], where w ( ๐‘—) โˆˆ R๐‘› ๐‘— is the normal to the ๐‘— th hyperplane corresponding to the ๐‘— th ( ๐‘—) mode, ๐‘ ( ๐‘—) is the bias, ๐œ‰๐‘š is error of the ๐‘š th training sample, and ๐ถ is the trade-off between the classification error and the amount of margin violation. These ๐‘‘ optimization problems have no closed-form solution, and need to be solved iteratively using the alternating projection algorithm. All ๐‘‘ normal vectors are randomly initialized. In each iteration, for each mode ๐‘—, {w (๐‘˜) } ๐‘˜โ‰  ๐‘— are fixed and (6.7) is solved for w ( ๐‘—) . Iterations continue until convergence is reached. Convergence criterion considering iterations ๐‘ก and ๐‘ก โˆ’ 1 is set as ( ๐‘—)> ( ๐‘—) ๐‘‘ ! โˆ‘๏ธ w๐‘ก w๐‘กโˆ’1 ( ๐‘—) โˆ’ 1 โ‰ค ๐œ€, ๐‘—=1 kw๐‘ก k 2 for some ๐œ€. Once the STM model has been solved, the binary classifier will determine the class of a test sample X based on the decision rule ? ๐‘‘ ! ๐‘ฆ(X) = sign X w ( ๐‘—)> + ๐‘ . (6.8) ๐‘–=1 ( ๐‘—) Further, assume that ๐œ‰๐‘š = max {๐œ‰๐‘š }, and that the ๐‘‘ normal vectors w ( ๐‘—) form a rank-1 tensor ๐‘— โˆˆ[๐‘‘] W= w (1) w (2) ยทยทยท (๐‘‘) w [6]. In this case, the following can be observed. โˆ‘๏ธ โˆ‘๏ธ ร–๐‘‘ kW k 2 = hW, Wi = W๐‘–21 ,...,๐‘– ๐‘‘ = w๐‘–(1)2 1 . . . w๐‘–(๐‘‘)2 ๐‘‘ = kw ( ๐‘—) k 22 , (6.9) ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘—=1 87 and ยฉ ?๐‘‘ ยช ? ๐‘‘ โˆ‘๏ธ D E ( ๐‘—)> ยญ (๐‘š) ( ๐‘—)> ยฎ (๐‘š) ( ๐‘—)> (๐‘š) (1) (๐‘‘) (๐‘š) w ยญX w ยฎ=X w = X๐‘–1 ,...,๐‘– ๐‘‘ w๐‘–1 . . . w๐‘– ๐‘‘ = X , W , (6.10) ยญ ยฎ ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘–=1 ๐‘–=1 ยซ ๐‘–โ‰  ๐‘— ยฌ where (2.11) and (2.12) have been used in the penultimate equality. This result can be used to write the problem as ๐‘€ 1 โˆ‘๏ธ min ๐ฝ (W, ๐‘, ๐ƒ) = kW k 2 + ๐ถ ๐œ‰๐‘š W,๐‘,๐ƒ 2 ๐‘š=1 D E  (6.11) subject to ๐‘ฆ ๐‘š X (๐‘š) , W + ๐‘ โ‰ฅ 1 โˆ’ ๐œ‰๐‘š ๐œ‰๐‘š โ‰ฅ 0, By forming the Lagrangian function with Lagrange multipliers ๐›ผ and ๐œ†, and taking partial ๐›ผ๐‘š ๐‘ฆ ๐‘š X (๐‘š) , ๐‘š=1 ร๐‘€ ร๐‘€ derivatives with respect to W, ๐‘ and ๐œ‰๐‘š , we obtain W = ๐‘š=1 ๐›ผ๐‘š ๐‘ฆ ๐‘š = 0 and ๐›ผ๐‘š + ๐œ† ๐‘š = ๐ถ. Then, the dual problem can be written in the following form. 1 max = 1> ๐œถ โˆ’ ๐œถ> S๐œถ ๐œถ 2 ๐‘€ โˆ‘๏ธ subject to ๐›ผ๐‘š ๐‘ฆ ๐‘š = 0 (6.12) ๐‘š=1 0 โ‰ค ๐›ผ๐‘š โ‰ค ๐ถ, ๐‘š โˆˆ [๐‘€], where S ๐‘๐‘ž = ๐‘ฆ ๐‘ ๐‘ฆ ๐‘ž X ( ๐‘) , X (๐‘ž) . Therefore, after solving the model, the binary classifier will be ๐‘ฆ(X) = sign (hX, Wi + ๐‘) , (6.13) for a test tensor X. This procedure is equivalent to solving the regular soft-margin SVM if the tensors are first vectorized. However, for large tensors vectorized, SVM suffers significantly from the curse of dimensionality and also small training sample count compared to the dimensionality of each sample making it susceptible to new data. 88 6.3.2 Support Higher-order Tensor Machine Support Higher-order Tensor Machine (abbreviated to SHTM) assumes that the data samples admit low-rank CP decompositions, which allows for another form of the objective function in the dual problem. As can be expected, in obtaining the CPD of data, a sketched least squares problem can be solved in each mode of the training and test samples to obtain approximate but more efficient versions of the corresponding CP decompositions. Let the CPD of X ( ๐‘) and X (๐‘ž) be (1) (๐‘‘) ( ๐‘—) ( ๐‘—) X ( ๐‘) โ‰ˆ ๐‘Ÿ๐‘˜=1 x (1) ยท ยท ยท x (๐‘‘) (๐‘ž) โ‰ˆ ร๐‘Ÿ ร ๐‘๐‘˜ ๐‘๐‘˜ and X ๐‘˜=1 x๐‘ž๐‘˜ ยท ยท ยท x๐‘ž๐‘˜ , where x๐‘ž๐‘˜ and x ๐‘๐‘˜ represent the ๐‘˜ th column in the ๐‘— th factor matrix of X ( ๐‘) and X (๐‘ž) , respectively. Therefore, the elements of S in (6.12) will be D E ๐‘Ÿ D โˆ‘๏ธ E S ๐‘๐‘ž = ๐‘ฆ ๐‘ ๐‘ฆ ๐‘ž X ( ๐‘) ,X (๐‘ž) = ๐‘ฆ ๐‘ ๐‘ฆ๐‘ž x (1) ๐‘๐‘˜ ยทยทยท x (๐‘‘) ๐‘๐‘˜ , x๐‘žโ„Ž (1) ยทยทยท x๐‘žโ„Ž (๐‘‘) ๐‘˜,โ„Ž=1 ๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ     = ๐‘ฆ ๐‘ ๐‘ฆ๐‘ž x (1) ๐‘๐‘˜ ยทยทยท x (๐‘‘) ๐‘๐‘˜ (1) x๐‘žโ„Ž ยทยทยท (๐‘‘) x๐‘žโ„Ž ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘˜,โ„Ž=1 ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘Ÿ ร– โˆ‘๏ธ ๐‘‘ D E ( ๐‘—) ( ๐‘—) = ๐‘ฆ ๐‘ ๐‘ฆ๐‘ž x ๐‘๐‘˜ , x๐‘žโ„Ž . ๐‘˜,โ„Ž=1 ๐‘—=1 Now, (6.12) can be solved using Sequential Minimal Optimization [11] in one iteration. The binary classifier for a test sample X โ‰ˆ ๐‘Ÿโ„Ž=1 xโ„Ž(1) ยท ยท ยท xโ„Ž(๐‘‘) will decide the class based on ร ๐‘€ ๐‘Ÿ ๐‘Ÿ ๐‘‘ D E ยฉโˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ ร– ( ๐‘—) ( ๐‘—) ๐‘ฆ(X) = sign (hX, Wi + ๐‘) = sign ยญ x๐‘š๐‘˜ , xโ„Ž + ๐‘ ยฎ , (6.14) ยช ๐›ผ๐‘š ๐‘ฆ ๐‘š ยซ๐‘š=1 ๐‘˜=1 โ„Ž=1 ๐‘—=1 ยฌ which clearly shows that the curse of dimensionality will not be an issue as only the much smaller factors of the CPD of data are incorporated in the classification operation. 6.3.3 Kernelized Support Tensor Machine Aside from the low-rank structure that is assumed for the weight tensor W and the data samples, STM and SHTM are only taking into account the multilinear structure in tensors, and are missing any nonlinearities that might exist in the data. For this reason, STM and SHTM will yield 89 suboptimal results. A kernelized approach can lead to improved performance by capturing the existing nonlinearities in the data. In [7], the primal problem for Kernelized Support Tensor Machine (KSTM) is stated as arg min ๐ฟ (๐‘ฆ, hW, Xi + ๐‘) + ๐‘ƒ (W) + ฮฉ (X) , (6.15) W,๐‘ where ๐ฟ is a loss function, ๐‘ƒ (W) is a penalty function, and ฮฉ (X) is a specific constraint imposed on the training samples. Before going into details about the terms in (6.15), the tensor Reproducing Kernel Hilbert Space needs to be defined. Note: For a domain D, a reproducing kernel ๐œ… : D ร— D โ†’ R is a kernel function associated with a feature map ๐œ™ : D โ†’ RD , where RD denotes the space of functions from D to R, with the property that any function ๐‘“ โˆˆ RD can be reproduced pointwise by calculating the inner product of the kernel and ๐‘“ , i.e., ๐‘“ (๐‘ฅ) = h ๐‘“ (ยท), ๐œ…(ยท, ๐‘ฅ)i for all ๐‘ฅ โˆˆ D, where ๐œ…(ยท, ๐‘ฅ) = ๐œ™(๐‘ฅ). A direct consequence of this property is that any inner product defined on RD can be calculated by the kernel as h๐œ™(๐‘ฅ 1 ), ๐œ™(๐‘ฅ 2 )i = h๐œ…(ยท, ๐‘ฅ1 ), ๐œ…(ยท, ๐‘ฅ2 )i = ๐œ…(๐‘ฅ 2 , ๐‘ฅ1 ) = ๐œ…(๐‘ฅ 1 , ๐‘ฅ2 ) for all ๐‘ฅ 1 , ๐‘ฅ2 โˆˆ D. Definition 6.3.1 Tensor Product Reproducing Kernel Hilbert Space   For ๐‘— โˆˆ [๐‘‘], let H ๐‘— , h., .i ๐‘— , ๐œ… ( ๐‘—) be a reproducing kernel Hilbert space (RKHS) of functions on a set S ๐‘— with a reproducing kernel ๐œ… ( ๐‘—) : S ๐‘— ร— S ๐‘— โ†’ R and the inner product operator hยท, ยทi ๐‘— . The space H = H1 ยทยทยท H๐‘‘ is called a tensor product RKHS of functions on the domain   S := S1 ร— ยท ยท ยท ร— S๐‘‘ . In particular, assume that ๐‘ฅ = ๐‘ฅ (1) , . . . , ๐‘ฅ (๐‘‘) โˆˆ S is a tuple. Let the tensor product space formed by the linear combinations of the functions ๐‘“ ( ๐‘—) for ๐‘— โˆˆ [๐‘‘] be defined as ร– ๐‘‘   ๐‘“ (1) ยทยทยท ๐‘“ (๐‘‘) : ๐‘ฅ โ†ฆโ†’ ๐‘“ ( ๐‘—) ๐‘ฅ ( ๐‘—) , ๐‘“ ( ๐‘—) โˆˆ H ๐‘— ๐‘—=1 Then for a multi-index ๐’Œ = (๐‘˜ 1 , . . . , ๐‘˜ ๐‘‘ ), it holds that โˆ‘๏ธ   โˆ‘๏ธ ร–๐‘‘   โˆ‘๏ธ ร–๐‘‘ D E ( ๐‘—) ( ๐‘—) ( ๐‘—) G๐’Œ ๐‘“ ๐‘˜(1) 1 ยทยทยท ๐‘“ ๐‘˜(๐‘‘) ๐‘‘ (๐‘ฅ) = G ๐’Œ ๐‘“ ๐‘˜๐‘— ๐‘ฅ ( ๐‘—) = G ๐’Œ ๐‘“ ๐‘˜๐‘— , ๐‘˜ ๐‘ฅ , (6.16) ๐‘— ๐’Œ ๐’Œ ๐‘—=1 ๐’Œ ๐‘—=1 90     ( ๐‘—) where G๐’Œ is the combination coefficient, and ๐‘˜ ๐‘ฅ is the function ๐‘˜ ( ๐‘—) ยท, ๐‘ฅ ( ๐‘—) : ๐‘ก โ†ฆโ†’ ๐‘˜ ( ๐‘—) ๐‘ก, ๐‘ฅ ( ๐‘—) in   D  E ( ๐‘—) ( ๐‘—) ( ๐‘—) the sense that ๐‘“ ๐‘˜ ๐‘— ๐‘ฅ ( ๐‘—) = ๐‘“ ๐‘˜ ๐‘— (ยท) , ๐‘˜ ๐‘ฅ ยท, ๐‘ฅ ( ๐‘—) . ๐‘—   ( ๐‘—) ( ๐‘—) ๐‘ฅ ( ๐‘—)  By looking closely at (6.16), it is observed that if in a special case, we let ๐‘“๐‘˜ ๐‘— = u๐‘˜ ๐‘— ๐‘– ๐‘— for ๐‘˜ ๐‘— โˆˆ [๐‘Ÿ ๐‘— ] and ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ] for ๐‘— โˆˆ [๐‘‘], then      ๐‘“ ๐‘˜(1) 1 ยทยทยท ๐‘“ ๐‘˜(๐‘‘) ๐‘‘ ๐‘ฅ (1) , . . . , ๐‘ฅ (๐‘‘) = u ๐‘˜(1) 1 ยทยทยท u ๐‘˜(๐‘‘)๐‘‘ (๐‘–1 , . . . , ๐‘– ๐‘‘ ) = u ๐‘˜(1) 1 (๐‘–1 ) . . . u ๐‘˜(๐‘‘) ๐‘‘ (๐‘– ๐‘‘ ) ( ๐‘—)  and (6.16) presents a general form of the Tucker decomposition, in a kernelized form, where u ๐‘˜ ๐‘— ๐‘– ๐‘— ( ๐‘—) can be represented as the inner product of a kernel and u ๐‘˜ ๐‘— , and G plays the role of the core tensor. Therefore (6.16) represents a kernelized tensor factorization. Treating X โˆˆ R๐‘›1 ร—...ร—๐‘›๐‘‘ as an element of the tensor product RKHS H , assume that it has a low-rank structure in H , such that โˆ‘๏ธ   ? ๐‘‘   ( ๐‘—) arg min kX โˆ’ G๐‘˜ 1 ,...,๐‘˜ ๐‘‘ ๐‘‘ ๐‘—=1 u ๐‘˜ ๐‘— k 2 = arg min kX โˆ’ G K ( ๐‘—) U ( ๐‘—) k 2 , G,{U ( ๐‘—) } ๐‘‘๐‘—=1 ๐‘˜ 1 ,...,๐‘˜ ๐‘‘ G,{U ( ๐‘—) } ๐‘‘๐‘—=1 ๐‘—=1 (6.17) ( ๐‘—) where K ( ๐‘—) โˆˆ R๐‘› ๐‘— ร—๐‘› ๐‘— is a symmetric kernel matrix whose ๐‘– th๐‘— row/column is ๐‘˜ ๐‘ฅ  ยท, ๐‘– ๐‘— for ๐‘– ๐‘— โˆˆ [๐‘› ๐‘— ], ( ๐‘—) and U ( ๐‘—) โˆˆ R๐‘› ๐‘— ร—๐‘Ÿ ๐‘— has u ๐‘˜ ๐‘— as its ๐‘˜ th๐‘— column for ๐‘˜ ๐‘— โˆˆ [๐‘Ÿ ๐‘— ]. To get from the left-hand side to the right-hand side, one can use the reproducing property of the kernel, i.e.,   ๐‘‘ ๐‘‘ D ๐‘‘ ( ๐‘—) ร– ( ๐‘—)  ร– ( ๐‘—) ( ๐‘—) E ๐‘—=1 u ๐‘˜ ๐‘— = u๐‘˜ ๐‘— ๐‘–๐‘— = u๐‘˜ ๐‘— , ๐‘˜ ๐‘ฅ ยท, ๐‘– ๐‘— , ๐‘–1 ,...,๐‘– ๐‘‘ ๐‘—=1 ๐‘—=1 In a kernelized CP factorization (KCP), the objective function of (6.17) will be the same except that G will be a diagonal tensor, and if the elements of its superdiagonal are absorbed into the factor matrices, then the primal model of KSTM for ๐‘€ training samples {X (๐‘š) , ๐‘ฆ ๐‘š } ๐‘š=1 ๐‘€ is stated as โˆ‘๏ธ๐‘€ ? ๐‘‘   ( ๐‘—) min ๐›พ kX (๐‘š) โˆ’ I K ( ๐‘—) U๐‘š k2 ( ๐‘—) {U๐‘š } ๐‘‘๐‘—=1 ,{K ( ๐‘—) } ๐‘‘๐‘—=1 ,W,๐‘ ๐‘š=1 ๐‘—=1 + hW, Wi (6.18) โˆ‘๏ธ๐‘€ h D E i +๐ถ 1 โˆ’ ๐‘ฆ๐‘š W, Xฬ‚ (๐‘š) + ๐‘ , + ๐‘š=1 91 where I denotes the identity tensor and [1 โˆ’ ๐‘ฅ] + = max (0, 1 โˆ’ ๐‘ฅ) ๐‘ for ๐‘ = 1 or ๐‘ = 2. Also, Xฬ‚ (๐‘š) ( ๐‘—) is the CP reconstruction of X (๐‘š) using {U๐‘š } ๐‘‘๐‘—=1 . Comparing (6.18) with (6.15), the first term is ฮฉ (X) representing the total KCP reconstruction error of training samples penalized by ๐›พ, the second term is ๐‘ƒ (W), and the third term corresponds to ๐ฟ (๐‘ฆ, hW, Xi + ๐‘). All training samples are sharing the same set of kernel matrices for a specific mode ๐‘— โˆˆ [๐‘‘] which makes characterization of tensor data possible by taking into account both common and discriminative features. Solving (6.18) in the dual domain is very complicated due to the inherent coupling between the weight ( ๐‘—) tensor W and factor matrices {U๐‘š } ๐‘‘๐‘—=1 . The kernel trick is used to implicitly capture the nonlinear structures in the data. If W is replaced with a function โˆ‘๏ธ๐‘€ ๐‘“ (ยท) = ๐›ฝ๐‘š ๐œ…(ยท, Xฬ‚ (๐‘š) ), ๐‘š=1 represented as the linear combination of a kernel function ๐œ…(ยท, Xฬ‚ (๐‘š) ) for ๐‘€ reconstructed training data samples, then (6.18) can be transformed to ๐‘€ โˆ‘๏ธ ? ๐‘‘   ( ๐‘—) min ๐›พ kX (๐‘š) โˆ’ I K ( ๐‘—) U๐‘š k2 ( ๐‘—) {U๐‘š } ๐‘‘๐‘—=1 ,{K ( ๐‘—) } ๐‘‘๐‘—=1 ,๐œท,๐‘ ๐‘š=1 ๐‘—=1 โˆ‘๏ธ๐‘€ +๐œ† ๐›ฝ๐‘– ๐›ฝ ๐‘— ๐œ…( Xฬ‚ (๐‘–) , Xฬ‚ ( ๐‘—) ) (6.19) ๐‘–, ๐‘—=1 ๐‘€ ๏ฃฎ๏ฃฏ โˆ‘๏ธ ๐‘€ โˆ‘๏ธ ๏ฃน (๐‘š) ( ๐‘—) ๏ฃบ + ๏ฃฏ1 โˆ’ ๐‘ฆ ๐‘š ยญ ๐›ฝ ๐‘— ๐œ…( Xฬ‚ , Xฬ‚ ) + ๐‘ ยฎ๏ฃบ๏ฃบ , ยฉ ยช ๏ฃฏ ๐‘š=1 ๏ฃฏ ยซ ๐‘—=1 ๏ฃบ ยฌ๏ฃป + ๏ฃฐ where ๐œ† = 1/๐ถ is the weight between the loss function and the margin, and ๐›พ controls the tradeoff between discriminative components and reconstruction error. Letting Kฬ‚๐‘–, ๐‘— = ๐œ…( Xฬ‚ (๐‘–) , Xฬ‚ ( ๐‘—) ) denote the elements of the symmetric kernel matrix Kฬ‚, the so called dual form of (6.18) is obtained as ๐‘€ โˆ‘๏ธ ? ๐‘‘   ( ๐‘—) min ๐›พ kX (๐‘š) โˆ’ I K ( ๐‘—) U๐‘š k2 ( ๐‘—) {U๐‘š } ๐‘‘๐‘—=1 ,{K ( ๐‘—) } ๐‘‘๐‘—=1 ,๐œท,๐‘ ๐‘š=1 ๐‘—=1 + ๐œ†๐œท> Kฬ‚๐œท (6.20) โˆ‘๏ธ๐‘€ h  i + 1 โˆ’ ๐‘ฆ ๐‘š kฬ‚> ๐‘š ๐œท + ๐‘ , + ๐‘š=1 92 where kฬ‚๐‘š denotes the ๐‘š th row/column of Kฬ‚. This objective function is non-convex and finding a global minimum might not be possible. Therefore, an iterative scheme has been derived in [7] by alternatively finding the local minimum of the objective function for each variable by fixing ( ๐‘—) the others. This is done by updating K ( ๐‘—) , ๐œท, U๐‘š and ๐‘ consecutively in each iteration. Let 2 ๐‘ง ๐‘š = kฬ‚>๐‘š ๐œท + ๐‘ in the following steps whenever it is used. Also, let [1 โˆ’ ๐‘ฅ] + = max (0, 1 โˆ’ ๐‘ฅ) . At each iteration, solving (6.20) includes the following steps. (a) Update K ( ๐‘—) : Since there is no supervised information involving K ( ๐‘—) , the optimization technique in CPD is used by finding the solution to the following system of equations for ๐‘— โˆˆ [๐‘‘]. ๐‘€  โˆ‘๏ธ  ๐‘€  โˆ‘๏ธ  ( ๐‘—) (โˆ’ ๐‘—) (โˆ’ ๐‘—) K ( ๐‘—) U๐‘š W๐‘š = X (๐‘š) V ( ๐‘—) ๐‘š ๐‘š=1 ๐‘š=1   (โˆ’ ๐‘—) ( ๐‘—) (โˆ’ ๐‘—) where X (๐‘š) ร‡1 ( ๐‘—) is the mode- ๐‘— unfolding of X (๐‘š) , V๐‘š = โ„“=๐‘‘ K ( ๐‘—) U๐‘š and W๐‘š = โ„“โ‰  ๐‘—  > (โˆ’ ๐‘—) (โˆ’ ๐‘—) V๐‘š V๐‘š . (b) Update ๐œท: A data point is a support vector (support tensor, in fact) if ๐‘ฆ ๐‘š ๐‘ง ๐‘š < 1 for that point, i.e., if the loss is nonzero for it. After reordering the training samples such that the first ๐‘€๐‘  samples are support tensors, ๐œท can be found by setting the first-order gradient of the objective function to zero, i.e.,    5 ๐œท = 2 ๐œ†Kฬ‚๐œท + Kฬ‚I0 Kฬ‚๐œท โˆ’ y + ๐‘1 = 0, ๏ฃฎ ๏ฃน ๏ฃฏ I๐‘€ 0 ๏ฃบ where I0 = ๏ฃฏ ๏ฃฏ ๐‘  ๏ฃบ and y is the vector of labels. ๏ฃบ ๏ฃฏ 0 0 ๏ฃบ ๏ฃฏ ๏ฃบ ๏ฃฐ ๏ฃป ( ๐‘—) ( ๐‘—) (c) Update U๐‘š : The kernel function ๐œ… is inherently coupled with U๐‘š , which underlines the importance of choosing an appropriate kernel when calculating the gradient of the objective ( ๐‘—) function with respect to U๐‘š . Before that, it should be made clear how the kernel ๐œ… is related ( ๐‘—) to U๐‘š . Mapping the tensors into a tensor Hilbert space of higher dimension, one can retain 93 the multilinear structure of data as well as capture existing nonlinearities. The mapping function is defined as โˆ‘๏ธ๐‘Ÿ โˆ‘๏ธ๐‘Ÿ   ๐‘‘ ( ๐‘—) ๐‘‘ ( ๐‘—) ๐œ™: ๐‘—=1 u ๐‘˜ โ†’ ๐‘—=1 ๐œ™ u๐‘˜ , ๐‘˜=1 ๐‘˜=1 which assumes mapping the tensor data into a tensor Hilbert space and then performing CPD. In other words, the feature map ๐œ™ acts on the individual factors of CPD while retaining the low-rank structure of the data. The kernel will now be the standard inner product of tensors on the higher-dimensional space, i.e., one can find a dual structure-preserving kernel function by writing ๐‘Ÿ ๐‘Ÿ ! ๐‘Ÿ   โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ   (๐‘–) ( ๐‘—) ๐‘‘ (โ„“) ๐‘‘ (โ„“) ๐‘‘ (โ„“) ๐‘‘ (โ„“) ๐œ… Xฬ‚ , Xฬ‚ =๐œ… โ„“=1 u๐‘–๐‘˜ , โ„“=1 u ๐‘— โ„Ž = ๐œ… โ„“=1 u๐‘–๐‘˜ , โ„“=1 u ๐‘— โ„Ž ๐‘˜=1 โ„Ž=1 ๐‘˜,โ„Ž=1 ๐‘Ÿ D  โˆ‘๏ธ   E ๐‘Ÿ D โˆ‘๏ธ    E = ๐œ™ ๐‘‘ โ„“=1 u๐‘–๐‘˜ (โ„“) ,๐œ™ ๐‘‘ (โ„“) โ„“=1 u ๐‘— โ„Ž = ๐‘‘ โ„“=1 ๐œ™ (โ„“) u๐‘–๐‘˜ , ๐‘‘ โ„“=1 ๐œ™ u (โ„“) ๐‘—โ„Ž ๐‘˜,โ„Ž=1 ๐‘˜,โ„Ž=1 ๐‘Ÿ โˆ‘๏ธ ร–๐‘‘ D    E ๐‘Ÿ โˆ‘๏ธ ร– ๐‘‘   (โ„“) = ๐œ™ u๐‘–๐‘˜ ,๐œ™ u (โ„“) ๐‘—โ„Ž = ๐œ… (โ„“) (โ„“) u๐‘–๐‘˜ , u๐‘—โ„Ž , ๐‘˜,โ„Ž=1 โ„“=1 ๐‘˜,โ„Ž=1 โ„“=1 (6.21) implying that to apply the kernel to two tensors, it suffices to apply it to the their factors in their respective CPD representations. The second equality results from the bilinearity of the kernel ๐œ…. Examples of commonly used kernels include ๐œ… (x, y) = x> y for the inner product kernel, and ๐œ… (x, y) = exp โˆ’๐œŽkx โˆ’ yk 22 for the Gaussian kernel where ๐œŽ controls the width  of the kernel. ( ๐‘—) After choosing a kernel and thus knowing how it is related to the factor matrices U๐‘š , the ( ๐‘—) gradient of the objective function with respect to U๐‘š can be calculated and set to zero to ( ๐‘—) solve for the updated U๐‘š . An explicit form of the gradient can be found in [7]. (d) Update ๐‘: Setting the first-order gradient of the objective function with respect to ๐‘ to zero 94 yields the solution for ๐‘, i.e., ๐‘€๐‘  โˆ‘๏ธ 5๐‘ = 2 (๐‘ง ๐‘š โˆ’ ๐‘ฆ ๐‘š ) = 0, ๐‘š=1 which can be solved as ๐œท has been estimated and Kฬ‚ is known. Classification. After solving (6.20) using the training data, the shared kernel matrices {K ( ๐‘—) } ๐‘‘๐‘—=1 can be utilized to compute the KCP factorization of a test sample X. Computing the CPD of X โˆ’1 yields its CP factor matrices {U ( ๐‘—) } ๐‘‘๐‘—=1 . Letting K ( ๐‘—) V ( ๐‘—) = U ( ๐‘—) and thus V ( ๐‘—) = K ( ๐‘—) U ( ๐‘—) , the > KCP reconstruction of X is calculated using {V ( ๐‘—) } ๐‘‘๐‘—=1 , i.e., Xฬ‚ = I ๐‘‘๐‘—=1 V ( ๐‘—) . Finally, the test sample X can be classified according to ๐‘€ ! โˆ‘๏ธ   ๐‘ฆ (X) = sign ๐›ฝ๐‘š ๐œ… Xฬ‚, Xฬ‚ (๐‘š) + ๐‘ . ๐‘š=1 using the reconstructed training samples. In this case, (6.21) can be used to calculate the kernel as   โˆ‘๏ธ ๐‘Ÿ ร– ๐‘‘   ๐œ… Xฬ‚, Xฬ‚ (๐‘š) = ๐œ… v ๐‘˜(โ„“) , u๐‘šโ„Ž (โ„“) , ๐‘˜,โ„Ž=1 โ„“=1 instead of dealing with the actual reconstructed data. For the training samples, it is important to ( ๐‘—) ( ๐‘—) ( ๐‘—) note that for ๐‘— โˆˆ [๐‘‘] and ๐‘š โˆˆ [๐‘€], U๐‘š = K ( ๐‘—) U๐‘š according to the reproducing property of ๐‘˜ ๐‘ฅ , which explains why the CP reconstruction of training data can be used in the kernel instead of their KCP reconstruction, as the KCP and CP reconstructions of the training samples are equivalent. However, the reproducing property of K ( ๐‘—) will not necessarily hold for test data, and therefore, the assumption K ( ๐‘—) V ( ๐‘—) = U ( ๐‘—) is made in this case, where U ( ๐‘—) are the corresponding CP factor matrices of the test tensor. 6.4 Future Work Directions for future research based on oblivious subspace embeddings include โ€ข Extend the results of JL on tensors with low CP rank to tensors with low Tucker rank without orthogonality constraints. The goal is to obtain theoretical bounds on the deviation in the 95 norm and inner product of randomly projected tensors as well as lower bounds on embedding dimensions. โ€ข In settings where data points lie on a low-rank subspace, the following can be investigated. โ€“ Compressed MPCA. One can use JL embeddings to project tensors before performing MPCA to solve for the principal components in the low-rank subspace. โ€“ Compressed Tensor SVM. One can use JL embeddings to project tensors with low-rank expansions before SVM classification. In other words, the separating hyperplane can be found in a lower-dimensional subspace. โ€ข Modewise JL in third-order MBPT calculations where higher-order tensors are formed from lower-order data replicated in certain dimensions. As was seen, in this setting, non-orthogonal tensors sometimes become nearly orthogonal after projection. The role of repetitive patters forming the higher-dimensional tensors as well as sparsity can be investigated. 96 APPENDICES 97 APPENDIX A USEFUL NOTIONS, DEFINITIONS, AND RELATIONS In this appendix chapter, some of the terms and mathematical definitions used in derivations throughout the thesis are presented. A.1 Operations on Rank-1 tensors In this section, useful relations are presented that explain how rank-1 tensors are reshaped into vectors or unfoldings, and how ๐‘—-mode products are done. A.1.1 Reshaping rank-1 tensors A.1.1.1 Vectorization For a rank-1 tensor X = (โ„“) โ„“=1 x , the vectorized form will be obtained according to ๐‘‘ ๏ฃฑ 1 (โ„“) ๏ฃด ร‹ ๏ฃฒ x ๏ฃด ๏ฃด ๏ฃด ๏ฃด column โˆ’ major, vec (X) = โ„“=๐‘‘ ๐‘‘ (A.1) (โ„“) ๏ฃดร‹ ๏ฃด x ๏ฃด ๏ฃด ๏ฃด row โˆ’ major. ๏ฃณ โ„“=1 A.1.1.2 Mode- ๐‘— Unfolding The mode- ๐‘— unfolding a rank-1 tensor X = โ„“=1 ๐‘‘ x (โ„“) can be calculated using ! ๏ฃฑ ๏ฃด  > ๏ฃด ๏ฃด x ( ๐‘—) vec ๐‘‘ x (โ„“) = x ( ๐‘—) x (๐‘‘) โŠ—ยท ยท ยท โŠ—x ( ๐‘—+1) โŠ—x ( ๐‘—โˆ’1) โŠ—ยท ยท ยท โŠ—x (1) , column โˆ’ major, ๏ฃด ๏ฃด โ„“=1 ๏ฃฒ ๏ฃด โ„“โ‰  ๐‘— X ( ๐‘—) = !  > ๏ฃด ๏ฃด ๏ฃด x ( ๐‘—) vec ๐‘‘ x (โ„“) = x ( ๐‘—) x (1) โŠ—ยท ยท ยท โŠ—x ( ๐‘—โˆ’1) โŠ—x ( ๐‘—+1) โŠ—ยท ยท ยท โŠ—x (๐‘‘) , row โˆ’ major. ๏ฃด ๏ฃด โ„“=1 ๏ฃด โ„“โ‰  ๐‘— ๏ฃณ (A.2) In the rest of this section, only the column-major relations will be presented and used as this has been the case throughout this thesis. 98 A.1.2 ๐‘—-mode Product For a rank-1 tensor X = (โ„“) โ„“=1 x , the ๐‘—-mode product can be done by applying the matrix of ๐‘‘ interest to the mode- ๐‘— factor x ( ๐‘—) , i.e.,   ๐‘—โˆ’1 (โ„“) X ร— ๐‘— A ( ๐‘—) = โ„“=1 x A ( ๐‘—) x ( ๐‘—) ๐‘‘ โ„“= ๐‘—+1 x (โ„“) (A.3) The proof is simply done by considering the mode- ๐‘— unfolding. !     (โ„“) X ร— ๐‘— A ( ๐‘—) = A ( ๐‘—) ๐‘‘ โ„“=1 x (โ„“) = A ( ๐‘—) x ( ๐‘—) vec ๐‘‘ โ„“=1 x (A.4) ( ๐‘—) ( ๐‘—) โ„“โ‰  ๐‘— Reshaping to tensor form, (A.3) is obtained. A more general relation involving all modes can be established through a similar approach. Consider a rank-1 tensor X = โ„“=1 ๐‘‘ x (โ„“) , and let Y := X >๐‘‘ A (โ„“) . Then, Y is also a rank-1 tensor, โ„“=1 and ๐‘‘ (โ„“) (โ„“) Y= โ„“=1 A x . (A.5) To show this, the mode- ๐‘— unfolding of Y is calculated as follows.  > Y ( ๐‘—) = A ( ๐‘—) X ( ๐‘—) A (๐‘‘) โŠ— ยท ยท ยท โŠ— A ( ๐‘—+1) โŠ— A ( ๐‘—โˆ’1) โŠ— ยท ยท ยท โŠ— A (1)  > = A ( ๐‘—) x ( ๐‘—) x (๐‘‘) โŠ— ยท ยท ยท โŠ— x ( ๐‘—+1) โŠ— x ( ๐‘—โˆ’1) โŠ— ยท ยท ยท โŠ— x (1)  > A (๐‘‘) โŠ— ยท ยท ยท โŠ— A ( ๐‘—+1) โŠ— A ( ๐‘—โˆ’1) โŠ— ยท ยท ยท โŠ— A (1) h  i > ( ๐‘—) ( ๐‘—) (๐‘‘) ( ๐‘—+1) ( ๐‘—โˆ’1) (1) (๐‘‘) ( ๐‘—+1) ( ๐‘—โˆ’1) (1) =A x A โŠ— . . . โŠ—A โŠ—A โŠ— . . . โŠ—A x โŠ— . . . โŠ—x โŠ—x โŠ— . . . โŠ—x  > = A ( ๐‘—) x ( ๐‘—) A (๐‘‘) x (๐‘‘) โŠ— ยท ยท ยท โŠ— A ( ๐‘—+1) x ( ๐‘—+1) โŠ— A ( ๐‘—โˆ’1) x ( ๐‘—โˆ’1) โŠ— ยท ยท ยท โŠ— A (1) x (1) . (A.6) This is clearly the mode- ๐‘— unfolding of a rank-1 tensor which can be formed by reshaping (A.6) to tensor structure to obtain (A.5). To get from the penultimate equality to the last one, the following matrix Kronecker product property was used.    (1) (๐‘‘) (1) (๐‘‘) A โŠ— ยทยทยท โŠ— A B โŠ— ยทยทยท โŠ— B = A (1) B (1) โŠ— ยท ยท ยท โŠ— A (๐‘‘) B (๐‘‘) . (A.7) 99 A.1.3 Inner Product Consider rank-1 tensors X = (โ„“) and Y = (โ„“) โ„“=1 y . To simplify the inner product defined ๐‘‘ ๐‘‘ โ„“=1 x as per (2.3), one can use the fact that an element of a rank-1 tensor at index set (๐‘–1 , ๐‘–2 , . . . , ๐‘– ๐‘‘ ) is the product of the individual elements of its factor vectors at indices ๐‘–1 , ๐‘–2 , ... , and ๐‘– ๐‘‘ . The inner product is then written as โˆ‘๏ธ โˆ‘๏ธ hX, Yi = X๐‘–1 ,...,๐‘– ๐‘‘ Y ๐‘–1 ,...,๐‘– ๐‘‘ = x๐‘–(1) 1 . . . x๐‘–(๐‘‘) ๐‘‘ y๐‘–(1) 1 . . . y๐‘–(๐‘‘) ๐‘‘ ๐‘– 1 ,...,๐‘– ๐‘‘ ๐‘–1 ,...,๐‘– ๐‘‘ โˆ‘๏ธ โˆ‘๏ธ ร–๐‘‘ (A.8) = x๐‘–(1) 1 y๐‘–(1) 1 ยทยทยท x๐‘–(1) ๐‘‘ y๐‘–(1) ๐‘‘ = hx (โ„“) , y (โ„“) i. ๐‘–1 ๐‘–๐‘‘ โ„“=1 This means the inner product of rank-1 tensors boils down to the inner product of their factor vectors. A.2 ๐œ€-nets as a Means of Set Discretization This is a useful discretization concept that allowes for obtaining large deviation inequalities when assessing the stattistical properties of random variables/vectors/matrices/tensors [25]. Definition A.2.1 (๐œบ-net) Let (S, ๐‘‘) be a metric space, and let S0 be a subset of ๐‘†. A subset N โŠ† S0 is called an ๐œ€-net of S0 if for ๐œ€ > 0, we have โˆ€๐‘ฅ โˆˆ S0 โˆƒ๐‘ฅ 0 โˆˆ N : ๐‘‘ (๐‘ฅ, ๐‘ฅ 0 ) โ‰ค ๐œ€, (A.9) meaning any given point in S0 is within a distance ๐œ€ of a point in N . This also means that N is an ๐œ€-net of S0 if and only if S0 can be fully covered by balls centered in the elements of N and with radii ๐œ€. It can be observed that the balls of radii ๐œ€ in the above definition may have overlaps. This puts a lower bound on the number for such balls, paving the way for the following definition. Definition A.2.2 (Covering Number) The smallest possible cardinality of an ๐œ€-net N of S0 in Definition A.2.1 is called the convering number of S0 and is denoted by C(S0 , ๐œ€). Equivalently, it 100 is the smallest number of closed balls with centers in the elements of N and radii ๐œ€ whose union covers S0 . Note A.2.1 The covering number of the Euclidean ball ๐ต2๐‘› satisfies the following inequality.  ๐‘›  ๐‘› 1 ๐‘› 2 โ‰ค C(๐ต2 , ๐œ€) โ‰ค 1 + (A.10) ๐œ€ ๐œ€ A.3 Random Projections and Johnson-Lindenstrauss Embeddings In this section, a brief introduction to random projections in the context of JL embeddings is presented. More details can be found in [25]. Suppose that we have ๐‘ data points {x๐‘– โˆˆ R๐‘› }๐‘–=1 ๐‘ and we want to project them onto {y๐‘– โˆˆ R๐‘š }๐‘–=1 ๐‘ where ๐‘š  ๐‘›, in a way that the geometry of the data is preserved, i.e., x๐‘– โˆ’ x ๐‘— โ‰ˆ y๐‘– โˆ’ y ๐‘— for ๐‘– โ‰  ๐‘— . We would also like to know what the smallest ๐‘š that make this possible is. First, we review notions that will be later useful. A general group of random variables called sub-Gaussian random variables are of special interest in this discussion. Definition A.3.1 (Sub-Gaussian Random Variable) For a sub-Gaussian random variable ๐‘‹, the following properties hold and are equivalent. The constants ๐พ๐‘– differ from each other by an absolute constant. 1. P{|๐‘‹ | > ๐‘ก} โ‰ค 2 exp โˆ’๐‘ก 2 /๐พ12 , for all ๐‘ก > 0.  2. k ๐‘‹ k ๐‘ := ( E |๐‘‹ | ๐‘ ) 1/๐‘ โ‰ค ๐พ2โˆš ๐‘, for all ๐‘ > 1. 3. The MGF of ๐‘‹ 2 satisfies E exp ๐œ†2 ๐‘‹ 2 โ‰ค exp ๐พ 3๐œ†2 for all ๐œ† such that |๐œ†| < 1/๐พ3 .   4. The MGF of ๐‘‹ 2 is bounded at some point, namely, E exp ๐‘‹ 2 /๐พ42 โ‰ค 2.  5. If E ๐‘‹ = 0, the above four properties are equal to the MGF of ๐‘‹ satisfying E exp (๐œ†๐‘‹) โ‰ค E exp ๐พ5 ๐œ† , for all ๐œ† โˆˆ R.   2 2 101 All Guassian random variables and bounded random variables are sub-gaussian. Definition A.3.2 (Sub-Gaussian Norm) The sub-Gaussian norm of a random variable ๐‘‹, denoted by k ๐‘‹ k ๐œ“2 , is defined as E exp n   o k ๐‘‹ k ๐œ“2 = inf ๐‘ก > 0 : ๐‘‹ 2 /๐‘ก 2 โ‰ค 2 . An alternative definition of the sub-Gaussian norm of a random variable ๐‘‹ is 1 k ๐‘‹ k ๐œ“2 = sup ๐‘ โˆ’ 2 k ๐‘‹ k ๐‘ . ๐‘โ‰ฅ1 Definition A.3.3 (Random Subspace) Let ๐บ ๐‘›,๐‘š denote the Grassmannian1, i.e., the set of all ๐‘š- dimensional subspaces in R๐‘› . We say that ๐ธ is a random ๐‘š-dimensional subspace of R๐‘› uniformly distributed in ๐บ ๐‘›,๐‘š if ๐ธ is a random ๐‘š-dimensional subspace of R๐‘› whose distribution is rotation- invariant, i.e., P{๐ธ โˆˆ E} = P{U (๐ธ) โˆˆ E}, for any fixed subset E โŠ‚ ๐บ ๐‘›,๐‘š , where U is an ๐‘› ร— ๐‘› orthogonal matrix. Proposition A.3.1 (Random Projection) Let P be a projection from R๐‘› onto a random ๐‘š-dimensional subspace of R๐‘› uniformly distributed in ๐บ ๐‘›,๐‘š . Let z โˆˆ R๐‘› be a fixed point and ๐œ€ > 0. Then, E kPzk 22   1/2 โˆš๏ธ ๐‘š 1. k kPzk 2 k 2 := = ๐‘› kzk 2 . 2. With probability at least 1 โˆ’ 2 exp โˆ’๐‘๐œ€ 2 ๐‘š , we have that  โˆš๏ธ‚ โˆš๏ธ‚ ๐‘š ๐‘š (1 โˆ’ ๐œ€) kzk 2 โ‰ค kPzk 2 โ‰ค (1 + ๐œ€) kzk 2 , ๐‘› ๐‘› where ๐‘ is an absolute contant. 1 Also called the Grassmann manifold. 102 Proof Since we can normalize the terms by dividing by kzk 2 , without loss of generality, we may assume that kzk 2 = 1. Using rotation-invariance, one can show that the random projection of a fixed point is equivalent in distribution to the fixed projection of a random point uniformly R distributed on the unit sphere ๐‘† ๐‘›โˆ’1 โˆˆ ๐‘› . This is denoted by z โˆผ uniform ๐‘† ๐‘›โˆ’1 . We choose this  fixed projection to be in such a way that it picks the first ๐‘š entries of z โˆˆ R๐‘› , i.e., Pz = [๐‘ง1 , ๐‘ง2 , . . . , ๐‘ง ๐‘š ] > . In other words, P = [I๐‘š | 0] ๐‘šร—๐‘› , where 0 โˆˆ R๐‘šร—(๐‘›โˆ’๐‘š) is a matrix of all zeros. Therefore, we have that ๐‘š ๐‘š E E E๐‘ง๐‘–2 = ๐‘šE๐‘ง๐‘–2, โˆ‘๏ธ โˆ‘๏ธ kPzk 22 = ๐‘ง๐‘–2 = ๐‘–=1 ๐‘–=1 since all ๐‘ง๐‘– are drawn from the same distribution. We also know that kzk 22 = 1. Taking expectation of both sides we can write ๐‘› E โˆ‘๏ธ ๐‘ง๐‘–2 = 1, ๐‘–=1 resulting in E ๐‘ง๐‘–2 = 1/๐‘›. Therefore, E kPzk 22 = ๐‘š/๐‘›. This proves 1. To prove 2, we may use the large deviation inequality for the concentration of Lipschitz-continuous  functions on the unit sphere.2 In particular, if ๐‘“ (z) is Lipschitz-continuous, and z โˆผ uniform ๐‘† ๐‘›โˆ’1 , then ๐‘๐‘›๐‘ก 2 P   {| ๐‘“ (z) โˆ’ k ๐‘“ (z) k 2 | > ๐‘ก} โ‰ค 2exp โˆ’ 2 , ๐ฟ where ๐‘ is an absolute constant and ๐ฟ is the Lipschitz constant of ๐‘“ (z).3 Here, k ๐‘“ (z)k 2 = E   1/2 | ๐‘“ (z)| 2 as ๐‘“ (z) is a random variable. In our case, ๐‘“ (z) = kPzk 2 . Since ๐‘“ (z) is Lipschitz, and therefore continuous, we can use the mean value theorem to write | ๐‘“ (z) โˆ’ ๐‘“ (y)| = โˆ‡> ๐‘“ (z0 ) (z โˆ’ y) โ‰ค kโˆ‡ ๐‘“ (z0 )k 2 kz โˆ’ yk 2 2 If ๐‘“ : R๐‘› โ†’ R is Lipschitz with Lipschitz  constant ๐ฟ, then ๐‘“ (x) is sub-Gaussian for x โˆผ uniform ๐‘†  ๐‘›โˆ’1 , and we have that P (| ๐‘“ (x) โˆ’ ๐‘€ | > ๐‘ก) โ‰ค 2exp โˆ’ 2๐ฟ ๐‘›๐‘ก 2 2 for ๐‘ก > 0, where ๐‘€ is the median of ๐‘“ . This means with high probability, ๐‘“ (x) โ‰ˆ ๐‘€ on the unit sphere. A similar inequality can be obtained when ๐‘€ is replaced with E ๐‘“ (z) which, in turn, can be substituted by k ๐‘“ (z) k 2 . For details, see Chapter 5 in [25]. 3 Here, we have also replaced E ๐‘“ (z) by k ๐‘“ (z) k in the large deviation inequality [25]. 2 103 where y, z โˆˆ R๐‘› and z0 is a point on the line segment between z and y. We can bound the gradient of ๐‘“ (z) to show that k ๐‘“ (z) โˆ’ ๐‘“ (y) k 2 โ‰ค kz โˆ’ yk 2 meaning that we can let ๐ฟ = 1. On the other hand, we may rearrange the inequalily in 2 as โˆš๏ธ‚ โˆš๏ธ‚ ๐‘š ๐‘š kPzk 2 โˆ’ kzk 2 โ‰ค ๐œ€ kzk 2 ๐‘› ๐‘› โˆš๏ธ ๐‘š Noting that ๐‘ก = ๐œ€ ๐‘› kzk 2 in the large deviation inequality, and using the result from 1, we can obtain the desired result. Lemma A.3.1 (Johnson-Lindenstrauss Lemma for Point Sets) Let ๐‘‹ โŠ† R๐‘› be a set of ๐‘ points in R๐‘› . Then, for an absolute constant ๐ถ, there exists a linear map A = ๐‘š๐‘› P โˆˆ R๐‘šร—๐‘› where โˆš๏ธ log ๐‘ ๐‘šโ‰ฅ๐ถ ๐œ€2 such that for all x, y โˆˆ ๐‘‹, (1 โˆ’ ๐œ€) kx โˆ’ yk 2 โ‰ค kA (x โˆ’ y)k 2 โ‰ค (1 + ๐œ€) kx โˆ’ yk 2 with probability at least 1 โˆ’ 2 exp โˆ’๐ถ๐‘š๐œ€ 2 . This means that A is an approximate isometry on ๐‘‹.  Proof We showed how the random projection P acts on a vector in ๐‘‹. Now, consider the difference set ๐‘‹ โˆ’ ๐‘‹ := {x โˆ’ y | x, y โˆˆ ๐‘‹ }. We want to show for all z โˆˆ ๐‘‹ โˆ’ ๐‘‹, (1 โˆ’ ๐œ€) kzk 2 โ‰ค kAzk 2 โ‰ค (1 + ๐œ€) kzk 2 . โˆš๏ธ ๐‘› Replacing A by ๐‘š P, we have โˆš๏ธ‚ โˆš๏ธ‚ ๐‘š ๐‘š (1 โˆ’ ๐œ€) kzk 2 โ‰ค kPzk 2 โ‰ค (1 + ๐œ€) kzk 2 , ๐‘› ๐‘› which we already know holds with probability at least 1 โˆ’ 2 exp ๐‘๐‘š๐œ€ 2 from Proposition A.3.1.  Now, taking union bound over all points in ๐‘‹, we may conclude that the desired result holds with probability at least       1 โˆ’ 2 |๐‘‹ โˆ’ ๐‘‹ | exp โˆ’๐‘๐‘š๐œ€ 2 = 1 โˆ’ 2๐‘ 2 exp โˆ’๐‘๐‘š๐œ€ 2 = 1 โˆ’ exp โˆ’๐‘๐‘š๐œ€ 2 + log 2๐‘ 2 . 104 Since this probability must be non-negative, we have that ๐‘๐‘š๐œ€ 2 โ‰ฅ log 2๐‘ 2 , and therefore, log ๐‘ ๐‘šโ‰ฅ๐ถ , ๐œ€2 completing the proof. 105 APPENDIX B MEMORY-EFFICIENT MODE-WISE PROJECTION CALCULATIONS OF THE ENERGY TERMS In the following, it is assumed that ๐‘› ๐‘— , ๐‘— โˆˆ {1, 2, 3} is the dimension size of the 3-mode tensors as the reshaped versions of the hypothetical 6-mode arrays, and ๐‘š ๐‘— is the corresponding size after projection. B.1 Particle-Particle Denoting the projected version of H1 by P1 , one can calculate its elements by โˆ‘๏ธ P1 (๐‘–1 , ๐‘–2 , ๐‘–3 ) = H ( ๐‘, ๐‘ž) H (๐‘ž, ๐‘Ÿ) A (1) (๐‘–1 , ๐‘) A (2) (๐‘–2 , ๐‘ž) A (3) (๐‘–3 , ๐‘Ÿ) ๐‘,๐‘ž,๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ ! โˆ‘๏ธ ! (B.1) (2) (1) (3) ๐‘‡ = A (๐‘–2 , ๐‘ž) A (๐‘–1 , ๐‘) H ( ๐‘, ๐‘ž) A (๐‘–3 , ๐‘Ÿ) H (๐‘Ÿ, ๐‘ž) , ๐‘ž ๐‘ ๐‘Ÿ for ๐‘– ๐‘— โˆˆ [๐‘š ๐‘— ], ๐‘— โˆˆ {1, 2, 3}. Now, letting H 0 = A (1) H and H 00 = A (3) H ๐‘‡ , it can be observed that P1 = H 000 ร—2 A (2) where H 000 is a tensor that is formed element-wise according to H 000 (๐‘– 1 , ๐‘ž, ๐‘– 3 ) = H 0 (๐‘–1 , ๐‘ž) H 00 (๐‘–3 , ๐‘ž) . Indeed, the mode-2 unfolding of H 000 is the result of the Hadamard product of the ๐‘›2 ร— ๐‘š 1 ๐‘š 3 matrices G1 and G2 , where G1 is formed by replicating H 0๐‘‡ across its second dimension ๐‘š 3 times, and G2 is formed by replicating each column of H 00๐‘‡ ๐‘š 1 times1. Left-multiplying the mode-2 unfolding of H 000 by A (2) and folding back to tensor shape yields P1 . Letting P2 denote the projected version of H2 , it is observed that 1 For implementation, one does not have to store replicated versions of data as this would be inefficient use of memory. 106 โˆ‘๏ธ P2 (๐‘–1 , ๐‘–2 , ๐‘–3 ) = Hฬƒ (๐‘Ÿ, ๐‘) D (๐‘ž, ๐‘) A (1) (๐‘– 1 , ๐‘) A (2) (๐‘–2 , ๐‘ž) A (3) (๐‘–3 , ๐‘Ÿ) ๐‘,๐‘ž,๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ ! โˆ‘๏ธ ! (B.2) (1) (2) (3) ๐‘‡ = A (๐‘–1 , ๐‘) A (๐‘–2 , ๐‘ž) D (๐‘ž, ๐‘) A (๐‘–3 , ๐‘Ÿ) Hฬƒ (๐‘Ÿ, ๐‘) , ๐‘ ๐‘ž ๐‘Ÿ Again, letting H 0 = A (2) D and H 00 = A (3) Hฬƒ , it can be observed that P2 = H 000 ร—1 A (1) where H 000 is defined element-wise by H 000 ( ๐‘, ๐‘–2 , ๐‘–3 ) = H 0 (๐‘–2 , ๐‘) H 00 (๐‘–3 , ๐‘) . The mode-1 unfolding of H 000 is obtained by the Hadamard product of the ๐‘›1 ร— ๐‘š 2 ๐‘š 3 matrices G1 and G2 , where G1 is formed by replicating H 0๐‘‡ across its second dimension ๐‘š 3 times, and G2 is formed by replicating each column of H 00๐‘‡ ๐‘š 2 times. It suffices now to left-multiply the mode-1 unfolding of H 000 by A (1) followed by folding back to tensor form to obtain P2 . B.2 Hole-Hole Calculations are done in a similar way to that of Appendix B.1. The projection of H1 will be exactly the same. For H2 , one can write โˆ‘๏ธ P2 (๐‘– 1 , ๐‘–2 , ๐‘–3 ) = Hฬƒ (๐‘Ÿ, ๐‘) D (๐‘Ÿ, ๐‘ž) A (1) (๐‘–1 , ๐‘) A (2) (๐‘–2 , ๐‘ž) A (3) (๐‘–3 , ๐‘Ÿ) ๐‘,๐‘ž,๐‘Ÿ โˆ‘๏ธ โˆ‘๏ธ ! โˆ‘๏ธ ! (B.3) (3) (1) ๐‘‡ (2) ๐‘‡ = A (๐‘– 3 , ๐‘Ÿ) A (๐‘–1 , ๐‘) Hฬƒ ( ๐‘, ๐‘Ÿ) A (๐‘–2 , ๐‘ž) D (๐‘ž, ๐‘Ÿ) , ๐‘Ÿ ๐‘ ๐‘ž Letting H 0 = A (1) Hฬƒ ๐‘‡ and H 00 = A (2) D ๐‘‡ , it can be observed that P2 = H 000 ร—3 A (3) where H 000 is formed element-wise according to H 000 (๐‘–1 , ๐‘–2 , ๐‘Ÿ) = H 0 (๐‘– 1 , ๐‘Ÿ) H 00 (๐‘–2 , ๐‘Ÿ) . The mode-3 unfolding of H 000 is obtained by the Hadamard product of the ๐‘›3 ร— ๐‘š 1 ๐‘š 2 matrices G1 and G2 , where G1 is formed by replicating H 0๐‘‡ across its second dimension ๐‘š 2 times, and G2 107 is formed by replicating each column of H 00๐‘‡ ๐‘š 1 times, where the replication is not performed explicitly to save time and memory as mentioned above. The remaining step is to left-multiply the mode-3 unfolding of H 000 by A (3) and fold back to tensor shape to get P2 . B.3 Particle-Hole The projection of H1 can be calculated using โˆ‘๏ธ P1 (๐‘–1 , ๐‘–2 , ๐‘–3 ) = Hฬƒ1 ( ๐‘, ๐‘ž) H ๐‘ (๐‘ž, ๐‘Ÿ) A (1) (๐‘–1 , ๐‘) A (2) (๐‘–2 , ๐‘ž) A (3) (๐‘–3 , ๐‘Ÿ) , (B.4) ๐‘,๐‘ž,๐‘Ÿ which is the same as (B.1) after replacing H (๐‘ž, ๐‘Ÿ) by H ๐‘ (๐‘ž, ๐‘Ÿ) and H ( ๐‘, ๐‘ž) by Hฬƒ1 (๐‘ž, ๐‘Ÿ) in (B.1). For H2 , the projected tensor P2 is calculated using โˆ‘๏ธ P2 (๐‘–1 , ๐‘–2 , ๐‘–3 ) = Hฬƒ2 (๐‘Ÿ, ๐‘) H ๐‘ A (1) (๐‘–1 , ๐‘) A (2) (๐‘–2 , ๐‘ž) A (3) (๐‘–3 , ๐‘Ÿ) ๐‘,๐‘ž,๐‘Ÿ !! โˆ‘๏ธ โˆ‘๏ธ โˆ‘๏ธ = A (2) (๐‘–2 , ๐‘ž) A (1) (๐‘–1 , ๐‘) A (3) (๐‘–3 , ๐‘Ÿ) Hฬƒ2 (๐‘Ÿ, ๐‘) (B.5) ๐‘ž ๐‘ ๐‘Ÿ โˆ‘๏ธ = H 00 (๐‘–1 , ๐‘–3 ) A (2) (๐‘–2 , ๐‘ž) . ๐‘ž where H 00 = A (1) H 0๐‘‡ and H 0 = A (3) Hฬƒ2 . 108 APPENDIX C FASTER KRONECKER JOHNSON-LINDENTRAUSS TRANSFORM In this appendix, the idea behind the method proposed in [10] is outlined in short. Here, the notation is kept similar to that of [10], and is slightly different from the notation used in Section รŽ 4.2.3. Assume that ๐‘ = ๐‘‘๐‘˜=1 ๐‘› ๐‘˜ , and consider the JL matrix โˆš๏ธ‚ 1  C ๐‘ รŒ  ๐šฝ= S F๐‘› ๐‘˜ D๐œ‰๐‘›๐‘˜ โˆˆ ๐‘škron ร—๐‘ (C.1) ๐‘š kron ๐‘˜=๐‘‘ where S โˆˆ R๐‘š kron is a random sampling matrix (similar to R in RFD), D๐œ‰๐‘›๐‘˜ โˆˆ R๐‘› ร—๐‘› ๐‘˜ ๐‘˜ is a diagonal matrix with Rademacher random variables on its diagonal, and F๐‘› ๐‘˜ โˆˆ C๐‘› ร—๐‘› ๐‘˜ ๐‘˜ is the unitary DFT matrix. If a vector x โˆˆ C๐‘ร—1 has Kronecker structure, meaning it can be written as x = ร‹1 ๐‘‘ x ๐‘˜ or equivalently, the vectorized form of a rank-1 tensor X = ๐‘˜=1 ๐‘‘ x , then one can observe that ๐‘˜ โˆš๏ธ‚ 1  รŒ 1 ๐‘ รŒ ๐šฝx = S F๐‘› ๐‘˜ D ๐œ‰ ๐‘› ๐‘˜ x๐‘˜ ๐‘š kron ๐‘˜=๐‘‘ ๐‘˜=๐‘‘ โˆš๏ธ‚ 1   โˆš๏ธ‚ ๐‘   ๐‘ รŒ 1 = S F๐‘› ๐‘˜ D ๐œ‰ ๐‘› ๐‘˜ x ๐‘˜ = S vec F D ๐‘˜=๐‘‘ ๐‘› ๐‘˜ ๐œ‰ ๐‘›๐‘˜ ๐‘˜ x (C.2) ๐‘š kron ๐‘˜=๐‘‘ ๐‘š kron ? โˆš๏ธ‚ " ๐‘‘  # ๐‘   1 = S vec ๐‘˜=๐‘‘ x ๐‘˜ F๐‘› ๐‘˜ D ๐œ‰ ๐‘› ๐‘˜ . ๐‘š kron ๐‘˜=1 It is easy to see that applying ๐šฝ to x defined above is equivalent to performing a modewise fast JL embedding to a rank-1 tensor whose vectorized form is x, meaning x = vec(X) = vec( ๐‘˜=1 x ๐‘˜ ). ๐‘‘ The embedding applied to mode ๐‘˜ is F๐‘› ๐‘˜ D๐œ‰๐‘›๐‘˜ . Next, the result is vectorized and the random โˆš๏ธƒ restriction matrix S is applied followed by the factor ๐‘š๐‘kron . The computational cost will be low since if y ๐‘˜ := F๐‘› ๐‘˜ D๐œ‰๐‘›๐‘˜ x ๐‘˜ , then one can see that ? ๐‘‘   ๐‘‘ ๐‘‘ Y := ๐‘˜=1 y ๐‘˜ = ๐‘˜=1 F๐‘› ๐‘˜ D๐œ‰ ๐‘›๐‘˜ x ๐‘˜ =X F๐‘› ๐‘˜ D ๐œ‰ ๐‘› ๐‘˜ ๐‘˜=1 ร‹๐‘‘ and y := vec (Y) = y ๐‘˜ . Therefore, when applying the random sampling matrix ๐‘†, if one knows ๐‘˜=1 which indices in y are being picked, those indices can be translated to indices in y ๐‘˜ , meaning 109 that calculations will only be done for those specific indices. Assume we are interested in finding element ๐‘– ๐‘˜ of y ๐‘˜ . For ๐‘– ๐‘˜ โˆˆ [๐‘› ๐‘˜ ] and ๐‘˜ โˆˆ [๐‘‘], we have that โˆ‘๏ธ๐‘›๐‘˜   y ๐‘˜ (๐‘– ๐‘˜ ) = F๐‘› ๐‘˜ D ๐œ‰ ๐‘› ๐‘˜ (x ๐‘˜ ) ๐‘— . ๐‘–๐‘˜ , ๐‘— ๐‘—=1 On the other hand,      F๐‘› ๐‘˜ D ๐œ‰ ๐‘› ๐‘˜ ๐‘–๐‘˜ , ๐‘— = F๐‘› ๐‘˜ ๐‘– ๐‘˜ ,: โˆ— d๐œ‰ ๐‘›๐‘˜ ๐‘— where โˆ— is the elementwise multiplication and d๐œ‰๐‘›๐‘˜ denotes the diagonal of D๐œ‰๐‘›๐‘˜ . Therefore, D  E y ๐‘˜ (๐‘– ๐‘˜ ) = F๐‘› ๐‘˜ ๐‘– ๐‘˜ ,: โˆ— d ๐œ‰ ๐‘›๐‘˜ , x ๐‘˜ . This means in each mode, all one needs is d๐œ‰๐‘›๐‘˜ and the ๐‘– ๐‘˜ th row of F๐‘› ๐‘˜ . This significantly reduces the computational cost as in practice, it is the case that ๐‘š kron  ๐‘. 110 BIBLIOGRAPHY 111 BIBLIOGRAPHY [1] Alzheimerโ€™s disease neuroimaging initiative. http://adni.loni.usc.edu/. [2] A mathematical introduction to fast and memory efficient algorithms for big data. https: //users.math.msu.edu/users/iwenmark/Notes_Fall2020_Iwen_Classes.pdf. [3] T. D. Ahle, M. Kapralov, J. B. Knudsen, R. Pagh, A. Velingker, D. P. Woodruff, and A. Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141โ€“160. SIAM, 2020. [4] R. Bro and H. A. Kiers. A new efficient method for determining the number of components in parafac models. Journal of Chemometrics: A Journal of the Chemometrics Society, 17(5):274โ€“286, 2003. [5] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing. Springer, 2013. [6] Z. Hao, L. He, B. Chen, and X. Yang. A linear support higher-order tensor machine for classification. IEEE Transactions on Image Processing, 22(7):2911โ€“2920, July 2013. [7] L. He, C.-T. Lu, G. Ma, S. Wang, L. Shen, P. S. Yu, and A. B. Ragin. Kernelized support tensor machines. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1442โ€“1451. JMLR. org, 2017. [8] M. Iwen and B. Ong. A distributed and incremental SVD algorithm for agglomerative data analysis on large networks. SIAM Journal on Matrix Analysis and Applications, 37(4):1699โ€“ 1718, 2016. [9] M. A. Iwen, D. Needell, E. Rebrova, and A. Zare. Lower memory oblivious (tensor) subspace embeddings with fewer random bits: modewise methods for least squares. SIAM Journal on Matrix Analysis and Applications, 42(1):376โ€“416, 2021. [10] R. Jin, T. G. Kolda, and R. Ward. Faster Johnsonโ€“Lindenstrauss transforms via kronecker products. arXiv preprint arXiv:1909.04801, 2019. [11] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. Improvements to plattโ€™s smo algorithm for svm classifier design. Neural computation, 13(3):637โ€“649, 2001. [12] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455โ€“500, 2009. [13] F. Krahmer and R. Ward. New and improved Johnsonโ€“Lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269โ€“1281, 2011. 112 [14] H. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos. Mpca: Multilinear principal component analysis of tensor objects. IEEE Transactions on Neural Networks, 19(1):18โ€“39, Jan 2008. [15] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (coil-100), 1996. [16] I. V. Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295โ€“2317, 2011. [17] A. Ozdemir, A. Zare, M. A. Iwen, and S. Aviyente. Multiscale analysis for higher-order tensors. In Wavelets and Sparsity XVIII, volume 11138, page 1113808. International Society for Optics and Photonics, 2019. [18] R. Pagh. Compressed matrix multiplication. ACM Transactions on Computation Theory (TOCT), 5(3):1โ€“17, 2013. [19] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239โ€“247, 2013. [20] Y. Shi and A. Anandkumar. Higher-order count sketch: Dimensionality reduction that retains efficient tensor operations, 2019. [21] N. D. Sidiropoulos and R. Bro. On the uniqueness of multilinear decomposition of n-way arrays. Journal of Chemometrics, 14(3):229โ€“239, 2000. [22] Y. Sun, Y. Guo, C. Luo, J. Tropp, and M. Udell. Low-rank tucker approximation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):1123โ€“1150, 2020. [23] Y. Sun, Y. Guo, J. A. Tropp, and M. Udell. Tensor random projection for low memory dimension reduction. In NeurIPS Workshop on Relational Representation Learning, 2018. [24] D. Tao, X. Li, X. Wu, W. Hu, and S. Maybank. Supervised tensor learning, knowledge and information systems. 2007. [25] R. Vershynin. High-dimensional probability: An introduction with applications in data science. cambridge series in statistical and probabilistic mathematics, 2018. [26] X. Liu and N. D. Sidiropoulos. Cramer-rao lower bounds for low-rank decomposition of multidimensional arrays. IEEE Transactions on Signal Processing, 49(9):2074โ€“2086, Sep. 2001. [27] A. Zare, A. Ozdemir, M. A. Iwen, and S. Aviyente. Extension of pca to higher order data structures: An introduction to tensors, tensor decompositions, and tensor pca. Proceedings of the IEEE, 106(8):1341โ€“1358, 2018. 113