ON METHODS IN TENSOR RECOVERY AND COMPLETION By Cullen Haselby A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Applied Mathematicsโ€”Doctor of Philosophy 2023 ABSTRACT Tensor representations of data have great promise, since as the size of data grows both in terms of dimensionality and modes, it becomes increasingly advantageous to employ methods that exploit structure to explain, store, and compute in an efficient manner. Tensors and their decompositions crucially admit the possibility of using multi-linear relationships between the modes to exploit for such purposes. Additionally multi-modal views of data are natural in many practical contexts; e.g. physical phenomenon may have several spatial modes and a temporal mode, in natural language processing data may derive from triplets of words that form subject-verb-object semantics, in neural networks both processing data and storing weights between layers have been fruitfully imagined using tensors. Finding or using decompositions of tensors is a fundament of this work, and indeed any work which seeks to analyze and produce algorithms which make efficient use of tensor representations of data since these factorizations can radically lower the total number of parameters required to store and perform operations on the tensor. In this dissertation we consider two related problems involving tensor representations of data. First, we consider the problem of recovering a low-rank Tucker approximation to a large tensor based on structured, yet randomized, measurements. We will describe a framework for a class of measurement ensembles that can offer excellent memory and accuracy trade-offs in comparison to alternatives. Furthermore, these ensembles can easily be applied in a single pass over the tensor, and in a linear manner making them suitable for streaming scenarios. That is, we will propose a compressive sensing framework, and study uses of it which can produce low-rank factorizations from these measurements where we show that the total number of measurements required scales according to the number of parameters in the factorization, rather than the full, uncompressed tensor. We analyze two recovery approaches along with the necessary specializations of the measurement framework. Unlike prior works related to algorithms for low-rank tensor approximation from such com- pressive measurements, we present a unified analysis that permits several different choices for structured measurement ensembles, and show how to prove error guarantees comparing the error of our recovery algorithmโ€™s approximation of the input tensor to the best possible low-rank Tucker approximation error achievable for the tensor by any possible algorithm. We include empirical and practical studies that verify our theoretical findings and explore various trade-offs of parameters of interest. We discuss the development of the methods, and how theoretical and practical critiques of our earlier work informed and enabled improvements in the sequel. Next, we consider the related problem of tensor completion, where the goal is to exactly complete a low-rank CP tensor. Our method leverages existing matrix completion techniques and an adaptive sampling scheme along with a noise robust version of Jennrichโ€™s Algorithm to complete a tensor using a sample budget which scales, up to a log factor, with the number of parameters in the factorization. Empirical studies, such as performance in the presence of additive noise on simulated data, as well as several practical applications of the method are included and discussed. Copyright by CULLEN HASELBY 2023 To my father, Ray Haselby. v ACKNOWLEDGEMENTS One could try to explore the frontiers alone, but in my view, precious little would be seen this way. Far better, and far more rewarding, to go there with others. This work and what it represents of my understanding of mathematics and its place in the world was accomplished principally through, and because of, discourse and collaboration with others. The most frequent and immediate target for my graspings for meaning in subjects large and small mathematically is my advisor, Mark Iwen. You taught me much, gave me much, and I never once saw the end of your patience. You invited me into the good and puissant circle of people and ideas you have cultivated, and I was made better for it. I somewhat doubt that your monstrous investment of time and effort into me will ever enrich you personally, but that also never seemed to bother you. Your example is not lost on me. I intend to do what good I can muster with all that was given to me and become worthy of it. Thank you. Next, I would like to thank Michael Perlmutter, Elizaveta Rebrova, Deanna Needell and William Swartworth for first pulling me onboard, and then rowing along with me on our now shared mission to make tensor recovery with modewise measurements a better understood and more useful art, and with whom I wrote [23] and [24] alongside. Your expertise, curiosity, persistence showed me how the work is done. Here at MSU, I was buoyed considerably by the fruitful interdisciplinary collaboration with Heiko Hergert, his student Roland Wirth and my graduate student older sibling so-to-speak Ali Zare as seen in our work [68]. I learned more than I think I let on in those early days from reading your writings, pondering your code and methods, and probing your ideas. Santhosh Karnik with whom I wrote [25]. It was a lot of fun making Tensor Sandwiches with you, and sharpening our ideas together. Thank you. I would like to thank Jianliang Qian for his service on my committee, his helpful advice when I was taking my first steps into applied mathematics research, his work as an instructor and his direction of the Industrial Mathematics program here at MSU; from which I had very memorable collaboration. I can with the confidence of some hindsight and experience say that numerical linear vi algebra and the topics you taught and taught well were used near daily in both my research and work. To Rongrong Wang, thank you for your service on my committee and your insights into my forays into tensor completion algorithms. To the faculty, staff, and fellow students at MSU Mathematics department who strove in so many ways to make our collective contribution to human understanding a success, and maintaining a tradition of excellence, I am also indebted. In particular I would like to acknowledge my fellow student Yuta Hozumi. I did value our several hundred hours of pondering each otherโ€™s many problems and ideas, however it was your friendship that was at times, sustaining. Thank you. vii TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 TENSOR RESTRICTED ISOMETRY PROPERTY AND TENSOR ITERATIVE HARD THRESHOLDING FOR RECOVERY . . . . . . . 35 CHAPTER 3 LEAVE ONE OUT MEASUREMENTS FOR RECOVERY . . . . . . . 54 CHAPTER 4 TENSOR SANDWICH: TENSOR COMPLETION FOR LOW CP- RANK TENSORS VIA ADAPTIVE RANDOM SAMPLING . . . . . . 114 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 viii CHAPTER 1 INTRODUCTION 1.1 Introduction The need for efficient methods to acquire, store, reconstruct, and analyze large-scale data is common now in many settings and disciplines. Furthermore, data may often be multi-modal thus making tensor representations natural and useful in both their storage and analysis. The contexts for this sort of data are numerous and extremely varied, from signal processing tasks, imaging, machine learning, and many other data science applications [37, 12, 59, 42, 9]. Tensors are the natural extension of vectors and matrices, allowing for an arbitrary number of modes, instead of only one or two. Despite seemingly straightforward extensions of the matrix concept, many of the standard linear algebra and computational results that are tried and true in the matrix case simply do not cleanly translate beyond two modes. Nevertheless, tensors of interest rarely have unconstrained, unrelated entries. In practical settings, the tensor data may have some implicit low-rank structure that can be exploited for efficient computation and storage, or simply be well approximated by a tensor that does have this structure. Examples of decompositions include Tucker, CANDECOMP/PARAFAC (CP), and tensor train [37]. Since tensor data can be large both in the number of modes and the dimensions within each mode, and arrive in a streaming fashion, efficient methods that deal with these facets are critical for the computation of such factorizations as well as recovery of the data from compressed measurements or using sampling strategies. In this work, we will describe and analyze a framework and methods which relate to this framework that can reduce overall memory requirements, theoretically accommodate many types of measurement schemes for collecting linear measurements of large tensors, apply to several low-rank decompositions, while also remaining robust to non-exact low-rankness, and provide computationally tractable and provable recovery. The framework also applies to the streaming setting, such as when the tensor data is being received sequentially over time in a slice by slice, or even entrywise manner. Efficiency in terms of storage for linear measurements are valuable in 1 streaming algorithms for tensor reconstruction in the big data setting (see, e.g., [60]) since it can be impractical or impossible to have access to the entire, full tensor that one wishes to measure and later reconstruct. 1.2 Motivating Examples The following examples provide context and some indication as to how measurements of tensors and their recovery are of theoretical and practical value. 1.2.1 Example, Technical For example, suppose we wish to aggregate tensor data over time using the additive rule X๐‘ก := X๐‘กโˆ’1 + ฮ”X๐‘ก, X0 = 0 based on updates ฮ”X๐‘ก. After some number of updates, we wish to reconstruct an estimate of X๐‘‡ = (cid:205)๐‘‡ ๐‘ก=1 ฮ”X๐‘ก โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ for some ๐‘‡ > 0, where ๐‘‡ can be large. Storing every intermediate Instead we can store small linear sketches of tensor X๐‘ก in uncompressed form is inefficient. the intermediate tensors using a linear measurement operator L : R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ โ†’ R๐‘š1ร—ยทยทยทร—๐‘š๐‘‘โ€ฒ with (cid:206)๐‘‘โ€ฒ โ„“=1 ๐‘›โ„“. This operator can be used to iteratively update the comparatively smaller ๐‘šโ„“ โ‰ช (cid:206)๐‘‘ โ„“=1 sketch L (X๐‘ก) = L (X๐‘กโˆ’1) + L (ฮ”X๐‘ก) over the updates ๐‘ก = 0, . . . , ๐‘‡. So long as the final sketch L (X๐‘‡ ) can then be used to approximately recover X๐‘‡ using a recovery algorithm, of sufficient accuracy, then we can achieve storage savings so long as the number of entries in the measurements are smaller than the number of entries in the original signal. Ideally the recovery algorithm itself would also be efficient in terms of memory and computational complexity, so that the cost of recovering X๐‘‡ from L (X๐‘‡ ) does not unduly outweigh the efficiencies in terms of storage of dealing with only the measurements after the operator has acted on the signal. Naturally, in this situation L itself must be stored or newly re-calculated each time it is to operate on a tensor. This is a storage requirement beyond the measurements themselves, L (X๐‘กโˆ’1) and L (ฮ”X๐‘ก). Perhaps this operator can be used many times on different streams, yet even so, it is 2 truly only memory conserving if the cost of storing L is less than the savings accrued by not storing the tensor X๐‘ก, and instead using the measurements, when such is even possible. Indeed, in the classic compressive sensing scenario where one has say a large sensing matrix acting on a sparse vector, one needs to store the measurement operator (or generate it anew each time it is needed) and without a low memory approach or a specially designed measurement operator, the storage of the matrix would exceed any savings gained by the compression of the data. So, by developing our framework, we will by turns describe how to construct a class of operators L and pair them with recovery procedures so that for signals of a given type we can with confidence recover them to a desired level of accuracy. 1.2.2 Example, Hypothetical and Possibly Non-benign Consider the following hypothetical application. Suppose there is a weapon which seeks to replace the role of the landmine in warfare (victim operated mines violate important international treaties and can be indiscriminate, and thus a potentially great evil, even outside of when they have served their initial intended military purpose). Suppose these new weapons have cameras and each device has a limited amount of computational resources, bandwidth, and storage. A well designed measurement operator L could enable each device to perform clustering and object recognition on segments of video data. Video data is easily conceived of as a tensor as collected by the devicesโ€™ cameras. Since the data can be significantly reduced in size while still retaining enough information to perform the needed tasks, this operation can be performed on the deviceโ€™s limited hardware. However, the machine alone should likely never be able to make the determination that a given segment of video represents a valid target. Once a segment of video data has been identified as significant, these reduced size measurements can be transmitted, and the scene reconstructed so that a human operator can view the scene and make a determination. In this way, one human could potentially supervise many weapons at once, and the system would not be onerous with constant, high volumes of data needing constant transmission back to nodes for computing and storage. 3 1.2.3 Example, Hypothetical and Benign Consider another hypothetical. Suppose a firm has designed a method that can reliably identify after some training and tuning, sections of a TV-broadcast which are live sports, and sections of the broadcast which are commercial breaks. Moreover, this procedure relies only on pairwise distance information taken from the streaming video and audio of the broadcast, and the categorization can be accomplished, for appropriately measured (and compressed) data on an inexpensive computer in milliseconds. Given this situation, a large chain of sports bars would be able to deploy the method and accompanying hardware at each of their establishments, across many broadcasts, many sports and many regions, to locally identify whenever an incoming TV broadcast is on commercial break; and automate switching their patron-facing viewscreens to display some other chosen video โ€“ for example, the sports bar chain might sell (their own) ads; or play a muted version of the source TV-broadcast. The compressed data can be retained, and recovery performed for quality assurance and possible tuning. 1.3 Organization and Contribution The rest of this work is organized in the following way. First, in Section 1.4.1, we will lay out definitions, fix notation as well as state some fundamental results in numerical linear algebra and high dimensional probability that will be useful throughout this work. The final Section 1.5 of Chapter 1 will describe the measurement framework as well as outline its development and relationship to other research on these topics. Chapter 2 will concern what was our first specialization of the framework and how we interfaced it with prior work that concerned a recovery procedure known as Tensor Iterative Hard Thresholding (TIHT) via a concept known as Tensor Restrictive Isometry Property (TRIP). We will state the main results involving TRIP and TIHT in Section 2.1 and describe the theoretical and practical advantages of this specialization as well as limitations in the context of our empirical findings. The numerical experiments as well as the proofs of the main theoretical contents of this chapter as well as the numerical findings first appeared in our work [23]. In that work I was principally responsible for the numerical experiments, comparisons, observations and discussion regarding the memory 4 footprint, whereas Micahel Perlmutter and Elizaveta Rebrova produced most of the proof for the main results. Since a critique of the main ideas of the work are germane to the development of the sequel, technical proofs for this chapter are put in the Appendix to that chapter, 2.4. The desire to address these critiques of the main findings as described and discussed in Chapter 2 led us to consider a new specialization of the framework, what we coin the Leave-One-Out alternative, for reasons that will be made evident in Chapter 3. Chapter 3 describes the Leave-One-Out variations of the framework and the paired recovery algorithm. In Section 3.2 we conduct a first of its kind analysis of the method and state and prove recovery guarantees that demonstrate the method has an overall sub-linear dependence on the total size of the tensor data. We also provide experiments on simulated data as well as video data to show how different choices and parameters in the problem contribute to trade-offs of interest. The results of this chapter first appear in our work [24]. In that work I was responsible for both most of the proof writing and numerical experiments. Technical proofs adapted from other research is included in that chapterโ€™s Appendix, since these involved important but minor specializations to our task and may distract from the main findings. Finally, in Chapter 4 we present a novel method for completing tensors that relies on a (mildly) adaptive sampling procedure and prove that in the noiseless setting this method can exactly complete a low-rank tensor. Sampling can be viewed as a very particular type of measurement operator, and though this sampling procedure does not strictly fit into the framework we describe in Section 1.5, its design is very much in the same spirit. We conclude that chapter again with a number of numerical experiments that demonstrate the main findings as well show empirically that the method can succeed in the presence of noise, as well as a number of applications of the method on real world data-sets. The initial results of this chapter appeared in our work [25], the main result, its proof and the empirical findings were primarily my contribution. We will need to set notation and define common terms and some base results in order to deal with the subject matter. 5 1.4 Preliminaries 1.4.1 Tensor and Measurement Preliminaries Definition 1.1. A ๐‘‘-mode or order-๐‘‘ tensor (or ๐‘‘-th order) is a ๐‘‘ dimensional array of complex values, written as X โˆˆ C๐‘›1ร—๐‘›2ร—ยทยทยทร—๐‘›๐‘‘ A ๐‘‘-mode tensorโ€™s entries are indexed by a vector i โˆˆ [๐‘›1] ร— [๐‘›2] ร— ยท ยท ยท ร— [๐‘›๐‘‘] where (X)i = ๐‘ฅi = ๐‘ฅ๐‘–1,...,๐‘–๐‘‘ โˆˆ C Example 1.1 (1 and 2 mode tensors). (i) A 1-mode tensor, x โˆˆ C๐‘› is a vector with entries ๐‘ฅ๐‘– โˆˆ C, ๐‘– โˆˆ [๐‘›]. We will denote 1-mode tensors, vectors, with bolded lowercase letters. (ii) A 2-mode tensor, ๐ด โˆˆ C๐‘›1ร—๐‘›2 is a matrix with entries ๐‘Ž๐‘–1,๐‘–2 โˆˆ C for ๐‘– โˆˆ [๐‘›1], ๐‘–2 โˆˆ [๐‘›2]. We denote 2-mode tensors (matrices) with capital letters. We introduce some terminology that will be useful when describing tensors Definition 1.2 (Fiber). Fibers are 1-dimensional subsets of a ๐‘‘-mode tensor. They are formed by fixing ๐‘‘ โˆ’ 1 of the dimensions and then ranging over all indices in the remaining dimension. So for any ๐‘˜ โˆˆ [๐‘‘], and X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ then a ๐‘˜-mode fiber would be a vector x โˆˆ C๐‘›๐‘˜ where indices ๐‘–1, . . . , ๐‘–๐‘˜โˆ’1, ๐‘–๐‘˜+1, . . . , ๐‘–๐‘‘ are fixed Using Matlab notation (X)๐‘–1,...,๐‘–๐‘˜โˆ’1,:,๐‘–๐‘˜+1,...,๐‘–๐‘‘ = x๐‘–1,...,๐‘–๐‘˜โˆ’1,:,๐‘–๐‘˜+1,...,๐‘–๐‘‘ So for example, given a matrix ๐‘‹ โˆˆ C๐‘›1ร—๐‘›2 then ๐‘‹๐‘–,: = x๐‘–,: โˆˆ C๐‘›2 is a mode-2 fiber (i.e. row). A mode-1 fiber, x:, ๐‘— is a column of the matrix. See Figure 1.1 for a schematic depiction of these terms. Definition 1.3 (Slice). A matrix formed by varing 2 indices and fixing all other indices of a tensor. That is, suppose ๐‘—, ๐‘˜ โˆˆ [๐‘‘] where ๐‘— โ‰  ๐‘˜ then ๐‘‹ = X๐‘–1,...,๐‘– ๐‘— โˆ’1,:,๐‘– ๐‘—+1,...,๐‘–๐‘˜โˆ’1,:,๐‘–๐‘˜+1,...,๐‘–๐‘‘ โˆˆ C๐‘› ๐‘— ร—๐‘›๐‘˜ Definition 1.4 (Sub-tensor). A ๐‘˜-mode sub-tensor of a ๐‘‘-mode tensor (๐‘˜ โ‰ค ๐‘‘) is denoted by a vector of length ๐‘‘ โˆ’ ๐‘˜ of indices and a set of ๐‘˜ distinct indices from the set [๐‘‘]. That is given 6 (a) Fiber (b) Slice Figure 1.1 A schematic of X โˆˆ R4ร—4ร—4 tensor where the fiber x2,2,: is highlighted in (๐‘Ž) and a frontal slice ๐‘‹:,:,1 is highlighted in ๐‘. distinct ๐‘—1, . . . , ๐‘—๐‘˜ โˆˆ [๐‘‘] and vector i โˆˆ (cid:203) ๐‘–โ‰  ๐‘—โ„“ [๐‘›๐‘–] of length ๐‘‘ โˆ’ ๐‘˜, we have a ๐‘˜-mode sub-tensor where X๐‘—1,..., ๐‘—๐‘˜,i โˆˆ C๐‘› ๐‘— 1 ร—ยทยทยทร—๐‘› ๐‘—๐‘˜ Using this sub-tensor notation then an entry of a mode-๐‘˜ fiber is a sub-tensor X๐‘—,i where ๐‘— โˆˆ [๐‘›๐‘˜ ] and i โˆˆ [๐‘›1] ร— . . . [๐‘›๐‘˜โˆ’1] ร— [๐‘›๐‘˜+1] ร— ยท ยท ยท ร— [๐‘›๐‘‘]. There are (cid:206)โ„“โ‰ ๐‘˜ ๐‘›โ„“ different mode-๐‘˜ fibers, one for each possible i. An entry of a slice then is denoted Xโ„“,๐‘˜,i โˆˆ C๐‘›โ„“ ร—๐‘›๐‘˜ . There are (cid:206) ๐‘—โ‰ โ„“,๐‘˜ ๐‘› ๐‘— slices of dimension ๐‘›โ„“ ร— ๐‘›๐‘˜ . Next we will discuss reshaping operators - these all involve different possible ways of changing the dimensions and number of modes of tensors so that they have the same number of entries but are different shapes. Definition 1.5 (Vectorization). For X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ , vec(X) = x where x โˆˆ C(cid:206)๐‘‘ ๐‘˜=1 ๐‘›๐‘˜ . Entries are typically ordered lexicographically by their indices, i.e. the first index sorts all values before the second and so on. However, usually the particular order that entries are moved to between 7 (a) (b) Figure 1.2 A schematic of X โˆˆ R4ร—4ร—4 tensor where the mode-1 fibers x:,๐‘–, ๐‘— are arranged as the columns of the corresponding flattening. The intensity of the color corresponds to the lexiographic order of the fibers. reshapings does not matter, so long as it is consistent across the operations. Definition 1.6 (Mode-๐‘˜ Unfolding). Also known as flattening or matricization. For X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ the ๐‘˜-mode flattening is a matrix ๐‘‹(๐‘˜) โˆˆ C๐‘›๐‘˜ร—(cid:206) ๐‘—โ‰ ๐‘˜ ๐‘› ๐‘— . We have effectively made the ๐‘˜-th dimension into the rows of the matrix, and the columns are then the different mode-๐‘˜ fibers, i.e. the columns are the fibers X๐‘˜,i. When it will not cause any confusion from context, we may omit the decoration (๐‘˜) and simply write the flattening as ๐‘‹. Definition 1.7 (Reshaping). We can reshape a ๐‘‘-mode tensor into any other ๐‘“ -mode tensor with a reshaping operation ๐‘… : C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ โ†’ C๐‘š1ร—ยทยทยทร—๐‘š ๐‘“ provided ๐‘‘ (cid:214) ๐‘“ (cid:214) ๐‘šโ„“ ๐‘› ๐‘— = ๐‘—=1 ๐‘˜-mode flattening and vectorization are in this view two particular reshaping operations. โ„“=1 What is the underlying vector space we can use to study tensors? To answer that, we consider the following norm, inner-product, and operations on tensors: Definition 1.8 (2-norm of a Tensor). For X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ then given any ๐‘˜ โˆˆ [๐‘‘] we have โˆฅXโˆฅ2 2 = โˆฅ ๐‘‹(๐‘˜) โˆฅ2 ๐น = โˆฅxโˆฅ2 2 = |๐‘ฅi|2 โˆ‘๏ธ iโˆˆ๐ผ where ๐ผ = [๐‘›1] ร— ยท ยท ยท ร— [๐‘›๐‘‘] 8 Definition 1.9 (Inner-product). For compatible tensors X, Y โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ then โŸจX, YโŸฉ = โˆ‘๏ธ iโˆˆ๐ผ ๐‘ฅi๐‘ฆ i = โŸจx, yโŸฉ that is, the inner-product of the vectorization of the tensors. Note that this is equivalent to โŸจ๐‘‹(๐‘˜), ๐‘Œ(๐‘˜)โŸฉ๐ป๐‘† = Trace(๐‘‹(๐‘˜) (๐‘Œ(๐‘˜))โˆ—), โˆ€๐‘˜ โˆˆ [๐‘‘], the Hilbert-Schmidt inner product for matrices Addition and scalar multiplication work component-wise, i.e. (X + Y)i = ๐‘ฅi + ๐‘ฆi (๐›ผX)i = ๐›ผ๐‘ฅi, โˆ€๐›ผ โˆˆ C 1.4.2 Modewise and Matrix Products The modewise product will feature prominently in our introduction to the measurement frame- work Section 1.5, as well as Chapters 2 and 3. We state it here as well as some basic algebraic identities related to it. Certain matrix products will also quickly become useful when describing tensors and their decompositions. Definition 1.10 (Modewise Product). The ๐‘˜-mode product of X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ and ๐‘ˆ โˆˆ C๐‘š๐‘˜ร—๐‘›๐‘˜ for ๐‘˜ โˆˆ [๐‘‘] is a tensor in C๐‘›1ร—ยทยทยทร—๐‘›๐‘˜โˆ’1ร—๐‘š๐‘˜ร—๐‘›๐‘˜+1ร—ยทยทยทร—๐‘›๐‘‘ element-wise defined as (X ร—๐‘˜ ๐‘ˆ)๐‘–1,...,๐‘–๐‘˜โˆ’1, ๐‘—,๐‘–๐‘˜+1,...,๐‘–๐‘‘ = ๐‘›๐‘˜โˆ‘๏ธ ๐‘–๐‘˜=1 ๐‘ฅ๐‘–1,๐‘–2,...,๐‘–๐‘‘ ๐‘ข ๐‘—๐‘–๐‘˜ (1.1) This is usefully identified with the following vectorized expression (X ร—๐‘˜ ๐‘ˆ)๐‘–1,...,๐‘–๐‘˜โˆ’1,:,๐‘–๐‘˜+1,...,๐‘–๐‘‘ = ๐‘ˆX๐‘–1,...,๐‘–๐‘˜โˆ’1,:,๐‘–๐‘˜+1,...,๐‘–๐‘‘ In other words, the ๐‘˜-mode product applies the matrix ๐‘ˆ to all the mode-๐‘˜ fibers of the tensor X. For example, suppose X โˆˆ C5ร—3ร—2 and ๐‘ˆ โˆˆ C4ร—5. Then X ร—1 ๐‘ˆ โˆˆ C4ร—3ร—2 where each of the mode-1 fibers is now the product of ๐‘ˆX:, ๐‘—,โ„“, for some ๐‘— โˆˆ [3], โ„“ โˆˆ [2]. In the 2-mode tensor case, i.e., matrices, the usual matrix-matrix multiplication can be under- stood in terms of 1-mode tensor product. 9 This also easily establishes that we can identify a mode-wise product with the matrix-matrix multiplication of the matrix with the corresponding ๐‘˜-mode flattening of the tensor. (cid:104) ๐ด๐ต = ๐ด b1 b2 . . . b๐‘› (cid:105) (cid:104) = ๐ดb1 ๐ดb2 . . . ๐ดb๐‘› (cid:105) = ๐ต ร—1 ๐ด Note, modewise products are bilinear. We note some other useful properties. Lemma 1.1 (Properties of ๐‘˜-mode products). Let X, Y โˆˆ C๐‘›1,...,๐‘›๐‘‘ , ๐›ผ, ๐›ฝ โˆˆ C, ๐‘ˆโ„“, ๐‘‰โ„“ โˆˆ C๐‘šโ„“ ร—๐‘›โ„“ , โˆ€โ„“ โˆˆ [๐‘‘]. Then 1. (๐›ผX + ๐›ฝY) ร— ๐‘— ๐‘ˆ ๐‘— = ๐›ผ(X ร— ๐‘— ๐‘ˆ ๐‘— ) + ๐›ฝ(Y ร— ๐‘— ๐‘ˆ ๐‘— ) 2. that is, ๐‘˜-mode product is bilinear. 3. If ๐‘— โ‰  โ„“ then (X ร— ๐‘— ๐‘ˆ ๐‘— ) ร—โ„“ ๐‘‰โ„“ = (X ร—โ„“ ๐‘‰โ„“) ร— ๐‘— ๐‘ˆ ๐‘— 4. If ๐‘Š โˆˆ C๐‘ร—๐‘š ๐‘— then (cid:0)X ร— ๐‘— ๐‘ˆ ๐‘— (cid:1) ร— ๐‘— ๐‘Š = X ร— ๐‘— (๐‘Š๐‘ˆ ๐‘— ) โˆˆ C๐‘›1ร—๐‘› ๐‘— โˆ’1ร—๐‘ร—๐‘› ๐‘—+1ร—...๐‘›๐‘‘ Proof. 1. For any ๐‘–1 โˆˆ ๐‘›1, . . . , ๐‘– ๐‘—โˆ’1 โˆˆ ๐‘› ๐‘—โˆ’1, ๐‘– ๐‘—+1 โˆˆ ๐‘› ๐‘—+1, . . . , ๐‘–๐‘‘ โˆˆ ๐‘›๐‘‘ we have (cid:2)(๐›ผX + ๐›ฝY) ร— ๐‘— ๐‘ˆ(cid:3) ๐‘–1,...,๐‘– ๐‘— โˆ’1,:,๐‘– ๐‘—+1,...,๐‘–๐‘‘ = ๐‘ˆ (๐›ผX + ๐›ฝY)๐‘–1,...,๐‘– ๐‘— โˆ’1,:,๐‘– ๐‘—+1,...,๐‘–๐‘‘ = ๐›ผ๐‘ˆX๐‘–1,...,๐‘– ๐‘— โˆ’1,:,๐‘– ๐‘—+1,...,๐‘–๐‘‘ + ๐›ฝ๐‘ˆY๐‘–1,...,๐‘– ๐‘— โˆ’1,:,๐‘– ๐‘—+1,...,๐‘–๐‘‘ = (X ร— ๐‘— ๐‘ˆ + ๐›ฝ(Y ร— ๐‘— ๐‘ˆ) 2. Similar to 1. 3. Consider the element-wise definition (1.1) ((X ร— ๐‘— ๐‘ˆ ๐‘— ) ร—โ„“ ๐‘‰โ„“)๐‘–1,...,๐‘,๐‘– ๐‘—+1,...,๐‘ž,๐‘–โ„“+1,...,๐‘–๐‘‘ = = ๐‘›๐‘ž โˆ‘๏ธ ๐‘› ๐‘ โˆ‘๏ธ ๐‘–๐‘ž=1 ๐‘› ๐‘ โˆ‘๏ธ ๐‘– ๐‘=1 ๐‘›๐‘ž โˆ‘๏ธ ๐‘– ๐‘=1 ๐‘–๐‘ž=1 ๐‘ฅ๐‘–1,๐‘–2,...,๐‘–๐‘‘ ๐‘ข ๐‘๐‘– ๐‘ ๐‘ฃ๐‘ž๐‘–๐‘ž ๐‘ฅ๐‘–1,๐‘–2,...,๐‘–๐‘‘ ๐‘ข ๐‘,๐‘– ๐‘ ๐‘ฃ๐‘ž,๐‘–๐‘ž = ((X ร—โ„“ ๐‘‰โ„“) ร— ๐‘— ๐‘ˆ ๐‘— )๐‘–1,...,๐‘,๐‘– ๐‘—+1,...,๐‘ž,๐‘–โ„“+1,...,๐‘–๐‘‘ 10 4. Using the same identity as in 3, the proof is similar Note that the run-time complexity is the same regardless of the order one applies the ๐‘˜- or ๐‘—-mode products. Definition 1.11 (Kronecker Product). The Kronecker product of two matrices ๐ด ร— C๐‘š1ร—๐‘›1 and ๐ต โˆˆ C๐‘š2ร—๐‘›2 is a matrix ๐ด โŠ— ๐ต := ๐‘Ž1,1๐ต ๐‘Ž2,1๐ต ... ๐‘Ž1,2๐ต . . . ๐‘Ž2,2๐ต . . . . . . ... ๐ต ๐ต ๐‘Ž1,๐‘›1 ๐‘Ž2,๐‘›1 ... ๐‘Ž๐‘š1,1๐ต ๐‘Ž๐‘š1,2๐ต . . . ๐‘Ž๐‘š1,๐‘›1 ๐ต ๏ฃฎ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฐ . ๏ฃน ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃป (1.2) where ๐ด โŠ— ๐ต โˆˆ C๐‘š1๐‘š2ร—๐‘›1๐‘›2 Definition 1.12 (Khatri-Rao Product). The Khatri-Rao product of two matrices is the matrix that results from computing the Kronecker product of their matching columns. That is, for ๐ด โˆˆ R๐‘š1ร—๐‘› and ๐ต โˆˆ R๐‘š2ร—๐‘›, their Khatri-Rao product is the matrix ๐ด โŠ™ ๐ต โˆˆ R๐‘š1๐‘š2ร—๐‘› defined by ๐ด โŠ™ ๐ต := (cid:104) a1 โŠ— b1 a2 โŠ— b2 . . . a๐‘› โŠ— b๐‘› (cid:105) . We also use a so-called row-wise Khatri-Rao product (sometimes called the face-splitting product) of ๐ด โˆˆ R๐‘šร—๐‘›1 and ๐ต โˆˆ R๐‘šร—๐‘›2, denoted by ๐ด โ€ข ๐ต โˆˆ R๐‘šร—๐‘›1๐‘›2, and defined by ( ๐ด โ€ข ๐ต)๐‘‡ := ๐ด๐‘‡ โŠ™ ๐ต๐‘‡ . Lemma 1.2. Let X โˆˆ C๐‘›1,...,๐ผ๐‘‘ , ๐‘ˆโ„“ โˆˆ C๐‘šโ„“ ร—๐‘›โ„“ then 1. (cid:0)X ร— ๐‘— ๐‘ˆ ๐‘— (cid:1) ( ๐‘—) = ๐‘ˆ ๐‘— X( ๐‘—) โˆˆ C๐‘š ๐‘— ร—(cid:206)โ„“โ‰  ๐‘— ๐‘›โ„“ 2. (X ร—1 ๐‘ˆ1 ร—2 ๐‘ˆ2 ยท ยท ยท ร—๐‘‘ ๐‘ˆ๐‘‘) [ ๐‘—] = ๐‘ˆ ๐‘— X[ ๐‘—] (cid:0)๐‘ˆ๐‘‘ โŠ— ๐‘ˆ๐‘‘ โŠ— ยท ยท ยท โŠ— ๐‘ˆ ๐‘—+1 โŠ— ๐‘ˆ ๐‘—โˆ’1 โŠ— ยท ยท ยท โŠ— ๐‘ˆ1 (cid:1)๐‘‡ Proof. This result follows from the definition of reshaping operators, which we detail below, assuming a column-major convention, and by using the elementwise definition in (1.1). In order for the identity seen in Lemma 1.2 to hold, we need specify our precise matricization convention. In our convention, the entry at location (๐‘–1, . . . , ๐‘–๐‘‘) in X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ is located in the 11 matrix as entry (๐‘– ๐‘, ๐‘—) where where ๐‘— = 1 + ๐‘‘ โˆ‘๏ธ ๐‘˜=1 ๐‘˜โ‰ ๐‘ (๐‘–๐‘˜ โˆ’ 1)๐ฝ๐‘˜ ๐‘˜โˆ’1 (cid:214) ๐‘›๐‘š ๐ฝ๐‘˜ = ๐‘š=1 ๐‘šโ‰ ๐‘ set ๐ฝ๐‘˜ = 1 if the index of the product above is empty. These conventions can become cumbersome to state, however they are made clear and more natural by an example. Example 1.2 (3-mode). A similar example appears in [37]: Consider a tensor X โˆˆ R3ร—4ร—2, the frontal slices are as follows 1 4 7 10 13 16 19 22 (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) The three different unfoldings are then as follows. Consider ๐‘ = 1, 3 6 9 12 2 5 8 11 , X:,:,2 = X:,:,1 = (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 14 17 20 23 15 18 21 24 (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) Note ๐‘‹(1) = (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 1 4 7 10 13 16 19 22 2 5 8 11 14 17 20 23 3 6 9 12 15 18 21 24 (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) ๐ฝ1 = 1โˆ’1 (cid:214) ๐‘š=1 ๐‘šโ‰ 1 (๐‘›๐‘š) = 1, ๐ฝ2 = ๐‘›๐‘š = 1, ๐ฝ3 2โˆ’1 (cid:214) ๐‘š=1 ๐‘šโ‰ 1 = ๐‘›๐‘š = 4 3โˆ’1 (cid:214) ๐‘š=1 ๐‘šโ‰ 1 Now to see how to locate a particular entry, note that X2,3,2 = 20, so in our unfolding (X( ๐‘))(๐‘–, ๐‘—) we can simply copy the index that corresponds to the ๐‘-th mode, i.e. the first here, ๐‘– = 2. To find the column, compute ๐‘— = 1 + (cid:205)๐‘‘ (๐‘–๐‘˜ โˆ’ 1)๐ฝ๐‘˜ = 1 + (๐‘–2 โˆ’ 1)๐ฝ2 + (๐‘–3 โˆ’ 1)๐ฝ3 = 1 + (2) (1) + (1) (4) = 7. So our entry with value 20 is in location (2, 7). ๐‘˜=1 ๐‘˜โ‰ ๐‘ 12 Consider ๐‘ = 2, X (1) = 1 3 5 2 4 6 13 14 15 (cid:170) (cid:174) (cid:174) 16 17 18 (cid:174) (cid:174) (cid:174) 19 20 21 (cid:174) (cid:174) (cid:174) 10 11 12 22 23 24 (cid:172) 7 8 9 (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) Again, to locate our entry,X2,3,2 = 20, we return the index on the ๐‘-th mode as our row, ๐‘– = 3, and repeat the same calculation for ๐‘—: Note 1โˆ’1 (cid:214) ๐ฝ1 = (๐‘›๐‘š) = 1, ๐ฝ2 = 2โˆ’1 (cid:214) ๐‘›๐‘š = 1, ๐ฝ3 ๐‘š=1 ๐‘šโ‰ 2 This time we leave out the ๐ฝ2 factor: ๐‘— = 1 + (cid:205)๐‘‘โˆ’1 ๐‘˜=1 ๐‘˜โ‰ ๐‘› ๐‘š=1 ๐‘šโ‰ 2 entry with value 20 is located in (3, 5). = ๐‘›๐‘š = 3 3โˆ’1 (cid:214) ๐‘š=1 ๐‘šโ‰ 2 (๐‘–๐‘˜ โˆ’ 1)๐ฝ๐‘˜ = 1 + (1) (1) + (1) (3) = 5. So our Consider ๐‘ = 3, ๐‘‹(3) 1 2 3 4 5 6 7 8 . . . 13 14 15 16 17 18 19 20 . . . (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) Again, to locate our entry,X2,3,2 = 20, we return the index on the ๐‘-th mode as our row, ๐‘– = 1, and repeat the same calculation for ๐‘—: Note ๐ฝ1 = 1โˆ’1 (cid:214) ๐‘š=1 ๐‘šโ‰ 3 (๐‘›๐‘š) = 1, ๐ฝ2 = 2โˆ’1 (cid:214) ๐‘›๐‘š = 3, ๐ฝ3 = ๐‘›๐‘š = 12 3โˆ’1 (cid:214) ๐‘š=1 ๐‘šโ‰ 3 This time we leave out the ๐ฝ3 factor: entry with value 20 is located in (2, 8). ๐‘š=1 ๐‘šโ‰ 3 ๐‘— = 1 + (cid:205)๐‘‘ ๐‘˜=1 ๐‘˜โ‰ ๐‘› (๐‘–๐‘˜ โˆ’ 1)๐ฝ๐‘˜ = 1 + 1(1) + 2(3) = 8. So our 1.4.3 Decomposition and Low Rank Approximation Our next set of definitions concerns ways of factorizing tensors. In the exact case, these describe how we can compute or represent the tensor as a product and sum of appropriately constrained sub-tensors of various types. This is useful for many reasons, such as interpreting the data, storing it, or performing computations more efficiently. Approximation will also be our concern. That 13 is, for situations where exact factorization is not required (or even possible) we wish to find and understand methods that can be used to find approximate tensors that do admit a certain factorization and understand the trade-offs such has how efficiently it is to compute or how accurately we can expect it to resemble the original tensor. To motivate why decompositions are of interest, and provide an example for why there is a need for tensor specific decompositions, we include an example of the storage benefit achieved by applying a standard linear algebra startegy. By tensor specific, we mean doing something other than reshaping tensors into familiar objects (vectors and matrices) and reusing the concepts from basic linear algebra. Suppose we have ๐‘ž tensors of size C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ , X1, . . . , X๐‘ž. We will compress, or reduce the dimension, by performing Principal Component Analysis (PCA) on the vectorized tensors. Our goal then in this case is to solve the minimization problem. Assuming the vectorized tensors have been centered, this amounts to ๐‘ž โˆ‘๏ธ ๐‘—=1 min ๐‘„โˆˆR(cid:206)๐‘‘ ๐‘˜=1 ๐‘„ ๐ป ๐‘„=๐ผ ๐‘›๐‘˜ ร—๐‘š โˆฅx ๐‘— โˆ’ ๐‘„๐‘„๐ปx ๐‘— โˆฅ2 where vec(X๐‘— ) = x ๐‘— โˆˆ C(cid:206)๐‘‘ ๐‘˜=1 ๐‘›๐‘˜ . We can use the Singular Value Decomposition (SVD) of the following data matrix (cid:104) x1 x2 . . . x๐‘ž (cid:105) = ๐‘ˆฮฃ๐‘‰ โˆ— โˆˆ C๐‘žร—(cid:206)๐‘‘ ๐‘˜=1 ๐‘›๐‘˜ As is well known, and typical for computing projections using PCA, this amounts to storing and using the truncated SVD of the data matrix. Once we have the (top) singular vectors, we can tensorize their outer product using the inverse of our vectorizing reshaping operation. That is ๐‘š โˆ‘๏ธ ๐œŽ๐‘—,๐‘˜ T๐‘˜ X๐‘— โ‰ˆ ๐‘˜=1 where T๐‘˜ are the tensors obtained from taking the relevant indices from the sum of outer product of the singular vectors in ๐‘ˆ and ๐‘‰, scaled by the singular vectors. What does this achieve in terms of storage? The space required for our original collection of tensors X1, . . . , X๐‘ž โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ is O (๐‘ž (cid:206)๐‘‘ ๐‘˜=1 ๐‘›๐‘˜ ). After PCA, we need keep ๐‘š basis tensors of the 14 same dimension C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ and our coordinates or principal scores will also need to be stored and there are ๐‘š๐‘ž of these. So the space is O (๐‘š๐‘ž + ๐‘š (cid:206)๐‘‘ the dependence on (cid:206)๐‘‘ ๐‘›๐‘˜ is unchanged. Additionally, thereโ€™s no obvious interpretable structure ๐‘›๐‘˜ ) which may be unsatisfactory because ๐‘˜=1 ๐‘˜=1 to the basis tensors T๐‘˜ that relates to the fact the original data was multi-modal. This motivates us then to look to another approach for decomposing (and therefore saving on storage among other things) tensors. Definition 1.13 (Rank one Tensor). Given ๐‘‘-vectors x ๐‘— โˆˆ C๐‘› ๐‘— for ๐‘— โˆˆ [๐‘‘], the outer-product X = ๐‘‘ โƒ ๐‘—=1 x ๐‘— โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ has entries given the product of corresponding entries of the vectors, i.e. X๐‘–1,...,๐‘–๐‘‘ = (cid:19) x ๐‘— (cid:18) ๐‘‘ โƒ ๐‘—=1 ๐‘–1,...,๐‘–๐‘‘ = (x1)๐‘–1 (x2)๐‘–2 . . . (x๐‘‘)๐‘–๐‘‘ any ๐‘‘-mode tensor where it is possible to write it as such an outer product of ๐‘‘ vectors is a rank one tensor. Note that storing a rank one tensor can be accomplished by storing only the vector components, rather than all entries, and computing the entries (as required) using the factors. This definition in the 2-mode case is the familiar rank one matrix case, for u โˆˆ C๐‘š, v โˆˆ C๐‘ ๐ด = u โ—ฆ v = uvโˆ— then matrix ๐ด โˆˆ C๐‘šร—๐‘ is a rank one matrix. Definition 1.14 (CP decomposition). We say X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ is a CP rank-๐‘Ÿ tensor if it can be written as the sum of minimally ๐‘Ÿ rank one tensors. That is X = ๐‘Ÿ โˆ‘๏ธ โ„“=1 (cid:18) ๐‘‘ โƒ ๐‘—=1 (cid:19) a(โ„“) ๐‘— = ๐‘Ÿ โˆ‘๏ธ โ„“=1 ๐’‚ (โ„“) 1 โ—ฆ ยท ยท ยท โ—ฆ ๐’‚ (โ„“) ๐‘‘ for vectors ๐’‚ (โ„“) by ๐ด(๐‘˜) = (cid:104) ๐‘˜ โˆˆ R๐‘›๐‘˜ for ๐‘˜ โˆˆ [๐‘‘] and โ„“ โˆˆ [๐‘Ÿ]. For convenience, define factor matrices ๐ด(๐‘˜) โˆˆ R๐‘›๐‘˜ร—๐‘Ÿ for ๐‘˜ โˆˆ [๐‘‘] so that the entries of the tensor are given by ๐’‚ (1) ๐‘˜ ยท ยท ยท ๐’‚ (๐‘Ÿ) ๐‘˜ (cid:105) X๐‘–1,...,๐‘–๐‘‘ = ๐‘Ÿ โˆ‘๏ธ ๐‘‘ (cid:214) โ„“=1 ๐‘˜=1 ๐ด(๐‘˜) ๐‘–๐‘˜,โ„“. 15 At times it will be convenient to require the CP decompositionโ€™s component vectors to be normalized to have norm of one, and then weight correspondingly each rank one term with a scalar. That is, X = ๐‘Ÿ โˆ‘๏ธ โ„“=1 (cid:18) ๐‘‘ โƒ ๐‘—=1 (cid:19) ๐›ผ ๐‘— a(โ„“) ๐‘— where (cid:13) (cid:13)a(โ„“) (cid:13) ๐‘— (cid:13) (cid:13) (cid:13)2 = 1 for all ๐‘— โˆˆ [๐‘‘], โ„“ โˆˆ [๐‘Ÿ]. and ๐›ผ ๐‘— โˆˆ C Here we also include a Lemma which relates how modewise products interact with rank-1 components. Lemma 1.3. Suppose we have a rank one tensor X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ with components x ๐‘— โˆˆ C๐‘› ๐‘— for all ๐‘— โˆˆ [๐‘‘], and suppose ๐‘ˆ๐‘˜ โˆˆ C๐‘š๐‘˜ร—๐‘›๐‘˜ for some ๐‘˜ โˆˆ [๐‘‘] then (cid:19) x ๐‘— (cid:18) ๐‘‘ โƒ ๐‘—=1 ร—๐‘˜ ๐‘ˆ๐‘˜ = (cid:19) x ๐‘— โƒ ๐‘ˆ๐‘˜ x๐‘˜ โƒ (cid:18) ๐‘‘ โƒ ๐‘—=๐‘˜+1 (cid:19) x ๐‘— (cid:18)๐‘˜โˆ’1 โƒ ๐‘—=1 (cid:16) Proof. Note that the ๐‘˜-mode fibers the tensor โƒ๐‘‘ ๐‘—=1 x ๐‘— (cid:17) are scalar multiples of the same vector, x๐‘˜ That is, the ๐‘˜-mode fiber indexed by (โ„“0, . . . , โ„“๐‘˜โˆ’1, โ„“๐‘˜+1, . . . , โ„“๐‘‘) is ๐‘‘โˆ’1 (cid:214) ๐‘—โ‰ ๐‘˜ (cid:169) (cid:173) (cid:171) x๐‘˜ (x ๐‘— )โ„“ ๐‘— (cid:170) (cid:174) (cid:172) is a scalar. So the identity follows now from noting the definition of the mode-๐‘˜ but (cid:16)(cid:206)๐‘‘ ๐‘—โ‰ ๐‘˜ (x ๐‘— )โ„“ ๐‘— (cid:17) product. (Each column of the unfolding is a scalar multiple of the same vector, scalar commutes with matrix-vector multiplication) As a remark, many of the correspondences and equivalences with rank in the matrix case do not extend to tensors. For example, the rank of a matrix is always equal to the dimension of the space spanned by its columns, and also the dimension of the space spanned by its rows. However, consider the following three mode tensor: X:,:,1 = 1 0 0 1 0 0 0 0 (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) , X:,:,2 = 0 0 0 0 1 0 0 1 (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 16 Naturally, the four mode-1 fibers span a four dimensional space, however the eight different mode-2 fibers span a two dimensional space. In general there is no simple correspondence between the dimensions of the spaces spanned by the different sets of fibers for a tensor. There are other complications that arise with notions of rank in the case of higher order tensors. In the case of matrices, a rank 3 matrix has a unique, optimal rank 2 approximation, which can be obtained from truncating its SVD. In the case of tensors, it is entirely possible for a rank 3 tensor to be approximated arbitrarily close by a rank 2 tensor, as this simple example below demonstrates: X:,:,1 = (cid:169) (cid:173) (cid:173) (cid:171) 0 1 1 0 (cid:170) (cid:174) (cid:174) (cid:172) , X:,:,2 = (cid:169) (cid:173) (cid:173) (cid:171) 1 0 0 0 (cid:170) (cid:174) (cid:174) (cid:172) Now consider ๐œ– โˆ’1 1 1 ๐œ– A:,:,1 = (cid:169) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) , A:,:,2 = (cid:169) (cid:173) (cid:173) (cid:171) 1 ๐œ– ๐œ– ๐œ– โˆ’2 (cid:170) (cid:174) (cid:174) (cid:172) ; B:,:,1 = (cid:169) (cid:173) (cid:173) (cid:171) 0 โˆ’๐œ– โˆ’1 0 0 0 , B:,:,2 = (cid:169) (cid:173) (cid:173) 0 0 (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) 0 (cid:172) Both A and B have rank one, so their sum has at most rank two, but we can let A + B get within any distance of X by sending ๐œ– โ†’ 0; so in this instance, there is no best rank 2 approximation of the rank 3 tensor X. Indeed, even if there is no issue between so called border rank of a tensor and its exact rank, unlike the case of a matrix, there is not necessarily any straightforward relationship between the best ๐‘Ÿ โˆ’ 1 approximation to that tensor and its exact rank ๐‘Ÿ decomposition in the same way as the SVD and the truncated SVD are related in the case of matrices. 1.4.3.1 The Tucker Decomposition and Low-Rank Approximation Next, we define the Tucker decomposition of tensors, which decomposes a tensor into a core tensor that is multiplied using the modewise product by a factor matrix along each of its modes. Definition 1.15 (Tucker decomposition). Let X โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ . We say that X has a Tucker decom- position of rank r = (๐‘Ÿ1, . . . , ๐‘Ÿ๐‘‘) if there exists G โˆˆ R๐‘Ÿ1ร—ยทยทยทร—๐‘Ÿ๐‘‘ , and ๐‘ˆ ๐‘— โˆˆ R๐‘› ๐‘— ร—๐‘Ÿ ๐‘— with orthonormal columns for all ๐‘— โˆˆ [๐‘‘], so that X = G ร—1 ๐‘ˆ1 ร—2 ๐‘ˆ2 ร—3 ยท ยท ยท ร—๐‘‘ ๐‘ˆ๐‘‘ = โˆ‘๏ธ iโˆˆ{(๐‘–1,...,๐‘–๐‘‘) | ๐‘– ๐‘— โˆˆ[๐‘Ÿ ๐‘— ] โˆ€ ๐‘— โˆˆ[๐‘‘]} Gi ยท u(1) ๐‘–1 โ—ฆ u(2) ๐‘–2 โ—ฆ ยท ยท ยท โ—ฆ u(๐‘‘) ๐‘–๐‘‘ , (1.3) 17 where u( ๐‘—) ๐‘– ๐‘— denotes the ๐‘– ๐‘— -th column of ๐‘ˆ ๐‘— , and Gi := G๐‘–1,...,๐‘–๐‘‘ โˆˆ R. We will also use the shorthand X = [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] := G ร—1 ๐‘ˆ1 ร—2 ๐‘ˆ2 ร—3 ยท ยท ยท ร—๐‘‘ ๐‘ˆ๐‘‘ when referring to the rank-r Tucker decomposition of a given rank-r tensor X. There is an equivalent formulation to equation (1.3) using unfoldings of X and of G in terms of matrix-matrix products: ๐‘‹[ ๐‘—] = ๐‘ˆ ๐‘— ๐บ [ ๐‘—] (cid:0)๐‘ˆ๐‘‘ โŠ— ยท ยท ยท โŠ— ๐‘ˆ ๐‘—+1 โŠ— ๐‘ˆ ๐‘—โˆ’1 โŠ— ยท ยท ยท โŠ— ๐‘ˆ1 (cid:1)๐‘‡ (1.4) From this definition several important observations can be made. First, we see that every tensor must admit a Tucker decomposition, since setting the core tensor equal to the full tensor and the factors equal to the appropriately sized identity matrices would ensure the decomposition matched the full tensor exactly, and the rank would in that case be the same as the length of the modes, r = (๐‘›1, . . . , ๐‘›๐‘‘). Naturally when considering efficient ways to store and perform computations with tensors, we will be interested in the cases where the rank along at least one or more modes is less than the original tensorโ€™s side length. And just as in the case of the CP decomposition, what the minimal Tucker rank is and what its relationship is to the exact decomposition is by no means straightforward generically. Second, we can see that the decomposition has no chance of being unique; consider any non-singular matrix ๐‘Œ , and observe: ๐‘‹[ ๐‘—] = ๐‘ˆ ๐‘— ๐บ [ ๐‘—] (cid:0)๐‘ˆ๐‘‘ โŠ— ยท ยท ยท โŠ— ๐‘ˆ ๐‘—+1 โŠ— ๐‘ˆ ๐‘—โˆ’1 โŠ— ยท ยท ยท โŠ— ๐‘ˆ1 (cid:1)๐‘‡ = ๐‘ˆ ๐‘—๐‘Œ๐‘Œ โˆ’1๐บ [ ๐‘—] (cid:0)๐‘ˆ๐‘‘ โŠ— ยท ยท ยท โŠ— ๐‘ˆ ๐‘—+1 โŠ— ๐‘ˆ ๐‘—โˆ’1 โŠ— ยท ยท ยท โŠ— ๐‘ˆ1 (cid:1)๐‘‡ = (๐‘ˆ ๐‘—๐‘Œ )(๐‘Œ โˆ’1๐บ [ ๐‘—]) (cid:0)๐‘ˆ๐‘‘ โŠ— ยท ยท ยท โŠ— ๐‘ˆ ๐‘—+1 โŠ— ๐‘ˆ ๐‘—โˆ’1 โŠ— ยท ยท ยท โŠ— ๐‘ˆ1 (cid:1)๐‘‡ and for this same reason, it is often natural to insist that the factor matrices, ๐‘ˆ๐‘– have orthornormal columns, given a Tucker decomposition. We can always arrive at this convention by means of non-singular transformations in each of the modes, e.g. perform a ๐‘„๐‘… decomposition and have ๐‘… act on the core. There is a specific type of Tucker decomposition that will be of interest, and that is the Higher Order SVD (HOSVD). In addition to the convention of having the columns of the factor matrices 18 form an orthonormal basis, in the case of the HOSVD, we impose an additional constraint that all the (๐‘‘ โˆ’ 1)-mode sub-tensors of the core formed by fixing an index along the same mode are orthogonal. Definition 1.16. [Higher Order Singular Value Decomposition (HOSVD)]. The Tucker decomposi- tion of tensor X = [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] is a HOSVD with rank r, when for every i๐‘Ž = (:, :, . . . , ๐‘Ž, . . . , :), i๐‘ = (:, :, . . . , ๐‘, . . . , :), ๐‘Ž, ๐‘ โˆˆ [๐‘Ÿ ๐‘— ], ๐‘Ž โ‰  ๐‘, ๐‘— โˆˆ [๐‘‘] we have that โŸจGi๐‘Ž, Gi๐‘โŸฉ = 0 (1.5) and ๐‘ˆ๐‘˜ have orthornomal columns for all ๐‘˜ โˆˆ [๐‘‘]. See [15] for a thorough description of this particular type of Tucker decomposition, and many of its useful properties. Because of its straightforward way to compute and often good approximation qualities, the HOSVD is frequently used as an initialization for finding Tucker decompositions. A HOSVD of a tensor can be computed using Algorithm 1.1. Algorithm 1.1 Higher Order SVD input : X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ r = (๐‘Ÿ1, ๐‘Ÿ2, . . . , ๐‘Ÿ๐‘‘) the rank of the HOSVD output : ห†X = [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] for ๐‘– โˆˆ [๐‘‘] do # SVDs of flattened tensor Compute SVD of mode- ๐‘— unfolding ๐‘‹( ๐‘—) = ๐‘„ ๐‘— ฮฃ ๐‘—๐‘‰ โˆ— ๐‘— ๐‘ˆ ๐‘— โ† ๐‘„ (๐‘–) ๐‘— [:, : ๐‘Ÿ ๐‘— ] end # Compute Core (cid:62)๐‘‘ G โ† X ๐‘ˆโˆ— ๐‘—=1 ๐‘— โˆˆ C๐‘Ÿ1ร—ยทยทยทร—๐‘Ÿ๐‘‘ Note that for each unfolding, the full SVD has form ๐ด( ๐‘—) = ๐‘ˆ (cid:124)(cid:123)(cid:122)(cid:125) ๐‘› ๐‘— ร—๐‘› ๐‘— ฮฃ (cid:124)(cid:123)(cid:122)(cid:125) ๐‘› ๐‘— ร—(cid:206) ๐‘—โ‰ ๐‘˜ ๐‘›๐‘˜ ๐‘‰ ๐ป (cid:124)(cid:123)(cid:122)(cid:125) (cid:206) ๐‘—โ‰ ๐‘˜ ๐‘›๐‘˜ร—(cid:206) ๐‘—โ‰ ๐‘˜ ๐‘›๐‘˜ the ๐‘Ÿ ๐‘— -truncated SVD has form ๐ด( ๐‘—) โ‰ˆ ๐‘ˆ (cid:124)(cid:123)(cid:122)(cid:125) ๐‘› ๐‘— ร—๐‘Ÿ ๐‘— ฮฃ (cid:124)(cid:123)(cid:122)(cid:125) ๐‘Ÿ ๐‘— ร—๐‘Ÿ ๐‘— ๐‘‰ ๐ป (cid:124)(cid:123)(cid:122)(cid:125) ๐‘Ÿ ๐‘— ร—(cid:206) ๐‘—โ‰ ๐‘˜ ๐‘›๐‘˜ 19 This problem then is repeated for each of the ๐‘‘ different modes. This is potentially a bottleneck computationally and so can likely benefit from fast approximations to the SVD, e.g. see [48]. In order to further improve the decomposition, we can take an alternating least squares approach, successively solving for ๐‘ˆ ๐‘— while holding other modes constant and iterate on this processes. This is called Higher Order Orthogonal Iteration (HOOI), Algorithm 1.2, see [16] Algorithm 1.2 Higher Order Orthogonal Iteration input : X โˆˆ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ r = (๐‘Ÿ1, ๐‘Ÿ2, . . . , ๐‘Ÿ๐‘‘) ๐‘€ the maximum number of iterations output : [[G (0), ๐‘ˆ (0) 1 ห†X = [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] , . . . ๐‘ˆ (0) ๐‘‘ ]] โ† HOSVD(X, r) for ๐‘– โˆˆ [๐‘€] do for ๐‘— โˆˆ [๐‘‘] do # Update each factor matrix (cid:62) (cid:16) ๐‘‹ Compute ๐‘ˆ ๐‘— โ† ๐‘„ (๐‘–) ๐‘˜โ‰  ๐‘— ๐‘— [:, : ๐‘Ÿ ๐‘— ] (cid:17) โˆ—(cid:17) ( ๐‘—) (cid:16) ๐‘ˆ (๐‘–โˆ’1) ๐‘˜ = ๐‘„ (๐‘–) ๐‘— ฮฃ(๐‘–) ๐‘— ๐‘‰ (๐‘–)(cid:17) โˆ— (cid:16) ๐‘— end end # Compute Core (cid:62) G โ† ๐‘‹ ๐‘˜โ‰  ๐‘— (cid:16) ๐‘ˆ (๐‘€) ๐‘˜ (cid:17) โˆ— We also observe that there is a natural correspondence between CP and Tucker decompositions. Given a CP decomposition, one can form a Tucker decomposition by arranging the weights of the CP decomposition along the diagonal of the core tensor and placing the component vectors as the columns of the factor matrices, while setting all off diagonal entries in the core equal to zero. The following small result formalizes this correspondence. Lemma 1.4. If X โˆˆ C๐‘›1ร—...๐‘›๐‘‘ has Tucker rank (๐‘Ÿ1, . . . , ๐‘Ÿ๐‘‘) then it has CP rank of at most (cid:206)๐‘‘ ๐‘—=1 ๐‘Ÿ ๐‘— Proof. The tensor has an exact Tucker decomposition, so ๐‘‘(cid:63) X = G ๐‘ˆ ๐‘— ๐‘—=1 Note that we can express any tensor in the standard basis; here the standard basis for tensors is a tensor with only one non-zero entry. 20 So for example for some โ„“ โˆˆ [๐‘Ÿ1] ร— ยท ยท ยท ร— [๐‘Ÿ๐‘‘] the associated standard basis element is ๐‘‘ โƒ ๐‘—=1 eโ„“ ๐‘— where eโ„“ ๐‘— is the usual standard basis vector in C๐‘Ÿ ๐‘— . Denote I = [๐‘Ÿ1] ร— ยท ยท ยท ร— [๐‘Ÿ๐‘‘]. Thus C = Gโ„“ โˆ‘๏ธ โ„“โˆˆI (cid:19) eโ„“ ๐‘— (cid:18) ๐‘‘ โƒ ๐‘—=1 Now use this expression in the Tucker decomposition of X ๐‘‘(cid:63) X = G ๐‘ˆ ๐‘— ๐‘—=1 = = = (cid:34) โˆ‘๏ธ โ„“โˆˆI Gโ„“ (cid:18) ๐‘‘ โƒ ๐‘—=1 eโ„“ ๐‘— (cid:19) (cid:35) ๐‘‘(cid:63) ๐‘ˆ ๐‘— ๐‘—=1 Gโ„“ Gโ„“ โˆ‘๏ธ โ„“โˆˆI โˆ‘๏ธ โ„“โˆˆI (cid:19) ๐‘ˆ ๐‘— eโ„“ ๐‘— (cid:18) ๐‘‘ โƒ ๐‘—=1 ๐‘‘ โƒ ๐‘—=1 (๐‘ˆ ๐‘— )โ„“ ๐‘— where we have used Lemma 1.3, and denoted the โ„“ ๐‘— -th column of ๐‘ˆ ๐‘— as (๐‘ˆ ๐‘— )โ„“ ๐‘— . We therefore have the sum of rank one tensors. There are (cid:206)๐‘‘ ๐‘Ÿ ๐‘— possible values for โ„“ and so we have a CP ๐‘—=1 decomposition of that rank. This provides an upper bound on CP rank, since the decomposition may not be optimal. On the other hand, given a Tucker decomposition of rank-r, one can instead regard this as a rank (cid:206)๐‘‘ ๐‘–=1 ๐‘Ÿ๐‘– CP tensor, where each entry of the core is a weight to a rank-one tensor formed by the outer product of the respective columns of the factor matrices and then these are all summed up. There is generically though no reason that these correspondences would be minimal in those respective decompositions. That is, given a minimal exact rank-๐‘Ÿ CP decomposition, it does not follow that the Tucker decomposition used in Lemma 1.4 would be a minimal in terms of rank for the Tucker decomposition of that particular tensor. Given an arbitrary tensor X โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ of unknown Tucker rank it is often of interest to compute r โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ a good rank-r approximation of X. In fact, an optimal Tucker rank-r minimizer [[X]]opt 21 satisfying (cid:13) X โˆ’ [[X]]opt (cid:13) r (cid:13) (cid:13) (cid:13) (cid:13)2 = inf 1 ร—ยทยทยทร—๐‘Ÿ๐‘‘ , GโˆˆR๐‘Ÿ ๐‘ˆ ๐‘— โˆˆR๐‘› ๐‘— ร—๐‘Ÿ ๐‘— โˆฅX โˆ’ [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] โˆฅ2 (1.6) always exists (see, e.g., Theorems 10.3 and Theorem 10.8 in [22]). However, computing such an [[X]]opt r (which is not unique) is generally a challenging task that is accomplished only approximately via iterative techniques (see, e.g., [37]). As a result, in such situations one usually seeks to instead compute a quasi-optimal rank-r tensor หœX. Definition 1.17. The tensor หœX of rank r is quasi-optimal approximation of X if it satisfies (cid:13)X โˆ’ หœX(cid:13) (cid:13) (cid:13)2 โ‰ค ๐ถ inf GโˆˆR๐‘Ÿ 1 ร—ยทยทยทร—๐‘Ÿ๐‘‘ , ๐‘ˆ ๐‘— โˆˆR๐‘› ๐‘— ร—๐‘Ÿ ๐‘— โˆฅX โˆ’ [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] โˆฅ2 , (1.7) where ๐ถ โˆˆ R+ is a positive constant independent of X. The following lemma demonstrates that one can recover a quasi-optimal rank-r approximation of an arbitrary tensor X โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ by simply computing the left singular vectors of ๐‘‹[ ๐‘—] for all ๐‘— โˆˆ [๐‘‘]. It is a variant of [22, Theorem 10.3]. We prove here for the sake of completeness. Lemma 1.5. Let X โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ , denote the ๐‘–th-largest singular value of ๐‘‹[ ๐‘—] by ๐œŽ๐‘– (cid:0)๐‘‹[ ๐‘—] ฮ”๐‘Ÿ, ๐‘— := (cid:205) หœ๐‘› ๐‘— ๐‘–=๐‘Ÿ+1 and suppose that [[X]]opt r (cid:1), and define (cid:1) 2 for all ๐‘— โˆˆ [๐‘‘] and ๐‘Ÿ โˆˆ (cid:2) หœ๐‘› ๐‘— := min (cid:8)๐‘› ๐‘— , (cid:206) ๐‘—โ‰ ๐‘˜ ๐‘›๐‘˜ (cid:9)(cid:3). Fix r โˆˆ [ หœ๐‘›1]ร—ยท ยท ยทร—[ หœ๐‘›๐‘‘] satisfies (1.6). Then, ๐œŽ๐‘– (cid:0)๐‘‹[ ๐‘—] ๐‘‘ โˆ‘๏ธ ๐‘—=1 ฮ”๐‘Ÿ ๐‘— , ๐‘— โ‰ค ๐‘‘ (cid:13) (cid:13) (cid:13) X โˆ’ [[X]]opt r (cid:13) (cid:13) (cid:13) 2 2 . Proof. Let ๐‘ƒ ๐‘— โˆˆ R๐‘› ๐‘— ร—๐‘› ๐‘— be a rank ๐‘Ÿ ๐‘— orthogonal projection matrix satisfying (cid:13) ฮ”๐‘Ÿ ๐‘— , ๐‘— โ‰ค (cid:13) [[X]]opt r at most ๐‘Ÿ ๐‘— matrix by (1.4), we then have that (cid:13)๐‘‹[ ๐‘—] โˆ’ ๐‘Œ (cid:13) 2 ๐น for all rank at most ๐‘Ÿ ๐‘— matrices ๐‘Œ .1 Noting that ๐‘Œ = (cid:13) (cid:16) (cid:13)๐‘‹[ ๐‘—] โˆ’ ๐‘ƒ ๐‘— ๐‘‹[ ๐‘—] (cid:13) (cid:13) 2 ๐น = (cid:17) [ ๐‘—] will be a rank ฮ”๐‘Ÿ ๐‘— , ๐‘— โ‰ค (cid:13) (cid:13) (cid:13) (cid:13) ๐‘‹[ ๐‘—] โˆ’ (cid:16) (cid:17) [[X]]opt r (cid:13) (cid:13) (cid:13) (cid:13) 2 ๐น [ ๐‘—] = (cid:13) X โˆ’ [[X]]opt (cid:13) r (cid:13) (cid:13) 2 (cid:13) (cid:13) 2 โˆ€ ๐‘— โˆˆ [๐‘‘]. Summing over the ๐‘‘ values of ๐‘— above now finishes the proof. Note, in order to simplify notation, we will often assume that ๐‘› ๐‘— = ๐‘› and ๐‘Ÿ ๐‘— = ๐‘Ÿ for all ๐‘— โˆˆ [๐‘‘]. That is our tensors will have equal side lengths and the rank will likewise be the same in every mode 1The Eckart-Young Theorem guarantees that we can compute such a ๐‘ƒ ๐‘— from the left singular values of ๐‘‹[ ๐‘— ]. 22 in the Tucker case, thus suppressing the need for additional level of sub-scripting. This will be done when modifying the proofs and results to deal with non-equal side lengths is straightforward; a bit of simplicity is achieved without too much damage to generality. Furthermore, we will principally be concerned with real valued tensors, rather than complex valued tensors. 1.4.4 Randomized Numerical Linear Algebra Preliminaries Concepts from randomized numerical linear algebra will feature prominently in our analysis of the embedding properties of different measurement operators. Definition 1.18 ((๐œ–, ๐›ฟ, ๐‘)-Johnson-Lindenstrauss (JL) property). Let ๐œ– > 0, ๐›ฟ โˆˆ (0, 1), and ๐‘ โˆˆ N. A random matrix ฮฉ โˆˆ R๐‘šร—๐‘› has the (๐œ–, ๐›ฟ, ๐‘)-Johnson-Lindenstrauss (JL) property2 for an arbitrary set ๐‘† โŠ‚ R๐‘› with cardinality at most ๐‘ if it satisfies (1 โˆ’ ๐œ–) โˆฅxโˆฅ2 2 โ‰ค โˆฅฮฉxโˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅxโˆฅ2 2 for all x โˆˆ ๐‘† (1.8) with probability at least 1 โˆ’ ๐›ฟ. Definition 1.19 (Set difference). Let หœ๐‘† โŠ‚ C๐‘›, then the set difference of หœ๐‘† denoted หœ๐‘† โˆ’ หœ๐‘† is (cid:8)x โˆ’ y|x, y โˆˆ หœ๐‘†(cid:9) โˆˆ C๐‘› For all random matrix distributions discussed in this work, only an upper-bound on the cardinal- ity of the set ๐‘† appearing in Definition 1.8 is required in order to ensure with high probability that a given realization satisfies (1.8). No other property of the set ๐‘† to be embedded will be required to define, generate, or apply ฮฉ. Such random matrices are referred to as data oblivious, or simply oblivious. The following variant of the famous Johnson-Lindenstrauss Lemma [32] demonstrates the existence of random matrices with the oblivious JL property. First though we can formally define a sub-Gaussian random variable and the sub-Gaussian norm, along with the related definition of Sub-exponential random variables. Definition 1.20 (Sub-exponential Random Variable). We say that ๐‘‹ โˆˆ R is sub-exponential random 2Formally, a distribution over ๐‘š ร— ๐‘› matrices has the JL property if a matrix selected from this distribution satisfies (1.8) with probability at least 1 โˆ’ ๐›ฟ. For brevity, here and in the next similar cases, the term โ€œrandom matrix" will refer to a distribution over matrices. 23 variable if โˆƒ๐›ฝ, ๐œ… > 0 such that ๐‘ƒ [|๐‘‹ | โ‰ฅ ๐‘ก] โ‰ค ๐›ฝ๐‘’โˆ’๐œ…๐‘ก, โˆ€๐‘ก > 0 We can understand this as saying that the random variable decays exponentially. Definition 1.21 (Sub-Gaussian Random Variable). We say that ๐‘‹ โˆˆ R is sub-Gaussian random variable if โˆƒ๐›ฝ, ๐œ… > 0 such that ๐‘ƒ [|๐‘‹ | โ‰ฅ ๐‘ก] โ‰ค ๐›ฝ๐‘’โˆ’๐œ…๐‘ก2, โˆ€๐‘ก > 0 There are many equivalences for a random variable being sub-Gaussian, see for example Proposition 2.5.2 in [63]. Definition 1.22 (Sub-Gaussian norm). The sub-Gaussian norm of ๐‘‹, denoted โˆฅ ๐‘‹ โˆฅ๐œ“2 is the quantity โˆฅ ๐‘‹ โˆฅ๐œ“2 := inf ๐‘ก>0 (cid:110)E[exp (cid:16) ๐‘‹ 2/๐‘ก2(cid:17) (cid:111) โ‰ค 2] There are many equivalences for a random variable being sub-Gaussian, most of which crucially are bounds which are related to the definition of the sub-Gaussian norm, see for example Proposition 2.5.2 in [63]. Theorem 1.1 (Sub-Gaussian random matrices have the JL property). Let ๐‘† โŠ‚ R๐‘› be an arbitrary finite subset of R๐‘›. Let ๐›ฟ, ๐œ– โˆˆ (0, 1). Finally, let ฮฉ โˆˆ R๐‘šร—๐‘› be a matrix with independent, mean zero, variance ๐‘šโˆ’1, sub-Gaussian entries all with sub-Gaussian norm less than or equal to ๐‘ โˆˆ R+. Then (1 โˆ’ ๐œ–)โˆฅxโˆฅ2 2 โ‰ค โˆฅฮฉxโˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅxโˆฅ2 2 will hold simultaneously for all x โˆˆ ๐‘† with probability at least 1 โˆ’ ๐›ฟ, provided that ๐‘š โ‰ฅ ๐ถ ๐œ– 2 (cid:19) , (cid:18) |๐‘†| ๐›ฟ ln (1.9) where ๐ถ โ‰ค 8๐‘(16๐‘ + 1) is an absolute constant that only depends on the bound to the sub-Gaussian norm ๐‘. Proof. See, e.g., Lemma 9.35 in [18]. Next, we define a similar property for an infinite, yet rank constrained, set. 24 1.4.4.1 Oblivious Subspace Embedding (OSE) Results In this section we will define the Oblivious Subspace Embedding property, and describe how to construct them from any random matrix distribution with the JL property. Definition 1.23. [(๐œ–, ๐›ฟ, ๐‘Ÿ)-OSE property] Let ๐œ– > 0, ๐›ฟ โˆˆ (0, 1), and ๐‘Ÿ โˆˆ N. Fix an arbitrary rank ๐‘Ÿ matrix ๐ด โˆˆ R๐‘›ร—๐‘ . A random matrix ฮฉ โˆˆ R๐‘šร—๐‘› has the (๐œ–, ๐›ฟ, ๐‘Ÿ)-Oblivious Subspace Embedding (OSE) property for (the column space of) ๐ด if it satisfies (1 โˆ’ ๐œ–) โˆฅ ๐ดxโˆฅ2 2 โ‰ค โˆฅฮฉ๐ดxโˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅ ๐ดxโˆฅ2 2 for all x โˆˆ R๐‘ (1.10) with probability at least 1 โˆ’ ๐›ฟ. Note that if ๐‘„ โˆˆ R๐‘›ร—๐‘Ÿ is an orthonormal basis for the column space of a rank ๐‘Ÿ matrix ๐ด โˆˆ R๐‘›ร—๐‘ , then the images of ๐ด and ๐‘„ coincide, i.e. {๐‘„y | โˆ€y โˆˆ R๐‘Ÿ } = (cid:8)๐ดx | โˆ€x โˆˆ R๐‘ (cid:9). Furthermore, since ๐‘„ has orthonormal columns so that โˆฅ๐‘„yโˆฅ2 = โˆฅyโˆฅ2 holds, (1.10) is equivalent to (1 โˆ’ ๐œ–) โˆฅyโˆฅ2 2 โ‰ค โˆฅฮฉ๐‘„yโˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅyโˆฅ2 2 for all y โˆˆ R๐‘Ÿ (1.11) holding in this case. Below we will use the equivalence of (1.10) and (1.11) often by regularly constructing matrices satisfying (1.10) by instead constructing matrices satisfying (1.11). The next result shows that random matrices with the JL property also have the OSE property. It is a standard result in the compressive sensing and randomized numerical linear algebra literature (see, e.g., [5, Lemma 5.1] or [57, Lemma 10]) that can be proven using an ๐œ–-cover of the appropriate column space. Lemma 1.6 (Subspace embeddings from finite embeddings via a cover). Fix ๐œ– โˆˆ (0, 1). Let Y โŠ‚ R๐‘› be the ๐‘Ÿ-dimensional subspace of R๐‘› spanned by an orthonormal basis Y, and define L๐‘Ÿ Y โŠ‚ L๐‘Ÿ ๐‘†๐‘Ÿ Y := Y. Then if ฮฉ โˆˆ C๐‘šร—๐‘› satisfies (1.8) with ๐‘† โ† ๐ถ and ๐œ– โ† ๐œ– Y be a minimal (cid:0) ๐œ– 2, it will also satisfy . Furthermore, let ๐ถ โŠ‚ ๐‘†๐‘Ÿ (cid:1)-cover of ๐‘†๐‘Ÿ (cid:12) (cid:12) x โˆˆ L๐‘Ÿ Y \ {0} (cid:110) x โˆฅxโˆฅ2 16 (cid:111) (1 โˆ’ ๐œ–)โˆฅxโˆฅ2 2 โ‰ค โˆฅฮฉxโˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅxโˆฅ2 2 โˆ€x โˆˆ L๐‘Ÿ Y . (1.12) Proof. See, e.g., Lemma 3 in [28] for this version. We discuss covering numbers in Section 1.4.4.3. 25 Using Lemma 1.6 one can now easily prove the following corollary of Theorem 1.1 which demonstrates the existence of random matrices with the OSE property. Corollary 1.1. A random matrix ฮฉ โˆˆ R๐‘šร—๐‘› with mean zero, variance ๐‘šโˆ’1, independent sub- Gaussian entries has the (๐œ–, ๐›ฟ, ๐‘Ÿ)-OSE property for an arbitrary rank ๐‘Ÿ matrix ๐ด โˆˆ R๐‘›ร—๐‘ provided that ๐‘š โ‰ฅ ๐ถ ๐‘Ÿ ๐œ– 2 (cid:19) (cid:18) ๐ถโ€ฒ ๐œ– ๐›ฟ ln โ‰ฅ ๐ถ ๐œ– 2 (cid:17)๐‘Ÿ (cid:16) 47 ๐œ– ๐›ฟ , (cid:170) (cid:174) (cid:174) (cid:172) ln (cid:169) (cid:173) (cid:173) (cid:171) where ๐ถโ€ฒ > 0 is an absolute constant, and ๐ถ > 0 is an absolute constant that only depends on the sub-Gaussian norms of ฮฉโ€™s entries. Proof. As is done in Lemma 1.6, suppose ๐‘† is a minimal (cid:0) ๐œ– 16 sphere in the column span of ๐ด. The cardinaltiy of this cover is bounded by (cid:1)-cover of the ๐‘Ÿ dimensional unit (cid:17)๐‘Ÿ . Apply Theorem (cid:16) 47 ๐œ– 1.1 to the finite set ๐‘†. The next Theorem and definition concern another important class of random, but highly struc- tured, matrices that are known to satisfy the JL-embedding property. Theorem 1.2. Let ๐‘ˆ โˆˆ C๐‘›ร—๐‘› be a unitary matrix with entries bounded by max๐‘–, ๐‘— โˆˆ[๐‘›] ๐‘› . Let ๐‘… โˆˆ C๐‘šร—๐‘› be a matrix obtained by selecting ๐‘š-rows from the ๐‘› ร— ๐‘› identity matrix i.i.d. uniformly โˆš (cid:12) (cid:12)๐‘ˆ๐‘–, ๐‘— (cid:12) (cid:12) โ‰ค ๐พ at random and let ๐ท โˆˆ R๐‘›ร—๐‘› be a diagonal matrix with i.i.d ยฑ1 Radamacher random values on its diagonal. Then โˆš๏ธ‚ ๐‘› ๐‘š ๐‘…๐‘ˆ๐ท will be an ๐œ–-JL map of any given ๐‘† โŠ‚ R๐‘› into C๐‘š with probability at least 1 โˆ’ ๐›ฟ โˆ’ ๐‘›โˆ’ ln3 ๐‘› provided that ๐‘š โ‰ฅ ๐‘ ๐พ 2 ๐œ– 2 (cid:19) (cid:18) 4|๐‘†| ๐›ฟ log log4 ๐‘› where ๐‘ โˆˆ R+ is an absolute constant. Note that ๐ท can be applied in O (๐‘›) time to a vector in x โˆˆ R๐‘›, since it involves reading the length of the vector and changing signs of some entries. The matrix ๐‘… can be applied to a vector of length ๐‘›, ๐‘ˆ๐ทx in O (๐‘›)-time since it involves scanning the length of the vector and discarding values. Applying ๐‘ˆ then, since generically it should require at least reading in the inputs should be 26 at least O (๐‘›). Therefore the overall complexity is governed principally by ๐‘ˆ. If the unitary matrix ๐‘ˆ admits a fast matrix-vector, like for example using the Fast Fourier Transform (FFT) to effect a Discrete Fourier Transform, then ๐‘ˆ can be applied to ๐ทx in O (๐‘› log ๐‘›)-time. In practice, the log4 ๐‘› factor in the bound of ๐‘š is often ignored with no change in performance. The failure probability bound ๐›ฟ + ๐‘›โˆ’ ln3 ๐‘› is a result of union bounding over two events โ€“ 1โˆš ๐‘š ๐‘…๐น failing to have the yet undefined RIP property (see Definition 2.1) with probability at most ๐‘›โˆ’ ln3 ๐‘›; the details of that can be found in Theorem 12.32 in [18]., and the probability of at most ๐›ฟ that โˆš๏ธ ๐‘› 2-JL map; details of which can be found in Theorem 3.1 in [38]. Here, since we generally concern ourselves with ๐‘› โ‰ซ 1, the failure probability can be made suitably small. We ๐‘š ๐‘…๐น ๐ท fails to be an ๐œ– refer to this class of matrices as Subsampled Orthogonal Random Signs (SORS) matrices. The most frequently used type being as follows. Definition 1.24 (Subsampled Scrambled Fourier Transform (SSFT)). Let ๐น be a unitary discrete Fourier transform matrix ๐นโ„“,๐‘˜ = ๐‘’ 2 ๐œ‹๐‘–โ„“ ๐‘˜ ๐‘› 1 โˆš ๐‘› If we take ๐‘ˆ = ๐น and ๐‘… and ๐ท be the matrices described in Theorem 1.2 then โˆš๏ธ ๐‘› ๐‘š ๐‘…๐น ๐ท is a JL map where (cid:12) (cid:12)๐นโ„“,๐‘˜ (cid:12) (cid:12) = max โ„“,๐‘˜ 1 โˆš ๐‘› so ๐พ = 1. The Fast Fourier Transform (FFT) can be used to apply ๐น to any vector x โˆˆ R๐‘› in O (๐‘› log ๐‘›). This is the proto-typical Fast JL Matrix, though other choices for ๐‘ˆ are naturally possible. We also refer to it simply as ๐‘…๐น ๐ท when context makes it clear. Finally, the following lemma demonstrates that matrices satisfying (1.10) for ๐ด also approx- imately preserve the Frobenius norm of ๐ด. We present its proof here as an illustration of basic notation and techniques, and will use this result later in Chapter 3. Lemma 1.7. Suppose ฮฉ โˆˆ R๐‘šร—๐‘› satisfies (1.10) for a rank ๐‘Ÿ matrix ๐ด โˆˆ R๐‘›ร—๐‘ . Then, (cid:12) (cid:12)โˆฅ ๐ดโˆฅ2 ๐น โˆ’ โˆฅฮฉ๐ดโˆฅ2 ๐น (cid:12) (cid:12) โ‰ค ๐œ– โˆฅ ๐ดโˆฅ2 ๐น . 27 Proof. Let e๐‘– โˆˆ R๐‘ denote the standard basis vector where the ๐‘–-th entry is 1, and all others are zero (i.e., e๐‘– is the ๐‘–-th column of the ๐‘ ร— ๐‘ identity matrix ๐ผ๐‘ ). Similarly let b๐‘– denote the ๐‘–-th column of any given matrix ๐ต, and set หœ๐ด := ฮฉ๐ด. By (1.10), we conclude that for all ๐‘– โˆˆ [๐‘] we have (cid:12) (cid:12)โˆฅ ๐ดe๐‘– โˆฅ2 2 โˆ’ โˆฅฮฉ๐ดe๐‘– โˆฅ2 2 (cid:12) (cid:12) โ‰ค ๐œ– โˆฅ ๐ดe๐‘– โˆฅ2 2 . To establish the desired result, we represent the squared Frobenius norm of a matrix as the sum of the squared โ„“2-norms of its columns. Doing so, we see that (cid:12) (cid:12)โˆฅ ๐ดโˆฅ2 ๐น โˆ’ โˆฅฮฉ๐ดโˆฅ2 ๐น (cid:12) (cid:12) = (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ๐‘ โˆ‘๏ธ (cid:16) ๐‘–=1 ๐‘ โˆ‘๏ธ ๐‘–=1 โ‰ค ๐œ– โˆฅa๐‘– โˆฅ2 2 โˆ’ โˆฅ หœa๐‘– โˆฅ2 2 (cid:17) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) โ‰ค ๐‘ โˆ‘๏ธ ๐‘–=1 โˆฅ ๐ดe๐‘– โˆฅ2 2 = ๐œ– โˆฅ ๐ดโˆฅ2 ๐น . (cid:12) (cid:12)โˆฅa๐‘– โˆฅ2 2 โˆ’ โˆฅ หœa๐‘– โˆฅ2 2 (cid:12) (cid:12) = ๐‘ โˆ‘๏ธ ๐‘–=1 (cid:12) (cid:12)โˆฅ ๐ดe๐‘– โˆฅ2 2 โˆ’ โˆฅ (ฮฉ๐ด)e๐‘– โˆฅ2 2 (cid:12) (cid:12) We will now define a property of random matrices related to fast approximate matrix multipli- cation. 1.4.4.2 The Approximate Matrix Multiplication (AMM) Property First proposed in [57, Lemma 6] (see also, e.g., [46] for other variants), the approximate matrix multiplication property will be crucial to our analysis of embedding properties of measurement operators. Definition 1.25 ((๐œ–, ๐›ฟ)-AMM property). Let ๐œ– > 0 and ๐›ฟ โˆˆ (0, 1). A random matrix ฮฉ โˆˆ R๐‘šร—๐‘› satisfies the (๐œ–, ๐›ฟ)-Approximate Matrix Multiplication property for two arbitrary matrices ๐ด โˆˆ R๐‘ร—๐‘› and ๐ต โˆˆ R๐‘›ร—๐‘ž if โˆฅ ๐ดฮฉ๐‘‡ ฮฉ๐ต โˆ’ ๐ด๐ตโˆฅ๐น โ‰ค ๐œ– โˆฅ ๐ดโˆฅ๐น โˆฅ๐ตโˆฅ๐น (1.13) holds with probability at least 1 โˆ’ ๐›ฟ. The following lemma can be used to construct random matrices with the AMM property from random matrices with the JL property. A slightly generalized version is proven in Appendix 3.5 for the sake of completeness. Lemma 1.8 (The JL property provides the AMM property). Let ๐ด โˆˆ R๐‘ร—๐‘› and ๐ต โˆˆ R๐‘›ร—๐‘ž. There exists a finite set ๐‘† โŠ‚ R๐‘› with cardinality |๐‘†| โ‰ค 2( ๐‘ + ๐‘ž)2 (determined entirely by ๐ด and ๐ต) such 28 that the following holds: If a random matrix ฮฉ โˆˆ R๐‘šร—๐‘› has the (๐œ–/2, ๐›ฟ, 2( ๐‘ + ๐‘ž)2)-JL property for ๐‘†, then ฮฉ will also have the (๐œ–, ๐›ฟ)-AMM property for ๐ด and ๐ต. Proof. Combine Lemma 3.12 with Remark 3.5. Using Lemma 1.8 one can now prove the following corollary of Theorem 1.1 which demonstrates the existence of random matrices with the AMM property for any two fixed matrices. Corollary 1.2. Fix ๐ด โˆˆ R๐‘ร—๐‘› and ๐ต โˆˆ R๐‘›ร—๐‘ž. A random matrix ฮฉ โˆˆ R๐‘šร—๐‘› with mean zero, variance ๐‘šโˆ’1, independent sub-Gaussian entries will have the (๐œ–, ๐›ฟ)-Approximate Matrix Multiplication property for ๐ด and ๐ต provided that ๐‘š โ‰ฅ ๐ถ ๐œ– 2 (cid:18) 2( ๐‘ + ๐‘ž)2 ๐›ฟ (cid:19) , ln where ๐ถ > 0 is an absolute constant that only depends on the sub-Gaussian norms/parameters of ฮฉโ€™s entries. Proof. Apply Theorem 1.1 to the finite set ๐‘† guaranteed by Lemma 1.8. 1.4.4.3 Covering Numbers We state some results concerning covering numbers, which are useful for showing ๐œ–-JL maps apply to more general infinite sets, and are a main ingredient to several of the results referred to above, such as Lemma 1.6. We begin then with these definitions and then show how to obtain covering number bounds for subsets, which is a main ingredient in net arguments used for embedding infinite sets. Definition 1.26 (๐›ฟ-cover). Let ๐‘‹ be a normed metric space over the complex numbers and ๐‘‡ โІ C๐‘ . A ๐›ฟ-cover of ๐‘‡ with respect to norm โˆฅ ยท โˆฅ ๐‘‹ is a subset of ๐‘† โІ ๐‘‡ such that โˆ€x โˆˆ ๐‘‡, โˆƒy โˆˆ ๐‘† with โˆฅx โˆ’ yโˆฅ ๐‘‹ < ๐›ฟ where ๐‘‡ โІ (cid:216) yโˆˆ๐‘† ๐ต๐‘‹ (y, ๐›ฟ) Note that ๐ต๐‘‹ (y, ๐›ฟ) is the open ball with center y โˆˆ C๐‘ and radius ๐›ฟ with respect to the norm โˆฅ ยท โˆฅ ๐‘‹. Usually it will be clear from context the space and norm, and so weโ€™ll simplify notation and write instead ๐ต(y, ๐›ฟ) 29 Definition 1.27 (๐›ฟ-covering Number). The ๐›ฟ-covering number of ๐‘‡ โІ C๐‘ , denoted ๐ถ ๐‘‹ respect to โˆฅ ยท โˆฅ ๐‘‹ is the smallest integer such that there exists a ๐›ฟ-cover ๐‘† โІ ๐‘‡ where |๐‘†| = ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) with ๐›ฟ (๐‘‡). If no such integer exists we say that ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) = โˆž Definition 1.28 (๐›ฟ-packing). Let ๐‘‡ โІ C๐‘ . A ๐›ฟ-packing of ๐‘‡ with respect to norm โˆฅ ยท โˆฅ ๐‘‹ is a subset of ๐‘† โІ ๐‘‡ such that โˆ€x, y โˆˆ ๐‘†, x โ‰  y with โˆฅx โˆ’ yโˆฅ ๐‘‹ โ‰ฅ ๐›ฟ then ๐ต๐‘‹ (x, ๐›ฟ/2) โˆฉ ๐ต๐‘‹ (y, ๐›ฟ/2) = โˆ… Definition 1.29 (๐›ฟ-packing Number). The ๐›ฟ-packing number of ๐‘‡ โІ C๐‘ , denoted ๐‘ƒ๐‘‹ respect to โˆฅ ยท โˆฅ ๐‘‹ is the largest integer such that there exists a ๐›ฟ-packing ๐‘† โІ ๐‘‡ where |๐‘†| = ๐‘ƒ๐‘‹ ๐›ฟ (๐‘‡) with ๐›ฟ (๐‘‡). If no such integer exists we say that ๐‘ƒ๐‘‹ ๐›ฟ (๐‘‡) = โˆž Lemma 1.9. Let ๐‘‡ โІ C๐‘ and ๐›ฟ โˆˆ (0, โˆž). Then 2๐›ฟ (๐‘‡) โ‰ค ๐ถ ๐‘‹ ๐‘ƒ๐‘‹ ๐›ฟ (๐‘‡) โ‰ค ๐‘ƒ๐‘‹ ๐›ฟ (๐‘‡) Proof. Let ๐‘ƒ2๐›ฟ โŠ‚ ๐‘‡ be a maximal 2๐›ฟ-packing of ๐‘‡ and ๐ถ๐›ฟ โІ ๐‘‡ be a minimal ๐›ฟ-cover of ๐‘‡. Each point x โˆˆ ๐‘ƒ2๐›ฟ is closest to a different point y โˆˆ ๐ถ๐›ฟ. To see this, suppose to the contrary that for x1, x2, โˆˆ ๐‘ƒ2๐›ฟ, x1 โ‰  x2 there was some point y โˆˆ ๐ถ๐›ฟ such that x1, x2 โˆˆ ๐ต(y, ๐›ฟ) this implies that y โˆˆ ๐ต๐‘‹ (x1, ๐›ฟ) โˆฉ ๐ต๐‘‹ (x2, ๐›ฟ) which is a contradiction. So, since each point in ๐‘ƒ๐›ฟ can be identified with at least one point in ๐ถ๐›ฟ. We thus define an injection ๐‘“ : ๐‘ƒ2๐›ฟ โ†’ ๐ถ๐›ฟ where ๐‘“ (x) = y, ๐‘ฅ โˆˆ ๐ต(y, ๐›ฟ). Since ๐‘“ is an injection, we have that the cardinality of ๐ถ๐›ฟ must be equal to or larger than ๐‘ƒ2๐›ฟ, which is equivalent to the left hand side of the desired inequality. Next, suppose ๐‘ƒ๐›ฟ is a maximal ๐›ฟ-packing of ๐‘‡. Now suppose for eventual contradiction that there exists a point y โˆˆ ๐‘‡, y โˆ‰ ๐‘ƒ๐›ฟ such that โˆฅx โˆ’ yโˆฅ โ‰ฅ ๐›ฟ, โˆ€๐‘ฅ โˆˆ ๐‘ƒ๐›ฟ. This implies that ๐ต๐‘‹ (x, ๐›ฟ/2) โˆฉ ๐ต๐‘‹ (y, ๐›ฟ/2) =. Thus ๐‘ƒ๐›ฟ โˆช {y} is a ๐›ฟ-packing of ๐‘‡. This contradicts that ๐‘ƒ๐›ฟ is maximal. So, for all points y โˆˆ ๐‘‡, there is x โˆˆ ๐‘ƒ๐›ฟ such that โˆฅx โˆ’ yโˆฅ โ‰ค ๐›ฟ, which is to say ๐‘ƒ๐›ฟ is a ๐›ฟ-covering of ๐‘‡, and therefore the cardinality of ๐‘ƒ๐›ฟ is equal to or larger than the ๐›ฟ-covering number for ๐‘‡. This is the right hand side of the desired inequality. 30 Lemma 1.10. Let ๐‘‡ โІ R๐‘ and ๐›ฟ โˆˆ (0, โˆž). Furthermore let ๐ต denote the unit ball ๐ต๐‘‹ (0, 1) in R๐‘ with respect to some norm โˆฅ ยท โˆฅ ๐‘‹. Then (cid:19) ๐‘ (cid:18) 1 ๐›ฟ Vol (๐‘‡) Vol (๐ต) โ‰ค ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) โ‰ค ๐‘ƒ๐‘‹ ๐›ฟ (๐‘‡) โ‰ค (cid:18) 2 ๐›ฟ (cid:19) ๐‘ Vol (cid:0)๐‘‡ + (cid:0) ๐›ฟ 2 Vol (๐ต) (cid:1) ๐ต(cid:1) holds, where Vol(๐‘‡) = โˆซ ๐‘‡ 1๐‘‘๐‘‰, the Lebesgue measure of ๐‘‡ in R๐‘ . Note that addition of sets is syntactical sugar for set difference of certain sets, i.e. ๐‘‡ + ๐‘† = ๐‘‡ โˆ’ (โˆ’๐‘†) = {๐‘ก + ๐‘ |โˆ€๐‘ก โˆˆ ๐‘‡, ๐‘  โˆˆ ๐‘†} Proof. Suppose ๐ถ๐›ฟ is a minimal ๐›ฟ cover of ๐‘‡. By definition then of ๐›ฟ-cover ๐‘‡ โІ (cid:216) yโˆˆ๐ถ ๐›ฟ ๐ต(y, ๐›ฟ) So, using sub-additivity of measurable sets, translation invariance, and scaling we have Vol(๐‘‡) โ‰ค Vol (cid:16)(cid:208) ๐ต(y, ๐›ฟ) yโˆˆ๐ถ ๐›ฟ (cid:17) โ‰ค ๐ถ ๐‘‹ ๐›ฟ Vol (๐ต(y, ๐›ฟ)) = ๐ถ ๐‘‹ ๐›ฟ ๐›ฟ๐‘ Vol (๐ต(0, 1)) Rearranging terms, we obtain the left hand side of the desired inequality (cid:19) ๐‘ (cid:18) 1 ๐›ฟ Vol (๐‘‡) Vol (๐ต) โ‰ค ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) Now suppose ๐‘ƒ๐›ฟ is a maximal ๐›ฟ-packing of ๐‘‡. It follows that ๐ต(y, ๐›ฟ/2) โŠ‚ ๐‘‡ + ๐ต(0, ๐›ฟ/2) (cid:216) yโˆˆ๐‘ƒ ๐›ฟ Since the balls that make up the ๐›ฟ-packing of ๐‘‡ are disjoint, we have that their measure is additive. Again, using translation invariance and scaling, this implies Vol (cid:32) (cid:216) yโˆˆ๐‘ƒ ๐›ฟ ๐ต(y, ๐›ฟ/2) (cid:33) = ๐‘ƒ๐‘‹ ๐›ฟ (๐‘‡) (cid:18) ๐›ฟ (cid:19) ๐‘ 2 Vol (๐ต(0, 1)) โ‰ค Vol(๐‘‡ + ๐ต(0, ๐›ฟ/2)) Which after rearranging terms matches the right hand side of the desired inequality Corollary 1.3. ๐›ฟ (๐ต) โ‰ค Proof. We can apply lemma 1.10 where ๐‘‡ = ๐ต(0, 1). So, 1 + 2 ๐›ฟ โ‰ค ๐ถ ๐‘‹ for all norms โˆฅ ยท โˆฅ ๐‘‹ on R๐‘ (cid:16) (cid:17) ๐‘ (cid:17) ๐‘ (cid:16) 1 ๐›ฟ (cid:19) ๐‘ (cid:18) 1 ๐›ฟ (cid:24)(cid:24)(cid:24)(cid:24) Vol (๐ต) (cid:24)(cid:24)(cid:24)(cid:24) Vol (๐ต) โ‰ค ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) 31 yields the first half of the corollary. Next, note that using ๐‘‡ = ๐ต(0, 1) we see that ๐ต(0, 1) + ๐ต (cid:18) 0, (cid:19) ๐›ฟ 2 (cid:18) (cid:18) 0, 1 + โІ ๐ต (cid:19)(cid:19) ๐›ฟ 2 Note that by scaling, we have the following for the volume calculation (cid:18) (cid:18) (cid:18) ๐ต 0, 1 + Vol (cid:19)(cid:19)(cid:19) (cid:18) 1 + = ๐›ฟ 2 (cid:19) ๐‘ ๐›ฟ 2 Vol(๐ต(0, 1)) Putting this into the previous lemma, we see ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) โ‰ค (cid:19) ๐‘ (cid:18) (cid:18) 2 ๐›ฟ 1 + (cid:19) ๐‘ ๐›ฟ 2 (cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40) Vol(๐ต(0, 1)) (cid:24)(cid:24)(cid:24)(cid:24) Vol (๐ต) (cid:18) 1 + = (cid:19) ๐‘ 2 ๐›ฟ Which completes the second inequality Note that if we add the assumption in the corollary that ๐›ฟ โˆˆ (0, 1) then we can bound (cid:16) 1 + 2 ๐›ฟ (cid:17) โ‰ค (cid:17) = 3 (cid:16) 1 ๐›ฟ + 2 ๐›ฟ Corollary 1.4. If ๐‘† โІ ๐ต๐‘‹ (0, 1) โŠ‚ R๐‘ then ๐›ฟ , for a more concise, though less tight bound. ๐ถ ๐‘‹ ๐›ฟ (๐‘‡) โ‰ค (cid:19) ๐‘ (cid:18) 1 + 2 ๐›ฟ Proof. By observing (cid:18) ๐‘† + ๐ต Vol (cid:19)(cid:19) (cid:18) 0, ๐›ฟ 2 (cid:32) โ‰ค Vol ๐ต๐‘‹ (0, 1) + ๐ต (cid:19) (cid:33) (cid:18) 0, ๐›ฟ 2 and applying the same reasoning as Corollary 1.3, we achieve the result. 1.5 Modewise Measurements Framework For Chapters 2 and 3, we will be investigating specializations of the following measurement framework for tensor recovery. The overall measurement process we will consider is a collection of one or more measurement procedures that interleaves reshaping of the tensor (see Definition 1.7) with modewise measurements using modewise products with linear maps that are applied to some number of (reshaped) modes (see Definition 1.10). Naturally, it is formally possible to layer as many rounds of reshaping and measurement as one wishes, however we will be concerned with measurement procedures that have at most two stages of reshaping and linear transforms applied the fibers of the reshaped tensor. 32 ๐‘…1(X) ๐ด1 ๐‘…1(X) ๐ด๐‘‡ 2 B1 โ†’ ๐‘€1 X โ†’ ๐‘…1 (a) First Stage, Reshaping B1 ๐‘…2(B1) (b) First Stage, Measuring ๐‘…2(B1) B2 ๐ต1 โ†’ ๐‘…2 โ†’ ๐‘€2 (c) Second Stage, Reshaping (d) Second Stage, Measuring Figure 1.3 Schematic of a two-stage measurement procedure. The first reshaping ๐‘…1 takes a three mode tensor and reshapes it into a two mode tensor. The modewise measurements then in the measurement phase of the first stage corresponds to left and right multiplication by the measurement maps ๐ด1 and ๐ด๐‘‡ 2 to form intermediary tensor B1. The second stage depicts a vectorization reshaping operation ๐‘…2 and a final measurement phase where the matrix ๐ต1 is applied to the remaining mode to form the final set of measurements, B2. For example, a generic two-stage modewise linear operator L : R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ โ†’ R๐‘š1ร—ยทยทยทร—๐‘š๐‘‘โ€ฒ could take the form L (X) := ๐‘…2 (cid:0)๐‘…1(X) ร—1 ๐ด1 ยท ยท ยท ร— หœ๐‘‘ ๐ด หœ๐‘‘ (cid:1) ร—1 ๐ต1 ยท ยท ยท ร—๐‘‘โ€ฒ ๐ต๐‘‘โ€ฒ, (1.14) where ๐‘…1 is a reshaping operator which rearranges elements of R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ tensor into an R หœ๐‘›1ร—ยทยทยทร— หœ๐‘› หœ๐‘‘ . After this reshaping, using modewise product (Definition 1.10), ๐ด ๐‘— โˆˆ R หœ๐‘š ๐‘— ร— หœ๐‘› ๐‘— is applied to the resphaped tensor for ๐‘— โˆˆ [ หœ๐‘‘]. This is followed by an additional reshaping via ๐‘…2 into an R๐‘šโ€ฒ ร—ยทยทยทร—๐‘šโ€ฒ ๐‘‘โ€ฒ tensor, and final set of ๐‘—-mode products using the matrices ๐ต ๐‘— โˆˆ R๐‘š ๐‘— ร—๐‘šโ€ฒ ๐‘— for ๐‘— โˆˆ [๐‘‘โ€ฒ]. See Figure 1.3 1 for a schematic depiction of a two-stage operator of this type. Clearly, any ๐‘ž-stage modewise operators can be defined in this way. This view of measure- ment operators was first analyzed in [28, 30]. Immediately, we can see multi-staged modewise compression operators where several modes are compressed can offer a computational advantages 33 over standard vector-based approaches, provided the measurements are sufficiently sized to retain information about the data. By vector-based approaches, what we mean is the direct application of classic compressive sensing results which have been applied to vectors; whereby one simply takes the tensor data and as a first step, rearranges its elements into a vector. This approach fits the frame- work, where ๐‘…1 is a vectorization operator as in Definition1.5. That is, หœ๐‘‘ = 1, ๐ด1 = ๐ด โˆˆ R๐‘šร—(cid:206)๐‘‘ ๐‘— ๐‘› ๐‘— is a Johnson-Lindenstrauss map for example, and all remaining operators R2, ๐ต1, etc. are the identity (see, e.g., [49]). If instead ๐‘…1 is a reshaping operation that combines even a few, but not all modes, then the resulting modewise linear transforms can be formed using significantly fewer random variables (effectively, independent random bits), and stored using less memory by avoiding the use of a single, comparatively large ๐‘š ร— (cid:206)๐‘‘ ๐‘— ๐‘› ๐‘— measurement matrix as in the vectorized-first approach. Indeed as we will discuss in Chapter 2, simply forming or storing such a measurement matrix can be impractical, whereas one can achieve accurate recovery by instead using several, smaller, modewise maps. In addition to the storage aspect of the map themselves, modewise linear operators of this kind are also straightforward to parallelize and offer the possibility of choosing different maps at a more granular level to better respect the multi-modal structure of the given tensor data. 34 CHAPTER 2 TENSOR RESTRICTED ISOMETRY PROPERTY AND TENSOR ITERATIVE HARD THRESHOLDING FOR RECOVERY We are now equipped to describe how we applied the measurement framework described in Sec- tion 1.5 with a recovery procedure in order to produce a method that provably can yield reliable and accurate results for the tensor recovery problem. The concept that will be of central importance to this endeavor is the Tensor Restricted Isometry Property (TRIP). To explain where this term comes from and why it is associated with recovery of an object from measurements, we will recall the term from which it originates, namely the Restricted Isometry Property (RIP). Classically RIP is nothing more than norm preserving Johnson-Lindenstrauss property applied to the set of sparse vectors. That is, we have the definition, Definition 2.1. [RIP(๐œ–, ๐‘†) property] We say that a linear map A, defined on a normed vector space with norm โˆฅ ยท โˆฅ, has the RIP(๐œ–, ๐‘†) property if for all elements ๐‘  โˆˆ ๐‘† (1 โˆ’ ๐œ–)โˆฅ๐‘ โˆฅ2 โ‰ค โˆฅA (๐‘ ) โˆฅ2 โ‰ค (1 + ๐œ–) โˆฅ๐‘ โˆฅ2 (2.1) As we mentioned, classically in compressive sensing, the set under consideration are the ๐‘ -sparse vectors; and so ๐‘† = ๐พ๐‘ , where ๐พ๐‘  = (cid:8)x โˆˆ C๐‘ |โˆฅxโˆฅ0 โ‰ค ๐‘ (cid:9) = (cid:216) ๐‘†โІ[๐‘] |๐‘†|โ‰ค๐‘  span (cid:8)e ๐‘— (cid:9) ๐‘— โˆˆ๐‘† though we emphasize the definition here admits any subset from a normed vector space. Famously however in the case of ๐พ๐‘ , โ„“1-minimization by basis pursuit was shown to be able to recover the original signal vector (exactly or approximately depending on the situation) from a small number of measurements, when the measurement operator satisfies this approximate geometry preserving property and moreover that common distributions of random matrices provably had RIP with high probability, e.g. see Chapter 4 of [18]. There are other well known methods for the recovery part of the measure and recover compressive sensing problem, such as orthogonal matching pursuit and iterative hard thresholding (IHT). Extending these results to low rank matrices was a natural direction for this idea, and was the subject of celebrated research in the field of compressive 35 sensing, see [14] for an overview. In this chapter however we shall be concerned with the extension of low-rank recovery to higher-order tensors. 2.1 Two-Stage Modewise Measurements and Tensor Restricted Isometry Property Prior research, such as [21, 20], has extended the Tensor IHT method (TIHT) to low-rank CP and Tucker tensors. TIHT is an iterative method that consists alternating updates where the first sub-step applies the adjoint of the measurement operator to the residual of the last iterate and a second sub-step that thresholds that update back to the relevant constraint set; the set of low-rank tensors. That is, we take a gradient step first, and then project back onto the proper constraint set for our problem. This is very recognizable as a direct extension of IHT to the case of higher-order tensors. That is, we have the update rule Y ๐‘— = X ๐‘— + ๐œ‡ ๐‘— Lโˆ—(L (X) โˆ’ L (X ๐‘— )), X ๐‘—+1 = Hr (cid:0)Y ๐‘— (cid:1) . (2.2) where ๐‘— denotes the iteration, L and Lโˆ— denote the measurement operator and its adjoint, X0 is the initialization (randomly generated for example), the symbol Hr is the thresholding operator, that takes the tensor and finds a Tucker decomposition of the specified rank that approximates the iterate Y ๐‘— , and finally ๐œ‡ ๐‘— is a step-size parameter. The optimality of the thresholding operation is something we will have cause to comment on later in this chapter. The following theorem is one of the main results of [56]. It implies accurate reconstruction of a tensor X is possible via TIHT when the measurement operator satisfies the yet to be defined TRIP(๐›ฟ, 3r) property for a sufficiently small ๐›ฟ and a sufficient number of iterations, with a reasonable initialization since the update rule is contractive. Theorem 2.1 ([56], Theorem 1). Let X โˆˆ R๐‘›1ร—...ร—๐‘›๐‘‘ , let 0 < ๐‘Ž < 1, and let L : R๐‘›1ร—...ร—๐‘›๐‘‘ โ†’ R๐‘š satisfy TRIP(๐›ฟ, 3r) with ๐›ฟ < ๐‘Ž/4 for some ๐‘Ž โˆˆ (0, 1). Assume that y = L (X) + e, where e โˆˆ R๐‘š is an arbitrary noise vector, and let X ๐‘— and Y ๐‘— be defined as in (2.2). Assume that โˆฅY ๐‘— โˆ’ X ๐‘—+1โˆฅ2 โ‰ค (1 + ๐œ‰๐‘Ž)โˆฅY ๐‘— โˆ’ Xโˆฅ2, where ๐œ‰๐‘Ž = โˆš ๐‘Ž2 1 + ๐›ฟ) โˆฅL โˆฅ2โ†’2 . 17(1 + (2.3) 36 Then where ๐‘๐‘Ž = 2 โˆš โˆฅX ๐‘—+1 โˆ’ Xโˆฅ2 โ‰ค ๐‘Ž ๐‘— โˆฅX0 โˆ’ Xโˆฅ2 + ๐‘๐‘Ž 1 โˆ’ ๐‘Ž โˆฅeโˆฅ2, 1 + ๐›ฟ + โˆš๏ธ 4๐œ‰๐‘Ž + 2๐œ‰2 ๐‘Ž โˆฅL โˆฅ2โ†’2. The assumption (2.3) in the Theorem 2.1 is about the quality of the thresholding operation in (2.2). This assumption is unfortunately stronger than the quasi-optimality guarantee which say truncated Higher Order SVD computed via Algorithm 1.1 is able to satisfy, as shown in Lemma 1.5. However, the Lemma is an upper-bound, and in practice we have observed that the truncated HOSVD procedure does produce estimates that are very good in some cases; though naturally in real world situations it would be impossible to assess quantitatively that without finding an optimal HOSVD - which is difficult. Some other choice for thresholding operation, to include refinements using iterations of HOOI or some other method can likewise be employed in practice to shore up oneโ€™s confidence that the requirement is met. Practically speaking, it is to some degree a matter of how much computational effort is reasonable to budget for the task. We note that in [56], the authors show it is possible to draw maps from random distributions which will satisfy the desired TRIP property with high probability. Unfortunately, these maps require first vectorizing the input tensor into a ๐‘›๐‘‘-dimensional vector and then multiplying by an ๐‘š ร— ๐‘›๐‘‘ matrix as we have mentioned earlier. Such a map requires ๐‘š-times more memory than the original tensor and thus can be unsuitable in the big data scenario; and indeed as we shall discuss in our experiments, we were able to recover large tensors at better accuracy using a modewise approach where a vectorized approach failed given the same constraints on memory. With Theorem 2.1 in mind, we will now can move to specify the class of measurement operators L we wish to use in this chapter. Two additional preliminaries await. First, the elaboration now on the RIP definition alluded to, so that it includes higher order tensors with a low-rank decomposition. Definition 2.2. [TRIP(๐›ฟ, r) property] We say that a linear map A has the TRIP(๐›ฟ, r) property if for all X with HOSVD rank at most r we have (1 โˆ’ ๐›ฟ)โˆฅXโˆฅ2 2 โ‰ค โˆฅA (X) โˆฅ2 2 โ‰ค (1 + ๐›ฟ) โˆฅXโˆฅ2 2 (2.4) 37 Our second remaining piece is to establish how the HOSVD of a tensor is related to the HOSVD of its reshaping. Let ๐œ… be an integer which divides ๐‘‘, the number of modes, and let ๐‘‘โ€ฒ (cid:66) ๐‘‘/๐œ…. Consider the reshaping operator R1 : ๐‘‘ (cid:204) ๐‘–=1 ๐‘‘โ€ฒ (cid:204) R๐‘›๐œ… R๐‘› โ†’ ๐‘–=1 that combines ๐œ… sized sets of consecutive modes of a tensor into one mode, (see Definition 1.7). For example if the tensor has four modes, and ๐œ… = 2 then the reshaping will pair the first and second modes together and the third and fourth together and thus reshape the four mode tensor into a matrix of size ๐‘›1๐‘›2 ร— ๐‘›3๐‘›4. Any procedure to group the modes would be valid; so there is no loss in generality for assuming consecutive modes being grouped. Formally, R1 is defined to be the unique linear operator such that on rank-one tensors it acts as R1 (cid:18) ๐‘‘ โƒ ๐‘–=1 (cid:19) x(๐‘–) (cid:66) ๐‘‘โ€ฒ โƒ ๐‘–=1 ๐œ…๐‘– (cid:204) (cid:169) (cid:173) โ„“=1+๐œ…(๐‘–โˆ’1) (cid:171) x(โ„“)(cid:170) (cid:174) (cid:172) (cid:67) ๐‘‘โ€ฒ โƒ ๐‘–=1 โ—ฆx(๐‘–), recall โŠ— denotes the Kronecker product, see Definition 1.11. We observe that if a tensor X has a Tucker decomposition (see (1.3)), then its reshaping X โˆˆ (cid:203)๐‘‘โ€ฒ ๐‘–=1 with HOSVD rank at most rโ€ฒ (cid:66) (๐‘Ÿ ๐œ…, . . . , ๐‘Ÿ ๐œ…) given by R๐‘›๐œ… โ—ฆ โ—ฆ X (cid:66) R1(X) is the ๐‘‘โ€ฒ-mode tensor โ—ฆ X = ๐‘Ÿ ๐œ… โˆ‘๏ธ ๐‘—๐‘‘โ€ฒ =1 . . . ๐‘Ÿ ๐œ… โˆ‘๏ธ ๐‘—1=1 โ—ฆ G( ๐‘—1, . . . , ๐‘—๐‘‘โ€ฒ) ๐‘‘โ€ฒ โƒ โ„“=1 โ—ฆu(โ„“) ๐‘—โ„“ , (2.5) where the new component vectors โ—ฆu(โ„“) ๐‘—โ„“ are obtained by taking Kronecker product of the appro- priate u(๐‘–) โ—ฆ G โˆˆ R๐‘Ÿ ๐œ… ร—ยทยทยทร—๐‘Ÿ ๐œ… ๐‘˜๐‘– , and where Since each of the {u(๐‘–) ๐‘˜๐‘– }๐‘Ÿ ๐œ… ๐‘—๐‘–=1 is an orthonormal basis for the space span { โ—ฆu(๐‘–) ๐‘—๐‘– }๐‘Ÿ ๐‘˜๐‘–=1 was an orthonormal basis for the space spanned by ๐‘ˆ๐‘–, it follows that (cid:17) . That is, the Kronecker product of (cid:16) { โ—ฆu(๐‘–) ๐‘—๐‘– }๐‘Ÿ ๐œ… ๐‘—๐‘–=1 is a reshaped version of G from (1.3). orthonormal vectors is itself an orthonormal vector in its respective vector space. Let ๐ด๐‘– be an ๐‘š ร— ๐‘›๐œ… matrix for ๐‘– โˆˆ [๐‘‘โ€ฒ], denote A : R๐‘›๐œ… ร—...ร—๐‘›๐œ… โ†’ R๐‘šร—...ร—๐‘š to be the linear map which acts modewise on ๐‘‘โ€ฒ-mode tensors by way of the modewise products: A (Y) = Y ร—1 ๐ด1 ร—2 . . . ร—๐‘‘โ€ฒ ๐ด๐‘‘โ€ฒ . (2.6) 38 Let X be a ๐‘‘-mode tensor with HOSVD decomposition given by (1.16). By Lemma 1.1 and (2.5), we have that A (R1(X)) = A ( โ—ฆ X) = ๐‘Ÿ ๐œ… โˆ‘๏ธ ๐‘—๐‘‘โ€ฒ =1 . . . ๐‘Ÿ ๐œ… โˆ‘๏ธ ๐‘—1=1 โ—ฆ๐บ ( ๐‘—1, . . . , ๐‘—๐‘‘โ€ฒ) ( ๐ด1 โ—ฆu(1) ๐‘—1 โ—ฆ . . . โ—ฆ ๐ด๐‘‘โ€ฒ โ—ฆu(๐‘‘โ€ฒ) ๐‘—๐‘‘โ€ฒ ). (2.7) Next, we define the second stage of reshaping and modewise operations, similar to what was discussed in the example in Section 1.5. Let R2 denote the reshaping operation: R2 : ๐‘‘โ€ฒ (cid:204) ๐‘–=1 R๐‘š โ†’ R๐‘š๐‘‘โ€ฒ , (2.8) that is the reshaping operation which vectorizes the tensor. Next, Let ๐ต be a ๐‘š2๐‘›๐‘‘ ร— ๐‘š๐‘‘โ€ฒ linear map. Denote A2๐‘›๐‘‘ (X) (cid:66) ๐ต ยท R2(A (R1(X))), (2.9) The main theoretical result that the chapter relies on is to show that A2๐‘›๐‘‘ satisfies the TRIP property. This relies on showing how ๐ด๐‘– satisfies a restricted isometry property on the set ๐‘†1,2, to be defined below, and how the gestalt of the measurement operator inherits its properties from its component parts. We will require two more definitions. Definition 2.3. [The set ๐‘†1,2] Consider a set of vectors in R๐‘›๐œ… , ๐‘†1 (cid:66) { โ—ฆu | โ—ฆu = โŠ—๐œ… 1u(๐‘–), u(๐‘–) โˆˆ S๐‘›โˆ’1}, (2.10) (cid:12) (cid:12) x, y โˆˆ ๐‘†1 s.t. โŸจx, yโŸฉ = 0 (cid:111) . Denote ๐‘†1,2 (cid:66) ๐‘†1 โˆช ๐‘†2, and observe that ๐‘†1,2 โІ and let ๐‘†2 (cid:66) (cid:110) x+y S๐‘›๐œ… โˆ’1. โˆฅx+yโˆฅ2 Definition 2.4 (Nearly orthogonal tensors B๐‘…,๐œ‡,๐œƒ,r). Let B๐‘…,๐œ‡,๐œƒ,r be the set of ๐‘‘-mode tensors in X โˆˆ R๐‘›ร—...ร—๐‘› that may be written in Tucker form (1.15) such that (a) โˆฅu(๐‘–) ๐‘˜๐‘– (b) |โŸจu(๐‘–) ๐‘˜๐‘– โˆฅ2 โ‰ค ๐‘… for all ๐‘– and ๐‘˜๐‘–, , u(๐‘–) ๐‘˜ โ€ฒ ๐‘– โŸฉ| โ‰ค ๐œ‡ for all ๐‘˜๐‘– โ‰  ๐‘˜โ€ฒ ๐‘–, (c) the core tensor G satisfies โˆฅGโˆฅ2 = 1, (d) G has orthogonal subtensors in the sense that (1.5) holds for all 1 โ‰ค ๐‘– โ‰ค ๐‘‘, 39 (e) โˆฅXโˆฅ2 โ‰ฅ ๐œƒ. Theorem 2.2. Let ๐‘Ÿ โ‰ฅ 2 and let r = (๐‘Ÿ, ๐‘Ÿ, . . . , ๐‘Ÿ) โˆˆ R๐‘‘. Suppose that A and A2๐‘›๐‘‘ are defined as in (2.6) and (2.9). Let ๐‘‘โ€ฒ = ๐‘‘/๐œ… and assume that ๐ด๐‘– satisfies the RIP(cid:0)๐œ–, ๐‘†1,2 (cid:1) property for all ๐‘– = 1, . . . , ๐‘‘โ€ฒ, where ๐›ฟ = 12๐‘‘โ€ฒ๐‘Ÿ ๐‘‘๐œ– < 1. Assume that ๐ด2nd satisfies the RIP(cid:0)๐›ฟ/3, B1+๐œ–,๐œ–,1โˆ’๐›ฟ/3,rโ€ฒ (cid:1), property where rโ€ฒ = (๐‘Ÿ ๐œ…, . . . , ๐‘Ÿ ๐œ…) โˆˆ R๐‘‘โ€ฒ . Then, A2๐‘›๐‘‘ will satisfy the TRIP(๐›ฟ, r) property, i.e., (1 โˆ’ ๐›ฟ)โˆฅXโˆฅ2 2 โ‰ค โˆฅA2๐‘›๐‘‘ (X)โˆฅ2 2 โ‰ค (1 + ๐›ฟ) โˆฅXโˆฅ2 2 for all X with HOSVD rank at most r. The proof of this result depends on several intermediary results, so we delay its presentation to Section 2.4 of this chapter. We instead in this section wish to focus on the following corollary, and a discussion with critiques based on empirical findings and several observations we made in the course of its investigation. Corollary 2.1. Assume the operator L, is defined in one of the following ways: (a) L = vec โ—ฆ A โ—ฆ R, where A is defined as per (2.6), and the matrices ๐ด๐‘– satisfy the RIP(cid:0)๐œ–, ๐‘†1,2 (cid:1), and ๐›ฟ = 4๐‘‘โ€ฒ(3๐‘Ÿ)๐‘‘๐œ– < ๐‘Ž/4. (b) L = A2๐‘›๐‘‘ defined as in (2.9), its component matrices ๐ด๐‘– satisfy RIP(cid:0)๐œ–, ๐‘†1,2 ๐›ฟ = 12๐‘‘โ€ฒ2(3๐‘Ÿ)๐‘‘๐œ–, and ๐ดfinal satisfies the ๐‘…๐ผ ๐‘ƒ(๐›ฟ/3, B1+๐œ–,๐œ–,1โˆ’๐›ฟ/3,r) property. (cid:1) property, Consider the recovery problem from the noisy measurements y = L (X) + e, where e โˆˆ R๐‘š is an arbitrary noise vector. Let 0 < ๐‘Ž < 1, and let X ๐‘— , and Y ๐‘— be defined as in (2.2), and assume that (2.3) holds. Then, โˆฅX ๐‘—+1 โˆ’ Xโˆฅ2 โ‰ค ๐‘Ž ๐‘— โˆฅX0 โˆ’ Xโˆฅ2 + ๐‘๐‘Ž 1 โˆ’ ๐‘Ž โˆฅeโˆฅ2, 1 + ๐›ฟ + โˆš๏ธ 4๐œ‰๐‘Ž + 2๐œ‰2 ๐‘Ž โˆฅA โˆฅ2โ†’2. where ๐‘๐‘Ž = 2 โˆš Corollary 2.1 encapsulates the theoretical basis for why we can expect a two stage modewise measurement operator like A2๐‘›๐‘‘, where the constituent measurement maps satisfy the RIP prop- erties, will yield a contractive iterative method in the presence of some noise. A particular choice of measurement map, such as i.i.d sub-Guassian matrices then produces row bounds - provided we can bound the size of covers of sets ๐‘†1,2 and B๐‘…,๐œ‡,๐œƒ,r. The intermediary results that describe these 40 details for how the properties of the various parts of the measurement operator relate are found below in Section 2.4. Remark 2.1. As we have alluded to, one of the important advantages of our method over a vectorization-based approach is that the maps A and A2nd require less memory to store than those considered in [56]. In particular, A requires ๐‘‘โ€ฒ๐‘š๐‘›๐œ… entries to be stored in order to sketch and A2nd requires ๐‘‘โ€ฒ๐‘š๐‘›๐œ… + ๐‘š๐‘‘โ€ฒ๐‘š2nd in order to sketch a dimension of ๐‘š2nd. By contrast, the map considered in [56] requires ๐‘›๐‘‘๐‘š entries to reach a dimension of ๐‘š. To make a dimension of ๐‘š๐‘‘โ€ฒ this more concrete, we consider the case where ๐‘‘ = 4, ๐‘› = 40, ๐œ… = 2, the final target dimension is 10, 000, and our maps are dense matrices with Gaussian entries. In this case, A2nd, with interme- diate dimensions of ๐‘š1 = ๐‘š2 = 250 would require 402 ร— 250 ร— 2 + 2502 ร— 10, 000 = 625, 800, 000 random entries, with a storage cost of about 2.5 gigabytes assuming 32-bit floating point arithmetic and the SI meaning of the prefix giga as 109 bytes. The vectorization based approach requires 404 ร— 10, 000 = 25, 600, 000, 000 random entries at a storage cost of about 102.4 gigabytes. In Section 2.2, we will show that under these settings tensor recovery experiments using A2nd have an identical recovery rate reliability in both the Gaussian case and SORS case when compared to those that use vectorization-based compression, despite the 40 times smaller memory requirement. 2.2 Experiments In this section, we show empirically that TIHT can be used with modewise measurement maps as defined in (2.2) to successfully reconstruct low-rank tensors. In our experiments we will consider randomly generated four-mode tensors in R๐‘›ร—๐‘›ร—๐‘›ร—๐‘› for ๐‘› = 40 and ๐‘› = 96. In the case where ๐‘› = 40, we will utilize both modewise Gaussian and ๐‘…๐น ๐ท measurements, see Definition 1.24. In the case where ๐‘› = 96 we will only consider ๐‘…๐น ๐ท measurements, since generating and storing the dense Gaussian matrices takes up too much memory. We compare our two-step modewise approach to a vectorization based method like was de- scribed in [56]. We consider the percentage of successfully recovered tensors from batches of 100 randomly generated low-rank tensors, as well as the average number of iterations used for recovery on the successful runs. In the case where ๐‘› = 40, we consider the algorithm to have successfully 41 Figure 2.1 Fraction of successfully recovered random tensors out of a random sample of 100 tensors in R40ร—40ร—40ร—40 with various intermediate dimensions. A run is considered successful if relative error is below 1% in at most 1000 iterations. recovered the tensor if the relative error falls below 1% in at most 1000 iterations. Similarly, in the case where ๐‘› = 96, we consider the algorithm to have successfully recovered the tensor if the relative error falls below 5% in at most 1000 iterations. Compression, defined here as the ratio of the total number of entries in the measurements over total number of entries in the true tensor, ranges from about 0.04% to 0.4% depending on the choice for final sketching dimension. In all instances, we initialize the algorithm using a randomly generated low-rank tensor. In our experiments, we apply the map from (2.9) with ๐œ… = 2. That is, we reshape a four-mode tensor whose modes are all of length ๐‘› into a ๐‘›2 ร— ๐‘›2 matrix, perform modewise measurements reducing each of the two reshaped modes to ๐‘š, the choice for intermediate dimension, and then in the second stage, vectorize that result and compress it further to ๐‘š2nd, the final target dimension. In our experiments, we consider a variety of intermediate dimensions to demonstrate the stability of advantage of the modewise measurements over the vectorized ones. In our experiments, for fair comparison, we will set the final embedding of our two-step process ๐‘š2nd equal to the final 42 Figure 2.2 Average number of iterations until convergence among the successful runs out of a random sample of 100 tensors in R40ร—40ร—40ร—40 with various intermediate dimensions. A run is considered successful if relative error is below 1% in at most 1000 iterations. embedding dimension of the vectorized method ๐‘šfinal. For a unified presentation, we will refer to the final dimension as ๐‘š0 in either case - modewise or vectorized. As noted in Remark 2.1, our proposed two-step method offers significant storage savings as compared to the vectorized based approach when looking at the size of the measurement maps involved. For example, when we use Gaussian measurements with ๐‘š0 = 10, 000 and ๐‘› = 40, the vectorized computations had to be carried out using four NVIDIA v100 GPUs in parallel, whereas the two-step method easily fits on memory of one GPU. In the case where ๐‘› = 96, generating, storing and applying Gaussian random matrices is impractical. For instance in the scenario we consider in Figure 2.3, where ๐‘› = 96 and ๐‘š0 = 65, 536, we would need more than 22.25 terabytes to store the Gaussian map required for the vectorized approach. A two-step approach would require about 4.4 terabytes to store the measurement matrices. We also note that we may apply ๐‘…๐น ๐ท measurements more quickly to larger tensors than Gaussian measurements since the FFT enables ๐‘…๐น ๐ท measurement matrices to be applied efficiently to the 43 Figure 2.3 Top row shows recovery reliability and convergence speed for trials consisting of 100 samples of random tensors in R96ร—96ร—96ร—96 at various ranks. A run is successful if relative error reaches below 5% within 1000 iterations. Final target dimensions are ๐‘š = 32, 768 and 65, 636, for two-step method intermediate dimensions are 2048 and 4096 respectively. Bottom row is average relative error over 100 samples after 500 iterations. tensor and without the need to explicitly form all the measurement maps. In particular, we need only store the sign changes which form the diagonal of matrix ๐ท, and store the choice of which rows are sampled from ๐น to form matrix ๐‘…๐น (see Definition 1.24). For this reason, the same size of problem requires about 67 megabytes storage in two-step method and 340 megabytes for the vectorized approach when using ๐‘…๐น ๐ท and an appropriately implemented FFT. It is for these reasons we restricted ourselves to ๐‘…๐น ๐ท measurements for the ๐‘› = 96 case for both the vectorized and two-step approach. As shown in Figures 2.1 and 2.2 the two-step approach, when compared to the vectorized ap- proach with the same choice of final sketching dimension, shows reliable recovery, and comparable convergence rate with both Gaussian and ๐‘…๐น ๐ท measurements. Indeed, for some choices of in- termediate and final sketching dimensions, modewise measurements empirically recover low-rank tensors more reliably than vectorized measurements (see Figure 2.1, bottom row). We show that 44 these advantages do not result in the need for a substantially increased number of iterations in order to achieve our convergence criteria. Across all considered scenarios, the average number of iterations to meet the convergence criteria can be bounded by two to eight times the number needed in the corresponding vectorized approach, depending on the choice ๐‘š for intermediate sketching dimension. Interestingly, for some ranges of the intermediate and final target dimension ๐‘š0, and in the case of ๐‘…๐น ๐ท measurements in the rank (5, 5, 5, 5) instance, fewer iterations are needed (see Figure 2.2, bottom right). Thus, modewise measurements are an effective, memory-efficient method of dimension reduction, and the choice of intermediate sketching dimension allows us to further balance trade-offs in terms of convergence speed and overall memory requirements, given a particular size and rank. In Figure 2.3, we investigate the performance of the algorithms as rank of the tensor increases. A larger tensor, with ๐‘› = 96 enables us to consider a wider range of rโ€™s that are reasonably considered to be low-rank. We maintain the vectorized to two-step comparison, and also consider two different final sketching dimensions, ๐‘š0 = 32, 768 and 65, 536. Due to the larger problem size and ranks, we scaled the convergence criteria to be 5% relative error in at most 1000 iterations for the comparison of recovery reliability and convergence speed. We observed in terms of performance of the algorithms a phase change as rank of the tensor increases for a fixed final sketching dimension. In particular, for ๐‘š0 = 32, 768 the performance of vectorized and two-step approach empirically degrade significantly at ๐‘Ÿ = 6 and 7 respectively for this convergence criteria. When doubling the sketching dimension to ๐‘š0 = 65, 536 we see empirically that phase change that the drop in performance occurs at ๐‘Ÿ = 8. The bottom row of Figure 2.3 shows the shift in terms of average relative error for a fixed number of iterations. For the larger ranks and smaller sketching dimensions we observe that stagnation at non-global optima appears more likely and the runtimes required for acceptable recovery can become impractical. Both vectorized and two-step approaches have this feature, however for a fixed final sketching dimension, for some ranks near the threshold the two-step method performs incrementally better. 45 2.3 Discussion As we have outlined in the previous section both theoretically and empirically, in terms of performance and economizing storage, using a two-stage measurement operator versus a one-stage vectorized only operator has significant advantages. There are a few shortcomings we wish to note however that came about as we developed the method. To understand the first one, consider the Corollary 2.2, which states that if the modewise transforms applied are assumed to be random matrices with i.i.d. sub-Gaussian entries with at least as many as (2.11) and (2.12) rows for the first and second stage respectively, then A2๐‘›๐‘‘ will have desired TRIP(๐›ฟ, r) with high probability. That is, this is a result which shows that a certain class of random matrix will likely satisfy the TRIP, and thus will be useful in yielding measurements, should its sketching dimension size satisfy the following bounds. Corollary 2.2. Let ๐‘Ÿ โ‰ฅ 2 and let r = (๐‘Ÿ, ๐‘Ÿ, . . . , ๐‘Ÿ) โˆˆ R๐‘‘. Suppose that A and A2๐‘›๐‘‘ are defined as in (2.6) and (2.9), and that all of the ๐ด๐‘– โˆˆ R๐‘šร—๐‘›๐œ… 1 ๐‘š -subgaussian entries for all have i.i.d. ๐‘– = 1, . . . , ๐‘‘โ€ฒ, where ๐‘‘โ€ฒ = ๐‘‘/๐œ…, and suppose that 0 < ๐œ‚, ๐›ฟ < 1. Let ๐‘š โ‰ฅ ๐ถ๐›ฟโˆ’2๐‘Ÿ 2๐‘‘ max (cid:26) ๐‘›๐‘‘2 ln(๐œ…) ๐œ… , ๐‘‘2 ๐œ…2 (cid:19)(cid:27) , (cid:18) 2๐‘‘ ๐œ…๐œ‚ ln (2.11) and let ๐ด2nd โˆˆ R๐‘š2ndร—๐‘š be a 1 ๐‘š2nd -sub-gaussian matrix with i.i.d. entries with ๐‘š2nd โ‰ฅ ๐ถ๐›ฟโˆ’2 max (cid:26) (cid:18) ๐‘Ÿ ๐‘‘ ๐œ… + ๐‘‘๐‘š๐‘Ÿ ๐œ… ๐œ… (cid:19) (cid:18) ๐‘‘ ๐œ… ln (cid:19) + + 1 ๐‘‘๐‘š๐‘Ÿ ๐œ… ๐œ… (cid:16) 1 + ๐›ฟ๐‘Ÿ ๐‘‘(cid:17) + ln ๐‘‘2๐‘š๐‘Ÿ ๐œ…๐›ฟ ๐œ…2 , ln (cid:19)(cid:27) . (cid:18) 2 ๐œ‚ (2.12) Then, A2๐‘›๐‘‘ satisfies the TRIP(๐›ฟ, r) property, i.e., (1 โˆ’ ๐›ฟ)โˆฅXโˆฅ2 2 โ‰ค โˆฅA2๐‘›๐‘‘ (X)โˆฅ2 2 โ‰ค (1 + ๐›ฟ) โˆฅXโˆฅ2 2 , for all X with HOSVD rank at most r with probability at least 1 โˆ’ ๐œ‚. Observe that if ๐œ… = 1, then (2.11) implies that ๐‘š > ๐‘›, so the sketching dimension needs to be bigger than the original side length to ensure the error guarantee, and thus not accomplishing any compression. In other words, the theory we have stated here is silent about the case where we com- press using modewise products without first combining at least some of the modes. Furthermore, numerical investigations lent empirical evidence that this may not merely have been an artifact 46 of the analysis, or a lack of tightness is the row bounds for this particular class of measurement operator. This was a disappointing state of affairs because it is quite natural to wish to apply no re- shaping initially, and instead compress along all modes using modewise products before employing subsequent stages of reshaping and modewise linear transforms. Moreover, the necessity of having ๐œ… be at least two means that there may not be any natural or intuitive way to first reshape a tensor with say a prime number of modes; in that case some modes would have to be grouped and treated differently than others; which makes the choice of the operator data dependent in a way we should like to avoid. The next two observations stem from the use of TIHT as a recovery procedure in the overall method. In the first sub-step of TIHT (2.2), we must calculate and form the intermediary tensor Y ๐‘— which has the same dimensions as our original signal X. Furthermore, Y ๐‘— is emphatically not in a factorized form, nor would we expect it to be exactly low rank. This iterate is a full, un-factored, tensor that we have to store and operate on in practical terms that is the same size as the uncompressed data X. In turn, in the next sub-step we then must find a factorization which approximates Y ๐‘— . Naturally that factorization will be a significantly smaller object to store in memory in the low rank case. Secondly, As was alluded too early in the Section 2.1, in Theorem 2.1, there is a reliance on an assumption about the quality of fit for the thresholding operator that exceeds the guarantees that we are aware of for the usual procedure to find the HOSVD. Furthermore calculating an HOSVD itself can qualify as computationally intensive; and refining a factorization could itself involve many iterations of say Algorithm 1.2 within an already iterative step of TIHT. So the use of TIHT as the recovery procedure, and its reliance on the TRIP property introduces several significant downsides. It means we need to have working memory on the order of the size of the uncompressed, unfactorized data to perform the sub-step which is contraindicating the expressed goal of a low-memory (at all stages) method for tensor recovery. Ideally we would have some means to go from the compressed measurements L (X) directly to core G and factors ๐‘ˆ๐‘– of the estimate without needing at any point in the procedure to form the entire, uncompressed tensor; 47 and thus remain low-memory throughout the entire process. Next, TIHT relies on an assumption about the thresholding operator H๐‘Ÿ; and practically any effort taken to shore up that assumption may involve significantly more computational effort at each TIHT update, such as HOOI iterations. Lastly, all known methods to prove TRIP for two-stage measurement operators that use the often studied types of random modewise maps requires an initial reshaping; which may not be desirable or intuitive. The desire to address these problems is what led us to investigate the Leave-One-Out alternative and its accompanying simple recovery procedure and is the subject of our next chapter. 2.4 Proof of Theorem 2.2 The goal of this section is to outline the proof of Theorem 2.2, which crucially builds on a similar type guarantee for the first stage compression operator and a lemma which shows how the output of the first stage of compression will likely belong in the set of nearly orthonogonal tensors as in Definition 2.4, and provides a covering number bound for this set. Once we have these, it is easy to fashion two-stage operators from standard types of random matrices, such as sub-Gaussian random matrices or ๐‘…๐น ๐ท matrices, as was done e.g. Lemma 1.6. For a detailed proof of these results, see Section 5 and Appendices A and B of our work [23]. Theorem 2.3. Suppose that A is defined as per (2.6) and that each of the ๐ด๐‘– have RIP(cid:0)๐œ–, ๐‘†1,2 property. Let ๐‘Ÿ โ‰ฅ 2, let ๐›ฟ = 4๐‘‘โ€ฒ๐‘Ÿ ๐‘‘๐œ– and assume that ๐›ฟ < 1. Then A โ—ฆ R satisfies the TRIP(๐›ฟ, r) (cid:1) property, i.e., (1 โˆ’ ๐›ฟ)โˆฅXโˆฅ2 2 โ‰ค โˆฅA (R1(X)) โˆฅ2 2 โ‰ค (1 + ๐›ฟ) โˆฅXโˆฅ2 2 (2.13) for all X with HOSVD rank less than r = (๐‘Ÿ, ๐‘Ÿ, . . . , ๐‘Ÿ). So the goal is now to first prove this result about the first stage compression operator. To do so, it will be convenient to write A as a composition of maps A (Y) = A๐‘‘โ€ฒ (. . . (A1(Y))), where A๐‘– (Y) = Y ร—๐‘– ๐ด๐‘– for 1 โ‰ค ๐‘– โ‰ค ๐‘‘โ€ฒ. (2.14) We note that by (2.5), we may still write the reshaped ๐‘‘โ€ฒ-mode tensor โ—ฆ X as a sum of ๐‘Ÿ ๐‘‘ orthogonal tensors; that is by Lemma 1.4 we can write X as a sum of ๐‘Ÿ ๐‘‘ rank one orthogonal tensors, and 48 the reshaping operator R1 is linear. This action of a linear operator in a vector space working on orthogonal sums plays nicely with an approximation property; as we now state in Lemma 2.1 Lemma 2.1. Let V be an inner product space and let L be a linear operator on V. Let U โŠ‚ V be a subspace of V spanned by an orthonormal system {v1, . . . , v๐พ } โˆˆ V. Suppose that (1 โˆ’ ๐œ–)โˆฅv๐‘– โˆฅ2 2 โ‰ค โˆฅLv๐‘– โˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅv๐‘– โˆฅ2 2 for all 1 โ‰ค ๐‘– โ‰ค ๐พ. (2.15) and also that (1 โˆ’ ๐œ–)โˆฅv๐‘– ยฑ v ๐‘— โˆฅ2 2 โ‰ค โˆฅL (v๐‘– ยฑ v ๐‘— ) โˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅv๐‘– ยฑ v ๐‘— โˆฅ2 2 for all 1 โ‰ค ๐‘–, ๐‘— โ‰ค ๐พ. (2.16) Then we have Proof. Claim: (1 โˆ’ ๐พ๐œ–)โˆฅwโˆฅ2 2 โ‰ค โˆฅLwโˆฅ2 2 โ‰ค (1 + ๐พ๐œ–) โˆฅwโˆฅ2 2 for all w โˆˆ U. |โŸจLv๐‘–, Lv ๐‘— โŸฉ| โ‰ค ๐œ– for all ๐‘– โ‰  ๐‘— . To prove the claim, let ๐‘– โ‰  ๐‘— . Then, 4โŸจLv๐‘–, Lv ๐‘— โŸฉ = โˆฅL (v๐‘– + v ๐‘— )โˆฅ2 2 โˆ’ โˆฅL (v๐‘– โˆ’ v ๐‘— ) โˆฅ2 2 โ‰ค (1 + ๐œ–)โˆฅv๐‘– + v ๐‘— โˆฅ2 2 โˆ’ (1 โˆ’ ๐œ–) โˆฅv๐‘– โˆ’ v ๐‘— โˆฅ2 2 = (1 + ๐œ–)(โˆฅv๐‘– โˆฅ2 2 + โˆฅv ๐‘— โˆฅ2 2 + 2โŸจv๐‘–, v ๐‘— โŸฉ) โˆ’ (1 โˆ’ ๐œ–) (โˆฅv๐‘– โˆฅ2 2 + โˆฅv ๐‘— โˆฅ2 2 โˆ’ 2โŸจv๐‘–, v ๐‘— โŸฉ) = 4โŸจv๐‘–, v ๐‘— โŸฉ + 2๐œ– (โˆฅv1โˆฅ2 2 + โˆฅv2โˆฅ2 2) = 4๐œ–, where the last inequality follows from the fact that โˆฅv1โˆฅ2 2 = โˆฅv2โˆฅ2 2 = 1 and โŸจv๐‘–, v ๐‘— โŸฉ = 0. Thus, โŸจLv๐‘–, Lv ๐‘— โŸฉ โ‰ค ๐œ– . The reverse inequality is similar. We now proceed to the Lemma itself, and argue by induction on ๐พ. When ๐พ = 1, the result is immediate from (2.15) and the fact that L is linear. Now assume the result is true for ๐พ โˆ’ 1. An arbitrary element of U may be written as w = ๐‘๐‘–v๐‘– ๐พ โˆ‘๏ธ ๐‘–=1 49 where ๐‘1, . . . , ๐‘๐พ are scalars. We will write w = w๐พโˆ’1 + ๐‘๐พv๐พ, where w๐พโˆ’1 (cid:66) ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 ๐‘๐‘–v๐‘–. By construction, we have โˆฅLwโˆฅ2 2 = โˆฅLw๐พโˆ’1โˆฅ2 2 + โˆฅ๐‘๐พ Lv๐พ โˆฅ2 2 + 2๐‘๐พ โŸจLw๐พโˆ’1, Lv๐พโŸฉ. We may use the inequality 2๐‘Ž๐‘ โ‰ค ๐‘Ž2 + ๐‘2 along with the claim earlier to see 2๐‘๐พ โŸจLw๐พโˆ’1, Lv๐พโŸฉ = ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 2๐‘๐‘–๐‘๐พ โŸจLv๐‘–, Lv๐พโŸฉ โ‰ค ๐œ– โ‰ค ๐œ– ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 2๐‘๐พ ๐‘๐‘– (๐‘2 ๐พ + ๐‘2 ๐‘– ) โ‰ค (๐พ โˆ’ 1)๐œ– ๐‘2 ๐พ + ๐œ– ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 ๐‘– . ๐‘2 By the inductive assumption, โˆฅLw๐พโˆ’1โˆฅ2 2 โ‰ค (1 + (๐พ โˆ’ 1)๐œ–) โˆฅw๐พโˆ’1โˆฅ2 2 = (1 + (๐พ โˆ’ 1)๐œ–) ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 ๐‘– . ๐‘2 Thus, โˆฅLwโˆฅ2 2 โ‰ค (1 + (๐พ โˆ’ 1)๐œ–) ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 ๐‘– + (1 + ๐œ–)๐‘2 ๐‘2 ๐พ + (๐พ โˆ’ 1)๐œ– ๐‘2 ๐พ + ๐œ– ๐พโˆ’1 โˆ‘๏ธ ๐‘–=1 ๐‘2 ๐‘– = (1 + ๐พ๐œ–) ๐พ โˆ‘๏ธ ๐‘–=1 ๐‘– = (1 + ๐พ๐œ–) โˆฅwโˆฅ2 ๐‘2 2 . (2.17) (2.18) The reverse inequality is similar. The next lemma checks that, if ๐ด๐‘– the matrix satisfies RIP(๐œ–, ๐‘†1,2) property for some 1 โ‰ค ๐‘‘โ€ฒ then the operator A๐‘– as in (2.14) will satisfy the conditions of Lemma 2.1 for the system of rank one component tensors that are produced by our reshaping procedure. Lemma 2.2. Let {V1, . . . , V๐พ } โˆˆ R๐‘›ร—...ร—๐‘› be an orthonormal system of rank one tensors of the form V๐‘˜ = โƒ๐‘‘โ€ฒ ๐‘˜ โˆฅ = 1 for all 1 โ‰ค ๐‘– โ‰ค ๐‘‘โ€ฒ. Let 1 โ‰ค ๐‘– โ‰ค ๐‘‘โ€ฒ, suppose ๐ด๐‘– has the ๐‘˜ where โˆฅv๐‘– ๐‘–=1 v๐‘– 50 RIP(๐œ–/2, ๐‘†1,2) property and assume that each v๐‘– 2.3. Then the conditions (2.15) and (2.16) are satisfied for (the vectorizations of) these {V๐‘–}๐พ ๐‘˜ is an element of the set ๐‘†1 defined in Definition ๐‘–=1 and L = A๐‘– defined via A๐‘– (X) = X ร—๐‘– ๐ด๐‘–. The next lemma gives a formula for the tensor Y๐‘ก obtained by applying sequentially the first ๐‘ก of the maps A๐‘–. It shows that all intermediary tensors Y๐‘ก can be written as an orthogonal linear combination of ๐‘Ÿ ๐œ…(๐‘‘โ€ฒโˆ’๐‘ก) rank-one tensors of unit norm. Moreover, for each of the terms in this sum, the (๐‘ก + 1)-st component vector is โ—ฆu (๐‘ก+1) ๐‘—๐‘ก+1 as defined in (2.5) and therefore is an element of the set ๐‘†1. Lemma 2.3. Let Y0 = โ—ฆ X and Y๐‘ก (cid:66) A๐‘ก (Y๐‘กโˆ’1) = Y๐‘กโˆ’1 ร—๐‘ก ๐ด๐‘ก for all ๐‘ก = 1, . . . , ๐‘‘โ€ฒ. Then, for each 1 โ‰ค ๐‘ก โ‰ค ๐‘‘โ€ฒ โˆ’ 1, we may write Y๐‘ก = ๐‘Ÿ ๐œ… โˆ‘๏ธ . . . ๐‘Ÿ ๐œ… โˆ‘๏ธ ๐‘—๐‘‘โ€ฒ =1 ๐‘—๐‘ก+1=1 G๐‘ก ( ๐‘—๐‘ก+1,..., ๐‘—๐‘‘โ€ฒ ) (cid:20) (cid:18) ๐‘ก โƒ ๐‘–=1 v(๐‘–) ๐‘—๐‘ก+1,..., ๐‘—๐‘‘โ€ฒ (cid:19) โƒ (cid:18) ๐‘‘โ€ฒ โƒ ๐‘–=๐‘ก+1 (cid:19)(cid:21) , โ—ฆu (๐‘–) ๐‘—๐‘– (2.19) where โˆฅv(๐‘–) ๐‘—๐‘ก+1,..., ๐‘—๐‘‘โ€ฒ โˆฅ = 1 for all valid index subsets. We are now ready to prove Theorem 2.3. Proof of Theorem 2.3. By Lemma 1.4, we can write Y0 = of ๐‘Ÿ ๐‘‘ norm one terms of the form โ—ฆ X as an orthogonal linear combination ๐‘‘โ€ฒ โƒ ๐‘–=1 โ—ฆu (๐‘–) ๐‘—๐‘– , 1 โ‰ค ๐‘—๐‘– โ‰ค ๐‘Ÿ ๐œ…, where each of the vectors โ—ฆu(๐‘–) ๐‘—๐‘– , 1 โ‰ค ๐‘—๐‘– โ‰ค ๐‘Ÿ ๐œ…, are obtained as the vectorization of a rank-one ๐œ…-mode tensor. Therefore, since ๐ด1 satisfies RIP(๐œ–, ๐‘†1,2), Lemma 2.2 allows us to apply Lemma 2.1 to see โˆฅA1( โ—ฆ X)โˆฅ2 โ‰ค (1 + 2๐‘Ÿ ๐‘‘๐œ–) โˆฅ โ—ฆ Xโˆฅ2. (2.20) Next observe that by Lemma 2.3 (๐‘‘โ€ฒโˆ’๐‘ก) sums of ๐‘Ÿ ๐œ… terms each (cid:125)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:122) ๐‘Ÿ ๐œ… โˆ‘๏ธ . . . (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123) ๐‘Ÿ ๐œ… โˆ‘๏ธ ๐‘—๐‘‘โ€ฒ =1 ๐‘—๐‘ก+1=1 Y๐‘ก = G๐‘ก ( ๐‘—๐‘ก+1, . . . , ๐‘—๐‘‘โ€ฒ) remains a set of orthonormal vectors (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125)(cid:124) (cid:122) (cid:123) (cid:20) (cid:18) ๐‘ก (cid:19) (cid:19)(cid:21) โƒ ๐‘–=1 v(๐‘–) ๐‘—๐‘ก+1,..., ๐‘—๐‘‘โ€ฒ (cid:18) ๐‘‘โ€ฒ โƒ ๐‘–=๐‘ก+1 (๐‘–) ๐‘—๐‘– โƒ โ—ฆu and note that there are ๐‘Ÿ ๐œ…(๐‘‘โ€ฒโˆ’๐‘ก) terms appearing in the sum in (2.19). Therefore, Lemmas 2.1 and 2.2 allow us to see that โˆฅY๐‘ก+1โˆฅ2 โ‰ค (cid:16) 1 + 2๐‘Ÿ ๐œ…(๐‘‘โ€ฒโˆ’๐‘ก)๐œ– (cid:17) โˆฅY๐‘ก โˆฅ2 (2.21) 51 for 1 โ‰ค ๐‘ก โ‰ค ๐‘‘โ€ฒ โˆ’ 1. Since Y๐‘‘โ€ฒ = A ( โ—ฆ X), combining (2.20) and (2.21) for each of the compositions implies that the overall first stage compression operator A defined in (2.6) satisfies โˆฅA ( โ—ฆ X)โˆฅ2 โ‰ค ๐‘‘โ€ฒโˆ’1 (cid:214) (cid:16) ๐‘ก=0 1 + 2๐‘Ÿ ๐œ…(๐‘‘โ€ฒโˆ’๐‘ก)๐œ– (cid:17) โ—ฆ Xโˆฅ2. โˆฅ To complete the upper bound set ๐›ผ (cid:66) 2๐‘Ÿ ๐‘‘๐œ– and note that 2๐›ผ < 1. Recall ๐‘‘โ€ฒ = ๐‘‘ ๐œ… . Then, since ๐‘Ÿ โ‰ฅ 2 ๐‘‘โ€ฒโˆ’1 (cid:214) (cid:16) ๐‘ก=0 1 + 2๐‘Ÿ ๐œ…(๐‘‘โ€ฒโˆ’๐‘ก)๐œ– (cid:17) = ๐‘‘โ€ฒโˆ’1 (cid:214) ๐‘ก=0 (1 + ๐›ผ๐‘Ÿ โˆ’๐‘ก๐œ…) ๐‘Ÿ โˆ’๐‘ก๐œ… + ๐›ผ2 ๐‘‘โ€ฒโˆ’1 โˆ‘๏ธ ๐‘Ÿ โˆ’(๐‘ก1+๐‘ก2)๐œ… + . . . + ๐›ผ๐‘‘โ€ฒ๐‘Ÿ โˆ’(1+...+(๐‘‘โ€ฒโˆ’1))๐œ… ๐‘‘โ€ฒโˆ’1 โˆ‘๏ธ ๐‘ก=0 ๐‘‘โ€ฒโˆ’1 โˆ‘๏ธ = 1 + ๐›ผ โ‰ค 1 + ๐›ผ โ‰ค 1 + ๐›ผ ๐‘ก1,๐‘ก2=0: ๐‘ก1<๐‘ก2 ๐‘‘โ€ฒโˆ’1 โˆ‘๏ธ ๐‘Ÿ โˆ’๐‘ก๐œ… (cid:33) 2 (cid:32) ๐›ผ ๐‘Ÿ โˆ’๐‘ก๐œ… + (cid:32) + . . . + ๐›ผ (cid:33) ๐‘‘โ€ฒ ๐‘Ÿ โˆ’๐‘ก๐œ… ๐‘‘โ€ฒโˆ’1 โˆ‘๏ธ ๐‘ก=0 ๐‘ก=0 โˆž โˆ‘๏ธ ๐‘ก=0 2โˆ’๐‘ก + ๐‘ก=0 โˆž โˆ‘๏ธ 2โˆ’๐‘ก (cid:32) ๐›ผ (cid:33) 2 ๐‘ก=0 (cid:32) + . . . + ๐›ผ (cid:33) ๐‘‘โ€ฒ โˆž โˆ‘๏ธ ๐‘ก=0 2โˆ’๐‘ก โ‰ค 1 + 2๐›ผ + (2๐›ผ)2 + . . . + (2๐›ผ)๐‘‘โ€ฒ โ‰ค 1 + 2๐‘‘โ€ฒ๐›ผ = 1 + 4๐‘‘โ€ฒ๐‘Ÿ ๐‘‘๐œ– which completes the proof of the upper bound. The proof of the lower bound is nearly identical. Now that we have proved Theorem 2.3, and have now the first stage compression operator, A (R1(ยท)) has a TRIP so long as the constituent linear maps themselves have a RIP, we can use this show that the second stage compression operator also as a TRIP. Unfortunately, we cannot simply inductively apply the same reasoning, since crucially the first stage depended on using that the sum of orthogonal (rank-1) components and a linear operator were compatible with the embedding properties. After the first stage, these sums of rank-1 tensors will almost certainly no longer be exactly orthogonal. Nevertheless, we see that they will be nearly orthogonal, and should we have a covering number for this set, then it will be possible to find a second stage compression operator which is able to satisfy the RIP for this set. This is what motivates the use of Definition 2.4, and, along with a covering number argument for this set of nearly orthogonal tensors, is the main content of the following Lemma. 52 Lemma 2.4. Let X be a unit norm d-mode tensor with HOSVD rank at most r = (๐‘Ÿ, . . . , ๐‘Ÿ). Let A be defined as in (2.6), assume that the matrices ๐ด๐‘– have the ๐‘…๐ผ ๐‘ƒ(๐œ–, ๐‘†1,2) property for all 1 โ‰ค ๐‘– โ‰ค ๐‘‘โ€ฒ, X) โˆˆ R๐‘šร—...ร—๐‘š is an element of the set and that ๐›ฟ = 12๐‘‘โ€ฒ๐‘Ÿ ๐‘‘๐œ– < 1. Then, the ๐‘‘โ€ฒ-mode tensor A ( โ—ฆ B1+๐œ–,๐œ–,1โˆ’๐›ฟ/3,rโ€ฒ (as per Definition 2.4). Additionally, let suppose that ๐‘š โ‰ฅ ๐‘Ÿ ๐‘‘โˆ’1. Then, หœB๐œ–,๐›ฟ,๐‘Ÿ (cid:66) (cid:26) X โˆฅXโˆฅF (cid:12) (cid:12) (cid:12) (cid:12) X โˆˆ B1+๐œ–,๐œ–,1โˆ’๐›ฟ/3,rโ€ฒ and (cid:27) ๐ถ โˆฅยทโˆฅ2 ๐‘ก (cid:0) หœB๐œ–,๐›ฟ,๐‘Ÿ (cid:1) โ‰ค (cid:19)๐‘Ÿ ๐‘‘+๐‘Ÿ ๐œ… ๐‘š๐‘‘/๐œ… (cid:18) 9(๐‘‘/๐œ… + 1) ๐‘ก (cid:16) (1 + ๐œ–)2 + ๐œ–๐‘Ÿ ๐‘‘(cid:17) ๐‘‘๐‘š๐‘Ÿ ๐œ… /๐œ… (1 + ๐œ–)๐‘‘2๐‘š๐‘Ÿ ๐œ… ๐œ…โˆ’2 holds for all ๐‘ก > 0, and 1 > ๐›ฟ > 0. Where ๐ถ โˆฅยทโˆฅ2 ๐‘ก (cid:0) หœB๐œ–,๐›ฟ,๐‘Ÿ (cid:1) refers to the covering number, see Definition 1.27. The proof then of Theorem 2.2 follows from applying the RIP assumption of ๐ด 2๐‘›๐‘‘ after using Theorem 2.3 for A (๐‘…1(X)). 53 CHAPTER 3 LEAVE ONE OUT MEASUREMENTS FOR RECOVERY Our investigation of two-stage measurement operators that satisfy the TRIP was motivated by our desire to employ a result regarding TIHT that could, with important qualifications, solve the tensor recovery problem by using a memory efficient L : R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ โ†’ R๐‘š ๐‘“ ๐‘–๐‘›๐‘Ž๐‘™ measurement operator. In Section 2.3 we noted three significant drawbacks to this overall approach 1. the method was not low-memory at all stages since a sub-step of TIHT involves forming and operating on a tensor the size of the uncompressed original signal 2. the method relies on an effectively unverifiable assumption about the thresholding operator that is also computationally expensive to shore up 3. the apparent necessity of an initial reshaping in order to achieve useful compression The desire to address these issues led us to a useful specialization to the modewise measurement framework, which we refer to as the leave-one-out alternative. We are not the first to propose the idea. Most notably, in the work of Hendrikx and De Lathauwer [26], they describe an algorithm that recovers a tensor from linear combinations of its entries where the measurement operations themselves are constrained to be Kronecker-structured; this will serve as a particular example which fits very naturally into our framework once we properly situate it. In that work they give necessary and sufficient conditions for perfect recovery of exactly low-rank tensors. They describe some relevant heuristics and provide empirical results for the performance of the recovery when the tensor or its measurements are subject to white noise. Additionally they demonstrate how to adapt their method to two other tensor formats, namely CP and tensor train. Furthermore, In the work of Tropp and Udell et al [60], a comparable recovery procedure as [26] was described (in that work, Algorithm 4.1, Tucker Sketch) however the measurement ensembles analyzed were in some sense, less structured. Measurements used to estimate factors of a tensor in this work were conceived as operating on the unfoldings of the tensor, not strictly modewise as is possible in the Kronecker-structured case of [26]. This too we can situate into our framework from Section 1.5. In the work [60] the authors detail approximation error expectations 54 of the recovered tensor in terms quasi-optimality, and notably do not require the signal to be exactly low-rank, both things we find desirable in a theory on recovery. On the other hand, the error guarantees in that work are in expectation and are stated using the assumption that all the entries of the measurement matrices are sampled from independent Gaussian random variables; which from a practical view point can be undesirable as these measurement matrices will take up as much storage in a way that scales data tensor itself. The authors do empirically investigate the performance of a more structured and constrained measurement map, specifically when the sketching matrices are constructed by using Khatri-Rao products of Gaussian matrices. In those experiments they show that these structured measurements deliver nearly as good approximations at the benefit of using potentially significantly less space, since it is possible in this scenario to store only the parts and construct the ensemble as it is needed during sketching. Our contribution in this chapter is to unify and add to these works. Using the overall same strategy as [60], our error analysis is applied to the more structured Kronecker and Khatri-Rao measurement ensembles which are then used in a Canonical Recovery Algorithm 3.1 to recover the factorized form of a tensor. These types of measurement ensembles operate on the tensor in subtly different modewise manner that we will situate in terms of the framework introduced in Section 1.5. In a similar fashion as to the type of measurement operator that was the subject of Chapter 2, this has significant practical advantages since neither the entire signal tensor nor the entire measurement operator need to be kept in working memory in order to obtain measurements or importantly, to recover the factors of the original signal tensor. The improvement compared to [26] then is that the error analysis presented in this chapter applies in the non-exact low-rank case, and quantifies more specifically how the error depends on the relevant parameters of the problem, such as rank truncation. Additionally, besides applying the analysis to these more structured measurement ensembles that was not present in those early works, we remove from it the needed assumption that the entries of the sketching matrices are drawn from Gaussian distribution independently, and instead rely on a Johnson-Lindenstrauss property (see Definition 1.18) in order to state probabilistic bounds on errors of recovered tensors. The advantage to this is that there are several well known distributions 55 of random matrices which are known to satisfy this property, and it becomes a straightforward exercise then to reinterpret measurement requirements for different choices of sketching matrices. Having these options enables a user to select structured measurement schemes which can take advantage of fast matrix-vector multiplies for example, see Definition 1.24. 3.1 Kronecker Products, Khatri-Rao Products, and Leave-One-Out Modewise Measure- ment Notation and Examples We now will describe and provide some notation and terminology for the type of measurement operators studied throughout this chapter. The general concept we are able to specialize to several cases is the following: Definition 3.1 (Leave-one-out Measurements). Given a ๐‘‘-mode tensor X with side-lengths ๐‘›, leave-one-out measurements are the result of reducing all but one mode using linear operations on the unfolding of the tensor. That is, ๐ต ๐‘— = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— are leave-one-out measurements whenever ฮฉ( ๐‘—, ๐‘—) โˆˆ R๐‘›ร—๐‘› and full-rank, ฮฉโˆ’ ๐‘— โˆˆ R๐‘šโ€ฒร—๐‘›๐‘‘โˆ’1 ๐‘šโ€ฒ โ‰ค ๐‘›๐‘‘โˆ’1. (3.1) where Any measurement process that can be equivalently described as left multiplication of an unfolded tensor by a full-rank square matrix and right multiplied by some other linear operator that reduces all other modes is a leave-one measurement process. We will consider three specific cases for how to construct the overall measurement operator ฮฉโˆ’ ๐‘— in Definition 3.1: Kronecker structured, Khatri-Rao structured and unstructured measurement operators, and then situate this within the framework introduced in Section 1.5. 3.1.1 Kronecker Structured Measurements Kronecker structured leave-one-out measurements are constructed by taking the Kronecker product (see Definition 1.11) of ๐‘‘ โˆ’ 1 component matrices to fashion ฮฉโˆ’ ๐‘— , in addition to one square, full-rank measurement matrix ฮฉ( ๐‘—, ๐‘—) to be applied to the mode- ๐‘— which is uncompressed. Note this matrix could simply be the identity matrix. We will require leave-one-out measurements of type in the Definition 3.1 for each mode (i.e. โˆ€ ๐‘— โˆˆ [๐‘‘]) in order to recover all factors. Collectively, the 56 matrices needed to define this ensemble will form a set (cid:8)ฮฉ(๐‘–, ๐‘—) ๐‘— โ‰  ๐‘–, and where ฮฉ(๐‘–,๐‘–) โˆˆ R๐‘›ร—๐‘› for all ๐‘– โˆˆ [๐‘‘]. (cid:9) ๐‘–, ๐‘— โˆˆ[๐‘‘] where ฮฉ(๐‘–, ๐‘—) โˆˆ R๐‘šร—๐‘› when So far our measurements then are ๐‘‘(cid:63) B ๐‘— := X ฮฉ( ๐‘—,๐‘–) ๐‘–=1 (3.2) for all ๐‘— โˆˆ [๐‘‘] using the matrices (cid:8)ฮฉ( ๐‘—,๐‘–) (cid:9)๐‘‘ ๐‘–=1. Crucially, we can identify this tensor of measurements B ๐‘— with a flattened version which makes it clear that Kronecker structured modewise measurements do define leave-one-out measurements conforming to Definition 3.1, see Lemma 1.2. ๐ต ๐‘— = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—] (ฮฉ( ๐‘—,1) โŠ— ฮฉ( ๐‘—,2) โŠ— . . . ฮฉ( ๐‘—, ๐‘—โˆ’1) โŠ— ฮฉ( ๐‘—, ๐‘—+1) โŠ— ยท ยท ยท โŠ— ฮฉ( ๐‘—,๐‘‘))๐‘‡ (3.3) See Algorithm 3.2 for an outline of the sketching procedure. Note, (3.2) makes it clear we have here a one-stage modewise measurement process which has no initial reshaping; but instead leaves one mode uncompressed. As mentioned, we will require a collection of such measurements - one measurement tensor in one mode will be used only to recover a single factor matrix in the Tucker decomposition of our estimate. We refer to a collection of measurements from different measurement operators as a measurement ensemble. Since it will be relevant in our analysis, we observe that conceptually, the Kronecker products of matrices can be rewritten in terms of matrix products of auxiliary matrices. This will be useful if we wish to examine via, e.g., (1.4), how several modewise products change the properties of the resulting factor matrices ๐‘ˆ ๐‘— in the Tucker decomposition of a given tensor. As an illustration, consider the case of a three mode tensor X โˆˆ R๐‘›1ร—๐‘›2ร—๐‘›3. Let x โˆˆ R๐‘›1๐‘›2๐‘›3 denote the vectorization of X, and further suppose that ฮฉ ๐‘— โˆˆ R๐‘š ๐‘— ร—๐‘› ๐‘— for ๐‘— = 1, 2, 3 are three measurement matrices used to produce modewise measurements of X given by X ร—1 ฮฉ1 ร— ฮฉ2 ร—3 ฮฉ3. Allowing for an additional reshaping, we can identify these three modewise operations equivalently with a single matrix-vector product using a variant of (1.4). That is, one can see that vec (X ร—1 ฮฉ1 ร— ฮฉ2 ร—3 ฮฉ3) = ฮฉx where ฮฉ = ฮฉ3 โŠ— ฮฉ2 โŠ— ฮฉ1 โˆˆ R๐‘š1๐‘š2๐‘š3ร—๐‘›1๐‘›2๐‘›3. Let ๐ผ๐‘› denote the ๐‘› ร— ๐‘› identity matrix. The mixed-product 57 property of the Kronecker product can now be used to further show that in fact ฮฉ = หœฮฉ3 หœฮฉ2 หœฮฉ1, where หœฮฉ1 = ๐ผ๐‘›3 โŠ— ๐ผ๐‘›2 โŠ— ฮฉ1 โˆˆ R๐‘š1๐‘›2๐‘›3ร—๐‘›1๐‘›2๐‘›3, หœฮฉ2 = ๐ผ๐‘›3 โŠ— ฮฉ2 โŠ— ๐ผ๐‘š1 โˆˆ R๐‘š1๐‘š2๐‘›3ร—๐‘š1๐‘›2๐‘›3, หœฮฉ3 = ฮฉ3 โŠ— ๐ผ๐‘š2 โŠ— ๐ผ๐‘š1 โˆˆ R๐‘š1๐‘š2๐‘š3ร—๐‘š1๐‘š2๐‘›3. Hence, vec (X ร—1 ฮฉ1 ร— ฮฉ2 ร—3 ฮฉ3) = หœฮฉ3 หœฮฉ2 หœฮฉ1vec(X). This example can be easily be extended to any number of modes. 3.1.2 Khatri-Rao Structured Measurements Leave-one-out measurement ensembles which are Khatri-Rao products of smaller maps are also possible and have been considered before in works such [60]. That is, ๐ต ๐‘— = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— (cid:16) ( ๐‘—,1) โŠ™ ฮฉ๐‘‡ ฮฉ๐‘‡ ฮฉ( ๐‘—,1) โ€ข ฮฉ( ๐‘—,2) โ€ข . . . ฮฉ๐‘‡ = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—] = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—] ( ๐‘—,2) โŠ™ . . . ฮฉ๐‘‡ (cid:16) ( ๐‘—, ๐‘—โˆ’1) โŠ™ ฮฉ๐‘‡ ( ๐‘—, ๐‘—+1) ยท ยท ยท โŠ™ ฮฉ๐‘‡ ( ๐‘—,๐‘‘) (cid:17)๐‘‡ ( ๐‘—, ๐‘—โˆ’1) โ€ข ฮฉ( ๐‘—, ๐‘—+1) . . . โ€ข ฮฉ( ๐‘—,๐‘‘) (cid:17)๐‘‡ (3.4) where ฮฉ( ๐‘—, ๐‘—) is a full-rank square matrix, and ฮฉ( ๐‘—,๐‘–) โˆˆ R๐‘šร—๐‘› for ๐‘— โ‰  ๐‘–. Note that in this case we can consider ฮฉโˆ’ ๐‘— โˆˆ R๐‘šร—๐‘›๐‘‘โˆ’1 as sketching sub-tensors of size ๐‘›๐‘‘โˆ’1 to size ๐‘š (as opposed to ๐‘š๐‘‘โˆ’1 as was done previously with Kronecker structured measurement maps). See Algorithm 3.3 for an outline of the sketching procedure. Note, to situate this measurement process within the framework, unlike the Kronecker structured measurements, there is no simple equivalence of each component map to a single modewise product along a mode - instead we conceive of the measurement matrix acting on sub-tensors of ๐‘‘ โˆ’ 1 modes - i.e. the subtensors that occupy the rows of a mode- ๐‘— unfolding. That is, there is a reshaping into a mode- ๐‘— flattening initially, and a single โ€œmodewiseโ€ (matrix) product of that reshaped tensor with the measurement operator. Fundamentally, measurements structured this way combine ๐‘‘ โˆ’ 1 modes initially with a reshaping; however the full measurement matrix itself need not be fully pre-computed because of its Khatri-Rao structure. One can, at the time of sketching, take the Kronecker product of the columns of the component matrices; trading compute for storage. 58 3.1.3 Unstructured Measurements In [50], a type of leave-one-out measurements are described and analyzed, however in that work their measurement ensembles are not structured as was described in the previous two sections, but instead are dense matrices where ฮฉโˆ’ ๐‘— โˆˆ R๐‘šร—๐‘›๐‘‘โˆ’1 has entries all drawn independently from a Gaussian distribution and this matrix is right multiplied by an unfolding of the tensor: ๐ต ๐‘— = ฮฉ( ๐‘—, ๐‘—) ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— where again ฮฉ( ๐‘—, ๐‘—) is some square, full-rank matrix. As an important observation, the storage requirements for the sketching matrix itself in this case is large, comparable in size to the data itself since the long dimension is ๐‘›๐‘‘โˆ’1, which can be undesirable (though an improvement on the fully vectorized approach which must need have ๐‘›๐‘‘ as its long dimension). Furthermore, the error analysis in [50] conducted on this type of measurement when the unstructured matrix ฮฉโˆ’ ๐‘— has i.i.d. Gaussian entries relies on expectation bounds known for Gaussian matrices and their pseudo-inverses. Comparable bounds for other distributions, or for matrices constructed using Kronecker or Khatri-Rhao products are not covered in the analysis. Note, Kronecker or Khatri-Rao structured measurements alleviate to a large degree the storage problem associated with ฮฉโˆ’ ๐‘— โˆˆ R๐‘šร—๐‘›๐‘‘โˆ’1 when compared to in the unstructured case since the entire sketching matrix does not need to be maintained in memory, but rather just constituent components ฮฉ( ๐‘—,๐‘–) โˆˆ R๐‘šร—๐‘› can be stored, and the action of the Kronecker or Khatri-Rao product on the tensor can computed as part of the sketching phase. This does incur additional operations in the sketching phase, and in our runtime analysis and numerical experiments we detail the trade-off between space and run-time for these various choices of measurement operators. 3.1.4 Core Measurements Regardless of which type leave-one-measurements are used for each of the modes and recovering the factors, in order to recover the core factor using only a single access to the data X, we will require an additional set of measurements. These are in all cases discussed in this chapter, modewise, and can compress all modes; i.e. the operator is a single stage, no initial reshaping 59 modewise measurement operator. ๐‘‘(cid:63) B๐‘ := X ฮฆ ๐‘— ๐‘—=1 (3.5) using the matrices (cid:8)ฮฆ ๐‘— (cid:9)๐‘‘ ๐‘—=1. See Algorithm 3.4 for an outline of the sketching procedure for the core. All together then, these ๐‘‘ + 1 measurement tensors, ๐‘‘ of the leave-one-out type (Definition 3.1) and one of the type in (3.5) will be used to recover the parts of the original full tensor. Figure 3.1 is a schematic rendition of the overall measurement procedure in the three mode case (๐‘‘ = 3). Note that the ๐‘›๐‘‘ entries of the original tensor X are compressed into ๐‘‘๐‘›๐‘š๐‘‘โˆ’1 + ๐‘š๐‘‘ ๐‘ entries of the ๐‘‘ + 1 total different measurement tensors; one leave-one-out measurement for each of the ๐‘‘ modes as well as the one measurement tensor for use in estimating the core. Recovery naturally will require the storage of the measurement operators themselves in some fashion along with the measurement tensors. As dense matrices, the measurement ensemble collectively uses ๐‘‘ ((๐‘‘ โˆ’ 1)๐‘š๐‘› + ๐‘›2) + ๐‘‘๐‘š๐‘๐‘› total entries. However the number of random variables required to generate these matrices may be significantly fewer depending on the method employed. For example, when using ๐‘…๐น ๐ท matrices and an FFT algorithm, the number of random bits required to construct an ensemble is linear in ๐‘›. Additionally, some choices of measurement matrices allow for near-linear time matrix-vector multiplication. As we detail later, applying the measurement operators, i.e. the sketching phase, is asymptot- ically the most computationally intensive part of the overall measure and recovery procedure, and thus in settings where it is useful to economize the computational effort to obtain measurements these matrices have significant advantages. Whichever choice for type of measurements are used, however, the efficiency in terms of the run-time of the recovery algorithm is largely dependent on the ratios ๐‘š/๐‘› and ๐‘š๐‘/๐‘›. 60 B1 = X ) 1 , 1 ( ฮฉ ฮฉ (1,3) ฮฉ(1,2) 1 ฮฆ = B๐‘ X ฮฆ2 ฮฆ 3 (a) Leave-one-measurements from Algorithm 3.2 (b) Core measurements from Algorithm 3.4 Figure 3.1 Schematic for the two types of measurement tensors used in recovery Algorithm 3.9 for a three mode tensor. Measurement tensors B๐‘– of the type shown in (๐‘Ž) for ๐‘– = 1 will be used to estimate the factors of the Tucker decomposition, and the measurement tensor B๐‘ in (๐‘) will be used to estimate the core of the Tucker decomposition. 3.1.5 The Canonical One-Pass Recovery Algorithm We can now state Algorithm 3.1 as the canonical algorithm for one-pass recovery the ten- sor using leave-one-out measurements. The algorithm outputs an estimate in factored form X1 = [[H , ๐‘„1, . . . , ๐‘„๐‘‘]], and we say one-pass to emphasize that after measuring the tensor X, no additional access to the data is required to obtain the estimate X1. The inputs to the algorithm are leave-one-out measurements ๐ต๐‘– for each of the modes of any type, as well as measurements B๐‘ for use in recovering the core, some of the related measurement operators ฮฉ(๐‘–,๐‘–), ฮฆ๐‘–, and a target rank r parameter. Algorithm 3.1 consists of two loops. The first loop recovers the factor matrices ๐‘„๐‘– in the Tucker factorization. The second loop depends on the output of the first loop and computes the core H of the Tucker factorization. Importantly, at no point does a full, uncompressed tensor or size X ever need to be formed. The procedure goes directly from measurements to factors. Note that within the pseudo-code for Algorithm 3.1 the โ€œunfoldโ€ function in the body of Algorithm 3.1 takes a tensor and flattens it into the given shape by arranging the specified modeโ€™s fibers as the columns, e.g. ๐ป โ† unfold(B๐‘, ๐‘š๐‘ ร— ๐‘š๐‘‘โˆ’1 , mode = 1) is equivalent to ๐ป โ† (B๐‘) [1] using the notation of Section 1.4.1, the matrix that is obtained by flattening the ๐‘‘-mode tensor B๐‘ ๐‘ along mode-1. The โ€œfoldโ€ function is the inverse of โ€œunfoldโ€, and takes a matrix that is an unfolding 61 along the specified mode of a tensor and reshapes it into a tensor with the specified and compatible dimensions, e.g. B โ† fold(๐ป, ๐‘Ÿ ร—๐‘š๐‘ ร— ยท ยท ยท ร— ๐‘š๐‘ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) ๐‘‘โˆ’1 , mode = 1) takes the matrix ๐ป with dimensions ๐‘Ÿ ร— ๐‘š๐‘‘โˆ’1 ๐‘ and reshapes it into a ๐‘‘ mode tensor where the first mode is of length ๐‘Ÿ and the remaining ๐‘‘ โˆ’ 1 modes are of length ๐‘š๐‘. In the scenario where it is possible to access the original tensor X twice (we refer to this as the two-pass scenario), we can compute the optimal core G given our estimated factor matrices ๐‘„๐‘– to obtain an estimate X2 = [[G, ๐‘„1, . . . , ๐‘„๐‘‘]]. See Algorithm 3.10 and Algorithm 3.12 in Section 3.6 for the detailed formulation of the two-pass recovery procedure. Remark 3.1. Note, for a tensor with an exact Tucker decomposition X = [[G, ๐‘ˆ1, . . . , ๐‘ˆ๐‘‘]] of rank the total degrees of freedom in the decomposition is a lower bound on any lossless (๐‘Ÿ, . . . , ๐‘Ÿ) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) (cid:124) ๐‘‘ compression scheme in the absence of any other kind of information about the factorization. For example, the leave-one-out measurements ๐ต๐‘– used to recover a factor matrix ๐‘ˆ ๐‘— in Algorithm 3.1 which has ๐‘›๐‘Ÿ โˆ’ ๐‘Ÿ (๐‘Ÿ+1) degrees of freedom, it is necessary in the exact case that the product ๐‘ˆ ๐‘— (cid:1) have column rank at least ๐‘Ÿ and thus (cid:0)ฮฉ๐‘‘๐‘ˆ๐‘‘ โŠ— ยท ยท ยท โŠ— ฮฉ ๐‘—+1๐‘ˆ ๐‘—+1 โŠ— ฮฉ ๐‘—โˆ’1๐‘ˆ ๐‘—โˆ’1 โŠ— ยท ยท ยท โŠ— ฮฉ1๐‘ˆ1 and ๐บ [ ๐‘—] ๐‘š๐‘‘โˆ’1 โ‰ฅ ๐‘Ÿ. Similarly we see that ๐‘š๐‘ โ‰ฅ ๐‘Ÿ is a lower bound for the measurements required to recover 2 the ๐‘Ÿ ๐‘‘ entries of the core. In other words, to recover the rank (๐‘Ÿ, . . . , ๐‘Ÿ) core tensor, our sketch along each mode must have dimension at least ๐‘Ÿ. With the measurement operators described and the canonical algorithm outlined, we can now state a representative main result. The full unabridged results are found in Section 3.2.5. (0, 1 Theorem 3.1 (A Summary Main Result). Let X be a ๐‘‘-mode tensor of side length ๐‘›, ๐œ– โˆˆ (0, 1), ๐›ฟ โˆˆ 3), and ๐‘Ÿ โˆˆ [๐‘›] := {1, . . . , ๐‘›}. Denote an optimal rank r = (๐‘Ÿ, . . . , ๐‘Ÿ) โˆˆ R๐‘‘ (๐‘‘ โ‰ฅ 2) Tucker . There exists a one-pass recovery algorithm (see Algorithm 3.1) approximation of X by [[X]]opt r that outputs a Tucker factorization X1 = [[H , ๐‘„1, . . . , ๐‘„๐‘‘]] of X from leave-one-out linear mea- surements that will satisfy โˆฅX โˆ’ X1โˆฅ2 โ‰ค (1 + ๐‘’๐œ– ) โˆš๏ธ‚ ๐‘‘ (1 + ๐œ–) 1 โˆ’ ๐œ– (cid:13) (cid:13) (cid:13) X โˆ’ [[X]]opt r (cid:13) (cid:13) (cid:13)2 (3.6) with probability at least 1 โˆ’ ๐›ฟ. The total number of linear measurements the algorithm will use is 62 for ๐‘– โˆˆ [๐‘‘] leave-one-out measurements Algorithm 3.1 One Pass HOSVD Recovery from Leave-One-Out Measurements input : ๐ต๐‘– โˆˆ R๐‘›ร—๐‘š๐‘‘โˆ’1 ฮฉ(๐‘–,๐‘–) โˆˆ R๐‘›ร—๐‘› for ๐‘– โˆˆ [๐‘‘] full-rank sensing matrix for uncompressed mode B๐‘ a ๐‘‘ mode tensor of measurements with side lengths ๐‘š๐‘ ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› for ๐‘– โˆˆ [๐‘‘] measurement matrices for core measurements r = (๐‘Ÿ, ๐‘Ÿ, . . . , ๐‘Ÿ) the rank of the HOSVD output : ห†X = [[H , ๐‘„1, . . . ๐‘„๐‘‘]] # Factor matrix recovery for ๐‘– โˆˆ [๐‘‘] do # Solve ๐‘› ร— ๐‘› linear system Solve ฮฉ(๐‘–,๐‘–) ๐น๐‘– = ๐ต๐‘– for ๐น๐‘– # Compute SVD and keep the ๐‘Ÿ leading singular vectors ๐‘ˆ, ฮฃ, ๐‘‰๐‘‡ โ† SVD(๐น๐‘–) ๐‘„๐‘– โ† ๐‘ˆ:,:๐‘Ÿ end # Core recovery for ๐‘– โˆˆ [๐‘‘] do # unfold measurements, mode-๐‘– fibers are columns, size ๐‘š๐‘ ร— ๐‘Ÿ (๐‘–โˆ’1)๐‘š๐‘‘โˆ’1โˆ’(๐‘–โˆ’1) ๐ป โ† unfold(B๐‘, ๐‘š๐‘ ร— ๐‘Ÿ (๐‘–โˆ’1)๐‘š๐‘‘โˆ’1โˆ’(๐‘–โˆ’1) , mode = ๐‘–) # Undo the mode-๐‘– measurement operator and factorโ€™s action by finding ๐‘ ๐‘ least square solution to ๐‘š๐‘ ร— ๐‘Ÿ over-determined linear system Solve ฮฆ๐‘–๐‘„๐‘–๐ปnew = ๐ป for ๐ป๐‘›๐‘’๐‘ค # reshape the flattened partially solved core into a tensor ร—๐‘š๐‘ ร— ยท ยท ยท ร— ๐‘š๐‘ B๐‘ โ† fold(๐ปnew, ๐‘Ÿ ร— ๐‘Ÿ ยท ยท ยท ร— ๐‘Ÿ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) (cid:125) (cid:123)(cid:122) (cid:124) ๐‘– ๐‘‘โˆ’๐‘– # Each iteration ๐‘š๐‘ โ†’ ๐‘Ÿ in ๐‘–th mode , mode = ๐‘–) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) end H โ† B๐‘ bounded above by ๐‘šโ€ฒ(๐‘Ÿ, ๐‘‘, ๐œ–, ๐‘›, ๐›ฟ) = (cid:104) ๐‘›๐‘‘ + ๐ถ๐‘Ÿ๐‘‘2 ๐œ– 2 ln (cid:16) ๐‘‘๐‘›๐‘‘ ๐›ฟ (cid:17)(cid:105) (cid:16) ๐ถ๐‘Ÿ๐‘‘2 ๐œ– 2 (cid:16) ๐‘‘๐‘›๐‘‘ ๐›ฟ ln (cid:17)(cid:17) ๐‘‘โˆ’1 , where ๐ถ > 0 is an absolute constant. Furthermore, the recovery algorithm (Algorithm 3.1) runs in time ๐‘œ(๐‘›๐‘‘) for large ๐‘› โ‰ซ ๐‘Ÿ. Remark 3.2. More generally and more exactly, the number of measurements is always ๐‘›๐‘‘๐‘š๐‘‘โˆ’1 + ๐‘š๐‘‘ ๐‘ where ๐‘š and ๐‘š๐‘ are as in Theorem 3.9 for Kronecker structured measurements, or ๐‘‘ หœ๐‘š๐‘› + ๐‘š๐‘‘ ๐‘ where หœ๐‘š and ๐‘š๐‘ are as in Theorem 3.11 in the case of Khatri-Rao structured measurements. Proof. Combine Theorem 3.9 (or Theorem 3.11) with Theorem 3.10 and Lemma 1.5. Although our bound almost certainly not tight, overall we demonstrate the procedure has errors 63 that depend on the number of degrees of freedom of the sketched tensor, not its full size; as is typical in similar compressive sensing scenarios. Furthermore, in our numerical experiments we demonstrate some of the main trade-offs involved when choosing between Kronecker or Khatri-Rao constructed sketches in terms of approximation error and performance. 3.1.6 Projection Cost Preserving (PCP) Sketches The following definition and related results are needed to conduct our probabilistic error analysis of the various leave-one-out measurements and Algorithm 3.1. We will rely heavily on some definitions in randomized numerical linear algebra and compressive sensing that were introduced in Chapter 1. Definition 3.2 (A (๐œ–, ๐‘, ๐‘Ÿ)-Projection Cost Preserving (PCP) sketch). Let ๐œ–, ๐‘ > 0, and ๐‘Ÿ โˆˆ N. A matrix หœ๐‘‹ โˆˆ R๐‘›ร—๐‘š is a (๐œ–, ๐‘, ๐‘Ÿ)-PCP sketch of ๐‘‹ โˆˆ R๐‘›ร—๐‘ if for all orthogonal projection matrices ๐‘ƒ โˆˆ R๐‘›ร—๐‘› with rank at most ๐‘Ÿ, (1 โˆ’ ๐œ–) โˆฅ ๐‘‹ โˆ’ ๐‘ƒ๐‘‹ โˆฅ2 ๐น โ‰ค (cid:13) (cid:13) หœ๐‘‹ โˆ’ ๐‘ƒ หœ๐‘‹(cid:13) 2 ๐น + ๐‘ โ‰ค (1 + ๐œ–) โˆฅ ๐‘‹ โˆ’ ๐‘ƒ๐‘‹ โˆฅ2 (cid:13) ๐น (3.7) holds. Note the above property first appeared in this form in [13, Definition 1] (see also, however, [17, Definition 13] for the statement of an equivalent property in a different form that appeared earlier). The next lemma can be used to construct random matrices that are PCP sketches of a given matrix ๐‘‹ with high probability. Before the lemma can be stated, however, we will need one additional definition. Definition 3.3 (Head-Tail Split). For any ๐ด โˆˆ R๐‘šร—๐‘›, we can split ๐ด into his leading ๐‘Ÿ-term and its tail (๐‘› โˆ’ ๐‘Ÿ)-term Singular Value Decomposition (SVD) components. That is, consider the SVD of ๐ด = ๐‘ˆฮฃ๐‘‰๐‘‡ . For any ๐‘Ÿ โ‰ค rank( ๐ด), let ๐‘ˆ๐‘Ÿ โˆˆ R๐‘šร—๐‘Ÿ and ๐‘‰๐‘Ÿ โˆˆ R๐‘›ร—๐‘Ÿ denote the first ๐‘Ÿ columns of ๐‘ˆ โˆˆ R๐‘šร—๐‘š and ๐‘‰ โˆˆ R๐‘›ร—๐‘›, respectively. We then define ๐ด๐‘Ÿ := ๐‘ˆ๐‘Ÿ๐‘ˆ๐‘‡ ๐‘Ÿ ๐ด = ๐ด๐‘‰๐‘Ÿ๐‘‰๐‘‡ ๐‘Ÿ to be ๐ดโ€™s best rank ๐‘Ÿ approximation with respect to โˆฅ ยท โˆฅ๐น. Furthermore, we denote the tail term by ๐ด\๐‘Ÿ := ๐ด โˆ’ ๐ด๐‘Ÿ. We will now detail how matrices with the OSE and AMM (recall Definitions 1.23 and 1.25) properties interacting with matrices derived from ๐‘‹ โˆˆ R๐‘›ร—๐‘ will yield PCP sketches of ๐‘‹ with 64 high probability. Variants of the following result are proven in [11, 53]. We include the proof in Section 3.5.2 for the sake of completeness. Theorem 3.2 (Projection-Cost-Preservation via the AMM and OSE properties). Let ๐‘‹ โˆˆ R๐‘›ร—๐‘ of rank หœ๐‘Ÿ โ‰ค min{๐‘›, ๐‘ } have the full SVD ๐‘‹ = ๐‘ˆฮฃ๐‘‰๐‘‡ , and let ๐‘‰๐‘Ÿ โ€ฒ โˆˆ R๐‘ร—๐‘Ÿ โ€ฒ denote the first ๐‘Ÿโ€ฒ columns of ๐‘‰ โˆˆ R๐‘ร—๐‘ for all ๐‘Ÿโ€ฒ โˆˆ [๐‘]. Fix ๐‘Ÿ โˆˆ [๐‘›] and consider the head-tail split ๐‘‹ = ๐‘‹๐‘Ÿ + ๐‘‹\๐‘Ÿ. If ฮฉ โˆˆ R๐‘šร—๐‘ satisfies 1. subspace embedding property (1.10) with ๐œ– โ† ๐œ– 3 for ๐ด โ† ๐‘‹๐‘‡ ๐‘Ÿ , 2. approximate multiplication property (1.13) with ๐œ– โ† 6 ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ}, ๐œ– โˆš min{๐‘Ÿ, หœ๐‘Ÿ} for ๐ด โ† ๐‘‹\๐‘Ÿ and ๐ต โ† 3. JL property (1.8) with ๐œ– โ† ๐œ– 4. approximate multiplication property (1.13) with ๐œ– โ† ๐œ– โˆš 6 6 for ๐‘† โ† {the ๐‘› columns of ๐‘‹๐‘‡ \๐‘Ÿ }, and ๐‘Ÿ for ๐ด โ† ๐‘‹\๐‘Ÿ and ๐ต โ† ๐‘‹๐‘‡ \๐‘Ÿ, then หœ๐‘‹ := ๐‘‹ฮฉ๐‘‡ is an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹. Proof. See Section 3.5.2. The following lemma can be used to construct PCP sketches from random matrices with the JL property. Lemma 3.1 (The JL property provides PCP sketches). Let ๐‘‹ โˆˆ R๐‘›ร—๐‘ have rank หœ๐‘Ÿ โ‰ค min{๐‘›, ๐‘ }. Fix ๐‘Ÿ โˆˆ [๐‘›]. There exist finite sets ๐‘†1, ๐‘†2 โŠ‚ R๐‘ (determined entirely by ๐‘‹) with cardinalities (cid:16) 141 and |๐‘†2| โ‰ค 16๐‘›2 + ๐‘› such that the following holds: |๐‘†1| โ‰ค ๐œ– ฮฉ โˆˆ R๐‘šร—๐‘ has both the , 16๐‘›2 + ๐‘› for ๐‘†2, then ๐‘‹ฮฉ๐‘‡ will be an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹ with probability at least 1 โˆ’ ๐›ฟ. -JL property for ๐‘†1 and the If a random matrix -JL property (cid:17) min{๐‘Ÿ, หœ๐‘Ÿ} (cid:16) ๐œ– โˆš ๐‘Ÿ 6 (cid:16) 141 ๐œ– (cid:16) ๐œ– 6 , ๐›ฟ 2 , ๐›ฟ 2 (cid:17)๐‘Ÿ (cid:17) (cid:17) , Proof. To ensure property 1 of Theorem 3.2 we can appeal to Lemma 1.6 to see that ฮฉ being (cid:1)-cover of the at most min{๐‘Ÿ, หœ๐‘Ÿ}-dimensional unit ball an (๐œ–/6)-JL-embedding of a minimal (cid:0) ๐œ– 48 (cid:1)-cover, we can in the column space of ๐‘‹๐‘‡ further see that |๐‘†1| โ‰ค (141/๐œ–)min{๐‘Ÿ, หœ๐‘Ÿ} by the proof of Corollary 1.1. Hence, if ฮฉ โˆˆ R๐‘šร—๐‘ has -JL property for ๐‘†1, then property 1 of Theorem 3.2 will be satisfied with with ๐‘Ÿ will suffice. Letting ๐‘†1 be this aforementioned (cid:0) ๐œ– 48 (cid:17)๐‘Ÿ (cid:17) , , ๐›ฟ 2 (cid:16) ๐œ– 6 (cid:16) 141 the ๐œ– probability at least 1 โˆ’ ๐›ฟ 2 . Applying Lemma 1.8 one can see that there exist sets ๐‘†โ€ฒ 2 65 , ๐‘†โ€ฒโ€ฒ 2 โŠ‚ R๐‘ with |๐‘†โ€ฒ 2| โ‰ค 2(๐‘› + min{๐‘Ÿ, หœ๐‘Ÿ})2 โ‰ค 8๐‘›2 and |๐‘†โ€ฒโ€ฒ 2 โˆช ๐‘†โ€ฒโ€ฒ 2 will satisfy both properties 2 and 4 of Theorem 3.2. Hence, since ๐‘Ÿ โ‰ฅ 1, we can see that an 2 | โ‰ค 2(๐‘› + ๐‘›)2 = 8๐‘›2 such that an (๐œ–/6 ๐‘Ÿ)-JL-embedding of ๐‘†โ€ฒ โˆš โˆš (๐œ–/6 2 โˆช ๐‘†โ€ฒโ€ฒ ๐‘Ÿ)-JL-embedding of ๐‘†2 := ๐‘†โ€ฒ is defined as per property 3. Noting that |๐‘†2| โ‰ค |๐‘†โ€ฒ ฮฉ will satisfy all of Theorem 3.2โ€™s properties 2 โ€“ 4 with probability at least 1 โˆ’ ๐›ฟ (cid:16) ๐œ– โˆš ๐‘Ÿ 6 2 โˆช ๐‘† will satisfy Theorem 3.2 properties 2 โ€“ 4, where ๐‘† 2 | + |๐‘†| โ‰ค 16๐‘›2 + ๐‘›, we can now see that 2 if it has the -JL property for ๐‘†2. , 16๐‘›2 + ๐‘› 2| + |๐‘†โ€ฒโ€ฒ , ๐›ฟ 2 (cid:17) Concluding, the prior two paragraphs in combination with the union bound imply that all of Theorem 3.2โ€™s properties 1 โ€“ 4 will hold with probability at least 1 โˆ’ ๐›ฟ if ฮฉ has both the (cid:16) ๐œ– 6 -JL property for ๐‘†2. An application of -JL property for ๐‘†1 and the , 16๐‘›2 + ๐‘› (cid:16) 141 ๐œ– , ๐›ฟ 2 , ๐›ฟ 2 (cid:17)๐‘Ÿ (cid:17) (cid:17) , (cid:16) ๐œ– โˆš ๐‘Ÿ 6 Theorem 3.2 now finishes the proof. Using Lemma 3.1 one can now prove the following corollary of Theorem 1.1 which demonstrates the existence of a PCP sketch for any fixed matrix ๐‘‹. Corollary 3.1. Fix ๐‘‹ โˆˆ R๐‘›ร—๐‘ and ๐‘Ÿ โˆˆ [๐‘›]. Let ฮฉ โˆˆ R๐‘šร—๐‘ be a random matrix with mean zero, ๐‘š , independent sub-Gaussian entries. Then, ๐‘‹ฮฉ๐‘‡ will be an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹ variance 1 with probability at least 1 โˆ’ ๐›ฟ provided that ๐‘š โ‰ฅ ๐ถ ๐‘Ÿ ๐œ– 2 (cid:26) max ln (cid:19) (cid:18) ๐ถ1 ๐œ– ๐›ฟ , ln (cid:18) ๐ถ2๐‘› ๐›ฟ (cid:19)(cid:27) , where ๐ถ1, ๐ถ2 > 0 are absolute constants, and ๐ถ > 0 is an absolute constant that only depends on the sub-Gaussian norms (see Definition 1.22) of ฮฉโ€™s entries. 6 and ๐›ฟ โ† ๐›ฟ Proof. Apply Theorem 1.1 to the finite set ๐‘†1 guaranteed by Lemma 3.1 with ๐œ– โ† ๐œ– 2 . Similarly, apply Theorem 1.1 to the finite set ๐‘†2 guaranteed by Lemma 3.1 with ๐œ– โ† ๐œ– ๐‘Ÿ and โˆš 6 ๐›ฟ โ† ๐›ฟ 2 . The result now follows by Lemma 3.1. We finish here by noting that Corollary 3.1 is just one example of a PCP sketching result that one can prove with relative ease using Lemma 3.1. Indeed, Lemma 3.1 can be combined with other standard results concerning more structured matrices with the JL property (see, e.g., [38, 30, 3, 27]) to produce similar theorems where ฮฉ has a fast matrix-vector multiply. 66 3.2 Proofs of Our Main Results on Leave-One-Measurements and Recovery The objective is to show that we can retrieve an accurate low rank Tucker approximation of a tensor X via Algorithm 3.1 from valid sets of leave-one-out and core measurements as described in Section 3.1. We will denote the approximation of X output by Algorithm 3.1 as X1 here to emphasize that a single pass over the original data tensor X suffices in order to compute the linear input measurements required by Algorithm 3.1. Hence, Algorithm 3.1 in this setting is an example of a streaming algorithm which doesnโ€™t need to store a copy the original uncompressed tensor X in memory in order to successfully approximate it. Nonetheless, we wish to show that this algorithm still produces a quasi-optimal approximation of X in the sense of (1.7) with high probability when given such highly compressed linear input measurements. Particular choices for measurement ensembles will make explicit the dependence on other parameters of the problem (these choices define specializations of Algorithm 3.1, and are summarized as Algorithm 3.9 and 3.11 in Section 3.6). More specifically, in this section we will show that with high probability for a given ๐‘‘-mode tensor X, error tolerance ๐œ– > 0, and chosen rank truncation parameter ๐‘Ÿ, that โˆฅX โˆ’ X1โˆฅ2 โ‰ค (1 + ๐‘’๐œ– ) (cid:118)(cid:117)(cid:116) 1 + ๐œ– 1 โˆ’ ๐œ– ๐‘‘ โˆ‘๏ธ ๐‘—=1 ฮ”๐‘Ÿ, ๐‘— (3.8) will hold whenever Algorithm 3.1 is provided with sufficiently informative input measurements. Here the ฮ”๐‘Ÿ, ๐‘— are defined as per Lemma 1.5, and โ€œsufficiently informative" means that (๐‘–) the leave-one-out measurements used to form X1 are of sufficient size to satisfy several PCP properties, and that (๐‘–๐‘–) the core measurements used to form X1 are of sufficient size to ensure the accurate solution of least squares problems computed as part of Algorithm 3.1. Finally, we note that one can see from (3.8) together with Lemma 1.5 that Algorithm 3.1 will perfectly recover exactly low Tucker-rank tensors if the rank parameter ๐‘Ÿ is made sufficiently large. In order to prove that (3.8) holds, we will need to also consider a weaker variant of Algorithm 3.1 which permits a second pass of the data tensor X. These weaker algorithms will first compute estimates of the factors of the tensor ๐‘„๐‘– as Algorithm 3.1 does, but thereafter will be allowed to use 67 those factors to operate on the original tensor X in order to approximate its core (see Algorithms 3.10 and 3.12). We denote the estimate of the tensor that results from this procedure as X2 to emphasize that it requires a second accesses to the original data during core recovery. Note that such two-pass algorithms are of less practical value in the big data and compressive sensing settings since it is often not possible to directly access the data tensor again after the initial compressed measurements have been taken in these scenarios. Nevertheless, this two-pass estimate will be extremely useful when proving (3.8). In particular, our proof will result from the following triangle inequality: โˆฅX โˆ’ X1โˆฅ2 = โˆฅX โˆ’ X1 + X2 โˆ’ X2โˆฅ2 โ‰ค โˆฅX โˆ’ X2โˆฅ2 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) Term I . (3.9) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) + โˆฅX1 โˆ’ X2โˆฅ2 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) Term II (cid:124) Bounding Term I in (3.9) will be the subject of Section 3.2.1. As we shall see, bounding Term I is straightforward if we have that leave-one-out measurements imply various PCP properties; and so the main work of Sections 3.2.2 and 3.2.3 will be to demonstrate how, for the structured choices of measurement operators considered herein, we can ensure that the PCP property is satisfied. Bounding Term II, on the other hand, will require us to apply a bound on the error incurred by solving sketched least square problems on a carefully partitioned re-expression of the Term II error. That argument is the subject of Section 3.2.4. Finally, we combine our analysis of these two error terms along with particular choices for measurement operators to state the full versions of our main results in Section 3.2.5. 3.2.1 Bounding โˆฅX โˆ’ X2โˆฅ2 In the two pass scenario, we first compute estimates for the factor matrices, ๐‘„๐‘– (see Algorithm 3.5), using leave-one-out measurements ๐ต๐‘– for each ๐‘– โˆˆ [๐‘‘]. Then, using these factor matrices, we wish to solve arg min H โˆฅX โˆ’ [[H , ๐‘„1, ๐‘„2, . . . , ๐‘„๐‘‘]] โˆฅ2 . One can see that the solution will be G := X ร—1 ๐‘„๐‘‡ 1 ร—2 ๐‘„๐‘‡ 2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‡ ๐‘‘ 68 (see, e.g., [16]). Let X2 := G ร—1 ๐‘„1 ร—2 ๐‘„2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ (3.10) denote the estimate obtained from a two-pass recovery procedure (i.e., Algorithm 3.10 or 3.12). Additionally, we note the following fact about modewise products (see, e.g., [67, Lemma 1]): X ร—๐‘– ๐ด ร—๐‘– ๐ต = X ร—๐‘– (๐ต๐ด). As a result, if we are permitted a second pass over X to compute the core we have that (cid:16) X2 = G ร—1 ๐‘„1 ร—2 ๐‘„2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ 1 ร—2 ๐‘„๐‘‡ 2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‡ 1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘๐‘„๐‘‡ = X ร—1 ๐‘„1๐‘„๐‘‡ X ร—1 ๐‘„๐‘‡ = (cid:17) ๐‘‘ ๐‘‘ ร—1 ๐‘„1 ร—2 ๐‘„2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ Using this expression we can now bound the two pass error term โˆฅX โˆ’ X2โˆฅ2. Theorem 3.3 (Error bound for Two-Pass โˆฅX โˆ’ X2โˆฅ2). Suppose หœ๐‘‹[ ๐‘—] := ๐‘‹[ ๐‘—]ฮฉ๐‘‡ (๐œ–, 0, ๐‘Ÿ)-PCP sketches of ๐‘‹[ ๐‘—] for each ๐‘— โˆˆ [๐‘‘]. โˆ’ ๐‘— โˆˆ R๐‘›ร—๐‘š๐‘‘โˆ’1 are If ๐‘„ ๐‘— โˆˆ R๐‘›ร—๐‘Ÿ for ๐‘Ÿ โ‰ค ๐‘š are factor matrices obtained from Algorithm 3.5, then โˆฅX โˆ’ X2โˆฅ2 = (cid:13) (cid:13)X โˆ’ X ร—1 ๐‘„1๐‘„๐‘‡ 1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘๐‘„๐‘‡ ๐‘‘ (cid:13) (cid:13)2 โ‰ค (cid:118)(cid:117)(cid:116) 1 + ๐œ– 1 โˆ’ ๐œ– ๐‘‘ โˆ‘๏ธ ๐‘—=1 ฮ”๐‘Ÿ, ๐‘— . (3.11) Proof. Since the ๐‘„๐‘–๐‘„๐‘‡ ๐‘– are orthogonal projectors, we have by [62, Theorem 5.1] that (cid:13) (cid:13)X โˆ’ X ร—1 ๐‘„1๐‘„๐‘‡ 1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘๐‘„๐‘‡ ๐‘‘ (cid:13) (cid:13) 2 2 โ‰ค ๐‘‘ โˆ‘๏ธ ๐‘—=1 (cid:13) X โˆ’ X ร— ๐‘— ๐‘„ ๐‘— ๐‘„๐‘‡ (cid:13) ๐‘— (cid:13) (cid:13) 2 (cid:13) (cid:13) 2 . (3.12) From Algorithm 3.5 (recalling that ฮฉโˆ’1 ( ๐‘—, ๐‘—) ๐ต ๐‘— = ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— = หœ๐‘‹[ ๐‘—]), we have that the ๐‘„ ๐‘— โ€™s are the best rank-๐‘Ÿ approximations for their respective sketched problems, since ๐‘„ ๐‘— = arg min ๐‘Ÿ๐‘Ž๐‘›๐‘˜ (๐‘„)=๐‘Ÿ ๐‘„๐‘‡ ๐‘„=๐ผ๐‘Ÿ (cid:13) (cid:13) หœ๐‘‹[ ๐‘—] โˆ’ ๐‘„๐‘„๐‘‡ หœ๐‘‹[ ๐‘—] (cid:13) (cid:13)๐น by the Eckartโ€“Young Theorem. 69 Now suppose that each ๐‘ˆ ๐‘— โˆˆ R๐‘›ร—๐‘Ÿ forms an optimal rank ๐‘Ÿ approximation to ๐‘‹[ ๐‘—] in the sense that ๐‘ˆ ๐‘— = arg min ๐‘Ÿ๐‘Ž๐‘›๐‘˜ (๐‘ˆ)=๐‘Ÿ ๐‘ˆ๐‘‡๐‘ˆ=๐ผ๐‘Ÿ (cid:13) (cid:13)๐‘‹[ ๐‘—] โˆ’ ๐‘ˆ๐‘ˆ๐‘‡ ๐‘‹[ ๐‘—] (cid:13) (cid:13)๐น . By the hypothesis that หœ๐‘‹[ ๐‘—] is a (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹[ ๐‘—], we have that (1 โˆ’ ๐œ–) (cid:13) X โˆ’ X ร— ๐‘— ๐‘„ ๐‘— ๐‘„๐‘‡ (cid:13) ๐‘— (cid:13) (cid:13) 2 (cid:13) (cid:13) 2 ๐‘‹[ ๐‘—] โˆ’ ๐‘„ ๐‘— ๐‘„๐‘‡ ๐‘— ๐‘‹[ ๐‘—] = (1 โˆ’ ๐œ–) (cid:13) (cid:13) (cid:13) หœ๐‘‹[ ๐‘—] โˆ’ ๐‘„ ๐‘— ๐‘„๐‘‡ ๐‘— หœ๐‘‹[ ๐‘—] หœ๐‘‹[ ๐‘—] โˆ’ ๐‘ˆ ๐‘—๐‘ˆ๐‘‡ (cid:13) (cid:13) (cid:13) ๐น ๐‘‹[ ๐‘—] โˆ’ ๐‘ˆ ๐‘—๐‘ˆ๐‘‡ โ‰ค (1 + ๐œ–) ๐‘— ๐‘‹[ ๐‘—] ๐‘— หœ๐‘‹ ๐‘— (cid:13) 2 (cid:13) (cid:13) ๐น (cid:13) 2 (cid:13) (cid:13) โ‰ค โ‰ค (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 2 ๐น (cid:13) 2 (cid:13) (cid:13) ๐น = (1 + ๐œ–)ฮ”๐‘Ÿ, ๐‘— , where we have used the definition of (๐œ–, 0, ๐‘Ÿ)-PCP sketches in the first and third inequalities. After a rearrangement of terms, substituting the above into (3.12) now yields the inequality in (3.11). We have now established in Theorem 3.3 that we have a quasi-optimal error bound for Term I in (3.9) whenever our leave-one-out measurement matrices ฮฉ๐‘‡ โˆ’ ๐‘— yield (๐œ–, 0, ๐‘Ÿ)-PCP sketches of all ๐‘‘ unfoldings ๐‘‹ ๐‘— . Next, we will demonstrate how to ensure that Kronecker structured and Khatri-Rao structured leave-one-out measurement matrices provide PCP sketches. 3.2.2 PCP Sketches via Kronecker-Structured Leave-one-out Measurement Matrices In this section we study when Kronecker-structured measurement matrices will provide the PCP property. To begin we will show that the JL and OSE properties are inherited under matrix direct sums and compositions. These are useful facts because our overall leave-one-out matrices can be constructed using these operations. In particular, we will follow the example in the last paragraph of Section 3.1.1 and consider a matrix ฮฉโˆ’ ๐‘— โˆˆ R๐‘š๐‘‘โˆ’1ร—๐‘›๐‘‘โˆ’1 defined as ฮฉโˆ’ ๐‘— = ๐‘‘ (cid:204) ๐‘–=1 ๐‘–โ‰  ๐‘— ฮฉ๐‘– = ๐‘‘โˆ’1 (cid:214) ๐‘–โ€ฒ=1 หœฮฉ๐‘–โ€ฒ for ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› (3.13) where หœฮฉ๐‘–โ€ฒ := ๐ผ๐‘› โŠ— ยท ยท ยท โŠ— ๐ผ๐‘› (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:124) (cid:123)(cid:122) ๐‘‘โˆ’1โˆ’๐‘–โ€ฒ โˆˆ R๐‘š๐‘–โ€ฒ ๐‘›๐‘‘โˆ’1โˆ’๐‘–โ€ฒ ร—๐‘š๐‘–โ€ฒ โˆ’1๐‘›๐‘‘โˆ’๐‘–โ€ฒ (3.14) โŠ— ฮฉ๐‘– ๐‘— (๐‘–โ€ฒ) โŠ— ๐ผ๐‘š โŠ— ยท ยท ยท โŠ— ๐ผ๐‘š (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) ๐‘–โ€ฒโˆ’1 70 for ๐‘– ๐‘— (๐‘–โ€ฒ) := Here, ๐ผ๐‘› denotes an ๐‘› ร— ๐‘› identity matrix. ๐‘–โ€ฒ if ๐‘–โ€ฒ < ๐‘— . ๐‘–โ€ฒ + 1 if ๐‘–โ€ฒ โ‰ฅ ๐‘— ๏ฃฑ๏ฃด๏ฃด๏ฃด๏ฃด๏ฃฒ ๏ฃด๏ฃด๏ฃด๏ฃด ๏ฃณ The next three lemmas will be used to help show that ฮฉโˆ’ ๐‘— as defined in (3.13) inherits both the JL and OSE properties from its component ฮฉ๐‘– matrices. Having established this, we can then use, e.g., Lemma 3.1 to prove PCP sketching results for such Kronecker-structured ฮฉโˆ’ ๐‘— . Lemma 3.2. Suppose that ฮฉ1 โˆˆ R๐‘š1ร—๐‘1 and ฮฉ2 โˆˆ R๐‘š2ร—๐‘2 are two random matrices. Denote their matrix direct sum by ฮฉ = ฮฉ1 โŠ• ฮฉ2 โˆˆ R(๐‘š1+๐‘š2)ร—(๐‘1+๐‘2). Then, 1. If ฮฉ1 and ฮฉ2 are (๐œ–, ๐›ฟ1, ๐‘) and (๐œ–, ๐›ฟ2, ๐‘)-JLs respectively, then ฮฉ is an (๐œ–, ๐›ฟ1 + ๐›ฟ2, ๐‘)-JL. 2. If ฮฉ1 and ฮฉ2 are (๐œ–, ๐›ฟ1, ๐‘Ÿ) and (๐œ–, ๐›ฟ2, ๐‘Ÿ)-OSEs respectively, then ฮฉ is an (๐œ–, ๐›ฟ1 + ๐›ฟ2, ๐‘Ÿ)-OSE. Proof. Part 1.: Consider a set ๐‘† โŠ‚ R๐‘1+๐‘2 with cardinality ๐‘. Let z โˆˆ ๐‘†. Group the first ๐‘1 coordinates of z into x โˆˆ R๐‘1 and the last ๐‘2 coordinates of z into y โˆˆ R๐‘2. Observe that = โˆฅฮฉ1xโˆฅ2 2 + โˆฅฮฉ2yโˆฅ2 2 โ‰ค (1 + ๐œ–) (cid:16) โˆฅxโˆฅ2 2 + โˆฅyโˆฅ2 2 (cid:17) = (1 + ๐œ–) โˆฅzโˆฅ2 2 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 0 2 = โˆฅฮฉzโˆฅ2 ๏ฃฎ ฮฉ1 ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฐ will hold whenever both โˆฅฮฉ1xโˆฅ2 ๏ฃฎ x ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ y ๏ฃฏ ๏ฃฐ 0 ฮฉ2 ๏ฃน ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃป ๏ฃน ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃป (cid:13) 2 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 2 2 hold. The (1 โˆ’ ๐œ–)- distortion lower bound is similar. As a result, we can use the union bound to see that ฮฉ will have 2 โ‰ค (1 + ๐œ–) โˆฅxโˆฅ2 2 โ‰ค (1 + ๐œ–) โˆฅyโˆฅ2 2 and โˆฅฮฉ2yโˆฅ2 the (๐œ–, ๐›ฟ1 + ๐›ฟ2, ๐‘)-JL property. Part 2.: Suppose ๐‘‹ โˆˆ R๐‘›ร—(๐‘1+๐‘2) has rank ๐‘Ÿ. Let ๐‘‹1 and ๐‘‹2 denote the sub-matrices of ๐‘‹ containing the first ๐‘1 and last ๐‘2 columns of ๐‘‹, respectively. Note that both ๐‘‹1 and ๐‘‹2 have at most rank ๐‘Ÿ. Furthermore, note also that (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 2 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 2 ฮฉ1๐‘‹๐‘‡ 1 ฮฉ2๐‘‹๐‘‡ 2 (cid:16)(cid:13) (cid:13)๐‘‹๐‘‡ y (cid:169) (cid:173) (cid:173) (cid:171) + (cid:13) 2 2 = 1 y(cid:13) (cid:13) โ‰ค (1 + ๐œ–) (cid:13)ฮฉ๐‘‹๐‘‡ y(cid:13) (cid:13) (cid:13) (cid:170) (cid:174) (cid:174) (cid:172) will hold for any arbitrary vector y โˆˆ R๐‘› whenever both โˆฅฮฉ1๐‘‹๐‘‡ โˆฅฮฉ2๐‘‹๐‘‡ we can see that ฮฉ will be an (๐œ–, ๐›ฟ1 + ๐›ฟ2, ๐‘Ÿ)-OSE by the union bound. 2 โ‰ค (1 + ๐œ–)โˆฅ ๐‘‹๐‘‡ 2 yโˆฅ2 2 yโˆฅ2 2 y(cid:13) (cid:13) (cid:13)๐‘‹๐‘‡ 2 2 2 2 2 and 2 hold. The (1 โˆ’ ๐œ–)-distortion lower bound is similar. As a result, 2 โ‰ค (1 + ๐œ–) โˆฅ ๐‘‹๐‘‡ 1 yโˆฅ2 1 yโˆฅ2 (cid:17) = (1 + ๐œ–) (cid:13) (cid:13)๐‘‹๐‘‡ y(cid:13) 2 (cid:13) 2 . 71 Note that there is no requirement that ฮฉ1 and ฮฉ2 need to be independent in Lemma 3.2. This is crucial for the next lemma, which will involve many copies of the same measurement matrix. Lemma 3.3 (Direct Sums Inherit the OSE and JL Properties). For some ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1], let หœฮฉ๐‘–โ€ฒ be defined as in (3.14) and set ฮฉ := ฮฉ๐‘– ๐‘— (๐‘–โ€ฒ) โˆˆ R๐‘šร—๐‘›. 1. If ฮฉ has the 2. If ฮฉ has the (cid:16) (cid:16) ๐œ–, ๐œ–, ๐›ฟ ๐‘š๐‘–โ€ฒ โˆ’1๐‘›๐‘‘โˆ’๐‘–โ€ฒ โˆ’1 , ๐‘Ÿ (cid:17) ๐›ฟ ๐‘š๐‘–โ€ฒ โˆ’1๐‘›๐‘‘โˆ’๐‘–โ€ฒ โˆ’1 , ๐‘ -OSE property, then หœฮฉ๐‘–โ€ฒ will have the (๐œ–, ๐›ฟ, ๐‘Ÿ)-OSE property. (cid:17) -JL property, then หœฮฉ๐‘–โ€ฒ will have the (๐œ–, ๐›ฟ, ๐‘)-JL, property. Proof. First consider the rearrangement of หœฮฉ๐‘–โ€ฒ in (3.14) defined as follows หœฮฉ := ๐ผ๐‘š โŠ— ๐ผ๐‘š โŠ— ยท ยท ยท โŠ— ๐ผ๐‘š (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) ๐‘–โ€ฒโˆ’1 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) โŠ— ๐ผ๐‘› โŠ— ๐ผ๐‘› โŠ— ยท ยท ยท โŠ— ๐ผ๐‘› (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) ๐‘‘โˆ’1โˆ’๐‘–โ€ฒ (cid:124) โŠ— ฮฉ. Note that the Kronecker product of two identity matrices is itself an identity matrix. Thus, we can rewrite this as simply หœฮฉ = ๐ผ๐‘š โŠ— ๐ผ๐‘š โŠ— ยท ยท ยท โŠ— ๐ผ๐‘š (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) ๐‘–โ€ฒโˆ’1 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) โŠ— ๐ผ๐‘› โŠ— ๐ผ๐‘› โŠ— ยท ยท ยท โŠ— ๐ผ๐‘› (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) ๐‘‘โˆ’1โˆ’๐‘–โ€ฒ (cid:124) โŠ— ฮฉ = ฮฉ 0 . . . 0 0 ฮฉ . . . 0 . . . 0 0 0 . . . ฮฉ 0 0 ๏ฃฎ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฏ ๏ฃฐ . ๏ฃน ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃบ ๏ฃป That is, we have a block diagonal matrix with ยฏ๐‘š = ๐‘š๐‘–โ€ฒโˆ’1๐‘›๐‘‘โˆ’๐‘–โ€ฒโˆ’1 copies of ฮฉ along its diagonal. Thus, if ฮฉ has either the (๐œ–, ๐›ฟ/ ยฏ๐‘š, ๐‘Ÿ)-OSE or the (๐œ–, ๐›ฟ/ ยฏ๐‘š, ๐‘)-JL property, repeated applications of Lemma 3.2 will then establish the desired OSE or JL properity for หœฮฉ. Now consider หœฮฉ๐‘–โ€ฒ as in (3.14). There exist unitary (permutation) matrices ๐ฟ and ๐‘… which interchange rows and columns such that ๐ฟ หœฮฉ๐‘… = ๐ฟ (cid:169) ๐ผ๐‘š โŠ— ๐ผ๐‘š โŠ— ยท ยท ยท โŠ— ๐ผ๐‘š (cid:173) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:173) (cid:123)(cid:122) (cid:125) (cid:124) ๐‘–โ€ฒโˆ’1 (cid:171) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) โŠ— ๐ผ๐‘› โŠ— ๐ผ๐‘› โŠ— ยท ยท ยท โŠ— ๐ผ๐‘› (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) ๐‘‘โˆ’1โˆ’๐‘–โ€ฒ (cid:124) ๐‘… โŠ— ฮฉ(cid:170) (cid:174) (cid:174) (cid:172) = ๐ฟ (cid:0)๐ผ๐‘š๐‘–โ€ฒ โˆ’1๐‘›๐‘‘โˆ’1โˆ’๐‘–โ€ฒ โŠ— ฮฉ(cid:1) ๐‘… = ๐ผ๐‘› โŠ— ๐ผ๐‘› โŠ— ยท ยท ยท โŠ— ๐ผ๐‘› (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) ๐‘‘โˆ’1โˆ’๐‘–โ€ฒ โŠ— ฮฉ โŠ— ๐ผ๐‘š โŠ— ๐ผ๐‘š โŠ— ยท ยท ยท โŠ— ๐ผ๐‘š (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:124) (cid:123)(cid:122) ๐‘–โ€ฒโˆ’1 = ๐ผ๐‘›๐‘‘โˆ’1โˆ’๐‘–โ€ฒ โŠ— ฮฉ โŠ— ๐ผ๐‘š๐‘–โ€ฒ โˆ’1 = หœฮฉ๐‘–โ€ฒ . 72 Noting that both the OSE and JL properties are invariant to unitary transformations of a given random matrix, one can now see that หœฮฉ๐‘–โ€ฒ = ๐ฟ หœฮฉ๐‘… will indeed have the same desired OSE or JL property as was established for หœฮฉ. Lemma 3.3 allows us to infer JL and OSE properties of the หœฮฉ๐‘–โ€ฒ matrices in (3.14) from the properties of the smaller random matrices ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› appearing in (3.13). The next lemma will allow us to then use these inferred properties of the หœฮฉ๐‘–โ€ฒ matrices to derive OSE and JL properties for ฮฉโˆ’ ๐‘— from (3.13) in terms of the properties of its component ฮฉ๐‘– โˆˆ R๐‘šร—๐‘›. Lemma 3.4 (A Composition Lemma for the OSE and JL Properties). Let ๐œ– โˆˆ (0, 1) and หœฮฉ๐‘–โ€ฒ โˆˆ R๐‘š๐‘–โ€ฒ ๐‘›๐‘‘โˆ’1โˆ’๐‘–โ€ฒ ร—๐‘š๐‘–โ€ฒ โˆ’1๐‘›๐‘‘โˆ’๐‘–โ€ฒ for ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1]. 1. If หœฮฉ๐‘–โ€ฒ is an (cid:16) ๐œ– 2(๐‘‘โˆ’1) , ๐›ฟ ๐‘‘โˆ’1 , ๐‘Ÿ (cid:17) -OSE for all ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1], then หœฮฉ = หœฮฉ๐‘–โ€ฒ is an (๐œ–, ๐›ฟ, ๐‘Ÿ)-OSE. (cid:16) (cid:17) ๐›ฟ ๐‘‘โˆ’1 ๐œ– 2(๐‘‘โˆ’1) , , ๐‘ 2. If หœฮฉ๐‘–โ€ฒ is an -JL for all ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1], then หœฮฉ = Proof. Part 1.: Let ๐‘Œ ๐‘‡ โˆˆ R๐‘›๐‘‘โˆ’1ร—๐‘› be an arbitrary matrix of rank at most ๐‘Ÿ. Denote หœ๐‘Œ๐‘–โ€ฒ = (cid:0) หœฮฉ๐‘–โ€ฒ . . . หœฮฉ1 ร—๐‘› for ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1]. Note that each หœ๐‘Œ๐‘–โ€ฒ has rank at most ๐‘Ÿ. Fix some z โˆˆ R๐‘›. Suppose for the moment that (1.10) holds for each ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1] with ฮฉ = หœฮฉ๐‘–โ€ฒ, (cid:1) ๐‘Œ ๐‘‡ โˆˆ R๐‘š๐‘–โ€ฒ ๐‘›๐‘‘โˆ’1โˆ’๐‘–โ€ฒ หœฮฉ๐‘–โ€ฒ is an (๐œ–, ๐›ฟ, ๐‘)-JL. ๐‘‘โˆ’1 (cid:206) ๐‘–โ€ฒ=1 ๐‘‘โˆ’1 (cid:206) ๐‘–โ€ฒ=1 ๐ด = หœ๐‘Œ๐‘–โ€ฒโˆ’1, and x = z, we have that (cid:13) หœฮฉ๐‘Œ ๐‘‡ z(cid:13) (cid:13) (cid:13) (cid:0) หœฮฉ๐‘‘โˆ’2 . . . หœฮฉ1 (cid:1) ๐‘Œ ๐‘‡ z(cid:13) 2 (cid:13) 2 2 2 = (cid:13) (cid:13) หœฮฉ๐‘‘โˆ’1 (cid:13) หœฮฉ๐‘‘โˆ’1 หœ๐‘Œ๐‘‘โˆ’2z(cid:13) = (cid:13) (cid:13) (cid:18) 2 2 ๐œ– 2(๐‘‘ โˆ’ 1) ๐œ– 2(๐‘‘ โˆ’ 1) โ‰ค 1 + (cid:18) 1 + = ... (cid:18) (cid:18) โ‰ค 1 + ๐œ– 2(๐‘‘ โˆ’ 1) (cid:19) 1 1 โˆ’ ๐œ–/2 โ‰ค (1 + ๐œ–) (cid:13) (cid:13)๐‘Œ ๐‘‡ z(cid:13) 2 (cid:13) 2 โ‰ค , (cid:13)๐‘Œ ๐‘‡ z(cid:13) (cid:13) 2 (cid:13) 2 (cid:19) (cid:19) (cid:13) หœ๐‘Œ๐‘‘โˆ’2z(cid:13) (cid:13) (cid:13) 2 2 (cid:13) หœฮฉ๐‘‘โˆ’2 หœ๐‘Œ๐‘‘โˆ’3z(cid:13) (cid:13) 2 (cid:13) 2 (cid:19) ๐‘‘โˆ’1 (cid:13)๐‘Œ ๐‘‡ z(cid:13) (cid:13) 2 (cid:13) 2 where we have used the general bound (1 + ๐‘˜/๐‘›)๐‘› โ‰ค ๐‘’๐‘˜ โ‰ค (1 โˆ’ ๐‘˜)โˆ’1 for ๐‘˜ โˆˆ [0, 1) in the second to 73 last inequality. Similarly, for a lower bound one can see that (cid:13) หœฮฉ๐‘Œ ๐‘‡ z(cid:13) (cid:13) 2 (cid:13) 2 โ‰ฅ (cid:18) 1 โˆ’ ๐œ– 2(๐‘‘ โˆ’ 1) (cid:13)๐‘Œ ๐‘‡ z(cid:13) 2 (cid:13) 2 . โ‰ฅ (1 โˆ’ ๐œ–) (cid:13) (cid:19) ๐‘‘โˆ’1 (cid:13)๐‘Œ ๐‘‡ z(cid:13) (cid:13) (cid:13) 2 2 Union bounding over the failure probability that (1.10) holds for each ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1] as supposed above now yields the desired result. Part 2.: An essentially identical arguments also applies to obtain the desired JL property result. We now have all the necessary results to show how the component maps ฮฉ๐‘– of a Kronecker structured measurement ensemble as per (3.13) can guarantee a Kronecker sketch with the projection cost preserving property. Theorem 3.4 (Kronecker Products of JL matrices yield PCP Sketchs). Let ๐œ– โˆˆ (0, 1), ๐‘‹ โˆˆ R๐‘›ร—๐‘›๐‘‘โˆ’1 have rank ๐‘Ÿ โˆˆ [๐‘›], and ฮฉโˆ’ ๐‘— โˆˆ R๐‘š๐‘‘โˆ’1ร—๐‘›๐‘‘โˆ’1 be defined as in (3.13) and (3.14). Furthermore, suppose that the ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› in (3.13) have both the (cid:16) (cid:16) 1. 2. , ๐œ– 12(๐‘‘โˆ’1) ๐œ– ๐‘Ÿ (๐‘‘โˆ’1) ๐›ฟ 2(๐‘‘โˆ’1)๐‘›๐‘‘โˆ’2 , ๐›ฟ 2(๐‘‘โˆ’1)๐‘›๐‘‘โˆ’2 12 โˆš (cid:17)๐‘Ÿ (cid:17) , (cid:16) 141 ๐œ– -JL property, and the , 16๐‘›2 + ๐‘› (cid:17) -JL property for all ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1]. Then, ๐‘‹ฮฉ๐‘‡ Proof. By Lemma 3.1 we know that ๐‘‹ฮฉ๐‘‡ (cid:16) 141 ๐œ– least 1 โˆ’ ๐›ฟ if ฮฉโˆ’ ๐‘— has both the , โˆ’ ๐‘— will be an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹ with probability at least 1 โˆ’ ๐›ฟ. โˆ’ ๐‘— will be an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹ with probability at (cid:17)๐‘Ÿ (cid:17) (cid:16) ๐œ– โˆš ๐‘Ÿ 6 -JL property and the -JL property. , 16๐‘›2 + ๐‘› (cid:16) ๐œ– 6 , ๐›ฟ 2 , ๐›ฟ 2 (cid:17) In fact, by Lemma 3.4 we can further see that it suffices to have the หœฮฉ๐‘–โ€ฒ from (3.13) and (3.14) have both the (cid:16) 1. (cid:16) 2. , , ๐œ– 12(๐‘‘โˆ’1) ๐œ– ๐‘Ÿ (๐‘‘โˆ’1) 12 ๐›ฟ 2(๐‘‘โˆ’1) ๐›ฟ 2(๐‘‘โˆ’1) โˆš , (cid:16) 141 ๐œ– (cid:17)๐‘Ÿ (cid:17) -JL property, and the , 16๐‘›2 + ๐‘› (cid:17) -JL property for all ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1]. Finally, looking now at Lemma 3.3 for each ๐‘–โ€ฒ โˆˆ [๐‘‘ โˆ’ 1] we can see that the assumed properties of the ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› in (3.13) will guarantee both of these sufficient conditions. The following corollary of Theorem 3.4 guarantees a Kronecker sketch with the projection cost preserving property when the component matrices ฮฉ๐‘– in (3.13) are sub-Gaussian random matrices. 74 Corollary 3.2 (Kronecker Products of sub-Gaussian Matrices Yield PCP Sketches). Suppose X is a real valued ๐‘‘-mode tensor with side-lengths all equal to ๐‘›. Let ๐œ– โˆˆ (0, 1), ๐›ฟ โˆˆ (0, 1), ๐‘Ÿ โˆˆ [๐‘›], ๐‘— โˆˆ [๐‘‘]. If ฮฉโˆ’ ๐‘— = (cid:203)๐‘‘ ฮฉ๐‘– โˆˆ R๐‘š๐‘‘โˆ’1ร—๐‘›๐‘‘โˆ’1 defined as in (3.13) with random matrices ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› ๐‘–=1 ๐‘–โ‰  ๐‘— having i.i.d centered variance ๐‘šโˆ’1, sub-Gaussian entries such that ๐‘š โ‰ฅ max (cid:26) ๐ถ1๐‘Ÿ (๐‘‘ โˆ’ 1)2 ๐œ– 2 (cid:18) ๐‘›๐‘‘ (๐‘‘ โˆ’ 1) ๐›ฟ (cid:19) , ๐ถ2(๐‘‘ โˆ’ 1)2 ๐œ– 2 ln (cid:18)(cid:18) 141 ๐œ– (cid:19)๐‘Ÿ ๐‘›๐‘‘โˆ’2(๐‘‘ โˆ’ 1) ๐›ฟ ln (cid:19)(cid:27) for absolute constants ๐ถ1, ๐ถ2 > 0 then the sketched unfolding หœ๐‘‹[ ๐‘—] = ๐‘‹[ ๐‘—]ฮฉ๐‘‡ (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹[ ๐‘—] with probability at least 1 โˆ’ ๐›ฟ. โˆ’ ๐‘— โˆˆ R๐‘›ร—๐‘š๐‘‘โˆ’1 is an Proof. To obtain the first quantity maximized over we apply Theorem 1.1 with ๐œ– โ† ๐œ– ๐‘Ÿ (๐‘‘โˆ’1) , โˆš 2(๐‘‘โˆ’1)๐‘›๐‘‘โˆ’2 , and |๐‘†| โ† 16๐‘›2 + ๐‘›. Similarly, for the second quantity maximized over we apply . The result now follows from Theorem 1.1 with ๐œ– โ† ๐œ– ๐›ฟ โ† (cid:17)๐‘Ÿ 12 ๐›ฟ ๐›ฟ 2(๐‘‘โˆ’1)๐‘›๐‘‘โˆ’2 , and |๐‘†| โ† 12(๐‘‘โˆ’1) , ๐›ฟ โ† (cid:16) 141 ๐œ– Theorem 3.4. Remark 3.3. Note that Theorem 3.4 is quite general, requiring only that the random matrices ฮฉ๐‘– in (3.13) should be drawn from some distribution having a couple JL properties. As a result of this generality, its Corollary 3.2 concerning sub-Gaussian component matrices turns out to be sub-optimal by (at least a) factor of ๐‘‘ in that setting. To obtain a slightly sharper result in ๐‘‘ for sub- Gaussian ฮฉ๐‘– we recommend replacing our implicit use of Lemma 3.3 in the proof of Corollary 3.2 (via Theorem 3.4) with [33, Lemma 14] instead. 3.2.3 PCP Sketches via Khatri-Rao Structured Leave-one-out Measurement Matrices In this section we study how to ensure that Khatri-Rao structured leave-one-out measurement matrices will provide the PCP property. To start we will first show that random Khatri-Rao structured measurement maps, denoted in this section by ฮฉโˆ’ ๐‘— = 1 ๐‘š ฮฉ1 โ€ข ฮฉ2 โ€ข . . . ฮฉ ๐‘—โˆ’1 โ€ข ฮฉ ๐‘—+1 โ€ข ฮฉ๐‘‘ where ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› โˆ€๐‘– โˆˆ [๐‘‘] \ { ๐‘— }, will have the JL property whenever all their component matrices ฮฉ๐‘– have i.i.d. sub-Gaussian entries. Having established this, we can then use, e.g., Lemma 3.1 to prove PCP sketching results for such Khatri-Rao Structured ฮฉโˆ’ ๐‘— . 75 ๐‘š ฮฉ1 โ€ข ฮฉ2 โ€ข . . . ฮฉ ๐‘—โˆ’1 โ€ข ฮฉ ๐‘—+1 โ€ข ฮฉ๐‘‘ where all the Theorem 3.5. Let ๐œ– > 0, 0 < ๐›ฟ โ‰ค ๐‘’โˆ’2, and ฮฉโˆ’ ๐‘— = 1 ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› for ๐‘– โˆˆ [๐‘‘] \ { ๐‘— } have i.i.d. mean zero, variance one, sub-Gaussian entries. Then ฮฉโˆ’ ๐‘— is an (๐œ–, ๐›ฟ, ๐‘˜)-JL whenever ๐‘š โ‰ฅ ๐ถ ๐‘‘โˆ’1 max (cid:40) ๐œ– โˆ’2 log ๐‘˜ ๐›ฟ , ๐œ– โˆ’1 (cid:18) log ๐‘˜ ๐›ฟ (cid:19) ๐‘‘โˆ’1(cid:41) for a constant ๐ถ โˆˆ R+ that depends only on the sub-Gaussian norm of the i.i.d. ฮฉ๐‘–-entries. The proof of Theorem 3.5 largely follows the argument proposed in Section 2 of [1] concerning the so-called ๐‘-moment JL property of Khatri-Rao structured measurements. We note that Kro- necker products of sub-Gaussian vectors are not sub-Gaussian in general, so the general idea is to use Markovโ€™s inequality for higher moments of the norm of ฮฉโˆ’ ๐‘— y with a fixed y โˆˆ R๐‘› to obtain the desired result. To proceed with the argument, we will need the following two concentration results. Before stating the concentration results, we recall a standard definition that the ๐‘-th root of the (absolute) moment of a random variable is referred to as the ๐ฟ ๐‘-norm of the random variable, and for random variable ๐‘‹ is denoted โˆฅ ๐‘‹ โˆฅ ๐ฟ ๐‘ := (E|๐‘‹ | ๐‘)1/๐‘ (3.15) this is of course because the expectation of a random variable in a probability space is by definition the Lebesgue integral of the function from the probability space to the reals, and so all the theorems about Lebesgue integration and metric spaces apply to random variables. Now onto our needed concentration results: Lemma 3.5. [Lemma 19 in [33]] Let Y be a ๐‘‘ โˆ’ 1 mode tensors with side lengths of size ๐‘›, ๐‘ โ‰ฅ 1, and ฮฉ ๐‘— (๐‘–, :) โˆˆ R๐‘› for ๐‘— โˆˆ [๐‘‘ โˆ’ 1], ๐‘– โˆˆ [๐‘š] be independent random vectors each satisfying the Khintchine inequality (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค ๐ถ๐‘ โˆฅyโˆฅ2 for any vector y โˆˆ R๐‘› where ๐ถ๐‘ a constant (cid:13)โŸจฮฉ ๐‘— (๐‘–, :), yโŸฉ(cid:13) depending only on ๐‘. Then โˆฅโŸจฮฉ1(๐‘–, :) โŠ— ฮฉ2(๐‘–, :) โŠ— ยท ยท ยท โŠ— ฮฉ๐‘‘โˆ’1(๐‘–, :), vec(Y)โŸฉโˆฅ ๐ฟ ๐‘ โ‰ค ๐ถ ๐‘‘โˆ’1 ๐‘ โˆฅY โˆฅ2 . 76 Lemma 3.6. [Corollary 2 in [41]] If ๐‘ โ‰ฅ 2 and ๐‘, ๐‘1, . . . , ๐‘๐‘š are i.i.d symmetric random variables then we have (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) ๐‘š โˆ‘๏ธ ๐‘–=1 ๐‘๐‘– (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค ๐ถ sup ๐‘ โˆˆ[max{2, ๐‘ ๐‘š },๐‘] (cid:19) 1/๐‘  (cid:40) ๐‘ ๐‘  (cid:18) ๐‘š ๐‘ (cid:41) โˆฅ๐‘ โˆฅ ๐ฟ๐‘  . Here ๐ถ > 0 is an absolute constant. In particular, we will utilize the following corollary of Lemma 3.6. Corollary 3.3. If under conditions of Lemma 3.6, in addition, we know that โˆฅ๐‘ โˆฅ ๐ฟ๐‘  โ‰ค (๐ถ๐‘ )๐‘‘โˆ’1, then (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 ๐‘๐‘– (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค ๐ถ ๐‘‘โˆ’1 max (cid:26) 2๐‘‘โˆ’1 โˆš๏ธ‚ ๐‘ ๐‘š , ( ๐‘๐‘’)๐‘‘โˆ’1 ๐‘š (cid:27) . Proof. The proof of this corollary loosely follows the argument presented in [1]. Since โˆฅ๐‘ โˆฅ ๐ฟ๐‘  โ‰ค (๐ถ๐‘ )๐‘‘โˆ’1, ๐‘๐‘– โˆผ ๐‘, the expression over which we are taking the supremum in Lemma 3.6 is a function whose derivative with respect to ๐‘  is non-decreasing. That is, the derivative ๐œ• ๐œ•๐‘  (cid:34) ๐‘ ๐‘  (cid:18) ๐‘š ๐‘ (cid:19) 1/๐‘  (cid:35) ๐‘ ๐‘‘โˆ’1 = ๐‘๐‘ ๐‘‘โˆ’4 (cid:19) 1/2 (cid:18) (cid:18) ๐‘š ๐‘ (๐‘‘ โˆ’ 2)๐‘  โˆ’ log (cid:19) ๐‘š ๐‘ has at most a single root in the interval of interest at ๐‘  = log ๐‘š ๐‘ ๐‘‘โˆ’2 . Noting the sign change at this root, we conclude that the maximum value must occur at the endpoints of the interval, and cannot occur at the critical point that is interior to the interval. Evaluating the function of interest at ๐‘  = 2 we obtain 2๐‘‘โˆ’2โˆš๐‘š ๐‘; we will further upper-bound this by 2๐‘‘โˆ’1โˆš๐‘š ๐‘ in order to simplify analysis for the right endpoint. Additionally, since we are interested only in an upper bound, we need not evaluate the possible endpoint ๐‘  = ๐‘ ๐‘š , since if 2 < ๐‘ ๐‘š , we are increasing the interval over which we are maximizing by instead considering ๐‘  = 2. We will now bound the expression at the right endpoint when ๐‘  = ๐‘ โ‰ฅ 2. Clearly (1/๐‘)1/๐‘ โ‰ค 1 thus (cid:19) 1/๐‘ (cid:18) ๐‘š ๐‘ ๐‘๐‘‘โˆ’1 โ‰ค ๐‘š1/๐‘ ๐‘๐‘‘โˆ’1. If the function value at the right endpoint actually dominates 2๐‘‘โˆ’1โˆš๐‘š ๐‘ (i.e., our upper bound of the function value at the left endpoint), we will have that ๐‘š1/๐‘ ๐‘๐‘‘โˆ’1 โ‰ฅ 2๐‘‘โˆ’1โˆš๐‘š ๐‘ must hold. We will now use this assumption to remove the dependence on ๐‘š in our current upper bound for the 77 function value at the right endpoint after noting that doing so will still yield a valid upper bound whenever the functions value at the right endpoint fails to already be bounded by 2๐‘‘โˆ’1โˆš๐‘š ๐‘. Proceeding as planned, our assumption yields that Rearranging of terms, we get that ๐‘š1/2โˆ’1/๐‘ โ‰ค (cid:17) ๐‘‘โˆ’1 ๐‘โˆ’1/2. (cid:16) ๐‘ 2 ๐‘š1/๐‘ โ‰ค = (cid:20) (cid:16) ๐‘ 2 (cid:20) (cid:16) ๐‘ 2 (cid:17) ๐‘‘โˆ’1 ๐‘โˆ’1/2 (cid:21) 2 ๐‘โˆ’2 (cid:17) 2๐‘‘โˆ’2 ๐‘โˆ’2 (cid:21) ๐‘ โˆ’1 ๐‘โˆ’2 . ๐‘โˆ’2 is less than one. A tedious calculation reveals that (cid:0) ๐‘ 2 The factor ๐‘ โˆ’1 ๐‘ โ‰ฅ 2, and therefore ๐‘š1/๐‘ bounded by lim๐‘โ†’2 bounds and then averaging over ๐‘š, we obtain the desired inequality. (cid:0) ๐‘ 2 (cid:1) 2๐‘‘โˆ’2 ๐‘โˆ’2 is decreasing for all (cid:1) 2๐‘‘โˆ’2 ๐‘โˆ’2 = ๐‘’๐‘‘โˆ’1. maximizing over our two upper Now we are ready to give a formal proof of Theorem 3.5. Proof of Theorem 3.5: Note that the result is unchanged for any choice of mode to leave out, so we will shorten the notation and work with ฮฉ := ฮฉโˆ’ ๐‘— within this proof. Let ๐พ be the sub-Gaussian norm of an entry in the ฮฉ ๐‘— โ€™s, as in Definition 1.22. We aim to bound the probability that (cid:12) (cid:12) (cid:12) (cid:12) 1 ๐‘š โˆฅฮฉxโˆฅ2 2 โˆ’ โˆฅxโˆฅ2 2 (cid:12) (cid:12) (cid:12) (cid:12) โ‰ฅ ๐œ– โˆฅxโˆฅ2 2 for a fixed x โˆˆ R๐‘›๐‘‘โˆ’1. Without loss of generality, assume โˆฅxโˆฅ2 = 1. Furthermore, note that 1 ๐‘š โˆฅฮฉxโˆฅ2 2 โˆ’ 1 = 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 (cid:2)โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1(cid:3) . (3.16) Now, since the entries of each ฮฉ ๐‘— are i.i.d. mean zero and variance one sub-Gaussian random variables, they satisfy Khintchineโ€™s inequality (see, e.g., [63]) in the form (cid:13)โŸจฮฉ ๐‘— (๐‘–, :), yโŸฉ(cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค ๐ถ๐พ โˆš ๐‘ โˆฅyโˆฅ2 for any fixed y โˆˆ R๐‘›. By Lemma 3.5, this implies that the rows of ฮฉ satisfy a generalized Khintchineโ€™s inequality in the form โˆฅโŸจฮฉ(๐‘–, :), xโŸฉโˆฅ ๐ฟ ๐‘ = โˆฅโŸจฮฉ1(๐‘–, :) โŠ— ฮฉ2(๐‘–, :) โŠ— . . . โŠ— ฮฉ๐‘‘โˆ’1(๐‘–, :), xโŸฉโˆฅ ๐ฟ ๐‘ โ‰ค (๐ถโ€ฒ๐‘) ๐‘‘โˆ’1 2 โˆฅxโˆฅ2 , (3.17) 78 where ๐ถโ€ฒ is a new constant that only depends on ๐พ. To bound the ๐ฟ ๐‘-norm of the sum in (3.16) we will now bound the ๐ฟ ๐‘-norm of each summand. Using the centering Lemma 3.13 and continuing to estimate for one term we see that (cid:13)โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1(cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค 2 (cid:13) (cid:13)โŸจฮฉ(๐‘–, :), xโŸฉ2(cid:13) (cid:13)๐ฟ ๐‘ = 2 โˆฅโŸจฮฉ(๐‘–, :), xโŸฉโˆฅ2 ๐ฟ2 ๐‘ = 2 โˆฅโŸจฮฉ1(๐‘–, :) โŠ— ฮฉ2(๐‘–, :) โŠ— . . . โŠ— ฮฉ๐‘‘โˆ’1(๐‘–, :), xโŸฉโˆฅ2 (3.17) โ‰ค 2((๐ถโ€ฒ2๐‘) 2 โ‰ค (๐ถโ€ฒโ€ฒ๐‘)๐‘‘โˆ’1 โˆฅxโˆฅ2 ๐‘‘โˆ’1 2 )2 โˆฅxโˆฅ2 ๐ฟ2 ๐‘ 2 = (๐ถโ€ฒโ€ฒ๐‘)๐‘‘โˆ’1. We would now like to apply Corollary 3.3 to help bound the ๐ฟ ๐‘-norm of the sum in (3.16). However, we need to symmetrize our random variables first. Toward that end, define (cid:16) ๐‘๐‘– = ๐œŒ๐‘– โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1 (cid:17) , (3.18) where ๐œŒ๐‘– are i.i.d. Rademacher random variables. Note that โˆฅ๐‘๐‘– โˆฅ ๐ฟ ๐‘ = โˆฅ (cid:0)โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1(cid:1) โˆฅ ๐ฟ ๐‘ โ‰ค (๐ถโ€ฒโ€ฒ๐‘)๐‘‘โˆ’1. Appealing now to Corollary 3.3 we have that, (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 ๐‘๐‘– (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค (๐ถโ€ฒโ€ฒ)๐‘‘โˆ’1 max (cid:26) 2๐‘‘โˆ’1 โˆš๏ธ‚ ๐‘ ๐‘š , ( ๐‘๐‘’)๐‘‘โˆ’1 ๐‘š (cid:27) (3.19) for any ๐‘ โ‰ฅ 2. The upper bound in [43, Lemma 6.3] then further implies that (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 (cid:2)โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1(cid:3) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค 2 (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 ๐‘๐‘– (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)๐ฟ ๐‘ โ‰ค 2(๐ถโ€ฒโ€ฒ)๐‘‘โˆ’1 max (cid:26) 2๐‘‘โˆ’1 โˆš๏ธ‚ ๐‘ ๐‘š , ( ๐‘๐‘’)๐‘‘โˆ’1 ๐‘š (cid:27) . Employing Markovโ€™s inequality, we finally have that P (cid:40)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 (cid:2)โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1(cid:3) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:41) โ‰ฅ ๐œ– = P (cid:40)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 ๐‘š ๐‘š โˆ‘๏ธ ๐‘–=1 (cid:2)โŸจฮฉ(๐‘–, :), xโŸฉ2 โˆ’ 1(cid:3) (cid:41) โ‰ฅ ๐œ– ๐‘ ๐‘ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) ๐‘š ) ๐‘/2, ( ๐‘๐‘’) ๐‘ (๐‘‘โˆ’1) ๐‘š ๐‘ ๐‘ (cid:111) . ๐œ– ๐‘ (cid:110) ( max โ‰ค (๐ถโ€ฒโ€ฒโ€ฒ) ๐‘(๐‘‘โˆ’1) Taking ๐‘ = log(๐‘˜/๐›ฟ) and ๐‘š โ‰ฅ หœ๐ถ ๐‘‘โˆ’1 max ๐‘’ (cid:26) ๐œ– โˆ’2 log ๐‘˜ ๐›ฟ , ๐œ– โˆ’1 (cid:16) log ๐‘˜ ๐›ฟ (cid:17) ๐‘‘โˆ’1(cid:27) , the last expression is upper bounded by ๐›ฟ/๐‘˜. Hence, ฮฉ is an (๐œ–, ๐›ฟ, ๐‘˜)-JL by the union bound over ๐‘˜ vectors. 79 Remark 3.4. In order to employ Lemmas 3.5 and 3.6, it is necessary that the rows ฮฉ ๐‘— (๐‘–, :) have independent and identical distributions and that these rows satisfy Khintchineโ€™s inequality. Assum- ing that the matrices ฮฉ๐‘– all have i.i.d sub-Gaussian entries as we have done in Theorem 3.5 implies both these necessary properties of the rows. However, we note that more general (though perhaps less natural) assumptions will also suffice. For example, the distributions of the i.i.d sub-Gaussian entries of the ฮฉ๐‘– may also vary by column. We can now use Theorem 3.5 to derive row bounds that guarantee that our Khatri-Rao structured sub-Gaussian measurement matrices will provide PCP sketches with high probability. Theorem 3.6. Let ๐œ– > 0, 0 < ๐›ฟ โ‰ค ๐‘’โˆ’2, and X be a ๐‘‘ mode tensor with side-lengths equal to ๐‘›. ๐‘š ฮฉ1 โ€ข ฮฉ2 โ€ข . . . ฮฉ ๐‘—โˆ’1 โ€ข ฮฉ ๐‘—+1 โ€ข . . . โ€ข ฮฉ๐‘‘ where the ฮฉ๐‘– โˆˆ R๐‘šร—๐‘› for Furthermore, suppose that ฮฉโˆ’ ๐‘— := 1 ๐‘– โˆˆ [๐‘‘] \ { ๐‘— } are as in Theorem 3.5 with ๐œ– โˆ’2 log ๐‘š โ‰ฅ ๐ถ ๐‘‘โˆ’1 max (cid:170) (cid:174) (cid:174) (cid:172) for a positive constant ๐ถ โˆˆ R+. Then, หœ๐‘‹[ ๐‘—] = ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— will be an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹[ ๐‘—] with , ๐œ– โˆ’1 (cid:169) (cid:173) (cid:173) (cid:171) (3.20) log log log , , 17๐‘›2 ๐›ฟ/2 (cid:18) ๐‘Ÿ ๐œ– 17๐‘›2 ๐›ฟ/2 ๐‘Ÿ ๐œ– 2 (cid:19) ๐‘‘โˆ’1๏ฃผ๏ฃด๏ฃด๏ฃด๏ฃฝ ๏ฃด๏ฃด๏ฃด ๏ฃพ ๏ฃฑ๏ฃด๏ฃด๏ฃด๏ฃฒ ๏ฃด๏ฃด๏ฃด ๏ฃณ ๐‘‘โˆ’1 (cid:17)๐‘Ÿ (cid:16) 141 ๐œ– ๐›ฟ/2 (cid:17)๐‘Ÿ (cid:16) 141 ๐œ– ๐›ฟ/2 probability at least 1 โˆ’ ๐›ฟ. Proof. By Lemma 3.1 we know that it suffices for the measurement matrix ฮฉโˆ’ ๐‘— to have both the (cid:17) (cid:16) ๐œ– 6 -JL property. Combining these two required JL -JL property and , 16๐‘›2 + ๐‘› (cid:16) 141 ๐œ– , ๐›ฟ 2 , ๐›ฟ 2 (cid:17)๐‘Ÿ (cid:17) , (cid:16) ๐œ– โˆš ๐‘Ÿ 6 properties with Theorem 3.5 yields (3.20) after combining and simplifying constants. We can now see that both Kronecker and Khatri-Rao structured measurements can satisfy the PCP property required by Theorem 3.3. This demonstrates that such memory-efficient measure- ments may be used to bound the first term in (3.9) as desired. Given this partial success, we will now turn our attention to the second term in (3.9). 3.2.4 Bounding โˆฅX1 โˆ’ X2โˆฅ2 Next, we show how to bound Term II in (3.9). Recall that X1 is the output of Algorithm 3.1, and that X2 is the tensor recovered by the two-pass algorithm consisting of the first โ€œFactor matrix recoveryโ€ phase of Algorithm 3.1 followed by the second pass core recovery procedure discussed in Section 3.2.1. That is, X1 := [[H , ๐‘„1, . . . , ๐‘„๐‘‘]] is the single-pass estimate of the tensor X output 80 by Algorithm 3.1, and X2 = G ร—1 ๐‘„1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ = X ร—1 ๐‘„1๐‘„๐‘‡ 1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘๐‘„๐‘‡ ๐‘‘ , where G is the core estimate from the two-pass algorithm, and where the ๐‘„๐‘– โˆˆ R๐‘›ร—๐‘Ÿ have ๐‘Ÿ orthonormal columns. To begin, we note that the one-pass core H computed by Algorithm 3.1 can be recovered from its input measurements by H = B๐‘ ร—1 (ฮฆ1๐‘„1)โ€  ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€  = (X ร—1 ฮฆ1 ร—2 ยท ยท ยท ร—๐‘‘ ฮฆ๐‘‘) ร—1 (ฮฆ1๐‘„1)โ€  ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€ . (3.21) In addition, we note that the norm of the difference between the two estimates is the same as the norm of the difference of their cores since factor matrices have orthonormal columns. Lemma 3.7. In the notation outlined above, โˆฅX1 โˆ’ X2โˆฅ2 = โˆฅH โˆ’ Gโˆฅ2 . Proof. โˆฅX1 โˆ’ X2โˆฅ2 = โˆฅH ร—1 ๐‘„1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ โˆ’ G ร—1 ๐‘„1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ โˆฅ2 = โˆฅ(H โˆ’ G) ร—1 ๐‘„1 ร—2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‘ โˆฅ2 = โˆฅ (๐‘„1 โŠ— ยท ยท ยท โŠ— ๐‘„๐‘‘)vec (H โˆ’ G) โˆฅ2 = โˆฅH โˆ’ Gโˆฅ2 since (๐‘„1 โŠ— ยท ยท ยท โŠ— ๐‘„๐‘‘) has orthonormal columns. In order to simplify the presentation of our culminating results we next state a definition for an Affine-Embedding property. In Lemma 3.8 below we then describe how this property relates to the OSE and AMM properties. Definition 3.4 ((๐œ–, ๐‘Ÿ, ๐‘)-AE property). Let ๐œ– > 0, and ๐‘Ÿ โˆˆ N, Fix ๐‘„ โˆˆ R๐‘›ร—๐‘Ÿ an arbitrary matrix with orthonormal columns and an arbitrary matrix ๐ต โˆˆ R๐‘›ร—๐‘ . A matrix ฮฆ โˆˆ R๐‘šร—๐‘› is an (๐œ–, ๐‘Ÿ, ๐‘)- Affine Embedding (AE) for given matrices ๐‘„ and ๐ต if it satisfies (cid:13)(ฮฆ๐‘„)โ€ ฮฆ๐ต(cid:13) (cid:13) (cid:13)๐น โ‰ค (1 + ๐œ–) โˆฅ๐ตโˆฅ๐น . (3.22) 81 With this definition in hand, we are now able to prove the main theorem of this section. It will allow us to relate Term II of (3.9) with Term I. Theorem 3.7. Let X2 = [[G, ๐‘„1, . . . , ๐‘„๐‘‘]] denote the two-pass tensor estimate, and X1 = [[H , ๐‘„1, . . . , ๐‘„๐‘‘]] denote the single-pass tensor estimate for a ๐‘‘-mode tensor X. Furthermore, let ๐œ– โˆˆ (0, 1), and ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› be a (๐œ–/๐‘‘, ๐‘Ÿ, ๐‘›๐‘–โˆ’1๐‘Ÿ ๐‘‘โˆ’๐‘–)-AE for the matrices ๐‘„๐‘– and (cid:0)๐‘‹[๐‘–] โˆ’ (๐‘‹2)[๐‘–] for each ๐‘– โˆˆ [๐‘‘]. Then, (cid:1) (cid:203)๐‘–โˆ’1 ๐‘—=1 ๐ผ๐‘› (cid:203)๐‘‘ (cid:0)ฮฆ ๐‘— ๐‘„ ๐‘— (cid:1) โ€  ฮฆ ๐‘— ๐‘—=๐‘–+1 โˆฅX1 โˆ’ X2โˆฅ2 โ‰ค ๐‘’๐œ– โˆฅX โˆ’ X2โˆฅ2 . (3.23) Proof. By Lemma 3.7 it is enough to estimate the difference between the cores G and H . He have that H โˆ’ G (3.21) = (X ร—1 ฮฆ1 ร—2 ยท ยท ยท ร—๐‘‘ ฮฆ๐‘‘) ร—1 (ฮฆ1๐‘„1)โ€  ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€  โˆ’ G = ((X โˆ’ X2) ร—1 ฮฆ1 ร—2 ยท ยท ยท ร—๐‘‘ ฮฆ๐‘‘) ร—1 (ฮฆ1๐‘„1)โ€  ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€  + (X2 ร—1 ฮฆ1 ร—2 ยท ยท ยท ร—๐‘‘ ฮฆ๐‘‘) ร—1 (ฮฆ1๐‘„1)โ€  ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€  โˆ’ G = (X โˆ’ X2) ร—1 (ฮฆ1๐‘„1)โ€ ฮฆ1 ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€  + G ร—1 (ฮฆ1๐‘„1)โ€ ฮฆ1๐‘„1 ยท ยท ยท ร— (ฮฆ๐‘‘๐‘„๐‘‘)โ€ ฮฆ๐‘‘๐‘„๐‘‘ โˆ’ G = (X โˆ’ X2) ร—1 (ฮฆ1๐‘„1)โ€ ฮฆ1 ร—2 ยท ยท ยท ร—๐‘‘ (ฮฆ๐‘‘๐‘„๐‘‘)โ€ ฮฆ๐‘‘. Now consider the following related mode-๐‘– unfolding for ๐‘– โˆˆ [๐‘‘], where (cid:0)ฮฆ ๐‘— ๐‘„ ๐‘— (cid:1) โ€  ฮฆ ๐‘— for ๐‘— < ๐‘– is replaced with an ๐‘› ร— ๐‘› identity matrix ๐ผ๐‘›, ๐‘– := (cid:0)๐‘‹[๐‘–] โˆ’ (๐‘‹2)[๐‘–] ๐‘‹โ€ฒ (cid:1) ๐‘–โˆ’1 (cid:204) ๐ผ๐‘› ๐‘‘ (cid:204) ๐‘—=1 ๐‘—=๐‘–+1 (cid:0)ฮฆ ๐‘— ๐‘„ ๐‘— (cid:1) โ€  ฮฆ ๐‘— . (3.24) 82 Each ฮฆ๐‘– is an (๐œ–/๐‘‘, ๐‘Ÿ, ๐‘›๐‘–โˆ’1๐‘Ÿ ๐‘‘โˆ’๐‘–)-AE, where ๐‘„ โ† ๐‘„๐‘– and ๐ต โ† ๐‘‹โ€ฒ ๐‘– , for ๐‘– = 1, 2, . . . , ๐‘‘. Thus, (cid:0)๐‘‹[1] โˆ’ (๐‘‹2)[1] (cid:1) ๐‘‘ (cid:204) ๐‘—=2 (cid:0)ฮฆ ๐‘— ๐‘„ ๐‘— (cid:1) โ€  ฮฆ ๐‘— (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13)๐น โˆฅH โˆ’ Gโˆฅ2 = โ‰ค (ฮฆ1๐‘„1)โ€  ฮฆ1 ๐œ– ๐‘‘ (cid:17) (cid:13) (cid:13)๐‘‹โ€ฒ 1 (cid:13) (cid:13)๐น (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:16) 1 + ... (cid:16) โ‰ค 1 + (cid:17) ๐‘‘ (cid:13) (cid:13)๐‘‹โ€ฒ ๐‘‘ ๐œ– ๐‘‘ โ‰ค ๐‘’๐œ– โˆฅX โˆ’ X2โˆฅ2 (cid:13) (cid:13)๐น , where we have used Definition 3.4 ๐‘‘-times together with the bound (cid:0)1 + ๐œ– ๐‘‘ (cid:1) ๐‘‘ โ‰ค ๐‘’๐œ– . 3.2.5 Putting it All Together with Row Bounds In the previous subsections we have demonstrated that we can bound both error terms in (3.9) when the leave-one-out and core measurements satisfy certain embedding properties. In particular, we have shown how the Johnson-Lindenstrauss property (Definition 1.18) can be used to obtain both the Oblivious Subspace Embedding (Definition 1.23) and an Approximate Matrix Multiplication properties (Definition 1.25). These two properties are then used with compositions and direct sums to show how a tensor unfolding will satisfy a Projection Cost Preserving property (Definition 3.2), which was the essential ingredient in bounding the error term โˆฅX โˆ’ X2โˆฅ2. Next, we introduced Affine-Embeddings (Definition 3.4), which are the crucial ingredient needed for bounding the error term โˆฅX1 โˆ’ X2โˆฅ2. In this section we will show how JL and OSE properties imply the AE property. All together then, these will enable us to verify the requirements of Theorem 3.8 in a straightforward manner once we have specified the type of leave-one-out measurements, and the particular type of sensing matrices. Theorem 3.8 (Error bound for one-pass). Let X1 = [[H , ๐‘„1, . . . , ๐‘„๐‘‘]] denote the single-pass tensor estimate for a ๐‘‘-mode tensor X with side length ๐‘› obtained from Algorithm 3.1 using leave-one-out linear measurements ๐ต๐‘– for ๐‘– โˆˆ [๐‘‘], and core measurements B๐‘. Let ๐œ– > 0 and ๐›ฟ โˆˆ (0, 1/2). Then, 83 โˆฅX1 โˆ’ Xโˆฅ2 โ‰ค (1 + ๐‘’๐œ– ) (cid:118)(cid:117)(cid:116) 1 + ๐œ– 1 โˆ’ ๐œ– ๐‘‘ โˆ‘๏ธ ๐‘—=1 ฮ”๐‘Ÿ, ๐‘— where ฮ”๐‘Ÿ, ๐‘— := หœ๐‘› ๐‘— โˆ‘๏ธ ๐‘–=๐‘Ÿ+1 ๐œŽ๐‘– (cid:0)๐‘‹[ ๐‘—] (cid:1) 2 (3.25) will hold whenever 1. ฮฉ(๐‘–,๐‘–) โˆˆ R๐‘›ร—๐‘› are full-rank matrices for all ๐‘– โˆˆ [๐‘‘], 2. ฮฉโˆ’1 (๐‘–,๐‘–) ๐ต๐‘– = ๐‘‹[๐‘–]ฮฉ๐‘‡ โˆ’ ๐‘— โˆˆ R๐‘›ร—๐‘š๐‘‘โˆ’1 are (๐œ–, 0, ๐‘Ÿ)-PCP sketches of ๐‘‹[๐‘–] for each ๐‘– โˆˆ [๐‘‘], and 3. ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› is an (๐œ–/๐‘‘, ๐‘Ÿ, ๐‘›๐‘–โˆ’1๐‘Ÿ ๐‘‘โˆ’๐‘–)-AE for the matrices ๐‘„๐‘– and ๐‘‹โ€ฒ ๐‘– as in (3.24) for all ๐‘– โˆˆ [๐‘‘]. Proof. Recalling (3.9) we have that โˆฅX1 โˆ’ Xโˆฅ2 = โˆฅX1 โˆ’ X + X2 โˆ’ X2โˆฅ2 โ‰ค โˆฅX โˆ’ X2โˆฅ2 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) Term I We know from Theorem 3.3 that โˆฅX โˆ’ X2โˆฅ2 โ‰ค (cid:118)(cid:117)(cid:116) 1 + ๐œ– 1 โˆ’ ๐œ– ๐‘‘ โˆ‘๏ธ ๐‘—=1 ฮ”๐‘Ÿ, ๐‘— (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) + โˆฅX1 โˆ’ X2โˆฅ2 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:123)(cid:122) Term II (cid:124) . (3.26) (3.27) whenever ๐‘‹[ ๐‘—]ฮฉ๐‘‡ โˆ’ ๐‘— โˆˆ R๐‘›ร—๐‘š๐‘‘โˆ’1 are (๐œ–, 0, ๐‘Ÿ)-PCP sketches of ๐‘‹[ ๐‘—] for each ๐‘— โˆˆ [๐‘‘]. Furthermore, Theorem 3.7 together with our third assumption implies that โˆฅX1 โˆ’ X2โˆฅ2 โ‰ค ๐‘’๐œ– โˆฅX โˆ’ X2โˆฅ2 . (3.28) Using (3.27) and (3.28) in (3.9) now yields the desired result. We now have need for the lemma that links the AE property to the JL and OSE properties, and thus provides the necessary machinery to fully account for row bounds that guarantee with high probability that the recovered tensor satisfies the stated bound in Theorem 3.8. Lemma 3.8. Let ๐ต โˆˆ R๐‘›ร—๐‘ and suppose that ๐‘„ โˆˆ R๐‘›ร—๐‘Ÿ, ๐‘› โ‰ฅ ๐‘Ÿ, has orthonormal columns. If ฮฆ โˆˆ R๐‘šร—๐‘› is an (cid:17) , ๐›ฟ, ๐‘Ÿ (cid:16) 1 2 -OSE for ๐‘„, and has the (cid:16) ๐œ– โˆš ๐‘Ÿ 2 (cid:17) , ๐›ฟ -AMM property for ๐‘„๐‘‡ and (๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต, then ฮฆ will be an (๐œ–, ๐‘Ÿ, ๐‘)-AE for the matrices ๐‘„ and ๐ต with probability at least 1 โˆ’ 2๐›ฟ. Proof. Denote หœ๐‘Œ := (ฮฆ๐‘„)โ€ ฮฆ๐ต and ๐‘Œ โ€ฒ := ๐‘„๐‘‡ ๐ต, the solutions to the sketched and un-skechted linear least square problems given by minimizing โˆฅฮฆ๐ต โˆ’ ฮฆ๐‘„๐‘Œ โˆฅ๐น and โˆฅ๐ต โˆ’ ๐‘„๐‘Œ โˆฅ๐น, respectively, with respect to ๐‘Œ . Whenever ฮฆ is a (1/2, ๐›ฟ, ๐‘Ÿ)-OSE for ๐‘„ we know that ฮฆ๐‘„ will be full-rank. It follows 84 that (ฮฆ๐‘„)๐‘‡ ฮฆ(๐ต โˆ’ ๐‘„ หœ๐‘Œ ) = 0 will then also hold because (ฮฆ๐‘„)โ€  = (cid:2)(ฮฆ๐‘„)๐‘‡ (ฮฆ๐‘„)(cid:3) โˆ’1 (ฮฆ๐‘„)๐‘‡ . As a consequence, ๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ) = ๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ) + (ฮฆ๐‘„)๐‘‡ ฮฆ(๐ต โˆ’ ๐‘„ หœ๐‘Œ ) = ๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ(๐‘„ หœ๐‘Œ โˆ’ ๐‘„๐‘Œ โ€ฒ + ๐ต โˆ’ ๐‘„ หœ๐‘Œ ) = ๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ(๐ต โˆ’ ๐‘„๐‘Œ โ€ฒ). When the approximate matrix multiplication property also holds we will now have that (3.29) (cid:13)(ฮฆ๐‘„)๐‘‡ ฮฆ๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ)(cid:13) (cid:13) (cid:13)๐น = (cid:13) (cid:13)๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13)๐น (cid:13)๐น = (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13) (cid:13)๐น (cid:13)๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ(๐ต โˆ’ ๐‘„๐‘Œ โ€ฒ)(cid:13) ๐œ– โˆš ๐‘Ÿ (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13) (cid:13)๐‘„๐‘‡ (cid:13) (cid:13) (cid:13)๐น (cid:13)๐น = 2 ๐œ– ๐œ– 2 โˆฅ๐ต โˆ’ ๐‘„๐‘Œ โ€ฒโˆฅ๐น . โ‰ค = Furthermore, whenever ฮฆ is a 2 (cid:16) 1 2 (cid:17) , ๐›ฟ, ๐‘Ÿ -OSE for the column space of ๐‘„, all the eigenvalues of ๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ๐‘„ โˆ’ ๐ผ will be within the interval [โˆ’1/2, 1/2]. Thus, we can bound its operator norm (cid:13)๐‘„๐‘‡ ฮฆ๐‘‡ ฮฆ๐‘„ โˆ’ ๐ผ(cid:13) (cid:13) (cid:13) โ‰ค 1/2. We may now combine this operator norm bound with (3.29) and see that (cid:13) หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ(cid:13) (cid:13) (cid:13)( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ) โˆ’ (ฮฆ๐‘„)๐‘‡ ฮฆ๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ) + (ฮฆ๐‘„)๐‘‡ ฮฆ๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ)(cid:13) (cid:13)๐น = (cid:13) (cid:13)๐น (cid:13)(ฮฆ๐‘„)๐‘‡ ฮฆ๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ)(cid:13) โ‰ค (cid:13) (cid:2)(ฮฆ๐‘„)๐‘‡ ฮฆ๐‘„ โˆ’ ๐ผ(cid:3) ( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ)(cid:13) (cid:13)๐น ๐œ– (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13) (cid:13)๐น + (cid:13) (cid:13) 1 2 (cid:13)๐น + (cid:13) (cid:13) (cid:2)(ฮฆ๐‘„)๐‘‡ ฮฆ๐‘„ โˆ’ ๐ผ(cid:3)(cid:13) (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13) (cid:13) หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ(cid:13) (cid:13) (cid:13)๐น (cid:13) หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ(cid:13) (cid:13) (cid:13)๐น (cid:13)๐น + 2 ๐œ– โ‰ค โ‰ค 2 . Rearranging the inequality above while noting the invariance of the Frobenius norm to multiplication by a matrix with orthogonal columns, we learn that (cid:13) หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ(cid:13) (cid:13) (cid:13)๐น = (cid:13) (cid:13)๐‘„( หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ)(cid:13) (cid:13)๐น โ‰ค ๐œ– (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13)๐น . 85 To finish, we may now apply the triangle inequality to see that (cid:13) หœ๐‘Œ (cid:13) (cid:13) (cid:13)๐น โ‰ค (cid:13) โ‰ค ๐œ– (cid:13) (cid:13) หœ๐‘Œ โˆ’ ๐‘Œ โ€ฒ(cid:13) (cid:13)๐น + โˆฅ๐‘Œ โ€ฒโˆฅ๐น (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )๐ต(cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )(cid:13) (cid:13)๐น + (cid:13) (cid:13) โˆฅ๐ตโˆฅ๐น + (cid:13) (cid:13)๐‘„๐‘‡ ๐ต(cid:13) (cid:13)๐น (cid:13)๐‘„๐‘‡ (cid:13) โ‰ค ๐œ– (cid:13) (cid:13) โˆฅ๐ตโˆฅ๐น = (1 + ๐œ–) โˆฅ๐ตโˆฅ๐น . In addition, we note that taking a union bound over the two necessary OSE and AMM conditions establishes the stated probability guarantee. We are now prepared to state how a particular choice of distribution used to generate our measurement matrices as well as the leave-one-out measurement type (Kronecker or Khatri-Rao) can satisfy the error bound (3.25) with high probability. Note that below Algorithms 3.9 and 3.11 refer to the specialization of Algorithm 3.1 to the type of leave-one-out measurement (Kronecker or Khatri-Rao, respectively). Theorem 3.9 (Error bound for one-pass Kronecker-structured sub-Gaussian measurements). Sup- pose X is a ๐‘‘-mode tensor with side length ๐‘›. Let ๐œ– > 0, ๐›ฟ โˆˆ (0, 1 3), ๐‘Ÿ โˆˆ [๐‘›]. Furthermore, let 1. ฮฉ(๐‘–,๐‘–) โˆˆ R๐‘›ร—๐‘› be arbitrary full-rank matrices, 2. ฮฉ(๐‘–, ๐‘—) โˆˆ R๐‘šร—๐‘› for ๐‘– โ‰  ๐‘— be random matrices with mutually independent, mean zero, variance ๐‘šโˆ’1, sub-Gaussian entries with ๐‘š โ‰ฅ max (cid:26) ๐ถ1๐‘Ÿ (๐‘‘ โˆ’ 1)2 ๐œ– 2 (cid:18) ๐‘›๐‘‘ ๐‘‘2 ๐›ฟ (cid:19) , ๐ถ2(๐‘‘ โˆ’ 1)2 ๐œ– 2 ln (cid:18)(cid:18) 141 ๐œ– (cid:19)๐‘Ÿ ๐‘›๐‘‘โˆ’2๐‘‘2 ๐›ฟ (cid:19)(cid:27) ln , and 3. ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› be random matrices with mutially independent, mean zero, variance ๐‘šโˆ’1, sub- Gaussian entries with ๐‘š๐‘ โ‰ฅ max (cid:26) ๐ถ3 ln (cid:18) (94)๐‘Ÿ ๐‘‘ ๐›ฟ (cid:19) , ๐ถ4๐‘Ÿ๐‘‘2 ๐œ– 2 (cid:18) ๐‘‘ (๐‘Ÿ + ๐‘›๐‘‘โˆ’1)2 ๐›ฟ (cid:19)(cid:27) . ln (3.30) Then X1 = [[H , ๐‘„1, . . . , ๐‘„๐‘‘]] the output of Algorithm 3.9 (i.e., Algorithm 3.1 specialized to Kronecker sub-Gaussian measurements B๐‘– and B๐‘), will satisfy (3.25) with probability at least 1 โˆ’ 3๐›ฟ. 86 Proof. We verify that the requirements of Theorem 3.8 are satisfied. Note that: 1. ฮฉ(๐‘–,๐‘–) โˆˆ R๐‘›ร—๐‘› are full-rank matrices by assumption. 2. We have from Corollary 3.2 where ๐›ฟ โ† ๐›ฟ โˆ’ ๐‘— is a (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹[ ๐‘—] for all ๐‘— โˆˆ [๐‘‘] with probability at least 1 โˆ’ ๐›ฟ when ฮฉ(๐‘–, ๐‘—) โˆˆ R๐‘šร—๐‘› are independent sub-Gaussian ๐‘‘ that ๐‘‹[ ๐‘—]ฮฉ๐‘‡ random matrices with ๐‘š โ‰ฅ max (cid:26) ๐ถ1๐‘Ÿ (๐‘‘ โˆ’ 1)2 ๐œ– 2 (cid:18) ๐‘›๐‘‘ ๐‘‘2 ๐›ฟ (cid:19) , ๐ถ2(๐‘‘ โˆ’ 1)2 ๐œ– 2 ln (cid:18)(cid:18) 141 ๐œ– (cid:19)๐‘Ÿ ๐‘›๐‘‘โˆ’2๐‘‘2 ๐›ฟ (cid:19)(cid:27) . ln 3. A substitution of ๐œ– โ† 1 2, ๐›ฟ โ† ๐›ฟ ๐‘‘ into Corollary 1.1 yields, ๐‘š๐‘ โ‰ฅ ๐ถ3 ln (cid:19) (cid:18) (94)๐‘Ÿ ๐‘‘ ๐›ฟ (3.31) in order to ensure ฮฆ๐‘– is ( 1 2 noting that the matrices ๐‘„๐‘– and ๐‘‹โ€ฒ ๐‘‘ , ๐‘Ÿ)-OSE. Using Corollary 1.2, where ๐œ– โ† ๐œ– , ๐›ฟ 2๐‘‘ ๐‘‘ and ๐‘– as in (3.24) have are ๐‘Ÿ ร— ๐‘› and ๐‘› ร— ๐‘›๐‘–โˆ’1๐‘Ÿ ๐‘‘โˆ’๐‘–, respectively, ๐‘Ÿ , ๐›ฟ โ† ๐›ฟ โˆš we have that when ๐‘š๐‘ โ‰ฅ ๐ถ4๐‘Ÿ๐‘‘2 ๐œ– 2 (cid:18) 2๐‘‘ (๐‘Ÿ + ๐‘›๐‘‘โˆ’1)2 ๐›ฟ (cid:19) ln (3.32) then ฮฆ๐‘– has the ( , ๐›ฟ ๐‘‘ )-AMM property for each ๐‘– โˆˆ [๐‘‘]. Lemma 3.8 now shows how the OSE and AMM properties ensure that ฮฆ๐‘– has the desired AE property for each ๐‘– โˆˆ [๐‘‘] with ๐œ– โˆš ๐‘Ÿ 2๐‘‘ probability at least 1 โˆ’ 2๐›ฟ/๐‘‘. The union bound now implies that the third requirement of Theorem 3.8 will hold with probability at least 1 โˆ’ 2๐›ฟ. Taking a maximum over (3.31) and (3.32) after simplifying and adjusting constants then yields (3.30). A final union bound over the failure probabilities for requirements of ฮฉ(๐‘–, ๐‘—) and ฮฆ๐‘– now yields the result. The following runtime analysis demonstrates that instances of Algorithm 3.1 can indeed recover low-rank approximations of ๐‘‘-mode tensors of side length ๐‘› in ๐‘œ(๐‘›๐‘‘)-time. As a result, one can see that Algorithm 3.1 is effectively a sub-linear time recovery algorithm for a large class of low Tucker-rank tensors. Theorem 3.10. Suppose that ๐‘š < ๐‘› < ๐‘š๐‘‘โˆ’1 and that ๐‘š > ๐ถ๐‘š๐‘ for some absolute constant ๐ถ > 0. Then, Algorithm 3.9, when given the ๐ต๐‘– and B๐‘ measurements as input, runs in O (๐‘‘๐‘›2๐‘š๐‘‘โˆ’1) โˆ’ ๐‘ก๐‘–๐‘š๐‘’ 87 and requires the storage of ๐‘‘๐‘›๐‘š๐‘‘โˆ’1 + ๐‘š๐‘‘ ๐‘ measurement tensor entries, and at most ๐‘‘ (๐‘‘ โˆ’ 1)๐‘š๐‘› + ๐‘›2 + ๐‘‘๐‘š๐‘๐‘› total measurement matrix entries. Proof. Inside the factor matrix recovery loop of Algorithm 3.5, called by Algorithm 3.9, the two main sub-tasks are to solve the linear system ฮฉ(๐‘–,๐‘–) ๐น๐‘– = ๐ต๐‘– and to compute a truncated SVD, ๐น๐‘– = ๐‘ˆฮฃ๐‘‰๐‘‡ . Solving the linear system can be accomplished using ๐‘„๐‘…-factorization via Householder orthogonalization. Doing so requires 4 3 ๐‘›3 floating point operations to compute the factorization of ฮฉ(๐‘–,๐‘–), and 2๐‘›2๐‘š๐‘‘โˆ’1 operations to form ๐‘„๐‘‡ ๐ต๐‘– and ๐‘›2๐‘š๐‘‘โˆ’1 operations to solve ๐‘…๐น๐‘– = ๐‘„๐‘‡ ๐ต๐‘– via back substitution. The complexity of computing the SVD is O (๐‘›๐‘š๐‘‘โˆ’1 min(๐‘›, ๐‘š๐‘‘โˆ’1)). Therefore the factor recovery loop has overall complexity O (๐‘‘๐‘›2๐‘š๐‘‘โˆ’1) if we assume that ๐‘› < ๐‘š๐‘‘โˆ’1. Next we consider the core recovery loop. for the first iteration of the loop of Algorithm 3.8, we must form ฮฆ1๐‘„1 at a cost of O (๐‘š๐‘๐‘›๐‘Ÿ). Next we solve a linear system ฮฆ1๐‘„1๐ป = ๐ต1 at a cost of O (2๐‘š๐‘๐‘Ÿ 2 โˆ’ 2 3 ๐‘Ÿ 3 + 3๐‘š๐‘‘ ๐‘ ๐‘Ÿ). The first iteration dominates the complexity, since subsequent solves use a smaller right hand side formed from solutions from the previous iterations. Furthermore, if we assume ๐‘š๐‘ = O (๐‘š), we have a core recovery loop with complexity O (๐‘‘๐‘š๐‘‘๐‘Ÿ). Thus overall the recovery algorithm has O (๐‘‘๐‘›2๐‘š๐‘‘โˆ’1) complexity. In the situation where ฮฉ(๐‘–,๐‘–) = ๐ผ๐‘› then the computation of the SVDs in the factor recovery step dominates the run-time of the algorithm. Clearly the size of the measurement tensors are ๐‘›๐‘š๐‘‘โˆ’1 per factor and ๐‘š๐‘‘ ๐‘ for the core, which yields the space complexity of the measurements. One of the advantages of the structure of the argument in Theorem 3.8 is that once it is known how to ensure a given random matrix will satisfy the JL property, we can (with the help of Lemma 3.8) account for how to assemble related measurement operators that satisfy the error bound (3.25) in a straightforward way. For example, using Theorem 3.1 in [38] along with bounds appearing in that work on the sketching dimension, we have that a sub-sampled and scrambled 88 Fourier matrix is a (๐œ–, ๐›ฟ, ๐‘)-JL of vectors in R๐‘› provided that ๐‘š โ‰ฅ ๐ถ ๐œ– 2 log (cid:17) (cid:16) ๐‘ ๐›ฟ log4 ๐‘›. Using such existing results one can easily update Theorem 3.9 to instead use sub-sampled and scrambled Fourier measurement matrices ฮฉ(๐‘–, ๐‘—) and ฮฆ๐‘– instead of matrices with independent sub- Gaussian entries. Furthermore, it need not be the case that the distribution is the same for each component map ฮฉ(๐‘–, ๐‘—) or ฮฆ๐‘–. It is important only that each map satisfies the JL primitive for arbitrary sets. We are free to choose a measurement map for a particular mode to suit some other purpose. For example, in the case that the side lengths of the tensor are unequal, we may prefer to choose a map that admits a fast matrix-vector multiply in order to economize run-time for the modes which are long, and on smaller modes, choose maps which have better trade-offs for quality of approximation in terms of ๐‘š (e.g., we may prefer dense sub-Gaussian random matrices for these modes). On the other hand, a measurement ensemble which does not rely on the Kronecker product of components, like that in Theorem 3.5 does not admit this sort of reasoning. Mixing measurement maps of different kinds in that case has no clear advantage, and indeed, may even serve to undermine the advantages. Nonetheless, Theorem 3.8 can still be used to provide Khatri-Rao structured leave-one-out measurement re- sults. Verifying the requirements of Theorem 3.8 for Khatri-Rao measurements using Theorem 3.6 yields the following result. Theorem 3.11 (Error bound for one-pass Khatri-Rao structured sub-Gaussian measurements). Suppose X is a ๐‘‘-mode tensor with side length ๐‘›, Let ๐œ– > 0, ๐›ฟ โˆˆ (0, 1 3), ๐‘Ÿ โˆˆ [๐‘›]. Furthermore, let 1. ฮฉ(๐‘–,๐‘–) โˆˆ R๐‘›ร—๐‘› be full-rank matrices of any kind, 2. ฮฉ(๐‘–, ๐‘—) โˆˆ R หœ๐‘šร—๐‘› for ๐‘– โ‰  ๐‘— be random matrices with mutually independent, mean zero, variance 89 หœ๐‘šโˆ’1, sub-Gaussian entries with หœ๐‘š โ‰ฅ ๐ถ ๐‘‘โˆ’1 max (cid:26) ๐œ– โˆ’2 ln (cid:169) (cid:173) (cid:173) (cid:171) (cid:17)๐‘Ÿ (cid:16) 141 ๐œ– 2๐‘‘ (cid:170) (cid:174) (cid:174) (cid:172) (cid:18) 34๐‘›2๐‘‘ ๐›ฟ , ๐œ– โˆ’1 (cid:169) (cid:173) (cid:173) (cid:171) ๐‘Ÿ ๐œ– (cid:19) , ln (cid:169) (cid:173) (cid:173) (cid:171) (cid:18) ln ๐›ฟ ๐‘Ÿ ๐œ– 2 ln 2๐‘‘ (cid:17)๐‘Ÿ (cid:16) 141 ๐œ– ๐›ฟ (cid:18) 34๐‘›2๐‘‘ ๐›ฟ ๐‘‘โˆ’1 (cid:170) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:172) (cid:19)(cid:19) ๐‘‘โˆ’1(cid:27) , , and 3. ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› be random matrices with mutually independent, mean zero, variance ๐‘šโˆ’1, sub-Gaussian entries with ๐‘š๐‘ โ‰ฅ max (cid:26) ๐ถ3 ln (cid:18) (94)๐‘Ÿ ๐‘‘ ๐›ฟ (cid:19) , ๐ถ4๐‘Ÿ๐‘‘2 ๐œ– 2 (cid:18) ๐‘‘ (๐‘Ÿ + ๐‘›๐‘‘โˆ’1)2 ๐›ฟ (cid:19)(cid:27) . ln Then X1 = [[H , ๐‘„1, . . . , ๐‘„๐‘‘]] the output of Algorithm 3.11 (i.e., Algorithm 3.1 specialized to Khatri-Rao sub-Gaussian leave-one-out measurements B๐‘– and Kronecker measurements B๐‘) will satisfy (3.25) with probability at least 1 โˆ’ 3๐›ฟ. Proof. The proof is again based on verifying the requirements of Theorem 3.8. 1. The first requirement is satisfied by assumption. 2. The row requirement for the ฮฉ(๐‘–, ๐‘—) follows from an application of Theorem 3.6 with ๐›ฟ โ† ๐›ฟ/๐‘‘. As a result of doing so, we learn that the second requirement will be satisfied with probability at least 1 โˆ’ ๐›ฟ after a union bound. 3. The third requirement is satisfied identically to the argument in the proof of Theorem 3.9. The proof now concludes identically to the proof of Theorem 3.9. Comparing the Theorem 3.10, we note, e.g., that Theorem 3.11 will require the storage of ๐‘‘ หœ๐‘š๐‘› + ๐‘š๐‘‘ ๐‘ measurement tensor entries (where the หœ๐‘š here is from the second condition of Theo- rem 3.11). Comparing this to the ๐‘‘๐‘›๐‘š๐‘‘โˆ’1 + ๐‘š๐‘‘ ๐‘ measurements from Theorem 3.10 (where ๐‘š here is as in Theorem 3.9) one can see that there are parameter regimes where Khatri-Rao structured measurements will lead to a smaller overall measurement budget. 3.3 Experiments In this section we present numerical results that support our theoretical contributions and address practical trade-offs involved in the different choices for measurement type. Unless otherwise 90 specified, the tensors in the experiments are random three-mode cubic tensors, with side length ๐‘› = 300 and rank ๐‘Ÿ = 10. We use the following procedure (same as in [60]) to generate low- rank tensors; the coreโ€™s entries are uniformly and independently drawn from [0, 1] and the factors are formed by first sampling a standard normal distribution for each entry and then normalized and made orthogonal using ๐‘„๐‘…-factorization. The data points in the plots are the mean of 100 independent trials. The parameter ๐‘š refers to the sketching dimension for maps ฮฉ(๐‘–, ๐‘—) โˆˆ R๐‘šร—๐‘› used in recovering factor ๐‘„๐‘–. For the left-out mode we remove the need to solve the full ๐‘› ร— ๐‘› linear system by setting ฮฉ(๐‘–,๐‘–) to be the ๐‘› ร— ๐‘› identity matrix. The parameter ๐‘š๐‘ refers to the sketching dimension for ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› which are used in recovering the core H . In experiments with noise, the additive Gaussian noise tensor N is scaled according to the desired signal to noise ratio and added to the true, (low-rank) tensor X0. That is, X = X0 + N is the observed, noisy tensor. Signal to noise ratio (SNR) is calculated as 10 log10 (โˆฅXโˆฅ2 /โˆฅX0 โˆ’ Xโˆฅ2). Relative error is calculated as (โˆฅ ห†X โˆ’ Xโˆฅ2)/โˆฅX0โˆฅ2 where ห†X is the full estimated tensor. 3.3.1 Recovering Low-Rank Tensors In this first simple experiment, we fix the signal to noise ratio at 30 decibels (dB) and vary the sketching dimension ๐‘š to show the dependence on the accuracy of our estimate on the number of measurements. For each ๐‘š we set ๐‘š๐‘ = 2๐‘š. Rank truncation is fixed at ๐‘Ÿ = 10, which matches the rank of the true, noiseless tensor X0, see Figure 3.2. For the plot (b) in the figure, we have the maximum principal angle among the three estimated factors and true factors (๐‘„๐‘–, ๐‘ˆ๐‘–) in degrees, see [35]. Note, there is no straightforward way to plot the relative error which is due to the factor estimates vs. the core estimate, because the decomposition will in general not be unique. However, since principal angle is invariant to non-singular transformations, plot (b) provides empirical evidence that the factor estimates alone are improving with sketching dimension. We note that for these low-rank tensors with noise, we are able to fit at or below the level of noise (relative error of 0.001) easily - evidently finding good rank 10 approximations to the (full-rank) noisy tensor X. 91 Figure 3.2 Error plots for different sketching dimensions and a fixed SNR of 30 dB and a fixed rank truncation of 10. Plot (a) compares relative errors for both one-pass and two-pass recovery. Plot (b) shows the maximum principal angle among all estimated sub-spaces and the true factor matrices. This perhaps surprising result motivated us to try the method on a class of tensors in which we could be more certain about what quality of rank 10 approximation is achievable. In our second set of experiments, we examine performance on super-diagonal tensors with tail decay. Since we are truncating to rank 10, this tail can be thought of as structured, deterministic noise. These are tensors where all values are zero except for those on the diagonal, and where all values on the diagonal for indices larger than ๐‘Ÿ = 10, we have some decay in their magnitude. In particular we have two types, exponential tail decay in plot (a), where X๐‘– ๐‘— ๐‘˜ = ๏ฃฑ๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃฒ ๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด 1 ๐‘– = ๐‘— = ๐‘˜ โˆˆ [๐‘Ÿ + 1] 10โˆ’1(๐‘–โˆ’๐‘Ÿ) ๐‘– = ๐‘— = ๐‘˜ โˆˆ [๐‘Ÿ + 2, ๐‘›] (3.33) 0 otherwise ๏ฃณ and polynomial tail decay in plot (b), X๐‘– ๐‘— ๐‘˜ = ๏ฃฑ๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃฒ ๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด๏ฃด ๏ฃณ 1 ๐‘– = ๐‘— = ๐‘˜ โˆˆ [๐‘Ÿ + 1] (๐‘– โˆ’ ๐‘Ÿ)โˆ’1 ๐‘– = ๐‘— = ๐‘˜ โˆˆ [๐‘Ÿ + 2, ๐‘›] (3.34) 0 otherwise. 92 These highly constrained tensors are clearly not low-rank, however it is reasonable to suppose that a recovery algorithm for a given rank truncation would output an estimate that is close to the leading ๐‘Ÿ terms of the diagonal. The residual in that case will simply be the norm of the tail-sum โˆš๏ธƒ(cid:205)๐‘› ๐‘–๐‘–๐‘–, which we have included as the red horizontal line in Figure 3.3. ๐‘–=๐‘Ÿ+1 X2 Figure 3.3 Error plots for different sketching dimensions in the noiseless setting with a fixed rank truncation of 10. Plot (a) compares relative errors for both one-pass and two-pass recovery for super-diagonal tensors where the diagonal entries have exponential decay of type (3.33), and plot (b) super-diagonal tensors with polynomial decay of the type (3.34). 3.3.2 Allocating Core and Factor Measurements One question raised by our error analysis is how to weigh the error contribution between the tasks of estimating the factor matrices and estimating the core. In other words, for a given total measurement budget, how should we allocate between the two tasks if we wish to decrease overall relative error? In the following experiment (see Figure 3.4) we find the relative error under various noise levels for pairs of sketching dimensions (๐‘š, ๐‘š๐‘). We compare pairs (13, 12) and (11, 36) and (8, 48). These choices of sketching dimensions were chosen since they have nearly equal overall compression ratios of 0.57%, 0.58%, 0.62% respectively, however they vary considerably on whether they emphasize measurements to be used in estimating the factors or the core of the tensor. Note that the two-pass error, which relies only on the factor matrix estimates is naturally 93 best when the factor sketches are larger, i.e. the ๐‘š = 13 case. However the relative error of the recovered tensor in the one-pass setting is more than ten times better when more of the total measurement budget is allocated to estimate the core as shown in Figure 3.4. This shows that in some situations it is preferable to allocate more resources to obtain measurements for the core than the factors, up to some threshold. For example in Figure 3.4, the rank of the true signal is 10, and going below this dimension for the factor sketches does correspond with no longer improving on the accuracy in terms of the trade-off between ๐‘š and ๐‘š๐‘. Figure 3.4 Relative error plots at different signal to noise ratios for two-pass and one-pass recovery of noisy low-rank tensors. Ordered pairs indicates the choice for sketching dimensions (๐‘š, ๐‘š๐‘). 3.3.3 Error Bounds Apply to sub-Gaussian Measurement Matrices In this next experiment we demonstrate, in a similar manner as done in Figure 1 in [60], that recovery performance of Algorithm 3.11 does not vary greatly for different choices of types of sub- Gaussian measurement matrices. What is different from that earlier work is that the measurement ensembles are all Kronecker structured. Plotted in Figure 3.5 are relative errors for Gaussian (g), sparse uniform from [โˆ’1, 0, 1] with weights 1 6 , 2 3 , 1 3 (sp0), sub-sampled randomized Fourier transforms as in [64] (rfd) and a mixed measurement ensemble that uses Gaussian-RFD-sparse measurements where we vary by mode which measurement type is used, which is a scenario that is 94 practically and theoretically not well suited for the Khatri-Rao structured measurement operators used in [60]. Figure 3.5 Relative errors for one-pass, two-pass for Kronecker measurement ensembles made up of different kinds of sub-Gaussian random matrices. The legend g, sp0, rfd, mix correspond to Gaussian, sparse, sub-sampled random Fourier transform, and a mixture of all three for the measurement ensembles. 3.3.4 Comparison to Khatri-Rao Structured Measurements Figure 3.6 Relative error comparison for Khatri-Rao and Kronecker-structured measurements. Right hand figure shows the average time for sketching and recovery phases of the algorithm. This set of experiments demonstrates that the sketching phase will dominate the run-time of 95 Algorithm 3.1 regardless of the choice of leave-one-out type, however the Kronecker-structured measurements are able to generate more measurements for a fixed number of operations as compared to Khatri-Rao structured measurements, see Figure 3.6. This means that it is possible to achieve similar or better performance using strictly modewise measurements and in less overall time as problems grow in size with respect to total number of tensor elements; i.e. both number of modes and length of those modes. In Figure 3.6 for the Kronecker-structured measurements we sketch to ๐‘š = 25, for the Khatri-Rao ensemble we sketch pairs of modes to 225. We see that the Kronecker measurements perform incrementally better in terms of relative error but at less than half the overall run-time. Sketching times are about five times faster for the Kronecker-structured measurements as compared to the Khatri-Rao. Note that this does trade speed for space - the total number of entries in the leave one out measurements is nearly three times larger for the Kronecker-structured measurements versus the Khatri-Rao, i.e. sketches ๐ต๐‘– as per (3.2) and (3.4) have sizes 300 ร— 252, and 300 ร— 225 respectively. 3.3.5 Application to Video Summary Task As a practical demonstration, we consider the same video summary task first described in [47] and again in [60]. In this demonstration, the video is taken with a camera in a fixed position. The video is a nature scene and a person walks in front of the camera at two different time points in the second half of the video. The first 100 and the last 193 frames are removed since they include setup that results in small shifts of the camera. The entire video has been converted to grayscale. This yields a three mode tensor of dimensions 2200 ร— 1080 ร— 1980 which has a size of about 41 GB when stored as an array of doubles. We wish to identify the parts of the scene that include the person walking and distinguish them from the relatively static scene elsewhere. As discussed in [60], there is a third salient time varying feature in this particular video, which is the light intensity of the scene, since at around frame 940 the scene darkens. Furthermore there are changes in the light intensity as the camera automatically adjusts after the person walks in and out of frame. For this reason, we cluster the frames using three centers, rather than two. In all cases, we use ๐‘˜-means to cluster the frames, however we assign features to frames in four 96 different ways: 1. Using the sketch ๐ต1 โˆˆ R2200ร—202 , as in (3.2) that leaves out the time dimension, then clustering using ๐‘˜-means on the rows of the unfolding of the sketch along the first, temporal mode. 2. Unfolding the temporal mode of the reconstructed tensor using a one-pass set of measure- ments, i.e. (๐‘‹1) [1] โˆˆ R2200ร—2138400 (recall that X1 denotes the output of Algorithm 3.1). 3. Unfolding the reconstructed tensor using a two-pass scenario, (๐‘‹2) [1] โˆˆ R2200ร—2138400 (recall that X2 denotes the output of Algorithm 3.10). 4. Estimated temporal factor matrix ๐‘ˆ1 โˆˆ R2200ร—20 (see in Algorithm 3.1). As we can see in Figures 3.7a, 3.7b, the sketch alone shows reliable clustering of the main temporal changes in the video, which verifies the observation in [60] about using the measurements as an effective feature set for clustering, although in that case the measurements were Khatri-Rao structured whereas ours are Kronecker-structured. The unfoldings of the reconstructed tensor also reliably distinguishes the main parts of the scene. The reconstruction is useful at least to get clusted interpretability. Although certainly natural to wish to cluster on the temporal factor, this method appears inferior to any of the preceding. As an added advantage of using the modewise, Kronecker structured measurements, we can in principle select measurement maps for different modes. Gaussian measurement maps theoretically have some advantages over other types in terms of accuracy for a fixed number of measurements, whereas applying RFD or other Fourier-like transforms to modes that have longer fibers would net a better payoff in terms of overall run-time because of the faster matrix-vector multiply permitted by these structured matrices. In this demonstration, we use Gaussian matrices along the spatial modes, and ๐‘…๐น ๐ท matrices for the temporal mode. In the earlier work [47], the authors describe a variant of Tucker-Alternating Least Squares (aka Higher Order Orthogonal Iteration, multi-pass scenario) that employed TensorSketch to produce the necessary measurements used to reconstruct the same video tensor data we have used here. In the subsequent work [60], those authors again perform the same task, but use a single-pass approach which fits the framework we have described as Algorithm 3.1, where the measurement matrices 97 are Khatri-Rao structured, and the ฮฉ๐‘– have entries drawn from standard Gaussian distribution. Furthermore, analysis of the type afforded by Theorem 3.11 may also explain the discrepancy between the sketching dimensions seen in [47] and [60]. Naturally there are several differences between the approaches, but the CountSketch matrices used in TensorSketch operators as shown in [33] have a O (cid:17) (cid:16) 1 ๐›ฟ dependency in order to ensure the OSE property, whereas the other ensembles, such as dense Gaussians, enjoy an O (cid:17) (cid:16) log 1 ๐›ฟ dependence for this parameter. As was discussed in [60], the video is not especially low-rank in practice - in particular along the spatial dimensions in terms of relative error of the reconstruction. However the clusters appear distinct enough that assigning clusters with this summary type of information is still possible. 3.4 Discussion In this chapter we have described a measurement system that specializes the framework dis- cussed in Section 1.5 and a simple recovery procedure that addresses the three main issues we noted about the two-stage measurement operators needed to ensure a TRIP, which was the key to make use of existing theory with regards to TIHT that was the subject of Chapter 2. That is, the leave-one-out alternative discussed in this chapter does not require working memory on the order of the size of the uncompressed signal, because it computes factors of the estimate directly from measurements. Furthermore, the recovery procedure is non-iterative, and provably recovers the tensor in the exact arithmetic, no noise, no rank truncation scenario, and has no reliance on unverifiable assumptions about the quality of the fit of a thresholding operator as TIHT had. On the third issue, whereby TRIP required an initial reshaping in order to achieve useful compression - the leave-one-out alternative trades this constraint for another; namely that one of the modes goes uncompressed. Direct comparison of which of these constraints is preferable is perhaps best left to a given application than any objective criteria that is data agnostic. It is worth mentioning however, that the need for the measurements to be independent, based on empirical evidence from synthetic data, is to a large extent motivated by the analysis. That is, practically speaking, one can reuse measurement maps to produce the different measurement tensors, or even share measurements that were used to estimate factors, and use them to estimate the core without any change in accuracy. 98 (a) (b) Figure 3.7 (a) Cluster assignments for the 2200 frames in the video. Top run corresponds to using the measurements for the first mode only ๐ต1, middle rows use one and two pass approximations of the tensor, and the last row uses the factor matrix ๐‘ˆ1 for the temporal mode only. We use Gaussian sketching matrices for both spatial modes and the real part of ๐‘…๐น ๐ท for the temporal mode. Sketching parameters are ๐‘š = 20, ๐‘š๐‘ = 40 and rank truncation of ๐‘Ÿ = 10 in all modes. (b) Three reference frames at 0, 1000 and 1496. The dependencies introduced however make the analysis of such a procedure considerably more complicated. A practitioner could economize on both the storage of the linear transformations, and number of measurements by reusing maps and โ€œrecyclingโ€ measurements. Also, in this chapter we have shown how several variations can be unified under the leave- 99 Figure 3.8 The 1456th frame of the grey-scale video is shown for the original, the one-pass, and two-pass reconstructions using sketching dimension of ๐‘š = 300, ๐‘š๐‘ = 601 and ๐‘Ÿ = 50 for each mode. Although the reconstructions for this choice of sketching dimension and rank truncation are not particularly accurate for this video, nevertheless the reconstructions provide enough information to perform the summary task of clustering the frames into the major temporal changes that occur during the scene. one-out concept, and the results made useful to analyze all of them. We have used empirical tests on simulated data to highlight important trade-offs in terms of what type of measurements were selected and how sensing matrices were structured. Finally, we demonstrated a simple, but representative application whereby clustering or segmenting video data is handled using the measurements. Recall the example from section 1.2.2 as to for why both the segmenting using the measurements, and the recovered tensor could be of practical value. 3.5 Technical Proofs Herein we provide proofs for some technical lemmas needed to prove our main theorems, including a Lemma that we first stated in Chapter 1 regarding the AMM property. The second sub-section then addresses the proof of Theorem 3.2. 3.5.1 Proof of Lemma 1.8 Our proof of Lemma 1.8 will utilize several intermediate lemmas. Our first lemma concerning the approximate preservation of inner products is a slight generalization of [2, Corollary 2]. Lemma 3.9 (The JL property implies angle preservation). Let ๐‘† โŠ‚ C๐‘› with cardinality at most ๐‘ and ๐œ– โˆˆ (0, 1). If a random matrix ฮฉ โˆˆ C๐‘šร—๐‘› has the (๐œ–/4, ๐›ฟ, 4๐‘2)-JL property for ๐‘†โ€ฒ = (cid:26) x โˆฅxโˆฅ2 + y โˆฅyโˆฅ2 , x โˆฅxโˆฅ2 โˆ’ y โˆฅyโˆฅ2 , x โˆฅxโˆฅ2 + ๐‘– y โˆฅyโˆฅ2 , x โˆฅxโˆฅ2 โˆ’ ๐‘– y โˆฅyโˆฅ2 (cid:12) (cid:12) x, y โˆˆ ๐‘† (cid:27) , then |โŸจฮฉx, ฮฉyโŸฉ โˆ’ โŸจx, yโŸฉ| โ‰ค ๐œ– โˆฅxโˆฅ2โˆฅyโˆฅ2 โˆ€x, y โˆˆ ๐‘† (3.35) 100 will be satisfied with probability at least 1 โˆ’ ๐›ฟ. Proof. Note that if either x = 0 or y = 0, then (3.35) automatically holds because 0 โ‰ค 0. Thus, suppose without loss of generality that x, y โ‰  0. Considering the normalizations u = x โˆฅxโˆฅ2 , v = y โˆฅyโˆฅ2 , one can see that the polarization identity implies that |โŸจฮฉu, ฮฉvโŸฉ โˆ’ โŸจu, vโŸฉ| = โ‰ค (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 1 4 1 4 โ„“=0 3 โˆ‘๏ธ โ„“=0 ๐œ– 4 3 โˆ‘๏ธ ๐‘–โ„“ (cid:16) โˆฅฮฉu + ๐‘–โ„“ฮฉvโˆฅ2 2 โˆ’ โˆฅu + ๐‘–โ„“vโˆฅ2 2 (cid:17) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (โˆฅuโˆฅ2 + โˆฅvโˆฅ2)2 = ๐œ– will hold whenever (1.8) holds with ๐‘† โ† ๐‘†โ€ฒ and ๐œ– โ† ๐œ–/4. The result now follows by renormalizing. Remark 3.5. Note that if ๐‘† โŠ‚ R๐‘› it suffices for a random matrix ฮฉ โˆˆ R๐‘šร—๐‘› to have the (๐œ–/2, ๐›ฟ, 2๐‘2)- JL property for a smaller set ๐‘†โ€ฒ โŠ‚ R๐‘› in Lemma 3.9. This can be seen by using the real version of the polarization identity instead of the complex version. The next lemma constructs a set ๐‘† to utilize in Lemma 3.9 based on two matrices with normalized columns. The end result is an entrywise approximate matrix multiplication property for the two column-normalized matrices in question. Lemma 3.10 (The JL property allows approximate matrix multiplies for unitary matrices). Let ๐‘‰ โˆˆ C๐‘›ร—๐‘ and ๐‘ˆ โˆˆ C๐‘›ร—๐‘ž have unit โ„“2-normalized columns. Suppose that ฮฉ โˆˆ C๐‘šร—๐‘› satisfies (3.35) from Lemma 3.9 where ๐‘† = (cid:8)u ๐‘— |u ๐‘— = ๐‘ˆ [:, ๐‘—](cid:9) โˆช (cid:8)v ๐‘— |v ๐‘— = ๐‘‰ [:, ๐‘—](cid:9). Then (cid:12) (cid:12)(๐‘‰ โˆ—ฮฉโˆ—ฮฉ๐‘ˆ โˆ’ ๐‘‰ โˆ—๐‘ˆ) ๐‘˜, ๐‘— (cid:12) (cid:12) โ‰ค ๐œ–, for all 1 โ‰ค ๐‘˜ โ‰ค ๐‘ and 1 โ‰ค ๐‘— โ‰ค ๐‘ž. Proof. Note that |๐‘†| = ๐‘ + ๐‘ž. Thus, |๐‘†โ€ฒ| โ‰ค 4( ๐‘ + ๐‘ž)2 in Lemma 3.9. Furthermore, . . . . . . ฮฉ๐‘‰ = Hence, (cid:0)(ฮฉ๐‘‰)โˆ— ฮฉ๐‘ˆ(cid:1) . . . ฮฉv๐‘ (cid:169) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) ฮฉu1 ฮฉu2 ฮฉv1 ฮฉv2 (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:171) ๐‘˜, ๐‘— = โŸจฮฉu ๐‘— , ฮฉv๐‘˜ โŸฉ. Therefore, given Lemma 3.9, for all choices of ๐‘˜, ๐‘— we have , and ฮฉ๐‘ˆ = . . . ฮฉu๐‘ž (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) . . . . . . . (cid:12) (cid:12)(๐‘‰ โˆ—ฮฉโˆ—ฮฉ๐‘ˆ โˆ’ ๐‘‰ โˆ—๐‘ˆ) ๐‘˜, ๐‘— (cid:12) = (cid:12) (cid:12) (cid:12)โŸจฮฉu ๐‘— , ฮฉv๐‘˜ โŸฉ โˆ’ โŸจu ๐‘— , v๐‘˜ โŸฉ(cid:12) (cid:12) โ‰ค ๐œ– โˆฅv๐‘˜ โˆฅ2โˆฅu ๐‘— โˆฅ2 = ๐œ– . 101 The next lemma constructs a new set ๐‘† to utilize in Lemma 3.9 by selecting a well chosen subset of the singular vectors of both ๐ด and ๐ต. This set will ultimately determine how the finite set ๐‘† promised by Lemma 1.8 depends on ๐ด and ๐ต. As we shall see, itโ€™s proven by applying Lemma 3.10 to two unitary matrices provided by the SVDs of ๐ด and ๐ต. Lemma 3.11 (The JL property implies the AMM property for arbitrary matrices). Let ๐ด โˆˆ C๐‘ร—๐‘› 2 , and suppose that ฮฉ โˆˆ C๐‘šร—๐‘› and ๐ต โˆˆ C๐‘›ร—๐‘ž have SVDs given by ๐ด = ๐‘ˆ1ฮฃ1๐‘‰ โˆ— and ๐ต = ๐‘ˆฮฃ2๐‘‰ โˆ— satisfies the conditions of Lemma 3.10 for ๐‘ˆ and ๐‘‰. Then, โˆฅ ๐ดฮฉโˆ—ฮฉ๐ต โˆ’ ๐ด๐ตโˆฅ๐น โ‰ค ๐œ– โˆฅ ๐ดโˆฅ๐น โˆฅ๐ตโˆฅ๐น Proof. We will expand the quantity of interest according the SVD of ๐ด and ๐ต. Doing so we see that โˆฅ ๐ดฮฉโˆ—ฮฉ๐ต โˆ’ ๐ด๐ตโˆฅ๐น = โˆฅ๐‘ˆ1ฮฃ1๐‘‰ โˆ—ฮฉโˆ—ฮฉ๐‘ˆฮฃ2๐‘‰ โˆ— 2 โˆ’ ๐‘ˆ1ฮฃ1๐‘‰ โˆ—๐‘ˆฮฃ2๐‘‰ โˆ— 2 โˆฅ๐น = โˆฅ๐‘ˆ1ฮฃ1 (๐‘‰ โˆ—ฮฉโˆ—ฮฉ๐‘ˆ โˆ’ ๐‘‰ โˆ—๐‘ˆ) ฮฃ2๐‘‰ โˆ— 2 โˆฅ๐น = โˆฅฮฃ1 (๐‘‰ โˆ—ฮฉโˆ—ฮฉ๐‘ˆ โˆ’ ๐‘‰ โˆ—๐‘ˆ) ฮฃ2โˆฅ๐น (cid:118)(cid:117)(cid:116) ๐‘ โˆ‘๏ธ ๐‘ž โˆ‘๏ธ ๐‘˜=1 (cid:118)(cid:117)(cid:116) ๐‘ โˆ‘๏ธ ๐‘—=1 ๐‘ž โˆ‘๏ธ = โ‰ค (ฮฃ1)2 ๐‘˜,๐‘˜ |๐‘‰ โˆ—ฮฉโˆ—ฮฉ๐‘ˆ โˆ’ ๐‘‰ โˆ—๐‘ˆ|2 ๐‘˜, ๐‘— (ฮฃ2)2 ๐‘—, ๐‘— ๐œŽ๐‘˜ ( ๐ด)2๐œ– 2๐œŽ๐‘— (๐ต)2 ๐‘—=1 ๐‘˜=1 (cid:118)(cid:116) ๐‘ โˆ‘๏ธ ๐œŽ๐‘˜ ( ๐ด)2 (cid:118)(cid:117)(cid:116) ๐‘ž โˆ‘๏ธ ๐‘—=1 ๐œŽ๐‘— (๐ต)2 ๐‘˜=1 = ๐œ– = ๐œ– โˆฅ ๐ดโˆฅ๐น โˆฅ๐ตโˆฅ๐น . Lemmas 3.9, 3.10, and 3.11 now collectively prove the following generalized version of Lemma 1.8. Lemma 3.12 (The JL property provides the AMM property). Let ๐ด โˆˆ C๐‘ร—๐‘› and ๐ต โˆˆ C๐‘›ร—๐‘ž. There exists a finite set ๐‘† โŠ‚ C๐‘› with cardinality |๐‘†| โ‰ค 4( ๐‘ + ๐‘ž)2 (determined entirely by ๐ด and ๐ต) such 102 that the following holds: If a random matrix ฮฉ โˆˆ C๐‘šร—๐‘› has the (๐œ–/4, ๐›ฟ, 4( ๐‘ + ๐‘ž)2)-JL property for ๐‘†, then ฮฉ will also have the (๐œ–, ๐›ฟ)-AMM property for ๐ด and ๐ต. We will make use of this simple centering result with regards to ๐ฟ ๐‘ norms of random variables. Lemma 3.13. Suppose ๐‘‹ a real random variable, and let ๐‘ โ‰ฅ 1, Then, โˆฅ ๐‘‹ โˆ’ E[๐‘‹] โˆฅ ๐ฟ ๐‘ โ‰ค 2 โˆฅ ๐‘‹ โˆฅ ๐ฟ ๐‘ Proof. Let ๐œ‡ = E[๐‘‹]. By Jensenโ€™s inequality, โˆฅ๐œ‡โˆฅ ๐ฟ ๐‘ = |๐œ‡| โ‰ค E[|๐‘‹ |] = โˆฅ ๐‘‹ โˆฅ ๐ฟ1 Observe, โˆฅ ๐‘‹ โˆ’ ๐œ‡โˆฅ ๐ฟ ๐‘ โ‰ค โˆฅ ๐‘‹ โˆฅ ๐ฟ ๐‘ + โˆฅ๐œ‡โˆฅ ๐ฟ ๐‘ โ‰ค โˆฅ ๐‘‹ โˆฅ ๐ฟ ๐‘ + โˆฅ ๐‘‹ โˆฅ ๐ฟ1 โ‰ค 2 โˆฅ ๐‘‹ โˆฅ ๐ฟ ๐‘ Where we have used Minkowskiโ€™s inequality in the first line, and Jensenโ€™s inequality in the third. 3.5.2 Proof of Theorem 3.2 A similar proof appears in support of [53, Theorem 2], which was itself simplified from earlier work [11]. We reproduce the proof here for completeness, and to clarify details. We begin by restating Theorem 3.2 for ease of reference. Theorem 3.12 (Restatement of Theorem 3.2). Let ๐‘‹ โˆˆ R๐‘›ร—๐‘ of rank หœ๐‘Ÿ โ‰ค min{๐‘›, ๐‘ } have the full SVD ๐‘‹ = ๐‘ˆฮฃ๐‘‰๐‘‡ , and let ๐‘‰๐‘Ÿ โ€ฒ โˆˆ R๐‘ร—๐‘Ÿ โ€ฒ denote the first ๐‘Ÿโ€ฒ columns of ๐‘‰ โˆˆ R๐‘ร—๐‘ for all ๐‘Ÿโ€ฒ โˆˆ [๐‘]. Fix ๐‘Ÿ โˆˆ [๐‘›] and consider the head-tail split ๐‘‹ = ๐‘‹๐‘Ÿ + ๐‘‹\๐‘Ÿ. If ฮฉ โˆˆ R๐‘šร—๐‘ satisfies 1. subspace embedding property (1.10) with ๐œ– โ† ๐œ– 3 for ๐ด โ† ๐‘‹๐‘‡ ๐‘Ÿ , 2. approximate multiplication property (1.13) with ๐œ– โ† 6 ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ}, ๐œ– โˆš min{๐‘Ÿ, หœ๐‘Ÿ} for ๐ด โ† ๐‘‹\๐‘Ÿ and ๐ต โ† 3. JL property (1.8) with ๐œ– โ† ๐œ– 4. approximate multiplication property (1.13) with ๐œ– โ† ๐œ– โˆš 6 the ๐‘› columns of ๐‘‹๐‘‡ \๐‘Ÿ 6 for ๐‘† โ† (cid:110) (cid:111) , and ๐‘Ÿ for ๐ด โ† ๐‘‹\๐‘Ÿ and ๐ต โ† ๐‘‹๐‘‡ \๐‘Ÿ, then หœ๐‘‹ := ๐‘‹ฮฉ๐‘‡ is an (๐œ–, 0, ๐‘Ÿ)-PCP sketch of ๐‘‹. Proof. Let ๐‘„ โˆˆ R๐‘›ร—๐‘Ÿ be a an arbitrary matrix with orthonormal columns so that ๐‘„๐‘„๐‘‡ โˆˆ R๐‘›ร—๐‘› is an orthogonal projection matrix. It suffices to show that 103 (cid:12) (cid:12) (cid:12) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) ๐‘‹(cid:13) (cid:13) (cid:13) 2 ๐น โˆ’ (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) ๐‘‹ฮฉ๐‘‡ (cid:13) 2 (cid:13) ๐น (cid:12) (cid:12) (cid:12) โ‰ค ๐œ– (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) ๐‘‹(cid:13) 2 (cid:13) ๐น . (3.36) Writing ๐‘‹ in terms of its head-tail split, (3.36) becomes (cid:12) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )(๐‘‹๐‘Ÿ + ๐‘‹\๐‘Ÿ)(cid:13) (cid:13) (cid:12) (cid:13) (cid:12) 2 ๐น โˆ’ (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) (๐‘‹๐‘Ÿ + ๐‘‹\๐‘Ÿ)ฮฉ๐‘‡ (cid:13) (cid:13) 2 ๐น (cid:12) (cid:12) (cid:12) โ‰ค ๐œ– (cid:13) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) ๐‘‹(cid:13) 2 (cid:13) ๐น , (3.37) where ๐‘‹๐‘Ÿ has the thin SVD representation ๐‘‹๐‘Ÿ = ๐‘ˆ๐‘Ÿ ฮฃ๐‘Ÿ๐‘‰๐‘‡ ๐‘Ÿ with ๐‘ˆ๐‘Ÿ and ๐‘‰๐‘Ÿ containing the first ๐‘Ÿ columns of ๐‘ˆ and ๐‘‰ from the full SVD of ๐‘‹, respectively.1 Letting ๐‘ƒ := (๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ), and using that tr(cid:8)๐ด๐ด๐‘‡ (cid:9) = โˆฅ ๐ดโˆฅ2 ๐น, one may expand the left hand side of (3.37). Noting that ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ = 0 and ๐‘ƒ = ๐‘ƒ๐‘‡ while doing so, we can now see that (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ )(๐‘‹๐‘Ÿ + ๐‘‹\๐‘Ÿ)(cid:13) (cid:13) ๐น โˆ’ (cid:13) 2 (cid:13) (cid:17) (cid:13)(๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) (๐‘‹๐‘Ÿ + ๐‘‹\๐‘Ÿ)ฮฉ๐‘‡ (cid:13) (cid:13) (cid:16) (cid:16) 2 ๐น โˆ’ tr ๐‘ƒ(๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ + ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ + ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:12) (cid:12) (cid:12) = \๐‘Ÿ + ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12)tr ๐‘Ÿ + ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) . Regrouping terms in the last expression while noting the invariance of trace to transposition, one can we can now further see that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ + ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17) (cid:16) โˆ’ tr ๐‘ƒ(๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ = โ‰ค (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:16) ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ โˆ’ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17) (cid:16) + 2 tr (cid:16) ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ โˆ’ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) + 2 (cid:12) (cid:12) (cid:12)tr \๐‘Ÿ + ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ + ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:16) (cid:17) ๐‘Ÿ )๐‘ƒ ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:16) ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ + tr (cid:17)(cid:12) (cid:12) (cid:12) + (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ (cid:16) ๐‘Ÿ + ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) (cid:17)(cid:12) (cid:12) (cid:12) ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) . Looking back at (3.37) in the light of this last computation, one can now see that it suffices to prove that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ โˆ’ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) + 2 (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) + (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค ๐œ– โˆฅ๐‘ƒ๐‘‹ โˆฅ2 ๐น (3.38) holds to establish the desired result. Going forward we will therefore aim to prove (3.38) by proving each of the following three bounds: ๐‘Ÿ )๐‘ƒ(cid:1)(cid:12) ๐‘Ÿ โˆ’ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:12)tr (cid:0)๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ (a) (cid:12) (cid:12) โ‰ค ๐œ– 3 โˆฅ๐‘ƒ๐‘‹ โˆฅ2 ๐น, 1If ๐‘Ÿ โ‰ฅ ๐‘ we let ๐‘‰๐‘Ÿ = ๐‘‰. 104 (b) (cid:12) (cid:12)tr (cid:0)๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:12) ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ (cid:12) (cid:12)tr 6 โˆฅ๐‘ƒ๐‘‹ โˆฅ2 (cid:17)(cid:12) \๐‘Ÿ)๐‘ƒ (cid:12) (cid:12) Proving (a) โ€“ (c) will establish (3.38), thereby completing the proof. ๐‘Ÿ )๐‘ƒ(cid:1)(cid:12) (cid:12) โ‰ค ๐œ– \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐น, and โ‰ค ๐œ– 3 โˆฅ๐‘ƒ๐‘‹ โˆฅ2 ๐น. (c) (cid:16) Proof of Bound (a): Using again that tr(cid:8)๐ด๐ด๐‘‡ (cid:9) = tr(cid:8)๐ด๐‘‡ ๐ด(cid:9) = โˆฅ ๐ดโˆฅ2 ๐น and that ๐‘ƒ = ๐‘ƒ๐‘‡ we have (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ โˆ’ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) = (cid:12) (cid:12) (cid:12) (cid:13) (cid:13)๐‘‹๐‘‡ ๐‘Ÿ ๐‘ƒ(cid:13) (cid:13) 2 ๐น โˆ’ (cid:13) (cid:13)ฮฉ๐‘‹๐‘‡ ๐‘Ÿ ๐‘ƒ(cid:13) 2 (cid:13) ๐น . (cid:12) (cid:12) (cid:12) Applying Lemma 1.7 in the light of assumption 1, we now have that (cid:16) (cid:12) (cid:12) (cid:12)tr as desired. ๐‘ƒ(๐‘‹๐‘Ÿ ๐‘‹๐‘‡ ๐‘Ÿ โˆ’ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค โ‰ค ๐œ– 3 ๐œ– 3 (cid:13) (cid:13)๐‘‹๐‘‡ ๐‘Ÿ ๐‘ƒ(cid:13) 2 ๐น = (cid:13) (cid:13)๐‘‹๐‘‡ ๐‘ƒ(cid:13) (cid:13) 2 ๐น = (cid:13) ๐œ– 3 ๐œ– 3 (cid:13) (cid:13)๐‘‰๐‘Ÿ๐‘‰๐‘‡ ๐‘Ÿ ๐‘‹๐‘‡ ๐‘ƒ(cid:13) 2 (cid:13) ๐น โˆฅ๐‘ƒ๐‘‹ โˆฅ2 ๐น Proof of Bound (b): Using the invariance of trace to both transposition and permutations, as well as that ๐‘ƒ๐‘‡ = ๐‘ƒ = ๐‘ƒ2, we can see that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:16) (cid:17)(cid:12) (cid:12) (cid:12) = (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:16) (cid:17)(cid:12) (cid:12) (cid:12) = (cid:12) (cid:12) (cid:12)tr ๐‘ƒ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:17)(cid:12) (cid:12) (cid:12) . Recalling the full SVD ๐‘‹ = ๐‘ˆฮฃ๐‘‰๐‘‡ , we note that if หœ๐‘Ÿ := rank(๐‘‹) < min{๐‘›, ๐‘ } we can remove the last ๐‘› โˆ’ หœ๐‘Ÿ columns of ๐‘ˆ, the last ๐‘ โˆ’ หœ๐‘Ÿ columns of ๐‘‰, and the last ๐‘› โˆ’ หœ๐‘Ÿ rows and ๐‘ โˆ’ หœ๐‘Ÿ columns of ฮฃ to form the thin SVD ๐‘‹ = หœ๐‘ˆ หœฮฃ หœ๐‘‰๐‘‡ with หœ๐‘ˆ โˆˆ R๐‘›ร— หœ๐‘Ÿ, หœฮฃ โˆˆ R หœ๐‘Ÿร— หœ๐‘Ÿ, and หœ๐‘‰ โˆˆ R๐‘ร— หœ๐‘Ÿ. Having done so we note that หœฮฃ will be invertable and that หœ๐‘ˆ หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿ = ๐‘‹๐‘Ÿ so that we may write (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) = = = = = (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:16) (cid:16) (cid:16) (cid:16) ๐‘ƒ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:17)(cid:12) (cid:12) (cid:12) ๐‘ƒ หœ๐‘ˆ หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:17)(cid:12) (cid:12) (cid:12) (cid:16)(cid:16) ๐‘ƒ หœ๐‘ˆ หœฮฃ หœ๐‘‰๐‘‡ หœ๐‘‰ หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ ๐‘ƒ หœ๐‘ˆ หœฮฃ หœ๐‘‰๐‘‡ (cid:17) (cid:16) (cid:16) หœ๐‘‰ หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:17)(cid:12) (cid:12) (cid:12) หœ๐‘‰ หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:17)(cid:17)(cid:12) (cid:12) (cid:12) (๐‘ƒ๐‘‹) . (cid:17)(cid:17)(cid:12) (cid:12) (cid:12) 105 Recall now that โŸจ๐ด, ๐ตโŸฉ๐น := tr(cid:0)๐ด๐ต๐‘‡ (cid:1) is an inner product on matrices with โˆฅ ๐ดโˆฅ๐น = โˆš๏ธƒ tr(cid:0)๐ด๐ด๐‘‡ (cid:1). Hence, we may apply the Cauchyโ€“Schwarz inequality to our last expression to see that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (๐‘ƒ๐‘‹) (cid:16) (cid:12) (cid:12) (cid:12)tr (cid:17)(cid:12) (cid:12) (cid:12) = โ‰ค โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น = โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น (cid:17)(cid:17)(cid:12) (cid:12) (cid:12) (cid:16) หœ๐‘‰ หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:13) (cid:13) (cid:13)๐น (cid:13) หœ๐‘‰ หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:13) \๐‘Ÿ (cid:13) (cid:13) (cid:13) หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ (cid:13) (cid:13) \๐‘Ÿ (cid:13)๐น (cid:13) . Expanding ๐‘‹๐‘Ÿ in terms of its thin SVD representation ๐‘‹๐‘Ÿ = ๐‘ˆ๐‘Ÿ ฮฃ๐‘Ÿ๐‘‰๐‘‡ ๐‘Ÿ we now have that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น = โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น = โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น = โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡ ๐‘‹๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:13) (cid:13) (cid:13) (cid:13) หœฮฃโˆ’1 หœ๐‘ˆ๐‘‡๐‘ˆ๐‘Ÿ ฮฃ๐‘Ÿ๐‘‰๐‘‡ (cid:13) (cid:13) (cid:13) min{๐‘Ÿ, หœ๐‘Ÿ}ฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘‰๐‘‡ (cid:13) \๐‘Ÿ (cid:13) (cid:13) (cid:13)๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ } (cid:13) (cid:13) (cid:13)๐น ๐‘Ÿ ฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:13) (cid:13) (cid:13)๐น (cid:13) (cid:13)๐น . (cid:13) (cid:13) (cid:13)๐น Finally, we may now use that ๐‘‹\๐‘Ÿ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} = ๐‘‹ (๐ผ๐‘ โˆ’๐‘‰๐‘Ÿ๐‘‰๐‘‡ ๐‘Ÿ )๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} = ๐‘‹ (cid:0)๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} โˆ’ ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} (cid:1) = 0 to see that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘Ÿ )๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น (cid:13) (cid:13)๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} (cid:13) (cid:13)๐น = โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น โ‰ค โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น (cid:13) (cid:13)๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} โˆ’ ๐‘‹\๐‘Ÿ๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} (cid:32) (cid:13) (cid:13)๐น (cid:13) (cid:13)๐‘‹\๐‘Ÿ (cid:13) (cid:13)๐น (cid:13) (cid:13)๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} (cid:13) (cid:13)๐น (cid:33) , ๐œ– 6โˆš๏ธmin{๐‘Ÿ, หœ๐‘Ÿ} where we have utilized assumption 2 in the last inequality. We are now finished after using that (cid:13)๐น = โˆš๏ธmin{๐‘Ÿ, หœ๐‘Ÿ}, and noting that (cid:13) (cid:13) (cid:13) (cid:13)๐‘‰min{๐‘Ÿ, หœ๐‘Ÿ} (cid:13)๐‘‹\๐‘Ÿ orthogonal projections ๐‘ƒ by the definition of ๐‘‹๐‘Ÿ. (cid:13) (cid:13)๐น โ‰ค โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น holds for all rank ๐‘› โˆ’ ๐‘Ÿ or greater Proof of Bound (c): Again using the invariance of trace to permutations as well as ๐‘ƒ = ๐‘ƒ2 = 106 ๐ผ โˆ’ ๐‘„๐‘„๐‘‡ we have that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) = = โ‰ค (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:12) (cid:12) (cid:12)tr (cid:16) (cid:16) ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ) (cid:17)(cid:12) (cid:12) (cid:12) \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ) (cid:17)(cid:12) (cid:12) (cid:12) (๐ผ โˆ’ ๐‘„๐‘„๐‘‡ ) (๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ (cid:16) (cid:16) \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ (cid:12) (cid:12) (cid:12)tr (cid:13) (cid:13)ฮฉ๐‘‹๐‘‡ (cid:13) ๐‘„๐‘„๐‘‡ (๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ (cid:12) (cid:12) (cid:12) (cid:12) (cid:17)(cid:12) (cid:12) (cid:12) \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ) (cid:13) (cid:13)๐‘„๐‘„๐‘‡ (cid:13) + (cid:13) (cid:13) (cid:13)๐น (cid:13) ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ (cid:13) 2 (cid:13) (cid:13) (cid:13) 2 (cid:13) (cid:13) โˆ’ + \๐‘Ÿ ๐น ๐น (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค (cid:12) (cid:13) (cid:12) (cid:13) (cid:12) (cid:13) (cid:12) ๐‘‹๐‘‡ \๐‘Ÿ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:13) (cid:13) (cid:13)๐น , where we have again utilized Cauchyโ€“Schwarz in the last inequality. Utilizing assumption 3, the first term just above can be bounded by ๐œ– ๐น = ๐œ– ๐น using an argument analogous to the 6 โˆฅ ๐‘‹๐‘‡ \๐‘Ÿ โˆฅ2 6 โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 proof of Lemma 1.7. Doing so we see that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค = ๐œ– 6 ๐œ– 6 โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ (cid:13)๐‘„๐‘„๐‘‡ (cid:13) ๐น + (cid:13) (cid:13)๐น โˆš (cid:13) ๐‘Ÿ (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ ๐น + \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:13) (cid:13) (cid:13)๐น . (cid:13) (cid:13) (cid:13)๐น Finally, we may now employ assumption 4 to bound the second term just above. Doing so we obtain that (cid:16) (cid:12) (cid:12) (cid:12)tr ๐‘ƒ(๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ)๐‘ƒ (cid:17)(cid:12) (cid:12) (cid:12) โ‰ค โ‰ค = ๐œ– 6 ๐œ– 6 ๐œ– 3 โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 ๐น + โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 ๐น + โˆš ๐‘Ÿ โˆš ๐‘Ÿ ๐น . โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 \๐‘Ÿ โˆ’ ๐‘‹\๐‘Ÿฮฉ๐‘‡ ฮฉ๐‘‹๐‘‡ \๐‘Ÿ (cid:13) (cid:13) (cid:13) 6 ๐‘‹\๐‘Ÿ ๐‘‹๐‘‡ ๐œ– โˆš ๐‘Ÿ โˆฅ ๐‘‹\๐‘Ÿ โˆฅ2 ๐น (cid:13) (cid:13) (cid:13)๐น To conclude we note again that (cid:13) (cid:13)๐‘‹\๐‘Ÿ projections ๐‘ƒ by the definition of ๐‘‹๐‘Ÿ. 3.6 Algorithm Variants (cid:13) (cid:13)๐น โ‰ค โˆฅ๐‘ƒ๐‘‹ โˆฅ๐น holds for all rank ๐‘› โˆ’ ๐‘Ÿ or greater orthogonal There are four tasks relevant in Algorithm 3.1 where by making difference choices for these tasks, we get variants of the algorithm. Two of the tasks are related to the measurement process: sketching the tensor in order to produce leave-one-out measurements, and also sketching the tensor without leaving out a mode in order to produce measurements useful in computing the core. The second two tasks are then recovering the factor matrices, and recovering the core. 107 Note, if we permit a second pass on the tensor X after estimating factor matrices ๐‘„๐‘–, then core measurements are not necessary, and the optimal core given these factor matrices can be found by computing G = X ร—1 ๐‘„๐‘‡ 1 ร—2 ๐‘„๐‘‡ 2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‡ ๐‘‘ See section 4.2 in [37]. This then is the computation used in two-pass versions of the algorithm; it requires first to compute the factor matrices from measurements, and then apply these factors modewise to the original tensor; no separate measurement tensor for the core is required. Crucially however, this relies on a second access to the data; for which we are computing a HOSVD. In this section then, we break apart Algorithm 3.1 into these tasks, and show how combining different choices for these tasks produces the variants of the algorithm considered. Within the pseudo-code, โ€œunfoldโ€ refers to the operation of taking a tensor and flattening it into a matrix of the size listed by arranging the specified modeโ€™s fibers as the columns. The operation โ€œfoldโ€ is the inverse of this, taking a matrix as viewed as an unfolding along the specified mode and reshaping it into a tensor of the given dimensions. Algorithm 3.2 takes measurement matrices (e.g. sub-Gaussian random matrices) and the data tensor X and produces a set of leave-one-out measurements which can be viewed as tensors or as flattened matrices. That is, it is practical to apply the measurement matrices along the modes and obtain a tensor of measurements with ๐‘‘ โˆ’ 1 modes each of length ๐‘š and one mode of length ๐‘›, see Figure 3.1 for a schematic depiction. The measurement process takes slices of the tensor and maps them to (smaller) slices with (mostly) shorter edge lengths. Or we can conceive of the measurements as a matrix by unfolding the measurement tensor along the mode that is uncompressed and thus obtaining a matrix of size ๐‘› ร— ๐‘š๐‘‘โˆ’1, one such matrix for each mode ๐‘– โˆˆ [๐‘‘]. It is Kronecker structured because of the correspondence of modewise products of tensors with matrices to matrix products of unfoldings of a tensor with matrices, see for example (1.4). Alternatively, we can use Algorithm 3.3 which also produces a set of leave-one-out measure- ments of the tensor X which can be viewed as a matrix. It is Khatri-Rao structured because the measurement matrix applied to the unfolding is formed using Khatri-Rao products. Note, unlike 108 : Algorithm 3.2 Leave-One-Out Kronecker Sketching input ฮฉ(๐‘–, ๐‘—) where rank(ฮฉ(๐‘–,๐‘–)) = ๐‘› and ฮฉ(๐‘–, ๐‘—) โˆˆ R๐‘šร—๐‘› for ๐‘–, ๐‘— โˆˆ [๐‘‘], ๐‘– โ‰  ๐‘— X a ๐‘‘ mode tensor with side lengths ๐‘› output : ๐ต๐‘– for ๐‘– โˆˆ [๐‘‘] for ๐‘– โˆˆ [๐‘‘] do B๐‘– โ† X ร—1 ฮฉ(๐‘–,1) ร—2 ฮฉ(๐‘–,2) ร—3 ยท ยท ยท ร—๐‘‘ ฮฉ(๐‘–,๐‘‘) # Now unfold the measurement tensor so the mode-๐‘– fibers are columns, size ๐‘› ร— ๐‘š๐‘‘โˆ’1 ๐ต๐‘– โ† unfold(B๐‘–, ๐‘› ร— ๐‘š๐‘‘โˆ’1, mode = ๐‘–) end Algorithm 3.2, there is not necessarily any natural way to view the measurements ๐ต๐‘– as tensors of ๐‘‘ modes - the sketching process in this case takes slices of the tensor and maps them to vectors and we gather these into matrices ๐ต๐‘– of size ๐‘› ร— ๐‘š for each ๐‘– โˆˆ [๐‘‘]. : Algorithm 3.3 Leave-One-Out Khatri-Rao Sketching input ฮฉ(๐‘–, ๐‘—) where rank(ฮฉ(๐‘–,๐‘–)) = ๐‘› and ฮฉ(๐‘–, ๐‘—) โˆˆ R๐‘šร—๐‘› for ๐‘–, ๐‘— โˆˆ [๐‘‘], ๐‘– โ‰  ๐‘— X a ๐‘‘ mode tensor with side lengths ๐‘› output : ๐ต๐‘– for ๐‘– โˆˆ [๐‘‘] for ๐‘– โˆˆ [๐‘‘] do (cid:0)ฮฉ(๐‘–,1) โ€ข ฮฉ(๐‘–,2) โ€ข . . . โ€ข ฮฉ๐‘–,๐‘–โˆ’1 โ€ข ฮฉ๐‘–,๐‘–+1 โ€ข . . . โ€ข ฮฉ๐‘–,๐‘‘(cid:1)๐‘‡ ๐ต๐‘– โ† ฮฉ(๐‘–,๐‘–) ๐‘‹[๐‘–] end In order to estimate the core, we wish another, independent set of measurements. These are obtained in the manner as Algorithm 3.2, only now we are permitted to compress all the modes, producing a tensor of ๐‘‘ modes all with side length equal to ๐‘š๐‘. : Algorithm 3.4 Core Sketching input ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘›, ๐‘– โˆˆ [๐‘‘] X a ๐‘‘ mode tensor with side lengths ๐‘› output : B๐‘ B๐‘ โ† X ร—1 ฮฆ1 ร—2 ฮฆ2 ร—3 ยท ยท ยท ร—๐‘‘ ฮฆ๐‘‘ Next, we describe the procedure which takes as input (๐‘Ž) leave-one-out measurements of X (from Algorithm 3.2 or 3.3), (๐‘) the full-rank sensing matrices applied to the uncompressed modes of X, and (๐‘) our desired target rank vector of r, and then outputs a factor matrix ๐‘„๐‘– for each mode. 109 In the case of exact arithmetic, no rank truncation, and no noise, this exactly recovers the factors (see (3.11)). for ๐‘– โˆˆ [๐‘‘] measurements that leave mode ๐‘– uncompressed Algorithm 3.5 Recover HOSVD Factors from Leave-One-Out Measurements input : ๐ต๐‘– โˆˆ ๐‘…๐‘›ร—๐‘š๐‘‘โˆ’1 ฮฉ(๐‘–,๐‘–) โˆˆ ๐‘…๐‘›ร—๐‘› for ๐‘– โˆˆ [๐‘‘] r = (๐‘Ÿ, . . . , ๐‘Ÿ) desired rank for HOSVD output : ๐‘„1, . . . , ๐‘„๐‘‘ โˆˆ R๐‘›ร—๐‘Ÿ # Factor matrix recovery for ๐‘– โˆˆ [๐‘‘] do # Solve ๐‘› ร— ๐‘› linear system Solve ฮฉ(๐‘–,๐‘–) ๐น๐‘– = ๐ต๐‘– for ๐น๐‘– # Compute SVD and keep the top ๐‘Ÿ singular vectors ๐‘ˆ, ฮฃ, ๐‘‰๐‘‡ โ† SVD(๐น๐‘–) ๐‘„๐‘– โ† ๐‘ˆ:,:๐‘Ÿ end Lastly, we consider the task of obtaining the core of the HOSVD of the data tensor. The two ways described in section 3.2 are to either compute the core using a second access to the data tensor - in which case this is a matter of applying the transpose of the factor matrices from Algorithm 3.5 to the data tensor. This is detailed in Algorithm 3.10. : Algorithm 3.6 Compute HOSVD Core with Second Access input ๐‘„1, . . . , ๐‘„๐‘‘ โˆˆ R๐‘›ร—๐‘Ÿ computed factor matrices X a ๐‘‘ mode tensor with side lengths ๐‘›. output : G a ๐‘‘ mode tensor with side lengths ๐‘Ÿ G = X ร—1 ๐‘„๐‘‡ 2 ยท ยท ยท ร—๐‘‘ ๐‘„๐‘‡ 1 ร—2 ๐‘„๐‘‡ ๐‘‘ In the scenario in which a second access to the tensor is not desired, instead we obtain the core of the HOSVD by solving a linear system involving the measurement operators and the factor matrices, see (3.21). This is equivalent to solving the linear system a mode at a time as detailed in Algorithm 3.7, a method of practical value because it does not require as much working memory. A third possibility is to โ€œre-useโ€ leave-one-out measurements to compute the core. Theoretically this involves new dependencies on the errors introduced by estimating the factors and errors introduced by recovering the core that are not addressed in any of our main results. Practically 110 : Algorithm 3.7 Recover HOSVD Core from Measurements input B๐‘ a ๐‘‘ mode tensor with side lengths ๐‘š๐‘ ฮฆ๐‘– โˆˆ R๐‘š๐‘ร—๐‘› ๐‘„1, . . . , ๐‘„๐‘‘ โˆˆ R๐‘›ร—๐‘Ÿ computed factor matrices output : H for ๐‘– โˆˆ [๐‘‘] do # unfold measurements, mode-๐‘– fibers are columns, size ๐‘š๐‘ ร— ๐‘Ÿ (๐‘–โˆ’1)๐‘š๐‘‘โˆ’1โˆ’(๐‘–โˆ’1) ๐ป โ† unfold(B๐‘, ๐‘š๐‘ ร— ๐‘Ÿ (๐‘–โˆ’1)๐‘š๐‘‘โˆ’1โˆ’(๐‘–โˆ’1) , mode = ๐‘–) # Undo the mode-๐‘– measurement operator and factorโ€™s action by finding ๐‘ ๐‘ least square solution to ๐‘š๐‘ ร— ๐‘Ÿ over-determined linear system Solve ฮฆ๐‘–๐‘„๐‘–๐ปnew = ๐ป for ๐ป๐‘›๐‘’๐‘ค # reshape the flattened partially solved core into a tensor ร—๐‘š๐‘ ร— ยท ยท ยท ร— ๐‘š๐‘ B๐‘ โ† fold(๐ปnew, ๐‘Ÿ ร— ๐‘Ÿ ยท ยท ยท ร— ๐‘Ÿ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) (cid:125) (cid:123)(cid:122) (cid:124) ๐‘– ๐‘‘โˆ’๐‘– # Each iteration ๐‘š๐‘ โ†’ ๐‘Ÿ in ๐‘–th mode , mode = ๐‘–) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) end however, it would be desirable to avoid having to produce the core measurement tensor, and empirically on synthetic data the overall error is not effected. Algorithm 3.8 Recover HOSVD Core from Recycled Leave One Out Measurements input : ๐ต ๐‘— for a fixed ๐‘—. ฮฉ ๐‘—,๐‘– for each ๐‘– โˆˆ [๐‘‘] ๐‘„1, . . . , ๐‘„๐‘‘ โˆˆ R๐‘›ร—๐‘Ÿ computed factor matrices output : H for ๐‘– โˆˆ [๐‘‘] do # unfold measurements, mode-๐‘– fibers are columns, size ๐‘š ร— ๐‘Ÿ (๐‘–โˆ’1)๐‘š๐‘‘โˆ’1โˆ’(๐‘–โˆ’1) ๐ป โ† unfold(B ๐‘— , ๐‘š ร— ๐‘Ÿ (๐‘–โˆ’1)๐‘š๐‘‘โˆ’1โˆ’(๐‘–โˆ’1), mode = ๐‘–) # Undo the mode-๐‘– measurement operator and factorโ€™s action by finding least square solution to ๐‘š ร— ๐‘Ÿ over-determined linear system Solve ฮฉ( ๐‘—,๐‘–)๐‘„๐‘–๐ปnew = ๐ป for ๐ป๐‘›๐‘’๐‘ค # reshape the flattened partially solved core into a tensor ร—๐‘š๐‘ ร— ยท ยท ยท ร— ๐‘š๐‘ B ๐‘— โ† fold(๐ปnew, ๐‘Ÿ ร— ๐‘Ÿ ยท ยท ยท ร— ๐‘Ÿ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) (cid:123)(cid:122) (cid:125) (cid:124) ๐‘‘โˆ’๐‘– , mode = ๐‘–) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) ๐‘– (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) # Each iteration ๐‘š โ†’ ๐‘Ÿ in ๐‘–th mode or ๐‘› โ†’ ๐‘› when ๐‘– = ๐‘— end We are now able to state the variants of the general algorithm. 111 (cid:9) Algorithm 3.9 Recover HOSVD Kronecker One Pass # Obtain measurements for factors with Alg. 3.2 ๐ต1, ๐ต2, . . . , ๐ต๐‘‘ โ† Leave-One-Out Kronecker Sketching((cid:8)ฮฉ(๐‘–, ๐‘—) # Obtain measurement for core with Alg. 3.4 B๐‘ โ† Core Sketching({ฮฆ๐‘–}๐‘–โˆˆ[๐‘‘] , X) # Estimate factor matrices using Alg. 3.5 ๐‘„1, ๐‘„2, . . . , ๐‘„๐‘‘ โ† Recover HOSVD Factors from Leave-One-Out Measurements ((cid:8)ฮฉ(๐‘–,๐‘–) # Estimate core using Alg. 3.8 H โ† Recover HOSVD Core from Measurements({ฮฆ๐‘–}๐‘–โˆˆ[๐‘‘] , B๐‘) output : ห†X = [[H , ๐‘„1, . . . ๐‘„๐‘‘]] , {๐ต๐‘–}๐‘–โˆˆ[๐‘‘] , (๐‘Ÿ, . . . , ๐‘Ÿ)) ๐‘–โˆˆ[๐‘‘] (cid:9) ๐‘–, ๐‘— โˆˆ[๐‘‘] Algorithm 3.10 Recover HOSVD Kronecker Two Pass # Obtain measurements for factors with Alg. 3.2 ๐ต1, ๐ต2, . . . , ๐ต๐‘‘ โ† Leave-One-Out Kronecker Sketching((cid:8)ฮฉ(๐‘–, ๐‘—) # Estimate factor matrices using Alg. 3.5 ๐‘„1, ๐‘„2, . . . , ๐‘„๐‘‘ โ† Recover HOSVD Factors from Leave-One-Out Measurements((cid:8)ฮฉ(๐‘–,๐‘–) # Compute core using Alg. 3.6 H โ† Compute HOSVD Core with Second Access({๐‘„๐‘–}๐‘–โˆˆ[๐‘‘] , X) output : ห†X = [[H , ๐‘„1, . . . ๐‘„๐‘‘]] , {๐ต๐‘–}๐‘–โˆˆ[๐‘‘] , (๐‘Ÿ, . . . , ๐‘Ÿ)) ๐‘–โˆˆ[๐‘‘] (cid:9) (cid:9) ๐‘–, ๐‘— โˆˆ[๐‘‘] , X) , X) Algorithm 3.11 Recover HOSVD Khatri-Rao One Pass # Obtain measurements for factors with Alg. 3.3 ๐ต1, ๐ต2, . . . , ๐ต๐‘‘ โ† Leave-One-Out Khatri-Rao Sketching((cid:8)ฮฉ(๐‘–, ๐‘—) # Obtain measurements for core with Alg. 3.4 B๐‘ โ† Core Sketching({ฮฆ๐‘–}๐‘–โˆˆ[๐‘‘] , X) # Estimate factor matrices using Alg. 3.5 ๐‘„1, ๐‘„2, . . . , ๐‘„๐‘‘ โ† Recover HOSVD Factors ((cid:8)ฮฉ(๐‘–,๐‘–) # Estimate core using Alg. 3.8 H โ† Recover HOSVD Core from Measurements({ฮฆ๐‘–}๐‘–โˆˆ[๐‘‘] , B๐‘) output : ห†X = [[H , ๐‘„1, . . . ๐‘„๐‘‘]] , {๐ต๐‘–}๐‘–โˆˆ[๐‘‘] , (๐‘Ÿ, . . . , ๐‘Ÿ)) ๐‘–โˆˆ[๐‘‘] (cid:9) (cid:9) , X) ๐‘–, ๐‘— โˆˆ[๐‘‘] from Leave-One-Out Measurements 112 Algorithm 3.12 Recover HOSVD Khatri-Rao Two Pass # Obtain measurements for factors with Alg. 3.3 ๐ต1, ๐ต2, . . . , ๐ต๐‘‘ โ† Leave-One-Out Khatri-Rao Sketching((cid:8)ฮฉ(๐‘–, ๐‘—) # Estimate factor matrices using Alg. 3.5 ๐‘„1, ๐‘„2, . . . , ๐‘„๐‘‘ โ† Recover HOSVD Factors ((cid:8)ฮฉ(๐‘–,๐‘–) , {๐ต๐‘–}๐‘–โˆˆ[๐‘‘] , (๐‘Ÿ, . . . , ๐‘Ÿ)) # Compute core using Alg. 3.6 H โ† Compute HOSVD Core with Second Access({๐‘„๐‘–}๐‘–โˆˆ[๐‘‘] , X) output : ห†X = [[H , ๐‘„1, . . . ๐‘„๐‘‘]] ๐‘–โˆˆ[๐‘‘] (cid:9) (cid:9) , X) ๐‘–, ๐‘— โˆˆ[๐‘‘] from Leave-One-Out Measurements 113 CHAPTER 4 TENSOR SANDWICH: TENSOR COMPLETION FOR LOW CP-RANK TENSORS VIA ADAPTIVE RANDOM SAMPLING In the proceeding chapters we have described two schemes for the tensor recovery problem. We now will address a method for a different, though related, tensor problem. In tensor completion we consider the problem of how we might complete (or impute, or predict) a tensor from which some entries are missing or unavailable. That is, for a tensor X โˆˆ R๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ , we have some set, ฮฉ โŠ‚ [๐‘›1] ร— . . . . . . [๐‘›๐‘‘] of indices which correspond to entries which are sampled, or revealed. Entries with indices outside of the revealed set are then not available and some or all of these missing entries we desire to fill in, and do so in a way which minimizes error from the true tensor. Naturally, if the entries are unrelated to each other, the problem is hopeless, however in the case where the tensor has additional structure such as a low-rank decomposition, then we can use that fact to develop methods which are able to accurately compute (or estimate in the inexact case) given only a small sample of some of the entries. The commonalities with the recovery problem are immediate - the sampling operator can fruitfully be envisioned as linear measurements of the original tensor and from those measurements we wish to recover the tensor, or a good estimate for it. It is often the case that we wish to limit or minimize how many entries are revealed. However, we will quickly leave behind the requirement that this operator act in a modewise manner or with interleaved reshapings as was done in the previous chapters. One reason to not insist on an operator conforming to such a scheme is that we will want to discuss sampling schemes which are adaptive, in the sense that after observing some entries, we will use some computation and information gleaned from these entries to update where next we would like to sample in order to best complete the tensor. In this case the set ฮฉ is constructed in stages. The degree to which a sampling scheme adapts or not, and the reasonableness of having different levels of control over the sampling pattern is something we shall discuss. 114 4.1 Adaptive Tensor Sandwich for Completing Three Mode Tensors The subject of this chapter, and our proposed method for addressing the tensor completion problem resulted from combining matrix completion techniques (see, e.g., [8, 40, 10]) for a small number of slices (see Definition 1.3) with a modified noise robust version of Jennrichโ€™s algorithm (see, e.g., [51, Section 3]); an algorithm which originally was used to prove (constructively) the uniqueness of the CP decomposition when rank is less than or equal to the smallest side length of the tensor. In the simplest case as we shall see, this leads to a sampling strategy that more densely samples two slices (the bread), and then more sparsely samples additional entries throughout the tensor (the filling) for final completion, hence โ€œTensor Sandwichโ€. To begin, we will consider three mode tensors with some mild assumptions on the factor matrices of the CP decomposition (see Definition 1.14). Our algorithm then completes an ๐‘› ร— ๐‘› ร— ๐‘› tensor with CP-rank ๐‘Ÿ with high probability while using at most O (๐‘›๐‘Ÿ log2 ๐‘Ÿ) adaptively chosen samples. Empirical experiments further verify that the approach works well in practice, including in the presence of additive noise, see Section 4.3. We will then include details for how the method can be employed for tensors of more than three modes and how sampling can be modified to be essentially non-adaptive, for the cases in which a practitioner may not have the level of control needed to conform to the adaptive strategy, see Section 4.5. Before beginning our description of our proposed adaptive method, we make two important observations. First, there is a simple information theoretic lower bound on any completion method, adaptive or otherwise. For example, a ๐‘‘ mode CP-rank ๐‘Ÿ tensor, absent any other constraint, has ๐‘Ÿ๐‘‘ (๐‘› โˆ’ 1) + ๐‘Ÿ degrees of freedom, and so no sampling scheme which uses fewer entries than this can guarantee an exact completion. Furthermore, any method which employs randomness should then require more samples, and though we do not have a proof that our method is strictly optimal in this regard, it is by comparison to other known methods, better, and does have scaling up to log factors in the relevant parameters of the problem. Second, there is not a priori any guarantee that a particular revealed entry in a tensor will be 115 especially informative. For example, consider the rank one tensor T = โƒ๐‘‘ ๐‘—=1 e1 where e1 is the standard basis vector with a one in position one. All but one of this tensorsโ€™ ๐‘›๐‘‘ entries are zero, which is to say, a sampling technique would have to reveal this singular non-zero entry in order to complete the tensor and - any random sampling pattern would need on the order the number of entries of the tensor revealed to succeed in such a case. Fortunately such โ€œworstโ€ case tensors can be excluded with some natural and useful assumptions. 4.1.1 Statement and Outline of Main Result To proceed with defining the class of tensors for which the method will have a high chance of completing, we will need two additional definitions, motivated by the proceeding discussion of the completion problem. Definition 4.1 (Coherence). The coherence of an ๐‘Ÿ-dimensional subspace ๐‘ˆ โŠ‚ R๐‘› is defined ๐œ‡(๐‘ˆ) := ๐‘› ๐‘Ÿ max ๐‘–โˆˆ[๐‘›] โˆฅP๐‘ˆ ๐’†๐‘– โˆฅ2 2 , where P๐‘ˆ is the orthogonal projection onto ๐‘ˆ, and where ๐’†๐‘– โˆˆ R๐‘› for ๐‘– โˆˆ [๐‘›] denotes the ๐‘–th standard basis vector. From this definition we can see a subspace will have low coherence or be incoherent if, in terms of magnitude and support, its entries are (roughly) equal and spread out among all possible indices. In such a case, it will happen that revealing an entry has a high likelihood of being informative - the concentration of mass will not be constrained to some relatively small support of the total index set, so the chance that a sampling scheme โ€œmissesโ€ informative entries is lower. We will also need the following definition concerning an elaboration on rank of a matrix. Definition 4.2 (Kruskal rank). The Kruskal rank of a matrix is the maximum integer ๐‘Ÿ such that any ๐‘Ÿ columns of the matrix are linearly independent. We observe immediately that Kruskal rank is always bounded by the rank of the matrix. As a simple yet illuminating example, consider the identity matrix in R๐‘›ร—๐‘›, which has rank and Kruskal rank both equal to ๐‘›. Now append a column to this which consists of e1. Clearly the rank of the resulting matrix remains ๐‘Ÿ, but the Kruskal rank is now one, since it is possible to find a subset of 116 two columns among the ๐‘› + 1 columns which are linearly dependent. The class of tensors for which the proposed approach will be guaranteed to succeed will need to satisfy the following assumptions. For positive integers ๐‘›, ๐‘Ÿ, ๐‘  with ๐‘Ÿ, ๐‘  โ‰ค ๐‘› and any ๐œ‡0 โˆˆ [1, ๐‘›/๐‘Ÿ], we define T(๐‘›, ๐‘Ÿ, ๐œ‡0, ๐‘ ) to be the class of all tensors T = (cid:205)๐‘Ÿ (a) the factor matrices ๐ด, ๐ต both have full column rank, ๐‘–=1 a๐‘– โ—ฆ b๐‘– โ—ฆ c๐‘– โˆˆ R๐‘›ร—๐‘›ร—๐‘› for which: (b) the column space of ๐ด has coherence bounded above by ๐œ‡0, and (c) for some ๐‘  โˆˆ [๐‘›], every ๐‘  ร— ๐‘Ÿ submatrix of ๐ถ has Kruskal rank โ‰ฅ 2. We are now able to state our main result of this section. Theorem 4.1. There exists an adaptive random sampling strategy and an associated reconstruction algorithm (see Algorithm 4.2) such that for any ๐›ฟ > 0, after observing at most ๐ถ1๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) entries of a tensor T โˆˆ T(๐‘›, ๐‘Ÿ, ๐œ‡0, ๐‘ ), the algorithm completes T with probability at least 1 โˆ’ ๐‘ ๐›ฟ. Here ๐ถ1 > 0 is absolute constant that is independent of all other quantities. The proof of our sandwich sampling algorithm involves three stages. First, without loss of generality, we pick ๐‘  mode-3 slices and then use a matrix completion algorithm that with high probability recovers these slices and observes at most ๐ถโ€ฒ 1 ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) entries in each slice. Second, if this ๐‘› ร— ๐‘› ร— ๐‘  sub-tensor is correctly completed, we can then use a deterministic variant of Jennrichโ€™s Algorithm1 on the completed sub-tensor to learn the factor matrices ๐ด and ๐ต. Third, once we know the factor matrices ๐ด and ๐ต, we can deterministically find ๐‘Ÿ sample locations in each of the ๐‘› mode-3 slices whose values allow a censored least squares problem to solve for the third factor matrix ๐ถ. This three-stage procedure uses at most ๐ถโ€ฒ 1 ๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) samples to complete the ๐‘  initial mode-3 slices in order to learn ๐ด and ๐ต, and then ๐‘›๐‘Ÿ additional samples to learn ๐ถ thereafter, for a total of at most ๐ถโ€ฒ 1 ๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) + ๐‘›๐‘Ÿ โ‰ค ๐ถ1๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) samples. See Figure 4.1 for a schematic illustration of the overall sampling strategy where fibers are sampled through the middle of the sandwich for simplicity. We note that assumptions (a) and (c) are the necessary assumptions for the Jennrichโ€™s step to work with any ๐‘› ร— ๐‘› ร— ๐‘  sub-tensor. Furthermore, if the columns of the factor matrices are 1As discussed in [36] the work of [44] is perhaps a more accurate attribution of the method, however we use the traditional name of Jennrichโ€™s Algorithm in this work. 117 drawn from any continuous distribution, assumption (c) for ๐‘  = 2 holds with probability 1. Finally, something akin to assumption (b) is always required in completion problems, as we have discussed in Section 4.1. Figure 4.1 A schematic depiction of the sampling strategy where ๐‘  = 2 slices have been sampled relatively densely in order to compute ๐ด and ๐ต, and where additional fibers where then sampled elsewhere to use in computing ๐ถ. 4.1.2 Related Work Many prior works on low-rank tensor completion use non-adaptive and uniform sampling [29, 4, 65, 66, 55, 52, 45, 34]. While some of those works [4, 34, 52] can handle CP-ranks up to roughly ๐‘›3/2 instead of ๐‘›, all of them require at least O (๐‘›3/2) samples, even when the rank is ๐‘Ÿ = O (1). Furthermore, [4] shows that completing an ๐‘› ร— ๐‘› ร— ๐‘› rank-๐‘Ÿ tensor from ๐‘›3/2โˆ’๐œ– uniformly 118 random samples is NP-hard by way of showing it is equivalent to the problem of refuting a 3-SAT formula with ๐‘› variables and ๐‘›3/2โˆ’๐œ– clauses. In [39], the authors propose a recursive method for completing CP-rank ๐‘Ÿ โ‰ค ๐‘› tensors using adaptive sampling which, for order-3 tensors, requires O (๐œ‡2 0 ๐‘›๐‘Ÿ 5/2 log2 ๐‘Ÿ) samples. Our algorithm requires a number of samples which has a more favorable dependence on the coherence ๐œ‡0 and rank ๐‘Ÿ. Furthermore, our result only requires coherence assumptions about ๐ด, instead of about ๐ด and ๐ต. These improvements come at the expense of requiring the mild additional assumptions (a) and (c) in our Theorem 4.1 related to Jennrichโ€™s algorithm, however. Additionally, the method of [39], outside the exact case does not output a CP factorization of the tensor, but rather a more complicated and recursive decomposition. The first step in our tensor completion algorithm involves using an adaptive sampling scheme to complete ๐‘  mode-3 slices of the tensor. Our tensor completion algorithm and results are based on the adaptive matrix completion algorithm and results in [40] which, with high probability, uses O (๐œ‡0๐‘›๐‘Ÿ log2 ๐‘Ÿ) samples to complete a rank ๐‘Ÿ matrix. However, our algorithm can be adapted to use other adaptive matrix completion results such as, e.g., [10] with relative ease. In the censored least squares phase of our algorithm, we can sample entire fibers of the tensor as done in Figure 4.1. We note that doing so is similar in spirit to the fiber sampling approach of Sรธrensen and De Lathauwer [58]. However, their work focuses on determining algebraic constraints on the factor matrices of a low rank tensor which, when satisfied, allow the tensor to be completed from the sampled fibers. As such, our results cannot be directly compared with [58]. 4.2 Proof of Tensor Sandwich Exact Recovery for Tensors in T(๐‘›, ๐‘Ÿ, ๐œ‡0, ๐‘ ) In this section we follow our three stage proof outline from Section 4.1.1 4.2.1 Completing ๐‘  Mode-3 Slices of T We start by picking any subset of indices ๐‘† โŠ‚ [๐‘›] with |๐‘†| = ๐‘  elements. For each ๐‘˜ โˆˆ ๐‘†, the mode-3 slice T:,:,๐‘˜ โˆˆ R๐‘›ร—๐‘› satisfies T:,:,๐‘˜ = ๐‘Ÿ โˆ‘๏ธ ๐‘–=1 โŸจc๐‘–, e๐‘˜ โŸฉa๐‘–b๐‘‡ ๐‘– , 119 and so, col-span(T:,:,๐‘˜ ) โІ col-span( ๐ด). By assumption, col-span( ๐ด) has coherence bounded by ๐œ‡0. Thus, the assumptions required by Theorem 1 in [40] hold. Therefore, for each slice ๐‘‡:,:,๐‘˜ ๐‘˜ โˆˆ ๐‘†, with probability at least 1 โˆ’ ๐›ฟ, the adaptive sampling procedure in Algorithm 1 uses at most ๐ถโ€ฒ 1 ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) samples for some absolute constant ๐ถโ€ฒ 1 > 0, and completes T:,:,๐‘˜ . By taking a simple union bound over each of the ๐‘  slices ๐‘˜ โˆˆ ๐‘†, we have that with probability at least 1โˆ’๐‘ ๐›ฟ, this strategy will successfully complete all of the ๐‘  slices T:,:,๐‘˜ for ๐‘˜ โˆˆ ๐‘† and use fewer than ๐ถโ€ฒ 1 ๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) samples. Let ฮฉ1 = {(๐‘–, ๐‘—, ๐‘˜)|(๐‘–, ๐‘—) sampled according to [40] for slice ๐‘˜ โˆˆ ๐‘†}, the set of locations of T sampled to complete these ๐‘  slices. 4.2.2 Learning Mode-1 and 2 Factor Matrices via Modified Jennrichโ€™s Algorithm / Simul- taneous Diagonlization Let u, v be random vectors uniformly drawn from the unit sphere S๐‘ โˆ’1. Denote the sub-vector of c with entries indexed in ๐‘† by หœc๐‘– = (c๐‘–)๐‘†, and construct two auxiliary matrices ๐‘‡๐‘ข, ๐‘‡๐‘ฃ using the completed slices by adding up the linear combinations of the completed slices, weighted by the random vectors u, v. In terms of the components, we have: ๐‘‡๐‘ข = ๐‘‡๐‘ฃ = ๐‘Ÿ โˆ‘๏ธ ๐‘–=1 ๐‘Ÿ โˆ‘๏ธ ๐‘–=1 โŸจหœc๐‘–, uโŸฉa๐‘–b๐‘‡ ๐‘– โŸจหœc๐‘–, vโŸฉa๐‘–b๐‘‡ ๐‘– Denote the ๐‘Ÿ ร— ๐‘Ÿ matrix ๐ท๐‘ข which has along its diagonal entries the values โŸจหœc๐‘–, uโŸฉ, and similarly ๐ท๐‘ฃ. Notice the following identity with the product ๐‘‡๐‘ข (๐‘‡๐‘ฃ)โ€  ๐‘‡๐‘ข (๐‘‡๐‘ฃ)โ€  = ๐ด๐ท๐‘ข ๐ต๐‘‡ (cid:16) ๐ด๐ท๐‘ฃ ๐ต๐‘‡ (cid:17) โ€  = ๐ด๐ท๐‘ข ๐ทโˆ’1 ๐‘ฃ ๐ดโ€  Where the matrix ๐ท๐‘ข ๐ทโˆ’1 ๐‘ฃ โŸจหœc๐‘–,vโŸฉ . Clearly the matrix ๐‘‡๐‘ข (๐‘‡๐‘ฃ)โ€  is diagonalizable, and so by computing the eigen-decomposition of ๐‘‡๐‘ข (๐‘‡๐‘ฃ)โ€  we recover the is a diagonal matrix with (๐‘–, ๐‘–)-entry equal to โŸจหœc๐‘–,uโŸฉ columns of ๐ด and the eigenvlaues along the diagonal of ๐ท๐‘ข ๐ทโˆ’1 ๐‘ฃ . We can order the eigenvectors in descending order by magnitude of their corresponding eigenvalue. Note it is here that we employ 120 assumption (c): Since ๐ถ๐‘† (i.e., the rows of ๐ถ indexed by ๐‘†) has ๐‘˜-rank at least 2, the ratios โŸจหœc๐‘–,uโŸฉ โŸจหœc๐‘–,vโŸฉ for each ๐‘– will almost surely be distinct from one another, and thus the ordering of the eigenvalues is unique. Let ๐‘ƒ be the permutation matrix that interchanges the columns of ๐ด so that instead of being in ๐ท๐‘ข ๐ทโˆ’1 ๐‘ฃ order, they are in ๐ท๐‘ข order (ordered greatest to least in terms of the magnitude of โŸจหœc๐‘–, uโŸฉ). Now notice that ๐ดโ€ ๐‘‡๐‘ข = ๐ดโ€  ๐ด๐‘ƒ๐ท๐‘ข ๐ต๐‘‡ = ๐‘ƒ๐ท๐‘ข ๐ต๐‘‡ = (๐ต๐ท๐‘ข๐‘ƒ)๐‘‡ . (4.1) This means that the re-scaled columns of ๐ต that are in ๐ท๐‘ข order are now in ๐ท๐‘ข ๐ทโˆ’1 ๐‘ฃ order after we apply the inverse ๐ดโ€  to ๐‘‡๐‘ข. That is, we have found the matching components of ๐ด and ๐ต up to a re-scaling of their outer product! This was achieved by learning the columns of ๐ด from ๐‘‡๐‘ข (๐‘‡๐‘ฃ)โ€  and then, crucially, the matching columns of ๐ต from ๐ดโ€ ๐‘‡๐‘ข. The scaling due to ๐ท๐‘ข will be resolved in the final step of the algorithm, where we solve for the missing third factor (i.e., ๐ต, ๐ถ and ๐ต๐ท, ๐ถ (๐ทโˆ’1) are both valid pairs of factors for any diagonal matrix ๐ท). 4.2.3 Learning the mode-3 factor matrix After obtaining ๐ด and a re-scaled ๐ต in order to find the remaining components, ๐ถ, we will need ฮฉ2, a second set of revealed locations of the entries of T . We will also need the solutions to ๐‘› instances of the following censored least squares problem related to those revealed values ( ๐ด โŠ™ ๐ต)๐พ๐‘˜ c = t๐พ๐‘˜ (4.2) where ๐‘˜ โˆˆ [๐‘›],c = (๐ถ๐‘˜,:)๐‘‡ , t = vec(T:,:,๐‘˜ )๐‘‡ and ๐พ๐‘˜ = {๐‘– + ๐‘›( ๐‘— โˆ’ 1)|(๐‘–, ๐‘—, ๐‘˜) โˆˆ ฮฉ2}. In (4.2), the term t๐พ๐‘˜ denotes the vector of length |๐พ๐‘˜ | which includes only the entries of t which have indices appearing in the set ๐พ๐‘˜ . Similarly ( ๐ด โŠ™ ๐ต)๐พ๐‘˜ is the matrix where we restrict rows of ( ๐ด โŠ™ ๐ต) to only those which have indices appearing in the set. Note ( ๐ด โŠ™ ๐ต) is full rank because ๐ด and ๐ต are assumed to be full rank (see, e.g., [31]). This implies the uncensored system is consistent with a unique solution. The difficulty in the censored 121 case is that unobserved values in a particular column of the right hand side of (4.2) could force the discarding of rows of the matrix ( ๐ด โŠ™ ๐ต) that cause the system to become under-determined. That is, we must ensure that ( ๐ด โŠ™ ๐ต)๐พ๐‘˜ has full column rank for each ๐‘˜. If we do not vary our sampling procedure from frontal slice to frontal slice, then this corresponds to sampling tubes of the original tensor of the form T๐‘–, ๐‘—,: and ๐พ๐‘˜ = ๐พ for all ๐‘˜ โˆˆ [๐‘›]. sampling to accomplish this, consider a ๐‘„๐‘… with column-pivoting factorization of ( ๐ด โŠ™ ๐ต)๐‘‡ , see In order to arrange the algorithm 5.4.1 in [19]). This produces factors and a permutation matrix C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ โˆˆ R๐‘›2ร—๐‘›2 such that ( ๐ด โŠ™ ๐ต)๐‘‡ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ = ๐‘„๐‘…, where (cid:104) ๐‘… = (cid:105) ๐‘…1 ๐‘…2 and ๐‘…1 โˆˆ R๐‘Ÿร—๐‘Ÿ is upper-triangular and non-singular. But this means that the first ๐‘Ÿ columns of C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ select a set of columns of ( ๐ด โŠ™ ๐ต)๐‘‡ which are linearly independent, so for each C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ ๐‘ โˆˆ [๐‘Ÿ] we have that there exists some ๐‘ž โˆˆ [๐‘›2] such that e๐‘ž = C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ column of the identity in R๐‘›2ร—๐‘›2 ๐‘ž ๐‘› โŒ‹. Define then ฮฉ2 all the tuples of the form (๐‘–, ๐‘—, :). That is, we can read off from C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ which ๐‘Ÿ fibers of length ๐‘› to sample in . Let ๐‘– = ๐‘ž mod ๐‘›, ๐‘— = โŒŠ , the corresponding :,๐‘ :,๐‘ , order to ensure (4.2) is consistent for each ๐‘˜ โˆˆ [๐‘›]. We have specified at most ๐‘›๐‘Ÿ new sample locations, and thus ฮฉ = ฮฉ1 โˆช ฮฉ2, the set of all samples employed to complete the tensor T is at most |ฮฉ1| + |ฮฉ2| โ‰ค ๐ถโ€ฒ 1 Remark 4.1. Note, C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ is not unique, and indeed we could use different selections corre- ๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) + ๐‘›๐‘Ÿ โ‰ค ๐ถ1๐‘ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ). sponding to other valid permutations from frontal slice to frontal slice to vary the sampling pattern when finding ๐ถ. Additionally, including more rows of ( ๐ด โŠ™ ๐ต) beyond the first ๐‘Ÿ as specified by C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ may be numerically advantageous when computing ๐ถ. 4.3 Experiments In this section, we show that tensors sampled according to our overall strategy and completed using Algorithm (4.2) support our theoretical findings. In particular we demonstrate that once sample complexity bounds are satisfied, we can achieve very precise levels of relative error by sampling what is overall a small percentage of the total tensorโ€™s entries. For example, in the rank 20 122 : Algorithm 4.1 Consistency Preserving Fiber Sampler input ๐›พ โ‰ฅ 1, over-sample parameter ๐‘Ÿ rank parameter ๐ด, ๐ต โˆˆ R๐‘›ร—๐‘Ÿ, estimates for factor matrices output : ฮฉ2, set of sample locations Compute ๐‘„๐‘… with column-pivoting, ๐‘„๐‘… = ( ๐ด โŠ™ ๐ต)๐‘‡ C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ for ๐‘ โˆˆ [๐›พ๐‘Ÿ] do ) :,๐‘ ๐‘ž โ† nonzero(C๐‘›1ร—ยทยทยทร—๐‘›๐‘‘ ๐‘– โ† ๐‘ž mod ๐‘› ๐‘ž ๐‘— โ† โŒŠ ๐‘› โŒ‹ Include (๐‘–, ๐‘—, ๐‘˜) in ฮฉ2 for all ๐‘˜ โˆˆ [๐‘›] end : ๐‘† โІ [๐‘›] slices to complete Algorithm 4.2 Tensor Sandwich input output : ห†๐‘‡, completed tensor # Slice Complete Phase for ๐‘˜ โˆˆ ๐‘† do end # Jennrich Complete Phase Generate random vectors u, v โˆˆ S๐‘ โˆ’1; ๐‘‡๐‘ข โ† (cid:205)๐‘˜โˆˆ๐‘† ๐‘ข๐‘˜๐‘‡::๐‘˜ ๐‘‡๐‘ฃ โ† (cid:205)๐‘˜โˆˆ๐‘† ๐‘ฃ ๐‘˜๐‘‡::๐‘˜ Compute eigen-decomposition ๐ด๐šฒ๐ดโˆ’1 = ๐‘‡๐‘ข (๐‘‡๐‘ข)โ€  ๐ต โ† ๐ดโˆ’1๐‘‡๐‘ข # Censored Least Squares Phase ฮฉ2 โ† fiber sampler( ๐ด, ๐ต, ๐‘Ÿ, ๐›พ) ๐ถ๐‘‡ โ† censored least squares (( ๐ด โŠ™ ๐ต), T๐‘–, ๐‘—,๐‘˜ s.t. (๐‘–, ๐‘—, ๐‘˜) โˆˆ ฮฉ2) ห†T โ† [[ ๐ด, ๐ต, ๐ถ]] Use, e.g., [10], or algorithm 1 in [40] to complete T:,:,๐‘˜ . Algorithm 1 in [40] uses at most ๐ถโ€ฒ 1 ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) adaptively chosen samples. case in Figure 4.2, the median relative error is 0.000168 having sampled 94445 8000000 โ‰ˆ 1.1% of the total entries. Our experiments also verify the linear dependence on rank in the sample complexity bound. Furthermore, in our simulated data and real world data we show empirical evidence the method performs useful completion even in the presence of noise. Additionally we make comparisons to other tensor completion methods. 123 4.3.1 Simulated Data In this section, we show that Tensor Sandwich, as well as its extension to higher orders (see Section 4.5), which we refer to as Tensor Deli in the figures, can complete tensors of simulated data using both adaptive and non-adaptive sampling. Once sample complexity bounds are satisfied, we can achieve precise levels of relative error for completing a tensor by looking at what is overall a small fraction of its total entries. We also demonstrate empirically our method works for four mode tensors. In prior works such as [39], results of this kind are given only for two mode tensors; in [45] tensors of three modes. Furthermore we show how using Tensor Sandwich/Deli to initialize a masked-alternating least squares scheme with even a few iterations provides significant improvement in accuracy. We compare our adaptive results to [39] and [45], and show that Tensor Deli can perform well in terms of accuracy and sample complexity while also providing a factorized tensor. In all cases with the simulated data, the tensor is generated by drawing factor matrices of shape R๐‘›ร—๐‘Ÿ with i.i.d standard Gaussian entries and then normalizing the columns. For simulated data, for each mode ๐‘‘, the side-lengths ๐‘› are the same, and thus the tensor has ๐‘›๐‘‘ total entries. We then have the option to weight the components using decaying weights described by the parameter ๐›ผ, i.e. T = ๐‘Ÿ โˆ‘๏ธ โ„“=1 (cid:19) (cid:18) 1 โ„“๐›ผ โ„“ โ—ฆ ยท ยท ยท โ—ฆ ๐’‚ (๐‘‘) ๐’‚ (1) โ„“ (4.3) In each experiment, the median of the errors is taken over ten independent trials. In each Tensor Deli experiment, after selecting frontal slices to complete, within these slices we sample using a budget of ๐‘š = ๐›พ(๐‘›2) samples (adaptively or non-adaptively), where ๐›พ โˆˆ [0, 1] is the proportion of the total ๐‘›2 entries of the slice available for sampling in the matrix completion step. Slices are then completed using the semi-definite programming formulation of the nuclear norm minimization and solved via Douglas-Rachford splitting, see [54]. In this step, convergence is declared once the primal residual, dual residual and duality gap are below 10โˆ’8 for each of the ๐‘  selected slices, or after 10,000 iterations per slice, whichever comes first. We note that the accuracy of this matrix completion step influences numerically what is achievable in terms of overall accuracy for the 124 completed tensor, and that the error for completing these slices is compounded in the subsequent steps (even in the absence of noise). This explains the apparent โ€œleveling offโ€ of the relative error even as sample complexity or signal-to-noise ratios increase. The number of fibers to sample, and which are used to estimate the remaining ๐‘‘ โˆ’ 2 factor matrices is given by ๐›ฟ๐‘Ÿ, where ๐›ฟ is an oversampling factor of size at least one. In Figure 4.2, we see a phase transition of our method. For a given rank, prior to a threshold, the error is dominated by the inability to accurately complete the ๐‘  slices. Once sufficient samples are obtained within these slices, completion reliably succeeds and accuracy approaches the limiting numerical accuracy inherited from the initial slice completion step. We empirically see in Figure 4.2 the linear relationship of the sample complexity on the rank of the tensor - it takes approximately twice as many samples to reach this phase transition in terms of relative error when we compare the rank 5 case with the rank 10 case. In this figure, ๐‘‘ = 3, ๐‘› = 200, ๐›ผ = 2, ๐‘  = 2, ๐›ฟ = 4, the parameter ๐›พ is varied within in the interval (0, 1) and is driving the overall sample complexity. Figure 4.2 Median relative error (log-scaled) of completed tensors of varying rank as sample complexity increases without noise. 125 In Figure 4.3 we repeat a similar experiment, but for ๐‘‘ = 4, ๐‘› = 100, ๐›ผ = 2 and for three different ranks ๐‘Ÿ = 5, 10, 15. These tensors are 12.5 times larger in terms of total number of entries than in the three mode case studied in Figure 4.2. Our method is able to reach relative errors at or lower than 1% using less than a tenth of one percent of the total entries. In plot (a) of Figure 4.3, we show the median relative error for Tensor Deli and the relative error after ten iteration of masked-ALS is performed on the result, i.e. the second set of experiments depend on the first in that ALS is initialized using the Tensor Deli estimates of the CP factorization, and using the same revealed entries, see [61]. As is well documented now, ALS can be quite sensitive to initialization and prone to โ€œswampsโ€ where accuracy stagnates over a large number of iterations. This figure shows however that Tensor Deli can provide high quality initialization, and with only a small number of iterations we can improve the accuracy of the estimate of the completed tensor by an order of magnitude without even needing to reveal more entries. Empirically we have observed ALS alone is unable to achieve comparable relative errors at this sample complexity for any observed amount of iterations, see Figure 4.5. In plot (b) of Figure 4.3 we show the errors for adaptive vs non-adaptive strategy for each rank. In the non-adaptive case, the ๐‘ -slices are sampled more heavily, but in a uniform fashion. Furthermore, no ๐‘„๐‘…-factorization is performed on the first two estimated factor matrices, and instead entries are sampled uniformly throughout the rest of the tensor for the purposes of estimating the remaining factors. Not surprisingly, in the case of random low-rank tensors, the density of the sampling in the slices is driving the accuracy, and adapting the sampling pattern appears to have little effect - the slices themselves are already highly incoherent, so there is little to gain by adapting. As we discuss in the next section, on real world data with more structure, adaptive patterns do have a more measurable effect. In Figure 4.4, we add mean-zero i.i.d. Gaussian noise to each entry in our tensor and perform the same completion strategy on the noisy tensor as described earlier in this section, ๐‘‘ = 3, ๐‘› = 200, ๐›ผ = 2, ๐‘  = 2, ๐›ฟ = 4. For each trial, the noise tensor N is scaled to the appropriate signal-to-noise ratio, i.e. SNR = 10 log10 โˆฅT โˆฅ โˆฅN โˆฅ . The sample complexity proportions are fixed with respect to slice 126 Figure 4.3 Median relative error (log-scaled) of completed four mode tensors of varying rank as sample complexity increases without noise. Each value is the median of ten trials, ๐‘  = 2, ๐›พ โˆˆ [0.1, 0.8], ๐›ฟ = 8. In plot (a) we compare the relative errors of Tensor Deli before and after ten iterations of masked-ALS. In plot (b) we show error for both adaptive and non-adaptive sampling schemes. completion versus fibers sampled to estimate ๐ถ but do scale linearly according to rank in order to facilitate comparison. In all cases the total number of sampled entries is between 0.27% and 1.10% of the total number of entries, depending on rank. As alluded to earlier, we observed during these experiments that using masked alternating least squares alone to complete our synthetic tensors (using either the Tensor Sandwich sample pattern, or an equivalent proportion of uniformly drawn samples) achieved at best relative errors of about 6%. However, the estimate of the tensor as output by Algorithm 4.2, and its sample pattern can be used as an initialization to an iterative scheme like alternating least squares to further improve the accuracy of the completed tensor. In Figure 4.5, we show three sets of related experiments: the relative error at various ranks achieved by alternating least squares alone (AS), Tensor Sandwich alone (TS), or alternating least squares initialized by Tensor Sandwich (TS + ALS). In each trial, 100 iterations of alternating least squares are used, weights for components are set to decay according to (4.3), and each of the three variations for completing the tensor are used on the same data. We notice that at this level of sample complexity, for either a sample mask chosen uniformly at random or according to the adaptive scheme described in this work, ALS alone is limited. Combined with Tensor Sandwich, however, we can see roughly an order of magnitude improvement in relative error by performing a hundred iterations of ALS on the TS estimate. This shows Tensor Sandwich can 127 Figure 4.4 Median relative error (log-scaled) of completed tensors of varying rank for different levels of white Gaussian noise. Sample complexity is less than 1.1% for all trials, scaled linearly by rank. be useful as an initialization strategy for other completion methods, either to save on run time by decreasing the total number of iterations needed, or to improve the accuracy of the final estimate. Here ๐‘‘ = 3, ๐‘› = 200, ๐›ผ = 2, ๐‘  = 2, ๐›ฟ = 4. In all cases the total number of sampled entries is between 0.27% and 1.10% of the total number of entries, with linear scaling depending on rank to facilitate comparison. In next pair of experiments, we compare one of the existing (non-adaptive) completion methods discussed in Subsection 4.1.2 to Tensor Sandwich. We use the same overall sample budget for Tensor Sandwich but instead use the Algorithm in [51] which we refer to as Tensor Complete (TC). The range of total values sampled is the same as in Figure 4.2. Implementation details make direct comparisons difficult. However, the empirical findings summarized in Figure 4.6 suggest that Tensor Sandwich can produce better estimates for high rank tensors than Tensor Complete can when sample budgets are limited. In Figure 4.7 we show a comparison of one of the few adaptive methods we identified in the 128 Figure 4.5 Median relative error (log-scaled) of completed tensors of varying rank for different levels of white Gaussian noise. Sample complexity is less than 1.1% for all trials, scaled linearly by rank. Tensors are completed using 100 iterations of alternating least squares (ALS), or Tensor Sandwich (TS), or 100 iterations of ALS initialized by Tensor Sandwich (TS+ALS). area, proposed in [39] (denoted KS in the figure) and adaptive Tensor Sandwich for ๐‘‘ = 3, ๐‘› = 40, ๐›ผ = 1 with ten iterations of post-ALS and for three different ranks ๐‘Ÿ = 4, 6, 8. In plot (a) we have along the horizontal axis the two respective parameters for each of the methods that drives their overall sample complexity. For KS, the sample complexity is largely driven by the proportion of entries the method is able to sample in a given fiber to test whether that fiber should be added to the basis. For Tensor Deli, this is ๐›พ, the proportion of entries in a single face the method is able to sample when completing that slice. In the figure we note Tensor Deli exhibits a phase transition for each rank where after this threshold of a certain size for ๐›พ the method achieves much better relative error. That this threshold changes in an approximately linear fashion with respect to rank supports out theoretical conclusions. Plot (b) in Figure 4.7 changes the horizontal axis to total percent of the tensor sampled, and we can now plainly observe that, for a smaller overall sample budget by a factor of least four, Tensor Deli is able to complete the data more accurately than KS. It is also the case that Tensor Deli outputs an estimate that is truly a CP-tensor of the chosen rank truncation, 129 Figure 4.6 Median relative error (log-scaled) of completed tensors of varying rank for different total number of entries revealed. Tensors are completed using Tensor Sandwich (TS), or Tensor Complete as described in [51] without noise. whereas KS in general will not output a tensor that has a CP-decomposition of the selected rank, but instead a more complicated recursive decomposition. CP fitting can be conducted on the KS estimate, for example using iterations of ALS, but this in general will reduce the accuracy of the estimate. 130 Figure 4.7 Median relative error (log-scaled) of completed three mode tensors of varying rank without noise for our proposed Tensor Deli with adaptive sampling method, and the adaptive algorithm proposed in [39]. Each value is the median of ten trials, for Tensor Deli ๐‘  = 2, ๐›พ โˆˆ [0.2, 0.8], ๐›ฟ = 8, for KS the proportion of entries that can be sampled for a fiber range from [0, 2, 0.8]. In plot (a) we show the relative error by the face or fiber sampling parameter for each of the methods. 131 4.3.2 Applications We also apply Tensor Sandwich on data-sets in following application areas: chemo-metrics, and hyper-spectral imaging. The chemo-metrics data-set is as used in [7]. It is florescence measured from known analytes and the data is intended for calibration purposes. There are a total of 405 fluorophores of six different types. For each sample an Excitation Emission Matrix (EEM) was measured - i.e. fluorescence intensity was collected for different set levels of excitation wavelength and emission wavelength. The EEM of an unknown sample can be used for its identification and studying other properties of interest for a given analyte. In this data-set there are a 136 emission wavelengths and 19 excitation wavelengths. In the original tensor, there are five replicates per sample, however in our experiment we discarded all but the first set of replicates and so formed then a three mode tensor of size 405 ร— 136 ร— 19. We compare the completed tensor for five different choices of rank, ๐‘Ÿ = 11, 13, 15, 17, 19. We set ๐‘  = 4, and thus complete four frontal slices. We fix frontal slices where the third mode is at indices 5, 9, 13, 17 across all experiments in order to facilitate better comparison. In Figure 4.8, we show the recovered tensor for a representative frontal slice at each of the different ranks. Below this in the same figure, we have fixed a representative lateral slice, which corresponds to a completed EEM for a particular sample of an analyte at the different ranks. In all cases, the total number of entries revealed is between 10 and 11% of the tensorโ€™s entries, we performed adaptive sampling and ten iterations of ALS on the resulting CP estimate, sampling parameters ๐›พ and ๐›ฟ are 0.5 and 10 respectively. Furthermore, in Figure 4.9 we show evidence that indeed, in practical applications unlike the simulated data, there is quite clearly a benefit to sample adaptively in terms of the trade off between accuracy and sample complexity. In this Figure, for a fixed rank of 15 and a fixed sample complexity using four slices and the sampling parameters of 0.5 and 10 for ๐›พ and ๐›ฟ, using the same lateral slice as before in Figure 4.8, we show the completed EEM and corresponding relative error for this slice using the adaptive Tensor Deli (Algorithm 4.3), the adaptive Tensor Deli with ten iterations of ALS, and non-adaptive Tensor Deli. In this case, the additional ALS iterations show only modest improvement to the overall estimates, however the adaptive schemeโ€™s accuracy has a much larger 132 effect. This is likely do to the fact the dataโ€™s coherence and other parameters related to the problem are far from ideal, especially when compared to the simulated case from earlier. Figure 4.8 Slices of a completed tensor for fluorence data. The top row shows the original frontal slice X:,:,9 and the same slice completed at ranks 11, 13, 15, 17, 19 with the corresponding relative error. The bottom row shows a lateral slice X149,:,:, which corresponds then to the EEM for the single analyte number 149 in the dataset. The next application is hyperspectral imaging data. The data is as found in [6]. The hyperspectral sensor data was acquired in June 1992 and consists of aerial images of an approximately two mile 133 Figure 4.9 Lateral slice corresponding to the EEM for analyte number 149 in the fluorence data is compared with the corresponding slice completed from adaptive, adaptive with extra iterations of ALS, and non-adaptive Tensor Deli. Relative error for the slice is listed for each method. Rank is fixed at 15 for each, sample complexity is fixed at about 10% of entries using sampling parameters of 0.5 and 10 for ๐›พ and ๐›ฟ respectively. by to mile Purdue University Agronomy farm and was originally intended for soils research. The data consists of 200 images at different wave lengths that are 145 by 145 pixels each. We thus form a three mode, 145 ร— 145 ร— 200 data tensor. We complete the tensor using ๐‘  = 9, 134 where we have fixed these frontal slices at indices 20, 40, 60, 80, 100, 120, 140, 160, 180 across all experiments to facilitate comparison. Shown in Figure 4.10, we have completed the tensor for ranks ๐‘Ÿ = 20, 30, 40, 50, 60, 80 and displayed a fixed, representative frontal slice at index 48. The first row consists of data completed using Tensor Deli with adaptive sampling, with parameters ๐›พ = 0.7 and ๐›ฟ = 10 for a total sample budget that is about 4.5% of the total entries. The second row then performs ten iterations of masked-ALS using the initialization of the top row and the same revealed entries. The last row consists of ten iterations masked-ALS, which is initialized using the SVD of the flattenings, e.g. see [61]. Empirically, we observe the hyperspectral image dataset is imperfectly approximated by a low- rank CP decomposition to begin with. In Figure 4.10, we see Tensor Deli alone performs in terms of global relative error certainly no better than masked-ALS alone on this data. However, it provides a superior initialization to ALS, as we can see in the second row. Moreover, we observe there is a qualitative difference in the completed slices and their errors. In the ALS alone case, the error appears achieved by a sort of local averaging of intensity of the pixels whereas Tensor Deli does capture contrasts and local features more distinctly, and the errors are largest on rows and columns it โ€œmissesโ€ or imperfectly completes. This can be seen by looking at a heat map of the error slice by slice. Looking at the middle row, we can see there is a distinct advantage of combining these two types of methods to achieve the best sort of completion for this dataset. 4.4 Discussion We have presented an adaptive sampling approach for low CP-rank tensor completion which completes a CP-rank ๐‘Ÿ tensor of size ๐‘› ร— ๐‘› ร— ๐‘› using O (๐‘›๐‘Ÿ log2 ๐‘Ÿ) samples with high probability. Our method significantly improves on the tensor completion result in [39] while only making a mild additional assumption on the third factor matrix. We also provided numerical experiments to demonstrate that a version of our tensor completion method is robust to noise empirically, works as a high quality initialization to ALS, and can run efficiently on tensors of more than three modes, in the regime of hundreds of millions of entries. A non-adaptive version of our tensor completion algorithm is also possible, as suggested in 135 Figure 4.10 Slices of a completed tensor for hyperspectral data. The first column shows the original frontal slice X:,:,48 and the same slice completed at ranks 20, 30, 40, 50, 60, 80 with the corresponding relative error below for three different methods. The top row is adaptive Tensor Deli with sampling parameters ๐›พ = 0.7, ๐›ฟ = 10, the middle row shows the slice after ten iterations of ALS using the Deli initialization and the last row shows SVD initialized masked-ALS. Section 4.5. Second, it is possible to extend our tensor completion method to higher-order tensors, as we outline. We show the result for the case where factor matrices do not have zero values, as this illuminates very clearly what system of equations must be consistent to ensure exact recovery of the factor matrix, however it is possible to remove this assumption, and instead deal with the case of zeroes in the factor matrices using (a few) more samples. 4.5 Extending to Higher Order Tensors The purpose of this section is to include details on how the proposed method can be extended to tensors of more than three modes. Within this proof we also see how the method can be made non-adaptive; by increasing the sample complexity to meet the necessary bounds for the matrix completion and censored least square solves with high probability. 136 : ๐‘† โІ [๐‘›3] slices to complete Algorithm 4.3 Adaptive Tensor Deli for Order-๐‘‘ Tensors with No Zeros in ๐ด(3), . . . , ๐ด(๐‘‘) input output : ห†๐‘‡, completed tensor # Slice Complete Phase for ๐‘˜ โˆˆ ๐‘† do Use, e.g., [10], or algorithm 1 in [40] to complete T:,:,๐‘˜,1,...,1. Algorithm 1 in [40] uses at most ๐ถโ€ฒ 1 ๐œ‡0 max(๐‘›1, ๐‘›2)๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) adaptively chosen samples. end # Jennrich Complete Phase Generate random vectors u, v โˆˆ S๐‘ โˆ’1; ๐‘‡๐‘ข โ† (cid:205)๐‘˜โˆˆ๐‘† ๐‘ข๐‘˜๐‘‡:,:,๐‘˜,1,...,1 ๐‘‡๐‘ฃ โ† (cid:205)๐‘˜โˆˆ๐‘† ๐‘ฃ ๐‘˜๐‘‡:,:,๐‘˜,1,...,1 Compute eigen-decomposition ๐ด(1)๐šฒ( ๐ด(1))โˆ’1 = ๐‘‡๐‘ข (๐‘‡๐‘ข)โ€  ๐ด(2) โ† ( ๐ด(1))โˆ’1๐‘‡๐‘ข # Censored Least Squares Phase to learn modes 3 through ๐‘‘ for ๐‘˜ = 3, 4, . . . , ๐‘‘ do ฮฉ๐‘˜โˆ’1 โ† fiber sampler( ๐ด(1), ๐ด(2), ๐‘Ÿ, ๐›พ) ( หœ๐ด(3))๐‘‡ โ† censored least squares (( ๐ด(1) โŠ™ ๐ด(2)), T๐‘–1,๐‘–2,1,...,1,๐‘–๐‘˜,1,...,1 s.t. (๐‘–1, ๐‘–2, 1, . . . , 1, ๐‘–๐‘˜ , 1, . . . , 1) โˆˆ ฮฉ๐‘˜โˆ’1) ๐ด(๐‘˜) โ† หœ๐ด(๐‘˜) with columns normalized to unit length end ๐›ผ โ† censored least squares (( ๐ด(1) โŠ™ ๐ด(2) ยท ยท ยท โŠ™ ๐ด(๐‘‘)), vec(Ti) s.t. (i) โˆˆ (cid:208)๐‘‘โˆ’1 ห†T โ† [[a, ๐ด(1), . . . , ๐ด(๐‘‘)]] ๐‘˜=1 ฮฉ๐‘˜โˆ’1) 137 Theorem 4.2. Suppose that the factor matrices satisfy the following assumptions for CP rank-๐‘Ÿ tensor T โˆˆ R๐‘›1ร—๐‘›2ร—ยทยทยทร—๐‘›๐‘‘ โ€ข rank( ๐ด(1)) = rank( ๐ด(2)) = ๐‘Ÿ โ€ข ๐œ‡( ๐ด(1)) โ‰ค ๐œ‡0 โ€ข every ๐‘  ร— ๐‘Ÿ submatrix of ๐ด(3) has Kruskal rank โ‰ฅ 2 โ€ข for ๐‘˜ = 3, . . . , ๐‘‘, ๐ด(๐‘˜) has no entries which are 0. Then, with probability at least 1 โˆ’ ๐‘ ๐›ฟ, Algorithm 3 completes T and uses at most (cid:16) ๐‘  ๐ถ ๐œ‡0๐‘›๐‘Ÿ log2(๐‘Ÿ 2/๐›ฟ) (cid:17) + ๐‘Ÿ ๐‘‘ โˆ‘๏ธ ๐‘˜=3 ๐‘›๐‘˜ samples, ๐‘› = max(๐‘›1, ๐‘›2) Proof. We can prove this result by completing well chosen 3-mode sub-tensors. Consider first the 3-mode sub-tensor T๐‘–1,๐‘–2,๐‘–3,i, for ๐‘– ๐‘— โІ [๐‘› ๐‘— ], i the index vector of all ones of length ๐‘‘ โˆ’ 3, [1] ๐‘‘โˆ’3. By our assumptions and Theorem 4.1, using the corresponding set of entries as previously described in the proof of Theorem 4.1 only where modes beyond three are fixed at index 1, we can define sets ฮฉ1 โŠ‚ [๐‘›1] ร— [๐‘›2] ร— [๐‘ ๐‘˜ ], the set of indices needed to complete the ๐‘  selected (frontal) slices for matrix completion, where ๐‘ ๐‘˜ โˆˆ ๐‘†, ๐‘† โŠ‚ [๐‘›3] and used to find factor matrices ๐ด(1) and ๐ด(2), and ฮฉ2 โŠ‚ [๐‘›1] ร— [๐‘›2] ร— [๐‘›3] ร— [1] ร— ยท ยท ยท ร— [1] where (๐‘–1, ๐‘–2, ๐‘–3, 1, . . . , 1) s.t. (๐‘–1, ๐‘–2) are selected from the ๐‘„๐‘… decomposition of ( ๐ด(1) โŠ™ ๐ด(2))๐‘‡ , ๐‘–3 ranges over all [๐‘›3] are the set of fibers needed we can complete the 3-mode sub-tensor and obtain factor matrices หœ๐ด(3). That is, using the same procedure as the three mode case, we have solved the system of equations for { หœ๐ด(3) ๐‘–3,โ„“}๐‘Ÿ โ„“=1, ๐‘Ÿ โˆ‘๏ธ โ„“=1 ( ๐ด(1) ๐‘–1,โ„“ ๐ด(2) ๐‘–2,โ„“ ๐ด(3) ๐‘–3,โ„“) ( ๐ด(4) 1,โ„“ ยท ยท ยท ยท ยท ๐ด(๐‘‘) 1,โ„“ ) = ๐‘Ÿ โˆ‘๏ธ โ„“=1 ๐‘–1,โ„“ ๐ด(2) ๐ด(1) ๐‘–2,โ„“ หœ๐ด(3) ๐‘–3,โ„“ Where หœ๐ด(3) ๐‘–3,โ„“ = ๐ด(3) ๐‘–3,โ„“ ๐ด(4) 1,โ„“ ยท ยท ยท ยท ยท ๐ด(๐‘‘) 1,โ„“ , i.e. = T๐‘–1,๐‘–2,๐‘–3,1,1,...,1 the scaled version of ๐ด(3), importantly, because the factor matrices are assumed to be entirely non-zero, ๐ด(4) 1,โ„“ and so on will all be non-zero, and so we are guaranteed a non-zero and thus informative value for ๐ด(3) in all the ๐‘Ÿ selected equations. 138 We remark, here suggests immediately how we can relax the last assumption concerning the factor matrices for modes 4 and higher. Should the factor matrices in fact have zeroes, we can with high probability get ๐‘Ÿ informative equations by sampling more. We can now permute the indices, and solve a similar looking system for the similarly defined หœ๐ด(4), (๐‘–1, ๐‘–2) same as before, ๐‘–4 โˆˆ [๐‘›4] and so indices (๐‘–1, ๐‘–2, 1, ๐‘–4, 1, . . . , 1) โˆˆ ฮฉ3 correspond to ๐‘Ÿ mode-4 fibers of the tensor. We have the system ๐‘Ÿ โˆ‘๏ธ โ„“=1 ( ๐ด(1) ๐‘–1,โ„“ ๐ด(2) ๐‘–2,โ„“ ๐ด(4) ๐‘–4,โ„“)( ๐ด(3) 1,โ„“ ยท ๐ด(5) 1,โ„“ ยท ยท ยท ยท ๐ด(๐‘‘) 1,โ„“ ) = ๐‘Ÿ โˆ‘๏ธ โ„“=1 ๐‘–1,โ„“ ๐ด(2) ๐ด(1) ๐‘–2,โ„“ หœ๐ด(4) ๐‘–4,โ„“ = T๐‘–1,๐‘–2,1,๐‘–4,1,...,1 We repeat this procedure for all remaining factor matrices. A final solve to rectify the scaling indeterminacy and obtain the weights of the rank-1 components as the vector ๐›ผ โˆˆ R๐‘Ÿ, i.e. solve the censored least squares ( ๐ด(1) โŠ™ ๐ด(2) โŠ™ หœ๐ด(1) โŠ™ ยท ยท ยท โŠ™ หœ๐ด(๐‘‘))๐›ผ = vec(T )ฮฉ ฮฉ = (cid:208)๐‘‘โˆ’1 ๐‘˜=1 ฮฉ๐‘˜ . To show then the sample complexity bound, we need to bound |ฮฉ|. We have ๐‘Ÿ fibers in ๐‘‘ โˆ’ 2 modes indexed by (๐‘–1, ๐‘–2, 1, . . . , ๐‘–๐‘˜ , . . . , 1), for each ๐‘˜ = 3, 4, . . . , ๐‘‘, ๐‘–๐‘˜ โˆˆ [๐‘›๐‘˜ ] and these indices are in ฮฉ๐‘˜โˆ’1. I.e. (๐‘‘ โˆ’ 2)๐‘Ÿ fibers of length ๐‘›๐‘– for a total of ๐‘Ÿ (cid:205)๐‘‘ ๐‘–=3 ๐‘›๐‘– revealed indices from fiber sampling once we have union bounded over all the sets ฮฉ๐‘˜ . This is in addition (but again potentially intersecting) with the revealed entries used to completed the ๐‘  slices, the elements of set ฮฉ1. A final union bound yields the desired sample complexity and failure probability. 139 BIBLIOGRAPHY [1] Thomas D. Ahle and Jakob B. T. Knudsen. Almost optimal tensor sketch, 2019. [2] Rosa I Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine learning, 63:161โ€“182, 2006. [3] Stefan Bamberger, Felix Krahmer, and Rachel Ward. Johnsonโ€“lindenstrauss embeddings with kronecker structure. SIAM Journal on Matrix Analysis and Applications, 43(4):1806โ€“1850, 2022. [4] Boaz Barak and Ankur Moitra. Noisy tensor completion via the sum-of-squares hierarchy. In Conference on Learning Theory, pages 417โ€“445. PMLR, 2016. [5] Richard G Baraniuk, Mark A Davenport, Ronald A DeVore, and Michael B Wakin. A simple proof of the restricted isometry property for random matrices. Constructive Approximation, 2007. [6] Marion F. Baumgardner, Larry L. Biehl, and David A. Landgrebe. 220 band aviris hyper- spectral image data set: June 12, 1992 indian pine test site 3, Sep 2015. [7] Rasmus Bro, ร…smund Rinnan, and Nicolaas Faber. Standard error of prediction for multilinear pls - 2. practical implementation in fluorescence spectroscopy. Chemometrics and Intelligent Laboratory Systems, 75:69โ€“76, 01 2005. [8] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111โ€“119, 2012. [9] Hongyang Chen, Fauzia Ahmad, Sergiy Vorobyov, and Fatih Porikli. Tensor decompositions IEEE Journal of Selected Topics in Signal in wireless communications and mimo radar. Processing, 15(3):438โ€“453, 2021. [10] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing any low-rank matrix, provably. Journal of Machine Learning Research, 16(94):2999โ€“3034, 2015. [11] Agniva Chowdhury, Jiasen Yang, and Petros Drineas. Structural conditions for projection- cost preservation via randomized matrix multiplication. Linear Algebra and its Applications, 573:144โ€“165, 2019. [12] Andrzej Cichocki. Tensor decompositions: new concepts in brain data analysis? Journal of the Society of Instrument and Control Engineers, 50(7):507โ€“516, 2011. [13] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163โ€“172, 2015. [14] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608โ€“ 622, 2016. 140 [15] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253โ€“1278, 2000. [16] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank- (๐‘Ÿ1, ๐‘Ÿ2, . . . , ๐‘Ÿ๐‘›) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications, 21(4):1324โ€“1342, 2000. [17] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Comput- ing, 49(3):601โ€“657, 2020. [18] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Springer New York, New York, NY, 2013. [19] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996. [20] Rachel Grotheer, Shuang Li, Anna Ma, Deanna Needell, and Jing Qin. Iterative hard thresh- olding for low CP-rank tensor models. arXiv preprint arXiv:1908.08479, 2019. [21] Rachel Grotheer, Anna Ma, Deanna Needell, Shuang Li, and Jing Qin. Stochastic iterative In Proc. Information Theory and hard thresholding for low Tucker rank tensor recovery. Applications, 2020. [22] Wolfgang Hackbusch. Tensor Spaces and Numerical Tensor Calculus. [electronic resource]. Springer Series in Computational Mathematics: 56. Springer International Publishing, 2019. [23] Cullen Haselby, Mark Iwen, Deanna Needell, Michael Perlmutter, and Elizaveta Rebrova. Modewise operators, the tensor restricted isometry property, and low-rank tensor recovery. Applied and Computational Harmonic Analysis, 66:161โ€“192, 2023. [24] Cullen Haselby, Mark Iwen, Deanna Needell, Elizaveta Rebrova, and William Swartworth. Fast and low-memory compressive sensing algorithms for low tucker-rank tensor approxima- tion from streamed measurements, 2023. [25] Cullen Haselby, Santhosh Karnik, and Mark Iwen. Tensor sandwich: Tensor completion for low CP-rank tensors via adaptive random sampling. In Fourteenth International Conference on Sampling Theory and Applications, 2023. [26] Stฤณn Hendrikx and Lieven De Lathauwer. Block row kronecker-structured linear systems with a low-rank tensor solution. Frontiers in Applied Mathematics and Statistics, 8, 2022. [27] Mikael Mรธller Hรธgsgaard, Lion Kamma, Kasper Green Larsen, Jelani Nelson, and Chris Schwiegelshohn. Sparse dimensionality reduction revisited, 2023. [28] Mark Iwen, Deanna Needell, Elizaveta Rebrova, and Ali Zare. Lower memory oblivious (tensor) subspace embeddings with fewer random bits: modewise methods for least squares. SIAM Journal on Matrix Analysis and Applications, 42(1):376โ€“416, 2021. 141 [29] Prateek Jain and Sewoong Oh. Provable tensor factorization with missing data. Advances in Neural Information Processing Systems, 27, 2014. [30] Ruhui Jin, Tamara G. Kolda, and Rachel Ward. Faster johnson-lindenstrauss transforms via kronecker products. CoRR, abs/1909.04801, 2019. [31] Ten Berge JM. The ๐‘˜-rank of a khatriโ€“rao product. Unpublished Note, Heฤณmans Institute of Psychological Research, University of Groningen, The Netherlands., 2000. [32] William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math., 26:189โ€“206, 1984. [33] Michael Kapralov, Rasmus Pagh, Ameya Velingker, David P. Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. CoRR, abs/1909.01410, 2019. [34] Bohdan Kivva and Aaron Potechin. Exact nuclear norm, completion and decomposition for random overcomplete tensors via degree-4 sos. arXiv preprint arXiv:2011.09416, 2020. [35] Andrew V. Knyazev and Merico E. Argentati. Principal angles between subspaces in an a-based scalar product: Algorithms and perturbation estimates. SIAM Journal on Scientific Computing, 23(6):2008โ€“2040, 2002. [36] Tamara G. Kolda. Will the real jennrichโ€™s algorithm please stand up? https://www.mathsci. ai/post/jennrich/. Accessed: 2023-05-23. [37] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455โ€“500, 2009. [38] Felix Krahmer and Rachel Ward. New and improved johnsonโ€“lindenstrauss embeddings via the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269โ€“1281, 2011. [39] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive sampling. Advances in neural information processing systems, 26, 2013. [40] Akshay Krishnamurthy and Aarti Singh. On the power of adaptivity in matrix completion and approximation, 2014. [41] Rafaล‚ Lataล‚a. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502 โ€“ 1513, 1997. [42] Denis Le Bihan, Jean-Franรงois Mangin, Cyril Poupon, Chris A Clark, Sabina Pappata, Nicolas Molko, and Hughes Chabriat. Diffusion tensor imaging: concepts and applications. Journal of Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, 13(4):534โ€“546, 2001. [43] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces : Isoperimetry and Processes. Applied and Numerical Harmonic Analysis. Springer New York, 2002. 142 [44] S. E. Leurgans, R. T. Ross, and R. B. Abel. A decomposition for three-way arrays. SIAM Journal on Matrix Analysis and Applications, 14(4):1064โ€“1083, 1993. [45] Allen Liu and Ankur Moitra. Tensor completion made practical. Advances in Neural Infor- mation Processing Systems, 33:18905โ€“18916, 2020. [46] Avner Magen and Anastasios Zouzias. Low rank matrix-valued chernoff bounds and ap- proximate matrix multiplication. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 1422โ€“1436. SIAM, 2011. [47] Osman Asif Malik and Stephen Becker. Low-rank tucker decomposition of large tensors using tensorsketch. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [48] Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A randomized algorithm for the decomposition of matrices. Applied and Computational Harmonic Analysis, 30(1):47โ€“68, 2011. [49] Jiล™รญ Matouลกek. On variants of the johnsonโ€“lindenstrauss lemma. Random Structures & Algorithms, 33(2):142โ€“156, 2008. [50] Rachel Minster, Arvind K. Saibaba, and Misha E. Kilmer. Randomized algorithms for low- rank tensor decompositions in the tucker format. 2019. [51] Ankur Moitra. Algorithmic aspects of machine learning. Cambridge University Press, 2018. [52] Andrea Montanari and Nike Sun. Spectral algorithms for tensor completion. Communications on Pure and Applied Mathematics, 71(11):2381โ€“2425, 2018. [53] Cameron Musco and Christopher Musco. Projection-cost-preserving sketches: Proof strate- gies and constructions. CoRR, abs/2004.08434, 2020. [54] Brendan Oโ€™Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications, 169(3):1042โ€“1068, June 2016. [55] Aaron Potechin and David Steurer. Exact tensor completion with sum-of-squares. In Confer- ence on Learning Theory, pages 1619โ€“1673. PMLR, 2017. [56] Holger Rauhut, Reinhold Schneider, and ลฝeljka Stojanac. Low rank tensor recovery via iterative hard thresholding. Linear Algebra and its Applications, 523:220โ€“262, 2017. [57] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections. In 2006 47th annual IEEE symposium on foundations of computer science (FOCSโ€™06), pages 143โ€“152. IEEE, 2006. [58] Mikael Sรธrensen and Lieven De Lathauwer. Fiber sampling approach to canonical polyadic decomposition and application to tensor completion. SIAM Journal on Matrix Analysis and Applications, 40(3):888โ€“917, 2019. 143 [59] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S Yu, and Christos Faloutsos. Incre- mental tensor analysis: Theory and applications. ACM Transactions on Knowledge Discovery from Data (TKDD), 2(3):1โ€“37, 2008. [60] Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker approximation of a tensor from streaming data. SIAM Journal on Mathematics of Data Science, 2(4):1123โ€“1150, 2020. [61] Giorgio Tomasi and Rasmus Bro. Parafac and missing values. Chemometrics and Intelligent Laboratory Systems, 75(2):163โ€“180, 2005. [62] Nick Vannieuwenhoven, Raf Vandebril, and Karl Meerbergen. A new truncation strategy for the higher-order singular value decomposition. SIAM Journal on Scientific Computing, 34(2):A1027โ€“A1052, 2012. [63] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. [64] Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. A fast randomized algorithm for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335โ€“ 366, 2008. [65] Ming Yuan and Cun-Hui Zhang. On tensor completion via nuclear norm minimization. Foundations of Computational Mathematics, 16(4):1031โ€“1068, 2016. [66] Ming Yuan and Cun-Hui Zhang. Incoherent tensor norms and their applications in higher order tensor completion. IEEE Transactions on Information Theory, 63(10):6753โ€“6766, 2017. [67] Ali Zare, Alp Ozdemir, Mark A Iwen, and Selin Aviyente. Extension of pca to higher order data structures: An introduction to tensors, tensor decompositions, and tensor pca. Proceedings of the IEEE, 106(8):1341โ€“1358, 2018. [68] Ali Zare, R. Wirth, C. Haselby, Heiko Hergert, and M. Iwen. Modewise johnsonโ€“lindenstrauss embeddings for nuclear many-body theory. The European Physical Journal A, 59, 05 2023. 144