ON METHODS IN TENSOR RECOVERY AND COMPLETION

By

Cullen Haselby

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Applied Mathematics—Doctor of Philosophy

2023

ABSTRACT

Tensor representations of data have great promise, since as the size of data grows both in terms of

dimensionality and modes, it becomes increasingly advantageous to employ methods that exploit

structure to explain, store, and compute in an efficient manner. Tensors and their decompositions

crucially admit the possibility of using multi-linear relationships between the modes to exploit for

such purposes. Additionally multi-modal views of data are natural in many practical contexts; e.g.

physical phenomenon may have several spatial modes and a temporal mode, in natural language

processing data may derive from triplets of words that form subject-verb-object semantics, in neural

networks both processing data and storing weights between layers have been fruitfully imagined

using tensors.

Finding or using decompositions of tensors is a fundament of this work, and indeed any work

which seeks to analyze and produce algorithms which make efficient use of tensor representations

of data since these factorizations can radically lower the total number of parameters required to

store and perform operations on the tensor. In this dissertation we consider two related problems

involving tensor representations of data.

First, we consider the problem of recovering a low-rank Tucker approximation to a large tensor

based on structured, yet randomized, measurements. We will describe a framework for a class of

measurement ensembles that can offer excellent memory and accuracy trade-offs in comparison to

alternatives. Furthermore, these ensembles can easily be applied in a single pass over the tensor,

and in a linear manner making them suitable for streaming scenarios. That is, we will propose a

compressive sensing framework, and study uses of it which can produce low-rank factorizations

from these measurements where we show that the total number of measurements required scales

according to the number of parameters in the factorization, rather than the full, uncompressed tensor.

We analyze two recovery approaches along with the necessary specializations of the measurement

framework.

Unlike prior works related to algorithms for low-rank tensor approximation from such com-

pressive measurements, we present a unified analysis that permits several different choices for

structured measurement ensembles, and show how to prove error guarantees comparing the error

of our recovery algorithm’s approximation of the input tensor to the best possible low-rank Tucker

approximation error achievable for the tensor by any possible algorithm.

We include empirical and practical studies that verify our theoretical findings and explore

various trade-offs of parameters of interest. We discuss the development of the methods, and how

theoretical and practical critiques of our earlier work informed and enabled improvements in the

sequel.

Next, we consider the related problem of tensor completion, where the goal is to exactly

complete a low-rank CP tensor. Our method leverages existing matrix completion techniques and

an adaptive sampling scheme along with a noise robust version of Jennrich’s Algorithm to complete

a tensor using a sample budget which scales, up to a log factor, with the number of parameters

in the factorization. Empirical studies, such as performance in the presence of additive noise on

simulated data, as well as several practical applications of the method are included and discussed.

Copyright by
CULLEN HASELBY
2023

To my father, Ray Haselby.

v

ACKNOWLEDGEMENTS

One could try to explore the frontiers alone, but in my view, precious little would be seen this way.

Far better, and far more rewarding, to go there with others. This work and what it represents of

my understanding of mathematics and its place in the world was accomplished principally through,

and because of, discourse and collaboration with others.

The most frequent and immediate target for my graspings for meaning in subjects large and

small mathematically is my advisor, Mark Iwen. You taught me much, gave me much, and I never

once saw the end of your patience. You invited me into the good and puissant circle of people

and ideas you have cultivated, and I was made better for it. I somewhat doubt that your monstrous

investment of time and effort into me will ever enrich you personally, but that also never seemed

to bother you. Your example is not lost on me. I intend to do what good I can muster with all that

was given to me and become worthy of it. Thank you.

Next, I would like to thank Michael Perlmutter, Elizaveta Rebrova, Deanna Needell and William

Swartworth for first pulling me onboard, and then rowing along with me on our now shared mission

to make tensor recovery with modewise measurements a better understood and more useful art, and

with whom I wrote [23] and [24] alongside. Your expertise, curiosity, persistence showed me how

the work is done.

Here at MSU, I was buoyed considerably by the fruitful interdisciplinary collaboration with

Heiko Hergert, his student Roland Wirth and my graduate student older sibling so-to-speak Ali

Zare as seen in our work [68]. I learned more than I think I let on in those early days from reading

your writings, pondering your code and methods, and probing your ideas.

Santhosh Karnik with whom I wrote [25]. It was a lot of fun making Tensor Sandwiches with

you, and sharpening our ideas together. Thank you.

I would like to thank Jianliang Qian for his service on my committee, his helpful advice when

I was taking my first steps into applied mathematics research, his work as an instructor and his

direction of the Industrial Mathematics program here at MSU; from which I had very memorable

collaboration. I can with the confidence of some hindsight and experience say that numerical linear

vi

algebra and the topics you taught and taught well were used near daily in both my research and

work.

To Rongrong Wang, thank you for your service on my committee and your insights into my

forays into tensor completion algorithms.

To the faculty, staff, and fellow students at MSU Mathematics department who strove in so

many ways to make our collective contribution to human understanding a success, and maintaining

a tradition of excellence, I am also indebted. In particular I would like to acknowledge my fellow

student Yuta Hozumi.

I did value our several hundred hours of pondering each other’s many

problems and ideas, however it was your friendship that was at times, sustaining. Thank you.

vii

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

1

CHAPTER 2

TENSOR RESTRICTED ISOMETRY PROPERTY AND TENSOR
ITERATIVE HARD THRESHOLDING FOR RECOVERY . . . . . . . 35

CHAPTER 3

LEAVE ONE OUT MEASUREMENTS FOR RECOVERY . . . . . . . 54

CHAPTER 4

TENSOR SANDWICH: TENSOR COMPLETION FOR LOW CP-
RANK TENSORS VIA ADAPTIVE RANDOM SAMPLING . . . . . . 114

BIBLIOGRAPHY .

.

.

.

.

. .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

viii

CHAPTER 1

INTRODUCTION

1.1

Introduction

The need for efficient methods to acquire, store, reconstruct, and analyze large-scale data is

common now in many settings and disciplines. Furthermore, data may often be multi-modal thus

making tensor representations natural and useful in both their storage and analysis. The contexts

for this sort of data are numerous and extremely varied, from signal processing tasks, imaging,

machine learning, and many other data science applications [37, 12, 59, 42, 9].

Tensors are the natural extension of vectors and matrices, allowing for an arbitrary number of

modes, instead of only one or two. Despite seemingly straightforward extensions of the matrix

concept, many of the standard linear algebra and computational results that are tried and true in the

matrix case simply do not cleanly translate beyond two modes. Nevertheless, tensors of interest

rarely have unconstrained, unrelated entries. In practical settings, the tensor data may have some

implicit low-rank structure that can be exploited for efficient computation and storage, or simply be

well approximated by a tensor that does have this structure. Examples of decompositions include

Tucker, CANDECOMP/PARAFAC (CP), and tensor train [37].

Since tensor data can be large both in the number of modes and the dimensions within each mode,

and arrive in a streaming fashion, efficient methods that deal with these facets are critical for the

computation of such factorizations as well as recovery of the data from compressed measurements

or using sampling strategies.

In this work, we will describe and analyze a framework and methods which relate to this

framework that can reduce overall memory requirements, theoretically accommodate many types

of measurement schemes for collecting linear measurements of large tensors, apply to several

low-rank decompositions, while also remaining robust to non-exact low-rankness, and provide

computationally tractable and provable recovery. The framework also applies to the streaming

setting, such as when the tensor data is being received sequentially over time in a slice by slice,

or even entrywise manner. Efficiency in terms of storage for linear measurements are valuable in

1

streaming algorithms for tensor reconstruction in the big data setting (see, e.g., [60]) since it can

be impractical or impossible to have access to the entire, full tensor that one wishes to measure and

later reconstruct.

1.2 Motivating Examples

The following examples provide context and some indication as to how measurements of tensors

and their recovery are of theoretical and practical value.

1.2.1 Example, Technical

For example, suppose we wish to aggregate tensor data over time using the additive rule

X𝑡 := X𝑡−1 + ΔX𝑡,

X0 = 0

based on updates ΔX𝑡. After some number of updates, we wish to reconstruct an estimate of
X𝑇 = (cid:205)𝑇
𝑡=1 ΔX𝑡 ∈ R𝑛1×···×𝑛𝑑 for some 𝑇 > 0, where 𝑇 can be large. Storing every intermediate
Instead we can store small linear sketches of

tensor X𝑡 in uncompressed form is inefficient.

the intermediate tensors using a linear measurement operator L : R𝑛1×···×𝑛𝑑 → R𝑚1×···×𝑚𝑑′ with
(cid:206)𝑑′
ℓ=1

𝑛ℓ. This operator can be used to iteratively update the comparatively smaller

𝑚ℓ ≪ (cid:206)𝑑

ℓ=1

sketch

L (X𝑡) = L (X𝑡−1) + L (ΔX𝑡)

over the updates 𝑡 = 0, . . . , 𝑇. So long as the final sketch L (X𝑇 ) can then be used to approximately

recover X𝑇 using a recovery algorithm, of sufficient accuracy, then we can achieve storage savings

so long as the number of entries in the measurements are smaller than the number of entries in the

original signal. Ideally the recovery algorithm itself would also be efficient in terms of memory and

computational complexity, so that the cost of recovering X𝑇 from L (X𝑇 ) does not unduly outweigh

the efficiencies in terms of storage of dealing with only the measurements after the operator has

acted on the signal.

Naturally, in this situation L itself must be stored or newly re-calculated each time it is to

operate on a tensor. This is a storage requirement beyond the measurements themselves, L (X𝑡−1)

and L (ΔX𝑡). Perhaps this operator can be used many times on different streams, yet even so, it is

2

truly only memory conserving if the cost of storing L is less than the savings accrued by not storing

the tensor X𝑡, and instead using the measurements, when such is even possible.

Indeed, in the

classic compressive sensing scenario where one has say a large sensing matrix acting on a sparse

vector, one needs to store the measurement operator (or generate it anew each time it is needed)

and without a low memory approach or a specially designed measurement operator, the storage of

the matrix would exceed any savings gained by the compression of the data. So, by developing our

framework, we will by turns describe how to construct a class of operators L and pair them with

recovery procedures so that for signals of a given type we can with confidence recover them to a

desired level of accuracy.

1.2.2 Example, Hypothetical and Possibly Non-benign

Consider the following hypothetical application. Suppose there is a weapon which seeks to

replace the role of the landmine in warfare (victim operated mines violate important international

treaties and can be indiscriminate, and thus a potentially great evil, even outside of when they have

served their initial intended military purpose). Suppose these new weapons have cameras and each

device has a limited amount of computational resources, bandwidth, and storage. A well designed

measurement operator L could enable each device to perform clustering and object recognition on

segments of video data. Video data is easily conceived of as a tensor as collected by the devices’

cameras. Since the data can be significantly reduced in size while still retaining enough information

to perform the needed tasks, this operation can be performed on the device’s limited hardware.

However, the machine alone should likely never be able to make the determination that a given

segment of video represents a valid target. Once a segment of video data has been identified as

significant, these reduced size measurements can be transmitted, and the scene reconstructed so

that a human operator can view the scene and make a determination. In this way, one human could

potentially supervise many weapons at once, and the system would not be onerous with constant,

high volumes of data needing constant transmission back to nodes for computing and storage.

3

1.2.3 Example, Hypothetical and Benign

Consider another hypothetical. Suppose a firm has designed a method that can reliably identify

after some training and tuning, sections of a TV-broadcast which are live sports, and sections of the

broadcast which are commercial breaks. Moreover, this procedure relies only on pairwise distance

information taken from the streaming video and audio of the broadcast, and the categorization can

be accomplished, for appropriately measured (and compressed) data on an inexpensive computer in

milliseconds. Given this situation, a large chain of sports bars would be able to deploy the method

and accompanying hardware at each of their establishments, across many broadcasts, many sports

and many regions, to locally identify whenever an incoming TV broadcast is on commercial break;

and automate switching their patron-facing viewscreens to display some other chosen video – for

example, the sports bar chain might sell (their own) ads; or play a muted version of the source

TV-broadcast. The compressed data can be retained, and recovery performed for quality assurance

and possible tuning.

1.3 Organization and Contribution

The rest of this work is organized in the following way. First, in Section 1.4.1, we will lay

out definitions, fix notation as well as state some fundamental results in numerical linear algebra

and high dimensional probability that will be useful throughout this work. The final Section 1.5

of Chapter 1 will describe the measurement framework as well as outline its development and

relationship to other research on these topics.

Chapter 2 will concern what was our first specialization of the framework and how we interfaced

it with prior work that concerned a recovery procedure known as Tensor Iterative Hard Thresholding

(TIHT) via a concept known as Tensor Restrictive Isometry Property (TRIP). We will state the

main results involving TRIP and TIHT in Section 2.1 and describe the theoretical and practical

advantages of this specialization as well as limitations in the context of our empirical findings. The

numerical experiments as well as the proofs of the main theoretical contents of this chapter as well

as the numerical findings first appeared in our work [23]. In that work I was principally responsible

for the numerical experiments, comparisons, observations and discussion regarding the memory

4

footprint, whereas Micahel Perlmutter and Elizaveta Rebrova produced most of the proof for the

main results. Since a critique of the main ideas of the work are germane to the development of the

sequel, technical proofs for this chapter are put in the Appendix to that chapter, 2.4. The desire

to address these critiques of the main findings as described and discussed in Chapter 2 led us to

consider a new specialization of the framework, what we coin the Leave-One-Out alternative, for

reasons that will be made evident in Chapter 3.

Chapter 3 describes the Leave-One-Out variations of the framework and the paired recovery

algorithm. In Section 3.2 we conduct a first of its kind analysis of the method and state and prove

recovery guarantees that demonstrate the method has an overall sub-linear dependence on the total

size of the tensor data. We also provide experiments on simulated data as well as video data to

show how different choices and parameters in the problem contribute to trade-offs of interest. The

results of this chapter first appear in our work [24]. In that work I was responsible for both most

of the proof writing and numerical experiments. Technical proofs adapted from other research is

included in that chapter’s Appendix, since these involved important but minor specializations to

our task and may distract from the main findings.

Finally, in Chapter 4 we present a novel method for completing tensors that relies on a (mildly)

adaptive sampling procedure and prove that in the noiseless setting this method can exactly complete

a low-rank tensor. Sampling can be viewed as a very particular type of measurement operator, and

though this sampling procedure does not strictly fit into the framework we describe in Section 1.5, its

design is very much in the same spirit. We conclude that chapter again with a number of numerical

experiments that demonstrate the main findings as well show empirically that the method can

succeed in the presence of noise, as well as a number of applications of the method on real world

data-sets. The initial results of this chapter appeared in our work [25], the main result, its proof

and the empirical findings were primarily my contribution.

We will need to set notation and define common terms and some base results in order to deal

with the subject matter.

5

1.4 Preliminaries

1.4.1 Tensor and Measurement Preliminaries

Definition 1.1. A 𝑑-mode or order-𝑑 tensor (or 𝑑-th order) is a 𝑑 dimensional array of complex

values, written as

X ∈ C𝑛1×𝑛2×···×𝑛𝑑

A 𝑑-mode tensor’s entries are indexed by a vector i ∈ [𝑛1] × [𝑛2] × · · · × [𝑛𝑑] where (X)i = 𝑥i =
𝑥𝑖1,...,𝑖𝑑 ∈ C

Example 1.1 (1 and 2 mode tensors).

(i) A 1-mode tensor, x ∈ C𝑛 is a vector with entries

𝑥𝑖 ∈ C, 𝑖 ∈ [𝑛]. We will denote 1-mode tensors, vectors, with bolded lowercase letters.
(ii) A 2-mode tensor, 𝐴 ∈ C𝑛1×𝑛2 is a matrix with entries 𝑎𝑖1,𝑖2 ∈ C for 𝑖 ∈ [𝑛1], 𝑖2 ∈ [𝑛2]. We

denote 2-mode tensors (matrices) with capital letters.

We introduce some terminology that will be useful when describing tensors

Definition 1.2 (Fiber). Fibers are 1-dimensional subsets of a 𝑑-mode tensor. They are formed by

fixing 𝑑 − 1 of the dimensions and then ranging over all indices in the remaining dimension. So

for any 𝑘 ∈ [𝑑], and X ∈ C𝑛1×···×𝑛𝑑 then a 𝑘-mode fiber would be a vector x ∈ C𝑛𝑘 where indices

𝑖1, . . . , 𝑖𝑘−1, 𝑖𝑘+1, . . . , 𝑖𝑑 are fixed Using Matlab notation

(X)𝑖1,...,𝑖𝑘−1,:,𝑖𝑘+1,...,𝑖𝑑 = x𝑖1,...,𝑖𝑘−1,:,𝑖𝑘+1,...,𝑖𝑑

So for example, given a matrix 𝑋 ∈ C𝑛1×𝑛2 then 𝑋𝑖,: = x𝑖,: ∈ C𝑛2 is a mode-2 fiber (i.e. row). A

mode-1 fiber, x:, 𝑗 is a column of the matrix.

See Figure 1.1 for a schematic depiction of these terms.

Definition 1.3 (Slice). A matrix formed by varing 2 indices and fixing all other indices of a tensor.

That is, suppose 𝑗, 𝑘 ∈ [𝑑] where 𝑗 ≠ 𝑘 then

𝑋 = X𝑖1,...,𝑖 𝑗 −1,:,𝑖 𝑗+1,...,𝑖𝑘−1,:,𝑖𝑘+1,...,𝑖𝑑 ∈ C𝑛 𝑗 ×𝑛𝑘

Definition 1.4 (Sub-tensor). A 𝑘-mode sub-tensor of a 𝑑-mode tensor (𝑘 ≤ 𝑑) is denoted by a

vector of length 𝑑 − 𝑘 of indices and a set of 𝑘 distinct indices from the set [𝑑]. That is given

6

(a) Fiber

(b) Slice

Figure 1.1 A schematic of X ∈ R4×4×4 tensor where the fiber x2,2,: is highlighted in (𝑎) and a frontal
slice 𝑋:,:,1 is highlighted in 𝑏.

distinct 𝑗1, . . . , 𝑗𝑘 ∈ [𝑑] and vector i ∈ (cid:203)

𝑖≠ 𝑗ℓ [𝑛𝑖] of length 𝑑 − 𝑘, we have a 𝑘-mode sub-tensor

where

X𝑗1,..., 𝑗𝑘,i ∈ C𝑛 𝑗

1

×···×𝑛 𝑗𝑘

Using this sub-tensor notation then an entry of a mode-𝑘 fiber is a sub-tensor X𝑗,i where 𝑗 ∈ [𝑛𝑘 ]
and i ∈ [𝑛1] × . . . [𝑛𝑘−1] × [𝑛𝑘+1] × · · · × [𝑛𝑑]. There are (cid:206)ℓ≠𝑘 𝑛ℓ different mode-𝑘 fibers, one for

each possible i.

An entry of a slice then is denoted Xℓ,𝑘,i ∈ C𝑛ℓ ×𝑛𝑘 . There are (cid:206) 𝑗≠ℓ,𝑘 𝑛 𝑗 slices of dimension

𝑛ℓ × 𝑛𝑘 .

Next we will discuss reshaping operators - these all involve different possible ways of changing

the dimensions and number of modes of tensors so that they have the same number of entries but

are different shapes.
Definition 1.5 (Vectorization). For X ∈ C𝑛1×···×𝑛𝑑 , vec(X) = x where x ∈ C(cid:206)𝑑

𝑘=1

𝑛𝑘 . Entries are

typically ordered lexicographically by their indices, i.e.

the first index sorts all values before

the second and so on. However, usually the particular order that entries are moved to between

7

(a)

(b)

Figure 1.2 A schematic of X ∈ R4×4×4 tensor where the mode-1 fibers x:,𝑖, 𝑗 are arranged as the
columns of the corresponding flattening. The intensity of the color corresponds to the lexiographic
order of the fibers.

reshapings does not matter, so long as it is consistent across the operations.

Definition 1.6 (Mode-𝑘 Unfolding). Also known as flattening or matricization. For X ∈ C𝑛1×···×𝑛𝑑
the 𝑘-mode flattening is a matrix 𝑋(𝑘) ∈ C𝑛𝑘×(cid:206) 𝑗≠𝑘 𝑛 𝑗 . We have effectively made the 𝑘-th dimension
into the rows of the matrix, and the columns are then the different mode-𝑘 fibers, i.e. the columns

are the fibers X𝑘,i. When it will not cause any confusion from context, we may omit the decoration

(𝑘) and simply write the flattening as 𝑋.

Definition 1.7 (Reshaping). We can reshape a 𝑑-mode tensor into any other 𝑓 -mode tensor with a

reshaping operation 𝑅 : C𝑛1×···×𝑛𝑑 → C𝑚1×···×𝑚 𝑓 provided

𝑑
(cid:214)

𝑓
(cid:214)

𝑚ℓ

𝑛 𝑗 =

𝑗=1
𝑘-mode flattening and vectorization are in this view two particular reshaping operations.

ℓ=1

What is the underlying vector space we can use to study tensors? To answer that, we consider

the following norm, inner-product, and operations on tensors:

Definition 1.8 (2-norm of a Tensor). For X ∈ C𝑛1×···×𝑛𝑑 then given any 𝑘 ∈ [𝑑] we have

∥X∥2

2 = ∥ 𝑋(𝑘) ∥2

𝐹 = ∥x∥2

2 =

|𝑥i|2

∑︁

i∈𝐼

where 𝐼 = [𝑛1] × · · · × [𝑛𝑑]

8

Definition 1.9 (Inner-product). For compatible tensors X, Y ∈ C𝑛1×···×𝑛𝑑 then

⟨X, Y⟩ =

∑︁

i∈𝐼

𝑥i𝑦

i = ⟨x, y⟩

that is, the inner-product of the vectorization of the tensors. Note that this is equivalent to

⟨𝑋(𝑘), 𝑌(𝑘)⟩𝐻𝑆 = Trace(𝑋(𝑘) (𝑌(𝑘))∗), ∀𝑘 ∈ [𝑑], the Hilbert-Schmidt inner product for matrices

Addition and scalar multiplication work component-wise, i.e.

(X + Y)i = 𝑥i + 𝑦i

(𝛼X)i = 𝛼𝑥i, ∀𝛼 ∈ C

1.4.2 Modewise and Matrix Products

The modewise product will feature prominently in our introduction to the measurement frame-

work Section 1.5, as well as Chapters 2 and 3. We state it here as well as some basic algebraic

identities related to it. Certain matrix products will also quickly become useful when describing

tensors and their decompositions.

Definition 1.10 (Modewise Product). The 𝑘-mode product of X ∈ C𝑛1×···×𝑛𝑑 and 𝑈 ∈ C𝑚𝑘×𝑛𝑘 for

𝑘 ∈ [𝑑] is a tensor in C𝑛1×···×𝑛𝑘−1×𝑚𝑘×𝑛𝑘+1×···×𝑛𝑑 element-wise defined as

(X ×𝑘 𝑈)𝑖1,...,𝑖𝑘−1, 𝑗,𝑖𝑘+1,...,𝑖𝑑 =

𝑛𝑘∑︁

𝑖𝑘=1

𝑥𝑖1,𝑖2,...,𝑖𝑑 𝑢 𝑗𝑖𝑘

(1.1)

This is usefully identified with the following vectorized expression

(X ×𝑘 𝑈)𝑖1,...,𝑖𝑘−1,:,𝑖𝑘+1,...,𝑖𝑑 = 𝑈X𝑖1,...,𝑖𝑘−1,:,𝑖𝑘+1,...,𝑖𝑑

In other words, the 𝑘-mode product applies the matrix 𝑈 to all the mode-𝑘 fibers of the tensor

X. For example, suppose X ∈ C5×3×2 and 𝑈 ∈ C4×5. Then X ×1 𝑈 ∈ C4×3×2 where each of the

mode-1 fibers is now the product of 𝑈X:, 𝑗,ℓ, for some 𝑗 ∈ [3], ℓ ∈ [2].

In the 2-mode tensor case, i.e., matrices, the usual matrix-matrix multiplication can be under-

stood in terms of 1-mode tensor product.

9

This also easily establishes that we can identify a mode-wise product with the matrix-matrix

multiplication of the matrix with the corresponding 𝑘-mode flattening of the tensor.

(cid:104)

𝐴𝐵 = 𝐴

b1 b2

. . . b𝑛

(cid:105)

(cid:104)

=

𝐴b1 𝐴b2

. . . 𝐴b𝑛

(cid:105)

= 𝐵 ×1 𝐴

Note, modewise products are bilinear. We note some other useful properties.

Lemma 1.1 (Properties of 𝑘-mode products). Let X, Y ∈ C𝑛1,...,𝑛𝑑 , 𝛼, 𝛽 ∈ C, 𝑈ℓ, 𝑉ℓ ∈ C𝑚ℓ ×𝑛ℓ ,

∀ℓ ∈ [𝑑]. Then

1. (𝛼X + 𝛽Y) × 𝑗 𝑈 𝑗 = 𝛼(X × 𝑗 𝑈 𝑗 ) + 𝛽(Y × 𝑗 𝑈 𝑗 )

2. that is, 𝑘-mode product is bilinear.

3. If 𝑗 ≠ ℓ then

(X × 𝑗 𝑈 𝑗 ) ×ℓ 𝑉ℓ = (X ×ℓ 𝑉ℓ) × 𝑗 𝑈 𝑗

4. If 𝑊 ∈ C𝑝×𝑚 𝑗 then (cid:0)X × 𝑗 𝑈 𝑗 (cid:1) × 𝑗 𝑊 = X × 𝑗 (𝑊𝑈 𝑗 ) ∈ C𝑛1×𝑛 𝑗 −1×𝑝×𝑛 𝑗+1×...𝑛𝑑

Proof.

1. For any 𝑖1 ∈ 𝑛1, . . . , 𝑖 𝑗−1 ∈ 𝑛 𝑗−1, 𝑖 𝑗+1 ∈ 𝑛 𝑗+1, . . . , 𝑖𝑑 ∈ 𝑛𝑑 we have

(cid:2)(𝛼X + 𝛽Y) × 𝑗 𝑈(cid:3)

𝑖1,...,𝑖 𝑗 −1,:,𝑖 𝑗+1,...,𝑖𝑑

= 𝑈 (𝛼X + 𝛽Y)𝑖1,...,𝑖 𝑗 −1,:,𝑖 𝑗+1,...,𝑖𝑑

= 𝛼𝑈X𝑖1,...,𝑖 𝑗 −1,:,𝑖 𝑗+1,...,𝑖𝑑 + 𝛽𝑈Y𝑖1,...,𝑖 𝑗 −1,:,𝑖 𝑗+1,...,𝑖𝑑

= (X × 𝑗 𝑈 + 𝛽(Y × 𝑗 𝑈)

2. Similar to 1.

3. Consider the element-wise definition (1.1)

((X × 𝑗 𝑈 𝑗 ) ×ℓ 𝑉ℓ)𝑖1,...,𝑝,𝑖 𝑗+1,...,𝑞,𝑖ℓ+1,...,𝑖𝑑 =

=

𝑛𝑞
∑︁

𝑛 𝑝
∑︁

𝑖𝑞=1
𝑛 𝑝
∑︁

𝑖 𝑝=1
𝑛𝑞
∑︁

𝑖 𝑝=1

𝑖𝑞=1

𝑥𝑖1,𝑖2,...,𝑖𝑑 𝑢 𝑝𝑖 𝑝 𝑣𝑞𝑖𝑞

𝑥𝑖1,𝑖2,...,𝑖𝑑 𝑢 𝑝,𝑖 𝑝 𝑣𝑞,𝑖𝑞

= ((X ×ℓ 𝑉ℓ) × 𝑗 𝑈 𝑗 )𝑖1,...,𝑝,𝑖 𝑗+1,...,𝑞,𝑖ℓ+1,...,𝑖𝑑

10

4. Using the same identity as in 3, the proof is similar

Note that the run-time complexity is the same regardless of the order one applies the 𝑘- or

𝑗-mode products.

Definition 1.11 (Kronecker Product). The Kronecker product of two matrices 𝐴 × C𝑚1×𝑛1 and

𝐵 ∈ C𝑚2×𝑛2 is a matrix

𝐴 ⊗ 𝐵 :=

𝑎1,1𝐵

𝑎2,1𝐵
...

𝑎1,2𝐵 . . .

𝑎2,2𝐵 . . .
. . .

...

𝐵

𝐵

𝑎1,𝑛1
𝑎2,𝑛1
...

𝑎𝑚1,1𝐵 𝑎𝑚1,2𝐵 . . . 𝑎𝑚1,𝑛1

𝐵















.















(1.2)

where 𝐴 ⊗ 𝐵 ∈ C𝑚1𝑚2×𝑛1𝑛2

Definition 1.12 (Khatri-Rao Product). The Khatri-Rao product of two matrices is the matrix that

results from computing the Kronecker product of their matching columns. That is, for 𝐴 ∈ R𝑚1×𝑛

and 𝐵 ∈ R𝑚2×𝑛, their Khatri-Rao product is the matrix 𝐴 ⊙ 𝐵 ∈ R𝑚1𝑚2×𝑛 defined by

𝐴 ⊙ 𝐵 :=

(cid:104)

a1 ⊗ b1 a2 ⊗ b2

. . . a𝑛 ⊗ b𝑛

(cid:105)

.

We also use a so-called row-wise Khatri-Rao product (sometimes called the face-splitting

product) of 𝐴 ∈ R𝑚×𝑛1 and 𝐵 ∈ R𝑚×𝑛2, denoted by 𝐴 • 𝐵 ∈ R𝑚×𝑛1𝑛2, and defined by ( 𝐴 • 𝐵)𝑇 :=

𝐴𝑇 ⊙ 𝐵𝑇 .

Lemma 1.2. Let X ∈ C𝑛1,...,𝐼𝑑 , 𝑈ℓ ∈ C𝑚ℓ ×𝑛ℓ then

1. (cid:0)X × 𝑗 𝑈 𝑗 (cid:1)

( 𝑗) = 𝑈 𝑗 X( 𝑗) ∈ C𝑚 𝑗 ×(cid:206)ℓ≠ 𝑗 𝑛ℓ

2. (X ×1 𝑈1 ×2 𝑈2 · · · ×𝑑 𝑈𝑑) [ 𝑗] = 𝑈 𝑗 X[ 𝑗]

(cid:0)𝑈𝑑 ⊗ 𝑈𝑑 ⊗ · · · ⊗ 𝑈 𝑗+1 ⊗ 𝑈 𝑗−1 ⊗ · · · ⊗ 𝑈1

(cid:1)𝑇

Proof. This result follows from the definition of reshaping operators, which we detail below,

assuming a column-major convention, and by using the elementwise definition in (1.1).

In order for the identity seen in Lemma 1.2 to hold, we need specify our precise matricization
convention. In our convention, the entry at location (𝑖1, . . . , 𝑖𝑑) in X ∈ C𝑛1×···×𝑛𝑑 is located in the

11

matrix as entry (𝑖 𝑝, 𝑗) where

where

𝑗 = 1 +

𝑑
∑︁

𝑘=1
𝑘≠𝑝

(𝑖𝑘 − 1)𝐽𝑘

𝑘−1
(cid:214)

𝑛𝑚

𝐽𝑘 =

𝑚=1
𝑚≠𝑝
set 𝐽𝑘 = 1 if the index of the product above is empty. These conventions can become cumbersome

to state, however they are made clear and more natural by an example.

Example 1.2 (3-mode). A similar example appears in [37]: Consider a tensor X ∈ R3×4×2, the

frontal slices are as follows

1 4 7 10

13 16 19 22

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
The three different unfoldings are then as follows. Consider 𝑝 = 1,

3 6 9 12

2 5 8 11

, X:,:,2 =

X:,:,1 =

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

14 17 20 23

15 18 21 24

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

Note

𝑋(1) =

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

1 4 7 10 13 16 19 22

2 5 8 11 14 17 20 23

3 6 9 12 15 18 21 24

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

𝐽1 =

1−1
(cid:214)

𝑚=1
𝑚≠1

(𝑛𝑚) = 1, 𝐽2 =

𝑛𝑚 = 1, 𝐽3

2−1
(cid:214)

𝑚=1
𝑚≠1

= 𝑛𝑚 = 4

3−1
(cid:214)

𝑚=1
𝑚≠1

Now to see how to locate a particular entry, note that X2,3,2 = 20, so in our unfolding (X( 𝑝))(𝑖, 𝑗)

we can simply copy the index that corresponds to the 𝑝-th mode, i.e. the first here, 𝑖 = 2. To find
the column, compute 𝑗 = 1 + (cid:205)𝑑

(𝑖𝑘 − 1)𝐽𝑘 = 1 + (𝑖2 − 1)𝐽2 + (𝑖3 − 1)𝐽3 = 1 + (2) (1) + (1) (4) = 7.

So our entry with value 20 is in location (2, 7).

𝑘=1
𝑘≠𝑝

12

Consider 𝑝 = 2,

X (1) =

1

3

5

2

4

6

13 14 15
(cid:170)
(cid:174)
(cid:174)
16 17 18
(cid:174)
(cid:174)
(cid:174)
19 20 21
(cid:174)
(cid:174)
(cid:174)
10 11 12 22 23 24
(cid:172)

7

8

9

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

Again, to locate our entry,X2,3,2 = 20, we return the index on the 𝑝-th mode as our row, 𝑖 = 3, and

repeat the same calculation for 𝑗:

Note

1−1
(cid:214)

𝐽1 =

(𝑛𝑚) = 1, 𝐽2 =

2−1
(cid:214)

𝑛𝑚 = 1, 𝐽3

𝑚=1
𝑚≠2
This time we leave out the 𝐽2 factor: 𝑗 = 1 + (cid:205)𝑑−1
𝑘=1
𝑘≠𝑛

𝑚=1
𝑚≠2

entry with value 20 is located in (3, 5).

= 𝑛𝑚 = 3

3−1
(cid:214)

𝑚=1
𝑚≠2

(𝑖𝑘 − 1)𝐽𝑘 = 1 + (1) (1) + (1) (3) = 5. So our

Consider 𝑝 = 3, 𝑋(3)

1

2

3

4

5

6

7

8

. . .

13 14 15 16 17 18 19 20 . . .

(cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

Again, to locate our entry,X2,3,2 = 20, we return the index on the 𝑝-th mode as our row, 𝑖 = 1, and

repeat the same calculation for 𝑗:

Note

𝐽1 =

1−1
(cid:214)

𝑚=1
𝑚≠3

(𝑛𝑚) = 1, 𝐽2 =

2−1
(cid:214)

𝑛𝑚 = 3, 𝐽3

= 𝑛𝑚 = 12

3−1
(cid:214)

𝑚=1
𝑚≠3

This time we leave out the 𝐽3 factor:

entry with value 20 is located in (2, 8).

𝑚=1
𝑚≠3
𝑗 = 1 + (cid:205)𝑑

𝑘=1
𝑘≠𝑛

(𝑖𝑘 − 1)𝐽𝑘 = 1 + 1(1) + 2(3) = 8. So our

1.4.3 Decomposition and Low Rank Approximation

Our next set of definitions concerns ways of factorizing tensors. In the exact case, these describe

how we can compute or represent the tensor as a product and sum of appropriately constrained

sub-tensors of various types. This is useful for many reasons, such as interpreting the data, storing

it, or performing computations more efficiently. Approximation will also be our concern. That

13

is, for situations where exact factorization is not required (or even possible) we wish to find and

understand methods that can be used to find approximate tensors that do admit a certain factorization

and understand the trade-offs such has how efficiently it is to compute or how accurately we can

expect it to resemble the original tensor. To motivate why decompositions are of interest, and

provide an example for why there is a need for tensor specific decompositions, we include an

example of the storage benefit achieved by applying a standard linear algebra startegy. By tensor

specific, we mean doing something other than reshaping tensors into familiar objects (vectors and

matrices) and reusing the concepts from basic linear algebra.

Suppose we have 𝑞 tensors of size C𝑛1×···×𝑛𝑑 , X1, . . . , X𝑞. We will compress, or reduce the

dimension, by performing Principal Component Analysis (PCA) on the vectorized tensors. Our

goal then in this case is to solve the minimization problem. Assuming the vectorized tensors have

been centered, this amounts to

𝑞
∑︁

𝑗=1

min
𝑄∈R(cid:206)𝑑
𝑘=1
𝑄 𝐻 𝑄=𝐼

𝑛𝑘 ×𝑚

∥x 𝑗 − 𝑄𝑄𝐻x 𝑗 ∥2

where vec(X𝑗 ) = x 𝑗 ∈ C(cid:206)𝑑

𝑘=1

𝑛𝑘 . We can use the Singular Value Decomposition (SVD) of the

following data matrix

(cid:104)

x1 x2

. . . x𝑞

(cid:105)

= 𝑈Σ𝑉 ∗ ∈ C𝑞×(cid:206)𝑑

𝑘=1

𝑛𝑘

As is well known, and typical for computing projections using PCA, this amounts to storing

and using the truncated SVD of the data matrix. Once we have the (top) singular vectors, we can

tensorize their outer product using the inverse of our vectorizing reshaping operation. That is

𝑚
∑︁

𝜎𝑗,𝑘 T𝑘

X𝑗 ≈

𝑘=1
where T𝑘 are the tensors obtained from taking the relevant indices from the sum of outer product

of the singular vectors in 𝑈 and 𝑉, scaled by the singular vectors.

What does this achieve in terms of storage? The space required for our original collection of

tensors X1, . . . , X𝑞 ∈ C𝑛1×···×𝑛𝑑 is O (𝑞 (cid:206)𝑑

𝑘=1

𝑛𝑘 ). After PCA, we need keep 𝑚 basis tensors of the

14

same dimension C𝑛1×···×𝑛𝑑 and our coordinates or principal scores will also need to be stored and
there are 𝑚𝑞 of these. So the space is O (𝑚𝑞 + 𝑚 (cid:206)𝑑
the dependence on (cid:206)𝑑

𝑛𝑘 is unchanged. Additionally, there’s no obvious interpretable structure

𝑛𝑘 ) which may be unsatisfactory because

𝑘=1

𝑘=1

to the basis tensors T𝑘 that relates to the fact the original data was multi-modal. This motivates us

then to look to another approach for decomposing (and therefore saving on storage among other

things) tensors.

Definition 1.13 (Rank one Tensor). Given 𝑑-vectors x 𝑗 ∈ C𝑛 𝑗 for 𝑗 ∈ [𝑑], the outer-product

X =

𝑑
⃝
𝑗=1

x 𝑗 ∈ C𝑛1×···×𝑛𝑑

has entries given the product of corresponding entries of the vectors, i.e.

X𝑖1,...,𝑖𝑑 =

(cid:19)

x 𝑗

(cid:18) 𝑑
⃝
𝑗=1

𝑖1,...,𝑖𝑑

= (x1)𝑖1

(x2)𝑖2

. . . (x𝑑)𝑖𝑑

any 𝑑-mode tensor where it is possible to write it as such an outer product of 𝑑 vectors is a rank

one tensor.

Note that storing a rank one tensor can be accomplished by storing only the vector components,

rather than all entries, and computing the entries (as required) using the factors. This definition in

the 2-mode case is the familiar rank one matrix case, for u ∈ C𝑚, v ∈ C𝑁

𝐴 = u ◦ v = uv∗

then matrix 𝐴 ∈ C𝑚×𝑁 is a rank one matrix.

Definition 1.14 (CP decomposition). We say X ∈ C𝑛1×···×𝑛𝑑 is a CP rank-𝑟 tensor if it can be written

as the sum of minimally 𝑟 rank one tensors. That is

X =

𝑟
∑︁

ℓ=1

(cid:18) 𝑑
⃝
𝑗=1

(cid:19)

a(ℓ)
𝑗

=

𝑟
∑︁

ℓ=1

𝒂 (ℓ)
1

◦ · · · ◦ 𝒂 (ℓ)
𝑑

for vectors 𝒂 (ℓ)
by 𝐴(𝑘) =

(cid:104)

𝑘 ∈ R𝑛𝑘 for 𝑘 ∈ [𝑑] and ℓ ∈ [𝑟]. For convenience, define factor matrices 𝐴(𝑘) ∈ R𝑛𝑘×𝑟
for 𝑘 ∈ [𝑑] so that the entries of the tensor are given by
𝒂 (1)
𝑘

· · · 𝒂 (𝑟)
𝑘

(cid:105)

X𝑖1,...,𝑖𝑑 =

𝑟
∑︁

𝑑
(cid:214)

ℓ=1

𝑘=1

𝐴(𝑘)
𝑖𝑘,ℓ.

15

At times it will be convenient to require the CP decomposition’s component vectors to be

normalized to have norm of one, and then weight correspondingly each rank one term with a scalar.

That is,

X =

𝑟
∑︁

ℓ=1

(cid:18) 𝑑
⃝
𝑗=1

(cid:19)

𝛼 𝑗 a(ℓ)
𝑗

where

(cid:13)
(cid:13)a(ℓ)
(cid:13)

𝑗

(cid:13)
(cid:13)
(cid:13)2

= 1 for all 𝑗 ∈ [𝑑], ℓ ∈ [𝑟]. and 𝛼 𝑗 ∈ C

Here we also include a Lemma which relates how modewise products interact with rank-1

components.

Lemma 1.3. Suppose we have a rank one tensor X ∈ C𝑛1×···×𝑛𝑑 with components x 𝑗 ∈ C𝑛 𝑗 for all

𝑗 ∈ [𝑑], and suppose 𝑈𝑘 ∈ C𝑚𝑘×𝑛𝑘 for some 𝑘 ∈ [𝑑] then

(cid:19)

x 𝑗

(cid:18) 𝑑
⃝
𝑗=1

×𝑘 𝑈𝑘 =

(cid:19)

x 𝑗

⃝ 𝑈𝑘 x𝑘 ⃝

(cid:18)

𝑑
⃝
𝑗=𝑘+1

(cid:19)

x 𝑗

(cid:18)𝑘−1
⃝
𝑗=1
(cid:16)

Proof. Note that the 𝑘-mode fibers the tensor

⃝𝑑

𝑗=1 x 𝑗

(cid:17)

are scalar multiples of the same vector, x𝑘

That is, the 𝑘-mode fiber indexed by (ℓ0, . . . , ℓ𝑘−1, ℓ𝑘+1, . . . , ℓ𝑑) is

𝑑−1
(cid:214)

𝑗≠𝑘

(cid:169)
(cid:173)
(cid:171)

x𝑘

(x 𝑗 )ℓ 𝑗 (cid:170)
(cid:174)
(cid:172)

is a scalar. So the identity follows now from noting the definition of the mode-𝑘

but

(cid:16)(cid:206)𝑑

𝑗≠𝑘 (x 𝑗 )ℓ 𝑗

(cid:17)

product. (Each column of the unfolding is a scalar multiple of the same vector, scalar commutes

with matrix-vector multiplication)

As a remark, many of the correspondences and equivalences with rank in the matrix case do

not extend to tensors. For example, the rank of a matrix is always equal to the dimension of the

space spanned by its columns, and also the dimension of the space spanned by its rows. However,

consider the following three mode tensor:

X:,:,1 =

1 0

0 1

0 0

0 0

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

, X:,:,2 =

0 0

0 0

1 0

0 1

(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

16

Naturally, the four mode-1 fibers span a four dimensional space, however the eight different

mode-2 fibers span a two dimensional space. In general there is no simple correspondence between

the dimensions of the spaces spanned by the different sets of fibers for a tensor.

There are other complications that arise with notions of rank in the case of higher order tensors.

In the case of matrices, a rank 3 matrix has a unique, optimal rank 2 approximation, which can be

obtained from truncating its SVD. In the case of tensors, it is entirely possible for a rank 3 tensor to

be approximated arbitrarily close by a rank 2 tensor, as this simple example below demonstrates:

X:,:,1 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

0 1

1 0

(cid:170)
(cid:174)
(cid:174)
(cid:172)

, X:,:,2 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

1 0

0 0

(cid:170)
(cid:174)
(cid:174)
(cid:172)

Now consider

𝜖 −1 1

1

𝜖

A:,:,1 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

, A:,:,2 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

1

𝜖

𝜖

𝜖 −2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

; B:,:,1 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

0

−𝜖 −1 0

0 0
, B:,:,2 = (cid:169)
(cid:173)
(cid:173)
0 0
(cid:171)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
0
(cid:172)

Both A and B have rank one, so their sum has at most rank two, but we can let A + B get within

any distance of X by sending 𝜖 → 0; so in this instance, there is no best rank 2 approximation of

the rank 3 tensor X. Indeed, even if there is no issue between so called border rank of a tensor and

its exact rank, unlike the case of a matrix, there is not necessarily any straightforward relationship

between the best 𝑟 − 1 approximation to that tensor and its exact rank 𝑟 decomposition in the same

way as the SVD and the truncated SVD are related in the case of matrices.

1.4.3.1 The Tucker Decomposition and Low-Rank Approximation

Next, we define the Tucker decomposition of tensors, which decomposes a tensor into a core

tensor that is multiplied using the modewise product by a factor matrix along each of its modes.

Definition 1.15 (Tucker decomposition). Let X ∈ R𝑛1×···×𝑛𝑑 . We say that X has a Tucker decom-
position of rank r = (𝑟1, . . . , 𝑟𝑑) if there exists G ∈ R𝑟1×···×𝑟𝑑 , and 𝑈 𝑗 ∈ R𝑛 𝑗 ×𝑟 𝑗 with orthonormal
columns for all 𝑗 ∈ [𝑑], so that

X = G ×1 𝑈1 ×2 𝑈2 ×3 · · · ×𝑑 𝑈𝑑 =

∑︁

i∈{(𝑖1,...,𝑖𝑑) | 𝑖 𝑗 ∈[𝑟 𝑗 ] ∀ 𝑗 ∈[𝑑]}

Gi · u(1)
𝑖1

◦ u(2)
𝑖2

◦ · · · ◦ u(𝑑)
𝑖𝑑

,

(1.3)

17

where u( 𝑗)

𝑖 𝑗 denotes the 𝑖 𝑗 -th column of 𝑈 𝑗 , and Gi := G𝑖1,...,𝑖𝑑 ∈ R. We will also use the shorthand
X = [[G, 𝑈1, . . . , 𝑈𝑑]] := G ×1 𝑈1 ×2 𝑈2 ×3 · · · ×𝑑 𝑈𝑑 when referring to the rank-r Tucker

decomposition of a given rank-r tensor X.

There is an equivalent formulation to equation (1.3) using unfoldings of X and of G in terms of

matrix-matrix products:

𝑋[ 𝑗] = 𝑈 𝑗 𝐺 [ 𝑗]

(cid:0)𝑈𝑑 ⊗ · · · ⊗ 𝑈 𝑗+1 ⊗ 𝑈 𝑗−1 ⊗ · · · ⊗ 𝑈1

(cid:1)𝑇

(1.4)

From this definition several important observations can be made. First, we see that every tensor

must admit a Tucker decomposition, since setting the core tensor equal to the full tensor and the

factors equal to the appropriately sized identity matrices would ensure the decomposition matched

the full tensor exactly, and the rank would in that case be the same as the length of the modes,

r = (𝑛1, . . . , 𝑛𝑑). Naturally when considering efficient ways to store and perform computations

with tensors, we will be interested in the cases where the rank along at least one or more modes is

less than the original tensor’s side length. And just as in the case of the CP decomposition, what

the minimal Tucker rank is and what its relationship is to the exact decomposition is by no means

straightforward generically. Second, we can see that the decomposition has no chance of being

unique; consider any non-singular matrix 𝑌 , and observe:

𝑋[ 𝑗] = 𝑈 𝑗 𝐺 [ 𝑗]

(cid:0)𝑈𝑑 ⊗ · · · ⊗ 𝑈 𝑗+1 ⊗ 𝑈 𝑗−1 ⊗ · · · ⊗ 𝑈1

(cid:1)𝑇

= 𝑈 𝑗𝑌𝑌 −1𝐺 [ 𝑗]

(cid:0)𝑈𝑑 ⊗ · · · ⊗ 𝑈 𝑗+1 ⊗ 𝑈 𝑗−1 ⊗ · · · ⊗ 𝑈1

(cid:1)𝑇

= (𝑈 𝑗𝑌 )(𝑌 −1𝐺 [ 𝑗]) (cid:0)𝑈𝑑 ⊗ · · · ⊗ 𝑈 𝑗+1 ⊗ 𝑈 𝑗−1 ⊗ · · · ⊗ 𝑈1

(cid:1)𝑇

and for this same reason, it is often natural to insist that the factor matrices, 𝑈𝑖 have orthornormal

columns, given a Tucker decomposition. We can always arrive at this convention by means of

non-singular transformations in each of the modes, e.g. perform a 𝑄𝑅 decomposition and have 𝑅

act on the core.

There is a specific type of Tucker decomposition that will be of interest, and that is the Higher

Order SVD (HOSVD). In addition to the convention of having the columns of the factor matrices

18

form an orthonormal basis, in the case of the HOSVD, we impose an additional constraint that

all the (𝑑 − 1)-mode sub-tensors of the core formed by fixing an index along the same mode are

orthogonal.

Definition 1.16. [Higher Order Singular Value Decomposition (HOSVD)]. The Tucker decomposi-

tion of tensor X = [[G, 𝑈1, . . . , 𝑈𝑑]] is a HOSVD with rank r, when for every i𝑎 = (:, :, . . . , 𝑎, . . . , :),

i𝑏 = (:, :, . . . , 𝑏, . . . , :), 𝑎, 𝑏 ∈ [𝑟 𝑗 ], 𝑎 ≠ 𝑏, 𝑗 ∈ [𝑑] we have that

⟨Gi𝑎, Gi𝑏⟩ = 0

(1.5)

and 𝑈𝑘 have orthornomal columns for all 𝑘 ∈ [𝑑].

See [15] for a thorough description of this particular type of Tucker decomposition, and many of

its useful properties. Because of its straightforward way to compute and often good approximation

qualities, the HOSVD is frequently used as an initialization for finding Tucker decompositions.

A HOSVD of a tensor can be computed using Algorithm 1.1.

Algorithm 1.1 Higher Order SVD
input
:
X ∈ C𝑛1×···×𝑛𝑑
r = (𝑟1, 𝑟2, . . . , 𝑟𝑑) the rank of the HOSVD
output : ˆX = [[G, 𝑈1, . . . , 𝑈𝑑]]
for 𝑖 ∈ [𝑑] do

# SVDs of flattened tensor
Compute SVD of mode- 𝑗 unfolding 𝑋( 𝑗) = 𝑄 𝑗 Σ 𝑗𝑉 ∗
𝑗
𝑈 𝑗 ← 𝑄 (𝑖)

𝑗 [:, : 𝑟 𝑗 ]

end
# Compute Core
(cid:62)𝑑
G ← X

𝑈∗

𝑗=1

𝑗 ∈ C𝑟1×···×𝑟𝑑

Note that for each unfolding, the full SVD has form

𝐴( 𝑗) = 𝑈

(cid:124)(cid:123)(cid:122)(cid:125)
𝑛 𝑗 ×𝑛 𝑗

Σ
(cid:124)(cid:123)(cid:122)(cid:125)
𝑛 𝑗 ×(cid:206) 𝑗≠𝑘 𝑛𝑘

𝑉 𝐻
(cid:124)(cid:123)(cid:122)(cid:125)
(cid:206) 𝑗≠𝑘 𝑛𝑘×(cid:206) 𝑗≠𝑘 𝑛𝑘

the 𝑟 𝑗 -truncated SVD has form

𝐴( 𝑗) ≈ 𝑈

(cid:124)(cid:123)(cid:122)(cid:125)
𝑛 𝑗 ×𝑟 𝑗

Σ
(cid:124)(cid:123)(cid:122)(cid:125)
𝑟 𝑗 ×𝑟 𝑗

𝑉 𝐻
(cid:124)(cid:123)(cid:122)(cid:125)
𝑟 𝑗 ×(cid:206) 𝑗≠𝑘 𝑛𝑘

19

This problem then is repeated for each of the 𝑑 different modes. This is potentially a bottleneck

computationally and so can likely benefit from fast approximations to the SVD, e.g. see [48].

In order to further improve the decomposition, we can take an alternating least squares approach,

successively solving for 𝑈 𝑗 while holding other modes constant and iterate on this processes. This

is called Higher Order Orthogonal Iteration (HOOI), Algorithm 1.2, see [16]

Algorithm 1.2 Higher Order Orthogonal Iteration
input
:
X ∈ C𝑛1×···×𝑛𝑑
r = (𝑟1, 𝑟2, . . . , 𝑟𝑑)
𝑀 the maximum number of iterations
output :
[[G (0), 𝑈 (0)
1

ˆX = [[G, 𝑈1, . . . , 𝑈𝑑]]

, . . . 𝑈 (0)

𝑑 ]] ← HOSVD(X, r) for 𝑖 ∈ [𝑀] do

for 𝑗 ∈ [𝑑] do

# Update each factor matrix

(cid:62)

(cid:16)

𝑋

Compute
𝑈 𝑗 ← 𝑄 (𝑖)

𝑘≠ 𝑗
𝑗 [:, : 𝑟 𝑗 ]

(cid:17) ∗(cid:17) ( 𝑗)

(cid:16)
𝑈 (𝑖−1)
𝑘

= 𝑄 (𝑖)

𝑗 Σ(𝑖)

𝑗

𝑉 (𝑖)(cid:17) ∗
(cid:16)

𝑗

end

end
# Compute Core
(cid:62)
G ← 𝑋

𝑘≠ 𝑗

(cid:16)
𝑈 (𝑀)
𝑘

(cid:17) ∗

We also observe that there is a natural correspondence between CP and Tucker decompositions.

Given a CP decomposition, one can form a Tucker decomposition by arranging the weights of the

CP decomposition along the diagonal of the core tensor and placing the component vectors as the

columns of the factor matrices, while setting all off diagonal entries in the core equal to zero. The

following small result formalizes this correspondence.
Lemma 1.4. If X ∈ C𝑛1×...𝑛𝑑 has Tucker rank (𝑟1, . . . , 𝑟𝑑) then it has CP rank of at most (cid:206)𝑑

𝑗=1

𝑟 𝑗

Proof. The tensor has an exact Tucker decomposition, so

𝑑(cid:63)

X = G

𝑈 𝑗

𝑗=1

Note that we can express any tensor in the standard basis; here the standard basis for tensors is a

tensor with only one non-zero entry.

20

So for example for some ℓ ∈ [𝑟1] × · · · × [𝑟𝑑] the associated standard basis element is

𝑑
⃝
𝑗=1

eℓ 𝑗

where eℓ 𝑗 is the usual standard basis vector in C𝑟 𝑗 . Denote I = [𝑟1] × · · · × [𝑟𝑑]. Thus

C =

Gℓ

∑︁

ℓ∈I

(cid:19)

eℓ 𝑗

(cid:18) 𝑑
⃝
𝑗=1

Now use this expression in the Tucker decomposition of X

𝑑(cid:63)

X = G

𝑈 𝑗

𝑗=1

=

=

=

(cid:34)

∑︁

ℓ∈I

Gℓ

(cid:18) 𝑑
⃝
𝑗=1

eℓ 𝑗

(cid:19) (cid:35)

𝑑(cid:63)

𝑈 𝑗

𝑗=1

Gℓ

Gℓ

∑︁

ℓ∈I
∑︁

ℓ∈I

(cid:19)

𝑈 𝑗 eℓ 𝑗

(cid:18) 𝑑
⃝
𝑗=1

𝑑
⃝
𝑗=1

(𝑈 𝑗 )ℓ 𝑗

where we have used Lemma 1.3, and denoted the ℓ 𝑗 -th column of 𝑈 𝑗 as (𝑈 𝑗 )ℓ 𝑗 . We therefore
have the sum of rank one tensors. There are (cid:206)𝑑

𝑟 𝑗 possible values for ℓ and so we have a CP

𝑗=1

decomposition of that rank. This provides an upper bound on CP rank, since the decomposition

may not be optimal.

On the other hand, given a Tucker decomposition of rank-r, one can instead regard this as a

rank (cid:206)𝑑
𝑖=1

𝑟𝑖 CP tensor, where each entry of the core is a weight to a rank-one tensor formed by the

outer product of the respective columns of the factor matrices and then these are all summed up.

There is generically though no reason that these correspondences would be minimal in those

respective decompositions. That is, given a minimal exact rank-𝑟 CP decomposition, it does not

follow that the Tucker decomposition used in Lemma 1.4 would be a minimal in terms of rank for

the Tucker decomposition of that particular tensor.

Given an arbitrary tensor X ∈ R𝑛1×···×𝑛𝑑 of unknown Tucker rank it is often of interest to compute
r ∈ R𝑛1×···×𝑛𝑑

a good rank-r approximation of X. In fact, an optimal Tucker rank-r minimizer [[X]]opt

21

satisfying

(cid:13)
X − [[X]]opt
(cid:13)
r
(cid:13)

(cid:13)
(cid:13)
(cid:13)2

=

inf
1 ×···×𝑟𝑑 ,
G∈R𝑟
𝑈 𝑗 ∈R𝑛 𝑗 ×𝑟 𝑗

∥X − [[G, 𝑈1, . . . , 𝑈𝑑]] ∥2

(1.6)

always exists (see, e.g., Theorems 10.3 and Theorem 10.8 in [22]). However, computing such an

[[X]]opt
r

(which is not unique) is generally a challenging task that is accomplished only approximately

via iterative techniques (see, e.g., [37]). As a result, in such situations one usually seeks to instead

compute a quasi-optimal rank-r tensor ˜X.

Definition 1.17. The tensor ˜X of rank r is quasi-optimal approximation of X if it satisfies

(cid:13)X − ˜X(cid:13)
(cid:13)
(cid:13)2

≤ 𝐶

inf
G∈R𝑟
1 ×···×𝑟𝑑 ,
𝑈 𝑗 ∈R𝑛 𝑗 ×𝑟 𝑗

∥X − [[G, 𝑈1, . . . , 𝑈𝑑]] ∥2

,

(1.7)

where 𝐶 ∈ R+ is a positive constant independent of X.

The following lemma demonstrates that one can recover a quasi-optimal rank-r approximation

of an arbitrary tensor X ∈ R𝑛1×···×𝑛𝑑 by simply computing the left singular vectors of 𝑋[ 𝑗] for all

𝑗 ∈ [𝑑]. It is a variant of [22, Theorem 10.3]. We prove here for the sake of completeness.
Lemma 1.5. Let X ∈ R𝑛1×···×𝑛𝑑 , denote the 𝑖th-largest singular value of 𝑋[ 𝑗] by 𝜎𝑖 (cid:0)𝑋[ 𝑗]
Δ𝑟, 𝑗 := (cid:205) ˜𝑛 𝑗
𝑖=𝑟+1
and suppose that [[X]]opt
r

(cid:1), and define
(cid:1) 2 for all 𝑗 ∈ [𝑑] and 𝑟 ∈ (cid:2) ˜𝑛 𝑗 := min (cid:8)𝑛 𝑗 , (cid:206) 𝑗≠𝑘 𝑛𝑘 (cid:9)(cid:3). Fix r ∈ [ ˜𝑛1]×· · ·×[ ˜𝑛𝑑]

satisfies (1.6). Then,

𝜎𝑖 (cid:0)𝑋[ 𝑗]

𝑑
∑︁

𝑗=1

Δ𝑟 𝑗 , 𝑗 ≤ 𝑑

(cid:13)
(cid:13)
(cid:13)

X − [[X]]opt
r

(cid:13)
(cid:13)
(cid:13)

2

2

.

Proof. Let 𝑃 𝑗 ∈ R𝑛 𝑗 ×𝑛 𝑗 be a rank 𝑟 𝑗 orthogonal projection matrix satisfying (cid:13)
Δ𝑟 𝑗 , 𝑗 ≤ (cid:13)
[[X]]opt
r
at most 𝑟 𝑗 matrix by (1.4), we then have that

(cid:13)𝑋[ 𝑗] − 𝑌 (cid:13)
2
𝐹 for all rank at most 𝑟 𝑗 matrices 𝑌 .1 Noting that 𝑌 =
(cid:13)

(cid:16)

(cid:13)𝑋[ 𝑗] − 𝑃 𝑗 𝑋[ 𝑗]

(cid:13)
(cid:13)

2
𝐹 =

(cid:17)

[ 𝑗]

will be a rank

Δ𝑟 𝑗 , 𝑗 ≤

(cid:13)
(cid:13)
(cid:13)
(cid:13)

𝑋[ 𝑗] −

(cid:16)

(cid:17)

[[X]]opt
r

(cid:13)
(cid:13)
(cid:13)
(cid:13)

2

𝐹

[ 𝑗]

=

(cid:13)
X − [[X]]opt
(cid:13)
r
(cid:13)

(cid:13)
2
(cid:13)
(cid:13)
2

∀ 𝑗 ∈ [𝑑].

Summing over the 𝑑 values of 𝑗 above now finishes the proof.

Note, in order to simplify notation, we will often assume that 𝑛 𝑗 = 𝑛 and 𝑟 𝑗 = 𝑟 for all 𝑗 ∈ [𝑑].

That is our tensors will have equal side lengths and the rank will likewise be the same in every mode

1The Eckart-Young Theorem guarantees that we can compute such a 𝑃 𝑗 from the left singular values of 𝑋[ 𝑗 ].

22

in the Tucker case, thus suppressing the need for additional level of sub-scripting. This will be done

when modifying the proofs and results to deal with non-equal side lengths is straightforward; a bit

of simplicity is achieved without too much damage to generality. Furthermore, we will principally

be concerned with real valued tensors, rather than complex valued tensors.

1.4.4 Randomized Numerical Linear Algebra Preliminaries

Concepts from randomized numerical linear algebra will feature prominently in our analysis of

the embedding properties of different measurement operators.

Definition 1.18 ((𝜖, 𝛿, 𝑝)-Johnson-Lindenstrauss (JL) property). Let 𝜖 > 0, 𝛿 ∈ (0, 1), and

𝑝 ∈ N. A random matrix Ω ∈ R𝑚×𝑛 has the (𝜖, 𝛿, 𝑝)-Johnson-Lindenstrauss (JL) property2 for an

arbitrary set 𝑆 ⊂ R𝑛 with cardinality at most 𝑝 if it satisfies

(1 − 𝜖) ∥x∥2

2 ≤ ∥Ωx∥2

2 ≤ (1 + 𝜖) ∥x∥2

2 for all x ∈ 𝑆

(1.8)

with probability at least 1 − 𝛿.

Definition 1.19 (Set difference). Let ˜𝑆 ⊂ C𝑛, then the set difference of ˜𝑆 denoted ˜𝑆 − ˜𝑆 is
(cid:8)x − y|x, y ∈ ˜𝑆(cid:9) ∈ C𝑛

For all random matrix distributions discussed in this work, only an upper-bound on the cardinal-

ity of the set 𝑆 appearing in Definition 1.8 is required in order to ensure with high probability that

a given realization satisfies (1.8). No other property of the set 𝑆 to be embedded will be required

to define, generate, or apply Ω. Such random matrices are referred to as data oblivious, or simply

oblivious. The following variant of the famous Johnson-Lindenstrauss Lemma [32] demonstrates

the existence of random matrices with the oblivious JL property. First though we can formally

define a sub-Gaussian random variable and the sub-Gaussian norm, along with the related definition

of Sub-exponential random variables.

Definition 1.20 (Sub-exponential Random Variable). We say that 𝑋 ∈ R is sub-exponential random

2Formally, a distribution over 𝑚 × 𝑛 matrices has the JL property if a matrix selected from this distribution satisfies
(1.8) with probability at least 1 − 𝛿. For brevity, here and in the next similar cases, the term “random matrix" will refer
to a distribution over matrices.

23

variable if ∃𝛽, 𝜅 > 0 such that

𝑃 [|𝑋 | ≥ 𝑡] ≤ 𝛽𝑒−𝜅𝑡, ∀𝑡 > 0

We can understand this as saying that the random variable decays exponentially.

Definition 1.21 (Sub-Gaussian Random Variable). We say that 𝑋 ∈ R is sub-Gaussian random

variable if ∃𝛽, 𝜅 > 0 such that

𝑃 [|𝑋 | ≥ 𝑡] ≤ 𝛽𝑒−𝜅𝑡2, ∀𝑡 > 0

There are many equivalences for a random variable being sub-Gaussian, see for example

Proposition 2.5.2 in [63].

Definition 1.22 (Sub-Gaussian norm). The sub-Gaussian norm of 𝑋, denoted ∥ 𝑋 ∥𝜓2 is the quantity

∥ 𝑋 ∥𝜓2 := inf
𝑡>0

(cid:110)E[exp

(cid:16)

𝑋 2/𝑡2(cid:17)

(cid:111)

≤ 2]

There are many equivalences for a random variable being sub-Gaussian, most of which crucially

are bounds which are related to the definition of the sub-Gaussian norm, see for example Proposition

2.5.2 in [63].

Theorem 1.1 (Sub-Gaussian random matrices have the JL property). Let 𝑆 ⊂ R𝑛 be an arbitrary

finite subset of R𝑛. Let 𝛿, 𝜖 ∈ (0, 1). Finally, let Ω ∈ R𝑚×𝑛 be a matrix with independent, mean

zero, variance 𝑚−1, sub-Gaussian entries all with sub-Gaussian norm less than or equal to 𝑐 ∈ R+.

Then

(1 − 𝜖)∥x∥2

2 ≤ ∥Ωx∥2

2 ≤ (1 + 𝜖) ∥x∥2
2

will hold simultaneously for all x ∈ 𝑆 with probability at least 1 − 𝛿, provided that

𝑚 ≥

𝐶
𝜖 2

(cid:19)

,

(cid:18) |𝑆|
𝛿

ln

(1.9)

where 𝐶 ≤ 8𝑐(16𝑐 + 1) is an absolute constant that only depends on the bound to the sub-Gaussian

norm 𝑐.

Proof. See, e.g., Lemma 9.35 in [18].

Next, we define a similar property for an infinite, yet rank constrained, set.

24

1.4.4.1 Oblivious Subspace Embedding (OSE) Results

In this section we will define the Oblivious Subspace Embedding property, and describe how

to construct them from any random matrix distribution with the JL property.

Definition 1.23. [(𝜖, 𝛿, 𝑟)-OSE property] Let 𝜖 > 0, 𝛿 ∈ (0, 1), and 𝑟 ∈ N. Fix an arbitrary rank

𝑟 matrix 𝐴 ∈ R𝑛×𝑁 . A random matrix Ω ∈ R𝑚×𝑛 has the (𝜖, 𝛿, 𝑟)-Oblivious Subspace Embedding

(OSE) property for (the column space of) 𝐴 if it satisfies

(1 − 𝜖) ∥ 𝐴x∥2

2 ≤ ∥Ω𝐴x∥2

2 ≤ (1 + 𝜖) ∥ 𝐴x∥2

2 for all x ∈ R𝑁

(1.10)

with probability at least 1 − 𝛿.

Note that if 𝑄 ∈ R𝑛×𝑟 is an orthonormal basis for the column space of a rank 𝑟 matrix 𝐴 ∈ R𝑛×𝑁 ,
then the images of 𝐴 and 𝑄 coincide, i.e. {𝑄y | ∀y ∈ R𝑟 } = (cid:8)𝐴x | ∀x ∈ R𝑁 (cid:9). Furthermore, since

𝑄 has orthonormal columns so that ∥𝑄y∥2 = ∥y∥2 holds, (1.10) is equivalent to

(1 − 𝜖) ∥y∥2

2 ≤ ∥Ω𝑄y∥2

2 ≤ (1 + 𝜖) ∥y∥2

2 for all y ∈ R𝑟

(1.11)

holding in this case. Below we will use the equivalence of (1.10) and (1.11) often by regularly

constructing matrices satisfying (1.10) by instead constructing matrices satisfying (1.11).

The next result shows that random matrices with the JL property also have the OSE property. It

is a standard result in the compressive sensing and randomized numerical linear algebra literature

(see, e.g., [5, Lemma 5.1] or [57, Lemma 10]) that can be proven using an 𝜖-cover of the appropriate

column space.

Lemma 1.6 (Subspace embeddings from finite embeddings via a cover). Fix 𝜖 ∈ (0, 1). Let
Y ⊂ R𝑛 be the 𝑟-dimensional subspace of R𝑛 spanned by an orthonormal basis Y, and define
L𝑟
Y ⊂ L𝑟
𝑆𝑟
Y :=
Y.
Then if Ω ∈ C𝑚×𝑛 satisfies (1.8) with 𝑆 ← 𝐶 and 𝜖 ← 𝜖

Y be a minimal (cid:0) 𝜖
2, it will also satisfy

. Furthermore, let 𝐶 ⊂ 𝑆𝑟

(cid:1)-cover of 𝑆𝑟

(cid:12)
(cid:12) x ∈ L𝑟

Y \ {0}

(cid:110) x
∥x∥2

16

(cid:111)

(1 − 𝜖)∥x∥2

2 ≤ ∥Ωx∥2

2 ≤ (1 + 𝜖) ∥x∥2

2 ∀x ∈ L𝑟

Y

.

(1.12)

Proof. See, e.g., Lemma 3 in [28] for this version. We discuss covering numbers in Section 1.4.4.3.

25

Using Lemma 1.6 one can now easily prove the following corollary of Theorem 1.1 which

demonstrates the existence of random matrices with the OSE property.

Corollary 1.1. A random matrix Ω ∈ R𝑚×𝑛 with mean zero, variance 𝑚−1, independent sub-

Gaussian entries has the (𝜖, 𝛿, 𝑟)-OSE property for an arbitrary rank 𝑟 matrix 𝐴 ∈ R𝑛×𝑁 provided

that

𝑚 ≥ 𝐶

𝑟
𝜖 2

(cid:19)

(cid:18) 𝐶′
𝜖 𝛿

ln

≥

𝐶
𝜖 2

(cid:17)𝑟

(cid:16) 47
𝜖
𝛿

,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

ln (cid:169)
(cid:173)
(cid:173)
(cid:171)

where 𝐶′ > 0 is an absolute constant, and 𝐶 > 0 is an absolute constant that only depends on the

sub-Gaussian norms of Ω’s entries.
Proof. As is done in Lemma 1.6, suppose 𝑆 is a minimal (cid:0) 𝜖
16

sphere in the column span of 𝐴. The cardinaltiy of this cover is bounded by

(cid:1)-cover of the 𝑟 dimensional unit
(cid:17)𝑟

. Apply Theorem

(cid:16) 47
𝜖

1.1 to the finite set 𝑆.

The next Theorem and definition concern another important class of random, but highly struc-

tured, matrices that are known to satisfy the JL-embedding property.

Theorem 1.2. Let 𝑈 ∈ C𝑛×𝑛 be a unitary matrix with entries bounded by max𝑖, 𝑗 ∈[𝑛]
𝑛 . Let
𝑅 ∈ C𝑚×𝑛 be a matrix obtained by selecting 𝑚-rows from the 𝑛 × 𝑛 identity matrix i.i.d. uniformly

√

(cid:12)
(cid:12)𝑈𝑖, 𝑗

(cid:12)
(cid:12) ≤ 𝐾

at random and let 𝐷 ∈ R𝑛×𝑛 be a diagonal matrix with i.i.d ±1 Radamacher random values on its

diagonal. Then

√︂ 𝑛
𝑚

𝑅𝑈𝐷

will be an 𝜖-JL map of any given 𝑆 ⊂ R𝑛 into C𝑚 with probability at least 1 − 𝛿 − 𝑛− ln3 𝑛 provided

that

𝑚 ≥ 𝑐

𝐾 2
𝜖 2

(cid:19)

(cid:18) 4|𝑆|
𝛿

log

log4 𝑛

where 𝑐 ∈ R+ is an absolute constant.

Note that 𝐷 can be applied in O (𝑛) time to a vector in x ∈ R𝑛, since it involves reading the

length of the vector and changing signs of some entries. The matrix 𝑅 can be applied to a vector

of length 𝑛, 𝑈𝐷x in O (𝑛)-time since it involves scanning the length of the vector and discarding

values. Applying 𝑈 then, since generically it should require at least reading in the inputs should be

26

at least O (𝑛). Therefore the overall complexity is governed principally by 𝑈. If the unitary matrix

𝑈 admits a fast matrix-vector, like for example using the Fast Fourier Transform (FFT) to effect a

Discrete Fourier Transform, then 𝑈 can be applied to 𝐷x in O (𝑛 log 𝑛)-time.

In practice, the log4 𝑛 factor in the bound of 𝑚 is often ignored with no change in performance.

The failure probability bound 𝛿 + 𝑛− ln3 𝑛 is a result of union bounding over two events – 1√
𝑚

𝑅𝐹

failing to have the yet undefined RIP property (see Definition 2.1) with probability at most 𝑛− ln3 𝑛;

the details of that can be found in Theorem 12.32 in [18]., and the probability of at most 𝛿 that
√︁ 𝑛

2-JL map; details of which can be found in Theorem 3.1 in [38]. Here, since
we generally concern ourselves with 𝑛 ≫ 1, the failure probability can be made suitably small. We

𝑚 𝑅𝐹 𝐷 fails to be an 𝜖

refer to this class of matrices as Subsampled Orthogonal Random Signs (SORS) matrices. The

most frequently used type being as follows.

Definition 1.24 (Subsampled Scrambled Fourier Transform (SSFT)). Let 𝐹 be a unitary discrete

Fourier transform matrix

𝐹ℓ,𝑘 =

𝑒 2 𝜋𝑖ℓ 𝑘

𝑛

1
√
𝑛

If we take 𝑈 = 𝐹 and 𝑅 and 𝐷 be the matrices described in Theorem 1.2 then √︁ 𝑛

𝑚 𝑅𝐹 𝐷 is a JL map

where

(cid:12)
(cid:12)𝐹ℓ,𝑘

(cid:12)
(cid:12) =

max
ℓ,𝑘

1
√
𝑛

so 𝐾 = 1. The Fast Fourier Transform (FFT) can be used to apply 𝐹 to any vector x ∈ R𝑛 in

O (𝑛 log 𝑛).

This is the proto-typical Fast JL Matrix, though other choices for 𝑈 are naturally possible. We

also refer to it simply as 𝑅𝐹 𝐷 when context makes it clear.

Finally, the following lemma demonstrates that matrices satisfying (1.10) for 𝐴 also approx-

imately preserve the Frobenius norm of 𝐴. We present its proof here as an illustration of basic

notation and techniques, and will use this result later in Chapter 3.

Lemma 1.7. Suppose Ω ∈ R𝑚×𝑛 satisfies (1.10) for a rank 𝑟 matrix 𝐴 ∈ R𝑛×𝑁 . Then,

(cid:12)
(cid:12)∥ 𝐴∥2

𝐹 − ∥Ω𝐴∥2
𝐹

(cid:12)
(cid:12) ≤ 𝜖 ∥ 𝐴∥2
𝐹 .

27

Proof. Let e𝑖 ∈ R𝑁 denote the standard basis vector where the 𝑖-th entry is 1, and all others are zero

(i.e., e𝑖 is the 𝑖-th column of the 𝑁 × 𝑁 identity matrix 𝐼𝑁 ). Similarly let b𝑖 denote the 𝑖-th column

of any given matrix 𝐵, and set ˜𝐴 := Ω𝐴. By (1.10), we conclude that for all 𝑖 ∈ [𝑁] we have

(cid:12)
(cid:12)∥ 𝐴e𝑖 ∥2

2 − ∥Ω𝐴e𝑖 ∥2
2

(cid:12)
(cid:12) ≤ 𝜖 ∥ 𝐴e𝑖 ∥2
2

.

To establish the desired result, we represent the squared Frobenius norm of a matrix as the sum of

the squared ℓ2-norms of its columns. Doing so, we see that

(cid:12)
(cid:12)∥ 𝐴∥2

𝐹 − ∥Ω𝐴∥2
𝐹

(cid:12)
(cid:12) =

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑁
∑︁

(cid:16)

𝑖=1
𝑁
∑︁

𝑖=1

≤ 𝜖

∥a𝑖 ∥2

2 − ∥ ˜a𝑖 ∥2
2

(cid:17)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤

𝑁
∑︁

𝑖=1

∥ 𝐴e𝑖 ∥2

2 = 𝜖 ∥ 𝐴∥2
𝐹 .

(cid:12)
(cid:12)∥a𝑖 ∥2

2 − ∥ ˜a𝑖 ∥2
2

(cid:12)
(cid:12) =

𝑁
∑︁

𝑖=1

(cid:12)
(cid:12)∥ 𝐴e𝑖 ∥2

2 − ∥ (Ω𝐴)e𝑖 ∥2
2

(cid:12)
(cid:12)

We will now define a property of random matrices related to fast approximate matrix multipli-

cation.

1.4.4.2 The Approximate Matrix Multiplication (AMM) Property

First proposed in [57, Lemma 6] (see also, e.g., [46] for other variants), the approximate matrix

multiplication property will be crucial to our analysis of embedding properties of measurement

operators.

Definition 1.25 ((𝜖, 𝛿)-AMM property). Let 𝜖 > 0 and 𝛿 ∈ (0, 1). A random matrix Ω ∈ R𝑚×𝑛

satisfies the (𝜖, 𝛿)-Approximate Matrix Multiplication property for two arbitrary matrices 𝐴 ∈ R𝑝×𝑛

and 𝐵 ∈ R𝑛×𝑞 if

∥ 𝐴Ω𝑇 Ω𝐵 − 𝐴𝐵∥𝐹 ≤ 𝜖 ∥ 𝐴∥𝐹 ∥𝐵∥𝐹

(1.13)

holds with probability at least 1 − 𝛿.

The following lemma can be used to construct random matrices with the AMM property from

random matrices with the JL property. A slightly generalized version is proven in Appendix 3.5 for

the sake of completeness.

Lemma 1.8 (The JL property provides the AMM property). Let 𝐴 ∈ R𝑝×𝑛 and 𝐵 ∈ R𝑛×𝑞. There

exists a finite set 𝑆 ⊂ R𝑛 with cardinality |𝑆| ≤ 2( 𝑝 + 𝑞)2 (determined entirely by 𝐴 and 𝐵) such

28

that the following holds: If a random matrix Ω ∈ R𝑚×𝑛 has the (𝜖/2, 𝛿, 2( 𝑝 + 𝑞)2)-JL property for

𝑆, then Ω will also have the (𝜖, 𝛿)-AMM property for 𝐴 and 𝐵.

Proof. Combine Lemma 3.12 with Remark 3.5.

Using Lemma 1.8 one can now prove the following corollary of Theorem 1.1 which demonstrates

the existence of random matrices with the AMM property for any two fixed matrices.

Corollary 1.2. Fix 𝐴 ∈ R𝑝×𝑛 and 𝐵 ∈ R𝑛×𝑞. A random matrix Ω ∈ R𝑚×𝑛 with mean zero, variance

𝑚−1, independent sub-Gaussian entries will have the (𝜖, 𝛿)-Approximate Matrix Multiplication

property for 𝐴 and 𝐵 provided that

𝑚 ≥

𝐶
𝜖 2

(cid:18) 2( 𝑝 + 𝑞)2
𝛿

(cid:19)

,

ln

where 𝐶 > 0 is an absolute constant that only depends on the sub-Gaussian norms/parameters of

Ω’s entries.

Proof. Apply Theorem 1.1 to the finite set 𝑆 guaranteed by Lemma 1.8.

1.4.4.3 Covering Numbers

We state some results concerning covering numbers, which are useful for showing 𝜖-JL maps

apply to more general infinite sets, and are a main ingredient to several of the results referred to

above, such as Lemma 1.6. We begin then with these definitions and then show how to obtain

covering number bounds for subsets, which is a main ingredient in net arguments used for embedding

infinite sets.

Definition 1.26 (𝛿-cover). Let 𝑋 be a normed metric space over the complex numbers and 𝑇 ⊆ C𝑁 .

A 𝛿-cover of 𝑇 with respect to norm ∥ · ∥ 𝑋 is a subset of 𝑆 ⊆ 𝑇 such that ∀x ∈ 𝑇, ∃y ∈ 𝑆 with

∥x − y∥ 𝑋 < 𝛿 where

𝑇 ⊆

(cid:216)

y∈𝑆

𝐵𝑋 (y, 𝛿)

Note that 𝐵𝑋 (y, 𝛿) is the open ball with center y ∈ C𝑁 and radius 𝛿 with respect to the norm

∥ · ∥ 𝑋. Usually it will be clear from context the space and norm, and so we’ll simplify notation and

write instead 𝐵(y, 𝛿)

29

Definition 1.27 (𝛿-covering Number). The 𝛿-covering number of 𝑇 ⊆ C𝑁 , denoted 𝐶 𝑋

respect to ∥ · ∥ 𝑋 is the smallest integer such that there exists a 𝛿-cover 𝑆 ⊆ 𝑇 where |𝑆| = 𝐶 𝑋

𝛿 (𝑇) with
𝛿 (𝑇).

If no such integer exists we say that 𝐶 𝑋

𝛿 (𝑇) = ∞

Definition 1.28 (𝛿-packing). Let 𝑇 ⊆ C𝑁 . A 𝛿-packing of 𝑇 with respect to norm ∥ · ∥ 𝑋 is a subset

of 𝑆 ⊆ 𝑇 such that ∀x, y ∈ 𝑆, x ≠ y with ∥x − y∥ 𝑋 ≥ 𝛿 then

𝐵𝑋 (x, 𝛿/2) ∩ 𝐵𝑋 (y, 𝛿/2) = ∅

Definition 1.29 (𝛿-packing Number). The 𝛿-packing number of 𝑇 ⊆ C𝑁 , denoted 𝑃𝑋

respect to ∥ · ∥ 𝑋 is the largest integer such that there exists a 𝛿-packing 𝑆 ⊆ 𝑇 where |𝑆| = 𝑃𝑋

𝛿 (𝑇) with
𝛿 (𝑇).

If no such integer exists we say that 𝑃𝑋

𝛿 (𝑇) = ∞
Lemma 1.9. Let 𝑇 ⊆ C𝑁 and 𝛿 ∈ (0, ∞). Then

2𝛿 (𝑇) ≤ 𝐶 𝑋
𝑃𝑋

𝛿 (𝑇) ≤ 𝑃𝑋

𝛿 (𝑇)

Proof. Let 𝑃2𝛿 ⊂ 𝑇 be a maximal 2𝛿-packing of 𝑇 and 𝐶𝛿 ⊆ 𝑇 be a minimal 𝛿-cover of 𝑇. Each

point x ∈ 𝑃2𝛿 is closest to a different point y ∈ 𝐶𝛿. To see this, suppose to the contrary that for

x1, x2, ∈ 𝑃2𝛿, x1 ≠ x2 there was some point y ∈ 𝐶𝛿 such that x1, x2 ∈ 𝐵(y, 𝛿) this implies that

y ∈ 𝐵𝑋 (x1, 𝛿) ∩ 𝐵𝑋 (x2, 𝛿) which is a contradiction.

So, since each point in 𝑃𝛿 can be identified with at least one point in 𝐶𝛿. We thus define an

injection 𝑓 : 𝑃2𝛿 → 𝐶𝛿 where 𝑓 (x) = y, 𝑥 ∈ 𝐵(y, 𝛿). Since 𝑓 is an injection, we have that the

cardinality of 𝐶𝛿 must be equal to or larger than 𝑃2𝛿, which is equivalent to the left hand side of

the desired inequality.

Next, suppose 𝑃𝛿 is a maximal 𝛿-packing of 𝑇. Now suppose for eventual contradiction

that there exists a point y ∈ 𝑇, y ∉ 𝑃𝛿 such that ∥x − y∥ ≥ 𝛿, ∀𝑥 ∈ 𝑃𝛿. This implies that

𝐵𝑋 (x, 𝛿/2) ∩ 𝐵𝑋 (y, 𝛿/2) =. Thus 𝑃𝛿 ∪ {y} is a 𝛿-packing of 𝑇. This contradicts that 𝑃𝛿 is maximal.

So, for all points y ∈ 𝑇, there is x ∈ 𝑃𝛿 such that ∥x − y∥ ≤ 𝛿, which is to say 𝑃𝛿 is a 𝛿-covering of

𝑇, and therefore the cardinality of 𝑃𝛿 is equal to or larger than the 𝛿-covering number for 𝑇. This

is the right hand side of the desired inequality.

30

Lemma 1.10. Let 𝑇 ⊆ R𝑁 and 𝛿 ∈ (0, ∞). Furthermore let 𝐵 denote the unit ball 𝐵𝑋 (0, 1) in R𝑁

with respect to some norm ∥ · ∥ 𝑋. Then

(cid:19) 𝑁

(cid:18) 1
𝛿

Vol (𝑇)
Vol (𝐵)

≤ 𝐶 𝑋

𝛿 (𝑇) ≤ 𝑃𝑋

𝛿 (𝑇) ≤

(cid:18) 2
𝛿

(cid:19) 𝑁 Vol (cid:0)𝑇 + (cid:0) 𝛿
2
Vol (𝐵)

(cid:1) 𝐵(cid:1)

holds, where Vol(𝑇) = ∫

𝑇 1𝑑𝑉, the Lebesgue measure of 𝑇 in R𝑁 .

Note that addition of sets is syntactical sugar for set difference of certain sets, i.e. 𝑇 + 𝑆 =

𝑇 − (−𝑆) = {𝑡 + 𝑠|∀𝑡 ∈ 𝑇, 𝑠 ∈ 𝑆}

Proof. Suppose 𝐶𝛿 is a minimal 𝛿 cover of 𝑇. By definition then of 𝛿-cover

𝑇 ⊆

(cid:216)

y∈𝐶 𝛿

𝐵(y, 𝛿)

So, using sub-additivity of measurable sets, translation invariance, and scaling we have

Vol(𝑇) ≤ Vol

(cid:16)(cid:208)

𝐵(y, 𝛿)

y∈𝐶 𝛿

(cid:17)

≤ 𝐶 𝑋

𝛿 Vol (𝐵(y, 𝛿)) = 𝐶 𝑋

𝛿 𝛿𝑁 Vol (𝐵(0, 1))

Rearranging terms, we obtain the left hand side of the desired inequality

(cid:19) 𝑁

(cid:18) 1
𝛿

Vol (𝑇)
Vol (𝐵)

≤ 𝐶 𝑋

𝛿 (𝑇)

Now suppose 𝑃𝛿 is a maximal 𝛿-packing of 𝑇. It follows that

𝐵(y, 𝛿/2) ⊂ 𝑇 + 𝐵(0, 𝛿/2)

(cid:216)

y∈𝑃 𝛿

Since the balls that make up the 𝛿-packing of 𝑇 are disjoint, we have that their measure is additive.

Again, using translation invariance and scaling, this implies

Vol

(cid:32)

(cid:216)

y∈𝑃 𝛿

𝐵(y, 𝛿/2)

(cid:33)

= 𝑃𝑋

𝛿 (𝑇)

(cid:18) 𝛿

(cid:19) 𝑁

2

Vol (𝐵(0, 1)) ≤ Vol(𝑇 + 𝐵(0, 𝛿/2))

Which after rearranging terms matches the right hand side of the desired inequality

Corollary 1.3.

𝛿 (𝐵) ≤
Proof. We can apply lemma 1.10 where 𝑇 = 𝐵(0, 1). So,

1 + 2
𝛿

≤ 𝐶 𝑋

for all norms ∥ · ∥ 𝑋 on R𝑁

(cid:16)

(cid:17) 𝑁

(cid:17) 𝑁

(cid:16) 1
𝛿

(cid:19) 𝑁

(cid:18) 1
𝛿

(cid:24)(cid:24)(cid:24)(cid:24)
Vol (𝐵)
(cid:24)(cid:24)(cid:24)(cid:24)
Vol (𝐵)

≤ 𝐶 𝑋

𝛿 (𝑇)

31

yields the first half of the corollary. Next, note that using 𝑇 = 𝐵(0, 1) we see that

𝐵(0, 1) + 𝐵

(cid:18)

0,

(cid:19)

𝛿

2

(cid:18)

(cid:18)

0,

1 +

⊆ 𝐵

(cid:19)(cid:19)

𝛿

2

Note that by scaling, we have the following for the volume calculation

(cid:18)

(cid:18)

(cid:18)

𝐵

0,

1 +

Vol

(cid:19)(cid:19)(cid:19)

(cid:18)

1 +

=

𝛿

2

(cid:19) 𝑁

𝛿

2

Vol(𝐵(0, 1))

Putting this into the previous lemma, we see

𝐶 𝑋

𝛿 (𝑇) ≤

(cid:19) 𝑁 (cid:18)

(cid:18) 2
𝛿

1 +

(cid:19) 𝑁

𝛿

2

(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)(cid:40)
Vol(𝐵(0, 1))
(cid:24)(cid:24)(cid:24)(cid:24)
Vol (𝐵)

(cid:18)

1 +

=

(cid:19) 𝑁

2
𝛿

Which completes the second inequality

Note that if we add the assumption in the corollary that 𝛿 ∈ (0, 1) then we can bound

(cid:16)

1 + 2
𝛿

(cid:17)

≤

(cid:17)

= 3

(cid:16) 1
𝛿 + 2
𝛿
Corollary 1.4. If 𝑆 ⊆ 𝐵𝑋 (0, 1) ⊂ R𝑁 then

𝛿 , for a more concise, though less tight bound.

𝐶 𝑋

𝛿 (𝑇) ≤

(cid:19) 𝑁

(cid:18)

1 +

2
𝛿

Proof. By observing

(cid:18)

𝑆 + 𝐵

Vol

(cid:19)(cid:19)

(cid:18)

0,

𝛿

2

(cid:32)

≤ Vol

𝐵𝑋 (0, 1) + 𝐵

(cid:19) (cid:33)

(cid:18)

0,

𝛿

2

and applying the same reasoning as Corollary 1.3, we achieve the result.

1.5 Modewise Measurements Framework

For Chapters 2 and 3, we will be investigating specializations of the following measurement

framework for tensor recovery. The overall measurement process we will consider is a collection of

one or more measurement procedures that interleaves reshaping of the tensor (see Definition 1.7)

with modewise measurements using modewise products with linear maps that are applied to some

number of (reshaped) modes (see Definition 1.10). Naturally, it is formally possible to layer as

many rounds of reshaping and measurement as one wishes, however we will be concerned with

measurement procedures that have at most two stages of reshaping and linear transforms applied

the fibers of the reshaped tensor.

32

𝑅1(X)

𝐴1

𝑅1(X)

𝐴𝑇
2

B1

→
𝑀1

X

→
𝑅1

(a) First Stage, Reshaping

B1

𝑅2(B1)

(b) First Stage, Measuring
𝑅2(B1) B2

𝐵1

→
𝑅2

→
𝑀2

(c) Second Stage, Reshaping

(d) Second Stage, Measuring

Figure 1.3 Schematic of a two-stage measurement procedure. The first reshaping 𝑅1 takes a
three mode tensor and reshapes it into a two mode tensor. The modewise measurements then in the
measurement phase of the first stage corresponds to left and right multiplication by the measurement
maps 𝐴1 and 𝐴𝑇
2 to form intermediary tensor B1. The second stage depicts a vectorization reshaping
operation 𝑅2 and a final measurement phase where the matrix 𝐵1 is applied to the remaining mode
to form the final set of measurements, B2.

For example, a generic two-stage modewise linear operator L : R𝑛1×···×𝑛𝑑 → R𝑚1×···×𝑚𝑑′ could

take the form

L (X) := 𝑅2

(cid:0)𝑅1(X) ×1 𝐴1 · · · × ˜𝑑 𝐴 ˜𝑑

(cid:1) ×1 𝐵1 · · · ×𝑑′ 𝐵𝑑′,

(1.14)

where 𝑅1 is a reshaping operator which rearranges elements of R𝑛1×···×𝑛𝑑 tensor into an R ˜𝑛1×···× ˜𝑛 ˜𝑑 .
After this reshaping, using modewise product (Definition 1.10), 𝐴 𝑗 ∈ R ˜𝑚 𝑗 × ˜𝑛 𝑗 is applied to the
resphaped tensor for 𝑗 ∈ [ ˜𝑑]. This is followed by an additional reshaping via 𝑅2 into an R𝑚′
×···×𝑚′
𝑑′
tensor, and final set of 𝑗-mode products using the matrices 𝐵 𝑗 ∈ R𝑚 𝑗 ×𝑚′

𝑗 for 𝑗 ∈ [𝑑′]. See Figure 1.3

1

for a schematic depiction of a two-stage operator of this type.

Clearly, any 𝑞-stage modewise operators can be defined in this way. This view of measure-

ment operators was first analyzed in [28, 30]. Immediately, we can see multi-staged modewise

compression operators where several modes are compressed can offer a computational advantages

33

over standard vector-based approaches, provided the measurements are sufficiently sized to retain

information about the data. By vector-based approaches, what we mean is the direct application of

classic compressive sensing results which have been applied to vectors; whereby one simply takes

the tensor data and as a first step, rearranges its elements into a vector. This approach fits the frame-
work, where 𝑅1 is a vectorization operator as in Definition1.5. That is, ˜𝑑 = 1, 𝐴1 = 𝐴 ∈ R𝑚×(cid:206)𝑑
𝑗 𝑛 𝑗 is
a Johnson-Lindenstrauss map for example, and all remaining operators R2, 𝐵1, etc. are the identity

(see, e.g., [49]).

If instead 𝑅1 is a reshaping operation that combines even a few, but not all modes, then the

resulting modewise linear transforms can be formed using significantly fewer random variables

(effectively, independent random bits), and stored using less memory by avoiding the use of a
single, comparatively large 𝑚 × (cid:206)𝑑

𝑗 𝑛 𝑗 measurement matrix as in the vectorized-first approach.

Indeed as we will discuss in Chapter 2, simply forming or storing such a measurement matrix

can be impractical, whereas one can achieve accurate recovery by instead using several, smaller,

modewise maps.

In addition to the storage aspect of the map themselves, modewise linear operators of this kind

are also straightforward to parallelize and offer the possibility of choosing different maps at a more

granular level to better respect the multi-modal structure of the given tensor data.

34

CHAPTER 2

TENSOR RESTRICTED ISOMETRY PROPERTY AND TENSOR ITERATIVE HARD
THRESHOLDING FOR RECOVERY

We are now equipped to describe how we applied the measurement framework described in Sec-

tion 1.5 with a recovery procedure in order to produce a method that provably can yield reliable and

accurate results for the tensor recovery problem. The concept that will be of central importance to

this endeavor is the Tensor Restricted Isometry Property (TRIP). To explain where this term comes

from and why it is associated with recovery of an object from measurements, we will recall the

term from which it originates, namely the Restricted Isometry Property (RIP). Classically RIP is

nothing more than norm preserving Johnson-Lindenstrauss property applied to the set of sparse

vectors. That is, we have the definition,

Definition 2.1. [RIP(𝜖, 𝑆) property] We say that a linear map A, defined on a normed vector space

with norm ∥ · ∥, has the RIP(𝜖, 𝑆) property if for all elements 𝑠 ∈ 𝑆

(1 − 𝜖)∥𝑠∥2 ≤ ∥A (𝑠) ∥2 ≤ (1 + 𝜖) ∥𝑠∥2

(2.1)

As we mentioned, classically in compressive sensing, the set under consideration are the 𝑠-sparse

vectors; and so 𝑆 = 𝐾𝑠, where

𝐾𝑠 = (cid:8)x ∈ C𝑁 |∥x∥0 ≤ 𝑠(cid:9) =

(cid:216)

𝑆⊆[𝑁] |𝑆|≤𝑠

span (cid:8)e 𝑗 (cid:9)

𝑗 ∈𝑆

though we emphasize the definition here admits any subset from a normed vector space. Famously

however in the case of 𝐾𝑠, ℓ1-minimization by basis pursuit was shown to be able to recover the

original signal vector (exactly or approximately depending on the situation) from a small number

of measurements, when the measurement operator satisfies this approximate geometry preserving

property and moreover that common distributions of random matrices provably had RIP with high

probability, e.g. see Chapter 4 of [18]. There are other well known methods for the recovery part

of the measure and recover compressive sensing problem, such as orthogonal matching pursuit

and iterative hard thresholding (IHT). Extending these results to low rank matrices was a natural

direction for this idea, and was the subject of celebrated research in the field of compressive

35

sensing, see [14] for an overview. In this chapter however we shall be concerned with the extension

of low-rank recovery to higher-order tensors.

2.1 Two-Stage Modewise Measurements and Tensor Restricted Isometry Property

Prior research, such as [21, 20], has extended the Tensor IHT method (TIHT) to low-rank CP

and Tucker tensors. TIHT is an iterative method that consists alternating updates where the first

sub-step applies the adjoint of the measurement operator to the residual of the last iterate and a

second sub-step that thresholds that update back to the relevant constraint set; the set of low-rank

tensors. That is, we take a gradient step first, and then project back onto the proper constraint set

for our problem. This is very recognizable as a direct extension of IHT to the case of higher-order

tensors. That is, we have the update rule

Y 𝑗 = X 𝑗 + 𝜇 𝑗 L∗(L (X) − L (X 𝑗 )),

X 𝑗+1 = Hr

(cid:0)Y 𝑗 (cid:1) .

(2.2)

where 𝑗 denotes the iteration, L and L∗ denote the measurement operator and its adjoint, X0 is

the initialization (randomly generated for example), the symbol Hr is the thresholding operator,

that takes the tensor and finds a Tucker decomposition of the specified rank that approximates the

iterate Y 𝑗 , and finally 𝜇 𝑗 is a step-size parameter. The optimality of the thresholding operation is

something we will have cause to comment on later in this chapter.

The following theorem is one of the main results of [56]. It implies accurate reconstruction

of a tensor X is possible via TIHT when the measurement operator satisfies the yet to be defined

TRIP(𝛿, 3r) property for a sufficiently small 𝛿 and a sufficient number of iterations, with a reasonable

initialization since the update rule is contractive.

Theorem 2.1 ([56], Theorem 1). Let X ∈ R𝑛1×...×𝑛𝑑 , let 0 < 𝑎 < 1, and let L : R𝑛1×...×𝑛𝑑 → R𝑚

satisfy TRIP(𝛿, 3r) with 𝛿 < 𝑎/4 for some 𝑎 ∈ (0, 1). Assume that y = L (X) + e, where e ∈ R𝑚 is

an arbitrary noise vector, and let X 𝑗 and Y 𝑗 be defined as in (2.2). Assume that

∥Y 𝑗 − X 𝑗+1∥2 ≤ (1 + 𝜉𝑎)∥Y 𝑗 − X∥2,

where

𝜉𝑎 =

√

𝑎2
1 + 𝛿) ∥L ∥2→2

.

17(1 +

(2.3)

36

Then

where 𝑏𝑎 = 2

√

∥X 𝑗+1 − X∥2 ≤ 𝑎 𝑗 ∥X0 − X∥2 +

𝑏𝑎
1 − 𝑎

∥e∥2,

1 + 𝛿 + √︁

4𝜉𝑎 + 2𝜉2

𝑎 ∥L ∥2→2.

The assumption (2.3) in the Theorem 2.1 is about the quality of the thresholding operation

in (2.2). This assumption is unfortunately stronger than the quasi-optimality guarantee which

say truncated Higher Order SVD computed via Algorithm 1.1 is able to satisfy, as shown in

Lemma 1.5. However, the Lemma is an upper-bound, and in practice we have observed that the

truncated HOSVD procedure does produce estimates that are very good in some cases; though

naturally in real world situations it would be impossible to assess quantitatively that without finding

an optimal HOSVD - which is difficult. Some other choice for thresholding operation, to include

refinements using iterations of HOOI or some other method can likewise be employed in practice

to shore up one’s confidence that the requirement is met. Practically speaking, it is to some degree

a matter of how much computational effort is reasonable to budget for the task.

We note that in [56], the authors show it is possible to draw maps from random distributions

which will satisfy the desired TRIP property with high probability. Unfortunately, these maps

require first vectorizing the input tensor into a 𝑛𝑑-dimensional vector and then multiplying by an

𝑚 × 𝑛𝑑 matrix as we have mentioned earlier. Such a map requires 𝑚-times more memory than the

original tensor and thus can be unsuitable in the big data scenario; and indeed as we shall discuss

in our experiments, we were able to recover large tensors at better accuracy using a modewise

approach where a vectorized approach failed given the same constraints on memory.

With Theorem 2.1 in mind, we will now can move to specify the class of measurement operators

L we wish to use in this chapter. Two additional preliminaries await. First, the elaboration now on

the RIP definition alluded to, so that it includes higher order tensors with a low-rank decomposition.

Definition 2.2. [TRIP(𝛿, r) property] We say that a linear map A has the TRIP(𝛿, r) property if

for all X with HOSVD rank at most r we have

(1 − 𝛿)∥X∥2

2 ≤ ∥A (X) ∥2

2 ≤ (1 + 𝛿) ∥X∥2
2

(2.4)

37

Our second remaining piece is to establish how the HOSVD of a tensor is related to the HOSVD

of its reshaping. Let 𝜅 be an integer which divides 𝑑, the number of modes, and let 𝑑′ (cid:66) 𝑑/𝜅.

Consider the reshaping operator

R1 :

𝑑
(cid:204)

𝑖=1

𝑑′
(cid:204)

R𝑛𝜅

R𝑛 →

𝑖=1

that combines 𝜅 sized sets of consecutive modes of a tensor into one mode, (see Definition 1.7).

For example if the tensor has four modes, and 𝜅 = 2 then the reshaping will pair the first and

second modes together and the third and fourth together and thus reshape the four mode tensor into

a matrix of size 𝑛1𝑛2 × 𝑛3𝑛4. Any procedure to group the modes would be valid; so there is no

loss in generality for assuming consecutive modes being grouped. Formally, R1 is defined to be

the unique linear operator such that on rank-one tensors it acts as

R1

(cid:18) 𝑑
⃝
𝑖=1

(cid:19)

x(𝑖)

(cid:66)

𝑑′
⃝
𝑖=1

𝜅𝑖
(cid:204)

(cid:169)
(cid:173)
ℓ=1+𝜅(𝑖−1)
(cid:171)

x(ℓ)(cid:170)
(cid:174)
(cid:172)

(cid:67)

𝑑′
⃝
𝑖=1

◦x(𝑖),

recall ⊗ denotes the Kronecker product, see Definition 1.11. We observe that if a tensor X

has a Tucker decomposition (see (1.3)), then its reshaping
X ∈ (cid:203)𝑑′
𝑖=1

with HOSVD rank at most r′ (cid:66) (𝑟 𝜅, . . . , 𝑟 𝜅) given by

R𝑛𝜅

◦

◦

X (cid:66) R1(X) is the 𝑑′-mode tensor

◦
X =

𝑟 𝜅
∑︁

𝑗𝑑′ =1

. . .

𝑟 𝜅
∑︁

𝑗1=1

◦

G( 𝑗1, . . . , 𝑗𝑑′)

𝑑′
⃝
ℓ=1

◦u(ℓ)
𝑗ℓ

,

(2.5)

where the new component vectors ◦u(ℓ)

𝑗ℓ are obtained by taking Kronecker product of the appro-

priate u(𝑖)

◦

G ∈ R𝑟 𝜅 ×···×𝑟 𝜅

𝑘𝑖 , and where
Since each of the {u(𝑖)
𝑘𝑖
}𝑟 𝜅
𝑗𝑖=1 is an orthonormal basis for the space span

{ ◦u(𝑖)
𝑗𝑖

}𝑟
𝑘𝑖=1 was an orthonormal basis for the space spanned by 𝑈𝑖, it follows that
(cid:17)
. That is, the Kronecker product of

(cid:16)

{ ◦u(𝑖)
𝑗𝑖

}𝑟 𝜅
𝑗𝑖=1

is a reshaped version of G from (1.3).

orthonormal vectors is itself an orthonormal vector in its respective vector space.

Let 𝐴𝑖 be an 𝑚 × 𝑛𝜅 matrix for 𝑖 ∈ [𝑑′], denote A : R𝑛𝜅 ×...×𝑛𝜅

→ R𝑚×...×𝑚 to be the linear map

which acts modewise on 𝑑′-mode tensors by way of the modewise products:

A (Y) = Y ×1 𝐴1 ×2 . . . ×𝑑′ 𝐴𝑑′ .

(2.6)

38

Let X be a 𝑑-mode tensor with HOSVD decomposition given by (1.16). By Lemma 1.1 and

(2.5), we have that

A (R1(X)) = A (

◦
X) =

𝑟 𝜅
∑︁

𝑗𝑑′ =1

. . .

𝑟 𝜅
∑︁

𝑗1=1

◦𝐺 ( 𝑗1, . . . , 𝑗𝑑′) ( 𝐴1

◦u(1)
𝑗1

◦ . . . ◦ 𝐴𝑑′

◦u(𝑑′)
𝑗𝑑′ ).

(2.7)

Next, we define the second stage of reshaping and modewise operations, similar to what was

discussed in the example in Section 1.5. Let R2 denote the reshaping operation:

R2 :

𝑑′
(cid:204)

𝑖=1

R𝑚 → R𝑚𝑑′

,

(2.8)

that is the reshaping operation which vectorizes the tensor. Next, Let 𝐵 be a 𝑚2𝑛𝑑 × 𝑚𝑑′

linear

map. Denote

A2𝑛𝑑 (X) (cid:66) 𝐵 · R2(A (R1(X))),

(2.9)

The main theoretical result that the chapter relies on is to show that A2𝑛𝑑 satisfies the TRIP

property. This relies on showing how 𝐴𝑖 satisfies a restricted isometry property on the set 𝑆1,2, to

be defined below, and how the gestalt of the measurement operator inherits its properties from its

component parts.

We will require two more definitions.

Definition 2.3. [The set 𝑆1,2] Consider a set of vectors in R𝑛𝜅 ,

𝑆1 (cid:66) { ◦u | ◦u = ⊗𝜅

1u(𝑖), u(𝑖) ∈ S𝑛−1},

(2.10)

(cid:12)
(cid:12) x, y ∈ 𝑆1 s.t. ⟨x, y⟩ = 0

(cid:111)

. Denote 𝑆1,2 (cid:66) 𝑆1 ∪ 𝑆2, and observe that 𝑆1,2 ⊆

and let 𝑆2 (cid:66) (cid:110) x+y
S𝑛𝜅 −1.

∥x+y∥2

Definition 2.4 (Nearly orthogonal tensors B𝑅,𝜇,𝜃,r). Let B𝑅,𝜇,𝜃,r be the set of 𝑑-mode tensors in
X ∈ R𝑛×...×𝑛 that may be written in Tucker form (1.15) such that

(a) ∥u(𝑖)
𝑘𝑖
(b) |⟨u(𝑖)
𝑘𝑖

∥2 ≤ 𝑅 for all 𝑖 and 𝑘𝑖,
, u(𝑖)
𝑘 ′
𝑖

⟩| ≤ 𝜇 for all 𝑘𝑖 ≠ 𝑘′
𝑖,

(c) the core tensor G satisfies ∥G∥2 = 1,

(d) G has orthogonal subtensors in the sense that (1.5) holds for all 1 ≤ 𝑖 ≤ 𝑑,

39

(e) ∥X∥2 ≥ 𝜃.

Theorem 2.2. Let 𝑟 ≥ 2 and let r = (𝑟, 𝑟, . . . , 𝑟) ∈ R𝑑. Suppose that A and A2𝑛𝑑 are defined
as in (2.6) and (2.9). Let 𝑑′ = 𝑑/𝜅 and assume that 𝐴𝑖 satisfies the RIP(cid:0)𝜖, 𝑆1,2
(cid:1) property for all
𝑖 = 1, . . . , 𝑑′, where 𝛿 = 12𝑑′𝑟 𝑑𝜖 < 1. Assume that 𝐴2nd satisfies the RIP(cid:0)𝛿/3, B1+𝜖,𝜖,1−𝛿/3,r′ (cid:1),
property where r′ = (𝑟 𝜅, . . . , 𝑟 𝜅) ∈ R𝑑′

. Then, A2𝑛𝑑 will satisfy the TRIP(𝛿, r) property, i.e.,

(1 − 𝛿)∥X∥2

2 ≤ ∥A2𝑛𝑑 (X)∥2

2 ≤ (1 + 𝛿) ∥X∥2
2

for all X with HOSVD rank at most r.

The proof of this result depends on several intermediary results, so we delay its presentation to

Section 2.4 of this chapter. We instead in this section wish to focus on the following corollary, and

a discussion with critiques based on empirical findings and several observations we made in the

course of its investigation.

Corollary 2.1. Assume the operator L, is defined in one of the following ways:

(a) L = vec ◦ A ◦ R, where A is defined as per (2.6), and the matrices 𝐴𝑖 satisfy the RIP(cid:0)𝜖, 𝑆1,2

(cid:1),

and 𝛿 = 4𝑑′(3𝑟)𝑑𝜖 < 𝑎/4.

(b) L = A2𝑛𝑑 defined as in (2.9), its component matrices 𝐴𝑖 satisfy RIP(cid:0)𝜖, 𝑆1,2
𝛿 = 12𝑑′2(3𝑟)𝑑𝜖, and 𝐴final satisfies the 𝑅𝐼 𝑃(𝛿/3, B1+𝜖,𝜖,1−𝛿/3,r) property.

(cid:1) property,

Consider the recovery problem from the noisy measurements y = L (X) + e, where e ∈ R𝑚 is an

arbitrary noise vector. Let 0 < 𝑎 < 1, and let X 𝑗 , and Y 𝑗 be defined as in (2.2), and assume that

(2.3) holds. Then,

∥X 𝑗+1 − X∥2 ≤ 𝑎 𝑗 ∥X0 − X∥2 +

𝑏𝑎
1 − 𝑎

∥e∥2,

1 + 𝛿 + √︁

4𝜉𝑎 + 2𝜉2

𝑎 ∥A ∥2→2.

where 𝑏𝑎 = 2

√

Corollary 2.1 encapsulates the theoretical basis for why we can expect a two stage modewise

measurement operator like A2𝑛𝑑, where the constituent measurement maps satisfy the RIP prop-

erties, will yield a contractive iterative method in the presence of some noise. A particular choice

of measurement map, such as i.i.d sub-Guassian matrices then produces row bounds - provided we

can bound the size of covers of sets 𝑆1,2 and B𝑅,𝜇,𝜃,r. The intermediary results that describe these

40

details for how the properties of the various parts of the measurement operator relate are found

below in Section 2.4.

Remark 2.1. As we have alluded to, one of the important advantages of our method over a

vectorization-based approach is that the maps A and A2nd require less memory to store than
those considered in [56]. In particular, A requires 𝑑′𝑚𝑛𝜅 entries to be stored in order to sketch
and A2nd requires 𝑑′𝑚𝑛𝜅 + 𝑚𝑑′𝑚2nd in order to sketch a dimension of 𝑚2nd.
By contrast, the map considered in [56] requires 𝑛𝑑𝑚 entries to reach a dimension of 𝑚. To make

a dimension of 𝑚𝑑′

this more concrete, we consider the case where 𝑑 = 4, 𝑛 = 40, 𝜅 = 2, the final target dimension is

10, 000, and our maps are dense matrices with Gaussian entries. In this case, A2nd, with interme-

diate dimensions of 𝑚1 = 𝑚2 = 250 would require 402 × 250 × 2 + 2502 × 10, 000 = 625, 800, 000

random entries, with a storage cost of about 2.5 gigabytes assuming 32-bit floating point arithmetic

and the SI meaning of the prefix giga as 109 bytes. The vectorization based approach requires

404 × 10, 000 = 25, 600, 000, 000 random entries at a storage cost of about 102.4 gigabytes. In

Section 2.2, we will show that under these settings tensor recovery experiments using A2nd have

an identical recovery rate reliability in both the Gaussian case and SORS case when compared to

those that use vectorization-based compression, despite the 40 times smaller memory requirement.

2.2 Experiments

In this section, we show empirically that TIHT can be used with modewise measurement maps

as defined in (2.2) to successfully reconstruct low-rank tensors. In our experiments we will consider

randomly generated four-mode tensors in R𝑛×𝑛×𝑛×𝑛 for 𝑛 = 40 and 𝑛 = 96. In the case where 𝑛 = 40,

we will utilize both modewise Gaussian and 𝑅𝐹 𝐷 measurements, see Definition 1.24. In the case

where 𝑛 = 96 we will only consider 𝑅𝐹 𝐷 measurements, since generating and storing the dense

Gaussian matrices takes up too much memory.

We compare our two-step modewise approach to a vectorization based method like was de-

scribed in [56]. We consider the percentage of successfully recovered tensors from batches of 100

randomly generated low-rank tensors, as well as the average number of iterations used for recovery

on the successful runs. In the case where 𝑛 = 40, we consider the algorithm to have successfully

41

Figure 2.1 Fraction of successfully recovered random tensors out of a random sample of 100 tensors
in R40×40×40×40 with various intermediate dimensions. A run is considered successful if relative
error is below 1% in at most 1000 iterations.

recovered the tensor if the relative error falls below 1% in at most 1000 iterations. Similarly, in

the case where 𝑛 = 96, we consider the algorithm to have successfully recovered the tensor if the

relative error falls below 5% in at most 1000 iterations. Compression, defined here as the ratio

of the total number of entries in the measurements over total number of entries in the true tensor,

ranges from about 0.04% to 0.4% depending on the choice for final sketching dimension. In all

instances, we initialize the algorithm using a randomly generated low-rank tensor.

In our experiments, we apply the map from (2.9) with 𝜅 = 2. That is, we reshape a four-mode

tensor whose modes are all of length 𝑛 into a 𝑛2 × 𝑛2 matrix, perform modewise measurements

reducing each of the two reshaped modes to 𝑚, the choice for intermediate dimension, and then in

the second stage, vectorize that result and compress it further to 𝑚2nd, the final target dimension.

In our experiments, we consider a variety of intermediate dimensions to demonstrate the stability

of advantage of the modewise measurements over the vectorized ones.

In our experiments, for

fair comparison, we will set the final embedding of our two-step process 𝑚2nd equal to the final

42

Figure 2.2 Average number of iterations until convergence among the successful runs out of a
random sample of 100 tensors in R40×40×40×40 with various intermediate dimensions. A run is
considered successful if relative error is below 1% in at most 1000 iterations.

embedding dimension of the vectorized method 𝑚final. For a unified presentation, we will refer to

the final dimension as 𝑚0 in either case - modewise or vectorized.

As noted in Remark 2.1, our proposed two-step method offers significant storage savings as

compared to the vectorized based approach when looking at the size of the measurement maps

involved. For example, when we use Gaussian measurements with 𝑚0 = 10, 000 and 𝑛 = 40, the

vectorized computations had to be carried out using four NVIDIA v100 GPUs in parallel, whereas

the two-step method easily fits on memory of one GPU. In the case where 𝑛 = 96, generating,

storing and applying Gaussian random matrices is impractical. For instance in the scenario we

consider in Figure 2.3, where 𝑛 = 96 and 𝑚0 = 65, 536, we would need more than 22.25 terabytes

to store the Gaussian map required for the vectorized approach. A two-step approach would require

about 4.4 terabytes to store the measurement matrices.

We also note that we may apply 𝑅𝐹 𝐷 measurements more quickly to larger tensors than Gaussian

measurements since the FFT enables 𝑅𝐹 𝐷 measurement matrices to be applied efficiently to the

43

Figure 2.3 Top row shows recovery reliability and convergence speed for trials consisting of 100
samples of random tensors in R96×96×96×96 at various ranks. A run is successful if relative error
reaches below 5% within 1000 iterations. Final target dimensions are 𝑚 = 32, 768 and 65, 636, for
two-step method intermediate dimensions are 2048 and 4096 respectively. Bottom row is average
relative error over 100 samples after 500 iterations.

tensor and without the need to explicitly form all the measurement maps. In particular, we need

only store the sign changes which form the diagonal of matrix 𝐷, and store the choice of which

rows are sampled from 𝐹 to form matrix 𝑅𝐹 (see Definition 1.24). For this reason, the same

size of problem requires about 67 megabytes storage in two-step method and 340 megabytes for

the vectorized approach when using 𝑅𝐹 𝐷 and an appropriately implemented FFT. It is for these

reasons we restricted ourselves to 𝑅𝐹 𝐷 measurements for the 𝑛 = 96 case for both the vectorized

and two-step approach.

As shown in Figures 2.1 and 2.2 the two-step approach, when compared to the vectorized ap-

proach with the same choice of final sketching dimension, shows reliable recovery, and comparable

convergence rate with both Gaussian and 𝑅𝐹 𝐷 measurements. Indeed, for some choices of in-

termediate and final sketching dimensions, modewise measurements empirically recover low-rank

tensors more reliably than vectorized measurements (see Figure 2.1, bottom row). We show that

44

these advantages do not result in the need for a substantially increased number of iterations in

order to achieve our convergence criteria. Across all considered scenarios, the average number of

iterations to meet the convergence criteria can be bounded by two to eight times the number needed

in the corresponding vectorized approach, depending on the choice 𝑚 for intermediate sketching

dimension. Interestingly, for some ranges of the intermediate and final target dimension 𝑚0, and

in the case of 𝑅𝐹 𝐷 measurements in the rank (5, 5, 5, 5) instance, fewer iterations are needed

(see Figure 2.2, bottom right). Thus, modewise measurements are an effective, memory-efficient

method of dimension reduction, and the choice of intermediate sketching dimension allows us to

further balance trade-offs in terms of convergence speed and overall memory requirements, given

a particular size and rank.

In Figure 2.3, we investigate the performance of the algorithms as rank of the tensor increases.

A larger tensor, with 𝑛 = 96 enables us to consider a wider range of r’s that are reasonably

considered to be low-rank. We maintain the vectorized to two-step comparison, and also consider

two different final sketching dimensions, 𝑚0 = 32, 768 and 65, 536. Due to the larger problem size

and ranks, we scaled the convergence criteria to be 5% relative error in at most 1000 iterations for the

comparison of recovery reliability and convergence speed. We observed in terms of performance of

the algorithms a phase change as rank of the tensor increases for a fixed final sketching dimension.

In particular, for 𝑚0 = 32, 768 the performance of vectorized and two-step approach empirically

degrade significantly at 𝑟 = 6 and 7 respectively for this convergence criteria. When doubling

the sketching dimension to 𝑚0 = 65, 536 we see empirically that phase change that the drop in

performance occurs at 𝑟 = 8. The bottom row of Figure 2.3 shows the shift in terms of average

relative error for a fixed number of iterations. For the larger ranks and smaller sketching dimensions

we observe that stagnation at non-global optima appears more likely and the runtimes required for

acceptable recovery can become impractical. Both vectorized and two-step approaches have this

feature, however for a fixed final sketching dimension, for some ranks near the threshold the two-step

method performs incrementally better.

45

2.3 Discussion

As we have outlined in the previous section both theoretically and empirically, in terms of

performance and economizing storage, using a two-stage measurement operator versus a one-stage

vectorized only operator has significant advantages. There are a few shortcomings we wish to

note however that came about as we developed the method. To understand the first one, consider

the Corollary 2.2, which states that if the modewise transforms applied are assumed to be random

matrices with i.i.d. sub-Gaussian entries with at least as many as (2.11) and (2.12) rows for the first

and second stage respectively, then A2𝑛𝑑 will have desired TRIP(𝛿, r) with high probability. That

is, this is a result which shows that a certain class of random matrix will likely satisfy the TRIP,

and thus will be useful in yielding measurements, should its sketching dimension size satisfy the

following bounds.
Corollary 2.2. Let 𝑟 ≥ 2 and let r = (𝑟, 𝑟, . . . , 𝑟) ∈ R𝑑. Suppose that A and A2𝑛𝑑 are defined
as in (2.6) and (2.9), and that all of the 𝐴𝑖 ∈ R𝑚×𝑛𝜅

1
𝑚 -subgaussian entries for all

have i.i.d.

𝑖 = 1, . . . , 𝑑′, where 𝑑′ = 𝑑/𝜅, and suppose that 0 < 𝜂, 𝛿 < 1. Let

𝑚 ≥ 𝐶𝛿−2𝑟 2𝑑 max

(cid:26) 𝑛𝑑2 ln(𝜅)
𝜅

,

𝑑2
𝜅2

(cid:19)(cid:27)

,

(cid:18) 2𝑑
𝜅𝜂

ln

(2.11)

and let 𝐴2nd ∈ R𝑚2nd×𝑚 be a 1
𝑚2nd

-sub-gaussian matrix with i.i.d. entries with

𝑚2nd ≥ 𝐶𝛿−2 max

(cid:26) (cid:18) 𝑟 𝑑 𝜅 + 𝑑𝑚𝑟 𝜅
𝜅

(cid:19)

(cid:18) 𝑑
𝜅

ln

(cid:19)

+

+ 1

𝑑𝑚𝑟 𝜅
𝜅

(cid:16)

1 + 𝛿𝑟 𝑑(cid:17)

+

ln

𝑑2𝑚𝑟 𝜅𝛿
𝜅2

, ln

(cid:19)(cid:27)

.

(cid:18) 2
𝜂

(2.12)

Then, A2𝑛𝑑 satisfies the TRIP(𝛿, r) property, i.e.,

(1 − 𝛿)∥X∥2

2 ≤ ∥A2𝑛𝑑 (X)∥2

2 ≤ (1 + 𝛿) ∥X∥2
2

,

for all X with HOSVD rank at most r with probability at least 1 − 𝜂.

Observe that if 𝜅 = 1, then (2.11) implies that 𝑚 > 𝑛, so the sketching dimension needs to be

bigger than the original side length to ensure the error guarantee, and thus not accomplishing any

compression. In other words, the theory we have stated here is silent about the case where we com-

press using modewise products without first combining at least some of the modes. Furthermore,

numerical investigations lent empirical evidence that this may not merely have been an artifact

46

of the analysis, or a lack of tightness is the row bounds for this particular class of measurement

operator. This was a disappointing state of affairs because it is quite natural to wish to apply no re-

shaping initially, and instead compress along all modes using modewise products before employing

subsequent stages of reshaping and modewise linear transforms. Moreover, the necessity of having

𝜅 be at least two means that there may not be any natural or intuitive way to first reshape a tensor

with say a prime number of modes; in that case some modes would have to be grouped and treated

differently than others; which makes the choice of the operator data dependent in a way we should

like to avoid.

The next two observations stem from the use of TIHT as a recovery procedure in the overall

method. In the first sub-step of TIHT (2.2), we must calculate and form the intermediary tensor

Y 𝑗 which has the same dimensions as our original signal X. Furthermore, Y 𝑗 is emphatically

not in a factorized form, nor would we expect it to be exactly low rank. This iterate is a full,

un-factored, tensor that we have to store and operate on in practical terms that is the same size as

the uncompressed data X. In turn, in the next sub-step we then must find a factorization which

approximates Y 𝑗 . Naturally that factorization will be a significantly smaller object to store in

memory in the low rank case.

Secondly, As was alluded too early in the Section 2.1, in Theorem 2.1, there is a reliance on an

assumption about the quality of fit for the thresholding operator that exceeds the guarantees that we

are aware of for the usual procedure to find the HOSVD. Furthermore calculating an HOSVD itself

can qualify as computationally intensive; and refining a factorization could itself involve many

iterations of say Algorithm 1.2 within an already iterative step of TIHT.

So the use of TIHT as the recovery procedure, and its reliance on the TRIP property introduces

several significant downsides.

It means we need to have working memory on the order of the

size of the uncompressed, unfactorized data to perform the sub-step which is contraindicating the

expressed goal of a low-memory (at all stages) method for tensor recovery. Ideally we would have

some means to go from the compressed measurements L (X) directly to core G and factors 𝑈𝑖 of

the estimate without needing at any point in the procedure to form the entire, uncompressed tensor;

47

and thus remain low-memory throughout the entire process. Next, TIHT relies on an assumption

about the thresholding operator H𝑟; and practically any effort taken to shore up that assumption

may involve significantly more computational effort at each TIHT update, such as HOOI iterations.

Lastly, all known methods to prove TRIP for two-stage measurement operators that use the often

studied types of random modewise maps requires an initial reshaping; which may not be desirable

or intuitive.

The desire to address these problems is what led us to investigate the Leave-One-Out alternative

and its accompanying simple recovery procedure and is the subject of our next chapter.

2.4 Proof of Theorem 2.2

The goal of this section is to outline the proof of Theorem 2.2, which crucially builds on a

similar type guarantee for the first stage compression operator and a lemma which shows how the

output of the first stage of compression will likely belong in the set of nearly orthonogonal tensors

as in Definition 2.4, and provides a covering number bound for this set. Once we have these, it is

easy to fashion two-stage operators from standard types of random matrices, such as sub-Gaussian

random matrices or 𝑅𝐹 𝐷 matrices, as was done e.g. Lemma 1.6. For a detailed proof of these

results, see Section 5 and Appendices A and B of our work [23].
Theorem 2.3. Suppose that A is defined as per (2.6) and that each of the 𝐴𝑖 have RIP(cid:0)𝜖, 𝑆1,2
property. Let 𝑟 ≥ 2, let 𝛿 = 4𝑑′𝑟 𝑑𝜖 and assume that 𝛿 < 1. Then A ◦ R satisfies the TRIP(𝛿, r)

(cid:1)

property, i.e.,

(1 − 𝛿)∥X∥2

2 ≤ ∥A (R1(X)) ∥2

2 ≤ (1 + 𝛿) ∥X∥2
2

(2.13)

for all X with HOSVD rank less than r = (𝑟, 𝑟, . . . , 𝑟).

So the goal is now to first prove this result about the first stage compression operator. To do so,

it will be convenient to write A as a composition of maps

A (Y) = A𝑑′ (. . . (A1(Y))), where A𝑖 (Y) = Y ×𝑖 𝐴𝑖 for 1 ≤ 𝑖 ≤ 𝑑′.

(2.14)

We note that by (2.5), we may still write the reshaped 𝑑′-mode tensor

◦

X as a sum of 𝑟 𝑑 orthogonal

tensors; that is by Lemma 1.4 we can write X as a sum of 𝑟 𝑑 rank one orthogonal tensors, and

48

the reshaping operator R1 is linear. This action of a linear operator in a vector space working on

orthogonal sums plays nicely with an approximation property; as we now state in Lemma 2.1

Lemma 2.1. Let V be an inner product space and let L be a linear operator on V. Let U ⊂ V

be a subspace of V spanned by an orthonormal system {v1, . . . , v𝐾 } ∈ V. Suppose that

(1 − 𝜖)∥v𝑖 ∥2

2 ≤ ∥Lv𝑖 ∥2

2 ≤ (1 + 𝜖) ∥v𝑖 ∥2
2

for all 1 ≤ 𝑖 ≤ 𝐾.

(2.15)

and also that

(1 − 𝜖)∥v𝑖 ± v 𝑗 ∥2

2 ≤ ∥L (v𝑖 ± v 𝑗 ) ∥2

2 ≤ (1 + 𝜖) ∥v𝑖 ± v 𝑗 ∥2
2

for all 1 ≤ 𝑖, 𝑗 ≤ 𝐾.

(2.16)

Then we have

Proof. Claim:

(1 − 𝐾𝜖)∥w∥2

2 ≤ ∥Lw∥2

2 ≤ (1 + 𝐾𝜖) ∥w∥2

2 for all w ∈ U.

|⟨Lv𝑖, Lv 𝑗 ⟩| ≤ 𝜖

for all 𝑖 ≠ 𝑗 .

To prove the claim, let 𝑖 ≠ 𝑗 . Then,

4⟨Lv𝑖, Lv 𝑗 ⟩ = ∥L (v𝑖 + v 𝑗 )∥2

2 − ∥L (v𝑖 − v 𝑗 ) ∥2
2

≤ (1 + 𝜖)∥v𝑖 + v 𝑗 ∥2

2 − (1 − 𝜖) ∥v𝑖 − v 𝑗 ∥2
2

= (1 + 𝜖)(∥v𝑖 ∥2

2 + ∥v 𝑗 ∥2

2 + 2⟨v𝑖, v 𝑗 ⟩) − (1 − 𝜖) (∥v𝑖 ∥2

2 + ∥v 𝑗 ∥2

2 − 2⟨v𝑖, v 𝑗 ⟩)

= 4⟨v𝑖, v 𝑗 ⟩ + 2𝜖 (∥v1∥2

2 + ∥v2∥2
2)

= 4𝜖,

where the last inequality follows from the fact that ∥v1∥2

2 = ∥v2∥2

2 = 1 and ⟨v𝑖, v 𝑗 ⟩ = 0. Thus,

⟨Lv𝑖, Lv 𝑗 ⟩ ≤ 𝜖 . The reverse inequality is similar.

We now proceed to the Lemma itself, and argue by induction on 𝐾. When 𝐾 = 1, the result is

immediate from (2.15) and the fact that L is linear. Now assume the result is true for 𝐾 − 1. An

arbitrary element of U may be written as

w =

𝑐𝑖v𝑖

𝐾
∑︁

𝑖=1

49

where 𝑐1, . . . , 𝑐𝐾 are scalars. We will write w = w𝐾−1 + 𝑐𝐾v𝐾, where

w𝐾−1 (cid:66)

𝐾−1
∑︁

𝑖=1

𝑐𝑖v𝑖.

By construction, we have

∥Lw∥2

2 = ∥Lw𝐾−1∥2

2 + ∥𝑐𝐾 Lv𝐾 ∥2

2 + 2𝑐𝐾 ⟨Lw𝐾−1, Lv𝐾⟩.

We may use the inequality 2𝑎𝑏 ≤ 𝑎2 + 𝑏2 along with the claim earlier to see

2𝑐𝐾 ⟨Lw𝐾−1, Lv𝐾⟩ =

𝐾−1
∑︁

𝑖=1

2𝑐𝑖𝑐𝐾 ⟨Lv𝑖, Lv𝐾⟩

≤ 𝜖

≤ 𝜖

𝐾−1
∑︁

𝑖=1
𝐾−1
∑︁

𝑖=1

2𝑐𝐾 𝑐𝑖

(𝑐2

𝐾 + 𝑐2
𝑖 )

≤ (𝐾 − 1)𝜖 𝑐2

𝐾 + 𝜖

𝐾−1
∑︁

𝑖=1

𝑖 .
𝑐2

By the inductive assumption,

∥Lw𝐾−1∥2

2 ≤ (1 + (𝐾 − 1)𝜖) ∥w𝐾−1∥2

2 = (1 + (𝐾 − 1)𝜖)

𝐾−1
∑︁

𝑖=1

𝑖 .
𝑐2

Thus,

∥Lw∥2

2 ≤ (1 + (𝐾 − 1)𝜖)

𝐾−1
∑︁

𝑖=1

𝑖 + (1 + 𝜖)𝑐2
𝑐2

𝐾 + (𝐾 − 1)𝜖 𝑐2

𝐾 + 𝜖

𝐾−1
∑︁

𝑖=1

𝑐2
𝑖

= (1 + 𝐾𝜖)

𝐾
∑︁

𝑖=1

𝑖 = (1 + 𝐾𝜖) ∥w∥2
𝑐2
2

.

(2.17)

(2.18)

The reverse inequality is similar.

The next lemma checks that, if 𝐴𝑖 the matrix satisfies RIP(𝜖, 𝑆1,2) property for some 1 ≤ 𝑑′

then the operator A𝑖 as in (2.14) will satisfy the conditions of Lemma 2.1 for the system of rank

one component tensors that are produced by our reshaping procedure.
Lemma 2.2. Let {V1, . . . , V𝐾 } ∈ R𝑛×...×𝑛 be an orthonormal system of rank one tensors of the
form V𝑘 = ⃝𝑑′
𝑘 ∥ = 1 for all 1 ≤ 𝑖 ≤ 𝑑′. Let 1 ≤ 𝑖 ≤ 𝑑′, suppose 𝐴𝑖 has the

𝑘 where ∥v𝑖

𝑖=1 v𝑖

50

RIP(𝜖/2, 𝑆1,2) property and assume that each v𝑖
2.3. Then the conditions (2.15) and (2.16) are satisfied for (the vectorizations of) these {V𝑖}𝐾

𝑘 is an element of the set 𝑆1 defined in Definition

𝑖=1 and

L = A𝑖 defined via A𝑖 (X) = X ×𝑖 𝐴𝑖.

The next lemma gives a formula for the tensor Y𝑡 obtained by applying sequentially the first 𝑡

of the maps A𝑖. It shows that all intermediary tensors Y𝑡 can be written as an orthogonal linear

combination of 𝑟 𝜅(𝑑′−𝑡) rank-one tensors of unit norm. Moreover, for each of the terms in this sum,

the (𝑡 + 1)-st component vector is ◦u

(𝑡+1)
𝑗𝑡+1

as defined in (2.5) and therefore is an element of the set 𝑆1.

Lemma 2.3. Let Y0 =

◦

X and Y𝑡 (cid:66) A𝑡 (Y𝑡−1) = Y𝑡−1 ×𝑡 𝐴𝑡 for all 𝑡 = 1, . . . , 𝑑′. Then, for each

1 ≤ 𝑡 ≤ 𝑑′ − 1, we may write

Y𝑡 =

𝑟 𝜅
∑︁

. . .

𝑟 𝜅
∑︁

𝑗𝑑′ =1

𝑗𝑡+1=1

G𝑡 ( 𝑗𝑡+1,..., 𝑗𝑑′ )

(cid:20) (cid:18) 𝑡
⃝
𝑖=1

v(𝑖)
𝑗𝑡+1,..., 𝑗𝑑′

(cid:19)

⃝

(cid:18) 𝑑′
⃝
𝑖=𝑡+1

(cid:19)(cid:21)

,

◦u

(𝑖)
𝑗𝑖

(2.19)

where ∥v(𝑖)

𝑗𝑡+1,..., 𝑗𝑑′ ∥ = 1 for all valid index subsets.

We are now ready to prove Theorem 2.3.

Proof of Theorem 2.3. By Lemma 1.4, we can write Y0 =
of 𝑟 𝑑 norm one terms of the form

◦
X as an orthogonal linear combination

𝑑′
⃝
𝑖=1

◦u

(𝑖)
𝑗𝑖

,

1 ≤ 𝑗𝑖 ≤ 𝑟 𝜅,

where each of the vectors ◦u(𝑖)
𝑗𝑖

, 1 ≤ 𝑗𝑖 ≤ 𝑟 𝜅, are obtained as the vectorization of a rank-one 𝜅-mode

tensor. Therefore, since 𝐴1 satisfies RIP(𝜖, 𝑆1,2), Lemma 2.2 allows us to apply Lemma 2.1 to see

∥A1(

◦

X)∥2 ≤ (1 + 2𝑟 𝑑𝜖) ∥

◦

X∥2.

(2.20)

Next observe that by Lemma 2.3

(𝑑′−𝑡) sums of 𝑟 𝜅 terms each
(cid:125)(cid:124)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:122)
𝑟 𝜅
∑︁

. . .

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)
𝑟 𝜅
∑︁

𝑗𝑑′ =1

𝑗𝑡+1=1

Y𝑡 =

G𝑡 ( 𝑗𝑡+1, . . . , 𝑗𝑑′)

remains a set of orthonormal vectors
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)(cid:124)
(cid:122)
(cid:123)
(cid:20) (cid:18) 𝑡
(cid:19)
(cid:19)(cid:21)
⃝
𝑖=1

v(𝑖)
𝑗𝑡+1,..., 𝑗𝑑′

(cid:18) 𝑑′
⃝
𝑖=𝑡+1

(𝑖)
𝑗𝑖

⃝

◦u

and note that there are 𝑟 𝜅(𝑑′−𝑡) terms appearing in the sum in (2.19). Therefore, Lemmas 2.1 and

2.2 allow us to see that

∥Y𝑡+1∥2 ≤

(cid:16)

1 + 2𝑟 𝜅(𝑑′−𝑡)𝜖

(cid:17)

∥Y𝑡 ∥2

(2.21)

51

for 1 ≤ 𝑡 ≤ 𝑑′ − 1. Since Y𝑑′ = A (

◦

X), combining (2.20) and (2.21) for each of the compositions

implies that the overall first stage compression operator A defined in (2.6) satisfies

∥A (

◦

X)∥2 ≤

𝑑′−1
(cid:214)

(cid:16)

𝑡=0

1 + 2𝑟 𝜅(𝑑′−𝑡)𝜖

(cid:17)

◦

X∥2.

∥

To complete the upper bound set 𝛼 (cid:66) 2𝑟 𝑑𝜖 and note that 2𝛼 < 1. Recall 𝑑′ = 𝑑

𝜅 . Then, since 𝑟 ≥ 2

𝑑′−1
(cid:214)

(cid:16)

𝑡=0

1 + 2𝑟 𝜅(𝑑′−𝑡)𝜖

(cid:17)

=

𝑑′−1
(cid:214)

𝑡=0

(1 + 𝛼𝑟 −𝑡𝜅)

𝑟 −𝑡𝜅 + 𝛼2

𝑑′−1
∑︁

𝑟 −(𝑡1+𝑡2)𝜅 + . . . + 𝛼𝑑′𝑟 −(1+...+(𝑑′−1))𝜅

𝑑′−1
∑︁

𝑡=0

𝑑′−1
∑︁

= 1 + 𝛼

≤ 1 + 𝛼

≤ 1 + 𝛼

𝑡1,𝑡2=0:
𝑡1<𝑡2
𝑑′−1
∑︁

𝑟 −𝑡𝜅

(cid:33) 2

(cid:32)

𝛼

𝑟 −𝑡𝜅 +

(cid:32)

+ . . . +

𝛼

(cid:33) 𝑑′

𝑟 −𝑡𝜅

𝑑′−1
∑︁

𝑡=0

𝑡=0
∞
∑︁

𝑡=0

2−𝑡 +

𝑡=0
∞
∑︁

2−𝑡

(cid:32)

𝛼

(cid:33) 2

𝑡=0

(cid:32)

+ . . . +

𝛼

(cid:33) 𝑑′

∞
∑︁

𝑡=0

2−𝑡

≤ 1 + 2𝛼 + (2𝛼)2 + . . . + (2𝛼)𝑑′

≤ 1 + 2𝑑′𝛼

= 1 + 4𝑑′𝑟 𝑑𝜖

which completes the proof of the upper bound. The proof of the lower bound is nearly identical.

Now that we have proved Theorem 2.3, and have now the first stage compression operator,

A (R1(·)) has a TRIP so long as the constituent linear maps themselves have a RIP, we can use this

show that the second stage compression operator also as a TRIP. Unfortunately, we cannot simply

inductively apply the same reasoning, since crucially the first stage depended on using that the

sum of orthogonal (rank-1) components and a linear operator were compatible with the embedding

properties. After the first stage, these sums of rank-1 tensors will almost certainly no longer be

exactly orthogonal. Nevertheless, we see that they will be nearly orthogonal, and should we have

a covering number for this set, then it will be possible to find a second stage compression operator

which is able to satisfy the RIP for this set. This is what motivates the use of Definition 2.4, and,

along with a covering number argument for this set of nearly orthogonal tensors, is the main content

of the following Lemma.

52

Lemma 2.4. Let X be a unit norm d-mode tensor with HOSVD rank at most r = (𝑟, . . . , 𝑟). Let A

be defined as in (2.6), assume that the matrices 𝐴𝑖 have the 𝑅𝐼 𝑃(𝜖, 𝑆1,2) property for all 1 ≤ 𝑖 ≤ 𝑑′,
X) ∈ R𝑚×...×𝑚 is an element of the set
and that 𝛿 = 12𝑑′𝑟 𝑑𝜖 < 1. Then, the 𝑑′-mode tensor A (

◦

B1+𝜖,𝜖,1−𝛿/3,r′ (as per Definition 2.4). Additionally, let
suppose that 𝑚 ≥ 𝑟 𝑑−1. Then,

˜B𝜖,𝛿,𝑟 (cid:66)

(cid:26)

X
∥X∥F

(cid:12)
(cid:12)
(cid:12)
(cid:12)

X ∈ B1+𝜖,𝜖,1−𝛿/3,r′

and

(cid:27)

𝐶 ∥·∥2
𝑡

(cid:0) ˜B𝜖,𝛿,𝑟 (cid:1) ≤

(cid:19)𝑟 𝑑+𝑟 𝜅 𝑚𝑑/𝜅

(cid:18) 9(𝑑/𝜅 + 1)
𝑡

(cid:16)

(1 + 𝜖)2 + 𝜖𝑟 𝑑(cid:17) 𝑑𝑚𝑟 𝜅 /𝜅

(1 + 𝜖)𝑑2𝑚𝑟 𝜅 𝜅−2

holds for all 𝑡 > 0, and 1 > 𝛿 > 0.

Where 𝐶 ∥·∥2

𝑡

(cid:0) ˜B𝜖,𝛿,𝑟 (cid:1) refers to the covering number, see Definition 1.27. The proof then of

Theorem 2.2 follows from applying the RIP assumption of 𝐴

2𝑛𝑑 after using Theorem 2.3 for

A (𝑅1(X)).

53

CHAPTER 3

LEAVE ONE OUT MEASUREMENTS FOR RECOVERY

Our investigation of two-stage measurement operators that satisfy the TRIP was motivated by our

desire to employ a result regarding TIHT that could, with important qualifications, solve the tensor

recovery problem by using a memory efficient L : R𝑛1×···×𝑛𝑑 → R𝑚 𝑓 𝑖𝑛𝑎𝑙 measurement operator. In

Section 2.3 we noted three significant drawbacks to this overall approach

1. the method was not low-memory at all stages since a sub-step of TIHT involves forming and

operating on a tensor the size of the uncompressed original signal

2. the method relies on an effectively unverifiable assumption about the thresholding operator

that is also computationally expensive to shore up

3. the apparent necessity of an initial reshaping in order to achieve useful compression

The desire to address these issues led us to a useful specialization to the modewise measurement

framework, which we refer to as the leave-one-out alternative.

We are not the first to propose the idea. Most notably, in the work of Hendrikx and De Lathauwer

[26], they describe an algorithm that recovers a tensor from linear combinations of its entries where

the measurement operations themselves are constrained to be Kronecker-structured; this will serve

as a particular example which fits very naturally into our framework once we properly situate it.

In that work they give necessary and sufficient conditions for perfect recovery of exactly low-rank

tensors. They describe some relevant heuristics and provide empirical results for the performance

of the recovery when the tensor or its measurements are subject to white noise. Additionally they

demonstrate how to adapt their method to two other tensor formats, namely CP and tensor train.

Furthermore, In the work of Tropp and Udell et al [60], a comparable recovery procedure

as [26] was described (in that work, Algorithm 4.1, Tucker Sketch) however the measurement

ensembles analyzed were in some sense, less structured. Measurements used to estimate factors

of a tensor in this work were conceived as operating on the unfoldings of the tensor, not strictly

modewise as is possible in the Kronecker-structured case of [26]. This too we can situate into our

framework from Section 1.5. In the work [60] the authors detail approximation error expectations

54

of the recovered tensor in terms quasi-optimality, and notably do not require the signal to be exactly

low-rank, both things we find desirable in a theory on recovery. On the other hand, the error

guarantees in that work are in expectation and are stated using the assumption that all the entries of

the measurement matrices are sampled from independent Gaussian random variables; which from a

practical view point can be undesirable as these measurement matrices will take up as much storage

in a way that scales data tensor itself. The authors do empirically investigate the performance of

a more structured and constrained measurement map, specifically when the sketching matrices are

constructed by using Khatri-Rao products of Gaussian matrices. In those experiments they show

that these structured measurements deliver nearly as good approximations at the benefit of using

potentially significantly less space, since it is possible in this scenario to store only the parts and

construct the ensemble as it is needed during sketching.

Our contribution in this chapter is to unify and add to these works. Using the overall same

strategy as [60], our error analysis is applied to the more structured Kronecker and Khatri-Rao

measurement ensembles which are then used in a Canonical Recovery Algorithm 3.1 to recover the

factorized form of a tensor. These types of measurement ensembles operate on the tensor in subtly

different modewise manner that we will situate in terms of the framework introduced in Section 1.5.

In a similar fashion as to the type of measurement operator that was the subject of Chapter 2, this

has significant practical advantages since neither the entire signal tensor nor the entire measurement

operator need to be kept in working memory in order to obtain measurements or importantly, to

recover the factors of the original signal tensor. The improvement compared to [26] then is that the

error analysis presented in this chapter applies in the non-exact low-rank case, and quantifies more

specifically how the error depends on the relevant parameters of the problem, such as rank truncation.

Additionally, besides applying the analysis to these more structured measurement ensembles that

was not present in those early works, we remove from it the needed assumption that the entries

of the sketching matrices are drawn from Gaussian distribution independently, and instead rely on

a Johnson-Lindenstrauss property (see Definition 1.18) in order to state probabilistic bounds on

errors of recovered tensors. The advantage to this is that there are several well known distributions

55

of random matrices which are known to satisfy this property, and it becomes a straightforward

exercise then to reinterpret measurement requirements for different choices of sketching matrices.

Having these options enables a user to select structured measurement schemes which can take

advantage of fast matrix-vector multiplies for example, see Definition 1.24.

3.1 Kronecker Products, Khatri-Rao Products, and Leave-One-Out Modewise Measure-

ment Notation and Examples

We now will describe and provide some notation and terminology for the type of measurement

operators studied throughout this chapter. The general concept we are able to specialize to several

cases is the following:

Definition 3.1 (Leave-one-out Measurements). Given a 𝑑-mode tensor X with side-lengths 𝑛,

leave-one-out measurements are the result of reducing all but one mode using linear operations on

the unfolding of the tensor. That is,

𝐵 𝑗 = Ω( 𝑗, 𝑗) 𝑋[ 𝑗]Ω𝑇
− 𝑗

are leave-one-out measurements whenever Ω( 𝑗, 𝑗) ∈ R𝑛×𝑛 and full-rank, Ω− 𝑗 ∈ R𝑚′×𝑛𝑑−1
𝑚′ ≤ 𝑛𝑑−1.

(3.1)

where

Any measurement process that can be equivalently described as left multiplication of an unfolded

tensor by a full-rank square matrix and right multiplied by some other linear operator that reduces

all other modes is a leave-one measurement process. We will consider three specific cases for

how to construct the overall measurement operator Ω− 𝑗 in Definition 3.1: Kronecker structured,

Khatri-Rao structured and unstructured measurement operators, and then situate this within the

framework introduced in Section 1.5.

3.1.1 Kronecker Structured Measurements

Kronecker structured leave-one-out measurements are constructed by taking the Kronecker

product (see Definition 1.11) of 𝑑 − 1 component matrices to fashion Ω− 𝑗 , in addition to one square,

full-rank measurement matrix Ω( 𝑗, 𝑗) to be applied to the mode- 𝑗 which is uncompressed. Note this

matrix could simply be the identity matrix. We will require leave-one-out measurements of type

in the Definition 3.1 for each mode (i.e. ∀ 𝑗 ∈ [𝑑]) in order to recover all factors. Collectively, the

56

matrices needed to define this ensemble will form a set (cid:8)Ω(𝑖, 𝑗)
𝑗 ≠ 𝑖, and where Ω(𝑖,𝑖) ∈ R𝑛×𝑛 for all 𝑖 ∈ [𝑑].

(cid:9)

𝑖, 𝑗 ∈[𝑑] where Ω(𝑖, 𝑗) ∈ R𝑚×𝑛 when

So far our measurements then are

𝑑(cid:63)

B 𝑗 := X

Ω( 𝑗,𝑖)

𝑖=1

(3.2)

for all 𝑗 ∈ [𝑑] using the matrices (cid:8)Ω( 𝑗,𝑖)

(cid:9)𝑑
𝑖=1. Crucially, we can identify this tensor of measurements
B 𝑗 with a flattened version which makes it clear that Kronecker structured modewise measurements

do define leave-one-out measurements conforming to Definition 3.1, see Lemma 1.2.

𝐵 𝑗 = Ω( 𝑗, 𝑗) 𝑋[ 𝑗]Ω𝑇

− 𝑗 = Ω( 𝑗, 𝑗) 𝑋[ 𝑗] (Ω( 𝑗,1) ⊗ Ω( 𝑗,2) ⊗ . . . Ω( 𝑗, 𝑗−1) ⊗ Ω( 𝑗, 𝑗+1) ⊗ · · · ⊗ Ω( 𝑗,𝑑))𝑇

(3.3)

See Algorithm 3.2 for an outline of the sketching procedure. Note, (3.2) makes it clear we have

here a one-stage modewise measurement process which has no initial reshaping; but instead leaves

one mode uncompressed. As mentioned, we will require a collection of such measurements -

one measurement tensor in one mode will be used only to recover a single factor matrix in the

Tucker decomposition of our estimate. We refer to a collection of measurements from different

measurement operators as a measurement ensemble.

Since it will be relevant in our analysis, we observe that conceptually, the Kronecker products

of matrices can be rewritten in terms of matrix products of auxiliary matrices. This will be useful

if we wish to examine via, e.g., (1.4), how several modewise products change the properties of

the resulting factor matrices 𝑈 𝑗 in the Tucker decomposition of a given tensor. As an illustration,

consider the case of a three mode tensor X ∈ R𝑛1×𝑛2×𝑛3. Let x ∈ R𝑛1𝑛2𝑛3 denote the vectorization

of X, and further suppose that Ω 𝑗 ∈ R𝑚 𝑗 ×𝑛 𝑗 for 𝑗 = 1, 2, 3 are three measurement matrices used to

produce modewise measurements of X given by X ×1 Ω1 × Ω2 ×3 Ω3. Allowing for an additional

reshaping, we can identify these three modewise operations equivalently with a single matrix-vector

product using a variant of (1.4). That is, one can see that vec (X ×1 Ω1 × Ω2 ×3 Ω3) = Ωx where
Ω = Ω3 ⊗ Ω2 ⊗ Ω1 ∈ R𝑚1𝑚2𝑚3×𝑛1𝑛2𝑛3. Let 𝐼𝑛 denote the 𝑛 × 𝑛 identity matrix. The mixed-product

57

property of the Kronecker product can now be used to further show that in fact Ω = ˜Ω3 ˜Ω2 ˜Ω1, where

˜Ω1 = 𝐼𝑛3 ⊗ 𝐼𝑛2 ⊗ Ω1 ∈ R𝑚1𝑛2𝑛3×𝑛1𝑛2𝑛3,
˜Ω2 = 𝐼𝑛3 ⊗ Ω2 ⊗ 𝐼𝑚1 ∈ R𝑚1𝑚2𝑛3×𝑚1𝑛2𝑛3,
˜Ω3 = Ω3 ⊗ 𝐼𝑚2 ⊗ 𝐼𝑚1 ∈ R𝑚1𝑚2𝑚3×𝑚1𝑚2𝑛3.

Hence, vec (X ×1 Ω1 × Ω2 ×3 Ω3) = ˜Ω3 ˜Ω2 ˜Ω1vec(X). This example can be easily be extended to

any number of modes.

3.1.2 Khatri-Rao Structured Measurements

Leave-one-out measurement ensembles which are Khatri-Rao products of smaller maps are

also possible and have been considered before in works such [60]. That is,

𝐵 𝑗 = Ω( 𝑗, 𝑗) 𝑋[ 𝑗]Ω𝑇
− 𝑗
(cid:16)
( 𝑗,1) ⊙ Ω𝑇
Ω𝑇
Ω( 𝑗,1) • Ω( 𝑗,2) • . . . Ω𝑇

= Ω( 𝑗, 𝑗) 𝑋[ 𝑗]

= Ω( 𝑗, 𝑗) 𝑋[ 𝑗]

( 𝑗,2) ⊙ . . . Ω𝑇

(cid:16)

( 𝑗, 𝑗−1) ⊙ Ω𝑇

( 𝑗, 𝑗+1) · · · ⊙ Ω𝑇
( 𝑗,𝑑)
(cid:17)𝑇

( 𝑗, 𝑗−1)

• Ω( 𝑗, 𝑗+1) . . . • Ω( 𝑗,𝑑)

(cid:17)𝑇

(3.4)

where Ω( 𝑗, 𝑗) is a full-rank square matrix, and Ω( 𝑗,𝑖) ∈ R𝑚×𝑛 for 𝑗 ≠ 𝑖. Note that in this case we can
consider Ω− 𝑗 ∈ R𝑚×𝑛𝑑−1
as sketching sub-tensors of size 𝑛𝑑−1 to size 𝑚 (as opposed to 𝑚𝑑−1 as was

done previously with Kronecker structured measurement maps). See Algorithm 3.3 for an outline

of the sketching procedure. Note, to situate this measurement process within the framework, unlike

the Kronecker structured measurements, there is no simple equivalence of each component map to

a single modewise product along a mode - instead we conceive of the measurement matrix acting

on sub-tensors of 𝑑 − 1 modes - i.e. the subtensors that occupy the rows of a mode- 𝑗 unfolding.

That is, there is a reshaping into a mode- 𝑗 flattening initially, and a single “modewise” (matrix)

product of that reshaped tensor with the measurement operator. Fundamentally, measurements

structured this way combine 𝑑 − 1 modes initially with a reshaping; however the full measurement

matrix itself need not be fully pre-computed because of its Khatri-Rao structure. One can, at the

time of sketching, take the Kronecker product of the columns of the component matrices; trading

compute for storage.

58

3.1.3 Unstructured Measurements

In [50], a type of leave-one-out measurements are described and analyzed, however in that work

their measurement ensembles are not structured as was described in the previous two sections,
but instead are dense matrices where Ω− 𝑗 ∈ R𝑚×𝑛𝑑−1

has entries all drawn independently from a

Gaussian distribution and this matrix is right multiplied by an unfolding of the tensor:

𝐵 𝑗 = Ω( 𝑗, 𝑗) 𝑋[ 𝑗]Ω𝑇
− 𝑗

where again Ω( 𝑗, 𝑗) is some square, full-rank matrix.

As an important observation, the storage requirements for the sketching matrix itself in this

case is large, comparable in size to the data itself since the long dimension is 𝑛𝑑−1, which can be

undesirable (though an improvement on the fully vectorized approach which must need have 𝑛𝑑 as

its long dimension). Furthermore, the error analysis in [50] conducted on this type of measurement

when the unstructured matrix Ω− 𝑗 has i.i.d. Gaussian entries relies on expectation bounds known

for Gaussian matrices and their pseudo-inverses. Comparable bounds for other distributions, or

for matrices constructed using Kronecker or Khatri-Rhao products are not covered in the analysis.

Note, Kronecker or Khatri-Rao structured measurements alleviate to a large degree the storage
problem associated with Ω− 𝑗 ∈ R𝑚×𝑛𝑑−1

when compared to in the unstructured case since the entire

sketching matrix does not need to be maintained in memory, but rather just constituent components

Ω( 𝑗,𝑖) ∈ R𝑚×𝑛 can be stored, and the action of the Kronecker or Khatri-Rao product on the tensor

can computed as part of the sketching phase. This does incur additional operations in the sketching

phase, and in our runtime analysis and numerical experiments we detail the trade-off between space

and run-time for these various choices of measurement operators.

3.1.4 Core Measurements

Regardless of which type leave-one-measurements are used for each of the modes and recovering

the factors, in order to recover the core factor using only a single access to the data X, we will

require an additional set of measurements. These are in all cases discussed in this chapter,

modewise, and can compress all modes; i.e.

the operator is a single stage, no initial reshaping

59

modewise measurement operator.

𝑑(cid:63)

B𝑐 := X

Φ 𝑗

𝑗=1

(3.5)

using the matrices (cid:8)Φ 𝑗 (cid:9)𝑑

𝑗=1. See Algorithm 3.4 for an outline of the sketching procedure for the

core.

All together then, these 𝑑 + 1 measurement tensors, 𝑑 of the leave-one-out type (Definition 3.1)

and one of the type in (3.5) will be used to recover the parts of the original full tensor. Figure 3.1

is a schematic rendition of the overall measurement procedure in the three mode case (𝑑 = 3).

Note that the 𝑛𝑑 entries of the original tensor X are compressed into 𝑑𝑛𝑚𝑑−1 + 𝑚𝑑

𝑐 entries of the

𝑑 + 1 total different measurement tensors; one leave-one-out measurement for each of the 𝑑 modes

as well as the one measurement tensor for use in estimating the core. Recovery naturally will require

the storage of the measurement operators themselves in some fashion along with the measurement

tensors. As dense matrices, the measurement ensemble collectively uses 𝑑 ((𝑑 − 1)𝑚𝑛 + 𝑛2) + 𝑑𝑚𝑐𝑛

total entries. However the number of random variables required to generate these matrices may be

significantly fewer depending on the method employed. For example, when using 𝑅𝐹 𝐷 matrices

and an FFT algorithm, the number of random bits required to construct an ensemble is linear in

𝑛. Additionally, some choices of measurement matrices allow for near-linear time matrix-vector

multiplication.

As we detail later, applying the measurement operators, i.e. the sketching phase, is asymptot-

ically the most computationally intensive part of the overall measure and recovery procedure, and

thus in settings where it is useful to economize the computational effort to obtain measurements

these matrices have significant advantages. Whichever choice for type of measurements are used,

however, the efficiency in terms of the run-time of the recovery algorithm is largely dependent on

the ratios 𝑚/𝑛 and 𝑚𝑐/𝑛.

60

B1

=

X

)
1
,
1
(

Ω

Ω (1,3)

Ω(1,2)

1

Φ

=

B𝑐

X

Φ2

Φ 3

(a) Leave-one-measurements from Algorithm 3.2

(b) Core measurements from Algorithm 3.4

Figure 3.1 Schematic for the two types of measurement tensors used in recovery Algorithm 3.9 for
a three mode tensor. Measurement tensors B𝑖 of the type shown in (𝑎) for 𝑖 = 1 will be used to
estimate the factors of the Tucker decomposition, and the measurement tensor B𝑐 in (𝑏) will be
used to estimate the core of the Tucker decomposition.

3.1.5 The Canonical One-Pass Recovery Algorithm

We can now state Algorithm 3.1 as the canonical algorithm for one-pass recovery the ten-

sor using leave-one-out measurements. The algorithm outputs an estimate in factored form

X1 = [[H , 𝑄1, . . . , 𝑄𝑑]], and we say one-pass to emphasize that after measuring the tensor X,

no additional access to the data is required to obtain the estimate X1. The inputs to the algorithm

are leave-one-out measurements 𝐵𝑖 for each of the modes of any type, as well as measurements B𝑐

for use in recovering the core, some of the related measurement operators Ω(𝑖,𝑖), Φ𝑖, and a target

rank r parameter. Algorithm 3.1 consists of two loops. The first loop recovers the factor matrices

𝑄𝑖 in the Tucker factorization. The second loop depends on the output of the first loop and computes

the core H of the Tucker factorization. Importantly, at no point does a full, uncompressed tensor

or size X ever need to be formed. The procedure goes directly from measurements to factors.

Note that within the pseudo-code for Algorithm 3.1 the “unfold” function in the body of

Algorithm 3.1 takes a tensor and flattens it into the given shape by arranging the specified mode’s

fibers as the columns, e.g. 𝐻 ← unfold(B𝑐, 𝑚𝑐 × 𝑚𝑑−1

, mode = 1) is equivalent to 𝐻 ← (B𝑐) [1]
using the notation of Section 1.4.1, the matrix that is obtained by flattening the 𝑑-mode tensor B𝑐

𝑐

along mode-1. The “fold” function is the inverse of “unfold”, and takes a matrix that is an unfolding

61

along the specified mode of a tensor and reshapes it into a tensor with the specified and compatible

dimensions, e.g. B ← fold(𝐻, 𝑟 ×𝑚𝑐 × · · · × 𝑚𝑐
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
𝑑−1

, mode = 1) takes the matrix 𝐻 with dimensions

𝑟 × 𝑚𝑑−1

𝑐

and reshapes it into a 𝑑 mode tensor where the first mode is of length 𝑟 and the remaining

𝑑 − 1 modes are of length 𝑚𝑐.

In the scenario where it is possible to access the original tensor X twice (we refer to this as the

two-pass scenario), we can compute the optimal core G given our estimated factor matrices 𝑄𝑖 to

obtain an estimate X2 = [[G, 𝑄1, . . . , 𝑄𝑑]]. See Algorithm 3.10 and Algorithm 3.12 in Section 3.6

for the detailed formulation of the two-pass recovery procedure.

Remark 3.1. Note, for a tensor with an exact Tucker decomposition X = [[G, 𝑈1, . . . , 𝑈𝑑]] of rank

the total degrees of freedom in the decomposition is a lower bound on any lossless

(𝑟, . . . , 𝑟)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
𝑑

compression scheme in the absence of any other kind of information about the factorization. For

example, the leave-one-out measurements 𝐵𝑖 used to recover a factor matrix 𝑈 𝑗 in Algorithm 3.1
which has 𝑛𝑟 − 𝑟 (𝑟+1)

degrees of freedom, it is necessary in the exact case that the product 𝑈 𝑗
(cid:1) have column rank at least 𝑟 and thus

(cid:0)Ω𝑑𝑈𝑑 ⊗ · · · ⊗ Ω 𝑗+1𝑈 𝑗+1 ⊗ Ω 𝑗−1𝑈 𝑗−1 ⊗ · · · ⊗ Ω1𝑈1

and 𝐺 [ 𝑗]
𝑚𝑑−1 ≥ 𝑟. Similarly we see that 𝑚𝑐 ≥ 𝑟 is a lower bound for the measurements required to recover

2

the 𝑟 𝑑 entries of the core. In other words, to recover the rank (𝑟, . . . , 𝑟) core tensor, our sketch

along each mode must have dimension at least 𝑟.

With the measurement operators described and the canonical algorithm outlined, we can now

state a representative main result. The full unabridged results are found in Section 3.2.5.

(0, 1

Theorem 3.1 (A Summary Main Result). Let X be a 𝑑-mode tensor of side length 𝑛, 𝜖 ∈ (0, 1), 𝛿 ∈
3), and 𝑟 ∈ [𝑛] := {1, . . . , 𝑛}. Denote an optimal rank r = (𝑟, . . . , 𝑟) ∈ R𝑑 (𝑑 ≥ 2) Tucker
. There exists a one-pass recovery algorithm (see Algorithm 3.1)

approximation of X by [[X]]opt
r

that outputs a Tucker factorization X1 = [[H , 𝑄1, . . . , 𝑄𝑑]] of X from leave-one-out linear mea-

surements that will satisfy

∥X − X1∥2 ≤ (1 + 𝑒𝜖 )

√︂ 𝑑 (1 + 𝜖)
1 − 𝜖

(cid:13)
(cid:13)
(cid:13)

X − [[X]]opt
r

(cid:13)
(cid:13)
(cid:13)2

(3.6)

with probability at least 1 − 𝛿. The total number of linear measurements the algorithm will use is

62

for 𝑖 ∈ [𝑑] leave-one-out measurements

Algorithm 3.1 One Pass HOSVD Recovery from Leave-One-Out Measurements
input
:
𝐵𝑖 ∈ R𝑛×𝑚𝑑−1
Ω(𝑖,𝑖) ∈ R𝑛×𝑛 for 𝑖 ∈ [𝑑] full-rank sensing matrix for uncompressed mode
B𝑐 a 𝑑 mode tensor of measurements with side lengths 𝑚𝑐
Φ𝑖 ∈ R𝑚𝑐×𝑛 for 𝑖 ∈ [𝑑] measurement matrices for core measurements
r = (𝑟, 𝑟, . . . , 𝑟) the rank of the HOSVD
output : ˆX = [[H , 𝑄1, . . . 𝑄𝑑]]
# Factor matrix recovery
for 𝑖 ∈ [𝑑] do

# Solve 𝑛 × 𝑛 linear system
Solve Ω(𝑖,𝑖) 𝐹𝑖 = 𝐵𝑖 for 𝐹𝑖
# Compute SVD and keep the 𝑟 leading singular vectors
𝑈, Σ, 𝑉𝑇 ← SVD(𝐹𝑖)
𝑄𝑖 ← 𝑈:,:𝑟

end
# Core recovery
for 𝑖 ∈ [𝑑] do

# unfold measurements, mode-𝑖 fibers are columns, size 𝑚𝑐 × 𝑟 (𝑖−1)𝑚𝑑−1−(𝑖−1)
𝐻 ← unfold(B𝑐, 𝑚𝑐 × 𝑟 (𝑖−1)𝑚𝑑−1−(𝑖−1)
, mode = 𝑖)
# Undo the mode-𝑖 measurement operator and factor’s action by finding

𝑐

𝑐

least square solution to 𝑚𝑐 × 𝑟 over-determined linear system

Solve Φ𝑖𝑄𝑖𝐻new = 𝐻 for 𝐻𝑛𝑒𝑤
# reshape the flattened partially solved core into a tensor
×𝑚𝑐 × · · · × 𝑚𝑐
B𝑐 ← fold(𝐻new, 𝑟 × 𝑟 · · · × 𝑟
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:125)
(cid:123)(cid:122)
(cid:124)
𝑖
𝑑−𝑖
# Each iteration 𝑚𝑐 → 𝑟 in 𝑖th mode

, mode = 𝑖)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

end
H ← B𝑐

bounded above by 𝑚′(𝑟, 𝑑, 𝜖, 𝑛, 𝛿) =

(cid:104)

𝑛𝑑 + 𝐶𝑟𝑑2
𝜖 2

ln

(cid:16) 𝑑𝑛𝑑
𝛿

(cid:17)(cid:105) (cid:16) 𝐶𝑟𝑑2
𝜖 2

(cid:16) 𝑑𝑛𝑑
𝛿

ln

(cid:17)(cid:17) 𝑑−1

, where 𝐶 > 0 is an

absolute constant. Furthermore, the recovery algorithm (Algorithm 3.1) runs in time 𝑜(𝑛𝑑) for

large 𝑛 ≫ 𝑟.

Remark 3.2. More generally and more exactly, the number of measurements is always 𝑛𝑑𝑚𝑑−1 + 𝑚𝑑
𝑐

where 𝑚 and 𝑚𝑐 are as in Theorem 3.9 for Kronecker structured measurements, or 𝑑 ˜𝑚𝑛 + 𝑚𝑑

𝑐 where

˜𝑚 and 𝑚𝑐 are as in Theorem 3.11 in the case of Khatri-Rao structured measurements.

Proof. Combine Theorem 3.9 (or Theorem 3.11) with Theorem 3.10 and Lemma 1.5.

Although our bound almost certainly not tight, overall we demonstrate the procedure has errors

63

that depend on the number of degrees of freedom of the sketched tensor, not its full size; as is

typical in similar compressive sensing scenarios. Furthermore, in our numerical experiments we

demonstrate some of the main trade-offs involved when choosing between Kronecker or Khatri-Rao

constructed sketches in terms of approximation error and performance.

3.1.6 Projection Cost Preserving (PCP) Sketches

The following definition and related results are needed to conduct our probabilistic error analysis

of the various leave-one-out measurements and Algorithm 3.1. We will rely heavily on some

definitions in randomized numerical linear algebra and compressive sensing that were introduced

in Chapter 1.

Definition 3.2 (A (𝜖, 𝑐, 𝑟)-Projection Cost Preserving (PCP) sketch). Let 𝜖, 𝑐 > 0, and 𝑟 ∈ N.

A matrix ˜𝑋 ∈ R𝑛×𝑚 is a (𝜖, 𝑐, 𝑟)-PCP sketch of 𝑋 ∈ R𝑛×𝑁 if for all orthogonal projection matrices

𝑃 ∈ R𝑛×𝑛 with rank at most 𝑟,

(1 − 𝜖) ∥ 𝑋 − 𝑃𝑋 ∥2

𝐹 ≤ (cid:13)

(cid:13) ˜𝑋 − 𝑃 ˜𝑋(cid:13)
2
𝐹 + 𝑐 ≤ (1 + 𝜖) ∥ 𝑋 − 𝑃𝑋 ∥2
(cid:13)
𝐹

(3.7)

holds.

Note the above property first appeared in this form in [13, Definition 1] (see also, however, [17,

Definition 13] for the statement of an equivalent property in a different form that appeared earlier).

The next lemma can be used to construct random matrices that are PCP sketches of a given matrix

𝑋 with high probability. Before the lemma can be stated, however, we will need one additional

definition.

Definition 3.3 (Head-Tail Split). For any 𝐴 ∈ R𝑚×𝑛, we can split 𝐴 into his leading 𝑟-term and its

tail (𝑛 − 𝑟)-term Singular Value Decomposition (SVD) components. That is, consider the SVD

of 𝐴 = 𝑈Σ𝑉𝑇 . For any 𝑟 ≤ rank( 𝐴), let 𝑈𝑟 ∈ R𝑚×𝑟 and 𝑉𝑟 ∈ R𝑛×𝑟 denote the first 𝑟 columns of

𝑈 ∈ R𝑚×𝑚 and 𝑉 ∈ R𝑛×𝑛, respectively. We then define 𝐴𝑟 := 𝑈𝑟𝑈𝑇

𝑟 𝐴 = 𝐴𝑉𝑟𝑉𝑇
𝑟

to be 𝐴’s best rank

𝑟 approximation with respect to ∥ · ∥𝐹. Furthermore, we denote the tail term by 𝐴\𝑟 := 𝐴 − 𝐴𝑟.

We will now detail how matrices with the OSE and AMM (recall Definitions 1.23 and 1.25)

properties interacting with matrices derived from 𝑋 ∈ R𝑛×𝑁 will yield PCP sketches of 𝑋 with

64

high probability. Variants of the following result are proven in [11, 53]. We include the proof in

Section 3.5.2 for the sake of completeness.

Theorem 3.2 (Projection-Cost-Preservation via the AMM and OSE properties). Let 𝑋 ∈ R𝑛×𝑁 of

rank ˜𝑟 ≤ min{𝑛, 𝑁 } have the full SVD 𝑋 = 𝑈Σ𝑉𝑇 , and let 𝑉𝑟 ′ ∈ R𝑁×𝑟 ′

denote the first 𝑟′ columns of

𝑉 ∈ R𝑁×𝑁 for all 𝑟′ ∈ [𝑁]. Fix 𝑟 ∈ [𝑛] and consider the head-tail split 𝑋 = 𝑋𝑟 + 𝑋\𝑟. If Ω ∈ R𝑚×𝑁

satisfies

1. subspace embedding property (1.10) with 𝜖 ← 𝜖

3 for 𝐴 ← 𝑋𝑇
𝑟 ,

2. approximate multiplication property (1.13) with 𝜖 ←

6

𝑉min{𝑟, ˜𝑟},

𝜖

√

min{𝑟, ˜𝑟}

for 𝐴 ← 𝑋\𝑟 and 𝐵 ←

3. JL property (1.8) with 𝜖 ← 𝜖
4. approximate multiplication property (1.13) with 𝜖 ← 𝜖
√
6

6 for 𝑆 ← {the 𝑛 columns of 𝑋𝑇

\𝑟 }, and

𝑟 for 𝐴 ← 𝑋\𝑟 and 𝐵 ← 𝑋𝑇
\𝑟,

then ˜𝑋 := 𝑋Ω𝑇 is an (𝜖, 0, 𝑟)-PCP sketch of 𝑋.

Proof. See Section 3.5.2.

The following lemma can be used to construct PCP sketches from random matrices with the JL

property.

Lemma 3.1 (The JL property provides PCP sketches). Let 𝑋 ∈ R𝑛×𝑁 have rank ˜𝑟 ≤ min{𝑛, 𝑁 }.
Fix 𝑟 ∈ [𝑛]. There exist finite sets 𝑆1, 𝑆2 ⊂ R𝑁 (determined entirely by 𝑋) with cardinalities
(cid:16) 141
and |𝑆2| ≤ 16𝑛2 + 𝑛 such that the following holds:
|𝑆1| ≤
𝜖
Ω ∈ R𝑚×𝑁 has both the

, 16𝑛2 + 𝑛
for 𝑆2, then 𝑋Ω𝑇 will be an (𝜖, 0, 𝑟)-PCP sketch of 𝑋 with probability at least 1 − 𝛿.

-JL property for 𝑆1 and the

If a random matrix

-JL property

(cid:17) min{𝑟, ˜𝑟}

(cid:16) 𝜖
√
𝑟
6

(cid:16) 141
𝜖

(cid:16) 𝜖
6

, 𝛿
2

, 𝛿
2

(cid:17)𝑟 (cid:17)

(cid:17)

,

Proof. To ensure property 1 of Theorem 3.2 we can appeal to Lemma 1.6 to see that Ω being
(cid:1)-cover of the at most min{𝑟, ˜𝑟}-dimensional unit ball
an (𝜖/6)-JL-embedding of a minimal (cid:0) 𝜖
48
(cid:1)-cover, we can
in the column space of 𝑋𝑇
further see that |𝑆1| ≤ (141/𝜖)min{𝑟, ˜𝑟} by the proof of Corollary 1.1. Hence, if Ω ∈ R𝑚×𝑁 has
-JL property for 𝑆1, then property 1 of Theorem 3.2 will be satisfied with with

𝑟 will suffice. Letting 𝑆1 be this aforementioned (cid:0) 𝜖
48

(cid:17)𝑟 (cid:17)

,

, 𝛿
2

(cid:16) 𝜖
6

(cid:16) 141
the
𝜖
probability at least 1 − 𝛿
2 .

Applying Lemma 1.8 one can see that there exist sets 𝑆′
2

65

, 𝑆′′

2 ⊂ R𝑁 with |𝑆′

2| ≤ 2(𝑛 +

min{𝑟, ˜𝑟})2 ≤ 8𝑛2 and |𝑆′′

2 ∪ 𝑆′′
2
will satisfy both properties 2 and 4 of Theorem 3.2. Hence, since 𝑟 ≥ 1, we can see that an

2 | ≤ 2(𝑛 + 𝑛)2 = 8𝑛2 such that an (𝜖/6

𝑟)-JL-embedding of 𝑆′

√

√

(𝜖/6

2 ∪ 𝑆′′

𝑟)-JL-embedding of 𝑆2 := 𝑆′

is defined as per property 3. Noting that |𝑆2| ≤ |𝑆′
Ω will satisfy all of Theorem 3.2’s properties 2 – 4 with probability at least 1 − 𝛿
(cid:16) 𝜖
√
𝑟
6

2 ∪ 𝑆 will satisfy Theorem 3.2 properties 2 – 4, where 𝑆
2 | + |𝑆| ≤ 16𝑛2 + 𝑛, we can now see that
2 if it has the

-JL property for 𝑆2.

, 16𝑛2 + 𝑛

2| + |𝑆′′

, 𝛿
2

(cid:17)

Concluding, the prior two paragraphs in combination with the union bound imply that all

of Theorem 3.2’s properties 1 – 4 will hold with probability at least 1 − 𝛿 if Ω has both the
(cid:16) 𝜖
6

-JL property for 𝑆2. An application of

-JL property for 𝑆1 and the

, 16𝑛2 + 𝑛

(cid:16) 141
𝜖

, 𝛿
2

, 𝛿
2

(cid:17)𝑟 (cid:17)

(cid:17)

,

(cid:16) 𝜖
√
𝑟
6

Theorem 3.2 now finishes the proof.

Using Lemma 3.1 one can now prove the following corollary of Theorem 1.1 which demonstrates

the existence of a PCP sketch for any fixed matrix 𝑋.

Corollary 3.1. Fix 𝑋 ∈ R𝑛×𝑁 and 𝑟 ∈ [𝑛]. Let Ω ∈ R𝑚×𝑁 be a random matrix with mean zero,
𝑚 , independent sub-Gaussian entries. Then, 𝑋Ω𝑇 will be an (𝜖, 0, 𝑟)-PCP sketch of 𝑋

variance 1

with probability at least 1 − 𝛿 provided that

𝑚 ≥ 𝐶

𝑟
𝜖 2

(cid:26)

max

ln

(cid:19)

(cid:18) 𝐶1
𝜖 𝛿

, ln

(cid:18) 𝐶2𝑛
𝛿

(cid:19)(cid:27)

,

where 𝐶1, 𝐶2 > 0 are absolute constants, and 𝐶 > 0 is an absolute constant that only depends on

the sub-Gaussian norms (see Definition 1.22) of Ω’s entries.
6 and 𝛿 ← 𝛿
Proof. Apply Theorem 1.1 to the finite set 𝑆1 guaranteed by Lemma 3.1 with 𝜖 ← 𝜖
2 .
Similarly, apply Theorem 1.1 to the finite set 𝑆2 guaranteed by Lemma 3.1 with 𝜖 ← 𝜖
𝑟 and
√
6
𝛿 ← 𝛿

2 . The result now follows by Lemma 3.1.

We finish here by noting that Corollary 3.1 is just one example of a PCP sketching result that

one can prove with relative ease using Lemma 3.1. Indeed, Lemma 3.1 can be combined with other

standard results concerning more structured matrices with the JL property (see, e.g., [38, 30, 3, 27])

to produce similar theorems where Ω has a fast matrix-vector multiply.

66

3.2 Proofs of Our Main Results on Leave-One-Measurements and Recovery

The objective is to show that we can retrieve an accurate low rank Tucker approximation of a

tensor X via Algorithm 3.1 from valid sets of leave-one-out and core measurements as described

in Section 3.1. We will denote the approximation of X output by Algorithm 3.1 as X1 here to

emphasize that a single pass over the original data tensor X suffices in order to compute the

linear input measurements required by Algorithm 3.1. Hence, Algorithm 3.1 in this setting is an

example of a streaming algorithm which doesn’t need to store a copy the original uncompressed

tensor X in memory in order to successfully approximate it. Nonetheless, we wish to show that

this algorithm still produces a quasi-optimal approximation of X in the sense of (1.7) with high

probability when given such highly compressed linear input measurements. Particular choices for

measurement ensembles will make explicit the dependence on other parameters of the problem

(these choices define specializations of Algorithm 3.1, and are summarized as Algorithm 3.9 and

3.11 in Section 3.6).

More specifically, in this section we will show that with high probability for a given 𝑑-mode

tensor X, error tolerance 𝜖 > 0, and chosen rank truncation parameter 𝑟, that

∥X − X1∥2 ≤ (1 + 𝑒𝜖 )

(cid:118)(cid:117)(cid:116) 1 + 𝜖
1 − 𝜖

𝑑
∑︁

𝑗=1

Δ𝑟, 𝑗

(3.8)

will hold whenever Algorithm 3.1 is provided with sufficiently informative input measurements.

Here the Δ𝑟, 𝑗 are defined as per Lemma 1.5, and “sufficiently informative" means that (𝑖) the

leave-one-out measurements used to form X1 are of sufficient size to satisfy several PCP properties,

and that (𝑖𝑖) the core measurements used to form X1 are of sufficient size to ensure the accurate

solution of least squares problems computed as part of Algorithm 3.1. Finally, we note that one

can see from (3.8) together with Lemma 1.5 that Algorithm 3.1 will perfectly recover exactly low

Tucker-rank tensors if the rank parameter 𝑟 is made sufficiently large.

In order to prove that (3.8) holds, we will need to also consider a weaker variant of Algorithm

3.1 which permits a second pass of the data tensor X. These weaker algorithms will first compute

estimates of the factors of the tensor 𝑄𝑖 as Algorithm 3.1 does, but thereafter will be allowed to use

67

those factors to operate on the original tensor X in order to approximate its core (see Algorithms

3.10 and 3.12). We denote the estimate of the tensor that results from this procedure as X2 to

emphasize that it requires a second accesses to the original data during core recovery. Note that

such two-pass algorithms are of less practical value in the big data and compressive sensing settings

since it is often not possible to directly access the data tensor again after the initial compressed

measurements have been taken in these scenarios. Nevertheless, this two-pass estimate will be

extremely useful when proving (3.8). In particular, our proof will result from the following triangle

inequality:

∥X − X1∥2 = ∥X − X1 + X2 − X2∥2 ≤ ∥X − X2∥2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
Term I

.

(3.9)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

+ ∥X1 − X2∥2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
Term II

(cid:124)

Bounding Term I in (3.9) will be the subject of Section 3.2.1. As we shall see, bounding Term

I is straightforward if we have that leave-one-out measurements imply various PCP properties; and

so the main work of Sections 3.2.2 and 3.2.3 will be to demonstrate how, for the structured choices

of measurement operators considered herein, we can ensure that the PCP property is satisfied.

Bounding Term II, on the other hand, will require us to apply a bound on the error incurred by

solving sketched least square problems on a carefully partitioned re-expression of the Term II error.

That argument is the subject of Section 3.2.4. Finally, we combine our analysis of these two error

terms along with particular choices for measurement operators to state the full versions of our main

results in Section 3.2.5.

3.2.1 Bounding ∥X − X2∥2

In the two pass scenario, we first compute estimates for the factor matrices, 𝑄𝑖 (see Algorithm

3.5), using leave-one-out measurements 𝐵𝑖 for each 𝑖 ∈ [𝑑]. Then, using these factor matrices, we

wish to solve

arg min
H

∥X − [[H , 𝑄1, 𝑄2, . . . , 𝑄𝑑]] ∥2

.

One can see that the solution will be

G := X ×1 𝑄𝑇

1 ×2 𝑄𝑇

2 · · · ×𝑑 𝑄𝑇

𝑑

68

(see, e.g., [16]). Let

X2 := G ×1 𝑄1 ×2 𝑄2 · · · ×𝑑 𝑄𝑑

(3.10)

denote the estimate obtained from a two-pass recovery procedure (i.e., Algorithm 3.10 or 3.12).

Additionally, we note the following fact about modewise products (see, e.g., [67, Lemma 1]):

X ×𝑖 𝐴 ×𝑖 𝐵 = X ×𝑖 (𝐵𝐴).

As a result, if we are permitted a second pass over X to compute the core we have that

(cid:16)

X2 = G ×1 𝑄1 ×2 𝑄2 · · · ×𝑑 𝑄𝑑
1 ×2 𝑄𝑇
2 · · · ×𝑑 𝑄𝑇
1 ×2 · · · ×𝑑 𝑄𝑑𝑄𝑇

= X ×1 𝑄1𝑄𝑇

X ×1 𝑄𝑇

=

(cid:17)

𝑑

𝑑

×1 𝑄1 ×2 𝑄2 · · · ×𝑑 𝑄𝑑

Using this expression we can now bound the two pass error term ∥X − X2∥2.
Theorem 3.3 (Error bound for Two-Pass ∥X − X2∥2). Suppose ˜𝑋[ 𝑗] := 𝑋[ 𝑗]Ω𝑇
(𝜖, 0, 𝑟)-PCP sketches of 𝑋[ 𝑗] for each 𝑗 ∈ [𝑑].

− 𝑗 ∈ R𝑛×𝑚𝑑−1 are
If 𝑄 𝑗 ∈ R𝑛×𝑟 for 𝑟 ≤ 𝑚 are factor matrices

obtained from Algorithm 3.5, then

∥X − X2∥2 = (cid:13)

(cid:13)X − X ×1 𝑄1𝑄𝑇

1 ×2 · · · ×𝑑 𝑄𝑑𝑄𝑇

𝑑

(cid:13)
(cid:13)2

≤

(cid:118)(cid:117)(cid:116) 1 + 𝜖
1 − 𝜖

𝑑
∑︁

𝑗=1

Δ𝑟, 𝑗 .

(3.11)

Proof. Since the 𝑄𝑖𝑄𝑇

𝑖 are orthogonal projectors, we have by [62, Theorem 5.1] that

(cid:13)
(cid:13)X − X ×1 𝑄1𝑄𝑇

1 ×2 · · · ×𝑑 𝑄𝑑𝑄𝑇

𝑑

(cid:13)
(cid:13)

2
2

≤

𝑑
∑︁

𝑗=1

(cid:13)
X − X × 𝑗 𝑄 𝑗 𝑄𝑇
(cid:13)
𝑗
(cid:13)

(cid:13)
2
(cid:13)
(cid:13)
2

.

(3.12)

From Algorithm 3.5 (recalling that Ω−1

( 𝑗, 𝑗)

𝐵 𝑗 = 𝑋[ 𝑗]Ω𝑇

− 𝑗 = ˜𝑋[ 𝑗]), we have that the 𝑄 𝑗 ’s are the best

rank-𝑟 approximations for their respective sketched problems, since

𝑄 𝑗 = arg min
𝑟𝑎𝑛𝑘 (𝑄)=𝑟
𝑄𝑇 𝑄=𝐼𝑟

(cid:13)
(cid:13) ˜𝑋[ 𝑗] − 𝑄𝑄𝑇 ˜𝑋[ 𝑗]

(cid:13)
(cid:13)𝐹

by the Eckart–Young Theorem.

69

Now suppose that each 𝑈 𝑗 ∈ R𝑛×𝑟 forms an optimal rank 𝑟 approximation to 𝑋[ 𝑗] in the sense

that

𝑈 𝑗 = arg min
𝑟𝑎𝑛𝑘 (𝑈)=𝑟
𝑈𝑇𝑈=𝐼𝑟

(cid:13)
(cid:13)𝑋[ 𝑗] − 𝑈𝑈𝑇 𝑋[ 𝑗]

(cid:13)
(cid:13)𝐹

.

By the hypothesis that ˜𝑋[ 𝑗] is a (𝜖, 0, 𝑟)-PCP sketch of 𝑋[ 𝑗], we have that

(1 − 𝜖)

(cid:13)
X − X × 𝑗 𝑄 𝑗 𝑄𝑇
(cid:13)
𝑗
(cid:13)

(cid:13)
2
(cid:13)
(cid:13)
2

𝑋[ 𝑗] − 𝑄 𝑗 𝑄𝑇

𝑗 𝑋[ 𝑗]

= (1 − 𝜖)

(cid:13)
(cid:13)
(cid:13)
˜𝑋[ 𝑗] − 𝑄 𝑗 𝑄𝑇

𝑗 ˜𝑋[ 𝑗]

˜𝑋[ 𝑗] − 𝑈 𝑗𝑈𝑇
(cid:13)
(cid:13)
(cid:13)

𝐹
𝑋[ 𝑗] − 𝑈 𝑗𝑈𝑇

≤ (1 + 𝜖)

𝑗 𝑋[ 𝑗]

𝑗 ˜𝑋 𝑗

(cid:13)
2
(cid:13)
(cid:13)

𝐹
(cid:13)
2
(cid:13)
(cid:13)

≤

≤

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)

2

𝐹

(cid:13)
2
(cid:13)
(cid:13)

𝐹

= (1 + 𝜖)Δ𝑟, 𝑗 ,

where we have used the definition of (𝜖, 0, 𝑟)-PCP sketches in the first and third inequalities. After

a rearrangement of terms, substituting the above into (3.12) now yields the inequality in (3.11).

We have now established in Theorem 3.3 that we have a quasi-optimal error bound for Term I in

(3.9) whenever our leave-one-out measurement matrices Ω𝑇

− 𝑗 yield (𝜖, 0, 𝑟)-PCP sketches of all 𝑑
unfoldings 𝑋 𝑗 . Next, we will demonstrate how to ensure that Kronecker structured and Khatri-Rao

structured leave-one-out measurement matrices provide PCP sketches.

3.2.2 PCP Sketches via Kronecker-Structured Leave-one-out Measurement Matrices

In this section we study when Kronecker-structured measurement matrices will provide the PCP

property. To begin we will show that the JL and OSE properties are inherited under matrix direct

sums and compositions. These are useful facts because our overall leave-one-out matrices can be

constructed using these operations. In particular, we will follow the example in the last paragraph
of Section 3.1.1 and consider a matrix Ω− 𝑗 ∈ R𝑚𝑑−1×𝑛𝑑−1

defined as

Ω− 𝑗 =

𝑑
(cid:204)

𝑖=1
𝑖≠ 𝑗

Ω𝑖 =

𝑑−1
(cid:214)

𝑖′=1

˜Ω𝑖′

for Ω𝑖 ∈ R𝑚×𝑛

(3.13)

where

˜Ω𝑖′ := 𝐼𝑛 ⊗ · · · ⊗ 𝐼𝑛
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:124)

(cid:123)(cid:122)
𝑑−1−𝑖′

∈ R𝑚𝑖′ 𝑛𝑑−1−𝑖′

×𝑚𝑖′ −1𝑛𝑑−𝑖′

(3.14)

⊗ Ω𝑖 𝑗 (𝑖′) ⊗ 𝐼𝑚 ⊗ · · · ⊗ 𝐼𝑚
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:124)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
𝑖′−1

70

for

𝑖 𝑗 (𝑖′) :=

Here, 𝐼𝑛 denotes an 𝑛 × 𝑛 identity matrix.

𝑖′

if 𝑖′ < 𝑗

.

𝑖′ + 1

if 𝑖′ ≥ 𝑗





The next three lemmas will be used to help show that Ω− 𝑗 as defined in (3.13) inherits both the

JL and OSE properties from its component Ω𝑖 matrices. Having established this, we can then use,

e.g., Lemma 3.1 to prove PCP sketching results for such Kronecker-structured Ω− 𝑗 .
Lemma 3.2. Suppose that Ω1 ∈ R𝑚1×𝑁1 and Ω2 ∈ R𝑚2×𝑁2 are two random matrices. Denote their
matrix direct sum by Ω = Ω1 ⊕ Ω2 ∈ R(𝑚1+𝑚2)×(𝑁1+𝑁2). Then,

1. If Ω1 and Ω2 are (𝜖, 𝛿1, 𝑝) and (𝜖, 𝛿2, 𝑝)-JLs respectively, then Ω is an (𝜖, 𝛿1 + 𝛿2, 𝑝)-JL.

2. If Ω1 and Ω2 are (𝜖, 𝛿1, 𝑟) and (𝜖, 𝛿2, 𝑟)-OSEs respectively, then Ω is an (𝜖, 𝛿1 + 𝛿2, 𝑟)-OSE.
Proof. Part 1.: Consider a set 𝑆 ⊂ R𝑁1+𝑁2 with cardinality 𝑝. Let z ∈ 𝑆. Group the first 𝑁1
coordinates of z into x ∈ R𝑁1 and the last 𝑁2 coordinates of z into y ∈ R𝑁2. Observe that

= ∥Ω1x∥2

2 + ∥Ω2y∥2

2 ≤ (1 + 𝜖)

(cid:16)

∥x∥2

2 + ∥y∥2
2

(cid:17)

= (1 + 𝜖) ∥z∥2
2

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

0

2 =

∥Ωz∥2


Ω1






will hold whenever both ∥Ω1x∥2


x




y



0 Ω2

















(cid:13)
2
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
2

2 hold. The (1 − 𝜖)-
distortion lower bound is similar. As a result, we can use the union bound to see that Ω will have

2 ≤ (1 + 𝜖) ∥x∥2

2 ≤ (1 + 𝜖) ∥y∥2

2 and ∥Ω2y∥2

the (𝜖, 𝛿1 + 𝛿2, 𝑝)-JL property.

Part 2.: Suppose 𝑋 ∈ R𝑛×(𝑁1+𝑁2) has rank 𝑟. Let 𝑋1 and 𝑋2 denote the sub-matrices of 𝑋
containing the first 𝑁1 and last 𝑁2 columns of 𝑋, respectively. Note that both 𝑋1 and 𝑋2 have at

most rank 𝑟. Furthermore, note also that

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

(cid:13)
2
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
2

Ω1𝑋𝑇
1
Ω2𝑋𝑇
2

(cid:16)(cid:13)
(cid:13)𝑋𝑇

y

(cid:169)
(cid:173)
(cid:173)
(cid:171)

+ (cid:13)

2
2 =

1 y(cid:13)
(cid:13)

≤ (1 + 𝜖)

(cid:13)Ω𝑋𝑇 y(cid:13)
(cid:13)
(cid:13)

(cid:170)
(cid:174)
(cid:174)
(cid:172)
will hold for any arbitrary vector y ∈ R𝑛 whenever both ∥Ω1𝑋𝑇
∥Ω2𝑋𝑇
we can see that Ω will be an (𝜖, 𝛿1 + 𝛿2, 𝑟)-OSE by the union bound.

2 ≤ (1 + 𝜖)∥ 𝑋𝑇

2 y∥2

2 y∥2

2 y(cid:13)
(cid:13)

(cid:13)𝑋𝑇

2
2

2
2

2 and
2 hold. The (1 − 𝜖)-distortion lower bound is similar. As a result,

2 ≤ (1 + 𝜖) ∥ 𝑋𝑇

1 y∥2

1 y∥2

(cid:17)

= (1 + 𝜖) (cid:13)

(cid:13)𝑋𝑇 y(cid:13)
2
(cid:13)
2

.

71

Note that there is no requirement that Ω1 and Ω2 need to be independent in Lemma 3.2. This

is crucial for the next lemma, which will involve many copies of the same measurement matrix.

Lemma 3.3 (Direct Sums Inherit the OSE and JL Properties). For some 𝑖′ ∈ [𝑑 − 1], let ˜Ω𝑖′ be

defined as in (3.14) and set Ω := Ω𝑖 𝑗 (𝑖′) ∈ R𝑚×𝑛.

1. If Ω has the

2. If Ω has the

(cid:16)

(cid:16)

𝜖,

𝜖,

𝛿
𝑚𝑖′ −1𝑛𝑑−𝑖′ −1

, 𝑟

(cid:17)

𝛿
𝑚𝑖′ −1𝑛𝑑−𝑖′ −1

, 𝑝

-OSE property, then ˜Ω𝑖′ will have the (𝜖, 𝛿, 𝑟)-OSE property.
(cid:17)

-JL property, then ˜Ω𝑖′ will have the (𝜖, 𝛿, 𝑝)-JL, property.

Proof. First consider the rearrangement of ˜Ω𝑖′ in (3.14) defined as follows

˜Ω := 𝐼𝑚 ⊗ 𝐼𝑚 ⊗ · · · ⊗ 𝐼𝑚
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
𝑖′−1

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

⊗ 𝐼𝑛 ⊗ 𝐼𝑛 ⊗ · · · ⊗ 𝐼𝑛
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
𝑑−1−𝑖′

(cid:124)

⊗ Ω.

Note that the Kronecker product of two identity matrices is itself an identity matrix. Thus, we can

rewrite this as simply

˜Ω = 𝐼𝑚 ⊗ 𝐼𝑚 ⊗ · · · ⊗ 𝐼𝑚
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
𝑖′−1

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

⊗ 𝐼𝑛 ⊗ 𝐼𝑛 ⊗ · · · ⊗ 𝐼𝑛
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
𝑑−1−𝑖′

(cid:124)

⊗ Ω =

Ω 0 . . . 0

0 Ω . . . 0

. . .

0

0

0 . . . Ω

0

0















.















That is, we have a block diagonal matrix with ¯𝑚 = 𝑚𝑖′−1𝑛𝑑−𝑖′−1 copies of Ω along its diagonal.

Thus, if Ω has either the (𝜖, 𝛿/ ¯𝑚, 𝑟)-OSE or the (𝜖, 𝛿/ ¯𝑚, 𝑝)-JL property, repeated applications of

Lemma 3.2 will then establish the desired OSE or JL properity for ˜Ω.

Now consider ˜Ω𝑖′ as in (3.14). There exist unitary (permutation) matrices 𝐿 and 𝑅 which

interchange rows and columns such that

𝐿 ˜Ω𝑅 = 𝐿 (cid:169)
𝐼𝑚 ⊗ 𝐼𝑚 ⊗ · · · ⊗ 𝐼𝑚
(cid:173)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:173)
(cid:123)(cid:122)
(cid:125)
(cid:124)
𝑖′−1
(cid:171)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

⊗ 𝐼𝑛 ⊗ 𝐼𝑛 ⊗ · · · ⊗ 𝐼𝑛
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
𝑑−1−𝑖′

(cid:124)

𝑅

⊗ Ω(cid:170)
(cid:174)
(cid:174)
(cid:172)

= 𝐿 (cid:0)𝐼𝑚𝑖′ −1𝑛𝑑−1−𝑖′ ⊗ Ω(cid:1) 𝑅 = 𝐼𝑛 ⊗ 𝐼𝑛 ⊗ · · · ⊗ 𝐼𝑛
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
𝑑−1−𝑖′

⊗ Ω ⊗ 𝐼𝑚 ⊗ 𝐼𝑚 ⊗ · · · ⊗ 𝐼𝑚
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:124)

(cid:123)(cid:122)
𝑖′−1

= 𝐼𝑛𝑑−1−𝑖′ ⊗ Ω ⊗ 𝐼𝑚𝑖′ −1 = ˜Ω𝑖′ .

72

Noting that both the OSE and JL properties are invariant to unitary transformations of a given

random matrix, one can now see that ˜Ω𝑖′ = 𝐿 ˜Ω𝑅 will indeed have the same desired OSE or JL

property as was established for ˜Ω.

Lemma 3.3 allows us to infer JL and OSE properties of the ˜Ω𝑖′ matrices in (3.14) from the

properties of the smaller random matrices Ω𝑖 ∈ R𝑚×𝑛 appearing in (3.13). The next lemma will

allow us to then use these inferred properties of the ˜Ω𝑖′ matrices to derive OSE and JL properties

for Ω− 𝑗 from (3.13) in terms of the properties of its component Ω𝑖 ∈ R𝑚×𝑛.

Lemma 3.4 (A Composition Lemma for the OSE and JL Properties). Let 𝜖 ∈ (0, 1) and ˜Ω𝑖′ ∈
R𝑚𝑖′ 𝑛𝑑−1−𝑖′

×𝑚𝑖′ −1𝑛𝑑−𝑖′

for 𝑖′ ∈ [𝑑 − 1].

1. If ˜Ω𝑖′ is an

(cid:16)

𝜖
2(𝑑−1)

,

𝛿
𝑑−1

, 𝑟

(cid:17)

-OSE for all 𝑖′ ∈ [𝑑 − 1], then ˜Ω =

˜Ω𝑖′ is an (𝜖, 𝛿, 𝑟)-OSE.

(cid:16)

(cid:17)

𝛿
𝑑−1

𝜖
2(𝑑−1)

,

, 𝑝

2. If ˜Ω𝑖′ is an

-JL for all 𝑖′ ∈ [𝑑 − 1], then ˜Ω =
Proof. Part 1.: Let 𝑌 𝑇 ∈ R𝑛𝑑−1×𝑛 be an arbitrary matrix of rank at most 𝑟. Denote ˜𝑌𝑖′ =
(cid:0) ˜Ω𝑖′ . . . ˜Ω1
×𝑛 for 𝑖′ ∈ [𝑑 − 1]. Note that each ˜𝑌𝑖′ has rank at most 𝑟. Fix
some z ∈ R𝑛. Suppose for the moment that (1.10) holds for each 𝑖′ ∈ [𝑑 − 1] with Ω = ˜Ω𝑖′,

(cid:1) 𝑌 𝑇 ∈ R𝑚𝑖′ 𝑛𝑑−1−𝑖′

˜Ω𝑖′ is an (𝜖, 𝛿, 𝑝)-JL.

𝑑−1
(cid:206)
𝑖′=1
𝑑−1
(cid:206)
𝑖′=1

𝐴 = ˜𝑌𝑖′−1, and x = z, we have that

(cid:13) ˜Ω𝑌 𝑇 z(cid:13)
(cid:13)
(cid:13)

(cid:0) ˜Ω𝑑−2 . . . ˜Ω1

(cid:1) 𝑌 𝑇 z(cid:13)
2
(cid:13)
2

2

2 = (cid:13)
(cid:13) ˜Ω𝑑−1
(cid:13) ˜Ω𝑑−1 ˜𝑌𝑑−2z(cid:13)
= (cid:13)
(cid:13)
(cid:18)

2
2

𝜖
2(𝑑 − 1)
𝜖
2(𝑑 − 1)

≤

1 +

(cid:18)

1 +

=

...

(cid:18)

(cid:18)

≤

1 +

𝜖
2(𝑑 − 1)
(cid:19)
1
1 − 𝜖/2
≤ (1 + 𝜖) (cid:13)

(cid:13)𝑌 𝑇 z(cid:13)
2
(cid:13)
2

≤

,

(cid:13)𝑌 𝑇 z(cid:13)
(cid:13)
2
(cid:13)
2

(cid:19)

(cid:19)

(cid:13) ˜𝑌𝑑−2z(cid:13)
(cid:13)
(cid:13)

2
2

(cid:13) ˜Ω𝑑−2 ˜𝑌𝑑−3z(cid:13)
(cid:13)
2
(cid:13)
2

(cid:19) 𝑑−1

(cid:13)𝑌 𝑇 z(cid:13)
(cid:13)
2
(cid:13)
2

where we have used the general bound (1 + 𝑘/𝑛)𝑛 ≤ 𝑒𝑘 ≤ (1 − 𝑘)−1 for 𝑘 ∈ [0, 1) in the second to

73

last inequality. Similarly, for a lower bound one can see that

(cid:13) ˜Ω𝑌 𝑇 z(cid:13)
(cid:13)
2
(cid:13)
2

≥

(cid:18)

1 −

𝜖
2(𝑑 − 1)
(cid:13)𝑌 𝑇 z(cid:13)
2
(cid:13)
2

.

≥ (1 − 𝜖) (cid:13)

(cid:19) 𝑑−1

(cid:13)𝑌 𝑇 z(cid:13)
(cid:13)
(cid:13)

2
2

Union bounding over the failure probability that (1.10) holds for each 𝑖′ ∈ [𝑑 − 1] as supposed

above now yields the desired result.

Part 2.: An essentially identical arguments also applies to obtain the desired JL property

result.

We now have all the necessary results to show how the component maps Ω𝑖 of a Kronecker

structured measurement ensemble as per (3.13) can guarantee a Kronecker sketch with the projection

cost preserving property.
Theorem 3.4 (Kronecker Products of JL matrices yield PCP Sketchs). Let 𝜖 ∈ (0, 1), 𝑋 ∈ R𝑛×𝑛𝑑−1
have rank 𝑟 ∈ [𝑛], and Ω− 𝑗 ∈ R𝑚𝑑−1×𝑛𝑑−1 be defined as in (3.13) and (3.14). Furthermore, suppose
that the Ω𝑖 ∈ R𝑚×𝑛 in (3.13) have both the

(cid:16)

(cid:16)

1.

2.

,

𝜖
12(𝑑−1)
𝜖
𝑟 (𝑑−1)

𝛿
2(𝑑−1)𝑛𝑑−2
,

𝛿
2(𝑑−1)𝑛𝑑−2

12

√

(cid:17)𝑟 (cid:17)

,

(cid:16) 141
𝜖

-JL property, and the

, 16𝑛2 + 𝑛

(cid:17)

-JL property

for all 𝑖′ ∈ [𝑑 − 1]. Then, 𝑋Ω𝑇

Proof. By Lemma 3.1 we know that 𝑋Ω𝑇
(cid:16) 141
𝜖

least 1 − 𝛿 if Ω− 𝑗 has both the

,

− 𝑗 will be an (𝜖, 0, 𝑟)-PCP sketch of 𝑋 with probability at least 1 − 𝛿.
− 𝑗 will be an (𝜖, 0, 𝑟)-PCP sketch of 𝑋 with probability at
(cid:17)𝑟 (cid:17)
(cid:16) 𝜖
√
𝑟
6

-JL property and the

-JL property.

, 16𝑛2 + 𝑛

(cid:16) 𝜖
6

, 𝛿
2

, 𝛿
2

(cid:17)

In fact, by Lemma 3.4 we can further see that it suffices to have the ˜Ω𝑖′ from (3.13) and (3.14) have

both the
(cid:16)

1.

(cid:16)

2.

,

,

𝜖
12(𝑑−1)
𝜖
𝑟 (𝑑−1)

12

𝛿
2(𝑑−1)
𝛿
2(𝑑−1)

√

,

(cid:16) 141
𝜖

(cid:17)𝑟 (cid:17)

-JL property, and the

, 16𝑛2 + 𝑛

(cid:17)

-JL property

for all 𝑖′ ∈ [𝑑 − 1]. Finally, looking now at Lemma 3.3 for each 𝑖′ ∈ [𝑑 − 1] we can see that the

assumed properties of the Ω𝑖 ∈ R𝑚×𝑛 in (3.13) will guarantee both of these sufficient conditions.

The following corollary of Theorem 3.4 guarantees a Kronecker sketch with the projection cost

preserving property when the component matrices Ω𝑖 in (3.13) are sub-Gaussian random matrices.

74

Corollary 3.2 (Kronecker Products of sub-Gaussian Matrices Yield PCP Sketches). Suppose X is

a real valued 𝑑-mode tensor with side-lengths all equal to 𝑛. Let 𝜖 ∈ (0, 1), 𝛿 ∈ (0, 1), 𝑟 ∈ [𝑛],
𝑗 ∈ [𝑑]. If Ω− 𝑗 = (cid:203)𝑑
Ω𝑖 ∈ R𝑚𝑑−1×𝑛𝑑−1 defined as in (3.13) with random matrices Ω𝑖 ∈ R𝑚×𝑛
𝑖=1
𝑖≠ 𝑗
having i.i.d centered variance 𝑚−1, sub-Gaussian entries such that

𝑚 ≥ max

(cid:26) 𝐶1𝑟 (𝑑 − 1)2
𝜖 2

(cid:18) 𝑛𝑑 (𝑑 − 1)
𝛿

(cid:19)

,

𝐶2(𝑑 − 1)2
𝜖 2

ln

(cid:18)(cid:18) 141
𝜖

(cid:19)𝑟 𝑛𝑑−2(𝑑 − 1)
𝛿

ln

(cid:19)(cid:27)

for absolute constants 𝐶1, 𝐶2 > 0 then the sketched unfolding ˜𝑋[ 𝑗] = 𝑋[ 𝑗]Ω𝑇
(𝜖, 0, 𝑟)-PCP sketch of 𝑋[ 𝑗] with probability at least 1 − 𝛿.

− 𝑗 ∈ R𝑛×𝑚𝑑−1 is an

Proof. To obtain the first quantity maximized over we apply Theorem 1.1 with 𝜖 ←

𝜖
𝑟 (𝑑−1) ,
√
2(𝑑−1)𝑛𝑑−2 , and |𝑆| ← 16𝑛2 + 𝑛. Similarly, for the second quantity maximized over we apply
. The result now follows from

Theorem 1.1 with 𝜖 ← 𝜖

𝛿 ←

(cid:17)𝑟

12

𝛿

𝛿

2(𝑑−1)𝑛𝑑−2 , and |𝑆| ←

12(𝑑−1) , 𝛿 ←

(cid:16) 141
𝜖

Theorem 3.4.

Remark 3.3. Note that Theorem 3.4 is quite general, requiring only that the random matrices Ω𝑖

in (3.13) should be drawn from some distribution having a couple JL properties. As a result of

this generality, its Corollary 3.2 concerning sub-Gaussian component matrices turns out to be

sub-optimal by (at least a) factor of 𝑑 in that setting. To obtain a slightly sharper result in 𝑑 for sub-

Gaussian Ω𝑖 we recommend replacing our implicit use of Lemma 3.3 in the proof of Corollary 3.2

(via Theorem 3.4) with [33, Lemma 14] instead.

3.2.3 PCP Sketches via Khatri-Rao Structured Leave-one-out Measurement Matrices

In this section we study how to ensure that Khatri-Rao structured leave-one-out measurement

matrices will provide the PCP property. To start we will first show that random Khatri-Rao

structured measurement maps, denoted in this section by

Ω− 𝑗 =

1
𝑚

Ω1 • Ω2 • . . . Ω 𝑗−1 • Ω 𝑗+1 • Ω𝑑 where Ω𝑖 ∈ R𝑚×𝑛 ∀𝑖 ∈ [𝑑] \ { 𝑗 },

will have the JL property whenever all their component matrices Ω𝑖 have i.i.d. sub-Gaussian entries.

Having established this, we can then use, e.g., Lemma 3.1 to prove PCP sketching results for such

Khatri-Rao Structured Ω− 𝑗 .

75

𝑚 Ω1 • Ω2 • . . . Ω 𝑗−1 • Ω 𝑗+1 • Ω𝑑 where all the
Theorem 3.5. Let 𝜖 > 0, 0 < 𝛿 ≤ 𝑒−2, and Ω− 𝑗 = 1
Ω𝑖 ∈ R𝑚×𝑛 for 𝑖 ∈ [𝑑] \ { 𝑗 } have i.i.d. mean zero, variance one, sub-Gaussian entries. Then Ω− 𝑗

is an (𝜖, 𝛿, 𝑘)-JL whenever

𝑚 ≥ 𝐶 𝑑−1 max

(cid:40)

𝜖 −2 log

𝑘
𝛿

, 𝜖 −1

(cid:18)

log

𝑘
𝛿

(cid:19) 𝑑−1(cid:41)

for a constant 𝐶 ∈ R+ that depends only on the sub-Gaussian norm of the i.i.d. Ω𝑖-entries.

The proof of Theorem 3.5 largely follows the argument proposed in Section 2 of [1] concerning

the so-called 𝑝-moment JL property of Khatri-Rao structured measurements. We note that Kro-

necker products of sub-Gaussian vectors are not sub-Gaussian in general, so the general idea is to

use Markov’s inequality for higher moments of the norm of Ω− 𝑗 y with a fixed y ∈ R𝑛 to obtain

the desired result. To proceed with the argument, we will need the following two concentration

results. Before stating the concentration results, we recall a standard definition that the 𝑝-th root of

the (absolute) moment of a random variable is referred to as the 𝐿 𝑝-norm of the random variable,

and for random variable 𝑋 is denoted

∥ 𝑋 ∥ 𝐿 𝑝 := (E|𝑋 | 𝑝)1/𝑝

(3.15)

this is of course because the expectation of a random variable in a probability space is by definition

the Lebesgue integral of the function from the probability space to the reals, and so all the theorems

about Lebesgue integration and metric spaces apply to random variables. Now onto our needed

concentration results:

Lemma 3.5. [Lemma 19 in [33]] Let Y be a 𝑑 − 1 mode tensors with side lengths of size 𝑛,

𝑝 ≥ 1, and Ω 𝑗 (𝑖, :) ∈ R𝑛 for 𝑗 ∈ [𝑑 − 1], 𝑖 ∈ [𝑚] be independent random vectors each satisfying
the Khintchine inequality (cid:13)
(cid:13)𝐿 𝑝 ≤ 𝐶𝑝 ∥y∥2 for any vector y ∈ R𝑛 where 𝐶𝑝 a constant

(cid:13)⟨Ω 𝑗 (𝑖, :), y⟩(cid:13)

depending only on 𝑝. Then

∥⟨Ω1(𝑖, :) ⊗ Ω2(𝑖, :) ⊗ · · · ⊗ Ω𝑑−1(𝑖, :), vec(Y)⟩∥ 𝐿 𝑝 ≤ 𝐶 𝑑−1

𝑝

∥Y ∥2

.

76

Lemma 3.6. [Corollary 2 in [41]] If 𝑝 ≥ 2 and 𝑍, 𝑍1, . . . , 𝑍𝑚 are i.i.d symmetric random variables

then we have

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

𝑚
∑︁

𝑖=1

𝑍𝑖

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)𝐿 𝑝

≤ 𝐶

sup
𝑠∈[max{2, 𝑝

𝑚 },𝑝]

(cid:19) 1/𝑠

(cid:40) 𝑝
𝑠

(cid:18) 𝑚
𝑝

(cid:41)

∥𝑍 ∥ 𝐿𝑠

.

Here 𝐶 > 0 is an absolute constant.

In particular, we will utilize the following corollary of Lemma 3.6.

Corollary 3.3. If under conditions of Lemma 3.6, in addition, we know that ∥𝑍 ∥ 𝐿𝑠 ≤ (𝐶𝑠)𝑑−1, then

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑚

𝑚
∑︁

𝑖=1

𝑍𝑖

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)𝐿 𝑝

≤ 𝐶 𝑑−1 max

(cid:26)

2𝑑−1

√︂ 𝑝
𝑚

,

( 𝑝𝑒)𝑑−1
𝑚

(cid:27)

.

Proof. The proof of this corollary loosely follows the argument presented in [1]. Since ∥𝑍 ∥ 𝐿𝑠 ≤

(𝐶𝑠)𝑑−1, 𝑍𝑖 ∼ 𝑍, the expression over which we are taking the supremum in Lemma 3.6 is a function

whose derivative with respect to 𝑠 is non-decreasing. That is, the derivative

𝜕
𝜕𝑠

(cid:34) 𝑝
𝑠

(cid:18) 𝑚
𝑝

(cid:19) 1/𝑠

(cid:35)

𝑠𝑑−1

= 𝑝𝑠𝑑−4

(cid:19) 1/2 (cid:18)

(cid:18) 𝑚
𝑝

(𝑑 − 2)𝑠 − log

(cid:19)

𝑚
𝑝

has at most a single root in the interval of interest at 𝑠 =

log 𝑚
𝑝
𝑑−2 . Noting the sign change at this

root, we conclude that the maximum value must occur at the endpoints of the interval, and cannot

occur at the critical point that is interior to the interval. Evaluating the function of interest at 𝑠 = 2
we obtain 2𝑑−2√𝑚 𝑝; we will further upper-bound this by 2𝑑−1√𝑚 𝑝 in order to simplify analysis

for the right endpoint. Additionally, since we are interested only in an upper bound, we need not

evaluate the possible endpoint 𝑠 =

𝑝

𝑚 , since if 2 < 𝑝

𝑚 , we are increasing the interval over which we

are maximizing by instead considering 𝑠 = 2.

We will now bound the expression at the right endpoint when 𝑠 = 𝑝 ≥ 2. Clearly (1/𝑝)1/𝑝 ≤ 1

thus

(cid:19) 1/𝑝

(cid:18) 𝑚
𝑝

𝑝𝑑−1 ≤ 𝑚1/𝑝 𝑝𝑑−1.

If the function value at the right endpoint actually dominates 2𝑑−1√𝑚 𝑝 (i.e., our upper bound of
the function value at the left endpoint), we will have that 𝑚1/𝑝 𝑝𝑑−1 ≥ 2𝑑−1√𝑚 𝑝 must hold. We
will now use this assumption to remove the dependence on 𝑚 in our current upper bound for the

77

function value at the right endpoint after noting that doing so will still yield a valid upper bound
whenever the functions value at the right endpoint fails to already be bounded by 2𝑑−1√𝑚 𝑝.

Proceeding as planned, our assumption yields that

Rearranging of terms, we get that

𝑚1/2−1/𝑝 ≤

(cid:17) 𝑑−1

𝑝−1/2.

(cid:16) 𝑝
2

𝑚1/𝑝 ≤

=

(cid:20) (cid:16) 𝑝
2
(cid:20) (cid:16) 𝑝
2

(cid:17) 𝑑−1

𝑝−1/2

(cid:21) 2
𝑝−2

(cid:17) 2𝑑−2
𝑝−2

(cid:21)

𝑝 −1
𝑝−2 .

𝑝−2 is less than one. A tedious calculation reveals that (cid:0) 𝑝
2

The factor 𝑝 −1
𝑝 ≥ 2, and therefore 𝑚1/𝑝 bounded by lim𝑝→2
bounds and then averaging over 𝑚, we obtain the desired inequality.

(cid:0) 𝑝
2

(cid:1) 2𝑑−2
𝑝−2

is decreasing for all

(cid:1) 2𝑑−2
𝑝−2 = 𝑒𝑑−1. maximizing over our two upper

Now we are ready to give a formal proof of Theorem 3.5.

Proof of Theorem 3.5: Note that the result is unchanged for any choice of mode to leave out, so

we will shorten the notation and work with Ω := Ω− 𝑗 within this proof. Let 𝐾 be the sub-Gaussian

norm of an entry in the Ω 𝑗 ’s, as in Definition 1.22. We aim to bound the probability that

(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑚

∥Ωx∥2

2 − ∥x∥2
2

(cid:12)
(cid:12)
(cid:12)
(cid:12)

≥ 𝜖 ∥x∥2
2

for a fixed x ∈ R𝑛𝑑−1.

Without loss of generality, assume ∥x∥2 = 1. Furthermore, note that

1
𝑚

∥Ωx∥2

2 − 1 =

1
𝑚

𝑚
∑︁

𝑖=1

(cid:2)⟨Ω(𝑖, :), x⟩2 − 1(cid:3) .

(3.16)

Now, since the entries of each Ω 𝑗 are i.i.d. mean zero and variance one sub-Gaussian random

variables, they satisfy Khintchine’s inequality (see, e.g., [63]) in the form

(cid:13)⟨Ω 𝑗 (𝑖, :), y⟩(cid:13)
(cid:13)

(cid:13)𝐿 𝑝 ≤ 𝐶𝐾

√

𝑝 ∥y∥2

for any fixed y ∈ R𝑛.

By Lemma 3.5, this implies that the rows of Ω satisfy a generalized Khintchine’s inequality in the

form

∥⟨Ω(𝑖, :), x⟩∥ 𝐿 𝑝 = ∥⟨Ω1(𝑖, :) ⊗ Ω2(𝑖, :) ⊗ . . . ⊗ Ω𝑑−1(𝑖, :), x⟩∥ 𝐿 𝑝 ≤ (𝐶′𝑝)

𝑑−1
2 ∥x∥2

,

(3.17)

78

where 𝐶′ is a new constant that only depends on 𝐾.

To bound the 𝐿 𝑝-norm of the sum in (3.16) we will now bound the 𝐿 𝑝-norm of each summand.

Using the centering Lemma 3.13 and continuing to estimate for one term we see that

(cid:13)⟨Ω(𝑖, :), x⟩2 − 1(cid:13)
(cid:13)

(cid:13)𝐿 𝑝 ≤ 2 (cid:13)

(cid:13)⟨Ω(𝑖, :), x⟩2(cid:13)

(cid:13)𝐿 𝑝 = 2 ∥⟨Ω(𝑖, :), x⟩∥2

𝐿2 𝑝

= 2 ∥⟨Ω1(𝑖, :) ⊗ Ω2(𝑖, :) ⊗ . . . ⊗ Ω𝑑−1(𝑖, :), x⟩∥2
(3.17)
≤ 2((𝐶′2𝑝)

2 ≤ (𝐶′′𝑝)𝑑−1 ∥x∥2

𝑑−1
2 )2 ∥x∥2

𝐿2 𝑝

2 = (𝐶′′𝑝)𝑑−1.

We would now like to apply Corollary 3.3 to help bound the 𝐿 𝑝-norm of the sum in (3.16). However,

we need to symmetrize our random variables first. Toward that end, define

(cid:16)

𝑍𝑖 = 𝜌𝑖

⟨Ω(𝑖, :), x⟩2 − 1

(cid:17)

,

(3.18)

where 𝜌𝑖 are i.i.d. Rademacher random variables. Note that ∥𝑍𝑖 ∥ 𝐿 𝑝 = ∥ (cid:0)⟨Ω(𝑖, :), x⟩2 − 1(cid:1) ∥ 𝐿 𝑝 ≤

(𝐶′′𝑝)𝑑−1.

Appealing now to Corollary 3.3 we have that,

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑚

𝑚
∑︁

𝑖=1

𝑍𝑖

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)𝐿 𝑝

≤ (𝐶′′)𝑑−1 max

(cid:26)

2𝑑−1

√︂ 𝑝
𝑚

,

( 𝑝𝑒)𝑑−1
𝑚

(cid:27)

(3.19)

for any 𝑝 ≥ 2. The upper bound in [43, Lemma 6.3] then further implies that

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑚

𝑚
∑︁

𝑖=1

(cid:2)⟨Ω(𝑖, :), x⟩2 − 1(cid:3)

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)𝐿 𝑝

≤ 2

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)

1
𝑚

𝑚
∑︁

𝑖=1

𝑍𝑖

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)𝐿 𝑝

≤ 2(𝐶′′)𝑑−1 max

(cid:26)

2𝑑−1

√︂ 𝑝
𝑚

,

( 𝑝𝑒)𝑑−1
𝑚

(cid:27)

.

Employing Markov’s inequality, we finally have that

P

(cid:40)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑚

𝑚
∑︁

𝑖=1

(cid:2)⟨Ω(𝑖, :), x⟩2 − 1(cid:3)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:41)

≥ 𝜖

= P

(cid:40)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
𝑚

𝑚
∑︁

𝑖=1

(cid:2)⟨Ω(𝑖, :), x⟩2 − 1(cid:3)

(cid:41)

≥ 𝜖 𝑝

𝑝

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
𝑚 ) 𝑝/2, ( 𝑝𝑒) 𝑝 (𝑑−1)

𝑚 𝑝

𝑝

(cid:111)

.

𝜖 𝑝

(cid:110)

(

max

≤ (𝐶′′′) 𝑝(𝑑−1)

Taking 𝑝 = log(𝑘/𝛿) and 𝑚 ≥ ˜𝐶 𝑑−1 max 𝑒

(cid:26)

𝜖 −2 log 𝑘

𝛿 , 𝜖 −1 (cid:16)

log 𝑘
𝛿

(cid:17) 𝑑−1(cid:27)

, the last expression is upper

bounded by 𝛿/𝑘. Hence, Ω is an (𝜖, 𝛿, 𝑘)-JL by the union bound over 𝑘 vectors.

79

Remark 3.4. In order to employ Lemmas 3.5 and 3.6, it is necessary that the rows Ω 𝑗 (𝑖, :) have

independent and identical distributions and that these rows satisfy Khintchine’s inequality. Assum-

ing that the matrices Ω𝑖 all have i.i.d sub-Gaussian entries as we have done in Theorem 3.5 implies

both these necessary properties of the rows. However, we note that more general (though perhaps

less natural) assumptions will also suffice. For example, the distributions of the i.i.d sub-Gaussian

entries of the Ω𝑖 may also vary by column.

We can now use Theorem 3.5 to derive row bounds that guarantee that our Khatri-Rao structured

sub-Gaussian measurement matrices will provide PCP sketches with high probability.

Theorem 3.6. Let 𝜖 > 0, 0 < 𝛿 ≤ 𝑒−2, and X be a 𝑑 mode tensor with side-lengths equal to 𝑛.
𝑚 Ω1 • Ω2 • . . . Ω 𝑗−1 • Ω 𝑗+1 • . . . • Ω𝑑 where the Ω𝑖 ∈ R𝑚×𝑛 for

Furthermore, suppose that Ω− 𝑗 := 1

𝑖 ∈ [𝑑] \ { 𝑗 } are as in Theorem 3.5 with

𝜖 −2 log

𝑚 ≥ 𝐶 𝑑−1 max

(cid:170)
(cid:174)
(cid:174)
(cid:172)
for a positive constant 𝐶 ∈ R+. Then, ˜𝑋[ 𝑗] = 𝑋[ 𝑗]Ω𝑇
− 𝑗 will be an (𝜖, 0, 𝑟)-PCP sketch of 𝑋[ 𝑗] with

, 𝜖 −1 (cid:169)
(cid:173)
(cid:173)
(cid:171)

(3.20)

log

log

log

,

,

17𝑛2
𝛿/2

(cid:18)

𝑟
𝜖

17𝑛2
𝛿/2

𝑟
𝜖 2

(cid:19) 𝑑−1







𝑑−1

(cid:17)𝑟

(cid:16) 141
𝜖
𝛿/2

(cid:17)𝑟

(cid:16) 141
𝜖
𝛿/2

probability at least 1 − 𝛿.

Proof. By Lemma 3.1 we know that it suffices for the measurement matrix Ω− 𝑗 to have both the
(cid:17)
(cid:16) 𝜖
6

-JL property. Combining these two required JL

-JL property and

, 16𝑛2 + 𝑛

(cid:16) 141
𝜖

, 𝛿
2

, 𝛿
2

(cid:17)𝑟 (cid:17)

,

(cid:16) 𝜖
√
𝑟
6

properties with Theorem 3.5 yields (3.20) after combining and simplifying constants.

We can now see that both Kronecker and Khatri-Rao structured measurements can satisfy the

PCP property required by Theorem 3.3. This demonstrates that such memory-efficient measure-

ments may be used to bound the first term in (3.9) as desired. Given this partial success, we will

now turn our attention to the second term in (3.9).

3.2.4 Bounding ∥X1 − X2∥2

Next, we show how to bound Term II in (3.9). Recall that X1 is the output of Algorithm 3.1,

and that X2 is the tensor recovered by the two-pass algorithm consisting of the first “Factor matrix

recovery” phase of Algorithm 3.1 followed by the second pass core recovery procedure discussed

in Section 3.2.1. That is, X1 := [[H , 𝑄1, . . . , 𝑄𝑑]] is the single-pass estimate of the tensor X output

80

by Algorithm 3.1, and

X2 = G ×1 𝑄1 ×2 · · · ×𝑑 𝑄𝑑 = X ×1 𝑄1𝑄𝑇

1 ×2 · · · ×𝑑 𝑄𝑑𝑄𝑇
𝑑 ,

where G is the core estimate from the two-pass algorithm, and where the 𝑄𝑖 ∈ R𝑛×𝑟 have 𝑟

orthonormal columns.

To begin, we note that the one-pass core H computed by Algorithm 3.1 can be recovered from

its input measurements by

H = B𝑐 ×1 (Φ1𝑄1)† ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)†

= (X ×1 Φ1 ×2 · · · ×𝑑 Φ𝑑) ×1 (Φ1𝑄1)† ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)†.

(3.21)

In addition, we note that the norm of the difference between the two estimates is the same as the

norm of the difference of their cores since factor matrices have orthonormal columns.

Lemma 3.7. In the notation outlined above,

∥X1 − X2∥2 = ∥H − G∥2

.

Proof.

∥X1 − X2∥2 = ∥H ×1 𝑄1 ×2 · · · ×𝑑 𝑄𝑑 − G ×1 𝑄1 ×2 · · · ×𝑑 𝑄𝑑 ∥2

= ∥(H − G) ×1 𝑄1 ×2 · · · ×𝑑 𝑄𝑑 ∥2

= ∥ (𝑄1 ⊗ · · · ⊗ 𝑄𝑑)vec (H − G) ∥2

= ∥H − G∥2

since (𝑄1 ⊗ · · · ⊗ 𝑄𝑑) has orthonormal columns.

In order to simplify the presentation of our culminating results we next state a definition for an

Affine-Embedding property. In Lemma 3.8 below we then describe how this property relates to the

OSE and AMM properties.

Definition 3.4 ((𝜖, 𝑟, 𝑁)-AE property). Let 𝜖 > 0, and 𝑟 ∈ N, Fix 𝑄 ∈ R𝑛×𝑟 an arbitrary matrix

with orthonormal columns and an arbitrary matrix 𝐵 ∈ R𝑛×𝑁 . A matrix Φ ∈ R𝑚×𝑛 is an (𝜖, 𝑟, 𝑁)-

Affine Embedding (AE) for given matrices 𝑄 and 𝐵 if it satisfies

(cid:13)(Φ𝑄)†Φ𝐵(cid:13)
(cid:13)

(cid:13)𝐹 ≤ (1 + 𝜖) ∥𝐵∥𝐹 .

(3.22)

81

With this definition in hand, we are now able to prove the main theorem of this section. It will

allow us to relate Term II of (3.9) with Term I.

Theorem 3.7. Let X2 = [[G, 𝑄1, . . . , 𝑄𝑑]] denote the two-pass tensor estimate, and X1 = [[H , 𝑄1, . . . , 𝑄𝑑]]

denote the single-pass tensor estimate for a 𝑑-mode tensor X. Furthermore, let 𝜖 ∈ (0, 1), and Φ𝑖 ∈
R𝑚𝑐×𝑛 be a (𝜖/𝑑, 𝑟, 𝑛𝑖−1𝑟 𝑑−𝑖)-AE for the matrices 𝑄𝑖 and (cid:0)𝑋[𝑖] − (𝑋2)[𝑖]
for each 𝑖 ∈ [𝑑]. Then,

(cid:1) (cid:203)𝑖−1
𝑗=1

𝐼𝑛 (cid:203)𝑑

(cid:0)Φ 𝑗 𝑄 𝑗 (cid:1) † Φ 𝑗

𝑗=𝑖+1

∥X1 − X2∥2 ≤ 𝑒𝜖 ∥X − X2∥2

.

(3.23)

Proof. By Lemma 3.7 it is enough to estimate the difference between the cores G and H . He have

that

H − G (3.21)

= (X ×1 Φ1 ×2 · · · ×𝑑 Φ𝑑) ×1 (Φ1𝑄1)† ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)† − G

= ((X − X2) ×1 Φ1 ×2 · · · ×𝑑 Φ𝑑) ×1 (Φ1𝑄1)† ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)†

+ (X2 ×1 Φ1 ×2 · · · ×𝑑 Φ𝑑) ×1 (Φ1𝑄1)† ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)† − G

= (X − X2) ×1 (Φ1𝑄1)†Φ1 ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)†

+ G ×1 (Φ1𝑄1)†Φ1𝑄1 · · · × (Φ𝑑𝑄𝑑)†Φ𝑑𝑄𝑑 − G

= (X − X2) ×1 (Φ1𝑄1)†Φ1 ×2 · · · ×𝑑 (Φ𝑑𝑄𝑑)†Φ𝑑.

Now consider the following related mode-𝑖 unfolding for 𝑖 ∈ [𝑑], where (cid:0)Φ 𝑗 𝑄 𝑗 (cid:1) † Φ 𝑗 for 𝑗 < 𝑖 is

replaced with an 𝑛 × 𝑛 identity matrix 𝐼𝑛,

𝑖 := (cid:0)𝑋[𝑖] − (𝑋2)[𝑖]
𝑋′

(cid:1)

𝑖−1
(cid:204)

𝐼𝑛

𝑑
(cid:204)

𝑗=1

𝑗=𝑖+1

(cid:0)Φ 𝑗 𝑄 𝑗 (cid:1) † Φ 𝑗 .

(3.24)

82

Each Φ𝑖 is an (𝜖/𝑑, 𝑟, 𝑛𝑖−1𝑟 𝑑−𝑖)-AE, where 𝑄 ← 𝑄𝑖 and 𝐵 ← 𝑋′

𝑖 , for 𝑖 = 1, 2, . . . , 𝑑. Thus,

(cid:0)𝑋[1] − (𝑋2)[1]

(cid:1)

𝑑
(cid:204)

𝑗=2

(cid:0)Φ 𝑗 𝑄 𝑗 (cid:1) † Φ 𝑗

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)𝐹

∥H − G∥2 =

≤

(Φ1𝑄1)† Φ1

𝜖
𝑑

(cid:17) (cid:13)
(cid:13)𝑋′
1

(cid:13)
(cid:13)𝐹

(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:16)

1 +
...
(cid:16)

≤

1 +

(cid:17) 𝑑 (cid:13)

(cid:13)𝑋′
𝑑

𝜖
𝑑
≤ 𝑒𝜖 ∥X − X2∥2

(cid:13)
(cid:13)𝐹

,

where we have used Definition 3.4 𝑑-times together with the bound (cid:0)1 + 𝜖
𝑑

(cid:1) 𝑑

≤ 𝑒𝜖 .

3.2.5 Putting it All Together with Row Bounds

In the previous subsections we have demonstrated that we can bound both error terms in (3.9)

when the leave-one-out and core measurements satisfy certain embedding properties. In particular,

we have shown how the Johnson-Lindenstrauss property (Definition 1.18) can be used to obtain both

the Oblivious Subspace Embedding (Definition 1.23) and an Approximate Matrix Multiplication

properties (Definition 1.25). These two properties are then used with compositions and direct sums

to show how a tensor unfolding will satisfy a Projection Cost Preserving property (Definition 3.2),

which was the essential ingredient in bounding the error term ∥X − X2∥2. Next, we introduced

Affine-Embeddings (Definition 3.4), which are the crucial ingredient needed for bounding the error

term ∥X1 − X2∥2. In this section we will show how JL and OSE properties imply the AE property.

All together then, these will enable us to verify the requirements of Theorem 3.8 in a straightforward

manner once we have specified the type of leave-one-out measurements, and the particular type of

sensing matrices.

Theorem 3.8 (Error bound for one-pass). Let X1 = [[H , 𝑄1, . . . , 𝑄𝑑]] denote the single-pass tensor

estimate for a 𝑑-mode tensor X with side length 𝑛 obtained from Algorithm 3.1 using leave-one-out

linear measurements 𝐵𝑖 for 𝑖 ∈ [𝑑], and core measurements B𝑐. Let 𝜖 > 0 and 𝛿 ∈ (0, 1/2). Then,

83

∥X1 − X∥2 ≤ (1 + 𝑒𝜖 )

(cid:118)(cid:117)(cid:116) 1 + 𝜖
1 − 𝜖

𝑑
∑︁

𝑗=1

Δ𝑟, 𝑗

where Δ𝑟, 𝑗 :=

˜𝑛 𝑗
∑︁

𝑖=𝑟+1

𝜎𝑖 (cid:0)𝑋[ 𝑗]

(cid:1) 2

(3.25)

will hold whenever

1. Ω(𝑖,𝑖) ∈ R𝑛×𝑛 are full-rank matrices for all 𝑖 ∈ [𝑑],

2. Ω−1
(𝑖,𝑖)

𝐵𝑖 = 𝑋[𝑖]Ω𝑇

− 𝑗 ∈ R𝑛×𝑚𝑑−1 are (𝜖, 0, 𝑟)-PCP sketches of 𝑋[𝑖] for each 𝑖 ∈ [𝑑], and

3. Φ𝑖 ∈ R𝑚𝑐×𝑛 is an (𝜖/𝑑, 𝑟, 𝑛𝑖−1𝑟 𝑑−𝑖)-AE for the matrices 𝑄𝑖 and 𝑋′

𝑖 as in (3.24) for all 𝑖 ∈ [𝑑].

Proof. Recalling (3.9) we have that

∥X1 − X∥2 = ∥X1 − X + X2 − X2∥2 ≤ ∥X − X2∥2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

(cid:123)(cid:122)
Term I

We know from Theorem 3.3 that

∥X − X2∥2 ≤

(cid:118)(cid:117)(cid:116) 1 + 𝜖
1 − 𝜖

𝑑
∑︁

𝑗=1

Δ𝑟, 𝑗

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

+ ∥X1 − X2∥2
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:123)(cid:122)
Term II

(cid:124)

.

(3.26)

(3.27)

whenever 𝑋[ 𝑗]Ω𝑇

− 𝑗 ∈ R𝑛×𝑚𝑑−1

are (𝜖, 0, 𝑟)-PCP sketches of 𝑋[ 𝑗] for each 𝑗 ∈ [𝑑]. Furthermore,

Theorem 3.7 together with our third assumption implies that

∥X1 − X2∥2 ≤ 𝑒𝜖 ∥X − X2∥2

.

(3.28)

Using (3.27) and (3.28) in (3.9) now yields the desired result.

We now have need for the lemma that links the AE property to the JL and OSE properties, and

thus provides the necessary machinery to fully account for row bounds that guarantee with high

probability that the recovered tensor satisfies the stated bound in Theorem 3.8.

Lemma 3.8. Let 𝐵 ∈ R𝑛×𝑁 and suppose that 𝑄 ∈ R𝑛×𝑟, 𝑛 ≥ 𝑟, has orthonormal columns.

If

Φ ∈ R𝑚×𝑛 is an

(cid:17)

, 𝛿, 𝑟

(cid:16) 1
2

-OSE for 𝑄, and has the

(cid:16) 𝜖
√
𝑟
2

(cid:17)

, 𝛿

-AMM property for 𝑄𝑇 and (𝐼 − 𝑄𝑄𝑇 )𝐵,

then Φ will be an (𝜖, 𝑟, 𝑁)-AE for the matrices 𝑄 and 𝐵 with probability at least 1 − 2𝛿.

Proof. Denote ˜𝑌 := (Φ𝑄)†Φ𝐵 and 𝑌 ′ := 𝑄𝑇 𝐵, the solutions to the sketched and un-skechted linear

least square problems given by minimizing ∥Φ𝐵 − Φ𝑄𝑌 ∥𝐹 and ∥𝐵 − 𝑄𝑌 ∥𝐹, respectively, with

respect to 𝑌 . Whenever Φ is a (1/2, 𝛿, 𝑟)-OSE for 𝑄 we know that Φ𝑄 will be full-rank. It follows

84

that (Φ𝑄)𝑇 Φ(𝐵 − 𝑄 ˜𝑌 ) = 0 will then also hold because (Φ𝑄)† = (cid:2)(Φ𝑄)𝑇 (Φ𝑄)(cid:3) −1 (Φ𝑄)𝑇

. As a

consequence,

𝑄𝑇 Φ𝑇 Φ𝑄( ˜𝑌 − 𝑌 ′) = 𝑄𝑇 Φ𝑇 Φ𝑄( ˜𝑌 − 𝑌 ′) + (Φ𝑄)𝑇 Φ(𝐵 − 𝑄 ˜𝑌 )

= 𝑄𝑇 Φ𝑇 Φ(𝑄 ˜𝑌 − 𝑄𝑌 ′ + 𝐵 − 𝑄 ˜𝑌 )

= 𝑄𝑇 Φ𝑇 Φ(𝐵 − 𝑄𝑌 ′).

When the approximate matrix multiplication property also holds we will now have that

(3.29)

(cid:13)(Φ𝑄)𝑇 Φ𝑄( ˜𝑌 − 𝑌 ′)(cid:13)
(cid:13)

(cid:13)𝐹 = (cid:13)

(cid:13)𝑄𝑇 Φ𝑇 Φ(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)𝐹

(cid:13)𝐹 = (cid:13)
(cid:13)(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)
(cid:13)𝐹

(cid:13)𝑄𝑇 Φ𝑇 Φ(𝐵 − 𝑄𝑌 ′)(cid:13)
𝜖
√
𝑟
(cid:13)(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)

(cid:13)𝑄𝑇 (cid:13)
(cid:13)
(cid:13)𝐹

(cid:13)𝐹 =

2
𝜖

𝜖

2

∥𝐵 − 𝑄𝑌 ′∥𝐹 .

≤

=

Furthermore, whenever Φ is a

2
(cid:16) 1
2

(cid:17)

, 𝛿, 𝑟

-OSE for the column space of 𝑄, all the eigenvalues

of 𝑄𝑇 Φ𝑇 Φ𝑄 − 𝐼 will be within the interval [−1/2, 1/2]. Thus, we can bound its operator norm
(cid:13)𝑄𝑇 Φ𝑇 Φ𝑄 − 𝐼(cid:13)
(cid:13)

(cid:13) ≤ 1/2. We may now combine this operator norm bound with (3.29) and see that

(cid:13) ˜𝑌 − 𝑌 ′(cid:13)
(cid:13)

(cid:13)( ˜𝑌 − 𝑌 ′) − (Φ𝑄)𝑇 Φ𝑄( ˜𝑌 − 𝑌 ′) + (Φ𝑄)𝑇 Φ𝑄( ˜𝑌 − 𝑌 ′)(cid:13)
(cid:13)𝐹 = (cid:13)
(cid:13)𝐹
(cid:13)(Φ𝑄)𝑇 Φ𝑄( ˜𝑌 − 𝑌 ′)(cid:13)
≤ (cid:13)
(cid:2)(Φ𝑄)𝑇 Φ𝑄 − 𝐼(cid:3) ( ˜𝑌 − 𝑌 ′)(cid:13)
(cid:13)𝐹
𝜖
(cid:13)(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)
(cid:13)𝐹 + (cid:13)
(cid:13)
1
2

(cid:13)𝐹 + (cid:13)
(cid:13)
(cid:2)(Φ𝑄)𝑇 Φ𝑄 − 𝐼(cid:3)(cid:13)
(cid:13)

(cid:13)(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)

(cid:13) ˜𝑌 − 𝑌 ′(cid:13)
(cid:13)
(cid:13)𝐹

(cid:13) ˜𝑌 − 𝑌 ′(cid:13)
(cid:13)
(cid:13)𝐹

(cid:13)𝐹 +

2
𝜖

≤

≤

2

.

Rearranging the inequality above while noting the invariance of the Frobenius norm to multiplication

by a matrix with orthogonal columns, we learn that

(cid:13) ˜𝑌 − 𝑌 ′(cid:13)
(cid:13)

(cid:13)𝐹 = (cid:13)

(cid:13)𝑄( ˜𝑌 − 𝑌 ′)(cid:13)

(cid:13)𝐹 ≤ 𝜖 (cid:13)

(cid:13)(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)𝐹

.

85

To finish, we may now apply the triangle inequality to see that

(cid:13) ˜𝑌 (cid:13)
(cid:13)

(cid:13)𝐹 ≤ (cid:13)
≤ 𝜖 (cid:13)

(cid:13) ˜𝑌 − 𝑌 ′(cid:13)
(cid:13)𝐹 + ∥𝑌 ′∥𝐹
(cid:13)(𝐼 − 𝑄𝑄𝑇 )𝐵(cid:13)
(cid:13)(𝐼 − 𝑄𝑄𝑇 )(cid:13)

(cid:13)𝐹 + (cid:13)
(cid:13) ∥𝐵∥𝐹 + (cid:13)

(cid:13)𝑄𝑇 𝐵(cid:13)
(cid:13)𝐹
(cid:13)𝑄𝑇 (cid:13)

≤ 𝜖 (cid:13)

(cid:13) ∥𝐵∥𝐹

= (1 + 𝜖) ∥𝐵∥𝐹 .

In addition, we note that taking a union bound over the two necessary OSE and AMM conditions

establishes the stated probability guarantee.

We are now prepared to state how a particular choice of distribution used to generate our

measurement matrices as well as the leave-one-out measurement type (Kronecker or Khatri-Rao)

can satisfy the error bound (3.25) with high probability. Note that below Algorithms 3.9 and 3.11

refer to the specialization of Algorithm 3.1 to the type of leave-one-out measurement (Kronecker

or Khatri-Rao, respectively).

Theorem 3.9 (Error bound for one-pass Kronecker-structured sub-Gaussian measurements). Sup-

pose X is a 𝑑-mode tensor with side length 𝑛. Let 𝜖 > 0, 𝛿 ∈ (0, 1

3), 𝑟 ∈ [𝑛]. Furthermore,

let

1. Ω(𝑖,𝑖) ∈ R𝑛×𝑛 be arbitrary full-rank matrices,
2. Ω(𝑖, 𝑗) ∈ R𝑚×𝑛 for 𝑖 ≠ 𝑗 be random matrices with mutually independent, mean zero, variance

𝑚−1, sub-Gaussian entries with

𝑚 ≥ max

(cid:26) 𝐶1𝑟 (𝑑 − 1)2
𝜖 2

(cid:18) 𝑛𝑑 𝑑2
𝛿

(cid:19)

,

𝐶2(𝑑 − 1)2
𝜖 2

ln

(cid:18)(cid:18) 141
𝜖

(cid:19)𝑟 𝑛𝑑−2𝑑2
𝛿

(cid:19)(cid:27)

ln

, and

3. Φ𝑖 ∈ R𝑚𝑐×𝑛 be random matrices with mutially independent, mean zero, variance 𝑚−1, sub-

Gaussian entries with

𝑚𝑐 ≥ max

(cid:26)

𝐶3 ln

(cid:18) (94)𝑟 𝑑
𝛿

(cid:19)

,

𝐶4𝑟𝑑2
𝜖 2

(cid:18) 𝑑 (𝑟 + 𝑛𝑑−1)2
𝛿

(cid:19)(cid:27)

.

ln

(3.30)

Then X1 = [[H , 𝑄1, . . . , 𝑄𝑑]] the output of Algorithm 3.9 (i.e., Algorithm 3.1 specialized to

Kronecker sub-Gaussian measurements B𝑖 and B𝑐), will satisfy (3.25) with probability at least

1 − 3𝛿.

86

Proof. We verify that the requirements of Theorem 3.8 are satisfied. Note that:

1. Ω(𝑖,𝑖) ∈ R𝑛×𝑛 are full-rank matrices by assumption.
2. We have from Corollary 3.2 where 𝛿 ← 𝛿

− 𝑗 is a (𝜖, 0, 𝑟)-PCP sketch of 𝑋[ 𝑗] for
all 𝑗 ∈ [𝑑] with probability at least 1 − 𝛿 when Ω(𝑖, 𝑗) ∈ R𝑚×𝑛 are independent sub-Gaussian

𝑑 that 𝑋[ 𝑗]Ω𝑇

random matrices with

𝑚 ≥ max

(cid:26) 𝐶1𝑟 (𝑑 − 1)2
𝜖 2

(cid:18) 𝑛𝑑 𝑑2
𝛿

(cid:19)

,

𝐶2(𝑑 − 1)2
𝜖 2

ln

(cid:18)(cid:18) 141
𝜖

(cid:19)𝑟 𝑛𝑑−2𝑑2
𝛿

(cid:19)(cid:27)

.

ln

3. A substitution of 𝜖 ← 1

2, 𝛿 ← 𝛿

𝑑 into Corollary 1.1 yields,

𝑚𝑐 ≥ 𝐶3 ln

(cid:19)

(cid:18) (94)𝑟 𝑑
𝛿

(3.31)

in order to ensure Φ𝑖 is ( 1
2

noting that the matrices 𝑄𝑖 and 𝑋′

𝑑 , 𝑟)-OSE. Using Corollary 1.2, where 𝜖 ← 𝜖
, 𝛿
2𝑑

𝑑 and
𝑖 as in (3.24) have are 𝑟 × 𝑛 and 𝑛 × 𝑛𝑖−1𝑟 𝑑−𝑖, respectively,

𝑟 , 𝛿 ← 𝛿

√

we have that when

𝑚𝑐 ≥

𝐶4𝑟𝑑2
𝜖 2

(cid:18) 2𝑑 (𝑟 + 𝑛𝑑−1)2
𝛿

(cid:19)

ln

(3.32)

then Φ𝑖 has the (

, 𝛿
𝑑 )-AMM property for each 𝑖 ∈ [𝑑]. Lemma 3.8 now shows how the
OSE and AMM properties ensure that Φ𝑖 has the desired AE property for each 𝑖 ∈ [𝑑] with

𝜖
√
𝑟
2𝑑

probability at least 1 − 2𝛿/𝑑. The union bound now implies that the third requirement of

Theorem 3.8 will hold with probability at least 1 − 2𝛿.

Taking a maximum over (3.31) and (3.32) after simplifying and adjusting constants then yields

(3.30). A final union bound over the failure probabilities for requirements of Ω(𝑖, 𝑗) and Φ𝑖 now

yields the result.

The following runtime analysis demonstrates that instances of Algorithm 3.1 can indeed recover

low-rank approximations of 𝑑-mode tensors of side length 𝑛 in 𝑜(𝑛𝑑)-time. As a result, one can

see that Algorithm 3.1 is effectively a sub-linear time recovery algorithm for a large class of low

Tucker-rank tensors.

Theorem 3.10. Suppose that 𝑚 < 𝑛 < 𝑚𝑑−1 and that 𝑚 > 𝐶𝑚𝑐 for some absolute constant 𝐶 > 0.

Then, Algorithm 3.9, when given the 𝐵𝑖 and B𝑐 measurements as input, runs in O (𝑑𝑛2𝑚𝑑−1) − 𝑡𝑖𝑚𝑒

87

and requires the storage of 𝑑𝑛𝑚𝑑−1 + 𝑚𝑑

𝑐 measurement tensor entries, and at most 𝑑 (𝑑 − 1)𝑚𝑛 +

𝑛2 + 𝑑𝑚𝑐𝑛 total measurement matrix entries.

Proof. Inside the factor matrix recovery loop of Algorithm 3.5, called by Algorithm 3.9, the

two main sub-tasks are to solve the linear system Ω(𝑖,𝑖) 𝐹𝑖 = 𝐵𝑖 and to compute a truncated SVD,
𝐹𝑖 = 𝑈Σ𝑉𝑇 . Solving the linear system can be accomplished using 𝑄𝑅-factorization via Householder

orthogonalization. Doing so requires 4
3

𝑛3 floating point operations to compute the factorization

of Ω(𝑖,𝑖), and 2𝑛2𝑚𝑑−1 operations to form 𝑄𝑇 𝐵𝑖 and 𝑛2𝑚𝑑−1 operations to solve 𝑅𝐹𝑖 = 𝑄𝑇 𝐵𝑖 via
back substitution. The complexity of computing the SVD is O (𝑛𝑚𝑑−1 min(𝑛, 𝑚𝑑−1)). Therefore

the factor recovery loop has overall complexity

O (𝑑𝑛2𝑚𝑑−1)

if we assume that 𝑛 < 𝑚𝑑−1.

Next we consider the core recovery loop. for the first iteration of the loop of Algorithm 3.8, we

must form Φ1𝑄1 at a cost of O (𝑚𝑐𝑛𝑟). Next we solve a linear system Φ1𝑄1𝐻 = 𝐵1 at a cost of

O (2𝑚𝑐𝑟 2 − 2
3

𝑟 3 + 3𝑚𝑑

𝑐 𝑟). The first iteration dominates the complexity, since subsequent solves use

a smaller right hand side formed from solutions from the previous iterations. Furthermore, if we

assume 𝑚𝑐 = O (𝑚), we have a core recovery loop with complexity

O (𝑑𝑚𝑑𝑟).

Thus overall the recovery algorithm has O (𝑑𝑛2𝑚𝑑−1) complexity. In the situation where Ω(𝑖,𝑖) = 𝐼𝑛

then the computation of the SVDs in the factor recovery step dominates the run-time of the

algorithm. Clearly the size of the measurement tensors are 𝑛𝑚𝑑−1 per factor and 𝑚𝑑

𝑐 for the core,

which yields the space complexity of the measurements.

One of the advantages of the structure of the argument in Theorem 3.8 is that once it is

known how to ensure a given random matrix will satisfy the JL property, we can (with the help

of Lemma 3.8) account for how to assemble related measurement operators that satisfy the error

bound (3.25) in a straightforward way. For example, using Theorem 3.1 in [38] along with bounds

appearing in that work on the sketching dimension, we have that a sub-sampled and scrambled

88

Fourier matrix is a (𝜖, 𝛿, 𝑝)-JL of vectors in R𝑛 provided that

𝑚 ≥

𝐶
𝜖 2

log

(cid:17)

(cid:16) 𝑝
𝛿

log4 𝑛.

Using such existing results one can easily update Theorem 3.9 to instead use sub-sampled and

scrambled Fourier measurement matrices Ω(𝑖, 𝑗) and Φ𝑖 instead of matrices with independent sub-

Gaussian entries.

Furthermore, it need not be the case that the distribution is the same for each component map

Ω(𝑖, 𝑗) or Φ𝑖. It is important only that each map satisfies the JL primitive for arbitrary sets. We are

free to choose a measurement map for a particular mode to suit some other purpose. For example,

in the case that the side lengths of the tensor are unequal, we may prefer to choose a map that admits

a fast matrix-vector multiply in order to economize run-time for the modes which are long, and on

smaller modes, choose maps which have better trade-offs for quality of approximation in terms of

𝑚 (e.g., we may prefer dense sub-Gaussian random matrices for these modes).

On the other hand, a measurement ensemble which does not rely on the Kronecker product of

components, like that in Theorem 3.5 does not admit this sort of reasoning. Mixing measurement

maps of different kinds in that case has no clear advantage, and indeed, may even serve to undermine

the advantages. Nonetheless,

Theorem 3.8 can still be used to provide Khatri-Rao structured leave-one-out measurement re-

sults. Verifying the requirements of Theorem 3.8 for Khatri-Rao measurements using Theorem 3.6

yields the following result.

Theorem 3.11 (Error bound for one-pass Khatri-Rao structured sub-Gaussian measurements).

Suppose X is a 𝑑-mode tensor with side length 𝑛, Let 𝜖 > 0, 𝛿 ∈ (0, 1

3), 𝑟 ∈ [𝑛]. Furthermore, let

1. Ω(𝑖,𝑖) ∈ R𝑛×𝑛 be full-rank matrices of any kind,
2. Ω(𝑖, 𝑗) ∈ R ˜𝑚×𝑛 for 𝑖 ≠ 𝑗 be random matrices with mutually independent, mean zero, variance

89

˜𝑚−1, sub-Gaussian entries with

˜𝑚 ≥ 𝐶 𝑑−1 max

(cid:26)

𝜖 −2 ln (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:17)𝑟

(cid:16) 141
𝜖

2𝑑

(cid:170)
(cid:174)
(cid:174)
(cid:172)
(cid:18) 34𝑛2𝑑
𝛿

, 𝜖 −1 (cid:169)
(cid:173)
(cid:173)
(cid:171)
𝑟
𝜖

(cid:19)

,

ln (cid:169)
(cid:173)
(cid:173)
(cid:171)

(cid:18)

ln

𝛿

𝑟
𝜖 2

ln

2𝑑

(cid:17)𝑟

(cid:16) 141
𝜖

𝛿

(cid:18) 34𝑛2𝑑
𝛿

𝑑−1

(cid:170)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:172)
(cid:19)(cid:19) 𝑑−1(cid:27)

,

,

and

3. Φ𝑖 ∈ R𝑚𝑐×𝑛 be random matrices with mutually independent, mean zero, variance 𝑚−1,

sub-Gaussian entries with

𝑚𝑐 ≥ max

(cid:26)

𝐶3 ln

(cid:18) (94)𝑟 𝑑
𝛿

(cid:19)

,

𝐶4𝑟𝑑2
𝜖 2

(cid:18) 𝑑 (𝑟 + 𝑛𝑑−1)2
𝛿

(cid:19)(cid:27)

.

ln

Then X1 = [[H , 𝑄1, . . . , 𝑄𝑑]] the output of Algorithm 3.11 (i.e., Algorithm 3.1 specialized to

Khatri-Rao sub-Gaussian leave-one-out measurements B𝑖 and Kronecker measurements B𝑐) will

satisfy (3.25) with probability at least 1 − 3𝛿.

Proof. The proof is again based on verifying the requirements of Theorem 3.8.

1. The first requirement is satisfied by assumption.

2. The row requirement for the Ω(𝑖, 𝑗) follows from an application of Theorem 3.6 with 𝛿 ← 𝛿/𝑑.

As a result of doing so, we learn that the second requirement will be satisfied with probability

at least 1 − 𝛿 after a union bound.

3. The third requirement is satisfied identically to the argument in the proof of Theorem 3.9.

The proof now concludes identically to the proof of Theorem 3.9.

Comparing the Theorem 3.10, we note, e.g., that Theorem 3.11 will require the storage of

𝑑 ˜𝑚𝑛 + 𝑚𝑑

𝑐 measurement tensor entries (where the ˜𝑚 here is from the second condition of Theo-

rem 3.11). Comparing this to the 𝑑𝑛𝑚𝑑−1 + 𝑚𝑑

𝑐 measurements from Theorem 3.10 (where 𝑚 here

is as in Theorem 3.9) one can see that there are parameter regimes where Khatri-Rao structured

measurements will lead to a smaller overall measurement budget.

3.3 Experiments

In this section we present numerical results that support our theoretical contributions and

address practical trade-offs involved in the different choices for measurement type. Unless otherwise

90

specified, the tensors in the experiments are random three-mode cubic tensors, with side length

𝑛 = 300 and rank 𝑟 = 10. We use the following procedure (same as in [60]) to generate low-

rank tensors; the core’s entries are uniformly and independently drawn from [0, 1] and the factors

are formed by first sampling a standard normal distribution for each entry and then normalized

and made orthogonal using 𝑄𝑅-factorization. The data points in the plots are the mean of 100

independent trials. The parameter 𝑚 refers to the sketching dimension for maps Ω(𝑖, 𝑗) ∈ R𝑚×𝑛 used

in recovering factor 𝑄𝑖. For the left-out mode we remove the need to solve the full 𝑛 × 𝑛 linear

system by setting Ω(𝑖,𝑖) to be the 𝑛 × 𝑛 identity matrix. The parameter 𝑚𝑐 refers to the sketching
dimension for Φ𝑖 ∈ R𝑚𝑐×𝑛 which are used in recovering the core H .

In experiments with noise, the additive Gaussian noise tensor N is scaled according to the

desired signal to noise ratio and added to the true, (low-rank) tensor X0. That is, X = X0 + N is the

observed, noisy tensor.

Signal to noise ratio (SNR) is calculated as 10 log10 (∥X∥2 /∥X0 − X∥2). Relative error is

calculated as (∥ ˆX − X∥2)/∥X0∥2 where ˆX is the full estimated tensor.

3.3.1 Recovering Low-Rank Tensors

In this first simple experiment, we fix the signal to noise ratio at 30 decibels (dB) and vary the

sketching dimension 𝑚 to show the dependence on the accuracy of our estimate on the number

of measurements. For each 𝑚 we set 𝑚𝑐 = 2𝑚. Rank truncation is fixed at 𝑟 = 10, which

matches the rank of the true, noiseless tensor X0, see Figure 3.2. For the plot (b) in the figure,

we have the maximum principal angle among the three estimated factors and true factors (𝑄𝑖, 𝑈𝑖)

in degrees, see [35]. Note, there is no straightforward way to plot the relative error which is

due to the factor estimates vs.

the core estimate, because the decomposition will in general not

be unique. However, since principal angle is invariant to non-singular transformations, plot (b)

provides empirical evidence that the factor estimates alone are improving with sketching dimension.

We note that for these low-rank tensors with noise, we are able to fit at or below the level of noise

(relative error of 0.001) easily - evidently finding good rank 10 approximations to the (full-rank)

noisy tensor X.

91

Figure 3.2 Error plots for different sketching dimensions and a fixed SNR of 30 dB and a fixed rank
truncation of 10. Plot (a) compares relative errors for both one-pass and two-pass recovery. Plot (b)
shows the maximum principal angle among all estimated sub-spaces and the true factor matrices.

This perhaps surprising result motivated us to try the method on a class of tensors in which we

could be more certain about what quality of rank 10 approximation is achievable. In our second

set of experiments, we examine performance on super-diagonal tensors with tail decay. Since we

are truncating to rank 10, this tail can be thought of as structured, deterministic noise. These are

tensors where all values are zero except for those on the diagonal, and where all values on the

diagonal for indices larger than 𝑟 = 10, we have some decay in their magnitude. In particular we

have two types, exponential tail decay in plot (a), where

X𝑖 𝑗 𝑘 =




1

𝑖 = 𝑗 = 𝑘 ∈ [𝑟 + 1]

10−1(𝑖−𝑟)

𝑖 = 𝑗 = 𝑘 ∈ [𝑟 + 2, 𝑛]

(3.33)

0

otherwise


and polynomial tail decay in plot (b),

X𝑖 𝑗 𝑘 =






1

𝑖 = 𝑗 = 𝑘 ∈ [𝑟 + 1]

(𝑖 − 𝑟)−1

𝑖 = 𝑗 = 𝑘 ∈ [𝑟 + 2, 𝑛]

(3.34)

0

otherwise.

92

These highly constrained tensors are clearly not low-rank, however it is reasonable to suppose

that a recovery algorithm for a given rank truncation would output an estimate that is close to the

leading 𝑟 terms of the diagonal. The residual in that case will simply be the norm of the tail-sum
√︃(cid:205)𝑛

𝑖𝑖𝑖, which we have included as the red horizontal line in Figure 3.3.

𝑖=𝑟+1 X2

Figure 3.3 Error plots for different sketching dimensions in the noiseless setting with a fixed rank
truncation of 10. Plot (a) compares relative errors for both one-pass and two-pass recovery for
super-diagonal tensors where the diagonal entries have exponential decay of type (3.33), and plot
(b) super-diagonal tensors with polynomial decay of the type (3.34).

3.3.2 Allocating Core and Factor Measurements

One question raised by our error analysis is how to weigh the error contribution between the

tasks of estimating the factor matrices and estimating the core. In other words, for a given total

measurement budget, how should we allocate between the two tasks if we wish to decrease overall

relative error? In the following experiment (see Figure 3.4) we find the relative error under various

noise levels for pairs of sketching dimensions (𝑚, 𝑚𝑐). We compare pairs (13, 12) and (11, 36)

and (8, 48). These choices of sketching dimensions were chosen since they have nearly equal

overall compression ratios of 0.57%, 0.58%, 0.62% respectively, however they vary considerably

on whether they emphasize measurements to be used in estimating the factors or the core of the

tensor. Note that the two-pass error, which relies only on the factor matrix estimates is naturally

93

best when the factor sketches are larger, i.e.

the 𝑚 = 13 case. However the relative error of

the recovered tensor in the one-pass setting is more than ten times better when more of the total

measurement budget is allocated to estimate the core as shown in Figure 3.4. This shows that in

some situations it is preferable to allocate more resources to obtain measurements for the core than

the factors, up to some threshold. For example in Figure 3.4, the rank of the true signal is 10, and

going below this dimension for the factor sketches does correspond with no longer improving on

the accuracy in terms of the trade-off between 𝑚 and 𝑚𝑐.

Figure 3.4 Relative error plots at different signal to noise ratios for two-pass and one-pass recovery
of noisy low-rank tensors. Ordered pairs indicates the choice for sketching dimensions (𝑚, 𝑚𝑐).

3.3.3 Error Bounds Apply to sub-Gaussian Measurement Matrices

In this next experiment we demonstrate, in a similar manner as done in Figure 1 in [60], that

recovery performance of Algorithm 3.11 does not vary greatly for different choices of types of sub-

Gaussian measurement matrices. What is different from that earlier work is that the measurement

ensembles are all Kronecker structured. Plotted in Figure 3.5 are relative errors for Gaussian

(g), sparse uniform from [−1, 0, 1] with weights 1
6

, 2
3

, 1
3 (sp0), sub-sampled randomized Fourier

transforms as in [64] (rfd) and a mixed measurement ensemble that uses Gaussian-RFD-sparse

measurements where we vary by mode which measurement type is used, which is a scenario that is

94

practically and theoretically not well suited for the Khatri-Rao structured measurement operators

used in [60].

Figure 3.5 Relative errors for one-pass, two-pass for Kronecker measurement ensembles made
up of different kinds of sub-Gaussian random matrices. The legend g, sp0, rfd, mix correspond
to Gaussian, sparse, sub-sampled random Fourier transform, and a mixture of all three for the
measurement ensembles.

3.3.4 Comparison to Khatri-Rao Structured Measurements

Figure 3.6 Relative error comparison for Khatri-Rao and Kronecker-structured measurements.
Right hand figure shows the average time for sketching and recovery phases of the algorithm.

This set of experiments demonstrates that the sketching phase will dominate the run-time of

95

Algorithm 3.1 regardless of the choice of leave-one-out type, however the Kronecker-structured

measurements are able to generate more measurements for a fixed number of operations as compared

to Khatri-Rao structured measurements, see Figure 3.6. This means that it is possible to achieve

similar or better performance using strictly modewise measurements and in less overall time as

problems grow in size with respect to total number of tensor elements; i.e. both number of modes

and length of those modes. In Figure 3.6 for the Kronecker-structured measurements we sketch to

𝑚 = 25, for the Khatri-Rao ensemble we sketch pairs of modes to 225. We see that the Kronecker

measurements perform incrementally better in terms of relative error but at less than half the overall

run-time. Sketching times are about five times faster for the Kronecker-structured measurements

as compared to the Khatri-Rao. Note that this does trade speed for space - the total number of

entries in the leave one out measurements is nearly three times larger for the Kronecker-structured

measurements versus the Khatri-Rao, i.e. sketches 𝐵𝑖 as per (3.2) and (3.4) have sizes 300 × 252,

and 300 × 225 respectively.

3.3.5 Application to Video Summary Task

As a practical demonstration, we consider the same video summary task first described in [47]

and again in [60]. In this demonstration, the video is taken with a camera in a fixed position. The

video is a nature scene and a person walks in front of the camera at two different time points in

the second half of the video. The first 100 and the last 193 frames are removed since they include

setup that results in small shifts of the camera. The entire video has been converted to grayscale.

This yields a three mode tensor of dimensions 2200 × 1080 × 1980 which has a size of about 41

GB when stored as an array of doubles. We wish to identify the parts of the scene that include the

person walking and distinguish them from the relatively static scene elsewhere. As discussed in

[60], there is a third salient time varying feature in this particular video, which is the light intensity

of the scene, since at around frame 940 the scene darkens. Furthermore there are changes in the

light intensity as the camera automatically adjusts after the person walks in and out of frame. For

this reason, we cluster the frames using three centers, rather than two.

In all cases, we use 𝑘-means to cluster the frames, however we assign features to frames in four

96

different ways:

1. Using the sketch 𝐵1 ∈ R2200×202

, as in (3.2) that leaves out the time dimension, then clustering

using 𝑘-means on the rows of the unfolding of the sketch along the first, temporal mode.

2. Unfolding the temporal mode of the reconstructed tensor using a one-pass set of measure-

ments, i.e. (𝑋1) [1] ∈ R2200×2138400 (recall that X1 denotes the output of Algorithm 3.1).
3. Unfolding the reconstructed tensor using a two-pass scenario, (𝑋2) [1] ∈ R2200×2138400 (recall

that X2 denotes the output of Algorithm 3.10).

4. Estimated temporal factor matrix 𝑈1 ∈ R2200×20 (see in Algorithm 3.1).

As we can see in Figures 3.7a, 3.7b, the sketch alone shows reliable clustering of the main temporal

changes in the video, which verifies the observation in [60] about using the measurements as

an effective feature set for clustering, although in that case the measurements were Khatri-Rao

structured whereas ours are Kronecker-structured. The unfoldings of the reconstructed tensor also

reliably distinguishes the main parts of the scene. The reconstruction is useful at least to get clusted

interpretability. Although certainly natural to wish to cluster on the temporal factor, this method

appears inferior to any of the preceding.

As an added advantage of using the modewise, Kronecker structured measurements, we can in

principle select measurement maps for different modes. Gaussian measurement maps theoretically

have some advantages over other types in terms of accuracy for a fixed number of measurements,

whereas applying RFD or other Fourier-like transforms to modes that have longer fibers would net

a better payoff in terms of overall run-time because of the faster matrix-vector multiply permitted

by these structured matrices. In this demonstration, we use Gaussian matrices along the spatial

modes, and 𝑅𝐹 𝐷 matrices for the temporal mode.

In the earlier work [47], the authors describe a variant of Tucker-Alternating Least Squares (aka

Higher Order Orthogonal Iteration, multi-pass scenario) that employed TensorSketch to produce

the necessary measurements used to reconstruct the same video tensor data we have used here. In

the subsequent work [60], those authors again perform the same task, but use a single-pass approach

which fits the framework we have described as Algorithm 3.1, where the measurement matrices

97

are Khatri-Rao structured, and the Ω𝑖 have entries drawn from standard Gaussian distribution.

Furthermore, analysis of the type afforded by Theorem 3.11 may also explain the discrepancy

between the sketching dimensions seen in [47] and [60]. Naturally there are several differences

between the approaches, but the CountSketch matrices used in TensorSketch operators as shown in

[33] have a O

(cid:17)

(cid:16) 1
𝛿

dependency in order to ensure the OSE property, whereas the other ensembles,

such as dense Gaussians, enjoy an O

(cid:17)

(cid:16)

log 1
𝛿

dependence for this parameter.

As was discussed in [60], the video is not especially low-rank in practice - in particular along

the spatial dimensions in terms of relative error of the reconstruction. However the clusters appear

distinct enough that assigning clusters with this summary type of information is still possible.

3.4 Discussion

In this chapter we have described a measurement system that specializes the framework dis-

cussed in Section 1.5 and a simple recovery procedure that addresses the three main issues we

noted about the two-stage measurement operators needed to ensure a TRIP, which was the key

to make use of existing theory with regards to TIHT that was the subject of Chapter 2. That is,

the leave-one-out alternative discussed in this chapter does not require working memory on the

order of the size of the uncompressed signal, because it computes factors of the estimate directly

from measurements. Furthermore, the recovery procedure is non-iterative, and provably recovers

the tensor in the exact arithmetic, no noise, no rank truncation scenario, and has no reliance on

unverifiable assumptions about the quality of the fit of a thresholding operator as TIHT had. On

the third issue, whereby TRIP required an initial reshaping in order to achieve useful compression

- the leave-one-out alternative trades this constraint for another; namely that one of the modes goes

uncompressed. Direct comparison of which of these constraints is preferable is perhaps best left to

a given application than any objective criteria that is data agnostic. It is worth mentioning however,

that the need for the measurements to be independent, based on empirical evidence from synthetic

data, is to a large extent motivated by the analysis. That is, practically speaking, one can reuse

measurement maps to produce the different measurement tensors, or even share measurements that

were used to estimate factors, and use them to estimate the core without any change in accuracy.

98

(a)

(b)

Figure 3.7 (a) Cluster assignments for the 2200 frames in the video. Top run corresponds to using
the measurements for the first mode only 𝐵1, middle rows use one and two pass approximations
of the tensor, and the last row uses the factor matrix 𝑈1 for the temporal mode only. We use
Gaussian sketching matrices for both spatial modes and the real part of 𝑅𝐹 𝐷 for the temporal
mode. Sketching parameters are 𝑚 = 20, 𝑚𝑐 = 40 and rank truncation of 𝑟 = 10 in all modes. (b)
Three reference frames at 0, 1000 and 1496.

The dependencies introduced however make the analysis of such a procedure considerably more

complicated. A practitioner could economize on both the storage of the linear transformations, and

number of measurements by reusing maps and “recycling” measurements.

Also, in this chapter we have shown how several variations can be unified under the leave-

99

Figure 3.8 The 1456th frame of the grey-scale video is shown for the original, the one-pass, and
two-pass reconstructions using sketching dimension of 𝑚 = 300, 𝑚𝑐 = 601 and 𝑟 = 50 for each
mode. Although the reconstructions for this choice of sketching dimension and rank truncation are
not particularly accurate for this video, nevertheless the reconstructions provide enough information
to perform the summary task of clustering the frames into the major temporal changes that occur
during the scene.

one-out concept, and the results made useful to analyze all of them. We have used empirical

tests on simulated data to highlight important trade-offs in terms of what type of measurements

were selected and how sensing matrices were structured. Finally, we demonstrated a simple,

but representative application whereby clustering or segmenting video data is handled using the

measurements. Recall the example from section 1.2.2 as to for why both the segmenting using the

measurements, and the recovered tensor could be of practical value.

3.5 Technical Proofs

Herein we provide proofs for some technical lemmas needed to prove our main theorems,

including a Lemma that we first stated in Chapter 1 regarding the AMM property. The second

sub-section then addresses the proof of Theorem 3.2.

3.5.1 Proof of Lemma 1.8

Our proof of Lemma 1.8 will utilize several intermediate lemmas. Our first lemma concerning

the approximate preservation of inner products is a slight generalization of [2, Corollary 2].

Lemma 3.9 (The JL property implies angle preservation). Let 𝑆 ⊂ C𝑛 with cardinality at most 𝑝

and 𝜖 ∈ (0, 1). If a random matrix Ω ∈ C𝑚×𝑛 has the (𝜖/4, 𝛿, 4𝑝2)-JL property for

𝑆′ =

(cid:26) x
∥x∥2

+

y
∥y∥2

, x
∥x∥2

−

y
∥y∥2

, x
∥x∥2

+ 𝑖 y
∥y∥2

, x
∥x∥2

− 𝑖 y
∥y∥2

(cid:12)
(cid:12) x, y ∈ 𝑆

(cid:27)

,

then

|⟨Ωx, Ωy⟩ − ⟨x, y⟩| ≤ 𝜖 ∥x∥2∥y∥2 ∀x, y ∈ 𝑆

(3.35)

100

will be satisfied with probability at least 1 − 𝛿.

Proof. Note that if either x = 0 or y = 0, then (3.35) automatically holds because 0 ≤ 0. Thus,

suppose without loss of generality that x, y ≠ 0. Considering the normalizations u = x
∥x∥2

, v = y
∥y∥2

,

one can see that the polarization identity implies that

|⟨Ωu, Ωv⟩ − ⟨u, v⟩| =

≤

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

1
4

1
4

ℓ=0
3
∑︁

ℓ=0

𝜖

4

3
∑︁

𝑖ℓ (cid:16)

∥Ωu + 𝑖ℓΩv∥2

2 − ∥u + 𝑖ℓv∥2

2

(cid:17)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(∥u∥2 + ∥v∥2)2 = 𝜖

will hold whenever (1.8) holds with 𝑆 ← 𝑆′ and 𝜖 ← 𝜖/4. The result now follows by renormalizing.

Remark 3.5. Note that if 𝑆 ⊂ R𝑛 it suffices for a random matrix Ω ∈ R𝑚×𝑛 to have the (𝜖/2, 𝛿, 2𝑝2)-

JL property for a smaller set 𝑆′ ⊂ R𝑛 in Lemma 3.9. This can be seen by using the real version of

the polarization identity instead of the complex version.

The next lemma constructs a set 𝑆 to utilize in Lemma 3.9 based on two matrices with normalized

columns. The end result is an entrywise approximate matrix multiplication property for the two

column-normalized matrices in question.

Lemma 3.10 (The JL property allows approximate matrix multiplies for unitary matrices). Let

𝑉 ∈ C𝑛×𝑝 and 𝑈 ∈ C𝑛×𝑞 have unit ℓ2-normalized columns. Suppose that Ω ∈ C𝑚×𝑛 satisfies (3.35)
from Lemma 3.9 where 𝑆 = (cid:8)u 𝑗 |u 𝑗 = 𝑈 [:, 𝑗](cid:9) ∪ (cid:8)v 𝑗 |v 𝑗 = 𝑉 [:, 𝑗](cid:9). Then

(cid:12)
(cid:12)(𝑉 ∗Ω∗Ω𝑈 − 𝑉 ∗𝑈) 𝑘, 𝑗

(cid:12)
(cid:12) ≤ 𝜖,

for all 1 ≤ 𝑘 ≤ 𝑝 and 1 ≤ 𝑗 ≤ 𝑞.

Proof. Note that |𝑆| = 𝑝 + 𝑞. Thus, |𝑆′| ≤ 4( 𝑝 + 𝑞)2 in Lemma 3.9. Furthermore,

. . .

. . .

Ω𝑉 =

Hence, (cid:0)(Ω𝑉)∗ Ω𝑈(cid:1)

. . . Ωv𝑝

(cid:169)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
Ωu1 Ωu2
Ωv1 Ωv2
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:171)
𝑘, 𝑗 = ⟨Ωu 𝑗 , Ωv𝑘 ⟩. Therefore, given Lemma 3.9, for all choices of 𝑘, 𝑗 we have

, and Ω𝑈 =

. . . Ωu𝑞

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)

. . .

. . .

.

(cid:12)
(cid:12)(𝑉 ∗Ω∗Ω𝑈 − 𝑉 ∗𝑈) 𝑘, 𝑗

(cid:12) = (cid:12)
(cid:12)

(cid:12)⟨Ωu 𝑗 , Ωv𝑘 ⟩ − ⟨u 𝑗 , v𝑘 ⟩(cid:12)

(cid:12) ≤ 𝜖 ∥v𝑘 ∥2∥u 𝑗 ∥2 = 𝜖 .

101

The next lemma constructs a new set 𝑆 to utilize in Lemma 3.9 by selecting a well chosen subset

of the singular vectors of both 𝐴 and 𝐵. This set will ultimately determine how the finite set 𝑆

promised by Lemma 1.8 depends on 𝐴 and 𝐵. As we shall see, it’s proven by applying Lemma 3.10

to two unitary matrices provided by the SVDs of 𝐴 and 𝐵.

Lemma 3.11 (The JL property implies the AMM property for arbitrary matrices). Let 𝐴 ∈ C𝑝×𝑛
2 , and suppose that Ω ∈ C𝑚×𝑛
and 𝐵 ∈ C𝑛×𝑞 have SVDs given by 𝐴 = 𝑈1Σ1𝑉 ∗ and 𝐵 = 𝑈Σ2𝑉 ∗
satisfies the conditions of Lemma 3.10 for 𝑈 and 𝑉. Then,

∥ 𝐴Ω∗Ω𝐵 − 𝐴𝐵∥𝐹 ≤ 𝜖 ∥ 𝐴∥𝐹 ∥𝐵∥𝐹

Proof. We will expand the quantity of interest according the SVD of 𝐴 and 𝐵. Doing so we see

that

∥ 𝐴Ω∗Ω𝐵 − 𝐴𝐵∥𝐹 = ∥𝑈1Σ1𝑉 ∗Ω∗Ω𝑈Σ2𝑉 ∗

2 − 𝑈1Σ1𝑉 ∗𝑈Σ2𝑉 ∗

2 ∥𝐹

= ∥𝑈1Σ1 (𝑉 ∗Ω∗Ω𝑈 − 𝑉 ∗𝑈) Σ2𝑉 ∗

2 ∥𝐹

= ∥Σ1 (𝑉 ∗Ω∗Ω𝑈 − 𝑉 ∗𝑈) Σ2∥𝐹

(cid:118)(cid:117)(cid:116) 𝑝
∑︁

𝑞
∑︁

𝑘=1
(cid:118)(cid:117)(cid:116) 𝑝
∑︁

𝑗=1

𝑞
∑︁

=

≤

(Σ1)2

𝑘,𝑘 |𝑉 ∗Ω∗Ω𝑈 − 𝑉 ∗𝑈|2

𝑘, 𝑗 (Σ2)2
𝑗, 𝑗

𝜎𝑘 ( 𝐴)2𝜖 2𝜎𝑗 (𝐵)2

𝑗=1

𝑘=1
(cid:118)(cid:116) 𝑝
∑︁

𝜎𝑘 ( 𝐴)2

(cid:118)(cid:117)(cid:116) 𝑞
∑︁

𝑗=1

𝜎𝑗 (𝐵)2

𝑘=1

= 𝜖

= 𝜖 ∥ 𝐴∥𝐹 ∥𝐵∥𝐹 .

Lemmas 3.9, 3.10, and 3.11 now collectively prove the following generalized version of

Lemma 1.8.

Lemma 3.12 (The JL property provides the AMM property). Let 𝐴 ∈ C𝑝×𝑛 and 𝐵 ∈ C𝑛×𝑞. There

exists a finite set 𝑆 ⊂ C𝑛 with cardinality |𝑆| ≤ 4( 𝑝 + 𝑞)2 (determined entirely by 𝐴 and 𝐵) such

102

that the following holds: If a random matrix Ω ∈ C𝑚×𝑛 has the (𝜖/4, 𝛿, 4( 𝑝 + 𝑞)2)-JL property for

𝑆, then Ω will also have the (𝜖, 𝛿)-AMM property for 𝐴 and 𝐵.

We will make use of this simple centering result with regards to 𝐿 𝑝 norms of random variables.

Lemma 3.13. Suppose 𝑋 a real random variable, and let 𝑝 ≥ 1, Then,

∥ 𝑋 − E[𝑋] ∥ 𝐿 𝑝 ≤ 2 ∥ 𝑋 ∥ 𝐿 𝑝

Proof. Let 𝜇 = E[𝑋]. By Jensen’s inequality, ∥𝜇∥ 𝐿 𝑝 = |𝜇| ≤ E[|𝑋 |] = ∥ 𝑋 ∥ 𝐿1 Observe,

∥ 𝑋 − 𝜇∥ 𝐿 𝑝 ≤ ∥ 𝑋 ∥ 𝐿 𝑝 + ∥𝜇∥ 𝐿 𝑝

≤ ∥ 𝑋 ∥ 𝐿 𝑝 + ∥ 𝑋 ∥ 𝐿1

≤ 2 ∥ 𝑋 ∥ 𝐿 𝑝

Where we have used Minkowski’s inequality in the first line, and Jensen’s inequality in the third.

3.5.2 Proof of Theorem 3.2

A similar proof appears in support of [53, Theorem 2], which was itself simplified from earlier

work [11]. We reproduce the proof here for completeness, and to clarify details. We begin by

restating Theorem 3.2 for ease of reference.

Theorem 3.12 (Restatement of Theorem 3.2). Let 𝑋 ∈ R𝑛×𝑁 of rank ˜𝑟 ≤ min{𝑛, 𝑁 } have the full

SVD 𝑋 = 𝑈Σ𝑉𝑇 , and let 𝑉𝑟 ′ ∈ R𝑁×𝑟 ′

denote the first 𝑟′ columns of 𝑉 ∈ R𝑁×𝑁 for all 𝑟′ ∈ [𝑁]. Fix

𝑟 ∈ [𝑛] and consider the head-tail split 𝑋 = 𝑋𝑟 + 𝑋\𝑟. If Ω ∈ R𝑚×𝑁 satisfies

1. subspace embedding property (1.10) with 𝜖 ← 𝜖

3 for 𝐴 ← 𝑋𝑇
𝑟 ,

2. approximate multiplication property (1.13) with 𝜖 ←

6

𝑉min{𝑟, ˜𝑟},

𝜖

√

min{𝑟, ˜𝑟}

for 𝐴 ← 𝑋\𝑟 and 𝐵 ←

3. JL property (1.8) with 𝜖 ← 𝜖
4. approximate multiplication property (1.13) with 𝜖 ← 𝜖
√
6

the 𝑛 columns of 𝑋𝑇
\𝑟

6 for 𝑆 ←

(cid:110)

(cid:111)

, and

𝑟 for 𝐴 ← 𝑋\𝑟 and 𝐵 ← 𝑋𝑇
\𝑟,

then ˜𝑋 := 𝑋Ω𝑇 is an (𝜖, 0, 𝑟)-PCP sketch of 𝑋.

Proof. Let 𝑄 ∈ R𝑛×𝑟 be a an arbitrary matrix with orthonormal columns so that 𝑄𝑄𝑇 ∈ R𝑛×𝑛 is an

orthogonal projection matrix. It suffices to show that

103

(cid:12)
(cid:12)
(cid:12)

(cid:13)(𝐼 − 𝑄𝑄𝑇 ) 𝑋(cid:13)
(cid:13)
(cid:13)

2

𝐹 − (cid:13)

(cid:13)(𝐼 − 𝑄𝑄𝑇 ) 𝑋Ω𝑇 (cid:13)
2
(cid:13)
𝐹

(cid:12)
(cid:12)
(cid:12)

≤ 𝜖 (cid:13)

(cid:13)(𝐼 − 𝑄𝑄𝑇 ) 𝑋(cid:13)
2
(cid:13)
𝐹

.

(3.36)

Writing 𝑋 in terms of its head-tail split, (3.36) becomes

(cid:12)
(cid:13)(𝐼 − 𝑄𝑄𝑇 )(𝑋𝑟 + 𝑋\𝑟)(cid:13)
(cid:13)
(cid:12)
(cid:13)
(cid:12)

2

𝐹 − (cid:13)

(cid:13)(𝐼 − 𝑄𝑄𝑇 ) (𝑋𝑟 + 𝑋\𝑟)Ω𝑇 (cid:13)
(cid:13)

2
𝐹

(cid:12)
(cid:12)
(cid:12)

≤ 𝜖 (cid:13)

(cid:13)(𝐼 − 𝑄𝑄𝑇 ) 𝑋(cid:13)
2
(cid:13)
𝐹

,

(3.37)

where 𝑋𝑟 has the thin SVD representation 𝑋𝑟 = 𝑈𝑟 Σ𝑟𝑉𝑇

𝑟 with 𝑈𝑟 and 𝑉𝑟 containing the first 𝑟

columns of 𝑈 and 𝑉 from the full SVD of 𝑋, respectively.1
Letting 𝑃 := (𝐼 − 𝑄𝑄𝑇 ), and using that tr(cid:8)𝐴𝐴𝑇 (cid:9) = ∥ 𝐴∥2

𝐹, one may expand the left hand side of

(3.37). Noting that 𝑋\𝑟 𝑋𝑇

𝑟 = 0 and 𝑃 = 𝑃𝑇 while doing so, we can now see that

(cid:13)(𝐼 − 𝑄𝑄𝑇 )(𝑋𝑟 + 𝑋\𝑟)(cid:13)
(cid:13)
𝐹 − (cid:13)
2
(cid:13)
(cid:17)

(cid:13)(𝐼 − 𝑄𝑄𝑇 ) (𝑋𝑟 + 𝑋\𝑟)Ω𝑇 (cid:13)
(cid:13)
(cid:16)

(cid:16)

2
𝐹

− tr

𝑃(𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑟 + 𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑃(𝑋𝑟 𝑋𝑇

𝑟 + 𝑋\𝑟 𝑋𝑇

\𝑟)𝑃

(cid:12)
(cid:12)
(cid:12) =
\𝑟 + 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)tr

𝑟 + 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

.

Regrouping terms in the last expression while noting the invariance of trace to transposition, one

can we can now further see that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋𝑟 𝑋𝑇

𝑟 + 𝑋\𝑟 𝑋𝑇

\𝑟)𝑃

(cid:17)

(cid:16)

− tr

𝑃(𝑋𝑟Ω𝑇 Ω𝑋𝑇

=

≤

(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr

(cid:16)

𝑃(𝑋𝑟 𝑋𝑇

𝑟 − 𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)

(cid:16)

+ 2 tr

(cid:16)

𝑃(𝑋𝑟 𝑋𝑇

𝑟 − 𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

+ 2

(cid:12)
(cid:12)
(cid:12)tr

\𝑟 + 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
𝑟 + 𝑋𝑟Ω𝑇 Ω𝑋𝑇
(cid:16)
(cid:17)

𝑟 )𝑃

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇
(cid:16)

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

+ tr

(cid:17)(cid:12)
(cid:12)
(cid:12)

+

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟 𝑋𝑇
(cid:16)

𝑟 + 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)
(cid:17)(cid:12)
(cid:12)
(cid:12)

𝑃(𝑋\𝑟 𝑋𝑇

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

.

Looking back at (3.37) in the light of this last computation, one can now see that it suffices to

prove that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋𝑟 𝑋𝑇

𝑟 − 𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

+ 2

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

+

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟 𝑋𝑇

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤ 𝜖 ∥𝑃𝑋 ∥2
𝐹

(3.38)

holds to establish the desired result. Going forward we will therefore aim to prove (3.38) by proving

each of the following three bounds:
𝑟 )𝑃(cid:1)(cid:12)

𝑟 − 𝑋𝑟Ω𝑇 Ω𝑋𝑇

(cid:12)tr (cid:0)𝑃(𝑋𝑟 𝑋𝑇

(a) (cid:12)

(cid:12) ≤ 𝜖

3 ∥𝑃𝑋 ∥2
𝐹,

1If 𝑟 ≥ 𝑁 we let 𝑉𝑟 = 𝑉.

104

(b) (cid:12)
(cid:12)tr (cid:0)𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇
(cid:12)
𝑃(𝑋\𝑟 𝑋𝑇
(cid:12)
(cid:12)tr

6 ∥𝑃𝑋 ∥2
(cid:17)(cid:12)
\𝑟)𝑃
(cid:12)
(cid:12)
Proving (a) – (c) will establish (3.38), thereby completing the proof.

𝑟 )𝑃(cid:1)(cid:12)
(cid:12) ≤ 𝜖
\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝐹, and
≤ 𝜖

3 ∥𝑃𝑋 ∥2
𝐹.

(c)

(cid:16)

Proof of Bound (a): Using again that tr(cid:8)𝐴𝐴𝑇 (cid:9) = tr(cid:8)𝐴𝑇 𝐴(cid:9) = ∥ 𝐴∥2

𝐹 and that 𝑃 = 𝑃𝑇 we have

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋𝑟 𝑋𝑇

𝑟 − 𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12) =

(cid:12)
(cid:12)
(cid:12)

(cid:13)
(cid:13)𝑋𝑇

𝑟 𝑃(cid:13)
(cid:13)

2

𝐹 − (cid:13)

(cid:13)Ω𝑋𝑇

𝑟 𝑃(cid:13)
2
(cid:13)
𝐹

.

(cid:12)
(cid:12)
(cid:12)

Applying Lemma 1.7 in the light of assumption 1, we now have that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

as desired.

𝑃(𝑋𝑟 𝑋𝑇

𝑟 − 𝑋𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤

≤

𝜖

3
𝜖

3

(cid:13)
(cid:13)𝑋𝑇

𝑟 𝑃(cid:13)
2
𝐹 =
(cid:13)

(cid:13)𝑋𝑇 𝑃(cid:13)
(cid:13)
2
𝐹 =
(cid:13)

𝜖

3
𝜖

3

(cid:13)
(cid:13)𝑉𝑟𝑉𝑇

𝑟 𝑋𝑇 𝑃(cid:13)
2
(cid:13)
𝐹

∥𝑃𝑋 ∥2
𝐹

Proof of Bound (b): Using the invariance of trace to both transposition and permutations, as

well as that 𝑃𝑇 = 𝑃 = 𝑃2, we can see that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:16)

(cid:17)(cid:12)
(cid:12)
(cid:12) =

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:16)

(cid:17)(cid:12)
(cid:12)
(cid:12) =

(cid:12)
(cid:12)
(cid:12)tr

𝑃𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:17)(cid:12)
(cid:12)
(cid:12)

.

Recalling the full SVD 𝑋 = 𝑈Σ𝑉𝑇 , we note that if ˜𝑟 := rank(𝑋) < min{𝑛, 𝑁 } we can remove the

last 𝑛 − ˜𝑟 columns of 𝑈, the last 𝑁 − ˜𝑟 columns of 𝑉, and the last 𝑛 − ˜𝑟 rows and 𝑁 − ˜𝑟 columns of

Σ to form the thin SVD 𝑋 = ˜𝑈 ˜Σ ˜𝑉𝑇 with ˜𝑈 ∈ R𝑛× ˜𝑟, ˜Σ ∈ R ˜𝑟× ˜𝑟, and ˜𝑉 ∈ R𝑁× ˜𝑟. Having done so we

note that ˜Σ will be invertable and that ˜𝑈 ˜𝑈𝑇 𝑋𝑟 = 𝑋𝑟 so that we may write

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12) =

=

=

=

=

(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr

(cid:16)

(cid:16)

(cid:16)

(cid:16)

𝑃𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:17)(cid:12)
(cid:12)
(cid:12)

𝑃 ˜𝑈 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:17)(cid:12)
(cid:12)
(cid:12)

(cid:16)(cid:16)

𝑃 ˜𝑈 ˜Σ ˜𝑉𝑇 ˜𝑉 ˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟
𝑃 ˜𝑈 ˜Σ ˜𝑉𝑇 (cid:17) (cid:16)
(cid:16)
˜𝑉 ˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:17)(cid:12)
(cid:12)
(cid:12)
˜𝑉 ˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟
(cid:17)(cid:17)(cid:12)
(cid:12)
(cid:12)

(𝑃𝑋)

.

(cid:17)(cid:17)(cid:12)
(cid:12)
(cid:12)

105

Recall now that ⟨𝐴, 𝐵⟩𝐹 := tr(cid:0)𝐴𝐵𝑇 (cid:1) is an inner product on matrices with ∥ 𝐴∥𝐹 =

√︃

tr(cid:0)𝐴𝐴𝑇 (cid:1).

Hence, we may apply the Cauchy–Schwarz inequality to our last expression to see that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(𝑃𝑋)

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

(cid:17)(cid:12)
(cid:12)
(cid:12) =
≤ ∥𝑃𝑋 ∥𝐹

= ∥𝑃𝑋 ∥𝐹

(cid:17)(cid:17)(cid:12)
(cid:12)
(cid:12)

(cid:16)
˜𝑉 ˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟
(cid:13)
(cid:13)
(cid:13)𝐹

(cid:13)
˜𝑉 ˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
(cid:13)
\𝑟
(cid:13)
(cid:13)
(cid:13)
˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
(cid:13)
(cid:13)
\𝑟
(cid:13)𝐹
(cid:13)

.

Expanding 𝑋𝑟 in terms of its thin SVD representation 𝑋𝑟 = 𝑈𝑟 Σ𝑟𝑉𝑇

𝑟 we now have that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤ ∥𝑃𝑋 ∥𝐹

= ∥𝑃𝑋 ∥𝐹

= ∥𝑃𝑋 ∥𝐹

= ∥𝑃𝑋 ∥𝐹

˜Σ−1 ˜𝑈𝑇 𝑋𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:13)
(cid:13)
(cid:13)
(cid:13)
˜Σ−1 ˜𝑈𝑇𝑈𝑟 Σ𝑟𝑉𝑇
(cid:13)
(cid:13)
(cid:13)
min{𝑟, ˜𝑟}Ω𝑇 Ω𝑋𝑇
𝑉𝑇
(cid:13)
\𝑟
(cid:13)
(cid:13)
(cid:13)𝑋\𝑟Ω𝑇 Ω𝑉min{𝑟, ˜𝑟 }

(cid:13)
(cid:13)
(cid:13)𝐹
𝑟 Ω𝑇 Ω𝑋𝑇
\𝑟
(cid:13)
(cid:13)
(cid:13)𝐹
(cid:13)
(cid:13)𝐹

.

(cid:13)
(cid:13)
(cid:13)𝐹

Finally, we may now use that 𝑋\𝑟𝑉min{𝑟, ˜𝑟} = 𝑋 (𝐼𝑁 −𝑉𝑟𝑉𝑇

𝑟 )𝑉min{𝑟, ˜𝑟} = 𝑋 (cid:0)𝑉min{𝑟, ˜𝑟} − 𝑉min{𝑟, ˜𝑟}

(cid:1) = 0

to see that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟Ω𝑇 Ω𝑋𝑇

𝑟 )𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤ ∥𝑃𝑋 ∥𝐹

(cid:13)
(cid:13)𝑋\𝑟Ω𝑇 Ω𝑉min{𝑟, ˜𝑟}

(cid:13)
(cid:13)𝐹

= ∥𝑃𝑋 ∥𝐹

≤ ∥𝑃𝑋 ∥𝐹

(cid:13)
(cid:13)𝑋\𝑟Ω𝑇 Ω𝑉min{𝑟, ˜𝑟} − 𝑋\𝑟𝑉min{𝑟, ˜𝑟}
(cid:32)

(cid:13)
(cid:13)𝐹

(cid:13)
(cid:13)𝑋\𝑟

(cid:13)
(cid:13)𝐹

(cid:13)
(cid:13)𝑉min{𝑟, ˜𝑟}

(cid:13)
(cid:13)𝐹

(cid:33)

,

𝜖
6√︁min{𝑟, ˜𝑟}

where we have utilized assumption 2 in the last inequality. We are now finished after using that

(cid:13)𝐹 = √︁min{𝑟, ˜𝑟}, and noting that (cid:13)
(cid:13)

(cid:13)
(cid:13)𝑉min{𝑟, ˜𝑟}
(cid:13)𝑋\𝑟
orthogonal projections 𝑃 by the definition of 𝑋𝑟.

(cid:13)
(cid:13)𝐹 ≤ ∥𝑃𝑋 ∥𝐹 holds for all rank 𝑛 − 𝑟 or greater

Proof of Bound (c): Again using the invariance of trace to permutations as well as 𝑃 = 𝑃2 =

106

𝐼 − 𝑄𝑄𝑇 we have that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟 𝑋𝑇

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12) =

=

≤

(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr
(cid:12)
(cid:12)
(cid:12)tr

(cid:16)

(cid:16)

𝑃(𝑋\𝑟 𝑋𝑇

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
\𝑟)

(cid:17)(cid:12)
(cid:12)
(cid:12)
\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
\𝑟)

(cid:17)(cid:12)
(cid:12)
(cid:12)

(𝐼 − 𝑄𝑄𝑇 ) (𝑋\𝑟 𝑋𝑇
(cid:16)

(cid:16)

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
𝑋\𝑟 𝑋𝑇
\𝑟
(cid:12)
(cid:12)
(cid:12)tr
(cid:13)
(cid:13)Ω𝑋𝑇
(cid:13)

𝑄𝑄𝑇 (𝑋\𝑟 𝑋𝑇
(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:17)(cid:12)
(cid:12)
(cid:12)
\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
\𝑟)
(cid:13)
(cid:13)𝑄𝑄𝑇 (cid:13)
+ (cid:13)
(cid:13)
(cid:13)𝐹
(cid:13)

𝑋\𝑟 𝑋𝑇

(cid:13)
2
(cid:13)
(cid:13)

(cid:13)
2
(cid:13)
(cid:13)

−

+

\𝑟

𝐹

𝐹

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤

(cid:12)
(cid:13)
(cid:12)
(cid:13)
(cid:12)
(cid:13)
(cid:12)

𝑋𝑇
\𝑟

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:13)
(cid:13)
(cid:13)𝐹

,

where we have again utilized Cauchy–Schwarz in the last inequality. Utilizing assumption 3, the
first term just above can be bounded by 𝜖
𝐹 = 𝜖

𝐹 using an argument analogous to the

6 ∥ 𝑋𝑇

\𝑟 ∥2

6 ∥ 𝑋\𝑟 ∥2

proof of Lemma 1.7. Doing so we see that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟 𝑋𝑇

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤

=

𝜖

6
𝜖

6

∥ 𝑋\𝑟 ∥2

∥ 𝑋\𝑟 ∥2

𝑋\𝑟 𝑋𝑇

(cid:13)𝑄𝑄𝑇 (cid:13)
𝐹 + (cid:13)
(cid:13)𝐹
√
(cid:13)
𝑟
(cid:13)
(cid:13)

(cid:13)
(cid:13)
(cid:13)
\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
𝑋\𝑟 𝑋𝑇
\𝑟

𝐹 +

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:13)
(cid:13)
(cid:13)𝐹

.

(cid:13)
(cid:13)
(cid:13)𝐹

Finally, we may now employ assumption 4 to bound the second term just above. Doing so we

obtain that

(cid:16)

(cid:12)
(cid:12)
(cid:12)tr

𝑃(𝑋\𝑟 𝑋𝑇

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇

\𝑟)𝑃

(cid:17)(cid:12)
(cid:12)
(cid:12)

≤

≤

=

𝜖

6
𝜖

6
𝜖

3

∥ 𝑋\𝑟 ∥2

𝐹 +

∥ 𝑋\𝑟 ∥2

𝐹 +

√
𝑟

√
𝑟

𝐹 .
∥ 𝑋\𝑟 ∥2

\𝑟 − 𝑋\𝑟Ω𝑇 Ω𝑋𝑇
\𝑟

(cid:13)
(cid:13)
(cid:13)

6

𝑋\𝑟 𝑋𝑇
𝜖
√
𝑟

∥ 𝑋\𝑟 ∥2
𝐹

(cid:13)
(cid:13)
(cid:13)𝐹

To conclude we note again that (cid:13)

(cid:13)𝑋\𝑟
projections 𝑃 by the definition of 𝑋𝑟.

3.6 Algorithm Variants

(cid:13)
(cid:13)𝐹 ≤ ∥𝑃𝑋 ∥𝐹 holds for all rank 𝑛 − 𝑟 or greater orthogonal

There are four tasks relevant in Algorithm 3.1 where by making difference choices for these

tasks, we get variants of the algorithm. Two of the tasks are related to the measurement process:

sketching the tensor in order to produce leave-one-out measurements, and also sketching the tensor

without leaving out a mode in order to produce measurements useful in computing the core. The

second two tasks are then recovering the factor matrices, and recovering the core.

107

Note, if we permit a second pass on the tensor X after estimating factor matrices 𝑄𝑖, then core

measurements are not necessary, and the optimal core given these factor matrices can be found by

computing

G = X ×1 𝑄𝑇

1 ×2 𝑄𝑇

2 · · · ×𝑑 𝑄𝑇

𝑑

See section 4.2 in [37]. This then is the computation used in two-pass versions of the algorithm;

it requires first to compute the factor matrices from measurements, and then apply these factors

modewise to the original tensor; no separate measurement tensor for the core is required. Crucially

however, this relies on a second access to the data; for which we are computing a HOSVD.

In this section then, we break apart Algorithm 3.1 into these tasks, and show how combining

different choices for these tasks produces the variants of the algorithm considered.

Within the pseudo-code, “unfold” refers to the operation of taking a tensor and flattening it into

a matrix of the size listed by arranging the specified mode’s fibers as the columns. The operation

“fold” is the inverse of this, taking a matrix as viewed as an unfolding along the specified mode and

reshaping it into a tensor of the given dimensions.

Algorithm 3.2 takes measurement matrices (e.g. sub-Gaussian random matrices) and the data

tensor X and produces a set of leave-one-out measurements which can be viewed as tensors or as

flattened matrices. That is, it is practical to apply the measurement matrices along the modes and

obtain a tensor of measurements with 𝑑 − 1 modes each of length 𝑚 and one mode of length 𝑛, see

Figure 3.1 for a schematic depiction. The measurement process takes slices of the tensor and maps

them to (smaller) slices with (mostly) shorter edge lengths. Or we can conceive of the measurements

as a matrix by unfolding the measurement tensor along the mode that is uncompressed and thus

obtaining a matrix of size 𝑛 × 𝑚𝑑−1, one such matrix for each mode 𝑖 ∈ [𝑑].

It is Kronecker

structured because of the correspondence of modewise products of tensors with matrices to matrix

products of unfoldings of a tensor with matrices, see for example (1.4).

Alternatively, we can use Algorithm 3.3 which also produces a set of leave-one-out measure-

ments of the tensor X which can be viewed as a matrix. It is Khatri-Rao structured because the

measurement matrix applied to the unfolding is formed using Khatri-Rao products. Note, unlike

108

:

Algorithm 3.2 Leave-One-Out Kronecker Sketching
input
Ω(𝑖, 𝑗) where rank(Ω(𝑖,𝑖)) = 𝑛 and Ω(𝑖, 𝑗) ∈ R𝑚×𝑛 for 𝑖, 𝑗 ∈ [𝑑], 𝑖 ≠ 𝑗
X a 𝑑 mode tensor with side lengths 𝑛
output : 𝐵𝑖 for 𝑖 ∈ [𝑑]
for 𝑖 ∈ [𝑑] do

B𝑖 ← X ×1 Ω(𝑖,1) ×2 Ω(𝑖,2) ×3 · · · ×𝑑 Ω(𝑖,𝑑)
# Now unfold the measurement tensor so the mode-𝑖 fibers are columns,

size 𝑛 × 𝑚𝑑−1

𝐵𝑖 ← unfold(B𝑖, 𝑛 × 𝑚𝑑−1, mode = 𝑖)

end

Algorithm 3.2, there is not necessarily any natural way to view the measurements 𝐵𝑖 as tensors of

𝑑 modes - the sketching process in this case takes slices of the tensor and maps them to vectors and

we gather these into matrices 𝐵𝑖 of size 𝑛 × 𝑚 for each 𝑖 ∈ [𝑑].

:

Algorithm 3.3 Leave-One-Out Khatri-Rao Sketching
input
Ω(𝑖, 𝑗) where rank(Ω(𝑖,𝑖)) = 𝑛 and Ω(𝑖, 𝑗) ∈ R𝑚×𝑛 for 𝑖, 𝑗 ∈ [𝑑], 𝑖 ≠ 𝑗
X a 𝑑 mode tensor with side lengths 𝑛
output : 𝐵𝑖 for 𝑖 ∈ [𝑑]
for 𝑖 ∈ [𝑑] do

(cid:0)Ω(𝑖,1) • Ω(𝑖,2) • . . . • Ω𝑖,𝑖−1 • Ω𝑖,𝑖+1 • . . . • Ω𝑖,𝑑(cid:1)𝑇

𝐵𝑖 ← Ω(𝑖,𝑖) 𝑋[𝑖]

end

In order to estimate the core, we wish another, independent set of measurements. These are

obtained in the manner as Algorithm 3.2, only now we are permitted to compress all the modes,

producing a tensor of 𝑑 modes all with side length equal to 𝑚𝑐.

:

Algorithm 3.4 Core Sketching
input
Φ𝑖 ∈ R𝑚𝑐×𝑛, 𝑖 ∈ [𝑑]
X a 𝑑 mode tensor with side lengths 𝑛
output : B𝑐
B𝑐 ← X ×1 Φ1 ×2 Φ2 ×3 · · · ×𝑑 Φ𝑑

Next, we describe the procedure which takes as input (𝑎) leave-one-out measurements of X

(from Algorithm 3.2 or 3.3), (𝑏) the full-rank sensing matrices applied to the uncompressed modes

of X, and (𝑐) our desired target rank vector of r, and then outputs a factor matrix 𝑄𝑖 for each mode.

109

In the case of exact arithmetic, no rank truncation, and no noise, this exactly recovers the factors

(see (3.11)).

for 𝑖 ∈ [𝑑] measurements that leave mode 𝑖 uncompressed

Algorithm 3.5 Recover HOSVD Factors from Leave-One-Out Measurements
input
:
𝐵𝑖 ∈ 𝑅𝑛×𝑚𝑑−1
Ω(𝑖,𝑖) ∈ 𝑅𝑛×𝑛 for 𝑖 ∈ [𝑑]
r = (𝑟, . . . , 𝑟) desired rank for HOSVD
output : 𝑄1, . . . , 𝑄𝑑 ∈ R𝑛×𝑟
# Factor matrix recovery
for 𝑖 ∈ [𝑑] do

# Solve 𝑛 × 𝑛 linear system
Solve Ω(𝑖,𝑖) 𝐹𝑖 = 𝐵𝑖 for 𝐹𝑖
# Compute SVD and keep the top 𝑟 singular vectors
𝑈, Σ, 𝑉𝑇 ← SVD(𝐹𝑖)
𝑄𝑖 ← 𝑈:,:𝑟

end

Lastly, we consider the task of obtaining the core of the HOSVD of the data tensor. The two

ways described in section 3.2 are to either compute the core using a second access to the data tensor

- in which case this is a matter of applying the transpose of the factor matrices from Algorithm 3.5

to the data tensor. This is detailed in Algorithm 3.10.

:

Algorithm 3.6 Compute HOSVD Core with Second Access
input
𝑄1, . . . , 𝑄𝑑 ∈ R𝑛×𝑟 computed factor matrices
X a 𝑑 mode tensor with side lengths 𝑛.
output : G a 𝑑 mode tensor with side lengths 𝑟
G = X ×1 𝑄𝑇
2 · · · ×𝑑 𝑄𝑇

1 ×2 𝑄𝑇

𝑑

In the scenario in which a second access to the tensor is not desired, instead we obtain the

core of the HOSVD by solving a linear system involving the measurement operators and the factor

matrices, see (3.21). This is equivalent to solving the linear system a mode at a time as detailed in

Algorithm 3.7, a method of practical value because it does not require as much working memory.

A third possibility is to “re-use” leave-one-out measurements to compute the core. Theoretically

this involves new dependencies on the errors introduced by estimating the factors and errors

introduced by recovering the core that are not addressed in any of our main results. Practically

110

:

Algorithm 3.7 Recover HOSVD Core from Measurements
input
B𝑐 a 𝑑 mode tensor with side lengths 𝑚𝑐
Φ𝑖 ∈ R𝑚𝑐×𝑛
𝑄1, . . . , 𝑄𝑑 ∈ R𝑛×𝑟 computed factor matrices
output : H
for 𝑖 ∈ [𝑑] do

# unfold measurements, mode-𝑖 fibers are columns, size 𝑚𝑐 × 𝑟 (𝑖−1)𝑚𝑑−1−(𝑖−1)
𝐻 ← unfold(B𝑐, 𝑚𝑐 × 𝑟 (𝑖−1)𝑚𝑑−1−(𝑖−1)
, mode = 𝑖)
# Undo the mode-𝑖 measurement operator and factor’s action by finding

𝑐

𝑐

least square solution to 𝑚𝑐 × 𝑟 over-determined linear system

Solve Φ𝑖𝑄𝑖𝐻new = 𝐻 for 𝐻𝑛𝑒𝑤
# reshape the flattened partially solved core into a tensor
×𝑚𝑐 × · · · × 𝑚𝑐
B𝑐 ← fold(𝐻new, 𝑟 × 𝑟 · · · × 𝑟
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:125)
(cid:123)(cid:122)
(cid:124)
𝑖
𝑑−𝑖
# Each iteration 𝑚𝑐 → 𝑟 in 𝑖th mode

, mode = 𝑖)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

end

however, it would be desirable to avoid having to produce the core measurement tensor, and

empirically on synthetic data the overall error is not effected.

Algorithm 3.8 Recover HOSVD Core from Recycled Leave One Out Measurements
input
:
𝐵 𝑗 for a fixed 𝑗.
Ω 𝑗,𝑖 for each 𝑖 ∈ [𝑑]
𝑄1, . . . , 𝑄𝑑 ∈ R𝑛×𝑟 computed factor matrices
output : H
for 𝑖 ∈ [𝑑] do

# unfold measurements, mode-𝑖 fibers are columns, size 𝑚 × 𝑟 (𝑖−1)𝑚𝑑−1−(𝑖−1)
𝐻 ← unfold(B 𝑗 , 𝑚 × 𝑟 (𝑖−1)𝑚𝑑−1−(𝑖−1), mode = 𝑖)
# Undo the mode-𝑖 measurement operator and factor’s action by finding

least square solution to 𝑚 × 𝑟 over-determined linear system

Solve Ω( 𝑗,𝑖)𝑄𝑖𝐻new = 𝐻 for 𝐻𝑛𝑒𝑤
# reshape the flattened partially solved core into a tensor
×𝑚𝑐 × · · · × 𝑚𝑐
B 𝑗 ← fold(𝐻new, 𝑟 × 𝑟 · · · × 𝑟
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:125)
(cid:123)(cid:122)
(cid:125)
(cid:124)
𝑑−𝑖

, mode = 𝑖)

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:123)(cid:122)
𝑖

(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)

(cid:124)

# Each iteration 𝑚 → 𝑟 in 𝑖th mode or 𝑛 → 𝑛 when 𝑖 = 𝑗

end

We are now able to state the variants of the general algorithm.

111

(cid:9)

Algorithm 3.9 Recover HOSVD Kronecker One Pass
# Obtain measurements for factors with Alg. 3.2
𝐵1, 𝐵2, . . . , 𝐵𝑑 ← Leave-One-Out Kronecker Sketching((cid:8)Ω(𝑖, 𝑗)
# Obtain measurement for core with Alg. 3.4
B𝑐 ← Core Sketching({Φ𝑖}𝑖∈[𝑑] , X)
# Estimate factor matrices using Alg. 3.5
𝑄1, 𝑄2, . . . , 𝑄𝑑 ← Recover HOSVD Factors from
Leave-One-Out Measurements ((cid:8)Ω(𝑖,𝑖)
# Estimate core using Alg. 3.8
H ← Recover HOSVD Core from Measurements({Φ𝑖}𝑖∈[𝑑] , B𝑐)
output : ˆX = [[H , 𝑄1, . . . 𝑄𝑑]]

, {𝐵𝑖}𝑖∈[𝑑] , (𝑟, . . . , 𝑟))

𝑖∈[𝑑]

(cid:9)

𝑖, 𝑗 ∈[𝑑]

Algorithm 3.10 Recover HOSVD Kronecker Two Pass
# Obtain measurements for factors with Alg. 3.2
𝐵1, 𝐵2, . . . , 𝐵𝑑 ← Leave-One-Out Kronecker Sketching((cid:8)Ω(𝑖, 𝑗)
# Estimate factor matrices using Alg. 3.5
𝑄1, 𝑄2, . . . , 𝑄𝑑 ← Recover HOSVD Factors from
Leave-One-Out Measurements((cid:8)Ω(𝑖,𝑖)
# Compute core using Alg. 3.6
H ← Compute HOSVD Core with Second Access({𝑄𝑖}𝑖∈[𝑑] , X)
output : ˆX = [[H , 𝑄1, . . . 𝑄𝑑]]

, {𝐵𝑖}𝑖∈[𝑑] , (𝑟, . . . , 𝑟))

𝑖∈[𝑑]

(cid:9)

(cid:9)

𝑖, 𝑗 ∈[𝑑]

, X)

, X)

Algorithm 3.11 Recover HOSVD Khatri-Rao One Pass
# Obtain measurements for factors with Alg. 3.3
𝐵1, 𝐵2, . . . , 𝐵𝑑 ← Leave-One-Out Khatri-Rao Sketching((cid:8)Ω(𝑖, 𝑗)
# Obtain measurements for core with Alg. 3.4
B𝑐 ← Core Sketching({Φ𝑖}𝑖∈[𝑑] , X)
# Estimate factor matrices using Alg. 3.5
𝑄1, 𝑄2, . . . , 𝑄𝑑 ← Recover HOSVD Factors
((cid:8)Ω(𝑖,𝑖)
# Estimate core using Alg. 3.8
H ← Recover HOSVD Core from Measurements({Φ𝑖}𝑖∈[𝑑] , B𝑐)
output : ˆX = [[H , 𝑄1, . . . 𝑄𝑑]]

, {𝐵𝑖}𝑖∈[𝑑] , (𝑟, . . . , 𝑟))

𝑖∈[𝑑]

(cid:9)

(cid:9)

, X)

𝑖, 𝑗 ∈[𝑑]

from Leave-One-Out Measurements

112

Algorithm 3.12 Recover HOSVD Khatri-Rao Two Pass
# Obtain measurements for factors with Alg. 3.3
𝐵1, 𝐵2, . . . , 𝐵𝑑 ← Leave-One-Out Khatri-Rao Sketching((cid:8)Ω(𝑖, 𝑗)
# Estimate factor matrices using Alg. 3.5
𝑄1, 𝑄2, . . . , 𝑄𝑑 ← Recover HOSVD Factors
((cid:8)Ω(𝑖,𝑖)
, {𝐵𝑖}𝑖∈[𝑑] , (𝑟, . . . , 𝑟))
# Compute core using Alg. 3.6
H ← Compute HOSVD Core with Second Access({𝑄𝑖}𝑖∈[𝑑] , X)
output : ˆX = [[H , 𝑄1, . . . 𝑄𝑑]]

𝑖∈[𝑑]

(cid:9)

(cid:9)

, X)

𝑖, 𝑗 ∈[𝑑]

from Leave-One-Out Measurements

113

CHAPTER 4

TENSOR SANDWICH: TENSOR COMPLETION FOR LOW CP-RANK TENSORS VIA
ADAPTIVE RANDOM SAMPLING

In the proceeding chapters we have described two schemes for the tensor recovery problem. We

now will address a method for a different, though related, tensor problem. In tensor completion

we consider the problem of how we might complete (or impute, or predict) a tensor from which

some entries are missing or unavailable. That is, for a tensor X ∈ R𝑛1×···×𝑛𝑑 , we have some set,

Ω ⊂ [𝑛1] × . . . . . . [𝑛𝑑] of indices which correspond to entries which are sampled, or revealed.

Entries with indices outside of the revealed set are then not available and some or all of these

missing entries we desire to fill in, and do so in a way which minimizes error from the true tensor.

Naturally, if the entries are unrelated to each other, the problem is hopeless, however in the

case where the tensor has additional structure such as a low-rank decomposition, then we can use

that fact to develop methods which are able to accurately compute (or estimate in the inexact case)

given only a small sample of some of the entries. The commonalities with the recovery problem

are immediate - the sampling operator can fruitfully be envisioned as linear measurements of the

original tensor and from those measurements we wish to recover the tensor, or a good estimate for

it. It is often the case that we wish to limit or minimize how many entries are revealed. However,

we will quickly leave behind the requirement that this operator act in a modewise manner or with

interleaved reshapings as was done in the previous chapters. One reason to not insist on an operator

conforming to such a scheme is that we will want to discuss sampling schemes which are adaptive,

in the sense that after observing some entries, we will use some computation and information

gleaned from these entries to update where next we would like to sample in order to best complete

the tensor. In this case the set Ω is constructed in stages. The degree to which a sampling scheme

adapts or not, and the reasonableness of having different levels of control over the sampling pattern

is something we shall discuss.

114

4.1 Adaptive Tensor Sandwich for Completing Three Mode Tensors

The subject of this chapter, and our proposed method for addressing the tensor completion

problem resulted from combining matrix completion techniques (see, e.g., [8, 40, 10]) for a small

number of slices (see Definition 1.3) with a modified noise robust version of Jennrich’s algorithm

(see, e.g., [51, Section 3]); an algorithm which originally was used to prove (constructively) the

uniqueness of the CP decomposition when rank is less than or equal to the smallest side length of

the tensor.

In the simplest case as we shall see, this leads to a sampling strategy that more densely samples

two slices (the bread), and then more sparsely samples additional entries throughout the tensor

(the filling) for final completion, hence “Tensor Sandwich”. To begin, we will consider three

mode tensors with some mild assumptions on the factor matrices of the CP decomposition (see

Definition 1.14). Our algorithm then completes an 𝑛 × 𝑛 × 𝑛 tensor with CP-rank 𝑟 with high

probability while using at most O (𝑛𝑟 log2 𝑟) adaptively chosen samples.

Empirical experiments further verify that the approach works well in practice, including in the

presence of additive noise, see Section 4.3. We will then include details for how the method can be

employed for tensors of more than three modes and how sampling can be modified to be essentially

non-adaptive, for the cases in which a practitioner may not have the level of control needed to

conform to the adaptive strategy, see Section 4.5.

Before beginning our description of our proposed adaptive method, we make two important

observations. First, there is a simple information theoretic lower bound on any completion method,

adaptive or otherwise. For example, a 𝑑 mode CP-rank 𝑟 tensor, absent any other constraint, has

𝑟𝑑 (𝑛 − 1) + 𝑟 degrees of freedom, and so no sampling scheme which uses fewer entries than this

can guarantee an exact completion. Furthermore, any method which employs randomness should

then require more samples, and though we do not have a proof that our method is strictly optimal

in this regard, it is by comparison to other known methods, better, and does have scaling up to log

factors in the relevant parameters of the problem.

Second, there is not a priori any guarantee that a particular revealed entry in a tensor will be

115

especially informative. For example, consider the rank one tensor T = ⃝𝑑

𝑗=1 e1 where e1 is the
standard basis vector with a one in position one. All but one of this tensors’ 𝑛𝑑 entries are zero,

which is to say, a sampling technique would have to reveal this singular non-zero entry in order

to complete the tensor and - any random sampling pattern would need on the order the number of

entries of the tensor revealed to succeed in such a case. Fortunately such “worst” case tensors can

be excluded with some natural and useful assumptions.

4.1.1 Statement and Outline of Main Result

To proceed with defining the class of tensors for which the method will have a high chance of

completing, we will need two additional definitions, motivated by the proceeding discussion of the

completion problem.

Definition 4.1 (Coherence). The coherence of an 𝑟-dimensional subspace 𝑈 ⊂ R𝑛 is defined

𝜇(𝑈) :=

𝑛
𝑟 max
𝑖∈[𝑛]

∥P𝑈 𝒆𝑖 ∥2
2

,

where P𝑈 is the orthogonal projection onto 𝑈, and where 𝒆𝑖 ∈ R𝑛 for 𝑖 ∈ [𝑛] denotes the 𝑖th standard

basis vector.

From this definition we can see a subspace will have low coherence or be incoherent if, in terms

of magnitude and support, its entries are (roughly) equal and spread out among all possible indices.

In such a case, it will happen that revealing an entry has a high likelihood of being informative -

the concentration of mass will not be constrained to some relatively small support of the total index

set, so the chance that a sampling scheme “misses” informative entries is lower.

We will also need the following definition concerning an elaboration on rank of a matrix.

Definition 4.2 (Kruskal rank). The Kruskal rank of a matrix is the maximum integer 𝑟 such that

any 𝑟 columns of the matrix are linearly independent.

We observe immediately that Kruskal rank is always bounded by the rank of the matrix. As a

simple yet illuminating example, consider the identity matrix in R𝑛×𝑛, which has rank and Kruskal

rank both equal to 𝑛. Now append a column to this which consists of e1. Clearly the rank of the

resulting matrix remains 𝑟, but the Kruskal rank is now one, since it is possible to find a subset of

116

two columns among the 𝑛 + 1 columns which are linearly dependent.

The class of tensors for which the proposed approach will be guaranteed to succeed will need to

satisfy the following assumptions. For positive integers 𝑛, 𝑟, 𝑠 with 𝑟, 𝑠 ≤ 𝑛 and any 𝜇0 ∈ [1, 𝑛/𝑟],
we define T(𝑛, 𝑟, 𝜇0, 𝑠) to be the class of all tensors T = (cid:205)𝑟
(a) the factor matrices 𝐴, 𝐵 both have full column rank,

𝑖=1 a𝑖 ◦ b𝑖 ◦ c𝑖 ∈ R𝑛×𝑛×𝑛 for which:

(b) the column space of 𝐴 has coherence bounded above by 𝜇0, and

(c) for some 𝑠 ∈ [𝑛], every 𝑠 × 𝑟 submatrix of 𝐶 has Kruskal rank ≥ 2.

We are now able to state our main result of this section.

Theorem 4.1. There exists an adaptive random sampling strategy and an associated reconstruction

algorithm (see Algorithm 4.2) such that for any 𝛿 > 0, after observing at most 𝐶1𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿)

entries of a tensor T ∈ T(𝑛, 𝑟, 𝜇0, 𝑠), the algorithm completes T with probability at least 1 − 𝑠𝛿.

Here 𝐶1 > 0 is absolute constant that is independent of all other quantities.

The proof of our sandwich sampling algorithm involves three stages. First, without loss of

generality, we pick 𝑠 mode-3 slices and then use a matrix completion algorithm that with high

probability recovers these slices and observes at most 𝐶′
1

𝜇0𝑛𝑟 log2(𝑟 2/𝛿) entries in each slice.

Second, if this 𝑛 × 𝑛 × 𝑠 sub-tensor is correctly completed, we can then use a deterministic variant

of Jennrich’s Algorithm1 on the completed sub-tensor to learn the factor matrices 𝐴 and 𝐵. Third,

once we know the factor matrices 𝐴 and 𝐵, we can deterministically find 𝑟 sample locations in each

of the 𝑛 mode-3 slices whose values allow a censored least squares problem to solve for the third

factor matrix 𝐶. This three-stage procedure uses at most 𝐶′
1

𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿) samples to complete

the 𝑠 initial mode-3 slices in order to learn 𝐴 and 𝐵, and then 𝑛𝑟 additional samples to learn 𝐶

thereafter, for a total of at most 𝐶′
1

𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿) + 𝑛𝑟 ≤ 𝐶1𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿) samples. See

Figure 4.1 for a schematic illustration of the overall sampling strategy where fibers are sampled

through the middle of the sandwich for simplicity.

We note that assumptions (a) and (c) are the necessary assumptions for the Jennrich’s step

to work with any 𝑛 × 𝑛 × 𝑠 sub-tensor. Furthermore, if the columns of the factor matrices are

1As discussed in [36] the work of [44] is perhaps a more accurate attribution of the method, however we use the

traditional name of Jennrich’s Algorithm in this work.

117

drawn from any continuous distribution, assumption (c) for 𝑠 = 2 holds with probability 1. Finally,

something akin to assumption (b) is always required in completion problems, as we have discussed

in Section 4.1.

Figure 4.1 A schematic depiction of the sampling strategy where 𝑠 = 2 slices have been sampled
relatively densely in order to compute 𝐴 and 𝐵, and where additional fibers where then sampled
elsewhere to use in computing 𝐶.

4.1.2 Related Work

Many prior works on low-rank tensor completion use non-adaptive and uniform sampling

[29, 4, 65, 66, 55, 52, 45, 34]. While some of those works [4, 34, 52] can handle CP-ranks up

to roughly 𝑛3/2 instead of 𝑛, all of them require at least O (𝑛3/2) samples, even when the rank is

𝑟 = O (1). Furthermore, [4] shows that completing an 𝑛 × 𝑛 × 𝑛 rank-𝑟 tensor from 𝑛3/2−𝜖 uniformly

118

random samples is NP-hard by way of showing it is equivalent to the problem of refuting a 3-SAT

formula with 𝑛 variables and 𝑛3/2−𝜖 clauses.

In [39], the authors propose a recursive method for completing CP-rank 𝑟 ≤ 𝑛 tensors using

adaptive sampling which, for order-3 tensors, requires O (𝜇2
0

𝑛𝑟 5/2 log2 𝑟) samples. Our algorithm

requires a number of samples which has a more favorable dependence on the coherence 𝜇0 and

rank 𝑟. Furthermore, our result only requires coherence assumptions about 𝐴, instead of about

𝐴 and 𝐵. These improvements come at the expense of requiring the mild additional assumptions

(a) and (c) in our Theorem 4.1 related to Jennrich’s algorithm, however. Additionally, the method

of [39], outside the exact case does not output a CP factorization of the tensor, but rather a more

complicated and recursive decomposition.

The first step in our tensor completion algorithm involves using an adaptive sampling scheme

to complete 𝑠 mode-3 slices of the tensor. Our tensor completion algorithm and results are based

on the adaptive matrix completion algorithm and results in [40] which, with high probability, uses

O (𝜇0𝑛𝑟 log2 𝑟) samples to complete a rank 𝑟 matrix. However, our algorithm can be adapted to

use other adaptive matrix completion results such as, e.g., [10] with relative ease.

In the censored least squares phase of our algorithm, we can sample entire fibers of the tensor

as done in Figure 4.1. We note that doing so is similar in spirit to the fiber sampling approach of

Sørensen and De Lathauwer [58]. However, their work focuses on determining algebraic constraints

on the factor matrices of a low rank tensor which, when satisfied, allow the tensor to be completed

from the sampled fibers. As such, our results cannot be directly compared with [58].

4.2 Proof of Tensor Sandwich Exact Recovery for Tensors in T(𝑛, 𝑟, 𝜇0, 𝑠)

In this section we follow our three stage proof outline from Section 4.1.1

4.2.1 Completing 𝑠 Mode-3 Slices of T

We start by picking any subset of indices 𝑆 ⊂ [𝑛] with |𝑆| = 𝑠 elements. For each 𝑘 ∈ 𝑆, the

mode-3 slice T:,:,𝑘 ∈ R𝑛×𝑛 satisfies

T:,:,𝑘 =

𝑟
∑︁

𝑖=1

⟨c𝑖, e𝑘 ⟩a𝑖b𝑇
𝑖 ,

119

and so, col-span(T:,:,𝑘 ) ⊆ col-span( 𝐴). By assumption, col-span( 𝐴) has coherence bounded by

𝜇0. Thus, the assumptions required by Theorem 1 in [40] hold. Therefore, for each slice 𝑇:,:,𝑘

𝑘 ∈ 𝑆, with probability at least 1 − 𝛿, the adaptive sampling procedure in Algorithm 1 uses at most

𝐶′
1

𝜇0𝑛𝑟 log2(𝑟 2/𝛿) samples for some absolute constant 𝐶′
1

> 0, and completes T:,:,𝑘 .

By taking a simple union bound over each of the 𝑠 slices 𝑘 ∈ 𝑆, we have that with probability at

least 1−𝑠𝛿, this strategy will successfully complete all of the 𝑠 slices T:,:,𝑘 for 𝑘 ∈ 𝑆 and use fewer than

𝐶′
1

𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿) samples. Let Ω1 = {(𝑖, 𝑗, 𝑘)|(𝑖, 𝑗) sampled according to [40] for slice 𝑘 ∈ 𝑆},

the set of locations of T sampled to complete these 𝑠 slices.

4.2.2 Learning Mode-1 and 2 Factor Matrices via Modified Jennrich’s Algorithm / Simul-

taneous Diagonlization

Let u, v be random vectors uniformly drawn from the unit sphere S𝑠−1. Denote the sub-vector

of c with entries indexed in 𝑆 by ˜c𝑖 = (c𝑖)𝑆, and construct two auxiliary matrices 𝑇𝑢, 𝑇𝑣 using the

completed slices by adding up the linear combinations of the completed slices, weighted by the

random vectors u, v. In terms of the components, we have:

𝑇𝑢 =

𝑇𝑣 =

𝑟
∑︁

𝑖=1
𝑟
∑︁

𝑖=1

⟨˜c𝑖, u⟩a𝑖b𝑇
𝑖

⟨˜c𝑖, v⟩a𝑖b𝑇
𝑖

Denote the 𝑟 × 𝑟 matrix 𝐷𝑢 which has along its diagonal entries the values ⟨˜c𝑖, u⟩, and similarly

𝐷𝑣. Notice the following identity with the product 𝑇𝑢 (𝑇𝑣)†

𝑇𝑢 (𝑇𝑣)† = 𝐴𝐷𝑢 𝐵𝑇 (cid:16)

𝐴𝐷𝑣 𝐵𝑇 (cid:17) †

= 𝐴𝐷𝑢 𝐷−1

𝑣 𝐴†

Where the matrix 𝐷𝑢 𝐷−1
𝑣

⟨˜c𝑖,v⟩ . Clearly the matrix
𝑇𝑢 (𝑇𝑣)† is diagonalizable, and so by computing the eigen-decomposition of 𝑇𝑢 (𝑇𝑣)† we recover the

is a diagonal matrix with (𝑖, 𝑖)-entry equal to ⟨˜c𝑖,u⟩

columns of 𝐴 and the eigenvlaues along the diagonal of 𝐷𝑢 𝐷−1

𝑣 . We can order the eigenvectors in

descending order by magnitude of their corresponding eigenvalue. Note it is here that we employ

120

assumption (c): Since 𝐶𝑆 (i.e., the rows of 𝐶 indexed by 𝑆) has 𝑘-rank at least 2, the ratios ⟨˜c𝑖,u⟩

⟨˜c𝑖,v⟩ for
each 𝑖 will almost surely be distinct from one another, and thus the ordering of the eigenvalues is

unique.

Let 𝑃 be the permutation matrix that interchanges the columns of 𝐴 so that instead of being in

𝐷𝑢 𝐷−1

𝑣 order, they are in 𝐷𝑢 order (ordered greatest to least in terms of the magnitude of ⟨˜c𝑖, u⟩).

Now notice that

𝐴†𝑇𝑢 = 𝐴† 𝐴𝑃𝐷𝑢 𝐵𝑇

= 𝑃𝐷𝑢 𝐵𝑇

= (𝐵𝐷𝑢𝑃)𝑇 .

(4.1)

This means that the re-scaled columns of 𝐵 that are in 𝐷𝑢 order are now in 𝐷𝑢 𝐷−1

𝑣 order after we

apply the inverse 𝐴† to 𝑇𝑢. That is, we have found the matching components of 𝐴 and 𝐵 up to a

re-scaling of their outer product! This was achieved by learning the columns of 𝐴 from 𝑇𝑢 (𝑇𝑣)† and

then, crucially, the matching columns of 𝐵 from 𝐴†𝑇𝑢. The scaling due to 𝐷𝑢 will be resolved in the

final step of the algorithm, where we solve for the missing third factor (i.e., 𝐵, 𝐶 and 𝐵𝐷, 𝐶 (𝐷−1)

are both valid pairs of factors for any diagonal matrix 𝐷).

4.2.3 Learning the mode-3 factor matrix

After obtaining 𝐴 and a re-scaled 𝐵 in order to find the remaining components, 𝐶, we will need

Ω2, a second set of revealed locations of the entries of T . We will also need the solutions to 𝑛

instances of the following censored least squares problem related to those revealed values

( 𝐴 ⊙ 𝐵)𝐾𝑘 c = t𝐾𝑘

(4.2)

where 𝑘 ∈ [𝑛],c = (𝐶𝑘,:)𝑇 , t = vec(T:,:,𝑘 )𝑇 and 𝐾𝑘 = {𝑖 + 𝑛( 𝑗 − 1)|(𝑖, 𝑗, 𝑘) ∈ Ω2}. In (4.2), the
term t𝐾𝑘 denotes the vector of length |𝐾𝑘 | which includes only the entries of t which have indices

appearing in the set 𝐾𝑘 . Similarly ( 𝐴 ⊙ 𝐵)𝐾𝑘 is the matrix where we restrict rows of ( 𝐴 ⊙ 𝐵) to

only those which have indices appearing in the set.

Note ( 𝐴 ⊙ 𝐵) is full rank because 𝐴 and 𝐵 are assumed to be full rank (see, e.g., [31]). This

implies the uncensored system is consistent with a unique solution. The difficulty in the censored

121

case is that unobserved values in a particular column of the right hand side of (4.2) could force the

discarding of rows of the matrix ( 𝐴 ⊙ 𝐵) that cause the system to become under-determined.

That is, we must ensure that ( 𝐴 ⊙ 𝐵)𝐾𝑘 has full column rank for each 𝑘. If we do not vary

our sampling procedure from frontal slice to frontal slice, then this corresponds to sampling tubes

of the original tensor of the form T𝑖, 𝑗,: and 𝐾𝑘 = 𝐾 for all 𝑘 ∈ [𝑛].
sampling to accomplish this, consider a 𝑄𝑅 with column-pivoting factorization of ( 𝐴 ⊙ 𝐵)𝑇 , see

In order to arrange the

algorithm 5.4.1 in [19]). This produces factors and a permutation matrix C𝑛1×···×𝑛𝑑 ∈ R𝑛2×𝑛2

such

that ( 𝐴 ⊙ 𝐵)𝑇 C𝑛1×···×𝑛𝑑 = 𝑄𝑅, where

(cid:104)

𝑅 =

(cid:105)

𝑅1 𝑅2

and 𝑅1 ∈ R𝑟×𝑟 is upper-triangular and non-singular. But this means that the first 𝑟 columns of
C𝑛1×···×𝑛𝑑 select a set of columns of ( 𝐴 ⊙ 𝐵)𝑇 which are linearly independent, so for each C𝑛1×···×𝑛𝑑
𝑝 ∈ [𝑟] we have that there exists some 𝑞 ∈ [𝑛2] such that e𝑞 = C𝑛1×···×𝑛𝑑
column of the identity in R𝑛2×𝑛2

𝑞
𝑛 ⌋. Define then Ω2 all the tuples of
the form (𝑖, 𝑗, :). That is, we can read off from C𝑛1×···×𝑛𝑑 which 𝑟 fibers of length 𝑛 to sample in

. Let 𝑖 = 𝑞 mod 𝑛, 𝑗 = ⌊

, the corresponding

:,𝑝

:,𝑝

,

order to ensure (4.2) is consistent for each 𝑘 ∈ [𝑛]. We have specified at most 𝑛𝑟 new sample

locations, and thus Ω = Ω1 ∪ Ω2, the set of all samples employed to complete the tensor T is at

most |Ω1| + |Ω2| ≤ 𝐶′
1
Remark 4.1. Note, C𝑛1×···×𝑛𝑑 is not unique, and indeed we could use different selections corre-

𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿) + 𝑛𝑟 ≤ 𝐶1𝑠𝜇0𝑛𝑟 log2(𝑟 2/𝛿).

sponding to other valid permutations from frontal slice to frontal slice to vary the sampling pattern

when finding 𝐶. Additionally, including more rows of ( 𝐴 ⊙ 𝐵) beyond the first 𝑟 as specified by

C𝑛1×···×𝑛𝑑 may be numerically advantageous when computing 𝐶.

4.3 Experiments

In this section, we show that tensors sampled according to our overall strategy and completed

using Algorithm (4.2) support our theoretical findings.

In particular we demonstrate that once

sample complexity bounds are satisfied, we can achieve very precise levels of relative error by

sampling what is overall a small percentage of the total tensor’s entries. For example, in the rank 20

122

:

Algorithm 4.1 Consistency Preserving Fiber Sampler
input
𝛾 ≥ 1, over-sample parameter
𝑟 rank parameter
𝐴, 𝐵 ∈ R𝑛×𝑟, estimates for factor matrices
output : Ω2, set of sample locations
Compute 𝑄𝑅 with column-pivoting, 𝑄𝑅 = ( 𝐴 ⊙ 𝐵)𝑇 C𝑛1×···×𝑛𝑑
for 𝑝 ∈ [𝛾𝑟] do

)

:,𝑝

𝑞 ← nonzero(C𝑛1×···×𝑛𝑑
𝑖 ← 𝑞 mod 𝑛
𝑞
𝑗 ← ⌊
𝑛 ⌋
Include (𝑖, 𝑗, 𝑘) in Ω2 for all 𝑘 ∈ [𝑛]

end

: 𝑆 ⊆ [𝑛] slices to complete

Algorithm 4.2 Tensor Sandwich
input
output : ˆ𝑇, completed tensor
# Slice Complete Phase
for 𝑘 ∈ 𝑆 do

end
# Jennrich Complete Phase
Generate random vectors u, v ∈ S𝑠−1;
𝑇𝑢 ← (cid:205)𝑘∈𝑆 𝑢𝑘𝑇::𝑘
𝑇𝑣 ← (cid:205)𝑘∈𝑆 𝑣 𝑘𝑇::𝑘
Compute eigen-decomposition 𝐴𝚲𝐴−1 = 𝑇𝑢 (𝑇𝑢)†
𝐵 ← 𝐴−1𝑇𝑢
# Censored Least Squares Phase
Ω2 ← fiber sampler( 𝐴, 𝐵, 𝑟, 𝛾)
𝐶𝑇 ← censored least squares (( 𝐴 ⊙ 𝐵), T𝑖, 𝑗,𝑘 s.t. (𝑖, 𝑗, 𝑘) ∈ Ω2)
ˆT ← [[ 𝐴, 𝐵, 𝐶]]

Use, e.g., [10], or algorithm 1 in [40] to complete T:,:,𝑘 . Algorithm 1 in [40] uses at most
𝐶′
1

𝜇0𝑛𝑟 log2(𝑟 2/𝛿) adaptively chosen samples.

case in Figure 4.2, the median relative error is 0.000168 having sampled 94445

8000000 ≈ 1.1% of the total
entries. Our experiments also verify the linear dependence on rank in the sample complexity bound.

Furthermore, in our simulated data and real world data we show empirical evidence the method

performs useful completion even in the presence of noise. Additionally we make comparisons to

other tensor completion methods.

123

4.3.1 Simulated Data

In this section, we show that Tensor Sandwich, as well as its extension to higher orders (see

Section 4.5), which we refer to as Tensor Deli in the figures, can complete tensors of simulated data

using both adaptive and non-adaptive sampling. Once sample complexity bounds are satisfied, we

can achieve precise levels of relative error for completing a tensor by looking at what is overall

a small fraction of its total entries. We also demonstrate empirically our method works for four

mode tensors.

In prior works such as [39], results of this kind are given only for two mode

tensors; in [45] tensors of three modes. Furthermore we show how using Tensor Sandwich/Deli to

initialize a masked-alternating least squares scheme with even a few iterations provides significant

improvement in accuracy. We compare our adaptive results to [39] and [45], and show that Tensor

Deli can perform well in terms of accuracy and sample complexity while also providing a factorized

tensor.

In all cases with the simulated data, the tensor is generated by drawing factor matrices of shape

R𝑛×𝑟 with i.i.d standard Gaussian entries and then normalizing the columns. For simulated data,

for each mode 𝑑, the side-lengths 𝑛 are the same, and thus the tensor has 𝑛𝑑 total entries. We then

have the option to weight the components using decaying weights described by the parameter 𝛼,

i.e.

T =

𝑟
∑︁

ℓ=1

(cid:19)

(cid:18) 1
ℓ𝛼

ℓ ◦ · · · ◦ 𝒂 (𝑑)
𝒂 (1)

ℓ

(4.3)

In each experiment, the median of the errors is taken over ten independent trials. In each Tensor

Deli experiment, after selecting frontal slices to complete, within these slices we sample using a

budget of 𝑚 = 𝛾(𝑛2) samples (adaptively or non-adaptively), where 𝛾 ∈ [0, 1] is the proportion of

the total 𝑛2 entries of the slice available for sampling in the matrix completion step. Slices are then

completed using the semi-definite programming formulation of the nuclear norm minimization and

solved via Douglas-Rachford splitting, see [54].

In this step, convergence is declared once the

primal residual, dual residual and duality gap are below 10−8 for each of the 𝑠 selected slices, or

after 10,000 iterations per slice, whichever comes first. We note that the accuracy of this matrix

completion step influences numerically what is achievable in terms of overall accuracy for the

124

completed tensor, and that the error for completing these slices is compounded in the subsequent

steps (even in the absence of noise). This explains the apparent “leveling off” of the relative error

even as sample complexity or signal-to-noise ratios increase. The number of fibers to sample,

and which are used to estimate the remaining 𝑑 − 2 factor matrices is given by 𝛿𝑟, where 𝛿 is an

oversampling factor of size at least one.

In Figure 4.2, we see a phase transition of our method. For a given rank, prior to a threshold,

the error is dominated by the inability to accurately complete the 𝑠 slices. Once sufficient samples

are obtained within these slices, completion reliably succeeds and accuracy approaches the limiting

numerical accuracy inherited from the initial slice completion step. We empirically see in Figure

4.2 the linear relationship of the sample complexity on the rank of the tensor - it takes approximately

twice as many samples to reach this phase transition in terms of relative error when we compare

the rank 5 case with the rank 10 case. In this figure, 𝑑 = 3, 𝑛 = 200, 𝛼 = 2, 𝑠 = 2, 𝛿 = 4, the

parameter 𝛾 is varied within in the interval (0, 1) and is driving the overall sample complexity.

Figure 4.2 Median relative error (log-scaled) of completed tensors of varying rank as sample
complexity increases without noise.

125

In Figure 4.3 we repeat a similar experiment, but for 𝑑 = 4, 𝑛 = 100, 𝛼 = 2 and for three different

ranks 𝑟 = 5, 10, 15. These tensors are 12.5 times larger in terms of total number of entries than in

the three mode case studied in Figure 4.2. Our method is able to reach relative errors at or lower

than 1% using less than a tenth of one percent of the total entries. In plot (a) of Figure 4.3, we show

the median relative error for Tensor Deli and the relative error after ten iteration of masked-ALS

is performed on the result, i.e.

the second set of experiments depend on the first in that ALS is

initialized using the Tensor Deli estimates of the CP factorization, and using the same revealed

entries, see [61]. As is well documented now, ALS can be quite sensitive to initialization and

prone to “swamps” where accuracy stagnates over a large number of iterations. This figure shows

however that Tensor Deli can provide high quality initialization, and with only a small number

of iterations we can improve the accuracy of the estimate of the completed tensor by an order of

magnitude without even needing to reveal more entries. Empirically we have observed ALS alone

is unable to achieve comparable relative errors at this sample complexity for any observed amount

of iterations, see Figure 4.5.

In plot (b) of Figure 4.3 we show the errors for adaptive vs non-adaptive strategy for each rank. In

the non-adaptive case, the 𝑠-slices are sampled more heavily, but in a uniform fashion. Furthermore,

no 𝑄𝑅-factorization is performed on the first two estimated factor matrices, and instead entries are

sampled uniformly throughout the rest of the tensor for the purposes of estimating the remaining

factors. Not surprisingly, in the case of random low-rank tensors, the density of the sampling in

the slices is driving the accuracy, and adapting the sampling pattern appears to have little effect -

the slices themselves are already highly incoherent, so there is little to gain by adapting. As we

discuss in the next section, on real world data with more structure, adaptive patterns do have a more

measurable effect.

In Figure 4.4, we add mean-zero i.i.d. Gaussian noise to each entry in our tensor and perform

the same completion strategy on the noisy tensor as described earlier in this section, 𝑑 = 3, 𝑛 = 200,

𝛼 = 2, 𝑠 = 2, 𝛿 = 4. For each trial, the noise tensor N is scaled to the appropriate signal-to-noise

ratio, i.e. SNR = 10 log10

∥T ∥
∥N ∥ . The sample complexity proportions are fixed with respect to slice

126

Figure 4.3 Median relative error (log-scaled) of completed four mode tensors of varying rank as
sample complexity increases without noise. Each value is the median of ten trials, 𝑠 = 2, 𝛾 ∈
[0.1, 0.8], 𝛿 = 8. In plot (a) we compare the relative errors of Tensor Deli before and after ten
iterations of masked-ALS. In plot (b) we show error for both adaptive and non-adaptive sampling
schemes.

completion versus fibers sampled to estimate 𝐶 but do scale linearly according to rank in order to

facilitate comparison. In all cases the total number of sampled entries is between 0.27% and 1.10%

of the total number of entries, depending on rank.

As alluded to earlier, we observed during these experiments that using masked alternating least

squares alone to complete our synthetic tensors (using either the Tensor Sandwich sample pattern,

or an equivalent proportion of uniformly drawn samples) achieved at best relative errors of about

6%. However, the estimate of the tensor as output by Algorithm 4.2, and its sample pattern can be

used as an initialization to an iterative scheme like alternating least squares to further improve the

accuracy of the completed tensor. In Figure 4.5, we show three sets of related experiments: the

relative error at various ranks achieved by alternating least squares alone (AS), Tensor Sandwich

alone (TS), or alternating least squares initialized by Tensor Sandwich (TS + ALS). In each trial, 100

iterations of alternating least squares are used, weights for components are set to decay according

to (4.3), and each of the three variations for completing the tensor are used on the same data. We

notice that at this level of sample complexity, for either a sample mask chosen uniformly at random

or according to the adaptive scheme described in this work, ALS alone is limited. Combined with

Tensor Sandwich, however, we can see roughly an order of magnitude improvement in relative error

by performing a hundred iterations of ALS on the TS estimate. This shows Tensor Sandwich can

127

Figure 4.4 Median relative error (log-scaled) of completed tensors of varying rank for different
levels of white Gaussian noise. Sample complexity is less than 1.1% for all trials, scaled linearly
by rank.

be useful as an initialization strategy for other completion methods, either to save on run time by

decreasing the total number of iterations needed, or to improve the accuracy of the final estimate.

Here 𝑑 = 3, 𝑛 = 200, 𝛼 = 2, 𝑠 = 2, 𝛿 = 4. In all cases the total number of sampled entries is

between 0.27% and 1.10% of the total number of entries, with linear scaling depending on rank to

facilitate comparison.

In next pair of experiments, we compare one of the existing (non-adaptive) completion methods

discussed in Subsection 4.1.2 to Tensor Sandwich. We use the same overall sample budget for

Tensor Sandwich but instead use the Algorithm in [51] which we refer to as Tensor Complete

(TC). The range of total values sampled is the same as in Figure 4.2. Implementation details make

direct comparisons difficult. However, the empirical findings summarized in Figure 4.6 suggest

that Tensor Sandwich can produce better estimates for high rank tensors than Tensor Complete can

when sample budgets are limited.

In Figure 4.7 we show a comparison of one of the few adaptive methods we identified in the

128

Figure 4.5 Median relative error (log-scaled) of completed tensors of varying rank for different
levels of white Gaussian noise. Sample complexity is less than 1.1% for all trials, scaled linearly
by rank. Tensors are completed using 100 iterations of alternating least squares (ALS), or Tensor
Sandwich (TS), or 100 iterations of ALS initialized by Tensor Sandwich (TS+ALS).

area, proposed in [39] (denoted KS in the figure) and adaptive Tensor Sandwich for 𝑑 = 3, 𝑛 = 40,

𝛼 = 1 with ten iterations of post-ALS and for three different ranks 𝑟 = 4, 6, 8. In plot (a) we have

along the horizontal axis the two respective parameters for each of the methods that drives their

overall sample complexity. For KS, the sample complexity is largely driven by the proportion of

entries the method is able to sample in a given fiber to test whether that fiber should be added to

the basis. For Tensor Deli, this is 𝛾, the proportion of entries in a single face the method is able to

sample when completing that slice. In the figure we note Tensor Deli exhibits a phase transition for

each rank where after this threshold of a certain size for 𝛾 the method achieves much better relative

error. That this threshold changes in an approximately linear fashion with respect to rank supports

out theoretical conclusions. Plot (b) in Figure 4.7 changes the horizontal axis to total percent of

the tensor sampled, and we can now plainly observe that, for a smaller overall sample budget by a

factor of least four, Tensor Deli is able to complete the data more accurately than KS. It is also the

case that Tensor Deli outputs an estimate that is truly a CP-tensor of the chosen rank truncation,

129

Figure 4.6 Median relative error (log-scaled) of completed tensors of varying rank for different
total number of entries revealed. Tensors are completed using Tensor Sandwich (TS), or Tensor
Complete as described in [51] without noise.

whereas KS in general will not output a tensor that has a CP-decomposition of the selected rank,

but instead a more complicated recursive decomposition. CP fitting can be conducted on the KS

estimate, for example using iterations of ALS, but this in general will reduce the accuracy of the

estimate.

130

Figure 4.7 Median relative error (log-scaled) of completed three mode tensors of varying rank
without noise for our proposed Tensor Deli with adaptive sampling method, and the adaptive
algorithm proposed in [39]. Each value is the median of ten trials, for Tensor Deli 𝑠 = 2, 𝛾 ∈
[0.2, 0.8], 𝛿 = 8, for KS the proportion of entries that can be sampled for a fiber range from
[0, 2, 0.8]. In plot (a) we show the relative error by the face or fiber sampling parameter for each
of the methods.

131

4.3.2 Applications

We also apply Tensor Sandwich on data-sets in following application areas: chemo-metrics, and

hyper-spectral imaging. The chemo-metrics data-set is as used in [7]. It is florescence measured

from known analytes and the data is intended for calibration purposes. There are a total of 405

fluorophores of six different types. For each sample an Excitation Emission Matrix (EEM) was

measured - i.e. fluorescence intensity was collected for different set levels of excitation wavelength

and emission wavelength. The EEM of an unknown sample can be used for its identification and

studying other properties of interest for a given analyte. In this data-set there are a 136 emission

wavelengths and 19 excitation wavelengths.

In the original tensor, there are five replicates per

sample, however in our experiment we discarded all but the first set of replicates and so formed

then a three mode tensor of size 405 × 136 × 19. We compare the completed tensor for five different

choices of rank, 𝑟 = 11, 13, 15, 17, 19. We set 𝑠 = 4, and thus complete four frontal slices. We

fix frontal slices where the third mode is at indices 5, 9, 13, 17 across all experiments in order to

facilitate better comparison. In Figure 4.8, we show the recovered tensor for a representative frontal

slice at each of the different ranks. Below this in the same figure, we have fixed a representative

lateral slice, which corresponds to a completed EEM for a particular sample of an analyte at the

different ranks. In all cases, the total number of entries revealed is between 10 and 11% of the

tensor’s entries, we performed adaptive sampling and ten iterations of ALS on the resulting CP

estimate, sampling parameters 𝛾 and 𝛿 are 0.5 and 10 respectively.

Furthermore, in Figure 4.9 we show evidence that indeed, in practical applications unlike the

simulated data, there is quite clearly a benefit to sample adaptively in terms of the trade off between

accuracy and sample complexity. In this Figure, for a fixed rank of 15 and a fixed sample complexity

using four slices and the sampling parameters of 0.5 and 10 for 𝛾 and 𝛿, using the same lateral

slice as before in Figure 4.8, we show the completed EEM and corresponding relative error for this

slice using the adaptive Tensor Deli (Algorithm 4.3), the adaptive Tensor Deli with ten iterations of

ALS, and non-adaptive Tensor Deli. In this case, the additional ALS iterations show only modest

improvement to the overall estimates, however the adaptive scheme’s accuracy has a much larger

132

effect. This is likely do to the fact the data’s coherence and other parameters related to the problem

are far from ideal, especially when compared to the simulated case from earlier.

Figure 4.8 Slices of a completed tensor for fluorence data. The top row shows the original frontal
slice X:,:,9 and the same slice completed at ranks 11, 13, 15, 17, 19 with the corresponding relative
error. The bottom row shows a lateral slice X149,:,:, which corresponds then to the EEM for the
single analyte number 149 in the dataset.

The next application is hyperspectral imaging data. The data is as found in [6]. The hyperspectral

sensor data was acquired in June 1992 and consists of aerial images of an approximately two mile

133

Figure 4.9 Lateral slice corresponding to the EEM for analyte number 149 in the fluorence data is
compared with the corresponding slice completed from adaptive, adaptive with extra iterations of
ALS, and non-adaptive Tensor Deli. Relative error for the slice is listed for each method. Rank is
fixed at 15 for each, sample complexity is fixed at about 10% of entries using sampling parameters
of 0.5 and 10 for 𝛾 and 𝛿 respectively.

by to mile Purdue University Agronomy farm and was originally intended for soils research.

The data consists of 200 images at different wave lengths that are 145 by 145 pixels each. We

thus form a three mode, 145 × 145 × 200 data tensor. We complete the tensor using 𝑠 = 9,

134

where we have fixed these frontal slices at indices 20, 40, 60, 80, 100, 120, 140, 160, 180 across all

experiments to facilitate comparison. Shown in Figure 4.10, we have completed the tensor for ranks

𝑟 = 20, 30, 40, 50, 60, 80 and displayed a fixed, representative frontal slice at index 48. The first

row consists of data completed using Tensor Deli with adaptive sampling, with parameters 𝛾 = 0.7

and 𝛿 = 10 for a total sample budget that is about 4.5% of the total entries. The second row then

performs ten iterations of masked-ALS using the initialization of the top row and the same revealed

entries. The last row consists of ten iterations masked-ALS, which is initialized using the SVD of

the flattenings, e.g. see [61].

Empirically, we observe the hyperspectral image dataset is imperfectly approximated by a low-

rank CP decomposition to begin with. In Figure 4.10, we see Tensor Deli alone performs in terms of

global relative error certainly no better than masked-ALS alone on this data. However, it provides

a superior initialization to ALS, as we can see in the second row. Moreover, we observe there is

a qualitative difference in the completed slices and their errors. In the ALS alone case, the error

appears achieved by a sort of local averaging of intensity of the pixels whereas Tensor Deli does

capture contrasts and local features more distinctly, and the errors are largest on rows and columns

it “misses” or imperfectly completes. This can be seen by looking at a heat map of the error slice

by slice. Looking at the middle row, we can see there is a distinct advantage of combining these

two types of methods to achieve the best sort of completion for this dataset.

4.4 Discussion

We have presented an adaptive sampling approach for low CP-rank tensor completion which

completes a CP-rank 𝑟 tensor of size 𝑛 × 𝑛 × 𝑛 using O (𝑛𝑟 log2 𝑟) samples with high probability.

Our method significantly improves on the tensor completion result in [39] while only making a

mild additional assumption on the third factor matrix. We also provided numerical experiments to

demonstrate that a version of our tensor completion method is robust to noise empirically, works

as a high quality initialization to ALS, and can run efficiently on tensors of more than three modes,

in the regime of hundreds of millions of entries.

A non-adaptive version of our tensor completion algorithm is also possible, as suggested in

135

Figure 4.10 Slices of a completed tensor for hyperspectral data. The first column shows the
original frontal slice X:,:,48 and the same slice completed at ranks 20, 30, 40, 50, 60, 80 with the
corresponding relative error below for three different methods. The top row is adaptive Tensor Deli
with sampling parameters 𝛾 = 0.7, 𝛿 = 10, the middle row shows the slice after ten iterations of
ALS using the Deli initialization and the last row shows SVD initialized masked-ALS.

Section 4.5. Second, it is possible to extend our tensor completion method to higher-order tensors,

as we outline. We show the result for the case where factor matrices do not have zero values, as

this illuminates very clearly what system of equations must be consistent to ensure exact recovery

of the factor matrix, however it is possible to remove this assumption, and instead deal with the

case of zeroes in the factor matrices using (a few) more samples.

4.5 Extending to Higher Order Tensors

The purpose of this section is to include details on how the proposed method can be extended

to tensors of more than three modes. Within this proof we also see how the method can be made

non-adaptive; by increasing the sample complexity to meet the necessary bounds for the matrix

completion and censored least square solves with high probability.

136

: 𝑆 ⊆ [𝑛3] slices to complete

Algorithm 4.3 Adaptive Tensor Deli for Order-𝑑 Tensors with No Zeros in 𝐴(3), . . . , 𝐴(𝑑)
input
output : ˆ𝑇, completed tensor
# Slice Complete Phase
for 𝑘 ∈ 𝑆 do

Use, e.g., [10], or algorithm 1 in [40] to complete T:,:,𝑘,1,...,1. Algorithm 1 in [40] uses at most
𝐶′
1

𝜇0 max(𝑛1, 𝑛2)𝑟 log2(𝑟 2/𝛿) adaptively chosen samples.

end
# Jennrich Complete Phase
Generate random vectors u, v ∈ S𝑠−1;
𝑇𝑢 ← (cid:205)𝑘∈𝑆 𝑢𝑘𝑇:,:,𝑘,1,...,1
𝑇𝑣 ← (cid:205)𝑘∈𝑆 𝑣 𝑘𝑇:,:,𝑘,1,...,1
Compute eigen-decomposition 𝐴(1)𝚲( 𝐴(1))−1 = 𝑇𝑢 (𝑇𝑢)†
𝐴(2) ← ( 𝐴(1))−1𝑇𝑢
# Censored Least Squares Phase to learn modes 3 through 𝑑
for 𝑘 = 3, 4, . . . , 𝑑 do

Ω𝑘−1 ← fiber sampler( 𝐴(1), 𝐴(2), 𝑟, 𝛾)
( ˜𝐴(3))𝑇 ← censored least squares (( 𝐴(1) ⊙ 𝐴(2)), T𝑖1,𝑖2,1,...,1,𝑖𝑘,1,...,1 s.t. (𝑖1, 𝑖2, 1, . . . , 1, 𝑖𝑘 , 1, . . . , 1) ∈
Ω𝑘−1)
𝐴(𝑘) ← ˜𝐴(𝑘) with columns normalized to unit length

end
𝛼 ← censored least squares (( 𝐴(1) ⊙ 𝐴(2) · · · ⊙ 𝐴(𝑑)), vec(Ti) s.t. (i) ∈ (cid:208)𝑑−1
ˆT ← [[a, 𝐴(1), . . . , 𝐴(𝑑)]]

𝑘=1 Ω𝑘−1)

137

Theorem 4.2. Suppose that the factor matrices satisfy the following assumptions for CP rank-𝑟

tensor T ∈ R𝑛1×𝑛2×···×𝑛𝑑

• rank( 𝐴(1)) = rank( 𝐴(2)) = 𝑟

• 𝜇( 𝐴(1)) ≤ 𝜇0

• every 𝑠 × 𝑟 submatrix of 𝐴(3) has Kruskal rank ≥ 2

• for 𝑘 = 3, . . . , 𝑑, 𝐴(𝑘) has no entries which are 0.

Then, with probability at least 1 − 𝑠𝛿, Algorithm 3 completes T and uses at most

(cid:16)

𝑠

𝐶 𝜇0𝑛𝑟 log2(𝑟 2/𝛿)

(cid:17)

+ 𝑟

𝑑
∑︁

𝑘=3

𝑛𝑘

samples, 𝑛 = max(𝑛1, 𝑛2)

Proof. We can prove this result by completing well chosen 3-mode sub-tensors. Consider first the
3-mode sub-tensor T𝑖1,𝑖2,𝑖3,i, for 𝑖 𝑗 ⊆ [𝑛 𝑗 ], i the index vector of all ones of length 𝑑 − 3, [1] 𝑑−3. By

our assumptions and Theorem 4.1, using the corresponding set of entries as previously described

in the proof of Theorem 4.1 only where modes beyond three are fixed at index 1, we can define

sets Ω1 ⊂ [𝑛1] × [𝑛2] × [𝑠𝑘 ], the set of indices needed to complete the 𝑠 selected (frontal) slices

for matrix completion, where 𝑠𝑘 ∈ 𝑆, 𝑆 ⊂ [𝑛3] and used to find factor matrices 𝐴(1) and 𝐴(2), and

Ω2 ⊂ [𝑛1] × [𝑛2] × [𝑛3] × [1] × · · · × [1] where (𝑖1, 𝑖2, 𝑖3, 1, . . . , 1) s.t. (𝑖1, 𝑖2) are selected from
the 𝑄𝑅 decomposition of ( 𝐴(1) ⊙ 𝐴(2))𝑇 , 𝑖3 ranges over all [𝑛3] are the set of fibers needed we can
complete the 3-mode sub-tensor and obtain factor matrices ˜𝐴(3). That is, using the same procedure
as the three mode case, we have solved the system of equations for { ˜𝐴(3)

𝑖3,ℓ}𝑟

ℓ=1,

𝑟
∑︁

ℓ=1

( 𝐴(1)

𝑖1,ℓ 𝐴(2)

𝑖2,ℓ 𝐴(3)

𝑖3,ℓ) ( 𝐴(4)

1,ℓ · · · · · 𝐴(𝑑)

1,ℓ ) =

𝑟
∑︁

ℓ=1

𝑖1,ℓ 𝐴(2)
𝐴(1)
𝑖2,ℓ

˜𝐴(3)
𝑖3,ℓ

Where ˜𝐴(3)

𝑖3,ℓ = 𝐴(3)

𝑖3,ℓ 𝐴(4)

1,ℓ · · · · · 𝐴(𝑑)

1,ℓ , i.e.

= T𝑖1,𝑖2,𝑖3,1,1,...,1

the scaled version of 𝐴(3), importantly, because the

factor matrices are assumed to be entirely non-zero, 𝐴(4)

1,ℓ and so on will all be non-zero, and so
we are guaranteed a non-zero and thus informative value for 𝐴(3) in all the 𝑟 selected equations.

138

We remark, here suggests immediately how we can relax the last assumption concerning the factor

matrices for modes 4 and higher. Should the factor matrices in fact have zeroes, we can with high

probability get 𝑟 informative equations by sampling more.

We can now permute the indices, and solve a similar looking system for the similarly defined

˜𝐴(4), (𝑖1, 𝑖2) same as before, 𝑖4 ∈ [𝑛4] and so indices (𝑖1, 𝑖2, 1, 𝑖4, 1, . . . , 1) ∈ Ω3 correspond to 𝑟

mode-4 fibers of the tensor. We have the system

𝑟
∑︁

ℓ=1

( 𝐴(1)

𝑖1,ℓ 𝐴(2)

𝑖2,ℓ 𝐴(4)

𝑖4,ℓ)( 𝐴(3)

1,ℓ · 𝐴(5)

1,ℓ · · · · 𝐴(𝑑)

1,ℓ ) =

𝑟
∑︁

ℓ=1

𝑖1,ℓ 𝐴(2)
𝐴(1)
𝑖2,ℓ

˜𝐴(4)
𝑖4,ℓ

= T𝑖1,𝑖2,1,𝑖4,1,...,1

We repeat this procedure for all remaining factor matrices. A final solve to rectify the scaling

indeterminacy and obtain the weights of the rank-1 components as the vector 𝛼 ∈ R𝑟, i.e. solve the

censored least squares

( 𝐴(1) ⊙ 𝐴(2) ⊙ ˜𝐴(1) ⊙ · · · ⊙ ˜𝐴(𝑑))𝛼 = vec(T )Ω

Ω = (cid:208)𝑑−1

𝑘=1 Ω𝑘 . To show then the sample complexity bound, we need to bound |Ω|. We have 𝑟
fibers in 𝑑 − 2 modes indexed by (𝑖1, 𝑖2, 1, . . . , 𝑖𝑘 , . . . , 1), for each 𝑘 = 3, 4, . . . , 𝑑, 𝑖𝑘 ∈ [𝑛𝑘 ] and
these indices are in Ω𝑘−1. I.e. (𝑑 − 2)𝑟 fibers of length 𝑛𝑖 for a total of 𝑟 (cid:205)𝑑
𝑖=3

𝑛𝑖 revealed indices

from fiber sampling once we have union bounded over all the sets Ω𝑘 . This is in addition (but again

potentially intersecting) with the revealed entries used to completed the 𝑠 slices, the elements of

set Ω1. A final union bound yields the desired sample complexity and failure probability.

139

BIBLIOGRAPHY

[1] Thomas D. Ahle and Jakob B. T. Knudsen. Almost optimal tensor sketch, 2019.

[2] Rosa I Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts

and random projection. Machine learning, 63:161–182, 2006.

[3] Stefan Bamberger, Felix Krahmer, and Rachel Ward. Johnson–lindenstrauss embeddings with
kronecker structure. SIAM Journal on Matrix Analysis and Applications, 43(4):1806–1850,
2022.

[4] Boaz Barak and Ankur Moitra. Noisy tensor completion via the sum-of-squares hierarchy. In

Conference on Learning Theory, pages 417–445. PMLR, 2016.

[5] Richard G Baraniuk, Mark A Davenport, Ronald A DeVore, and Michael B Wakin. A simple
proof of the restricted isometry property for random matrices. Constructive Approximation,
2007.

[6] Marion F. Baumgardner, Larry L. Biehl, and David A. Landgrebe. 220 band aviris hyper-

spectral image data set: June 12, 1992 indian pine test site 3, Sep 2015.

[7] Rasmus Bro, Åsmund Rinnan, and Nicolaas Faber. Standard error of prediction for multilinear
pls - 2. practical implementation in fluorescence spectroscopy. Chemometrics and Intelligent
Laboratory Systems, 75:69–76, 01 2005.

[8] Emmanuel Candes and Benjamin Recht. Exact matrix completion via convex optimization.

Communications of the ACM, 55(6):111–119, 2012.

[9] Hongyang Chen, Fauzia Ahmad, Sergiy Vorobyov, and Fatih Porikli. Tensor decompositions
IEEE Journal of Selected Topics in Signal

in wireless communications and mimo radar.
Processing, 15(3):438–453, 2021.

[10] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing any
low-rank matrix, provably. Journal of Machine Learning Research, 16(94):2999–3034, 2015.

[11] Agniva Chowdhury, Jiasen Yang, and Petros Drineas. Structural conditions for projection-
cost preservation via randomized matrix multiplication. Linear Algebra and its Applications,
573:144–165, 2019.

[12] Andrzej Cichocki. Tensor decompositions: new concepts in brain data analysis? Journal of

the Society of Instrument and Control Engineers, 50(7):507–516, 2011.

[13] Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu.
Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings
of the forty-seventh annual ACM symposium on Theory of computing, pages 163–172, 2015.

[14] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from
incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–
622, 2016.

140

[15] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value
decomposition. SIAM Journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000.

[16] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 and rank-
(𝑟1, 𝑟2, . . . , 𝑟𝑛) approximation of higher-order tensors. SIAM journal on Matrix Analysis and
Applications, 21(4):1324–1342, 2000.

[17] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data:
Constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Comput-
ing, 49(3):601–657, 2020.

[18] Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing.
Applied and Numerical Harmonic Analysis. Springer New York, New York, NY, 2013.

[19] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University

Press, third edition, 1996.

[20] Rachel Grotheer, Shuang Li, Anna Ma, Deanna Needell, and Jing Qin. Iterative hard thresh-

olding for low CP-rank tensor models. arXiv preprint arXiv:1908.08479, 2019.

[21] Rachel Grotheer, Anna Ma, Deanna Needell, Shuang Li, and Jing Qin. Stochastic iterative
In Proc. Information Theory and

hard thresholding for low Tucker rank tensor recovery.
Applications, 2020.

[22] Wolfgang Hackbusch. Tensor Spaces and Numerical Tensor Calculus. [electronic resource].
Springer Series in Computational Mathematics: 56. Springer International Publishing, 2019.

[23] Cullen Haselby, Mark Iwen, Deanna Needell, Michael Perlmutter, and Elizaveta Rebrova.
Modewise operators, the tensor restricted isometry property, and low-rank tensor recovery.
Applied and Computational Harmonic Analysis, 66:161–192, 2023.

[24] Cullen Haselby, Mark Iwen, Deanna Needell, Elizaveta Rebrova, and William Swartworth.
Fast and low-memory compressive sensing algorithms for low tucker-rank tensor approxima-
tion from streamed measurements, 2023.

[25] Cullen Haselby, Santhosh Karnik, and Mark Iwen. Tensor sandwich: Tensor completion for
low CP-rank tensors via adaptive random sampling. In Fourteenth International Conference
on Sampling Theory and Applications, 2023.

[26] Stĳn Hendrikx and Lieven De Lathauwer. Block row kronecker-structured linear systems with
a low-rank tensor solution. Frontiers in Applied Mathematics and Statistics, 8, 2022.

[27] Mikael Møller Høgsgaard, Lion Kamma, Kasper Green Larsen, Jelani Nelson, and Chris

Schwiegelshohn. Sparse dimensionality reduction revisited, 2023.

[28] Mark Iwen, Deanna Needell, Elizaveta Rebrova, and Ali Zare. Lower memory oblivious
(tensor) subspace embeddings with fewer random bits: modewise methods for least squares.
SIAM Journal on Matrix Analysis and Applications, 42(1):376–416, 2021.

141

[29] Prateek Jain and Sewoong Oh. Provable tensor factorization with missing data. Advances in

Neural Information Processing Systems, 27, 2014.

[30] Ruhui Jin, Tamara G. Kolda, and Rachel Ward. Faster johnson-lindenstrauss transforms via

kronecker products. CoRR, abs/1909.04801, 2019.

[31] Ten Berge JM. The 𝑘-rank of a khatri–rao product. Unpublished Note, Heĳmans Institute of

Psychological Research, University of Groningen, The Netherlands., 2000.

[32] William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math.,

26:189–206, 1984.

[33] Michael Kapralov, Rasmus Pagh, Ameya Velingker, David P. Woodruff, and Amir Zandieh.
Oblivious sketching of high-degree polynomial kernels. CoRR, abs/1909.01410, 2019.

[34] Bohdan Kivva and Aaron Potechin. Exact nuclear norm, completion and decomposition for
random overcomplete tensors via degree-4 sos. arXiv preprint arXiv:2011.09416, 2020.

[35] Andrew V. Knyazev and Merico E. Argentati. Principal angles between subspaces in an
a-based scalar product: Algorithms and perturbation estimates. SIAM Journal on Scientific
Computing, 23(6):2008–2040, 2002.

[36] Tamara G. Kolda. Will the real jennrich’s algorithm please stand up? https://www.mathsci.

ai/post/jennrich/. Accessed: 2023-05-23.

[37] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review,

51(3):455–500, 2009.

[38] Felix Krahmer and Rachel Ward. New and improved johnson–lindenstrauss embeddings via
the restricted isometry property. SIAM Journal on Mathematical Analysis, 43(3):1269–1281,
2011.

[39] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive

sampling. Advances in neural information processing systems, 26, 2013.

[40] Akshay Krishnamurthy and Aarti Singh. On the power of adaptivity in matrix completion

and approximation, 2014.

[41] Rafał Latała. Estimation of moments of sums of independent real random variables. The

Annals of Probability, 25(3):1502 – 1513, 1997.

[42] Denis Le Bihan, Jean-François Mangin, Cyril Poupon, Chris A Clark, Sabina Pappata, Nicolas
Molko, and Hughes Chabriat. Diffusion tensor imaging: concepts and applications. Journal of
Magnetic Resonance Imaging: An Official Journal of the International Society for Magnetic
Resonance in Medicine, 13(4):534–546, 2001.

[43] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces : Isoperimetry and

Processes. Applied and Numerical Harmonic Analysis. Springer New York, 2002.

142

[44] S. E. Leurgans, R. T. Ross, and R. B. Abel. A decomposition for three-way arrays. SIAM

Journal on Matrix Analysis and Applications, 14(4):1064–1083, 1993.

[45] Allen Liu and Ankur Moitra. Tensor completion made practical. Advances in Neural Infor-

mation Processing Systems, 33:18905–18916, 2020.

[46] Avner Magen and Anastasios Zouzias. Low rank matrix-valued chernoff bounds and ap-
proximate matrix multiplication. In Proceedings of the twenty-second annual ACM-SIAM
symposium on Discrete Algorithms, pages 1422–1436. SIAM, 2011.

[47] Osman Asif Malik and Stephen Becker. Low-rank tucker decomposition of large tensors using
tensorsketch. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran
Associates, Inc., 2018.

[48] Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A randomized algorithm for the
decomposition of matrices. Applied and Computational Harmonic Analysis, 30(1):47–68,
2011.

[49] Jiří Matoušek. On variants of the johnson–lindenstrauss lemma. Random Structures &

Algorithms, 33(2):142–156, 2008.

[50] Rachel Minster, Arvind K. Saibaba, and Misha E. Kilmer. Randomized algorithms for low-

rank tensor decompositions in the tucker format. 2019.

[51] Ankur Moitra. Algorithmic aspects of machine learning. Cambridge University Press, 2018.

[52] Andrea Montanari and Nike Sun. Spectral algorithms for tensor completion. Communications

on Pure and Applied Mathematics, 71(11):2381–2425, 2018.

[53] Cameron Musco and Christopher Musco. Projection-cost-preserving sketches: Proof strate-

gies and constructions. CoRR, abs/2004.08434, 2020.

[54] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic optimization via
operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory
and Applications, 169(3):1042–1068, June 2016.

[55] Aaron Potechin and David Steurer. Exact tensor completion with sum-of-squares. In Confer-

ence on Learning Theory, pages 1619–1673. PMLR, 2017.

[56] Holger Rauhut, Reinhold Schneider, and Željka Stojanac. Low rank tensor recovery via
iterative hard thresholding. Linear Algebra and its Applications, 523:220–262, 2017.

[57] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections.
In 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06), pages
143–152. IEEE, 2006.

[58] Mikael Sørensen and Lieven De Lathauwer. Fiber sampling approach to canonical polyadic
decomposition and application to tensor completion. SIAM Journal on Matrix Analysis and
Applications, 40(3):888–917, 2019.

143

[59] Jimeng Sun, Dacheng Tao, Spiros Papadimitriou, Philip S Yu, and Christos Faloutsos. Incre-
mental tensor analysis: Theory and applications. ACM Transactions on Knowledge Discovery
from Data (TKDD), 2(3):1–37, 2008.

[60] Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker
approximation of a tensor from streaming data. SIAM Journal on Mathematics of Data
Science, 2(4):1123–1150, 2020.

[61] Giorgio Tomasi and Rasmus Bro. Parafac and missing values. Chemometrics and Intelligent

Laboratory Systems, 75(2):163–180, 2005.

[62] Nick Vannieuwenhoven, Raf Vandebril, and Karl Meerbergen. A new truncation strategy
for the higher-order singular value decomposition. SIAM Journal on Scientific Computing,
34(2):A1027–A1052, 2012.

[63] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data
Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University
Press, 2018.

[64] Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. A fast randomized algorithm
for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335–
366, 2008.

[65] Ming Yuan and Cun-Hui Zhang. On tensor completion via nuclear norm minimization.

Foundations of Computational Mathematics, 16(4):1031–1068, 2016.

[66] Ming Yuan and Cun-Hui Zhang. Incoherent tensor norms and their applications in higher
order tensor completion. IEEE Transactions on Information Theory, 63(10):6753–6766, 2017.

[67] Ali Zare, Alp Ozdemir, Mark A Iwen, and Selin Aviyente. Extension of pca to higher order data
structures: An introduction to tensors, tensor decompositions, and tensor pca. Proceedings
of the IEEE, 106(8):1341–1358, 2018.

[68] Ali Zare, R. Wirth, C. Haselby, Heiko Hergert, and M. Iwen. Modewise johnson–lindenstrauss
embeddings for nuclear many-body theory. The European Physical Journal A, 59, 05 2023.

144