SPARSE HARMONIC TRANSFORMS By Bosu Choi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Applied Mathematics — Doctor of Philosophy 2018 ABSTRACT SPARSE HARMONIC TRANSFORMS By Bosu Choi We develop fast function learning algorithms given an assumption that a function can be well approximated by the expansion of a few high-dimensional basis functions. Considering the tensorized Fourier basis functions, several versions of high-dimensional sparse Fourier transforms (SFTs) are discussed. One-dimensional sparse Fourier transforms introduced in [1] and [2] quickly approximate functions represented by only s Fourier basis functions using a few samples (function evaluations) without noise and with noise respectively. The algorithms can be directly applied to compute the Fourier transform of the high-dimensional functions. However, it becomes hard to implement them if the dimension gets too large. In this thesis, we introduce two new concepts: partial unwrapping and tilting. These two ideas allow us to efficiently compute the high-dimensional sparse Fourier transforms using the ideas in [1] and [2]. Furthermore, we develop sublinear-time compressive sensing methods to approximate the multivariate functions by the expansion of a few bounded orthonormal product (BOP) bases which include tensorized Fourier basis functions. These new methods are obtained from CoSaMP by replacing its usual support identification procedure with a new faster one inspired by fast SFT techniques. The resulting sublinearized CoSaMP method allows for the rapid approximation of bounded orthonormal product basis (BOPB)-sparse functions of many variables which are too hideously high-dimensional to be learned by other means. Both numerics and theoretical recovery guarantees will be presented. ACKNOWLEDGMENTS First and foremost, I would like to express my deepest gratitude to my advisors, professors Andrew Christlieb and Mark Iwen for their continuous support and encouragement during the six years of my Ph.D study. They have provided me with the tremendous knowledge, motivation, great insight, patient guidance, and supportive research environment. I also thank professor Yang Wang at HKUST, Hong Kong and professor Felix Krahmer at TUM, Germany who are my valuable collaborators giving me brilliant research ideas and chance to visit their institutes. I have been extremely lucky to have the best advisors and collaborators for my Ph.D. study so that I could finish my Ph.D research and have a chance to continue my research career after the graduation. I also thank my committee members, professor Yingda Cheng, for teaching me great knowledge in numerical analysis, professor Matthew Hirn for his valuable feedback and com- ments on my thesis, and professor Zhengfang Zhou for his enlightening opinion for the future research extension. I also thank my Korean advisor, professor Jungho Yoon, for his contin- uous guidance and support for being a Ph.D and a researcher. I thank all my dear friends at MSU, EWHA Univ. and EWHA High. I am so lucky to have so many good friends who supported me whether we are in remote distance or not. Especially, I want to thank my friends Dr. Hana Cho and Dr. Hyosun Yang with whom I spent most of my 20’s to pursue a math bachelor’s, master’s, and Ph.D degrees. Thanks to them, I have not been alone, struggling to do research and be a mathematician eventually since professor Jungho Yoon inspired us in our freshman year in EWHA Univ. I am eternally grateful to my family: my parents, my brother Boksoo, and my sister Heechan for their unconditional support. I also thank my grandparents, relatives and cousins, iii both in South Korea and the United States, who gave me courage to pursue a Ph.D degree with special gatherings, so I could enjoy being away from my Ph.D study occasionally. iv TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Preliminaries Chapter 2 High-dimensional Sparse Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Review of the One-Dimensional Sparse Fourier Transform . . . . . . . 2.1.2 Multidimensional Problem Setting and Worst Case Scenario . . . . . 2.2 Two Dimensional Sublinear Sparse Fourier Algorithm . . . . . . . . . . . . . 2.2.1 Basic Algorithm Using Parallel Projection . . . . . . . . . . . . . . . 2.2.2 Full Unwrapping Method . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Tilting Method for the Worst Case . . . . . . . . . . . . . . . . . . . 2.3 Partial Unwrapping Method for High Dimensional Algorithm . . . . . . . . . 2.3.1 Partial Unwrapping Method . . . . . . . . . . . . . . . . . . . . . . . 2.3.1.1 Example of 4-D Case . . . . . . . . . . . . . . . . . . . . . . 2.3.1.2 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Probability of Worst Case Scenario . . . . . . . . . . . . . . . . . . . 2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Empirical Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Sampling Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Runtime Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Multiscale Sparse Fourier Transforms 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Notation and Review . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Multiscale Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Description of Frequency Entry Estimation . . . . . . . . . . . . . . . 3.2.2 Analysis of Multiscale Method . . . . . . . . . . . . . . . . . . . . . . 3.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Choice of p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Collision Detection Tests . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Number of Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Description of Our Pseudocode . . . . . . . . . . . . . . . . . . . . . 3.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Sampling complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Runtime complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 10 12 12 12 15 17 18 21 22 26 26 28 29 31 33 36 37 38 39 40 42 42 42 46 49 49 51 53 53 55 55 56 58 61 61 62 v 4.2.1 4.2.2 81 83 86 88 4.1 Preliminaries 64 Chapter 4 General Sparse Harmonic Transforms . . . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1.1 The Compressive Sensing Problem for BOPB-Sparse Functions . . . . 69 4.1.2 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Randomly Constructed Grids with Universal Approximation Properties 72 75 4.1.4 Notation and Problem Setting . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Definitions Required for Support Identification . . . . . . . . . . . . . 77 4.1.6 The Proposed Method as a Sublinear-Time Compressive Sensing Al- gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Support Identification Step # 1: Entry Identification . . . . . . . . . Support Identification Step # 2: Pairing . . . . . . . . . . . . . . . . 4.2.2.1 An Example of Entry Identification and Pairing to Find Sup- 89 port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.3 A Theoretical Guarantee for Support Identification . . . . . . . . . . 94 4.3 Analysis of the Support Identification . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Entry Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.2 Pairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Support Identification . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3.3 4.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.4.2 Experiments with the Fourier Basis for D = [0, 1]D . . . . . . . . . . 116 4.4.3 Experiments with the Chebyshev Basis for D = [−1, 1]D . . . . . . . 120 4.4.4 Experiments for Larger Ranges of Sparsity s and Dimension D . . . . 122 4.4.5 Recovery of Functions from Noisy Measurements . . . . . . . . . . . . 125 Some Additional Implementational Details . . . . . . . . . . . . . . . 128 4.4.6 Chapter 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 vi LIST OF FIGURES Figure 2.1: Process of 1D sublinear sparse Fourier algorithm . . . . . . . . . . . . . 14 Figure 2.2: Process of the basic algorithm in 2D . . . . . . . . . . . . . . . . . . . . 18 Figure 2.3: Worst case scenario in two dimensions and solving it through tilting . . . 22 Figure 2.4: Average l2 error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Figure 2.5: Average sampling complexity . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 2.6: Average CPU TICKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Figure 3.1: (a)Average l1 error vs. noise level σ in logarithm. (b)Average samples vs. noise level in logarithm. (c)Average runtime vs. noise level in logarithm. 59 Figure 3.2: (a)Average l1 error vs. sparsity s in logarithm. (b)Average samples vs. sparsity s in logarithm. (c)Average runtime vs. sparsity s in logarithm. . 60 Figure 4.1: Fourier basis, M ∈ {10, 20, 40, 80}, D = 4, s = 5 . . . . . . . . . . . . . 117 Figure 4.2: Fourier basis, M = 10, D = {2, 4, 6, 8}, s = 5 . . . . . . . . . . . . . . . 118 Figure 4.3: Fourier basis, M = 20, D = 4, s = {1, 2, 3,··· , 10} . . . . . . . . . . . . 119 Figure 4.4: Chebyshev basis, M ∈ {10, 20, 40, 80}, D = 4, s = 5 . . . . . . . . . . . 121 Figure 4.5: Chebyshev basis, M = 20, D = {2, 4, 6}, s = 5 . . . . . . . . . . . . . . 122 Figure 4.6: Chebyshev basis, M = 10, D = 6, s = {1, 2, 3,··· , 10} . . . . . . . . . . 123 Figure 4.7: Fourier basis, M = 20, D ∈ {5, 10, 15, 20,··· , 75}, s = 5 . . . . . . . . . 124 Figure 4.8: Fourier basis, M = 40, D = 5, s ∈ {5, 10, 20, 40, 80} . . . . . . . . . . . 124 Figure 4.9: Chebyshev basis, M = 20, D ∈ {2, 4, 6,··· , 12}, s = 5 . . . . . . . . . . 125 Figure 4.10: Chebyshev basis, M = 40, D = 5, s ∈ {2, 4, 6,··· , 20} . . . . . . . . . . 125 Figure 4.11: Algorithm 5, Fourier basis, M = 10, D ∈ {4, 6, 8, 10}, s = 5, SNRdB ∈ {0, 10, 20,··· , 80} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii Figure 4.12: Algorithm 5, Chebyshev basis, M = 10, D ∈ {6, 8}, s = 5, SNRdB ∈ {0, 10, 20,··· , 80} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 viii Chapter 1 Introduction In this thesis, we develop sublinear-time algorithms for finding signals under the assumption of sparsity. We first do this in the Fourier setting and then develop a generalized method which is expected to be applied to Uncertainty Quantification(UQ) with great promise. In various numerical problems, one encounters the problem of function approximation, interpolation, and learning from finite function evaluations (samples). For high-dimensional function domains, there is the danger of the curse of dimensionality, i.e., the problem that the number of function evaluations required for an accurate result grows exponentially fast as the dimension grows. For example, one of the best known and most frequently-used algorithms is the Fast Fourier Transforms (FFT). However, in the case that the bandwidth M of the frequencies is large, the sampling size becomes large, as dictated by the Shannon-Nyquist sampling theorem. Specifically, the runtime complexity is O(M logM ) and the number of samples is O(M ). This issue is only exacerbated in the D-dimensional setting, where the runtime complexity is O(M DlogM D) and the number of samples is O(M D) if we assume the dimension is D and the bandwidth in each dimension is M . For another example, it is proven in [3] that the numerical integration of D-dimensional functions in Cr requires more than cr(1 + γ)D function evaluations for γ > 0 in order to achieve a given error . Thus, some additional assumptions are endowed to the functions for developing a feasible algorithm. Sparsity is the one such assumption that we are often able to observe in many 1 problems. That is, many signals or data show some sparse (or compressible) property in various forms. Due to this curse of dimensionality, many higher dimensional problems of interest are beyond current computational capabilities of the traditional FFT. Moreover, in the sparse setting where the number s of significant frequencies is small, it is computationally wasteful to compute all M D coefficients. In such a setting we refer to the problem as being sparse in the Fourier domain. On the other hand, for such sparse problems, the idea of sublinear sparse Fourier transforms was introduced in [4], and several sublinear algorithms were developed extending the idea [4, 5, 6, 7, 8, 1, 2]. These methods greatly reduce the runtime and sampling complexity of the FFT in the sparse setting. The methods were primarily designed for the one dimensional setting. In [9], practical algorithms for data in two dimensions were given for the first time. In this thesis, we develop algorithms designed for higher dimensional data, which is effective even for dimensions in the hundreds and thousands. To achieve our goal, our approach must address the worst case scenario presented in [9]. It is not straightforward to extend one dimensional sparse Fourier transform algorithms to multiple dimensions. We face several obstacles. First, we do not have an efficient FFT for multidimensional problems much higher than three. Using projections onto lower-dimensional spaces solves this problem. However, like all projection methods for sparse FFT, one needs to match frequencies from one projection with those from another projection. This registration problem is one of the big challenges in the one dimensional sparse FFT. An equally difficult challenge is that different frequencies may be projected into the same frequency (the collision problem). All projection methods for sparse FFT primarily aim to overcome these two challenges. In higher dimensional sparse FFT, these problems become even more challenging as now we 2 are dealing with frequency vectors, not just scalar frequencies. In order to make our multi-dimensional sparse Fourier algorithm to be robust to noisy samples, we utilize the idea of a multiscale method from [2]. Suppose f : [0, 1)D → C defined n∈S cne2πin·x + z(x) where x ∈ [0, 1)D, n ∈ [−M/2, M/2)D ∩ZD, |S| = s (cid:28) M D and as(cid:80) z(x) represents a noise function. Our algorithms introduced in Chapter 2 are not robust to noise. This lack of robustness is due to the computation of the ratio between the discrete Fourier transforms of two sample sets from f at the shifted and unshifted points in order to get the traces of significant Fourier modes. For example, consider a one mode sparse function f (x) = ce2πinx of one dimension and two sample sets of length p, where p is taken to be 1. That is, f 0 = (f (0)) and f  = (f ()) each of whose DFT yields the exact recovery (cid:16) f () (cid:17) f (0) √ (cid:16) f ()+z() (cid:17) 2πArg where the function Arg gives the argument falling into of the frequency n = 1 [−π, π). If those samples are noisy, however, then 1 does not guarantee the proper approximation of n. The fraction in Arg(·) in general is corrupted by the term whose size is O(σ/cmin p) where p is the number of samples, cmin is the nonzero Fourier coefficient least in magnitude, and the noise is assumed to be Gaussian noise ∼ N (0, σ2), details are given in Section 3.1.2. A significantly large p > O(σ/cmin)2/3 enables us to find 2πArg f (0)+z(0) the correct Fourier modes by simple rounding as introduced in [2]. However, when either σ is too big or the shift size  is too small, p is required to be very large, accordingly. In order to avoid p excessively enlarged, we make use of a multiscale method in Chapter 3 which makes an initial guess about each frequency using moderately large 0 and p, and add the finite L correction terms computed by using gradually increasing q for q = 1, 2,··· , L. Our algorithms assume that we have access to an underlying continuous function f , i.e., we can sample at anywhere we want. However, samples are sometimes given at the beginning and getting extra samples can be very expensive. Accordingly, it is necessary to 3 do the approximation of extra samples using given samples. A fully discrete sparse Fourier transform was introduced in [10] combining periodized Gaussian filters and one-dimensional sparse Fourier transform in the continuous setting such as [11] and [2]. We expect for the future work that this approach will help our multiscale high-dimensional algorithm to be modified to work for fully discrete samples. Many problems are best represented using bases other than Fourier. In this thesis we develop a generalization of sublinear-time sparse harmonic transforms for such applications. To motivate our generalized sparse harmonic transform we will now consider an example from uncertainty quantification (UQ). A common class of examples in UQ literature [12, 13], is that certain quantities of interest (QoI) are affected by a number of parameters and one can assume that the QoI depend smoothly on these parameters. Consequently, uncertainty in these inputs affects the uncertainty in the QoI outputs. In the process of quantifying these uncertainties, the probability density function (PDF) of the input parameters is esti- mated through Bayesian inference, and the uncertainty is propagated through the models for sampling points chosen from the PDF in order to obtain the prediction intervals of QoI. A crucial step in this process is then to approximate the QoI from its sample values. This typically requires multivariate function integration and interpolation, usually via quadra- ture methods, sparse grid approaches, or Monte Carlo methods, depending on the number of parameters. The first two methods indeed face the curse of dimensionality: Quadrature methods [14] require O(M D) samples when D is the number of dimensions/function variables. Sparse grid methods [15] improve on this by working with function spaces of mixed smoothness. Instead of a full grid using O(M D) samples, they require only O(M (log M )D−1) samples. Although it is better than quadrature methods for high dimensional problems, it still exhibits 4 exponential dependence on D, and so they are only feasible for moderately large dimensional problems. Monte Carlo methods [16] on the other hand have the advantage of a convergence rate independent of the number of dimensions. However, they exhibit slow convergence at √ a rate of O(1/ m), where m is the sample number. Consequently, such methods are often used to integrate relatively high-dimensional functions. Similarly, quasi-Monte Carlo[17, 16] achieves a convergence rate, O( (log m)D m ) by using low-discrepancy sample sets. Though good for integration, (quasi-)Monte Carlo methods can not be used for function learning, approximation or interpolation. For such applications one has few options for large dimen- sions. Uncertainty quantification consists of several stages. We first get several experimental observations of some phenomenon, design our model from observations using ODE, PDE, etc., and solve this model using various numerical methods. These three steps are revisited for validation and calibration. Each step contains uncertainty (error), and these errors are accumulated to uncertainty in QoI. Our models usually contain tens of or hundreds of parameters which are unknown. We consider parametric operator equations [18, 19, 20], A(x)u(x) = b, 0 (U), a parameter vector x ∈ D ⊂ RD with where the solution u(x) is in the Sobolev space H1 a large D ∈ N and a probability measure ν on D, and U is the physical domain of u. QoI is approximated by a function g : x → G(u(x)), where G is a functional on u ∈ H1 0 (U). The following class of parametric elliptic partial differential equations is considered in 5 several papers [18, 20, 21], −∇ · (a(·, x)∇u) = b, u|∂U = 0, where a(·, x) is a diffusion coefficient. Considering these parameters as variables x as well as physical variables t, by the Karhunen-Lo`eve expansion[18], a stochastic process, a(t, x), can be represented as the expectation, ¯a(t), plus an infinite sum of eigenfunctions of the covariance function multiplied by eigenvalues. We look at a simple case where a affinely depends on x as follows, a(t, x) = ¯a(t) + (cid:88) j≥1 xjψj(t). Accordingly, u can be represented by a (possibly) infinite expansion in Tn, typically, ten- sorized Chebyshev or Legendre functions, u(t, x) = (cid:88) n∈F dnTn(x), where dn ∈ H1 0 (U). When G is a linear functional, (cid:88) g(x) = G(u(x)) = G(dn)Tn(x), n∈F and if cn := G(dn) and we truncate this expansion to a finite sum of a few cn large in their magnitude, g(x) ≈ (cid:88) n∈S⊂F cnTn(x), which means that a function g can be approximated by a linear combination of a few ten- sorized basis functions. 6 Our generalized method is inspired by the sparse approximation of QoI which can be considered as a function of many variables in UQ. We consider a D-variate function f (x) : D → C with D ⊂ RD which is compressible in a bounded orthonormal system {Tn}, i.e., (cid:88) n∈[M ]D f (x) = cnTn(x) where only s (cid:28) M D number of cn are significant, such n satisfy (cid:107)n(cid:107)0 ≤ d with d ≤ D, and [M ] := {0, 1, 2,··· , M − 1}. We aim at recovering s significant coefficients cn and corresponding index vectors n efficiently in terms of both runtime and sampling. Our method is a greedy method motivated by CoSaMP[22], HTP[23], etc. The support identification in these methods requires a construction of a measurement matrix and its multiplication with a sample vector resulting in huge memory and operation counts when considering the high dimensional problems. Accordingly, our method introduces a faster support identifying method that finds the entries of the multi-index n (entry identification) and then integrates the entries to recover the multi-indices (pairing). After estimating the support of n, the least squares problem restricted to this support is solved to approximate the corresponding coefficients. The process so far is repeated until we recover the function f with desirable error. To wrap up, this thesis explores several versions of sublinear-time algorithms for approx- imating high dimensional functions which show sparsity in some bounded orthonormal prod- uct basis(BOPB) expansion. Our algorithm using the CoSaMP approach can compute the discrete Fourier transform quickly instead of our proposed high dimensional sparse Fourier transforms. However, the sparse Fourier transforms shows faster and more efficient perfor- mance whereas the sublinear CoSaMP works for the general BOPB including the tensorized 7 Fourier basis. In the remainder of the introduction, we review the related work. The first sparse Fourier algorithm was proposed in [4]. The authors introduced a randomized algorithm with O(s2logcM ) runtime and O(s2logcM ) samples where c is a positive number that varies depending on the trade-off between efficiency and accuracy. An algorithm with improved runtime O(slogcM ) and samples O(slogcM ) was given in [5]. The algorithms given in [6] and [7] achieved O(slogM logM/s) average-case runtime, and the actual empirical results are given in [7] comparing their algorithms with FFTW and an existing sparse FFT from [24]. The algorithms in [4, 5, 6, 7] are all randomized. The first deterministic algorithm using a combinatorial approach was introduced in [8]. In [1], another deterministic algo- rithm was given whose procedure recognizes frequencies in a similar manner to [6]. The two methods in [1, 6] were published at the same time and both use the idea of working with two sets of samples, one at O(s) points and the second at the same O(s) points plus a small shift. The ratio of the FFT of the two sets of points, plus extra machinery, leads to fast deterministic algorithms. The first deterministic algorithm [8] has O(s2log4M ) runtime and sampling complexity, and in [11], an improved deterministic algorithm was introduced, and the extension to higher dimensional functions was suggested but it suffered the exponential dependence of runtime complexity on the dimension. Another deterministic algorithm was introduced in [1] which uses the similar idea in the frequency recovery through phaseshift and works for noiseless samples from exactly s-sparse functions. It has O(s log s) runtime and O(s) sampling complexity on average. In [2], a multiscale method was introduced which works when we are given noisy samples and has O(s log s log M/s) runtime and O(s log M/s) sampling complexity. Extension of one-dimensional sparse Fourier transform to multidimensional problem set- 8 ting is not straightforward. One simple way of extension is to unwrap the multi-dimensional signal to one-dimensional signal, however, it suffers from the exponentially large runtime complexity due to the curse of dimensionality [11]. The first randomized algorithm for the two-dimensional problem was introduced in [9] using parallel projections of frequencies. In [25], a general D-dimensional deterministic sparse Fourier algorithm achieves O(Ds2M ) sam- ples and O(Ds3 + ds2M log(sM )) runtime complexity by using rank-1 lattices and finding frequencies in an entry-wise fashion. To reduce the runtime complexity, the authors also introduced a randomized version of the algorithm with O(Ds + DM ) samples and O(Ds3) runtime. A randomized algorithm introduced in [26] requires 2O(D2)(s log M D loglog M D) samples and 2O(D2)s logD+3 M D runtime. Many functions in UQ problems assume sparsity(or compressibility) in various forms that inspires us to consider the possible contribution of compressive sensing and sparse approx- imation. In [27], randomized algorithms approximating sparse polynomials are introduced. In [25, 28, 29, 30], different versions of high-dimensional sparse Fourier transforms are intro- duced. As the sparsity in high-dimensional Chebyshev and Lengendre space is considered in some UQ problems, fast algorithms to recover functions with such sparsity have been devel- oped [31, 20, 19]. These papers often have additional assumptions on the structure of the sparsity which implies the degrees of the polynomials with large coefficients are very small. A simple example is the case where a function of larger number of variables actually depends on a fewer variables[32]. Hyperbolic cross[33, 34] and lower sets[19] are also examples with such characteristic, and this implies the bounded mixed smoothness of the objective function instead of arbitrary smoothness. In [3], it is shown that the number of function evaluations required to approximate a function with the arbitrary smoothness cannnot avoid the curse of dimensionality, i.e., super-exponential growth with the number of dimensions. Instead, 9 many methods assumes the structured sparsity in order to avoid the curse of dimensionality in certain problems. One method is a sparse grid method [15] which is used for both inter- polation and integration of multi-variate functions. In [20, 18], compressive sensing method with weighted l1 minimization is combined with Petrov-Galerkin method to approximate the linear functional of parametric PDE’s solution which can be represented by the combination of a few tensorized Chebyshev polynomials. 1.1 Thesis Outline In the thesis, we present three different approaches to quickly approximating functions of many variables that are sparse in BOPBs using a few function evaluations. A tensorized Fourier basis is the one with good properties compared to the other BOPBs. We develop very efficient sparse FFTs using these advantages. The first approach works for exactly sparse functions by using noiseless samples, and the second approach works for functions with moderate noise. Then, general sparse harmonic transforms are established so that they work for any BOPB including the Fourier basis. As a first step to our goal of a high dimensional sparse FFT, the thesis addresses the case for the continuous functions without noise in a high dimensional setting in the Chapter 2. We introduce effective methods to address the registration and the collision problems in Section 2.1. In Section 2.2, the ideas of various methods to resolve these problems in 2D are introduced which can be extended to the methods in the higher dimensional setting. In particular, we introduce a novel partial unwrapping technique in Section 2.3 that is shown to be highly effective in reducing the registration and collision complexity while maintains the sublinear runtime efficiency even in very high-dimensional problems. We shall show in 10 Section 2.4 that our algorithm can achieve O(Dslogs) computational complexity and O(Ds) sampling size on average assuming there is no worst-case scenario. In Section 2.5, we present as examples computational results for sparse FFT where the dimensions are 100 and 1000 respectively. For comparison, the traditional D-dimensional FFT requires O(M DlogM D) time complexity and O(M D) sampling complexity, which is impossible to implement on any computers today. In Chapter 3, we shall present an adaptation of the algorithm for noisy data. In Sec- tion 3.1 we introduce our problem setting, necessary notation and our noise model. Based on these, a multiscale method for frequency entry estimate is introduced and analyzed in Section 3.2. In Section 3.3, parameters that determine the performance of our algorithm are introduced and the pseudocode is given with description. The results of the numerical experiments are shown in Section 3.4. In Chapter 4, a generalized sparse harmonic transform is described. In Section 4.1, we introduce the notation that is used throughout the chapter and the problem setting, and reviewed the basics of compressive sensing related to the thesis. In Section 4.2 the description and the performance results of our proposed method are introduced. The analysis supporting the results in Section 4.2 is elaborated in Section 4.3 and the results of numerical experiments are shown in Section 4.4. 11 Chapter 2 High-dimensional Sparse Fourier Transforms 2.1 Preliminaries 2.1.1 Review of the One-Dimensional Sparse Fourier Transform The one-dimensional sublinear sparse Fourier algorithm inspiring our method was developed in [1]. We briefly introduce the idea and notation of the algorithm before developing the multidimensional ones throughout this chapter. We assume a function f : [0, 1) → C with sparsity s as the following, (cid:88) n∈S cne2πinx f (x) = (2.1) where the bandwidth is M , i.e., each frequency n ∈ S ⊂ [−M/2, M/2)∩Z, the cardinality, |S|, of S is s, and the corresponding nonzero coefficient cn is in C for all n. We can consider it as a periodic function over R instead of [0, 1). The goal of the algorithm is to recover all coefficients cn and frequencies n so that we can reconstruct the function f . This algorithm is called the “phase-shift” method since it uses equi-spaced samples from the function and those at the positions shifted by a small positive number . To verify that the algorithm correctly finds the frequencies in the bandwidth M ,  should be strictly no bigger than 1/M . 12 We denote a sequence of samples shifted by  with the sampling rate 1/p, where p is a prime number, as (cid:16) f p, = f (0 + ), f ( 1 p + ), f ( 2 p + ), f ( + ),··· , f ( 3 p p − 1 p (cid:17) + ) . (2.2) We skip much of the details here. In a nutshell, choosing p slightly larger than s is enough to make the algorithm work. In [1] p is set to be roughly 5s, which is much smaller than the Nyquist rate M . Discrete Fourier transform (DFT) is then applied to the sample sequence f p,, and the hth element of its result is the following (cid:88) F(f p,)[h] = p cne2πin (2.3) where h = 0, 1, 2, . . . , p − 1. If there is only one frequency n congruent to h modulo p, n=h( mod p) F(f p,)[h] = pcne2πin. (2.4) By putting 0 instead of , we can get the unshifted samples f p,0 and applying the DFT gives F(f p,0)[h] = pcn. (2.5) This process so far is visualized in the Figure 2.1. As long as there is no collision of frequencies with modulo p, we can find frequencies and their corresponding coefficients by the following computation (cid:16)F(f p,)[h] F(f p,0)[h] (cid:17) , n = 1 2π Arg 13 cn = F(f p,0)[h], 1 p (2.6) where the function “Arg” gives us the argument falling into [−π, π). Note that n should be the only frequency congruent to h modulo p, i.e., n has no collision with other frequencies modulo p. The test to determine whether a collision occurs or not is |F(f p,)[h]| |F(f p,0)[h]| = 1. (2.7) The equality (2.7) above holds when there is no collision. If there is a collision, the equality does not hold for almost all , i.e., the test fails to predict a collision for the finite number of  [1]. Furthermore, it is also shown in [1] that for any  = a b with a, b coprime and b ≥ 2M , equality (2.7) does not hold for at least one shift in a certain finite set of integer multiples of  unless there is no collision. In practical implementations, we choose  to be 1/CM for some positive integer C ≥ 1 and allow some small difference τ between the left and right sides of (2.7) where τ is a very small positive number. Figure 2.1: Process of 1D sublinear sparse Fourier algorithm 14 The above process is one loop of the algorithm with a prime number p. To explain it from a different view point, we can imagine that there are p bins. Then we sort all frequencies into these bins according to their remainder modulo p. If there are more than one frequencies in one bin, then a collision happened. If there is only one frequency, then there is no collision. To determine whether a collision occurs, we use the above test. In the case where the test fails, i.e., the ratio is not 1, we need to use another prime number p(cid:48). Thus we re-sort the frequencies into p(cid:48) bins by their remainder modulo p(cid:48). Even if two frequencies collide modulo p, it is likely that they do not collide modulo p(cid:48). Particularly, the Chinese Remainder Theorem guarantees that with a finite set of prime numbers, {p(cid:96)}, any frequency within the (cid:96) p(cid:96) ≥ M . Algorithmically, for each loop, we choose a different prime number p(cid:48) and repeat equations (2.2)–(2.7) with p replaced by p(cid:48). In this way we can recover all cn and n in sublinear time O(slogs) using O(s) samples bandwidth M can be uniquely identified, given(cid:81) on average. The overall code is shown in Algorithm 4 referred from [1]. 2.1.2 Multidimensional Problem Setting and Worst Case Scenario In this section, the multidimensional problem is introduced. Let us consider a function f : RD → C such that f (x) = cne2πin·x, (2.8) (cid:88) n∈S where n ∈ [−M/2, M/2)D ∩ ZD and cn ∈ C. That is, from (2.1), x is replaced by the D-dimensional phase or time vector x, the frequency n is replaced by the frequency vector n, and thus the operator between n and x is a dot product instead of simple scalar multi- plication. We can see that this is a natural extension of the one-dimensional sparse problem. As in the one-dimensional setting, if we find cn and n, we recover the function f . 15 Algorithm 1 Phaseshift 1: procedure Phaseshift 2: Input:f, C, s, M,  3: Output:R R ← ∅ 4: t ← 1 5: while |R| < s do 6: 7: 8: 9: 10: 11: Q(x) =(cid:80) s∗ ← s − |R| p ← t th prime number ≥ Cs∗ for (cid:96) = 0 → p − 1 do p ) − Q( (cid:96) p ) p + ) − Q( (cid:96) fp,0[(cid:96)] = f ( (cid:96) fp,[(cid:96)] = f ( (cid:96) (n,cn)∈R cne2πinx end if 20: 21: 22: 23: 24: 25: 26: 27: end procedure end for prune small coefficients from R t ← t + 1 end while 12: 13: 14: 15: 16: 17: 18: 19: p + ) end for F(f p,) = FFT(f p,) F(f p,0) = FFT(f p,0) (cid:12)(cid:12)(cid:12)|F sort(f p,0)[h]| (cid:12)(cid:12)(cid:12) <  then F sort(f p,0) = SORT(F(f p,0)) for h = 0 → s∗ − 1 do (cid:17) (cid:16) F sort(f p,)[h] |F sort(f p,)[h]| − 1 (cid:101)n = 1 2π Arg c(cid:101)n = 1 R ← R ∪ ((cid:101)n, c(cid:101)n) pF sort(f p,0)[h] F sort(f p,0)[h] if However, since our time and frequency domain have changed, we cannot apply the previ- ous algorithm directly. If we project the frequencies onto a line, then we can apply the former algorithm so that we can retain the sublinear time complexity. Since the operator between frequency and time vectors is a dot product, we can convert projection of frequencies to that of time. For example, we consider the projection onto the first axis, that is, we put the last D − 1 elements of time vectors as 0. If the projection is one-to-one, i.e., there is no collision, then we can apply the algorithm in Section 2.1.1 to this projected function to recover the first element of frequency vectors. If there is a collision on the first axis, then we can try 16 another projection onto(cid:101)kth axis,(cid:101)k = 2, 3,··· , D, until there are no collisions. We introduce in the latter sections how to recover the corresponding remaining D − 1 elements by ex- tending the test to determine the occurrence of a collision in Section 2.1.1. Furthermore, to reduce the chance of a collision through projections, we use an “unwrapping method” which unwraps frequencies onto a lower dimension guaranteeing a one-to-one projection. There are both a “full unwrapping” and a “partial unwrapping” methods, which are explained in later sections. We shall call projections onto any one of the coordinate axes a parallel projection. The worst case is where there is a collision for every parallel projection. This obviously happens when a subset of frequency vectors form the vertices of a D-dimensional hypercube, but it can happen also with other configurations that require fewer vertices. Then our method cannot recover any of these frequency vectors via parallel projections. To resolve this problem, we introduce tilted projections: instead of simple projection onto axes we project frequency vectors onto tilted lines or planes so that there is no collision after the projection. We shall call this the tilting method and provide the details in the next section. After introducing these projection methods, we explore which combination of these methods is likely to be optimal. 2.2 Two Dimensional Sublinear Sparse Fourier Algo- rithm As means of explanation, we introduce the two-dimensional case in this section and extend this to higher dimensions in Section 2.3. The basic two-dimensional algorithm using a parallel projection is introduced in Section 2.2.1, the full unwrapping method is introduced in Section 17 2.2.2 and the tilting method for the worst case is discussed in Section 2.2.3. Figure 2.2: Process of the basic algorithm in 2D 2.2.1 Basic Algorithm Using Parallel Projection Our basic two-dimensional sublinear algorithm excludes certain worst case scenarios. In most cases, we can recover frequencies in the 2-D plane by projecting them onto each horizontal axis or vertical axis. Figure 2.2 is a simple illustration. Here we have three frequency vectors where n1 and n3 are colliding with each other when they are projected onto the horizontal axis, and n1 and n2 are colliding when they are projected onto the vertical axis. The first step is to project the frequency vectors onto the horizontal axis and recover n2 and its corresponding coefficient cn2 only, since it is not colliding. After subtracting n2 from the data, we project the remaining frequency vectors onto the vertical axis and then find both n1 and n3. Now let us consider the generalized two-dimensional basic algorithm. Assume that we have a two-dimensional function f with sparsity s : f (x) = cne2πin·x, (cid:88) n∈S cn ∈ C, n ∈(cid:104) − M (cid:17)2 ∩ Z2. , M 2 (2.9) 2 For now, let us focus on one frequency vector (cid:101)n which is not collided with any other pairs 18 when they are projected onto the horizontal axis. To clarify put x = (x1, 0) with n = (n1, n2) into (2.9), f 1(x1) := f (x1, 0) = (cid:88) n∈S cne2πin1x1, (2.10) which gives the same effect of the parallel projection of frequency vectors. Now, we can consider this function as a one-dimensional function f 1 so that we can use the original one dimensional sparse Fourier algorithm to find the first component of (cid:101)n. We get the samples p and f 1,1 f 1 p, without and with the shift by , respectively. We can find these in the form of sequences in (2.2), apply the DFT to them, and then recover the first component of the frequency pair and its coefficient as follows, (cid:101)n1 = c(cid:101)n = Arg 1 2π F(f 1 1 p p)[h]. (cid:16)F(f 1,1 p,)[h] p)[h] F(f 1 (cid:17) , (2.11) At the same time, we need to find the second component. In (2.10), we replace 0 by . Then (cid:88) n∈S cne2πi(n1x1+n2), f 1,2(x1) := f (x1, ) = F(f 1,2 p, )[h] = pc(cid:101)ne2πi(cid:101)n2, (cid:101)n2 = Arg 1 2π (cid:16)F(f 1,2 p, )[h] p)[h] F(f 1 (cid:17) , (2.12) where f 1,2 p, are samples shifted by  in the vertical sense with rate 1/p from the function f 1,2. The equalities in (2.11) hold only when (cid:101)n1 is the only one congruent to h modulo p among every first component of s frequency pairs and (2.12) holds only when the previous 19 condition is satisfied and (cid:101)n = ((cid:101)n1,(cid:101)n2) does not collide with other frequency pairs from the parallel projection. Now we have two kinds of collisions. The first one is from taking modulo p after the parallel projection and the second one is from the projection. Thus we need two tests. To determine whether there are both kinds of collisions, we use similar tests as (2.7). If there are at least two different n1 congruent to h modulo p, then the second equality in the following is not satisfied for almost all , just as (2.7), |F(f 1,1 |F(f 1 p, )[h]| p)[h]| = |p(cid:80) |p(cid:80) n1=h( mod p) cne2πin1| n1=h( mod p) cn| = 1. (2.13) Likewise, if there is a collision from the projection, i.e., the first components n1’s of at least two frequency vectors are identical and the corresponding n2’s are different, the following second equality does not hold for almost all , |F(f 1,2 |F(f 1 p, )[h]| p)[h]| = |p(cid:80) |p(cid:80) n1=h( mod p) cne2πin2| n1=h( mod p) cn| = 1. (2.14) The two tests above are both satisfied only when there is no collision both from taking modulo p and the projection. We use these for the complete recovery of the objective frequencies. So far we project the frequencies onto the horizontal axis. After we find the non-collided frequencies from the first projection, we subtract a function consisting of the found fre- quencies and their coefficients from the original function f to get a new function. Next we project this new function onto the vertical axis and do a similar process. The difference is to exchange 1 and 2 in the super-indices in (2.10) through (2.14). Again, find the remaining 20 non-collided frequencies, change the axis again and keep doing this until we recover all of the frequencies. 2.2.2 Full Unwrapping Method We introduce another kind of projection which is one-to-one. The full unwrapping method uses one-to-one projections onto one-dimensional lines instead of the parallel projection onto axes from the previous method. We consider the s pairs of frequencies n = (n1, n2) ∈ S and transform them as follows (n1, n2) → n1 + M n2. (2.15) This transformation in frequency space can be considered as the transformation in phase or time space. That is, from the function in (2.9) g(x) := f (x, M x) = cne2πi(n1+M n2)x. (2.16) (cid:88) n∈S The function g(x) is a one-dimensional function with the sparsity s and the bandwidth bounded by M 2. We can apply the algorithm in Section 2.1.1 on g so that we recover s frequencies of the form on the right side of the arrow in (2.15). Whether unwrapped or not, the coefficients are the same, so we can find them easily. In the end we need to wrap the unwrapped frequencies to get the original pairs. Remember that unwrapping transformation is one-to-one. Thus we can wrap them without any collision. Since the pairs of the frequencies are projected onto the one-dimensional line directly, we call this method the “full unwrapping method”. A problem of this method occurs when the dimension D gets large. From the above description, we see that after the one-to-one 21 unwrapping the total bandwidth of the two dimensional signal increases from M in each dimension to M 2. If the full unwrapping method is applied to a function in D-dimensions, then to guarantee the one-to-one transformation the bandwidth will be M D. Theoretically this does not matter. However, since  is dependent on the bandwidth, in the case where D is large, we need to consider the limit of machine precision for practical implementations. As a result, we need to introduce the partial unwrapping method to prevent the bandwidth from becoming too large. The partial unwrapping method is discussed in Section 2.3. Figure 2.3: Worst case scenario in two dimensions and solving it through tilting 2.2.3 Tilting Method for the Worst Case Up till now, we have assumed that we do not encounter the worst case, i.e., that we do not encounter the case where any frequency pair has collisions from the parallel projection for all coordinate axes. This makes the algorithm break down. A very simple example of the worse case scenario is given in Figure 2.3 where the four frequency pairs n1, n2, n3 and n4 form a rectangle, and thus each pair has collisions in both horizontal and vertical projections. The following method is for finding those frequency pairs. Basically, we rotate axes of the frequency plane and thus use a projection onto a one-dimension system which is a tilted 22 line with the tilt chosen so that there are no collisions. If the horizontal and vertical axes are rotated with angle θ then the frequency pair n = (n1, n2) can be relabeled with new coordinates as the right side of the following (n1, n2) → (n1 cos θ − n2 sin θ, n1 sin θ + n2 cos θ). (2.17) In phase-sense, this rotation can be written as g(˜x1, ˜x2) := f (˜x1 cos θ + ˜x2 sin θ,−˜x1 sin θ + ˜x2 cos θ) (cid:88) (cid:88) n∈S cne2πi{n1(˜x1 cos θ+˜x2 sin θ)+n2(−˜x1 sin θ+˜x2 cos θ)}, cne2πi{(n1 cos θ−n2 sin θ)˜x1+(n1 sin θ+n2 cos θ)˜x2}. (2.18) = = n∈S We can apply the basic algorithm in Section 2.2.1 to the function g to get the frequency pairs in the form of the right side of the arrow in (2.17). One problem we face is that the components of the projected frequency pairs should be integers to apply the method, since we assume the integer frequencies in the first place. To guarantee the injectivity for both projections, tan θ should be irrational, however, and thus the projected frequencies become irrational. Thus, we should try a rational tan θ, and to make them integer it is inevitable to increase the bandwidth by multiplying the least common multiple of the denominators of sin θ and cos θ. We assume the following with integers a, b and c sin θ = a c , cos θ = b c , gcd(a, c) = gcd(b, c) = 1. (2.19) 23 Multiplying c to both inputs in the right-hand side of (2.18) we obtain ˆg(˜x1, ˜x2) := f (c(˜x1 cos θ + ˜x2 sin θ), c(−˜x1 sin θ + ˜x2 cos θ)) (cid:88) n∈S = cne2πi{(cn1 cos θ−cn2 sin θ)˜x1+(cn1 sin θ+cn2 cos θ)˜x2}. (2.20) As long as there is no collision for at least one projection, the frequency pairs, (cn1 cos θ− cn2 sin θ, cn1 sin θ + cn2 cos θ), can be found by applying the basic algorithm in Section 2.2.2 on ˆg. Due to the machine precision the integer c should not be too large, or the bandwidth gets too large resulting in the failure of the algorithm. If four pairs of frequencies are at vertices of a rectangle aligned with coordinate axes before the rotation, then they are not aligned after the rotation with 0 < θ < π/2. Thus we can assure finding whole frequencies whether they are in the worst case or not. The pseudo code of the 2D tilting method is shown in Algorithm 2. The lines 15 and 16 mean that each frequency pair (n1, n2) is rotated by a matrix [cos − sin; sin cos] and scaled of (cid:101)n = (n1 cos−n2 sin, n1 sin +n2 cos) and after finding all of them, we rotate them back to make the rotated components integers. Thus we first find the frequency pairs in the form into the original pairs with the matrix [cos sin; − sin cos] in line 39. This tilting method is a straight forward way to resolve the worst case problem. First, we recover the frequencies as much as possible from the basic parallel projection method. If we cannot get any frequency pairs for several projections switching among each axis then, assuming that the worst case happens, we apply the tilting method with an angle so that all remaining frequency pairs are found. We only introduced the tilting method in the two-dimensional case, but the idea of rotating the axes can be extended to the general D- dimensional case with some effort. On the other hand, we may notice that the probability of 24 p m(cid:48)(cid:48) + n(cid:48)(cid:48)) cos) p e(cid:101)k) p, [(cid:96)] = f (( (cid:96) for k = 1 → 2 do p m(cid:48)(cid:48) + n(cid:48)(cid:48)) sin,−( (cid:96) Algorithm 2 2D Sparse Fourier Algorithm with Tilting Method Pseudo Code 1: procedure 2DTiltedPhaseshift 2: Input:f, C, s, M, D, , base, height, hypo 3: Output:R R ← ∅ 4: t ← 1 5: cos ← base, sin ← height 6: while |R| < s do 7: 8: 9: 10: 11: 12: 13: 14: s∗ ← s − |R| (cid:101)k ← (t mod 2) + 1 p ← t th prime number ≥ Cs∗ Q((cid:101)x) =(cid:80) ((cid:101)n,c(cid:101)n)∈R c(cid:101)ne2πi(cid:101)n·(cid:101)x m(cid:48) ←(cid:101)k mod 2, m(cid:48)(cid:48) ←(cid:101)k + 1 mod 2, n(cid:48) ← k mod 2, n(cid:48)(cid:48) ← k + 1 mod 2 for (cid:96) = 0 → p − 1 do f(cid:101)k,k p m(cid:48) + n(cid:48)) sin +( (cid:96) p m(cid:48) + n(cid:48)) cos +( (cid:96) p e(cid:101)k + ek) −Q( (cid:96) f(cid:101)k,k p m(cid:48) cos + (cid:96) F(f(cid:101)k,k p, ) = FFT(f(cid:101)k,k F(f(cid:101)k p) = FFT(f(cid:101)k p) = SORT(F(f(cid:101)k F sort(f(cid:101)k end for for h = 0 → s∗ − 1 do (cid:96) ← 0 (cid:12)(cid:12)(cid:12)|F sort(f(cid:101)k,k for k = 1 → 2 do p )[h]| |F sort(f(cid:101)k,k p, )[h]| (cid:96) ← (cid:96) + 1 (cid:101)nk = 1 pF sort(f(cid:101)k c(cid:101)n = 1 R ← R ∪ (c(cid:101)n,(cid:101)n) (cid:12)(cid:12)(cid:12) <  then (cid:17) (cid:16)F sort(f(cid:101)k,k F sort(f(cid:101)k p)[h] 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: p m(cid:48)(cid:48) sin,− (cid:96) p m(cid:48) sin + (cid:96) p m(cid:48)(cid:48) cos) − Q( (cid:96) end for if (cid:96) == 2 then [(cid:96)] = f ( (cid:96) p end for if end if 2π Arg p, )[h] p)[h] p, ) p) p)) − 1 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: end procedure end if end for prune small coefficients from R t ← t + 1 end while cos ← base rotate each (cid:101)n back to n using a matrix [cos sin;− sin cos] and restore it in R hypo , sin ← height hypo 25 this worst case is very low, especially when the number of dimensions D is large. Its details are shown in Section 2.3. 2.3 Partial Unwrapping Method for High Dimensional Algorithm In this section we present the partial unwrapping method for a sublinear sparse Fourier algo- rithm for very high dimensional data. As we have already mentioned, while full unwrapping converts a multi-dimensional problem into a single dimensional problem, it is severely limited in its viability when the dimension is large or when the bandwidth is already high because of the increased bandwidth. Partial unwrapping is introduced here to overcome this problem and other problems. In Section 2.3.1 we give a four dimensional version of the algorithm using the partial unwrapping method as well as a generalize it to D dimension. In Section 2.3.2, the probability of the worst case in D dimension is analyzed. 2.3.1 Partial Unwrapping Method To see the benefit of partial unwrapping we need to examine the main difficulties we may encounter in developing the sublinear sparse Fourier algorithms. For this let us consider a hypothetical case of sparse FFT where we have s = 100 frequencies in a 20-dimensional Fourier series distributed in [−10, 10)20. When we perform the parallel projection method, because the bandwidth is small, there will be a lot of collisions after the projections. It is often impossible to separate any frequency after each projection, and the task could thus not be completed. This, ironically, is a curse of small bandwidth for sparse Fourier algorithm. 26 On the other hand, if we do the full unwrapping we would have increased the bandwidth to M(cid:48) = 2020, which is impossible to do within reasonable accuracy because M(cid:48) is too large. However, a partial unwrapping would reap the benefit of both worlds. Let us now break down the 20 dimensions into 5 lower 4-dimensional subspaces, namely we write (cid:16) [−10, 10)4(cid:17)5 . [−10, 10)20 = In each subspace we perform the full unwrapping, which yields bandwidth M(cid:48) = 204 = 160, 000 in the subspace. This bandwidth M(cid:48) is large enough compared with s, so when the projection method is used there is a very good probability that the collision will occur only for a small percentage of the frequencies, allowing them to be reconstructed. On the other hand, M(cid:48) is not so large that the phase-shift method will incur significant error. One of the greatest advantage of partial unwrapping is to turn the curse of dimensionality into the blessing of dimensionality. Note that in the above example, the 4 dimensions of any subspace do not have to follow the natural order. By randomizing (if necessary) the order of the dimensions it may achieve the same goal as the tilting method. Also note that the dimension for each subspace needs not be uniform. For example, we can break down the above 20-dimensional example into four 3-dimensional subspaces and two 4-dimensional subspaces, i.e. [−10, 10)3(cid:17)4 ×(cid:16) (cid:16) [−10, 10)4(cid:17)2 . [−10, 10)20 = This will lead to further flexibility. 27 2.3.1.1 Example of 4-D Case Before introducing the generalized partial unwrapping algorithm for dimension D, let us think about the simple case of 4 dimensions. We assume that s frequency vectors are in 4-dimensional space (D = 4). Then, a function f constructed from these frequency vectors is as follows, cne2πin·x, cn ∈ C, n ∈(cid:104) − M 2 , M 2 (cid:17)4 ∩ Z4. (cid:88) n∈S f (x) = (2.21) Since 4 = 2 × 2, the frequency pairs of the two-two dimensional spaces are both unwrapped onto one-dimensional spaces. Here, 4 dimensions is projected onto 2 dimensions as fol- lows g(x1, x2) := f (x1, M x1, x2, M x2) cne2πi{(n1+M n2)x1+(n3+M n4)x2} cne2πi((cid:101)n1x1+(cid:101)n2x2), (2.22) (cid:88) (cid:88) n∈S = = n∈S where(cid:101)n1 = n1 + M n2 and(cid:101)n2 = n3 + M n4. Note that this projection is one-to-one so as to guarantee the inverse transformation. Now we can apply the basic projection method in Section 2.2.1 to this function g re- defined as the 2-dimensional one. To make this algorithm work, (cid:101)n = ((cid:101)n1,(cid:101)n2) should not collide with any other frequency pair after the projection onto either the horizontal or vertical axes. If not, we can consider using the tilting method. After finding all the frequencies in the form of ((cid:101)n1,(cid:101)n2), it can be transformed to (n1, n2, n3, n4). 28 2.3.1.2 Generalization We introduce the final version of the multidimensional algorithm in this section. Its pseudo code and detailed explanation are given in Algorithm 3 and Section 2.5.1, respectively. We start with a D-dimensional function f , f (x) = cne2πin·x, (cid:88) n∈S cn ∈ C, n ∈(cid:104) − M 2 , M 2 (cid:17)D ∩ ZD. (2.23) Let us assume that D can be divided into d1 and d2 - the case of D being a prime number will be mentioned at the end of this section. The domain of frequencies can be considered as ([−M/2, M/2) ∩ Z)D = (([−M/2, M/2) ∩ Z)d1)d2 and ([−M/2, M/2) ∩ Z)d1 will be re- duced to one dimension, as d1 is in the 4 dimensional case. Each of the d1 elements of a frequency vector, n = (n1, n2,··· , nD), is unwrapped as (n(d1q+1), n(d1q+2), n(d1q+3),··· , n(d1q+d1)) → n(d1q+1) + M n(d1q+2) + M 2n(d1q+3) + ··· + M d1−1n(d1q+d1) =: (cid:101)n(q+1) (2.24) with q = 0, 1, 2,··· , d2 − 1, increasing the respective bandwidth from M to M d1 and having injectivity. We rewrite this transformation in terms of the phase. With x = (x1, x2,··· , xD) and put the following into x(cid:96) M R((cid:96)−1,d1) ˜xQ((cid:96)−1,d1)+1 (2.25) 29 for all (cid:96) = 1, 2,··· , D, where R((cid:96) − 1, d1) and Q((cid:96) − 1, d1) are the remainder from dividing (cid:96)− 1 by d1 and quotient from dividing (cid:96)− 1 by d1 respectively, and(cid:101)x = ((cid:101)x1,(cid:101)x2,··· ,(cid:101)xd2 ) is a phase vector in d2 dimensions after projection. Define a function g on d2 dimension as g((cid:101)x) := f (··· , M R((cid:96)−1,d1)(cid:101)xQ((cid:96)−1,d1)+1,··· ) 2πi(cid:80)d2−1 q=0 (cid:0)(cid:80)d1−1 r=0 n(d1q+r+1)M r(cid:1)(cid:101)xq+1, = cne (cid:88) n∈S (2.26) where M R((cid:96)−1,d1)(cid:101)xQ((cid:96)−1,d1) is the (cid:96)th element of the input of f . If we project frequency vectors of g onto the(cid:101)kth axis, then the kth element of a frequency vector (cid:101)n can be found in the following computation, p, = g(cid:101)k,k (cid:101)nk = c(cid:101)n = (cid:16) 1 p g(0e(cid:101)k e(cid:101)k (cid:17) 1 p + ek), g( (cid:16)F(g(cid:101)k,k F(g(cid:101)k p, )[h] p)[h] 1 2π Arg F(g(cid:101)k p)[h], + ek),··· , g( p − 1 p e(cid:101)k (cid:17) + ek) (2.27) is the(cid:101)kth unit vector with length d2, i.e., all elements are zero except the(cid:101)kth one where e(cid:101)k with entry 1. The last two equalities in (2.27) hold as long as(cid:101)n(cid:101)k to h modulo p among all(cid:101)kth elements of the frequency vectors, and (cid:101)n does not collide with any other frequency vector due to the projection onto the (cid:101)kth axis. The test for checking is the only one congruent whether these conditions are satisfied is |F(g(cid:101)k,k |F(g(cid:101)k p, )[h]| p)[h]| = 1 (2.28) 30 for all 1 ≤ k ≤ d2. The projections onto the(cid:101)kth axis, where(cid:101)k = 1,··· , d2, take turns until we recover all frequency vectors and their coefficients. After that we wrap the unwrapped frequency vectors up from d2 to D dimension. Since the unwrapping transformation is one-to-one, this inverse transformation is well-defined. So far, we assumed that the dimension D can be divided into two integers, d1 and d2. For the case that D is a prime number, or both d1 and d2 are so large that the unwrapped data has a bandwidth such that  is below the machine precision, a strategy of divide and conquer can be applied. In that case we can think about applying partial unwrapping method in a way that each unwrapped component has a different size of bandwidth. If D is 3, for example, then we can unwrap the first two components of the frequency vector onto one dimension and the last one lies in the same dimension. In that case, the unwrapped data is in two dimensions, and the bandwidth of the first component is bounded by M 2 and that of second component is bounded by M . In this case we can choose a shift  < 1/M 2 where M 2 is the largest bandwidth. We can extend this to the general case, so the partial unwrapping method has a variety of choices balancing the bandwidth and machine precision. 2.3.2 Probability of Worst Case Scenario In this section, we give an upper bound of the probability of the worst case assuming that we randomly choose a partial unwrapping method. As addressed in the Section 2.3.1, there is flexibility in choosing certain partial unwrapping method. Assuming a certain partial unwrapping method and considering a stronger condition to avoid its failure, we can find the upper bound of the probability of the worst case where there is a collision for each parallel projection. For simple explanation, consider a two dimensional problem. Choosing the first frequency 31 vector n1 = (n11, n12) on a two dimensional plane, if the second frequency vector, n2 = (n21, n22), is not on the vertical line crossing (n11, 0) and the horizontal line crossing (0, n12), then the projection method works. Then if the third frequency vector is not on four lines, those two lines mentioned before, the vertical line crossing (n21, 0) and the horizontal line crossing (0, n22), then again the projection method works. We keep choosing next frequency vector in this way, excluding the lines containing previous frequencies. Thus, letting such event A, the probability that the projection method fails is bounded above by 1−P(A). Generally, let us assume that we randomly choose a partial unwrapping, without loss of generality, the total dimension is D = d1+d2+···+dr where r is the number of subspaces and d1, d2,··· , dr are the dimensions of each subspace. That is, partially unwrapped frequency vectors are in r < D dimensions and each bandwidth is M d1, M d2,··· , M dr , respectively, which is an integer strictly larger than 1. Then, the failure probability of projection method is bounded above by 1 − s(cid:89) j=1 j=1 P(Aj) ≤ 1 − s(cid:89) (cid:114) τ (cid:16) = 1 − 1 τ s ∼ 1 − 1 τ s τ ! = 1 − 1 es 1 − s τ (τ − s)! τ − s τ := (cid:32) (cid:0) τ (cid:1)τ (cid:0) τ−s (cid:1)τ−s s·s(cid:16) (cid:17)− τ e e (cid:17)s− 1 2 1 − s τ M D − (j − 1)(M d1 + M d2 + ··· + M dr ) M D (cid:33) M D M d1 + M d2 + ··· + M dr (by Sterling’s formula) (2.29) where Aj is the event that we choose jth frequency not on the lines, crossing formerly chosen frequency vectors and parallel to each coordinate axis. Noting M D = M d1×M d2×···×M dr , sparsity s is relatively small compared to M D, and τ is large, we can see that the upper bound above gets closer to 0 as D or M grows to infinity. 32 Algorithm 3 Multidimensional Sparse Fourier Algorithm Pseudo Code 1: procedure MultiPhaseshift 2: Input:f, C, s, M, D, d1, d2,  3: Output:R R ← ∅ 4: t ← 1 5: while |R| < s do 6: 7: 8: 9: 10: 11: 12: for k = 1 → d2 do p ed1((cid:101)k−1)+j + (cid:80)d1−1 j=0 M jed1(k−1)+j) − Q( (cid:96) p ed1((cid:101)k−1)+j) − Q( (cid:96) p e(cid:101)k) p e(cid:101)k + ek) p, ) p) end for j=0 M j (cid:96) j=0 M j (cid:96) s∗ ← s − |R| (cid:101)k ← (t mod d2) + 1 p ← t th prime number ≥ Cs∗ Q((cid:101)x) =(cid:80) ((cid:101)n,c(cid:101)n)∈R c(cid:101)ne2πi(cid:101)n·(cid:101)x p, [(cid:96)] = f ((cid:80)d1−1 for (cid:96) = 0 → p − 1 do f(cid:101)k,k p [(cid:96)] = f ((cid:80)d1−1 f(cid:101)k p, ) = FFT(f(cid:101)k,k F(f(cid:101)k,k F(f(cid:101)k p) = FFT(f(cid:101)k p) = SORT(F(f(cid:101)k F sort(f(cid:101)k end for for h = 0 → s∗ − 1 do (cid:96) ← 0 (cid:12)(cid:12)(cid:12) |F sort(f(cid:101)k for k = 1 → d2 do p)[h]| |F sort(f(cid:101)k,k p, )[h]| (cid:96) ← (cid:96) + 1 (cid:101)nk = 1 pF sort(f(cid:101)k c(cid:101)n = 1 R ← R ∪ ((cid:101)n, c(cid:101)n) (cid:16)F sort(f(cid:101)k,k F sort(f(cid:101)k p)[h] end for if (cid:96) == d2 then 2π Arg end if if − 1 p)) (cid:12)(cid:12)(cid:12) <  then (cid:17) p, )[h] p)[h] 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: end if 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: end procedure end for prune small coefficients from R t ← t + 1 end while inverse-transform each (cid:101)n in d2-D to n D-D and restore it in R 2.4 Analysis In this section, we analyze the performance of our algorithms suggested. We will prove that the tilting method works well in two dimensions but explain that it is hard to extend the 33 idea to the general high dimensional setting. However, it was shown in Section 2.3.2 that the probability of the worst-case scenario is extremely small. Furthermore, the average-case runtime and sampling complexity is shown in this section under the assumption that there are no frequency vectors forming a worst-case scenario. The following theorem shows that the tilting method in 2D recovers all frequency pairs even though they form a worst-case scenario. Theorem 1. Let n = (n1, n2) ∈ S ⊂ (cid:104)− M (cid:17)2 ∩ Z2. If tan θ = a b such that c > b > a are Pythagorean triples where b > 2M and a are relative primes, then all (cn1 cos θ − 2 , M 2 cn2 sin θ, cn1 sin θ + cn2 cos θ) rotated by θ does not collide with any other pair through the parallel projection. Thus, all rotated pairs can be identified by the parallel projection method. Proof. Suppose that any two frequency pairs n = (n1, n2) and n(cid:48) = (n(cid:48) 1, n(cid:48) 2) cannot be recovered through the tilting method. This implies that the slope of the line crossing n and n(cid:48) is perpendicular to tan θ, and thus those two pairs collide at least once if they are n2−n(cid:48) n1−n(cid:48) 1 2 < M , and a. However, this is a contradiction since −M < n1 − n(cid:48) projected onto each principal axis rotated by the angle θ. This results in the fact that 1, n2 − n(cid:48) is either a b or − b 2 a and b > 2M are relative primes. Theorem 1 implies that all frequency pairs can be distinguished by the tilting method with only one proper angle θ. It is natural to think about extending this idea to the high- dimensional setting. However, it is not as easy to find a proper slope as in two dimensions using the Pythagorean triples. Moreover, even though we consider choosing a finite set of random angles guaranteeing that it includes the proper one, each line crossing between two arbitrary vectors in general D dimensions has infinitely many lines perpendicular to it so 34 that such a finite set is difficult to find again. It should be noted that there is a variety of the worst-case scenario where the relatively simple tilting method still works. For example, if the frequency vectors are located on the two dimensional subspace, then the tilting method applied on the subspace would recover all the frequencies. There still exist trickier worst-case scenarios, however, such that the frequencies form a D-dimensional cube or more complicated structures. It is still worth exploring further about the tilting method. On the other hand, we also have shown in Section 2.3.2 that the worst-case scenario rarely happens in extremely high dimensions. In order to obtain the average-case runtime and sampling complexity of the parallel projection method under the assumption that there is no worst-case scenario, we utilize the probability recurrence relation as in [1]. We remind readers that each entry of the frequency vectors should be distinguished modulo M due to parallel projection as well as be distinguished modulo p. Since C = 5 determining p is shown in [1] to ensure that 90% of frequencies are isolated modulo p on average, at least 90% of frequencies are isolated modulo M on average if M ≥ p. Taking the union bound of failure probabilities of isolation modulo p and M yields at least 80% of frequencies are isolated both modulo M and p at the same time. Algorithm 3 with d1 = 1 and d2 = D implements the direct parallel projection method, and it takes a(s) = Θ(Ds log s), the runtime on input of size s and m(s) = s/5, the average size of the subproblem. Then, the straightforward application of Theorem 2 in [1] gives the following theorem. Theorem 2. Assume M ≥ 5s and there is no worst-case scenario. Let T (s) denote the runtime of Algorithm 3 on a random signal setting with d1 = 1 and d2 = D. Then E[T (s)] = Θ(Ds log s) and P[T (s) > Θ(Ds log s) + tDs log s] ≤ 5−t. 35 In a similar manner, Algorithm 3 takes a(s) = Θ(Ds), the number of samples on input of size s and m(s) = s/5, the average size of the subproblem. Thus, Theorem 2 in [1] again gives, Theorem 3. Assume M ≥ 5s and there is no worst-case scenario. Let S(s) denote the number of samples used in Algorithm 3 on a random signal setting with d1 = 1 and d2 = D. Then E[S(s)] = Θ(Ds) and P[S(s) > Θ(Ds) + tDs] ≤ 5−t. 2.5 Empirical Result The partial unwrapping method is implemented in the C language. The pseudocode of this algorithm is shown in Algorithm 3. It is explained in detail in Section 2.5.1. In our experiment, dimension D is set to 100 and 1000, d1 is 5 and d2 is 20 and 200, accordingly. Frequency bandwidth M in each dimension is 20 and sparsity s varies as 1, 2, 22,··· , 210. The value of  for shifting is set to 1/2M d1 and the constant number C determining the prime number p is set to 5. We randomly choose s frequency vectors n ∈ [−M/2, M/2)D ∩ ZD and corresponding coefficients cn = e2πiθn ∈ C from randomly chosen angles θn ∈ [0, 1) so that the magnitude of each cn is 1. For each D and s we have 100 trials. We get the result by averaging l2 errors, the number of samples used and CPU TICKS out of 100 trials. Since it is difficult to implement high dimensional FFT and there is no practical high dimensional sparse Fourier transform with wide range of D and s at the same time it is hard to compare the result of ours with others, as so far no one else was able to do FFT on 36 this large data set. Thus we cannot help but show ours only. From Figure 2.4 we can see that the average l2 errors are below 2−52. Those errors are from all differences of frequency vectors and coefficients of the original and recovered values. Since all frequency components are integers and thus the least difference is 1, we can conclude that our algorithm recover the frequency vectors perfectly. Those errors are only from the coefficients. In Figure 2.5 the average sampling complexity is shown. We can see that the logarithm of the number of samples is almost proportional to that of sparsity. Note that the traditional FFT would show the same sampling complexity even though sparsity s varies since it only depends on the bandwidth M and dimension D. In Figure 2.6 the average CPU TICKS are shown. We can see that the logarithm of CPU TICKS is also almost proportional to that of sparsity. Note that the traditional FFT might show the same CPU TICKS even though sparsity s varies since it also depends on the bandwidth M and dimension D only. 2.5.1 Algorithm In this section, the explanation of Algorithm 3 is given. In [1] several versions of 1D al- gorithms are shown. Among them, non-adaptive and adaptive algorithms are introduced where the input function f is not modified throughout the whole iteration, and is modified by subtracting the function constructed from the data in registry R, respectively. In our multidimensional algorithm, however, the adaptive version is mandatory since excluding the contribution of the currently recovered data is the key of our algorithm to avoid the collision of frequencies through projections, whose simple pictorial description is given in Figure 2.2. In Algorithm 3, the function Q is the one constructed from the data in the registry R. Our algorithm begins with entering inputs, a function f , a constant number C deter- mining p, a sparsity s, a bandwidth M of each dimension, a dimension D, factors d1 and 37 d2 of D and a shifting number  ≤ 1/M d1. For each iteration of the algorithm, the number of frequencies to find is updated as s∗ = s − |R|. It stops when |R| becomes equal to the sparsity s. The prime number p is determined depending on this new s∗ as p ≥ Cs∗ and is chosen as the next larger prime number. The lines 13 and 14 of Algorithm 3 represent the partial unwrapping, and sampling with and without shifting from the function where the contribution of former data is excluded. After applying the FFT on each sequence, sorting them according to the magnitude of F(f(cid:101)k p), we check the ratio between the FFT’s of the unshifted and shifted sequences to determined whether there is a collision, either from mod- ulo p or a parallel projection. If all tests are passed, then we find each frequency component and corresponding coefficient for the data that passed and store them in R. After several iterations, we find all the data and the final wrapping process gives the original frequency vectors in D dimensions. 2.5.2 Accuracy We assume that there is no noise on the data that we want to recover. Figure 2.4 shows that we can find frequencies perfectly and the l2 error from coefficients are significantly small. This error is what we average out over 100 trials for each D = 100, 1000 and s = 1, 21, 22,··· , 210 when M is fixed to 20. The horizontal axis represents the logarithm with base 2 of s and the vertical axis represents the logarithm with base 2 of the l2 error. It is increasing as the sparsity s is increasing since the number of nonzero coefficients increases. The red graph in the Figure 2.4 shows the error when the number of dimensions is 100 and the blue one shows the error when the number of dimensions is 1000. Thus, we see that the errors are not substantially impacted by the dimensions. 38 Figure 2.4: Average l2 error 2.5.3 Sampling Complexity Figure 2.5 shows the sampling complexity of our algorithm averaged out from 100 tests for each dimension and sparsity. The horizontal axis means the logarithm with base 2 of s and the vertical axis represents the logarithm with base 2 of the total number of samples from the randomly constructed function which are used to find all frequencies and coefficients. The red graph in the Figure 2.5 shows the sampling complexity when the number of dimensions is 100 and the blue one shows the one when the number of dimensions is 1000. Both graphs increase as s increases. When D is large, we see that it requires more samples since there are more frequency components to find. From the graphs, we see that the scaling seems to 39 Figure 2.5: Average sampling complexity be proportional to D. 2.5.4 Runtime Complexity In Figure 2.6, we plot the runtime complexity of the main part of our algorithm averaged over 100 tests for each dimension and sparsity. “Main part” means that we have excluded the time for constructing a function consisting of frequencies and coefficients and the time associated with getting samples from it. The horizontal axis is the logarithm, base 2, of s and the vertical axis is the logarithm, base 2, of CPU TICKS. The red curve shows the runtime when we set the number of dimensions to 100 and the blue one shows the same 40 Figure 2.6: Average CPU TICKS thing when the number of dimensions to 1000. Both plots increase as s increases. When D is larger, the plots show that it takes more time to run the algorithm. From the graphs we see that the runtime looks proportional to D. Unfortunately, the sampling process of getting the samples from continuous functions dominates the runtime of the whole algorithm instead of the main algorithm. To show the runtime of our main algorithm, however, we showed CPU TICKS without sampling process. Reducing the time for sampling is still a problem. In [10] a one-dimensional fully discrete Fourier transform is introduced that we expect to use to reduce it. Exploring how to use this will be one part of our future work. 41 Chapter 3 Multiscale Sparse Fourier Transforms 3.1 Preliminaries 3.1.1 Notation and Review In this section, we introduce the notation used throughout this chapter. Let s, D and M be natural numbers, s (cid:28) M D, and D := [0, 1)D. We consider a function f : D → C which is s-sparse in the D-dimensional Fourier domain as follows (cid:88) n∈S f (x) = cne2πin·x where each n ∈ [−M/2, M/2)D ∩ ZD and cn ∈ C. We note that f can be regarded as a periodic function defined on RD. The aim of sparse Fourier algorithms is to rapidly reconstruct a function f using a small number of its samples. In Chapter 2, the methods were introduced using several different transformations and parallel projections along coordinate axes of frequencies in order to exploit the one-dimensional sparse Fourier algorithm from [1]. These transformations such as partial unwrapping and tilting methods are introduced in order to change the locations of energetic frequencies when the current energetic frequencies 42 are hard or impossible to find through the parallel projections directly. Through those manipulations, each frequency vector n is recovered in an entry-wise fashion. In this way, the linear dependence of runtime and sampling complexities on the dimension D could be shown empirically, which is a great improvement when compared to the D-dimensional FFT with the exponential dependence on D. The transformations and projections occur in the physical domain, which provides the separation of the frequency vectors in the Fourier domain. That is, each n is transformed to n(cid:48) and then projected onto several lines, and these can be done by manipulating the sampling points in the physical domain D. Let u : D(cid:48) → D represent those transformations where D(cid:48) is D or less dimensional space and is determined by each transformation. We assume that D(cid:48) has D(cid:48) dimensions. A new function g : D(cid:48) → C is defined as a composition of f and u, i.e., g(x) := f (u(x)). We note that g is still s-sparse in the D(cid:48)-dimensional Fourier domain, [−M(cid:48)/2, M(cid:48)/2)D(cid:48)∩ZD(cid:48) , whose bandwidth M(cid:48) depends on each transformation. For example, consider a 4-dimensional function f with the Fourier domain, [−M/2, M/2)4∩Z4. If a partial unwrapping is applied to f which unwraps each n = (n1, n2, n3, n4) to n(cid:48) = (n(cid:48) 4) then it implies that u : (x1, x2) → (x1, M x1, x2, M x2) and g has the Fourier domain, [−M 2/2, M 2/2)2∩Z2 where accordingly M(cid:48) = M 2 and D(cid:48) = 2. This is one example and there are variations of partial 3 +M n(cid:48) 1 +M n(cid:48) 2, n(cid:48) unwrapping methods and tilting methods which can be found in Chapter 2. Using samples of g(or f ), the transformed frequency vectors n(cid:48) and corresponding Fourier coefficients cn(cid:48)(=cn) are found through the parallel projection method, and n(cid:48) are transformed back to n. Now, we remind the readers using comprehensive notations how n(cid:48) = (n(cid:48) 1, n(cid:48) 2,··· , n(cid:48) D(cid:48)) can be 43 recovered element-wise using the parallel projection method and the ideas from the one- dimensional sparse Fourier algorithm. Let p be a prime number greater than a constant multiple of s, i.e., p > Cs for some constant C, n be a fixed frequency vector in S, k be fixed among the set {1, 2,··· , D(cid:48)} and ek be a vector with all zero entries but 1 at the index k. Furthermore,  is defined as a positive number ≤ 1/M(cid:48). To recover the kth element of each n(cid:48), we use two sets of p-length equispaced samples g(cid:101)k and g(cid:101)k,k p and g(cid:101)k,k (cid:18) (cid:96) e(cid:101)k p, as follows, (cid:18) (cid:96) p, [(cid:96)] := g p[(cid:96)] := g g(cid:101)k (cid:19) (cid:19) e(cid:101)k p + ek p where (cid:96) = 0, 1,··· , p − 1 and (cid:101)k ∈ {1, 2,··· , D(cid:48)} is the index of coordinate axis where a , different from the (cid:101)kth elements of any other energetic particular (cid:101)n has (cid:101)kth element, (cid:101)n(cid:101)k projection”. At the same time, if there is no collision modulo p, i.e., (cid:101)n(cid:101)k In this case, we refer to the above phenomenon as “no collision from frequency vectors. has the unique remainder modulo p from others then the discrete Fourier transform of each sample set is F (cid:17) F(cid:16) g(cid:101)k (cid:19) (cid:18) g(cid:101)k,k p, p (cid:88) (cid:88) ≡h mod p ≡h mod p [h] = p [h] = p n(cid:48) k n(cid:48) k cn(cid:48) = pc(cid:101)n, cn(cid:48)e2πin(cid:48) and k = pc(cid:101)ne2πi(cid:101)nk, (3.1) respectively. The equations above in (3.1) give a unique entry for the kth element assuming there does not exist a collision modulo p of the vetor projected onto the kth axis. Hence, in the 44 above equations, when the second equalities hold, we can recover c(cid:101)n and(cid:101)nk as follows, (cid:19) (cid:18) F g(cid:101)k,k (cid:17) F(cid:16) g(cid:101)k p, p [h] [h]  (cid:101)nk = 1 2π Arg F(cid:16) (cid:17) g(cid:101)k p p and c(cid:101)n = [h] . (3.2) The right choice of the branch and the shift size  ≤ 1 (cid:101)nk. Algorithmically, the two kinds of collisions are guaranteed not to happen using the M(cid:48) make it possible to find the correct following test: (3.3) (cid:18) (cid:19) g(cid:101)k,k (cid:17) F(cid:16) g(cid:101)k p, p F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [h] [h] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1, which is inspired by the test used in [1]. Practically, we put some threshold τ > 0 so that if the difference between the left and right-hand sides in (3.3) is less than τ , then we conclude that there are no collisions of both kinds. In this case, each (cid:101)nk for k = 1, 2,··· , D(cid:48) can be recovered using a pair of sets g(cid:101)k p, , respectively. Otherwise, we take another p and g(cid:101)k,k prime number for sample length, switch the index of coordinate axis for the projection, update the samples by eliminating the influence from previously found Fourier modes and repeat our procedure as before. Switching the coordinate axis and updating samples reduce the occasions of collisions from projection. In Chapter 2, moreover, it is proved that the probability is very low when D(cid:48) is large that all remaining frequency vectors have collisions from projection onto all coordinate axes, which we call the “worst case scenario”. In the “worst case scenario”, the parallel projections we just used do not work and thus, we need a rotation mapping, u(cid:48), defining another function g(cid:48) = f ◦ u(cid:48). 45 3.1.2 Noise Model In practice, the samples from f (or g) are often corrupted by noise. The high-dimenisonal sparse Fourier algorithm introduced in Section 3.1.1 works well when f is exactly s-sparse and the samples from f are not noisy. It is not robust to noise since in order to find entries of energetic frequency vectors we compute the fraction F [h] which is (cid:19) (cid:18) g(cid:101)k,k p; [h](cid:14)F(cid:16) g(cid:101)k p (cid:17) sensitive to noise. In the remaining sections of Chapter 3, we consider the case of noisy measurements of g as follows r(cid:101)k p[(cid:96)] := g(cid:101)k p[(cid:96)] + z(cid:96) = g (cid:19) (cid:18) (cid:96) p e(cid:101)k + z(cid:96) for (cid:96) = 0, 1,··· , p − 1 where z := (z0, z1,··· , zp−1) is a complex Gaussian random variable with mean 0 and variance σ2I. If we apply DFT to this sample set, we get F(cid:16) r(cid:101)k p (cid:17) [h] = F(cid:16) (cid:17) g(cid:101)k p p−1(cid:88) (cid:96)=0 [h] + −2πih (cid:96) p z(cid:96)e (3.4) Since z(cid:96) are i.i.d Gaussian variables, the expectation and variance of the second term in (3.4) are and respectively. Accordingly, E (cid:96)=0 p−1(cid:88) p−1(cid:88) E(cid:104)F(cid:16) r(cid:101)k Var (cid:96)=0 p (cid:17) −2πih (cid:96) p z(cid:96)e −2πih (cid:96) p z(cid:96)e  = 0  = pσ2, (cid:17) = F(cid:16) g(cid:101)k [h] p (cid:105) [h] 46 Var and (cid:105) (cid:17) [h] p = pσ2. (cid:104)F(cid:16) r(cid:101)k For noisy shifted samples set r(cid:101)k,k p; := g(cid:101)k,k p; +(cid:101)z with an i.i.d Gaussian random vector (cid:101)z, we (cid:19) (cid:18) (cid:20) r(cid:101)k,k (cid:18) (cid:20) r(cid:101)k,k g(cid:101)k,k have likewise = pσ2. = F (cid:19) (cid:18) E F (cid:19) p, [h] p, [h] Var F (cid:21) (cid:21) and p, [h] In the case of n(cid:48)(cid:101)k h (mod p), not having collisions both from projection and modulo p, and n(cid:48)(cid:101)k ≡ [h] = pcn(cid:48) + O(σ √ p) (3.5) (cid:17) p F(cid:16) r(cid:101)k (cid:19) (cid:18) r(cid:101)k,k p, (cid:19) (cid:18) r(cid:101)k,k F(cid:16) (cid:17) r(cid:101)k p, p F [h] and F [h] = pcn(cid:48)e2πin(cid:48) k + O(σ √ p) for each k = 1, 2,··· , D(cid:48). As a result, we get [h] = e2πin(cid:48) k + O( σ √ c(cid:48) n ) p (3.6) and note that if there were no noise in samples, we only have the first term on the right side of (3.6) which makes it possible to recover n(cid:48) k by taking its argument and dividing it by 2π as (3.2). With noisy samples, however, it is corrupted with noise which is a multiple of 47 c(cid:48) √ σ n p . Defining ˆnk := 1 2π Arg (cid:19) (cid:18) F r(cid:101)k,k F(cid:16) (cid:17) r(cid:101)k p, p [h] [h]  , (3.7) we want to see how far ˆnk is from n(cid:48) k. For this purpose, we introduce the Lee norm associated with a lattice L in R as (cid:107)z(cid:107)L := miny∈L |z − y| for z ∈ R and the related property that under the Lee norm associated with the lattice 2πZ, (cid:107)Arg(γ + ν) − Arg(γ)(cid:107)2πZ = (3.8) (cid:18) (cid:13)(cid:13)(cid:13)(cid:13)Arg 1 + (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2πZ ν γ (cid:12)(cid:12)(cid:12)(cid:12) ν γ (cid:12)(cid:12)(cid:12)(cid:12) , ≤ π 2 where |γ| ≥ |ν| with γ, ν ∈ C. By choosing the sample length p large enough depending on the least magnitude nonzero cmin and the noise level σ, (3.8) can be applied to (3.6) as follows, Consequently, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2πZ k ≤ O (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Arg [h] p, (cid:18) (cid:19) F r(cid:101)k,k (cid:17) F(cid:16) r(cid:101)k (cid:13)(cid:13)ˆnk − n(cid:48) p [h] k  − 2πn(cid:48) (cid:18) (cid:13)(cid:13)Z ≤ O (cid:19) . (cid:18) p σ |cmin|√ (cid:19) σ 2π |cmin|√ , p (3.9) which implies that the error of our estimate ˆnk to n(cid:48) Thus, p needs to be chosen carefully depending on √ we can approximate the corresponding coefficient cn(cid:48) with the error of size O(σ/ k is controlled by the size of  |cmin|√ p.  |cmin|. On the other hand, from (3.5), p) as σ σ follows cn(cid:48) = 1 p (cid:17) F(cid:16) r(cid:101)k p [h] + O (cid:19) (cid:18) σ√ p . (3.10) 48 3.2 Multiscale Method In [2], rounding and multiscale methods for the one-dimensional sparse Fourier algorithm for noisy data were introduced. Both methods use the fact that the peaks of the DFT are robust to relatively high noise, i.e., h can be correctly found such that n ≡ h(mod p) for an energetic frequency n. The Rounding method is efficient when σ is relatively small, which approximates such n := bp + h for some integer b up to p/2 error and rounds a multiple of p in order to get the correct b. It was shown that p ≥ max{C1s, C2( |cmin|)2/3} for some σ constants C1 and C2 makes it possible to correctly find n through the rounding method. On the other hand, the multiscale method was introduced for relatively large σ. It prevents p from becoming too large, which happens for large σ in the rounding method. In this section, we focus on extending the multiscale method to recover high dimensional frequencies by gradually fixing each entry estimation with several shifts q. 3.2.1 Description of Frequency Entry Estimation Let n(cid:48) ∈ S(cid:48) and k ∈ {1, 2,··· , D(cid:48)} be fixed. The set S(cid:48) is defined to be the set containing all n(cid:48) transformed from n ∈ S. The target frequency entry n(cid:48) both from projection and modulo p. We start with a coarse estimation n(cid:48) k is assumed not to have collisions k,0 of n(cid:48) k defined by n(cid:48) k,0 := 1 2π0 Arg (cid:18) (cid:19) F r(cid:101)k,k (cid:17) F(cid:16) r(cid:101)k p,0 p [h] [h]  , where 0 ≤ 1/M(cid:48). Then n(cid:48) k,0 ≡ n(cid:48) k( mod p) even though it is not guaranteed that n(cid:48) k,0 = n(cid:48) k. Thus, we need to improve the approximation. With each correction, the solution is improved by l digits where l depends on the parameters that are chosen in the method as well as noise. 49 Each correction term is calculated with a choice of growing q > 1/M(cid:48), i.e., q−1 < q for all q ≥ 1. With the initialization bk,0 := 0n(cid:48) k,0, the correction terms are calculated in the following way, and p,q (cid:18) (cid:19) F r(cid:101)k,k (cid:17) F(cid:16) r(cid:101)k (cid:17)(cid:16) p [h] [h]  (cid:104)− 1 q bk,q − qn(cid:48) k,q−1 mod 2 , 1 2 (cid:17)(cid:17) bk,q := 1 2π Arg (cid:16) n(cid:48) k,q := n(cid:48) k,q−1 + for q ≥ 1, where x (mod [−1/2, 1/2)) is defined to be the value a ∈ [−1/2, 1/2) such that x ≡ a (mod 1). Using this fact, (cid:18) (cid:20) (cid:19)(cid:19) , bk,q ≈ qn(cid:48) k mod −1 2 , 1 2 the error n(cid:48) k − n(cid:48) k,q−1 can be estimated as q(n(cid:48) k − n(cid:48) k,q−1) = qn(cid:48) k − qn(cid:48) ≈ (bk,q − qn(cid:48) k,q−1 k,q−1) (cid:18) mod (cid:20) −1 2 , 1 2 (cid:19)(cid:19) and thus, in a similar manner to (3.9), (cid:18) (cid:19) O σ qcmin √ p = (n(cid:48) k − n(cid:48) k,q−1) − n(cid:48) = n(cid:48) k − k,q−1 + = n(cid:48) k − n(cid:48) k,q. (cid:104)− 1 (cid:104)− 1 2 , 1 2 2 , 1 2 (cid:17)(cid:17) (cid:17)(cid:17)  (3.11) (3.12) (3.13) (cid:16) bk,q − qn(cid:48) (cid:16) bk,q − qn(cid:48) (cid:17)(cid:16) (cid:17)(cid:16) q k,q−1 mod k,q−1 mod q 50 As the correction is repeated with larger q, the error of the estimate decreases. In other words, we approximate n(cid:48) k by its most significant bits and the next significant bits repeatedly. The performance of this multiscale method is shown in detail in the next section. 3.2.2 Analysis of Multiscale Method The following theorem shows how the correction term dk,q/q is constructed in each iteration and how large the error of estimate n(cid:48) k after L iterations is in the multiscale frequency entry estimation procedure. Theorem 4. Let n(cid:48) ∈ S(cid:48), k ∈ {1, 2,··· , D(cid:48)} be fixed and n(cid:48) 1 < ··· < L and bk,0, bk,1,··· , bk,L ∈ R such that k ∈(cid:104)− M(cid:48) 2 , M(cid:48) 2 (cid:17) . Let 0 < 0 < (cid:107)qn(cid:48) k − bk,q(cid:107)Z < δ, 0 ≤ q ≤ L ≤ 1−2δ where 0 < δ < 1 dk,0, dk,1,··· , dk,L ∈ R, each computable from {q} and {bk,q} such that 4. Assume that 0 ≤ 1−2δ M(cid:48) and βq := q q−1 2δ . Then there exist (cid:12)(cid:12)(cid:101)nk − n(cid:48) k (cid:12)(cid:12) ≤ δ L(cid:89) 0 q=1 β−1 q , where (cid:101)nk := L(cid:88) q=0 dk,q q . Proof. The proof is the same as the proof of Theorem 4.2 in [2] since each entry of frequency vectors is corrected in the same way as each one-dimensional frequency w in [2] is. Each dq is defined as d0 := b0 and 51 dq := bq − qλq−1 (cid:20) (cid:18) mod −1 2 , 1 2 (cid:19)(cid:19) for q ≥ 1, where λq = dq/q for q ≥ 0. Corollary 1. Assume that we let βq = β in Theorem 4 where β ≤ 1−2δ all q ≥ 1. Let p > 0 and L ≥(cid:106) logβ + 1. Then, 2δ , i.e., q = βq0 for 2δ 0 (cid:107) (cid:12)(cid:12)(cid:101)nk − n(cid:48) k (cid:12)(cid:12) ≤ δ 0 β−L < 1 2 . Proof. This is straightforward corollary of Theorem 4. Corollary 1 looks very similar to Corollary 4.3 in [2]. However, it is different in that the iteration number is increased in order to make the error between (cid:101)nk and n(cid:48) k less than 1/2 instead of p/2. In [2], the remainder h modulo p of each energetic n is known by sorting out the s largest DFT components of p-length unshifted sample vector so that p/2 error bound is enough to guarantee the exact recovery of n. On the other hand, in the high-dimensional setting of this section, the remainders of all entries of n(cid:48) remainder of n(cid:48)(cid:101)k is only known where(cid:101)k is the index of coordinate axis where the frequencies k are not known. Instead, the are projected. Thus, we decrease the error further by enlarging the iteration number L and are able to recover the exact n(cid:48) k by rounding when the error is less than 1/2. Remark 1. Similar to Theorem 4.4 in [2], the admissible size of δ can be estimated as follows, δ = min (cid:18)1 − 0M(cid:48) , 1 2β + 2 2 (cid:19) (3.14) under the assumption 0 ≤ 1−2δ M(cid:48) of Corollary 1 together with the assumption β ≤ 1−2δ 2δ of 52 Theorem 4. 3.3 Algorithm In this section, the parameters that determine the performance of our algorithm are intro- duced and the way that they are chosen is shown. Those are affected by the noise level, σ. Furthermore, the pseudocode of the multiscale algorithm and the related explanation are provided. It is important to know how the frequency entry estimation works in the algorithm and how the collision detection tests are modified from the tests in Chapter 2 in order to make them tolerant of noise. 3.3.1 Choice of p The sample length p affects the total runtime complexity since the discrete Fourier transform is applied to all sample sets taken to recover the frequency entries. Due to this, we want to make it as small as possible. At the same time, however, we can see from (3.9) that the error between the target entry and its approximation becomes smaller if a larger p is taken. Thus, as discussed in Section 3.2, if p is large enough, then the rounding method instead of multiscale method can recover the exact frequency by rounding (3.7) to the nearest integer of the form pv + h with an integer v. In this case, p is large enough to diminish the influence of σ, and therefore if σ is large, so is p. Instead, the multiscale method makes it possible to enlarge p moderately. From Theorem 4, we get |n(cid:48) k,q+1 − n(cid:48) k| < δ q+1 (3.15) 53 and (3.13) implies |n(cid:48) k,q+1 − n(cid:48) k| ≤ O (cid:18) (cid:19) . σ 2πq |cmin|√ p (3.16) By putting the right side of (3.16) as Cσ σ q |cmin|√ p with some constant Cσ and equating both right sides of (3.15) and (3.16), β := q+1/q can be estimated as √ 2πδ p cminCσσ . β = (3.17) β determines the choice of q for each iteration, i.e., q = βq0 where 0 is chosen to be less than 1/M(cid:48). Combining (3.14) and (3.17), the sample length p can be calculated as (cid:18) β(β + 1)cminCσσ (cid:19)2 p = π when 0 = 1 2M(cid:48) and β > 1, which implies δ = 1 2β+2. Eventually, the sample length p for the multiscale method needs to satisfy (cid:40) p > max C1s, (cid:18) β(β + 1)cminCσσ π (cid:19)2(cid:41) , (3.18) where p > C1s ensures the sample is long enough so that the 90% of all energetic frequencies are not collided modulo p on average which comes from the pigeonhole argument in [1]. 54 3.3.2 Collision Detection Tests As mentioned in Section 3.1.1, frequencies are recovered only when there are no collisions from the projection and the modulo p division. These conditions are satisfied if (cid:18) (cid:19) g(cid:101)k,k (cid:17) F(cid:16) g(cid:101)k p, p F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < τ, [h] [h] (3.19) for k = 1,··· , D(cid:48) and some small τ > 0 which are the practical tests of (3.3). In our noisy setting, Equation (3.6) implies that the left hand side of (3.19) is bounded above by O(cid:16) √ σ cmin (cid:17) p . Thus, we set our threshold τ as a constant multiple of p. Moreover, since we iteratively update the estimates n(cid:48) k,q for q = 0, 1,··· , L, we reject the estimate after L iterations if the tests fail for more than η(L+1) times for each kth entry with k = 1, 2,··· , D(cid:48) √ σ cmin where η is a fraction< 1. Numerical experiments indicate η = 1 4 is a good number. 1 2δ 0 (cid:107) logβ 3.3.3 Number of Iterations From Corollary 1, L ≥(cid:106) 2 which is required to recover the exact n(cid:48) choice of 0 = 1 (cid:12)(cid:12) < +1 guarantees we get the approximation error,(cid:12)(cid:12)(cid:101)nk − n(cid:48) k by rounding(cid:101)nk to the nearest integer. With our 2M(cid:48) and the fact that δ < 1, L =(cid:4)logβ M(cid:48)(cid:5)+1 suffices to satisfy the 1/2 error (cid:17)D ∩ ZD is partially unwrapped to some value (cid:19)d2 ∩ Zd2 where d1 and d2 are positive integers satisfying D = d1d2, then bound. For example, if each n ∈(cid:104)− M (cid:20) in d2 entries of each energetic frequency are recovered element-wisely after L = O(d1 log M ) − M d1 2 , M d1 2 , M 2 k 2 iterations. 55 3.3.4 Description of Our Pseudocode In this section, we explain the multiscale high-dimensional sparse Fourier transform whose pseudocode is provided in Algorithm 4. The set R contains the identified Fourier frequen- cies and their corresponding coefficients, and it is an empty set initially. Parameter t is the counting number determining the index(cid:101)k of the coordinate axes where the frequencies are projected in line 6. Parameters p, τ and L are determined as discussed in the previous sections. Function Q in line 7 is a function constructed from the previously found Fourier modes which is used in updating our samples in lines 9 and 17. In line 9, the unshifted samples r(cid:101)k p corrupted by random Gaussian noise z(cid:96) are taken and in line 11, DFT is applied to these samples and the transformed vector is sorted in the descending order of magnitude. Only its s∗ largest components are taken into account under the s∗-sparsity assumption. In the loop from line 12 through 39, the entries of frequencies corresponding to these s∗ components are estimated iteratively. The shift size q is updated in line 14 and we get the D(cid:48) number of length p samples at the points shifted by q along each axis. Each length p sample will be used to approximate each entry. Similar to line 11, DFT is applied to each length-p sample and the transformed vector is sorted again following the index order of the sorted DFT of unshifted samples in line 19. In line 22, we check whether the D(cid:48) tests are passed at the same time or not. If not, the vote is increased by 1. From lines 24 through 34, the estimate n(cid:48) k for kth entry is updated. Except when q = 0, n(cid:48) k is improved by adding the correction term shown in line 29 in each iteration. In the last iteration when q = L, n(cid:48) k is rounded to the nearest integer in order to recover the exact entry, as guaranteed by Corollary 1. Whether this estimate is stored in the set R or not is determined by checking if the vote after L iterations is less than η(L + 1). If it is less, this implies that the failure rate 56 of the collision detection tests is less than η. Accordingly, we estimate cn(cid:48) from the DFT of unshifted samples and store (n(cid:48), cn(cid:48)) in R. The entire while loop repeats until s energetic Fourier modes are all found through switching the projection coordinate. Once we find all s frequency vectors, each n(cid:48) ∈ R is transformed to D-dimensional n = u−1(n(cid:48)). Theorem 5. Let f z(x) = f (x) + z(x), where ˆf (n) is s-sparse with all frequencies satisfying n ∈ S ⊂ [−M/2, M/2)D ∩ ZD and not forming any worst case scenario, and z is complex i.i.d. Gaussian noise of variance σ2. Moreover, suppose that s > C(β(β + 1)cminσ)2 for some constant C. Algorithm 4, given M, D, s, β with M > 5s and access to f z(x) returns a list of s pairs ( ˆn, c ˆn) such that (i) each ˆn ∈ S and (ii) for each ˆn, there is an n ∈ S such √ that |cn − c ˆn| ≤ Cσ/ s. The average-case runtime and sampling complexity are O(sD log s log M ) and O(sD log M ), respectively, over the class of random signals. Proof. The difference of Algorithm 4 from Algorithm 3 appears in lines 13 through 39. Algorithm 4 has the multiscale frequency entry estimation. Thus, the average-case runtime and sampling complexity of Algorithm 3 is increased by a factor of L which is the number of the repetition in the multiscale frequency entry estimation. Corollary 1 ensures that the returned frequency vectors ˆn are correct, and the coefficient c ˆn has the desired error bound from (3.10). 57 3.4 Empirical Evaluation In this section, we show the empirical evaluation of the multiscale high-dimensional sparse Fourier algorithm. The empirical evaluation was done for test functions f (x) which consist of cn randomly chosen from a unit circle in C and n randomly chosen from [−M/2, M/2)D∩ZD. Sparsity s varied from 1 to 210 = 1024 by factor of 2. The noise term added to each sample of f came from the Gaussian distribution N (0, σ2). Standard deviation σ varied from 0.001 to 0.512 by factor of 2. The dimension D was chosen to be 100 and 1000, and M was chosen as 20. The transformation u was the one for partial unwrapping that was used in [35], i.e., every 5-dimensional subvector of each frequency vector was unwrapped to a one- dimensional vector and therefore each 100 and 1000-dimensional function f was unwrapped to a 20 or 200-dimensional function f ◦ g whose Fourier domain is(cid:0)[−205/2, 205/2) ∩ Z(cid:1)20 or(cid:0)[−205/2, 205/2) ∩ Z(cid:1)200 , respectively. Input parameters C1 = 2, Cσ = 6, η = 1/4 and β = 2.5 were empirically chosen to balance the runtime and accuracy as in [2]. Initial shift size 0 was set to 1 2·205 . All experiments are performed in MATLAB. The three plots in Figure 3.1 show the average over 10 trials of the l1 error, the number of samples, and the runtime in seconds as the noise level σ changes. These values are in logarithm in the plots. Dimension D and sparsity s are fixed with 100 and 256, respectively. On the other hand, the other three plots in Figure 3.2 show the average over 10 trials of the l1 error, the number of samples, and the runtime in seconds as the sparsity s changes when D = 100 and 1000. These values are in logarithm in the plots, either and the noise level is fixed to 0.512. 58 (a) (b) (c) Figure 3.1: (a)Average l1 error vs. noise level σ in logarithm. (b)Average samples vs. noise level in logarithm. (c)Average runtime vs. noise level in logarithm. 59 (a) (b) (c) Figure 3.2: (a)Average l1 error vs. sparsity s in logarithm. (b)Average samples vs. sparsity s in logarithm. (c)Average runtime vs. sparsity s in logarithm. 60 3.4.1 Accuracy In Figure 3.1a and Figure 3.2a, the l1 errors of Fourier coefficient vectors are given under various parameter changes. Throughout all trials conducted in these experiments frequencies were always recovered exactly even for the noise level of σ = 0.512, which is relatively large compared to the true coefficients from the unit circle in C. Thus, we can observe the errors only from coefficients whose size is a constant multiple of σ√ p from (3.10). Due to the charac- teristic of the multiscale method which uses less samples compared to the rounding method, l1 errors are relatively large in nature. From Figure 3.1a, l1 error looks increasingly linear as σ increases, which meets our expectation. In Figure 3.2a, the plot does not look exactly linear, but between log2 s = 5 and 6, there is a transition of slope. This is because the sample length p from (3.18) changes from C1s to 3.4.2 Sampling complexity (cid:16) β(β+1)cminCσσ (cid:17)2 π during this transition. Sample number along σ changes in Figure 3.1b seems irregular at first sight. Looking at the scale of vertical axis, however, we can see that the difference between maximum and minimum is less than 0.3. Therefore, sampling complexity is not very affected by noise level. In Figure 3.2b, the red graph shows the average sample numbers as the sparsity increases when D = 100 and the blue graph shows the ones when D = 1000. Since our multiscale algorithm recovers each frequency entry iteratively using log M sets of O(s)-length samples, the average-case sampling complexity is indeed O(sD log M ). Two graphs in Figure 3.2b look close to be linear excluding the transition between log2 s = 5 and 6, which again is caused by the change of p from (3.18). Moreover the difference between the values of the red and blue graphs are close to 3, which implies that the sampling number depends linearly 61 on D. The D-dimensional FFT whose sampling complexity is O(M D) cannot deal with our high-dimensional problem computationally, whereas our algorithm uses only millions to billions of samples for reconstruction. 3.4.3 Runtime complexity Figures 3.1c and 3.2c show the average-case runtime complexity of the algorithm. The time for evaluating the samples from functions is excluded when measuring the runtime. For the main algorithm, we demonstrated that it is O(sD log s log M ) because for each entry recovery, DFT with O(s log s) runtime complexity is applied in D log M iterations. In Figures 3.1c, the runtimes in seconds look irregular but the scale of vertical axis is less than 0.3 so that we can conclude that similar to sample numbers the runtime is not affected by σ very much. Overall, it took less than a second on average. In Figure 3.2c, the red graph represents the runtimes as the sparsity changes when D = 100, and the blue graph represents the ones when D = 1000. Those graphs do not look linear, but considering the average slope we can see that the runtime is increased by around 28 while s is increased by 210. On the other hand, the difference between two graphs implies that the runtime complexity is linear in D. Compared to the FFT with runtime complexity of O(M D log M D) which is impossible to be practical in high-dimensional problems, our algorithm is quite effective, taking only a few seconds. 62 Algorithm 4 Multiscale High-dimensional Sparse Fourier Algorithm Pseudo Code Input:f, u, s, M, D, M(cid:48), D(cid:48), σ, cmin, Cσ, C1, η, β Output:R 5: 4: 1: R ← ∅, t ← 0 C1s∗,(cid:0)β(β + 1)cminCσσ/π(cid:1)2(cid:111) (cid:110) 2: while |R| < s do s∗ ← s − |R| 3: p ← first prime number ≥ max , L ← 1 + (cid:98)logβ M(cid:48)(cid:99) √ τ ← Cσσ (cid:101)k ← (t mod D(cid:48)) + 1 Q (x) ←(cid:80) cmin (n(cid:48),cn(cid:48) )∈R cn(cid:48) e2πin(cid:48)·x (cid:16) (cid:96) (cid:17)(cid:17) (cid:16) (cid:96) (cid:16) r(cid:101)k p e(cid:101)k p e(cid:101)k (cid:19) (cid:18) (cid:19) (cid:18) (cid:19) (cid:18) + z(cid:96) − Q p [(cid:96)] ← f r(cid:101)k r(cid:101)k r(cid:101)k for (cid:96) = 0 → p − 1 do ← SORT (cid:19)(cid:19) , Fsort r(cid:101)k (cid:18) (cid:18) (cid:17) F 8: 6: 7: u p p p p p ← FFT end for F vote ← 0 for q = 0 → L do (cid:16) (cid:96) p e(cid:101)k (cid:17) (cid:18) F p,q p,q (cid:18) + qek (cid:19)(cid:19) ← SORT r(cid:101)k,k + z(cid:48) (cid:96) − Q (cid:18) (cid:19) r(cid:101)k,k (cid:12)(cid:12)(cid:12)(cid:12) > τ for any k = 1, 2,··· , D(cid:48) then vote ← vote + 1 (cid:12)(cid:12)(cid:12)(cid:12) − 1 (cid:19)(cid:14)(cid:18) (cid:19)(cid:19) (cid:19) [h] Fsort [h] (cid:18) r(cid:101)k (cid:17) p (mod [−1/2, 1/2)) /q , Fsort (cid:17)(cid:17) q ← βq 2M(cid:48) for k = 1 → D(cid:48) do u p,q p,q (cid:16) end for + qek end for F for (cid:96) = 0 → p − 1 do r(cid:101)k,k p,q [(cid:96)] ← f (cid:18) (cid:19) r(cid:101)k,k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Fsort(r(cid:101)k (cid:16) (cid:96) p e(cid:101)k (cid:18) (cid:19) r(cid:101)k,k ← FFT (cid:12)(cid:12)(cid:12)(cid:12)Fsort(r(cid:101)k,k (cid:12)(cid:12)(cid:12)(cid:12)/ for h = 0 → s∗ − 1 do p)[h] (cid:18)(cid:18) (cid:18) r(cid:101)k,k for k = 1 → D(cid:48) do (cid:17) bk ← 1 if q == 0 then Fsort end if p,q else if p,q )[h] (cid:19) 2π Arg n(cid:48) k ← bk/q n(cid:48) k ← n(cid:48) k + (cid:16)(cid:16) (cid:16) n(cid:48) k ← round (cid:18) end if if q == L then end if cn(cid:48) ← 1 pFsort end for if vote ≤ η(L + 1) then k bk − qn(cid:48) (cid:17) n(cid:48) (cid:19) [h], R ← R ∪(cid:16) r(cid:101)k k p (cid:17) n(cid:48), c n(cid:48) 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: end if end for 37: 38: 39: end for t ← t + 1 40: 41: end while 42: inverse-transform each n(cid:48) in D(cid:48)-D to n D-D and restore it in R 63 Chapter 4 General Sparse Harmonic Transforms 4.1 Preliminaries In this section, we introduce a completely different approach for high dimensional sparse problems. The approach is a generalized sparse transform for a generic tensor basis. We present the new method and a class of problems that it is ideally suited for. 4.1.1 The Compressive Sensing Problem for BOPB-Sparse Func- tions Toward a more exact problem formulation, let p ∈ N be any natural number and [N ] := {0, 1, 2, . . . , N − 1} for all N ∈ N. The set of functions, {Tk : D → C}k∈[N ] forms a Bounded Orthonormal System (BOS) with respect to a probability measure σ over D ⊂ Rp with BOS constant K := maxk (cid:107)Tk(cid:107)∞ ≥ 1 if K < ∞, and 1 if k = l 0 if k (cid:54)= l (cid:90) D Tk(x)Tl(x)dσ(x) = δk,l = (cid:104)Tk, Tl(cid:105)(D,σ) := 64 holds for all k, l ∈ [N ]. Now let Bj := {Tj,k : Dj → C}k∈[M ] form a BOS with respect to a probability measure νj on Dj ⊂ R, with constant (cid:101)Kj for each j ∈ [D]. Then, the BOPB functions B := {Tn : D → C} n∈[M ]D , defined by Tn(x) := again form a BOS with constant Tj;nj (xj) (4.1) with respect to the probability measure ν := ⊗j∈[D]νj over D := ×j∈[D]Dj ⊂ RD. Herein we consider BOPB-sparse functions f : D → C of the form (cid:89) j∈[D] (cid:88) K := max n ||Tn||∞ = (cid:89) j∈[D] (cid:101)Kj f (x) := cnTn(x) n∈S⊂I⊆[M ]D where |S| = s (cid:28) |I| ≤ |B| = N = M D. Following [32, 20, 18] we will take I to be the subset of [M ]D containing at most d ≤ D nonzero entries. Considering the recovery of f using standard compressive sensing methods [36, 37] when ⊂ D, I = [M ]D, one can simply independently draw m(cid:48) according to ν and then sample f at those points to obtain (cid:26) t1, . . . , tm(cid:48) 1 (cid:27) 1 points, GE := (cid:18) (cid:19)(cid:19)T ∈ Cm(cid:48) tm(cid:48) 1 1. (cid:16)GE(cid:17) (cid:18) yE = f := f (t1) , f (t2) , . . . , f (4.2) Our objective becomes the recovery of f using only the samples yE. 65 Let the m(cid:48) 1 × M D random sampling matrix Φ ∈ Cm(cid:48) 1×M D have entries given by Φ(cid:96),n = Tn(t(cid:96)). (4.3) We can now form the underdetermined linear system  f yE =  =  f (t1) (cid:19) f (t2) (cid:18) ... tm(cid:48) 1 Tn1(t1) Tn2(t1) Tn1(t2) Tn2(t2) ... ... ··· ··· Tn1(tm(cid:48) 1 ) Tn2(tm(cid:48) 1 ··· ) Tn Tn ··· ··· . . . ··· Tn M D (t1) M D (t2) ... M D (tm(cid:48) 1  c = Φc, ) where c ∈ CM D contains the basis coefficients cn of f , and the index vectors n1, . . . , n M D ∈ [M ]D are ordered, e.g., lexicographically. Note that this linear system is woefully underde- termined when m(cid:48) 1 (cid:28) M D. When c has only s (cid:28) M D nonzero entries as it does here, however, the compressive sensing literature tells us that c can still be recovered using sig- nificantly fewer than M D function evaluations as long as the normalized random sampling matrix Φ has the Restricted Isometry Property (RIP) of order 2s [37]. Definition 1 (See Definition 6.1 in [37]). The sth restricted isometry constant δs of a matrix (cid:101)Φ ∈ Cm×N is the smallest δ ≥ 0 such that 2 ≤(cid:13)(cid:13)(cid:13)(cid:101)Φc (1 − δ)(cid:107)c(cid:107)2 (cid:13)(cid:13)(cid:13)2 2 ≤ (1 + δ)(cid:107)c(cid:107)2 2 holds for all s-sparse vectors c ∈ CN . The matrix (cid:101)Φ is said to satisfy the RIP of order s if δs ∈ (0, 1). 66 Furthermore, one can show that random sampling matrices have the restricted isometry property with high probability when (cid:12)(cid:12)(cid:12)GE(cid:12)(cid:12)(cid:12) is relatively small. Theorem 6 (See Theorem 12.32 and Remark 12.33 in [37]). Let A ∈ Cm×N be the random sampling matrix associated to a BOS with constant K ≥ 1. If, for δ, p ∈ (0, 1), m ≥ aK2δ−2s · max{log2(4s) log(8N ) log(9m), log(p−1)}, then with probability at least 1− p, the restricted isometry constant δs of (cid:101)A = 1√ m A satisfies δs ≤ δ. The constant a > 0 is universal. Note that Theorem 6 effectively decouples the number of samples that one must acquire/ compute in order to recover any BOPB-sparse f from the overall BOP basis size |B| = M D. 1 = O(K2 · s · D · log4(KM D)) 1) at this point It guarantees that a random sampling set of size suffices. The main obstacle to reducing the sampling complexity (i.e., m(cid:48) (cid:12)(cid:12)(cid:12)GE(cid:12)(cid:12)(cid:12) = m(cid:48) becomes the BOS sampling constant K. To see why, consider, e.g., the cosine BOPB where √ 2 cos (nx) for n ≥ 1 and Tj;0(x) = 1 in (4.1). for all j ∈ [D] in (4.1) we set Tj;n(x) = This leads to a BOS with K = 2D/2 with respect to uniform probability measure ν over D = [0, 2π]D. Now we can see that we still face the curse of dimensionality since K2 = 2D even for this fairly straightforward BOPB. Nonetheless, it expresses itself in a dramatically reduced fashion: 2D is still a vast improvement over M D for even moderately sized M > 2. As previously mentioned, to further reduce the sampling complexity from scaling like 2O(D) previous work has focussed on developing efficient methods for effectively reducing the basis size to a smaller subset of the total basis B (see, e.g., [19, 20]). To see how this 67 might work in the context of our simple cosine BOPB example above, we can note that the BOPB elements in (4.1) can be rewritten as Tn(x) := 2(cid:107)n(cid:107)0/2 cos(cid:0)njxj (cid:1) D−1(cid:89) j=0 in that case. It now becomes obvious that limiting the basis functions to those with indexes in I := {n ∈ [M ]D | (cid:107)n(cid:107)0 ≤ d ≤ D} leads to a reduced BOS constant of K = 2d/2 < 2D/2 for the resulting reduced basis, as well as to a smaller basis cardinality of size (cid:0)D O(cid:16) (cid:1)M d = ( DM d d )d(cid:17) . In particular, the utility of the assumption that the s non-negligible basis indexes of f , S ⊂ [M ]D, also belong to the reduced index set I above is supported in some UQ applications where it is known that, e.g., the solutions of some parametric PDE are not only approximately sparse in some BOP bases such as the Chebyshev or Legendre product bases, but also that most of their significant coefficients correspond to index vectors n ⊂ ND with relatively small (weighted) (cid:96)p-norms [19, 20]. In certain simplified situations this essentially implies that S ⊂ I as discussed above. As a result, we will assume throughout this chapter that S ⊂ I so that N = |I| =(cid:0)D (cid:1)M d ≤ M D.1 d Even when s·K2 (cid:28) N so that the number of required samples m(cid:48) 1 is small compared to the reduced basis size |I| = N , however, all existing standard compressive sensing approaches for recovering f still need to compute and store potentially fully populated intermediate coefficient vectors c(cid:48) ∈ CN at some point in the process of recovering f . As a result, all existing approaches are limited in terms of the reduced basis sizes I they can consider 1Additionally, we will occasionally assume that our total grid size |G| below always satisfies |G| ≤ N c for some absolute constant c ≥ 1 in order to simplify some of the logarithmic factors appearing in our big-O notation. This will certainly always be the case for any standardly used (trigonometric) polynomial BOPB (such as Fourier and Chebyshev product bases) whenever sKDM < N . 68 by both their memory needs and runtime complexities. In this chapter we develop new methods that are capable of circumventing these memory and runtime restrictions for a general class of practical BOP bases. As a result, we make it possible to recover a new class of extremely high-dimensional BOPB-sparse functions which are simply too complicated to be approximated by other means. We are now prepared to discuss our main results. 4.1.2 Main Results The proposed sublinear-time algorithm is a greedy pursuit method motivated by CoSaMP[22], HTP[23], and their sublinear-time predecessors [38, 39]. In particular, it is obtained from CoSaMP by replacing CoSaMP’s support identification procedure with a new sublinear-time support identification procedure. See Algorithm 5 in section 4.2 for pseudocode and other details. Our main result demonstrates the existence of a relatively small grid of points G ⊂ D which allows Algorithm 5 to recover any given BOPB-sparse function f in sublinear-time from its evaluations on G. We refer the reader to section 4.1.3 for a detailed description of the grid set G and its use in Algorithm 5. The following theorem is a simplified version of theorem 7 in section 4.2. Theorem (Main Result). Suppose that function Tn is defined as per (4.1). Let Fs be the subset of all functions f ∈ span(cid:8)Tn Tn is a BOS where each basis (cid:12)(cid:12) n ∈ I(cid:9) (cid:110) (cid:12)(cid:12) n ∈ I ⊆ [M ]D(cid:111) whose coefficient vectors are s-sparse, and let cf ∈ CI denote the s-sparse coefficient vec- tor for each f ∈ Fs. Fix p ∈ (0, 1/3), a precision parameter η > 0, 1 ≤ d ≤ D, and (cid:107)Tn(cid:107)∞. Then, one can randomly select a set of i.i.d. Gaussian weights K = W ⊂ R for use in (4.19), and also randomly construct a compressive sensing grid, G ⊂ D, n s.t.(cid:107)n(cid:107)0≤d sup whose total cardinality |G| is O(cid:16) s3DL(cid:48)K4 max d4 log4(s) log4(D2M ), log2( D p ) , such that (cid:111)(cid:17) (cid:110) 69 the following property holds ∀f ∈ Fs with probability greater than 1− 3p: Let y = f (G) con- sist of samples from f ∈ Fs on G. If Algorithm 5 is granted access to y, G, and W, then it will produce an s-sparse approximation a ∈ CI s.t. (cid:107)cf − a(cid:107)2 ≤ Cη, where C > 0 is an absolute constant. Furthermore, the total runtime complexity of Algorithm 5 is always O(cid:16)(cid:16) (cid:110) (cid:111)(cid:17) × log (cid:17) . (cid:107)cf(cid:107)2 d4 log4(s) log4(D2M ), log2( D p ) s5D2LK4 max Note that Algorithm 5 will run in sublinear-time whenever s5DLK4d4 (cid:28) |I| (neglecting logarithmic factors). Here and in the theorem above the parameters L and L(cid:48) depend on η your choice of numerical method for computing the inner product between a sparse function in the span of each one-dimensional BOS Bj = (cid:8)Tj;m (cid:12)(cid:12) m ∈ [M ](cid:9). More specifically, let L(cid:48) j represent the number of function evaluations one needs in order to compute all M -inner in O(L)-time for any given function g : Dj → C belonging to the products (cid:110)(cid:104)g, Tj;(cid:101)n(cid:105)(cid:111) (cid:101)n∈[M ] span of Bj that is also s-sparse in Bj. We then set L(cid:48) := maxj∈[D] L(cid:48) j. For example, if each BOS Bj consists of orthonormal polynomials whose degrees are all bounded above by M then quadrature rules such as Gaussian quadrature or Chebyshev quadrature give L = O(M 2) and L(cid:48) = O(M ) [14]. If each Bj is either the standard Fourier, sine, cosine, or Chebyschev basis then the Fast Fourier Transform (FFT) can always be used to give L = O(M log M ) and L(cid:48) = O(M ) [14]. Moreover, there are several sublinear-time sparse Fourier transforms as well as sparse harmonic transforms for other bases which could also be used to give other valid L(cid:48) and L combinations [40, 41, 10, 5, 24, 42, 43, 7, 44, 8, 45, 11, 46]. These typically have O(sc logc(cid:48) M ) 70 runtime and sampling complexities for small positive absolute constants c and c(cid:48). As a result, one can obtain much stronger results than the main theorem above when s (cid:28) M and every one-dimensional BOS Bj is either the Fourier, sine, cosine, or Chebyshev basis. The follow- ing corollary of our main theorem is obtained by using deterministic one-dimensional SFT results from [11] and [44] in order to compute all of the nonzero inner products in lines 6 – 13 of Algorithm 6. They lead to L(cid:48) and L values in section 4.2’s theorem 7 of size O(s2 log4 M ). (cid:110) (cid:12)(cid:12) n ∈ I ⊆ [M ]D(cid:111) Tn Corollary 2. Suppose that is a BOS where each basis function Tn is defined as per (4.1), and where every one-dimensional BOS Bj is either the Fourier, sine, cosine, or Chebyshev basis. Let Fs be the subset of all functions f ∈ span(cid:8)Tn (cid:12)(cid:12) n ∈ I(cid:9) whose coefficient vectors are s-sparse, and let cf ∈ CI denote the s-sparse coefficient vector for each f ∈ Fs. Fix p ∈ (0, 1/3), a precision parameter η > 0, 1 ≤ d ≤ D, and let (cid:107)Tn(cid:107)∞. Then, one can randomly select a set of i.i.d. Gaussian weights K = W ⊂ R for use in (4.19), and also randomly construct a compressive sensing grid, G ⊂ D, whose total cardinality |G| is n s.t.(cid:107)n(cid:107)0≤d sup s3D log4(M )K4 max d4 log4(s) log4(D2M ), log2( D p ) , such that the following property holds ∀f ∈ Fs with probability greater than 1 − 3p: Let y = f (G) consist of samples from f ∈ Fs on G. If Algorithm 5 is granted access to y, G, and W, then it will produce an s-sparse approximation a ∈ CI s.t. O(cid:16) (cid:110) (cid:111)(cid:17) (cid:107)cf − a(cid:107)2 ≤ Cη, where C > 0 is an absolute constant. Furthermore, the total runtime complexity of Algorithm 5 is always 71 O(cid:16)(cid:16) s5D2 log4(M )K4 max (cid:110) d4 log4(s) log4(D2M ), log2( D p ) (cid:111)(cid:17) × log (cid:17) . (cid:107)cf(cid:107)2 η Note that the runtime dependance achieved by the corollary above scales sublinearly with M , quadratically in D, and at most polynomially in the parameter d ≤ D used to determine I. We also remind the reader that the BOS constant K for the Fourier basis is 1. As a result, the K dependence in the runtime complexity vanishes entirely when the BOPB in question is the multidimensional Fourier basis.2 Finally, there are also sublinear-time sparse transforms for one-dimensional Legendre polynomial systems [44], though the theoretical results for sparse recovery therein require additional support restrictions beyond simple sparsity. Thus, corollary 2 can also be extended to restricted types of Legendre-sparse functions in order to achieve sublinear-in-M runtimes. A detailed development of such results is left for future work, however. 4.1.3 Randomly Constructed Grids with Universal Approxima- tion Properties Fix a BOP basis B and sparsity level s. We will call any set G ⊂ D a compressive sensing grid if and only if ∃ a set of weights W s.t. ∀ f : D → C that are s-sparse in B Algorithm 5 with weights W can recover f from its evaluations on G is true. As mentioned above, our main results demonstrate the existence of relatively small 2Though the resulting O(cid:16) compressive sensing grids by randomly constructing highly structured sets of points that are s5D2d4polylog(M Ds(cid:107)c(cid:107)2/ηp) -runtime achieved by corollary 2 for the multi- dimensional Fourier basis is strictly worse than the best existing noise robust and deterministic sublinear-time results for that basis [11] (except perhaps when s3d4 (cid:28) D3), we emphasize that it is achieved with a different and significantly less specialized grid G herein. (cid:17) 72 then shown to be compressive sensing grids with high probability. We emphasize that our use of probability in this chapter is entirely constrained to (i) the initial choice of the grid G given a BOP basis B and sparsity level s, and to (ii) the entirely independent and one-time initial choice of a set of random gaussian weights W for use in (4.19) (i.e., as part of the initialization phase for Algorithm 5). Algorithm 5 is entirely deterministic once both G and W have been chosen. The compressing sensing grids G utilized herein will be the union of three distinct sets of points in D. The first set of points is the set GE ⊂ D on which f is evaluated in order to obtain yE in (4.2). This set of points is used in Algorithm 5 in order to estimate the basis coefficients for the basis elements identified by Algorithms 6 and 7. The second and third sets included in G, GI ⊂ D and GP ⊂ D, are utilized by Algorithm 6 and Algorithm 7, GI := GI j = j∈[D] j∈[D] j,(cid:96),k ((cid:96),k)∈[m]×[L(cid:48) j ] . (cid:91) (cid:91) (cid:110) x(cid:48) (cid:111) 73 respectively. They are defined below. (cid:27) (cid:26) (cid:110)(cid:104)g, Tj;(cid:101)n(cid:105)(cid:111) Let Uj := ⊂ Dj be the set of L(cid:48) uj,0, . . . , uj,L(cid:48) j−1 j points at which one can evaluate any given Bj-sparse function g : Dj → C in the span of Bj in order to compute all M -inner in O(L)-time. Also, let I≥j : [D − 1] → {0, 1} be the indicator products function that is zero when κ < j, and one when κ ≥ j. For each j ∈ [D] we will then define j ⊂ D to be the set of mL(cid:48) GI j randomly generated grid points given by (cid:101)n∈[M ] x(cid:48) j,(cid:96),k =(cid:0)(xj,(cid:96))0, (xj,(cid:96))1, . . . , (xj,(cid:96))j−1, uj,k, (xj,(cid:96))j, . . . , (xj,(cid:96))D−2 where each (xj,(cid:96))κ ∈ Dκ+I≥j (κ) is an independent realization of a random variable ∼ νκ+I≥j (κ) for all κ ∈ [D − 1]. We now take GI to be the union of these sets so that (cid:1) ∀((cid:96), k) ∈ [m] × [L(cid:48) j], Finally, similar to (4.2), we will also define f ’s evaluations on GI to be yI ∈ Cm(cid:48) (cid:19)(cid:19)T (cid:18) 2 where (cid:17) (cid:16) (cid:17) (cid:18) x(cid:48) D−1,m−1,L(cid:48) yI = f := f 0,0,0 , f x(cid:48) 0,0,1 , . . . , f (cid:16)GI(cid:17) (cid:16) x(cid:48) D−1−1 . To define GP we will again need several different subsets for each j ∈ [D] \ {0}. For i=j+1Di be i=j+1νi, respectively.3 We each fixed (j, (cid:96), k) ∈ [D] \ {0} × [m1] × [m2] let wj,(cid:96) ∈ ×i∈[j+1]Di and zj,k ∈ ×D−1 chosen independently at random according to ⊗i∈[j+1]νi and ⊗D−1 then define GP j ⊂ D to be the set of m1m2 randomly generated grid points given by j :=(cid:8)(wj,(cid:96), zj,k)(cid:12)(cid:12) ((cid:96), k) ∈ [m1] × [m2](cid:9) ∀j ∈ [D] \ {0}. GP As above, we now let GP be the union of these sets so that (cid:91) GP := GP j j∈[D]\{0} and define f ’s evaluations on GP to be yP ∈ Cm(cid:48) (cid:1) , f(cid:0)w1,0, z1,1 f(cid:0)w1,0, z1,0 (cid:16)GP(cid:17) yP = f (cid:16) := 3 where (cid:1) , . . . , f (cid:16) wD−1,m1−1, zD−1,m2−1 (cid:17)(cid:17)T . As we saw in Section 4.1.2, it turns out that G := GE ∪ GI ∪ GP will be a compressive sensing grid with high probability even when each component set is chosen to have a relatively small cardinality. The vast majority of the remainder of this chapter will be dedicated to proving this fact. We will begin the remainder of Section 4.1 by introducing additional notation that is used throughout the rest of the chapter, and by interpreting our function 3When j = D − 1 the vector zj,k is interpreted as a null vector satisfying (wj,(cid:96), zj,k) = wj,(cid:96) ∀((cid:96), k). 74 evaluations on G, y = (yE, yI, yP)T ∈ Cm(cid:48) 3 1+m(cid:48) 2+m(cid:48) (4.4) as standard compressive sensing measurements. Next, in Section 4.2, Algorithm 5 is discussed in detail and the main theorem above is proven with the help of a key technical lemma (i.e., Lemma 2) that guarantees the accuracy of our proposed support identification method. Lemma 2 is then proven in Section 4.3. Finally, a numerical evaluation is carried out in Section 4.4 that demonstrates that Algorithm 5 both behaves as expected, and is robust to noisy function evaluations. 4.1.4 Notation and Problem Setting In this section we introduce the notation that will be used in the rest of this chapter as well as the problem for which we will develop our proposed algorithm. We denote by N the set of natural numbers, R the set of real numbers, and C the set of complex numbers. Let [N ] := {0, 1, 2, . . . , N − 1} for N ∈ N. In this chapter all letters in boldface (other than probability measures such as ν) will always represent vectors. Vectors whose entries are indexed by index vectors in, e.g., [M ]D will be assumed to have their entries ordered lexicographically for the purposes of, e.g., matrix-vector multiplications. Thus, we say either v ∈ C[M ]D or v ∈ CM D when we want to emphasize that each entry vn of v is corresponding to its index vector n, or when we perform, e.g., matrix-vector multiplications, respectively. We further define the (cid:96)0 pseudo- norm of a vector v by (cid:107)v(cid:107)0 := |{i : vi (cid:54)= 0}| where the index i refers to the ith entry of the vector (in lexicographical order). If v ∈ C is a scalar then we will also use the (cid:96)0-notation 75 to correspond to the indicator function defined by 1 if v (cid:54)= 0 0 if v = 0 (cid:107)v(cid:107)0 := . (4.5) We will consider functions f : D → C given in a BOS product basis expansion below so that (cid:88) n∈[M ]D f (x) := cnTn(x). (4.6) We will further assume that f is approximately sparse in this BOS product basis. That is, we will assume that there exists some index set S ⊂ I for an a priori known index set I ⊆ [M ]D such that S has the property that both (i) |S| = s (cid:28) |I| ≤ M D, and that (ii) the set of coefficients C := {cn (cid:12)(cid:12) n ∈ S} ⊂ C dominates f ’s (cid:96)2-norm in the sense that |cn|2 (cid:29) (cid:88) (cid:88) |cn|2 =: 2, n∈S n∈[M ]D\S for a relatively small number  > 0. We emphasize here that absolutely nothing about S is known to us in advance beyond the fact that it is a subset of I, and has cardinality at most s. We must learn the identity of its elements ourselves by sampling f . Our analysis herein will focus on the case where I is given by (cid:110) n ∈ [M ]D (cid:12)(cid:12) (cid:107)n(cid:107)0 ≤ d (cid:111) I := for some d ≤ D (cf. Section 4.1.1). Note that this includes, for d = D, the special case where I = [M ]D. We will call the index vectors n ∈ S energetic. Our goal is to recover 76 S and the associated coefficients C as rapidly as possible using only evaluations/samples from f . This will, in turn, necessitate that we sample f at very few locations in D. In this chapter we will mainly focus on providing theoretical guarantees for the case where  = 0 (i.e., for provably recovering f that are exactly s-sparse in a BOS product basis). Numerical experiments in Section 4.4 demonstrate that the method also works when  > 0, however. We leave theoretical guarantees in the case of  > 0 for future consideration. 4.1.5 Definitions Required for Support Identification As with most compressive sensing and sparse approximation problems we will see that iden- tifying the function f ’s support S is the most difficult part of recovering f . As a result our proposed iterative algorithm spends the vast majority of its time in every iteration recov- ering as many energetic n = (n0, n1,··· , nD−1) ∈ S as it can. Only after doing so does it then approximate a sparse vector c ∈ C[M ]D containing nonzero coefficients cn for each discovered n ∈ S. Here, each n will be referred to as an index vector of an entry in c. Let supp(v) ⊆ [M ]D represent the set of index vectors whose corresponding vn entries are nonzero. We introduce the following notation in order to help explain our algorithm in the subsequent sections of the chapter. For a given v ∈ C[M ]D k ∈ [M ]D−1 is defined by (cid:17) (cid:16) vj;(cid:101)n = k , j ∈ [D], and (cid:101)n ∈ [M ] the vector vj;(cid:101)n ∈ C[M ]D−1 indexed by vn, 0 if n = (k0, . . . , kj−1,(cid:101)n, kj, . . . , kD−2) . (4.7) otherwise 77 Note that vj;(cid:101)n will only ever have at most (cid:18) D − 1 d − (cid:107)(cid:101)n(cid:107)0 N(cid:48) := (cid:19) M min{d−(cid:107)(cid:101)n(cid:107)0,D−1} ≤ (cid:18) e(D − 1)M max{d − 1, 1} (cid:19)d (4.8) (cid:1) to be 1 whenever q ≥ p or q < 0 throughout the remainder of the chapter, we define (cid:0)p nonzero entries if vn = 0 for all n ∈ [M ]D with (cid:107)n(cid:107)0 > d by assumption. Here, as (also recall the definition of (cid:107)(cid:101)n(cid:107)0 from (4.5) above).4 Similarly, for a given v ∈ C[M ]D j ∈ [D], and (cid:101)n ∈ [M ]j+1 the vector vj;((cid:101)n,··· ) ∈ C[M ]D−j−1 , indexed by k ∈ [M ]D−j−1 is q defined by (cid:16) (cid:17) = k vj;((cid:101)n,··· ) vn, 0 if n = ((cid:101)n, k) . otherwise (4.9) The following lemma bounds the total number of nonzero entries that vj;((cid:101)n,··· ) can have given that vn = 0 whenever (cid:107)n(cid:107)0 > d. Note that for j = 0, vj;(cid:101)n = vj;((cid:101)n,··· ) so that (4.8) follows as a special case. Lemma 1. Let v ∈ C[M ]D whenever (cid:107)n(cid:107)0 > d. Then vj;((cid:101)n,··· ) can have at most M min{d−(cid:107)(cid:101)n(cid:107)0,D−j−1} ≤ , j ∈ [D], and (cid:101)n ∈ [M ]j+1 with (cid:107)(cid:101)n(cid:107)0 ≤ d. Suppose that vn = 0 (cid:18)D − j − 1 (cid:19) d − (cid:107)(cid:101)n(cid:107)0 (cid:18) e(D − j − 1)M max{d − j − 1, 1} (cid:101)Nj := (cid:19)d (4.10) nonzero entries. Proof. Since vn = 0 whenever (cid:107)n(cid:107)0 > d it must be the case that 4The min{d − (cid:107)(cid:101)n(cid:107)0, D − 1} in the exponent of the M in (4.8) handles the case when d = D and(cid:101)n = 0. = 0 whenever n (cid:16) vj;((cid:101)n,··· ) (cid:17) 78 n = ((cid:101)n,(cid:101)n(cid:48)) has (cid:107)(cid:101)n(cid:48)(cid:107)0 > d−(cid:107)(cid:101)n(cid:107)0, where(cid:101)n(cid:48) ∈ CD−j−1. As a result, if D− j − 1 > d−(cid:107)(cid:101)n(cid:107)0 then there are at most(cid:0)D−j−1 (cid:1) entry combinations left in n which can be nonzero, each of d−(cid:107)(cid:101)n(cid:107)0 which can take on M different values. If, on the other hand, d − (cid:107)(cid:101)n(cid:107)0 ≥ D − j − 1 then all of the remaining D − j − 1 values of n can each take on M different values. Motivated by the definition of vj;(cid:101)n in (4.7) we further define Ij;(cid:101)n :=(cid:8)n ∈ I (cid:12)(cid:12) nj =(cid:101)n(cid:9) ⊂ C[M ]D , onto each Ij;(cid:101)n (considered and denote the restriction matrix that projects vectors in C[M ]D . That is, we consider each Pj;(cid:101)n matrix as a subset of C[M ]D−1 to have rows indexed by l ∈ [M ]D−1, columns indexed by k ∈ [M ]D, and entries defined ) by Pj;(cid:101)n ∈ {0, 1}M D−1×M D 1 0 by (Pj;(cid:101)n)l,k := if k = (l0, . . . , lj−1,(cid:101)n, lj, . . . , lD−2) otherwise . (4.11) The fast support identification strategy we will employ in this chapter will effectively boil As a result, we have that Pj;(cid:101)nv = vj;(cid:101)n for all v ∈ C[M ]D down to rapidly approximating the norms of various cj;(cid:101)n and cj;((cid:101)n,··· ) vectors for carefully chosen collections of (cid:101)n ∈ [M ] and (cid:101)n ∈ [M ]j+1. This, in turn, will be done using as few . evaluations of f in (4.6) as possible in order to estimate inner products and norms of other proxy functions constructed from f . As a simple example, note that (cid:107)c(cid:107)2 can be estimated by using samples from f in order to approximate (cid:107)f(cid:107)2 since L2(D,ν) |cn|2. (cid:88) n∈[M ]D (cid:107)f(cid:107)2 L2(D,ν) = 79 A bit less trivially, for j ∈ [D] and(cid:101)n ∈ [M ] one can also define the function D(cid:48) j → C with domain D(cid:48) j := ×k(cid:54)=jDk by having (cid:69) (cid:68) f, Tj;(cid:101)n (w) evaluate to (Dj ,νj ) (cid:69) (cid:68) f, Tj;(cid:101)n : (Dj ,νj ) (cid:90) Dj f (w0, . . . , wj−1, z, wj+1, . . . , wD−2)Tj;(cid:101)n(z) dνj(z) for all w ∈ D(cid:48) j. Let ν(cid:48) j := ⊗k(cid:54)=jνk. It is not too difficult to see that (cid:13)(cid:13)(cid:13)(cid:104)f, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:13)(cid:13)(cid:13)2 L2(D(cid:48) j ,ν(cid:48) j ) = (cid:88) n s.t. nj =(cid:101)n |cn|2 = (cid:107)cj;(cid:101)n(cid:107)2 2 (4.12) in this case. Similarly, for some j ∈ [D] and (cid:101)n ∈ [M ]j+1 one can define the function (cid:104)f, Tj;(cid:101)n(cid:105)(cid:16)×i∈[j+1]Di,⊗i∈[j+1]νi (cid:104)f, Tj;(cid:101)n(cid:105)(cid:16)×i∈[j+1]Di,⊗i∈[j+1]νi (cid:17) from D(cid:48)(cid:48) (cid:17)(w) equal to j := ×k>jDk into C by letting (cid:90) f (z, w0, . . . , wD−j−2) ×i∈[j+1]Di (cid:89) k∈[j+1] (cid:16)⊗i∈[j+1]νi (cid:17) (z) Tk;(cid:101)nk (zk) d for all w ∈ D(cid:48)(cid:48) j . Let ν(cid:48)(cid:48) j := ⊗k>jνk. Analogously to the situation above we then have (cid:17) = (cid:107)cj;((cid:101)n,··· )(cid:107)2 2. (4.13) As we shall see below, both (4.12) and (4.13) will be implicitly utilized in order to allow the 2 norms, respectively, using just a few nonadaptive that (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2 L2(cid:16)D(cid:48)(cid:48) j ,ν(cid:48)(cid:48) j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:104)f, Tj;(cid:101)n(cid:105)(cid:16)×i∈[j+1]Di,⊗i∈[j+1]νi 2 and (cid:107)cj;((cid:101)n,··· )(cid:107)2 estimation of such (cid:107)cj;(cid:101)n(cid:107)2 samples from f . 80 4.1.6 The Proposed Method as a Sublinear-Time Compressive Sensing Algorithm Note that c, cj;(cid:101)n and cj;((cid:101)n,··· ) from (4.6) are all at most s-sparse under the assumption that  = 0. As mentioned above, this means that recovering f from a few function eval- uations is essentially equivalent to recovering c using random sampling matrices. Given this, the method we propose in the next section can also be viewed as a sublinear-time compressive sensing algorithm which uses a highly structured measurement matrix A con- sisting of several concatenated random sampling matrices. More explicitly, the measurements y ∈ Cm(cid:48) 1+m(cid:48) 2+m(cid:48) 3 utilized by Algorithm 5 below consist of function evaluations (i.e., recall (4.4)) which can be represented in the concatenated form  =  yE yI yP  Φc AI c AP c  =  c = Ac  Φ AI AP y = (4.14) for subvectors yE ∈ Cm(cid:48) sampling matrices Φ ∈ Cm(cid:48) 1, yI ∈ Cm(cid:48) 1×M D 2, yP ∈ Cm(cid:48) 2×M D , AI ∈ Cm(cid:48) 3 (recall Section 4.1.3), and structured , and AP ∈ Cm(cid:48) 3×M D . In (4.14) the matrix Φ is a standard random sampling matrix with the RIP formed as per (4.3). It and its associated samples yE = Φc are used to estimate the entries of c indexed by the index vectors contained in the identified energetic support set T in line 13 of Algorithm 5. The matrices AI and AP are both used for support identification. In particular, AI and its associated samples yI = AI c are used to try to identify all of the energetic basis functions 81 in each input dimension, i.e., the sets j :=(cid:8)(cid:101)n ∈ [M ](cid:12)(cid:12) ∃n ∈ S with nj =(cid:101)n(cid:9) ⊆ [M ] N (cid:48) for each j ∈ [D] (see Algorithm 6). The matrix AP and its associated samples yP = AP c are then used in Algorithm 7 to help build up the estimated support set T ⊂ [M ]D from the previously identified N (cid:48) The matrix AI above is built using the matrix Kronecker products (cid:101)Aj ⊗ Lj for j ∈ [D], where (cid:101)Aj ∈ Cm×[M ]D−1 is a sampling matrix associated with Uj ⊂ Dj from Section 4.1.3 defined as is the random sampling matrix defined in (4.20), and Lj ∈ CL(cid:48) j-sets. See Section 4.2 below for additional details. j×[M ] (cid:1) (cid:0)Lj q,n := Tj;n(uj,q), q ∈ [L(cid:48) j] and n ∈ [M ]. (4.15) The matrix AP , on the other hand, is constructed using Bj ⊗ Cj for all j ∈ [D] \ {0} where each Bj ∈ Cm1×[M ]j+1 Cj ∈ Cm2×[M ]D−j−1 the random sampling matrix defined in (4.34). In particular, we have is the random sampling matrix defined below in (4.35), and each that AI :=  (cid:101)A0 ⊗ L0 (cid:101)AD−1 ⊗ LD−1 ...  , and AP :=   . B1 ⊗ C1 ... BD−1 ⊗ CD−1 (4.16) From a set of random gaussian weights W, a marix G ∈ CL×m is defined as (G)k,(cid:96) := gk (cid:96) , k ∈ [L] and (cid:96) ∈ [m], (4.17) where gk (cid:96) ’s are the gaussian weights from (4.19). 82 Briefly contrasting the proposed approach interpreted as a sublinear-time compressive sensing method via (4.14) against previously existing sublinear-time algorithms for Com- pressive Sensing (CS) (see, e.g., [38, 39, 47, 48, 49]), we note that no previous sublinear-time CS methods exist which utilize measurement matrices solely derived from general BOS ran- dom sampling matrices. This means that the associated recovery algorithms developed herein can not directly take advantage of the standard group testing, hashing, and error correcting code-based techniques which have been regularly employed by such methods, making the development of fast reconstruction techniques and their subsequent analysis quite challeng- ing. Nonetheless, we will see that we can still utilize at least some of the core ideas of these methods by sublinearizing the runtime of one of their well known superlinear-time relatives, CoSaMP [22]. 4.2 The Proposed Method In this section we introduce and discuss our proposed method. Roughly speaking, our algorithm can be considered as a greedy pursuit algorithm (see, e.g., [50, 23, 22, 51, 52]) with a faster support identification technique that takes advantage of the structure of BOS product bases. In particular, we will focus on the CoSaMP algorithm [22] herein. Note that support identification is the most computationally expensive step of the CoSaMP algorithm. Otherwise, CoSaMP is already a sublinear-time method for any type of BOS basis one likes. Our overall strategy, therefore, will be to hijack the CoSaMP algorithm as well as its analysis by removing its superlinear-time support identification procedure and replacing it with a new sublinear-time version that still satisfies the same iteration invariant as the original. See Algorithm 5 for pseudocode of our modified CoSaMP method. Note that most its steps 83 Algorithm 5 Sublinearized CoSaMP 1: procedure SublinearRecoveryAlgorithm 2: Input: Sampling matrices Φ, AI , AP (implicitly, via the samples in G that determine their rows), samples yE = Φc, yI = AI c, yP = AP c, a sparsity estimate s, and a set of i.i.d. Gaussian weights W {Initial approximation} a0 = 0 vI ← yI, vP ← yP t ← 0 repeat 3: Output: s-sparse approximation a of c 4: 5: 6: 7: 8: 9: 10: 11: {The next line calls Algorithm 6 . . .} Nj ∀j ∈ [D] ← EntryIdentification(vI , W) {Support identification step # 1} {The next line calls Algorithm 7 . . .} {Support identification step # 2} Ω ← Pairing(vP , Nj ∀j ∈ [D]) T ← Ω ∪ supp(at) {Merge supports} † bT ← Φ {Local estimation by least-squares} T yE t ← t + 1 {Prune to obtain next approximation} at ← bs {Update current samples} vI ← yI − AI at, vP ← yP − AP at 12: 13: 14: 15: 16: 17: 18: end procedure until halting criterion true are identical to the original CoSaMP algorithm except for the two “Support identification” steps, and the “Update current samples” and “halting criterion” lines. Thus, our discussion will mainly focus on these three parts. Like CoSaMP, Algorithm 5 is a greedy approximation technique which makes locally optimal choices during each iteration. In the t-th iteration, it starts with an s-sparse ap- proximation at of c and then tries to approximate the at most 2s-sparse residual vector r := c − at. The two “Support identification” steps begin approximating r by finding a support set Ω ⊂ I of cardinality at most 2s which contains the set n(cid:12)(cid:12) |rn|2 ≥ (cid:107)r(cid:107)2 (cid:26) (cid:27) 2 α2s (i.e., Ω contains the indices of the entries where most of the energy of r is located). These support identification steps constitute the main modification made to CoSaMP in this chapter and are discussed in more detail in sections 4.2.1 and 4.2.2 below. After support identification, in the “Merge supports” step, a new support set T of cardinality at most 3s is then formed from 84 the union of Ω with the support of the current approximation at. At this stage T should contain the overwhelming majority of the important (i.e., energetic) index vectors in S. As a result, restricting the columns of the sampling matrix Φ to those in T (or constructing them on the fly in a low memory setting) in order to solve for bT := argmin u∈C|T|(cid:107)ΦT u − yE(cid:107)2 should yield accurate estimates for the true coefficients of c indexed by the elements of T , cT .5 The vector bs restricting bT to its s largest-magnitude elements then becomes the next approximation of c, at+1. As previously mentioned, the main difference between Algorithm 5 and CoSaMP is in the support identification steps. In the proposed method support identification consists of two parts: “Entry Identification” and “Pairing”. For each of these steps we use a different measurement matrix, AI or AP , respectively, as well as a different set of samples (either vI or vP ) from the current residual vector. Thus, we need to update a total of three estimates every iteration: vI , vP and at. In the next two sections we review each of the two newly proposed support identification steps in more detail. 5In practice, it suffices to approximate the least-squares solution bT by an iterative least-squares approach such as Richardson’s iteration or conjugate gradient [14] since computing the exact least squares solution can be expensive when s is large. The argument of [22] shows that it is enough to take three iterations for Richardson’s iteration or conjugate gradient if the initial condition is set to at, and if Φ has an RIP constant δ2s < 0.025. In fact, both of these methods have similar runtime performance. 85 4.2.1 Support Identification Step # 1: Entry Identification Algorithm 6 Entry identification 1: procedure EntryIdentification 2: Input: vI , and a set of i.i.d. Gaussian weights W for use in the hj;k below (see (4.19)) 3: Output: Nj for j ∈ [D] for j = 0 → D − 1 do 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: Nj ← ∅ for(cid:101)n = 0 → M − 1 do {The method works because mediank∈[L] (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≈ (cid:107)rj;(cid:101)n(cid:107)2. See (4.18) and (4.19) for the definition of hj;k. For exactly s-sparse c one can use τ = 0 below. More generally, one can select the largest s estimates for Nj.} if mediank∈[L] (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12), then Nj ← {(cid:101)n} ∪ Nj. end if end for end for 15: end procedure For each j ∈ [D] the entry identification algorithm (see Algorithm 6) tries to find the j-th entry of each energetic index vector n corresponding to a nonzero entry rn in the 2s-sparse residual vector r = c − at. Note that for each j this gives rise to at most 2s index entries j to be the resulting set {nj | n ∈ supp(r)} of size at most in [M ].6 We therefore define N t 6Note that we are generally assuming herein that 2s < M . In the event that 2s ≥ M one can proceed in at least two different ways. The first way is to not change anything, and to simply be at peace with the possibility of, e.g., occasionally returning Nj = [M ]. This is our default approach. The second way is regroup the first g ∈ N variables of f together into a new collective “first variable”, the second g variables together into a new collective “second variable”, etc., for some g satisfying M g > 2s. After such regrouping 86 2s for each j ∈ [D]. Note that (cid:101)n ∈ N t j by approximating (cid:107)rj;(cid:101)n(cid:107)2 using learn N t we know (cid:13)(cid:13)(cid:13)2 j if and only if (cid:107)rj;(cid:101)n(cid:107)2 > 0. As a result we can (cid:13)(cid:13)(cid:13)(cid:104)h, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:88) via (4.12) as long as j ,ν(cid:48) j ) L2(D(cid:48) (4.18) h(x) := rnTn(x). n∈supp(r) Whenever ticular choice of j ∈ [D] and(cid:101)n ∈ [M ], we could simply add(cid:101)n to Nj (our estimate of N t is larger than a threshold value (e.g., zero) for a par- j ,ν(cid:48) j ) L2(D(cid:48) j ) in this case. Of course we don’t actually know exactly what h is. However, we do have access to samples from h in each iteration in the form of vI and vP . And, as a result, we are able (cid:13)(cid:13)(cid:13)(cid:104)h, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:104)h, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:13)(cid:13)(cid:13)2 to approximate estimator = (cid:107)rj;(cid:101)n(cid:107)2 for any j ∈ [D] and(cid:101)n ∈ [M ] with the (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) L2(D(cid:48) j ,ν(cid:48) j ) mediank defined using (4.19). In section 4.3.1 we show that this estimator can be used to accurately using only O(cid:0)s · K2dL(cid:48) · approximate (cid:107)rj;(cid:101)n(cid:107)2 for all 2s-sparse residual vectors r ∈ C[M ]D polylog(D, s, M, K)(cid:1) universal samples from any given r’s associated h-function in (4.18) O(cid:0)s2 · K2d2L · polylog(D, s, M, K)(cid:1)-time. (i.e., the samples in vI ).7 Furthermore, the estimator can always be computed in just At this point it is important to note that managing to find each N t j exactly for all j ∈ [D] still does not provide enough information to allow us to learn supp(r) when d > 1. In general the most we learn from this information is that supp(r) ⊂ (cid:16)×j∈[D]Nj pute (cid:104)g, Tj;(cid:101)n(cid:105) for all (cid:101)n ∈ [M ] 7Recall that L(cid:48) represents the maximum number of function evaluations one needs in order to com- (cid:110) in O(L)-time for any given j ∈ [D], and s-sparse g : Dj → C in (cid:17)(cid:84)I ⊂ [M ]D. In the algorithm can then again effectively be run as is with respect to these new collective variables. (cid:111) (cid:12)(cid:12) m ∈ [M ] . span Tj;m 87 the next “Pairing” step we address this problem by iteratively pruning the candidates in ×j∈[1]Nj,×j∈[2]Nj, . . . ,×j∈[D]Nj down at each stage to the best 2s candidates for being a prefix of some element in supp(r). As we shall see, the pruning in each “Pairing” stage involves energy estimates that are computed using only the samples from h in vP . These ideas are discussed in greater detail in the next section. 4.2.2 Support Identification Step # 2: Pairing Algorithm 7 Pairing 1: procedure Pairing 2: Input: vP =(cid:8)h(wj,(cid:96), zj,k)(cid:12)(cid:12) j ∈ [D] \ {0}, (cid:96) ∈ [m1], k ∈ [m2](cid:9), Nj for j ∈ [D] 3: Output: P P0 ← N0 for j = 1 → D − 1 do {This method works because Ej;((cid:101)n... ) ≈ (cid:107)rj;((cid:101)n,··· )(cid:107)2 (cid:12)(cid:12)(cid:12)2 ∀(cid:101)n ∈ Pj−1 × Nj. Ej;((cid:101)n... ) ← 1 Create Pj containing each (cid:101)n whose energy estimate Ej;((cid:101)n... ) is in the 2s-largest. 2 below.} (cid:96)∈[m1] h(wj,(cid:96), zj,k)T(cid:101)n(wj,(cid:96)) (cid:12)(cid:12)(cid:12) 1 k∈[m2] (cid:80) (cid:80) m2 4: 5: 6: 7: 8: m1 9: 10: end for P ← PD−1 11: end procedure Once all the Nj ⊂ [M ] have been identified for all j ∈ [D] it remains to match them together in order learn the true length-D index vectors in supp(r) ⊂ [M ]D. To achieve this we begin by attempting to identify all the prefixes of length two, (cid:101)n = ((cid:101)n0,(cid:101)n1) ∈ N0 × N1, which begin at least one element in the support of r. Similar to the ideas utilized above, we 88 As a result, it suffices for us to use the samples from h (recall (4.18)) in vP in order to 2 > 0. now note that ((cid:101)n, n(cid:48)) ∈ supp(r) for some n(cid:48) ∈ [M ]D−2 if and only if (cid:107)rj;((cid:101)n,··· )(cid:107)2 compute Ej;((cid:101)n... ) ≈ (cid:107)rj;((cid:101)n,··· )(cid:107)2 2 in Algorithm 7 above. The 2s-largest estimates Ej;((cid:101)n... ) are then used to identify all the prefixes of length 2 which begin at least one element of supp(r). Of course, this same idea can then be used again to find all length-3 prefixes of elements in supp(r) by extending the previously identified length-2 prefixes in all possible O(s2) combinations with the elements in N2, and then testing the resulting length-3 prefixes’ energies in order to identify the 2s most energetic such combinations, etc.. See Algorithm 7 above for pseudocode, Section 4.3.2 for analysis of these Ej;((cid:101)n... ) estimators, and Section 4.2.2.1 just below for a concrete example of the pairing process. 4.2.2.1 An Example of Entry Identification and Pairing to Find Support Assume that r ∈ C[M ]3 is three-sparse with a priori unknown energetic index vectors supp(r) = {(3, 5, 6), (4, 7, 8), (11, 5, 100)} ⊂ [M ]3 and corresponding nonzero coefficients r(3,5,6), r(4,7,8), and r(11,5,100). We can further imag- ine that M here is significantly larger than, e.g., 100 so that computing all M 3 coefficients of r using standard numerical methods would be undesirable. In this case, Algorithm 6 aims to output the sets N0 = {3, 4, 11}, N1 = {5, 7}, and N2 = {100, 6, 8} ⊂ [M ], i.e., the first, second, and third entries of each index vector in the support of r, respectively. Note that there are 18 = 3 × 2 × 3 possible index vectors which are consistent with the 89 N0, N1, and N2 above. Algorithm 7 is now tasked with finding out which of these 18 possibilities are truly elements of supp(r) without having to test them all individually.8 To identify supp(r) without having to test all 18 index vectors in N0 × N1 × N2, the pairing process instead starts by estimating the energy of the |N0|·|N1| = 6 length-2 prefixes in N0 × N1 which might begin an index vector in supp(r). In the ideal case these energy estimates will reveal that only 3 of these 6 possible length-2 prefixes actually have any energy, P1 = {(3, 5), (4, 7), (11, 5)} ⊂ [M ]2. In its next stage the pairing process now continues by combining these three length-2 prefixes in P1 with N2 in order to produce |P1| × |N2| = 9 final candidate elements potentially belonging to supp(r) ⊂ [M ]3. Estimating the energy of these 9 candidates then finally reveals the true identities of the index vectors in the support of r. Note that instead of computing energy estimates for all 18 possible support candidates in N0×N1×N2, the pairing process allows us to determine the correct support of r using only 15 = 6 + 9 < 18 total energy estimates in this example. Though somewhat underwhelming in this particular example, the improvement provided by Algorithm 7 becomes much more significant as the dimension D of the index vectors grows larger. When |supp(r)| = |Nj| = s for all j ∈ [D], for example, ×j∈[D]Nj will have sD total elements. Nonetheless, Algo- rithm 7 will be able to identify supp(r) using only O(cid:0)s2D(cid:1) energy estimates in the ideal setting.9 8In this simple example we can of course simply estimate the energy for all 18 possible index vectors. The three true index vectors in the support of r with nonzero energy would then be discovered and all would be well. However, this naive approach becomes spectacularly inefficient for larger D (cid:29) 3. 9In less optimal settings one should keep in mind that Algorithm 7 only finds the most energetic entries (cid:41) in general, so that P ⊃ (cid:40) n(cid:12)(cid:12) |rn|2 ≥ (cid:107)r(cid:107)2 2 α2s for a given α > 1. This is why we need to apply it iteratively. 90 4.2.3 A Theoretical Guarantee for Support Identification The following lemma and theorem show that our support identification procedure (i.e., Al- gorithm 6, followed by Algorithm 7) always identifies the indexes of the majority of the energetic entries in r. Consequently, the energy of the residual is guaranteed to decrease from iteration to iteration of Algorithm 5. We want to remind readers that r is always 2s-sparse since c and each at−1 are s-sparse in the present analysis (i.e.,  = 0). (cid:110) (cid:12)(cid:12) n ∈ I ⊆ [M ]D(cid:111) Lemma 2. Suppose that Tn (cid:12)(cid:12) n ∈ I(cid:9) is a BOS where each basis function Tn is (cid:1)M d, and K = defined as per (4.1). Let H2s be the set of all functions, h : D → C, in span(cid:8)Tn for each h ∈ H2s. Fix p ∈ (0, 1/2), 1 ≤ d ≤ D, N =(cid:0)D (cid:12)(cid:12)(cid:12)GP(cid:12)(cid:12)(cid:12) is whose coefficient vectors are 2s-sparse, and let rh ∈ CI denote the 2s-sparse coefficient vector (cid:107)Tn(cid:107)∞. Then, one can randomly select a set of i.i.d. Gaussian weights W ⊂ R for use in (4.19), and also randomly construct entry identification and pairing grids, GI ⊂ D and GP ⊂ D (recall (cid:111)(cid:17) such that the following property holds ∀h ∈ H2s with probability greater than 1 − 2p: Let h ∈ Cm(cid:48) 3 be samples from h ∈ H2s on GI and GP , respectively. If Algorithms vI h , GI , GP , and W then they will find a set Ω ⊂ [M ]D of Section 4.1.3), whose total cardinality sDL(cid:48)K2 max{log2(s) log2(DN ), log( D log4(s) log2(N ) log2(DN ), log2( D p ) , 2 and vP h ∈ Cm(cid:48) 6 and 7 are granted access to vI h, vP sup n s.t.(cid:107)n(cid:107)0≤d d (cid:110) (cid:12)(cid:12)(cid:12)GI(cid:12)(cid:12)(cid:12) + p )}+s3DK4 max cardinality 2s in line 11 of Algorithm 5 such that (cid:107)(rh)Ωc(cid:107)2 ≤ 0.202(cid:107)rh(cid:107)2. O(cid:16) O(cid:16) Furthermore, the total runtime complexity of Algorithms 6 and 7 is always s2DLK2 max{log2(s) log2(DN ), log( D p )}+s5D2K4 max 91 (cid:110) log4(s) log2(N ) log2(DN ), log2( D p ) (cid:111)(cid:17) . In Lemma 2 above L(cid:48) denotes the maximum number of points in Dj required in order to determine the value of (cid:104)g, Tj;(cid:101)n(cid:105)(Dj ,νj ) for all (cid:101)n ∈ [M ] in O(L)-time for any given s-sparse (cid:12)(cid:12) m ∈ [M ](cid:9), maximized over all j ∈ [D]. See Section 4.1.2 for g : Dj → C in span(cid:8)Tj;m a more in depth discussion of these quantities. The reader is also referred back to Section 4.1.3 for a discussion of the entry identification and pairing grids, GI and GP , mentioned in Lemma 2. The proof of Lemma 2, which is quite long and technical, is given in Section 4.3.3. Once Lemma 2 has been established, however, it is fairly straightforward to prove that Algorithm 5 will always rapidly recover any function of D-variables which exhibits sparsity in a tensor product basis by building on the results in [22]. We have the following theorem. (cid:110) Tn (cid:12)(cid:12) n ∈ I ⊆ [M ]D(cid:111) Theorem 7. Suppose that is defined as per (4.1). Let Fs be the subset of all functions f ∈ span(cid:8)Tn each f ∈ Fs. Fix p ∈ (0, 1/3), 1 ≤ d ≤ D, N =(cid:0)D coefficient vectors are s-sparse, and let cf ∈ CI denote the s-sparse coefficient vector for (cid:107)Tn(cid:107)∞, and a (cid:12)(cid:12) n ∈ I(cid:9) whose (cid:1)M d, K = is a BOS where each basis function Tn sup d n s.t.(cid:107)n(cid:107)0≤d O(cid:16) precision parameter η > 0. Then, one can randomly select a set of i.i.d. Gaussian weights W ⊂ R for use in (4.19), and also randomly construct a compressive sensing grid, G ⊂ D, (cid:111)(cid:17) whose total cardinality |G| is such that the following property holds ∀f ∈ Fs with probability greater than 1 − 3p: Let y consist of samples from f ∈ Fs on G. If Algorithm 5 is granted access to y, G, and W, then sDL(cid:48)K2 max{log2(s) log2(DN ), log( D log4(s) log2(N ) log2(DN ), log2( D p ) p )} + s3DK4 max (cid:110) , it will produce an s-sparse approximation a such that (cid:107)cf − a(cid:107)2 ≤ Cη. 92 Here C > 0 is an absolute constant. Furthermore, the total runtime complexity of Algorithm 5 is always O(cid:16)(cid:16) × log (cid:17) s2D2LK2 max{log2(s) log2(DN ), log( D when |G| is bounded as above. (cid:107)cf(cid:107)2 η p )(cid:9)(cid:17) p )}+s5D2K4 max(cid:8) log4(s) log2(N ) log2(DN ), log2( D As above, we refer the reader back to Section 4.1.2 for a discussion of the quantities L(cid:48) and L appearing in Theorem 7, as well as to Section 4.1.3 for more information on the compressive sensing grid G ⊂ D mentioned therein. Proof. In order to analyze the support identification step, we replace Lemma 4.2 in [22] by Lemma 2, and then we obtain (cid:107)cf − at+1(cid:107)2 ≤ 0.5(cid:107)cf − at(cid:107)2 for each iteration t ≥ 0, which is the same as in Theorem 2.1 in [22] provided that f is exactly s-sparse and samples are not noisy. Except for the support identification step(s), Algorithm (cid:18) (cid:19) 5 agrees with CoSaMP, so that Lemmas 4.3 – 4.5 in [22] still directly apply to Algorithm 5. After O iterations, we see that the s-sparse approximation a therefore satisfies (cid:107)cf(cid:107)2 log η (cid:107)cf − a(cid:107)2 ≤ Cη. Since the runtime complexity of the support identification steps and the sample update process in each iteration, the total running time arises from multiplying it with the number of iterations. The number of sample points, m(cid:48) Φ, (cid:101)Aj, Bj and Cj discussed in Section 4.1.6 are all chosen to ensure that these resulting 1, m, m1 and m2 used to define the matrices measurement matrices have the RIP. Thus, the samples of f in y can be reused over as many 93 iterations as needed. Updating the samples for the entry identification or pairing causes an extra O(sDm(cid:48) 3) computations respectively which causes a D2 factor instead 2) and O(sDm(cid:48) of D in the first term of runtime complexity. The probability of successful recovery for all f ∈ Fs is obtained by taking the union bound over the failure probability p of Φ having δ2s < 0.025 via Theorem 6 together with the failure probability 2p of Lemma 2. We are now prepared to begin the process of proving Lemma 2. 4.3 Analysis of the Support Identification In this section, we analyze the performance of the sublinear-time support identification tech- nique proposed herein. First, we show in Section 4.3.1 the success of the entry identification step. Indeed, Theorems 8 and 9 show under the RIP assumption that certain one-dimensional proxy functions allow us to identify the entry with large corresponding coefficients. Lemma 6 then estimates the necessary sample complexity. In Section 4.3.2, we analyze the pairing step showing that it works uniformly for any 2s-sparse functions in Theorem 10. Finally, in Section 4.3.3, we complete the proof of the Lemma 2 providing the complete result for the proposed support identification method. 4.3.1 Entry Identification In this section, we seek to find Nj containing the j-th entries of the index vectors of the nonzero transform coefficients for all j ∈ [D]. Define [D](cid:48) := [D] \ {j}. Assume without loss of generality that the number L ∈ N of proxy functions is odd. Choose Xj := {xj,(cid:96)}(cid:96)∈[m] where each xj,(cid:96) is chosen independently at random from D(cid:48) according to dν(cid:48) m}k∈[L] where each gk (cid:96) is an i.i.d. standard Gaus- j j. Also, choose {gk 1 ,··· , gk 94 sian variable ∼ N (0, 1), which forms W introduced in Section 4.1.3. We define a function hj;k : Dj → C of one variable in Dj as follows, (cid:88) (cid:88) (cid:88) (cid:89) hj;k(x) := 1√ m (cid:96)∈[m] gk (cid:96) h([x, xj,(cid:96)]) = 1√ m gk (cid:96) (cid:96)∈[m] n∈supp(r) rnTj;nj (x) Ti;ni (xj,(cid:96))i, i∈[D](cid:48) and [x, xj,(cid:96)] is the vector obtained by inserting the variable x between the entries of xj,(cid:96) indexed by [j] and {j, j + 1··· , D − 2}, i.e., (4.19) [x, xj,(cid:96)] :=(cid:0)(xj,(cid:96))0, (xj,(cid:96))1,··· , (xj,(cid:96))j−1, x , (xj,(cid:96))j,··· , (xj,(cid:96))D−2 (cid:1) . For the sake of simplicity, we let T ˇn(ˇx) := (cid:81) (ˇxi) with ˇn ∈ [M ]D−1 and ˇx ∈ We choose m large enough above to form the normalized random sampling matrix (cid:101)Aj ∈ Cm×[M ]D−1 for each j ∈ [D], defined as i∈[D](cid:48) Ti;ˇni D(cid:48) j. (cid:17) (cid:16)(cid:101)Aj := (cid:96), ˇn 1√ m T ˇn(xj,(cid:96)), (cid:96) ∈ [m], ˇn ∈ [M ]D−1, (4.20) so that each one has a restricted isometry constant δ2s of at most δ with high probability in its restricted form. Here, the restricted form is introduced by eliminating the columns of the full (cid:101)Aj indexed by vectors n /∈ Ij;(cid:101)n. To explain further, we denote (cid:101)AjPj;(cid:101)nr = (cid:101)Ajrj;(cid:101)n where Pj;(cid:101)n is a restriction matrix defined in (4.11). This comes from the inner product in line 10 of Algorithm 6 which is calculated by using the evaluations of hj,k at Uj defined in (cid:17) Section 4.1.3. Then, the inner product can be written as as in (4.17). In other words, the evaluations of h at [uj,k, xj,(cid:96)] from GI G(cid:101)AjPj;(cid:101)nr k j in Section 4.1.3 are where G is defined (cid:16) 95 utilized to compute the inner product. Then, the matrix-vector multiplication (cid:101)Ajrj;(cid:101)n can be considered in its restricted form by eliminating the columns of (cid:101)Aj and elements of rj;(cid:101)n which are zero due to their corresponding index vectors not belonging to Ij;(cid:101)n. The resulting restricted matrix (cid:101)Aj has the size m× N(cid:48) where N(cid:48) is bounded above in (4.8) so that (cid:101)Aj has the restricted isometry constant δ2s mentioned. The advantage of forming RIP matrices in this fashion is that it allows us to analyze the different iterations of Algorithm 5 repeatedly Theorem 6). with the same RIP matrices. For a discussion about the number of measurements needed to ensure that (cid:101)Aj satisfies the RIP, see the following lemma (which is a simple consequence of Lemma 3. Let (cid:101)Aj ∈ Cm×N(cid:48) (cid:26) be the random sampling matrix as in (4.20) in its restricted form. If, for δ, p ∈ (0, 1), (cid:18)8e(D − 1)DM (cid:18) D (cid:19)(cid:27) (cid:19) m ≥ aK2δ−2s max d log2(4s) log(9m) log d , log p , then with probability at least 1 − p, the restricted isometry constant δs of (cid:101)Aj satisfies δs ≤ δ for all j ∈ [D]. The constant a > 0 is universal. also As we will show, more than half of the proxy functions, {hj;k}k∈[L] are guaranteed with (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) bounded above and below by (cid:107)rj;(cid:101)n(cid:107)2 up to some constants for all high probability to have (cid:107)hj;k(cid:107)L2(Dj ,νj ) bounded above by (cid:107)r(cid:107)2 up to some constant, and (cid:101)n ∈ [M ] and j ∈ [D]. To show this we consider the indicator variable Eh,j,(cid:101)n,k which is 1 if and only if all three 4(cid:107)rj;(cid:101)n(cid:107)2 ≥(cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) 2 for the absolute constant α(cid:48) defined in Lemma 4, 12 (cid:107)rj;(cid:101)n(cid:107)2, and 1. (cid:107)hj;k(cid:107)2 2. 9 ≤ α(cid:48)(cid:107)r(cid:107)2 (cid:12)(cid:12)(cid:12) ≥ L2(Dj ,νj ) √ of 23 96 3. the vector of Gaussian weights gk ∈ Rm satisfying 1 2 m ≤ (cid:107)gk(cid:107)2 2 ≤ 3 2m, are simultaneously true, and 0 otherwise. The proof will proceed as follows. Lemmas 4 and 5 together with the bound on (cid:107)gk(cid:107)2 through Bernstein’s inequality imply that the probability of each Eh,j,(cid:101)n,k being 1 is greater than 0.5. Combining this with Chernoff bound, the deviation of (cid:80) k∈[L] Eh,j,(cid:101)n,k below proxy functions, i.e., sufficiently large L, the probability that(cid:80) k∈[L] Eh,j,(cid:101)n,k < L/2 for all (h, j,(cid:101)n) ∈ H × [D] × [M ] becomes very small, which is shown in Theorem 8. The number its expectation shows exponential decay in its distribution. Then, with sufficiently many L logarithmically depends on DM|H|. In order to get the desired properties for all 2s- sparse functions satisfying our support assumption, the finite function set H is taken as H with corresponding coefficient vector set R which is an -cover over all normalized 2s- sparse vectors in CN in Theorem 9. Thus, with high probability, the desired properties hold uniformly, i.e., for all functions of interest, and for ∀j ∈ [D] and(cid:101)n ∈ [M ]. The following lemma bounds the energy of the proxy functions. Lemma 4. Suppose that r ∈ CN is 2s-sparse, and the restricted isometry constant δ2s of (cid:101)Aj satisfies δ2s ≤ δ for all j ∈ [D] where δ ∈ (0, 7/16). Then, for each k ∈ [L], there exists an absolute constant α(cid:48) ∈ R+ such that (cid:20) P (cid:107)hj;k(cid:107)2 L2(Dj ,νj ) ≥ α(cid:48)||r||2 2 (cid:21) ≤ .025 (4.21) for all j ∈ [D]. Proof. Consider the random sampling point set Xj = (cid:8)xj,(cid:96) | (cid:96) ∈ [m](cid:9) to be fixed for the moment. We begin by noting that 97 dνj(x) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 (cid:3)) rnrn(cid:48) (cid:89) |hj;k(x)|2dνj(x) (cid:88) m (cid:96)∈[m] (cid:96) gk gk (cid:96)(cid:48) gk (cid:96) gk (cid:96),(cid:96)(cid:48)∈[m] (cid:96),(cid:96)(cid:48)∈[m] (cid:90) (cid:90) Dj Dj 1 m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1√ (cid:88) (cid:88) = 1 m × (cid:90) gk Dj n,n(cid:48)∈supp(r) (cid:96) h((cid:2)x, xj,(cid:96) (cid:90) (cid:96)(cid:48) (cid:88)  j h([x, xj,(cid:96)])h([x, xj,(cid:96)(cid:48)]) dνj(x) Ti;ni (xj,(cid:96))i i∈[D](cid:48) (cid:89) i(cid:48)∈[D](cid:48) Ti(cid:48);n(cid:48) i(cid:48) (xj,(cid:96)(cid:48))i(cid:48) (4.22) (cid:89) i(cid:48)∈[D](cid:48) Ti(cid:48);n(cid:48) i(cid:48) (xj,(cid:96)(cid:48))i(cid:48) Ti;ni (xj,(cid:96))i  (cid:107)hj;k(cid:107)2 L2(Dj ,νj ) = = = Tj;nj (x)Tj;n(cid:48) (x) dνj(x) Dj rnrn(cid:48) (cid:89) (cid:88) (cid:88) (cid:101)n∈[M ] n,n(cid:48) s.t. j =(cid:101)n (cid:17) (cid:16)(cid:101)Ajrj;(cid:101)n (cid:17) (cid:16)(cid:101)Ajrj;(cid:101)n nj =n(cid:48) (cid:96)(cid:48), (cid:96) i∈[D](cid:48) gk (cid:96) gk (cid:96)(cid:48) (cid:96),(cid:96)(cid:48)∈[m] (cid:88) gk (cid:96) gk (cid:96)(cid:48) = 1 m (cid:88) (cid:88) (cid:101)n∈[M ] where (cid:101)Aj ∈ Cm×N(cid:48) = (4.23) is the restricted vector from (4.7). Thus, we can see that (cid:96),(cid:96)(cid:48)∈[m] is the restricted random sampling matrix from (4.20) and rj;(cid:101)n ∈ CN(cid:48) gk(cid:13)(cid:13)(cid:13)2 (cid:88) (cid:101)n∈[M ] (cid:21) = (cid:107)(cid:101)AjR(cid:107)2 (cid:17)∗ (cid:16)(cid:101)AjR where R ∈ CN(cid:48)×M is the matrix whose (cid:101)nth column is rj;(cid:101)n. This yields results that (cid:17)∗ F. gk ∼ N (0, U Σ2U∗), where U ΣV ∗ is the SVD of (cid:12)(cid:12)(cid:104)(cid:101)Ajrj;(cid:101)n, gk(cid:105)(cid:12)(cid:12)2 = L2(Dj ,νj ) Now observe that (cid:13)(cid:13)(cid:13)(gk)∗(cid:101)AjR (cid:107)hj;k(cid:107)2 L2(Dj ,νj ) (cid:13)(cid:13)(cid:13)(cid:16)(cid:101)AjR (cid:16)(cid:101)AjR (cid:107)hj;k(cid:107)2 (cid:13)(cid:13)(cid:13)2 (cid:17)∗ E gk (cid:20) and = = 2 2 98 Σ ∈ Rmin{M,m}×min{M,m} is the diagonal matrix containing at most min{M, m} nonzero singular values, σ1 ≥ σ2 ≥ ··· ≥ σmin{M,m} ≥ 0, of (cid:101)AjR. Let (cid:101)gk := ΣV ∗gk ∼ N (0, Σ2) and note that (cid:107)U(cid:101)gk(cid:107)2 2. As a consequence, one can see that 2 = (cid:107)(cid:101)gk(cid:107)2 (cid:20) P (cid:107)hj;k(cid:107)2 L2(Dj ,νj ) ≥ t (cid:21) = P(cid:104)(cid:107)(cid:101)gk(cid:107)2 (cid:105) 2 ≥ t = P min{M,m}(cid:88) (cid:96)=1  (cid:96) X(cid:96) ≥ t σ2 holds for all t ∈ R, where each X(cid:96) is an i.i.d χ2 random variable. Applying the Bernstein type inequality given in Proposition 5.16 in [53] we deduce that (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:107)hj;k(cid:107)2 P L2(Dj ,νj ) − (cid:107)(cid:101)AjR(cid:107)2 F (cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:21) (cid:41)(cid:33) (4.24) ≤ 2 exp ≤ 2 exp (cid:41)(cid:33) (cid:32) −a(cid:48) min (cid:32) −a(cid:48) min (cid:40) (cid:40) t t2 (cid:107)σ(cid:107)4 4 t2 F , , t F (cid:107)(cid:101)AjR(cid:107)4 (cid:107)σ(cid:107)2∞ (cid:107)(cid:101)AjR(cid:107)2 log 80/a(cid:48),(cid:112)log 80/a(cid:48)(cid:111)(cid:107)(cid:101)AjR(cid:107)2 (cid:110) (cid:40) (cid:107)(cid:101)AjR(cid:107)2 (cid:41)(cid:33) (cid:114) , F log 80 a(cid:48) log 80 a(cid:48) where a(cid:48) ∈ R+ is an absolute constant, and σ is the vector containing the diagonal elements of Σ. An application of (4.24) with t = max F finally tells us that (cid:107)hj;k(cid:107)2 L2(Dj ,νj ) ≥ 1 + max (cid:32) will hold with probability at most 1/40. Turning our attention to (cid:107)(cid:101)AjR(cid:107)2 (cid:88) (cid:101)n∈[M ] (cid:107)(cid:101)AjR(cid:107)2 F = F =(cid:80)(cid:101)n∈[M ] (cid:107)(cid:101)Ajrj;(cid:101)n(cid:107)2 (cid:88) (cid:107)(cid:101)Ajrj;(cid:101)n(cid:107)2 (cid:107)rj;(cid:101)n(cid:107)2 (cid:101)n∈[M ] 2 ≥ 1 2 2, we assert that 2 = (cid:107)r(cid:107)2 2 1 2 holds since the restricted isometry constant δ2s of (cid:101)Aj is assumed to be bounded above by (cid:0)1 + max(cid:8) log 80 16. Therefore, we finally get the desired probability estimate with α(cid:48) := 1 7 2 , a(cid:48) 99 (cid:113) log 80 a(cid:48) (cid:9)(cid:1). The following lemma bounds the estimated inner products. Then, there exists an absolute constant β(cid:48) ∈ R+ such that Lemma 5. Suppose that r ∈ CN is 2s-sparse, and the restricted isometry constant δ2s of (cid:101)Aj satisfies δ2s ≤ δ for all j ∈ [D] where δ ∈ (0, 7/16). Let k ∈ [L], j ∈ [D], and (cid:101)n ∈ [M ]. (cid:34)(cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:107)rj;(cid:101)n(cid:107)2 or ≤ 0.273. (4.25) (cid:12)(cid:12)(cid:12) ≥ 9 (cid:107)rj;(cid:101)n(cid:107)2 (cid:12)(cid:12)(cid:12) ≤ √ 23 12 (cid:35) P 4 gk (cid:96) (cid:17) gk (cid:96) (cid:96)∈[m] (4.26) . (cid:96) rn i∈[D](cid:48) Ti;ni (cid:96)∈[m] n s.t. 1√ m (xj,(cid:96))i = (cid:88) (cid:88) (cid:88) nj =(cid:101)n (cid:16)(cid:101)Ajrj;(cid:101)n Proof. Consider the random sampling point set Xj to be fixed for the time being. Recalling the definitions of hj;k and of rj;(cid:101)n, one can see that (cid:89) (cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) = Looking at (4.26) one can see that (cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) ∼ N (0,(cid:107)(cid:101)Ajrj;(cid:101)n(cid:107)2 (cid:34)(cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≥ 3(cid:107)(cid:101)Ajrj;(cid:101)n(cid:107)2 16(cid:107)rj;(cid:101)n(cid:107)2 (cid:35) 2 ≤ (cid:107)(cid:101)Ajrj;(cid:101)n(cid:107)2 holds. Combining (4.27) and the assumption on δ2s, which yields 9 (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) 16(cid:107)rj;(cid:101)n(cid:107)2 Theorem 8. Let H be a finite set of functions h whose BOS coefficient vectors are 2s-sparse, and let rh ∈ CN denote the coefficient vector for each h ∈ H. Suppose that the restricted isometry constant δ2s of (cid:101)Aj satisfies δ2s ≤ δ for all j ∈ [D] where δ ∈ (0, 7/16). Furthermore, (cid:12)(cid:12)(cid:12) ≤ (cid:107)(cid:101)Ajrj;(cid:101)n(cid:107)2 2, establishes the desired result. ≤ 0.273 (4.27) 2) and hence, 2 ≤ or 3 P 23 100 [M ] with probability at least 1 − p. That is, with probability at least 1 − p, the following will let p ∈ (0, 1), L ∈ N be odd, and L ≥(cid:101)γ log(DM|H|/p) hold for a sufficiently large absolute constant(cid:101)γ ∈ R+. Then,(cid:80) k∈[L] Eh,j,(cid:101)n,k > L/2 simultaneously for all (h, j,(cid:101)n) ∈ H × [D] × hold simultaneously for each (h, j,(cid:101)n) ∈ H × [D] × [M ]: All three of 1. (cid:107)hj;k(cid:107)2 2. 9 3. the vector of Gaussian weights gk ∈ Rm satisfying 1 4(cid:107)(rh)j;(cid:101)n(cid:107)2 ≥(cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) 2 for the absolute constant α(cid:48) defined in Lemma 4, 12 (cid:107)(rh)j;(cid:101)n(cid:107)2, and ≤ α(cid:48)(cid:107)rh(cid:107)2 (cid:12)(cid:12)(cid:12) ≥ L2(Dj ,νj ) 2 m ≤ (cid:107)gk(cid:107)2 2 ≤ 3 2 m, √ 23 will be simultaneously true for more than half of the k ∈ [L]. Proof. Let h ∈ H, k ∈ [L], and (cid:101)n ∈ [M ]. The probabilities that the first and second properties fail are given in (4.21) and (4.25), respectively. For the third property, applying the Bernstein type inequality given in Proposition 5.16 in [53], one obtains P(cid:104)(cid:12)(cid:12)(cid:12)(cid:107)gk(cid:107)2 2 − m (cid:12)(cid:12)(cid:12) ≥ m 2 (cid:105) ≤ 2e−a(cid:48)(cid:48)m ≤ 0.03, (4.28) (cid:105) ≤ Eh,j,(cid:101)n,k = 0 where a(cid:48)(cid:48) ∈ R+ is an absolute constant. Combining (4.21), (4.25), (4.28) via a union bound now tell us that P(cid:104)  < e−L/¯γ ≤ (cid:88)  = P Eh,j,(cid:101)n,k < L/2 (1 − Eh,j,(cid:101)n,k) > L/2 328/1000. Utilizing the Chernoff bound (see, e.g., [54, 55]) one now sees that (cid:88) P k∈[L] k∈[L] p DM|H| for an absolute constant ¯γ ∈ R+, where the last inequality follows by choosing(cid:101)γ = ¯γ in the assumption. Applying the union bound over all choices of (h, j,(cid:101)n) ∈ H × [D] × [M ] now establishes the desired result. Theorem 9. Let H2s be the set of all functions h whose coefficient vectors are 2s-sparse, 101 and let rh ∈ CN denote the coefficient vector for each h ∈ H2s. Suppose that the restricted isometry constant δ2s of (cid:101)Aj satisfies δ2s ≤ δ for all j ∈ [D] where δ ∈ (0, 7/16). Fur- (cid:33) thermore, let p ∈ (0, 1), L ∈ N be odd, and assume that L ≥ γ(cid:48)s · d · log for sufficiently large absolute constant γ(cid:48) ∈ R+. Then, with probability greater than 1 − p, one has (cid:80) k∈[L] Eh,j,(cid:101)n,k > L/2 simultaneously for all (h, j,(cid:101)n) ∈ H × [D] × [M ]. Consequently, with probability greater than 1−p, it will hold that for all choices of (h, j,(cid:101)n) ∈ H2s×[D]×[M ] DM sd p1/s (cid:32) d(cid:113) both L2(Dj ,νj ) ≤ (α(cid:48) + 1)(cid:107)rh(cid:107)2 1. (cid:107)hj;k(cid:107)2 2. 9 2(cid:107)(rh)j;(cid:101)n(cid:107)2 ≥(cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≥ 1 3(cid:107)(rh)j;(cid:101)n(cid:107)2 are true simultaneously for more than half of the k ∈ [L]. 2 for the absolute constant α(cid:48) defined in Lemma 4, and (cid:17)2s (cid:17)2s(cid:16) Proof. Define R ⊂ CN as a finite -cover of all 2s-sparse coefficient vectors r ∈ CN with (cid:1)M d and  ∈ (0, 1). Such covers exist of cardinality (cid:107)r(cid:107)2 = 1, together with 0, where N =(cid:0)D |R| ≤(cid:16) eN choices of (h, j,(cid:101)n) ∈ H × [D] × [M ], Properties 1 – 3 of Theorem 8 will hold for more than (see, e.g., Appendix C of [37]). Define H as the set of functions corresponding to the 2s-sparse coefficient vectors in R. Assume that for H = H and all 1 + 2  d 2s half of the k ∈ [L]. By the theorem, this event will happen with probability at least 1 − p. We will now prove that under this assumption both Properties 1 and 2 above will hold as desired. Let(cid:101)n ∈ [M ], j ∈ [D], consider h ∈ H2s with coefficient vector r = rh, and let τ := (cid:107)r(cid:107)2 2. (cid:13)(cid:13)r − τ r(cid:48)(cid:13)(cid:13)2 ≤ τ hold. Finally, let k ∈ [L] be one of the values for which Properties 1 – 3 of Then, there exists an h(cid:48) ∈ H with coefficient vector r(cid:48) ∈ R such that both (cid:107)r(cid:48)(cid:107)2 = 1 and Theorem 8 are simultaneously true for (h(cid:48), j,(cid:101)n). We will begin by establishing Property 1 102 above for h, j and k. Using (4.19) we have that (cid:13)(cid:13)hj;k (cid:13)(cid:13)L2(Dj ,νj ) = ≤ τ j;k m (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1√ (cid:13)(cid:13)(cid:13)h(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1√ (cid:96)∈[m] gk (cid:96) n∈supp(r) (cid:88) (cid:88) (cid:13)(cid:13)(cid:13)L2(Dj ,νj ) (cid:88) (cid:88) (cid:13)(cid:13)(cid:13)(cid:0)h − τ h(cid:48)(cid:1) (cid:13)(cid:13)(cid:13)(cid:0)h − τ h(cid:48)(cid:1) n∈supp(r) gk (cid:96) j;k m √ (cid:96)∈[m] α(cid:48)(cid:107)r(cid:48)(cid:107)2 + √ α(cid:48) + + ≤ τ = τ (cid:89) i∈[D](cid:48) (x) Ti;ni (xj,(cid:96))i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)L2(Dj ,νj ) rnTj;nj n (cid:0)rn − τ r(cid:48) (cid:1) Tj;nj (cid:13)(cid:13)(cid:13)L2(Dj ,νj ) (cid:13)(cid:13)(cid:13)L2(Dj ,νj ) j;k , (cid:89) i∈[D](cid:48) (x) Ti;ni (xj,(cid:96))i (4.29) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)L2(Dj ,νj ) (4.30) where the last inequality follows from the first property of Theorem 8 holding for h(cid:48). Repeating the expansion from the proof of Lemma 4 for (cid:13)(cid:13)(cid:13)2 j;k L2(Dj ,νj ) , one obtains (cid:13)(cid:13)(cid:13)(cid:0)h − τ h(cid:48)(cid:1) j;k (cid:13)(cid:13)(cid:13)2 L2(Dj ,νj ) (cid:13)(cid:13)(cid:13)(cid:0)h − τ h(cid:48)(cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:101)Aj j;(cid:101)n, gk(cid:69)(cid:12)(cid:12)(cid:12)(cid:12)2 (cid:88) (cid:0)r − τ r(cid:48)(cid:1) (cid:101)n∈[M ] (cid:13)(cid:13)(cid:13)(cid:101)Aj (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)gk(cid:13)(cid:13)(cid:13)2 ≤ (cid:88) (cid:0)r − τ r(cid:48)(cid:1) j;(cid:101)n (cid:101)n∈[M ] (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:0)r − τ r(cid:48)(cid:1) (cid:88) j;(cid:101)n ≤ 9 (cid:101)n∈[M ] 4 m = 2 2 2 where the last inequality follows from the third property of Theorem 8, and (cid:101)Aj having δ2s ≤ 7 16. Continuing, we can see that (cid:13)(cid:13)(cid:13)(cid:0)h − τ h(cid:48)(cid:1) (cid:13)(cid:13)(cid:13)2 j;k L2(Dj ,νj ) ≤ 9 4 m(cid:13)(cid:13)(cid:0)r − τ r(cid:48)(cid:1)(cid:13)(cid:13)2 2 ≤ 9 4 mτ 22. 103 (cid:19) . (cid:18)√ Combining this expression with (4.30) we now learn that (cid:19) (cid:18)√ 3 2 α(cid:48) + 3 2 √  m (cid:13)(cid:13)hj;k (cid:13)(cid:13)L2(Dj ,νj ) ≤ τ Making sure to use, e.g., an  ≤(cid:16) Turning our attention to establishing Property 2 above for h, (cid:101)n, and k, we now choose an h(cid:48) ∈ H whose coefficient vector r(cid:48) ∈ R has (cid:107)r(cid:48)(cid:107)2 = 1, and also satisfies ensures property one. (cid:17)−1 = (cid:107)r(cid:107)2 √ m  √ 6 α(cid:48) + α(cid:48)m (cid:13)(cid:13)(cid:13)rj;(cid:101)n − τ(cid:48)r(cid:48)(cid:13)(cid:13)(cid:13)2 ≤ τ(cid:48) for τ(cid:48) := (cid:107)rj;(cid:101)n(cid:107)2. Note that all nonzero entries of r(cid:48) j;(cid:101)n agree with those of r(cid:48), the latter j;(cid:101)n and also rj;(cid:101)n vector only has certain additional nonzero entries in locations where r(cid:48) j;(cid:101)n makes the left hand side of (4.31) smaller, and vanish. Consequently, replacing r(cid:48) by r(cid:48) (4.31) one obtains that From (4.19) one can see that (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≥(cid:12)(cid:12)(cid:12)(cid:104)τ(cid:48)h(cid:48) ≤ τ(cid:48) (cid:13)(cid:13)(cid:13)rj;(cid:101)n − τ(cid:48)r(cid:48) j;(cid:101)n. (cid:13)(cid:13)(cid:13)2 (cid:12)(cid:12)(cid:12) − k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:88) j;(cid:101)n(cid:107)2 − τ(cid:48)(cid:107)r(cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:96)∈[m] √ 23 12 ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:0)hj;k − τ(cid:48)h(cid:48) (cid:69) (cid:1) , Tj;(cid:101)n (cid:17) (cid:16)(cid:101)Aj (cid:0)r − τ(cid:48)r(cid:48)(cid:1) j;(cid:101)n gk (cid:96) k (cid:96) (Dj ,νj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (4.32) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where the last inequality follows from the second property of Theorem 8 holding for h(cid:48). Continuing using (4.32) we have that (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≥ (cid:16)(cid:107)rj;(cid:101)n(cid:107)2 − (cid:107)(τ(cid:48)r(cid:48) − r)j;(cid:101)n(cid:107)2 √ 23 12 (cid:17) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:101)Aj (cid:0)r − τ(cid:48)r(cid:48)(cid:1) j;(cid:101)n, gk(cid:69)(cid:12)(cid:12)(cid:12)(cid:12) 104 √ 23 √ 12 (cid:107)rj;(cid:101)n(cid:107)2 − (cid:107)rj;(cid:101)n(cid:107)2 − (cid:107)rj;(cid:101)n(cid:107)2 − (cid:32)√ (cid:107)rj;(cid:101)n(cid:107)2 23 √ 12 23 12 1 3 23 4 √ 23 √ 12 τ(cid:48) − j;(cid:101)n, gk(cid:69)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:101)Aj (cid:0)r − τ(cid:48)r(cid:48)(cid:1) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)gk(cid:13)(cid:13)(cid:13)2 τ(cid:48) −(cid:13)(cid:13)(cid:13)(cid:101)Aj (cid:0)r − τ(cid:48)r(cid:48)(cid:1) j;(cid:101)n (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:0)r − τ(cid:48)r(cid:48)(cid:1) j;(cid:101)n (cid:33) 23 √ 12 τ(cid:48) − 3 23 √ 2 12 − 9 − 2 23 4 √ m √ m  . ≥ ≥ ≥ = (cid:69) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (Dj ,νj ) j;(cid:101)n, gk(cid:69)(cid:12)(cid:12)(cid:12)(cid:12) (cid:96) j;k gk (cid:96) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:96)∈[m] (cid:17) (cid:17) , Tj;(cid:101)n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:16) (cid:12)(cid:12)(cid:12) + hj;k − τ(cid:48)h(cid:48) (cid:16)(cid:101)Aj (cid:0)r − τ(cid:48)r(cid:48)(cid:1) j;(cid:101)n j;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:88) j;(cid:101)n(cid:107)2 + τ(cid:48)(cid:107)r(cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:101)Aj (cid:16)(cid:107)rj;(cid:101)n(cid:107)2 + (cid:107)(τ(cid:48)r(cid:48) − r)j;(cid:101)n(cid:107)2 (cid:17) (cid:0)r − τ(cid:48)r(cid:48)(cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:68)(cid:101)Aj j;(cid:101)n, gk(cid:69)(cid:12)(cid:12)(cid:12)(cid:12) (cid:0)r − τ(cid:48)r(cid:48)(cid:1) (cid:107)rj;(cid:101)n(cid:107)2 + (cid:13)(cid:13)(cid:13)(cid:101)Aj (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)gk(cid:13)(cid:13)(cid:13)2 (cid:0)r − τ(cid:48)r(cid:48)(cid:1) (cid:107)rj;(cid:101)n(cid:107)2 + j;(cid:101)n (cid:13)(cid:13)(cid:13)(cid:0)r − τ(cid:48)r(cid:48)(cid:1) (cid:13)(cid:13)(cid:13)2 (cid:107)rj;(cid:101)n(cid:107)2 + j;(cid:101)n (cid:18) (cid:19) (cid:107)rj;(cid:101)n(cid:107)2 (cid:17)−1 τ(cid:48) + τ(cid:48) + m √  τ(cid:48) + 9 4 9 4 9 4 1 +  + 3 2 2 3 m . √ + ≤ 9 4 4 ≤ 9 4 ≤ 9 4 ≤ 9 4 9 4 = On the other hand, (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≤(cid:12)(cid:12)(cid:12)(cid:104)τ(cid:48)h(cid:48) As above, we obtain using (4.32) that (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:12)(cid:12)(cid:12) ≤ 9 Once again, making sure to use, e.g., an  ≤(cid:16) √ α(cid:48)m 6 will now ensure property two for h as well. Lemma 6. Let H2s be the set of all functions h whose coefficient vectors are 2s-sparse, and let rh ∈ CN denote the coefficient vector for each h ∈ H2s. Furthermore, let δ ∈ 105 d p (cid:32) d(cid:113) (cid:33) (cid:1), log(cid:0) 2D (0, 7/16), p ∈ (0, 1), L ∈ N be odd, and assume that m ≥ (cid:101)β(cid:48)K2δ−2s max(cid:8)d log2(4s) log(9m) (cid:1)(cid:9) and L ≥ γ(cid:48)s · d · log log(cid:0) 8e(D−1)DM solute constants (cid:101)β(cid:48), γ(cid:48) ∈ R+. Then, with probability greater than 1 − p, all of the following will hold for all (h, j,(cid:101)n) ∈ H2s × [D] × [M ]: Both (cid:12)(cid:12)(cid:12) ≥ 1 1. (cid:107)hj;k(cid:107)2 3(cid:107)(rh)j;(cid:101)n(cid:107)2 2. 9 2(cid:107)(rh)j;(cid:101)n(cid:107)2 ≥(cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) 2 for the absolute constant α(cid:48) defined in Lemma 4, and ≤ (α(cid:48) + 1)(cid:107)rh(cid:107)2 for sufficiently large ab- L2(Dj ,νj ) DM sd (p/2)1/s will be simultaneously true for more than half of the k ∈ [L]. Theorem 9 simultaneously hold for more than half of the k ∈ [L], and let B be the event Proof. Let A be the event that for all (h, j,(cid:101)n) ∈ H2s × [D] × [M ], the properties 1 and 2 in that the restricted isometry constant δ2s of (cid:101)Aj satisfies δ2s ≤ δ for all j ∈ [D]. By Theorem 9 and Lemma 3 with properly chosen parameters including L and m, both P(cid:2)A(cid:12)(cid:12) B(cid:3) and P[B] are greater than 1 − p/2. We obtain, by Bayes’ theorem, 1 − p ≤(cid:16) (cid:17)(cid:16) 1 − p 2 1 − p 2 (cid:17) ≤ P(cid:2)A(cid:12)(cid:12) B(cid:3) P [B] = P[A ∩ B] ≤ P [A], which establishes the desired result. The results from Lemma 6 can be exploited in our algorithm as follows. In the entry identification, by taking the median over k ∈ [L] of (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(rh)j;(cid:101)n (cid:12)(cid:12)(cid:12), we get a nonzero (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(rh)j;(cid:101)n (cid:12)(cid:12)(cid:12)(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj ) is nonzero and a zero value if value if is zero, especially due to the second property in Lemma 6 being satisfied for more the half of k ∈ [L]. Thus, we store all those(cid:101)n with nonzero median value in Nj. On the other hand, the summation over(cid:101)n ∈ [M ] (cid:12)(cid:12)(cid:12) can be also used for the halting criterion in our algorithm. of mediank Although O((cid:107)c(cid:107)2/η) iterations guarantee the desired precision, it is not necessary to repeat 106 the iteration if the residual r already has small energy. For any j ∈ [D], if (cid:12)(cid:12)(cid:12)mediank|(cid:104)hj;k, Tj;(cid:101)n(cid:105)(Dj ,νj )|(cid:12)(cid:12)(cid:12)2 ≤ 1 9 η2, (cid:88) (cid:101)n∈[M ] then (cid:107)r(cid:107)2 ≤ η by using the lower bound in the Property 2 of Lemma 6. 4.3.2 Pairing (cid:27) 2 ≥ (cid:107)r(cid:107)2 2 α2s (cid:26)(cid:101)n ∈ Pj−1 × Nj In the entry identification, we can find at most 2s entries belonging to Nj := {nj | n ∈ supp(r)} ⊂ [M ] for each j ∈ [D]. However, we do not know how to combine the entries of each Nj to identify the elements of supp(r). In order to do this, in the pairing process briefly (cid:12)(cid:12) (cid:107)rj;((cid:101)n,··· )(cid:107)2 introduced in Section 4.2.2, we successively build up the prefix set Pj of the energetic pairs for all j ∈ [D] \ {0} such that Pj ⊃ with the ini- (cid:26)(cid:101)n ∈ supp(r)(cid:12)(cid:12) |r(cid:101)n|2 ≥ (cid:107)r(cid:107)2 (cid:27) tialization of P0 = N0. The prefix set Pj contains only 2s pairs throwing out the other pairs with smaller energy for each j ∈ [D] \ {0} so that PD−1 ⊃ corresponding to each (cid:101)n ∈ [M ]j+1 has the following (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2 L2(cid:16)×i∈[j+1]Di,⊗i∈[j+1]νi L2(cid:16)D(cid:48)(cid:48) (cid:17) . j ,ν(cid:48)(cid:48) The energy is estimated by using the following estimator Ej;((cid:101)n,··· ) defined as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:104)h, Tj;(cid:101)n(cid:105) (cid:13)(cid:13)(cid:13)2 (cid:88) h(wj,(cid:96), zj,k)Tj;(cid:101)n(wj,(cid:96)) (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) Ej;((cid:101)n,··· ) := From (4.13), the energy j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 k∈[m2] m1 (cid:96)∈[m1] (cid:88) 2 α2s . (4.33) equality, 2 = 2 1 m2 which approximates the right hand side of (4.33) by using only a finite evaluations of h. Those sampling point sets Wj×Zj for all j ∈ [D]\{0} are constructed from Wj := {wj,(cid:96)}(cid:96)∈[m1] and 107 Zj := {zj,k}k∈[m2] where wj,(cid:96) and zj,k are chosen independently at random from ×i∈[j+1]Di and D(cid:48)(cid:48) j for j ∈ [D − 1]\{0}, respectively. If j = D − 1, WD−1 := {wD−1,(cid:96)}(cid:96)∈[m1] is chosen from ×i∈[D]Di and ZD−1 = ∅. Note that Wj × Zj = GP from Section 4.1.3. Furthermore, the sets Wj × Zj for all j ∈ [D] \ {0} build a random sampling matrix AP in (4.16) which explicitly expresses the samples(evaluations) of the 2s-sparse h used in the (4.34) as AP r. The matrix AP is broken into smaller matrices Bj and Cj for j ∈ [D] \ {0} defined and j explained in the next paragraph for the complete analysis of the pairing process in the upcoming lemmas and theorems in this section. For all j ∈ [D−1]\{0}, the measurement matrix Cj ∈ Cm2×[M ]D−j−1 is defined as := k,n2 1√ m2 T(cid:48)(cid:48) n2 (zj,k), k ∈ [m2], n2 ∈ [M ]D−j−1, (4.34) (y) is a partial product of the last D − j − 1 terms of Tn(x) defined in (4.1), (cid:1) (cid:0)Cj (cid:1) (cid:0)Bj where T(cid:48)(cid:48) n2 i.e., (cid:89) T(cid:48)(cid:48) n2 (y) = Ti+j+1;ni+j+1 (yi), y ∈ D(cid:48)(cid:48) j . i∈[D−j−1] The matrix Cj can be restricted to the matrix of size m2×(cid:101)Nj when Cj is applied to vj,((cid:101)n,··· ) where (cid:101)Nj estimated in (4.10) is the cardinality of the superset of any possible supp(vj;(cid:101)n) with fixed j,(cid:101)n, D and d so that it satisfies RIP with sufficiently large m2. When the j = D − 1, the matrix Cj is defined to be 1 since ZD−1 = ∅. For all j ∈ [D] \ {0}, on the other hand, the matrices Bj ∈ Cm1×[M ]j+1 is defined as := (cid:96),n1 1√ m1 T(cid:48) n1 (wj,(cid:96)), (cid:96) ∈ [m1], n1 ∈ [M ]j+1, (4.35) 108 where T(cid:48) n1 (z) is a partial product of the first j + 1 terms of Tn(x) in (4.1), i.e., (cid:89) i∈[j+1] T(cid:48) n1 (z) = Ti;ni (zi), z ∈ ×i∈[j+1]Di. (cid:1)M d if The matrix Bj can be restricted to the matrix of size m1 × ¯Nj where ¯Nj = (cid:0)j+1 j + 1 ≥ d, or M d otherwise. The number ¯Nj is the cardinality of the set of any possible prefix(cid:101)n ∈ [M ]j+1 with fixed j,(cid:101)n and d. The sampling numbers m1 and m2 are chosen for all Bj and Cj with any j ∈ [D] \ {0} to satisfy RIP with the restricted isometry constants (cid:101)δ2s+1 ≤(cid:101)δ and δ(cid:48) 2s ≤ δ(cid:48), respectively. We again mention that CD−1 = 1. d Lemma 7. Let H2s be the set of all functions h whose coefficient vectors are 2s-sparse and let rh ∈ CN denote the coefficient vector for each h ∈ H2s. Let j ∈ [D] \ {0}, and (cid:101)δ and δ(cid:48) be chosen from (0, 1). Assume that Bj and Cj satisfy RIP with (cid:101)δ2s+1 ≤(cid:101)δ and δ(cid:48) 2s ≤ δ(cid:48), respectively. Then, denoting r := rh for simplicity, (cid:16) (1 − δ(cid:48) 2s) max{0, (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 −(cid:101)δ2s+1(cid:107)r(cid:107)2}(cid:17)2 ≤ Ej;((cid:101)n,··· ) ≤ (1 + δ(cid:48) 2s) (cid:16)(cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 +(cid:101)δ2s+1(cid:107)r(cid:107)2 (cid:17)2 . (4.36) for any (cid:101)n ∈ [M ]j+1. Proof. As j ∈ [D]\{0} is fixed, for simplicity, we use notation w(cid:96) and zk for sampling points respectively. Fix (cid:101)n ∈ [M ]j+1. Letting n = (n1, n2), n1 ∈ [M ]j+1 and n2 ∈ [M ]D−j−1, we instead of wj,(cid:96) and zj,k constructing the sampling matrices Bj and Cj as in (4.35) and (4.34), 109 can rewrite the energy estimate Ej;((cid:101)n,··· ) as follows, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 h(w(cid:96), zk)Tj;(cid:101)n(w(cid:96)) k∈[m2] m1 (cid:96)∈[m1] k∈[m2] m1 (cid:96)∈[m1] n=(n1,n2)∈supp(r) Ej;((cid:101)n,··· ) = 1 m2 = 1 m2 = 1 m2 =: 1 m2 (cid:88) (cid:88) (cid:88) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1 1 m2 1 (cid:88) (cid:88) (cid:88) (cid:88) 1 m2 1 k∈[m2] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k∈[m2] n∈supp(r) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 (cid:88) (cid:88) (cid:88) (cid:96)∈[m1] rnT(cid:48) n1 rnTn(w(cid:96), zk)Tj;(cid:101)n(w(cid:96)) (w(cid:96))Tj;(cid:101)n(w(cid:96))T(cid:48)(cid:48) ((cid:101)r(cid:101)n)n2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2 T(cid:48)(cid:48) n2 (zk) (zk) n2 2 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n2 s.t. ∃n1 with (n1,n2)∈supp(r) (4.37) (4.38) at n2 so that (cid:101)r has a support whose cardinality is at most 2s since h is 2s-sparse. Thus, the energy estimate . Since the restricted measurement matrix (cid:88) (cid:88) where ((cid:101)r(cid:101)n)n2 := rn (cid:16) T(cid:48) n1 (cid:96)∈[m1] n(cid:48) 1 s.t. n=(n1,n2)∈supp(r) (w(cid:96))Tj;(cid:101)n(w(cid:96)). (cid:17) ∈ C(cid:101)Nj with entries ((cid:101)r(cid:101)n)n2 We can construct a vector (cid:101)r := ((cid:101)r(cid:101)n)n2 (cid:13)(cid:13)(cid:13)Cj (cid:17)(cid:13)(cid:13)(cid:13)2 (cid:16) (cid:101)r Ej;((cid:101)n,··· ) in (4.37) can be expressed as Cj ∈ Cm2×(cid:101)Nj satisfies the RIP, (cid:13)(cid:13)(cid:13)(cid:13) (cid:101)r (cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13)Cj (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:18) (cid:101)r (cid:13)(cid:13)(cid:13) (cid:101)r In order to get an upper bound and lower bound of (cid:13)(cid:13)(cid:13)(cid:13) (cid:101)r ≤ (1 + δ(cid:48) (cid:13)(cid:13)(cid:13)(cid:13)2 (1 − δ(cid:48) 2s) . 2 m1 2 m1 2 2s) m1 m1 2 ≤ (cid:13)(cid:13)(cid:13)2 m1 (4.39) , we define e(cid:101)n ∈ C ¯Nj as a standard 110 basis vector with all 0 entries except for a 1 at (cid:101)n, and Rj ∈ C(cid:101)Nj× ¯Nj as r(n1,n2) 0 if (n1, n2) ∈ supp(r) (Rj)n2,n1 := otherwise . n1 and n2 depend on the column orderings of Bj and Cj, respectively. We note that the Note that Rj contains at most 2s nonzero elements since h is 2s-sparse, and(cid:101)n is any element in [M ]j+1. Set Q := {(cid:101)n}∪{n1 ∈ [M ]j+1 | ∃n2 ∈ [M ]D−j−1 such that (n1, n2) ∈ supp(r)} with a fixed (cid:101)n ∈ [M ]j+1. Thus, the cardinality of Q is at most 2s + 1. Orderings of indices (cid:101)n-th column of Rj is rj;((cid:101)n,··· ). Both bounds of =(cid:13)(cid:13)Rj(Bj)∗ ≤ (cid:107)Rj(cid:107)2→2(cid:107)(Bj)∗ ≤(cid:101)δ2s+1(cid:107)Rj(cid:107)F =(cid:101)δ2s+1(cid:107)r(cid:107)2 (cid:13)(cid:13)(cid:13) (cid:101)r Q(Bj)Qe(cid:101)n − Rje(cid:101)n Q(Bj)Q − I(cid:107)2→2(cid:107)e(cid:101)n(cid:107)2 − rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:101)r are found as follows (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 m1 m1 and therefore,(cid:12)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)(cid:13)(cid:13) (cid:101)r (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) m1 −(cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)2 −(cid:101)δ2s+1(cid:107)r(cid:107)2 ≤ (cid:12)(cid:12)(cid:12)(cid:12) ≤(cid:101)δ2s+1(cid:107)r(cid:107)2, (cid:13)(cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)(cid:13) (cid:101)r m1 ≤(cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 +(cid:101)δ2s+1(cid:107)r(cid:107)2. (4.40) Combining (4.39) and (4.40), we reach the conclusion in (4.36). Lemma 8. Let H2s be the set of all functions h whose coefficient vectors are 2s-sparse and let rh ∈ CN denote the coefficient vector for each h ∈ H2s. Let j be any integer such that 111 s √ ˇc for some ˇc > 1−δ(cid:48) < ˇc (cid:113) 1+δ(cid:48) √ 2 and δ(cid:48) j ∈ [D] \ {0}, and (cid:101)δ and δ(cid:48) be chosen from (0, 1). Given α > 1, assume that the restricted Bj and Cj satisfy RIP with (cid:101)δ2s+1 ≤ (cid:101)δ ≤ 1 2s ≤ δ(cid:48) ∈ (0, 1), (cid:27) α − 1. Then, denoting r := rh for simplicity, one has the set 2 ≥ (cid:107)r(cid:107)2 2 ≥ (cid:107)r(cid:107)2 (cid:26)(cid:101)n ∈ [M ]j+1 (cid:12)(cid:12) (cid:107)rj;((cid:101)n,··· )(cid:107)2 (cid:26) ˆn ∈ [M ]j (cid:12)(cid:12) (cid:107)rj−1;( ˆn,··· )(cid:107)2 (cid:9), Pj−1 × Nj contains all possible prefixes (cid:101)n ∈ [M ]j+1 with (cid:107)rj;((cid:101)n,··· )(cid:107)2 Proof. By assumption that Pj−1 ⊂ ×i∈[j]Ni contains all prefixes in(cid:8) ˆn ∈ [M ]j(cid:12)(cid:12) (cid:107)rj−1;( ˆn,··· )(cid:107)2 2 ≥ (cid:107)r(cid:107)2 respectively, with Pj ⊃ Pj−1 ⊃ of cardinality 2s resulting from Algorithm 7 if (cid:27) 2 α2s 2 α2s 2 α2s 2 . ≥ (cid:107)r(cid:107)2 the definition of Pj and Nj for all j ∈ [D]. If ((cid:101)n, n2) ∈ supp(r), then from Lemma 7, 0 ≤ Ej;((cid:101)n,··· ) ≤ (1 + δ(cid:48)) (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 (cid:18)(cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) On the other hand, if − (cid:107)r(cid:107)2 √ ˇc ≥ (cid:107)r(cid:107)2 (1 − δ(cid:48)) (cid:13)(cid:13)(cid:13)2 2 α2s s 2 by 2 α2s = 0, i.e., there is no n2 such that (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:17)2 ≤ (1 + δ(cid:48)) (cid:16)(cid:101)δ(cid:107)r(cid:107)2 , i.e., there is n2 such that ((cid:101)n, n2) ∈ supp(r), then (cid:18)(cid:107)r(cid:107)2 (cid:19)2 √ ˇc s . In order to distinguish nonzero (1 + δ(cid:48)) which is implied by +(cid:101)δ(cid:107)r(cid:107)2 (cid:17)2 . , we should have (cid:19)2 ≤ (1 − δ(cid:48)) (cid:16)(cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 (cid:17)2 −(cid:101)δ(cid:107)r(cid:107)2 (cid:16)(cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 ≤ Ej;((cid:101)n,··· ) ≤ (1 + δ(cid:48)) (cid:13)(cid:13)(cid:13)2 (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:13)(cid:13)(cid:13)2 (cid:19)2 ≥ (cid:107)r(cid:107)2 √ from zero (cid:18)(cid:107)r(cid:107)2 α s < (1 − δ(cid:48)) √ − (cid:107)r(cid:107)2 √ ˇc s α s (cid:13)(cid:13)(cid:13)rj;((cid:101)n,··· ) (cid:19)2 (cid:18)(cid:107)r(cid:107)2 (cid:114) √ ˇc s 1 + δ(cid:48) 1 − δ(cid:48) < − 1, ˇc α 112 as in the assumption. Since estimates of zero energy and nonzero energy greater than are separated, choosing 2s prefixes with largest estimates E(cid:101)n guarantees that it contains all (cid:107)r(cid:107)2 2 α2s prefixes with energy greater than (cid:107)r(cid:107)2 2 α2s . √ ˇc s √ s max s max 1+ 1 d log3(4s) for some ˇc > 1+δ(cid:48) < ˇc  and  log(9m1), log(2D/p)  ,  log(9m2), log(2D/p) Theorem 10. Let H2s be the set of all functions h whose coefficient vectors are 2s-sparse and let rh ∈ CN denote the coefficient vector for each h ∈ H2s. We assume that we have Nj for all j ∈ [D]. Let α > 1, (cid:101)δ ≤ 1 (cid:113) 1−δ(cid:48) 2 and δ(cid:48) ∈ (0, 1), satisfying α − 1, and p ∈ (0, 1). If log2(4s) · d · log  eM D m1 ≥ ¯αK2(cid:16)(cid:101)δ (cid:17)−2 log2(4s) · d · log  eM D m2 ≥ ¯βK2(cid:0)δ(cid:48)(cid:1)−2 (cid:9) of |P| = 2s with probability at least 1 − p. P ⊃(cid:8)n ∈ [M ]D | |rn|2 ≥ (cid:107)r(cid:107)2 (cid:101)δ2s+1 <(cid:101)δ or δ(cid:48) Proof. Given m1 and m2, by Theorem 3, the probability of either Bj or Cj not satisfying 2D for each j ∈ [D] \ {0}, and thus the union bound over all j yields the failure probability at most p(D−1) D < p. That is, Theorem 3 ensures (cid:27) that Bj and Cj have RIP uniformly for all j ∈ [D] \ {0} with probability at least 1 − p. Repeatedly applying Lemma 8 yields the final P(= PD−1) ⊃ cardinality 2s by combining the fact that P0(= N0) ⊃ for absolute constants ¯α and ¯β, then denoting r := rh for simplicity, Algorithm 7 finds (cid:26) (cid:26)(cid:101)n ∈ [M ](cid:12)(cid:12) (cid:107)r((cid:101)n,··· )(cid:107)2 n ∈ [M ]D (cid:12)(cid:12) |rn|2 ≥ (cid:107)r(cid:107)2 (cid:27) 2 ≥ (cid:107)r(cid:107)2 2s < δ(cid:48) respectively is at most p 2 α2s . 1+ 1 d log3(4s) 2 α2s of d d 2 α2s 113 4.3.3 Support Identification In this section, it remains to combine the results of entry identification and pairing processes in order to give the complete support identification algorithm and to prove Lemma 2 which is the main ingredient of Theorem 7 in Section 4.2.3. The support identification starts with (cid:27) the entry identification providing Nj, j ∈ [D] as outputs, and in turn, the pairing takes Nj, j ∈ [D] as inputs and outputs P of cardinality 2s containing (cid:26) n ∈ [M ]D | |rn|2 ≥ (cid:107)r(cid:107)2 2 α2s . Accordingly, we can get the following result. Lemma 9. The set P contains at most 2s index vectors satisfying (cid:115) (cid:107)rPc(cid:107)2 ≤ (cid:107)r(cid:107)2 2 α2s = √ 2(cid:107)r(cid:107)2 α , 2s where rPc is the vector of r restricted to the complement of P. Proof. Note that r is 2s-sparse and by Theorem 10 the squared magnitude of each rn at Pc is less than (cid:107)r(cid:107)2 2 α2s so that we obtain the desired result. Finally, we are ready to prove Lemma 2 in order to complete the analysis of support identification. Proof of Lemma 2. By choosing α = 7 and P = Ω in Lemma 9, we obtain the desired upper bound of (cid:107)(rh)Ωc(cid:107)2 with probability at least 1 − 2p given the grids GI and GP . The union bound of the failure probabilities p of Lemma 6 and Theorem 10 gives the desired probability. It remains to demonstrate the sampling complexity combining and the runtime complexity combining Algorithms 6 and 7. The first term where m comes from Lemma 6, L(cid:48) is defined as in Section 4.2.3 , and D, the number of changes in j ∈ [D], therein. We emphasize that m samples are utilized repeatedly in order (cid:12)(cid:12)(cid:12)GI(cid:12)(cid:12)(cid:12) and (cid:12)(cid:12)(cid:12)GP(cid:12)(cid:12)(cid:12), (cid:12)(cid:12)(cid:12)GI(cid:12)(cid:12)(cid:12) is mL(cid:48)D 114 to implicitly construct L proxy functions combined with different Gaussian weights so that L does not affect the sampling complexity but affects the runtime complexity. The second (cid:12)(cid:12)(cid:12)GP(cid:12)(cid:12)(cid:12) is m1m2(D − 2) + m1 where m1 and m2 are from Theorem 10, and D − 2 is term the number of changes in j ∈ [D − 1] \ {0}. We remind readers that CD−1 = 1 implied by ZD−1 = ∅ so that m1 samples are utilized instead of m1m2 when j = D − 1. Now, we consider the runtime complexity. The first term in runtime complexity is O(mLLD) from Algorithm 6 where mL computations are taken to implicitly construct the L proxy functions, O(L) is defined as in Section 4.2.3 , and D comes from the for loop from Algorithm 6. The second term in runtime complexity is 4s2 (m1m2(D − 2) + m1) + 4s2m1 Algorithm 7 since 4s2 energy estimates are calculated using the m1m2 samples for each (cid:80)D−1 j=1 (j + 1) from j ∈ [D − 1] \ {0} and m1 samples for j = D − 1, and the evaluations of Tj;(cid:101)n(wj,(cid:96)) are calculated for all (cid:101)n ∈ Pj, ∀j ∈ [D] \ {0}. Here, it is assumed that it takes O(1) runtime to evaluate each ith component Ti;(cid:101)ni ((wj,(cid:96))i) of Tj;(cid:101)n(wj,(cid:96)). 4.4 Empirical Evaluation In this section Algorithm 5 is evaluated numerically and compared to CoSaMP [22], its superlinear-time progenitor. Both Algorithm 5 and CoSaMP were implemented in MATLAB for this purpose. 4.4.1 Experimental Setup We consider two kinds of tensor product basis functions below: Fourier and Chebyshev. In both cases each parameter, M , D, and s, is changed while the others remain fixed so that we can see how each parameter affects the runtime, sampling number, memory usage, and 115 error of both Algorithm 5 and CoSaMP. For all experiments below d = D so that I = [M ]D. Every data point in every plot below was created using 100 different randomly generated trial signals, f , of the form (cid:88) n∈S f (x) = cnTn(x), (4.41) where each function’s support set, S, contained s index vectors n ∈ [M ]D each of which was independently chosen uniformly at random from [M ]D, and where each function’s coefficients cn were each independently chosen uniformly at random from the unit circle in the complex plane (i.e., each cn = eiθ where θ is chosen uniformly at random from [0, 2π]). In the Fourier setting the basis functions Tn(x) in (4.41) were chosen as per (4.42), and in the Chebyshev setting as per (4.43). Below a trial will always refer to the execution of Algorithm 5 and/or CoSaMP on a particular randomly generated trial function f in (4.41). A failed trail will refer to any trial where either CoSaMP or Algorithm 5 failed to recover the correct support set S for f . Herein the parameters of both Algorithm 5 and CoSaMP were tuned to keep the number of failed trials down to less than 10 out of the total 100 used to create every datapoint in every plot. Finally, in all of our plots Algorithm 5 is graphed with red, and CoSaMP with blue. 4.4.2 Experiments with the Fourier Basis for D = [0, 1]D In this section we consider the Fourier tensor product basis Tn(x) := 2πinj xj e (4.42) D−1(cid:89) j=0 116 (a) (b) (c) (d) Figure 4.1: Fourier basis, M ∈ {10, 20, 40, 80}, D = 4, s = 5 whose orthogonality measure is the Lebesgue measure on D = [0, 1]D. In figures 4.1, 4.2 and 4.3, results are shown for approximating Fourier-sparse trial functions (4.41) using noiseless samples y. In figure 4.1, the parameter M changes over the set {10, 20, 40, 80} while D = 4 and s = 5 are held constant. In figure 4.1a, the average runtime (in seconds) is shown as M changes. The average here is calculated over all 100 trials at each data point excluding any failed trials. As we can see, the runtime of Algorithm 5 grows very slowly as M grows, whereas the runtime grows fairly quickly for CoSaMP since its measurement matrix’s size 117 02040608010-410-2100102104CoSaMPAlgorithm 10204060801011021031040204060801091010101102040608012345610-15 (a) (b) (c) (d) Figure 4.2: Fourier basis, M = 10, D = {2, 4, 6, 8}, s = 5 increases significantly as M grows. Figure 4.1b shows the number of samples used by both CoSaMP and Algorithm 5. We can see that Algorithm 5 requires more samples due mainly to its support identification’s pairing step. On the other hand, we can see in figure 4.1c that the memory usage of CoSaMP grows very rapidly compared to the slow growth of Algo- rithm 5’s memory usage. This exemplifies the tradeoff between Algorithm 5 and CoSaMP – Algorithm 5 uses more samples than CoSaMP in order to reduce its runtime complexity and memory usage for large D and M . Finally, figure 4.1d demonstrates that both methods 118 234567810-410-2100102104234567810110210310423456781091010101123456781234567810-15 (a) (b) (c) (d) Figure 4.3: Fourier basis, M = 20, D = 4, s = {1, 2, 3,··· , 10} produce outputs whose average errors (over the trials where they don’t fail) are on the order of 10−15. In figure 4.2, the number of dimensions, D, changes while both M = 10 and s = 5 are held fixed. Here, we can clearly see the advantage of Algorithm 5 for functions of many variables. The runtime and memory usage of CoSaMP blow up quickly as D increases due to the gigantic matrix-vector multiplies it requires to identify support. Algorithm 5, on the other hand, shows much slower growth in runtime and memory usage. When D = 10, for 119 024681010-310-210-1100024681010110210310402468101.31.351.41.451.51.551.6109024681000.20.40.60.8110-14 example, CoSaMP requires terabytes of memory whereas Algorithm 5 requires only a few gigabytes. In figure 4.3, s varies in {1, 2, 3,··· , 10} while M = 20 and D = 4 are fixed. Since Algorithm 5 has ˜O(s5) scaling in runtime due to its pairing step, it suffers as sparsity increases more quickly than CoSaMP does. Note that the crossover point is around s = 8, so that Algorithm 5 appears to be slower than CoSaMP for all s > 8 when M = 20 and D = 4. Though “only polynomial in s”, it is clear from these experiments that the runtime scaling of Algorithm 7 in s needs to be improved before the methods proposed herein can become truly useful in practice. 4.4.3 Experiments with the Chebyshev Basis for D = [−1, 1]D In this section we consider the Chebyshev tensor product basis j=0 whose orthogonality measure is dν = ⊗j∈[D] π (4.43) on D = [−1, 1]D. Runtime, memory, D−1(cid:89) Tn(x) := 2 1 2(cid:107)n(cid:107)0 cos(cid:0)nj arccos(xj)(cid:1) (cid:113) dxj 1−x2 j sampling complexity, and error graphs are provided in figures 4.4, 4.5 and 4.6 as M , D, and s vary, respectively. Since this Chebyshev product basis has a BOS constant of K = 2D/2, both CoSaMP and Algorithm 5 suffer from a mild exponential grown in sampling, runtime, and memory complexity as D increases (recall that d = D for these experiments). This leads to markedly different overall performance for the Chebyshev basis than what is observed for the Fourier basis where K = 1. A reduction in performance from the Fourier case for both methods is clearly visible, e.g., in figure 4.5. Nonetheless, Algorithm 5 demonstrates the expected reduced runtime and sampling complexity dependence on M and D over CoSaMP in figures 4.4 and 4.5, as well as a striking reduction in its required memory usage over 120 (a) (b) (c) (d) Figure 4.4: Chebyshev basis, M ∈ {10, 20, 40, 80}, D = 4, s = 5 CoSaMP even in figure 4.6 when its runtime complexity is worse. Unfortunately, the ˜O(s5) runtime dependance of Algorithm 7 on sparsity is again clear in figure 4.6a leading to a crossover point of Algorithm 5 with CoSaMP at only s = 3 when M = 10 and D = 6. This again clearly marks the pairing process of Algorithm 7 as being in need of improvement. 121 02040608010-2100102CoSaMPOur method020406080102103104105020406080109101010110204060800.9511.051.11.1510-15 (a) (b) (c) (d) Figure 4.5: Chebyshev basis, M = 20, D = {2, 4, 6}, s = 5 4.4.4 Experiments for Larger Ranges of Sparsity s and Dimension D Figures 4.7 and 4.8 explore the performance of Algorithm 5 on Fourier sparse functions for larger ranges of D and s, respectively. In figure 4.7, a function of D = 75 variables can be recovered in just a few seconds when it is sufficiently sparse in the Fourier basis. It is worth pointing out here that when D = 75 the BOS in question contains 2075 ∼ 1097 basis functions, significantly more than the approximately 1082 atoms estimated to be in 122 2345610-410-21001021042345610210410623456109101010111012234560.60.811.21.41.61.8210-15 (a) (b) (c) (d) Figure 4.6: Chebyshev basis, M = 10, D = 6, s = {1, 2, 3,··· , 10} the observable universe. We would like to emphasize that Algorithm 5 is solving problems in this setting that are simply too large to be solved efficiently, if at all, using standard superlinear-time compressive sensing approaches due to their memory requirements when dealing with such extremely large bases. Figure 4.8 also shows that functions with larger Fourier sparsities, s, than previously considered (up to s = 160) can be be recovered in about an hour or less from a BOS of size 405 = 102, 400, 000. In figures 4.9 and 4.10 we consider the functions which are sparse in the Chebyshev prod- 123 024681010-21001020246810102104106024681010910101011024681001234510-15 (a) (b) (c) Figure 4.7: Fourier basis, M = 20, D ∈ {5, 10, 15, 20,··· , 75}, s = 5 (a) (b) (c) Figure 4.8: Fourier basis, M = 40, D = 5, s ∈ {5, 10, 20, 40, 80} uct basis. Again, due to the larger BOS constant of the Chebyshev basis, the D and s ranges that our method can deal efficiently are smaller than in the Fourier case. When D is 12 or s is 20 in figures 4.9 and 4.10, respectively, for example, it takes a few hours for Algorithm 5 to finish running. We again remind the readers that standard superlinear-time compressive sensing methods cannot solve with such high dimensional problems at all, however, on any- thing less than a world class supercomputer due to their memory requirements. In the figure 4.9 experiments the Chebyshev BOS contains 2012 ∼ 1015 basis functions when D = 12. In the figure 4.10 experiments the BOS contains just over 100 million basis functions. 124 02040608010-210-1100101Our method0204060801031041050204060801.522.533.5410-1502040608010-110010110210302040608010310410510602040608000.511.522.510-14 (a) (b) (c) Figure 4.9: Chebyshev basis, M = 20, D ∈ {2, 4, 6,··· , 12}, s = 5 (a) (b) (c) Figure 4.10: Chebyshev basis, M = 40, D = 5, s ∈ {2, 4, 6,··· , 20} 4.4.5 Recovery of Functions from Noisy Measurements In figures 4.11 and 4.12 we further consider exactly sparse trial functions (4.41) whose func- tion evaluations are contaminated with Gaussian noise. That is, we provide Algorithm 5 with noisy samples y(cid:48) = y + g(cid:48) = y + σ (cid:107)y(cid:107)2 (cid:107)g(cid:107)2 g where y contains noiseless samples from each f as per (4.4), g ∼ N (0, I), and σ ∈ R+ is used to control the Signal to Noise Ratio (SNR) defined herein by SNRdb := 10 log10 = − 10 log10 (cid:16) σ2(cid:17) . (cid:32) (cid:107)y(cid:107)2 (cid:33) 2 (cid:107)g(cid:48)(cid:107)2 2 125 24681012100102104Our method246810121021041061082468101200.511.5210-140510152010-11001011021030510152010410510610705101520012345610-15 (a) (b) (c) (d) Figure 4.11: Algorithm 5, Fourier basis, M = 10, D ∈ {4, 6, 8, 10}, s = 5, SNRdB ∈ {0, 10, 20,··· , 80} Figures 4.11 and 4.12 show the performance of Algorithm 5 for the Fourier and Chebyshev product bases, respectively, as SNR varies. Figure 4.11a shows the average runtime for each D ∈ {2, 4, 6, 8} as SNRdB changes. When SNRdB is close to 0 (which means that the (cid:96)2- norm of noise vector is the same as the (cid:96)2-norm of sample vector), the runtime gets larger due to Algorithm 5 using a larger number of overall iterations. The runtime also increases mildly as D increases in line with our previous observations. The sampling number in Figure 4.11b is set to be three times larger than the sampling number used in the noiseless cases. Figure 126 02040608010-1100101102D=4D=6D=8D=10020406080200025003000350040004500500002040608010-51000204060800.50.60.70.80.91 (a) (b) (c) (d) Figure 4.12: Algorithm 5, Chebyshev basis, M = 10, D ∈ {6, 8}, s = 5, SNRdB ∈ {0, 10, 20,··· , 80} 4.12 shows the results of Algorithm 5 applied to functions which are sparse in the Chebyshev product basis. Similar to the Fourier case, the runtime grows as the noise level gets worse in Figure 4.12a. Also, larger D results in the larger runtime as previously discussed. In Figure 4.12b, the sampling number is also set by tripling the sampling number used in noiseless cases. As above, in both figures 4.11 and 4.12 the average (cid:96)2-error is computed by only consid- ering the successful trials where every element of f ’s support, S, is found. Here, however, 127 020406080101102103104D=6D=802040608023456710502040608010-51000204060800.20.30.40.50.60.70.80.9 the percentage of successful trials falls below 90% for lower SNR values. The success rates (i.e., the percentage of successful trials at each data point) are therefore plotted in figures 4.11d and 4.12d. Both figures show that a smaller SNRdB results in a smaller success rate, as one might expect. As SNRdB increases, however, the (cid:96)2-error decreases linearly for the successful trials. 4.4.6 Some Additional Implementational Details In the line 13 of Algorithm 5 solving the least square problem can be accelerated by the iterative algorithms such as the Richardson method or the conjugate gradient method when the size of the matrix ΦT is large [22]. For our range of relatively low sparsities, however, there was not much difference in the runtime between using such iterative least square solving † T := (Φ∗ algorithms and simply multiplying yE by the Moore-Penrose inverse, Φ T ΦT )−1Φ∗ T . Thus, we simply form and use the Moore-Penrose inverse for both CoSaMP and Algorithm 5 in our implementations below. Similarly, in our CoSaMP implementation the conjugate transpose of the measurement matrix, Φ, of size m × M D is simply directly multiplied by the updated sample vector v in each iteration in order to obtain the signal proxy used for CoSaMP’s support identification procedure (recall that d = D in all experiments below so that I = [M ]D). It is impor- tant to note that this matrix-vector multiplication can generally be done more efficiently if, e.g., one instead uses nonuniform FFT techniques [56] to evaluate Φ∗y for the types of high-dimensional Fourier and Chebyshev basis functions considered below. However, such techniques are again not actually faster than a naive direct matrix multiply for the ranges of relatively low sparsities we consider in the experiments herein.10 Furthermore, such nonuni- 10CoSaMP always uses only m = O(s · D log M ) samples in the experiments herein which means that its 128 form FFT techniques will still exhibit exponential runtime and memory dependence on D in the high-dimensional setting even for larger sparsity levels. Thus, nonuniform FFTs were not utilized in our MATLAB implementation of CoSaMP. measurement matrix’s conjugate transpose, Φ∗ ∈ CM D×m, can be naively multiplied by vectors in only O(s · D log M · M D)-time. When s is small this is comparable to the O(D log M · M D) runtime complexity of a (nonuniform) FFT. 129 Chapter 5 Conclusion In this thesis, we developed several sublinear-time high-dimensional function learning al- gorithms under the assumption that the function has a sparsity in the tensorized basis of bounded orthonormal functions including the tensorized Fourier basis. When focusing on Fourier basis, nice properties of Fourier basis make it possible to more quickly and efficiently recover the frequency vectors and the corresponding Fourier coefficients using much less samples. In Chapter 2 we showed how to extend our 1D sublinear sparse Fourier algorithm to the general D dimensional case. The methods project D dimensional frequency vectors onto lower dimensions. In this process we encounter several obstacles. Thus we introduced “tilt- ing method” for the worst case problems and the “partial unwrapping method” to reduce the chance of collisions and to increase the frequency bandwidth within the limit of com- putation. Through those methods we can overcome the obstacles as well as maintain the advantage of the 1D algorithm. In [1] the average-case sampling complexity is O(s) and the runtime complexity is O(slogs). Extended this estimation from our 1D algorithm, we have O(Ds) sampling complexity and a runtime complexity of O(Dslogs) on average under the assumption that there is no worst case scenario happening. In Chapter 3 we developed a multiscale high-dimensional sparse Fourier algorithm re- covering a few energetic Fourier modes using noisy samples. As the estimation error is 130 controlled by the noise level σ and the sample length p, larger p reduces the error. Rather than recovering the frequencies in a single step, however, we choose multiscale approach in order to make the sample length p increase moderately by improving the estimate iteratively through correction terms determined by a sequence of shifting sizes q. We showed that the finite number of correction terms are enough to make the error smaller than 1/2 so that we can reconstruct each integer frequency entry by rounding. As a result, the algorithm has O(sD log M ) sampling complexity and O(sD log s log M ) runtime complexity on average combining the result from Chapter 2 and [2]. In Chapter 4 we have shown that there exist sublinear-time algorithms that approximate multivariate functions sparse in BOS. The approximations provide uniform error guarantee for any function at certain sparsity level. As well as the uniform error guarantee for all exactly sparse functions, the numerical experiments demonstrate our proposed method can well approximate functions containing certain level of noise. All methods in this thesis assume that we can get the measurement at any sample point. However, this is not always the case in practice. Our future work will be a modification of the algorithms to make it work for given discrete signals with sparsity in Fourier domain using the idea of filtering from [10]. Moreover, one of the future work will be proving rigorous error guarantee of Algorithm 5 for nearly sparse functions. Moreover, combining the work in this section with [32] sublinearizing the dependence on ambient dimension D when the functions only depend on (cid:101)d-variables instead of D ≥ (cid:101)d, we expect the further acceleration of approximating functions of many variables. 131 BIBLIOGRAPHY 132 BIBLIOGRAPHY [1] David Lawlor, Yang Wang, and Andrew Christlieb. Adaptive sub-linear time fourier algorithms. Advances in Adaptive Data Analysis, 5(01):1350003, 2013. [2] Andrew Christlieb, David Lawlor, and Yang Wang. A multiscale sub-linear time fourier algorithm for noisy data. Applied and Computational Harmonic Analysis, 40(3):553– 574, 2016. [3] Aicke Hinrichs, Erich Novak, Mario Ullrich, and H Wo´zniakowski. The curse of dimen- sionality for numerical integration of smooth functions. Mathematics of Computation, 83(290):2853–2863, 2014. [4] Anna C Gilbert, Sudipto Guha, Piotr Indyk, S Muthukrishnan, and Martin Strauss. Near-optimal sparse fourier representations via sampling. In Proceedings of the thiry- fourth annual ACM symposium on Theory of computing, pages 152–161. ACM, 2002. [5] Anna C Gilbert, S Muthukrishnan, and Martin Strauss. Improved time bounds for near-optimal sparse fourier representations. In Proceedings of SPIE, volume 5914, page 59141A, 2005. [6] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Nearly optimal sparse fourier transform. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 563–578. ACM, 2012. [7] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Simple and practical algorithm for sparse fourier transform. In Proceedings of the twenty-third annual ACM- SIAM symposium on Discrete Algorithms, pages 1183–1194. Society for Industrial and Applied Mathematics, 2012. [8] Mark A Iwen. Combinatorial sublinear-time fourier algorithms. Foundations of Com- putational Mathematics, 10(3):303–338, 2010. [9] Badih Ghazi, Haitham Hassanieh, Piotr Indyk, Dina Katabi, Eric Price, and Lixin Shi. Sample-optimal average-case sparse fourier transform in two dimensions. In Communi- cation, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference on, pages 1258–1265. IEEE, 2013. [10] Sami Merhi, Ruochuan Zhang, Mark A Iwen, and Andrew Christlieb. A new class of fully discrete sparse fourier transforms: Faster stable implementations with guarantees. arXiv preprint arXiv:1706.02740, 2017. [11] Mark A Iwen. Improved approximation guarantees for sublinear-time fourier algorithms. Applied And Computational Harmonic Analysis, 34(1):57–82, 2013. [12] Ralph C. Smith. Uncertainty Quantification: Theory, Implementation, and Applica- tions. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2013. 133 [13] Dongbin Xiu. Numerical Methods for Stochastic Computations: A Spectral Method Approach. Princeton University Press, Princeton, NJ, USA, 2010. [14] Germund Dahlquist and ke Bjrck. Numerical Methods in Scientific Computing: Volume 1. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. [15] Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta numerica, 13:147–269, 2004. [16] Russel E Caflisch. Monte carlo and quasi-monte carlo methods. Acta numerica, 7:1–49, 1998. [17] G. Leobacher and F. Pillichshammer. Introduction to Quasi-Monte Carlo Integration and Applications. Compact Textbooks in Mathematics. Springer International Publish- ing, 2014. [18] J.-L. Bouchot, H. Rauhut, and C. Schwab. Multi-level Compressed Sensing Petrov- Galerkin discretization of high-dimensional parametric PDEs. ArXiv e-prints, January 2017. [19] Abdellah Chkifa, Nick Dexter, Hoang Tran, and Clayton G Webster. Polynomial ap- proximation via compressed sensing of high-dimensional functions on lower sets. arXiv preprint arXiv:1602.05823, 2016. [20] Holger Rauhut and Christoph Schwab. Compressive sensing petrov-galerkin approxima- tion of high-dimensional parametric operator equations. Mathematics of Computation, 86(304):661–700, 2017. [21] Christoph Schwab and Radu Alexandru Todor. Karhunen–lo`eve approximation of ran- dom fields by generalized fast multipole methods. Journal of Computational Physics, 217(1):100–122, 2006. [22] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. [23] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM Journal on Numerical Analysis, 49(6):2543–2563, 2011. [24] MA Iwen, A Gilbert, M Strauss, et al. Empirical evaluation of a sub-linear time sparse dft algorithm. Communications in Mathematical Sciences, 5(4):981–998, 2007. [25] Daniel Potts and Toni Volkmer. Sparse high-dimensional fft based on rank-1 lattice sampling. Applied and Computational Harmonic Analysis, 41(3):713–748, 2016. [26] Michael Kapralov. Sparse fourier transform in any constant dimension with nearly- optimal sample complexity in sublinear time. arXiv preprint arXiv:1604.00845, 2016. [27] Yishay Mansour. Randomized interpolation and approximation of sparse polynomi- als. In Proceedings of the 19th International Colloquium on Automata, Languages and Programming, ICALP ’92, pages 261–272, London, UK, UK, 1992. Springer-Verlag. 134 [28] Piotr Indyk and Michael Kapralov. Sparse fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time. 2014. [29] Bosu Choi, Andrew Christlieb, and Yang Wang. Multi-dimensional sublinear sparse fourier algorithm. arXiv preprint arXiv:1606.07407, 2016. [30] Lucia Morotti. Explicit universal sampling sets in finite vector spaces. Applied and Computational Harmonic Analysis, 43(2):354–369, 2017. [31] Daniel Potts and Toni Volkmer. Multivariate sparse fft based on rank-1 chebyshev lattice sampling. In Sampling Theory and Applications (SampTA), 2017 International Conference on, pages 504–508. IEEE, 2017. [32] Ronald DeVore, Guergana Petrova, and Przemyslaw Wojtaszczyk. Approximation of functions of few variables in high dimensions. Constructive Approximation, 33(1):125– 143, 2011. [33] Jie Shen and Li-Lian Wang. Sparse spectral approximations of high-dimensional prob- lems based on hyperbolic cross. SIAM Journal on Numerical Analysis, 48(3):1087–1109, 2010. [34] Dinh D˜ung and Tino Temlyakov, Vladimir N aiwnd Ullrich. Hyperbolic cross approxi- mation. arXiv preprint arXiv:1601.03978, 2016. [35] B. Choi, A. Christlieb, and Y. Wang. Multi-dimensional Sublinear Sparse Fourier Al- gorithm. ArXiv e-prints, June 2016. [36] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006. [37] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Springer, 2013. [38] AC Gilbert, MJ Strauss, JA Tropp, and R Vershynin. Sublinear approximation of compressible signals. Proc. SPIE Intell. Integrated Microsystems (IIM), page 623, 2006. [39] A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: Fast algorithms for compressed sensing. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, STOC ’07, pages 237–246, New York, NY, USA, 2007. ACM. [40] Anna C Gilbert, Piotr Indyk, Mark Iwen, and Ludwig Schmidt. Recent developments in the sparse fourier transform: a compressed fourier transform for big data. IEEE Signal Processing Magazine, 31(5):91–100, 2014. [41] Sina Bittens, Ruochuan Zhang, and Mark A Iwen. A deterministic sparse fft for func- tions with structured fourier sparsity. arXiv preprint arXiv:1705.05256, 2017. [42] Mark A Iwen. A deterministic sub-linear time sparse fourier algorithm via non-adaptive compressed sensing methods. In Proceedings of the nineteenth annual ACM-SIAM sym- posium on Discrete algorithms, pages 20–29. Society for Industrial and Applied Math- ematics, 2008. 135 [43] J Bailey, Mark A Iwen, and Craig V Spencer. On the design of deterministic matrices for fast recovery of fourier compressible functions. SIAM Journal on Matrix Analysis and Applications, 33(1):263–289, 2012. [44] Xianfeng Hu, Mark Iwen, and Hyejin Kim. Rapidly computing sparse legendre expan- sions via sparse fourier transforms. Numerical Algorithms, pages 1–31, 2015. [45] I.B. Segal and M.A. Iwen. Improved sparse fourier approximation results: Faster im- plementations and stronger guarantees. Numerical Algorithms, 63:239 – 263, 2013. [46] Andrew Christlieb, David Lawlor, and Yang Wang. A multiscale sub-linear time fourier algorithm for noisy data. Applied and Computational Harmonic Analysis, 40:553 – 574, 2016. [47] A. Gilbert, Y. Li, E. Porat, and M. Strauss. Approximate sparse recovery: Optimizing time and measurements. SIAM Journal on Computing, 41(2):436–453, 2012. [48] Mark A Iwen. Compressed sensing with sparse binary matrices: Instance optimal error guarantees in near-optimal time. Journal of Complexity, 30(1):1–15, 2014. [49] Anna C. Gilbert, Yi Li, Ely Porat, and Martin J. Strauss. For-all sparse recovery in near-optimal time. ACM Trans. Algorithms, 13(3):32:1–32:26, March 2017. [50] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):1413–1457, 2004. [51] Deanna Needell and Roman Vershynin. Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit. IEEE Journal of selected topics in signal processing, 4(2):310–316, 2010. [52] Tong Zhang. Sparse recovery with orthogonal matching pursuit under rip. IEEE Trans- actions on Information Theory, 57(9):6215–6221, 2011. [53] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027v7, 2011. [54] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms. Cambridge Uni- versity Press, 1995. [55] R Arratia and L Gordon. Tutorial on large deviations for the binomial distribution. Bulletin of mathematical biology, 51(1):125–131, 1989. [56] Leslie Greengard and June-Yub Lee. Accelerating the nonuniform fast fourier transform. SIAM review, 46(3):443–454, 2004. 136