SUB-LINEAR FOURIER ALGORITHMS: THEORY AND IMPLEMENTATION By David Lawlor A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Applied Mathematics 2012 ABSTRACT SUB-LINEAR FOURIER ALGORITHMS: THEORY AND IMPLEMENTATION By David Lawlor We present a new deterministic algorithm for the sparse Fourier transform problem, in which we seek to identify k N significant Fourier coefficients from a signal of bandwidth N . Previous deterministic algorithms exhibit quadratic runtime scaling in k, while our algorithm scales linearly with k in the average case. Underlying our algorithm are a few simple observations relating the Fourier coefficients of time-shifted samples to unshifted samples of the input function. This allows us to detect when aliasing between two or more frequencies has occurred, as well as to determine the value of unaliased frequencies. We also show that the small time shifts required for our algorithm can be implemented using two much larger shifts, an important point for applications where the sampling rate is limited by hardware capacity. Empirically our algorithm is orders of magnitude faster than competing algorithms. We study the effect of noise on the performance of our algorithm, and derive a relationship between its magnitude and the rate at which the signal is sampled. This in turn is used to devise modified versions of our basic algorithm that are robust to noise, although with higher runtime and sampling requirements than in the noiseless case. The first of these is a minor change, and exhibits poor performance relative to the noise level. To remedy this we propose a multi-scale algorithm that allows for error correction in the frequency estimates in an iterative fashion. It is shown empirically that this multi-scale algorithm is robust to noise even with coarse sampling rates. Finally, we discuss the application of our algorithms to the multi-dimensional setting. A straightforward extension, while simple to describe and implement, exhibits exponential scaling with the dimension. A second extension is then proposed that uses number-theoretic techniques to reduce the multi-dimensional problem to the one-dimensional case, which can then be solved efficiently with the algorithm proposed in chapter two. An empirical evaluation shows that this second method significantly outperforms highly optimized FFT implementations in two dimensions. Copyright by DAVID LAWLOR 2012 To my family v ACKNOWLEDGMENTS I would first like to thank my advisor, Andrew Christlieb, for his advice and guidance over the past five years. In our very first meeting he showed the patience and wisdom necessary to mold my vague notions of research problems into definite plans. He made sure I kept on pace throughout my graduate career, which sometimes meant pushing me outside of my comfort zone. Beyond his scientific guidance, his advice on navigating an academic career has been invaluable. I would like to thank Anna Gilbert for the many hours spent working through ideas in her office, acting at times as a proxy advisor for me in Ann Arbor. She generously invited me to attend seminars, group meetings, and workshops at the University of Michigan and elsewhere that have helped to define my research interests. I would like to thank Yang Wang for providing the spark that would become my dissertation project, and for finding time in his busy schedule as chair of our department to work through problems and provide feedback on drafts. I would like to thank Jianliang Qian and Di Liu for serving on my defense committee and for ably teaching the core applied mathematics courses during my graduate studies. I would like to thank the support staff of the math department, and Barbara Miller in particular, for all their help with administrative matters. I would like to thank my fellow math students for their friendship over the past five years. The nights spent playing softball, sharing dinner, or laughing over beers made the experience of graduate school more rewarding. The applied math group deserves special mention for organizing a student seminar where I was able to practice giving research talks in front of a friendly audience. My friends from college and outside the math department have given me many fond memories having nothing to do with academics. Many thanks go to them for helping me to vi not miss out on life while pursuing my studies. My parents, Jack and Marian, my brother Patrick, and my grandfather Bill have been incredibly supportive throughout my graduate career, and I dedicate this dissertation to them. Lastly I would like to express my profound gratitude to my partner Melanie, who has shown me the deepest love, kindness, and patience throughout our time together. It is no exaggeration to say that this dissertation would not have been possible without her support and encouragement. vii TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . List of Figures . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . xi . . . . . . . . . . . . . . . . . . . Chapter 1 Introduction . . . . . . . . 1.1 Sparse Fourier approximation . . . 1.2 Relationship to compressed sensing 1.3 Related work . . . . . . . . . . . . 1.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 6 10 Chapter 2 Adaptive sub-linear Fourier algorithms . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Mathematical background . . . . . . . . . . . . . . . . 2.2.1 Notation and preliminaries . . . . . . . . . . . . 2.2.2 Technical lemmas . . . . . . . . . . . . . . . . . 2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Non-adaptive . . . . . . . . . . . . . . . . . . . 2.3.2 Adaptive . . . . . . . . . . . . . . . . . . . . . . 2.4 Average-case analysis . . . . . . . . . . . . . . . . . . . 2.4.1 While loop runtime and sampling complexity . . 2.4.2 Random signal model . . . . . . . . . . . . . . . 2.4.3 Markov analysis of collisions . . . . . . . . . . . 2.4.4 Average-case runtime and sampling complexity 2.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . 2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Sampling Complexity . . . . . . . . . . . . . . . 2.5.3 Runtime Complexity . . . . . . . . . . . . . . . 2.6 Multiple Shifts . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Two shifts . . . . . . . . . . . . . . . . . . . . . 2.6.2 Three or more shifts . . . . . . . . . . . . . . . 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 13 17 22 23 24 24 26 26 27 29 31 32 33 34 34 37 39 41 Fourier algorithms for noisy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 43 43 44 47 48 Chapter 3 Multi-scale sub-linear 3.1 Introduction . . . . . . . . . . 3.2 Mathematical background . . 3.2.1 Noise model . . . . . . 3.2.2 Sensitivity to noise . . 3.2.3 Earth mover distance . 3.3 A minor modification . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 50 52 53 53 56 58 Chapter 4 Sub-linear Fourier algorithms in multiple dimensions 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Straightforward extension to multiple dimensions . . . . . . . . . 4.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Signal unwrapping and the Chinese Remainder Theorem . . . . . 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Empirical evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Runtime and sampling complexity . . . . . . . . . . . . . . 4.5.3 Sample lengths and aliased frequencies . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60 61 63 64 66 68 69 71 72 72 73 74 78 . . . . . . . . . . . . . . . . . . . 80 3.4 3.5 3.3.1 Rounding . . . . . . 3.3.2 Empirical evaluation Multi-scale methods . . . . 3.4.1 Preliminaries . . . . 3.4.2 Algorithms . . . . . 3.4.3 Empirical evaluation Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . LIST OF TABLES Table 2.1 Implementations used in the empirical evaluation. . . . . . . . . . . x 32 LIST OF FIGURES Figure 2.1 Figure 2.2 Figure 3.1 Figure 3.2 Figure 3.3 (a) Sampling complexity with fixed bandwidth N = 222 for PS-Det (blue solid line), GFFT-RS (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). (b) Runtime complexity with fixed bandwidth N = 222 for PS-Det (blue solid line), GFFTRF (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 (a) Sampling complexity with fixed sparsity k = 60 for PS-Det (blue solid line), GFFT-RS (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). (b) Runtime complexity with fixed sparsity k = 60 for PS-Det (blue solid line), GFFT-RF (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 (left): Mean phase error without rounding (shaded attribute) vs. noise level σ and sample length p, in logarithmic scale. (right) Mean phase error with rounding (shaded attribute) vs. noise level σ and sample length p, in logarithmic scale. The bandwidth for both plots is N = 222 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 EMD(1) error as a function of the noise level σ with the sparsity and bandwidth fixed at k = 64, N = 222 , respectively. Different values of the parameter c1 are shown: c1 = 4 (blue solid line), c1 = 16 (black dashed line), c1 = 64 (red solid line), and c1 = 256 (magenta dashed line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 EMD(1) error as a function of the noise level σ with the sparsity and bandwidth fixed at k = 64, N = 222 , respectively. Different values of the parameter c1 are shown: c1 = 4 (blue solid line), c1 = 16 (black dashed line), c1 = 64 (red solid line), and c1 = 256 (magenta dashed line). (a) Nonadaptive shift algorithm (B = 4), (b) Adaptive shift algorithm (α = 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 xi Figure 3.4 Figure 4.1 Figure 4.2 √ (a) Samples to achieve EMD(1) error σ/ k, as a function of the sparsity k with fixed σ = 0.512, N = 222 , for rounding only (blue solid line), non-adaptive shift (red dash-dot line), and adaptive shift (magenta dashed line) algorithms. (b) Runtime to achieve the same error bound, as a function of the sparsity k. . . . . . . . . . . . . . . 59 (a) 2D sampling complexity with fixed bandwidth N1 = N2 = 211 for Naive (blue solid line), Naive-Sqrt (black dashed line), Unwrapped (magenta dashed line), and FFTW (red dashed line). (b) 2D runtime complexity with fixed bandwidth N1 = N2 = 211 for Naive (blue solid line), Naive-Sqrt (black dashed line), Unwrapped (magenta dashed line), and FFTW (red dashed line). . . . . . . . . . . . 75 (a) Average number of sample lengths used in 2D with fixed bandwidth N1 = N2 = 211 for Naive (solid blue line), Naive-Sqrt (black dashed line), and Unwrapped (magenta dashed line). Also shown for comparison is 1D PS-Det (red solid line) with fixed bandwidth N = 222 . (b) Average fraction of aliased frequencies in 2D with fixed bandwidth N1 = N2 = 211 for Naive (solid blue line), Naive-Sqrt (black dashed line), and Unwrapped (magenta dashed line). Also shown for comparison is 1D PS-Det (red solid line) with fixed bandwidth N = 222 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 xii Chapter 1 Introduction 1.1 Sparse Fourier approximation The Fast Fourier Transform (FFT) [12] is arguably the most ubiquitous numerical algorithm in scientific computing. In addition to being named one of the “Top Ten Algorithms” of the past century [15], the FFT is a critical tool in myriad applications, ranging from signal processing to computational PDE and machine learning. At the time of its introduction, it represented a major leap forward in the size of problems that could be solved on available hardware, as it reduces the runtime complexity of computing the Discrete Fourier Transform (DFT) of a length-N array from O(N 2 ) to O(N log N ). Even with advanced hardware and optimized implementations, for large N this still represents the major computational bottleneck in many applications. Any algorithm which computes all N Fourier coefficients necessarily has a runtime complexity of Ω(N ), since it takes that much time merely to report the output. However, in many applications it is known that the DFT of the signal of interest is highly sparse – that is, 1 only a small number of coefficients are non-zero. In this case it is possible to break the Ω(N ) barrier by asking only for the largest k terms in the signal’s DFT. When k N existing algorithms for this sparse Fourier approximation problem can significantly outperform even highly optimized FFT implementations [30, 28, 26]. Any sub-linear1 Fourier algorithm can’t even read all the input, and so must either be approximate or succeed with high probability (or both). The discrete uncertainty principle of Donoho and Stark [16] provides a heuristic justification for sub-sampling in the time domain if the signal of interest is highly sparse. One version states that if Nt , Nω are the number of non-zero values of a length-N signal respectively in the time and frequency domains, then it must hold that Nt Nω ≥ N . Thus when Nω N , the number of non-zero time samples must be large. This suggests that the few samples taken by a sub-linear algorithm should still carry significant information about the frequency content of the signal. To date, most sparse Fourier approximation algorithms have exhibited an empirical speedup over the FFT only for limited sparsity regimes due to their complicated structure, and so are sub-linear only in an asymptotic sense.2 In contrast, the algorithms presented in this dissertation have a simple structure which permits a straightforward, fast implementation, resulting in improvements over the FFT for a large range of sparsity values. 1 That is, an algorithm with runtime o(N ). 2 The recent algorithm [26] has been shown to outperform the FFT for a much larger range of k, comparable to that of the noiseless algorithm presented in Chapter 2 of this dissertation. 2 1.2 Relationship to compressed sensing The term “compressed sensing” refers to a new paradigm in signal processing which seeks to recover a compressible signal from a number of linear measurements roughly proportional to its information content, rather than its nominal dimension. There has been tremendous recent progress in this area, as evidenced by the daily updates of a popular research blog devoted to the subject [9]. While this dissertation does not make explicit use of the results or algorithms of compressed sensing, there are parallels in the approaches used. The purpose of this section is to clarify the relationship between the two. All algorithms for the sparse approximate DFT problem take a small number of samples of the input x, either at random or in a deterministic fashion. These samples are then processed in a highly non-linear, algorithm-dependent manner to produce a k-term Fourier representation of x – that is, a list ω ,a k =1 of significant frequency/coefficient pairs. In other words, these algorithms approximately solve the severely underdetermined system RF ∗ x = Rx, where R is the restriction to the samples used by the algorithm, F ∗ is the adjoint of the N × N discrete Fourier matrix with entries 1 Fjk = √ e−2πijk/N , 0 ≤ j, k < N, N (1.1) and x is the discrete Fourier transform of x. In [8], Cand`s, Romberg, and Tao considered the dual problem: that of recovering a given e signal from highly incomplete Fourier measurements. Specifically, suppose that a signal x of 3 length N is the superposition of k spikes at times t = τj : k x[τj ]δ(t − τj ). x[t] = (1.2) j=1 The authors show that, with high probability, x can be recovered exactly from a randomly chosen set Ω of m frequencies from the discrete Fourier transform of x, provided m ≥ Ck log N (1.3) for some constant C whose value depends on the desired probability of success. This can be viewed as the severely underdetermined linear system dual to the system described above: RF x = Rx. The recovery algorithm in this case is the 1 minimization g ∗ = argmin g 1 subject to g(ω) = f (ω) for all ω ∈ Ω. (1.4) The idea of using 1 minimization to recover sparse vectors has been studied extensively in a number of research communities, including seismic imaging [42], image processing [41], and signal processing [10], where it is commonly referred to as basis pursuit. The theoretical foundations of 1 approximation are treated in depth in the monograph [39]. Sparse Fourier approximation and compressed sensing are therefore broadly similar in both their goals (sparse approximation of signals) and methods (in particular, the use of randomization.) There are, however, substantial differences between the two, which we now enumerate. 1. Sampling requirements. The compressed sensing model requires measurement matrices 4 to satisfy the Restricted Isometry Property, which has been shown to hold with high probability for random Gaussian, Bernoulli, and Fourier ensembles. Sparse Fourier algorithms, on the other hand, generally require more structure in their sampling sets. This is obviously true for the deterministic algorithms, and also for some of the randomized versions – in particular, [21] requires samples that lie on arithmetic progressions. 2. Reconstruction costs. As mentioned above, the reconstruction of the target signal in the compressed sensing model is achieved by a convex optimization problem (which can be recast as a linear program.) This is expected to incur a computational costs of O(N 3 ) for a signal of length N . Most Sparse Fourier approximation algorithms are sublinear in time, and so exponentially faster. 3. Allocation of resources. The balance between the two previous items is the major point of distinction. Indeed, we view the comparison of the two paradigms as an “applesto-oranges” scenario: In the seismic imaging environment (where practitioners have recently implemented compressed sensing methods [35], [14]), high acquisition costs make long processing times on the back end more palatable. Sparse Fourier approximation algorithms, however, were developed with data streaming applications in mind. In this arena, low signal acquisition costs and enormous problem sizes necessitate fast algorithms with sparing use of memory resources. 5 1.3 Related work The first works to implicitly address the sparse approximate DFT problem appeared in the theoretical computer science literature in the early 1990s. In [36], a variant of the Fourier transform for Boolean functions was shown to have applications for learnability. A polynomial-time algorithm to find large coefficients in this basis was given in [33], while the interpolation of sparse polynomials over finite fields was considered in [37]. It was later realized [21] that this last algorithm could be considered as an approximate DFT for the special case when N is a power of two. In the past ten or so years, a number of algorithms have appeared which directly address the problem of computing sparse approximate Fourier transforms. When comparing the results in the literature, care must be taken to identify the class of signals over which a specific algorithm is to perform, as well as to identify the error bounds of a given method. Different algorithms have been devised in different research communities, and so have varying assumptions on the underlying signals as well as different levels of acceptable error. The first result with sub-linear runtime and sampling requirements appeared in [20]. They give a poly(k, log N, log(1/δ), 1/ε) time algorithm for finding, with probability 1 − δ, an approximation y of the DFT of the input x that is nearly optimal, in the sense that 2 x − y 2 ≤ (1 + ε) x − xopt , 2 2 (1.5) where xopt is the best k-term approximation to x. This type of error bound is known as an “ 2 / 2 ” guarantee. Here the exponent of k in the runtime is two, so the algorithm is quadratic in the sparsity. Moreover, the algorithm is non-adaptive in the sense that the 6 samples used are independent of the input x. This algorithm was modified in [21] to bring the dependence on k down to linear.3 This was accomplished mainly by replacing uniform random variables (used to sample the input) by random arithmetic progressions, which allowed the use of the nonequispaced fast Fourier transform to sample from intermediate representations and to estimate the coefficients in near-linear time. The increased overhead of this procedure, however, limited the range of k for which the algorithm outperformed a standard FFT implementation [30]. Around the same time, a similar algorithm was developed in the context of list decoding for proving hard-core predicates for one-way functions [3]. This can be considered an extension of [33], and like [20, 21] is a randomized algorithm. Since the goal in this work was to give a polynomial-time algorithm for list decoding, no effort was made to optimize the dependence on k; it stands at k 11/2 , considerably higher than [20, 21]. The randomness in this algorithm is used only to construct a sample set on which norms are estimated, and in [2] this set is replaced with a deterministic construction. This construction is based on the notion of ε-approximating the uniform distribution over arithmetic progressions, and relies on existing constructions of ε-biased sets of small size [32, 1]. Depending on the size of the εbiased sets used, the sampling and runtime complexities are O(k 4 logc N ) and O(k 6 logc N ), respectively, for some c > 4.4 3 See [22] for a “user-friendly” description of the improved algorithm. 4 Specifically, the runtime is O(k 2 · log N · |S|), where S is the set of samples read by the log N algorithm. This set takes the form S = A − B , where A has ε-discrepancy on =1 rank 2 Bohr sets, B ε-approximates the uniform distribution on [0, 2 ) ∩ Z, and A − B is the difference set. Using constructions from [32] one has |A| = O(ε−1 log4 N ), |B | = O(ε−3 log4 N ); setting ε = Θ(k −1 ) and noting that A−B = O A−B and A − B = O(|A||B |) (see, e.g., [45]) one obtains the stated sampling and runtime complexities. 7 In the series of works [27, 28, 29], a different deterministic algorithm for sparse Fourier approximation was given that relies on the combinatorial properties of aliasing, or collisions among frequencies in sub-sampled DFTs. By taking enough short DFTs of co-prime lengths, and employing the Chinese Remainder Theorem to reconstruct energetic frequencies from their residues modulo these sample lengths, the author is able to prove sampling and runtime bounds of O(k 2 log4 N ). The error bound is of the form 1 x − y 2 ≤ x − xopt + √ x − xopt ; 2 1 k (1.6) it has been shown that the stronger “ 2 / 2 ” guarantee (1.5) cannot hold for a sub-linear, deterministic algorithm [11]. Moreover, the range of k for which this algorithm is faster than the FFT is smaller in practice than that of [21]. Most recently, the authors of [26] presented a randomized algorithm that extends by an order of magnitude the range of sparsity for which it is faster than the FFT. This is accomplished by removing the iterative aspect from [21] by using more efficient filters, which are nearly flat within the passband and which decay exponentially outside. In contrast, the box-car filters used in [21] have a frequency response which oscillates and decays like |ω|−1 . In addition, the identification of significant frequencies is done by direct estimation after hashing into a large number of bins rather than the binary search technique of [21]. These √ changes give a runtime bound of O(log N N k log N ) and a somewhat stronger error bound 2 ε 2 x−y 2 ≤ ∞ k x − xopt 2 + δ x 1 with probability 1 − 1/N , where ε > 0 and δ = N −O(1) is a precision parameter. 8 (1.7) These existing algorithms generally take one of two approaches to the sparse Fourier transform problem. In [20, 3, 21, 26], the spectrum of the input is randomly permuted and then run through a low-pass filter to isolate and identify frequencies which carry a large fraction of the signal’s energy. This leads to randomized algorithms that fail on a non-negligible set of possible inputs. On the other hand, [28] takes advantage of the combinatorial properties of aliasing in order to identify the significant frequencies. This leads to a deterministic algorithm with higher runtime and sampling requirements than the randomized algorithms mentioned. Both of these randomized and deterministic approaches have drawbacks. Randomized algorithms are not suitable for failure-intolerant applications, while the process used to reconstruct significant frequencies in [28] relies on the Chinese Remainder Theorem (CRT), which is highly unstable to errors in the residues. While there do exist algorithms for “noisy Chinese Remaindering” [23, 5, 44] these have thus far not found application to the sparse DFT problem, and we leave this as future work. As this dissertation was being prepared, the author became aware of an independent work using very similar methods for frequency estimation in the noiseless case [25]. Both methods consider the phase difference between Fourier samples to extract frequency information, but are based on different techniques for binning significant frequencies. The authors of [25] use random dilations and efficient filters of [26], whereas we use different sample lengths in the spirit of [28]. We believe both contributions are of interest, and reinforce the notion that exploiting phase information is critical for developing fast, robust algorithms for sparse Fourier approximation. 9 1.4 Thesis outline The remainder of this dissertation is structured as follows. In Chapter 2 we give an averagecase Θ(k log(k)) algorithm for the sparse Fourier transform in the noiseless setting. The runtime bound is proven over a certain class of random signals, and it is verified empirically with numerical simulations. We also show how to implement the algorithm using lowerresolution hardware, an important consideration for applications. In Chapter 3 we extend the algorithm given in Chapter 2 to handle noisy signals. We first give a minor modification of the algorithm in chapter 2 to handle low-level noise, and we discuss the merits of measuring the error of approximation using the earth-mover distance (EMD) in place of the standard Euclidean norm. We then give two algorithms to handle more general noise. This is done in a multi-scale manner, where we learn chunks of significant frequencies in an iterative fashion. Finally, in Chapter 4 we present extensions of our algorithm to multiple dimensions. Since the runtime of the FFT is O(N d log N ) in d dimensions, we expect to see the greatest improvement over traditional Fourier transform methods in this setting. The first extension is quite natural and follows from the separability of the Fourier exponentials e2πiω·x in multiple dimensions, but has runtime and sampling complexities that scale exponentially in the dimension d. We also give a second extension which relies on a clever application of the Chinese Remainder Theorem to construct an equivalent one-dimensional problem that is solved using the methods of chapter 2. This second algorithm significantly outperforms both the previous extension and highly-optimized FFT implementations, as shown empirically in two dimensions. 10 Chapter 2 Adaptive sub-linear Fourier algorithms 2.1 Introduction In this chapter we describe a simple, deterministic algorithm that avoids reconstruction with the Chinese Remainder Theorem. As mentioned in chapter , existing deterministic algorithms for the sparse Fourier approximation problem rely on the CRT for energetic frequency reconstruction, and so are very unstable to errors in the residues. Our method relies on sampling the signal in the time domain at slightly shifted points, and thus it assumes access to an underlying continuous-time signal. The shifted time samples allow us to determine the value of significant frequencies in sub-sampled FFTs and also indicate when two or more frequencies have been aliased in such a sub-sampled FFT. These two key facts allow us to significantly reduce (by up to two orders of magnitude) the average-case sampling and runtime complexity of the sparse FFT over a certain class of random signals. 11 Our worst-case bounds improve by a constant factor those of prior deterministic algorithms. We present both adaptive and non-adaptive versions of our algorithms. If the application allows samples to be acquired adaptively (that is, dependent on previous samples), we are able to improve further on our average-case bounds. The remainder of this chapter is organized as follows. In section 2.2 we introduce notation and prove the technical lemmas underlying our algorithms. In section 2.3 we introduce randomized and deterministic versions of our algorithm. In section 2.4 we prove that our algorithm has average-case runtime and sampling complexities of Θ(k log(k)) and Θ(k), respectively. In section 2.5 we present the results of an empirical evaluation of our algorithm and compare its runtime and sampling requirements to competing algorithms. In section 2.6 we describe how to combine two or more large shifts to reduce the sampling rate requirements of the algorithm, making them more suitable for hardware implementation. Finally in section 2.7 we provide some concluding remarks. 2.2 Mathematical background In this section we lay the groundwork for the remainder of this chapter, and indeed the rest of this dissertation. In section 2.2.1 we explain the notation used in the sequel and give some elementary results concerning the relationship between Fourier series and the discrete Fourier transform. We also derive two key results concerning the phase of DFT coefficients and the relative magnitude of aliased coefficients that are at the heart of our algorithms. In section 2.2.2 we prove two lemmas which will be useful in deriving tests for aliasing and worst-case bounds for our algorithms. 12 2.2.1 Notation and preliminaries Throughout this dissertation we shall be concerned with frequency-sparse band-limited signals S : [0, 1) → C of the form k 2πiωj t aj e , S(t) = (2.1) j=1 where ωj ∈ [−N/2, N/2) ∩ Z, aj ∈ C, and k 1 S(ω) = 0 N . The Fourier series of S is given by S(t)e−2πiωt dt, ω ∈ Z. (2.2) For signals of the form (2.1) we have 1 S(ω ) = 0  k  aj j=1 2πiωj t  −2πiω t e dt aj e j=1 k =  1 2πi(ω −ω )t j e dt. 0 (2.3) Since ωj and ω are integers, the integral in (2.3) is simply the Kronecker delta function δj , so we have S(ω ) = a for = 1, . . . , k and S(ω) = 0 for all other ω ∈ [−N/2, N/2) ∩ Z. We use the “big-oh” asymptotic notation common in computer science, which we summarize here briefly. For two functions f, g : N → R+ we say f = O(g) if there is a positive constant C so that for all n large enough we have f (n) ≤ Cg(n). We say f = Θ(g) if f (n) f = O(g) and g = O(f ). Finally, we say f = o(g) if limn→∞ = 0. g(n) We use the notation P[A] to denote the probability of an event A, and the notation E[X] to denote the expected value of a random variable X. The underlying probability measure 13 will always be clear from context. Given any finite sequence S = (s0 , s1 , . . . , sp−1 ) of length p we define its discrete Fourier transform (DFT) by p−1 p−1 sj e2πijh/p = S[h] = j=0 jh S[j]Wp , (2.4) j=0 where h = 0, 1, . . . , p − 1, S[j] := sj and Wp := e−2πi/p is the primitive p-th root of unity. For a detailed discussion of the DFT and its relationship to Fourier series see [6]. The Fast Fourier Transform (FFT) [12] allows the computation of S in O(p log p) time for any integer p. See [46] for an in-depth treatment of the FFT and its many possible implementations. The simplest way to determine the non-zero Fourier coefficients of a signal of the form (2.1) is to sample at the Nyquist rate 1/N and compute the full DFT. While guaranteed to give the correct output, this procedure has sampling and runtime complexities of Θ(N ) and Θ(N log N ), respectively. This is computationally wasteful, since the assumption k N means that almost all coefficients of the Nyquist-rate DFT are zero. We are thus interested in sub-linear algorithms, for which both the runtime and sampling complexities are o(N ). A frequency-sparse signal with k non-zero Fourier coefficients can be described using only Θ(k) space, since we need only to store the locations of the non-zero Fourier coefficients and the values of those coefficients. In section 2.3 we use the notation def R = ωj , aj k j=1 (2.5) to denote such a representation. We apply the DFT to discrete samples of S(t) to compute the Fourier coefficients aj of 14 S(t). For an integer p and real ε > 0 we form two discrete arrays Sp and Sp,ε by sampling S(t) at rate 1/p starting at t = 0 and t = ε, respectively. That is, for j = 0, 1, . . . , p − 1, j , p j Sp,ε [j] = S +ε . p Sp [j] = S (2.6) (2.7) If S(t) is of the form (2.1), we have for the unshifted samples that p−1 Sp [h] = Sp [j]e−2πihj/p j=0 p−1 = S j p e−2πihj/p j=0   p−1 k  = a e2πiω j/p  e−2πihj/p j=0 =1   p−1 k e2πij(ω −h)/p  a  = j=0 =1 =p a . (2.8) ω ≡h (mod p) p−1 e2πijα equals p if α ∈ Z and zero otherwise in the last Here we used the fact that j=0 equality. For the shifted samples we have, similarly, p−1 Sp,ε [h] = Sp,ε [j]e−2πihj/p j=0 p−1 = S j=0 j + ε e−2πihj/p p 15 p−1 =   k a e2πiω (j/p+ε)  e−2πihj/p  j=0 k = =1  p−1 a e2πiω ε   e2πij(ω −h)/p  j=0 =1 a e2πiω ε . =p (2.9) ω ≡h (mod p) Now assume that all ωj (mod p), 1 ≤ j ≤ k are distinct. In this case we have from (2.8) that Sp [h] =    paj   0 h = ωj (mod p) otherwise. By examining the peaks of Sp we can determine ωj (mod p) for 1 ≤ j ≤ k. Futhermore, from (2.9) we have    pa e2πiεωj j Sp,ε [h] =   0 h = ωj (mod p) otherwise. Using the both the ε-shifted and the unshifted samples, we can recover the frequencies Sp,ε [h] 2πiεωj =e . Hence ωj (mod ε−1 ) by noting that, for h = ωj (mod p), we have Sp [h] 2πεωj ≡ Arg Sp,ε [h] Sp [h] (mod 2π), (2.10) where Arg(z) denotes the phase angle of the complex number z in [−π, π). Assume that we 1 take ε ≤ N . Then ωj is completely determined by (2.10) as there will be no wrap-around 16 aliasing, and we have ωj = 1 Arg 2πε Sp,ε [h] Sp [h] . (2.11) Remark 2.2.1. In fact, more generally, if we have an estimate of ωj , say |ωj | < L , then 2 1 by taking ε ≤ L the same reconstruction formula (2.11) holds; we will use this observation in chapter when we develop a multi-scale version of our algorithm for the case of noisy input. The fact that taking slightly shifted samples allows us to identify frequencies in S(t) underlies the algorithms which follow, and the remainder of this chapter analyzes various aspects of the proposed algorithms, such as efficiency and robustness. Equation (2.8) represents the phenomenon of aliasing, which occurs when two or more frequencies present in S(t) are congruent modulo the sampling rate p. This can happen whenever p < N , and in this case it is impossible to distinguish the aliased frequencies using only samples at t = j/p. When aliasing occurs reconstruction via (2.11) is no longer valid. The aliasing phenomenon presents a serious challenge for any method with sub-linear sampling complexity. In the next section we develop a simple test to determine whether or not aliasing has occurred in a p-length DFT, which then allows us to effectively overcome this challenge and develop provably correct sub-linear algorithms. 2.2.2 Technical lemmas To effectively apply the sub-sampling idea in a Fourier algorithm one must first overcome the aliasing challenge. Previous approaches applied the Chinese Remainder Theorem to reconstruct the frequencies ωj by taking a suitable number of p’s, which must overcome the problem of registrations to match up the residues whenever a new p is used. The complexity of this matching problem becomes combinatorial in k, and clever techniques which require 17 more samples from S(t) must be used to maintain sub-linearity; see [28, 29] for details. Using shifted sub-samples gives us a simple yet extremely effective criterion to determine whether or not aliasing has occurred at a given location in a p-length DFT without resorting to complicated combinatorial techniques. Denote by I(S, h; p) the set of frequencies present def in S(t) that are congruent to h (mod p), so that I(S, h; p) = {j : ωj ≡ h (mod p)}. From equations (2.8) and (2.9) we have Sp,ε [h] 2 − Sp [h] 2 = p2 aj a e 2πiε(ωj −ω ) (2.12) j, ∈I(S,h;p) 2 aj . − p2 j∈I(S,h;p) Lemma 2.2.2. Let p > 1 and h ∈ {0, 1, . . . , p − 1}. Assume that q = |I(S, h; p)| > 1, i.e. ωj ≡ h (mod p) for more than one j in S(t). Then we have the following: (A) Let ε > 0 and E := {ωj − ωm : j, m ∈ I(S, h; p)}. Suppose that all elements of εE are distinct (mod 1). Then Sp,mε [h] = Sp [h] for some 1 ≤ m ≤ q 2 − q. (B) For almost all ε > 0 we have Sp,ε [h] = Sp [h] . 2 Proof. The proof of part (B) is immediate from (2.12). Observe that f (ε) := Sp,ε [h] − 2 Sp [h] is trigonometric polynomial in ε, and it is not identically 0 given that q = |I(S, h; p)| > 1. Thus it has at most finitely many zeros for ε ∈ [0, 1), and hence (B) is clearly true. We resort to the Vandermonde matrix to prove part (A). Recall that a Vandermonde j−1 for some values xi . It is well-known that if the xi are distinct matrix V satisfies Vij = xi then V is invertible [24]. For simplicity we write f (t) = α∈E cα e2πiαt . Set rα := e2πiαε where ε satisfies the hypothesis of the lemma, which implies that all rj are distinct. Assume 18 the claim of part (A) is false. Then we have f (mε) = 0 for all 0 ≤ m ≤ q 2 −q. Here f (0) = 0 is automatic because Sp,0 = Sp . Thus we have m cα rα = 0, m = 0, 1, . . . , q 2 − q. (2.13) α∈E But the cardinality of E is at most q 2 − q + 1, which means that there are at most q 2 − q + 1 m terms in the sum in (2.13). By hypothesis all rα are distinct, so the matrix [rα ] is a nonsingular Vandermonde matrix. Thus for (2.13) to hold all cα must be zero. This is clearly not the case, and a contradiction. Remark 2.2.3. Any irrational ε or ε = a with a, b coprime and b ≥ 2N will satisfy the b hypothesis of part (A) of Lemma 2.2.2. For irrational ε, suppose to the contrary that there were a quadruplet (j1 , j2 , j3 , j4 ) of indices so that ε ωj − ωj ≡ ε ωj − ωj 1 2 3 4 (mod 1). (2.14) Then we would have ε ωj − ωj − ωj + ωj ∈ Z, 1 2 3 4 (2.15) and since all ωj are integers this would imply ε ∈ Q, a contradiction. Similarly, in the case i a in lowest terms with b ≥ 2N , suppose there were a quadruplet of indices so that when ε = b (2.14) held. Since each ωj ≤ N , the term in parentheses in (2.15) is less than 2N . Thus 2 i we would have n= a ωj − ωj − ωj + ωj 1 2 3 4 b for some n ∈ Z, but this is impossible since by assumption b ≥ 2N . 19 Remark 2.2.4. It is also easy to show that in the special case where all coefficients aj are real and |I(S, h; p)| = 2, we have Sp,ε [h] = Sp [h] for any ε = a with a, b coprime and b b ≥ N . In this case, two frequencies are aliased modulo p, so that ω2 = ω1 + np for some N N nonzero n ∈ Z with − 2p ≤ n < 2p . Moreover, for h = ω1 (mod p) = ω2 (mod p), we have Sp [h] = |a1 + a2 | , Sp,ε [h] = a1 e2πiω1 ε + a2 e2πiω2 ε = a1 + a2 e2πiε(ω2 −ω1 ) = a1 + a2 e2πiεnp . (2.16) Since the coefficients aj are assumed to be real, for the remark to hold it suffices that εnp np np be nonintegral. This is clear since εnp = a and by assumption − 1 ≤ < 1. 2 2 b b Lemma 2.2.2 allows us to determine whether aliasing has occurred by checking whether Sp,ε [h] / Sp [h] = 1 for a few values of ε. It offers both a deterministic (part (A)) and a random (part (B)) procedure to identify aliasing in the sub-sampled DFTs. In practice we use a test related to part (B), and need to set a tolerance τ in order to accept or reject frequencies according to the criterion Sp,ε [h] − 1 ≤ τ. (2.17) Sp [h] In our algorithms we choose ε = 1/cN for some small constant c ≥ 1, which would satisfy the hypothesis of part (A) of Lemma 2.2.2. A tolerance τ on the order of p/N works well in general, which is what we use in our experiments in section 2.5 below. 20 In our algorithms we will take a number of sub-sampled DFTs of an input signal S(t) of the form (2.1), whose lengths we denote p . Lemma 2.2.2 allows us to determine whether or not two or more frequencies are aliased (mod p ), so that we only add non-aliased terms to our representation. Since it is unlikely that two or more frequencies are aliased modulo two different sampling rates, using a different p in a subsequent iteration lets us quickly discover all frequencies present in S(t). Lemma 2.2.6 gives a worst-case bound on the number of p ’s required by our deterministic algorithm to identify all k frequencies in a given Fourier-sparse signal. It is similar to [28, Lemma 1], but with a smaller constant. In its proof we use the CRT, which we quote here for completeness (see, e.g., [38]). Theorem 2.2.5 (Chinese Remainder Theorem). Any integer n is uniquely specified modulo N by its remainders modulo m pairwise relatively prime numbers p , provided m p ≥ N. =1 Lemma 2.2.6. Let M > 1. It suffices to take 1 + (k − 1) logM N pairwise relatively prime p ’s with p ≥ M to ensure that each frequency ωj is isolated (i.e. not aliased) (mod p ) for at least one . Proof. Assume otherwise, namely that given p for = 1, 2, . . . , L with L > k logM N there exists some ωj such that ωj is aliased (mod p ). By the Pigeon Hole Principle there exists at least one ωm = ωj such that ωj −ωm ≡ 0 (mod p ) at least q times, where q > logM N . Without loss of generality we assume that ωj − ωm ≡ 0 (mod p ) for by the fact that p ≥ M we have q p ≥ M q ≥ N. =1 By the CRT we would then have ωj ≡ ωm (mod N ), a contradiction. 21 = 1, 2, . . . , q. Now Remark 2.2.7. The algorithm in [28] requires taking 1+2k logk N co-prime sample lengths, since that algorithm requires each ω to be isolated in at least half of the DFTs of length p . This requirement stems from the fact that that algorithm cannot distinguish between aliased and non-aliased frequencies in a given sub-sampled DFT. Our worst-case bound is approximately a factor of two better, though in practice our algorithms never use all those sample lengths on random input. The fact that we can tell which frequencies are “good” for a given p allows us to construct our Fourier representation one term at a time, and quit when we have achieved a prescribed stopping criterion. 2.3 Algorithms Both of our algorithms proceed along a similar course; in fact they differ only in the choice of the sample lengths p . We assume that we are given access to the continuous-time signal S(t) whose Fourier coefficients we would like to determine, and further that we can sample from S at arbitrary points t in unit time. This is an appropriate model for analog signals, but not for discrete ones. In the discrete case, one could interpolate between given samples to approximate the required S-values, though we have not implemented or analyzed this case. (The same assumptions hold for the algorithms in [28], while those in [20, 21, 26] are formulated purely in the discrete realm.) In this chapter we limit ourselves to the noiseless case. Though this is a highly unrealistic assumption, it permits a simple description of the underlying algorithm. In chapter we address the issue of noise specifically and give a number of modifications to our algorithms to handle noise levels of varying strengths. 22 2.3.1 Non-adaptive Our algorithms start by choosing a sample length p1 such that p1 ≥ ck for some constant c > 1. For a fixed ε ≤ 1/N , we then compute Sp and Sp,ε , sort the results by magnitude, and compute frequencies ω via (2.11) for the k largest coefficients in absolute value. We then check whether or not each of those frequencies is aliased via (2.17), and if it is not, we add it to our list. The coefficient is given by the unshifted sample value Sp [h] at that frequency. After this, we combine terms with the same frequency and prune small coefficients from the list. We then iterate until a stopping criterion is reached. In the empirical study described in section 2.5, we stopped when the number of distinct frequencies in our list equaled the desired sparsity. Our deterministic algorithm chooses p to be the th prime greater than ck. This ensures that all samples lengths are co-prime, at the expense of taking slightly more samples than necessary. By Lemma 2.2.6, 1 + (k − 1) logck N such p s suffice to isolate every ω at least once. This gives us worst-case sampling and runtime complexity on the same order as [28], though the results in section 2.5 indicate that on average we significantly outperform those pessimistic bounds. Our Las Vegas algorithm chooses p uniformly at random from the interval [c1 k, c2 k] for constants 1 < c1 < c2 . In this case we cannot make a worst-case guarantee on the number of iterations needed by the algorithm to converge. However, the results in section 2.5 indicate that the Las Vegas version performs similarly to the deterministic version on the class of signals tested. 23 2.3.2 Adaptive The algorithms can also be implemented in an adaptive fashion, by which we mean that the size of the current representation is taken into account in subsequent iterations. In particular, if R is our current representation, we let k ∗ = k − |R| and choose the next p with respect to k ∗ instead of k. Moreover, before taking DFTs, we subtract off the contribution from the current representation, so that effort is not expended re-identifying portions of the spectrum already discovered. This idea is similar to that in [20, 21], though in our empirical studies the evaluation of the representation is done directly, rather than as an unequally-spaced FFT. This gives our algorithms asymptotically slower runtime, but the effect is negligible for the values of k studied in section 2.5. A formal description appears below in algorithm 1. 2.4 Average-case analysis In this section we prove that the average-case runtime and sampling complexity of our algorithm are Θ(k log k) and Θ(k), respectively. This is shown over a class of random signals described in section 2.4.2. Before giving this result on the expected runtime and sampling complexity, in section 2.4.1 we estimate the cost of a single iteration of the while loop in algorithm 1, lines 5–25. We then describe in section 2.4.2 the random signal model over which we prove our average-case bounds. In section 2.4.3 we prove that the expected number of iterations of the while loop is constant, and in section 2.4.4 we use this result to prove our average-case bounds. 24 Algorithm 1 Phaseshift Input: function pointer S, integers c1 , c2 , k, N , real ε Output: R, a sparse representation for S R ← ∅, ε0 ← 0, ε1 ← ε, ← 1 while |R| < k do 5: k ∗ ← k − |R| {or k if non-adaptive} ∗ p ← first prime ≥ c1 k {or Uniform(c1 k ∗ , c2 k ∗ ) if Las Vegas} for m = 0 to 1 do for j = 0 to p − 1 do j + εm S ,m [j] ← S p cω e2πiω(j/p +εm ) 10: Srep [j] ← (ω,cω )∈R {omit if non-adaptive} end for S ,m ← FFT(S ,m − Srep ) S sort ← Sort(S ,m ) ,m for j = 1 to k ∗ do  S sort [j] 1 ,1  15: ωj, ← Arg  sort [j] 2πε S ,0 end for end for for j = 1 to k ∗ do S sort [j] p ,0 if −1 < then N S sort [j] ,1 20: R ← R ∪ ωj, , S ,0 [ωj, ] end if end for collect terms in R with same ω prune small coefficients from R 25: ← +1 end while 25 2.4.1 While loop runtime and sampling complexity The computational cost of the while loop in algorithm 1, lines 5–25 is dominated by three operations. The first is the evaluation of the current representation R of k − k ∗ terms at the O(k ∗ ) points j/p in line 10. In our implementation, we simply calculated this directly, looping over both the sample points and the terms in the representation. The complexity of this implementation is O(p (k − k ∗ )) = O(k ∗ (k − k ∗ )) = O(k 2 ), and while non-equispaced fast Fourier transforms [18, 4] yield an asymptotically faster runtime of O(k log(k)), they also incur large overhead costs. For the values of k considered in this dissertation, the direct evaluation seems to have little effect on the overall runtime. The other two dominant computational tasks in the inner loop are the FFTs of O(k) samples and the subsequent sorting of these DFT coefficients. It is well-known that both of these operations can be done in time Θ(k log(k)) [13]. Thus the inner loop has overall time complexity Θ(k log(k)), assuming the use of non-equispaced FFTs. 2.4.2 Random signal model For both the average-case analysis and for the empirical evaluation described in section 2.5 we considered test signals with uniformly random phase over the bandwidth and coefficients chosen uniformly from the complex unit circle. In other words, given k and N , we choose k frequencies ωj uniformly at random (without replacement) from [−N/2, N/2) ∩ Z. The 2πiθj corresponding Fourier coefficients aj are of the form e , where θj is drawn uniformly from [0, 1). The signal is then given by k 2πiωj t aj e . S(t) = j=1 26 (2.18) This is the standard signal model considered in previous empirical evaluations of sub-linear Fourier algorithms [30, 28, 26]. 2.4.3 Markov analysis of collisions In order to analyze the expected runtime and sampling complexity of our algorithms, we must estimate the expected number of collisions among frequencies modulo the sample lengths used by the algorithms. Recall that in the noiseless case, our algorithms are able to detect when a collision between two or more frequencies has occurred, and for those that are not aliased we are able to calculate the value of the frequency. Thus we seek to estimate the expected fraction of frequencies that are aliased modulo a given sample length p, since this determines how many passes the algorithm makes. In this section we derive bounds on the expected value of this quantity. In the random signal model considered in section 2.4.2, we assume the k frequencies are uniformly distributed over the bandwidth [−N/2, N/2), and so the residues ω mod p are also uniformly distributed over [0, p − 1] ∩ Z. Our problem then becomes a classical occupancy problem: the number of collisions among the frequencies is equivalent to the number of multiple-occupancy bins when k balls are thrown uniformly at random into p bins. Define Xm to be the number of single-occupancy bins after m balls are thrown, Ym to be the number of multiple-occupancy bins after m balls are thrown, and Zm to be the number of zero-occupancy bins after m balls are thrown. Since p is constant, we have the trivial relationship Zm = p − Xm − Ym , so it suffices to consider only the pair (Xm , Ym ). When the (m + 1)st ball is thrown, we have the following possibilities: • it lands in an unoccupied bucket, with probability Zm /p = 1 − (Xm + Ym )/p; 27 • it lands in a single-occupancy bucket, with probability Xm /p; • it lands in a multiple-occupancy bucket, with probability Ym /p. In the first case, we have Xm+1 = Xm + 1, Ym+1 = Ym ; in the second case, we have Xm+1 = Xm − 1, Ym+1 = Ym + 1; and in the third case, we have Xm+1 = Xm , Ym+1 = Ym . Conditioning on the values of Xm , Ym we have           Xm+1   Xm    1 − 2/p −1/p   Xm   1  E     =    +  , Ym+1 Ym 1/p 1 Ym 0 (2.19) so that the system forms a Markov chain. By recursively conditioning on the values of Xm−1 , Ym−1 , we can calculate the expected values of Xk , Yk for any k > 0 using the initial condition X1 = 1, Y1 = 0. Denoting by A the matrix in the right-hand side of equation (2.19), we have    m  1    1   Xk  Am    . E  A   =  = 0 0 Yk m=0 m=0   k−1     k−1 (2.20) Since ρ(A) = 1 − 1/p < 1, where ρ is the spectral radius, the geometric matrix series can be written k−1 Am = (I − A)−1 (I − Ak ). (2.21) m=0 After some linear algebra, we obtain    1 k(1 − p )k−1   Xk    E   =  . 1 )k ) − k(1 − 1 )k−1 Yk p(1 − (1 − p p 28 (2.22) Since Zk = p − Xk − Yk , we have E Zk = p(1 − 1/p)k . In our algorithms we choose p = ck for some small integer c. Using this and the approxx imation (1 + n )n ≈ ex , we have    ke−1/c   Xk    E   ≈  . −1/c ) − ke−1/c Yk ck(1 − e (2.23) This gives a nonlinear equation for the expected number of collisions among k frequencies as a function of the parameter c. Newton’s method can then be used to determine the value c required to ensure a desired fraction of the frequencies are not aliased. For example, to ensure that 90% of frequencies are isolated on average, it suffices to take c = 5; this value for the parameter c had already been found to give good performance in our empirical evaluation of the algorithms. 2.4.4 Average-case runtime and sampling complexity In this section we will use a probabilistic recurrence relation due to Karp [31, 17] to give average-case performance bounds and concentration results for the case when the algorithm is halted after identifying k or more terms. In particular, we use the following theorem for recurrences of the form T (k) = a(k) + T (H(k)), (2.24) where T (k) denotes the time required to solve an instance of size k, a(k) is the amount of work done on a problem of size k, and 0 ≤ H(k) ≤ k is a random variable denoting the size of the subproblem generated by the algorithm. 29 Theorem 2.4.1. [31, Theorem 1.2] Suppose a(k) is nondecreasing, continuous, and strictly increasing on {x : a(x) > 0}, and that E[H(k)] ≤ m(k) for a nondecreasing continuous function m(k) such that m(k)/k is also nondecreasing. Denote by u(k) the solution to the deterministic recurrence u(k) = a(k) + u(m(k)). (2.25) Then for k > 0 and t ∈ N, P[T (k) > u(k) + ta(k)] ≤ m(k) t . k (2.26) In section 2.4.1 we showed that our algorithm does work a(k) = Θ(k log(k)) on input of size k, while in section 2.4.3 we showed that it generates a subproblem whose average size is m(k) = k/10. (Recall that with the parameter c = 5, on average over 90% of the frequencies were not aliased modulo p = O(ck).) The associated deterministic recurrence is then u(k) = Θ(k log(k)) + u(k/10), (2.27) whose solution is u(k) = Θ(k log(k)) (see, e.g., [13]). A straightforward application of Theorem 2.4.1 yields P[T (k) > Θ(k log(k)) + tk log(k)] ≤ 10−t , (2.28) so that the runtime is tightly concentrated about its mean Θ(k log(k)). The sampling complexity S(k) can be handled in an analogous manner, since in this case 30 a(k) = Θ(k) and m(k) = k/10 as before. The associated deterministic recurrence becomes u(k) = Θ(k) + u(k/10), (2.29) whose solution is u(k) = Θ(k). Applying Theorem 2.4.1 again we have P[S(k) > Θ(k) + tk] ≤ 10−t , (2.30) so that we again have tight concentration around the mean Θ(k). 2.5 Empirical Evaluation In this section we describe the results of an empirical evaluation of the adaptive deterministic and Las Vegas variants of the Phaseshift algorithm described above. Both algorithms were implemented in C++ using FFTW 3.0 [19] for the FFTs, using FFTW ESTIMATE plans since the sample lengths are not known in advance for the Las Vegas variant. For comparison we also ran the same tests on the four variants of GFFT as well as on AAFFT and FFTW itself. The FFTW runs utilized the FFTW PATIENT plans with wisdom enabled, and so are highly optimized. The experiments were run on a single core of an Intel Xeon E5620 CPU with a clock speed of 2.4 GHz and 24 GB of RAM, running SUSE Linux with kernel 2.6.16.60-0.81.2smp for x86 64. All code was compiled with the Intel compiler using the -fast optimization. As in [29], timing is reported in CPU ticks using the cycle.h file included with the source code for FFTW. In the following sections we refer to our algorithm as “Phaseshift”, since by taking shifted time samples of the input signal we also shift the phase of the Fourier coefficients. To keep 31 Table 2.1: Implementations used in the empirical evaluation. Algorithm R/D Samples PS-Det D k PS-LV R k GFFT-DF D k 2 log4 N D k 2 log2 N GFFT-DS GFFT-RF R k log4 N GFFT-RS R k log2 N R k logc N AAFFT FFTW D N Runtime k log k k log k k 2 log4 N N k log2 N k log5 N N log N k logc N N log N Reference Section 2.4 Section 2.4 [29] [29] [29] [29] [21] [19] the plots readable, we only show data for the adaptive, deterministic variant of our algorithm; the other variants perform similarly. The algorithms of [29] are denoted GFFT-XY, where X ∈ {D,R} and Y ∈ {F,S}. The D/R stands for deterministic or randomized, while the F/S stands for fast or slow. The fast variants use more samples but less runtime while the slow variants use fewer samples but more runtime. In the plots below, we always show the GFFT variant with the most favorable sampling or runtime complexity. Finally, AAFFT denotes the algorithm of [21]. The implementations tested are summarized in table 2.1 along with the average-case sampling and runtime complexities, and the associated references. 2.5.1 Setup Each data point in Fig. 2.1–2.2 is the average of 100 independent trials of the associated algorithm for the given values of the bandwidth N and the sparsity k. The lower and upper bars associated with each data point represent the minimum and maximum number of samples or runtime of the algorithm over the 100 test functions. The values of k tested were 2, 4, 8, . . . , 4096, while the values of N were 217 , 218 , . . . , 226 . For the larger values of k, the slow GFFT variants and AAFFT took too long to complete on our hardware, so we only present partial data for these algorithms. Nevertheless, the trend seen in the plots 32 below continues for higher values of the sparsity. The test signals were generated according to the signal model described in section 2.4.2. The Phaseshift and deterministic GFFT variants will always recover such signals exactly. The randomized GFFT variants are Monte Carlo algorithms, and so, when they succeed, will also recover the signal exactly. AAFFT, on the other hand, is an approximation algorithm which will fail on a non-negligible set of input signals. However, for the runs depicted in Fig. 2.1–2.2, AAFFT always produced an answer with 2 error less than 10−4 . The randomized GFFT variants failed a total of 7 times out of 2200 test signals, a relatively small amount that can be reduced by parameter tuning. For the Phaseshift variants, we chose the parameters c1 = 5, c2 = 10, and took the shift ε to be 1/2N . Finally, for the randomized GFFT variants, we chose the Monte Carlo parameter to be 1.2. 2.5.2 Sampling Complexity In Fig. 2.1 (a), we compare the average number of samples of the input signal S required by each algorithm when the bandwidth N is fixed at 222 . The sparsity of the test signal was varied from 2 to 4096 by powers of two. We can see that the Phaseshift variants require over an order of magnitude fewer samples than GFFT-RS, the GFFT variant with the lowest sampling requirements. Both Phaseshift variants also require over an order of magnitude fewer samples than AAFFT. The comparison with the deterministic GFFT variants is even starker; Phaseshift-Det requires two orders of magnitude fewer samples than GFFT-DS (not shown), and four orders of magnitude fewer samples than GFFT-DF (not shown). In Fig. 2.2 (a), we compare the average number of samples of the input signal S required by each algorithm when the sparsity k is fixed at 60. The bandwidth N was varied from 217 – 33 226 by powers of two. Using powers of two for the bandwidth allows the best performance for both FFTW and AAFFT, though this fact is more relevant for the runtime comparisons in the following section. We can see that the Phaseshift variants require many fewer samples than all four GFFT variants as well as AAFFT and FFTW, for all values of N tested. The Phaseshift variants exhibit almost no dependence on the bandwidth for all values of N , a feature not shared by the other deterministic algorithms. 2.5.3 Runtime Complexity In Fig. 2.1 (b), we compare the average runtime of each algorithm over 100 test signals when the bandwidth N is fixed at 222 . The range of sparsity k considered is the same as in section 2.5.2. For all values of k the Phaseshift variants are faster than GFFT-RF (the fastest GFFT variant) and AAFFT by more than an order of magnitude. When compared to GFFT-RS (not shown), GFFT-DS (not shown), and FFTW, the difference in runtime is closer to three orders of magnitude. In Fig. 2.2 (b), we compare the average runtime of each algorithm over 100 test signals when the sparsity k is fixed at 60. The range of bandwidth considered is the same as in section 2.5.2. The Phaseshift variants are the only algorithms that outperform FFTW for all values of N tested. The other implementations tested only become competitive with the standard FFT for N 2.6 220 , while ours are faster even for modest N . Multiple Shifts Our algorithm requires access to samples of the input function with a very small time shift ε < 1/N , where N is the assumed bandwidth of the signal. For signals with a very large 34 22 Samples for Fixed Bandwidth N = 2 7 Samples 10 5 10 3 10 1 10 0 10 1 10 2 10 Sparsity K 3 10 4 10 (a) 22 Runtime for Fixed Bandwidth N = 2 10 Runtime (CPU ticks) 10 8 10 6 10 4 10 0 10 1 10 2 10 Sparsity K 3 10 4 10 (b) Figure 2.1: (a) Sampling complexity with fixed bandwidth N = 222 for PS-Det (blue solid line), GFFT-RS (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). (b) Runtime complexity with fixed bandwidth N = 222 for PS-Det (blue solid line), GFFT-RF (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. 35 Samples for Fixed Sparsity k = 60 7 Samples 10 5 10 3 10 5 10 6 7 10 10 8 10 Bandwidth N (a) Runtime (CPU ticks) Runtime for Fixed Sparsity k = 60 9 10 7 10 5 10 5 10 6 7 10 10 8 10 Bandwidth N (b) Figure 2.2: (a) Sampling complexity with fixed sparsity k = 60 for PS-Det (blue solid line), GFFT-RS (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). (b) Runtime complexity with fixed sparsity k = 60 for PS-Det (blue solid line), GFFT-RF (red solid line), AAFFT (black dashed line), and FFTW (magenta dashed line). 36 bandwidth N 1012 , this time shift is impractical to implement with current hardware. Following [47], however, it is possible to combine two or more much smaller shifts ε1 , ε2 to achieve the same effect. 2.6.1 Two shifts Let r1 , r2 be co-prime integers with r1 r2 ≥ N , and set εi = 1/ri for i = 1, 2. Since |ω| > ri in general, we no longer have the relationship ω= 1 Arg 2πεi Sp,εi [h] (2.31) Sp [h] for h = ω (mod p) due to the wrap-around of the frequencies (mod ri ). Rather, we consider the related quantities def 1 Arg σi = 2π Sp,εi [h] Sp [h] , (2.32) so that σi ∈ [−1/2, 1/2). We then have σi = ω + ki ri (2.33) for some ki ∈ Z. Determining ω is then equivalent to determining the values ki . Note that (2.33) gives two equations, one for each index i. Solving these both for ω and setting them equal gives the linear diophantine equation r1 k1 − r2 k2 = r1 σ1 − r2 σ2 , (2.34) which is solvable for (k1 , k2 ) with any integer right-hand side since r1 , r2 were taken to be 37 co-prime. (See, e.g., [38] for a basic treatment of linear diophantine equations.) Indeed, let b be the right-hand side of (2.34), and let (x, y) be integers such that r1 x + r2 y = 1. (2.35) (These are easily furnished by the extended Euclidean algorithm [13].) The general solution to (2.34) is given by k1 = xb − r2 n k2 = −yb − r1 n, (2.36) for any integer n. Plugging the first of these in to (2.33) and solving for ω, we have ω = r1 σ1 − r1 bx − r1 r2 n. (2.37) Since r1 r2 ≥ N and different solutions ω differ by at least r1 r2 , there is a unique n such that −N/2 ≤ ω < N/2. Knowing this, we can determine the value of ω mod N efficiently. Specifically, n must take one of the values r1 σ1 − r1 bx r1 r2 or r1 σ1 − r1 bx , r1 r2 so we can simply check if the first of these puts ω in the correct range, and if not take the second. The only requirements on r1 , r2 are that they be co-prime with product greater than N . In particular, we can take r1 = N 1/2 and r2 = r1 + 1, so that the shifts required by the 38 algorithm are of the order N −1/2 instead of N −1 when using only one ε. For N = 1012 , say, this allows clock speeds in the megahertz range instead of terahertz. In practice, one would prefer more separation between the values r1 and r2 ; this can easily be achieved by taking r1 to be an even number greater than N 1/2 /α and r2 an odd prime greater than α N 1/2 for some α > 1. 2.6.2 Three or more shifts In the previous section we saw that using two shifts εi reduced the clock rate requirements from O(N −1 ) to O(N −1/2 ). We will show in this section that a similar procedure with m shifts can reduce the clock rate to O(N −1/m ), at the cost of solving more complicated systems than (2.34). To this end, suppose {ri }m are integers such that i=1 m r ≥ N , and define ε = r−1 . i i=1 i i Using (2.32) to define σi , we again have a system of equations of the form (2.33) for i = 1, . . . , m. Solving these for ω and summing the resulting equations gives m ri (σi − ki ); mω = (2.38) i=1 on the other hand, multiplying any one of the equations by m and solving for ω yields mω = mri (σi − ki ). (2.39) Choosing i = 1 in (2.39) and combining it with (2.38) then gives the linear diophantine 39 equation in m variables (k1 , . . . , km ) m m (m − 1)r1 k1 − ri ki = (m − 1)r1 σ1 − i=2 ri σ i . (2.40) i=2 This is solvable for any integer right-hand side when gcd ((m − 1)r1 , r2 , . . . , rm ) = 1. The analogous system (2.34) with two shifts admitted an efficient method to find the unique solution (k1 , k2 ) such that ω ∈ [−N/2, N/2). In the case of three or more shifts the problem is considerably more difficult, since there are m − 1 ≥ 2 degrees of freedom in the solution space. Moreover, distinct solutions do not give rise to values of ω separated by N or more. In fact, for any value of m one can express the solutions to (2.40) in the form k = Az + d, (2.41) where k ∈ Zm and z ∈ Zm−1 are variable and A ∈ Qm×m−1 and d ∈ Zm−1 are fixed [43]. In addition we have the inequality constraints − N N + σi < ki ≤ + σi . 2ri 2ri (2.42) Let ui , li denote respectively the upper and lower bounds on ki from (2.42). Given these and the values di one can consider (2.41) as a feasible integer programming problem: does there exist z ∈ Zm−1 such that      A   u−d   z ≤  ? −A d−l 40 (2.43) For fixed values of m this problem is solvable in polynomial time [34], while the general case is known to be N P-complete [43]. 2.7 Conclusion In this chapter we presented a deterministic and Las Vegas algorithm for the sparse Fourier transform problem that empirically outperform existing algorithms in average-case sampling and runtime complexity. While our worst-case bounds do not improve the asymptotic complexity, we proved average-case bounds over a representative class of random signals that significantly improve the computational complexity of our algorithm with respect to its competitors. We are moreover able to extend by an order of magnitude the range of sparsity for which our algorithm is faster than FFTW in the average case. The improved performance of our algorithm can be attributed to two major factors: adaptivity and ability to detect aliasing. In particular, we are able to extract more information from a small number of function samples by considering the phase of the DFT coefficients in addition to their magnitudes. This represents a significant improvement over the current state of the art for the sparse Fourier transform problem. Finally, we showed that our algorithm can be implemented efficiently using two large co-prime shifts in place of one very small shift. This has significant implications in practical situations where hardware with high clock rates costs significantly more than lower rate hardware. 41 Chapter 3 Multi-scale sub-linear Fourier algorithms for noisy data 3.1 Introduction In chapter we developed algorithms for approximating the discrete Fourier transform of noiseless signals using much fewer computational resources than a traditional FFT. Unfortunately any signal recorded from real-world data will contain errors of one sort or another, due to imprecise instrumentation or noisy environments. In this chapter we develop modified versions of our noiseless algorithm to address the challenge of noisy, real-world data. The remainder of this chapter is organized as follows. In section 3.2 we describe our noise model, discuss some of the problems associated with noisy signals, and argue that in certain applications the 2 error metric is inappropriate and should be replaced with a form of Earth Mover Distance. In section 3.3 we give a simple modification of our algorithm suitable for low-level noise. The runtime complexity for this algorithm is still Θ(k log k), but with a much 42 larger constant than in the noiseless case. We illustrate its performance with an empirical evaluation. In section 3.4 we give another modification of our algorithm which uses multiple shifts εq to determine energetic frequency values in a multi-scale manner. The resulting algorithm has runtime complexity Θ k 2 log(k) , but is robust to higher-level noise. This is confirmed in an empirical evaluation of the two versions of our multi-scale algorithm. 3.2 Mathematical background In this section we describe the mathematical setting for our study of sub-linear Fourier algorithms in the presence of noise. In section 3.2.1 we describe the specific noise model used in our analysis and our empirical studies, and we discuss its advantages and limitations. In section 3.2.2 we derive some fundamental results on the stability of the operations underpinning our algorithm to noise. 3.2.1 Noise model In order to derive some analytic estimates for use in our algorithms below, we limit ourselves in this dissertation to noisy signals of a specific type. Namely, we consider i.i.d. complex gaussian noise with covariance Σ = σ 2 I, where I is the two-by-two identity matrix. If S(t) is a signal of the form (2.1), our discrete measurements are of the form n Sp [j] = S j p 1 2 + σ ηj + iηj , (3.1) i where ηj are i.i.d. standard normal random variables (mean zero, unit variance) for i = 1, 2, j = 1, . . . , p and the superscript “n” denotes that these are noisy measurements. 43 Noise of the form (3.1) has the advantage that it is amenable to analytic estimates, at the expense of perhaps being overly pessimistic in modeling real-world phenomena. To emphasize this last point, we note that since gaussian random variables are unbounded, there is always a small probability that the noise overwhelms the signal in a given measurement. This is not realistic for real-world signals, and in our empirical studies in sections 3.3.2 and 3.4.3 we truncate the noise to two standard deviations. We note further that uncorrelated white noise is only a model for real-world noise; depending on the application area, some dependence or color may be present. 3.2.2 Sensitivity to noise As in section 2.2.1, we apply the DFT to our samples to approximate the significant Fourier coefficients of the underlying signal S(t). With noisy measurements of the form (3.1), we have p−1 n Sp [h] = n Sp [j]e−2πihj/p j=0 p−1 = S j=0 j p 1 2 + σ ηj + iηj 1 2 = Sp [h] + σ ηh + iηh . e−2πihj/p (3.2) Here we’ve written i def ηh = p−1 i ηj e−2πihj/p j=0 44 (3.3) i i to denote the DFT of the noise terms ηj . Note that E ηh = 0 and E i ηh 2 = p for i = 1, 2, h = 0, . . . , p − 1. Since these noise terms have mean zero, we have n E Sp [h] = Sp [h]. (3.4) n Moreover, the variance of Sp [h] can be calculated as n Var Sp [h] = E n Sp [h] − Sp [h] = E σ2 1 ηh 2 2 2 + ηh = 2σ 2 p. 2 (3.5) n Thus, a typical noisy DFT coefficient Sp [h] will deviate from the true value Sp [h] by an √ amount proportional to σ p. Recall from 2.2.1 that the discrete Fourier coefficients are given by Sp [h] = pa for ω ≡ h (mod p). A typical noisy DFT coefficient will therefore satisfy n Sp [h] = p a + O σ √ p . (3.6) n The locations of the peaks of Sp [h] will therefore give the correct values of ω (mod p) as √ long as σ/ p |a |. This will be the case in any reasonable setting, since in general there can be no hope of recovering Fourier coefficients whose magnitude is less than that of the noise. Our noiseless algorithm relies crucially on the computation of the phase angle of complex numbers to determine the frequency index associated with a significant Fourier coefficient. 45 The numbers whose phase angles we are interested in calculating are the ratios Sp,ε [h]/Sp [h], n n which must be replaced in the noisy case by Sp,ε [h]/Sp [h]. Suppose that ω ≡ h (mod p) is one of the frequencies appearing in (2.1) and let a be the corresponding Fourier coefficient (so that a = p−1 Sp [h]). Using the binomial approximation (1 + z)α = 1 + αz + O z 2 and a typical value for the noisy DFT coefficients, we have √ n Sp,ε [h] Sp [h]e2πiωε + O σ p = √ n Sp [h] Sp [h] + O σ p √ e2πiωε + O σ/a p = √ 1 + O σ/a p √ = e2πiωε + O (σ/a p) . (3.7) Thus the ratio of noisy DFT coefficients agrees with the noiseless ratio up to an error term √ of order σ/|a| p. Given this estimate for the ratio of noisy DFT coefficients, we can derive bounds for the error in the phase angle computed via Arg(z). Note that Arg(z) is differentiable except across the negative real axis, so assume for the moment that Arg(z) is sufficiently far from ±π. Since Arg(z) is constant on rays from the origin and linear (with unit slope) along circles centered at the origin, we have |Arg (z + η) − Arg(z)| ≤ |η|, (3.8) as long as −π + |η| < Arg(z) < π − |η|. Combining the bound (3.8) with the estimate (3.7), we have Arg n Sp,ε [h] n Sp [h] − 2πωε ≤ O 46 σ √ |a| p ; (3.9) 1 equivalently, if ω = 2πε Arg n Sp,ε [h] n Sp [h] is our noisy frequency estimate, we have |ω − ω| ≤ O σ √ 2πε|a| p . (3.10) √ For fixed ε, |a|, we can see from this last inequality that the ratio σ/ p is critical in determining the sensitivity of our phase calculation to noise. In the remainder of this chapter we examine how the choice p and ε affect the quality of the frequency approximation in our algorithm. 3.2.3 Earth mover distance In the existing literature on the sparse Fourier transform, the 2 norm is most often used to assess the quality of approximation. There are many reasons for this choice, with the two most convincing perhaps being the completeness of the complex exponentials with respect to the 2 norm and Parseval’s theorem. For certain applications, however, this choice of norm is inappropriate. For example, in wide-band spectral estimation and radar applications, one is interested in identifying a set of frequency intervals containing active Fourier modes. In this case, an estimate ω of the true frequency ω with |ω − ω| N is useful, but unless ω = ω the 2 metric will report an O(1) error. For these reasons, we propose measuring the approximation error of sparse Fourier transform problems with the Earth Mover Distance (EMD) [40]. Originally developed in the context of content-based image retrieval, EMD measures the minimum cost that must be paid (with a user-specified cost function) to transform one distribution of points into another. EMD can be calculated efficiently as the solution of a linear program corresponding to 47 a certain flow minimization problem. In our situation, we consider the cost to move a set of k k to the true values (ωi , cωj ) estimated Fourier modes and coefficients (ωj , cω ) j j=1 j=1 under the cost function d1 (ω, cω ), (ω, cω ); N def |ω − ω| = + |cω − cω |. N (3.11) This choice of cost function strikes a balance between the fidelity of the frequency estimate (as a fraction of the bandwidth) and that of the coefficient estimate. We denote the EMD using d1 for the cost function by EMD(1) in our empirical studies in sections 3.3.2 and 3.4.3 below. 3.3 A minor modification In this section we give a simple modification of our noiseless algorithm to improve the robustness of the frequency estimation in the presence of noise. In section 3.3.1 we describe the modification and study its performance as a function of the noise and the sampling rate p. In section 3.3.2 we report the results of an empirical evaluation of the accuracy of the modified algorithm as a function of the noise level. 3.3.1 Rounding As noted in section 3.2.2, the frequency reconstruction from noisy measurements is correct √ up to an error term of size O σ/2πε|a| p . Moreover, the location of the peaks in the noisy DFT are robust to noise, so that the values of ω (mod p) given by the indices of the k n largest coefficients in Sp [h] can be taken as exact. By combining these two measures we can 48 more reliably estimate the frequency indices of significant coefficients. Our proposed modification is therefore to simply round the noisy frequency estimate n Sp,ε [h] 1 to the nearest integer of the form np + h. This improved estimate ω = 2πε Arg n Sp [h] is therefore given by ω = p · round ω−h p + h, (3.12) where round(x) returns the nearest integer to x. For low levels of noise this modification will return the true value ω, while for larger noise levels it is possible that ω deviates by more than p/2 from the true frequency ω. In this case the estimate ω will be wrong by a multiple of p. Choosing larger values for p (i.e. increasing the parameter c1 ) will help to reduce the likelihood of an error in frequency estimation. Furthermore, to assure that the estimated frequencies are sufficiently far from the branch cut of Arg(·) along the negative real axis, we take the shift ε ≤ 1/2N . The estimated frequencies then satisfiy −N ≤ ω < N , so that the deviations due to the noise level will not push the estimates across the discontinuity. We saw in the previous section that the error in the phase estimation is proportional to σp−1/2 , so we would expect similar behavior for the improved estimate ω . To test this, we generated 100 frequencies uniformly at random from [−N/2, N/2) with N = 222 for a range of parameters (σ, p), and set the corresponding coefficient to unity. Thus our test signals for this empirical trial were one-term trigonometric polynomials. We then reconstructed the frequencies in two ways: first, simply using the formula 2.11, and second by combining this estimate with the rounding procedure (3.12). In figure 3.1 we plot the average phase error in logarithmic scale as a function of both σ and p, which were varied from 0.001 to 0.512 and from 10 to 5120, respectively, by powers of two. In the plot on the left, which corresponds to reconstruction using only (2.11), we can 49 no rounding 12 15 10 8 log2(p) log2(p) 12 rounding 10 6 4 15 10 10 8 6 5 4 5 −8 −6 −4 −2 log2(σ) −8 −6 −4 −2 log2(σ) Figure 3.1: (left): Mean phase error without rounding (shaded attribute) vs. noise level σ and sample length p, in logarithmic scale. (right) Mean phase error with rounding (shaded attribute) vs. noise level σ and sample length p, in logarithmic scale. The bandwidth for both plots is N = 222 . clearly see the contours of constant phase error obeying the relationship log2 (p) = 2 log2 (σ)+ c for some constant c. This confirms our analytic estimate from section 3.2.2 that the phase √ error is proportional to σ/ p. In the plot on the right, which corresponds to the improved reconstruction using (3.12), we can see that for large values of σ and small values of p the same relationship holds. However, for smaller σ and larger p we see an abrupt transition to exact reconstruction (the white area in the upper-left). The boundary of this region appears 2 to follow the relationship log2 (p) = 3 log2 (σ) + c, indicating that for small enough values of the ratio σ/p3/2 the rounding procedure is exact. 3.3.2 Empirical evaluation In this section we describe the results of an empirical evaluation of the algorithm with the rounding procedure (3.12). This modification doesn’t change the runtime or sampling complexity significantly, so we focus here on the error in the approximation as a function of the noise level σ and the parameter c1 . 50 22 Rounding only, k = 64, N = 2 Avg. EMD(1) Error 0 10 −2 10 −4 10 −6 10 −3 10 −2 −1 10 10 0 10 Noise level σ Figure 3.2: EMD(1) error as a function of the noise level σ with the sparsity and bandwidth fixed at k = 64, N = 222 , respectively. Different values of the parameter c1 are shown: c1 = 4 (blue solid line), c1 = 16 (black dashed line), c1 = 64 (red solid line), and c1 = 256 (magenta dashed line). In figure 3.2 we report the average EMD(1) error over 100 test signals as a function of the input noise level σ (note that σ is not given in dB), for various choices of the parameter c1 . The test signals were generated in the same manner as in section 2.4.2: k-term trigonometric polynomials with frequencies drawn uniformly at random from [−N/2, N/2) and coefficients drawn uniformly from the complex unit circle, with complex gaussian noise of variance σ added to each measurement. In this experiment, the sparsity and bandwidth are fixed at k = 64 and N = 222 , respectively. As expected, the error decreases as c1 increases, since the rounding procedure is more likely to result in the true frequency. Moreover, for larger values of c1 , the error increases linearly with the noise level, indicating the procedure’s robustness in the presence of noise. For smaller values of c1 the EMD(1) error is saturated by the frequency error, so that there is only a mild dependence on σ. We remark that in the noiseless case the choice c1 = 5 was found to be sufficient, while figure 3.2 indicates that the much larger value c1 ≈ 256 is necessary for good approximation in the EMD(1) metric. The larger sample lengths imply an increase in both the runtime and 51 sampling complexity, and indicate that the rounding procedure should be complemented by other modifications. In section 3.4 we combine the rounding procedure with the use of larger shifts εq in a multiscale approach to frequency estimation that is robust to higher noise levels even at coarse sampling rates (i.e. small values of c1 ). 3.4 Multi-scale methods In this section we describe two approaches to multi-scale frequency estimation using multiple shifts εq . These algorithms exhibit a form of error correction in their estimation of significant frequency values, which allows them to use much coarser sampling rates than the simple rounding method described in section 3.3 to achieve a desired error. This comes, however, at the cost of increased average-case runtime and sampling complexities of Θ k 2 log(k) log(N ) and Θ k 2 log(N ) , respectively. We note that the algorithms presented below, while simple to describe and implement, are not optimal in terms of computational complexity – the authors of [25] have recently given a Θ (k log(N ) log(N/k)) time algorithm for the general (noisy) case. That algorithm, however, is considerably more complicated than ours, and moreover has not yet been shown to empirically outperform standard FFT implementations. In section 3.4.1 we give some background on our multi-scale methods and introduce the main idea of our algorithms. In section 3.4.2 we describe two variations on the basic multiscale algorithm, and in section 3.4.3 we report the results of an empirical evaluation of the algorithms. 52 3.4.1 Preliminaries In both the noiseless algorithm and the modified version given by rounding the frequencies to the nearest integers with the correct remainder modulo p, we required that ε ≤ 1/N . This was due to the fact that the range of the principle argument function Arg(z) is [−π, π), so to identify a frequency bandlimited to [−N/2, N/2) without any wrap-around aliasing we needed to ensure that 2πεω ∈ [−π, π). Moreover, we noted in remark 2.2.1 that, given an 1 estimate of the magnitude of a particular frequency |ω| < L , then taking ε < L suffices 2 to recover ω without wrap-around aliasing. This is the key observation for our multi-scale algorithms, which build up an estimate for ω in an iterative fashion. By increasing the shift ε while simultaneously demodulating by our current frequency estimate, we obtain a new estimate on a finer scale that allows us to correct for errors introduced in previous iterations. 3.4.2 Algorithms Let us describe the general architecture of these algorithms in some detail. We begin by 0 choosing an appropriate p and ε < 1/2N and calculating our initial frequency estimates ωj , 1 ≤ j ≤ k ∗ , as in section 3.3. Thus for ωj ≡ h (mod p) we have 0 ωj = p · round 1 2πεp Arg n Sp,ε [h] n Sp [h] −h + h, (3.13) √ which will in general differ from the true frequency ωj by approximately σ/2πε p. For an appropriate choice of p, the difference between the estimated and true frequencies will satisfy 0 ωj − ωj 53 ωj . (3.14) We then begin an iterative frequency estimation procedure that tries to correct for the error in phase. For each of the k ∗ estimated frequencies, we increase the shift ε (say by a multiplicative factor of 2B , where B is a parameter) and demodulate the input by our first 0 estimate ωj . That is, for 1 ≤ j ≤ k ∗ we collect the samples n Sp [q] = e 0 −2πiωj q/p S q p , 0 B n [q] = e−2πiωj q/p+ε2 S Sp,ε (3.15) q + ε2B , p (3.16) and perform DFTs on them. Note that by elementary properties of Fourier series, the peak 0 location h = ωj (mod p) will be shifted down by ωj (mod p). From (3.13) we see that 0 ωj (mod p) = h, so that the peak corresponding to ωj will appear at the zero frequency in the demodulated signal. Moreover, the phase difference between the shifted and unshifted DFT values at the zero 0 frequency will be proportional to ωj − ωj , and we can estimate this phase error using the principle argument function as before. We therefore have the improved estimate 1 0 ωj = ωj + p · round 1 2πε2B p Arg n Sp,ε [0] n Sp [0] . (3.17) Note that increasing the shift by a factor 2B ensures that the correction term will lie −1 , 1 , and will differ from the true error ω 0 − ω by approximately j j ε2B ε2B √ σ2−B /2πε p. Thus the correction term operates on scales a factor 2B smaller than the in the range original estimate. The algorithms then iterate this procedure: demodulate the input samples by the previous 54 i−1 frequency estimate ωj , increase the shift by a factor 2B , and estimate the correction term as in (3.17). In this way the estimates “zoom in” on the correct frequency value in a multiscale manner. Since we must demodulate by different amounts for each of the k frequencies, one iteration of this procedure requires Θ k 2 samples and Θ k 2 log(k) time. Note that each iteration of the frequency estimation procedure provides a new estimate of a significant n frequency’s corresponding coefficient in Sp [0]. We use this fact to improve the robustness of our coefficient estimation by taking the median of the real and imaginary parts of the estimates. Our two versions differ in how the the shifts are increased, which in turn affects how many iterations of the estimation procedure are performed. In the “non-adaptive shift” algorithm, B is a user-specified constant, so the number of iterations in the non-adaptive shift algorithm is log2 (N ) /B. The “adaptive shift” algorithm uses and uses the size of the previous correction term to estimate an appropriate value for εi , the the shift used in the j ith iteration on the j th frequency. Specifically, we set  i−1    2 α+log2 ωj  N εi = j   2 α+log2 (p)   N i−1 = 0, if ωj (3.18) otherwise. The “safety parameter” α ≥ 1 guards against underestimating the magnitude of the true error. In this adaptive version we stop the estimation procedure when the correction term is zero in two consecutive iterations for all frequencies. While we cannot make an analytic claim on the number of iterations the adaptive shift algorithm makes, in practice it performs similarly to the non-adaptive version. Finally, we note that for small values of the sparsity k, only requiring p ≥ c1 k would 55 allow for very coarse sampling rates. As we saw in section 3.2.2, the frequency error is √ proportional to σ/ε p. If p is too small, then this phase error could be very large, and it may not be the case that (3.14) holds. To guard against this we require in addition that p ≥ cσ 2 22B , where the constant c is determined from a fit to the data in figure 3.1 (left). 3.4.3 Empirical evaluation To test both versions of our multi-scale algorithms, we performed an empirical evaluation with test signals identical to those in section 3.3.2. We varied the noise level and the parameter c1 over the same range of values, set B = 4 in the non-adaptive shift algorithm and α = 1 in the adaptive shift algorithm. In figure 3.3 we report the average EMD(1) error as a function of the noise level σ for different choices of the parameter c1 . The sparsity k = 64 and bandwidth N = 222 are fixed, as in figure 3.2. Note that for both the nonadaptive shift and adaptive shift algorithms the error is well-behaved for all values of c1 , whereas we had to take c1 ≈ 256 to achieve a similar behavior when using the rounding only algorithm of section 3.3. Thus with our multi-scale algorithms, we can achieve a prescribed error tolerance using much coarser samplings. In figure 3.4 we report the number of samples and the runtime required for each of our √ three algorithms to approximate the true spectrum to within σ/ k in the EMD(1) metric. In these plots, the bandwidth N = 222 and noise level σ = 0.512 fixed, but the parameter c1 varies for each data point. As seen in figures 3.2 and 3.3, the rounding only algorithm generally needs a larger c1 value to achieve the same error as the multi-scale algorithms. To √ make a fair comparison, we find the smallest c1 that achieves the error bound σ/ k and plot the corresponding number of samples and runtime. 56 Nonadaptive shifts, k = 64, N = 222 Avg. EMD(1) Error 0 10 −2 10 −4 10 −6 10 −3 10 −2 −1 10 0 10 10 Noise level σ (a) 22 Adaptive shifts, k = 64, N = 2 Avg. EMD(1) Error 0 10 −2 10 −4 10 −6 10 −3 10 −2 −1 10 10 0 10 Noise level σ (b) Figure 3.3: EMD(1) error as a function of the noise level σ with the sparsity and bandwidth fixed at k = 64, N = 222 , respectively. Different values of the parameter c1 are shown: c1 = 4 (blue solid line), c1 = 16 (black dashed line), c1 = 64 (red solid line), and c1 = 256 (magenta dashed line). (a) Nonadaptive shift algorithm (B = 4), (b) Adaptive shift algorithm (α = 1). 57 From figure 3.4 (a) we can see that the rounding only algorithm requires fewer samples to achieve the error bound than the multi-scale algorithms, even though the parameter c1 is much larger in general. From figure 3.4 (b) we can see that the same holds true for the runtime. This is to be expected since the iterative frequency estimation procedure in the multi-scale algorithms is currently a computational bottleneck. 3.5 Conclusion In this chapter we extended the algorithm of chapter to handle noisy signals. The first modification we gave is a minor alteration to the noiseless algorithm that doesn’t change the computational complexity significantly, though it does require a relatively fine sampling rate to produce quality approximations in the EMD(1) metric. We also gave two multi-scale algorithms that exhibit error correction in their frequency estimation procedures. While these algorithms allow the use of much coarser samplings to provide good approximations, they are slower both in theory and in practice than the minor modification. We remark that our multi-scale algorithms are somewhat wasteful in their frequency estimation procedures, since currently only the values of the samples at the zero frequency are used to estimate the frequency error. It remains an open question as to whether more information can be extracted from the demodulated samples so as to reduce the complexity of the frequency estimation procedure. 58 Avg. samples (CPU ticks) Samples for N=222, EMD(1) < 0.512/sqrt(k) 8 10 6 10 4 10 2 10 0 10 0 10 1 10 Sparsity k 2 10 Avg. runtime (CPU ticks) (a) Runtime for N=222, EMD(1) < 0.512/sqrt(k) 10 10 8 10 6 10 4 10 0 10 1 10 Sparsity k 2 10 (b) √ Figure 3.4: (a) Samples to achieve EMD(1) error σ/ k, as a function of the sparsity k with fixed σ = 0.512, N = 222 , for rounding only (blue solid line), non-adaptive shift (red dash-dot line), and adaptive shift (magenta dashed line) algorithms. (b) Runtime to achieve the same error bound, as a function of the sparsity k. 59 Chapter 4 Sub-linear Fourier algorithms in multiple dimensions 4.1 Introduction In this chapter we shift our focus from one-dimensional signals to the multidimensional case, concentrating (as in chapter ) on the noiseless case. We give three extensions of our basic one-dimensional noiseless algorithm to the multi-dimensional setting. Two of these are a result of the separability properties of the exponential function, while the third uses a number-theoretic construction to generate an equivalent one-dimensional problem on which our original algorithm can be applied. In this chapter we show that the resulting algorithms require fewer samples of the input and execute in less CPU time than the d-dimensional FFT. The remainder of this chapter is organized as follows. In section 4.2 we define our notation for the multidimensional case. In section 4.3, we describe a straightforward extension of 60 our one-dimensional algorithm to multiple dimensions. The resulting algorithm is simple to describe and implement, but suffers from increased computational complexity that scales exponentially in the dimension. In order to overcome this barrier, in section 4.4 we propose an alternative method that computes the one-dimensional Fourier transform of an “unwrapped” version of the multi-dimensional signal. Finally, in section 2.5 we describe the results of an empirical evaluation of the proposed methods in two dimensions. 4.2 Notation We denote the ambient dimension by d, and change the independent variable from t to x and use boldface to emphasize the multidimensional character of the input. We write the components of d-dimensional vectors with subscripts, so that x = x1 , . . . , xd . The Euclidean inner product is written def ω·x = d ωq xq , (4.1) q=1 and we denote by d − Q = Q N1 , . . . , Nd = q=1 Nq Nq , 2 2 ∩Z (4.2) the d-dimensional orthotope with side lengths N1 , . . . , Nd centered at the origin. The letter N (without subscripts) denotes the cardinality of Q, so that d N = |Q| = Nq . q=1 61 (4.3) Our principal objects of study in this chapter are frequency-sparse band-limited signals S : [0, 1]d → C of the form k S(x) = aj e 2πiωj ·x , (4.4) j=1 where aj ∈ C. In this setting, “band-limited” means that ωj ∈ Q N1 , . . . , Nd for some integers N1 , . . . , Nd , and “frequency-sparse” means that k N = |Q|. The d-dimensional Fourier series S(ω) of a signal S(x) is defined analogously to (2.2) by S(ω) = [0,1]d S(x)e−2πiω·x dx, ω ∈ Zd . (4.5) For signals of the form (4.4) we have  S(ω ) = [0,1]d  k aj e  j=1 k = aj j=1 2πiωj ·x  −2πiω ·x e dx [0,1]d 2πi(ωj −ω )·x e dx. (4.6) The separability of the exponentials ex = ex1 · · · exd implies that the integral in (4.6) factors into the product of d one-dimensional integrals of the form 1 2πi(ω −ω )xq j,q ,q e dxq , q = 1, . . . , d. 0 (4.7) Since all components of the frequency vectors ωj and ω are integers, each of these integrals is just the Kronecker delta function δj , so that S(ω ) = a for = 1, . . . , k and 0 for all other ω ∈ Q. Given a finite d-dimensional array S of size p1 × · · · × pd we define its discrete Fourier 62 transform by p−1 S[h] = S[j]e−2πih·j/p , (4.8) j=0 where p = p1 , . . . , pd , the indices hq and jq range from 0 to pq − 1 for q = 1, . . . , d, def the division j/p = j1 /p1 , . . . , jd /pd is defined elementwise, and the sum over the vector index j is nested and performed over each dimension, so that p−1 j=0 def = pd −1 p1 −1 ··· j1 =0 . (4.9) jd =0 For p = p1 · · · pd , multi-dimensional FFT algorithms [46] permit the computation of S in O(p log p) time. As in chapter , the simplest method to determine the significant frequency/coefficient k in the Fourier transform of a signal S of the form (4.4) is to sample at pairs (ωj , aj ) j=1 the Nyquist rate in each dimension and compute the full d-dimensional DFT. For a signal bandlimited to Q(N1 , . . . , Nd ) this procedure has sampling and runtime complexities of Θ(N ) and Θ(N log N ), respectively, where N = |Q|. When k N this is computationally wasteful, and in the following sections we develop methods which are sub-linear in N . 4.3 Straightforward extension to multiple dimensions In this section we describe a straightforward extension of the algorithm given in chapter to multiple dimensions. Recall that in the one-dimensional case, we used the fact that the phase modulation in the DFT coefficients of time-shifted samples of the input is linear in the frequency index. In section 4.3.1 we show that the same is true in multiple dimensions. 63 In section 4.3.2 we exploit this fact to develop an algorithm with sampling and runtime complexities of Θ k d and Θ k d log k d 4.3.1 , respectively. Preliminaries As in chapter we apply the d-dimensional DFT to discrete samples of S(x) to determine the significant frequency/coefficient pairs ωj , aj k . Fix integers p1 , . . . , pd and positive j=1 real numbers ε1 , . . . , εd . We form d + 1 discrete d-dimensional arrays Sp , Sp,ε1 , . . . , Sp,εd by sampling S(x) at rate 1/pq in the q th dimension starting with xq = 0 and xq = εq . That is, for 0 ≤ jq ≤ pq − 1, Sp [j1 , . . . , jd ] = S Sp,εq [j1 , . . . , jd ] = S j j1 ,..., d p1 pd , j jq j1 + εq , . . . , d ,..., p1 pq pd . (4.10) If S(x) is of the form (4.4), we have (as in chapter ) p−1 Sp [h] = S j p e−2πih·j/p j=0   p−1 k  = a e2πiω ·j/p  e−2πih·j/p j=0 =1 p−1 k = a e2πi(ω −h)·j/p =1 j=0 = p1 · · · pd a . (4.11) ω ≡h (mod p) Here as before, the division of vectors j/p = (j1 /p1 , . . . , jd /pd ) is to be interpreted ele64 mentwise, and ω ≡ h (mod p) is shorthand for the d simultaneous congruences ω ,q ≡ hq (mod pq ), q = 1, . . . , d. The last equality in (4.11) can be seen as follows: in expanded form, the inner sum in the second-to-last line in (4.11) reads pd −1 p1 −1 ··· j1 =0 e 2πi (ω ,1 −h1 )j1 /p1 + ··· + (ω ,d −hd )jd /pd , (4.12) jd =0 which can be rewritten as the product of d sums d pq −1 e 2πi(ω ,q −hq )jq /pq . (4.13) q=1 jq =0 As in chapter , each of the sums in (4.13) is equal to pq if (ω ,q − hq )/pq is integral, and zero otherwise. The product is thus equal to p1 · · · pd if ω ,q ≡ hq (mod pq ) for 1 ≤ q ≤ d and zero otherwise, hence (4.11). Similarly, for the shifted samples we have p−1 Sp,εq [h] = Sp,εq [j]e−2πih·j/p j=0   p−1 k 2πi ω ·j/p+ω ,q εq   e−2πih·j/p = a e j=0 =1 p−1 k 2πi ω −h ·j/p+ω ,q εq = a e =1 j=0 2πiω ,q εq a e . = p1 · · · pd (4.14) ω ≡h (mod p) Assuming for the moment that ω (mod p) k=1 are distinct (as d-vectors), we then 65 have for the unshifted samples    p ···p a 1 d Sp [h] =   0 if ω ≡ h (mod p), (4.15) otherwise, while for the shifted samples we have, for 1 ≤ q ≤ d,    p · · · p a e2πiεq ω ,q if ω ≡ h (mod p), 1 d Sp,εq [h] =   0 otherwise. (4.16) 1 As in the one-dimensional case, if we take εq ≤ N , we can recover ω ,q by noting that, for q h ≡ ω (mod p), 1 ω ,q = Arg 2πεq 4.3.2 Sp,εq [h] Sp [h] . (4.17) Algorithm Equation (4.17) allows us to quickly extract the significant frequency values in the multidimensional DFT of an input signal S(x) provided that no aliasing occurs in a given (ddimensional) frequency bin h. Recall that in the one-dimensional case, we tested whether or not a collision had occurred in a frequency bin by comparing the magnitudes of the shifted and unshifted coefficients for that bin. Moreover, we took the sample lengths pj to be prime numbers greater than a constant multiple of k (or k ∗ in the adaptive case). In section 2.4 we proved that with this choice of pj , the expected number of collisions among k frequencies is a constant fraction of k, so that the computational cost of each iteration of algorithm 1 decreases geometrically in the iteration number. This in turn implied that the total computational cost is dominated by that of the first iteration, which gave us sampling 66 and runtime complexities of Θ(k) and Θ(k log(k)), respectively. In multiple dimensions, collisions among frequencies can occur in any of the d dimensions. A naive approach is then to only calculate frequency values via (4.17) for those frequency bins h for which the ratio of shifted to unshifted coefficients is sufficiently close to unity in all dimensions. That is, for given tolerances τq , we require all of the inequalities Sp,εq [h] − 1 ≤ τq , 1 ≤ q ≤ d, (4.18) Sp [h] to hold in order to accept the frequency bin h. For the simulations described in section 4.5 we took τq = p1 · · · pd /N1 · · · Nd . In order to apply the average-case analysis of section 2.4 we therefore require the sampling rate in each dimension to be at least a constant multiple of the sparsity – that is, we take pj,q ≥ ck (or ck ∗ ) for 1 ≤ q ≤ d. As discussed in section 2.4.1, the computational costs of our algorithm are dominated by the FFTs and sorts of the sample arrays Sp and Sp,εq . The sampling complexity of this naive algorithm is then Θ k d , since we require that many signal values on the grid defined by the cartesian product of the pj,q s. As the d-dimensional FFT of an array of size Θ(k) × · · · × Θ(k) has time complexity Θ k d log k d algorithm will execute in time Θ k d log k d , this naive as well. Remark 4.3.1. The empirical evaluation in section 4.5 indicates that this choice of pj,q values is overly cautious, and that a coarser sampling of the input may suffice for energetic frequency identification. Specifically, it was observed that the naive algorithm described above almost never detected an aliased frequency bin, so that on average only one set of sample rates p was used. The explanation for this observation is that by choosing the sample 67 rates pj,q ≥ ck, we are effectively hashing the k frequencies into O k d d-dimensional bins, since the discrete grid defined by the cartesian product of the pj,q s has (ck)d nodes. For d > 1 we should therefore expect many fewer multiple-occupancy d-dimensional bins than in the d = 1 case. We therefore tested a variant of the naive algorithm in which we take pj,q ≥ (ck)1/d (or (ck ∗ )1/d ) for q = 1, . . . , d. In this manner the total number of d-dimensional bins is ck (or ck ∗ ), so the sampling and runtime complexities of one iteration of the inner loop are Θ(k) and Θ(k log k), respectively. On average a constant fraction of the frequencies end up in singleoccupancy d-dimensional frequency bins, which by (4.14) is sufficient for calculating the frequency index. Since we test for aliasing in every dimension, we would expect this variant of the naive algorithm to reject more frequencies than our one-dimensional algorithm. The results of the empirical evaluation support this observation. 4.4 Signal unwrapping and the Chinese Remainder Theorem In this section we provide an alternative to the naive algorithm described above that avoids the exponential scaling of the sampling and runtime complexities in the dimension d. We achieve this by using a clever number-theoretic technique to “unwrap” the d-dimensional input signal S(x) into a related one-dimensional signal S u (x). We then apply the onedimensional algorithm of chapter and transform back to d-dimensional indices. As we will show in section 4.4.1, this dimension-reduction technique follows from the Chinese Remainder Theorem (Theorem 2.2.5), which we used previously for worst-case bounds on the number 68 of pj s used by the one-dimensional algorithm. 4.4.1 Preliminaries In this section we provide the details of the dimension-reducing unwrapping technique, based on ideas presented in [29] and [7]. Let S(x) be a signal of the form (4.4), and let Q = Q N1 , . . . , Nd be as in (4.2) so that supp S ⊂ Q. Next, choose d relatively prime integers Nq , 1 ≤ q ≤ d, such that Nq ≥ dNq , (4.19) and set d N= Nq . (4.20) q=1 Finally, let y −1 (mod Nq ) denote the multiplicative inverse of y in Z , so that y·y −1 (mod Nq ) ≡ Nq 1 (mod Nq ). Note that y −1 (mod Nq ) exists, and is unique, whenever y is relatively prime to Nq . Define the function ι : Q N1 , . . . , Nd → Z by N  ι(x) =  d  N q=1 Nq xq  (mod N ). (4.21) Since the Nq are coprime, the Chinese Remainder Theorem implies that ι is a bijection, whose inverse x = ι−1 (x) has components xq = x N −1 (mod Nq ) . Nq 69 (4.22) Given these parameters, we can define a one-dimensional function S u : [0, 1] → C (the “unwrapped” version of S) by N S u (x) = S N1 x, . . . , N Nd x . (4.23) For notational convenience in (4.24) below, let x denote the vector argument of S in (4.23). If S is of the form (4.4), we then have 1 S u (x)e−2πiωx dx 0   k 1 2πiωj ·x  −2πiωx  S(ωj )e e dx = 0 j=1    d k 1 N S(ωj ) exp −2πix ω − ωj,q  dx. = Nq 0 q=1 j=1 S u (ω) = (4.24) From (4.24) we see that S u is non-zero only at the frequencies d ωj = N q=1 Nq ωj,q = ι(ωj ), (4.25) and moreover we have S u (ω) = S ι−1 (ω) . Finally, we note that if S(ω) is band-limited to Q, then S u (ω) is band-limited to − N , N . To see this, note that (4.25) implies 2 2 d |ω| ≤ N q=1 Nq d ωq ≤ q=1 70 N Nq Nq 2 d ≤ q=1 N N = , 2d 2 (4.26) Nq Nq where we used the band-limitedness of ωq to − 2 , 2 in the second inequality and the choice of Nq from (4.19) in the third. Thus S u (x) is frequency-sparse and band-limited whenever S(x) is, so we may apply all the results of chapter to obtain a multi-dimensional algorithm. 4.4.2 Algorithm Using the bijective unwrapping ι from (4.21), we have a simple prescription for a multidimensional algorithm. Given Q we first precompute Nq , N , and N /Nq −1 (mod Nq ) for 1 ≤ q ≤ d using the extended Euclidean algorithm [13]. This allows us to evaluate S u at any desired point x using (4.23). We then use the one-dimensional algorithm of chapter to approximate S u , and finally transform back to d dimensions using ι−1 . The sampling and runtime complexities of this “unwrapped” algorithm are Θ(k) and Θ(k log k), respectively. Remark 4.4.1. Since this “unwrapped” algorithm merely invokes the one-dimensional algorithm of chapter , we would expect it to perform similarly to that algorithm in practice. While it does seem to attain the stated sampling complexity, the results of the empirical evaluation in section 4.5 imply that it performs most similarly to the variation on the naive algorithm mentioned in remark 4.3.1. This is somewhat surprising, and we leave the relationship between the two algorithms as future work. 71 4.5 Empirical evaluation In this section we describe the results of an empirical evaluation in two dimensions of the three multi-dimensional algorithms described in sections 4.3 and 4.4. We only tested the adaptive versions of these algorithms, where the sample lengths are chosen with respect to k ∗ , the number of frequencies left in the residual. As in chapter , all three variants were implemented in C++ using FFTW 3.0 [19] for the FFTs, and all code was compiled with the Intel compiler using the -fast optimization. The experiments were run on the same machine used in that section. In the figures below we refer to the three algorithms as “Naive” (section 4.3), “Naive-Sqrt” (section 4.3, remark 4.3.1), and “Unwrapped” (section 4.4). 4.5.1 Setup For all of the experiments below, we fixed the bandwidth in both dimensions to be 211 , so in the notation of section 4.2 we have Q = [−1024, 1023) × [−1024, 1023). This value was chosen to facilitate comparison with the one-dimensional results in section 2.5, where the (one-dimensional) bandwidth was taken to be 222 . We varied the sparsity k from 2 to 512 by powers of two. All of the experiments were performed in the absence of noise, and we fixed the parameter c1 = 5. For the Naive algorithm and its variant Naive-Sqrt we took the shifts εq = 1/2Nq , while for the Unwrapped version we took ε = 1/2N . The test signals were of the form (4.4), and were generated by drawing frequencies ωj uniformly from Q and coefficients aj uniformly from the complex unit circle. Each data point in figures 4.1 and 4.2 represents the average over 100 independent trials, and the upper and lower bars in figure 4.1 represent the maximum and minimum values over all 100 trials. All variants of the algorithms recovered each test signal exactly (within 10−12 72 in the 2 norm). Finally, we also compared our algorithm with the two-dimensional FFTW routine using FFTW MEASURE plans. 4.5.2 Runtime and sampling complexity In figure 4.1(a) we compare the average number of samples of the input S(x) required by each algorithm when the bandwidth Q is fixed. We can see that the Naive algorithm scales approximately like Θ k 2 , while the Naive-Sqrt and Unwrapped algorithms scale apprimately like Θ(k), as expected. Note that for k 256, the Naive algorithm samples the input over the Nyquist rate 1/ (N1 N2 ), while for all values of k tested the Naive-Sqrt and Unwrapped algorithms take orders of magnitude fewer samples than either FFTW or the Naive algorithm. In figure 4.1(b) we compare the average runtime (in CPU ticks) of each algorithm when the bandwidth Q is fixed. The Naive algorithm is faster than FFTW for k ≤ 64, while the Naive-Sqrt and Unwrapped algorithms are faster for k ≤ 256. We can also see that the Naive algorithm scales approximately like Θ k 2 log(k) , and while the Naive-Sqrt and Unwrapped algorithms are generally two orders of magnitude faster, they also appear to scale at the same rate. This is in contrast to the expected Θ(k log k) runtime complexity. A potential explanation for this observation is that, as in the one-dimensional case, we evaluated the intermediate representations by simply looping over the sample points and the terms in the representation. As mentioned in section 2.4.1, this has time complexity Θ(k 2 ), while non-uniform FFTs can be used to reduce this to Θ(k log k). Whereas in the onedimensional case this did not affect the overall runtime, it appears that in two dimensions it represents a significant fraction of the runtime. We leave the replacement of the direct 73 evaluation with non-uniform FFTs in both the one- and multi-dimensional algorithms in this dissertation as future work. 4.5.3 Sample lengths and aliased frequencies In figure 4.2(a) we compare the average number of sets of sample lengths p used by each algorithm in the fixed bandwidth case. We include the one-dimensional algorithm with N = 222 for comparison. We can see that the Naive algorithm almost always uses only one set of sample lengths, which implies that the choice of p ,q ≥ ck ∗ for 1 ≤ q ≤ d is overly cautious. When we reduce the number of frequency bins by only requiring p ,q ≥ √ ck ∗ , the Naive-Sqrt algorithm uses many more sample length sets. Moreover, we see that the Unwrapped algorithm uses approximately the same number of sample lengths as the Naive-Sqrt algorithm, which is unexpected. What is more, both the Naive-Sqrt and Unwrapped algorithms use many more sample lengths than the one-dimensional algorithm on an equivalent bandwidth. In figure 4.2(b) we compare the average fraction of aliased frequencies encountered by each algorithm over its execution. This quantity was calculated simply by incrementing a counter whenever the aliasing test (4.18) (or its equivalent in the Unwrapped and one-dimensional cases) failed, and then dividing the total number of failures by the sparsity k. Note that by the Markov chain analysis of one-dimensional algorithm in section 2.4.3 this quantity should be independent of k, and this is indeed observed for the one-dimensional algorithm. As mentioned in remark 4.3.1, we can see that the Naive algorithm almost never encounters an aliased frequency, while the Naive-Sqrt and Unwrapped algorithms reject many more frequencies than in the equivalent one-dimensional case. 74 Samples for fixed bandwidth N1 = N2 = 211 10 Avg. Samples 10 5 10 0 10 0 10 1 2 10 3 10 10 Sparsity k Avg. Runtime (CPU Ticks) (a) Runtime for fixed bandwidth N = N = 211 1 11 2 10 8 10 5 10 0 10 1 2 10 10 3 10 Sparsity k (b) Figure 4.1: (a) 2D sampling complexity with fixed bandwidth N1 = N2 = 211 for Naive (blue solid line), Naive-Sqrt (black dashed line), Unwrapped (magenta dashed line), and FFTW (red dashed line). (b) 2D runtime complexity with fixed bandwidth N1 = N2 = 211 for Naive (blue solid line), Naive-Sqrt (black dashed line), Unwrapped (magenta dashed line), and FFTW (red dashed line). 75 Avg. # of pjs for fixed bandwidth 1 j Avg # of p s 10 0 10 0 10 1 2 10 10 3 10 Sparsity k Avg fraction of aliased freqs (a) Avg. fraction of aliased freqs for fixed bandwidth 0.8 0.6 0.4 0.2 0 0 10 1 2 10 10 3 10 Sparsity k (b) Figure 4.2: (a) Average number of sample lengths used in 2D with fixed bandwidth N1 = N2 = 211 for Naive (solid blue line), Naive-Sqrt (black dashed line), and Unwrapped (magenta dashed line). Also shown for comparison is 1D PS-Det (red solid line) with fixed bandwidth N = 222 . (b) Average fraction of aliased frequencies in 2D with fixed bandwidth N1 = N2 = 211 for Naive (solid blue line), Naive-Sqrt (black dashed line), and Unwrapped (magenta dashed line). Also shown for comparison is 1D PS-Det (red solid line) with fixed bandwidth N = 222 . 76 A possible explanation for the increased number of sample length sets p and the increased fraction of aliased frequencies observed in the Naive-Sqrt algorithm is the test used for aliasing in a given frequency bin. In the one-dimensional case, we tested whether the ratio between the magnitudes of the shifted and unshifted peaks was sufficiently close to unity, whereas in the multi-dimensional case we require that a frequency bin pass this test in all dimensions. The requirements for accepting a frequency bin are therefore more stringent in higher dimensions, and we would expect a commensurate increase in the number of rejected frequencies. As shown in figure 4.2 this is indeed observed in two dimensions for the Naive-Sqrt algorithm. The effect of this increase in the number of rejected frequencies can also be seen in the runtime of the two-dimensional Naive-Sqrt algorithm. As more frequencies are rejected, more sample lengths p are used to identify the remaining frequencies. Recall that in the adaptive algorithm (which is what we implemented), whenever a new sample rate is used we subtract off the contribution of the already-discovered Fourier terms. Since our implementation currently performs this operation directly, with time complexity Θ k 2 , the increased number of rejected frequencies leads to the overall Θ k 2 time complexity observed in figure 4.1(b). Finally we note that the preceding discussion only seems to apply to the Naive-Sqrt algorithm, and not the Unwrapped version. The aliasing test in the Unwrapped algorithm is identical to the one-dimensional version, so we would not expect an increase in the number of rejected frequencies. The fact that the Unwrapped algorithm has near-identical empirical performance with the Naive-Sqrt algorithm remains somewhat mysterious, and we leave the elucidation of the relationship between the two as future work. 77 4.6 Conclusion In this chapter we gave three different extensions of our basic algorithm to the multidimensional setting. Two of these rely on the separability properties of the exponential function, while the third relies on a clever number-theoretic construction using the Chinese Remainder Theorem. The resulting algorithms have runtime and sampling requirements that are less stringent than the d-dimensional FFT for sparsity levels up to k = 256. We remark again that the implementation of non-uniform FFTs in place of direct evaluation of intermediate representations and the explanation of the similarity in the observed behavior of the Naive-Sqrt and Unwrapped algorithms are left as possible directions for future research. 78 BIBLIOGRAPHY 79 BIBLIOGRAPHY [1] M. Ajtai, H. Iwaniec, J. Koml´s, J. Pintz, and E. Szemer´di, Construction of a thin set o e with small Fourier coefficients, Bull. London Math. Soc. 22 (1990), no. 6, 583–590. [2] A. Akavia, Deterministic Sparse Fourier Approximation via Fooling Arithmetic Progressions, Conference on Learning Theory (CoLT), 2010. [3] A. Akavia, S. Goldwasser, and S. Safra, Proving hard-core predicates using list decoding, Annual Symposium on Foundations of Computer Science, vol. 44, 2003, pp. 146–159. [4] C. Anderson and M. D. Dahleh, Rapid computation of the discrete Fourier transform, SIAM J. Sci. Comput. 17 (1996), no. 4, 913–919. [5] D. Boneh, Finding smooth integers in short intervals using crt decoding, Journal of Computer and System Sciences 64 (2002), no. 4, 768–784. [6] W. L. Briggs and V. E. Henson, The DFT: an owner’s manual for the discrete Fourier transform, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1995. [7] C. Burrus, Index mappings for multidimensional formulation of the DFT and convolution, IEEE Transactions on Acoustics, Speech and Signal Processing 25 (1977), no. 3, 239–242. [8] E. J. Cand`s, J. K. Romberg, and T. Tao, Robust uncertainty principles: exact sige nal reconstruction from highly incomplete frequency information, IEEE Trans. Inform. Theory 52 (2006), no. 2, 489–509. [9] I. Carron, Nuit blanche, http://nuit-blanche.blogspot.com. [10] S.S. Chen, D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput. 20 (1998), no. 1, 33–61. [11] A. Cohen, W. Dahmen, and R. DeVore, Compressed sensing and best k-term approximation, J. AMS 22 (2009), no. 1, 211–231. 80 [12] J. W. Cooley and J. W. Tukey, An algorithm for the machine calculation of complex Fourier series, Math. Comp. 19 (1965), 297–301. [13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, second ed., MIT Press, Cambridge, MA, 2001. [14] L. Demanet and G. Peyr´, e Compressive wave computation, Submitted to Found. Comput. Math. (2008), Preprint available at http://math.mit.edu/ laurent/html/publications.html. [15] J. Dongarra and F. Sullivan, Guest editors’ introduction: The top 10 algorithms, Computing in Science and Engineering (2000), 22–23. [16] D. L. Donoho and P. B. Stark, Uncertainty principles and signal recovery, SIAM J. Appl. Math. 49 (1989), no. 3, 906–931. [17] D. P. Dubhashi and A. Panconesi, Concentration of measure for the analysis of randomized algorithms, Cambridge University Press, Cambridge, 2009. [18] A. Dutt and V. Rokhlin, Fast Fourier transforms for nonequispaced data, SIAM J. Sci. Comput. 14 (1993), no. 6, 1368–1393. [19] M. Frigo and S. G. Johnson, The design and implementation of FFTW3, Proceedings of the IEEE 93 (2005), no. 2, 216–231, Special issue on “Program Generation, Optimization, and Platform Adaptation”. [20] A. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, Near-optimal sparse Fourier representations via sampling, Symposium on Theory of Computing, 2002, pp. 152–161. [21] A. Gilbert, S. Muthukrishnan, and M. Strauss, Improved time bounds for near-optimal sparse Fourier representations, SPIE Wavelets XI, 2005. [22] A. Gilbert, M. Strauss, and J. Tropp, A tutorial on fast Fourier sampling, IEEE Signal Processing Magazine 25 (2008), no. 2, 57–66. [23] O. Goldreich, D. Ron, and M. Sudan, Chinese remaindering with errors, IEEE Transactions on Information Theory 46 (2000), no. 4, 1330–1338. [24] G. Golub and C. Van Loan, Matrix computations, third ed., Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins University Press, Baltimore, MD, 1996. 81 [25] H. Hassanieh, P. Indyk, D. Katabi, and E. Price, Nearly optimal sparse fourier transform, to appear in ACM Symposium on Theory of Computing (STOC), 2012. [26] , Simple and practical algorithms for sparse Fourier transform, to appear in ACM-SIAM Symposium on Discrete Algorithms (SODA), 2012. [27] M. Iwen, A deterministic sub-linear time sparse Fourier algorithm via non-adaptive compressed sensing methods, Proceedings of the nineteenth annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008, pp. 20–29. [28] , Combinatorial sublinear-time Fourier algorithms, Found. Comput. Math. 10 (2010), no. 3, 303–338. [29] , Improved approximation guarantees for sublinear-time Fourier algorithms, Submitted (2011), Available at http://www.math.duke.edu/ markiwen/. [30] M. Iwen, A. Gilbert, and M. Strauss, Empirical evaluation of a sub-linear time sparse DFT algorithm, Commun. Math. Sci. 5 (2007), no. 4, 981–998. [31] R. M. Karp, Probabilistic recurrence relations, J. Assoc. Comput. Mach. 41 (1994), no. 6, 1136–1150. [32] N. Katz, An estimate for character sums, J. Amer. Math. Soc. 2 (1989), no. 2, 197–200. [33] E. Kushilevitz and Y. Mansour, Learning decision trees using the Fourier spectrum, SIAM J. Comput. 22 (1993), no. 6, 1331–1348. [34] H. W. Lenstra, Jr., Integer programming with a fixed number of variables, Math. Oper. Res. 8 (1983), no. 4, 538–548. [35] T. Lin and F.J. Herrmann, Compressed wavefield extrapolation, Geophysics 72 (2007), no. 5, SM77–SM93. [36] N. Linial, Y. Mansour, and N. Nisan, Constant depth circuits, Fourier transform, and learnability, J. Assoc. Comput. Mach. 40 (1993), no. 3, 607–620. [37] Y. Mansour, Randomized interpolation and approximation of sparse polynomials, SIAM Journal on Computing 24 (1995), no. 2, 357–368. 82 [38] I. Niven, H.S. Zuckerman, and H.L. Montgomery, An introduction to the theory of numbers, fifth ed., John Wiley & Sons Inc., New York, 1991. [39] A.M. Pinkus, On L1 -approximation, Cambridge Tracts in Mathematics, vol. 93, Cambridge University Press, Cambridge, 1989. [40] Y. Rubner, C. Tomasi, and L.J. Guibas, The earth mover’s distance as a metric for image retrieval, International Journal of Computer Vision 40 (2000), no. 2, 99–121. [41] L.I. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D: Nonlinear Phenomena 60 (1992), no. 1-4, 259–268. [42] F. Santosa and W.W. Symes, Linear inversion of band-limited reflection seismograms, SIAM J. Sci. Statist. Comput. 7 (1986), no. 4, 1307–1330. [43] A. Schrijver, Theory of linear and integer programming, Wiley-Interscience Series in Discrete Mathematics, John Wiley & Sons Ltd., Chichester, 1986. [44] I.E. Shparlinski and R. Steinfeld, Noisy chinese remaindering in the Lee norm, Journal of Complexity 20 (2004), no. 2-3, 423–437. [45] T. Tao and V. Vu, Additive combinatorics, Cambridge Studies in Advanced Mathematics, vol. 105, Cambridge University Press, Cambridge, 2006. [46] C. Van Loan, Computational frameworks for the fast Fourier transform, Frontiers in Applied Mathematics, vol. 10, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, 1992. [47] Y. Wang and G. Zhou, On the use of high-order ambiguity function for multi-component polynomial phase signals, Signal Processing 65 (1998), no. 2, 283–296. 83