GENERATIVE SIGNAL PROCESSING THROUGH MULTILAYER MULTISCALE WAVELET MODELS By Jieqian He A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering – Doctor of Philosophy Statistics – Dual Major 2021 ABSTRACT GENERATIVE SIGNAL PROCESSING THROUGH MULTILAYER MULTISCALE WAVELET MODELS By Jieqian He Wavelet analysis and deep learning are two popular fields for signal processing. The scatter- ing transform from wavelet analysis is a recently proposed mathematical model for convo- lution neural networks. Signals with repeated patterns can be analyzed using the statistics from such models. Specifically, signals from certain classes can be recovered from related statistics. We first focus on recovering 1D deterministic dirac signals from multiscale statis- tics. We prove a dirac signal can be recovered from multiscale statistics up to a translation and reflection. Then we switch to a stochastic version, modeled using Poisson point pro- cesses, and prove wavelet statistics at small scales capture the intensity parameter of Poisson point processes. We also design a scattering generative adversarial network (GAN) to gen- erate new Poisson point samples from statistics of multiple given samples. Next we consider texture images. We successfully synthesize new textures given one sample from the texture class through multiscale, multilayer wavelet models. Finally, we analyze and prove why the multiscale multilayer model is essential for signal recovery, especially natural texture images. Copyright by JIEQIAN HE 2021 ACKNOWLEDGEMENTS I would like to express the deepest appreciation to my advisor and committee chair Professor Matthew Hirn. He led me into the area of signal processing and provided lots of guidance during my PhD research. He has always been supportive and kind to me during the past five years. This dissertation work would not have been possible without his guidance and encouragement. I would also like to thank my committee mumbers, Professor Yuying Xie, Professor Mark Iwen, Professor Yimin Xiao and Professor Vishnu Boddeti, for their insightful comments and support during my research. Thanks to my dear friends, Binbin, Hongnan, Ningyu, Yuning, Yuanyi, Zheng for their consistent help and company. I greatly value their friendship and they are the treasures I got from MSU during the past five years. Thanks to my old friends Yilang and Yanjun for their long distance support and life-long friendships. Most importantly, I would like to thank my parents Xiaoqiang He and Li Lin for loving me and supporting me at every stage of my life. Many thanks to my fiance Longhao Jin for always caring for me and standing by my side. And thanks to my little friends Kiwi and Simba for the warmest accompany. iv TABLE OF CONTENTS LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Texture Synthesis using VGG and Style Transfer . . . . . . . . . . . . . . . . 8 2.5 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Scattering transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7 Signal recovery through statistical features . . . . . . . . . . . . . . . . . . . 18 CHAPTER 3 CHARACTERIZING SPARSE SIGNALS THROUGH A HYBRID SCATTERING TRANSFORM . . . . . . . . . . . . . . . . . . . . . 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Theory and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 CHAPTER 4 SCATTERING STATISTICS OF GENERALIZED SPATIAL POIS- SON POINT PROCESSES . . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 Expected Scattering Moments for Random Signed Measures . . . . . . . . . 35 4.3 First-Order Scattering Moments of Generalized Poisson Processes . . . . . . 37 4.3.1 Inhomogeneous, Compound Spatial Poisson Point Processes . . . . . 38 4.3.2 First-order Scattering Asymptotics . . . . . . . . . . . . . . . . . . . 39 4.4 Second-Order Scattering Moments of Generalized Poisson Processes . . . . . 43 4.5 Poisson Point Processes Compared to Self Similar Processes . . . . . . . . . 44 4.6 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6.1 Homogeneous, compound Poisson point processes with the same in- tensities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.6.2 Homogeneous, compound Poisson point processes with different in- tensities and charges . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.6.3 Inhomogeneous, non-compound Poisson point processes . . . . . . . . 48 4.6.4 Homogeneous, non-compound Poisson point process and self similar process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.7 Scattering GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 v 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 CHAPTER 5 TEXTURE SYNTHESIS VIA PROJECTION ONTO MULTISCALE, MULTILAYER STATISTICS . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.1 Wavelet filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.2 First layer statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.3 Second layer statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Synthesis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.4.1 Filter comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.2 Comparison of maximum scale . . . . . . . . . . . . . . . . . . . . . . 71 5.4.3 Layers analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4.4 Methods comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5.1 Reduction of second layer statistics . . . . . . . . . . . . . . . . . . . 77 5.5.2 Matching of second layer statistics . . . . . . . . . . . . . . . . . . . 80 5.5.3 Analysis of number of iterations . . . . . . . . . . . . . . . . . . . . . 80 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 CHAPTER 6 MULTILAYER MODEL ANALYSIS . . . . . . . . . . . . . . . . . . 86 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2 Multilayer multiscale model . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2.1 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2.2 Wavelet transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.3 Gram matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.3 Texture with multiple straight lines . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.1 Gram matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.2 Small J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.3 Deep layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Frame-like texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 Multilayer model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.5.1 Multilayer ReLU model . . . . . . . . . . . . . . . . . . . . . . . . . 101 CHAPTER 7 RANDOM FIELDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 7.1 Scattering moments of self-similar processes . . . . . . . . . . . . . . . . . . 106 7.2 Power spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 7.3 Power spectrum and scattering equivalence . . . . . . . . . . . . . . . . . . . 112 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vi LIST OF FIGURES Figure 2.1 A special pre-trained very deep CNN: VGG[1] . . . . . . . . . . . . . . . 7 Figure 2.2 Structure of the GAN. The generator takes in a random vector z with lower dimensions and tries to generate a new signal G(z) that comes from the same distribution as the given data x. The discriminator takes in a signal G(z) or x and tries to output a probability p of the signal coming from the real distribution. . . . . . . . . . . . . . . . . . . . . . 7 Figure 2.3 Process of using VGG features to synthesize texture images [1] . . . . . . 9 Figure 2.4 Synthesis results with different layers. The last row shows the original images. The first row shows synthesized image by matching statistics from only "conv1-1" layer and the second shows that matching statistics from "conv1-1" and "pool1" layers and so on. With only lower layer statistics, the synthesis lose information of the texture. Adding statistics from deeper layers improves the performance, which proves higher layer statistics are necessary to capture texture information. The last column shows result of a natural image, which has more localized structure and is not stationary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2.5 Left: A texture image of bricks with edges at direction 0, π/2. Right: The wavelets of the texture computed through Morlet wavelets, shown in Figure 2.6. It shows the texture has large response at θ = 0 and θ = π/2, tiny response at other angles, which match the directions of the edges in the texture. Also, as the wavelet scale goes larger, the response gets smoother. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 2.6 Morlet wavelets (real part) with 8 different scales and 12 different rota- tions in space and frequency. Left: 2D Morlet wavelet family (ψj,θ )j,θ in space. Middle: 2D Morlet wavelet family in frequency, which is the Fourier transform of 2D Morlet wavelet family (ψbj,θ )j,θ . Right: The summation |φ cJ (ω)|2 + P λ∈Λ |ψλ (ω)| which is almost a constant c 2 over all frequencies, making an approximate Littlewood-Paley frame. This figure shows this group of wavelets plus a low pass satisfy the Littlewood-Paley condition on a half plane approximately, except for the boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Figure 3.1 A wavelet ψ of 4 vanishing moments. . . . . . . . . . . . . . . . . . . . . 21 Figure 3.2 Wavelets sparsify piecewise polynomials on the interval [0, 1024]. . . . . . 21 vii Figure 3.3 Gabor filters used in the second layer (real parts) . . . . . . . . . . . . . 22 Figure 3.4 Choosing filter scales depending on the pairwise distance between signal spikes. Left: A sparse signal with n = 128, three spikes and minDD = 8. Right: The blue spikes show what are the pairwise distance. The orange spikes show what are the scales we choose. . . . . . . . . . . . . 31 Figure 3.5 Sparse signals reconstructed up to a global reflection and translation. Each row shows the result from a single test. Left column: Original signals. Right column: Synthesized signals. . . . . . . . . . . . . . . 33 Figure 4.1 First-order invariant scattering moments for three types of homoge- neous compound Poisson point processes with the same intensity λ0 . Left: Top: ordinary Poisson point process. Middle: Gaussian com- pound Poisson point process with normally distributed charges. Bot- tom: Rademacher compound Poisson point process with charges drawn from the Rademacher distribution. Middle: Normalized invariant scat- tering moments SY (s,ξ,1)/skwk1 (i.e., p = 1), which all converge to 0.01 as s → 0 (up to numerical errors) since λ0 E[|A1 |] is the same for all three point processes. Right: Normalized invariant scattering moments SY (s,ξ,2)/skwk2 (i.e., p = 2). In this case the ordinary Poisson point pro- 2 cess and the Rademacher Poisson point process still converge to the same value as s → 0 since E[|A1 |2 ] = 1 for both of them. However, the Gaussian Poisson point process converges to a different value since E[|A1 |2 ] = π/2 for this process. . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 4.2 First-order invariant scattering moments for two homogeneous, Gaus- sian compound Poisson point processes with different intensity and vari- ance. Left: Top: Homogeneous compound Poisson point process with intensity λ1 = 0.01 and charges A1,j ∼ N (0, 1). Bottom: Homogeneous √ compound Poisson point process with intensity λ2 = 0.01/ 2 and charges A2,j ∼ N (0, 2). The two point processes are difficult to distinguish, vi- sually. Middle: Normalized invariant scattering moments SY (s,ξ,1)/skwk1 (i.e., p = 1), which both converge to approximately 0.08 up to nu- merical error, thus indicating that these moments cannot distinguish the two processes. Right: Normalized invariant scattering moments SY (s,ξ,2)/skwk2 (i.e., p = 2). The two process are distinguished as s → 0 2 since the values λ1 E[|A1,j |2 ] = 0.01 and λ2 E[|A2,j |2 ] ≈ 0.014 differ by a significant margin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 viii Figure 4.3 First-order invariant scattering moments for inhomogeneous non-compound Poisson point processes. Left: Inhomogeneous non-compound Poisson point process with intensity λ(t) = 0.01(1+0.5 sin(2πt/N )). Right: Scat- tering moments S[γ,p]Y (t)/skwkpp for inhomegeneous non-compound Poisson point process at t1 = N/4, t2 = N/2, t3 = 3N/4. Note that λ(t1 ) = 0.015, λ(t2 ) = 0.01, λ(t3 ) = 0.005. The plots show that for inhomogeneous process, scattering coefficients at time t converges to the intensity at that time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Figure 4.4 First-order invariant scattering moments for Brownian motion and Pois- son point process. Left: Top: Brownian motion with Hurst parameter H = 1/2. Bottom: Ordinary Poisson point process. Middle: Nor- malized scattering moments for Brownian Motion (SXBM (s,ξ,1)/kwk2 E|Z|) and Poisson point process (SYpoisson (s,ξ,1)/λE|A|kwk1 ) √ at p = 1. This shows the convergence rate of normalized scattering is s for Brownian mo- tion and s for Poisson process, indicating the 1st moment can distin- guish Brownian motion and Poisson point process. Right: Normalized scattering moments for Brownian Motion (SXBM (s,ξ,2)/kwk22 ) and Poisson point process (SYpoisson (s,ξ,2)/kwk22 ) at p = 2. Both normalized scattering moments have convergence rate s, so the 2nd moment scattering cannot distinguish the two processes. . . . . . . . . . . . . . . . . . . . . . . . . 49 Figure 4.5 Scattering-GAN to study the capacity of scattering moments on Poisson point process. Similar to ordinary GAN, the generator takes in a random vector z and generates fake data y 0 . Also, the discriminator aims to distinguish fake representations from real. By inserting a scattering module between G and D, the discriminator tries to distinguish S(y 0 ) from S(y). When the model trains successfully, S(y 0 ) has the same distribution as S(y). By checking the similarity between y 0 and y, we learn the capacity of scattering moments. . . . . . . . . . . . . . . . . . . 50 Figure 4.6 Generated signals through scattering GAN. We use realizations with length n = 212 from a homogeneous ordinary Poisson point process, i.e., λ(t) ≡ λ0 and Ai ≡ 1, as training data. We use {2j/2 }22 j=0 as scales for filters and apply a one-layer scattering operator to compute the scattering moments. Sigmoid is applied at the last layer in the generator. The generated signals are sparse, although not as sparse as training data. This is natural since Sigmoid forces A0i ∈ (0, 1), thus E[A0i ] < E[Ai ]. According to theorem 6, λ00 > λ0 , which we verified numerically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 ix Figure 5.1 Wavelet families. Upper: Even directional wavelets in space and fre- quency (FFT). Middle: Odd directional wavelets in space and fre- quency (FFT). Lower: Omnidirectional wavelets in space and frequency (FFT). Each block shows the wavelet family with different scales and oscillations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 5.2 1D wavelets. From left to right: 1D even wavelet, 1D odd wavelet, FFT (real part) of even wavelet, FFT (imagery part) of odd wavelet. . . . . . 62 Figure 5.3 Dirac functions and wavelet coefficients. Left: Two Dirac functions y1 and −y1 . Middle: Wavelet coefficients with the even wavelet. Right: Wavelet coefficients with the odd wavelet. . . . . . . . . . . . . . . . . . 63 Figure 5.4 Covariance matrix for diracs. Upper row from left to right: Cye1 , C−y e 1 , Cy1 − C−y1 . Lower row from left to right: Cy1 , C−y1 , Cy1 − C−y1 . This e e o o o o numerically verified that even wavelet is able to distinguish the two dirac functions from the covariance statistics while odd wavelet cannot. . . . . 63 Figure 5.5 Jump functions and wavelet coefficients. Left: Two jump functions y2 and −y2 . Middle: Wavelet coefficients with the even wavelet. Right: Wavelet coefficients with the odd wavelet. . . . . . . . . . . . . . . . . . 64 Figure 5.6 Upper row from left to right: Cye2 , C−y e 2 , Cye2 − C−y e 2 . Lower row from left to right: Cy2 , C−y2 , Cy2 − C−y2 . This numerically verified that odd o o o o wavelet is able to distinguish the two jump functions from the covariance statistics while the even wavelet cannot. . . . . . . . . . . . . . . . . . . 65 Figure 5.7 Synthesis results from one layer (J = 6) with different types of wavelet filters. Left: Original image. Middle left: First layer synthesis results with only odd wavelets. Middle right: First layer synthesis results with only even wavelets. Right: First layer synthesis results with both even and odd wavelets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Figure 5.8 Synthesis results from two layers (J = 5) with/without omnidirectional wavelets. Left: Original image. Middle: 2nd layer synthesis with even and odd wavelets. Right: 2nd layer synthesis with even, odd and omnidirectional wavelets. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 5.9 Synthesis results from two layers with different number of scales. Left: Original image. Middle Left: 2nd layer synthesis results with J = 4. Middle Right: 2nd layer synthesis results with J = 5. Right: 2nd layer synthesis results with J = 6. . . . . . . . . . . . . . . . . . . . . . . 74 x Figure 5.10 Synthesis results with one layer model and two layer model. Left: Orig- inal image. Middle: 1st layer synthesis results. Right: 2nd layer synthesis results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Figure 5.11 Synthesis results compared to other models. Left: Original images. Middle Left: Results from Portilla and Simoncelli [2]. Middle Right: Results from Gatys et al. [3]. Right: Results from our two layer model. 78 Figure 5.12 Synthesis results compared to other models. Left: Original images. Middle Left: Results from Portilla and Simoncelli [2]. Middle Right: Results from Gatys et al. [3]. Right: Results from our two layer model. 79 Figure 5.13 Left: Original images. Middle left: Images synthesized from first layer. Middle right: Images synthesized from second layer, initialized from first layer result. Right: Images synthesized from second layer, initialized from uniform noise. . . . . . . . . . . . . . . . . . . . . . . . 81 Figure 5.14 The synthesis process for different micro-textures plus flowers. . . . . . . 83 Figure 5.15 The synthesis process for different macro-textures with rigid patterns. . . 84 Figure 5.16 The synthesis process when initializing from the first layer synthesis, for a texture with complex patterns. . . . . . . . . . . . . . . . . . . . . . . 84 Figure 6.1 Two line texture images P8 that have the same T summation of height squared. P4 √ x1 (u) = i=1 132iT (u1 ), u ∈ [Z [0, 256)] . Right: x2 (u) = 2 Left: i=1 2 · 1 (u 64i 1 ), u ∈ [Z [0, 256)]2 . . . . . . . . . . . . . . . . . . . . 94 Figure 6.2 The gram matrices and the difference for the two line textures in Figure 6.1. Left: The gram matrix Gx1 between ReLU responses for texture x1 . Middle: The gram matrix Gx2 between ReLU responses for texture x2 . Right: The difference between the two gram matrices |Gx1 − Gx2 |. 94 Figure 6.3 Frame texture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 xi LIST OF ALGORITHMS Algorithm 1 Projection algorithm for texture synthesis . . . . . . . . . . . . . . 69 Algorithm 2 Algorithm for simulating inhomogeneous Poisson point process . . 140 xii CHAPTER 1 INTRODUCTION Wavelet analysis [4] and deep learning are two popular fields for signal processing. Wavelet transforms can be used to extract information from many different kinds of signal data and are related to harmonic analysis. In the 90’s, people used wavelets to analyze 2D signals and do classification on texture images [5]. In [2], directional even and odd wavelets are used to extract statistical features of texture images and reconstruct the images from such features. In recent years, image processing has achieved great progress with the development of deep learning, which uses deep neural networks (DNN) to process signals. One popular model is the convolutional neural network (CNN), which achieves great performance on image classification [1, 6, 7]. DNNs are not only used to do classification on high dimensional data, but also to generate new data. A very deep neural network called VGG [1], which is pre- trained over Imagenet [8], successfully synthesizes texture images from only one sample [3]. Another recent model is the Generative Adversarial Network (GAN) [9], which generates signals that are from the same distribution of given data by training two networks in an adversarial way. However, there is not enough theory to explain why deep learning works, so these types of methods remain black boxes. Recently, theory was developed on the scattering transform [10], which combines wavelet transforms and nonlinear operators to capture features of signals. The process is similar to a CNN and it connects wavelet theory to deep neural nets. So a new task is to understand deep neural networks through theory in wavelet analysis. The research contained in this thesis involves wavelet analysis and deep learning in signal processing. Within this field, an interesting and difficult task is signal reconstruction. It requires to extract all important features of signals and to develop efficient algorithms to reconstruct the signals from such features. This thesis describes the use of modified scattering transforms to synthesize 1D sparse signals and 2D texture images. It also reports a new 1 algorithm that combines GANs and scattering networks to generate 1D stochastic processes. The relationship between filter size, model depth and signals is also discussed. Finally, wavelet analysis on random fields is explored. The remainder of this thesis is organized as follows. Chapter 2 introduces background material. Chapter 3 describes work on 1D sparse signal analysis and synthesis. Chapter 4 explores multilayer scattering features on Poisson point processes, especially at small scales. Chapter 5 presents a multiscale, multilayer model for texture image synthesis. Chapter 6 provides theoretical analysis on filter size, model depth and signal reconstruction. Chapter 7 discusses wavelet analysis on random fields. 2 CHAPTER 2 BACKGROUND This chapter introduces the background of stochastic processes, CNNs, GANs, wavelets, and the scattering transform. In section 2.1, stochastic processes will be defined and properties such as being stationary and ergodic will be discussed. Section 2.2 covers the background of CNNs and a specific architecture VGG, to explain the basic operators needed to define a CNN. Section 2.3 reviews the intuition and architecture of GANs. Section 2.5 introduces the 1D and 2D wavelet family and wavelet transforms. The intuition of wavelet analysis in signal processing will be briefly described. Section 2.6 reviews the definition and properties of the scattering transform, which was recently proposed as a mathematical model for CNNs that uses wavelet theory. 2.1 Stochastic processes Stochastic processes are used to model signals from real life that exhibit randomness. A stochastic process {Xt : t ∈ Rd } is a collection of random variables indexed by t. When d > 1, the stochastic process is referred to as a random field. For d = 2, the process is often used to model grey level texture images. The mean and covariance function of Xt is defined as: µX (t) = EXt , σX (s, t) = Cov(Xs , Xt ) Two processes X = {Xt : t ∈ Rd } and Y = {Yt : t ∈ Rd } are said to be equal in the sense of distribution if for any index set {t1 , t2 , ..., tk } ⊂ Rd , d (Xt1 , Xt2 , ..., Xtk ) = (Yt1 , Yt2 , ..., Ytk ) A stochastic process is stationary if, d {Xt+h : t ∈ Rd } = {Xt : t ∈ Rd }, ∀h ∈ Rd . 3 Stationarity implies the mean function µX (t) = µ is a constant that does not depend on t and the covariance function σX (s, t) = σ(s − t) only depends on the difference between s and t. A stochastic process has stationary increments if, d {Xt+h − Xh : t ∈ Rd } = {Xt − X0 : t ∈ Rd }, ∀h ∈ Rd . Define the time average of X to be, Z T 1 µ bX,T = Xt dt, 2T −T and let σX,T 2 be the variance of µ bX,T . A stationary process is said to be mean-ergodic if 2 lim σX,T = 0, T →∞ which implies lim E|b µX,T − µ|2 = 0. T →∞ From a practical perspective, if xt is a realization of Xt over a large interval [−T, T ], then with high probability Z T 1 xt dt ≈ µ. 2T −T The ergodic property for d moments is defined in a similar way. If X is stationary and ergodic, the statistics of the process can be derived from a single, sufficiently long, random realization of this process. These properties become useful when we discuss the wavelet statistics of texture images in the next few chapters. 2.2 Convolutional Neural Networks CNNs are a special class of deep neural networks for dealing with data that has geometric structure. A deep neural network is usually constructed with an input layer, an output layer and many hidden layers. Each layer consists of a linear or affine operator and a nonlinear operator. Training a network usually corresponds to learning the weights of the linear operators. CNNs use convolution as the linear operator to detect local information and 4 use a pooling function for nonlinear downsampling. Figure 2.1 shows a specially designed and pre-trained CNN called VGG. We now give an explicit definition of the operators in 2D CNNs. Given a discrete input signal x and a filter K in R2 , the ordinary discrete convolution operator is defined as: XX (x ∗ K)(i, i0 ) = x(m, m0 )K(i − m, i0 − m0 ). m m0 In a CNN, an image is always regarded as a three dimensional data, where the third dimension is referred to as the channel dimension. For example, a grey scale image is regarded of size N × N × 1 and an RGB image is regarded as N × N × 3. For a convolution layer in a CNN, suppose we have the input x with n1 channels denoted as (xi )1≤i≤n1 and we would like the output to have n2 channels; we need n1 × n2 filters which are denoted as (Ki,j )1≤i≤n1 ,1≤j≤n2 . The convolution layer specifically computes: Xn1 C(x)j = xi ∗ Ki,j , 1 ≤ j ≤ n2 i=1 for each output channel j. In practice, the convolution filters to be learned are usually of small size, such as 3 × 3 or 5 × 5. Good things about it include sparse interactions (each pixel at the current layer is computed from localized output from the last layer), parameter sharing (pixels are transformed with the same filters), and equivariance to translation (this operation is commutative with translation). It is also efficient to detect the singularities of a signal, such as the edges in an image. This linear operation C(x) is usually followed by a nonlinear activation function in a CNN. For example, a ReLU (Rectified linear unit) function g(x) = max{0, x} is used in the VGG network. Other CNN networks also use logistic sigmoid and hyperbolic tangent as the activation function. For classification tasks, a softmax function exp(xi0 ) softmax(x)i0 = P i exp(xi ) is used at the output layer to predict the probability of a data that belongs to class i0 . 5 A typical CNN layer may also have a pooling function. A max pooling returns the maximum output of the corresponding rectangular neighborhood. The average pooling is another popular pooling function. Pooling functions are used to represent the data with localized summary statistics which are approximately invariant to local translations. Pooling also increases the receptive fields, i.e., each output pixel contains information of more pixels from the last layer. These properties are important for image processing. Since each operator is defined explicitly, the gradient of the loss function over the weights can be computed explicitly, which makes the training efficient with gradient descent. To compute the gradient over weights and update the weights in gradient descent is known as the back-propagation, where we need the computation of the gradient to flow backwards in the network through the chain rule. CNNs have shown great efficiency and accuracy in practical applications, such as natural language processing, image classification and video recognition. Nowadays, people also use it to reconstruct and generate high dimensional data from an unknown distribution. However, there is not too much theory on why CNNs work so well. We try to interpret CNNs through scattering transforms [10] in Section 2.6. 2.3 Generative Adversarial Networks A very widely used model to learn the distribution of high-dimensional data is the Generative Adversarial Network (GAN) [9] from deep learning. Figure 2.2 shows the structure of the GAN which consists of two models: a generator and a discriminator. The two models are trained in an adversarial way in which the generator generates fake data that looks like real data and the discriminator tries to distinguish between the real data and generated data. If trained successfully, the generator learns the real data distribution. Suppose we have a discriminator D and a generator G. The generator G takes in a random vector z that comes from a prior random distribution pz and outputs data G(z). The discriminator D takes in real data x that comes from the real distribution pdata and fake 6 Figure 2.1 A special pre-trained very deep CNN: VGG[1] Figure 2.2 Structure of the GAN. The generator takes in a random vector z with lower dimensions and tries to generate a new signal G(z) that comes from the same distribution as the given data x. The discriminator takes in a signal G(z) or x and tries to output a probability p of the signal coming from the real distribution. 7 data G(z) that comes from the generated distribution pg and outputs D(x) and D(G(z)), which represents the probability the data comes from the real distribution. The loss function is defined through the binary cross entropy: V (D, G) = Ex∼pdata (x) [log(D(x))] + Ez∼pz (z) [log(1 − D(G(z)))] The goal is to find the D and G that solve the following minimax problem: min max V (D, G) G D The problem is solved by using gradient descent to update the parameters in the deep neural networks D and G. The authors prove that the global optimal solution is achieved when pg = pdata and D(x) = 21 . The GAN model does not learn the data distribution explicitly but generates signals that come from the real data distribution, that is, the generated data is not a real data but is similar to the real data. GANs have been widely used to generate natural images. In [11], the authors train a GAN in a progressive manner by gradually adding more layers in the training process and generates high resolution images. In [12], the model takes in sentences and generates images that fit the description. GANs can also be used to repair images with missing parts [13] or generate images with certain artistic style [14]. GANs are useful for many applications. 2.4 Texture Synthesis using VGG and Style Transfer In the previous section, we introduced GANs for generating new images. Another interesting topic from image processing is style transfer, which aims to transfer the style of one image to another. This is motivated by [3], where the authors generate a new texture image by matching the multi-layer statistics of image features computed through VGG [1]. They show the statistics from multiple deep layers is necessary to generate a new realization of the texture. And the loss between such statistics is referred to as the style loss in later research work. 8 Figure 2.3 Process of using VGG features to synthesize texture images [1] VGG is a very deep CNN with 16-19 layers. Figure 2.1 shows the architecture and the definition for each layer is described in Section 2.2 when we reviewed the CNN. This network has been pre-trained and showed great performance on image classification. Figure 2.3 shows the process of texture synthesis in [3] and the details will be described in the following paragraph. Suppose we have an input texture x0 . The output at layer l is a matrix H l (x0 ) ∈ RNl ×Ml where Nl is the number of output channels in layer l and Ml is the size of output from each channel. The authors use the gram matrix between channels as the summary statistics: X Gl (x0 )ij = H l (x0 )ik H l (x0 )jk k Features {G1 (x0 ), G2 (x0 ), ... GL (x0 )} from multiple layers are used to represent the texture 9 image x0 . Define the loss function between two images x0 and x as: X loss(x0 , x) = ||Gl (x0 ) − Gl (x)||22 l which is the summation of the l2 distance between gram matrices from each layer l. The goal is to find x∗ that minimizes the loss: x∗ = arg min loss(x0 , x). x By solving the problem above using gradient descent, they get back a new image which has the same statistics but is different from the input. This can be thought of as generating a new realization of one texture class from a given realization. Figure 2.4 shows the synthesis results. The good performance of matching the gram matrix from multiple layers explains that multi-layer statistics from the VGG network might capture general features of a class of texture. This work motivates the work on style transfer and has shown great performance [15]. It also inspired our work on texture synthesis in Chapter 5. However, it has poor performance on reproducing natural images which have more localized geometric structure and are not stationary. 2.5 Wavelets Wavelet theory is an important field developed for signal processing [4]. The Fourier trans- form of a function x ∈ L2 (Rd ) is defined as, Z x b(ω) = x(u)e−iu·ω du, ∀ω ∈ Rd Rd and has the property x[∗y =x b yb . (2.1) One can recover the function x through the inverse Fourier transform: Z 1 x(u) = b(ω)eiu·ω dω, ∀ω ∈ Rd . x (2π)d Rd 10 Figure 2.4 Synthesis results with different layers. The last row shows the original images. The first row shows synthesized image by matching statistics from only "conv1-1" layer and the second shows that matching statistics from "conv1-1" and "pool1" layers and so on. With only lower layer statistics, the synthesis lose information of the texture. Adding statistics from deeper layers improves the performance, which proves higher layer statistics are necessary to capture texture information. The last column shows result of a natural image, which has more localized structure and is not stationary. 11 A wavelet ψ ∈ L2 (Rd ) is defined as a function that is localized in space and frequency and has zero average [4]: Z ψ(u)du = 0 Rd It is normalized so that kψk = 1. We focus on the cases d = 1 and d = 2 in the following. For d = 1, a family of wavelets is obtained by dilating ψ: ψj (u) = 2−j ψ(2−j u), j ∈ Z. The 1D wavelet transform of a signal x ∈ L2 (R) is defined as: W x = (x ∗ ψj (u))u∈R,j∈Z . For d = 2, a family of wavelets is obtained by dilations and rotations:    cosθ −sinθ  ψj,θ (u) = 2−2j ψ(2−j Rθ−1 u), Rθ =   sinθ cosθ A 2D wavelet transform is defined as W x = (x ∗ ψj,θ (u))u∈R2 ,j∈Z,θ∈Θ⊂[0,π] Usaully Θ = { kπ K : k = 0, . . . , K − 1}. Figure 2.5 shows a texture image and its wavelet coefficients computed through Morlet wavelets. Figure 2.6 shows a family of 2D Morlet wavelets in space and frequency. A 2D mother Morlet wavelet is defined as: 1  1 u2 u2  ψ(u1 , u2 ) = exp − ( 12 + 22 ) · exp(2πiξu1 − C), (u1 , u2 ) ∈ R2 2πσ1 σ2 2 σ1 σ2 where σ1 and σ2 determine the scale of the mother wavelet at two directions and (ξ, 0) determines the center frequency of the wavelet. The constant C ensures ψ(u1 , u2 )du1 du2 = R 0. A family of Morlet wavelets is obtained by dilation and rotation described above. As can be seen, the Fourier transform of Morlet wavelets are essentially supported over different bounded regions, which means wavelets are localized in the frequency field. With Equation 2.1, the wavelet transform in frequency can be written as: W dx = (b x ψ d j,θ )j,θ 12 Figure 2.5 Left: A texture image of bricks with edges at direction 0, π/2. Right: The wavelets of the texture computed through Morlet wavelets, shown in Figure 2.6. It shows the texture has large response at θ = 0 and θ = π/2, tiny response at other angles, which match the directions of the edges in the texture. Also, as the wavelet scale goes larger, the response gets smoother. Figure 2.6 Morlet wavelets (real part) with 8 different scales and 12 different rotations in space and frequency. Left: 2D Morlet wavelet family (ψj,θ )j,θ in space. Middle: 2D Morlet wavelet family in frequency, which is the Fourier transform of 2D Morlet wavelet family (ψj,θ )j,θ . Right: The summation |φJ (ω)| + λ∈Λ |ψ 2 cλ (ω)|2 which is almost a constant over b c P all frequencies, making an approximate Littlewood-Paley frame. This figure shows this group of wavelets plus a low pass satisfy the Littlewood-Paley condition on a half plane approximately, except for the boundary. By multiplying x b with (ψ dj,θ )j,θ , the wavelet transform can capture localized information of the signal x in the frequency field. Let λ = (j, θ) ∈ ΛJ = {(j, θ) : j ≤ J, θ ∈ Θ ⊂ [0, 2π]} where ΛJ is the index set for wavelets and use ψλ to denote ψj,θ . Let φ be a low pass filter whose Fourier transform is concentrated at low frequencies. Set φJ (u) = 2−2J φ(2−J u). The filters {φJ , (ψλ )λ } are said 13 to be a Littlewood-Paley tight frame if: X cJ (ω)|2 + |φ cλ (ω)|2 = 1, ∀ω ∈ R2 |ψ λ∈ΛJ Define AJ x = x ∗ φJ and W [λ]x = x ∗ ψλ . The wavelet transform can be written as WJ x = {AJ x, (W [λ]x)λ∈ΛJ } and its norm is: X kWJ xk2 = kAJ xk2 + kW [λ]xk2 λ∈ΛJ If {φJ , (ψλ )λ } is a Littlewood-Paley tight frame, then kW xk2 = kxk2 . This shows the wavelet transform preserves the norm of a signal x and the property can be proved using Plancherel formula. We can define the dual wavelets as φeJ (ω) := φbJ (ω), ψeλ (ω) := ψbλ (ω). c c A signal x can be reconstructed from its wavelet transforms using the dual filters and the formula X x = x ∗ φJ ∗ φeJ + x ∗ ψλ ∗ ψeλ . (2.2) λ∈ΛJ 2D Morlet wavelets are used in image processing since they have a strong response to edges. Texture images always have edges distributed in a repeated pattern. Therefore the statistics (especially mean and variance) of wavelet coefficients, which summarize global edge information, may be used to analyze texture images. 2.6 Scattering transform Convolutional neural networks work very well for signal processing, especially image process- ing. However, why they work is not well understood and theory still remains to be developed. In [10], the author comes up with a scattering transform, which can be viewed as a mathe- matical model for CNNs. It has been used to achieve near state of the art results in the fields of audio signal processing [16, 17, 18, 19, 20], computer vision [21, 22, 23, 24, 25, 26], and quantum chemistry [27, 28, 29, 30], amongst others. The scattering transform is provably 14 invariant to local (or global) translations of the input signal and is also Lipschitz stable to the actions of diffeomorphisms on the input. These properties are motivated by the fact that signals with translation or small deformation usually contain similar information (e.g., they are in the same class). On the other hand, high frequency information is needed to distinguish one signal class from another, which motivated the use of nonlinear activation functions and a deep architecture. This non-paramatric model helps us to understand CNNs in a mathematical way. Assume we have a real valued function x ∈ L2 (Rd ). Let Lc : L2 (Rd ) → L2 (Rd ), for c ∈ Rd , be a translation operator: Lc x(u) = x(u − c) A map Φ on L2 (Rd ) is translation invariant if: Φ(Lc x) = Φ(x), ∀x ∈ L2 (Rd ), c ∈ Rd Let τ : Rd → Rd be a displacement function and Lτ : L2 (Rd ) → L2 (Rd ) be a diffeomorphism operator: Lτ x(u) = x(u − τ (u)) Let ∇τ (u) and Hτ (u) be the Jacobian and Hessian of τ . Define: kτ k∞ = sup |τ (u)|, k∇τ k∞ = sup k∇τ (u)k, kHτ k∞ = sup kHτ (u)k u∈Rd u∈Rd u∈Rd The map Φ is Lipschitz continuous to diffeomorphisms if there exists a constant C such that kΦ(x) − Φ(Lτ x)k ≤ Ckxk2 (kτ k∞ + k∇τ k∞ + kHτ k∞ ), ∀x ∈ L2 (Rd ) The modulus of the Fourier transform Φ(x) = |b x| is proved to be translation invariant but not stable to diffeomorphisms. So we turn to wavelet transforms. In this paragraph, we will explain the intuition of how the scattering transform is devel- oped to satisfy the translation invariance property while preserving high frequency informa- tion. The paper [10] proves it is also stable to diffeomorphism but we will omit the details. 15 Recall we defined λ = (j, θ) ∈ ΛJ , as well as the operators AJ x = x ∗ φJ and W [λ]x = x ∗ ψλ in section 2.2. Since AJ x and W [λ]x are convolution operators, they commute with transla- tions: AJ (Lc x) = Lc (AJ x), W [λ](Lc x) = Lc (W [λ]x) Then translation invariance can be obtained by integrating x ∗ φJ or x ∗ ψλ . Also note that the operator A is locally translation invariant in the sense that kA(Lc x) − A(x)k ≤ C·|c|·kxk 2J since φJ is a low pass filter. This is essential because sometimes we want local translation invariance instead of full translation invariance. Then instead of x ∗ ψλ , we could try to use R x ∗ ψλ ∗ φJ to obtain local translation invariance. However, since ψλ is a wavelet with zero average, the integral x ∗ ψλ = 0. Also since the support of ψ cλ and φ cJ have small overlap, R x ∗ ψλ ∗ φJ ≈ 0. Therefore we need a nonlinear operator. The scattering transform uses the modulus since it is non-expansive. Then nontrivial global and local translation invariance are obtained by Z |x ∗ ψλ | or |x ∗ ψλ | ∗ φJ . Note that the modulus operator nonlinearly projects x∗ψλ from the high frequency field to the low frequency field. Also note that |x ∗ ψλ | = |x\ ∗ ψλ |(0) and |x ∗\ cJ and R ψλ | ∗ φJ = |x\ ∗ ψλ |φ cJ is supported in low frequency field. However, the support of φ φ cJ is smaller than the support of |x\∗ ψλ |. Therefore, part of the information in |x\ ∗ ψλ | gets lost in the above representation. To recover it, we apply another wavelet transform over |x ∗ ψλ |. The information in |x ∗ ψλ | is preserved by {|x ∗ ψλ | ∗ φJ , (|x ∗ ψλ | ∗ ψλ0 )λ0 }. The term |x ∗ ψλ | ∗ φJ is invariant to translation while |x ∗ ψλ | ∗ ψλ0 recovers the lost high frequency information. By applying another nonlinear operator and a low pass to |x ∗ ψλ | ∗ ψλ0 , it also gets translation invariance. We give the standard definition of scattering transform in the next paragraph. Let p be a path sequence p = {λ1 , λ2 , ..., λm } ∈ Λm J . Define U [λ]x = |x ∗ ψλ | for x ∈ L2 (Rd ). A scattering propagator is defined as the path-ordered product: U [p] = U [λm ]...U [λ2 ]U [λ1 ] 16 Then the windowed scattering transform of x ∈ L2 (Rd ) for a path p is defined as: SJ [p]x(u) = U [p]x ∗ φJ (u) and is invariant to local translation. The scattering transform is defined as: Z Z 1 S̄[p]x = U [p]x(u)du with µp = U [p]δ(u)du (2.3) µp and is invariant to global translation. If the infinite set of finite paths p is defined as PJ , we can denote: SJ [PJ ]x = {SJ [p]x}p∈PJ or S̄[PJ ]x = {S̄[p]x}p∈PJ The multi-layer structure preserves high frequency information. The author also proves the scattering transform over PJ is stable to diffeomorphisms, i.e., for all τ ∈ C2 (Rd ) with k∇τ k∞ ≤ 12 , there exists a constant C such that kSJ [PJ ]x − SJ [PJ ](Lτ x)k ≤ CkU [PJ ]xk1 (kτ k∞ + k∇τ k∞ + kHτ k∞ ), ∀x ∈ L2 (Rd ) where kU [PJ ]xk1 = J ]xk. The key point to the proof is the stability of the P+∞ m=0 kU [Λm wavelet transform which we omit here. The scattering transform in L2 (Rd ) can be extended to stationary processes. If X(u) is a stationary process, for all p = {λ1 , λ2 , ..., λm } ∈ Λm J , U [p]X(u) is also stationary, and its expected value does not depend on u. The expected scattering transform is defined as: S̄[p]X = E(U [p]X) = E(||X ∗ ψλ1 ∗ · · · | ∗ ψλm |). This definition replaces the normalized integral of the scattering transform in Equation 2.3 by an expected value. If X is also ergodic, the expected scattering transform is estimated by computing the scattering transform of a realization x0 (u) of X(u): Z 1 S̄[p]X ≈ U [p]x0 (u)du µp 17 2.7 Signal recovery through statistical features This section introduces the general types of problems we will consider in this thesis. Given a deterministic signal x ∈ L2 (Rd ) or a stochastic process X(u) : u ∈ Rd , our goal is to find a representation Φ(x) or Φ(X) that is translation invariant, i.e., Φ(Lc x) = Φ(x), or Φ(Lc X) = Φ(X), ∀c ∈ Rd where Lc x(u) = x(u − c) or Lc X(u) = X(u − c). Typically we are interested in when Φ is a CNN representation or a scattering representation. For a deterministic signal x, we explore under what conditions can we recover x from Φ up to translation, i.e., if Φ(x) = Φ(x0 ), then x0 = Lc x. This aims to recover the exact signal up to translation. Chapter 3 explores this issue for 1D deterministic sparse signals. Similarly, Chapter 6 explores 2D deterministic line texture images and frame-like texture images. For a stochastic process X, we think of Φ as statistics, i.e., Φ(X) = E[Φ̃(X)] for some function Φ̃. When X is ergodic, Φ(X) is usually estimated from a sample x ∼ X: Z 1 Φ(X) ≈ Φ̃(x) (2T )d [−T,T ]d Given a sample x of X, we explore under what conditions can we generate another sample x0 of X from Φ̃, i.e., Z Z 1 1 if 0 Φ̃(x ) = Φ̃(x), then x0 ∼ X. (2T )d [−T,T ]d (2T )d [−T,T ]d Chapter 4 explores the problem for X being a Poisson point process and we try to generate new samples by learning the distribution of Φ(X). Chapter 5 suppose X to be a class of texture images and we try to generate new samples x0 ∼ X given an estimate of Φ(X) from one sample texture image x ∼ X. 18 CHAPTER 3 CHARACTERIZING SPARSE SIGNALS THROUGH A HYBRID SCATTERING TRANSFORM 3.1 Introduction As reviewed in Section 2.6, the scattering transform is a mathematical model for CNNs that is invariant to translation, stable to diffeomorphism and captures high frequency informa- tion. These properties are essential for signal analysis as signals with translations and small deformation are usually from the same class and high frequency information is important to distinguish signals from one class to another. On the other hand, filters learned in the early layers of CNNs typically resemble wavelets. But it usually requires a large dataset to learn a CNN and it remains unclear why such models work well for different types of signals, especially natural images. The scattering transform and CNN both show great performance in image classification, proving their ability to learn essential representations and distinguish different types of signals. Another interesting task from machine learning is to understand how different these representations are with dissimilar signals. This is referred to as the completeness of a model in machine learning. A more complicated task from this topic is to find representations that can be used to recover a signal. Intuitively this requires to learn all important information contained in a signal. Motivated by the structure of the scattering transform and CNNs, we propose a two-layer hybrid scattering model to capture signals with isolated singularities in this chapter. In the first layer, we apply a wavelet transform to sparsify the signal, while in the second layer, we use a Gabor type filter to leverage this sparsity. More specifically, • We propose a new model to capture the singularities of the input signal. • We prove the Gabor measurements used in the second layer determine the locations 19 and heights of a sparse signal. • We provide an algorithm that successfully synthesizes 1D sparse signals using these measurements up to translation and reflection. In Section 3.2 we provide our model in detail, while in Section 3.3 we show our main theorems and proofs. Section 3.4 describes our algorithm for sparse signal synthesis and Section 3.5 shows synthesis results. Section 3.6 summarizes our conclusions. 3.2 Model Let y(t) with t ∈ R be a piecewise polynomial whose knots {ui }ki=1 , ui < ui+1 , are located on the grid hZ = {hn : n ∈ Z} for some h > 0:  X k  pi (t) t ∈ [ui , ui+1 ]  y(t) = yi (t), where yi (t) = i=0  0  elsewhere where {pi }ki=0 are a group of polynomials satisfying pi (ui+1 ) = pi+1 (ui+1 ), ∀i ∈ {0, 1, · · · k−1}. We also assume that each of the piecewise polynomial components yi (t) has degree mi at most m, i.e., 0 ≤ mi ≤ m. Let ψ be a mother wavelet with supp(ψ) ⊆ [−1, 1] that has m + 1 vanishing moments, i.e., Z +∞ tk ψ(t)dt = 0, ∀0 ≤ k ≤ m. −∞ Since ψ has m + 1 vanishing moments, one can show pi ∗ ψ(t) = 0. Let ψ` (t) be the dilated wavelet:   1 t ψ` (t) = ` ψ . 2 2` We have supp(ψ` ) ⊆ [−2` , 2` ]. Figure 3.1 gives an example of a wavelet with 4 vanishing moments. Therefore we have y ∗ ψ` (t) = 0 for t ∈ / ki=1 [ui − 2` , ui + 2` ] as {ui }i are the S singularities of y. Equivalently supp(ψ` ∗ y) is contained in ki=1 [ui − 2` , ui + 2` ]. To further S promote sparsity, we next apply a max-pooling operator:  z(t) if z(t) = maxt0 ∈[ti −2` ,ti +2` ]∩hZ z(t0 )   M P` z(t) = . otherwise  0  20 Figure 3.1 A wavelet ψ of 4 vanishing moments. (a) piecewise polynomial y(t) (b) Convolution against wavelet (c) Sparse Signal from Max Pool- with highest order of 3 ψ` ing Figure 3.2 Wavelets sparsify piecewise polynomials on the interval [0, 1024]. Figure 3.2 shows the process of sparsifying a piecewise polynomial into a sparse signal with wavelet ψ` . As summarized in the following theorem, this yields a linear combination of Dirac delta functions. Theorem 1. Assume that 2`+1 ≤ min1≤i≤k−1 |ui − ui+1 |. Then, X k M P` (|ψ` ∗ y|)(t) = aj δvj (t). j=1 for some a1 , . . . , ak > 0, vj ∈ [uj − 2` , uj + 2` ], 1 ≤ j ≤ k. In our second layer, rather than use another wavelet, we use a Gabor filter   t gs,ξ (t) = w eiξt , (3.1) s where the parameters s and ξ determine the scale and central frequency of the filter and the window function w is supported on an interval of unit length. It differs from a wavelet in that the scale and frequency are separated in a gabor filter, giving more flexibility to adjust 21 (a) Indicator function window (b) Gaussian window Figure 3.3 Gabor filters used in the second layer (real parts) these parameters independently. Note that with an appropriately chosen window function w, Equation 3.1 includes dyadic wavelet families in the case that one selects s = 2` and |ξ| = C/s . However, it also includes many other families of filters, including Gabor filters used in the windowed Fourier transform. Next, we take the Lp norm for some integer p ≥ 1. As a result, we obtain translation invariant hybrid scattering coefficients {kgs,ξ ∗ M P` (|ψ` ∗ y|)kp }s,ξ . By design, these measurements are invariant to translations, reflections, and global sign changes. We aim to investigate the ability of our measurements to characterize y up to these natural ambiguities. The wavelet-modulus is known to be a powerful signal descriptor [31]. Therefore, in light of Theorem 1, we shall analyze the ability of the measurements {kgs,ξ ∗ xkp }s,ξ (3.2) to characterize sparse signals of the form X k x(t) = aj δvj (t). (3.3) j=1 Furthermore, to supplement our theory, we show that the measurements (3.2) can be used to reconstruct a sparse signal of the form (3.3) up to translations and reflections (see Figure 3.2). 22 3.3 Theory and proofs Let X k x(t) = aj δvj (t), v1 < v2 < · · · < vk . (3.4) j=1 We first consider the support set {vj }kj=1 , and let D(x) = {∆i,j = vj − vi : i < j} denote the difference set as the pairwise distances of the Dirac locations {vj }kj=1 . Let fx,ξ (s) = kgs,ξ ∗ xkpp be the gabor measurements. The following theorem shows the singularities of the measure- ments as a function of the scale s are contained in the pairwise distance set D(x). Theorem 2. Let p ≥ 1 be an integer, and assume w(t) = 1[0,1] (t). For i ≤ j, let j X βi,j (ξ) = a` eiξ∆i,` (3.5) `=i Then, for every fixed ξ ∈ R, the function fx,ξ (s) is piecewise linear, and ∂s2 fx,ξ (s) is a grid-free sparse signal whose support is contained in D(x). Specifically,   X X ∂s2 fx,ξ (s) =  ci,j (ξ) δd , d∈D(x) ∆i,j =d where ci,i+1 (ξ) = |βi,i+1 (ξ)|p − |βi+1,i+1 (ξ)|p − |βi,i (ξ)|p (3.6) and ci,j (ξ) = |βi,j (ξ)|p + |βi+1,j−1 (ξ)|p − |βi+1,j (ξ)|p − |βi,j−1 (ξ)|p for j ≥ i + 2. (3.7) Proof. We first note that Xk |(gs,ξ ∗ x)(t)| = ai gs,ξ (t − vi ) i=1 Xk = ai eiξ(t−vi ) 1[vi ,vi +s] (t) i=1 Xk = ai e−iξvi 1[vi ,vi +s] (t) . i=1 23 For every subset I ⊆ {1, . . . , k}, let RI (s) = {t : t ∈ [vi , vi + s] for all i ∈ I, t ∈ / [vi , vi + s] for all i ∈ / I}, i.e. let RI (s) be the set of t for which ai e−iξvi 1[vi ,vi +s] (t) is nonzero if and only if i ∈ I. Then, since w(t) = 1[0,1] (t) it is clear that for t ∈ RI , X |(gs,ξ ∗ x)(t)| = ai e−iξvi := yI (ξ). i∈I Therefore, X fx,ξ (s) = k(gs,ξ ∗ x)(t)kpp = |yI (ξ)|p |RI (s)|. (3.8) I⊆{1,...k} We will show that for all I ⊆ {1, . . . , k}, |RI (s)| is a piecewise linear function as a function whose Dirac locations are contained in D(x). First, we note that RI (s) = ∅ unless I has the form {i, i + 1, . . . , j − 1, j} for some i ≤ j. Therefore, Xk X k fs (ξ) = |βi,j (ξ)|p |Ri,j (s)|, (3.9) i=1 j=i where, as in (3.5), βi,j (ξ) is given by j j X X |βi,j (ξ)| = a` e iξ∆i,` = a` eiξv` , `=i `=i and Ri,j := R{i,...,j} . Now, turning our attention to Ri,j (s), we observe by definition that a point t is in Ri,j (s) if and only if it satisfies the following three conditions: v` ≤t ≤ v` + s for all i ≤ ` ≤ j, t > vi−1 + s, and t < vj+1 . Therefore, letting (a ∧ b) := max{a, b} and (a ∨ b) := min{a, b}, we see Ri,j (s) = [xj , xi + s] ∩ [xi−1 + s, xj+1 ] = [xj ∨ (xi−1 + s), (xi + s) ∧ xj+1 ], (3.10) 24 and |Ri,j (s)| = ((xi + s) ∧ xj+1 ) − (xj ∨ (xi−1 + s)), if the above quantity is positive and |Ri,j (s)| = 0 otherwise. It follows from (3.10) that |Ri,j (s)| is a piecewise linear function, and that ∂s2 |Ri,j (s)| is given by ∂s2 |Ri,j(S) | = δ∆i,j (s) + δ∆i−1,j+1 (s) − δ∆i−1,j (s) − δ∆i,j+1 (s). (3.11) In order for this equation to be valid for all 1 ≤ i < j ≤ k, we identify x0 and xk+1 with −∞ and ∞, and therefore, δ∆0,j δ∆i−1,k+1 are interpreted as being the zero function since the domain of f is (0, ∞). Likewise δ∆i,i = δ0 is interpreted as the zero function in the above equation. Combining (3.11) with (3.9) implies that ∂s2 fx,ξ (s) is a sparse signal with support con- tained in D(x), and for d ∈ D(x), X ∂s2 fx,ξ (d) = ci,j (ξ) ∆i,j =d as desired. The following example shows that, in general, the support of ∂s2 fx,ξ (s) may be a proper subset of D(x). Example 1. If p = 2 and x(t) = δ1 (t) + δ2 (t) + δ3 (t) − δ4 (t), then 2 ∈ D(x), but ∂s2 fξ (2) = 0. Proof. For this choice of x, there are two pairs (i, j) such that ∆i,j = 2, namely (1, 3) and (2, 4). Therefore, by Theorem 2, ∂s2 fξ (2) = |β1,3 (ξ)|2 + |β2,2 (ξ)|2 − |β1,2 (ξ)|2 − |β2,3 (ξ)|2 +  |β2,4 (ξ)|2 + |β3,3 (ξ)|2 − |β2,3 (ξ)|2 − |β3,4 (ξ)|2 .  25 Inserting (a1 , a2 , a3 , a4 ) = (1, 1, 1, −1), ∆i,i+1 = 1, and ∆i,i+2 = 2 into (3.5) implies that ∂s2 fξ (2) = |1 + eiξ + e2iξ |2 + 1 − |1 + eiξ |2 − |1 + eiξ |2 +  |1 + eiξ − e2iξ |2 + 1 − |1 + eiξ |2 − |1 − eiξ |2  =|1 + eiξ + e2iξ |2 + |1 + eiξ − e2iξ |2 + 2 − 3|1 + eiξ |2 − |1 − eiξ |2 =0. The last inequality follows from repeatedly applying the the trigonometric identities sin2 (θ)+ cos2 (θ) = 1 and cos(θ) = cos(2θ) cos(θ) + sin(2θ) sin(θ). As illustrated by Example 1, the reason why ∂s2 fx,ξ (2) is equal to zero is because x is not collision free, which is defined as follows. A signal x is collision free if |vi − vj | = 6 |vi0 − vj 0 | unless (i, j) = (i0 , j 0 ). With this assumption, it is known [32] that the support set {vj }kj=1 is determined (up to reflection and translation) by D(x). Specifically in the above example, there are two different pairs of points (1, 3) and (2, 4) in the support set of x that are both distance two from each other and c1,3 (ξ) = −c2,4 (ξ). When x is collision free, this cancellation cannot occur, and as guaranteed by the following theorem the support set of ∂s2 fx,ξ (s) will exactly be D(x). Therefore, our measurements with sufficiently many scales and only one frequency completely characterize the support set of a sparse signal up to translation and reflection. Theorem 3. Assume that x is a collision-free k-sparse signal as in Equation (3.3) and that p ≥ 1 is an integer. Then, for almost every ξ, ∂s2 fx,ξ is a sparse signal whose support is exactly equal to D(x). In order to prove Theorem 3, we will introduce a class of functions which we call Gen- eralized Exponential Laurent Polynomials and state several lemmas about these functions. For the proof of the lemmas in this section, please see the appendix. 26 We call a function q(θ) a Generalized Exponential Laurent Polynomial if it can be written as X N q(θ) = αk eiγk θ , θ ∈ [0, 2π) (3.12) k=1 where N ≥ 1, αk , γk ∈ R, αk 6= 0, and γ1 < γ2 < . . . < γN . We let E be set of all such functions. For q ∈ E, we refer to γN as the degree of q and let E(d) refer to the set of all q ∈ E with degree d. Note that we do not assume that the γk ’s are nonnegative or even rational. Therefore, the degree of q may be negative. We let E0 (d) denote the set of q ∈ E(d) such that γ1 ≥ 0. Lemma 1. Let q, q 0 ∈ E, N N 0 X X q(θ) = αk e iγk θ and q (θ) = 0 βk eiηk θ . k=1 k=1 Then q = q 0 if and only if N = N 0 and for all k = 1, . . . , N , αk = βk and γk = ηk . Lemma 1 implies that if q ∈ E(d1 ) and q 0 ∈ E(d2 ) then qq 0 ∈ E(d1 + d2 ). (3.13) In particular, if q ∈ E0 (d) |q|2 = q q̄ ∈ E(d + 0) = E(d). (3.14) Furthermore, if d2 ≤ d1 then (q + q 0 ) ∈ E(d1 ), (3.15) except, of course, if d1 = d2 and the lead coefficients of q and q 0 are negatives of one another. Lemma 2. Let p be an integer. For i = 1, 2, 3, 4, let qi ∈ E0 (di ) assume that d1 > d2 , d3 , d4 . Then the set of points such that |q1 |p + |q2 |p = |q3 |p + |q4 |p (3.16) has measure zero. 27 Proof of Theorem 3. Under the assumption that x(t) is collision free, it suffices to show that for all i ≤ j, ci,j (ξ) 6= 0 for almost every ξ ∈ R, where as in (3.6) and (3.7), ci,i+1 (ξ) = |βi,i+1 (ξ)|p − |βi+1,i+1 (ξ)|p − |βi,i (ξ)|p , ci,j (ξ) = |βi,j (ξ)|p + |βi+1,j−1 (ξ)|p − |βi+1,j (ξ)|p − |βi,j−1 (ξ)|p for j ≥ i + 2, and j X βi,j (ξ) = al eiξ∆i,l . l=i Observe that βi,j is a generalized exponential Laurent polynomial of the form introduced in Equation 3.12, with degree ∆i,j . Therefore, when j ≥ i + 2, it follows from Lemma 2 that ci,j vanishes on a set of measure zero since ci,j (ξ) = 0 implies |βi,j (ξ)|p + |βi+1,j−1 (ξ)|p = |βi+1,j (ξ)|p + |βi,j−1 (ξ)|p . In the case where j = i + 1, we see that ci,i+1 (ξ) = |ai + ai+1 e−iξ∆i,i+1 |p − |ai |p − |ai+1 |p , For any ξ such that ci,i+1 (ξ) = 0, we see that ξ∆i,i+1 is a solution to 2 ai + ai+1 eiθ − (|ai |p + |ai+1 |p )2/p = 0. Therefore, ci,i+1 (ξ) vanishes on a set of measure zero because the left-hand side of the above equation is a trigonometric polynomial. Theorem 3 proves our measurements on a sparse signal x with one frequency and enough scales locate the positions of diracs, i.e., the support of x, up to translation and reflection. The strategy of selecting the scales is described in Section 3.4. The following theorems shows with enough frequencies, our measurements also charac- terize the heights {ai }i of x. Moreover, if the moments p is even, the number of frequencies needed can be reduced. The proofs of the following theorems are described in appendix. 28 Theorem 4. Let p ≥ 1 be an odd integer, let Xk x(t) = aj δvj (t) j=1 be a collision-free sparse signal, and let ~a = (a1 , . . . , ak ). Let ξ1 , . . . , ξL be L frequencies chosen independently at random from some probability distribution which is absolutely continuous with respect to the Lebesgue measure for some L ≥ 4p + 2. Then, with probability one, the following uniqueness result is true. Let Xk y(t) = bj δuj (t) (3.17) j=1 be any other sparse signal such that D(y) = D(x), and let ~b = (b1 , . . . , bk ). If ∂s2 fx,ξ` (d) = ∂s2 fy,ξ` (d) for all d ∈ D(x) and all 1 ≤ ` ≤ L and ki=1 |bi |p = ki=1 |ai |p , then ~b = ±~a and P P therefore y(t) is equivalent to ±x(t) up to translation and reflection. Theorem 5. Let p = 2p0 be an even integer, and let Xk x(t) = aj δvj (t) j=1 be a collision-free sparse signal, and let ~a = (a1 , . . . , ak ). Let ξ1 , . . . , ξL be L frequencies chosen independently at random from some probability distribution which is absolutely continuous with respect to the Lebesgue measure for some L ≥ p + 2. Then, with probability one, the following uniqueness result is true. Let Xk y(t) = bj δuj (t) (3.18) j=1 be any other sparse signal such that D(y) = D(x), and let ~b = (b1 , . . . , bk ). If ∂s2 fx,ξ` (d) = ∂s2 fx,ξ` (d) for all d ∈ D(x) and all 1 ≤ ` ≤ L and ki=1 |bi |p = ki=1 |ai |p , then ~b = ±~a and P P therefore y(t) is equivalent to ±x(t) up to translation and reflection. 3.4 Algorithm Theorems 3, 4 and 5 shows our measurements characterize the support of a sparse signal x and the heights of the diracs. In this section, we describe our algorithm that uses such 29 measurements to recover the signal (up to translation and reflection). Let x be a sparse signal that is collision free, X k x= aj δvj (t) j=1 Let fx,ξ (s) = kgs,ξ ∗ xkpp be the measurements. Let Λ = {(ξj , sj )}j be the set of frequencies and scales we choose to define the gabor filters. The strategy to choose different scales is discussed in the next paragraph. With two signals x and x0 , we define the loss between them as: X `(x, x0 ) = (fx,ξ (s) − fx0 ,ξ (s))2 (ξ,s)∈Λ Given a target signal x, our goal is to find a new signal such that x∗ = arg min `(x, x0 ) (3.19) x0 For simplicity, we choose p = 1 in our algorithm for synthesis. Now we present a strategy of choosing the right set of scales. Let D(x) = {∆ij = |vi −vj |} be the pairwise distance between the spikes of x. Since x is collision free, we know ∆ij 6= ∆i0 j 0 unless (i, j) = (i0 , j 0 ) except for ∆ii = 0, ∀i. Therefore, there are k(k−1)/2+1 unique elements in D(x). Without loss of generality, we suppose 0 = d0 < d1 < · · · < dk(k−1)/2 , di ∈ D(x), ∀i in the following context. We also assume x is periodic in numerical experiments. Therefore dk(k−1)/2 ≤ n 2 . As stated in Theorem 3, fx,ξ (s) is a piecewise linear function of s and the singularities locate exactly at D(x). When choosing the scales to compute the scattering statistics, we need to ensure there is at least one scale between di and di+1 , ∀i. Therefore, we compute the minimum pairwise distance of D(x), i.e., minDD = min |di − di+1 |, 0≤i 0 such that for any Borel set B, the family of sets B + T ei = {b + T ei : b ∈ B} satisfies d Y (B) = Y (B + T ei ), ∀1 ≤ i ≤ d, where {ei }i≤d is the standard orthonormal basis for Rd . In this case one can verify, by d approximating gγ with simple functions, that (gγ ∗ Y )(t + T ei ) = (gγ ∗ Y )(t), and therefore S[γ, p]Y (t + T ei ) = S[γ, p]Y (t), ∀ t ∈ Rd . Thus the limit in (4.3) exists, and Z 1 SY (γ, p) = d E[|gγ ∗ Y (t)|p ] dt . (4.4) T QT 36 Note that in the special case when the distribution of Y (B) depends only on the Lebesgue measure of B, then S[γ, p]Y (t) is independent of t and the above limit (4.3) exists with SY (γ, p) = S[γ, p]Y (t) for any t ∈ Rd . First-order scattering moments compute summary statistics of the measure Y based upon its responses against the filters gγ . Higher-order summary statistics can be obtained by computing first-order scattering moments for larger powers p, or by cascading lower-order modulus nonlinearities as in a CNN. This leads us to define second-order scattering moments by h 0 i S[γ, p, γ 0 , p0 ]Y (t) = E ||gγ ∗ Y |p ∗ gγ 0 (t)|p . First-order invariant scattering moments collapse additional information by aggregating the variations of the random measure Y , which removes information related to the intermittency of Y . Second-order invariant scattering moments augment first-order scattering moments by iterating on the cascade of linear filtering operations and nonlinear | · |p operators, thus recovering some of this lost information. They are defined (assuming the limit on the right exists) by Z 0 0 1 h p p0 i SY (γ, p, γ , p ) = lim E ||gγ ∗ Y | ∗ gγ 0 (t)| dt . R→∞ (2R)d |ti |≤R The collection of (invariant) scattering moments is a set of non-parametric statistical measurements of the random measure Y . In the following sections, we analyze these moments for arbitrary frequencies ξ and small scales s, thus allowing the filters gγ to serve as a model for the learned filters in CNNs. In particular, we will analyze the asymptotic behavior of the scattering moments as the scale parameter s decreases to zero. 4.3 First-Order Scattering Moments of Generalized Poisson Pro- cesses We consider the case where Y (dt) is an inhomogeneous, compound spatial Poisson point process. Such processes generalize ordinary Poisson point processes by incorporating variable charges (heights) at the points of the process and a non-uniform intensity for the locations 37 of the points. They thus provide a flexible family of point processes that can be used to model many different phenomena. In this section we consider first-order scattering moments of these generalized Poisson processes. In Sec. 4.3.1 we provide a review of such processes, and in Sec. 4.3.2 we show that first-order scattering moments capture a significant amount of statistical information related these processes, particularly when using very localized filters. 4.3.1 Inhomogeneous, Compound Spatial Poisson Point Processes Let λ(t) be a continuous function on Rd such that 0 < λmin := inf λ(t) ≤ kλk∞ < ∞ , (4.5) t and let N (dt) be an inhomogeneous Poisson point process with intensity function λ(t). That is, X∞ N (dt) = δtj (dt) j=1 is a random measure, concentrated on a countable set of points {tj }∞ j=1 , such that for all Borel sets B ⊂ Rd , the number of points of N in B, denoted N (B), is a Poisson random variable with parameter Z Λ(B) = λ(t) dt , (4.6) B i.e., (Λ(B))n P[N (B) = n] = e−Λ(B) , n! and N (B) is independent of N (B 0 ) for all sets B 0 that do not intersect B. Now let (Aj )∞ j=1 be a sequence of i.i.d. random variables independent of N , and let Y (dt) be the random signed measure that gives charge Aj to each point tj of N , i.e., X∞ Y (dt) = Aj δtj (dt) . (4.7) j=1 We refer to Y (dt) as an inhomogeneous, compound Poisson point process. For a Borel set B ⊂ Rd , Y (B) has a compound Poisson distribution and we will (in a slight abuse of 38 notation) write N (B) X Y (B) = Aj . j=1 In many of our proofs, it will be convenient to consider the random measure |Y |p (dt) defined formally by X∞ |Y |p (dt) := |Aj |p δtj (dt). j=1 For a further overview of these processes, and closely related marked point processes, we refer the reader to Section 6.4 of [42]. 4.3.2 First-order Scattering Asymptotics Computing the convolution of gγ with Y (dt) gives Z X∞ (gγ ∗ Y )(t) = gγ (t − u) Y (du) = Aj gγ (t − tj ) , Rd j=1 which can be interpreted as a waveform gγ emitting from each location tj . Invariant scattering moments aggregate the random interference patterns in |gγ ∗ Y |. The results below show that the expectation of these interferences, for small scale waveforms gγ , encode important statistical information related to the point process. For notational convenience, we let Z d  Λs (t) := Λ [t − s, t] = λ(u) du [t−s,t]d denote the expected number of points of N in the support of gγ (t − ·). If λ(t) is a periodic function in each coordinate with period T , then Λs (t) = Λs (t + T ei ) for 1 ≤ i ≤ d and therefore, the invariant scattering coefficients of Y may be defined as in (4.4). Theorem 6. Let 1 ≤ p < ∞ and suppose that Y (dt) is an inhomogeneous, compound Poisson point process as defined above, where (Aj )∞ j=1 is an i.i.d. sequence of random variables, E[|A1 |p ] < ∞ and λ(t) is a continuous intensity function satisfying (4.5). Then for every 39 t ∈ Rd , every γ = (s, ξ) such that sd kλk∞ < 1, and for every m ≥ 1. m " k p# k X (Λs (t)) X S[γ, p]Y (t) = e−Λs (t) E Aj w(Vj )eisξ·Vj + (m, s, ξ, t) , (4.8) k=1 k! j=1 where the error term (m, s, ξ, t) satisfies kλk∞ |(m, s, ξ, t)| ≤ Cm,p kwkpp E[|A1 |p ]kλkm+1 ∞ s d(m+1) (4.9) λmin and V1 , V2 , . . . is an i.i.d. sequence of random variables, independent of the Aj , taking values in the unit cube Q1 and with density sd pV (v) = λ(t − vs) , v ∈ Q1 . Λs (t) The main idea of the proof of Theorem 6 is to condition on N [t − s, t]d , which is the  number of points in the support of gγ , and to use the fact that  m+1  P N [t − s, t]d > m = O sd kλk∞ ∀ sd kλk∞ < 1 .    , Theorem 6 shows that even at small scales the scattering moments S[γ, p]Y (t) depend upon higher-order information related to the distribution of the points, encapsulated by the term (Λs (t))k , regardless of the scattering moment p. However, the influence of the higher-order terms diminishes rapidly as the scale of the filter shrinks, which is indicated by the bound (4.9) on the error function. Theorem 6 also shows that pth scattering moments depend on the pth moments of the charges, (Aj )∞ j=1 . The next result uses Theorem 6 to examine the behavior of scattering moments for small filters in the asymptotic regime as the scale s → 0. Theorem 7. Let 1 ≤ p < ∞, and suppose that Y (dt) is an inhomogeneous, compound Poisson point process satisfying the same assumptions as in Theorem 6. Let γk = (sk , ξk ) be a sequence of scale and frequency pairs such that limk→∞ sk = 0. Then S[γk , p]Y (t) lim d = λ(t)E[|A1 |p ]kwkpp . (4.10) k→∞ sk Furthermore, if λ(t) is periodic with period T along each coordinate, then Z SY (γk , p) 1 lim = m1 (λ)E[|A1 | ]kwkp , where m1 (λ) = d p p λ(t) dt . (4.11) k→∞ sdk T QT 40 Theorem 7 is proved via asymptotic analysis of the m = 1 case of Theorem 6. The key to the proof, which is similar to the technique used to prove Theorem 2.1 of [41], is that in a small cube [t − s, t]d there is at most one point of N with overwhelming probability. Therefore, when s is very small, with very high probability, |gγ ∗ Y |p (t) = (|gγ |p ∗ |Y |p ) (t). This theorem shows that for small scales the scattering moments S[γ, p]Y (t) encode the intensity function λ(t), up to factors depending upon the summary statistics of the charges (Aj )∞j=1 and the window w. Recall that Λ(B), defined in (4.6), determines the concentration of events within the set B. Thus even a one-layer location dependent scattering network yields considerable information regarding the underlying data generation, at least in the case of inhomogeneous Poisson processes. However, it is often the case, e.g., [43], that invariant statistics are utilized. In this case (4.11) shows that invariant scattering statistics mix the mean of λ(t) and the pth moment of the charge magnitudes. However, we can decouple these statistics as we now explain. As a special case, Theorem 7 proves that for non-compound inhomogeneous Poisson processes (i.e., Aj = 1 for all j ≥ 1), small scale scattering moments recover λ(t) or m1 (λ), depending on whether one computes invariant or time-dependent scattering moments. For compound processes, we can add an additional nonlinearity, namely the signum function sgn, which when applied to the Poisson point process in (4.7) yields, X ∞ Y (dt) = sgn[Y (dt)] = δtj (dt) . j=1 Thus computing SY (γ, p) and the ratio SY (γ, p)/SY (γ, p) at small scales decouples the mean of λ(t) from the pth moment of |A1 |. We remark that the signum function is a simple perceptron and is closely related to the sigmoid nonlinearity, which is used in many neural networks. We further remark that the computation of SY constitutes a small two-layer network, consisting of the nonlinear sgn function, the linear filtering by the collection of filters gγ , the nonlinear pth modulus | · |p , and the linear integration operator. If Y (dt) is a homogeneous Poisson process, then λ(t) is constant, meaning that (4.10) and 41 (4.11) are equivalent. In the case of ordinary (non-compound) Poisson processes, Theorem 7 recovers the constant intensity. For periodic λ(t) and invariant scattering moments, the effect of higher-order moments of λ(t) can be partially isolated by considering higher-order expansions (e.g., m > 1) in (4.8). The next theorem considers second-order expansions and illustrates their dependence on the second moment of λ(t). Theorem 8. Let 1 ≤ p < ∞, and suppose Y (dt) is an inhomogeneous, compound Poisson point process satisfying the same assumptions as in Theorem 6. If λ(t) is periodic with period T in each coordinate, and if (γk )k≥1 = (sk , ξk )k≥1 , is a sequence of scale and frequency pairs such that limk→∞ sk = 0 and limk→∞ sk ξk = L ∈ Rd , then  Z  SY (γk , p) 1 Λsk (t) lim − d dt k→∞ s2d p p k E[|A1 | ]E[|Vk | ] T QT s2d k p  ! E A1 w(U1 )eiL·U1 + A2 w(U2 )eiL·U2 = m2 (λ) , (4.12) 2kwkpp E[|A1 |p ] where m2 (λ) = T −d QT λ(t)2 dt; U1 , U2 are independent uniform random variables on Q1 ; R and (Vk )k≥1 is a sequence of random variables independent of the Aj taking values in the unit cube Q1 and with respective densities, sdk pVk (v) = λ(t − vsk ) , v ∈ Q1 . Λsk (t) We first remark that the scale normalization on the left hand side of (4.12) is s−2d , compared to a normalization of s−d in Theorem 7. Thus even though (4.12) is written as a small scale limit, intuitively Theorem 8 is capturing information at moderately small scales that are larger than the scales considered in Theorem 7. This is further indicated by the term multiplied against m2 (λ) on the right hand side of (4.12), which depends on two points of the process (as indicated by the presence of two charges A1 and A2 ). Unlike Theorem 7, which gives a way to compute m1 (λ), Theorem 8 does not allow one to compute m2 (λ) since it would require knowledge of Λsk (t) in addition to the distribution from which the charges (Aj )∞ j=1 are drawn. However, Theorem 8 does show that at mod- erately small scales the invariant scattering coefficients depend non-trivially on the second 42 moment of λ(t). This behavior at moderately small scales can be used to distinguish be- tween, for example, an inhomogeneous Poisson point process with intensity function λ(t) and a homogeneous Poisson point process with constant intensity λ0 = m1 (λ), whereas Theorem 7 indicates that at very small scales the two processes will have the same invariant scattering moments. 4.4 Second-Order Scattering Moments of Generalized Poisson Pro- cesses We prove that second-order scattering moments, in the small scale regime, encode higher- order moment information about the charges (Aj )∞ j=1 . Theorem 9. Let 1 ≤ p, p0 < ∞ and q = pp0 . Suppose that Y (dt) is an inhomogeneous Pois- son point process satisfying the same assumptions as in Theorem 6 as well as the additional assumption that E|A1 |q < ∞. Let γk = (sk , ξk ) and γk0 = (s0k , ξk0 ) be two sequences of scale and frequency pairs such that s0k = csk for some fixed constant c > 0 and limk→∞ sk ξk = L ∈ Rd . Then, S[γk , p, γk0 , p0 ]Y (t) lim d(p0 +1) = Kλ(t)E[|A1 |q ] , (4.13) k→∞ sk where p0 K := gc,L/c ∗ |g1,0 |p p0 , is a constant depending on p, p0 , c, L, and w. Furthermore, if λ(t) is periodic with period T along each coordinate, then SY (γk , p, γk0 , p0 ) lim d(p0 +1) = Km1 (λ)E[|A1 |q ] . (4.14) k→∞ sk 0 Note that the scaling factor s−d(p +1) depends on p0 but not p. Intuitively this corresponds 0 d(p0 +1) to the behavior k|gγk |p ∗ gγk0 kpp0 ≈ sk as sk → 0. Theorem 9 proves that second-order scattering moments capture higher-order moments of the charges (Aj )∞ j=1 via two pairs of lower-order filtering and modulus operators. If p, p0 > 1, then q = pp0 will be larger than 43 either p or p0 and the result above will give us information about the higher order moment E|A1 |q . It is also useful to consider the p = 1 case. Indeed, in Sec. 4.5 below it is shown that first- order invariant scattering moments can distinguish Poisson point processes from fractional Brownian motion and α-stable processes, if p = 1, but may fail to do so for larger values of p. However, Theorem 7 shows that first-order invariant scattering moments for p = 1 will not be able to distinguish between the various different types of Poisson point processes with a one-layer network at very small scales. Theorem 9 shows that a second-order calculation that augments the first-order calculation with p = 1 and p0 > 1, will capture a higher-order moment of the charges (Aj )∞ j=1 . 4.5 Poisson Point Processes Compared to Self Similar Processes For one-dimensional processes (i.e., d = 1), we show that first-order invariant scattering moments can distinguish between inhomogeneous, compound Poisson point processes and certain self-similar processes. In particular, we show that if X(t) is either an α-stable pro- cess or a fractional Brownian motion (fBM), then the corresponding first-order scattering moments will have different asymptotic behavior for infinitesimal scales than in the case of a Poisson point process. Similar results were initially reported in [41]; here we generalize those results to the non-wavelet filters gγ defined in (4.1) and for general pth scattering moments, and further clarify their usefulness in the context of the new results presented in Sec. 4.3 and Sec. 4.4. As in [41], the proof will be based on the scaling relationships of these processes and therefore will not be able to distinguish between α-stable processes and fBM1 . The key will be proving a lemma that says if a stochastic process X has a scaling relation, then that scaling relation is inherited by integrals of deterministic functions against dX. More precisely, for a stochastic process X(t), t ∈ R, we consider the convolution of the 1 We note that [41] proves that second-order scattering moments defined with wavelet filters do distinguish between α-stable processes and fBM, but we do not pursue this direction in this project as we are concerned primarily with point processes. 44 filter gγ with the noise dX defined by Z gγ ∗ dX(t) = gγ (t − u) dX(u) , R and define (in a slight abuse of notation) the first-order scattering moments at time t by S[γ, p]X(t) = E[|gγ ∗ dX(t)|p ] . (4.15) In the case where X(t) is a compound, inhomogeneous Poisson (counting) process, Y = dX will be a compound Poisson random measure and the scattering moments defined in (4.15) will coincide with the first-order scattering moments defined in (4.2). The following two theorems analyze the small scale first-order scattering moments when X is either an α-stable process, for 1 < α ≤ 2, or fractional Brownian motion. Thus dX will be stable Lévy noise or fractional Gaussian noise, respectively. These results show that the asymptotic decay of the corresponding scattering moments is guaranteed to differ from Poisson point processes, in the case p = 1. We also note that both α-stable processes and fBM have stationary increments; therefore the scattering moments do not depend on time and Z 1 S[γ, p]X(t) = SX(γ, p) = lim E[|gγ ∗ dX(u)|p ] du , ∀t ∈ R. R→∞ 2R |u|≤R Theorem 10. Let 1 ≤ p < ∞ and suppose X(t) is a symmetric α-stable process for some p < α ≤ 2. Let γk = (sk , ξk ) be a sequence of scale and frequency pairs such that limk→∞ sk = 0 and limk→∞ sk ξk = L ∈ R. Then, Z 1 p SX(γk , p) iLu lim p/α =E w(u)e dX(u) . k→∞ sk 0 Theorem 11. Let 1 ≤ p < ∞, suppose X(t) is a fractional Brownian motion with Hurst parameter 0 < H < 1. Assume that the window function w has bounded variation on [0, 1], and let γk = (sk , ξk ) be a sequence of scale and frequency pairs such that limk→∞ sk = 0 and limk→∞ sk ξk = L ∈ R. Then, Z 1 p SX(γk , p) iLu lim =E w(u)e dX(u) . k→∞ spH k 0 45 The key to proving Theorem 10 and Theorem 11 is the lemma stated in Appendix 7.3, which shows that if X(t) is a self-similar process, then, then stochastic integrals against dX satisfy an identity corresponding to the scaling relation of X(t). Together, these two theorems indicate that first-order invariant scattering moments dis- tinguish inhomogeneous, compound Poisson processes from both α-stable processes and frac- tional Brownian motion except in the cases where p = α or p = 1/H. In particular, if X is a Brownian motion, then SX will distinguish X from a Poisson point process except in the case that p = 2. For this reason, it appears that p = 1 is the best choice of the parameter p for the purposes of distinguishing a Poisson point process from a self-similar process. In the case of a multi-layer network, it is advisable to set p = 1. Larger values of p0 in the second layer can then allow us to determine the higher moments of the arrival heights (Aj )∞ j=1 . 4.6 Numerical Illustrations We carry out several experiments to numerically validate the previously stated results and to illustrate their capacity for distinguishing between different types of random processes. In all of the experirments below, we will hold the frequency ξ constant while we let the scale s decrease to zero. 4.6.1 Homogeneous, compound Poisson point processes with the same intensi- ties We generated three different types of homogeneous compound Poisson point processes, all with the same intensity λ(t) ≡ λ0 = 0.01. The three point processes are Y1 (ordinary), Y2 (Gaussian), and Y3 (Rademacher), where the charges are sampled according to (A1,j )∞ j=1 ≡ π/2), and (A )∞ 3,j j=1 ∼ Rademacher distribution (i.e., ±1 with equal p 1, (A2,j )∞ j=1 ∼ N (0, probability). The charges of the three signals have the same first moment E[|Ai,j |] = 1 and different second moment with E[|A1,j |2 ] = E[|A3,j |2 ] = 1 and E[|A2,j |2 ] = π2 . Theorem 7 thus predicts that p = 1 invariant first-order scattering moments will not be able to 46 distinguish between the three processes, but p = 2 invariant first-order scattering moments will distinguish the Gaussian Poisson point process from the other two. Figure 4.1 illustrates this point by plotting the normalized invariant scattering moments for p = 1 and p = 2. Figure 4.1 First-order invariant scattering moments for three types of homogeneous com- pound Poisson point processes with the same intensity λ0 . Left: Top: ordinary Poisson point process. Middle: Gaussian compound Poisson point process with normally distributed charges. Bottom: Rademacher compound Poisson point process with charges drawn from the Rademacher distribution. Middle: Normalized invariant scattering moments SY (s,ξ,1)/skwk1 (i.e., p = 1), which all converge to 0.01 as s → 0 (up to numerical errors) since λ0 E[|A1 |] is the same for all three point processes. Right: Normalized invariant scattering moments SY (s,ξ,2)/skwk2 (i.e., p = 2). In this case the ordinary Poisson point process and the Rademacher 2 Poisson point process still converge to the same value as s → 0 since E[|A1 |2 ] = 1 for both of them. However, the Gaussian Poisson point process converges to a different value since E[|A1 |2 ] = π/2 for this process. 4.6.2 Homogeneous, compound Poisson point processes with different intensi- ties and charges We consider two homogeneous, compound Poisson point processes with different intensities and different charge distributions, but which nevertheless have the same first-order invariant scattering moments with p = 1 due to the mixing of intensity and charge information in (4.11). The first compound Poisson point process has constant intensity λ1 = 0.01 and √ charges A1,j ∼ N (0, 1), whereas the second has intensity λ2 = 0.01/ 2 and A2,j ∼ N (0, 2). In this way, λ1 E[|A1,j |] = λ2 E[|A2,j |] = 0.01 · 2/π ≈ 0.008, but λ1 E[|A1,j |2 ] = 0.01 and p √ λ2 E[|A2,j |2 ] = 0.01· 2 ≈ 0.014. Figure 4.2 plots the normalized invariant scattering moments for p = 1 and p = 2. 47 Figure 4.2 First-order invariant scattering moments for two homogeneous, Gaussian com- pound Poisson point processes with different intensity and variance. Left: Top: Homoge- neous compound Poisson point process with intensity λ1 = 0.01 and charges A1,j ∼ N (0, 1). √ Bottom: Homogeneous compound Poisson point process with intensity λ2 = 0.01/ 2 and charges A2,j ∼ N (0, 2). The two point processes are difficult to distinguish, visually. Mid- dle: Normalized invariant scattering moments SY (s,ξ,1)/skwk1 (i.e., p = 1), which both converge to approximately 0.08 up to numerical error, thus indicating that these moments cannot dis- tinguish the two processes. Right: Normalized invariant scattering moments SY (s,ξ,2)/skwk22 (i.e., p = 2). The two process are distinguished as s → 0 since the values λ1 E[|A1,j |2 ] = 0.01 and λ2 E[|A2,j |2 ] ≈ 0.014 differ by a significant margin. 4.6.3 Inhomogeneous, non-compound Poisson point processes We also consider inhomogeneous Poisson point processes. We use the intensity function λ(t) = 0.01(1 + 0.5 sin(2πt/N )) to generate inhomogeneous process. To estimate S[γ, p]Y (t), we average the modulus of the scattering transform at time t over 1000 realizations. Figure 4.3 plots the scattering moments for inhomogeneous process at different time. Figure 4.3 First-order invariant scattering moments for inhomogeneous non-compound Pois- son point processes. Left: Inhomogeneous non-compound Poisson point process with inten- sity λ(t) = 0.01(1 + 0.5 sin(2πt/N )). Right: Scattering moments S[γ,p]Y (t)/skwkpp for inhome- geneous non-compound Poisson point process at t1 = N/4, t2 = N/2, t3 = 3N/4. Note that λ(t1 ) = 0.015, λ(t2 ) = 0.01, λ(t3 ) = 0.005. The plots show that for inhomogeneous process, scattering coefficients at time t converges to the intensity at that time. 48 Figure 4.4 First-order invariant scattering moments for Brownian motion and Poisson point process. Left: Top: Brownian motion with Hurst parameter H = 1/2. Bottom: Ordi- nary Poisson point process. Middle: Normalized scattering moments for Brownian Motion (SXBM (s,ξ,1)/kwk2 E|Z|) and Poisson point process (SY√ poisson (s,ξ,1)/λE|A|kwk1 ) at p = 1. This shows the convergence rate of normalized scattering is s for Brownian motion and s for Pois- son process, indicating the 1st moment can distinguish Brownian motion and Poisson point process. Right: Normalized scattering moments for Brownian Motion (SXBM (s,ξ,2)/kwk22 ) and Poisson point process (SYpoisson (s,ξ,2)/kwk22 ) at p = 2. Both normalized scattering moments have convergence rate s, so the 2nd moment scattering cannot distinguish the two processes. 4.6.4 Homogeneous, non-compound Poisson point process and self similar pro- cess We consider Brownian motion with Hurst parameter H = 1 2 and compare it with Poisson point process with intensity λ = 0.01 and charges (A)∞ j=1 ≡ 10. Figure 4.4 shows that the 2nd moments cannot distinguish between Brownian motion and Poisson point process while the 1st moments can. 4.7 Scattering GAN We are also interested in the capacity of scattering moments on Poisson point processes. Ideally, we want to generate new signals through the scattering measurements and show the realizations are from the given Poisson point process. We are less interested in synthesizing one specific realization of the process but more interested in generating new realizations through the distribution of scattering moments. Therefore, instead of minimizing the `2 loss between scattering coefficients, which we did in Chapter 3, we use a GAN model to study the high dimensional distribution of the scattering moments. Figure 4.5 shows the structure of our model, where we insert a scattering propagator S between the generator G and the 49 Figure 4.5 Scattering-GAN to study the capacity of scattering moments on Poisson point process. Similar to ordinary GAN, the generator takes in a random vector z and generates fake data y 0 . Also, the discriminator aims to distinguish fake representations from real. By inserting a scattering module between G and D, the discriminator tries to distinguish S(y 0 ) from S(y). When the model trains successfully, S(y 0 ) has the same distribution as S(y). By checking the similarity between y 0 and y, we learn the capacity of scattering moments. discriminator D. Given realizations {yi }i from a Poisson point process Y , suppose their scattering moments {S(yi )}i are samples from an unknown high dimensional distribution PS . In the scattering model, the discriminator tries to distinguish between the real scattering coefficients S(yi ) and fake ones S(yi0 ), while the generator is generating signals yi0 that have scattering coefficients S(yi0 ) that match S(yi ). Figure 4.6 shows the generated signals from the scattering GAN through a numerical experiment. 4.8 Conclusion We have constructed Gabor-filter scattering transforms for random measures on Rd , and stochastic processes on R. Our construction is closely related to [41], but extends their work in several important ways. First, while our Gabor-type filters include dyadic wavelets as a special case, they also include many other families of filters. We also do not assume that the random measure Y is stationary, and consider compound, possibly inhomogeneous, Poisson random measures on Rd , in addition to ordinary Poisson processes on R. We do note however, that [41] provides a detailed analysis of self-similar processes and multifractal 50 Figure 4.6 Generated signals through scattering GAN. We use realizations with length n = 212 from a homogeneous ordinary Poisson point process, i.e., λ(t) ≡ λ0 and Ai ≡ 1, as training data. We use {2j/2 }22 j=0 as scales for filters and apply a one-layer scattering operator to compute the scattering moments. Sigmoid is applied at the last layer in the generator. The generated signals are sparse, although not as sparse as training data. This is natural since Sigmoid forces A0i ∈ (0, 1), thus E[A0i ] < E[Ai ]. According to theorem 6, λ00 > λ0 , which we verified numerically. 51 random measures, whereas we have primarily focused on models of random sparse signals. We believe the results presented here open up several avenues of future research. Firstly, we have assumed throughout most of this chapter that the points of our random measures were distributed according to a possibly inhomogenous Poisson process. It would be interesting to discover if our measurements can distinguish these signals from other point processes. Secondly, it would be interesting to explore the use of these measurements for a variety of machine learning tasks such as synthesizing new signals. In the next chapter, we describe our work on using similar measurements for texture image synthesis. 52 CHAPTER 5 TEXTURE SYNTHESIS VIA PROJECTION ONTO MULTISCALE, MULTILAYER STATISTICS 5.1 Introduction In the last chapter, we presented our work on generalized scattering transforms of stochastic processes. In practice, one class of signals that can be modeled by stochastic processes is texture images. This is motivated from the fact that texture images usually contain a type of random repetition of a potentially complex pattern. A natural task about texture images, which is related to the completeness of a model, is the texture synthesis problem. This task asks one to generate new, perceptually accurate, texture images given a limited (often single) realization of the texture class in question. Within the field of image generative models, the texture synthesis problem is appealing because it allows for types of statistical analysis that are not possible in general image generation. Recent works have proposed to use generative adversarial networks (GANs) [44] to perform texture synthesis and related tasks [45, 46, 47, 48]. GANs are also used to expand non-stationary texture images [49], proving the ability to capture large scale structures. Classically, texture synthesis models fall into two categories [50]: (i) non-parametric patch rearrangement methods that extract microscopic patterns from the reference image and randomly arrange these patterns in a new image; and (ii) parametric statistic-matching models that extract a set of empirical statistics from the reference texture, and generate a new image by selecting a random image with a similar statistical profile. This chapter addresses the second type of model based on statistical matching. In [51, 52], the authors reproduce micro-textures by randomizing the phase of Fourier coefficients of the input texture. In some other works, the filtered responses of texture images are matched based on the maximum entropy principle [53, 54, 55, 56]. Such statistical models have two 53 challenges: (i) What is the set of statistics needed to characterize a large class of textures? and (ii) Given the statistical profile of a reference image, how does one generate a random image with the same statistical profile? Gatys and collaborators [3] had great success in addressing these two challenges by extracting the covariance statistics of the filter responses at various layers of the VGG-19 network [1], and then generating a new image with matching statistics via back-propagation and stochastic gradient descent, which is reviewed in Section 2.4. This work in turn inspired several subsequent methods, including [15, 57, 58, 59]. Despite the success of [3], though, the model is not perfect and many open questions remain for statistics-based texture synthesis models. Indeed, in a follow up paper [60], it is observed that high quality texture images can be synthesized using only a one-layer network with random, multiscale filters and rectified linear unit (ReLU) nonlinearity. The combina- tion of the two papers [3, 60] raises questions with respect to the trade-off between network depth and the sizes of the receptive fields of the filters in the network. Additionally, putting the use of random filters aside, the use of a single layer of multiscale filters parallels classical work in the field that uses the statistics of multiscale wavelet coefficients to synthesize tex- tures [61, 53, 2, 62]. Multi-scale CNN models are also designed to maintain high resolution in texture synthesis [63, 64]. Among these methods, the algorithm of Portilla and Simoncelli [2] is particularly notable for its use of statistics based on the modulus and phase of com- plex wavelet valued coefficients, in addition to its impressive performance which is often still bench-marked against today. In this chapter we propose a multiscale, multilayer, nonlinear feature extractor for images based upon real-valued wavelet transforms, which in turn yields a set of statistics for use in texture synthesis. In addition to drawing inspiration from Portilla and Simoncelli [2] and Gatys et al. [3], the model presented in this chapter also draws upon ideas from the wavelet scattering transform [10], which itself has shown good results for the synthesis of gray-scale textures [43]. Nevertheless, our algorithm has several novel aspects that we use to investigate the texture synthesis problem, and which provide insight into image feature extraction via 54 convolutional networks. More specifically: • We provide an analysis of the types of filters required to obtain good synthesis results when combined with the ReLU nonlinearity. • We investigate the trade-off between network depth and the maximum scale of the wavelet filters. • We propose a CNN architecture that uses the ReLU nonlinearity and is provably invertible at each layer, which in turn allows us to adapt the projection synthesis algorithm of [2] to our setting. • We demonstrate our theoretical findings numerically through example synthesized im- ages, and also compare our results to [2] and [3]. In Section 5.2 we present our statistical model in detail, while Section 5.3 describes our synthesis algorithm. Section 5.4 provides detailed numerical results, and Section 5.5 introduces implementation details. Section 5.6 contains a short conclusion. 5.2 Model Set T2 := [−T, T ]2 and R+ := [0, ∞), and let x : T2 → R+ be a texture image, which we shall assume is in L2 (T2 ). A statistics-based matching algorithm for texture synthesis specifies a family of (nonlinear) functions Uk : L2 (T2 ) → L2 (T2 ) and extracts a family of empirical statistics Sx from x based on Z 1 Sx = (Sk x)k , Sk x := Uk x(u) du . (2T )2 T2 A new texture y ∈ L2 (T2 ) is synthesized by drawing y from the set of images with similar statistical profiles: y ∼ Ix := {z ∈ L2 (T2 ) : kSz − Sxk ≤ ε} . (5.1) If x ∼ X, where (X(u))u∈T2 is a stochastic process, and if Uk X is stationary and ergodic, then Sk x → E[Uk X] as T → ∞. Thus, we can think of Sk x as approximating the statistics E[Uk X] of the unknown process that generated x. 55 The model (5.1) is appealing because the statistical profile Sx determines the texture class. It is thus paramount to determine a good set of functions (Uk )k , and hence statistics (Sk )k , the pursuit of which has ramifications in human and computer vision [65, 66, 67]. The method of Portilla and Simoncelli [2] defines the majority of their statistics by lever- aging a complex valued wavelet transform and extracting statistics from the modulus and phase of the resulting wavelet coefficients. The first layer of our model also uses multiscale wavelet filters, but they are real valued and we replace the modulus and phase nonlinearities with the ReLU nonlinearity. In Sections 5.2.1 and 5.2.2 we explain how the proper selection of such wavelet filters, though, when combined with ReLU, can distinguish between certain types of patterns in the same way that modulus and phase can. On the other hand, Gatys et al. [3] define their statistics using the Gram matrices of the filter responses at various layers in the VGG-19 network. The receptive field of the filters of the VGG-19 network are small, only 3 × 3 pixels, but the depth and pooling of the VGG-19 network allows such statistics to still capture complex multiscale patterns in texture images. Akin to the VGG network, in Section 5.2.3 we expand our set of functions Uk by computing a second wavelet transform and ReLU nonlinearity. Such a procedure is inspired by the wavelet scattering transform [10], but as we will describe differs from the scattering transform in several significant ways. 5.2.1 Wavelet filters Let x b(ω), for frequencies ω ∈ Ω := {πk/T : k ∈ Z2 } ⊂ R2 , denote the Fourier transform of x: x b(ω) := T2 x(u)e−iu·ω du. A wavelet ψ ∈ L2 (T2 ) is an oscillating waveform that is R localized in both space and frequency and has zero average. Inspired by previous work in wavelet based image processing, as well as recent analyses of the filters of the VGG network [68], we make use of three types of wavelets. Figure 5.1 shows the three wavelet families and we will define the wavelets in the following context. The first two are directional wavelet filters. We select one even directional filter and one 56 Figure 5.1 Wavelet families. Upper: Even directional wavelets in space and frequency (FFT). Middle: Odd directional wavelets in space and frequency (FFT). Lower: Omnidi- rectional wavelets in space and frequency (FFT). Each block shows the wavelet family with different scales and oscillations. 57 odd directional filter: ψ e (u) := g(u) cos(ξ · u) , ψ o (u) := g(u) sin(ξ · u) , where g is an even window function and ξ ∈ R2 is the central frequency of the wavelets. These wavelets oscillate in the direction ξ and have localized Fourier transforms around ξ and −ξ. These wavelets are rotated to obtain waveforms oscillating in different directions: ψθβ (u) := ψ β (Rθ−1 u) , β ∈ {e, o} , where Rθ ∈ SO(2) is the 2 × 2 rotation matrix about the angle θ ∈ [0, π). We use M angles θ ∈ ΘM := {mπ/M : 0 ≤ m < M }. The third type of wavelet is based on the polar coordinate representation u = (r, ϕ) ∈ [0, ∞) × [0, 2π), and oscillates along the angle parameter ϕ: ψ`p (u) := a` (u) cos(`ϕ) , where ` ∈ Z is the frequency of oscillation along the angle ϕ. If a` (u) = e a` (r), then the function a` determines the frequency of oscillation of the filter along the radial parameter. In this case, |ψbp (ω)| has an essential support in the shape of an annulus and the filter is omnidirectional. We restrict 0 ≤ ` < L and select a` (u) = e a` (r) to be an oscillatory function that oscillates at a frequency approximately proportional to L − 1 − `, ensuring that the overall frequency support of ψ`p is approximately fixed. Directional filters such as ψθe and ψθo are common in image processing and various analyses of CNN filters, e.g., [68], have shown that commonly used CNNs learn directional filters. In Section 5.2.2 we will motivate the seemingly redundant choice of using both an even and odd directional wavelet filter. By examining the filters of the VGG-19 network, though, one also finds omnidirectional filters. In practice (see Section 5.4) we find that such filters improve the quality of synthesized textures in which the image patterns do not align with a small subset of directions. 58 All wavelets are dilated at dyadic scales to obtain a multiscale family of waveforms: β ψj,α (u) := 2−2j ψαβ (2−j u) , j ∈ Z , (α, β) ∈ {(θ, e), (θ, o), (`, p)} . In our experiment, for directional wavelets, we use g = gσ as a gaussian function with variance σ 2 . For the three wavelets, they all have local support both in space and frequency. As j increase from 0 to J − 1 (from left to right in each block in Figure 5.1), the wavelet has larger support in space and smaller support in frequency. For directional wavelets, the wavelet support varies in directions with rotations (from top to bottom in each block) to capture directional oscillations in images. For omnidirectional wavelets, the wavelet either oscillates radially or angularly or both. The total number of oscillations is fixed across the four wavelets. Numerically, we may assume that x b(ω) is supported on frequencies ω contained in [−π, π]2 . By design the collection of wavelet filters have collective frequency support in a ball, which we can assume is the frequency ball of radius π. In this case we complement the wavelet filters with two additional filters: (i) a non-negative low pass filter φ ∈ L2 (T2 ) that has Fourier transform essentially supported around the origin (since wavelets have zero average); and (ii) a high pass filter h ∈ L2 (T2 ) that has Fourier transform essentially sup- ported outside of the frequency ball {ω ∈ Ω : |ω| ≤ π} (in other words, b h is supported in the “corners” of [−π, π]2 ). The scales 2j of the wavelet filters are restricted to 0 ≤ j < J, where J ≤ Jmax = O(log2 T ), and we dilate φ to the scale 2J via φJ (u) := 2−2J φ(2−J u). The wavelet transform we use in this chapter computes the convolution of x with all the aforementioned filters: β WJ x := {x ∗ φJ , x ∗ h , x ∗ ψj,α : 0 ≤ j < J, (α, β) ∈ {(θ, e), (θ, o), (`, p)}, θ ∈ ΘM , 0 ≤ ` < L} . A group of filters {fk }k is said to form a frame for signals x such that supp(b x) ⊂ [−π, π]2 if there exists two constants 0 < A ≤ B < +∞ such that: X A≤ |fbk (ω)|2 ≤ B , ∀ ω ∈ Ω ∩ [−π, π]2 . k 59 We can define the dual filters as fek (ω) := fbk (ω) . An image x can be reconstructed from b 2 P k |fk (ω)| b its filtrations by the filters {fk }k using the dual filters {fek }k and the formula: X x= x ∗ fk ∗ fek . (5.2) k For appropriately chosen parameters, the collection of filters used to define the wavelet transform WJ forms a frame. As such, we can recover x from WJ x using (5.2). This property will be important in Section 5.3 for developing an algorithm by which to synthesize a new texture. 5.2.2 First layer statistics We extract directly from the image x the mean, variance, skewness, and kurtosis of the image intensities (x(u))u∈T2 , in addition to the min/max intensities. We then consider the low pass filtering x ∗ φJ . Since x(u) ≥ 0 and φJ (u) ≥ 0, the mean of x ∗ φJ is proportional to the mean of x and does not need to be computed. We do add in the variance of the values (x ∗ φJ (u))u∈T2 to Sx. We also add in the variance of the high pass coefficients (x ∗ h(u))u∈T2 . These statistics are also used by Portilla and Simoncelli, and one can find additional motivation for their usefulness in [2]. In order to simplify notation, let λ = (j, α, β) ∈ Λ be any admissible triplet for the wavelets ψj,αβ described in Section 5.2.1, and denote these wavelets by ψλ . Like the low pass coefficients and the high pass coefficients, we could compute only the variance of the values (x ∗ ψλ (u))u∈T2 , but in doing so we would miss important correlations between patterns in x at different scales, orientations, and angular frequencies, as captured by our wavelets. In fact, results in [43] indicate that the using only the variance of wavelet coefficients does not result in good texture synthesis for certain types of textures. An alternative would be to compute the covariance between x ∗ ψλ and x ∗ ψλ0 , but the frequency localization of the wavelets means that such statistics will be nearly zero, and hence meaningless, for most pairs (λ, λ0 ). One possible solution is to apply a pointwise nonlinear function σ : R → R+ to the wavelet 60 coefficients, effectively pushing the high frequencies of x ∗ ψλ down to the low frequencies for each λ, which in turn generates non-trivial correlations between x ∗ ψλ and x ∗ ψλ0 . In [2], Portilla and Simoncelli decompose complex-valued wavelet coefficients into their modulus and phase (two nonlinear transforms), and compute covariance-type statistics of the wavelet modulus coefficients and of the phase coefficients. More recently, Zhang and Mallat [69] developed a wavelet phase harmonic nonlinear transform (also for complex wavelets) and used the resulting covariance statistics for texture synthesis of select gray-scale textures. In this work we set σ(t) := max(0, t), which is the rectified linear unit (ReLU) nonlin- earity. In order to obtain an invertible transform, we also multiply the wavelet coefficients by ±1, thus yielding the nonlinear wavelet transform: UJ1 x := {x ∗ φJ , x ∗ h , σ(γ · x ∗ ψλ ) : γ = ±1, λ ∈ Λ} . Since t = σ(t) − σ(−t), one can recover WJ x from UJ1 x and hence one can recover x using (5.2). We compute the Gram matrix correlation statistics between all pairs of nonlinear coefficients in UJ1 x, Z 1 Cx1 (λ, γ, λ0 , γ 0 ) := σ(γ · x ∗ ψλ (u))σ(γ 0 · x ∗ ψλ0 (u)) du (5.3) (2T )2 T2 with the exception of λ = λ0 and γ = −γ 0 as Cx1 (λ, γ, λ, −γ) = 0. Note that |t| = σ(t) + σ(−t), and hence the statistics (5.3) subsume the covariance statis- tics between wavelet absolute value coefficients, which are similar to the wavelet modulus statistics computed in [2]. It was observed in [69] that the ReLU nonlinearity is related to phase information in complex valued wavelet coefficients. In [2], Portilla and Simoncelli motivate the inclusion of phase by considering two one-dimensional signals, a Dirac function and a step function. These two signals cannot be distinguished by the wavelet modulus coefficients alone. ReLU wavlet coefficients, on the other hand, can distinguish a Dirac func- tion from a step function for either even or odd wavelets. However, the next theorem shows they have trouble distinguishing the relative intensity of these functions unless even and odd wavelets are used together. 61 Theorem 12. Define y1 (t) := δ(t) and y2 (t) := 1[0,∞) (t). Let ψee ∈ L2 (R) be a one- dimensional even wavelet, and let ψeo ∈ L2 (R) be a one-dimensional odd wavelet. Define the 2 × 2 correlation matrices Cyβk as: Z 0 β Cyk (γ, γ ) := σ(γ · yk ∗ ψeβ (t))σ(γ 0 · yk ∗ ψeβ (t)) dt , R for β ∈ {e, o} and γ ∈ {−1, +1}. Then Cye1 6= C−y e 1 and Cye2 = C−y e 2 , while Cyo1 = C−y o 1 and Cyo2 6= C−yo 2 . Figure 5.2 shows examples of the wavelets and corresponding Fourier transforms. Figure 5.2 1D wavelets. From left to right: 1D even wavelet, 1D odd wavelet, FFT (real part) of even wavelet, FFT (imagery part) of odd wavelet. In order to prove Theorem 12, we need the following lemma. Lemma 3. Let f (u) be an odd function, we have Z Z 0 σ(γ · f )σ(γ · f ) = σ(−γ · f )σ(−γ 0 · f ) R R for γ, γ 0 ∈ {−1, +1}. Proof of Theorem 12. We first prove the theorem for the 1D Dirac function y1 (t). First the wavelet transform of y1 is: y1 ∗ ψ o (u) = ψ o (u), y1 ∗ ψ e (u) = ψ e (u) Figure 5.3 shows the wavelet transforms for y1 and −y1 . Since ψ o (u) is an odd function, with Lemma 3 we have Cyo1 = C−y o 1 . However since ψ e (u) is an even function, generally we have Cye1 (+1, +1) 6= C−y e 1 (+1, +1). Therefore Cye1 6= C−y e 1 . Figure 5.4 verifies this conclusion numerically. 62 Figure 5.3 Dirac functions and wavelet coefficients. Left: Two Dirac functions y1 and −y1 . Middle: Wavelet coefficients with the even wavelet. Right: Wavelet coefficients with the odd wavelet. Figure 5.4 Covariance matrix for diracs. Upper row from left to right: Cye1 , C−y e 1 , Cye1 − C−y e 1 . Lower row from left to right: Cy1 , C−y1 , Cy1 − C−y1 . This numerically verified that even o o o o wavelet is able to distinguish the two dirac functions from the covariance statistics while odd wavelet cannot. Now we prove the theorem for the jump function y2 (t). The wavelet transforms satisfy: Z Z u β y2 ∗ ψ (u) = β y2 (u − t)ψ (t)dt = ψ β (t)dt (5.4) R −∞ for β ∈ {e, o}. Figure 5.5 illustrates the convolution of the even and odd wavelet with y2 63 Figure 5.5 Jump functions and wavelet coefficients. Left: Two jump functions y2 and −y2 . Middle: Wavelet coefficients with the even wavelet. Right: Wavelet coefficients with the odd wavelet. and −y2 . Ru Remark 1. Since ψ e is an integrable even function, then f e (u) = −∞ ψ e (t)dt is an odd function. Ru Remark 2. Since ψ o is an integrable odd function, then f o (u) = −∞ ψ o (t)dt is an even function. Remark 1 with Lemma 3 shows we have Cye2 = C−y e 2 . Remark 2 means that generally we also have Cyo2 6= C−y o 2 . Figure 5.6 gives the numerical verification. Theorem 12 shows both even and odd wavelets are necessary in our model. For images x, the Dirac signal y1 is similar to a dividing line that separates two regions of the same shade, which occurs in many types of texture images. This result shows that ReLU nonlinear wavelet coefficient correlations, when computed with an odd wavelet, cannot correctly determine the brightness of the dividing line relative to the regions it separates. Similarly, ReLU nonlinear wavelet coefficients, when computed with an even wavelet, cannot determine whether the color gradient across an edge is positive or negative. Numerical results illustrating these effects are given in Section 5.4.1. 64 Figure 5.6 Upper row from left to right: Cye2 , C−y e 2 , Cye2 − C−y e 2 . Lower row from left to right: Cy2 , C−y2 , Cy2 − C−y2 . This numerically verified that odd wavelet is able to distinguish the o o o o two jump functions from the covariance statistics while the even wavelet cannot. 5.2.3 Second layer statistics ReLU wavelet correlation statistics can be complemented by two-layer statistics that are derived from feature maps that combine image information across scales before iterating the operator UJ1 . In particular, we compute ( J−1 ! X β UJ2 x := UJ1 σ(γ · x ∗ ψj,α ) : (α, β) ∈ {(θ, e), (θ, o), (`, p)}, j=0 ) γ = ±1, θ ∈ ΘM , 0 ≤ ` < L . and recall UJ1 x := {x ∗ φJ , x ∗ h , σ(γ · x ∗ ψλ ) : γ = ±1, λ ∈ Λ} . We then compute the variance statistics of the low and high pass maps of UJ2 x, and the correlation statistics between all pairs of the nonlinear wavelet maps contained in UJ2 x, with 65 the same exceptions as in the first layer. By iterating upon the map UJ1 , the map UJ2 bears some similarity to the wavelet scattering transform [10]. However, there are several important differences. As already discussed, we utilize the ReLU nonlinearity and a family of real valued wavelets, as opposed to complex valued wavelets and the modulus nonlinearity. Furthermore, the map UJ1 defined here is invertible, unlike the scattering propagation operator. Finally, before iterating the map UJ1 , we sum over the scale index j of the nonlinear maps σ(γ · x ∗ ψj,α β ). This operation is akin to a 1 × 1 convolution operation in the VGG-19 CNN (and other CNNs), but unlike in the VGG network in which the filters being summed over have receptive fields with the same size, here we aggregate over nonlinear multiscale wavelet filtrations that allows our network to link together correlated patterns at multiple scales. This operation also has the effect of reducing the number of second layer maps, and hence statistics. Perhaps surprisingly, the operator UJ2 is also invertible under appropriate conditions on the wavelets defined in Section 5.2.1. Theorem 13. If {φJ , h, ψj,θ o : 0 ≤ j < J, θ ∈ ΘM } forms a frame and if gb is non-negative, radial, and a decreasing function of |ω|, then x 7→ {x ∗ φJ , x ∗ h , UJ2 x} is invertible. Before proving Theorem 13, we first prove the following lemma. Lemma 4. If {φJ , h, ψj,θ o : 0 ≤ j < J, θ ∈ ΘM } forms a frame and if gb is non-negative, radial, and a decreasing function of |ω|, then {φJ , h, J−1 j=0 ψj,θ : θ ∈ ΘM } also forms a o P frame. Proof of Lemma 4. With the definition of frame, we know there exist two constants 0 < A ≤ B < ∞ such that: X A ≤ |φbJ (ω)|2 + |bh(ω)|2 + o |ψbj,θ (ω)|2 ≤ B, ∀ ω ∈ Ω ∩ [−π, π]2 . j,θ we need to prove there exist two constants 0 < A0 ≤ B 0 < ∞ such that: X X A0 ≤ |φbJ (ω)|2 + |b h(ω)|2 + | o ψbj,θ (ω)|2 ≤ B 0 , ∀ ω ∈ Ω ∩ [−π, π]2 . θ j 66 The upper bound always exists as long as we have a finite number of filters. Therefore we only prove the lower bound. The key point is to prove: X X o |ψbj,θ (ω)|2 ≤ | o ψbj,θ (ω)|2 , ∀θ ∈ ΘM (5.5) j j Without loss of generosity, we set θ = 0 and omit this notation in the following proof. Recall the odd directional wavelet ψ o (u) = g(u) sin(ξ · u), which has Fourier transform: gbσ (ω − ξ) − gbσ (ω + ξ) ψbo (ω) = (5.6) 2i Bringing equation (5.6) into equation (5.5), we need to prove: X X gσ,j (ω − ξ) − gbσ,j (ω + ξ)|2 ≤ | |b gbσ,j (ω − ξ) − gbσ,j (ω + ξ)|2 (5.7) j j If gb is non-negative, radial, and a decreasing function of |ω|, one can prove: • If |ω − ξ| < |ω + ξ|, then gbσ (ω − ξ) − gbσ (ω + ξ) ≥ 0. For any such ω we also have |2−j ω − ξ| < |2−j ω + ξ|, and gbσ (2−j ω − ξ) − gbσ (2−j ω + ξ) ≥ 0. • If |ω − ξ| > |ω + ξ|, then gbσ (ω − ξ) − gbσ (ω + ξ) ≤ 0. For any such ω we also have |2−j ω − ξ| > |2−j ω + ξ|, and gbσ (2−j ω − ξ) − gbσ (2−j ω + ξ) ≤ 0. • If |ω − ξ| = |ω + ξ|, then gbσ (ω − ξ) − gbσ (ω + ξ) = 0. For any such ω we also have |2−j ω − ξ| = |2−j ω + ξ|, and gbσ (2−j ω − ξ) − gbσ (2−j ω + ξ) = 0. Then for all ω, gbσ (ω − ξ) − gbσ (ω + ξ) and gbσ,j (ω − ξ) − gbσ,j (ω + ξ) have the same sign for all j (either non-positive or non-negative). One can prove: ( i ai )2 ≥ i a2i if all the ai are non- P P negative or non-positive. Therefore, equation (5.7) is true, and | j ψbj,θ P o (ω)|2 ≥ j |ψbj,θ P o (ω)|2 for all θ. The lemma is proved. With Lemma 4, now we prove Theorem 13. Proof of Theorem 13. If {φJ , h, ψj,θ o : 0 ≤ j < J, θ ∈ ΘM } forms a frame, then β {φJ , h , ψj,α : 0 ≤ j < J, (α, β) ∈ {(θ, e), (θ, o), (`, p)}, θ ∈ ΘM , 0 ≤ ` < L} . 67 also forms a frame, i.e., XJ−1 X x = x ∗ φJ ∗ φ fJ + x ∗ h ∗ e h+ β x ∗ ψj,α β ∗ ψj,α , ∀x ∈ L2 (T2 ) (5.8) g j=0 α,β Therefore we can reconstruct j=0 ) from UJ2 x. With t = σ(t) − σ(−t), we are PJ−1 β σ(γ · x ∗ ψj,α able to get J−1 j=0 x ∗ ψj,α , which can also be written as x ∗ j=0 ψj,α . At this point, we have P β PJ−1 β the following updated responses XJ−1 β {x ∗ φJ , x ∗ h, x ∗ ψj,α , (α, β) ∈ {(θ, e), (θ, o), (`, p)}} j=0 With Lemma 4, {φJ , h, } forms a frame, then {φJ , h, } also forms a frame. PJ−1 o PJ−1 β j=0 ψj,θ j=0 ψj,α Therefore we can reconstruct the image x from the above responses: X J−1 X X^ J−1 x = x ∗ φJ ∗ φ fJ + x ∗ h ∗ e h+ x∗ β ψj,α ∗ β ψj,α (5.9) α,β j=0 j=0 5.3 Synthesis algorithm Since both operator UJ1 and operator UJ2 (combined with the low pass and high pass co- efficients) are invertible, we can adapt the iterative projection algorithm of [2] in order to synthesize a new texture x∗ with approximately the same statistical profile as x. We describe our version of the projection algorithm in this section. Algorithm 1 summarizes the following synthesis process. Let us first collect the statistics described in Section 5.2: • SJ0 x := six pixel intensity statistics, given by the mean, variance, skewness, kurtosis, min, and max of (x(u))u∈T2 . • SJ1 x := {Var(x ∗ φJ ) , Var(x ∗ h) , Cx1 }, which are the statistics derived from the first layer coefficients. 68 Algorithm 1 Projection algorithm for texture synthesis Input reference image x ; Output new texture image x∗ ; compute target statistics SJ x = (SJ0 x, SJ1 x, SJ2 x) ; initialize: x∗ ← x0 (uniform noise) ; x∗ ← mod_intensities(x∗ , SJ0 x) ; while error > ε do compute UJ1 x∗ = {UJ1 x∗φJ , UJ1 x∗h , UJ1 x∗ψ } ; UJ1 x∗φJ ← mod_variance(UJ1 x∗φJ , Var(x ∗ φJ )) ; UJ1 x∗h ← mod_variance(UJ1 x∗h , Var(x ∗ h)) ; UJ1 x∗ψ ← mod_correlation(UJ1 x∗ψ , Cx1 ) ; compute UJ2 x∗ from UJ1 x∗ψ ; UJ2 x∗φJ ← mod_variance(UJ2 x∗φJ , SJ2 xφJ ) ; UJ2 x∗h ← mod_variance(UJ2 x∗h , SJ2 xh ) ; UJ2 x∗ψ ← mod_correlation(UJ2 x∗ψ , SJ2 xψ ) ; reconstruct UJ1 x∗ψ ← reconstruct_layer1(UJ2 x∗φJ , UJ2 x∗h , UJ2 x∗ψ ) ; reconstruct x∗ ← reconstruct_x(UJ1 x∗φJ , UJ1 x∗h , UJ1 x∗ψ ) ; x∗ ← mod_intensities(x∗ , SJ0 x) ; update error kSJ x∗ − SJ xk ; end • SJ2 x := {SJ2 xφJ , SJ2 xh , Cx2 }, which consists of the second layer low pass variances (SJ2 xφJ ) and high pass variances (SJ2 xh ), and the second layer correlation statistics between ReLU wavelet maps (Cx2 ). Now let UJ1 xψ denote the collection of nonlinear ReLU wavelet coefficient maps of UJ1 x; let UJ2 xφJ denote the collection of second layer low pass maps; let UJ2 xh denote the collection of second layer high pass maps; and let UJ2 xψ denote the collection of second layer nonlinear ReLU wavelet coefficient maps. Given a reference image x, we start by computing its statis- tical profile SJ x = (SJ0 x, SJ1 x, SJ2 x). We then initialize our synthesized image with a random noise image x0 where each x0 (u) is an i.i.d. sample from the uniform distribution. Let xt denote the synthesized image after t iterations. The algorithm first updates xt by directly modifying its intensities (xt (u))u∈T2 so that SJ0 xt = SJ0 x. It then computes UJ1 xt using the modified xt . The low pass coefficients xt ∗ φJ and the high pass coefficients xt ∗ h are adjusted so that Var(xt ∗ φJ ) = Var(x ∗ φJ ) and Var(xt ∗ h) = Var(x ∗ h). All of these 69 steps are computed in the exact same fashion as [2]. Finally, the nonlinear coefficient maps UJ1 xtψ are adjusted to match the target correlation matrix so that Cx1t = Cx1 ; this step is performed in a way that is similar to how [2] updates the wavelet modulus coefficients. If the algorithm is using only first layer statistics, it then inverts UJ1 xt to obtain xt+1 , and the process repeats itself. On the other hand, if the algorithm is using second layer statistics, it then decomposes the updated maps UJ1 xtψ further by computing UJ2 xt . The algorithm updates the collection of second layer maps UJ2 xt by matching SJ2 xt to SJ2 x in a similar fashion as the first layer. At this point, the algorithm inverts the updated collection {xt ∗φJ , xt ∗h , UJ2 xt } using Theorem 13 to obtain xt+1 . 5.4 Numerical Results We implement several texture synthesis experiments with the goals of (i) numerically verify- ing theoretical assertions made in previous sections; (ii) understanding the effect of hyper- parameter choices on the quality of synthesized textures; and (iii) comparing to other com- monly used algorithms. Our texture images are taken from DTD database1 and CG Texture database2 . Every image is resized to 256 × 256 for consistency. In our experiments, the wavelet transform is implemented in the frequency field using the fast Fourier transform. Therefore, we also periodize certain images to avoid border effect [70]. For directional wavelets, we fix the total number of rotations as M = 4. For omnidirectional wavelets, we fix the total number of oscillations at L = 4. The maximum scale is Jmax = 6. In the following subsections, we numerically illustrate the advantages of using all three types of filters (Section 5.4.1); we examine the role of the maximum scale 2J (Section 5.4.2); and we compare the one-layer synthesis to the two-layer synthesis, thus examining the role of network depth (Section 5.4.3). Finally, in Section 5.4.4 we also compare our results to Gatys et al. [3] and Portilla and Simoncelli [2]. 1 https://www.robots.ox.ac.uk/ vgg/data/dtd/ 2 https://www.textures.com/ 70 5.4.1 Filter comparison Theorem 12 proves both even and odd wavelets are important for texture synthesis. Figure 5.7 validates this theory numerically. We see, for example, that synthesis with odd wavelets is prone to blurring edges and even flipping the colors of enclosed regions that should have the same color (e.g., the last row of Figure 5.7). When using only even wavelets, images with banded colors, for example the third row of Figure 5.7, are blurred. Synthesized images using both even and odd wavelets are generally a clear improvement over their single wavelet-type counterparts. Figure 5.8 shows the necessity for using omnidirectional wavelets. The advantages are clearest for rounded shapes or swirls. Such patterns have no clear direction. For example, in the swirly texture in the last row of Figure 5.8, the omnidirectional wavelets do a better job of reproducing the long swirls. The round-shaped pebbles and dots (rows one and three of Figure 5.8) have smoother edges and a cleaner background when adding the omnidirectional wavelets. 5.4.2 Comparison of maximum scale Figure 5.9 shows synthesis results with different numbers of scales from the two layer model. With J = 3, the statistics are not able to capture macroscopic patterns. Therefore the edges of synthesized bricks (rows one and three) and frames (second to last row) are not straight or continuous. The reproduced house (second row) and dots (third last row) are more random compared to the original image. Larger scales can also capture longer swirls (last row). However, J = 5 and J = 6 achieve equivalent perceptual accuracy, proving there is no need to add in larger scales than J = 5. Indeed, the effective receptive field of the two layer synthesis with J = 5 is equivalent to the single layer receptive field of Jmax = 6. We also observe that using smaller scales gives more variety of the pattern arrangements, while large scale statistics have the potential to duplicate the reference image. 71 Figure 5.7 Synthesis results from one layer (J = 6) with different types of wavelet filters. Left: Original image. Middle left: First layer synthesis results with only odd wavelets. Middle right: First layer synthesis results with only even wavelets. Right: First layer synthesis results with both even and odd wavelets. 72 Figure 5.8 Synthesis results from two layers (J = 5) with/without omnidirectional wavelets. Left: Original image. Middle: 2nd layer synthesis with even and odd wavelets. Right: 2nd layer synthesis with even, odd and omnidirectional wavelets. 73 Figure 5.9 Synthesis results from two layers with different number of scales. Left: Original image. Middle Left: 2nd layer synthesis results with J = 4. Middle Right: 2nd layer synthesis results with J = 5. Right: 2nd layer synthesis results with J = 6. 74 5.4.3 Layers analysis For many texture images, the one-layer model can synthesize images of high quality. However for images with more complicated structures, multiple layers can provide a boost in visual quality. As discussed in [10], deeper layers decompose high frequency information that is aggregated into large frequency bins with a single wavelet transform. Figure 5.10 shows images that achieved better quality with second layer statistics. For most images, the one- layer model captures general structures while the two-layer model refines the details, e.g., reproducing more accurate shapes, preserving long edges and swirls, fixing blurriness. 5.4.4 Methods comparison We use even and odd directional wavelets, along with omnidirectional wavelets as our pre- selected filters in our final model. We also set J = 5 and use the two-layer model. Our results are compared with [2, 3] in Figure 5.11. The textures synthesized by our model are generally equivalent to, or superior than, the images synthesized by Portilla and Simoncelli [2]. In fact, while not depicted in Figure 5.11, this result holds even if we restrict to one layer, indicating the combination of the ReLU and the selection of even, odd, and omnidirectional filters may provide a more complete statistical description of texture images. With respect to Gatys et al. [3], the results are more nuanced. Our results are generally superior for textures with long, rigid edges even though their model is much deeper than our model. Additionally, textures with rigid patterns, but not necessarily long straight lines (e.g., left, last row; right, rows five, eight, nine) also have visually more appealing synthesis results via our method. These results can be attributed to the use of multiscale filters, although, even then, the results of Section 5.4.2 suggest that such an analysis might be too simplistic. For example, it is possible that a depth three wavelet network with J = 4 might also achieve similar performance to our current implementation with two layers and J = 5, which if so would raise questions with respect to other aspects of the VGG network. 75 Figure 5.10 Synthesis results with one layer model and two layer model. Left: Original image. Middle: 1st layer synthesis results. Right: 2nd layer synthesis results. 76 Moving on we see that [3] obtains superior performance for images with long, non-rigid curves, such as the pebbles and onions (left, rows three and nine), fireworks (left, row eight) and swirling type images (right, rows seven and ten). Nevertheless, these results are in line with the observations in Section 5.4.3, which suggested that these types of textures require depth in order to capture their complex patterns. Other images exhibit more subtle differences. For example, in the cracked earth image (left, row four), the synthesized image of [3] creates a bold effect on the cracks that is not present in the original. In the crossing image (left, row ten), the background illumination pattern is only correctly preserved by our method. Lastly, the multi-colored dots (right, last row) have a cleaner background with [3], but our method creates dots with colors that are not present in the original image, thus showing greater variability. 5.5 Implementation details In this section, we describe more implementation details and analysis on numerics. 5.5.1 Reduction of second layer statistics Recall at the second layer we compute: ( J−1 ! ) X β 2 1 UJ x := UJ σ(γ · x ∗ ψj,α ) : (α, β) ∈ {(θ, e), (θ, o), (`, p)}, θ ∈ ΘM , 0 ≤ ` < L . j=0 that is: ( J−1 J−1 X X UJ2 x := σ(γ1 · x ∗ ψjβ11,α1 ) ∗ φJ , σ(γ1 · x ∗ ψjβ11,α1 ) ∗ h, j1 =0 j1 =0 J−1 ! ! X σ γ2 · σ(γ1 · x ∗ ψjβ11,α1 ) ∗ ψjβ22,α2 : j1 =0 ) (α1 , β1 ), (α2 , β2 ) ∈ {(θ, e), (θ, o), (`, p)}, γ1 , γ2 ∈ {+1, −1} . In particular for the third item, we apply another layer of the wavelet transform to the first layer responses. In numerical experiments for directional wavelets, we find J−1 P j1 =0 σ(γ1 · x ∗ 77 Figure 5.11 Synthesis results compared to other models. Left: Original images. Middle Left: Results from Portilla and Simoncelli [2]. Middle Right: Results from Gatys et al. [3]. Right: Results from our two layer model. 78 Figure 5.12 Synthesis results compared to other models. Left: Original images. Middle Left: Results from Portilla and Simoncelli [2]. Middle Right: Results from Gatys et al. [3]. Right: Results from our two layer model. 79 ψjβ11,θ1 ) is essentially supported at the direction θ = θ1 . Therefore to reduce the number of total statistics and save computation, we set θ2 = θ1 , i.e., the second layer of directional wavelets has the same direction as the first layer directional responses. We also add a residual wavelet to keep track of the residual frequencies and match the variance. We numerically verified with this restriction, there is little loss in the synthesized image quality. 5.5.2 Matching of second layer statistics When we use the second layer statistics to synthesize texture images, we initialize the image with the first layer result. This is also equivalent to matching the first layer statistics until convergence, then matching both first layer and second layer statistics. In practice, we see the synthesized image has better quality than matching both first and second layer statistics from noise, i.e., matching both from the beginning. Figure 5.13 compares the synthesized images using these two strategies. In particular, the middle right column shows synthesized images initialized from first layer result and the right column shows synthesized images initialized from noise. We notice matching only first layer statistics significantly reduces the first layer loss and the learned images are already well structured. Initialized from such images, second layer statistics refine the details. On the other hand, initializing from noise fails to reduce the first layer loss to a very small value. The synthesized images generally lose large structures and are not as good as results from the other strategy. 5.5.3 Analysis of number of iterations In this section we discuss the relationship among the number of iterations, loss, and image quality. Figures 5.14, 5.15, 5.16 show the synthesis process of different textures. For all of these tests, we run second layer synthesis and run for 600 iterations. We plot the logarithm kSJi x−SJi x? k of the relative loss in these plots, i.e., log10 (loss) where loss = kSJi xk for i = 1, 2, x is the reference image, and x? is the synthesized image. 80 Figure 5.13 Left: Original images. Middle left: Images synthesized from first layer. Mid- dle right: Images synthesized from second layer, initialized from first layer result. Right: Images synthesized from second layer, initialized from uniform noise. 81 Figure 5.14 shows the synthesis process of micro-textures plus flowers. We start from noise to match the second layer to simplify the analysis. After iteration 0, the images already look much better than noise, although the colors are not fully matched. At iteration 40, the second layer loss is reduced to approximately 10−3.5 and the synthesized images are of high quality. As we continue the algorithm, at iteration 100 and 590, the images smoothly shift from one sample to another sample which are from the same texture class and the loss barely changed. In general, for such micro-textures, the algorithm converges fast and the loss is reduced long before we stop. Figure 5.15 shows the synthesis process for textures that have more structure. Still we start from noise. At iteration 0, the colors are mismatched, similar to Figure 5.14. At iteration 40 and 100, the colors are matched and general structures are learned, but the holes and frames are not aligned perfectly. Also the relative loss is not reduced. Finally at iteration 590, the second layer loss is reduced to around 10−4 and the synthesized images are of good quality. The algorithm on these types of images generally takes longer to converge than for micro-textures. As the number of iterations increases, the image quality is continuously improved. Figure 5.16 shows the synthesis process of a texture that has intricate patterns. Here we start from the first layer result to show how the second layer statistics improve the image quality. At iteration 0, which is essentially the synthesized image from the first layer experiment, there are barely swirls but only curves. As the algorithm goes on, we notice the first layer loss actually increases while the second layer loss is decreasing. The increasing first layer loss generally does not affect image quality so long as the second layer loss decreases. By the last iteration, the synthesized image contains longer and smoother swirls. 5.6 Conclusion We presented a unique texture synthesis algorithm that melds aspects of Portilla and Simon- celli [2], Gatys et al. [3], and Mallat [10], while also incorporating new ideas on filter design, 82 Figure 5.14 The synthesis process for different micro-textures plus flowers. 83 Figure 5.15 The synthesis process for different macro-textures with rigid patterns. Figure 5.16 The synthesis process when initializing from the first layer synthesis, for a texture with complex patterns. 84 multi-layer structure, and the invertibility of CNNs. Our numerical analysis provides insight into the workings of statistics-based texture synthesis algorithms. Synthesized textures are competitive with the state-of-the-art and in some cases superior to [3], thus providing a potential alternative. Nevertheless, issues such as the trade-off between network depth and filter scale are not fully resolved, and invite future research endeavors, which we will explore in the next chapter. 85 CHAPTER 6 MULTILAYER MODEL ANALYSIS 6.1 Introduction In this chapter, we continue our analysis of the statistical texture synthesis model proposed in Chapter 5. Intuitively, large filters and deep layers increase the size of receptive fields in the network and preserve long range dependence of the image, but they do so in different ways. From the numerical tests, we noticed the model with large filters keeps long straight lines and the multilayer model preserves long curves. On the other hand, the one layer wavelet model does not capture intricate curves and the multilayer VGG model struggles to reproduce rigid patterns. There has also been a discussion on the trade-off between model depth and the size of convolution filters in [15, 3]. In this chapter, we develop mathematical analysis on filter size and model depth. We investigate certain types of texture classes and discuss the necessity of large filters. We also discuss the relationship between model depth and the higher order derivatives of a signal. Specifically we continue the analysis of Chapter 5 in the following ways: • We proposed a model to represent the ReLU architecture. • We provide new perspective on understanding the even and odd wavelets. • We prove if the largest filter is not large enough to meet certain criteria, the gram matrix fails to distinguish different textures. • We prove with small filters, deep models intricately partition the high order derivatives of the given signal. 86 6.2 Multilayer multiscale model 6.2.1 Filters Let φ : R2 → R be an low pass filter with the following conditions: • φ(u) ≥ 0, ∀u ∈ R2 . • φ is isotropic, i.e., φ(u) = φ(Rθ u), ∀θ. Rθ : R2 → R2 is the rotation operator (rotate by θ). • φ is compactly supported in [−a, a]2 ∈ R2 . Define the wavelet family {ψk,n }k,n such that (n) ψk,n (u) = D~vk φ(u) (6.1) is the n-th derivative of φ along the direction ~vk where ~vk = Rθ (~v0 ), ~v0 = (1, 0), θ = kπ K . K is the total number of rotations. The filters can also be dilated: ψj,k,n (u) = 2−2j ψk,n (2−j u), 0 ≤ j < J Since (n) (n) D~vk φj (u) = D~vk (2−2j φ(2−j u)) (n) = 2−2j · 2−nj D~vk (φ(2−j u)) = 2−nj ψj,k,n (u), (n) we also have ψj,k,n (u) = 2nj D~vk φj (u). Remark 3. When n = 1, ψk,n is an odd directional wavelet with direction θ = kπ K ; when n = 2, ψk,n is an even directional wavelet with direction θ = kπ K . The derivative wavelets defined in Equation 6.1 are not exactly the same directional wavelets used in Chatper 5. The definition is different and the derivative wavelets has compact support. However since the derivative wavelets at n = 1, 2 look similar to the 87 directional wavelets, they can be thought of as a generalized version of directional wavelets with more flexibility by adjusting n. Therefore in the following context, we are going to analyze the filters from Chapter 5 through the derivative wavelets defined in Equation 6.1. 6.2.2 Wavelet transform Let x : R2 → R be a 2D signal. The wavelet transform can be rewritten as (n) x ∗ ψj,k,n (u) = 2nj x ∗ D~vk φj (u) (n) = 2nj D~vk (x ∗ φj )(u) The above equation shows the wavelet transform of a signal with a derivative wavelet equals to the derivative of the signal convolved with the low pass up to a constant 2nj , which depends on the dimension n and scale parameter j. Lemma 5. If x is isotropic, i.e., x(u) = x(Rθ u), ∀u, then x ∗ ψj,k1 ,n = x ∗ ψj,k2 ,n , ∀k1 , k2 . Proof. If x is isotropic, then (n) x ∗ ψj,k1 ,n (u) = 2nj D~vk (x ∗ φj )(u) 1 (n) (x, φj are isotropic) = 2nj D~vk (x ∗ φj )(u) 2 = x ∗ ψj,k2 ,n (u) 6.2.3 Gram matrix Recall in Chapter 5, we use the gram matrix between ReLU wavelet response to represent a texture image for synthesis and achieved good performance. Here we generalize the definition 88 for derivative filters. Define gram matrix as Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) : = hReLU(1 · x ∗ ψj1 ,k1 ,n1 ), ReLU(2 · x ∗ ψj2 ,k2 ,n2 )i Z = ReLU(1 · x ∗ ψj1 ,k1 ,n1 (u)) · ReLU(2 · x ∗ ψj2 ,k2 ,n2 (u))du R2 Z     n1 j1 +n2 j2 (n ) (n ) =2 ReLU 1 · D~vk 1 (x ∗ φj1 )(u) · ReLU 2 · D~vk 2 (x ∗ φj2 )(u) du 1 2 R2 (6.2) where 1 , 2 ∈ {+1, −1}. In order to better analyze the correlation, we are going to decompose the integral into the correlation and the integral support. Let  (n ) k1 π Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) := u ∈ R2 :1 · D~vk 1 (x ∗ φj1 )(u) ≥ 0, ~vk1 = Rθ1 ~v0 , θ1 = , 1 K  (n2 ) k2 π 2 · D~vk (x ∗ φj2 )(u) ≥ 0, ~vk2 = Rθ2 ~v0 , θ2 = 2 K (6.3) denote the ReLU support that the integral is integrated on. We can rewrite the gram matrix as Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = Z (6.4) n1 j1 +n2 j2 (n ) (n ) 1 2 2 D~vk 1 (x ∗ φj1 )(u) · D~vk 2 (x ∗ φj2 )(u)du 1 2 Ex(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) The following special cases include the statistical representations we computed at the first layer for texture images in Chapter 5. • Consider n1 = n2 = 0, there is no derivatives or directional derivatives. We can assume k1 = k2 = 0. If x ≥ 0, e.g., natural images, then x ∗ φj ≥ 0. Then Z Gx(j1 , j2 , 0, 0, 0, 0, +1, +1) = x ∗ φj1 · x ∗ φj2 dt R2 Z 1 ∗ = 2 x\∗ φj1 (ω) · x\ ∗ φj2 (ω)dω (2π) C2 Z 1 ∗ = 2 b(ω) · φc x j1 (ω) · xb∗ (ω) · φc j2 (ω)dω (2π) C2 Z 1 (suppose j1 ≥ j2 ) ≈ x(ω)|2 · |φc |b 2 j1 (ω)| dω (2π)2 C2 = kx ∗ φj1 k22 89 Therefore Gx(j1 , j2 , 0, 0, 0, 0, +1, +1) captures the `2 norm of the low pass responses. • Gx(j1 , j2 , k1 , k2 , 1, 1, 1 , 2 ) captures the odd wavelet gram matrix between different angles and scales. • Gx(j1 , j2 , k1 , k2 , 2, 2, 1 , 2 ) captures the even wavelet gram matrix between different angles and scales. • Gx(j1 , j2 , k1 , k2 , 1, 2, 1 , 2 ) captures the gram matrix between even and odd wavelets cross angles and scales. 6.3 Texture with multiple straight lines Recall in Figure 5.9 (fourth row) and Figure 5.11 (right, first row) from Chapter 5, our model with large scale filters preserve long straight lines while both the model with only small scale filters and the model from Gatys fail to do so. In this section, we consider a special class of signals with parallel straight lines. We prove that multiscale, especially large scale filters, are essential for recovering textures with long lines, which we define in the following. Let x(u1 , u2 ) = αi 1δi (u1 ) where δi ∈ ∆ ⊂ R. Then we have the wavelet transform as P i (n) x ∗ ψj,k,n (u) = 2nj D~vk (x ∗ φj )(u) (n) X = 2nj D~vk ( αi · φ1d j (u1 − δi )) δi ∈∆ where u = (u1 , u2 ) and φ1d j is the one dimensional version of φj . Let ~ v0 = (1, 0), ~v−1 = (0, 1). Then (n) X dn 1d D~v0 (x ∗ φj )(u) = αi φ (u1 − δi ) δi dun1 j (n) D~v−1 (x ∗ φj )(u) = 0 90 Therefore for any ~vk = Rθk (~v0 ) = (cos θk , sin θk ), (n) (n) D~vk (x ∗ φj )(u) = cosn θk · D~v0 (x ∗ φj )(u) X dn = cosn θk · αi n φ1d j (u1 − δi ) δ du 1 i And also, X dn 1d x ∗ ψj,k,n (u) = 2 nj n · cos θk · αi n φj (u1 − δi ) (6.5) δi du1 6.3.1 Gram matrix The gram matrix for this type of signal is Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) : = h(ReLU(1 · x ∗ ψj1 ,k1 ,n1 ), ReLU(2 · x ∗ ψj2 ,k2 ,n2 )i Z X dn1 = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2 αi n1 φ1d j1 (u1 − δi ) Ex(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) δi du1 X dn2 · αi n2 φ1d j2 (u1 − δi ) du1 δi du 1 Z n1 j1 +n2 j2 n1 n2 = 1 2 2 · cos θk1 · cos θk2 Ex(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) n1 n2 XX d 1d d αi1 αi2 n1 φj1 (u1 − δi1 ) φ1d (u1 − δi2 ) du1 δi1 δi 2 du1 dun1 2 j2 (6.6) Note dn1 1d  X Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 : 1 · cosn1 θk1 αi φ (u1 − δi ) > 0, δi dun1 1 j1 X dn2  (6.7) n2 1d 2 · cos θk2 αi n2 φj2 (u1 − δi ) > 0 δ du1 i Equation 6.7 shows the integral support only depends on u1 . Moreover, it is contained in the union of the band around u1 = δi . Our next section explores more on when J is small and the bands don’t overlap among different δi . 91 Besides the following remark shows, when the scales (j1 , j2 ), derivative orders (n1 , n2 ) and signs (1 , 2 ) are fixed and only vary the filters’ directions (k1 , k2 ), the gram matrix between the two ReLU responses is proportional to a value that depends on k1 , k2 . Gx(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) cosn1 θk1 ·cosn2 θk2 Remark 4. Gx(j1 ,j2 ,k10 ,k20 ,n1 ,n2 ,1 ,2 ) = cosn1 θk0 ·cosn2 θk0 1 2 Moreover, the following remark summarizes, when the angle θk1 = mπ+ π2 or θk2 = mπ+ π2 , the gram matrix is zero. Remark 5. If cosn1 θk1 = 0 or cosn2 θk2 = 0, then Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = 0, ∀j1 , j2 , n1 , n2 , 1 , 2 . 6.3.2 Small J Now let us assume J is small. Recall φ is supported in [−a, a], then φj has support [−2j a, 2j a] and φJ−1 has support [−2J−1 a, 2J−1 a]. Let dmin demote the smallest pairwise distance be- tween the diracs in x. The following theorem proves when J is small, the gram matrix cannot distinguish one line texture from another under certain conditions. Theorem 14. Let x1 (u) = αi1 1δi1 (u1 ) and x2 (u) = βi2 1δi2 (u1 ). Let d1min and P P δi1 ∈∆1 δi2 ∈∆2 d2min denote the smallest pairwise distance between diracs of x1 and x2 , respectively. Assume φ is supported in [−a, a]. If 2J a < dimin for i ∈ {1, 2} and βi2 , then P 2 P 2 αi1 = δi1 ∈∆1 δi2 ∈∆2 Gx1 = Gx2 . Note the above theorem is related to Theorem 3 in Chapter 3. The 2D signal in this section can be decomposed into a sparse signal along one direction and a constant along another direction. Theorem 14 provides a necessary condition to distinguish two signals of such type while Theorem 3 provides a sufficient condition to identify a 1D sparse signal. Proof of Theorem 17. Consider the signal x1 . When 2J a < d1min , φ1d j1 (u1 −δi ) does not overlap with φ1d j2 (u1 − δi0 ) for any δi , δi0 ∈ ∆1 , i 6= i , 0 ≤ j1 , j2 < J. Therefore the nonzero set from 0 92 (6.7) can be decomposed into [  dn1 Ex1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 : 1 · cosn1 θk1 αi n1 φ1d j1 (u1 − δi ) > 0, δi ∈∆1 du 1 dn2 1d  n2 2 · cos θk2 αi n2 φj2 (u1 − δi ) > 0 du1 [ : == Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) δi ∈∆1 (6.8) Note 0 Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) − δi0 + δi (6.9) i.e., if 0 t0 ∈ Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ), then t = t0 − δi0 + δi ∈ Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ). Also the argument inside the integral from Equation 6.6 is nonzero only along the diag- onal: XX dn1 1d dn2 1d X 2 d n1 1d dn2 1d αi1 αi2 φ (t 1 − δ i ) φ (t 1 − δ i ) = α φ (t 1 − δ i ) φ (t1 − δi1 ) δi 1 δi 2 dtn1 1 j1 1 dtn1 2 j2 2 δ i1 dtn1 1 j1 1 dtn1 2 j2 i1 (6.10) Inserting (6.8) and (6.10) into (6.6) we get Gx1 (j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2 X Z dn1 dn2 1d αi21 n1 φ1d (u 1 − δ i ) φ (u1 − δi1 )du1 δ ∈∆ Exi1 (j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) du1 j1 1 dun1 2 j2 i 1 = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2 (6.11) dn1 1d dn2 1d X Z 2 αi1 φ n 1 j1 (u 1 − δ i ) φ (u1 − δi1 )du1 δi ∈∆1 Exi1 (j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) du1 1 dun1 2 j2 ! X (with (6.9)) = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2 αi21 δi ∈∆1 Z n1 n2 d 1d d 1d n1 φj1 (u1 − δ0 ) n2 φj2 (u1 − δ0 )du1 Ex01 (j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) du1 du1 93 Figure 6.1P Two line texture images T that have the same summation P of height √ squared. Left: i=1 132i (u1 ), u ∈ [Z [0, 256)] . Right: 2 · 164i (u1 ), u ∈ 8 2 4 x1 (u) T = x2 (u) = i=1 [Z [0, 256)] . 2 Figure 6.2 The gram matrices and the difference for the two line textures in Figure 6.1. Left: The gram matrix Gx1 between ReLU responses for texture x1 . Middle: The gram matrix Gx2 between ReLU responses for texture x2 . Right: The difference between the two gram matrices |Gx1 − Gx2 |. One can get similar results for x2 with αi21 substituted by βi21 . Then the conclusion P P δi ∈∆1 δi ∈∆2 is proved. Figure 6.1 and 6.2 verify Theorem 14 numerically. The two line textures x1 (u) = P4 √ 1 and 2 · 164i (u1 ), u ∈ [Z [0, 256)]2 both have summation P8 T i=1 32i (u 1 ) x 2 (u) = i=1 of height squared equal to 8. For the selected wavelet family, we choose J = 2 so that 2J a < dmin where dmin = 32 in this example. Figure 6.2 shows there is little difference be- tween the two gram matrices except for the numerical errors, thus verifying our conclusion numerically. 94 6.3.3 Deep layers One may argue that the gram matrix fails in distinguishing the two signals due to the fact that we only have one layer of convolution. In this section, we prove that even with multiple layers, if the filters are tiny, the gram matrix still fails to distinguish different line textures. Recall the first layer wavelet coefficients for the line texture: X dn 1d x ∗ ψj,k,n (u) = 2nj · cosn θk · αi φ (u1 − δi ) (6.12) δi dun1 j dn 1d dn 1d Remind σ(x) = (ReLU )(x). Let dun φj,δi (u) = dun φj (u − δi ) and represent the element- 1 1 wise multiplication. The second layer response can be developed in a similar way: σ(x ∗ ψj1 ,k1 ,n1 ) ∗ ψj2 ,k2 ,n2 (u) (n ) =σ(x ∗ ψj1 ,k1 ,n1 ) ∗ 2n2 j2 · D~vk 2 φj2 (u) 2 (n ) =2n2 j2 D~vk 2 [σ(x ∗ ψj1 ,k1 ,n1 )] ∗ φj2 (u) 2 (n ) =2n2 j2 · cosn2 θk2 · D~v0 2 [σ(x ∗ ψj1 ,k1 ,n1 )] ∗ φj2 (u) dn2   n2 j2 n2 0 =2 · cos θk2 · σ (x ∗ ψj1 ,k1 ,n1 ) [x ∗ ψj1 ,k1 ,n1 ] ∗ φj2 (u) dun2  X dn1 =2 n2 j2 · cos θk2 · σ 0 (2n1 j1 · cosn1 θk1 · n2 αi n1 φ1d j1 ,δi ) δ du 1 n2 n1 i d X d n [2n1 j1 · cosn1 θk1 · αi n1 φj1d1 ,δi ] ∗ φj2 (u) du 2 δi du1  X dn1 X dn1 +n2  n1 j1 +n2 j2 n1 n2 0 n1 1d 1d =2 · cos θk1 · cos θk2 · σ (cos θk1 · αi n1 φj1 ,δi ) αi n1 +n2 φj1 ,δi ∗ φj2 (u) δ du 1 δ du1 i i dn 1d When J is small such that dun φj,δi (u) do not overlap at different i, the last line of the above 1 equation equals to dn1 dn1 +n2 1d X   2n1 j1 +n2 j2 n1 n2 · cos θk1 · cos θk2 · αi σ 0 (cosn1 θk1 · n1 φ1d ) φ ∗ φj2 (u). δi du1 j1 ,δi du1n1 +n2 j1 ,δi The term inside the big parenthesis is supported in [δi −2j1 a, δi +2j1 a]. With the convolution with φj2 , the support is [δi − 2j1 +j2 a, δi + 2j1 +j2 a]. This term is equivariant to translations on δi . Therefore when we apply the ReLU and gram matrix at the second layer, we get the 95 same conclusion as the first layer result. Define Z 0 G2 x(λ, λ ) = σ(2 · σ(1 · x ∗ ψj1 ,k1 ,n1 ) ∗ ψj2 ,k2 ,n2 )(u) · σ(02 · σ(01 · x ∗ ψj0 1 ,k1 ,n1 ) ∗ ψj0 2 ,k2 ,n2 )(u)du where λ = (j1 , k1 , n1 , 1 , j2 , k2 , n2 , 2 ). Theorem 15. Let x1 (u) = αi1 1δi1 (u1 ) and x2 (u) = βi2 1δi2 (u1 ). Let d1min and P P δi1 ∈∆1 δi2 ∈∆2 d2min denote the smallest pairwise distance between diracs of x1 and x2 , respectively. Assume φ is supported in [−a, a]. If 22J a < dimin for i ∈ {1, 2} and βi2 , then P 2 P 2 αi1 = δi1 ∈∆1 δi2 ∈∆2 G2 x1 = G2 x2 . More generally, define Gm x to be the m-layer gram matrix. We have Theorem 16. Let x1 (u) = αi1 1δi1 (u1 ) and x2 (u) = βi2 1δi2 (u1 ). Let d1min and P P δi1 ∈∆1 δi2 ∈∆2 d2min denote the smallest pairwise distance between diracs of x1 and x2 , respectively. Assume φ is supported in [−a, a]. If 2mJ a < dimin for i ∈ {1, 2} and βi2 , then P 2 P 2 αi1 = δi1 ∈∆1 δi2 ∈∆2 Gm x1 = Gm x2 . Proof. Let Sm x[λ1 , . . . , λm ](u) = σ(. . . σ(x1 ∗ψλ1 ) · · ·∗ψλm )(u) denote the m-th layer response of x. Let fm [δi , λ1 , . . . , λm ](u) = σ(. . . σ(1δi ∗ ψλ1 ) · · · ∗ ψλm ) be the multilayer response at the m-th layer of the line function. Note 1δi (u) = 1δi (u1 ). Then the m-th layer response of line signal x1 is: X Sm x1 [λ1 , . . . , λm ](u) = αi1 fm [δi1 , λ1 , . . . , λm ](u) δi1 ∈∆1 Since φ is supported in [−a, a]2 , for any 0 ≤ j < J, φj is supported in [−2J−1 a, 2J−1 a]2 . Therefore fm [δi1 , λ1 , . . . , λm ] is supported in the band: {u ∈ R2 : t1 ∈ [−2m(J−1) a, 2m(J−1) a}. The gram matrix for the line function is : Z 1 Gm δi [λ1 , . . . , λm , λ01 , . . . , λ0m ] = fm [δi1 , λ1 , . . . , λm ](u) · fm [δi1 , λ01 , . . . , λ0m ](u)du Note Gm 1δi [λ1 , . . . , λm , λ01 , . . . , λ0m ] remains a constant when changing δi . Since 2mJ a < dimin for i = 1, 2, fm [δi1 , λ1 , . . . , λm ] does not overlap with fm [δi01 , λ1 , . . . , λm ] for any δi1 6= δi01 . 96 With all of these facts, Z X Gm x1 [λ1 , . . . , λm , λ01 , . . . , λ0m ] = αi1 fm [δi1 , λ1 , . . . , λm ](u)· δi1 ∈∆1 X αi1 fm [δi1 , λ01 , . . . , λ0m ](u) δi1 ∈∆1 X Z = αi21 fm [δi1 , λ1 , . . . , λm ](u) · fm [δi1 , λ01 , . . . , λ0m ](u) δi1 ∈∆1 αi21 )Gm 1δ [λ1 , . . . , λm , λ01 , . . . , λ0m ] X =( δi1 ∈∆1 We get similar result for x2 . Therefore, Gm x1 = Gm x2 . Theorem 16 shows interesting relationship between model depth m, largest filter scale J and the minimum distance dmin between the lines. It also matches the observation of Figure 5.9 (fourth row, sixth row) from Chapter 5. According to Figure 5.9, when J = 4 the model is not able to identify the relative positions of the long lines while J = 5 and J = 6 fix the issue and are able to reproduce the lines. Motivated by Theorem 16, to avoid the inability to distinguish such textures, one can either increase filter size J or increase model depth m to break the condition 2mJ a < dimin . Note the VGG model is deep but is still not able to reproduce long lines (Figure 5.11, left, fifth row; right, first row). We infer this is resulted from the pooling function and invite further research. 6.4 Frame-like texture In this section we analyze another type of textures. Figure 6.3 shows one sample from this class. Let x(u1 , u2 ) = αi 1δi (u1 ) + βi 1ξi (u2 ). The wavelet transform for this signal is: P P δi ∈∆1 ξi ∈∆2 (n) x ∗ ψj,k,n (u) = 2nj D~vk (x ∗ φj (u)) dn 1d dn 1d  X X  nj n n = 2 cos θk αi n φj (u1 − δi ) + sin θk βi n φj (u2 − ξi ) δ ∈∆ du 1 ξ ∈∆ du2 i 1 i 2 Insert it into the gram matrix at Equation 6.2 we get: 97 Figure 6.3 Frame texture. Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 )  n1 j1 +n2 j2 = 1 2 2 · dn1 dn2 Z X X n1 cos θk1 · cos θk2 n2 αi1 n1 φ1d j1 (u1 − δi1 ) · αi2 n2 φ1d (t1 − δi2 ) du1 du2 Ex δ du1 δi2 du1 j2 i1 dn1 dn2 1d Z X X n1 + cos θk1 · sin θk2 n2 αi1 n1 φ1d (u 1 − δ i1 ) · β i2 n2 φj2 (u2 − ξi2 ) du1 du2 Ex δ du1 j1 ξi2 du 2 i1 dn1 dn2 1d Z X X + sinn1 θk1 · cosn2 θk2 βi1 n1 φ1d (u 2 − ξi1 ) · α i2 n2 φj2 (u1 − δi2 ) du1 du2 Ex ξ du1 j1 δi 2 du 2 i1 dn1 1d dn2 1d Z X X  n1 n2 + sin θk1 · sin θk2 βi1 n1 φj1 (u2 − ξi1 ) · βi2 n2 φj2 (u2 − ξi2 ) du1 du2 Ex ξ du2 ξ du2 i1 i2 (6.13) 98 where Ex denotes  Ex(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 : dn1 1d dn1 1d  X X  n1 n1 1 cos θk1 αi n1 φj1 (u1 − δi ) + sin θk1 βi n1 φj1 (u2 − ξi ) > 0, δi ∈∆1 du 1 ξi ∈∆2 du2 dn2 1d dn2 1d  X X   n2 n2 2 cos θk2 αi n2 φj2 (u1 − δi ) + sin θk2 βi n2 φj2 (u2 − ξi ) > 0 δ ∈∆ du1 ξ ∈∆ du2 i 1 i 2 (6.14) If θk ∈ {0, π2 }, cos θk = 0 or sin θk = 0. Therefore if θk1 , θk2 ∈ {0, π2 }, the four items in Equation 6.13 only have one that is nonzero: • If θk1 = θk2 = 0, only the first item is nonzero. Ex can be simplified as: dn1  X Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 :1 cosn1 θk1 αi n1 φ1d j1 (u1 − δi ) > 0, δi ∈∆1 du 1 dn2 1d X  n2 2 cos θk2 αi n2 φj2 (u1 − δi ) > 0 δ ∈∆ du1 i 1 Then with the same analysis as Theorem 14, the gram matrix can be simplified as: X Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2 · ( αi2 )· δi (6.15) dn1 1d dn2 1d Z φ (u 1 − δ0 ) · φ (u1 − δ0 ) du1 Ex0u1 dun1 1 j1 dun1 2 j2 n n where Ex0u1 = {u ∈ R2 : 1 cosn1 θk1 α0 du d 1 1d n1 φj (u1 − δ0 ) > 0, 2 cos 1 n2 θk2 α0 dud 2 1d n2 φj (u1 − 2 1 1 δ0 ) > 0 • Similarly if θk1 = θk2 = π2 , the gram matrix is simplified as: X Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =1 2 2n1 j1 +n2 j2 · sinn1 θk1 · sinn2 θk2 · ( βi2 )· ξi Z n1 n2 (6.16) d 1d d n1 φj1 (u2 − ξ0 ) · φ1d (u2 − ξ0 ) du2 Ex0u2 du2 dun2 2 j2 99 • If θk1 = 0, θk2 = π2 , Ex can be simplified as: dn1  X Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 :1 cosn1 θk1 αi n1 φ1d (u1 − δi ) > 0, δi ∈∆1 du1 j1 dn2 1d X  n2 2 sin θk2 αi n2 φj2 (u2 − δi ) > 0 ξ ∈∆ du2 i 2 := Exu1 ⊗Exu2 The gram matrix is simplified as: Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · sinn2 θk2 dn1 dn2 1d Z X X αi1 n1 φ1d (u 1 − δ i ) · β i φ (u2 − ξi2 ) du1 du2 Ex δ du1 j1 1 δ 2 dun2 2 j2 i1 i2 =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · sinn2 θk2 (6.17) dn1 1d dn2 Z X Z X αi1 n1 φj1 (u1 − δi1 ) du1 · βi2 n2 φ1d (u2 − ξi2 ) du2 Exu1 δ du1 Exu2 δ du2 j2 i1 i2 =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · sinn2 θk2 · dn1 1d dn2 1d X Z X Z ( α i1 ) · φ n1 j1 (u 1 − δ 0 ) du 1 · ( β i 2 ) · n2 φj2 (u2 − ξ0 ) du2 δ Ex0t du1 δ Ex0u du2 2 i1 1 i2 • If θk1 = 0, θk2 = π2 , the gram matrix is simplified as: Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =1 2 2n1 j1 +n2 j2 · sinn1 θk1 · cosn2 θk2 · (6.18) dn1 1d dn2 1d X Z X Z ( βi1 ) · n1 φj1 (u2 − ξ0 ) du2 · ( αi2 ) · n2 φj2 (u1 − ξ0 ) du1 ξ Ex0u du2 2 δ Ex0u du1 1 i1 i2 With the analysis above we have the following theorem: Theorem 17. Let x(u) = αi1 1δi1 (u1 )+ βi2 1δi2 (u2 ) and x0 (u) = αi1 1δi1 (u1 )+ P P P 0 δi1 ∈∆1 δi2 ∈∆2 δi1 ∈∆01 βi2 1δi2 (u2 ). Let dmin and d0min denote the smallest pairwise distance between parallel P 0 δi2 ∈∆02 lines of x and x0 , respectively, either along u1 or u2 . Assume φ is supported in [−a, a]2 and θk ∈ {0, π2 }. If 2J a < dmin and 2J a < d0min , then Gx = Gx0 with the following conditions: 100 • αi21 = αi021 P P δi1 ∈∆1 δi1 ∈∆01 • βi22 = βi022 P P ξi2 ∈∆2 ξi2 ∈∆02 • αi0 1 · βi02 P P P P α i1 · βi2 = δi1 ∈∆1 ξi2 ∈∆2 δi1 ∈∆01 ξi2 ∈∆02 Theorem 17 explains the observation of Figure 5.9 (seventh row) from Chapter 5. When J = 4, the gram representations are not able to reproduce the cross lines in the frame texture while larger J resolves the issue. 6.5 Multilayer model In this section, we discuss the properties of multilayer ReLU model. In the following context we use σ(f ) := ReLU(f ). Let x : R → R and φ : R → R to be a one dimensional low pass filter with φ(u) ≥ 0. Define the wavelets ψn (u) = φ(n) (u), u ∈ R By rescaling we get a wavelet family {ψj,n }j,n : (n) ψj,n (u) = φj (u) n Note ψj,n (u) = 2nj du d n φj (u). Then for the one layer wavelet coefficients we have dn φj dn x ∗ ψj,n = x ∗ 2nj = 2 nj · ( x ∗ φj ) = 2nj · x(n) ∗ φj dun dun If j is small, i.e., φj is a tiny filter which is usually the case in CNNs, the convolution with φj can be regarded as a local smoothing operator. The above wavelet coefficients captures the n-th order of derivative of x. 6.5.1 Multilayer ReLU model Now we consider deeper layers of ReLU responses. The following lemma is an essential property needed about the n-th order derivatives of ReLU function. Note d du [σ(f (u))] = 101 σ 0 (f (u)) · f 0 (u) and σ 0 (c · f ) = σ 0 (f ) if c > 0. Since σ (n) (f ) = 0, ∀n > 1, we have the following lemma: dn Lemma 6. dun σ(f (u)) = σ 0 (f (u)) · f (n) (u). Proof. We use induction to prove the statement. First we know d du [σ(f (u))] = σ 0 (f (u))·f 0 (u) which aligns with the statement at n = 1. Suppose for n − 1, we have dn−1 σ(f (u)) = σ 0 (f (u)) · f (n−1) (u) dun−1 Consider n, dn d dn−1   σ(f (u)) = σ(f (u)) dun du dun−1   d 0 (n−1) = σ (f (u)) · f (u) du = σ 00 (f (u)) · f (n−1) (u) + σ 0 (f (u)) · f (n) (u) (since σ 00 (f (u)) = 0) = σ 0 (f (u)) · f (n) (u) We prove the conclusion. 102 Let us first look at the second layer model dn2 σ(1 · x ∗ ψj1 ,n1 ) ∗ ψj2 ,n2 = σ(1 · x ∗ ψj1 ,n1 ) ∗ 2n2 j2 · n2 φj2 du n2   n2 j2 d =2 · n2 σ(1 · x ∗ ψj1 ,n1 ) ∗ φj2 du dn2   n2 j2 =2 · n2 σ(1 · x ∗ ψj1 ,n1 ) ∗ φj2 du dn2   n2 j2 0 = 1 2 · σ (1 · x ∗ ψj1 ,n1 ) x ∗ ψj1 ,n1 ∗ φj2 dun2 n1 dn2   n2 j2 0 n1 j1 d = 1 2 · σ (1 · x ∗ ψj1 ,n1 ) x∗2 φj ∗ φj2 dun2 dun1 1 dn1 +n2   n1 j1 +n2 j2 0 = 1 2 · σ (1 · x ∗ ψj1 ,n1 ) x ∗ φj1 ∗ φj2 dun1 +n2   n1 j1 +n2 j2 0 (n1 +n2 ) = 1 2 · σ (1 · x ∗ ψj1 ,n1 ) x ∗ φj1 ∗ φj2   n1 j1 +n2 j2 0 n1 j1 (n1 ) (n1 +n2 ) = 1 2 · σ (1 · 2 ·x ∗ φj1 ) x ∗ φj1 ∗ φj2   n1 j1 +n2 j2 0 (n1 ) (n1 +n2 ) = 1 2 · σ (1 · x ∗ φj1 ) x ∗ φj1 ∗ φj2 (6.19) With the definition of ReLU, we know σ 0 (1 · x(n1 ) ∗ φj1 ) 6= 0 only when 1 · x(n1 ) ∗ φj1 > 0. If φj1 is tiny enough, the condition is approximately 1 · x(n1 ) > 0. Therefore it can be concluded that σ(1 · x ∗ ψj1 ,n1 ) ∗ ψj2 ,n2 captures the (n1 + n2 )-th order of derivative of x at where 1 · x(n1 ) > 0. With similar logic, we can prove the following theorem for deeper layers of ReLU models. Theorem 18. Let fk be the k-layer ReLU response, i.e., fk = σ(. . . σ(2 · σ(1 x ∗ ψj1 ,n1 ) ∗ ψj2 ,n2 ) . . . ) ∗ ψjk ,nk . Then k−1 k P   k P  Y ni ji ( ni ) fk = ( i ) · 2i=1 0 σk−1 . . . σ10 (x i=1 ∗ φj1 ) · · · ∗ φjk−1 ∗ φjk (6.20) i=1 where σk0 = σ 0 (k · fk ). Proof. We use induction to prove the theorem. First for k = 2, we prove the conclusion in 103 (6.19). Suppose the statement holds for k − 1, let us consider the case for k: fk = σ(k−1 fk−1 ) ∗ ψnk ,jk nk   nk jk d =2 σ(k−1 fk−1 ) ∗ φjk (6.21) dunk dnk   nk jk 0 = k−1 · 2 σ (k−1 fk−1 ) fk−1 ∗ φjk dunk Since dnk dnk fk−1 = n σ(k−2 fk−2 ) ∗ ψjk−1 ,nk−1 dunk du k dnk = σ(k−2 fk−2 ) ∗ n ψjk−1 ,nk−1 du k dnk dnk−1 = σ(k−2 fk−2 ) ∗ n (2nk−1 jk−1 n φjk−1 ) du k du k−1 nk−1 +nk d = σ(k−2 fk−2 ) ∗ (2nk−1 jk−1 n +n φjk−1 ) du k−1 k 1 = j n σ(k−2 fk−2 ) ∗ ψjk−1 ,nk−1 +nk 2 k−1 k Let fk−1 ∗ = σ(k−2 fk−2 ) ∗ ψjk−1 ,nk−1 +nk , the only difference between fk−1 ∗ and fk−1 is that in ∗ fk−1 the last filter substitute nk−1 with nk−1 + nk . Insert this change into (6.20), we get k−2 k−1 P   Pk  Y nk jk−1 + ni ji ( ni ) ∗ 0 fk−1 =( i ) · 2 i=1 σk−2 . . . σ10 (x i=1 ∗ φj1 ) · · · ∗ φjk−2 ∗ φjk−1 i=1 Therefore k−2 k−1 k dnk P   P  Y ni ji ( ni ) 0 n fk−1 = ( i ) · 2 i=1 σk−2 . . . σ10 (x i=1 ∗ φj1 ) · · · ∗ φjk−2 ∗ φjk−1 du k i=1 Insert this into Equation 6.21 we prove the theorem. The above theorem concludes interesting results about very deep neural networks. In most CNNs, the convolution filters are of small size, which aligns with the case when jk are all small. In that case, the convolutions with φj can be regarded as tiny local smoothing operators that we can ignore for now. The theorem concludes that under this condition, the l k l P ( ni ) k-layer ReLU model captures ( ni )-th order derivatives of x at where ( i ) · x P Q i=1 > i=1 i=1 0, ∀1 ≤ l < k. High order derivatives are preserved based on fine partitions of low order derivatives. Specifically when the filters are tiny, n usually equals to 0 or 1. If n is exactly 1, 104 the theorem proves that the n-th layer deep model preserves the n-th order derivatives of x based on partitions of i-th order derivatives for 0 < i < n. If n is either 0 or 1, the theorem implies similar conclusions. If n is exactly 0, i.e., all the filters are small low pass filters, the deep model is less related to derivatives of x. In general, deep models smooth out either the signal or the high order derivatives of the signal. Also deep layers increase the receptive fields, thus incorporating information from more neighbor pixels and preserving long range dependence. 105 CHAPTER 7 RANDOM FIELDS A random field refers to a random function over an arbitrary domain, e.g., Rd . The function value at each point in the domain can be think of as a random variable. And the function values at different points usually have some correlation. On the other hand, texture images from the same class usually have similar repeated patterns but generally different, which can be reflected from the correlation and randomness of a random field. Therefore random fields can be used to model texture images. In this chapter, we show more of our work on scattering transform on random fields. We extend the work in [43] to d dimensional space. We also discuss the relationship between scattering moments and power spectrum of a stochastic process. 7.1 Scattering moments of self-similar processes Before we introduce the scattering moments of random fields, we first show and prove the following properties of random fields that are stationary or has stationary increments. Let ψ(u) : u ∈ Rd be a wavelet, i.e., Rd ψ(u)du = 0. R Lemma 7. If {X(u)}u∈Rd has stationary increments, {X ∗ ψ(u)}u∈Rd is stationary. Proof. for all u and v: Z X ∗ ψ(u) = X(u − u0 )ψ(u0 )du0 Z Z (since ψ has zero average) = 0 0 0 X(u − u )ψ(u )du − X(u) ψ(u0 )du0 Z = (X(u − u0 ) − X(u))ψ(u0 )du0 Z d = (X(v − u0 ) − X(v))ψ(u0 )du0 Z = X(v − u0 )ψ(u0 )du0 = X ∗ ψ(v) 106 d Therefore, X ∗ ψ(u) = X ∗ ψ(v) ∀u, v, and thus {X ∗ ψ(u)}u∈Rd is stationary. Remark 6. Similarly if {X(u)}u∈Rd is stationary, {X ∗ ψ(u)}u∈Rd is stationary. Let σ be a modulus operator or a ReLU operator, i.e., σ(x) = |x| or σ(x) = ReLU(x). Let {ψj,θ }j,θ be a family of wavelets where ψj,θ (u) = 2−dj ψ(2−j R−θ u), ∀u ∈ Rd . The following theorem explores the first order scattering moments of self similar processes. Theorem 19. Suppose {X(u)}u∈Rd is a self similar process of order H and has stationary increments. The first order m-th scattering moments of {X(u)}u∈Rd satisfy: E[σ m (X ∗ ψj1 ,θ1 )] = 2mHj1 (7.1) E[σ m (X ∗ ψθ1 )] Moreover, if {X(u)}u∈Rd is isotropic: E[σ m (X ∗ ψj1 ,θ1 )] = 2mHj1 (7.2) E[σ m (X ∗ ψ)] Proof. Z X ∗ ψj1 ,θ1 (u) = X(u0 ) · ψj1 ,θ1 (u − u0 )du0 Rd Z = X(u0 ) · 2−nj1 ψθ1 (2−j1 (u − u0 ))du0 d ZR (let v 0 = 2−j1 u0 ) = X(2j1 v 0 ) · 2−nj1 ψθ1 (2−j1 u − v 0 ))2nj1 dv 0 (7.3) Rd Z d (since X is self similar) = 2Hj1 X(v 0 ) · ψθ1 (2−j1 u − v 0 ))dv 0 Rd = 2Hj1 X ∗ ψθ1 (2−j1 u) Since X is stationary, we have X ∗ ψθ1 (u) is stationary. Therefore E[σ m (X ∗ ψθ1 (2−j1 u))] = E[σ m (X ∗ ψθ1 (u))] and the first result develops naturally. 107 Moreover, if {X(u)}u∈Rd is isotropic: Z X ∗ ψθ1 (u) = X(u0 ) · ψθ1 (u − u0 )du0 d ZR = X(u0 ) · ψ(Rθ−11 (u − u0 ))du0 d ZR (let v 0 = Rθ−1 1 u0 ) = X(Rθ1 v 0 ) · ψ(Rθ−1 1 u − v 0 )dv 0 d ZR d (since X is isotropic) = X(v 0 ) · ψ(Rθ−1 1 u − v 0 )dv 0 Rd = X ∗ ψ(Rθ−1 1 u) Since X ∗ ψ is stationary, E[σ m (X ∗ ψ(Rθ−1 1 u))] = E[σ m (X ∗ ψ(u))], we verified the second result. Theorem 19 implies that for a self similar process of order H with stationary increments, if we think of E[σ m (X ∗ ψθ1 )] as a normalization term, the first order scattering moments is proportional to a value determined by the scale j of the wavelet, the moments m and H. Moreover if the process is isotropic, the normalization term is substitute by E[σ m (X ∗ψ)] and the scattering moments are invariant to the rotation parameter θ. It provides a necessary condition for identifying a self similar process and also captures the order parameter H of the process through the first order scattering moments. In general, this theorem indicates the pattern of scattering moments of a self similar process. The following theorem explores more about the second order scattering moments of such processes. Since it involves rotation difference, we focus on R2 for simplification in the following theorem. Theorem 20. Suppose (X(u))u∈R2 is a self similar process of order H and has stationary increments. The second order scattering moments satisfy: E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )] E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )] = (7.4) E[σ(X ∗ ψj1 ,θ1 )] E[σ(X ∗ ψθ1 )] Moreover, if (X(u))u∈R2 is isotropic: E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )] E[σ(σ(X ∗ ψ) ∗ ψj2 −j1 ,θ2 −θ1 )] = (7.5) E[(X ∗ ψj1 ,θ1 )] E[σ(X ∗ ψ)] 108 d Proof. From Theorem 19 we have X ∗ ψj1 ,θ1 (u) = 2Hj1 X ∗ ψθ1 (2−j1 u). Therefore: Z σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 (u) = σ(X ∗ ψj1 ,θ1 (u0 )) · ψj2 ,θ2 (u − u0 )du0 2 ZR d = σ(2Hj1 X ∗ ψθ1 (2−j1 u0 )) · ψj2 ,θ2 (u − u0 )du0 2 ZR (let v 0 = 2−j1 u0 ) = σ(2Hj1 X ∗ ψθ1 (v 0 )) · ψj2 ,θ2 (u − 2j1 v 0 )2nj1 dv 0 2 ZR = σ(2Hj1 X ∗ ψθ1 (v 0 )) · ψj2 −j1 ,θ2 (2−j1 u − v 0 )dv 0 R2 = 2Hj1 σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 (2−j1 u) Since σ(X ∗ ψ1 ) is stationary, from Remark 6 we have σ(X ∗ ψ1 ) ∗ ψ2 is stationary. Therefore: E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )] 2Hj1 E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )] = E[σ(X ∗ ψj1 ,θ1 )] 2Hj1 E[σ(X ∗ ψθ1 )] E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )] = E[σ(X ∗ ψθ1 )] The first result if verified. If (X(u))u∈R2 is isotropic: Z σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 (u) = σ(X ∗ ψθ1 (u0 )) · ψj2 −j1 ,θ2 (u − u0 )du0 2 ZR d = σ(X ∗ ψ(Rθ−1 1 u0 )) · ψj2 −j1 ,θ2 (u − u0 )du0 2 ZR (let v = Rθ1 u ) = 0 −1 0 σ(X ∗ ψ(v 0 )) · ψj2 −j1 ,θ2 (u − Rθ−1 1 v 0 )dv 0 2 ZR = σ(X ∗ ψ(v 0 )) · ψj2 −j1 ,θ2 −θ1 (Rθ1 u − v 0 )dv 0 R2 = σ(X ∗ ψ) ∗ ψj2 −j1 ,θ2 −θ1 (Rθ1 u) Therefore: E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )] E[σ(σ(X ∗ ψ) ∗ ψj2 −j1 ,θ2 −θ1 )] = E[σ(X ∗ ψθ1 )] E[σ(X ∗ ψ)] We verified the second result. Theorem 20 implies that, for a self similar process with order H, the second order scattering moments E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )] normalized by the first scattering moments E[σ(X ∗ ψj1 ,θ1 )], is determined by the scale difference j2 − j1 where j1 , j2 are the scales of the two wavelets at the two layers. Moreover, if the process is isotropic, the second order 109 scattering moments is also determined by the rotation difference θ2 − θ1 , where θ1 , θ2 are the rotations of the two wavelets at the two layers. This theorem provides another necessary condition for identifying a self similar process. It describes the pattern of the second order moments and the parameter H is also captured in such statistics. 7.2 Power spectrum Definition 1. Suppose {X(u)}u∈Rd is a stationary process with zero mean. The covariance of X is defined as RX (τ ) = EX(0)X(τ ). Then the power spectrum of {X(u)}u∈Rd is defined as: V Z RX (ω) = F(RX )(ω) = RX (τ )e−iτ ω dτ Rd Lemma 8. Suppose {X(u)}u∈Rd is a stationary process and ψ is a wavelet. Let Y (u) = X ∗ ψ(u). It can be proved that Y is also stationary from Remark 1. Then V V V RY (ω) = RX (ω)|ψ (−ω)|2 (7.6) 110 Proof. V Z RY (ω) = RY (τ )e−iτ ω dτ Z = E[Y (0)Y (τ )]e−iτ ω dτ Z = E[X ∗ ψ(0)X ∗ ψ(τ )]e−iτ ω dτ Z Z Z = E[ X(u) · ψ(−u)du · X(v) · ψ(τ − v)dv]e−iτ ω dτ Z Z Z (by Fubini’s Thm.) =E X(u) · ψ(−u) · X(v) · ψ(τ − v) · e−iτ ω dudvdτ Z Z Z =E X(u) · ψ(−u) · X(v) · [ ψ(τ − v) · e−iτ ω dτ ]dvdu Z Z Z 0 (let τ 0 = τ − v, dτ 0 = dτ ) =E X(u) · ψ(−u) · X(v) · [ ψ(τ 0 ) · e−i(τ +v)ω dτ 0 ]dvdu Z Z =E X(u) · ψ(−u) · X(v) · ψ̂(−ω) · e−ivω dvdu Z Z 0 (let v = u + v 0 , dv = dv 0 ) = ψ̂(−ω)E X(u) · ψ(−u) · X(u + v 0 ) · e−i(u+v )ω dv 0 du Z Z 0 (by Fubini’s Thm.) = ψ̂(−ω) ψ(−u)E[X(u) · X(u + v 0 )] · e−i(u+v )ω dv 0 du Z Z 0 = ψ̂(−ω) ψ(−u) · RX (v 0 ) · e−i(u+v )ω dv 0 du V = ψ̂(−ω)ψ̂(−ω)RX (ω) V = |ψ̂(−ω)|2 RX (ω) Lemma 9. Let {X(u)}u∈Rd be a stationary process and let {ψj }j be a group of wavelets that V V are complex analytic. Suppose j |ψ j |2 (ω) + j |ψ j |2 (−ω) = 2, ∀ω. Then P P X E|X ∗ ψj |2 = VarX j Proof. Let Yj = X ∗ ψj . From Lemma 8, V V V RYj (ω) = RX (ω)|ψj (−ω)|2 111 Therefore, X V X V V RYj (ω) = RX (ω)|ψj (−ω)|2 j j V X V = RX (ω) |ψj (−ω)|2 j  V 2RX (ω), ω < 0  = 0, ω > 0  Taking the inverse Fourier transform at τ = 0 of both sides, we get Z ∞ Z 0 X 1 V 2 V RYj (0) = RY (ω)dω = RX (ω)dω = RX (0) j 2π −∞ 2π −∞ Since RX (0) = VarX X Var(Yj ) = VarX j Since ψj is a wavelet, EYj = EX ∗ ψj = 0. We have the result: X X X E|X ∗ ψj |2 = E|Yj |2 = Var(Yj ) = VarX j j j Remark 7. Note if σ(x) = ReLU(x), then |X ∗ ψj |2 = σ 2 (X ∗ ψj ) + σ 2 (−X ∗ ψj ) when X and ψj are both real valued. Then Lemma 9 can be extended to the ReLU operator: X X E[σ 2 (X ∗ ψj )] + E[σ 2 (−X ∗ ψj )] = VarX j j 7.3 Power spectrum and scattering equivalence √ For now let us work on R. Define ψλ (t) = λψ(λt) for any λ ∈ (0, ∞), and assume ψ is admissible, meaning that V ∞ |ψ (ω)|2 Z Cψ := dω < ∞ . 0 ω Let (X(t))t∈R be a stationary process. Define the first order quadratic scattering moments of X as: SX(λ) := E|X ∗ ψλ |2 , ∀ λ ∈ (0, ∞) . 112 Theorem 21. Let X, Y be two real-valued stationary processes. V V RX (ω) = RY (ω) a.e. ω ∈ R ⇐⇒ SX(λ) = SY (λ) ∀ λ ∈ (0, ∞) . To prove the theorem, we need Lemma 2.1 from [71], which we restate in the following. Lemma 10. Let f ∈ L2 (R) be continuous and assume f (ω) = f (−ω). ψb has compact support and satisfy Condition 1. Then Z f (ω)|ψbλ (ω)|2 dω = 0 ∀λ > 0 =⇒ f = 0 a.e. Condition 1. Define |ψbλ+ (ω)|2 = (|ψbλ (ω)|2 + |ψbλ (−ω)|2 ) · 1(ω ≥ 0). If for any finite sequence {ωi }ni=1 of distinct positive frequencies, the collection {|ψbλ+ (ωi )|2 }ni=1 are linearly independent functions of λ, we say the wavelet ψ satisfies the linear independence condition. Proof of Theorem 21. =⇒: If ψ is a wavelet, then from Lemma 8, we have: V V V V RX (ω) = RY (ω) a.e. ω ∈ R =⇒ RX∗ψλ (ω) = RY ∗ψλ (ω) a.e. ω ∈ R, ∀λ Then by taking inverse Fourier tansform of the power spectrum we have: RX∗ψλ (τ ) = RY ∗ψλ (τ ) ∀τ ∈ R, λ Let τ = 0, we have: E|X ∗ ψλ |2 = E|Y ∗ ψλ |2 which proves this direction. 113 ⇐=: Z Z 2 E|X ∗ ψλ (t)| = E X(u) · ψλ (t − u)du X(v) · ψλ (t − v)dv Z Z =E X(u) · X(v) · ψλ (t − u) · ψλ (t − v)dudv Z Z (Let v = u + τ ) = E X(u) · X(u + τ ) · ψλ (t − u) · ψλ (t − u − τ )dudτ Z Z = EX(u)X(u + τ ) · ψλ (t − u) · ψλ (t − u − τ )dudτ Z Z = RX (τ ) · ψλ (t − u) · ψλ (t − u − τ )dudτ Z Z = ψλ (t − u) · RX (τ ) · ψλ (t − u − τ )dτ du Z = ψλ (t − u) · RX ∗ ψλ (t − u)du Z (Let t = 0 and v = −u) = ψλ (v) · RX ∗ ψλ (v)dv Z V V (By Plancherel Thm.) = RX ∗ ψλ (ω) · ψλ (ω)dω Z V V V = RX (ω) · ψλ (ω) · ψλ (−ω)dω Z V V V = RX (ω) · ψλ (−ω) · ψλ (−ω)dω Z V V = RX (ω) · |ψλ (−ω)|2 dω Z V V = RX (−ω) · |ψλ (ω)|2 dω ^ Z V V (RX is even) = RX (ω) · |ψλ (ω)|2 dω If SX(λ) = SY (λ), we have: Z V V V (RX − RY )(ω) · |ψλ (ω)|2 dω = 0 V V According to Lemma 11 stated in the following, we have RX (ω) = RX (−ω). For stationary, V V V real-valued X, Lemma 8 implies E|X ∗ ψλ (t)|2 = RX (ω) · |ψλ (ω)|2 dω for all ψ. Suppose ψλ R V is an indicator function supported in interval [a, b], we know the integration of RX in interval V [a, b] is positive. By taking arbitrary a and b, we can conclude that RX is positive almost everywhere. With Lemma 10, we prove the result. 114 Lemma 11. If X is stationary and real-valued, we have RX is real-valued and RX (τ ) = V V V RX (−τ ). Then we also have RX is real-valued and RX (ω) = RX (−ω) Proof. Since RX (τ ) = EX(0)X(τ ) = EX(−τ )X(0) = RX (−τ ) RX (τ ) is real-valued and even. V For RX , V Z Z −iτ ω RX (ω) = RX (τ )e dτ = RX (τ ) cos(τ ω) − i · RX (τ ) sin(τ ω)dτ Since RX is even and sin() is odd, we know the imagery part is 0. Therefore, RX is real- valued. V Z Z Z V RX (−ω) = RX (τ )e iτ ω dτ = RX (−τ )e iτ ω dτ = RX (v)e−ivω dv = RX (ω) we proved the result. Theorem 21 demonstrates the power spectral is equivalent to the scattering moments of a real-valued stationary process. In particular, two processes with the same power spectrum is equivalent to, the two processes has the same first order second scattering moments for any scattering parameter λ. Since a stationary process is determined by its power spectrum, this theorem provides a necessary and sufficient condition for identifying such processes. 115 APPENDIX 116 PROOF FOR CHAPTER 3 The Proof of the Lemmas The Proof of Lemma 1. By linearity, it suffices to show that p is not the zero function unless α1 = . . . , αN = 0. We will consider the case where |γN | > |γk | for all k = 1, 2, . . . , N − 1. The other cases, where |γ1 | > |γk | for all 2 ≤ k ≤ N or where |γ1 | = |γN | > |γk | for all 2 ≤ k ≤ N − 1 are similar. The n-th derivative of p is given by N αk γkn eiγk x . X (n) p (x) = k=1 Therefore, p(n) (0) lim n = αN , n→∞ γN so in particular, there exists n such that p(n) (0) 6= 0, and therefore p is not uniformly zero. The proof of Lemma 2. In the case where p = 2m is even, then by (3.13) and (3.14), |pi |2m ∈ E(mdi ) for each i. Therefore, since d1 > d2 , d3 , d4 , it follows from (3.15) that |p1 |2m + |p2 |2m − |p3 |2m − |p4 |2m is an element of E(md1 ) and therefore vanishes on a set of measure zero. Now consider the case where p = 2m + 1 is odd. Squaring both sides of (3.16) implies p5 = 2(|p1 p2 |2m+1 − |p3 p4 |2m+1 ), where p5 := |p1 |4m+2 + |p2 |4m+2 − |p3 |4m+2 − |p4 |4m+2 . Thus, squaring both sides again gives p6 := 8|p1 p2 p3 p4 |2m+1 where p6 = p25 − 4|p1 p2 |4m+2 − 4|p3 p4 |4m+2 . Therefore, squaring both sides one final time implies that p26 − 64|p1 p2 p3 p4 |4m+2 = 0. 117 However, since d1 > d2 , d3 , d4 , repeatedly applying (3.13), (3.14), and (3.15), we see that (p26 − 64|p1 p2 p3 p4 |2 ) ∈ E(4md1 ) and therefore vanishes on a set of measure zero. In addition to the lemmas from Chapter 3, we also need the following lemmas to prove Theorem 5 and 4. Lemma 12. Let m ≥ 1 be an odd integer, and let a, b, c, d, C ∈ R, a, b, c, d 6= 0. Let p(θ) = a + beiθ , and q(θ) = c + deiθ . If there are more than 4m distinct θ ∈ [0, 2π] such that |p(θ)|m − |q(θ)|m = C, then ab = cd and a2 + b2 = c2 + d2 . Lemma 13. Let m ≥ 1 be an integer (not necessarily odd) and let a, b, c, d, C ∈ R, γ > 0, κ 6= 0, 1. Then the set of θ such that m i(γ+1)θ m 1 iθ a + be + ce − κa + beiθ + κcei(γ+1)θ =C (7) κ has measure zero. The Proof of Lemma 12. If θ is a solution to |p(θ)|p − |q(θ)|p = C, then |p(θ)|2p − |q(θ)|2p − C 2 = 2|q(θ)|p C. Therefore, f (θ) = 0, where f : R → R is the function defined by 2 f (θ) := |p(θ)|2p − |q(θ)|2p − C 2 − 4|q(θ)|2p C 2 . Since |p(θ)|2 = a2 + b2 + 2ab cos(θ) and |q(θ)|2 = c2 + d2 + cd cos(θ), (8) 118 f (θ) is a trigonometric polynomial of degree at most 2p which by assumption has more than 4p zeros in [0, 2π]. This implies that f (θ) is uniformly zero. By (8) 2 f (θ) = (a2 + b2 + 2ab cos(θ))p − (c2 + d2 + 2cd cos(θ))p − C 2 − 4C 2 (c2 + d2 + 2cd cos(θ))p and so setting the cos2p (θ) coefficient equal to zero implies 0 = (2p ap bp − 2p cp dp )2 which implies ab = cd since p is odd. If p ≥ 3, then 2(p − 1) > p. Therefore, using the binomial formula and setting the cos2(p−1) (θ) coefficient of f (θ) equal to zero implies     2 p 2 2 p−1 p 2 2 p−1 (a + b )(2ab) − (c + d )(2cd) = 0, p−1 p−1 but since ab = cd this implies that a2 + b2 = c2 + d2 . On the other hand, if p = 1, using the fact that ab = cd we see that 2 f (θ) = a2 + b2 − (c2 + d2 ) − C = 4C 2 (c2 + d2 + 2cd cos(θ) Therefore, f (θ) can only be uniformly equal to zero if C = 0 and a2 + b2 = c2 + d2 . The Proof of Lemma 13. Let 1 p(θ) = a + beiθ + cei(γ+1)θ and q(θ) = κa + beiθ + κcei(γ+1)θ . κ Then by (7) we see |p(θ)|p = |q(θ)|p + C, Squaring both sides yields, |p(θ)|2p − |q(θ)|2p − C 2 = 2|q(θ)|p C, and therefore if θ satisfies (7) it is a solution to f (θ) = 0, where 2 f (θ) := |p(θ)|2p − |q(θ)|2p − C 2 − 4|q(θ)|2p C 2 = 0. 119 f (θ) is an element of the class E of generalized exponential polynomials introduced earlier. Therefore, it will follow that f vanishes on a set of measure zero as soon as we show that f is not uniformly zero. We will verify that that the lead cofficient of f is nonzero unless c = ±1. Using the trigonometric identities sin2 (x) + cos2 (x) = 1 and cos(x − y) = cos(x) cos(y) + sin(x) sin(y) we see that |p(θ)|2 = a2 + b2 + c2 + 2ab cos(θ) + 2bc cos(γθ) + 2ac cos((γ + 1)θ) and likewise 1 2 |q(θ)|2 = κ2 a2 + b + κ2 c2 + 2ab cos(θ) + 2bc cos(γθ) + 2κ2 ac cos((γ + 1)θ). κ2 Therefore, the lead coefficient of f (θ) vanishes if and only if κ2 = 1. The Proof of Theorem 4 Proof. Choose ξ1 , ξ2 , . . . , ξL i.i.d. from any probability distribution which is absolutely con- tinuous with respect to the Lebesgue measure. Since x is collision free, with probability one, each of the ξ` ∆i,i+1 (x) are distinct modulo 2π, i.e. ξ` ∆i,i+1 (x) 6≡ ξ`0 ∆i0 ,i0 +1 (x) mod 2π (9) for all 1 ≤ i, i0 ≤ k − 1 and 1 ≤ `, `0 ≤ L, except when (i, `) = (i0 , `0 ). Assume these ξ` ∆i,i+1 are distinct, and let y(t) be any signal defined as in 3.17 such that D(y) = D(x) =: D, and ∂s2 fx,ξ` (d) = ∂s2 fy,ξ` (d) for all d ∈ D and for all 1 ≤ ` ≤ L − 1, and i=1 |ai | . Note that y(t) depends on ξ1 , . . . , ξL−1 , but is independent of ξL . By Pk p Pk p i=1 |bi | = CITE TURNPIKE and the assumption that x(t) and y(t) are collision free, the fact that D(x) = D(y) implies that the support sets of x and y are equivalent up to translation and reflection, so we may assume without loss of generality that ∆i,j (x) = ∆i,j (y) =: ∆i,j for all 1 ≤ i ≤ j ≤ k. We will show that ~b must be given by   1 ai if i is odd   c bi = , cai if i is even   120 where c = ±1 or Pb k+1 c 2 |a2i−1 |p p |c| = i=1 Pb k2 c . (10) p i=1 |a2i | Next, we will show that, with probability one, if c satisfies 10, but c 6= ±1, that ∂s2 fx,ξL (∆1,3 ) 6= ∂s2 fy,ξL (∆1,3 ). Since y(t) and therefore ~b was chosen to depend on ξ1 , . . . , ξL−1 , but not ξL ,  these two facts together will imply that, with probability one, if y(t) is any signal such that D(y) = D(x) =: D, and ∂s2 fx,ξ` (d) = ∂s2 fy,xi` (d) for all d ∈ D and all 1 ≤ ` ≤ L, then ~b = ±~a and therefore y is equivalent to ±x up to reflection and translation. (3.6) implies that for all 1 ≤ ` ≤ L − 1 and all 1 ≤ i ≤ k − 1 we have |ai + ai+1 eiξ` ∆i,i+1 |p − |ai+1 |p − |ai |p = ∂s2 fx,ξ` (∆i,i+1 ) = ∂s2 fy,ξ` (∆i,i+1 ) = |bi + bi+1 eiξ` ∆i,i+1 |p − |bi+1 |p − |bi |p . Therefore, for all 1 ≤ i ≤ k − 1, ξ1 ∆i,i+1 , . . . , ξL−1 ∆i,i+1 are all L − 1 solutions, which are distinct modulo 2π, to |ai + ai+1 eiθ |p − |bi + biθ p p p p i+1 | = |bi | + |bi+1 | − |ai | − |ai+1 | . p Since L − 1 ≥ 4p, Lemma 12 implies that ai ai+1 = bi bi+1 (11) and a2i + a2i+1 = b2i + b2i+1 (12) for all 1 ≤ i ≤ k − 1. It follows from (11) that   1 ai if i is odd   c bi = (13) cai if i is even,   where c := a1 b1 . Combining (13) with the assumption that X k X k p |ai | = |bi |p i=1 i=1 121 implies that either c = ±1 or that c satisfies (10). Now, we will show that, with probability one, if c satisfies (10), but c 6= ±1, then ∂s2 fξL [x](∆1,3 ) 6= ∂s2 fξL [y](∆1,3 ). By 3.7, if ∂s2 fξL [x](∆1,3 ) = ∂s2 fξL [y](∆1,3 ), then |a1 + a2 eiξL ∆1,2 + a3 eiξL ∆1,3 |p + |a2 |p − |a2 eiξL ∆1,2 + a3 eiξL ∆1,3 |p − |a1 + a2 eiξL ∆1,2 |p =|b1 + b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |p + |b2 |p − |b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |p − |b1 + b2 eiξL ∆1,2 |p . (14) But combining (11) and (12) implies that for all i either (ai , ai+1 ) = ±(bi , bi+1 ) or (ai , ai+1 ) = ±(bi+1 , bi ). In either case, we have that |a1 + a2 eiξL ∆1,2 | = |b1 + b2 eiξL ∆1,2 | and |a2 eiξL ∆1,2 + a3 eiξL ∆1,3 | = |b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |. Combining this with (14) gives |a1 + a2 eiξL ∆1,2 + a3 eiξL ∆1,3 |p + |a2 |p = |b1 + b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |p + |b2 |p . (15) However, by Lemma 13 the set of ξL ∈ R such that (15) holds has measure zero, unless c = ±1. Since we assumed that the distribution of ξL was absolutely continuous with respect to the Lebesgue measure, this completes the proof. The Proof of Theorem 5. The proof is quite similar to the proof of Theorem 4. Choose ξ1 , . . . , ξL i.i.d from any probability distribution which is absolutely continuous with respect to the Lebesgue measure, and again note that with probability one each of the ξ` ∆i,i+1 are distinct modulo 2π. Let y(t) be any signal defined as in (3.18) such that D(y) = D(x) =: D, and ∂s2 fξ` [x](d) = ∂s2 fξ` [y](d) for all d ∈ D and for all 1 ≤ ` ≤ L − 1, and ki=1 |bi |p = P i=1 |ai | . As before, we may assume that ∆i,j (x) = ∆i,j (y) =: ∆i,j for all 1 ≤ i ≤ j ≤ k. Pk p Since ∂s2 fξ` [x](∆i,i+1 ) = ∂s2 fξ` [x](∆i,i+1 ) if follows from (3.6) that for all 1 ≤ ` ≤ L, 1 ≤ i ≤ k − 1, |ai + ai+1 eiξ` ∆i,i+1 |2m − |ai |2m − |ai+1 |2m = |bi + bi+1 eiξ` ∆i,i+1 |2m − |bi |2m − |bi+1 |2m . Therefore, for all 1 ≤ i ≤ k − 1, ξ1 ∆i,i+1 , . . . , ξL−1 ∆i,i+1 are L − 1 solutions to h(θ) := |ai + ai+1 eiθ |2m − |bi + bi+1 eiθ |2m + |bi |2m + |bi+1 |2m − |ai |2m − |ai+1 |2m = 0 122 which are distint modulo 2π. Using the facts that |ai + ai+1 eiθ |2 = a2i + a2i+1 + 2ai ai+1 cos(θ) and |bi + bi+1 eiθ |2 = b2i + b2i+1 + 2bi bi+1 cos(θ) we see that h(θ) = (a2i + a2i+1 + 2ai ai+1 cos(θ))m − (b2i + b2i+1 + 2bi bi+1 cos(θ))m + b2m 2m 2m 2m i + bi+1 − ai − ai+1 is a trigonometic polynomial of degree at most m with at least L − 1 distinct roots. Since L − 1 ≥ 2m + 1 > 2m, this implies that h must be uniformly zero. In particular, setting the lead coefficient equal to zero implies (ai ai+1 )m = (bi bi+1 )m for all 1 ≤ i ≤ k − 1. Using the binomial theorem and setting the cosm−1 (θ) coefficient equal to zero gives (a2i + a2i+1 )m−1 ai ai+1 = (b2i + b2i+1 )m−1 bi bi+1 . Combining the last two equations gives a2i + a2i+1 = b2i + b2i+1 and ai ai+1 = bi bi+1 . As in the proof of Theorem 4, these two facts imply that   1 ai if i is odd   c bi = cai if i is even   for c = a1 b1 , and the assumption that Xk X k p |ai | = |bi |p i=1 i=1 implies that either c = ±1 or Pb k+1 c 2 |a2i−1 |p p |c| = ± P i=1 . (16) b k2 c i=1 |a2i |p As in the proof of Theorem 4 it follows from Lemma 13 if c satisfies (16), but c 6= ±1, then ∂s2 fξL [x](∆1,3 ) 6= ∂s2 fξL [y](∆1,3 ) with probability one. Therefore, the proof is complete. 123 PROOF FOR CHAPTER 4 Proof of Theorem 6 To prove Theorem 6 we will need the following lemma. Lemma 14. Let Z be a Poisson random variable with parameter λ. Then for all α ∈ R, m ∈ N, ∞ k −λ λ E Z 1{Z>m} = X k α ≤ Cm,α λm+1 ,  α  e ∀0 < λ < 1. k=m+1 k! Proof. For 0 < λ < 1 and k ∈ N, e−λ λk ≤ 1. Therefore, ∞ λk E Z 1{Z>m} = X e−λ k α  α  k=m+1 k! ∞ X λk =λ m+1 e−λ (k + m + 1)α k=0 (k + m + 1)! ∞ m+1 X (k + m + 1)α ≤λ k=0 (k + m + 1)! = Cα,m λm+1 . Proof. [Theorem 6] Recalling the definitions of Y (dt) and S[γ, p]Y (t), and setting Ns (t) = N [t − s, t]d , we see  Z   p t − u iξ·(t−u) S[γ, p]Y (t) = E w e Y (du) [s−t,t]d s  p Ns (t)   X t − tj = E Aj w eiξ·(t−tj )  , j=1 s where t1 , t2 , . . . tNs (t) are the points N (t) in [t−s, t]d . Conditioned on the event that Ns (t) = k, the locations of the k points on [t − s, t]d are distributed as i.i.d. random variables Z1 , . . . , Zk taking values in [t − s, t]d with density λ(z) pZ (z) = , z ∈ [t − s, t]d . Λs (t) 124 Therefore, the random variables t − Zi Vi := s take values in the unit cube Q1 = [0, 1]d and have density sd pV (v) = λ(t − vs) , v ∈ Q1 . Λs (t) Note that in the special case that N is homogeneous, i.e. λ(t) ≡ λ0 is constant, the Vi are uniform random variables on Q1 . Our proof will be based off of conditioning on Ns (t). For Ns (t) = k ≥ 1,  p  p# Ns (t)   " k X t − tj iξ·(t−t ) X isξ·V E Aj w e j : Ns (t) = k  = E Aj w(Vj )e j j=1 s j=1 kλk∞ p ≤ k E[|A1 |p ]kwkpp , (17) λmin where (17) follows from (i) the independence of the random variables Aj and Vj ; (ii) the fact that for any sequence of i.i.d. random variables Z1 , Z2 , . . ., " k p# " k # X X E Zn ≤ k p−1 E |Zn |p = k p E[|Z1 |p ] ; n=1 n=1 and (iii) the fact that kλk∞ Z p E[|w(Vi )| ] = |w(v)|p pV (v) dv ≤ kwkpp . Q1 λmin Therefore, since P[Ns (t) = k] = e−Λs (t) · (Λs (t))k/k!,  p Ns (t)   X t − tj E Aj w eiξ·(t−tj )  = j=1 s  p  ∞ k Ns (t)   X (Λs (t))  X t − tj = e−Λs (t) E Aj w eiξ·(t−tj ) : Ns (t) = k  k=0 k! j=1 s ∞ " k p# k −Λs (t) (Λs (t)) X X isξ·Vj = e E Aj w(Vj )e k=1 k! j=1 m " k p# k X (Λ s (t)) X = e−Λs (t) E Aj w(Vj )eisξ·Vj + (m, s, ξ, t) , k=1 k! j=1 125 where " p# ∞ k X (Λs (t))k X (m, s, t, ξ) := e−Λs (t) E Aj w(Vj )eisξ·Vj . k=m+1 k! j=1 By (17) and Lemma 14, if s is small enough so that Λs (t) ≤ sd kλk∞ < 1, then: ∞ " k p# k X (Λ s (t)) X (m, s, ξ, t) = e−Λs (t) E Aj w(Vj )eisξ·Vj k=m+1 k! j=1 ∞ kλk∞ X (Λs (t))k p ≤ E[|A1 |p ]kwkpp e−Λs (t) k λmin k=m+1 k! kλk∞ ≤ Cm,p E[|A1 |p ]kwkpp (Λs (t))m+1 λmin kλk∞ ≤ Cm,p E[|A1 |p ]kwkpp kλkm+1 ∞ s d(m+1) . λmin Proof of Theorem 7 Proof. [Theorem 7] Let (sk , ξk ) be a sequence of scale and frequency pairs such that limk→∞ sk = 0. Applying Theorem 6 with m = 1, we obtain: Sγk ,p Y (t) −Λsk (t) Λsk (t)  isξ·V1,k p  (1, sk , ξk , t) = e E A 1 w(V 1,k )e + sdk sdk sdk Λs (t) (1, sk , ξk , t) = e−Λsk (t) kd E[|A1 |p ]E[|w(V1,k )|p ] + , sk sdk where we write V1,k = V1 to emphasize the fact that the density of V1,k is: sdk pVk (v) = λ(t − vsk ) . Λsk (t) Using the error bound (4.9), we see that: (1, sk , ξk , t) lim = 0. k→∞ sdk Furthermore, since 0 ≤ Λsk (t) ≤ sdk kλk∞ , we observe that: lim e−Λsk (t) = 1 , k→∞ 126 and by the continuity of λ(t), Z Λs (t) 1 lim kd = lim d λ(u) du = λ(t) . (18) k→∞ sk k→∞ s k [sk −t,t]d Finally, by the continuity of λ(t), we see that kλk∞ pVk (v) ≤ and lim pVk (v) = 1 , ∀ v ∈ Q1 . (19) λmin k→∞ Therefore, by the bounded convergence theorem, Z Z p p lim E[|w(V1 )| ] = lim |w(v)| pVk (v) dv = |w(v)|p lim pVk (v) dv = kwkpp . k→∞ k→∞ Q1 Q1 k→∞ That completes the proof of (4.10). To prove (4.11), we assume that λ(t) is periodic with period T along each coordinate and again use Theorem 6 with m = 1 to observe, Z Z Z SY (sk , ξk , p) 1 Λs (t) 1 (1, sk , ξk , t) d = E[|A1 |p ] d e−Λsk (t) kd p |w(v)| pVk (v) dv dt+ d dt . sk T QT sk Q1 T Q1 sdk By (4.9), the second integral converges to zero as k → ∞. Therefore, Z SY (sk , ξk , p) 1 lim d = E[|A1 |p ]kwkpp d λ(t) dt , k→∞ sk T QT by the continuity of λ(t) and the bounded convergence theorem. Proof of Theorem 8 Proof. [Theorem 8] We apply Theorem 6 with m = 2 and obtain: Sγk ,p Y (t) = e−Λsk (t) Λsk (t)E[|A1 |p ]E[|w(V1,k )|p ] (20) (Λsk (t))2  p + e−Λsk (t) E A1 w(V1,k )eisk ξk ·V1,k + A2 w(V2,k )eisk ξk ·V2,k + (2, sk , ξk , t) , 2 where Vi,k , i = 1, 2, are random variables taking values on the unit cube Q1 = [0, 1]d with densities, sdk pVk (v) = λ(t − vsk ) . Λsk (t) 127 Λsk (t) E[|w(V1,k )|p ] Dividing both sides in (20) by s2d k kwkp E[|A1 | ] and subtracting p p s2d kwkpp yields: k Sγk ,p Y (t) Λsk (t) E[|w(V1,k )|p ] e−Λsk (t) Λsk (t) − Λsk (t) E[|w(V1,k )|p ] − = (21) s2d p k kwkp E[|A1 | ] p s2d k kwkpp s2d k kwkpp 2 E A w(V )eisk ξk ·V1,k + A w(V )eisk ξk ·V2,k p   (Λ s (t)) 1 1,k 2 2,k + e−Λsk (t) k s2d k 2kwkpp E[|A1 |p ] (2, sk , ξk , t) + 2d . sk kwkpp E[|A1 |p ] Using the error bound (4.9), (2, sk , ξk , t) lim p = 0, (22) k→∞ s2d kwkp E[|A1 |p ] k at a rate independent of t. Recalling (19) from the proof of Theorem 7, we use the fact that limk→∞ pVk ≡ 1 and the bounded convergence theorem to conclude, p p lim E A1 w(V1,k )eisk ξk ·V1,k + A2 w(V2,k )eisk ξk ·V2,k = E A1 w(U1 )eiL·U1 + A2 w(U2 )eiL·U2   , k→∞ (23) where Ui , i = 1, 2, are uniform random variables on the unit cube and L = limk→∞ sk ξk . Similarly, E[|w(V1,k )|p ] lim = 1. (24) k→∞ kwkpp Lastly, recalling that sk → 0 as k → ∞ and using (18) from the proof of Theorem 7, we see e−Λsk (t) Λsk (t) − Λsk (t)    −Λs (t)  Λsk (t) e k −1 lim = lim lim k→∞ s2d k k→∞ sdk k→∞ sd  −Λs (t)  k e k −1 = λ(t) lim k→∞ sdk = −λ(t)2 . (25) Now we integrate both sides of (21) over QT and divide by T d . Taking the limit as k → ∞, on the left hand side we get: Λsk (t) E[|w(V1,k )|p ] Z   1 Sγk ,p Y (t) lim 2d p − 2d p dt k→∞ T d Q s kwk p E[|A 1 |p] s kwk p T k k E[|w(V1,k )|p ] 1  Z  SY (sk , ξk , p) Λsk (t) = lim − dt k→∞ s2d p k kwkp E[|A1 | ] p kwkpp T d QT s2d k  Z  SY (sk , ξk , p) 1 Λsk (t) = lim − d dt , k→∞ s2d p k E[|w(V1,k )| ]E[|A1 | ] p T QT s2d k 128 where we used the definition of the invariant scattering moments and (24). On the right hand side of (21), we use (24), (25) and the dominated convergence theorem to see that the first term is: e−Λsk (t) Λsk (t) − Λsk (t) E[|w(V1,k )|p ] e−Λsk (t) Λsk (t) − Λsk (t) Z Z 1 1 lim d dt = lim dt k→∞ T QT s2d k kwkpp k→∞ T d Q T s2d k Z 1 =− d λ(t)2 dt . T QT Using (18), (23), and the bounded convergence theorem, the second term of (21) is: isk ξk ·V1,k p + A2 w(V2,k )eisk ξk ·V2,k  2 −Λsk (t) (Λsk (t)) E A1 w(V1,k )e Z 1 lim e dt k→∞ T d Q T s2d k 2kwkpp E[|A1 |p ] E[|A1 w(U1 )eiL·U1 + A2 w(U2 )eiL·U2 |p ] 1  Z  2 = λ(t) dt . 2kwkpp E[|A1 |p ] T d QT Finally, the third term of (21) goes to zero using the bounded convergence theorem and (22). Putting together the left and right hand sides of (21) with these calculations finishes the proof. Proof of Theorem 9 Proof. [Theorem 9] As in the proof of Theorem 6, let Ns (t) = N [t − s, t]d denote the  number of points in the cube [t − s, t]d . Then since the support of w is contained in [0, 1]d ,   Nsk (t)   t−u t − tj Z X iξk ·(t−u) (gγk ∗ Y ) (t) = w e Y (du) = Aj w eiξk ·(t−tj ) , [t−sk ,t]d sk j=1 sk where t1 , t2 , . . . , tNsk (t) are the points of N in [t−sk , t]d . Therefore, in the event that Nsk (t) = 1, | (gγk ∗ Y ) (t)|p = (|gγk |p ∗ |Y |p ) (t) , 129 and so, partitioning the space of possible outcomes based on Nsk (t), we obtain: | (gγk ∗ Y ) (t)|p = | (gγk ∗ Y ) (t) · 1{Nsk (t)=1} + (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p = | (gγk ∗ Y ) (t) · 1{Nsk (t)=1} |p + | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p = (|gγk |p ∗ |Y |p ) (t) · 1{Nsk (t)=1} + | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p = (|gγk |p ∗ |Y |p ) (t) + | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p − (|gγk |p ∗ |Y |p ) (t) · 1{Nsk (t)>1} = (|gγk |p ∗ |Y |p ) (t) + ek (t) , where ek (t) := | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p − (|gγk |p ∗ |Y |p ) (t) · 1{Nsk (t)>1} Using the above, we can write the second order convolution term as: gγk0 ∗ |gγk ∗ Y | (t) = gγk0 ∗ |gγk |p ∗ |Y |p (t) + gγk0 ∗ ek (t) .    0 The following lemma implies that gγk0 ∗ ek (t) decays rapidly in Lp at a rate independent  of t. Lemma 15. There exists δ > 0, independent of t, such that if sk < δ, p kλk∞ d(p0 +2) gγk0 ∗ ek (t) ≤ C(p, p0 , w, c, L) kλk2∞ sk   E . λmin Once we have proved Lemma 15, equation (4.13) will follow once we show, p0 h  i p p E gγk ∗ |gγk | ∗ |Y | (t) 0 lim d(p0 +1) = K(p, p0 , w, c, L)λ(t)E[|A1 |q ] . (26) k→∞ sk Let us prove (26) first and postpone the proof of Lemma 15. We will use the fact that the support of gγk0 ∗ |gγk |p is contained in [0, sk + s0k ]d . Let s̃k := sk + s0k , Nk (t) := Ns̃k (t), Λk (t) := Λs̃k (t), and let t1 , t2 , . . . , tNk (t) be the points of N in the cube [t − s̃k , t]d . We n have that P[Nk (t) = n] = e−Λk (t) (Λkn!(t)) , and conditioned on the event that Nk (t) = n, the locations of the points t1 , . . . , tn are distributed as i.i.d. random variables Z1 (t), . . . , Zn (t) 130 taking values in [t− s̃k , t]d with density pZ(t) (z) = λ(z) Λk (t) . Therefore the i.i.d. random variables Ve1 (t), . . . , Ven (t) defined by Vei (t) := t − Zi (t) take values in [0, s̃k ]d and have density λ(t − v) pVe (t) (v) = , v ∈ [0, s̃k ]d . Λk (t) Now, we condition on Nk (t) to see that p0   Nk (t) p0 h i  X E gγk0 ∗ |gγk |p ∗ |Y |p (t) |Aj |p gγk0 ∗ |gγk |p (t − tj )    =E   j=1 ∞ X (Λk (t))n = e−Λk (t) · (27) n=1 n! p0   Nk (t)  X |Aj |p gγk0 ∗ |gγk |p (t − tj )  : Nk (t) = n  E j=1 p0   ∞ n n X (Λk (t))  X e−Λk (t) |Aj |p gγk0 ∗ |gγk |p (Vej (t))   = E n=1 n! j=1 p0   −Λk (t) q p (28)  =e Λk (t)E[|A1 | ]E gγk0 ∗ |gγk | (V1 (t)) e p0   ∞ n n X (Λk (t))  X e−Λk (t) |Aj |p gγk0 ∗ |gγk |p (Vej (t))  .  + E n=2 n! j=1 (29) The following lemma will be used to estimate the scaling of the term in (28). Lemma 16. For all t ∈ Rd , s̃dk p0   0 p = kgc,L/c ∗ |g1,0 |p kpp0 . (30)  lim 0 E gγk0 ∗ |gγk | (Ve1 (t)) k→∞ sd(p +1) k Furthermore, there exists δ > 0, independent of t, such that if sk < δ then s̃dk p0   kλk∞ p C(p, p0 , w, c, L) . (31)  d(p0 +1) E gγk0 ∗ |gγk | (Ve1 (t)) ≤2 sk λmin 131 Proof. Making a change of variables in both u and v, and recalling the assumption that s0k = csk , we observe that s̃dk p0   p  d(p0 +1) E gγk0 ∗ |gγk | (Ve1 (t)) sk p p0 s̃dk     v−u Z Z iξk0 ·(v−u) u = d(p0 +1) pVe (t) (v) w e w du dv sk Rd Rd s0k sk   p0 sk (v − u) isk ξk0 ·(v−u) Z Z p = s̃dk pVe (t) (sk v) w e |w(u)| du dv Rd Rd s0k p0 s̃dk λ(t − sk v)   u − v is0k ξk0 ·(u−v)/c Z Z = w e p |w(u)| du dv . (32) Rd Λk (t) Rd c The continuity of λ(t) implies that s̃dk λ(t − sk v) lim = 1, ∀ v ∈ [0, 1 + c]d . k→∞ Λk (t) Furthermore, the assumption 0 < λmin ≤ kλk∞ < ∞ implies s̃dk λ(t − sk v) kλk∞ ≤ , ∀k ≥ 1. (33) Λk (t) λmin Therefore, (30) follows from the dominated convergence theorem and by the observation that the inner integral of (32) is zero unless v ∈ [0, 1 + c]d . Equation (31) follows from inserting (33) into (32) and sending k to infinity. Since Λk (t) lim = λ(t) , k→∞ s̃d k the independence of Ve1 (t) and A1 , the continuity of λ(t), and Lemma 16 imply that taking k → ∞ in (28) yields:  h i −Λk (t) q p p 0 e Λk (t)E[|A1 | ]E |gγk0 ∗ |gγk | (V1 (t))| e lim  d(p0 +1)  k→∞ sk ! −Λk (t) Λk (t) q s̃dk h p p0 i = lim e E[|A1 | ] d(p0 +1) E |gγk0 ∗ |gγk | (Ve1 (t))| k→∞ s̃dk sk = K(p, p0 , c, w, L)λ(t)E[|A1 |q ] . 132  0  d(p +2) The following lemma shows that (29) is O sk (and converges at a rate independent of t), and therefore completes the proof of (4.13) subject to proving Lemma 15. Lemma 17. For all α ∈ R there exists δ > 0, independent of t, such that if sk < δ, then p0   ∞ n n X (Λk (t)) α  X e−Λk (t) |Aj |p gγk0 ∗ |gγk |p (Vej (t))   n E n=2 n! j=1 kλk∞ d(p0 +2) ≤ C(p, p0 , w, c, α, L) kλk2∞ E[|A1 |q ]sk . λmin Proof. For any sequence of i.i.d. random variables, Z1 , Z2 , . . . , it holds that " k p# " k # X X E Zn ≤ k p−1 E |Zn |p = k p E [|Z1 |p ] . n=1 n=1 Therefore, by Lemma 14, Lemma 16, and the fact that the Vej (t) and Ai are i.i.d. and independent of each other, we see that if sk < δ, where δ is as in (31), p0   ∞ n n X (Λk (t)) α  X e−Λk (t) |Ai |p gγk0 ∗ |gγk |p (Vej (t))   n E n=2 n! j=1 ∞ n p0   −Λk (t) (Λk (t)) α p0 X p e q ≤ e n n E |A1 | gγk0 ∗ |gγk | (V1 (t)) n=2 n! ∞ n p0   −Λk (t) (Λk (t)) p0 +α X p e q = e n E[|A1 | ]E gγk0 ∗ |gγk | (V1 (t)) n=2 n!  ∞ p0 X (Λk (t))n p0 +α  p e q =E[|A1 | ]E gγk0 ∗ |gγk | (V1 (t)) e−Λk (t) n n=2 n! d(p0 +1) ∞ 0 kλk∞ s X (Λk (t))n p0 +α ≤C(p, p , w, c, L) E[|A1 |q ] k d e−Λk (t) n λmin s̃k n=2 n! d(p0 +1) kλk∞ s ≤C(p, p0 , w, c, L, α) E[|A1 |q ] k d (Λk (t))2 λmin s̃k kλk∞ d(p0 +2) ≤C(p, p0 , w, c, L, α) kλk2∞ E[|A1 |q ]sk , λmin where the last inequality uses the fact that Λk (t) ≤ s̃dk kλk∞ = (1 + c)d sdk kλk∞ . We will now complete the proof of the theorem by proving Lemma 15. 133 Proof. [Lemma 15] Since ek (t) = | (gγk ∗ Y ) (t)1{Nsk (t)>1} |p − (|gγk |p ∗ |Y |p ) (t)1{Nsk (t)>1} , we see that (gγk ∗ Y ) 1{Nsk (·)>1} (t) + gγk0 ∗ (|gγk |p ∗ |Y |p ) 1{Nsk (·)>1} (t) . p  gγk0 ∗ ek (t) ≤ gγk0 ∗ First turning our attention to the second term, we note that gγk0 ∗ (|gγk |p ∗ |Y |p ) 1{Nsk (·)>1} (t)    t − u iξk0 ·(t−u) Z = w 0 e (|gγk |p ∗ |Y |p ) (u)1{Nsk (u)>1} du 0 [t−sk ,t]d sk   t−u Z ≤ 1{Nk (t)>1} w 0 (|gγk |p ∗ |Y |p ) (u) du [t−s0k ,t]d sk = 1{Nk (t)>1} gs0k ,0 ∗ |gγk |p ∗ |Y |p (t). (34)  since Nsk (u) ≤ Nsk +s0k (t) = Ns̃k (t) = Nk (t) for all u ∈ [t − s0k , t]d . Therefore, conditioning on Nk (t), if sk < δ, p0 h i E gγk0 ∗ (|gγk |p ∗ |Y |p ) 1{Nsk (·)>1} (t)  p0 h i ≤ E 1{Nk (t)>1} gs0k ,0 ∗ |gγk |p ∗ |Y |p (t)  p0   ∞ n n X (Λk (t))  X e−Λk (t) |Aj |p gs0k ,0 ∗ |gγk |p (Vej (t))   = E n=2 n! j=1 kλk∞ d(p0 +2) ≤ C(p, p0 , w, c, L) kλk2∞ E[|A1 |q ]sk λmin by Lemma 17. Now, turning our attention to the first term, note that |(gγk ∗ Y )(t)|p 1{Nsk (t)>1} ≤ Nsk (t)p−1 (|gγk |p ∗ |Y |p ) (t)1{Nsk (t)>1} . 134 Therefore, by the same logic as in (34) (gγk ∗ Y ) 1{Nsk (·)>1} (t) p gγk0 ∗   t−u Z ≤ w 0 Nsk (u)p−1 (|gγk |p ∗ |Y |p ) (u)1{Nsk (u)>1} du [t−s0k ,t]d sk   t−u Z ≤ 1{Nk (t)>1} Nk (t) p−1 w 0 (|gγk |p ∗ |Y |p ) (u) du 0 [t−sk ,t]d sk ≤ 1{Nk (t)>1} Nk (t)p−1 gs0k ,0 ∗ (|gγk |p ∗ |Y |p ) (t) .  So again conditioning on Nk (t), and applying Lemma 17, we see that if sk < δ p0 h i (gγk ∗ Y ) 1{Nsk (·)>1} (t) p E gγk0 ∗ p0   ∞ k n −Λk (t) (Λk (t)) X X p p−1  |Aj |p  ≤ e n E gs0k ,0 ∗ |gγk | (Vej (t))  n=2 n! j=1 kλk∞ d(p0 +2) ≤ C(p, p0 , w, c, L) kλk2∞ E[|A1 |q ]sk . λmin This completes the proof of (4.13). Line (4.14) follows from integrating with respect to t, observing that the error bounds in Lemmas 15 and 16 are independent of t, and applying the bounded convergence theorem. Proofs of Results from Section 4.5 In order to prove Theorems 10 and 11, we will need the following lemma which shows that the scaling relationship of a self-similar process X(t)induces a similar relationship on stochastic integrals against dX(t). Lemma 18. If X is a stochastic process that satisfies the scaling relation d X(st) = sβ X(t) (35) 135 for some β > 0, then for any measurable function f : R → R, Z s Z 1 d β f (u) dX(u) = s f (su) dX(u) . 0 0 Proof. Let X = (X(t))t∈R be a stochastic process satisfying (35), and let Pn = {0 = tn0 < tn1 < . . . < tnKn = 1} be a sequence of partitions of [0, 1] such that lim max |tnk − tnk−1 | = 0.  n→∞ k Then, by the scaling relation (35), Z s KX n −1 f (stnk ) X(stnk+1 ) − X(stnk )  f (u) dX(u) = lim 0 n→∞ k=0 K n −1 Z 1 d X β f (stnk ) X(tnk+1 ) X(tnk ) β  = s lim − =s f (su) dX(u) . n→∞ 0 k=0 We will now use Lemma 18 to prove Theorems 10 and 11. Proof. [Theorem 10] Let X = (X(t))t∈R be the α-stable process, p < α ≤ 2. Since X has stationary increments, its scattering coefficients do not depend on t and it suffices to analyze Z 0 p Z sk p p E [|(gγk ∗ dX)(0)| ] = E gγk (u) dX(u) =E gγk (u) dX(u) , −sk 0 where the second equality uses the fact the distribution of X does not change if it is run in reverse, i.e. d (X(t))t∈R = (X(−t))t∈R It is well known that X(t) satisfies (35) for β = 1/α. Therefore, by Lemma 18 Z sk   p Z 1 p p u iξk u p/α iξk sk u E [|(gγk ∗ dX)(0)| ] = E w e dX(u) = sk E w(u)e dX(u) . 0 sk 0 So, p E [|(gγk ∗ dX)(0)|p ] 1 Z iξk sk u p/α =E w(u)e dX(u) . sk 0 136 The proof will be complete as soon as we show that  Z 1 p 1/p  Z 1 p 1/p iξk sk u iLu lim E w(u)e dX(u) = E w(u)e dX(u) . k→∞ 0 0 By the triangle inequality,  Z 1 p 1/p  Z 1 p 1/p iξk sk u iLu E w(u)e dX(u) − E w(u)e dX(u) 0 0  Z 1 p 1/p iξk sk u iLu  ≤ E w(u) e −e dX(u) . 0 Since 1 ≤ p < α, we may choose p0 strictly greater than 1 such that p ≤ p0 < α, and note that by Jensen’s inequality p 1/p "Z p0 #!1/p0  Z 1 1 w(u) eiξk sk u − e iLu w(u) eiξk sk u − e iLu   E dX(u) ≤ E dX(u) , 0 0 and since X(t) is a p0 -integrable martingale, the boundedness of martingale transforms [72] (see also [73]) implies "Z p0 # !1/p0 1 w(u) eiξk sk u − eiLu dX(u)  E 0  h 0 i h 0 i ≤ Cp0 sup w(u) eiξk sk u − eiLu E |X1 |p ≤ Cp0 |sk ξk − L|kwk∞ E |X1 |p , 0≤u≤1 which converges to zero by the continuity of w on [0, 1] and the assumption that sk ξk con- verges to L. Proof. [Theorem 11] Similarly to the proof of Theorem 10, it suffices to show that if a (X(t))t∈R is fractional Brownian motion with Hurst parameter H, then  Z 1 p 1/p iξk sk u iLu  lim E w(u) e −e dX(u) = 0. k→∞ 0 However, fractional Brownian motion is not a semi-martingale so we cannot apply Burkholder’s theorem as we did in the proof of Theorem 10. Instead, we use the following result first es- tablished in [74] (see also [75], p. 48) which states that if x(u) is any (deterministic) function 137 with bounded variation, and y(u) is any function which is α-Hölder continuous, 0 < α < 1, then Z 1 x(u) dy(u) 0 is well-defined as the limit of Riemann sums and Z 1 x(u) dy(u) − x(0) (y(1) − y(0)) ≤ Cα kxkBV kykα , 0 where k · kBV and k · kα are the bounded variation and α-Hölder seminorms respectively. For all k, the function hk (u) := w(u) eiξk sk u − eiLu := w(u)fk (u) satisfies, hk (0) = 0 and  khk kBV ≤ kwk∞ kfk kBV + kwkBV kfk k∞ . One can check that the fact that sk ξk converges to L implies that fk converges to zero in both L∞ and in the bounded variation seminorm, and that therefore that khk kBV converges to zero. It is well-known that fractional Brownian motion with Hurst parameter H admits a continuous modification which is α-Hölder continuous for any α < H. Therefore, Z 1 p iξk sk u iLu ≤ Cαp khk kpBV E [kXkpα ] .  E w(u) e −e dX(u) 0 Lastly, one can use the Garsia-Rodemich-Rumsey inequality [76], to show that E[kXkpα ] < ∞ . for all 1 < p < ∞. For details we refer the reader to the survey article [77]. Therefore, Z 1 p iξk sk u iLu  lim E w(u) e −e dX(u) =0 k→0 0 as desired. Remark 8. The assumption that w has bounded-variation was used to justify that the stochastic integral against fractional Brownian motion was well defined as the limit of Rie- mann sums because of its Hölder continuity and the above mentioned result of [74]. This allowed us to avoid the technical complexities of defining such an integral using either the Malliavin calculus or the Wick product. 138 Details of Numerical Experiments Definition of Filters For all the numerical experiments, we take the window function w to be the smooth bump function  exp − 1 2 , t ∈ (0, 1)    4t−4t w(t) = otherwise.  0,  Therefore for γ = (s, ξ), our filters are given by  2 eiξt e−s /(4ts−4t2 ) , t ∈ (0, s)   gγ (t) = eiξt w(t) = . otherwise  0,  Frequencies In all of our experiments, we hold the frequency, ξ, which we sample uniformly at random from (0, 2π), constant while allowing the scale to decrease to zero. Simulation of Poisson point process We use the standard method to generate a realization of a Poisson point process. For Poisson point process with intensity λ, the time interval between two neighbor jumps follows exponential distribution: ∆j := tj − tj−1 ∼ Exp(λ) . Therefore, taking the inverse cumulative distribution function, we sample the time interval between two neighbor jumps through: log Uj ∆j = − , λ where Uj are i.i.d. uniform random variables on [0, 1], and assign the charge Aj to the jump at location tj . 139 For inhomogeneous Poisson process with intensity funciton λ(t), we simulate the time interval based on the proposition from [78]. First define the cumulated intensity: Z t Λ(t) = λ(s)ds , 0 then generate the location of jumps tj by the following algorithm: Algorithm 2 Algorithm for simulating inhomogeneous Poisson point process initialize V = 0, t = 0 while t < N do generate U ∼ U([0, 1]) V ← V − log U t = inf{v : Λ(v) < V } deliver t 140 BIBLIOGRAPHY 141 BIBLIOGRAPHY [1] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large- scale image recognition. In International Conference on Learning Representations, 2015. [2] Javier Portilla and Eero Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–71, 10 2000. [3] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convo- lutional neural networks. In Advances in neural information processing systems, pages 262–270, 2015. [4] Stephane Mallat. A wavelet tour of signal processing, third edition: The sparse way. Academic Press, 3rd edition, 2008. [5] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell., 18(8):837–842, August 1996. [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. [7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher- jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014. [10] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331–1398, 2012. [11] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Im- proved Quality, Stability, and Variation. ArXiv e-prints, October 2017. [12] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative Adver- sarial Text to Image Synthesis. ArXiv e-prints, May 2016. [13] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros. Context encoders: Feature learning by inpainting. In Computer Vision and Pattern Recognition (CVPR), 2016. 142 [14] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017. [15] Leon A. Gatys, Alexander S. Ecker, and M. Bethge. A neural algorithm of artistic style. ArXiv, abs/1508.06576, 2015. [16] Joakim Andén and Stéphane Mallat. Multiscale scattering for audio classification. In Proceedings of the ISMIR 2011 conference, pages 657–662, 2011. [17] Joakim Andén and Stéphane Mallat. Deep scattering spectrum. IEEE Transactions on Signal Processing, 62(16):4114–4128, August 2014. [18] G. Wolf, S. Mallat, and S.A. Shamma. Audio source separation with time-frequency velocities. In 2014 IEEE International Workshop on Machine Learning for Signal Pro- cessing (MLSP), Reims, France, 2014. [19] Guy Wolf, Stephane Mallat, and Shihab A. Shamma. Rigid motion model for audio source separation. IEEE Transactions on Signal Processing, 64(7):1822–1831, 2015. [20] Joakim Andén, Vincent Lostanlen, and Stéphane Mallat. Classification with joint time- frequency scattering. arXiv:1807.08869, 2018. [21] Joan Bruna and Stéphane Mallat. Classification with scattering operators. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561– 1566, 2011. [22] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, August 2013. [23] Laurent Sifre and Stéphane Mallat. Combined scattering for rotation invariant texture analysis. In Proceedings of the ESANN 2012 conference, 2012. [24] Laurent Sifre and Stéphane Mallat. Rotation, scaling and deformation invariant scat- tering for texture discrimination. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013. [25] Laurent Sifre and Stéphane Mallat. Rigid-motion scattering for texture classification. arXiv:1403.1687, 2014. [26] Edouard Oyallon and Stéphane Mallat. Deep roto-translation scattering for object classification. In Proceedings in IEEE CVPR 2015 conference, 2015. arXiv:1412.8659. [27] Matthew Hirn, Stéphane Mallat, and Nicolas Poilvert. Wavelet scattering regression of quantum chemical energies. Multiscale Modeling and Simulation, 15(2):827–863, 2017. arXiv:1605.04654. 143 [28] Michael Eickenberg, Georgios Exarchakis, Matthew Hirn, and Stéphane Mallat. Solid harmonic wavelet scattering: Predicting quantum molecular energy from invariant de- scriptors of 3D electronic densities. In Advances in Neural Information Processing Sys- tems 30 (NIPS 2017), pages 6540–6549, 2017. [29] Michael Eickenberg, Georgios Exarchakis, Matthew Hirn, Stéphane Mallat, and Louis Thiry. Solid harmonic wavelet scattering for predictions of molecule properties. Journal of Chemical Physics, 148:241732, 2018. [30] Xavier Brumwell, Paul Sinz, Kwang Jin Kim, Yue Qi, and Matthew Hirn. Steerable wavelet scattering for 3D atomic systems with application to Li-Si energy prediction. In NeurIPS Workshop on Machine Learning for Molecules and Materials, 2018. [31] Stéphane Mallat and Iréne Waldspurger. Phase retrieval for the cauchy wavelet trans- form. Journal of Fourier Analysis and Applications, 21(6):1251–1309, 2015. [32] Juri Ranieri, Amina Chebira, Yue M. Lu, and Martin Vetterli. Phase retrieval for sparse signals: Uniqueness conditions. CoRR, abs/1308.3058, 2013. [33] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convo- lutional neural networks. In Advances in Neural Information Processing Systems 28, pages 262–270, 2015. [34] Joseph Antognini, Matt Hoffman, and Ron J. Weiss. Synthesizing diverse, high-quality audio textures. arXiv:1806.08002, 2018. [35] Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. Autoregressive convolutional neural networks for asynchronous time series. In Jennifer Dy and Andreas Krause, edi- tors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 580–589, Stockholmsmässan, Stock- holm Sweden, 10–15 Jul 2018. PMLR. [36] Antoine Brochard, Bartłomiej Błaszczyszyn, Stéphane Mallat, and Sixin Zhang. Sta- tistical learning of geometric characteristics of wireless networks. arXiv:1812.08265, 2018. [37] Martin Haenggi, Jeffrey G. Andrews, François Baccelli, Olivier Dousse, and Massimo Franceschetti. Stochastic geometry and random graphs for the analysis and design of wireless networks. IEEE Journal on Selected Areas in Communications, 27(7):1029– 1046, 2009. [38] Astrid Genet, Pavel Grabarnik, Olga Sekretenko, and David Pothier. Incorporating the mechanisms underlying inter-tree competition into a random point process model to improve spatial tree pattern analysis in forestry. Ecological Modelling, 288:143–154, 09 2014. [39] Frederic Paik Schoenberg. A note on the consistent estimation of spatial-temporal point process parameters. Statistica Sinica, 2016. 144 [40] V. Fromion, E. Leoncini, and P. Robert. Stochastic gene expression in cells: A point process approach. SIAM Journal on Applied Mathematics, 73(1):195–211, 2013. [41] Joan Bruna, Stéphane Mallat, Emmanuel Bacry, and Jean-Francois Muzy. Intermittent process analysis with scattering moments. Annals of Statistics, 43(1):323 – 351, 2015. [42] D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes. Vol. I. Probability and its Applications (New York). Springer-Verlag, New York, second edition, 2003. Elementary theory and methods. [43] Joan Bruna and Stéphane Mallat. Multiscale sparse microcanonical models. Mathemat- ical Statistics and Learning, 1(3/4):257–315, 01 2018. [44] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher- jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680, 2014. [45] Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. Learning texture manifolds with the periodic spatial GAN. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 469–477, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. [46] Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. Texture synthesis with spatial generative adversarial networks. ArXiv, abs/1611.08207, 2016. [47] Nikolay Jetchev, Urs Bergmann, and C. Seward. Ganosaic: Mosaic creation with gen- erative texture manifolds. ArXiv, abs/1712.00269, 2017. [48] Wenqi Xian, Patsorn Sangkloy, Jingwan Lu, Chen Fang, F. Yu, and James Hays. Tex- turegan: Controlling deep image synthesis with texture patches. 2018 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8456–8465, 2018. [49] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Non-stationary texture synthesis by adversarial expansion. ACM Transactions on Graphics (Proc. SIGGRAPH), 37(4), 2018. [50] Adib Akl, Charles Yaacoub, Marc Donias, Jean-Pierre Da Costa, and Christian Ger- main. A survey of exemplar-based texture synthesis methods. Computer Vision and Image Understanding, 172:12–24, 2018. [51] Bruno Galerne, Yann Gousseau, and Jean-Michel Morel. Micro-Texture Synthesis by Phase Randomization. Image Processing On Line, 1:213–237, 2011. [52] B. Galerne, Y. Gousseau, and J. Morel. Random phase textures: Theory and synthesis. IEEE Transactions on Image Processing, 20:257–267, 2011. [53] Song Chun Zhu, Yingnian Wu, and David Mumford. Filters, random fields and max- imum entropy (frame): Towards a unified theory for texture modeling. International Journal of Computer Vision, 27(2):107–126, 1998. 145 [54] Y. Lu, Song-Chun Zhu, and Y. Wu. Learning frame models using cnn filters for knowl- edge visualization. ArXiv, abs/1509.08379, 2015. [55] Valentin De Bortoli, Agnès Desolneux, Bruno Galerne, and Arthur Leclaire. Macro- canonical models for texture synthesis. In Jan Lellmann, Martin Burger, and Jan Mod- ersitzki, editors, Scale Space and Variational Methods in Computer Vision, pages 13–24, Cham, 2019. Springer International Publishing. [56] Valentin De Bortoli, A. Desolneux, Alain Durmus, B. Galerne, and A. Leclaire. Maxi- mum entropy methods for texture synthesis: theory and practice. SIAM J. Math. Data Sci., 3:52–82, 2021. [57] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016. [58] J. Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016. [59] Valentin De Bortoli, A. Desolneux, B. Galerne, and A. Leclaire. Macrocanonical models for texture synthesis. In SSVM, 2019. [60] Ivan Ustyuzhaninov, Wieland Brendel, Leon Gatys, and Matthias Bethge. What does it take to generate natural textures? In Proceedings of the International Conference on Learning Representations, 2017. [61] David J. Heeger and James R. Bergen. Pyramid-based texture analysis/synthesis. In Proceedings of the 22nd annual conference on Computer graphics and interactive tech- niques, pages 229–238, 1995. [62] Thibaud Briand, Jonathan Vacher, Bruno Galerne, and Julien Rabin. The Heeger- Bergen pyramid-based texture synthesis algorithm. Image Processing On Line, 4:279– 299, 2014. [63] Nicolas Gonthier, Yann Gousseau, and Saïd Ladjal. High resolution neural texture synthesis with long range constraints. CoRR, abs/2008.01808, 2020. [64] Xavier Snelgrove. High-resolution multi-scale neural texture synthesis. In SIGGRAPH ASIA 2017 Technical Briefs, SA ’17, New York, NY, USA, 2017. ACM. [65] Benjamin Balas, Lisa Nakano, and Ruth Rosenholtz. A summary statistic representation in peripheral vision explains visual crowding. Journal of vision, 9:13.1–18, 11 2009. [66] Jeremy Freeman, Corey M Ziemba, David J Heeger, Eero P Simoncelli, and J Anthony Movshon. A functional and perceptual signature of the second visual area in primates. Nature Neuroscience, pages 974–981, 07 2013. [67] Gouki Okazawa, Satohiro Tajima, and Hidehiko Komatsu. Image statistics underly- ing natural texture selectivity of neurons in macaque v4. Proceedings of the National Academy of Sciences, 112(4):E351–E360, 2015. 146 [68] R. Brüel Gabrielsson and G. Carlsson. Exposition and interpretation of the topology of neural networks. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 1069–1076, 2019. [69] Sixin Zhang and S. Mallat. Maximum entropy models from phase harmonic covariances. ArXiv, abs/1911.10017, 2019. [70] Lionel Moisan. Periodic plus Smooth Image Decomposition. working paper or preprint, May 2009. [71] Matthew Hirn and Anna Little. Wavelet invariants for statistically robust multi- reference alignment. arXiv:1909.11062, 2019. [72] Donald L. Burkholder. Sharp inequalities for martingales and stochastic integrals. In Colloque Paul Lévy sur les processus stochastiques, number 157-158 in Astérisque, pages 75–94. Société mathématique de France, 1988. [73] Rodrigo Bañuelos and Gang Wang. Sharp inequalities for martingales with applications to the beurling-ahlfors and riesz transforms. Duke Math. J., 80(3):575–600, 12 1995. [74] L. C. Young. An inequality of the Hölder type, connected with Stieltjes integration. Acta Math., 67:251–282, 1936. [75] Peter K Friz and Martin Hairer. A course on rough paths: with an introduction to regularity structures. Springer, 2014. [76] A. M. Garsia, E. Rodemich, and H. Rumsey Jr. A real variable lemma and the con- tinuity of paths of some Gaussian processes. Indiana University Mathematics Journal, 20(6):565–578, 1970. [77] G. Shevchenko. Fractional Brownian motion in a nutshell. In International Journal of Modern Physics Conference Series, volume 36 of International Journal of Modern Physics Conference Series, page 1560002, January 2015. [78] E. (Erhan) Cinlar. Introduction to stochastic processes. Prentice-Hall, Englewood Cliffs, N.J., 1975. 147