GENERATIVE SIGNAL PROCESSING THROUGH MULTILAYER MULTISCALE
                            WAVELET MODELS
                                        By
                                    Jieqian He
                             A DISSERTATION
                                 Submitted to
                         Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
  Computational Mathematics, Science and Engineering – Doctor of Philosophy
                           Statistics – Dual Major
                                       2021


                                         ABSTRACT
   GENERATIVE SIGNAL PROCESSING THROUGH MULTILAYER MULTISCALE
                                     WAVELET MODELS
                                               By
                                           Jieqian He
Wavelet analysis and deep learning are two popular fields for signal processing. The scatter-
ing transform from wavelet analysis is a recently proposed mathematical model for convo-
lution neural networks. Signals with repeated patterns can be analyzed using the statistics
from such models. Specifically, signals from certain classes can be recovered from related
statistics. We first focus on recovering 1D deterministic dirac signals from multiscale statis-
tics. We prove a dirac signal can be recovered from multiscale statistics up to a translation
and reflection. Then we switch to a stochastic version, modeled using Poisson point pro-
cesses, and prove wavelet statistics at small scales capture the intensity parameter of Poisson
point processes. We also design a scattering generative adversarial network (GAN) to gen-
erate new Poisson point samples from statistics of multiple given samples. Next we consider
texture images. We successfully synthesize new textures given one sample from the texture
class through multiscale, multilayer wavelet models. Finally, we analyze and prove why the
multiscale multilayer model is essential for signal recovery, especially natural texture images.


Copyright by
JIEQIAN HE
2021


                                ACKNOWLEDGEMENTS
I would like to express the deepest appreciation to my advisor and committee chair Professor
Matthew Hirn. He led me into the area of signal processing and provided lots of guidance
during my PhD research. He has always been supportive and kind to me during the past
five years. This dissertation work would not have been possible without his guidance and
encouragement.
    I would also like to thank my committee mumbers, Professor Yuying Xie, Professor Mark
Iwen, Professor Yimin Xiao and Professor Vishnu Boddeti, for their insightful comments and
support during my research.
    Thanks to my dear friends, Binbin, Hongnan, Ningyu, Yuning, Yuanyi, Zheng for their
consistent help and company. I greatly value their friendship and they are the treasures I
got from MSU during the past five years. Thanks to my old friends Yilang and Yanjun for
their long distance support and life-long friendships.
    Most importantly, I would like to thank my parents Xiaoqiang He and Li Lin for loving
me and supporting me at every stage of my life. Many thanks to my fiance Longhao Jin for
always caring for me and standing by my side. And thanks to my little friends Kiwi and
Simba for the warmest accompany.
                                              iv


                               TABLE OF CONTENTS
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .                     1
CHAPTER 2 BACKGROUND . . . . . . . . . . . . . . . .                 .  . . . . . . .  .  . . .  3
   2.1 Stochastic processes . . . . . . . . . . . . . . . . . . . .   . . . . . . . . .  . . .   3
   2.2 Convolutional Neural Networks . . . . . . . . . . . . .        . . . . . . . . .  . . .   4
   2.3 Generative Adversarial Networks . . . . . . . . . . . .        . . . . . . . . .  . . .   6
   2.4 Texture Synthesis using VGG and Style Transfer . . . .         . . . . . . . . .  . . .   8
   2.5 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . .  . . .  10
   2.6 Scattering transform . . . . . . . . . . . . . . . . . . .     . . . . . . . . .  . . .  14
   2.7 Signal recovery through statistical features . . . . . . .     . . . . . . . . .  . . .  18
CHAPTER 3 CHARACTERIZING SPARSE SIGNALS THROUGH A                               HYBRID
              SCATTERING TRANSFORM . . . . . . . . . . . . . .                  . . . . . . .   19
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . .   19
   3.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . .   20
   3.3 Theory and proofs . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .   23
   3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . .   29
   3.5 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . .   32
   3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . .   32
CHAPTER 4 SCATTERING STATISTICS OF GENERALIZED SPATIAL POIS-
              SON POINT PROCESSES . . . . . . . . . . . . . . . . . . . . . . .                 34
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     34
   4.2 Expected Scattering Moments for Random Signed Measures . . . . . . . . .                 35
   4.3 First-Order Scattering Moments of Generalized Poisson Processes . . . . . .              37
       4.3.1 Inhomogeneous, Compound Spatial Poisson Point Processes . . . . .                  38
       4.3.2 First-order Scattering Asymptotics . . . . . . . . . . . . . . . . . . .           39
   4.4 Second-Order Scattering Moments of Generalized Poisson Processes . . . . .               43
   4.5 Poisson Point Processes Compared to Self Similar Processes . . . . . . . . .             44
   4.6 Numerical Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      46
       4.6.1 Homogeneous, compound Poisson point processes with the same in-
               tensities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    46
       4.6.2 Homogeneous, compound Poisson point processes with different in-
               tensities and charges . . . . . . . . . . . . . . . . . . . . . . . . . . .      47
       4.6.3 Inhomogeneous, non-compound Poisson point processes . . . . . . . .                48
       4.6.4 Homogeneous, non-compound Poisson point process and self similar
               process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    49
   4.7 Scattering GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       49
                                              v


   4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     50
CHAPTER 5 TEXTURE SYNTHESIS VIA PROJECTION ONTO MULTISCALE,
              MULTILAYER STATISTICS . . . . . . . . . . . . . . . . . . . . . .                 53
   5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     53
   5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      55
       5.2.1 Wavelet filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      56
       5.2.2 First layer statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . .     60
       5.2.3 Second layer statistics . . . . . . . . . . . . . . . . . . . . . . . . . .        65
   5.3 Synthesis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      68
   5.4 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      70
       5.4.1 Filter comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        71
       5.4.2 Comparison of maximum scale . . . . . . . . . . . . . . . . . . . . . .            71
       5.4.3 Layers analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        75
       5.4.4 Methods comparison . . . . . . . . . . . . . . . . . . . . . . . . . . .           75
   5.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       77
       5.5.1 Reduction of second layer statistics . . . . . . . . . . . . . . . . . . .         77
       5.5.2 Matching of second layer statistics . . . . . . . . . . . . . . . . . . .          80
       5.5.3 Analysis of number of iterations . . . . . . . . . . . . . . . . . . . . .         80
   5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     82
CHAPTER 6 MULTILAYER MODEL ANALYSIS . . . . . .                     .  . . . . . . .  .  . . . 86
   6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . . 86
   6.2 Multilayer multiscale model . . . . . . . . . . . . . . .     . . . . . . . . .  . . . 87
       6.2.1 Filters . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . .  . . . 87
       6.2.2 Wavelet transform . . . . . . . . . . . . . . . .       . . . . . . . . .  . . . 88
       6.2.3 Gram matrix . . . . . . . . . . . . . . . . . . .       . . . . . . . . .  . . . 88
   6.3 Texture with multiple straight lines . . . . . . . . . . .    . . . . . . . . .  . . . 90
       6.3.1 Gram matrix . . . . . . . . . . . . . . . . . . .       . . . . . . . . .  . . . 91
       6.3.2 Small J . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . .  . . . 92
       6.3.3 Deep layers . . . . . . . . . . . . . . . . . . . .     . . . . . . . . .  . . . 95
   6.4 Frame-like texture . . . . . . . . . . . . . . . . . . . .    . . . . . . . . .  . . . 97
   6.5 Multilayer model . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . .  . . . 101
       6.5.1 Multilayer ReLU model . . . . . . . . . . . . .         . . . . . . . . .  . . . 101
CHAPTER 7 RANDOM FIELDS . . . . . . . . . . . . . . .               .  . . . . . . .  .  . . . 106
   7.1 Scattering moments of self-similar processes . . . . . .      . . . . . . . . .  . . .  106
   7.2 Power spectrum . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . .  . . .  110
   7.3 Power spectrum and scattering equivalence . . . . . . .       . . . . . . . . .  . . .  112
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
                                             vi


                                   LIST OF FIGURES
Figure 2.1 A special pre-trained very deep CNN: VGG[1] . . . . . . . . . . . . . . .         7
Figure 2.2 Structure of the GAN. The generator takes in a random vector z with
           lower dimensions and tries to generate a new signal G(z) that comes
           from the same distribution as the given data x. The discriminator takes
           in a signal G(z) or x and tries to output a probability p of the signal
           coming from the real distribution. . . . . . . . . . . . . . . . . . . . . .      7
Figure 2.3 Process of using VGG features to synthesize texture images [1] . . . . . .        9
Figure 2.4 Synthesis results with different layers. The last row shows the original
           images. The first row shows synthesized image by matching statistics
           from only "conv1-1" layer and the second shows that matching statistics
           from "conv1-1" and "pool1" layers and so on. With only lower layer
           statistics, the synthesis lose information of the texture. Adding statistics
           from deeper layers improves the performance, which proves higher layer
           statistics are necessary to capture texture information. The last column
           shows result of a natural image, which has more localized structure and
           is not stationary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.5 Left: A texture image of bricks with edges at direction 0, π/2. Right:
           The wavelets of the texture computed through Morlet wavelets, shown
           in Figure 2.6. It shows the texture has large response at θ = 0 and
           θ = π/2, tiny response at other angles, which match the directions of
           the edges in the texture. Also, as the wavelet scale goes larger, the
           response gets smoother. . . . . . . . . . . . . . . . . . . . . . . . . . . .    13
Figure 2.6 Morlet wavelets (real part) with 8 different scales and 12 different rota-
           tions in space and frequency. Left: 2D Morlet wavelet family (ψj,θ )j,θ
           in space. Middle: 2D Morlet wavelet family in frequency, which is
           the Fourier transform of 2D Morlet wavelet family (ψbj,θ )j,θ . Right:
           The summation |φ    cJ (ω)|2 + P
                                              λ∈Λ |ψλ (ω)| which is almost a constant
                                                   c      2
           over all frequencies, making an approximate Littlewood-Paley frame.
           This figure shows this group of wavelets plus a low pass satisfy the
           Littlewood-Paley condition on a half plane approximately, except for
           the boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    13
Figure 3.1 A wavelet ψ of 4 vanishing moments. . . . . . . . . . . . . . . . . . . . .      21
Figure 3.2 Wavelets sparsify piecewise polynomials on the interval [0, 1024]. . . . . .     21
                                               vii


Figure 3.3 Gabor filters used in the second layer (real parts) . . . . . . . . . . . . .      22
Figure 3.4 Choosing filter scales depending on the pairwise distance between signal
           spikes. Left: A sparse signal with n = 128, three spikes and minDD = 8.
           Right: The blue spikes show what are the pairwise distance. The
           orange spikes show what are the scales we choose. . . . . . . . . . . . .          31
Figure 3.5 Sparse signals reconstructed up to a global reflection and translation.
           Each row shows the result from a single test. Left column: Original
           signals. Right column: Synthesized signals. . . . . . . . . . . . . . .            33
Figure 4.1 First-order invariant scattering moments for three types of homoge-
           neous compound Poisson point processes with the same intensity λ0 .
           Left: Top: ordinary Poisson point process. Middle: Gaussian com-
           pound Poisson point process with normally distributed charges. Bot-
           tom: Rademacher compound Poisson point process with charges drawn
           from the Rademacher distribution. Middle: Normalized invariant scat-
           tering moments SY (s,ξ,1)/skwk1 (i.e., p = 1), which all converge to 0.01
           as s → 0 (up to numerical errors) since λ0 E[|A1 |] is the same for all
           three point processes. Right: Normalized invariant scattering moments
           SY (s,ξ,2)/skwk2 (i.e., p = 2). In this case the ordinary Poisson point pro-
                          2
           cess and the Rademacher Poisson point process still converge to the
           same value as s → 0 since E[|A1 |2 ] = 1 for both of them. However,
           the Gaussian Poisson point process converges to a different value since
           E[|A1 |2 ] = π/2 for this process. . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 4.2 First-order invariant scattering moments for two homogeneous, Gaus-
           sian compound Poisson point processes with different intensity and vari-
           ance. Left: Top: Homogeneous compound Poisson point process with
           intensity λ1 = 0.01 and charges A1,j ∼ N (0, 1). Bottom: Homogeneous
                                                                           √
           compound Poisson point process with intensity λ2 = 0.01/ 2 and charges
           A2,j ∼ N (0, 2). The two point processes are difficult to distinguish, vi-
           sually. Middle: Normalized invariant scattering moments SY (s,ξ,1)/skwk1
           (i.e., p = 1), which both converge to approximately 0.08 up to nu-
           merical error, thus indicating that these moments cannot distinguish
           the two processes. Right: Normalized invariant scattering moments
           SY (s,ξ,2)/skwk2 (i.e., p = 2). The two process are distinguished as s → 0
                          2
           since the values λ1 E[|A1,j |2 ] = 0.01 and λ2 E[|A2,j |2 ] ≈ 0.014 differ by a
           significant margin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    48
                                                viii


Figure 4.3 First-order invariant scattering moments for inhomogeneous non-compound
           Poisson point processes. Left: Inhomogeneous non-compound Poisson
           point process with intensity λ(t) = 0.01(1+0.5 sin(2πt/N )). Right: Scat-
           tering moments S[γ,p]Y (t)/skwkpp for inhomegeneous non-compound Poisson
           point process at t1 = N/4, t2 = N/2, t3 = 3N/4. Note that λ(t1 ) = 0.015,
           λ(t2 ) = 0.01, λ(t3 ) = 0.005. The plots show that for inhomogeneous
           process, scattering coefficients at time t converges to the intensity at
           that time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 4.4 First-order invariant scattering moments for Brownian motion and Pois-
           son point process. Left: Top: Brownian motion with Hurst parameter
           H = 1/2. Bottom: Ordinary Poisson point process. Middle: Nor-
           malized scattering moments for Brownian Motion (SXBM (s,ξ,1)/kwk2 E|Z|)
           and Poisson point process (SYpoisson (s,ξ,1)/λE|A|kwk1 ) √
                                                                    at p = 1. This shows
           the convergence rate of normalized scattering is s for Brownian mo-
           tion and s for Poisson process, indicating the 1st moment can distin-
           guish Brownian motion and Poisson point process. Right: Normalized
           scattering moments for Brownian Motion (SXBM (s,ξ,2)/kwk22 ) and Poisson
           point process (SYpoisson (s,ξ,2)/kwk22 ) at p = 2. Both normalized scattering
           moments have convergence rate s, so the 2nd moment scattering cannot
           distinguish the two processes. . . . . . . . . . . . . . . . . . . . . . . . .   49
Figure 4.5 Scattering-GAN to study the capacity of scattering moments on Poisson
           point process. Similar to ordinary GAN, the generator takes in a random
           vector z and generates fake data y 0 . Also, the discriminator aims to
           distinguish fake representations from real. By inserting a scattering
           module between G and D, the discriminator tries to distinguish S(y 0 )
           from S(y). When the model trains successfully, S(y 0 ) has the same
           distribution as S(y). By checking the similarity between y 0 and y, we
           learn the capacity of scattering moments. . . . . . . . . . . . . . . . . . .    50
Figure 4.6 Generated signals through scattering GAN. We use realizations with
           length n = 212 from a homogeneous ordinary Poisson point process,
           i.e., λ(t) ≡ λ0 and Ai ≡ 1, as training data. We use {2j/2 }22          j=0 as
           scales for filters and apply a one-layer scattering operator to compute
           the scattering moments. Sigmoid is applied at the last layer in the
           generator. The generated signals are sparse, although not as sparse as
           training data. This is natural since Sigmoid forces A0i ∈ (0, 1), thus
           E[A0i ] < E[Ai ]. According to theorem 6, λ00 > λ0 , which we verified
           numerically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
                                                   ix


Figure 5.1 Wavelet families. Upper: Even directional wavelets in space and fre-
           quency (FFT). Middle: Odd directional wavelets in space and fre-
           quency (FFT). Lower: Omnidirectional wavelets in space and frequency
           (FFT). Each block shows the wavelet family with different scales and
           oscillations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 5.2 1D wavelets. From left to right: 1D even wavelet, 1D odd wavelet, FFT
           (real part) of even wavelet, FFT (imagery part) of odd wavelet. . . . . .         62
Figure 5.3 Dirac functions and wavelet coefficients. Left: Two Dirac functions y1
           and −y1 . Middle: Wavelet coefficients with the even wavelet. Right:
           Wavelet coefficients with the odd wavelet. . . . . . . . . . . . . . . . . .      63
Figure 5.4 Covariance matrix for diracs. Upper row from left to right: Cye1 , C−y     e
                                                                                        1
                                                                                          ,
           Cy1 − C−y1 . Lower row from left to right: Cy1 , C−y1 , Cy1 − C−y1 . This
              e      e                                        o    o      o     o
           numerically verified that even wavelet is able to distinguish the two dirac
           functions from the covariance statistics while odd wavelet cannot. . . . .        63
Figure 5.5 Jump functions and wavelet coefficients. Left: Two jump functions y2
           and −y2 . Middle: Wavelet coefficients with the even wavelet. Right:
           Wavelet coefficients with the odd wavelet. . . . . . . . . . . . . . . . . .      64
Figure 5.6 Upper row from left to right: Cye2 , C−y  e
                                                       2
                                                         , Cye2 − C−y
                                                                    e
                                                                      2
                                                                        . Lower row from
           left to right: Cy2 , C−y2 , Cy2 − C−y2 . This numerically verified that odd
                           o     o      o     o
           wavelet is able to distinguish the two jump functions from the covariance
           statistics while the even wavelet cannot. . . . . . . . . . . . . . . . . . .     65
Figure 5.7 Synthesis results from one layer (J = 6) with different types of wavelet
           filters. Left: Original image. Middle left: First layer synthesis results
           with only odd wavelets. Middle right: First layer synthesis results
           with only even wavelets. Right: First layer synthesis results with both
           even and odd wavelets. . . . . . . . . . . . . . . . . . . . . . . . . . . .      72
Figure 5.8 Synthesis results from two layers (J = 5) with/without omnidirectional
           wavelets. Left: Original image. Middle: 2nd layer synthesis with
           even and odd wavelets. Right: 2nd layer synthesis with even, odd and
           omnidirectional wavelets. . . . . . . . . . . . . . . . . . . . . . . . . . . .   73
Figure 5.9 Synthesis results from two layers with different number of scales. Left:
           Original image. Middle Left: 2nd layer synthesis results with J = 4.
           Middle Right: 2nd layer synthesis results with J = 5. Right: 2nd
           layer synthesis results with J = 6. . . . . . . . . . . . . . . . . . . . . . .   74
                                               x


Figure 5.10 Synthesis results with one layer model and two layer model. Left: Orig-
            inal image. Middle: 1st layer synthesis results. Right: 2nd layer
            synthesis results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  76
Figure 5.11 Synthesis results compared to other models. Left: Original images.
            Middle Left: Results from Portilla and Simoncelli [2]. Middle Right:
            Results from Gatys et al. [3]. Right: Results from our two layer model.           78
Figure 5.12 Synthesis results compared to other models. Left: Original images.
            Middle Left: Results from Portilla and Simoncelli [2]. Middle Right:
            Results from Gatys et al. [3]. Right: Results from our two layer model.           79
Figure 5.13 Left: Original images. Middle left: Images synthesized from first
            layer. Middle right: Images synthesized from second layer, initialized
            from first layer result. Right: Images synthesized from second layer,
            initialized from uniform noise. . . . . . . . . . . . . . . . . . . . . . . .     81
Figure 5.14 The synthesis process for different micro-textures plus flowers. . . . . . .      83
Figure 5.15 The synthesis process for different macro-textures with rigid patterns. . .       84
Figure 5.16 The synthesis process when initializing from the first layer synthesis, for
            a texture with complex patterns. . . . . . . . . . . . . . . . . . . . . . .      84
Figure 6.1  Two line texture images
                                P8 that have the same   T summation    of height squared.
            P4 √ x1 (u) = i=1 132iT       (u1 ), u ∈ [Z [0, 256)] . Right: x2 (u) =
                                                                  2
            Left:
               i=1   2 · 1   (u
                          64i 1 ), u ∈ [Z  [0, 256)]2 . . . . . . . . . . . . . . . . . . . . 94
Figure 6.2  The gram matrices and the difference for the two line textures in Figure
            6.1. Left: The gram matrix Gx1 between ReLU responses for texture
            x1 . Middle: The gram matrix Gx2 between ReLU responses for texture
            x2 . Right: The difference between the two gram matrices |Gx1 − Gx2 |.            94
Figure 6.3  Frame texture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    98
                                                 xi


                        LIST OF ALGORITHMS
Algorithm 1 Projection algorithm for texture synthesis . . . . . . . . . . . . . . 69
Algorithm 2 Algorithm for simulating inhomogeneous Poisson point process . . 140
                                      xii


                                         CHAPTER 1
                                     INTRODUCTION
Wavelet analysis [4] and deep learning are two popular fields for signal processing. Wavelet
transforms can be used to extract information from many different kinds of signal data and
are related to harmonic analysis. In the 90’s, people used wavelets to analyze 2D signals and
do classification on texture images [5]. In [2], directional even and odd wavelets are used to
extract statistical features of texture images and reconstruct the images from such features.
In recent years, image processing has achieved great progress with the development of deep
learning, which uses deep neural networks (DNN) to process signals. One popular model
is the convolutional neural network (CNN), which achieves great performance on image
classification [1, 6, 7]. DNNs are not only used to do classification on high dimensional data,
but also to generate new data. A very deep neural network called VGG [1], which is pre-
trained over Imagenet [8], successfully synthesizes texture images from only one sample [3].
Another recent model is the Generative Adversarial Network (GAN) [9], which generates
signals that are from the same distribution of given data by training two networks in an
adversarial way. However, there is not enough theory to explain why deep learning works,
so these types of methods remain black boxes. Recently, theory was developed on the
scattering transform [10], which combines wavelet transforms and nonlinear operators to
capture features of signals. The process is similar to a CNN and it connects wavelet theory
to deep neural nets. So a new task is to understand deep neural networks through theory in
wavelet analysis.
    The research contained in this thesis involves wavelet analysis and deep learning in signal
processing. Within this field, an interesting and difficult task is signal reconstruction. It
requires to extract all important features of signals and to develop efficient algorithms to
reconstruct the signals from such features. This thesis describes the use of modified scattering
transforms to synthesize 1D sparse signals and 2D texture images. It also reports a new
                                                1


algorithm that combines GANs and scattering networks to generate 1D stochastic processes.
The relationship between filter size, model depth and signals is also discussed. Finally,
wavelet analysis on random fields is explored.
   The remainder of this thesis is organized as follows. Chapter 2 introduces background
material. Chapter 3 describes work on 1D sparse signal analysis and synthesis. Chapter 4
explores multilayer scattering features on Poisson point processes, especially at small scales.
Chapter 5 presents a multiscale, multilayer model for texture image synthesis. Chapter 6
provides theoretical analysis on filter size, model depth and signal reconstruction. Chapter
7 discusses wavelet analysis on random fields.
                                               2


                                           CHAPTER 2
                                        BACKGROUND
This chapter introduces the background of stochastic processes, CNNs, GANs, wavelets, and
the scattering transform. In section 2.1, stochastic processes will be defined and properties
such as being stationary and ergodic will be discussed. Section 2.2 covers the background
of CNNs and a specific architecture VGG, to explain the basic operators needed to define
a CNN. Section 2.3 reviews the intuition and architecture of GANs. Section 2.5 introduces
the 1D and 2D wavelet family and wavelet transforms. The intuition of wavelet analysis in
signal processing will be briefly described. Section 2.6 reviews the definition and properties
of the scattering transform, which was recently proposed as a mathematical model for CNNs
that uses wavelet theory.
2.1     Stochastic processes
Stochastic processes are used to model signals from real life that exhibit randomness. A
stochastic process {Xt : t ∈ Rd } is a collection of random variables indexed by t. When
d > 1, the stochastic process is referred to as a random field. For d = 2, the process is often
used to model grey level texture images. The mean and covariance function of Xt is defined
as:
                             µX (t) = EXt , σX (s, t) = Cov(Xs , Xt )
Two processes X = {Xt : t ∈ Rd } and Y = {Yt : t ∈ Rd } are said to be equal in the sense of
distribution if for any index set {t1 , t2 , ..., tk } ⊂ Rd ,
                                                        d
                              (Xt1 , Xt2 , ..., Xtk ) = (Yt1 , Yt2 , ..., Ytk )
    A stochastic process is stationary if,
                                                 d
                          {Xt+h : t ∈ Rd } = {Xt : t ∈ Rd }, ∀h ∈ Rd .
                                                       3


Stationarity implies the mean function µX (t) = µ is a constant that does not depend on t
and the covariance function σX (s, t) = σ(s − t) only depends on the difference between s
and t. A stochastic process has stationary increments if,
                                               d
                     {Xt+h − Xh : t ∈ Rd } = {Xt − X0 : t ∈ Rd }, ∀h ∈ Rd .
Define the time average of X to be,
                                                      Z T
                                                  1
                                      µ
                                      bX,T    =           Xt dt,
                                                 2T    −T
and let σX,T
          2
               be the variance of µ
                                  bX,T . A stationary process is said to be mean-ergodic if
                                                   2
                                            lim σX,T    = 0,
                                          T →∞
which implies
                                      lim E|b  µX,T − µ|2 = 0.
                                     T →∞
From a practical perspective, if xt is a realization of Xt over a large interval [−T, T ], then
with high probability
                                              Z  T
                                          1
                                                   xt dt ≈ µ.
                                         2T     −T
The ergodic property for d moments is defined in a similar way. If X is stationary and
ergodic, the statistics of the process can be derived from a single, sufficiently long, random
realization of this process. These properties become useful when we discuss the wavelet
statistics of texture images in the next few chapters.
2.2     Convolutional Neural Networks
CNNs are a special class of deep neural networks for dealing with data that has geometric
structure. A deep neural network is usually constructed with an input layer, an output
layer and many hidden layers. Each layer consists of a linear or affine operator and a
nonlinear operator. Training a network usually corresponds to learning the weights of the
linear operators. CNNs use convolution as the linear operator to detect local information and
                                                    4


use a pooling function for nonlinear downsampling. Figure 2.1 shows a specially designed
and pre-trained CNN called VGG. We now give an explicit definition of the operators in 2D
CNNs.
    Given a discrete input signal x and a filter K in R2 , the ordinary discrete convolution
operator is defined as:
                                         XX
                       (x ∗ K)(i, i0 ) =          x(m, m0 )K(i − m, i0 − m0 ).
                                          m    m0
In a CNN, an image is always regarded as a three dimensional data, where the third dimension
is referred to as the channel dimension. For example, a grey scale image is regarded of size
N × N × 1 and an RGB image is regarded as N × N × 3. For a convolution layer in a CNN,
suppose we have the input x with n1 channels denoted as (xi )1≤i≤n1 and we would like the
output to have n2 channels; we need n1 × n2 filters which are denoted as (Ki,j )1≤i≤n1 ,1≤j≤n2 .
The convolution layer specifically computes:
                                           Xn1
                               C(x)j =          xi ∗ Ki,j , 1 ≤ j ≤ n2
                                           i=1
for each output channel j. In practice, the convolution filters to be learned are usually of
small size, such as 3 × 3 or 5 × 5. Good things about it include sparse interactions (each
pixel at the current layer is computed from localized output from the last layer), parameter
sharing (pixels are transformed with the same filters), and equivariance to translation (this
operation is commutative with translation). It is also efficient to detect the singularities of
a signal, such as the edges in an image.
    This linear operation C(x) is usually followed by a nonlinear activation function in a
CNN. For example, a ReLU (Rectified linear unit) function g(x) = max{0, x} is used in the
VGG network. Other CNN networks also use logistic sigmoid and hyperbolic tangent as the
activation function. For classification tasks, a softmax function
                                                         exp(xi0 )
                                     softmax(x)i0 = P
                                                           i exp(xi )
is used at the output layer to predict the probability of a data that belongs to class i0 .
                                                    5


    A typical CNN layer may also have a pooling function. A max pooling returns the
maximum output of the corresponding rectangular neighborhood. The average pooling is
another popular pooling function. Pooling functions are used to represent the data with
localized summary statistics which are approximately invariant to local translations. Pooling
also increases the receptive fields, i.e., each output pixel contains information of more pixels
from the last layer. These properties are important for image processing.
    Since each operator is defined explicitly, the gradient of the loss function over the weights
can be computed explicitly, which makes the training efficient with gradient descent. To
compute the gradient over weights and update the weights in gradient descent is known as
the back-propagation, where we need the computation of the gradient to flow backwards in
the network through the chain rule.
    CNNs have shown great efficiency and accuracy in practical applications, such as natural
language processing, image classification and video recognition. Nowadays, people also use it
to reconstruct and generate high dimensional data from an unknown distribution. However,
there is not too much theory on why CNNs work so well. We try to interpret CNNs through
scattering transforms [10] in Section 2.6.
2.3     Generative Adversarial Networks
A very widely used model to learn the distribution of high-dimensional data is the Generative
Adversarial Network (GAN) [9] from deep learning. Figure 2.2 shows the structure of the
GAN which consists of two models: a generator and a discriminator. The two models are
trained in an adversarial way in which the generator generates fake data that looks like real
data and the discriminator tries to distinguish between the real data and generated data. If
trained successfully, the generator learns the real data distribution.
    Suppose we have a discriminator D and a generator G. The generator G takes in a
random vector z that comes from a prior random distribution pz and outputs data G(z).
The discriminator D takes in real data x that comes from the real distribution pdata and fake
                                                 6


                  Figure 2.1 A special pre-trained very deep CNN: VGG[1]
Figure 2.2 Structure of the GAN. The generator takes in a random vector z with lower
dimensions and tries to generate a new signal G(z) that comes from the same distribution
as the given data x. The discriminator takes in a signal G(z) or x and tries to output a
probability p of the signal coming from the real distribution.
                                              7


data G(z) that comes from the generated distribution pg and outputs D(x) and D(G(z)),
which represents the probability the data comes from the real distribution. The loss function
is defined through the binary cross entropy:
                V (D, G) = Ex∼pdata (x) [log(D(x))] + Ez∼pz (z) [log(1 − D(G(z)))]
The goal is to find the D and G that solve the following minimax problem:
                                         min max V (D, G)
                                           G  D
The problem is solved by using gradient descent to update the parameters in the deep neural
networks D and G. The authors prove that the global optimal solution is achieved when
pg = pdata and D(x) = 21 . The GAN model does not learn the data distribution explicitly but
generates signals that come from the real data distribution, that is, the generated data is not
a real data but is similar to the real data. GANs have been widely used to generate natural
images. In [11], the authors train a GAN in a progressive manner by gradually adding more
layers in the training process and generates high resolution images. In [12], the model takes
in sentences and generates images that fit the description. GANs can also be used to repair
images with missing parts [13] or generate images with certain artistic style [14]. GANs are
useful for many applications.
2.4     Texture Synthesis using VGG and Style Transfer
In the previous section, we introduced GANs for generating new images. Another interesting
topic from image processing is style transfer, which aims to transfer the style of one image
to another. This is motivated by [3], where the authors generate a new texture image by
matching the multi-layer statistics of image features computed through VGG [1]. They
show the statistics from multiple deep layers is necessary to generate a new realization of the
texture. And the loss between such statistics is referred to as the style loss in later research
work.
                                                8


         Figure 2.3 Process of using VGG features to synthesize texture images [1]
    VGG is a very deep CNN with 16-19 layers. Figure 2.1 shows the architecture and
the definition for each layer is described in Section 2.2 when we reviewed the CNN. This
network has been pre-trained and showed great performance on image classification. Figure
2.3 shows the process of texture synthesis in [3] and the details will be described in the
following paragraph.
    Suppose we have an input texture x0 . The output at layer l is a matrix H l (x0 ) ∈ RNl ×Ml
where Nl is the number of output channels in layer l and Ml is the size of output from each
channel. The authors use the gram matrix between channels as the summary statistics:
                                                X
                                   Gl (x0 )ij =   H l (x0 )ik H l (x0 )jk
                                                k
Features {G1 (x0 ), G2 (x0 ), ... GL (x0 )} from multiple layers are used to represent the texture
                                                  9


image x0 . Define the loss function between two images x0 and x as:
                                              X
                              loss(x0 , x) =       ||Gl (x0 ) − Gl (x)||22
                                               l
which is the summation of the l2 distance between gram matrices from each layer l. The
goal is to find x∗ that minimizes the loss:
                                    x∗ = arg min loss(x0 , x).
                                               x
    By solving the problem above using gradient descent, they get back a new image which
has the same statistics but is different from the input. This can be thought of as generating
a new realization of one texture class from a given realization. Figure 2.4 shows the synthesis
results. The good performance of matching the gram matrix from multiple layers explains
that multi-layer statistics from the VGG network might capture general features of a class
of texture. This work motivates the work on style transfer and has shown great performance
[15]. It also inspired our work on texture synthesis in Chapter 5. However, it has poor
performance on reproducing natural images which have more localized geometric structure
and are not stationary.
2.5     Wavelets
Wavelet theory is an important field developed for signal processing [4]. The Fourier trans-
form of a function x ∈ L2 (Rd ) is defined as,
                                         Z
                               x
                               b(ω) =        x(u)e−iu·ω du, ∀ω ∈ Rd
                                          Rd
and has the property
                                           x[∗y =x   b    yb .                            (2.1)
One can recover the function x through the inverse Fourier transform:
                                            Z
                                        1
                            x(u) =               b(ω)eiu·ω dω, ∀ω ∈ Rd .
                                                 x
                                    (2π)d    Rd
                                                  10


Figure 2.4 Synthesis results with different layers. The last row shows the original images.
The first row shows synthesized image by matching statistics from only "conv1-1" layer
and the second shows that matching statistics from "conv1-1" and "pool1" layers and so
on. With only lower layer statistics, the synthesis lose information of the texture. Adding
statistics from deeper layers improves the performance, which proves higher layer statistics
are necessary to capture texture information. The last column shows result of a natural
image, which has more localized structure and is not stationary.
                                              11


A wavelet ψ ∈ L2 (Rd ) is defined as a function that is localized in space and frequency and
has zero average [4]:
                                              Z
                                                  ψ(u)du = 0
                                               Rd
It is normalized so that kψk = 1. We focus on the cases d = 1 and d = 2 in the following.
For d = 1, a family of wavelets is obtained by dilating ψ:
                                       ψj (u) = 2−j ψ(2−j u), j ∈ Z.
The 1D wavelet transform of a signal x ∈ L2 (R) is defined as:
                                        W x = (x ∗ ψj (u))u∈R,j∈Z .
For d = 2, a family of wavelets is obtained by dilations and rotations:
                                                                            
                                                                cosθ −sinθ 
                        ψj,θ (u) = 2−2j ψ(2−j Rθ−1 u), Rθ =                 
                                                                   sinθ cosθ
A 2D wavelet transform is defined as
                                  W x = (x ∗ ψj,θ (u))u∈R2 ,j∈Z,θ∈Θ⊂[0,π]
Usaully Θ = { kπ K
                      : k = 0, . . . , K − 1}. Figure 2.5 shows a texture image and its wavelet
coefficients computed through Morlet wavelets. Figure 2.6 shows a family of 2D Morlet
wavelets in space and frequency. A 2D mother Morlet wavelet is defined as:
                             1          1 u2 u2 
          ψ(u1 , u2 ) =           exp − ( 12 + 22 ) · exp(2πiξu1 − C), (u1 , u2 ) ∈ R2
                         2πσ1 σ2            2 σ1 σ2
where σ1 and σ2 determine the scale of the mother wavelet at two directions and (ξ, 0)
determines the center frequency of the wavelet. The constant C ensures ψ(u1 , u2 )du1 du2 =
                                                                               R
0. A family of Morlet wavelets is obtained by dilation and rotation described above. As can
be seen, the Fourier transform of Morlet wavelets are essentially supported over different
bounded regions, which means wavelets are localized in the frequency field. With Equation
2.1, the wavelet transform in frequency can be written as:
                                            W
                                            dx = (b x   ψ
                                                        d j,θ )j,θ
                                                    12


Figure 2.5 Left: A texture image of bricks with edges at direction 0, π/2. Right: The
wavelets of the texture computed through Morlet wavelets, shown in Figure 2.6. It shows
the texture has large response at θ = 0 and θ = π/2, tiny response at other angles, which
match the directions of the edges in the texture. Also, as the wavelet scale goes larger, the
response gets smoother.
Figure 2.6 Morlet wavelets (real part) with 8 different scales and 12 different rotations in
space and frequency. Left: 2D Morlet wavelet family (ψj,θ )j,θ in space. Middle: 2D Morlet
wavelet family in frequency, which is the Fourier         transform of 2D Morlet wavelet family
(ψj,θ )j,θ . Right: The summation |φJ (ω)| + λ∈Λ |ψ
                                               2          cλ (ω)|2 which is almost a constant over
 b                                       c         P
all frequencies, making an approximate Littlewood-Paley frame. This figure shows this
group of wavelets plus a low pass satisfy the Littlewood-Paley condition on a half plane
approximately, except for the boundary.
By multiplying x   b with (ψ
                           dj,θ )j,θ , the wavelet transform can capture localized information of
the signal x in the frequency field.
    Let λ = (j, θ) ∈ ΛJ = {(j, θ) : j ≤ J, θ ∈ Θ ⊂ [0, 2π]} where ΛJ is the index set for
wavelets and use ψλ to denote ψj,θ . Let φ be a low pass filter whose Fourier transform is
concentrated at low frequencies. Set φJ (u) = 2−2J φ(2−J u). The filters {φJ , (ψλ )λ } are said
                                                   13


to be a Littlewood-Paley tight frame if:
                                              X
                                 cJ (ω)|2 +
                                |φ                  cλ (ω)|2 = 1, ∀ω ∈ R2
                                                   |ψ
                                             λ∈ΛJ
Define AJ x = x ∗ φJ and W [λ]x = x ∗ ψλ . The wavelet transform can be written as
WJ x = {AJ x, (W [λ]x)λ∈ΛJ } and its norm is:
                                                             X
                                 kWJ xk2 = kAJ xk2 +             kW [λ]xk2
                                                            λ∈ΛJ
If {φJ , (ψλ )λ } is a Littlewood-Paley tight frame, then kW xk2 = kxk2 . This shows the wavelet
transform preserves the norm of a signal x and the property can be proved using Plancherel
formula. We can define the dual wavelets as
                                  φeJ (ω) := φbJ (ω),    ψeλ (ω) := ψbλ (ω).
                                  c                      c
A signal x can be reconstructed from its wavelet transforms using the dual filters and the
formula
                                                         X
                                   x = x ∗ φJ ∗ φeJ +         x ∗ ψλ ∗ ψeλ .               (2.2)
                                                        λ∈ΛJ
    2D Morlet wavelets are used in image processing since they have a strong response to
edges. Texture images always have edges distributed in a repeated pattern. Therefore the
statistics (especially mean and variance) of wavelet coefficients, which summarize global edge
information, may be used to analyze texture images.
2.6      Scattering transform
Convolutional neural networks work very well for signal processing, especially image process-
ing. However, why they work is not well understood and theory still remains to be developed.
In [10], the author comes up with a scattering transform, which can be viewed as a mathe-
matical model for CNNs. It has been used to achieve near state of the art results in the fields
of audio signal processing [16, 17, 18, 19, 20], computer vision [21, 22, 23, 24, 25, 26], and
quantum chemistry [27, 28, 29, 30], amongst others. The scattering transform is provably
                                                      14


invariant to local (or global) translations of the input signal and is also Lipschitz stable to
the actions of diffeomorphisms on the input. These properties are motivated by the fact
that signals with translation or small deformation usually contain similar information (e.g.,
they are in the same class). On the other hand, high frequency information is needed to
distinguish one signal class from another, which motivated the use of nonlinear activation
functions and a deep architecture. This non-paramatric model helps us to understand CNNs
in a mathematical way.
    Assume we have a real valued function x ∈ L2 (Rd ). Let Lc : L2 (Rd ) → L2 (Rd ), for
c ∈ Rd , be a translation operator:
                                       Lc x(u) = x(u − c)
A map Φ on L2 (Rd ) is translation invariant if:
                             Φ(Lc x) = Φ(x), ∀x ∈ L2 (Rd ), c ∈ Rd
Let τ : Rd → Rd be a displacement function and Lτ : L2 (Rd ) → L2 (Rd ) be a diffeomorphism
operator:
                                     Lτ x(u) = x(u − τ (u))
Let ∇τ (u) and Hτ (u) be the Jacobian and Hessian of τ . Define:
             kτ k∞ = sup |τ (u)|, k∇τ k∞ = sup k∇τ (u)k, kHτ k∞ = sup kHτ (u)k
                     u∈Rd                   u∈Rd                     u∈Rd
The map Φ is Lipschitz continuous to diffeomorphisms if there exists a constant C such that
             kΦ(x) − Φ(Lτ x)k ≤ Ckxk2 (kτ k∞ + k∇τ k∞ + kHτ k∞ ), ∀x ∈ L2 (Rd )
The modulus of the Fourier transform Φ(x) = |b    x| is proved to be translation invariant but
not stable to diffeomorphisms. So we turn to wavelet transforms.
    In this paragraph, we will explain the intuition of how the scattering transform is devel-
oped to satisfy the translation invariance property while preserving high frequency informa-
tion. The paper [10] proves it is also stable to diffeomorphism but we will omit the details.
                                               15


Recall we defined λ = (j, θ) ∈ ΛJ , as well as the operators AJ x = x ∗ φJ and W [λ]x = x ∗ ψλ
in section 2.2. Since AJ x and W [λ]x are convolution operators, they commute with transla-
tions:
                          AJ (Lc x) = Lc (AJ x), W [λ](Lc x) = Lc (W [λ]x)
Then translation invariance can be obtained by integrating x ∗ φJ or x ∗ ψλ . Also note that
the operator A is locally translation invariant in the sense that kA(Lc x) − A(x)k ≤              C·|c|·kxk
                                                                                                     2J
since φJ is a low pass filter. This is essential because sometimes we want local translation
invariance instead of full translation invariance. Then instead of x ∗ ψλ , we could try to use
                                                                           R
x ∗ ψλ ∗ φJ to obtain local translation invariance. However, since ψλ is a wavelet with zero
average, the integral x ∗ ψλ = 0. Also since the support of ψ           cλ and φ cJ have small overlap,
                       R
x ∗ ψλ ∗ φJ ≈ 0. Therefore we need a nonlinear operator. The scattering transform uses the
modulus since it is non-expansive. Then nontrivial global and local translation invariance
are obtained by
                                    Z
                                        |x ∗ ψλ | or |x ∗ ψλ | ∗ φJ .
Note that the modulus operator nonlinearly projects x∗ψλ from the high frequency field to the
low frequency field. Also note that |x ∗ ψλ | = |x\      ∗ ψλ |(0) and |x ∗\                      cJ and
                                        R
                                                                              ψλ | ∗ φJ = |x\
                                                                                            ∗ ψλ |φ
cJ is supported in low frequency field. However, the support of φ
φ                                                                        cJ is smaller than the support
of |x\∗ ψλ |. Therefore, part of the information in |x\   ∗ ψλ | gets lost in the above representation.
To recover it, we apply another wavelet transform over |x ∗ ψλ |. The information in |x ∗ ψλ |
is preserved by {|x ∗ ψλ | ∗ φJ , (|x ∗ ψλ | ∗ ψλ0 )λ0 }. The term |x ∗ ψλ | ∗ φJ is invariant to
translation while |x ∗ ψλ | ∗ ψλ0 recovers the lost high frequency information. By applying
another nonlinear operator and a low pass to |x ∗ ψλ | ∗ ψλ0 , it also gets translation invariance.
We give the standard definition of scattering transform in the next paragraph.
     Let p be a path sequence p = {λ1 , λ2 , ..., λm } ∈ Λm     J . Define U [λ]x = |x ∗ ψλ | for x ∈
L2 (Rd ). A scattering propagator is defined as the path-ordered product:
                                     U [p] = U [λm ]...U [λ2 ]U [λ1 ]
                                                   16


Then the windowed scattering transform of x ∈ L2 (Rd ) for a path p is defined as:
                                     SJ [p]x(u) = U [p]x ∗ φJ (u)
and is invariant to local translation. The scattering transform is defined as:
                                  Z                                 Z
                                1
                    S̄[p]x =         U [p]x(u)du with µp =            U [p]δ(u)du               (2.3)
                               µp
and is invariant to global translation. If the infinite set of finite paths p is defined as PJ , we
can denote:
                      SJ [PJ ]x = {SJ [p]x}p∈PJ     or S̄[PJ ]x = {S̄[p]x}p∈PJ
The multi-layer structure preserves high frequency information. The author also proves the
scattering transform over PJ is stable to diffeomorphisms, i.e., for all τ ∈ C2 (Rd ) with
k∇τ k∞ ≤ 12 , there exists a constant C such that
     kSJ [PJ ]x − SJ [PJ ](Lτ x)k ≤ CkU [PJ ]xk1 (kτ k∞ + k∇τ k∞ + kHτ k∞ ), ∀x ∈ L2 (Rd )
where kU [PJ ]xk1 =                  J ]xk. The key point to the proof is the stability of the
                       P+∞
                           m=0 kU [Λm
wavelet transform which we omit here.
    The scattering transform in L2 (Rd ) can be extended to stationary processes. If X(u) is
a stationary process, for all p = {λ1 , λ2 , ..., λm } ∈ Λm J , U [p]X(u) is also stationary, and its
expected value does not depend on u. The expected scattering transform is defined as:
                        S̄[p]X = E(U [p]X) = E(||X ∗ ψλ1 ∗ · · · | ∗ ψλm |).
This definition replaces the normalized integral of the scattering transform in Equation 2.3
by an expected value. If X is also ergodic, the expected scattering transform is estimated
by computing the scattering transform of a realization x0 (u) of X(u):
                                                   Z
                                               1
                                    S̄[p]X ≈          U [p]x0 (u)du
                                              µp
                                                   17


2.7      Signal recovery through statistical features
This section introduces the general types of problems we will consider in this thesis.
    Given a deterministic signal x ∈ L2 (Rd ) or a stochastic process X(u) : u ∈ Rd , our goal
is to find a representation Φ(x) or Φ(X) that is translation invariant, i.e.,
                        Φ(Lc x) = Φ(x),       or Φ(Lc X) = Φ(X), ∀c ∈ Rd
where Lc x(u) = x(u − c) or Lc X(u) = X(u − c). Typically we are interested in when Φ is a
CNN representation or a scattering representation.
    For a deterministic signal x, we explore under what conditions can we recover x from Φ
up to translation, i.e.,
                                  if Φ(x) = Φ(x0 ), then x0 = Lc x.
This aims to recover the exact signal up to translation. Chapter 3 explores this issue for
1D deterministic sparse signals. Similarly, Chapter 6 explores 2D deterministic line texture
images and frame-like texture images.
    For a stochastic process X, we think of Φ as statistics, i.e.,
                                            Φ(X) = E[Φ̃(X)]
for some function Φ̃. When X is ergodic, Φ(X) is usually estimated from a sample x ∼ X:
                                                       Z
                                                  1
                                     Φ(X) ≈                      Φ̃(x)
                                                (2T )d [−T,T ]d
Given a sample x of X, we explore under what conditions can we generate another sample
x0 of X from Φ̃, i.e.,
                             Z                          Z
                        1                          1
                   if                      0
                                       Φ̃(x ) =                   Φ̃(x), then x0 ∼ X.
                      (2T )d  [−T,T ]d           (2T )d  [−T,T ]d
Chapter 4 explores the problem for X being a Poisson point process and we try to generate
new samples by learning the distribution of Φ(X). Chapter 5 suppose X to be a class of
texture images and we try to generate new samples x0 ∼ X given an estimate of Φ(X) from
one sample texture image x ∼ X.
                                                    18


                                        CHAPTER 3
        CHARACTERIZING SPARSE SIGNALS THROUGH A HYBRID
                             SCATTERING TRANSFORM
3.1      Introduction
As reviewed in Section 2.6, the scattering transform is a mathematical model for CNNs that
is invariant to translation, stable to diffeomorphism and captures high frequency informa-
tion. These properties are essential for signal analysis as signals with translations and small
deformation are usually from the same class and high frequency information is important
to distinguish signals from one class to another. On the other hand, filters learned in the
early layers of CNNs typically resemble wavelets. But it usually requires a large dataset to
learn a CNN and it remains unclear why such models work well for different types of signals,
especially natural images.
    The scattering transform and CNN both show great performance in image classification,
proving their ability to learn essential representations and distinguish different types of
signals. Another interesting task from machine learning is to understand how different these
representations are with dissimilar signals. This is referred to as the completeness of a model
in machine learning. A more complicated task from this topic is to find representations that
can be used to recover a signal. Intuitively this requires to learn all important information
contained in a signal.
    Motivated by the structure of the scattering transform and CNNs, we propose a two-layer
hybrid scattering model to capture signals with isolated singularities in this chapter. In the
first layer, we apply a wavelet transform to sparsify the signal, while in the second layer, we
use a Gabor type filter to leverage this sparsity. More specifically,
    • We propose a new model to capture the singularities of the input signal.
    • We prove the Gabor measurements used in the second layer determine the locations
                                              19


      and heights of a sparse signal.
    • We provide an algorithm that successfully synthesizes 1D sparse signals using these
      measurements up to translation and reflection.
In Section 3.2 we provide our model in detail, while in Section 3.3 we show our main theorems
and proofs. Section 3.4 describes our algorithm for sparse signal synthesis and Section 3.5
shows synthesis results. Section 3.6 summarizes our conclusions.
3.2    Model
Let y(t) with t ∈ R be a piecewise polynomial whose knots {ui }ki=1 , ui < ui+1 , are located
on the grid hZ = {hn : n ∈ Z} for some h > 0:
                                                         
                             X k                          pi (t) t ∈ [ui , ui+1 ]
                                                         
                      y(t) =      yi (t), where yi (t) =
                              i=0                         0
                                                                     elsewhere
where {pi }ki=0 are a group of polynomials satisfying pi (ui+1 ) = pi+1 (ui+1 ), ∀i ∈ {0, 1, · · · k−1}.
We also assume that each of the piecewise polynomial components yi (t) has degree mi at
most m, i.e., 0 ≤ mi ≤ m. Let ψ be a mother wavelet with supp(ψ) ⊆ [−1, 1] that has m + 1
vanishing moments, i.e.,
                                  Z   +∞
                                          tk ψ(t)dt = 0, ∀0 ≤ k ≤ m.
                                   −∞
Since ψ has m + 1 vanishing moments, one can show pi ∗ ψ(t) = 0. Let ψ` (t) be the dilated
wavelet:
                                                            
                                                    1     t
                                           ψ` (t) = ` ψ        .
                                                    2     2`
    We have supp(ψ` ) ⊆ [−2` , 2` ]. Figure 3.1 gives an example of a wavelet with 4 vanishing
moments. Therefore we have y ∗ ψ` (t) = 0 for t ∈          / ki=1 [ui − 2` , ui + 2` ] as {ui }i are the
                                                             S
singularities of y. Equivalently supp(ψ` ∗ y) is contained in ki=1 [ui − 2` , ui + 2` ]. To further
                                                                    S
promote sparsity, we next apply a max-pooling operator:
                                   
                                   z(t) if z(t) = maxt0 ∈[ti −2` ,ti +2` ]∩hZ z(t0 )
                                   
                                   
                      M P` z(t) =                                                     .
                                              otherwise
                                   
                                   0
                                   
                                                    20


                           Figure 3.1 A wavelet ψ of 4 vanishing moments.
(a) piecewise polynomial y(t)        (b) Convolution against wavelet         (c) Sparse Signal from Max Pool-
with highest order of 3              ψ`                                      ing
         Figure 3.2 Wavelets sparsify piecewise polynomials on the interval [0, 1024].
Figure 3.2 shows the process of sparsifying a piecewise polynomial into a sparse signal with
wavelet ψ` . As summarized in the following theorem, this yields a linear combination of
Dirac delta functions.
Theorem 1. Assume that 2`+1 ≤ min1≤i≤k−1 |ui − ui+1 |. Then,
                                                            X k
                                    M P` (|ψ` ∗ y|)(t) =         aj δvj (t).
                                                             j=1
for some a1 , . . . , ak > 0, vj ∈ [uj − 2` , uj + 2` ], 1 ≤ j ≤ k.
    In our second layer, rather than use another wavelet, we use a Gabor filter
                                                          
                                                           t
                                          gs,ξ (t) = w          eiξt ,                                  (3.1)
                                                           s
where the parameters s and ξ determine the scale and central frequency of the filter and the
window function w is supported on an interval of unit length. It differs from a wavelet in
that the scale and frequency are separated in a gabor filter, giving more flexibility to adjust
                                                     21


          (a) Indicator function window                               (b) Gaussian window
                 Figure 3.3 Gabor filters used in the second layer (real parts)
these parameters independently. Note that with an appropriately chosen window function
w, Equation 3.1 includes dyadic wavelet families in the case that one selects s = 2` and
|ξ| = C/s . However, it also includes many other families of filters, including Gabor filters
used in the windowed Fourier transform. Next, we take the Lp norm for some integer p ≥ 1.
As a result, we obtain translation invariant hybrid scattering coefficients
                                    {kgs,ξ ∗ M P` (|ψ` ∗ y|)kp }s,ξ .
    By design, these measurements are invariant to translations, reflections, and global sign
changes. We aim to investigate the ability of our measurements to characterize y up to these
natural ambiguities. The wavelet-modulus is known to be a powerful signal descriptor [31].
Therefore, in light of Theorem 1, we shall analyze the ability of the measurements
                                           {kgs,ξ ∗ xkp }s,ξ                              (3.2)
to characterize sparse signals of the form
                                                X k
                                        x(t) =       aj δvj (t).                          (3.3)
                                                 j=1
Furthermore, to supplement our theory, we show that the measurements (3.2) can be used
to reconstruct a sparse signal of the form (3.3) up to translations and reflections (see Figure
3.2).
                                                  22


3.3     Theory and proofs
Let
                                           X  k
                                 x(t) =          aj δvj (t),    v1 < v2 < · · · < vk .                      (3.4)
                                           j=1
We first consider the support set {vj }kj=1 , and let D(x) = {∆i,j = vj − vi : i < j} denote the
difference set as the pairwise distances of the Dirac locations {vj }kj=1 . Let
                                                fx,ξ (s) = kgs,ξ ∗ xkpp
be the gabor measurements. The following theorem shows the singularities of the measure-
ments as a function of the scale s are contained in the pairwise distance set D(x).
Theorem 2. Let p ≥ 1 be an integer, and assume w(t) = 1[0,1] (t). For i ≤ j, let
                                                              j
                                                             X
                                                βi,j (ξ) =        a` eiξ∆i,`                                (3.5)
                                                             `=i
Then, for every fixed ξ ∈ R, the function fx,ξ (s) is piecewise linear, and ∂s2 fx,ξ (s) is a grid-free
sparse signal whose support is contained in D(x). Specifically,
                                                                                  
                                                       X          X
                                    ∂s2 fx,ξ (s) =                      ci,j (ξ) δd ,
                                                      d∈D(x)    ∆i,j =d
where
                            ci,i+1 (ξ) = |βi,i+1 (ξ)|p − |βi+1,i+1 (ξ)|p − |βi,i (ξ)|p                      (3.6)
and
      ci,j (ξ) = |βi,j (ξ)|p + |βi+1,j−1 (ξ)|p − |βi+1,j (ξ)|p − |βi,j−1 (ξ)|p               for j ≥ i + 2. (3.7)
Proof. We first note that
                                                         Xk
                                 |(gs,ξ ∗ x)(t)| =            ai gs,ξ (t − vi )
                                                         i=1
                                                         Xk
                                                     =        ai eiξ(t−vi ) 1[vi ,vi +s] (t)
                                                         i=1
                                                         Xk
                                                     =        ai e−iξvi 1[vi ,vi +s] (t) .
                                                         i=1
                                                             23


For every subset I ⊆ {1, . . . , k}, let
                RI (s) = {t : t ∈ [vi , vi + s] for all i ∈ I, t ∈     / [vi , vi + s] for all i ∈
                                                                                                 / I},
i.e. let RI (s) be the set of t for which ai e−iξvi 1[vi ,vi +s] (t) is nonzero if and only if i ∈ I. Then,
since w(t) = 1[0,1] (t) it is clear that for t ∈ RI ,
                                                         X
                                   |(gs,ξ ∗ x)(t)| =         ai e−iξvi := yI (ξ).
                                                         i∈I
Therefore,
                                                                  X
                           fx,ξ (s) = k(gs,ξ ∗ x)(t)kpp =                  |yI (ξ)|p |RI (s)|.          (3.8)
                                                               I⊆{1,...k}
We will show that for all I ⊆ {1, . . . , k}, |RI (s)| is a piecewise linear function as a function
whose Dirac locations are contained in D(x).
     First, we note that RI (s) = ∅ unless I has the form {i, i + 1, . . . , j − 1, j} for some i ≤ j.
Therefore,
                                                Xk X   k
                                      fs (ξ) =             |βi,j (ξ)|p |Ri,j (s)|,                      (3.9)
                                                i=1 j=i
where, as in (3.5), βi,j (ξ) is given by
                                                 j                       j
                                                X                      X
                                  |βi,j (ξ)| =      a` e  iξ∆i,`
                                                                  =          a` eiξv` ,
                                                `=i                     `=i
and Ri,j := R{i,...,j} . Now, turning our attention to Ri,j (s), we observe by definition that a
point t is in Ri,j (s) if and only if it satisfies the following three conditions:
                                      v` ≤t ≤ v` + s for all i ≤ ` ≤ j,
                                            t > vi−1 + s, and
                                            t < vj+1 .
Therefore, letting (a ∧ b) := max{a, b} and (a ∨ b) := min{a, b}, we see
         Ri,j (s) = [xj , xi + s] ∩ [xi−1 + s, xj+1 ] = [xj ∨ (xi−1 + s), (xi + s) ∧ xj+1 ],           (3.10)
                                                          24


and
                            |Ri,j (s)| = ((xi + s) ∧ xj+1 ) − (xj ∨ (xi−1 + s)),
if the above quantity is positive and |Ri,j (s)| = 0 otherwise. It follows from (3.10) that
|Ri,j (s)| is a piecewise linear function, and that ∂s2 |Ri,j (s)| is given by
                    ∂s2 |Ri,j(S) | = δ∆i,j (s) + δ∆i−1,j+1 (s) − δ∆i−1,j (s) − δ∆i,j+1 (s). (3.11)
In order for this equation to be valid for all 1 ≤ i < j ≤ k, we identify x0 and xk+1 with
−∞ and ∞, and therefore, δ∆0,j δ∆i−1,k+1 are interpreted as being the zero function since the
domain of f is (0, ∞). Likewise δ∆i,i = δ0 is interpreted as the zero function in the above
equation.
    Combining (3.11) with (3.9) implies that ∂s2 fx,ξ (s) is a sparse signal with support con-
tained in D(x), and for d ∈ D(x),
                                                               X
                                            ∂s2 fx,ξ (d) =           ci,j (ξ)
                                                             ∆i,j =d
as desired.
    The following example shows that, in general, the support of ∂s2 fx,ξ (s) may be a proper
subset of D(x).
Example 1. If p = 2 and
                                     x(t) = δ1 (t) + δ2 (t) + δ3 (t) − δ4 (t),
then 2 ∈ D(x), but
                                                    ∂s2 fξ (2) = 0.
Proof. For this choice of x, there are two pairs (i, j) such that ∆i,j = 2, namely (1, 3) and
(2, 4). Therefore, by Theorem 2,
                    ∂s2 fξ (2) = |β1,3 (ξ)|2 + |β2,2 (ξ)|2 − |β1,2 (ξ)|2 − |β2,3 (ξ)|2 +
                                                                                         
                                    |β2,4 (ξ)|2 + |β3,3 (ξ)|2 − |β2,3 (ξ)|2 − |β3,4 (ξ)|2 .
                                                                                         
                                                            25


Inserting (a1 , a2 , a3 , a4 ) = (1, 1, 1, −1), ∆i,i+1 = 1, and ∆i,i+2 = 2 into (3.5) implies that
            ∂s2 fξ (2) = |1 + eiξ + e2iξ |2 + 1 − |1 + eiξ |2 − |1 + eiξ |2 +
                                                                               
                            |1 + eiξ − e2iξ |2 + 1 − |1 + eiξ |2 − |1 − eiξ |2
                                                                               
                       =|1 + eiξ + e2iξ |2 + |1 + eiξ − e2iξ |2 + 2 − 3|1 + eiξ |2 − |1 − eiξ |2
                       =0.
The last inequality follows from repeatedly applying the the trigonometric identities sin2 (θ)+
cos2 (θ) = 1 and cos(θ) = cos(2θ) cos(θ) + sin(2θ) sin(θ).
    As illustrated by Example 1, the reason why ∂s2 fx,ξ (2) is equal to zero is because x is not
collision free, which is defined as follows. A signal x is collision free if |vi − vj | =      6 |vi0 − vj 0 |
unless (i, j) = (i0 , j 0 ). With this assumption, it is known [32] that the support set {vj }kj=1
is determined (up to reflection and translation) by D(x). Specifically in the above example,
there are two different pairs of points (1, 3) and (2, 4) in the support set of x that are both
distance two from each other and c1,3 (ξ) = −c2,4 (ξ). When x is collision free, this cancellation
cannot occur, and as guaranteed by the following theorem the support set of ∂s2 fx,ξ (s) will
exactly be D(x). Therefore, our measurements with sufficiently many scales and only one
frequency completely characterize the support set of a sparse signal up to translation and
reflection.
Theorem 3. Assume that x is a collision-free k-sparse signal as in Equation (3.3) and that
p ≥ 1 is an integer. Then, for almost every ξ, ∂s2 fx,ξ is a sparse signal whose support is
exactly equal to D(x).
    In order to prove Theorem 3, we will introduce a class of functions which we call Gen-
eralized Exponential Laurent Polynomials and state several lemmas about these functions.
For the proof of the lemmas in this section, please see the appendix.
                                                      26


    We call a function q(θ) a Generalized Exponential Laurent Polynomial if it can be written
as
                                           X N
                                 q(θ) =          αk eiγk θ ,   θ ∈ [0, 2π)                  (3.12)
                                            k=1
where N ≥ 1, αk , γk ∈ R, αk 6= 0, and γ1 < γ2 < . . . < γN . We let E be set of all such
functions. For q ∈ E, we refer to γN as the degree of q and let E(d) refer to the set of all
q ∈ E with degree d. Note that we do not assume that the γk ’s are nonnegative or even
rational. Therefore, the degree of q may be negative. We let E0 (d) denote the set of q ∈ E(d)
such that γ1 ≥ 0.
Lemma 1. Let q, q 0 ∈ E,
                                     N                               N 0
                                    X                               X
                            q(θ) =       αk e iγk θ
                                                    and q (θ) =
                                                             0
                                                                         βk eiηk θ .
                                    k=1                             k=1
Then q = q 0 if and only if N = N 0 and for all k = 1, . . . , N , αk = βk and γk = ηk .
    Lemma 1 implies that if q ∈ E(d1 ) and q 0 ∈ E(d2 ) then
                                            qq 0 ∈ E(d1 + d2 ).                             (3.13)
In particular, if q ∈ E0 (d)
                                   |q|2 = q q̄ ∈ E(d + 0) = E(d).                           (3.14)
Furthermore, if d2 ≤ d1 then
                                            (q + q 0 ) ∈ E(d1 ),                            (3.15)
except, of course, if d1 = d2 and the lead coefficients of q and q 0 are negatives of one another.
Lemma 2. Let p be an integer. For i = 1, 2, 3, 4, let qi ∈ E0 (di ) assume that d1 > d2 , d3 , d4 .
Then the set of points such that
                                     |q1 |p + |q2 |p = |q3 |p + |q4 |p                      (3.16)
has measure zero.
                                                     27


Proof of Theorem 3. Under the assumption that x(t) is collision free, it suffices to show that
for all i ≤ j, ci,j (ξ) 6= 0 for almost every ξ ∈ R, where as in (3.6) and (3.7),
                             ci,i+1 (ξ) = |βi,i+1 (ξ)|p − |βi+1,i+1 (ξ)|p − |βi,i (ξ)|p ,
                     ci,j (ξ) = |βi,j (ξ)|p + |βi+1,j−1 (ξ)|p − |βi+1,j (ξ)|p − |βi,j−1 (ξ)|p
for j ≥ i + 2, and
                                                           j
                                                         X
                                             βi,j (ξ) =        al eiξ∆i,l .
                                                          l=i
Observe that βi,j is a generalized exponential Laurent polynomial of the form introduced in
Equation 3.12, with degree ∆i,j . Therefore, when j ≥ i + 2, it follows from Lemma 2 that
ci,j vanishes on a set of measure zero since ci,j (ξ) = 0 implies
                           |βi,j (ξ)|p + |βi+1,j−1 (ξ)|p = |βi+1,j (ξ)|p + |βi,j−1 (ξ)|p .
     In the case where j = i + 1, we see that
                             ci,i+1 (ξ) = |ai + ai+1 e−iξ∆i,i+1 |p − |ai |p − |ai+1 |p ,
For any ξ such that ci,i+1 (ξ) = 0, we see that ξ∆i,i+1 is a solution to
                                                   2
                                    ai + ai+1 eiθ    − (|ai |p + |ai+1 |p )2/p = 0.
Therefore, ci,i+1 (ξ) vanishes on a set of measure zero because the left-hand side of the above
equation is a trigonometric polynomial.
     Theorem 3 proves our measurements on a sparse signal x with one frequency and enough
scales locate the positions of diracs, i.e., the support of x, up to translation and reflection.
The strategy of selecting the scales is described in Section 3.4.
     The following theorems shows with enough frequencies, our measurements also charac-
terize the heights {ai }i of x. Moreover, if the moments p is even, the number of frequencies
needed can be reduced. The proofs of the following theorems are described in appendix.
                                                         28


Theorem 4. Let p ≥ 1 be an odd integer, let
                                                 Xk
                                         x(t) =        aj δvj (t)
                                                 j=1
be a collision-free sparse signal, and let ~a = (a1 , . . . , ak ). Let ξ1 , . . . , ξL be L frequencies chosen
independently at random from some probability distribution which is absolutely continuous
with respect to the Lebesgue measure for some L ≥ 4p + 2. Then, with probability one, the
following uniqueness result is true. Let
                                                 Xk
                                         y(t) =        bj δuj (t)                                         (3.17)
                                                 j=1
be any other sparse signal such that D(y) = D(x), and let ~b = (b1 , . . . , bk ). If ∂s2 fx,ξ` (d) =
∂s2 fy,ξ` (d) for all d ∈ D(x) and all 1 ≤ ` ≤ L and ki=1 |bi |p = ki=1 |ai |p , then ~b = ±~a and
                                                             P                  P
therefore y(t) is equivalent to ±x(t) up to translation and reflection.
Theorem 5. Let p = 2p0 be an even integer, and let
                                                 Xk
                                         x(t) =        aj δvj (t)
                                                 j=1
be a collision-free sparse signal, and let ~a = (a1 , . . . , ak ). Let ξ1 , . . . , ξL be L frequencies chosen
independently at random from some probability distribution which is absolutely continuous
with respect to the Lebesgue measure for some L ≥ p + 2. Then, with probability one, the
following uniqueness result is true. Let
                                                 Xk
                                         y(t) =        bj δuj (t)                                         (3.18)
                                                 j=1
be any other sparse signal such that D(y) = D(x), and let ~b = (b1 , . . . , bk ). If ∂s2 fx,ξ` (d) =
∂s2 fx,ξ` (d) for all d ∈ D(x) and all 1 ≤ ` ≤ L and ki=1 |bi |p = ki=1 |ai |p , then ~b = ±~a and
                                                             P                  P
therefore y(t) is equivalent to ±x(t) up to translation and reflection.
3.4        Algorithm
Theorems 3, 4 and 5 shows our measurements characterize the support of a sparse signal
x and the heights of the diracs. In this section, we describe our algorithm that uses such
                                                  29


measurements to recover the signal (up to translation and reflection). Let x be a sparse
signal that is collision free,
                                                      X k
                                                x=         aj δvj (t)
                                                       j=1
Let
                                              fx,ξ (s) = kgs,ξ ∗ xkpp
be the measurements. Let Λ = {(ξj , sj )}j be the set of frequencies and scales we choose
to define the gabor filters. The strategy to choose different scales is discussed in the next
paragraph. With two signals x and x0 , we define the loss between them as:
                                                   X
                                    `(x, x0 ) =          (fx,ξ (s) − fx0 ,ξ (s))2
                                                 (ξ,s)∈Λ
Given a target signal x, our goal is to find a new signal such that
                                             x∗ = arg min `(x, x0 )                           (3.19)
                                                        x0
For simplicity, we choose p = 1 in our algorithm for synthesis.
    Now we present a strategy of choosing the right set of scales. Let D(x) = {∆ij = |vi −vj |}
be the pairwise distance between the spikes of x. Since x is collision free, we know ∆ij 6= ∆i0 j 0
unless (i, j) = (i0 , j 0 ) except for ∆ii = 0, ∀i. Therefore, there are k(k−1)/2+1 unique elements
in D(x). Without loss of generality, we suppose 0 = d0 < d1 < · · · < dk(k−1)/2 , di ∈ D(x), ∀i
in the following context. We also assume x is periodic in numerical experiments. Therefore
dk(k−1)/2 ≤   n
              2
                . As stated in Theorem 3, fx,ξ (s) is a piecewise linear function of s and the
singularities locate exactly at D(x). When choosing the scales to compute the scattering
statistics, we need to ensure there is at least one scale between di and di+1 , ∀i. Therefore,
we compute the minimum pairwise distance of D(x), i.e.,
                                      minDD =           min       |di − di+1 |,
                                                   0≤i<k(k−1)/2
and sample scales every minDD from 1 to n2 . We also insert three scales between          n
                                                                                          2
                                                                                             to n to
capture fx,ξ (s) in this interval. With this strategy, the scales are sampled and the relative
positions of the Diracs are captured.
                                                          30


Figure 3.4 Choosing filter scales depending on the pairwise distance between signal spikes.
Left: A sparse signal with n = 128, three spikes and minDD = 8. Right: The blue spikes
show what are the pairwise distance. The orange spikes show what are the scales we choose.
     Figure 3.4 gives one example of how we choose the scales. The left figure shows a signal
x with three spikes, where the spikes are located at u = 34, 78, 110. This gives the pairwise
distance set as D(x) = {0, 32, 44, 52} and minDD = 8. The blue spikes in right figure of
Figure 3.4 show D. With the logic described in the last paragraph, we sample the scales
every 8 integers from 1 to 64, and sample three scales from 65 to 128, which is shown in the
red spikes. As one can see, this strategy ensures there is at least one scale located between
di and di+1 for every i. In our experiment, we find two frequencies are enough to get the
algorithm to converge. Therefore we randomly choose two frequencies when defining the
filters. Our synthesis results are presented in the next section.
     To solve the optimization problem in Equation 3.19, we initialize x0 to be a random
uniform noise and use gradient descent to update x0 until the loss is reduced. However,
in experiment, it is hard to match all target statistics together, therefore we propose a
greedy method to randomly match each statistic. We first randomly permute the parameter
set Λ. Then we choose the first {ξi1 , si1 } to compute the target statistics and match it,
i.e., use gradient descent to update x0 until `1 (x, x0 ) = (fx,ξi1 (si1 ) − fx0 ,ξi1 (si1 ))2 is reduced.
Then we add in the next parameters {ξi2 , si2 } and update x0 to reduce the loss `1 (x, x0 ) =
    j=1 (fx,ξij (sij ) − fx0 ,ξij (sij )) . We repeatedly add in the new parameters after the previous
P2                                       2
ones are matched until all parameters are added to promote convergence. After adding in
all statistics and the process is converged, we reshuffle the parameter set Λ and repeat the
                                                        31


process. This process is repeated several times to avoid local minimum. The next section
shows some of our numerical results.
3.5     Numerical results
Figure 3.5 shows our synthesis results for sparse signals with different number of diracs and
different minDD . Each row shows the result from one single test. The left column shows
the original signals while the right column shows the corresponding synthesized signals. All
signals are successfully synthesized up to translation, reflection or change of sign, proving
the completeness of our selected measurements on sparse signals.
3.6     Conclusion
In this chapter, we proposed a two-layer hybrid scattering model that characterize the sin-
gularities of piecewise polynomials. We proved our second layer measurements can be used
to capture the support and heights of sparse signals. We also designed an algorithm that use
such measurements to recover sparse signals and successfully synthesized the original signals
up to translation and reflection.
                                              32


           (a) Original k = 2, minDD = 10.                   (b) Synthesized.
           (c) Original k = 2, minDD = 22.                   (d) Synthesized.
           (e) Original k = 3, minDD = 12.                   (f) Synthesized.
            (g) Original k = 5, minDD = 6.                   (h) Synthesized.
Figure 3.5 Sparse signals reconstructed up to a global reflection and translation. Each row
shows the result from a single test. Left column: Original signals. Right column:
Synthesized signals.
                                            33


                                         CHAPTER 4
     SCATTERING STATISTICS OF GENERALIZED SPATIAL POISSON
                                    POINT PROCESSES
4.1      Introduction
In the last chapter we presented our work on analyzing the generalized scattering transform
of deterministic sparse signals. In this chapter, we extend our work to stochastic processes.
A lot of signal data in learning tasks can be modelled as a stochastic process, including
texture discrimination [24], texture synthesis [33, 34], time series analysis (e.g., finance) [35],
and wireless networks [36]. In many scenarios it is natural to model the signal data as the
points of a (potentially complex) spatial point process. Furthermore, there are numerous
other fields, including stochastic geometry [37], forestry [38], geoscience [39] and genetics
[40], in which spatial point processes are used to model the underlying generating process
of certain phenomena (e.g., earthquakes). CNNs have shown impressive results on these
tasks. However, precise theoretical understanding of these results is lacking. Motivated by
the existing empirical results, as well as the potential for numerous others in yet untapped
research, we consider the capacity of CNNs to capture the statistical properties of spatial
point processes.
     Recall the definitions from Section 2.6. The scattering transform consists of an alternating
cascade of linear wavelet transforms and complex modulus nonlinearities. In this chapter, we
examine a generalized scattering transform that utilizes a broader class of filters, which are
also the filters we used to analyze sparse signals in Chapter 3. But these filters still include
wavelets as a special case. Our main focus is on scattering architectures constructed with
filters that have small spatial support as is the case in most traditional CNNs.
     Expected wavelet scattering moments for stochastic processes with stationary increments
were introduced in [41], where it is shown that such moments capture important statistical
                                                34


information of one-dimensional Poisson processes, fractional Brownian motion, α-stable Lévy
processes, and a number of other stochastic processes. In this chapter, we extend the notion
of scattering moments to our generalized architecture, and in the process of doing so, we
recover many of the important small scale results in [41]. However, the main contributions
contained here consist of new results for more general spatial point processes, including
inhomogeneous Poisson point processes, which are not stationary and do not have stationary
increments. The collection of expected scattering moments is a non-parametric model for
these processes, which we prove captures important summary statistics of inhomogeneous,
compound spatial Poisson point processes.
    The remainder of this chapter is organized as follows. Expected scattering moments
are introduced in Section 4.2. Sections 4.3 and 4.4 analyze the first-order and second-order
scattering moments of inhomogeneous, compound spatial Poisson point processes. Section
4.5 compares the scattering moments of one-dimensional Poisson processes to two self-similar
processes, fractional Brownian motions and the α-stable process. Section 4.6 presents stylized
numerical examples to highlight certain aspects of the presented theory. A short conclusion is
given in Section 4.8. All proofs are in the appendices, in addition to details on the numerical
work.
4.2     Expected Scattering Moments for Random Signed Measures
Let ψ ∈ L2 (R) be a compactly supported mother wavelet with dilations ψj (t) = 2−j ψ(2−j t),
let and X(t), t ∈ R, be a stochastic process with stationary increments defined on the real
line. In [41], first-order wavelet scattering moments are defined as SX(j) = E[|ψj ∗ X|],
where the expectation does not depend on t since if X(t) has stationary increments, then
X ∗ψj (t) is stationary so long as ψj is a wavelet. Much of the mathematical analysis of wavelet
scattering moments relies on the fact that they can be rewritten as SX(j) = E[|ψ j ∗ dX|],
where ψ j is the primitive of ψj , i.e., dψ j = ψj . This reformulation motivates us to define
scattering moments as the integration of a filter, which is not necessarily a wavelet, against
                                                 35


a random signed measure Y (dt).
    To that end, let w ∈ L2 (Rd ) be a continuous window function with support contained
in the unit cube [0, 1]d . Denote by ws (t) = w st the dilation of w, supported on the cube
                                                           
Qs = [0, s]d , and set gγ (t) to be the Gabor-type filter with scale s and central frequency
ξ ∈ Rd ,
                             gγ (t) = ws (t)eiξ·t ,    γ = (s, ξ), t ∈ Rd .                (4.1)
Recall that this is the same type of filter we defined in Section 3.2 from Chapter 3.
    For a random signed measure Y (dt) we define the first-order Lp scattering moments,
1 ≤ p < ∞, at location t as
                                                             Z                       p
                                                 p
                 S[γ, p]Y (t) := E[|gγ ∗ Y (t)| ] := E             gγ (t − u) Y (du)     . (4.2)
                                                                Rd
    Note there is no assumption on the stationarity of Y (du), which is why these scattering
moments a priori depend on t. We define invariant (i.e., location independent) first-order
scattering coefficients of Y by
                                                     Z
                                               1
                        SY (γ, p) = lim                        E[|gγ ∗ Y (t)|p ] dt ,      (4.3)
                                      R→∞ (2R)d        |ti |≤R
if the limit on the right hand side exists.
    We call Y a periodic measure if there exists T > 0 such that for any Borel set B, the
family of sets B + T ei = {b + T ei : b ∈ B} satisfies
                                       d
                               Y (B) = Y (B + T ei ),          ∀1 ≤ i ≤ d,
where {ei }i≤d is the standard orthonormal basis for Rd . In this case one can verify, by
                                                                              d
approximating gγ with simple functions, that (gγ ∗ Y )(t + T ei ) = (gγ ∗ Y )(t), and therefore
                           S[γ, p]Y (t + T ei ) = S[γ, p]Y (t),        ∀ t ∈ Rd .
Thus the limit in (4.3) exists, and
                                                 Z
                                             1
                              SY (γ, p) = d             E[|gγ ∗ Y (t)|p ] dt .             (4.4)
                                            T       QT
                                                    36


Note that in the special case when the distribution of Y (B) depends only on the Lebesgue
measure of B, then S[γ, p]Y (t) is independent of t and the above limit (4.3) exists with
SY (γ, p) = S[γ, p]Y (t) for any t ∈ Rd .
    First-order scattering moments compute summary statistics of the measure Y based
upon its responses against the filters gγ . Higher-order summary statistics can be obtained
by computing first-order scattering moments for larger powers p, or by cascading lower-order
modulus nonlinearities as in a CNN. This leads us to define second-order scattering moments
by
                                                       h                       0
                                                                                 i
                           S[γ, p, γ 0 , p0 ]Y (t) = E ||gγ ∗ Y |p ∗ gγ 0 (t)|p .
First-order invariant scattering moments collapse additional information by aggregating the
variations of the random measure Y , which removes information related to the intermittency
of Y . Second-order invariant scattering moments augment first-order scattering moments by
iterating on the cascade of linear filtering operations and nonlinear | · |p operators, thus
recovering some of this lost information. They are defined (assuming the limit on the right
exists) by
                                                     Z
                            0  0                 1             h
                                                                           p          p0
                                                                                         i
                SY (γ, p, γ , p ) = lim                       E ||gγ ∗ Y | ∗ gγ 0 (t)| dt .
                                     R→∞ (2R)d        |ti |≤R
    The collection of (invariant) scattering moments is a set of non-parametric statistical
measurements of the random measure Y . In the following sections, we analyze these moments
for arbitrary frequencies ξ and small scales s, thus allowing the filters gγ to serve as a model
for the learned filters in CNNs. In particular, we will analyze the asymptotic behavior of the
scattering moments as the scale parameter s decreases to zero.
4.3     First-Order Scattering Moments of Generalized Poisson Pro-
        cesses
We consider the case where Y (dt) is an inhomogeneous, compound spatial Poisson point
process. Such processes generalize ordinary Poisson point processes by incorporating variable
charges (heights) at the points of the process and a non-uniform intensity for the locations
                                                      37


of the points. They thus provide a flexible family of point processes that can be used to
model many different phenomena. In this section we consider first-order scattering moments
of these generalized Poisson processes. In Sec. 4.3.1 we provide a review of such processes,
and in Sec. 4.3.2 we show that first-order scattering moments capture a significant amount of
statistical information related these processes, particularly when using very localized filters.
4.3.1    Inhomogeneous, Compound Spatial Poisson Point Processes
Let λ(t) be a continuous function on Rd such that
                              0 < λmin := inf λ(t) ≤ kλk∞ < ∞ ,                             (4.5)
                                            t
and let N (dt) be an inhomogeneous Poisson point process with intensity function λ(t). That
is,
                                                X∞
                                      N (dt) =      δtj (dt)
                                                j=1
is a random measure, concentrated on a countable set of points {tj }∞     j=1 , such that for all
Borel sets B ⊂ Rd , the number of points of N in B, denoted N (B), is a Poisson random
variable with parameter
                                               Z
                                      Λ(B) =       λ(t) dt ,                                (4.6)
                                                 B
i.e.,
                                                         (Λ(B))n
                               P[N (B) = n] = e−Λ(B)             ,
                                                             n!
and N (B) is independent of N (B 0 ) for all sets B 0 that do not intersect B. Now let (Aj )∞ j=1
be a sequence of i.i.d. random variables independent of N , and let Y (dt) be the random
signed measure that gives charge Aj to each point tj of N , i.e.,
                                              X∞
                                    Y (dt) =      Aj δtj (dt) .                             (4.7)
                                              j=1
We refer to Y (dt) as an inhomogeneous, compound Poisson point process. For a Borel
set B ⊂ Rd , Y (B) has a compound Poisson distribution and we will (in a slight abuse of
                                               38


notation) write
                                                     N (B)
                                                      X
                                           Y (B) =         Aj .
                                                      j=1
In many of our proofs, it will be convenient to consider the random measure |Y |p (dt) defined
formally by
                                                   X∞
                                    |Y |p (dt) :=       |Aj |p δtj (dt).
                                                   j=1
    For a further overview of these processes, and closely related marked point processes, we
refer the reader to Section 6.4 of [42].
4.3.2     First-order Scattering Asymptotics
Computing the convolution of gγ with Y (dt) gives
                                     Z                              X∞
                     (gγ ∗ Y )(t) =        gγ (t − u) Y (du) =          Aj gγ (t − tj ) ,
                                      Rd                            j=1
which can be interpreted as a waveform gγ emitting from each location tj . Invariant scattering
moments aggregate the random interference patterns in |gγ ∗ Y |. The results below show
that the expectation of these interferences, for small scale waveforms gγ , encode important
statistical information related to the point process.
    For notational convenience, we let
                                                            Z
                                                     d
                                                       
                            Λs (t) := Λ [t − s, t]       =             λ(u) du
                                                              [t−s,t]d
denote the expected number of points of N in the support of gγ (t − ·). If λ(t) is a periodic
function in each coordinate with period T , then Λs (t) = Λs (t + T ei ) for 1 ≤ i ≤ d and
therefore, the invariant scattering coefficients of Y may be defined as in (4.4).
Theorem 6. Let 1 ≤ p < ∞ and suppose that Y (dt) is an inhomogeneous, compound Poisson
point process as defined above, where (Aj )∞       j=1 is an i.i.d. sequence of random variables,
E[|A1 |p ] < ∞ and λ(t) is a continuous intensity function satisfying (4.5). Then for every
                                                    39


t ∈ Rd , every γ = (s, ξ) such that sd kλk∞ < 1, and for every m ≥ 1.
                                m
                                                           " k                     p#
                                                       k
                               X             (Λs (t))        X
             S[γ, p]Y (t) =         e−Λs (t)             E        Aj w(Vj )eisξ·Vj      + (m, s, ξ, t) ,  (4.8)
                               k=1
                                                k!           j=1
where the error term (m, s, ξ, t) satisfies
                                                    kλk∞
                         |(m, s, ξ, t)| ≤ Cm,p             kwkpp E[|A1 |p ]kλkm+1
                                                                               ∞ s
                                                                                      d(m+1)
                                                                                                           (4.9)
                                                     λmin
and V1 , V2 , . . . is an i.i.d. sequence of random variables, independent of the Aj , taking values
in the unit cube Q1 and with density
                                                    sd
                                     pV (v) =            λ(t − vs) ,    v ∈ Q1 .
                                                 Λs (t)
    The main idea of the proof of Theorem 6 is to condition on N [t − s, t]d , which is the
                                                                                                  
number of points in the support of gγ , and to use the fact that
                                                                   m+1 
                     P N [t − s, t]d > m = O sd kλk∞                            ∀ sd kλk∞ < 1 .
                                             
                                                                             ,
Theorem 6 shows that even at small scales the scattering moments S[γ, p]Y (t) depend upon
higher-order information related to the distribution of the points, encapsulated by the term
(Λs (t))k , regardless of the scattering moment p. However, the influence of the higher-order
terms diminishes rapidly as the scale of the filter shrinks, which is indicated by the bound
(4.9) on the error function. Theorem 6 also shows that pth scattering moments depend on
the pth moments of the charges, (Aj )∞            j=1 . The next result uses Theorem 6 to examine the
behavior of scattering moments for small filters in the asymptotic regime as the scale s → 0.
Theorem 7. Let 1 ≤ p < ∞, and suppose that Y (dt) is an inhomogeneous, compound
Poisson point process satisfying the same assumptions as in Theorem 6. Let γk = (sk , ξk ) be
a sequence of scale and frequency pairs such that limk→∞ sk = 0. Then
                                         S[γk , p]Y (t)
                                    lim           d
                                                           = λ(t)E[|A1 |p ]kwkpp .                        (4.10)
                                   k→∞          sk
Furthermore, if λ(t) is periodic with period T along each coordinate, then
                                                                                        Z
               SY (γk , p)                                                          1
         lim                 = m1 (λ)E[|A1 | ]kwkp , where m1 (λ) = d
                                                p        p
                                                                                            λ(t) dt .     (4.11)
        k→∞          sdk                                                           T QT
                                                           40


     Theorem 7 is proved via asymptotic analysis of the m = 1 case of Theorem 6. The key
to the proof, which is similar to the technique used to prove Theorem 2.1 of [41], is that
in a small cube [t − s, t]d there is at most one point of N with overwhelming probability.
Therefore, when s is very small, with very high probability, |gγ ∗ Y |p (t) = (|gγ |p ∗ |Y |p ) (t).
     This theorem shows that for small scales the scattering moments S[γ, p]Y (t) encode the
intensity function λ(t), up to factors depending upon the summary statistics of the charges
(Aj )∞j=1 and the window w. Recall that Λ(B), defined in (4.6), determines the concentration
of events within the set B. Thus even a one-layer location dependent scattering network
yields considerable information regarding the underlying data generation, at least in the case
of inhomogeneous Poisson processes. However, it is often the case, e.g., [43], that invariant
statistics are utilized. In this case (4.11) shows that invariant scattering statistics mix the
mean of λ(t) and the pth moment of the charge magnitudes. However, we can decouple these
statistics as we now explain.
     As a special case, Theorem 7 proves that for non-compound inhomogeneous Poisson
processes (i.e., Aj = 1 for all j ≥ 1), small scale scattering moments recover λ(t) or m1 (λ),
depending on whether one computes invariant or time-dependent scattering moments. For
compound processes, we can add an additional nonlinearity, namely the signum function
sgn, which when applied to the Poisson point process in (4.7) yields,
                                                         X ∞
                                Y (dt) = sgn[Y (dt)] =        δtj (dt) .
                                                          j=1
Thus computing SY (γ, p) and the ratio SY (γ, p)/SY (γ, p) at small scales decouples the
mean of λ(t) from the pth moment of |A1 |. We remark that the signum function is a simple
perceptron and is closely related to the sigmoid nonlinearity, which is used in many neural
networks. We further remark that the computation of SY constitutes a small two-layer
network, consisting of the nonlinear sgn function, the linear filtering by the collection of
filters gγ , the nonlinear pth modulus | · |p , and the linear integration operator.
     If Y (dt) is a homogeneous Poisson process, then λ(t) is constant, meaning that (4.10) and
                                                  41


(4.11) are equivalent. In the case of ordinary (non-compound) Poisson processes, Theorem
7 recovers the constant intensity. For periodic λ(t) and invariant scattering moments, the
effect of higher-order moments of λ(t) can be partially isolated by considering higher-order
expansions (e.g., m > 1) in (4.8). The next theorem considers second-order expansions and
illustrates their dependence on the second moment of λ(t).
Theorem 8. Let 1 ≤ p < ∞, and suppose Y (dt) is an inhomogeneous, compound Poisson
point process satisfying the same assumptions as in Theorem 6. If λ(t) is periodic with period
T in each coordinate, and if (γk )k≥1 = (sk , ξk )k≥1 , is a sequence of scale and frequency pairs
such that limk→∞ sk = 0 and limk→∞ sk ξk = L ∈ Rd , then
                                                        Z               
                                 SY (γk , p)           1       Λsk (t)
                      lim                          − d                 dt
                     k→∞    s2d        p        p
                              k E[|A1 | ]E[|Vk | ]    T QT s2d      k
                                                                           p
                                                                             !
                                    E A1 w(U1 )eiL·U1 + A2 w(U2 )eiL·U2
                        = m2 (λ)                                                ,           (4.12)
                                                  2kwkpp E[|A1 |p ]
where m2 (λ) = T −d QT λ(t)2 dt; U1 , U2 are independent uniform random variables on Q1 ;
                       R
and (Vk )k≥1 is a sequence of random variables independent of the Aj taking values in the
unit cube Q1 and with respective densities,
                                           sdk
                              pVk (v) =          λ(t − vsk ) ,   v ∈ Q1 .
                                         Λsk (t)
    We first remark that the scale normalization on the left hand side of (4.12) is s−2d ,
compared to a normalization of s−d in Theorem 7. Thus even though (4.12) is written as a
small scale limit, intuitively Theorem 8 is capturing information at moderately small scales
that are larger than the scales considered in Theorem 7. This is further indicated by the
term multiplied against m2 (λ) on the right hand side of (4.12), which depends on two points
of the process (as indicated by the presence of two charges A1 and A2 ).
    Unlike Theorem 7, which gives a way to compute m1 (λ), Theorem 8 does not allow one
to compute m2 (λ) since it would require knowledge of Λsk (t) in addition to the distribution
from which the charges (Aj )∞   j=1 are drawn. However, Theorem 8 does show that at mod-
erately small scales the invariant scattering coefficients depend non-trivially on the second
                                                   42


moment of λ(t). This behavior at moderately small scales can be used to distinguish be-
tween, for example, an inhomogeneous Poisson point process with intensity function λ(t) and
a homogeneous Poisson point process with constant intensity λ0 = m1 (λ), whereas Theorem
7 indicates that at very small scales the two processes will have the same invariant scattering
moments.
4.4    Second-Order Scattering Moments of Generalized Poisson Pro-
       cesses
We prove that second-order scattering moments, in the small scale regime, encode higher-
order moment information about the charges (Aj )∞             j=1 .
Theorem 9. Let 1 ≤ p, p0 < ∞ and q = pp0 . Suppose that Y (dt) is an inhomogeneous Pois-
son point process satisfying the same assumptions as in Theorem 6 as well as the additional
assumption that E|A1 |q < ∞. Let γk = (sk , ξk ) and γk0 = (s0k , ξk0 ) be two sequences of scale and
frequency pairs such that s0k = csk for some fixed constant c > 0 and limk→∞ sk ξk = L ∈ Rd .
Then,
                                   S[γk , p, γk0 , p0 ]Y (t)
                            lim             d(p0 +1)
                                                             = Kλ(t)E[|A1 |q ] ,               (4.13)
                           k→∞            sk
where
                                                                    p0
                                      K := gc,L/c ∗ |g1,0 |p        p0
                                                                       ,
is a constant depending on p, p0 , c, L, and w. Furthermore, if λ(t) is periodic with period T
along each coordinate, then
                                   SY (γk , p, γk0 , p0 )
                            lim           d(p0 +1)
                                                           = Km1 (λ)E[|A1 |q ] .               (4.14)
                           k→∞          sk
                                          0
    Note that the scaling factor s−d(p +1) depends on p0 but not p. Intuitively this corresponds
                                   0      d(p0 +1)
to the behavior k|gγk |p ∗ gγk0 kpp0 ≈ sk           as sk → 0. Theorem 9 proves that second-order
scattering moments capture higher-order moments of the charges (Aj )∞            j=1 via two pairs of
lower-order filtering and modulus operators. If p, p0 > 1, then q = pp0 will be larger than
                                                        43


either p or p0 and the result above will give us information about the higher order moment
E|A1 |q .
    It is also useful to consider the p = 1 case. Indeed, in Sec. 4.5 below it is shown that first-
order invariant scattering moments can distinguish Poisson point processes from fractional
Brownian motion and α-stable processes, if p = 1, but may fail to do so for larger values of
p. However, Theorem 7 shows that first-order invariant scattering moments for p = 1 will
not be able to distinguish between the various different types of Poisson point processes with
a one-layer network at very small scales. Theorem 9 shows that a second-order calculation
that augments the first-order calculation with p = 1 and p0 > 1, will capture a higher-order
moment of the charges (Aj )∞      j=1 .
4.5       Poisson Point Processes Compared to Self Similar Processes
For one-dimensional processes (i.e., d = 1), we show that first-order invariant scattering
moments can distinguish between inhomogeneous, compound Poisson point processes and
certain self-similar processes. In particular, we show that if X(t) is either an α-stable pro-
cess or a fractional Brownian motion (fBM), then the corresponding first-order scattering
moments will have different asymptotic behavior for infinitesimal scales than in the case of a
Poisson point process. Similar results were initially reported in [41]; here we generalize those
results to the non-wavelet filters gγ defined in (4.1) and for general pth scattering moments,
and further clarify their usefulness in the context of the new results presented in Sec. 4.3 and
Sec. 4.4. As in [41], the proof will be based on the scaling relationships of these processes
and therefore will not be able to distinguish between α-stable processes and fBM1 . The key
will be proving a lemma that says if a stochastic process X has a scaling relation, then that
scaling relation is inherited by integrals of deterministic functions against dX.
    More precisely, for a stochastic process X(t), t ∈ R, we consider the convolution of the
    1
      We note that [41] proves that second-order scattering moments defined with wavelet filters do distinguish
between α-stable processes and fBM, but we do not pursue this direction in this project as we are concerned
primarily with point processes.
                                                      44


filter gγ with the noise dX defined by
                                               Z
                                 gγ ∗ dX(t) =     gγ (t − u) dX(u) ,
                                                R
and define (in a slight abuse of notation) the first-order scattering moments at time t by
                                  S[γ, p]X(t) = E[|gγ ∗ dX(t)|p ] .                      (4.15)
In the case where X(t) is a compound, inhomogeneous Poisson (counting) process, Y = dX
will be a compound Poisson random measure and the scattering moments defined in (4.15)
will coincide with the first-order scattering moments defined in (4.2).
     The following two theorems analyze the small scale first-order scattering moments when
X is either an α-stable process, for 1 < α ≤ 2, or fractional Brownian motion. Thus dX
will be stable Lévy noise or fractional Gaussian noise, respectively. These results show that
the asymptotic decay of the corresponding scattering moments is guaranteed to differ from
Poisson point processes, in the case p = 1. We also note that both α-stable processes and
fBM have stationary increments; therefore the scattering moments do not depend on time
and
                                                 Z
                                              1
           S[γ, p]X(t) = SX(γ, p) = lim                  E[|gγ ∗ dX(u)|p ] du , ∀t ∈ R.
                                        R→∞ 2R     |u|≤R
Theorem 10. Let 1 ≤ p < ∞ and suppose X(t) is a symmetric α-stable process for some
p < α ≤ 2. Let γk = (sk , ξk ) be a sequence of scale and frequency pairs such that limk→∞ sk =
0 and limk→∞ sk ξk = L ∈ R. Then,
                                               Z    1                 p
                                SX(γk , p)                   iLu
                            lim     p/α
                                           =E          w(u)e     dX(u)     .
                           k→∞     sk              0
Theorem 11. Let 1 ≤ p < ∞, suppose X(t) is a fractional Brownian motion with Hurst
parameter 0 < H < 1. Assume that the window function w has bounded variation on [0, 1],
and let γk = (sk , ξk ) be a sequence of scale and frequency pairs such that limk→∞ sk = 0 and
limk→∞ sk ξk = L ∈ R. Then,
                                               Z    1                 p
                                SX(γk , p)                   iLu
                            lim            =E          w(u)e     dX(u)     .
                           k→∞     spH
                                    k              0
                                                 45


    The key to proving Theorem 10 and Theorem 11 is the lemma stated in Appendix 7.3,
which shows that if X(t) is a self-similar process, then, then stochastic integrals against dX
satisfy an identity corresponding to the scaling relation of X(t).
    Together, these two theorems indicate that first-order invariant scattering moments dis-
tinguish inhomogeneous, compound Poisson processes from both α-stable processes and frac-
tional Brownian motion except in the cases where p = α or p = 1/H. In particular, if X is
a Brownian motion, then SX will distinguish X from a Poisson point process except in the
case that p = 2. For this reason, it appears that p = 1 is the best choice of the parameter p
for the purposes of distinguishing a Poisson point process from a self-similar process. In the
case of a multi-layer network, it is advisable to set p = 1. Larger values of p0 in the second
layer can then allow us to determine the higher moments of the arrival heights (Aj )∞     j=1 .
4.6     Numerical Illustrations
We carry out several experiments to numerically validate the previously stated results and
to illustrate their capacity for distinguishing between different types of random processes.
In all of the experirments below, we will hold the frequency ξ constant while we let the scale
s decrease to zero.
4.6.1     Homogeneous, compound Poisson point processes with the same intensi-
          ties
We generated three different types of homogeneous compound Poisson point processes, all
with the same intensity λ(t) ≡ λ0 = 0.01. The three point processes are Y1 (ordinary), Y2
(Gaussian), and Y3 (Rademacher), where the charges are sampled according to (A1,j )∞        j=1 ≡
                        π/2), and (A )∞
                                      3,j j=1 ∼ Rademacher distribution (i.e., ±1 with equal
                      p
1, (A2,j )∞
          j=1 ∼ N (0,
probability). The charges of the three signals have the same first moment E[|Ai,j |] = 1
and different second moment with E[|A1,j |2 ] = E[|A3,j |2 ] = 1 and E[|A2,j |2 ] = π2 . Theorem
7 thus predicts that p = 1 invariant first-order scattering moments will not be able to
                                               46


distinguish between the three processes, but p = 2 invariant first-order scattering moments
will distinguish the Gaussian Poisson point process from the other two. Figure 4.1 illustrates
this point by plotting the normalized invariant scattering moments for p = 1 and p = 2.
Figure 4.1 First-order invariant scattering moments for three types of homogeneous com-
pound Poisson point processes with the same intensity λ0 . Left: Top: ordinary Poisson
point process. Middle: Gaussian compound Poisson point process with normally distributed
charges. Bottom: Rademacher compound Poisson point process with charges drawn from the
Rademacher distribution. Middle: Normalized invariant scattering moments SY (s,ξ,1)/skwk1
(i.e., p = 1), which all converge to 0.01 as s → 0 (up to numerical errors) since λ0 E[|A1 |]
is the same for all three point processes. Right: Normalized invariant scattering moments
SY (s,ξ,2)/skwk2 (i.e., p = 2). In this case the ordinary Poisson point process and the Rademacher
               2
Poisson point process still converge to the same value as s → 0 since E[|A1 |2 ] = 1 for both
of them. However, the Gaussian Poisson point process converges to a different value since
E[|A1 |2 ] = π/2 for this process.
4.6.2      Homogeneous, compound Poisson point processes with different intensi-
           ties and charges
We consider two homogeneous, compound Poisson point processes with different intensities
and different charge distributions, but which nevertheless have the same first-order invariant
scattering moments with p = 1 due to the mixing of intensity and charge information in
(4.11). The first compound Poisson point process has constant intensity λ1 = 0.01 and
                                                                            √
charges A1,j ∼ N (0, 1), whereas the second has intensity λ2 = 0.01/ 2 and A2,j ∼ N (0, 2).
In this way, λ1 E[|A1,j |] = λ2 E[|A2,j |] = 0.01 · 2/π ≈ 0.008, but λ1 E[|A1,j |2 ] = 0.01 and
                                                        p
                         √
λ2 E[|A2,j |2 ] = 0.01· 2 ≈ 0.014. Figure 4.2 plots the normalized invariant scattering moments
for p = 1 and p = 2.
                                                     47


Figure 4.2 First-order invariant scattering moments for two homogeneous, Gaussian com-
pound Poisson point processes with different intensity and variance. Left: Top: Homoge-
neous compound Poisson point process with intensity λ1 = 0.01 and charges A1,j ∼ N (0, 1).
                                                                                          √
Bottom: Homogeneous compound Poisson point process with intensity λ2 = 0.01/ 2 and
charges A2,j ∼ N (0, 2). The two point processes are difficult to distinguish, visually. Mid-
dle: Normalized invariant scattering moments SY (s,ξ,1)/skwk1 (i.e., p = 1), which both converge
to approximately 0.08 up to numerical error, thus indicating that these moments cannot dis-
tinguish the two processes. Right: Normalized invariant scattering moments SY (s,ξ,2)/skwk22
(i.e., p = 2). The two process are distinguished as s → 0 since the values λ1 E[|A1,j |2 ] = 0.01
and λ2 E[|A2,j |2 ] ≈ 0.014 differ by a significant margin.
4.6.3     Inhomogeneous, non-compound Poisson point processes
We also consider inhomogeneous Poisson point processes. We use the intensity function
λ(t) = 0.01(1 + 0.5 sin(2πt/N )) to generate inhomogeneous process. To estimate S[γ, p]Y (t),
we average the modulus of the scattering transform at time t over 1000 realizations. Figure
4.3 plots the scattering moments for inhomogeneous process at different time.
Figure 4.3 First-order invariant scattering moments for inhomogeneous non-compound Pois-
son point processes. Left: Inhomogeneous non-compound Poisson point process with inten-
sity λ(t) = 0.01(1 + 0.5 sin(2πt/N )). Right: Scattering moments S[γ,p]Y (t)/skwkpp for inhome-
geneous non-compound Poisson point process at t1 = N/4, t2 = N/2, t3 = 3N/4. Note that
λ(t1 ) = 0.015, λ(t2 ) = 0.01, λ(t3 ) = 0.005. The plots show that for inhomogeneous process,
scattering coefficients at time t converges to the intensity at that time.
                                                  48


Figure 4.4 First-order invariant scattering moments for Brownian motion and Poisson point
process. Left: Top: Brownian motion with Hurst parameter H = 1/2. Bottom: Ordi-
nary Poisson point process. Middle: Normalized scattering moments for Brownian Motion
(SXBM (s,ξ,1)/kwk2 E|Z|) and Poisson point process (SY√    poisson (s,ξ,1)/λE|A|kwk1 ) at p = 1. This shows
the convergence rate of normalized scattering is s for Brownian motion and s for Pois-
son process, indicating the 1st moment can distinguish Brownian motion and Poisson point
process. Right: Normalized scattering moments for Brownian Motion (SXBM (s,ξ,2)/kwk22 ) and
Poisson point process (SYpoisson (s,ξ,2)/kwk22 ) at p = 2. Both normalized scattering moments have
convergence rate s, so the 2nd moment scattering cannot distinguish the two processes.
4.6.4    Homogeneous, non-compound Poisson point process and self similar pro-
         cess
We consider Brownian motion with Hurst parameter H =                       1
                                                                           2
                                                                             and compare it with Poisson
point process with intensity λ = 0.01 and charges (A)∞            j=1 ≡ 10. Figure 4.4 shows that the
2nd moments cannot distinguish between Brownian motion and Poisson point process while
the 1st moments can.
4.7     Scattering GAN
We are also interested in the capacity of scattering moments on Poisson point processes.
Ideally, we want to generate new signals through the scattering measurements and show the
realizations are from the given Poisson point process. We are less interested in synthesizing
one specific realization of the process but more interested in generating new realizations
through the distribution of scattering moments. Therefore, instead of minimizing the `2 loss
between scattering coefficients, which we did in Chapter 3, we use a GAN model to study
the high dimensional distribution of the scattering moments. Figure 4.5 shows the structure
of our model, where we insert a scattering propagator S between the generator G and the
                                                      49


Figure 4.5 Scattering-GAN to study the capacity of scattering moments on Poisson point
process. Similar to ordinary GAN, the generator takes in a random vector z and generates
fake data y 0 . Also, the discriminator aims to distinguish fake representations from real. By
inserting a scattering module between G and D, the discriminator tries to distinguish S(y 0 )
from S(y). When the model trains successfully, S(y 0 ) has the same distribution as S(y). By
checking the similarity between y 0 and y, we learn the capacity of scattering moments.
discriminator D.
    Given realizations {yi }i from a Poisson point process Y , suppose their scattering moments
{S(yi )}i are samples from an unknown high dimensional distribution PS . In the scattering
model, the discriminator tries to distinguish between the real scattering coefficients S(yi ) and
fake ones S(yi0 ), while the generator is generating signals yi0 that have scattering coefficients
S(yi0 ) that match S(yi ). Figure 4.6 shows the generated signals from the scattering GAN
through a numerical experiment.
4.8      Conclusion
We have constructed Gabor-filter scattering transforms for random measures on Rd , and
stochastic processes on R. Our construction is closely related to [41], but extends their work
in several important ways. First, while our Gabor-type filters include dyadic wavelets as
a special case, they also include many other families of filters. We also do not assume
that the random measure Y is stationary, and consider compound, possibly inhomogeneous,
Poisson random measures on Rd , in addition to ordinary Poisson processes on R. We do
note however, that [41] provides a detailed analysis of self-similar processes and multifractal
                                               50


Figure 4.6 Generated signals through scattering GAN. We use realizations with length n =
212 from a homogeneous ordinary Poisson point process, i.e., λ(t) ≡ λ0 and Ai ≡ 1, as
training data. We use {2j/2 }22
                              j=0 as scales for filters and apply a one-layer scattering operator
to compute the scattering moments. Sigmoid is applied at the last layer in the generator.
The generated signals are sparse, although not as sparse as training data. This is natural
since Sigmoid forces A0i ∈ (0, 1), thus E[A0i ] < E[Ai ]. According to theorem 6, λ00 > λ0 , which
we verified numerically.
                                                 51


random measures, whereas we have primarily focused on models of random sparse signals.
We believe the results presented here open up several avenues of future research. Firstly, we
have assumed throughout most of this chapter that the points of our random measures were
distributed according to a possibly inhomogenous Poisson process. It would be interesting
to discover if our measurements can distinguish these signals from other point processes.
Secondly, it would be interesting to explore the use of these measurements for a variety of
machine learning tasks such as synthesizing new signals. In the next chapter, we describe
our work on using similar measurements for texture image synthesis.
                                             52


                                        CHAPTER 5
       TEXTURE SYNTHESIS VIA PROJECTION ONTO MULTISCALE,
                               MULTILAYER STATISTICS
5.1     Introduction
In the last chapter, we presented our work on generalized scattering transforms of stochastic
processes. In practice, one class of signals that can be modeled by stochastic processes is
texture images. This is motivated from the fact that texture images usually contain a type
of random repetition of a potentially complex pattern. A natural task about texture images,
which is related to the completeness of a model, is the texture synthesis problem. This
task asks one to generate new, perceptually accurate, texture images given a limited (often
single) realization of the texture class in question. Within the field of image generative
models, the texture synthesis problem is appealing because it allows for types of statistical
analysis that are not possible in general image generation. Recent works have proposed to
use generative adversarial networks (GANs) [44] to perform texture synthesis and related
tasks [45, 46, 47, 48]. GANs are also used to expand non-stationary texture images [49],
proving the ability to capture large scale structures. Classically, texture synthesis models
fall into two categories [50]: (i) non-parametric patch rearrangement methods that extract
microscopic patterns from the reference image and randomly arrange these patterns in a new
image; and (ii) parametric statistic-matching models that extract a set of empirical statistics
from the reference texture, and generate a new image by selecting a random image with a
similar statistical profile.
    This chapter addresses the second type of model based on statistical matching. In [51, 52],
the authors reproduce micro-textures by randomizing the phase of Fourier coefficients of the
input texture. In some other works, the filtered responses of texture images are matched
based on the maximum entropy principle [53, 54, 55, 56]. Such statistical models have two
                                              53


challenges: (i) What is the set of statistics needed to characterize a large class of textures?
and (ii) Given the statistical profile of a reference image, how does one generate a random
image with the same statistical profile? Gatys and collaborators [3] had great success in
addressing these two challenges by extracting the covariance statistics of the filter responses
at various layers of the VGG-19 network [1], and then generating a new image with matching
statistics via back-propagation and stochastic gradient descent, which is reviewed in Section
2.4. This work in turn inspired several subsequent methods, including [15, 57, 58, 59].
    Despite the success of [3], though, the model is not perfect and many open questions
remain for statistics-based texture synthesis models. Indeed, in a follow up paper [60], it is
observed that high quality texture images can be synthesized using only a one-layer network
with random, multiscale filters and rectified linear unit (ReLU) nonlinearity. The combina-
tion of the two papers [3, 60] raises questions with respect to the trade-off between network
depth and the sizes of the receptive fields of the filters in the network. Additionally, putting
the use of random filters aside, the use of a single layer of multiscale filters parallels classical
work in the field that uses the statistics of multiscale wavelet coefficients to synthesize tex-
tures [61, 53, 2, 62]. Multi-scale CNN models are also designed to maintain high resolution
in texture synthesis [63, 64]. Among these methods, the algorithm of Portilla and Simoncelli
[2] is particularly notable for its use of statistics based on the modulus and phase of com-
plex wavelet valued coefficients, in addition to its impressive performance which is often still
bench-marked against today.
    In this chapter we propose a multiscale, multilayer, nonlinear feature extractor for images
based upon real-valued wavelet transforms, which in turn yields a set of statistics for use in
texture synthesis. In addition to drawing inspiration from Portilla and Simoncelli [2] and
Gatys et al. [3], the model presented in this chapter also draws upon ideas from the wavelet
scattering transform [10], which itself has shown good results for the synthesis of gray-scale
textures [43]. Nevertheless, our algorithm has several novel aspects that we use to investigate
the texture synthesis problem, and which provide insight into image feature extraction via
                                                54


convolutional networks. More specifically:
    • We provide an analysis of the types of filters required to obtain good synthesis results
      when combined with the ReLU nonlinearity.
    • We investigate the trade-off between network depth and the maximum scale of the
      wavelet filters.
    • We propose a CNN architecture that uses the ReLU nonlinearity and is provably
      invertible at each layer, which in turn allows us to adapt the projection synthesis
      algorithm of [2] to our setting.
    • We demonstrate our theoretical findings numerically through example synthesized im-
      ages, and also compare our results to [2] and [3].
    In Section 5.2 we present our statistical model in detail, while Section 5.3 describes
our synthesis algorithm. Section 5.4 provides detailed numerical results, and Section 5.5
introduces implementation details. Section 5.6 contains a short conclusion.
5.2     Model
Set T2 := [−T, T ]2 and R+ := [0, ∞), and let x : T2 → R+ be a texture image, which we shall
assume is in L2 (T2 ). A statistics-based matching algorithm for texture synthesis specifies
a family of (nonlinear) functions Uk : L2 (T2 ) → L2 (T2 ) and extracts a family of empirical
statistics Sx from x based on
                                                         Z
                                                    1
                         Sx = (Sk x)k ,  Sk x :=             Uk x(u) du .
                                                  (2T )2  T2
A new texture y ∈ L2 (T2 ) is synthesized by drawing y from the set of images with similar
statistical profiles:
                          y ∼ Ix := {z ∈ L2 (T2 ) : kSz − Sxk ≤ ε} .                     (5.1)
If x ∼ X, where (X(u))u∈T2 is a stochastic process, and if Uk X is stationary and ergodic,
then Sk x → E[Uk X] as T → ∞. Thus, we can think of Sk x as approximating the statistics
E[Uk X] of the unknown process that generated x.
                                               55


     The model (5.1) is appealing because the statistical profile Sx determines the texture
class. It is thus paramount to determine a good set of functions (Uk )k , and hence statistics
(Sk )k , the pursuit of which has ramifications in human and computer vision [65, 66, 67].
     The method of Portilla and Simoncelli [2] defines the majority of their statistics by lever-
aging a complex valued wavelet transform and extracting statistics from the modulus and
phase of the resulting wavelet coefficients. The first layer of our model also uses multiscale
wavelet filters, but they are real valued and we replace the modulus and phase nonlinearities
with the ReLU nonlinearity. In Sections 5.2.1 and 5.2.2 we explain how the proper selection
of such wavelet filters, though, when combined with ReLU, can distinguish between certain
types of patterns in the same way that modulus and phase can.
     On the other hand, Gatys et al. [3] define their statistics using the Gram matrices of
the filter responses at various layers in the VGG-19 network. The receptive field of the
filters of the VGG-19 network are small, only 3 × 3 pixels, but the depth and pooling of
the VGG-19 network allows such statistics to still capture complex multiscale patterns in
texture images. Akin to the VGG network, in Section 5.2.3 we expand our set of functions
Uk by computing a second wavelet transform and ReLU nonlinearity. Such a procedure is
inspired by the wavelet scattering transform [10], but as we will describe differs from the
scattering transform in several significant ways.
5.2.1      Wavelet filters
Let x b(ω), for frequencies ω ∈ Ω := {πk/T : k ∈ Z2 } ⊂ R2 , denote the Fourier transform
of x: x  b(ω) := T2 x(u)e−iu·ω du. A wavelet ψ ∈ L2 (T2 ) is an oscillating waveform that is
                 R
localized in both space and frequency and has zero average. Inspired by previous work in
wavelet based image processing, as well as recent analyses of the filters of the VGG network
[68], we make use of three types of wavelets. Figure 5.1 shows the three wavelet families and
we will define the wavelets in the following context.
     The first two are directional wavelet filters. We select one even directional filter and one
                                                 56


Figure 5.1 Wavelet families. Upper: Even directional wavelets in space and frequency
(FFT). Middle: Odd directional wavelets in space and frequency (FFT). Lower: Omnidi-
rectional wavelets in space and frequency (FFT). Each block shows the wavelet family with
different scales and oscillations.
                                            57


odd directional filter:
                                      ψ e (u) := g(u) cos(ξ · u) ,
                                      ψ o (u) := g(u) sin(ξ · u) ,
where g is an even window function and ξ ∈ R2 is the central frequency of the wavelets.
These wavelets oscillate in the direction ξ and have localized Fourier transforms around ξ
and −ξ. These wavelets are rotated to obtain waveforms oscillating in different directions:
                                ψθβ (u) := ψ β (Rθ−1 u) ,   β ∈ {e, o} ,
where Rθ ∈ SO(2) is the 2 × 2 rotation matrix about the angle θ ∈ [0, π). We use M angles
θ ∈ ΘM := {mπ/M : 0 ≤ m < M }.
    The third type of wavelet is based on the polar coordinate representation u = (r, ϕ) ∈
[0, ∞) × [0, 2π), and oscillates along the angle parameter ϕ:
                                       ψ`p (u) := a` (u) cos(`ϕ) ,
where ` ∈ Z is the frequency of oscillation along the angle ϕ. If a` (u) = e            a` (r), then the
function a` determines the frequency of oscillation of the filter along the radial parameter.
In this case, |ψbp (ω)| has an essential support in the shape of an annulus and the filter is
omnidirectional. We restrict 0 ≤ ` < L and select a` (u) = e       a` (r) to be an oscillatory function
that oscillates at a frequency approximately proportional to L − 1 − `, ensuring that the
overall frequency support of ψ`p is approximately fixed.
    Directional filters such as ψθe and ψθo are common in image processing and various analyses
of CNN filters, e.g., [68], have shown that commonly used CNNs learn directional filters. In
Section 5.2.2 we will motivate the seemingly redundant choice of using both an even and odd
directional wavelet filter. By examining the filters of the VGG-19 network, though, one also
finds omnidirectional filters. In practice (see Section 5.4) we find that such filters improve
the quality of synthesized textures in which the image patterns do not align with a small
subset of directions.
                                                   58


    All wavelets are dilated at dyadic scales to obtain a multiscale family of waveforms:
                 β
                ψj,α (u) := 2−2j ψαβ (2−j u) ,    j ∈ Z , (α, β) ∈ {(θ, e), (θ, o), (`, p)} .
In our experiment, for directional wavelets, we use g = gσ as a gaussian function with
variance σ 2 . For the three wavelets, they all have local support both in space and frequency.
As j increase from 0 to J − 1 (from left to right in each block in Figure 5.1), the wavelet
has larger support in space and smaller support in frequency. For directional wavelets, the
wavelet support varies in directions with rotations (from top to bottom in each block) to
capture directional oscillations in images. For omnidirectional wavelets, the wavelet either
oscillates radially or angularly or both. The total number of oscillations is fixed across the
four wavelets.
    Numerically, we may assume that x            b(ω) is supported on frequencies ω contained in
[−π, π]2 . By design the collection of wavelet filters have collective frequency support in
a ball, which we can assume is the frequency ball of radius π. In this case we complement
the wavelet filters with two additional filters: (i) a non-negative low pass filter φ ∈ L2 (T2 )
that has Fourier transform essentially supported around the origin (since wavelets have zero
average); and (ii) a high pass filter h ∈ L2 (T2 ) that has Fourier transform essentially sup-
ported outside of the frequency ball {ω ∈ Ω : |ω| ≤ π} (in other words, b            h is supported in the
“corners” of [−π, π]2 ). The scales 2j of the wavelet filters are restricted to 0 ≤ j < J, where
J ≤ Jmax = O(log2 T ), and we dilate φ to the scale 2J via φJ (u) := 2−2J φ(2−J u).
    The wavelet transform we use in this chapter computes the convolution of x with all the
aforementioned filters:
                                                  β
               WJ x := {x ∗ φJ , x ∗ h , x ∗ ψj,α    :
                  0 ≤ j < J, (α, β) ∈ {(θ, e), (θ, o), (`, p)}, θ ∈ ΘM , 0 ≤ ` < L} .
    A group of filters {fk }k is said to form a frame for signals x such that supp(b          x) ⊂ [−π, π]2
if there exists two constants 0 < A ≤ B < +∞ such that:
                                X
                           A≤         |fbk (ω)|2 ≤ B , ∀ ω ∈ Ω ∩ [−π, π]2 .
                                  k
                                                     59


We can define the dual filters as fek (ω) :=      fbk (ω)
                                                              . An image x can be reconstructed from
                                      b
                                                            2
                                               P
                                                 k |fk (ω)|
                                                     b
its filtrations by the filters {fk }k using the dual filters {fek }k and the formula:
                                             X
                                         x=      x ∗ fk ∗ fek .                                 (5.2)
                                              k
For appropriately chosen parameters, the collection of filters used to define the wavelet
transform WJ forms a frame. As such, we can recover x from WJ x using (5.2). This
property will be important in Section 5.3 for developing an algorithm by which to synthesize
a new texture.
5.2.2     First layer statistics
We extract directly from the image x the mean, variance, skewness, and kurtosis of the
image intensities (x(u))u∈T2 , in addition to the min/max intensities. We then consider the
low pass filtering x ∗ φJ . Since x(u) ≥ 0 and φJ (u) ≥ 0, the mean of x ∗ φJ is proportional
to the mean of x and does not need to be computed. We do add in the variance of the
values (x ∗ φJ (u))u∈T2 to Sx. We also add in the variance of the high pass coefficients
(x ∗ h(u))u∈T2 . These statistics are also used by Portilla and Simoncelli, and one can find
additional motivation for their usefulness in [2].
    In order to simplify notation, let λ = (j, α, β) ∈ Λ be any admissible triplet for the
wavelets ψj,αβ
                described in Section 5.2.1, and denote these wavelets by ψλ . Like the low pass
coefficients and the high pass coefficients, we could compute only the variance of the values
(x ∗ ψλ (u))u∈T2 , but in doing so we would miss important correlations between patterns in x
at different scales, orientations, and angular frequencies, as captured by our wavelets. In fact,
results in [43] indicate that the using only the variance of wavelet coefficients does not result
in good texture synthesis for certain types of textures. An alternative would be to compute
the covariance between x ∗ ψλ and x ∗ ψλ0 , but the frequency localization of the wavelets
means that such statistics will be nearly zero, and hence meaningless, for most pairs (λ, λ0 ).
One possible solution is to apply a pointwise nonlinear function σ : R → R+ to the wavelet
                                                  60


coefficients, effectively pushing the high frequencies of x ∗ ψλ down to the low frequencies for
each λ, which in turn generates non-trivial correlations between x ∗ ψλ and x ∗ ψλ0 . In [2],
Portilla and Simoncelli decompose complex-valued wavelet coefficients into their modulus
and phase (two nonlinear transforms), and compute covariance-type statistics of the wavelet
modulus coefficients and of the phase coefficients. More recently, Zhang and Mallat [69]
developed a wavelet phase harmonic nonlinear transform (also for complex wavelets) and
used the resulting covariance statistics for texture synthesis of select gray-scale textures.
    In this work we set σ(t) := max(0, t), which is the rectified linear unit (ReLU) nonlin-
earity. In order to obtain an invertible transform, we also multiply the wavelet coefficients
by ±1, thus yielding the nonlinear wavelet transform:
                      UJ1 x := {x ∗ φJ , x ∗ h , σ(γ · x ∗ ψλ ) : γ = ±1, λ ∈ Λ} .
Since t = σ(t) − σ(−t), one can recover WJ x from UJ1 x and hence one can recover x using
(5.2). We compute the Gram matrix correlation statistics between all pairs of nonlinear
coefficients in UJ1 x,
                                                 Z
                                            1
                 Cx1 (λ, γ, λ0 , γ 0 ) :=            σ(γ · x ∗ ψλ (u))σ(γ 0 · x ∗ ψλ0 (u)) du (5.3)
                                          (2T )2  T2
with the exception of λ = λ0 and γ = −γ 0 as Cx1 (λ, γ, λ, −γ) = 0.
    Note that |t| = σ(t) + σ(−t), and hence the statistics (5.3) subsume the covariance statis-
tics between wavelet absolute value coefficients, which are similar to the wavelet modulus
statistics computed in [2]. It was observed in [69] that the ReLU nonlinearity is related
to phase information in complex valued wavelet coefficients. In [2], Portilla and Simoncelli
motivate the inclusion of phase by considering two one-dimensional signals, a Dirac function
and a step function. These two signals cannot be distinguished by the wavelet modulus
coefficients alone. ReLU wavlet coefficients, on the other hand, can distinguish a Dirac func-
tion from a step function for either even or odd wavelets. However, the next theorem shows
they have trouble distinguishing the relative intensity of these functions unless even and odd
wavelets are used together.
                                                       61


Theorem 12. Define y1 (t) := δ(t) and y2 (t) := 1[0,∞) (t). Let ψee ∈ L2 (R) be a one-
dimensional even wavelet, and let ψeo ∈ L2 (R) be a one-dimensional odd wavelet. Define
the 2 × 2 correlation matrices Cyβk as:
                                       Z
                                0
                       β
                     Cyk (γ, γ ) :=         σ(γ · yk ∗ ψeβ (t))σ(γ 0 · yk ∗ ψeβ (t)) dt ,
                                        R
for β ∈ {e, o} and γ ∈ {−1, +1}. Then Cye1 6= C−y           e
                                                              1
                                                                and Cye2 = C−y   e
                                                                                   2
                                                                                     , while Cyo1 = C−y
                                                                                                     o
                                                                                                        1
                                                                                                          and
Cyo2 6= C−yo
             2
               .
    Figure 5.2 shows examples of the wavelets and corresponding Fourier transforms.
Figure 5.2 1D wavelets. From left to right: 1D even wavelet, 1D odd wavelet, FFT (real
part) of even wavelet, FFT (imagery part) of odd wavelet.
    In order to prove Theorem 12, we need the following lemma.
Lemma 3. Let f (u) be an odd function, we have
                         Z                             Z
                                              0
                              σ(γ · f )σ(γ · f ) =         σ(−γ · f )σ(−γ 0 · f )
                            R                            R
for γ, γ 0 ∈ {−1, +1}.
Proof of Theorem 12. We first prove the theorem for the 1D Dirac function y1 (t). First the
wavelet transform of y1 is:
                             y1 ∗ ψ o (u) = ψ o (u),      y1 ∗ ψ e (u) = ψ e (u)
Figure 5.3 shows the wavelet transforms for y1 and −y1 . Since ψ o (u) is an odd function,
with Lemma 3 we have Cyo1 = C−y       o
                                         1
                                           . However since ψ e (u) is an even function, generally we
have Cye1 (+1, +1) 6= C−y
                       e
                          1
                            (+1, +1). Therefore Cye1 6= C−y       e
                                                                     1
                                                                       . Figure 5.4 verifies this conclusion
numerically.
                                                      62


Figure 5.3 Dirac functions and wavelet coefficients. Left: Two Dirac functions y1 and −y1 .
Middle: Wavelet coefficients with the even wavelet. Right: Wavelet coefficients with the
odd wavelet.
Figure 5.4 Covariance matrix for diracs. Upper row from left to right: Cye1 , C−y
                                                                               e
                                                                                 1
                                                                                   , Cye1 − C−y
                                                                                             e
                                                                                                1
                                                                                                  .
Lower row from left to right: Cy1 , C−y1 , Cy1 − C−y1 . This numerically verified that even
                                   o    o      o     o
wavelet is able to distinguish the two dirac functions from the covariance statistics while odd
wavelet cannot.
    Now we prove the theorem for the jump function y2 (t). The wavelet transforms satisfy:
                                     Z                      Z u
                              β
                        y2 ∗ ψ (u) =               β
                                        y2 (u − t)ψ (t)dt =      ψ β (t)dt                  (5.4)
                                      R                      −∞
for β ∈ {e, o}. Figure 5.5 illustrates the convolution of the even and odd wavelet with y2
                                               63


Figure 5.5 Jump functions and wavelet coefficients. Left: Two jump functions y2 and −y2 .
Middle: Wavelet coefficients with the even wavelet. Right: Wavelet coefficients with the
odd wavelet.
and −y2 .
                                                                         Ru
Remark 1. Since ψ e is an integrable even function, then f e (u) =         −∞
                                                                               ψ e (t)dt is an odd
function.
                                                                        Ru
Remark 2. Since ψ o is an integrable odd function, then f o (u) =        −∞
                                                                             ψ o (t)dt is an even
function.
    Remark 1 with Lemma 3 shows we have Cye2 = C−y      e
                                                          2
                                                            . Remark 2 means that generally we
also have Cyo2 6= C−y
                    o
                      2
                        . Figure 5.6 gives the numerical verification.
    Theorem 12 shows both even and odd wavelets are necessary in our model. For images x,
the Dirac signal y1 is similar to a dividing line that separates two regions of the same shade,
which occurs in many types of texture images. This result shows that ReLU nonlinear wavelet
coefficient correlations, when computed with an odd wavelet, cannot correctly determine the
brightness of the dividing line relative to the regions it separates. Similarly, ReLU nonlinear
wavelet coefficients, when computed with an even wavelet, cannot determine whether the
color gradient across an edge is positive or negative. Numerical results illustrating these
effects are given in Section 5.4.1.
                                                64


Figure 5.6 Upper row from left to right: Cye2 , C−y    e
                                                         2
                                                           , Cye2 − C−y
                                                                     e
                                                                        2
                                                                          . Lower row from left to right:
Cy2 , C−y2 , Cy2 − C−y2 . This numerically verified that odd wavelet is able to distinguish the
  o    o      o        o
two jump functions from the covariance statistics while the even wavelet cannot.
5.2.3    Second layer statistics
ReLU wavelet correlation statistics can be complemented by two-layer statistics that are
derived from feature maps that combine image information across scales before iterating the
operator UJ1 . In particular, we compute
                          (      J−1
                                                     !
                                 X                β
                UJ2 x := UJ1         σ(γ · x ∗ ψj,α ) : (α, β) ∈ {(θ, e), (θ, o), (`, p)},
                                 j=0
                                                                  )
                               γ = ±1, θ ∈ ΘM , 0 ≤ ` < L .
and recall
                      UJ1 x := {x ∗ φJ , x ∗ h , σ(γ · x ∗ ψλ ) : γ = ±1, λ ∈ Λ} .
    We then compute the variance statistics of the low and high pass maps of UJ2 x, and the
correlation statistics between all pairs of the nonlinear wavelet maps contained in UJ2 x, with
                                                    65


the same exceptions as in the first layer.
    By iterating upon the map UJ1 , the map UJ2 bears some similarity to the wavelet scattering
transform [10]. However, there are several important differences. As already discussed, we
utilize the ReLU nonlinearity and a family of real valued wavelets, as opposed to complex
valued wavelets and the modulus nonlinearity. Furthermore, the map UJ1 defined here is
invertible, unlike the scattering propagation operator. Finally, before iterating the map UJ1 ,
we sum over the scale index j of the nonlinear maps σ(γ · x ∗ ψj,α         β
                                                                             ). This operation is akin
to a 1 × 1 convolution operation in the VGG-19 CNN (and other CNNs), but unlike in the
VGG network in which the filters being summed over have receptive fields with the same
size, here we aggregate over nonlinear multiscale wavelet filtrations that allows our network
to link together correlated patterns at multiple scales. This operation also has the effect of
reducing the number of second layer maps, and hence statistics. Perhaps surprisingly, the
operator UJ2 is also invertible under appropriate conditions on the wavelets defined in Section
5.2.1.
Theorem 13. If {φJ , h, ψj,θ o
                                : 0 ≤ j < J, θ ∈ ΘM } forms a frame and if gb is non-negative,
radial, and a decreasing function of |ω|, then x 7→ {x ∗ φJ , x ∗ h , UJ2 x} is invertible.
    Before proving Theorem 13, we first prove the following lemma.
Lemma 4. If {φJ , h, ψj,θ o
                             : 0 ≤ j < J, θ ∈ ΘM } forms a frame and if gb is non-negative,
radial, and a decreasing function of |ω|, then {φJ , h, J−1       j=0 ψj,θ : θ ∈ ΘM } also forms a
                                                                         o
                                                                P
frame.
Proof of Lemma 4. With the definition of frame, we know there exist two constants 0 < A ≤
B < ∞ such that:
                                            X
               A ≤ |φbJ (ω)|2 + |bh(ω)|2 +         o
                                                |ψbj,θ (ω)|2 ≤ B, ∀ ω ∈ Ω ∩ [−π, π]2 .
                                            j,θ
we need to prove there exist two constants 0 < A0 ≤ B 0 < ∞ such that:
                                         X X
            A0 ≤ |φbJ (ω)|2 + |b
                               h(ω)|2 +       |        o
                                                    ψbj,θ (ω)|2 ≤ B 0 , ∀ ω ∈ Ω ∩ [−π, π]2 .
                                          θ     j
                                                  66


The upper bound always exists as long as we have a finite number of filters. Therefore we
only prove the lower bound. The key point is to prove:
                                X                      X
                                        o
                                    |ψbj,θ (ω)|2 ≤ |         o
                                                           ψbj,θ (ω)|2 , ∀θ ∈ ΘM                           (5.5)
                                 j                      j
Without loss of generosity, we set θ = 0 and omit this notation in the following proof. Recall
the odd directional wavelet ψ o (u) = g(u) sin(ξ · u), which has Fourier transform:
                                                gbσ (ω − ξ) − gbσ (ω + ξ)
                                   ψbo (ω) =                                                               (5.6)
                                                             2i
Bringing equation (5.6) into equation (5.5), we need to prove:
                 X                                           X
                       gσ,j (ω − ξ) − gbσ,j (ω + ξ)|2 ≤ |
                      |b                                          gbσ,j (ω − ξ) − gbσ,j (ω + ξ)|2          (5.7)
                   j                                           j
If gb is non-negative, radial, and a decreasing function of |ω|, one can prove:
     • If |ω − ξ| < |ω + ξ|, then gbσ (ω − ξ) − gbσ (ω + ξ) ≥ 0. For any such ω we also have
        |2−j ω − ξ| < |2−j ω + ξ|, and gbσ (2−j ω − ξ) − gbσ (2−j ω + ξ) ≥ 0.
     • If |ω − ξ| > |ω + ξ|, then gbσ (ω − ξ) − gbσ (ω + ξ) ≤ 0. For any such ω we also have
        |2−j ω − ξ| > |2−j ω + ξ|, and gbσ (2−j ω − ξ) − gbσ (2−j ω + ξ) ≤ 0.
     • If |ω − ξ| = |ω + ξ|, then gbσ (ω − ξ) − gbσ (ω + ξ) = 0. For any such ω we also have
        |2−j ω − ξ| = |2−j ω + ξ|, and gbσ (2−j ω − ξ) − gbσ (2−j ω + ξ) = 0.
Then for all ω, gbσ (ω − ξ) − gbσ (ω + ξ) and gbσ,j (ω − ξ) − gbσ,j (ω + ξ) have the same sign for all j
(either non-positive or non-negative). One can prove: ( i ai )2 ≥ i a2i if all the ai are non-
                                                                       P          P
negative or non-positive. Therefore, equation (5.7) is true, and | j ψbj,θ
                                                                                P o
                                                                                          (ω)|2 ≥ j |ψbj,θ
                                                                                                  P o
                                                                                                           (ω)|2
for all θ. The lemma is proved.
     With Lemma 4, now we prove Theorem 13.
Proof of Theorem 13. If {φJ , h, ψj,θ     o
                                             : 0 ≤ j < J, θ ∈ ΘM } forms a frame, then
                     β
          {φJ , h , ψj,α : 0 ≤ j < J, (α, β) ∈ {(θ, e), (θ, o), (`, p)}, θ ∈ ΘM , 0 ≤ ` < L} .
                                                        67


also forms a frame, i.e.,
                                                    XJ−1 X
                 x = x ∗ φJ ∗ φ fJ + x ∗ h ∗ e  h+                  β
                                                               x ∗ ψj,α     β
                                                                        ∗ ψj,α , ∀x ∈ L2 (T2 )           (5.8)
                                                                          g
                                                     j=0 α,β
Therefore we can reconstruct j=0                            ) from UJ2 x. With t = σ(t) − σ(−t), we are
                                   PJ−1                  β
                                            σ(γ · x ∗ ψj,α
able to get J−1 j=0 x ∗ ψj,α , which can also be written as x ∗           j=0 ψj,α . At this point, we have
              P           β                                             PJ−1 β
the following updated responses
                                           XJ−1
                                                  β
                     {x ∗ φJ , x ∗ h, x ∗       ψj,α , (α, β) ∈ {(θ, e), (θ, o), (`, p)}}
                                            j=0
With Lemma 4, {φJ , h,                  } forms a frame, then {φJ , h,                  } also forms a frame.
                           PJ−1     o
                                                                           PJ−1      β
                               j=0 ψj,θ                                        j=0 ψj,α
Therefore we can reconstruct the image x from the above responses:
                                                           X        J−1
                                                                    X            X^
                                                                                 J−1
                      x = x ∗ φJ ∗ φ fJ + x ∗ h ∗ e   h+        x∗        β
                                                                         ψj,α ∗        β
                                                                                     ψj,α                (5.9)
                                                            α,β     j=0          j=0
5.3      Synthesis algorithm
Since both operator UJ1 and operator UJ2 (combined with the low pass and high pass co-
efficients) are invertible, we can adapt the iterative projection algorithm of [2] in order to
synthesize a new texture x∗ with approximately the same statistical profile as x. We describe
our version of the projection algorithm in this section. Algorithm 1 summarizes the following
synthesis process.
     Let us first collect the statistics described in Section 5.2:
     • SJ0 x := six pixel intensity statistics, given by the mean, variance, skewness, kurtosis,
       min, and max of (x(u))u∈T2 .
     • SJ1 x := {Var(x ∗ φJ ) , Var(x ∗ h) , Cx1 }, which are the statistics derived from the first
       layer coefficients.
                                                       68


  Algorithm 1 Projection algorithm for texture synthesis
    Input reference image x ;
    Output new texture image x∗ ;
    compute target statistics SJ x = (SJ0 x, SJ1 x, SJ2 x) ;
    initialize: x∗ ← x0 (uniform noise) ;
    x∗ ← mod_intensities(x∗ , SJ0 x) ;
    while error > ε do
         compute UJ1 x∗ = {UJ1 x∗φJ , UJ1 x∗h , UJ1 x∗ψ } ;
         UJ1 x∗φJ ← mod_variance(UJ1 x∗φJ , Var(x ∗ φJ )) ;
         UJ1 x∗h ← mod_variance(UJ1 x∗h , Var(x ∗ h)) ;
         UJ1 x∗ψ ← mod_correlation(UJ1 x∗ψ , Cx1 ) ;
         compute UJ2 x∗ from UJ1 x∗ψ ;
         UJ2 x∗φJ ← mod_variance(UJ2 x∗φJ , SJ2 xφJ ) ;
         UJ2 x∗h ← mod_variance(UJ2 x∗h , SJ2 xh ) ;
         UJ2 x∗ψ ← mod_correlation(UJ2 x∗ψ , SJ2 xψ ) ;
         reconstruct UJ1 x∗ψ ← reconstruct_layer1(UJ2 x∗φJ , UJ2 x∗h , UJ2 x∗ψ ) ;
         reconstruct x∗ ← reconstruct_x(UJ1 x∗φJ , UJ1 x∗h , UJ1 x∗ψ ) ;
         x∗ ← mod_intensities(x∗ , SJ0 x) ;
         update error kSJ x∗ − SJ xk ;
    end
    • SJ2 x := {SJ2 xφJ , SJ2 xh , Cx2 }, which consists of the second layer low pass variances
       (SJ2 xφJ ) and high pass variances (SJ2 xh ), and the second layer correlation statistics
       between ReLU wavelet maps (Cx2 ).
    Now let UJ1 xψ denote the collection of nonlinear ReLU wavelet coefficient maps of UJ1 x;
let UJ2 xφJ denote the collection of second layer low pass maps; let UJ2 xh denote the collection
of second layer high pass maps; and let UJ2 xψ denote the collection of second layer nonlinear
ReLU wavelet coefficient maps. Given a reference image x, we start by computing its statis-
tical profile SJ x = (SJ0 x, SJ1 x, SJ2 x). We then initialize our synthesized image with a random
noise image x0 where each x0 (u) is an i.i.d. sample from the uniform distribution.
    Let xt denote the synthesized image after t iterations. The algorithm first updates xt
by directly modifying its intensities (xt (u))u∈T2 so that SJ0 xt = SJ0 x. It then computes UJ1 xt
using the modified xt . The low pass coefficients xt ∗ φJ and the high pass coefficients xt ∗ h
are adjusted so that Var(xt ∗ φJ ) = Var(x ∗ φJ ) and Var(xt ∗ h) = Var(x ∗ h). All of these
                                                         69


steps are computed in the exact same fashion as [2]. Finally, the nonlinear coefficient maps
UJ1 xtψ are adjusted to match the target correlation matrix so that Cx1t = Cx1 ; this step is
performed in a way that is similar to how [2] updates the wavelet modulus coefficients. If
the algorithm is using only first layer statistics, it then inverts UJ1 xt to obtain xt+1 , and the
process repeats itself.
     On the other hand, if the algorithm is using second layer statistics, it then decomposes
the updated maps UJ1 xtψ further by computing UJ2 xt . The algorithm updates the collection
of second layer maps UJ2 xt by matching SJ2 xt to SJ2 x in a similar fashion as the first layer. At
this point, the algorithm inverts the updated collection {xt ∗φJ , xt ∗h , UJ2 xt } using Theorem
13 to obtain xt+1 .
5.4      Numerical Results
We implement several texture synthesis experiments with the goals of (i) numerically verify-
ing theoretical assertions made in previous sections; (ii) understanding the effect of hyper-
parameter choices on the quality of synthesized textures; and (iii) comparing to other com-
monly used algorithms. Our texture images are taken from DTD database1 and CG Texture
database2 . Every image is resized to 256 × 256 for consistency. In our experiments, the
wavelet transform is implemented in the frequency field using the fast Fourier transform.
Therefore, we also periodize certain images to avoid border effect [70]. For directional
wavelets, we fix the total number of rotations as M = 4. For omnidirectional wavelets,
we fix the total number of oscillations at L = 4. The maximum scale is Jmax = 6. In the
following subsections, we numerically illustrate the advantages of using all three types of
filters (Section 5.4.1); we examine the role of the maximum scale 2J (Section 5.4.2); and
we compare the one-layer synthesis to the two-layer synthesis, thus examining the role of
network depth (Section 5.4.3). Finally, in Section 5.4.4 we also compare our results to Gatys
et al. [3] and Portilla and Simoncelli [2].
    1
      https://www.robots.ox.ac.uk/ vgg/data/dtd/
    2
      https://www.textures.com/
                                                 70


5.4.1    Filter comparison
Theorem 12 proves both even and odd wavelets are important for texture synthesis. Figure
5.7 validates this theory numerically. We see, for example, that synthesis with odd wavelets
is prone to blurring edges and even flipping the colors of enclosed regions that should have
the same color (e.g., the last row of Figure 5.7). When using only even wavelets, images with
banded colors, for example the third row of Figure 5.7, are blurred. Synthesized images using
both even and odd wavelets are generally a clear improvement over their single wavelet-type
counterparts.
    Figure 5.8 shows the necessity for using omnidirectional wavelets. The advantages are
clearest for rounded shapes or swirls. Such patterns have no clear direction. For example, in
the swirly texture in the last row of Figure 5.8, the omnidirectional wavelets do a better job
of reproducing the long swirls. The round-shaped pebbles and dots (rows one and three of
Figure 5.8) have smoother edges and a cleaner background when adding the omnidirectional
wavelets.
5.4.2    Comparison of maximum scale
Figure 5.9 shows synthesis results with different numbers of scales from the two layer model.
With J = 3, the statistics are not able to capture macroscopic patterns. Therefore the
edges of synthesized bricks (rows one and three) and frames (second to last row) are not
straight or continuous. The reproduced house (second row) and dots (third last row) are
more random compared to the original image. Larger scales can also capture longer swirls
(last row). However, J = 5 and J = 6 achieve equivalent perceptual accuracy, proving there
is no need to add in larger scales than J = 5. Indeed, the effective receptive field of the two
layer synthesis with J = 5 is equivalent to the single layer receptive field of Jmax = 6. We
also observe that using smaller scales gives more variety of the pattern arrangements, while
large scale statistics have the potential to duplicate the reference image.
                                               71


Figure 5.7 Synthesis results from one layer (J = 6) with different types of wavelet filters.
Left: Original image. Middle left: First layer synthesis results with only odd wavelets.
Middle right: First layer synthesis results with only even wavelets. Right: First layer
synthesis results with both even and odd wavelets.
                                            72


Figure 5.8 Synthesis results from two layers (J = 5) with/without omnidirectional wavelets.
Left: Original image. Middle: 2nd layer synthesis with even and odd wavelets. Right:
2nd layer synthesis with even, odd and omnidirectional wavelets.
                                             73


Figure 5.9 Synthesis results from two layers with different number of scales. Left: Original
image. Middle Left: 2nd layer synthesis results with J = 4. Middle Right: 2nd layer
synthesis results with J = 5. Right: 2nd layer synthesis results with J = 6.
                                             74


5.4.3     Layers analysis
For many texture images, the one-layer model can synthesize images of high quality. However
for images with more complicated structures, multiple layers can provide a boost in visual
quality. As discussed in [10], deeper layers decompose high frequency information that is
aggregated into large frequency bins with a single wavelet transform. Figure 5.10 shows
images that achieved better quality with second layer statistics. For most images, the one-
layer model captures general structures while the two-layer model refines the details, e.g.,
reproducing more accurate shapes, preserving long edges and swirls, fixing blurriness.
5.4.4     Methods comparison
We use even and odd directional wavelets, along with omnidirectional wavelets as our pre-
selected filters in our final model. We also set J = 5 and use the two-layer model. Our
results are compared with [2, 3] in Figure 5.11.
    The textures synthesized by our model are generally equivalent to, or superior than, the
images synthesized by Portilla and Simoncelli [2]. In fact, while not depicted in Figure 5.11,
this result holds even if we restrict to one layer, indicating the combination of the ReLU
and the selection of even, odd, and omnidirectional filters may provide a more complete
statistical description of texture images.
    With respect to Gatys et al. [3], the results are more nuanced. Our results are generally
superior for textures with long, rigid edges even though their model is much deeper than
our model. Additionally, textures with rigid patterns, but not necessarily long straight lines
(e.g., left, last row; right, rows five, eight, nine) also have visually more appealing synthesis
results via our method. These results can be attributed to the use of multiscale filters,
although, even then, the results of Section 5.4.2 suggest that such an analysis might be too
simplistic. For example, it is possible that a depth three wavelet network with J = 4 might
also achieve similar performance to our current implementation with two layers and J = 5,
which if so would raise questions with respect to other aspects of the VGG network.
                                                  75


Figure 5.10 Synthesis results with one layer model and two layer model. Left: Original
image. Middle: 1st layer synthesis results. Right: 2nd layer synthesis results.
                                            76


    Moving on we see that [3] obtains superior performance for images with long, non-rigid
curves, such as the pebbles and onions (left, rows three and nine), fireworks (left, row eight)
and swirling type images (right, rows seven and ten). Nevertheless, these results are in line
with the observations in Section 5.4.3, which suggested that these types of textures require
depth in order to capture their complex patterns.
    Other images exhibit more subtle differences. For example, in the cracked earth image
(left, row four), the synthesized image of [3] creates a bold effect on the cracks that is not
present in the original. In the crossing image (left, row ten), the background illumination
pattern is only correctly preserved by our method. Lastly, the multi-colored dots (right, last
row) have a cleaner background with [3], but our method creates dots with colors that are
not present in the original image, thus showing greater variability.
5.5     Implementation details
In this section, we describe more implementation details and analysis on numerics.
5.5.1    Reduction of second layer statistics
Recall at the second layer we compute:
          (        J−1
                                          !                                                               )
                   X                  β
    2         1
  UJ x := UJ           σ(γ · x ∗ ψj,α ) : (α, β) ∈ {(θ, e), (θ, o), (`, p)}, θ ∈ ΘM , 0 ≤ ` < L .
                   j=0
that is:
                     ( J−1                                    J−1
                         X                                    X
            UJ2 x :=          σ(γ1 · x ∗    ψjβ11,α1 ) ∗ φJ ,       σ(γ1 · x ∗ ψjβ11,α1 ) ∗ h,
                        j1 =0                                 j1 =0
                                     J−1
                                                                  !             !
                                     X
                        σ γ2 ·             σ(γ1 · x ∗ ψjβ11,α1 )     ∗ ψjβ22,α2   :
                                     j1 =0
                                                                                                )
                        (α1 , β1 ), (α2 , β2 ) ∈ {(θ, e), (θ, o), (`, p)}, γ1 , γ2 ∈ {+1, −1} .
In particular for the third item, we apply another layer of the wavelet transform to the first
layer responses. In numerical experiments for directional wavelets, we find J−1
                                                                                               P
                                                                                                j1 =0 σ(γ1 · x ∗
                                                          77


Figure 5.11 Synthesis results compared to other models. Left: Original images. Middle
Left: Results from Portilla and Simoncelli [2]. Middle Right: Results from Gatys et al.
[3]. Right: Results from our two layer model.
                                            78


Figure 5.12 Synthesis results compared to other models. Left: Original images. Middle
Left: Results from Portilla and Simoncelli [2]. Middle Right: Results from Gatys et al.
[3]. Right: Results from our two layer model.
                                            79


ψjβ11,θ1 ) is essentially supported at the direction θ = θ1 . Therefore to reduce the number
of total statistics and save computation, we set θ2 = θ1 , i.e., the second layer of directional
wavelets has the same direction as the first layer directional responses. We also add a residual
wavelet to keep track of the residual frequencies and match the variance. We numerically
verified with this restriction, there is little loss in the synthesized image quality.
5.5.2       Matching of second layer statistics
When we use the second layer statistics to synthesize texture images, we initialize the image
with the first layer result. This is also equivalent to matching the first layer statistics until
convergence, then matching both first layer and second layer statistics. In practice, we see
the synthesized image has better quality than matching both first and second layer statistics
from noise, i.e., matching both from the beginning.
      Figure 5.13 compares the synthesized images using these two strategies. In particular,
the middle right column shows synthesized images initialized from first layer result and the
right column shows synthesized images initialized from noise. We notice matching only first
layer statistics significantly reduces the first layer loss and the learned images are already
well structured. Initialized from such images, second layer statistics refine the details. On
the other hand, initializing from noise fails to reduce the first layer loss to a very small value.
The synthesized images generally lose large structures and are not as good as results from
the other strategy.
5.5.3       Analysis of number of iterations
In this section we discuss the relationship among the number of iterations, loss, and image
quality. Figures 5.14, 5.15, 5.16 show the synthesis process of different textures. For all of
these tests, we run second layer synthesis and run for 600 iterations. We plot the logarithm
                                                                      kSJi x−SJi x? k
of the relative loss in these plots, i.e., log10 (loss) where loss =      kSJi xk
                                                                                      for i = 1, 2, x is
the reference image, and x? is the synthesized image.
                                                  80


Figure 5.13 Left: Original images. Middle left: Images synthesized from first layer. Mid-
dle right: Images synthesized from second layer, initialized from first layer result. Right:
Images synthesized from second layer, initialized from uniform noise.
                                             81


    Figure 5.14 shows the synthesis process of micro-textures plus flowers. We start from
noise to match the second layer to simplify the analysis. After iteration 0, the images
already look much better than noise, although the colors are not fully matched. At iteration
40, the second layer loss is reduced to approximately 10−3.5 and the synthesized images are
of high quality. As we continue the algorithm, at iteration 100 and 590, the images smoothly
shift from one sample to another sample which are from the same texture class and the loss
barely changed. In general, for such micro-textures, the algorithm converges fast and the
loss is reduced long before we stop.
    Figure 5.15 shows the synthesis process for textures that have more structure. Still
we start from noise. At iteration 0, the colors are mismatched, similar to Figure 5.14. At
iteration 40 and 100, the colors are matched and general structures are learned, but the holes
and frames are not aligned perfectly. Also the relative loss is not reduced. Finally at iteration
590, the second layer loss is reduced to around 10−4 and the synthesized images are of good
quality. The algorithm on these types of images generally takes longer to converge than
for micro-textures. As the number of iterations increases, the image quality is continuously
improved.
    Figure 5.16 shows the synthesis process of a texture that has intricate patterns. Here
we start from the first layer result to show how the second layer statistics improve the
image quality. At iteration 0, which is essentially the synthesized image from the first layer
experiment, there are barely swirls but only curves. As the algorithm goes on, we notice the
first layer loss actually increases while the second layer loss is decreasing. The increasing first
layer loss generally does not affect image quality so long as the second layer loss decreases.
By the last iteration, the synthesized image contains longer and smoother swirls.
5.6      Conclusion
We presented a unique texture synthesis algorithm that melds aspects of Portilla and Simon-
celli [2], Gatys et al. [3], and Mallat [10], while also incorporating new ideas on filter design,
                                                 82


Figure 5.14 The synthesis process for different micro-textures plus flowers.
                                    83


    Figure 5.15 The synthesis process for different macro-textures with rigid patterns.
Figure 5.16 The synthesis process when initializing from the first layer synthesis, for a texture
with complex patterns.
                                             84


multi-layer structure, and the invertibility of CNNs. Our numerical analysis provides insight
into the workings of statistics-based texture synthesis algorithms. Synthesized textures are
competitive with the state-of-the-art and in some cases superior to [3], thus providing a
potential alternative. Nevertheless, issues such as the trade-off between network depth and
filter scale are not fully resolved, and invite future research endeavors, which we will explore
in the next chapter.
                                                 85


                                         CHAPTER 6
                          MULTILAYER MODEL ANALYSIS
6.1     Introduction
In this chapter, we continue our analysis of the statistical texture synthesis model proposed
in Chapter 5. Intuitively, large filters and deep layers increase the size of receptive fields in
the network and preserve long range dependence of the image, but they do so in different
ways. From the numerical tests, we noticed the model with large filters keeps long straight
lines and the multilayer model preserves long curves. On the other hand, the one layer
wavelet model does not capture intricate curves and the multilayer VGG model struggles to
reproduce rigid patterns. There has also been a discussion on the trade-off between model
depth and the size of convolution filters in [15, 3].
    In this chapter, we develop mathematical analysis on filter size and model depth. We
investigate certain types of texture classes and discuss the necessity of large filters. We also
discuss the relationship between model depth and the higher order derivatives of a signal.
Specifically we continue the analysis of Chapter 5 in the following ways:
    • We proposed a model to represent the ReLU architecture.
    • We provide new perspective on understanding the even and odd wavelets.
    • We prove if the largest filter is not large enough to meet certain criteria, the gram
      matrix fails to distinguish different textures.
    • We prove with small filters, deep models intricately partition the high order derivatives
      of the given signal.
                                               86


6.2     Multilayer multiscale model
6.2.1    Filters
Let φ : R2 → R be an low pass filter with the following conditions:
    • φ(u) ≥ 0, ∀u ∈ R2 .
    • φ is isotropic, i.e., φ(u) = φ(Rθ u), ∀θ. Rθ : R2 → R2 is the rotation operator (rotate
      by θ).
    • φ is compactly supported in [−a, a]2 ∈ R2 .
Define the wavelet family {ψk,n }k,n such that
                                                       (n)
                                         ψk,n (u) = D~vk φ(u)                                    (6.1)
is the n-th derivative of φ along the direction ~vk where ~vk = Rθ (~v0 ), ~v0 = (1, 0), θ =    kπ
                                                                                                K
                                                                                                   . K
is the total number of rotations. The filters can also be dilated:
                              ψj,k,n (u) = 2−2j ψk,n (2−j u), 0 ≤ j < J
Since
                                (n)             (n)
                              D~vk φj (u) = D~vk (2−2j φ(2−j u))
                                                           (n)
                                          = 2−2j · 2−nj D~vk (φ(2−j u))
                                          = 2−nj ψj,k,n (u),
                                (n)
we also have ψj,k,n (u) = 2nj D~vk φj (u).
Remark 3. When n = 1, ψk,n is an odd directional wavelet with direction θ =                kπ
                                                                                           K
                                                                                              ;  when
n = 2, ψk,n is an even directional wavelet with direction θ =       kπ
                                                                    K
                                                                       .
    The derivative wavelets defined in Equation 6.1 are not exactly the same directional
wavelets used in Chatper 5. The definition is different and the derivative wavelets has
compact support. However since the derivative wavelets at n = 1, 2 look similar to the
                                                    87


directional wavelets, they can be thought of as a generalized version of directional wavelets
with more flexibility by adjusting n. Therefore in the following context, we are going to
analyze the filters from Chapter 5 through the derivative wavelets defined in Equation 6.1.
6.2.2   Wavelet transform
Let x : R2 → R be a 2D signal. The wavelet transform can be rewritten as
                                                            (n)
                                 x ∗ ψj,k,n (u) = 2nj x ∗ D~vk φj (u)
                                                        (n)
                                                = 2nj D~vk (x ∗ φj )(u)
The above equation shows the wavelet transform of a signal with a derivative wavelet equals
to the derivative of the signal convolved with the low pass up to a constant 2nj , which
depends on the dimension n and scale parameter j.
Lemma 5. If x is isotropic, i.e., x(u) = x(Rθ u), ∀u, then x ∗ ψj,k1 ,n = x ∗ ψj,k2 ,n , ∀k1 , k2 .
Proof. If x is isotropic, then
                                                             (n)
                                    x ∗ ψj,k1 ,n (u) = 2nj D~vk (x ∗ φj )(u)
                                                               1
                                                             (n)
                            (x, φj are isotropic) =    2nj D~vk (x   ∗ φj )(u)
                                                               2
                                                     = x ∗ ψj,k2 ,n (u)
6.2.3   Gram matrix
Recall in Chapter 5, we use the gram matrix between ReLU wavelet response to represent a
texture image for synthesis and achieved good performance. Here we generalize the definition
                                                    88


for derivative filters. Define gram matrix as
   Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 )
           : = hReLU(1 · x ∗ ψj1 ,k1 ,n1 ), ReLU(2 · x ∗ ψj2 ,k2 ,n2 )i
                 Z
             =         ReLU(1 · x ∗ ψj1 ,k1 ,n1 (u)) · ReLU(2 · x ∗ ψj2 ,k2 ,n2 (u))du
                   R2
                                Z                                                                                      
                   n1 j1 +n2 j2                              (n )                                          (n )
             =2                       ReLU 1 · D~vk 1 (x ∗ φj1 )(u) · ReLU 2 · D~vk 2 (x ∗ φj2 )(u) du
                                                                1                                            2
                                   R2
                                                                                                                            (6.2)
where 1 , 2 ∈ {+1, −1}. In order to better analyze the correlation, we are going to decompose
the integral into the correlation and the integral support. Let
                                                  
                                                                              (n )                                         k1 π
 Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) := u ∈ R2 :1 · D~vk 1 (x ∗ φj1 )(u) ≥ 0, ~vk1 = Rθ1 ~v0 , θ1 =                     ,
                                                                                1                                           K
                                                                                                                                
                                                                              (n2 )                                        k2 π
                                                                    2 · D~vk (x ∗ φj2 )(u) ≥ 0, ~vk2 = Rθ2 ~v0 , θ2 =
                                                                                2                                           K
                                                                                                                            (6.3)
denote the ReLU support that the integral is integrated on. We can rewrite the gram matrix
as
     Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =
                                    Z                                                                                       (6.4)
                      n1 j1 +n2 j2                                           (n )                    (n )
              1 2 2                                                     D~vk 1 (x ∗ φj1 )(u) · D~vk 2 (x ∗ φj2 )(u)du
                                                                               1                       2
                                      Ex(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 )
The following special cases include the statistical representations we computed at the first
layer for texture images in Chapter 5.
    • Consider n1 = n2 = 0, there is no derivatives or directional derivatives. We can assume
      k1 = k2 = 0. If x ≥ 0, e.g., natural images, then x ∗ φj ≥ 0. Then
                                                                     Z
                   Gx(j1 , j2 , 0, 0, 0, 0, +1, +1) =                       x ∗ φj1 · x ∗ φj2 dt
                                                                       R2
                                                                               Z
                                                                          1                                 ∗
                                                               =             2
                                                                                    x\∗ φj1 (ω) · x\ ∗ φj2 (ω)dω
                                                                     (2π) C2
                                                                               Z
                                                                          1                                        ∗
                                                               =             2
                                                                                    b(ω) · φc
                                                                                    x        j1 (ω) · xb∗ (ω) · φc
                                                                                                                 j2 (ω)dω
                                                                     (2π) C2
                                                                               Z
                                                                          1
                                 (suppose j1 ≥ j2 ) ≈                                x(ω)|2 · |φc
                                                                                    |b                   2
                                                                                                 j1 (ω)| dω
                                                                     (2π)2 C2
                                                               = kx ∗ φj1 k22
                                                                        89


       Therefore Gx(j1 , j2 , 0, 0, 0, 0, +1, +1) captures the `2 norm of the low pass responses.
     • Gx(j1 , j2 , k1 , k2 , 1, 1, 1 , 2 ) captures the odd wavelet gram matrix between different
       angles and scales.
     • Gx(j1 , j2 , k1 , k2 , 2, 2, 1 , 2 ) captures the even wavelet gram matrix between different
       angles and scales.
     • Gx(j1 , j2 , k1 , k2 , 1, 2, 1 , 2 ) captures the gram matrix between even and odd wavelets
       cross angles and scales.
6.3     Texture with multiple straight lines
Recall in Figure 5.9 (fourth row) and Figure 5.11 (right, first row) from Chapter 5, our model
with large scale filters preserve long straight lines while both the model with only small scale
filters and the model from Gatys fail to do so. In this section, we consider a special class
of signals with parallel straight lines. We prove that multiscale, especially large scale filters,
are essential for recovering textures with long lines, which we define in the following.
     Let x(u1 , u2 ) = αi 1δi (u1 ) where δi ∈ ∆ ⊂ R. Then we have the wavelet transform as
                         P
                          i
                                                         (n)
                                 x ∗ ψj,k,n (u) = 2nj D~vk (x ∗ φj )(u)
                                                         (n)
                                                              X
                                                 = 2nj D~vk (       αi · φ1d
                                                                          j (u1 − δi ))
                                                              δi ∈∆
where u = (u1 , u2 ) and φ1d      j is the one dimensional version of φj . Let ~        v0 = (1, 0), ~v−1 = (0, 1).
Then
                                       (n)
                                                          X         dn 1d
                                    D~v0 (x ∗ φj )(u) =        αi      φ (u1 − δi )
                                                           δi
                                                                   dun1 j
                                      (n)
                                   D~v−1 (x ∗ φj )(u) = 0
                                                          90


Therefore for any ~vk = Rθk (~v0 ) = (cos θk , sin θk ),
                                  (n)                                         (n)
                              D~vk (x ∗ φj )(u) = cosn θk · D~v0 (x ∗ φj )(u)
                                                                           X dn
                                                           = cosn θk ·           αi n φ1d     j (u1 − δi )
                                                                            δ
                                                                                    du   1
                                                                              i
And also,
                                                                            X          dn 1d
                             x ∗ ψj,k,n (u) = 2           nj        n
                                                             · cos θk ·            αi n φj (u1 − δi )                            (6.5)
                                                                             δi
                                                                                      du1
6.3.1   Gram matrix
The gram matrix for this type of signal is
  Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 )
        : = h(ReLU(1 · x ∗ ψj1 ,k1 ,n1 ), ReLU(2 · x ∗ ψj2 ,k2 ,n2 )i
               Z                                                                                            X dn1
          =                                          1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2                 αi n1 φ1d
                                                                                                                        j1 (u1 − δi )
                 Ex(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 )                                                         δi
                                                                                                                   du1
            X dn2
          ·       αi n2 φ1d      j2 (u1 − δi ) du1
              δi
                      du  1
                                                                      Z
                      n1 j1 +n2 j2           n1              n2
          = 1 2 2                  · cos θk1 · cos θk2
                                                                        Ex(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 )
                                    n1                            n2
          XX                      d         1d                  d
                      αi1 αi2        n1 φj1 (u1 − δi1 )               φ1d (u1 − δi2 ) du1
           δi1   δi 2
                                 du1                           dun1 2 j2
                                                                                                                                 (6.6)
   Note
                                                                                                    dn1 1d
                                                                                       X
  Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =          u ∈ R2 : 1 · cosn1 θk1                αi           φ (u1 − δi ) > 0,
                                                                                          δi
                                                                                                   dun1 1 j1
                                                                                        X dn2                                 (6.7)
                                                                                n2                            1d
                                                                    2 · cos θk2               αi n2 φj2 (u1 − δi ) > 0
                                                                                          δ
                                                                                                   du1
                                                                                            i
Equation 6.7 shows the integral support only depends on u1 . Moreover, it is contained in
the union of the band around u1 = δi . Our next section explores more on when J is small
and the bands don’t overlap among different δi .
                                                                     91


    Besides the following remark shows, when the scales (j1 , j2 ), derivative orders (n1 , n2 ) and
signs (1 , 2 ) are fixed and only vary the filters’ directions (k1 , k2 ), the gram matrix between
the two ReLU responses is proportional to a value that depends on k1 , k2 .
                 Gx(j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 )     cosn1 θk1 ·cosn2 θk2
Remark 4.        Gx(j1 ,j2 ,k10 ,k20 ,n1 ,n2 ,1 ,2 )
                                                       = cosn1 θk0 ·cosn2 θk0
                                                                  1         2
    Moreover, the following remark summarizes, when the angle θk1 = mπ+ π2 or θk2 = mπ+ π2 ,
the gram matrix is zero.
Remark 5. If cosn1 θk1 = 0 or cosn2 θk2 = 0, then
                          Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = 0, ∀j1 , j2 , n1 , n2 , 1 , 2 .
6.3.2    Small J
Now let us assume J is small. Recall φ is supported in [−a, a], then φj has support [−2j a, 2j a]
and φJ−1 has support [−2J−1 a, 2J−1 a]. Let dmin demote the smallest pairwise distance be-
tween the diracs in x. The following theorem proves when J is small, the gram matrix cannot
distinguish one line texture from another under certain conditions.
Theorem 14. Let x1 (u) =                                αi1 1δi1 (u1 ) and x2 (u) =          βi2 1δi2 (u1 ). Let d1min and
                                                 P                                    P
                                              δi1 ∈∆1                               δi2 ∈∆2
d2min denote the smallest pairwise distance between diracs of x1 and x2 , respectively. Assume
φ is supported in [−a, a]. If 2J a < dimin for i ∈ {1, 2} and                                                       βi2 , then
                                                                                            P 2               P 2
                                                                                                    αi1 =
                                                                                          δi1 ∈∆1           δi2 ∈∆2
Gx1 = Gx2 .
    Note the above theorem is related to Theorem 3 in Chapter 3. The 2D signal in this
section can be decomposed into a sparse signal along one direction and a constant along
another direction. Theorem 14 provides a necessary condition to distinguish two signals of
such type while Theorem 3 provides a sufficient condition to identify a 1D sparse signal.
Proof of Theorem 17. Consider the signal x1 . When 2J a < d1min , φ1d                        j1 (u1 −δi ) does not overlap
with φ1d
       j2 (u1 − δi0 ) for any δi , δi0 ∈ ∆1 , i 6= i , 0 ≤ j1 , j2 < J. Therefore the nonzero set from
                                                                   0
                                                                     92


(6.7) can be decomposed into
                                                             [                                                  dn1
    Ex1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =                      u ∈ R2 : 1 · cosn1 θk1 αi n1 φ1d               j1 (u1 − δi ) > 0,
                                                           δi ∈∆1
                                                                                                                du  1
                                                                                                                 dn2 1d
                                                                                                                                             
                                                                                                  n2
                                                                                        2 · cos θk2 αi n2 φj2 (u1 − δi ) > 0
                                                                                                                du1
                                                                                                 [
                                                                                       : ==             Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 )
                                                                                               δi ∈∆1
                                                                                                                                                 (6.8)
Note
                                                                                0
               Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) − δi0 + δi                          (6.9)
i.e., if
                                                               0
                                               t0 ∈ Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ),
then
                                   t = t0 − δi0 + δi ∈ Exi1 (j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ).
      Also the argument inside the integral from Equation 6.6 is nonzero only along the diag-
onal:
  XX                  dn1 1d                          dn2 1d                             X
                                                                                                 2 d
                                                                                                      n1
                                                                                                              1d                dn2 1d
              αi1 αi2        φ     (t  1  −   δ i   )       φ      (t 1 −   δ i   ) =          α            φ    (t 1 − δ i   )       φ (t1 − δi1 )
   δi 1  δi 2
                      dtn1 1 j1                   1
                                                      dtn1 2 j2                 2
                                                                                         δ
                                                                                                 i1
                                                                                                    dtn1 1 j1               1
                                                                                                                                dtn1 2 j2
                                                                                           i1
                                                                                                                                               (6.10)
Inserting (6.8) and (6.10) into (6.6) we get
          Gx1 (j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 )
                      = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2
                      X Z                                                      dn1                            dn2 1d
                                                                        αi21 n1 φ1d         (u 1  −  δ i   )        φ (u1 − δi1 )du1
                     δ ∈∆       Exi1 (j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 )       du1 j1                      1
                                                                                                             dun1 2 j2
                      i    1
                      = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2
                                                                                                                                               (6.11)
                                                                                dn1 1d                         dn2 1d
                      X             Z
                                 2
                              αi1                                                       φ
                                                                                    n 1 j1   (u 1  −  δ i   )        φ (u1 − δi1 )du1
                     δi ∈∆1            Exi1 (j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 ) du1
                                                                                                          1
                                                                                                              dun1 2 j2
                                                                                                        !
                                                                                            X
    (with (6.9)) = 1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2                                     αi21
                                                                                           δi ∈∆1
                     Z                                              n1                          n2
                                                                  d       1d                   d       1d
                                                                     n1 φj1 (u1 − δ0 )            n2 φj2 (u1 − δ0 )du1
                        Ex01 (j1 ,j2 ,k1 ,k2 ,n1 ,n2 ,1 ,2 )   du1                          du1
                                                                             93


Figure 6.1P   Two line texture images  T that have      the same summation P    of height
                                                                                        √ squared. Left:
                i=1 132i (u1 ), u ∈ [Z   [0, 256)] . Right:                              2 · 164i (u1 ), u ∈
                8                                  2                                4
x1 (u)
   T =                                                              x2 (u) =        i=1
[Z [0, 256)] .  2
Figure 6.2 The gram matrices and the difference for the two line textures in Figure 6.1. Left:
The gram matrix Gx1 between ReLU responses for texture x1 . Middle: The gram matrix
Gx2 between ReLU responses for texture x2 . Right: The difference between the two gram
matrices |Gx1 − Gx2 |.
One can get similar results for x2 with              αi21 substituted by        βi21 . Then the conclusion
                                                P                          P
                                              δi ∈∆1                     δi ∈∆2
is proved.
    Figure 6.1 and 6.2 verify Theorem 14 numerically. The two line textures x1 (u) =
                                   P4 √
       1           and                     2 · 164i (u1 ), u ∈ [Z [0, 256)]2 both have summation
P8                                                                  T
   i=1  32i (u 1 )       x 2 (u) =   i=1
of height squared equal to 8. For the selected wavelet family, we choose J = 2 so that
2J a < dmin where dmin = 32 in this example. Figure 6.2 shows there is little difference be-
tween the two gram matrices except for the numerical errors, thus verifying our conclusion
numerically.
                                                      94


6.3.3      Deep layers
One may argue that the gram matrix fails in distinguishing the two signals due to the fact
that we only have one layer of convolution. In this section, we prove that even with multiple
layers, if the filters are tiny, the gram matrix still fails to distinguish different line textures.
    Recall the first layer wavelet coefficients for the line texture:
                                                                        X           dn 1d
                             x ∗ ψj,k,n (u) = 2nj · cosn θk ·                  αi      φ (u1 − δi )                     (6.12)
                                                                          δi
                                                                                   dun1 j
                                                  dn 1d              dn 1d
Remind σ(x) = (ReLU )(x). Let                   dun
                                                      φj,δi (u)  =  dun
                                                                           φj (u     − δi ) and      represent the element-
                                                    1                   1
wise multiplication. The second layer response can be developed in a similar way:
  σ(x ∗ ψj1 ,k1 ,n1 ) ∗ ψj2 ,k2 ,n2 (u)
                                      (n )
=σ(x ∗ ψj1 ,k1 ,n1 ) ∗ 2n2 j2 · D~vk 2 φj2 (u)
                                        2
           (n )
=2n2 j2 D~vk 2 [σ(x ∗ ψj1 ,k1 ,n1 )] ∗ φj2 (u)
              2
                           (n )
=2n2 j2 · cosn2 θk2 · D~v0 2 [σ(x ∗ ψj1 ,k1 ,n1 )] ∗ φj2 (u)
                                                        dn2
                                                                                 
   n2 j2        n2          0
=2       · cos θk2 · σ (x ∗ ψj1 ,k1 ,n1 )                     [x ∗ ψj1 ,k1 ,n1 ] ∗ φj2 (u)
                                                       dun2
                                                       X dn1
=2 n2 j2
         · cos θk2 · σ 0 (2n1 j1 · cosn1 θk1 ·
                n2
                                                              αi n1 φ1d    j1 ,δi )
                                                          δ
                                                                 du 1
     n2                                      n1
                                                         i
   d                            X           d
      n
         [2n1 j1 · cosn1 θk1 ·         αi n1 φj1d1 ,δi ] ∗ φj2 (u)
  du   2
                                  δi
                                           du1
                                                                  X dn1                        X dn1 +n2              
   n1 j1 +n2 j2       n1            n2              0      n1                          1d                          1d
=2              · cos θk1 · cos θk2 · σ (cos θk1 ·                         αi n1 φj1 ,δi )            αi n1 +n2 φj1 ,δi ∗ φj2 (u)
                                                                     δ
                                                                               du   1            δ
                                                                                                        du1
                                                                       i                           i
                                        dn 1d
When J is small such that              dun
                                            φj,δi (u)   do not overlap at different i, the last line of the above
                                          1
equation equals to
                                                                                 dn1             dn1 +n2 1d
                                               X                                                                
  2n1 j1 +n2 j2       n1             n2
                 · cos θk1 · cos θk2 ·                αi σ 0 (cosn1 θk1 · n1 φ1d            )             φ        ∗ φj2 (u).
                                                δi
                                                                                du1 j1 ,δi      du1n1 +n2 j1 ,δi
The term inside the big parenthesis is supported in [δi −2j1 a, δi +2j1 a]. With the convolution
with φj2 , the support is [δi − 2j1 +j2 a, δi + 2j1 +j2 a]. This term is equivariant to translations
on δi . Therefore when we apply the ReLU and gram matrix at the second layer, we get the
                                                                 95


same conclusion as the first layer result. Define
              Z
         0
G2 x(λ, λ ) =      σ(2 · σ(1 · x ∗ ψj1 ,k1 ,n1 ) ∗ ψj2 ,k2 ,n2 )(u) · σ(02 · σ(01 · x ∗ ψj0 1 ,k1 ,n1 ) ∗ ψj0 2 ,k2 ,n2 )(u)du
where λ = (j1 , k1 , n1 , 1 , j2 , k2 , n2 , 2 ).
Theorem 15. Let x1 (u) =                           αi1 1δi1 (u1 ) and x2 (u) =                  βi2 1δi2 (u1 ). Let d1min and
                                            P                                            P
                                         δi1 ∈∆1                                       δi2 ∈∆2
d2min denote the smallest pairwise distance between diracs of x1 and x2 , respectively. Assume
φ is supported in [−a, a]. If 22J a < dimin for i ∈ {1, 2} and                                                             βi2 , then
                                                                                                P 2                  P 2
                                                                                                     αi1 =
                                                                                             δi1 ∈∆1              δi2 ∈∆2
G2 x1 = G2 x2 .
    More generally, define Gm x to be the m-layer gram matrix. We have
Theorem 16. Let x1 (u) =                           αi1 1δi1 (u1 ) and x2 (u) =                  βi2 1δi2 (u1 ). Let d1min and
                                            P                                            P
                                         δi1 ∈∆1                                       δi2 ∈∆2
d2min denote the smallest pairwise distance between diracs of x1 and x2 , respectively. Assume
φ is supported in [−a, a]. If 2mJ a < dimin for i ∈ {1, 2} and                                                             βi2 , then
                                                                                                P 2                  P 2
                                                                                                     αi1 =
                                                                                             δi1 ∈∆1               δi2 ∈∆2
Gm x1 = Gm x2 .
Proof. Let Sm x[λ1 , . . . , λm ](u) = σ(. . . σ(x1 ∗ψλ1 ) · · ·∗ψλm )(u) denote the m-th layer response
of x. Let fm [δi , λ1 , . . . , λm ](u) = σ(. . . σ(1δi ∗ ψλ1 ) · · · ∗ ψλm ) be the multilayer response at
the m-th layer of the line function. Note 1δi (u) = 1δi (u1 ). Then the m-th layer response of
line signal x1 is:
                                                               X
                          Sm x1 [λ1 , . . . , λm ](u) =              αi1 fm [δi1 , λ1 , . . . , λm ](u)
                                                             δi1 ∈∆1
Since φ is supported in [−a, a]2 , for any 0 ≤ j < J, φj is supported in [−2J−1 a, 2J−1 a]2 .
Therefore fm [δi1 , λ1 , . . . , λm ] is supported in the band: {u ∈ R2 : t1 ∈ [−2m(J−1) a, 2m(J−1) a}.
The gram matrix for the line function is :
                                                      Z
         1
      Gm δi [λ1 , . . . , λm , λ01 , . . . , λ0m ] =      fm [δi1 , λ1 , . . . , λm ](u) · fm [δi1 , λ01 , . . . , λ0m ](u)du
Note Gm 1δi [λ1 , . . . , λm , λ01 , . . . , λ0m ] remains a constant when changing δi . Since 2mJ a < dimin
for i = 1, 2, fm [δi1 , λ1 , . . . , λm ] does not overlap with fm [δi01 , λ1 , . . . , λm ] for any δi1 6= δi01 .
                                                                 96


With all of these facts,
                                                  Z      X
   Gm x1 [λ1 , . . . , λm , λ01 , . . . , λ0m ] =               αi1 fm [δi1 , λ1 , . . . , λm ](u)·
                                                       δi1 ∈∆1
                                                        X
                                                               αi1 fm [δi1 , λ01 , . . . , λ0m ](u)
                                                      δi1 ∈∆1
                                                    X            Z
                                                =           αi21    fm [δi1 , λ1 , . . . , λm ](u) · fm [δi1 , λ01 , . . . , λ0m ](u)
                                                  δi1 ∈∆1
                                                             αi21 )Gm 1δ [λ1 , . . . , λm , λ01 , . . . , λ0m ]
                                                      X
                                                =(
                                                    δi1 ∈∆1
We get similar result for x2 . Therefore, Gm x1 = Gm x2 .
    Theorem 16 shows interesting relationship between model depth m, largest filter scale J
and the minimum distance dmin between the lines. It also matches the observation of Figure
5.9 (fourth row, sixth row) from Chapter 5. According to Figure 5.9, when J = 4 the model
is not able to identify the relative positions of the long lines while J = 5 and J = 6 fix the
issue and are able to reproduce the lines. Motivated by Theorem 16, to avoid the inability
to distinguish such textures, one can either increase filter size J or increase model depth m
to break the condition 2mJ a < dimin . Note the VGG model is deep but is still not able to
reproduce long lines (Figure 5.11, left, fifth row; right, first row). We infer this is resulted
from the pooling function and invite further research.
6.4     Frame-like texture
In this section we analyze another type of textures. Figure 6.3 shows one sample from this
class.
    Let x(u1 , u2 ) =               αi 1δi (u1 ) +           βi 1ξi (u2 ). The wavelet transform for this signal is:
                            P                           P
                          δi ∈∆1                      ξi ∈∆2
                                    (n)
     x ∗ ψj,k,n (u) = 2nj D~vk (x ∗ φj (u))
                                                           dn 1d                                            dn 1d
                                               X                                                 X                               
                            nj           n                                                 n
                        = 2 cos θk                    αi n φj (u1 − δi ) + sin θk                       βi n φj (u2 − ξi )
                                               δ ∈∆
                                                          du 1                                   ξ ∈∆
                                                                                                           du2
                                                i   1                                             i   2
Insert it into the gram matrix at Equation 6.2 we get:
                                                                   97


                                           Figure 6.3 Frame texture.
Gx(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 )
                                   
                   n1 j1 +n2 j2
        = 1 2 2                ·
                                                    dn1                               dn2
                                     Z X                                  X
            n1
       cos θk1 · cos θk2    n2
                                               αi1 n1 φ1d
                                                        j1 (u1 − δi1 ) ·        αi2 n2 φ1d    (t1 − δi2 ) du1 du2
                                       Ex δ        du1                     δi2
                                                                                     du1 j2
                                            i1
                                                   dn1                               dn2 1d
                                     Z X                                  X
            n1
    + cos θk1 · sin θk2    n2
                                               αi1 n1 φ1d (u 1 − δ i1 ) ·      β i2    n2 φj2 (u2 − ξi2 ) du1 du2
                                      Ex δ        du1 j1                   ξi2
                                                                                    du 2
                                           i1
                                                   dn1                               dn2 1d
                                     Z X                                  X
    + sinn1 θk1 · cosn2 θk2                    βi1 n1 φ1d (u 2 − ξi1  ) ·      α i2    n2 φj2 (u1 − δi2 ) du1 du2
                                      Ex ξ        du1 j1                  δi 2
                                                                                    du 2
                                           i1
                                                   dn1 1d                            dn2 1d
                                    Z X                                   X                                      
            n1             n2
    + sin θk1 · sin θk2                        βi1 n1 φj1 (u2 − ξi1 ) ·        βi2 n2 φj2 (u2 − ξi2 ) du1 du2
                                      Ex ξ        du2                     ξ
                                                                                    du2
                                          i1                               i2
                                                                                                               (6.13)
                                                        98


where Ex denotes
                                                 
   Ex(j1 ,j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 :
                                          dn1 1d                                              dn1 1d
                             X                                                   X                                   
                    n1                                                    n1
            1 cos θk1               αi n1 φj1 (u1 − δi ) + sin θk1                       βi n1 φj1 (u2 − ξi ) > 0,
                             δi ∈∆1
                                         du  1                                   ξi ∈∆2
                                                                                             du2
                                          dn2 1d                                              dn2 1d
                             X                                                   X                                          
                    n2                                                    n2
            2 cos θk2               αi n2 φj2 (u1 − δi ) + sin θk2                       βi n2 φj2 (u2 − ξi ) > 0
                             δ ∈∆
                                         du1                                     ξ ∈∆
                                                                                             du2
                              i    1                                              i     2
                                                                                                                              (6.14)
   If θk ∈ {0, π2 }, cos θk = 0 or sin θk = 0. Therefore if θk1 , θk2 ∈ {0, π2 }, the four items in
Equation 6.13 only have one that is nonzero:
   • If θk1 = θk2 = 0, only the first item is nonzero. Ex can be simplified as:
                                                                                                   dn1
                                                                                      X
         Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 :1 cosn1 θk1                    αi n1 φ1d   j1 (u1 − δi ) > 0,
                                                                                      δi ∈∆1
                                                                                                  du 1
                                                                                                   dn2 1d
                                                                                       X                                        
                                                                              n2
                                                                     2 cos θk2               αi n2 φj2 (u1 − δi ) > 0
                                                                                      δ ∈∆
                                                                                                  du1
                                                                                       i    1
      Then with the same analysis as Theorem 14, the gram matrix can be simplified as:
                                                                                                           X
         Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · cosn2 θk2 · (            αi2 )·
                                                                                                             δi
                                                                                                                              (6.15)
                                                                dn1 1d                       dn2 1d
                                                     Z
                                                                      φ     (u 1 −  δ0 )  ·        φ (u1 − δ0 ) du1
                                                       Ex0u1  dun1 1 j1                     dun1 2 j2
                                                                n                                                    n
      where Ex0u1 = {u ∈ R2 : 1 cosn1 θk1 α0 du              d 1 1d
                                                                 n1 φj (u1 − δ0 ) > 0, 2 cos
                                                                       1
                                                                                                       n2
                                                                                                           θk2 α0 dud 2 1d
                                                                                                                       n2 φj (u1 −
                                                                                                                            2
                                                                 1                                                     1
      δ0 ) > 0
   • Similarly if θk1 = θk2 = π2 , the gram matrix is simplified as:
                                                                                                          X
         Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) =1 2 2n1 j1 +n2 j2 · sinn1 θk1 · sinn2 θk2 · (            βi2 )·
                                                                                                            ξi
                                                     Z            n1                           n2
                                                                                                                              (6.16)
                                                                d        1d                  d
                                                                   n1 φj1 (u2 − ξ0 ) ·             φ1d (u2 − ξ0 ) du2
                                                       Ex0u2  du2                           dun2 2 j2
                                                             99


     • If θk1 = 0, θk2 = π2 , Ex can be simplified as:
                                                                                                              dn1
                                                                                                 X
           Ex(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 ) = u ∈ R2 :1 cosn1 θk1                             αi n1 φ1d     (u1 − δi ) > 0,
                                                                                                 δi ∈∆1
                                                                                                            du1 j1
                                                                                                             dn2 1d
                                                                                                  X                                     
                                                                                        n2
                                                                              2 sin θk2                 αi n2 φj2 (u2 − δi ) > 0
                                                                                                 ξ ∈∆
                                                                                                            du2
                                                                                                  i    2
                                                             := Exu1 ⊗Exu2
         The gram matrix is simplified as:
              Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 )
           =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · sinn2 θk2
                                 dn1                                        dn2 1d
              Z X                                             X
                           αi1 n1 φ1d     (u 1 −   δ i   ) ·        β i            φ (u2 − ξi2 ) du1 du2
                Ex δ            du1 j1                 1
                                                              δ
                                                                        2
                                                                          dun2 2 j2
                      i1                                        i2
           =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · sinn2 θk2                                                                               (6.17)
                                   dn1 1d                                                      dn2
              Z        X                                               Z           X
                              αi1 n1 φj1 (u1 − δi1 ) du1 ·                                βi2 n2 φ1d      (u2 − ξi2 ) du2
                Exu1 δ            du1                                     Exu2 δ              du2 j2
                          i1                                                        i2
           =1 2 2n1 j1 +n2 j2 · cosn1 θk1 · sinn2 θk2 ·
                                      dn1 1d                                                            dn2 1d
               X              Z                                              X                Z
              (      α i1 ) ·               φ
                                         n1 j1  (u  1   −   δ 0 )  du 1   · (       β i 2 ) ·             n2 φj2 (u2 − ξ0 ) du2
                δ               Ex0t du1                                      δ                Ex0u du2
                                                                                                    2
                  i1                1                                           i2
     • If θk1 = 0, θk2 = π2 , the gram matrix is simplified as:
             Gx(j1 , j2 , k1 , k2 , n1 , n2 , 1 , 2 )
           =1 2 2n1 j1 +n2 j2 · sinn1 θk1 · cosn2 θk2 ·                                                                             (6.18)
                                      dn1 1d                                                            dn2 1d
               X              Z                                              X                Z
             (       βi1 ) ·             n1 φj1 (u2 − ξ0 ) du2 · (                  αi2 ) ·                n2 φj2 (u1 − ξ0 ) du1
                ξ              Ex0u du2
                                    2                                         δ                 Ex0u du1
                                                                                                     1
                 i1                                                             i2
With the analysis above we have the following theorem:
Theorem 17. Let x(u) =                         αi1 1δi1 (u1 )+                   βi2 1δi2 (u2 ) and x0 (u) =                 αi1 1δi1 (u1 )+
                                        P                                P                                            P 0
                                      δi1 ∈∆1                        δi2 ∈∆2                                        δi1 ∈∆01
         βi2 1δi2 (u2 ). Let dmin and d0min denote the smallest pairwise distance between parallel
  P 0
δi2 ∈∆02
lines of x and x0 , respectively, either along u1 or u2 . Assume φ is supported in [−a, a]2 and
θk ∈ {0, π2 }. If 2J a < dmin and 2J a < d0min , then Gx = Gx0 with the following conditions:
                                                                    100


    •          αi21 =            αi021
         P                 P
       δi1 ∈∆1          δi1 ∈∆01
    •          βi22 =            βi022
         P                 P
       ξi2 ∈∆2          ξi2 ∈∆02
    •                                           αi0 1 ·          βi02
         P              P                P                P
               α i1 ·          βi2 =
       δi1 ∈∆1        ξi2 ∈∆2          δi1 ∈∆01         ξi2 ∈∆02
    Theorem 17 explains the observation of Figure 5.9 (seventh row) from Chapter 5. When
J = 4, the gram representations are not able to reproduce the cross lines in the frame texture
while larger J resolves the issue.
6.5      Multilayer model
In this section, we discuss the properties of multilayer ReLU model. In the following context
we use σ(f ) := ReLU(f ).
    Let x : R → R and φ : R → R to be a one dimensional low pass filter with φ(u) ≥ 0.
Define the wavelets
                                                ψn (u) = φ(n) (u), u ∈ R
By rescaling we get a wavelet family {ψj,n }j,n :
                                                                         (n)
                                                      ψj,n (u) = φj (u)
                            n
Note ψj,n (u) = 2nj du     d
                             n φj (u). Then for the one layer wavelet coefficients we have
                                                dn φj                    dn
                        x ∗ ψj,n = x ∗ 2nj                =  2 nj
                                                                   ·  (      x ∗ φj ) = 2nj · x(n) ∗ φj
                                                 dun                    dun
If j is small, i.e., φj is a tiny filter which is usually the case in CNNs, the convolution with
φj can be regarded as a local smoothing operator. The above wavelet coefficients captures
the n-th order of derivative of x.
6.5.1      Multilayer ReLU model
Now we consider deeper layers of ReLU responses. The following lemma is an essential
property needed about the n-th order derivatives of ReLU function. Note                                  d
                                                                                                        du
                                                                                                           [σ(f (u))] =
                                                                 101


σ 0 (f (u)) · f 0 (u) and σ 0 (c · f ) = σ 0 (f ) if c > 0. Since σ (n) (f ) = 0, ∀n > 1, we have the
following lemma:
                   dn
Lemma 6.          dun
                      σ(f (u))   = σ 0 (f (u)) · f (n) (u).
Proof. We use induction to prove the statement. First we know                        d
                                                                                    du
                                                                                       [σ(f (u))]   = σ 0 (f (u))·f 0 (u)
which aligns with the statement at n = 1. Suppose for n − 1, we have
                                        dn−1
                                              σ(f (u)) = σ 0 (f (u)) · f (n−1) (u)
                                      dun−1
Consider n,
                                  dn                 d dn−1
                                                                          
                                     σ(f (u)) =                  σ(f (u))
                                dun                 du dun−1
                                                                                
                                                     d      0           (n−1)
                                                =         σ (f (u)) · f       (u)
                                                    du
                                                = σ 00 (f (u)) · f (n−1) (u) + σ 0 (f (u)) · f (n) (u)
                     (since σ 00 (f (u)) = 0) = σ 0 (f (u)) · f (n) (u)
We prove the conclusion.
                                                            102


     Let us first look at the second layer model
                                                                                      dn2
    σ(1 · x ∗ ψj1 ,n1 ) ∗ ψj2 ,n2 = σ(1 · x ∗ ψj1 ,n1 ) ∗ 2n2 j2 · n2 φj2
                                                                                    du
                                                          n2
                                                                                                
                                              n2 j2     d
                                         =2         · n2 σ(1 · x ∗ ψj1 ,n1 ) ∗ φj2
                                                       du
                                                        dn2
                                                                                         
                                              n2 j2
                                         =2         · n2 σ(1 · x ∗ ψj1 ,n1 ) ∗ φj2
                                                       du
                                                                                          dn2
                                                                                                               
                                                 n2 j2      0
                                         = 1 2        · σ (1 · x ∗ ψj1 ,n1 )                    x ∗ ψj1 ,n1 ∗ φj2
                                                                                        dun2
                                                                                                                  n1
                                                                                          dn2
                                                                                                                         
                                                 n2 j2      0                                            n1 j1 d
                                         = 1 2        · σ (1 · x ∗ ψj1 ,n1 )                    x∗2                 φj ∗ φj2
                                                                                        dun2                   dun1 1
                                                                                                dn1 +n2
                                                                                                                       
                                                 n1 j1 +n2 j2        0
                                         = 1 2                 · σ (1 · x ∗ ψj1 ,n1 )                       x ∗ φj1 ∗ φj2
                                                                                               dun1 +n2
                                                                                                                    
                                                 n1 j1 +n2 j2        0                             (n1 +n2 )
                                         = 1 2                 · σ (1 · x ∗ ψj1 ,n1 )          x            ∗ φj1 ∗ φj2
                                                                                                                             
                                                 n1 j1 +n2 j2        0       n1 j1     (n1 )                   (n1 +n2 )
                                         = 1 2                 · σ (1 · 2         ·x       ∗ φj1 )        x            ∗ φj1 ∗ φj2
                                                                                                                     
                                                 n1 j1 +n2 j2        0       (n1 )                   (n1 +n2 )
                                         = 1 2                 · σ (1 · x         ∗ φj1 )        x            ∗ φj1 ∗ φj2
                                                                                                                                 (6.19)
With the definition of ReLU, we know σ 0 (1 · x(n1 ) ∗ φj1 ) 6= 0 only when 1 · x(n1 ) ∗ φj1 > 0.
If φj1 is tiny enough, the condition is approximately 1 · x(n1 ) > 0. Therefore it can be
concluded that σ(1 · x ∗ ψj1 ,n1 ) ∗ ψj2 ,n2 captures the (n1 + n2 )-th order of derivative of x at
where 1 · x(n1 ) > 0. With similar logic, we can prove the following theorem for deeper layers
of ReLU models.
Theorem 18. Let fk be the k-layer ReLU response, i.e., fk = σ(. . . σ(2 · σ(1 x ∗ ψj1 ,n1 ) ∗
ψj2 ,n2 ) . . . ) ∗ ψjk ,nk . Then
                    k−1           k
                                  P                                        k
                                                                             P                                
                    Y                ni ji                                 (     ni )
           fk = (        i ) · 2i=1          0
                                             σk−1          . . . σ10     (x i=1       ∗ φj1 ) · · · ∗ φjk−1         ∗ φjk        (6.20)
                    i=1
where σk0 = σ 0 (k · fk ).
Proof. We use induction to prove the theorem. First for k = 2, we prove the conclusion in
                                                                       103


(6.19). Suppose the statement holds for k − 1, let us consider the case for k:
                           fk = σ(k−1 fk−1 ) ∗ ψnk ,jk
                                              nk
                                                                       
                                    nk jk d
                               =2                    σ(k−1 fk−1 ) ∗ φjk                                                  (6.21)
                                          dunk
                                                                             dnk
                                                                                           
                                              nk jk     0
                               = k−1 · 2              σ (k−1 fk−1 )                fk−1 ∗ φjk
                                                                            dunk
Since
                         dnk               dnk
                              fk−1 = n σ(k−2 fk−2 ) ∗ ψjk−1 ,nk−1
                        dunk              du k
                                                                 dnk
                                     = σ(k−2 fk−2 ) ∗ n ψjk−1 ,nk−1
                                                                du k
                                                                 dnk                     dnk−1
                                     = σ(k−2 fk−2 ) ∗ n (2nk−1 jk−1 n φjk−1 )
                                                                du k                    du k−1
                                                                                 nk−1 +nk
                                                                               d
                                     = σ(k−2 fk−2 ) ∗ (2nk−1 jk−1 n +n φjk−1 )
                                                                             du k−1 k
                                               1
                                     = j n σ(k−2 fk−2 ) ∗ ψjk−1 ,nk−1 +nk
                                          2  k−1    k
Let fk−1
      ∗
           = σ(k−2 fk−2 ) ∗ ψjk−1 ,nk−1 +nk , the only difference between fk−1                      ∗
                                                                                                          and fk−1 is that in
 ∗
fk−1 the last filter substitute nk−1 with nk−1 + nk . Insert this change into (6.20), we get
              k−2                   k−1
                                     P                                        Pk                              
              Y           nk jk−1 +     ni ji                                 (     ni )
       ∗                                           0
     fk−1  =(     i ) · 2          i=1          σk−2         . . . σ10    (x i=1         ∗ φj1 ) · · · ∗ φjk−2    ∗ φjk−1
              i=1
Therefore
                      k−2         k−1                                          k
       dnk
                                   P                                        P                                
                      Y               ni ji                                 (      ni )
                                                 0
         n
            fk−1 = ( i ) · 2 i=1              σk−2          . . . σ10    (x i=1        ∗ φj1 ) · · · ∗ φjk−2     ∗ φjk−1
      du  k
                      i=1
Insert this into Equation 6.21 we prove the theorem.
    The above theorem concludes interesting results about very deep neural networks. In
most CNNs, the convolution filters are of small size, which aligns with the case when jk are
all small. In that case, the convolutions with φj can be regarded as tiny local smoothing
operators that we can ignore for now. The theorem concludes that under this condition, the
                                                                                                                        l
                                          k                                                                   l
                                                                                                                        P
                                                                                                                      (    ni )
k-layer ReLU model captures ( ni )-th order derivatives of x at where ( i ) · x
                                         P                                                                   Q
                                                                                                                       i=1      >
                                         i=1                                                                 i=1
0, ∀1 ≤ l < k. High order derivatives are preserved based on fine partitions of low order
derivatives. Specifically when the filters are tiny, n usually equals to 0 or 1. If n is exactly 1,
                                                             104


the theorem proves that the n-th layer deep model preserves the n-th order derivatives of x
based on partitions of i-th order derivatives for 0 < i < n. If n is either 0 or 1, the theorem
implies similar conclusions. If n is exactly 0, i.e., all the filters are small low pass filters,
the deep model is less related to derivatives of x. In general, deep models smooth out either
the signal or the high order derivatives of the signal. Also deep layers increase the receptive
fields, thus incorporating information from more neighbor pixels and preserving long range
dependence.
                                              105


                                       CHAPTER 7
                                    RANDOM FIELDS
A random field refers to a random function over an arbitrary domain, e.g., Rd . The function
value at each point in the domain can be think of as a random variable. And the function
values at different points usually have some correlation. On the other hand, texture images
from the same class usually have similar repeated patterns but generally different, which
can be reflected from the correlation and randomness of a random field. Therefore random
fields can be used to model texture images. In this chapter, we show more of our work
on scattering transform on random fields. We extend the work in [43] to d dimensional
space. We also discuss the relationship between scattering moments and power spectrum of
a stochastic process.
7.1     Scattering moments of self-similar processes
Before we introduce the scattering moments of random fields, we first show and prove the
following properties of random fields that are stationary or has stationary increments. Let
ψ(u) : u ∈ Rd be a wavelet, i.e., Rd ψ(u)du = 0.
                                  R
Lemma 7. If {X(u)}u∈Rd has stationary increments, {X ∗ ψ(u)}u∈Rd is stationary.
Proof. for all u and v:
                                          Z
                              X ∗ ψ(u) =    X(u − u0 )ψ(u0 )du0
                                          Z                            Z
           (since ψ has zero average) =              0    0   0
                                            X(u − u )ψ(u )du − X(u)      ψ(u0 )du0
                                          Z
                                       =    (X(u − u0 ) − X(u))ψ(u0 )du0
                                          Z
                                        d
                                       =    (X(v − u0 ) − X(v))ψ(u0 )du0
                                          Z
                                       =    X(v − u0 )ψ(u0 )du0
                                       = X ∗ ψ(v)
                                             106


                        d
Therefore, X ∗ ψ(u) = X ∗ ψ(v) ∀u, v, and thus {X ∗ ψ(u)}u∈Rd is stationary.
Remark 6. Similarly if {X(u)}u∈Rd is stationary, {X ∗ ψ(u)}u∈Rd is stationary.
   Let σ be a modulus operator or a ReLU operator, i.e., σ(x) = |x| or σ(x) = ReLU(x).
Let {ψj,θ }j,θ be a family of wavelets where
                                 ψj,θ (u) = 2−dj ψ(2−j R−θ u), ∀u ∈ Rd .
The following theorem explores the first order scattering moments of self similar processes.
Theorem 19. Suppose {X(u)}u∈Rd is a self similar process of order H and has stationary
increments. The first order m-th scattering moments of {X(u)}u∈Rd satisfy:
                                       E[σ m (X ∗ ψj1 ,θ1 )]
                                                              = 2mHj1                                (7.1)
                                        E[σ m (X ∗ ψθ1 )]
Moreover, if {X(u)}u∈Rd is isotropic:
                                       E[σ m (X ∗ ψj1 ,θ1 )]
                                                              = 2mHj1                                (7.2)
                                          E[σ m (X ∗ ψ)]
Proof.
                                              Z
                           X ∗ ψj1 ,θ1 (u) =       X(u0 ) · ψj1 ,θ1 (u − u0 )du0
                                                Rd
                                              Z
                                           =       X(u0 ) · 2−nj1 ψθ1 (2−j1 (u − u0 ))du0
                                                 d
                                              ZR
                      (let v 0 = 2−j1 u0 ) =       X(2j1 v 0 ) · 2−nj1 ψθ1 (2−j1 u − v 0 ))2nj1 dv 0 (7.3)
                                                Rd
                                              Z
                                            d
              (since X is self similar) =          2Hj1 X(v 0 ) · ψθ1 (2−j1 u − v 0 ))dv 0
                                                Rd
                                           = 2Hj1 X ∗ ψθ1 (2−j1 u)
Since X is stationary, we have X ∗ ψθ1 (u) is stationary. Therefore E[σ m (X ∗ ψθ1 (2−j1 u))] =
E[σ m (X ∗ ψθ1 (u))] and the first result develops naturally.
                                                    107


    Moreover, if {X(u)}u∈Rd is isotropic:
                                                  Z
                                 X ∗ ψθ1 (u) =         X(u0 ) · ψθ1 (u − u0 )du0
                                                     d
                                                  ZR
                                              =        X(u0 ) · ψ(Rθ−11
                                                                        (u − u0 ))du0
                                                     d
                                                  ZR
                         (let v 0 = Rθ−1
                                       1
                                         u0 ) =        X(Rθ1 v 0 ) · ψ(Rθ−1
                                                                          1
                                                                            u − v 0 )dv 0
                                                     d
                                                  ZR
                                               d
                   (since X is isotropic) =            X(v 0 ) · ψ(Rθ−1
                                                                      1
                                                                        u − v 0 )dv 0
                                                    Rd
                                              = X ∗ ψ(Rθ−1    1
                                                                u)
Since X ∗ ψ is stationary, E[σ m (X ∗ ψ(Rθ−1     1
                                                    u))] = E[σ m (X ∗ ψ(u))], we verified the second
result.
    Theorem 19 implies that for a self similar process of order H with stationary increments,
if we think of E[σ m (X ∗ ψθ1 )] as a normalization term, the first order scattering moments is
proportional to a value determined by the scale j of the wavelet, the moments m and H.
Moreover if the process is isotropic, the normalization term is substitute by E[σ m (X ∗ψ)] and
the scattering moments are invariant to the rotation parameter θ. It provides a necessary
condition for identifying a self similar process and also captures the order parameter H of
the process through the first order scattering moments. In general, this theorem indicates
the pattern of scattering moments of a self similar process. The following theorem explores
more about the second order scattering moments of such processes. Since it involves rotation
difference, we focus on R2 for simplification in the following theorem.
Theorem 20. Suppose (X(u))u∈R2 is a self similar process of order H and has stationary
increments. The second order scattering moments satisfy:
                   E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )]    E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )]
                                                    =                                           (7.4)
                        E[σ(X ∗ ψj1 ,θ1 )]                       E[σ(X ∗ ψθ1 )]
Moreover, if (X(u))u∈R2 is isotropic:
                  E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )]     E[σ(σ(X ∗ ψ) ∗ ψj2 −j1 ,θ2 −θ1 )]
                                                    =                                           (7.5)
                        E[(X ∗ ψj1 ,θ1 )]                         E[σ(X ∗ ψ)]
                                                    108


                                                         d
Proof. From Theorem 19 we have X ∗ ψj1 ,θ1 (u) = 2Hj1 X ∗ ψθ1 (2−j1 u). Therefore:
                                          Z
          σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 (u) =       σ(X ∗ ψj1 ,θ1 (u0 )) · ψj2 ,θ2 (u − u0 )du0
                                              2
                                          ZR
                                        d
                                        =       σ(2Hj1 X ∗ ψθ1 (2−j1 u0 )) · ψj2 ,θ2 (u − u0 )du0
                                              2
                                          ZR
                   (let v 0 = 2−j1 u0 ) =       σ(2Hj1 X ∗ ψθ1 (v 0 )) · ψj2 ,θ2 (u − 2j1 v 0 )2nj1 dv 0
                                              2
                                          ZR
                                        =       σ(2Hj1 X ∗ ψθ1 (v 0 )) · ψj2 −j1 ,θ2 (2−j1 u − v 0 )dv 0
                                             R2
                                        = 2Hj1 σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 (2−j1 u)
Since σ(X ∗ ψ1 ) is stationary, from Remark 6 we have σ(X ∗ ψ1 ) ∗ ψ2 is stationary. Therefore:
                  E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )]    2Hj1 E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )]
                                                    =
                         E[σ(X ∗ ψj1 ,θ1 )]                    2Hj1 E[σ(X ∗ ψθ1 )]
                                                      E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )]
                                                    =
                                                               E[σ(X ∗ ψθ1 )]
The first result if verified. If (X(u))u∈R2 is isotropic:
                                              Z
            σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 (u) =        σ(X ∗ ψθ1 (u0 )) · ψj2 −j1 ,θ2 (u − u0 )du0
                                                 2
                                              ZR
                                           d
                                          =        σ(X ∗ ψ(Rθ−1 1
                                                                   u0 )) · ψj2 −j1 ,θ2 (u − u0 )du0
                                                 2
                                              ZR
                      (let v = Rθ1 u ) =
                             0     −1 0
                                                   σ(X ∗ ψ(v 0 )) · ψj2 −j1 ,θ2 (u − Rθ−1 1
                                                                                            v 0 )dv 0
                                                 2
                                              ZR
                                          =        σ(X ∗ ψ(v 0 )) · ψj2 −j1 ,θ2 −θ1 (Rθ1 u − v 0 )dv 0
                                                R2
                                          = σ(X ∗ ψ) ∗ ψj2 −j1 ,θ2 −θ1 (Rθ1 u)
Therefore:
                  E[σ(σ(X ∗ ψθ1 ) ∗ ψj2 −j1 ,θ2 )]     E[σ(σ(X ∗ ψ) ∗ ψj2 −j1 ,θ2 −θ1 )]
                                                     =
                            E[σ(X ∗ ψθ1 )]                        E[σ(X ∗ ψ)]
We verified the second result.
   Theorem 20 implies that, for a self similar process with order H, the second order
scattering moments E[σ(σ(X ∗ ψj1 ,θ1 ) ∗ ψj2 ,θ2 )] normalized by the first scattering moments
E[σ(X ∗ ψj1 ,θ1 )], is determined by the scale difference j2 − j1 where j1 , j2 are the scales of
the two wavelets at the two layers. Moreover, if the process is isotropic, the second order
                                                     109


scattering moments is also determined by the rotation difference θ2 − θ1 , where θ1 , θ2 are the
rotations of the two wavelets at the two layers. This theorem provides another necessary
condition for identifying a self similar process. It describes the pattern of the second order
moments and the parameter H is also captured in such statistics.
7.2    Power spectrum
Definition 1. Suppose {X(u)}u∈Rd is a stationary process with zero mean. The covariance
of X is defined as RX (τ ) = EX(0)X(τ ). Then the power spectrum of {X(u)}u∈Rd is defined
as:
                           V
                                                   Z
                           RX (ω) = F(RX )(ω) =        RX (τ )e−iτ ω dτ
                                                    Rd
Lemma 8. Suppose {X(u)}u∈Rd is a stationary process and ψ is a wavelet. Let Y (u) =
X ∗ ψ(u). It can be proved that Y is also stationary from Remark 1. Then
                                   V        V        V
                                   RY (ω) = RX (ω)|ψ (−ω)|2                                (7.6)
                                              110


Proof.
                          V
                                    Z
                          RY (ω) =      RY (τ )e−iτ ω dτ
                                    Z
                                  =     E[Y (0)Y (τ )]e−iτ ω dτ
                                    Z
                                  =     E[X ∗ ψ(0)X ∗ ψ(τ )]e−iτ ω dτ
                                    Z      Z                    Z
                                  = E[ X(u) · ψ(−u)du · X(v) · ψ(τ − v)dv]e−iτ ω dτ
                                       Z Z Z
          (by Fubini’s Thm.)      =E            X(u) · ψ(−u) · X(v) · ψ(τ − v) · e−iτ ω dudvdτ
                                       Z Z                              Z
                                  =E          X(u) · ψ(−u) · X(v) · [ ψ(τ − v) · e−iτ ω dτ ]dvdu
                                       Z Z                              Z
                                                                                          0
   (let τ 0 = τ − v, dτ 0 = dτ )  =E          X(u) · ψ(−u) · X(v) · [ ψ(τ 0 ) · e−i(τ +v)ω dτ 0 ]dvdu
                                       Z Z
                                  =E          X(u) · ψ(−u) · X(v) · ψ̂(−ω) · e−ivω dvdu
                                               Z Z
                                                                                             0
   (let v = u + v 0 , dv = dv 0 ) = ψ̂(−ω)E           X(u) · ψ(−u) · X(u + v 0 ) · e−i(u+v )ω dv 0 du
                                             Z Z
                                                                                            0
          (by Fubini’s Thm.)      = ψ̂(−ω)         ψ(−u)E[X(u) · X(u + v 0 )] · e−i(u+v )ω dv 0 du
                                             Z Z
                                                                              0
                                  = ψ̂(−ω)         ψ(−u) · RX (v 0 ) · e−i(u+v )ω dv 0 du
                                                    V
                                  = ψ̂(−ω)ψ̂(−ω)RX (ω)
                                               V
                                  = |ψ̂(−ω)|2 RX (ω)
Lemma 9. Let {X(u)}u∈Rd be a stationary process and let {ψj }j be a group of wavelets that
                                        V                V
are complex analytic. Suppose j |ψ j |2 (ω) + j |ψ j |2 (−ω) = 2, ∀ω. Then
                                   P                 P
                                       X
                                             E|X ∗ ψj |2 = VarX
                                         j
Proof. Let Yj = X ∗ ψj . From Lemma 8,
                                     V           V       V
                                     RYj (ω) = RX (ω)|ψj (−ω)|2
                                                   111


Therefore,
                               X    V            X   V          V
                                    RYj (ω) =        RX (ω)|ψj (−ω)|2
                                  j               j
                                                V        X       V
                                             = RX (ω)           |ψj (−ω)|2
                                                            j
                                                   V
                                                 2RX (ω), ω < 0
                                                 
                                             =
                                                 0, ω > 0
                                                 
Taking the inverse Fourier transform at τ = 0 of both sides, we get
                                    Z ∞                        Z 0
                X                1       V
                                                           2         V
                    RYj (0) =            RY (ω)dω =                  RX (ω)dω = RX (0)
                 j
                                2π −∞                    2π −∞
Since RX (0) = VarX
                                        X
                                             Var(Yj ) = VarX
                                         j
Since ψj is a wavelet, EYj = EX ∗ ψj = 0. We have the result:
                     X                     X                X
                           E|X ∗ ψj |2 =        E|Yj |2 =         Var(Yj ) = VarX
                       j                    j                 j
Remark 7. Note if σ(x) = ReLU(x), then |X ∗ ψj |2 = σ 2 (X ∗ ψj ) + σ 2 (−X ∗ ψj ) when X
and ψj are both real valued. Then Lemma 9 can be extended to the ReLU operator:
                       X                        X
                            E[σ 2 (X ∗ ψj )] +       E[σ 2 (−X ∗ ψj )] = VarX
                         j                       j
7.3    Power spectrum and scattering equivalence
                                                    √
For now let us work on R. Define ψλ (t) =             λψ(λt) for any λ ∈ (0, ∞), and assume ψ is
admissible, meaning that                          V
                                               ∞
                                                 |ψ (ω)|2
                                           Z
                                    Cψ :=                   dω < ∞ .
                                             0       ω
Let (X(t))t∈R be a stationary process. Define the first order quadratic scattering moments
of X as:
                             SX(λ) := E|X ∗ ψλ |2 ,           ∀ λ ∈ (0, ∞) .
                                                   112


Theorem 21. Let X, Y be two real-valued stationary processes.
           V          V
            RX (ω) = RY (ω) a.e. ω ∈ R           ⇐⇒       SX(λ) = SY (λ) ∀ λ ∈ (0, ∞) .
    To prove the theorem, we need Lemma 2.1 from [71], which we restate in the following.
Lemma 10. Let f ∈ L2 (R) be continuous and assume f (ω) = f (−ω). ψb has compact
support and satisfy Condition 1. Then
                          Z
                              f (ω)|ψbλ (ω)|2 dω = 0 ∀λ > 0 =⇒ f = 0 a.e.
Condition 1. Define
                         |ψbλ+ (ω)|2 = (|ψbλ (ω)|2 + |ψbλ (−ω)|2 ) · 1(ω ≥ 0).
If for any finite sequence {ωi }ni=1 of distinct positive frequencies, the collection {|ψbλ+ (ωi )|2 }ni=1
are linearly independent functions of λ, we say the wavelet ψ satisfies the linear independence
condition.
Proof of Theorem 21. =⇒:
    If ψ is a wavelet, then from Lemma 8, we have:
           V         V                            V             V
           RX (ω) = RY (ω) a.e. ω ∈ R =⇒ RX∗ψλ (ω) = RY ∗ψλ (ω) a.e. ω ∈ R, ∀λ
Then by taking inverse Fourier tansform of the power spectrum we have:
                                 RX∗ψλ (τ ) = RY ∗ψλ (τ ) ∀τ ∈ R, λ
Let τ = 0, we have:
                                       E|X ∗ ψλ |2 = E|Y ∗ ψλ |2
which proves this direction.
                                                   113


     ⇐=:
                                     Z                       Z
                               2
                  E|X ∗ ψλ (t)| = E X(u) · ψλ (t − u)du X(v) · ψλ (t − v)dv
                                     Z Z
                                 =E      X(u) · X(v) · ψλ (t − u) · ψλ (t − v)dudv
                                     Z Z
                (Let v = u + τ ) = E     X(u) · X(u + τ ) · ψλ (t − u) · ψλ (t − u − τ )dudτ
                                   Z Z
                                 =      EX(u)X(u + τ ) · ψλ (t − u) · ψλ (t − u − τ )dudτ
                                   Z Z
                                 =      RX (τ ) · ψλ (t − u) · ψλ (t − u − τ )dudτ
                                   Z              Z
                                 = ψλ (t − u) · RX (τ ) · ψλ (t − u − τ )dτ du
                                   Z
                                 = ψλ (t − u) · RX ∗ ψλ (t − u)du
                                   Z
       (Let t = 0 and v = −u) = ψλ (v) · RX ∗ ψλ (v)dv
                                   Z V              V
         (By Plancherel Thm.) = RX ∗ ψλ (ω) · ψλ (ω)dω
                                   Z V
                                              V
                                                        V
                                 = RX (ω) · ψλ (ω) · ψλ (−ω)dω
                                   Z V        V          V
                                 = RX (ω) · ψλ (−ω) · ψλ (−ω)dω
                                   Z V         V
                                 = RX (ω) · |ψλ (−ω)|2 dω
                                   Z V            V
                                 = RX (−ω) · |ψλ (ω)|2 dω
                    ^
                                   Z V         V
                   (RX is even) = RX (ω) · |ψλ (ω)|2 dω
If SX(λ) = SY (λ), we have:
                                 Z  V    V           V
                                   (RX − RY )(ω) · |ψλ (ω)|2 dω = 0
                                                               V          V
According to Lemma 11 stated in the following, we have RX (ω) = RX (−ω). For stationary,
                                                        V         V                          V
real-valued X, Lemma 8 implies E|X ∗ ψλ (t)|2 = RX (ω) · |ψλ (ω)|2 dω for all ψ. Suppose ψλ
                                                      R
                                                                                   V
is an indicator function supported in interval [a, b], we know the integration of RX in interval
                                                                              V
[a, b] is positive. By taking arbitrary a and b, we can conclude that RX is positive almost
everywhere. With Lemma 10, we prove the result.
                                                114


Lemma 11. If X is stationary and real-valued, we have RX is real-valued and RX (τ ) =
                                V                            V        V
RX (−τ ). Then we also have RX is real-valued and RX (ω) = RX (−ω)
Proof. Since
                       RX (τ ) = EX(0)X(τ ) = EX(−τ )X(0) = RX (−τ )
RX (τ ) is real-valued and even.
        V
    For RX ,
             V
                       Z                        Z
                                     −iτ ω
             RX (ω) =    RX (τ )e          dτ =    RX (τ ) cos(τ ω) − i · RX (τ ) sin(τ ω)dτ
Since RX is even and sin() is odd, we know the imagery part is 0. Therefore, RX is real-
valued.
         V
                     Z                       Z                      Z                    V
         RX (−ω) =      RX (τ )e iτ ω
                                      dτ =      RX (−τ )e iτ ω
                                                               dτ =   RX (v)e−ivω dv = RX (ω)
we proved the result.
    Theorem 21 demonstrates the power spectral is equivalent to the scattering moments of
a real-valued stationary process. In particular, two processes with the same power spectrum
is equivalent to, the two processes has the same first order second scattering moments for
any scattering parameter λ. Since a stationary process is determined by its power spectrum,
this theorem provides a necessary and sufficient condition for identifying such processes.
                                                   115


APPENDIX
   116


                                 PROOF FOR CHAPTER 3
The Proof of the Lemmas
The Proof of Lemma 1. By linearity, it suffices to show that p is not the zero function unless
α1 = . . . , αN = 0. We will consider the case where |γN | > |γk | for all k = 1, 2, . . . , N − 1.
The other cases, where |γ1 | > |γk | for all 2 ≤ k ≤ N or where |γ1 | = |γN | > |γk | for all
2 ≤ k ≤ N − 1 are similar. The n-th derivative of p is given by
                                                     N
                                                        αk γkn eiγk x .
                                                   X
                                         (n)
                                       p     (x) =
                                                    k=1
Therefore,
                                                 p(n) (0)
                                            lim      n
                                                          = αN ,
                                           n→∞     γN
so in particular, there exists n such that p(n) (0) 6= 0, and therefore p is not uniformly zero.
The proof of Lemma 2. In the case where p = 2m is even, then by (3.13) and (3.14), |pi |2m ∈
E(mdi ) for each i. Therefore, since d1 > d2 , d3 , d4 , it follows from (3.15) that
                                 |p1 |2m + |p2 |2m − |p3 |2m − |p4 |2m
is an element of E(md1 ) and therefore vanishes on a set of measure zero. Now consider the
case where p = 2m + 1 is odd. Squaring both sides of (3.16) implies
                                 p5 = 2(|p1 p2 |2m+1 − |p3 p4 |2m+1 ),
where p5 := |p1 |4m+2 + |p2 |4m+2 − |p3 |4m+2 − |p4 |4m+2 . Thus, squaring both sides again gives
                                         p6 := 8|p1 p2 p3 p4 |2m+1
where p6 = p25 − 4|p1 p2 |4m+2 − 4|p3 p4 |4m+2 . Therefore, squaring both sides one final time
implies that
                                     p26 − 64|p1 p2 p3 p4 |4m+2 = 0.
                                                    117


However, since d1 > d2 , d3 , d4 , repeatedly applying (3.13), (3.14), and (3.15), we see that
(p26 − 64|p1 p2 p3 p4 |2 ) ∈ E(4md1 ) and therefore vanishes on a set of measure zero.
     In addition to the lemmas from Chapter 3, we also need the following lemmas to prove
Theorem 5 and 4.
Lemma 12. Let m ≥ 1 be an odd integer, and let a, b, c, d, C ∈ R, a, b, c, d 6= 0. Let
p(θ) = a + beiθ , and q(θ) = c + deiθ . If there are more than 4m distinct θ ∈ [0, 2π] such
that
                                            |p(θ)|m − |q(θ)|m = C,
then ab = cd and a2 + b2 = c2 + d2 .
Lemma 13. Let m ≥ 1 be an integer (not necessarily odd) and let a, b, c, d, C ∈ R, γ >
0, κ 6= 0, 1. Then the set of θ such that
                                                                               m
                                        i(γ+1)θ m           1
                                 iθ
                          a + be + ce             − κa + beiθ + κcei(γ+1)θ        =C       (7)
                                                            κ
has measure zero.
The Proof of Lemma 12. If θ is a solution to
                                             |p(θ)|p − |q(θ)|p = C,
then
                                    |p(θ)|2p − |q(θ)|2p − C 2 = 2|q(θ)|p C.
Therefore, f (θ) = 0, where f : R → R is the function defined by
                                                                2
                             f (θ) := |p(θ)|2p − |q(θ)|2p − C 2    − 4|q(θ)|2p C 2 .
Since
                |p(θ)|2 = a2 + b2 + 2ab cos(θ) and |q(θ)|2 = c2 + d2 + cd cos(θ),          (8)
                                                      118


f (θ) is a trigonometric polynomial of degree at most 2p which by assumption has more than
4p zeros in [0, 2π]. This implies that f (θ) is uniformly zero. By (8)
                                                                        2
f (θ) = (a2 + b2 + 2ab cos(θ))p − (c2 + d2 + 2cd cos(θ))p − C 2            − 4C 2 (c2 + d2 + 2cd cos(θ))p
and so setting the cos2p (θ) coefficient equal to zero implies
                                         0 = (2p ap bp − 2p cp dp )2
which implies ab = cd since p is odd. If p ≥ 3, then 2(p − 1) > p. Therefore, using the
binomial formula and setting the cos2(p−1) (θ) coefficient of f (θ) equal to zero implies
                                                                              2
                       p       2     2       p−1        p        2    2        p−1
                             (a + b )(2ab)        −            (c + d )(2cd)          = 0,
                     p−1                              p−1
but since ab = cd this implies that a2 + b2 = c2 + d2 . On the other hand, if p = 1, using the
fact that ab = cd we see that
                                                       2
                  f (θ) = a2 + b2 − (c2 + d2 ) − C        = 4C 2 (c2 + d2 + 2cd cos(θ)
Therefore, f (θ) can only be uniformly equal to zero if C = 0 and a2 + b2 = c2 + d2 .
The Proof of Lemma 13. Let
                                                                       1
                p(θ) = a + beiθ + cei(γ+1)θ      and q(θ) = κa + beiθ + κcei(γ+1)θ .
                                                                      κ
Then by (7) we see
                                          |p(θ)|p = |q(θ)|p + C,
Squaring both sides yields,
                                 |p(θ)|2p − |q(θ)|2p − C 2 = 2|q(θ)|p C,
and therefore if θ satisfies (7) it is a solution to f (θ) = 0, where
                                                            2
                       f (θ) := |p(θ)|2p − |q(θ)|2p − C 2       − 4|q(θ)|2p C 2 = 0.
                                                    119


f (θ) is an element of the class E of generalized exponential polynomials introduced earlier.
Therefore, it will follow that f vanishes on a set of measure zero as soon as we show that f is
not uniformly zero. We will verify that that the lead cofficient of f is nonzero unless c = ±1.
Using the trigonometric identities sin2 (x) + cos2 (x) = 1 and cos(x − y) = cos(x) cos(y) +
sin(x) sin(y) we see that
               |p(θ)|2 = a2 + b2 + c2 + 2ab cos(θ) + 2bc cos(γθ) + 2ac cos((γ + 1)θ)
and likewise
                               1 2
         |q(θ)|2 = κ2 a2 +       b + κ2 c2 + 2ab cos(θ) + 2bc cos(γθ) + 2κ2 ac cos((γ + 1)θ).
                              κ2
Therefore, the lead coefficient of f (θ) vanishes if and only if κ2 = 1.
The Proof of Theorem 4
Proof. Choose ξ1 , ξ2 , . . . , ξL i.i.d. from any probability distribution which is absolutely con-
tinuous with respect to the Lebesgue measure. Since x is collision free, with probability one,
each of the ξ` ∆i,i+1 (x) are distinct modulo 2π, i.e.
                                   ξ` ∆i,i+1 (x) 6≡ ξ`0 ∆i0 ,i0 +1 (x) mod 2π                          (9)
for all 1 ≤ i, i0 ≤ k − 1 and 1 ≤ `, `0 ≤ L, except when (i, `) = (i0 , `0 ).
    Assume these ξ` ∆i,i+1 are distinct, and let y(t) be any signal defined as in 3.17 such that
D(y) = D(x) =: D, and ∂s2 fx,ξ` (d) = ∂s2 fy,ξ` (d) for all d ∈ D and for all 1 ≤ ` ≤ L − 1, and
                  i=1 |ai | . Note that y(t) depends on ξ1 , . . . , ξL−1 , but is independent of ξL . By
Pk          p
               Pk          p
   i=1 |bi | =
CITE TURNPIKE and the assumption that x(t) and y(t) are collision free, the fact that
D(x) = D(y) implies that the support sets of x and y are equivalent up to translation and
reflection, so we may assume without loss of generality that ∆i,j (x) = ∆i,j (y) =: ∆i,j for all
1 ≤ i ≤ j ≤ k. We will show that ~b must be given by
                                                
                                                 1 ai if i is odd
                                                
                                                
                                                   c
                                           bi =                         ,
                                                cai if i is even
                                                
                                                
                                                        120


where c = ±1 or
                                                       Pb k+1   c
                                                              2
                                                                   |a2i−1 |p
                                                p
                                            |c| =          i=1
                                                           Pb k2 c           .                                 (10)
                                                                        p
                                                              i=1 |a2i |
Next, we will show that, with probability one, if c satisfies 10, but c 6= ±1, that ∂s2 fx,ξL (∆1,3 ) 6=
∂s2 fy,ξL (∆1,3 ). Since y(t) and therefore ~b was chosen to depend on ξ1 , . . . , ξL−1 , but not ξL ,
                                                      
these two facts together will imply that, with probability one, if y(t) is any signal such that
D(y) = D(x) =: D, and ∂s2 fx,ξ` (d) = ∂s2 fy,xi` (d) for all d ∈ D and all 1 ≤ ` ≤ L, then ~b = ±~a
and therefore y is equivalent to ±x up to reflection and translation.
     (3.6) implies that for all 1 ≤ ` ≤ L − 1 and all 1 ≤ i ≤ k − 1 we have
            |ai + ai+1 eiξ` ∆i,i+1 |p − |ai+1 |p − |ai |p = ∂s2 fx,ξ` (∆i,i+1 )
                                                             = ∂s2 fy,ξ` (∆i,i+1 )
                                                             = |bi + bi+1 eiξ` ∆i,i+1 |p − |bi+1 |p − |bi |p .
Therefore, for all 1 ≤ i ≤ k − 1, ξ1 ∆i,i+1 , . . . , ξL−1 ∆i,i+1 are all L − 1 solutions, which are
distinct modulo 2π, to
                     |ai + ai+1 eiθ |p − |bi + biθ       p         p           p       p
                                                  i+1 | = |bi | + |bi+1 | − |ai | − |ai+1 | .
                                                                                                  p
     Since L − 1 ≥ 4p, Lemma 12 implies that
                                                   ai ai+1 = bi bi+1                                           (11)
and
                                             a2i + a2i+1 = b2i + b2i+1                                         (12)
for all 1 ≤ i ≤ k − 1. It follows from (11) that
                                                   
                                                    1 ai if i is odd
                                                   
                                                   
                                                       c
                                           bi =                                                                (13)
                                                   cai if i is even,
                                                   
                                                   
where c :=     a1
               b1
                  . Combining (13) with the assumption that
                                               X k              X  k
                                                           p
                                                     |ai | =          |bi |p
                                               i=1               i=1
                                                            121


implies that either c = ±1 or that c satisfies (10). Now, we will show that, with probability
one, if c satisfies (10), but c 6= ±1, then ∂s2 fξL [x](∆1,3 ) 6= ∂s2 fξL [y](∆1,3 ).
      By 3.7, if ∂s2 fξL [x](∆1,3 ) = ∂s2 fξL [y](∆1,3 ), then
         |a1 + a2 eiξL ∆1,2 + a3 eiξL ∆1,3 |p + |a2 |p − |a2 eiξL ∆1,2 + a3 eiξL ∆1,3 |p − |a1 + a2 eiξL ∆1,2 |p
      =|b1 + b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |p + |b2 |p − |b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |p − |b1 + b2 eiξL ∆1,2 |p . (14)
But combining (11) and (12) implies that for all i either (ai , ai+1 ) = ±(bi , bi+1 ) or (ai , ai+1 ) =
±(bi+1 , bi ). In either case, we have that
 |a1 + a2 eiξL ∆1,2 | = |b1 + b2 eiξL ∆1,2 | and |a2 eiξL ∆1,2 + a3 eiξL ∆1,3 | = |b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |.
Combining this with (14) gives
               |a1 + a2 eiξL ∆1,2 + a3 eiξL ∆1,3 |p + |a2 |p = |b1 + b2 eiξL ∆1,2 + b3 eiξL ∆1,3 |p + |b2 |p .   (15)
However, by Lemma 13 the set of ξL ∈ R such that (15) holds has measure zero, unless
c = ±1. Since we assumed that the distribution of ξL was absolutely continuous with
respect to the Lebesgue measure, this completes the proof.
The Proof of Theorem 5. The proof is quite similar to the proof of Theorem 4. Choose
ξ1 , . . . , ξL i.i.d from any probability distribution which is absolutely continuous with respect
to the Lebesgue measure, and again note that with probability one each of the ξ` ∆i,i+1 are
distinct modulo 2π. Let y(t) be any signal defined as in (3.18) such that D(y) = D(x) =: D,
and ∂s2 fξ` [x](d) = ∂s2 fξ` [y](d) for all d ∈ D and for all 1 ≤ ` ≤ L − 1, and ki=1 |bi |p =
                                                                                                        P
    i=1 |ai | . As before, we may assume that ∆i,j (x) = ∆i,j (y) =: ∆i,j for all 1 ≤ i ≤ j ≤ k.
Pk             p
      Since ∂s2 fξ` [x](∆i,i+1 ) = ∂s2 fξ` [x](∆i,i+1 ) if follows from (3.6) that for all 1 ≤ ` ≤ L,
1 ≤ i ≤ k − 1,
         |ai + ai+1 eiξ` ∆i,i+1 |2m − |ai |2m − |ai+1 |2m = |bi + bi+1 eiξ` ∆i,i+1 |2m − |bi |2m − |bi+1 |2m .
Therefore, for all 1 ≤ i ≤ k − 1, ξ1 ∆i,i+1 , . . . , ξL−1 ∆i,i+1 are L − 1 solutions to
         h(θ) := |ai + ai+1 eiθ |2m − |bi + bi+1 eiθ |2m + |bi |2m + |bi+1 |2m − |ai |2m − |ai+1 |2m = 0
                                                            122


which are distint modulo 2π. Using the facts that
 |ai + ai+1 eiθ |2 = a2i + a2i+1 + 2ai ai+1 cos(θ) and |bi + bi+1 eiθ |2 = b2i + b2i+1 + 2bi bi+1 cos(θ)
we see that
h(θ) = (a2i + a2i+1 + 2ai ai+1 cos(θ))m − (b2i + b2i+1 + 2bi bi+1 cos(θ))m + b2m             2m     2m   2m
                                                                                        i + bi+1 − ai − ai+1
is a trigonometic polynomial of degree at most m with at least L − 1 distinct roots. Since
L − 1 ≥ 2m + 1 > 2m, this implies that h must be uniformly zero. In particular, setting the
lead coefficient equal to zero implies
                                              (ai ai+1 )m = (bi bi+1 )m
for all 1 ≤ i ≤ k − 1. Using the binomial theorem and setting the cosm−1 (θ) coefficient equal
to zero gives
                                (a2i + a2i+1 )m−1 ai ai+1 = (b2i + b2i+1 )m−1 bi bi+1 .
Combining the last two equations gives
                                a2i + a2i+1 = b2i + b2i+1    and ai ai+1 = bi bi+1 .
As in the proof of Theorem 4, these two facts imply that
                                                   
                                                    1 ai if i is odd
                                                   
                                                   
                                                      c
                                             bi =
                                                   cai if i is even
                                                   
                                                   
for c =    a1
           b1
              , and the assumption that
                                                Xk             X  k
                                                         p
                                                    |ai | =           |bi |p
                                                i=1             i=1
implies that either c = ±1 or
                                                        Pb k+1   c
                                                             2
                                                                    |a2i−1 |p
                                               p
                                            |c| = ± P      i=1
                                                                              .                         (16)
                                                             b k2 c
                                                             i=1    |a2i |p
As in the proof of Theorem 4 it follows from Lemma 13 if c satisfies (16), but c 6= ±1, then
∂s2 fξL [x](∆1,3 ) 6= ∂s2 fξL [y](∆1,3 ) with probability one. Therefore, the proof is complete.
                                                          123


                                        PROOF FOR CHAPTER 4
Proof of Theorem 6
To prove Theorem 6 we will need the following lemma.
Lemma 14. Let Z be a Poisson random variable with parameter λ. Then for all α ∈ R,
m ∈ N,
                                             ∞            k
                                                    −λ λ
                      E Z 1{Z>m} =
                                            X
                                                            k α ≤ Cm,α λm+1 ,
                         α           
                                                  e                                   ∀0 < λ < 1.
                                           k=m+1
                                                        k!
Proof. For 0 < λ < 1 and k ∈ N, e−λ λk ≤ 1. Therefore,
                                                ∞
                                                             λk
                          E Z 1{Z>m} =
                                               X
                                                      e−λ k α
                              α         
                                              k=m+1
                                                              k!
                                                      ∞
                                                     X                 λk
                                           =λ  m+1
                                                           e−λ                   (k + m + 1)α
                                                     k=0
                                                                 (k + m + 1)!
                                                      ∞
                                               m+1
                                                     X      (k + m + 1)α
                                           ≤λ
                                                     k=0
                                                             (k + m + 1)!
                                           = Cα,m λm+1 .
Proof. [Theorem 6] Recalling the definitions of Y (dt) and S[γ, p]Y (t), and setting Ns (t) =
N [t − s, t]d , we see
                 
                                              Z                                           p
                                                                  t − u iξ·(t−u)
                            S[γ, p]Y (t) = E                w              e          Y (du)
                                                  [s−t,t]d          s
                                                                                      p
                                                Ns (t)                  
                                                X                 t − tj
                                         = E          Aj w                eiξ·(t−tj )  ,
                                                 j=1
                                                                    s
where t1 , t2 , . . . tNs (t) are the points N (t) in [t−s, t]d . Conditioned on the event that Ns (t) = k,
the locations of the k points on [t − s, t]d are distributed as i.i.d. random variables Z1 , . . . , Zk
taking values in [t − s, t]d with density
                                                   λ(z)
                                        pZ (z) =            ,    z ∈ [t − s, t]d .
                                                   Λs (t)
                                                           124


Therefore, the random variables
                                                                t − Zi
                                                        Vi :=
                                                                   s
take values in the unit cube Q1 = [0, 1]d and have density
                                                     sd
                                     pV (v) =               λ(t − vs) ,     v ∈ Q1 .
                                                   Λs (t)
Note that in the special case that N is homogeneous, i.e. λ(t) ≡ λ0 is constant, the Vi are
uniform random variables on Q1 .
   Our proof will be based off of conditioning on Ns (t). For Ns (t) = k ≥ 1,
                                                        p                                                  p#
               Ns (t)                                                           " k
               X               t  −  tj     iξ·(t−t   )
                                                                                     X
                                                                                                     isξ·V
          E          Aj w                 e        j
                                                           : Ns (t) = k  = E             Aj w(Vj )e       j
               j=1
                                   s                                                 j=1
                                                                               kλk∞ p
                                                                            ≤          k E[|A1 |p ]kwkpp ,      (17)
                                                                                λmin
where (17) follows from (i) the independence of the random variables Aj and Vj ; (ii) the fact
that for any sequence of i.i.d. random variables Z1 , Z2 , . . .,
                             " k            p#                 " k         #
                                 X                              X
                           E           Zn       ≤ k p−1 E            |Zn |p = k p E[|Z1 |p ] ;
                                  n=1                            n=1
and (iii) the fact that
                                                                               kλk∞
                                                Z
                                         p
                            E[|w(Vi )| ] =             |w(v)|p pV (v) dv ≤             kwkpp .
                                                  Q1                            λmin
Therefore, since P[Ns (t) = k] = e−Λs (t) · (Λs (t))k/k!,
                                                            p
                  Ns (t)                  
                   X               t − tj
             E          Aj w                  eiξ·(t−tj )  =
                   j=1
                                       s
                                                                                         p                
                    ∞                        k        Ns (t)               
                   X               (Λs (t))  X                      t − tj
               =         e−Λs (t)              E             Aj w              eiξ·(t−tj ) : Ns (t) = k 
                   k=0
                                       k!              j=1
                                                                       s
                    ∞
                                                 " k                          p#
                                             k
                           −Λs (t) (Λs (t))
                   X                                 X
                                                                       isξ·Vj
               =         e                     E            Aj w(Vj )e
                   k=1
                                       k!             j=1
                     m
                                                 " k                          p#
                                             k
                   X               (Λ s (t))         X
               =         e−Λs (t)              E            Aj w(Vj )eisξ·Vj      + (m, s, ξ, t) ,
                   k=1
                                       k!             j=1
                                                              125


where                                                               "                              p#
                                    ∞                                     k
                                  X                (Λs (t))k           X
               (m, s, t, ξ) :=           e−Λs (t)              E            Aj w(Vj )eisξ·Vj          .
                                 k=m+1
                                                       k!               j=1
   By (17) and Lemma 14, if s is small enough so that Λs (t) ≤ sd kλk∞ < 1, then:
                                    ∞
                                                                    " k                            p#
                                                               k
                                   X                (Λ s (t))           X
                (m, s, ξ, t) =            e−Λs (t)              E           Aj w(Vj )eisξ·Vj
                                 k=m+1
                                                        k!               j=1
                                                                 ∞
                                  kλk∞                          X                 (Λs (t))k p
                               ≤          E[|A1 |p ]kwkpp                e−Λs (t)             k
                                   λmin                       k=m+1
                                                                                      k!
                                        kλk∞
                               ≤ Cm,p            E[|A1 |p ]kwkpp (Λs (t))m+1
                                         λmin
                                        kλk∞
                               ≤ Cm,p            E[|A1 |p ]kwkpp kλkm+1     ∞ s
                                                                                  d(m+1)
                                                                                          .
                                         λmin
Proof of Theorem 7
Proof. [Theorem 7] Let (sk , ξk ) be a sequence of scale and frequency pairs such that
limk→∞ sk = 0. Applying Theorem 6 with m = 1, we obtain:
              Sγk ,p Y (t)    −Λsk (t) Λsk (t)
                                                                      isξ·V1,k p
                                                                                   (1, sk , ξk , t)
                           = e                  E    A 1 w(V    1,k )e               +
                   sdk                    sdk                                                  sdk
                                       Λs (t)                                       (1, sk , ξk , t)
                           = e−Λsk (t) kd E[|A1 |p ]E[|w(V1,k )|p ] +                                 ,
                                          sk                                              sdk
where we write V1,k = V1 to emphasize the fact that the density of V1,k is:
                                                     sdk
                                     pVk (v) =             λ(t − vsk ) .
                                                  Λsk (t)
Using the error bound (4.9), we see that:
                                                (1, sk , ξk , t)
                                         lim                        = 0.
                                        k→∞           sdk
Furthermore, since 0 ≤ Λsk (t) ≤ sdk kλk∞ , we observe that:
                                              lim e−Λsk (t) = 1 ,
                                             k→∞
                                                     126


and by the continuity of λ(t),
                                                            Z
                                    Λs (t)              1
                               lim kd = lim d                            λ(u) du = λ(t) .                           (18)
                              k→∞     sk         k→∞ s
                                                         k    [sk −t,t]d
Finally, by the continuity of λ(t), we see that
                                     kλk∞
                         pVk (v) ≤               and       lim pVk (v) = 1 ,        ∀ v ∈ Q1 .                      (19)
                                      λmin               k→∞
Therefore, by the bounded convergence theorem,
                                     Z                                 Z
                         p                          p
         lim E[|w(V1 )| ] = lim           |w(v)| pVk (v) dv =                |w(v)|p lim pVk (v) dv = kwkpp .
        k→∞                     k→∞    Q1                                Q1          k→∞
That completes the proof of (4.10).
   To prove (4.11), we assume that λ(t) is periodic with period T along each coordinate and
again use Theorem 6 with m = 1 to observe,
                                   Z                         Z                                  Z
SY (sk , ξk , p)                 1                 Λs (t)                                     1     (1, sk , ξk , t)
         d
                 = E[|A1 |p ] d         e−Λsk (t) kd                        p
                                                                   |w(v)| pVk (v) dv dt+ d                            dt .
       sk                      T    QT                sk       Q1                           T    Q1       sdk
By (4.9), the second integral converges to zero as k → ∞. Therefore,
                                                                               Z
                                   SY (sk , ξk , p)                          1
                            lim           d
                                                     = E[|A1 |p ]kwkpp d            λ(t) dt ,
                           k→∞          sk                                 T     QT
by the continuity of λ(t) and the bounded convergence theorem.
Proof of Theorem 8
Proof. [Theorem 8] We apply Theorem 6 with m = 2 and obtain:
 Sγk ,p Y (t) = e−Λsk (t) Λsk (t)E[|A1 |p ]E[|w(V1,k )|p ]                                                          (20)
                          (Λsk (t))2                                                           p
              + e−Λsk (t)            E A1 w(V1,k )eisk ξk ·V1,k + A2 w(V2,k )eisk ξk ·V2,k + (2, sk , ξk , t) ,
                               2
where Vi,k , i = 1, 2, are random variables taking values on the unit cube Q1 = [0, 1]d with
densities,
                                                          sdk
                                          pVk (v) =              λ(t − vsk ) .
                                                       Λsk (t)
                                                          127


                                                                                             Λsk (t) E[|w(V1,k )|p ]
Dividing both sides in (20) by s2d         k kwkp E[|A1 | ] and subtracting
                                                     p          p
                                                                                              s2d        kwkpp
                                                                                                                     yields:
                                                                                               k
          Sγk ,p Y (t)        Λsk (t) E[|w(V1,k )|p ]              e−Λsk (t) Λsk (t) − Λsk (t) E[|w(V1,k )|p ]
                           −                                   =                                                             (21)
      s2d      p
       k kwkp E[|A1 | ]
                        p        s2d
                                   k           kwkpp                             s2d
                                                                                   k                      kwkpp
                                                       2 E A w(V )eisk ξk ·V1,k + A w(V )eisk ξk ·V2,k p
                                                                                                                          
                                          (Λ  s   (t))            1      1,k                     2       2,k
                           + e−Λsk (t)          k
                                               s2d
                                                 k                                2kwkpp E[|A1 |p ]
                                  (2, sk , ξk , t)
                           + 2d                           .
                              sk kwkpp E[|A1 |p ]
Using the error bound (4.9),
                                                         (2, sk , ξk , t)
                                             lim                p              = 0,                                          (22)
                                            k→∞ s2d kwkp E[|A1 |p ]
                                                       k
at a rate independent of t. Recalling (19) from the proof of Theorem 7, we use the fact that
limk→∞ pVk ≡ 1 and the bounded convergence theorem to conclude,
                                                                     p                                                       p
 lim E A1 w(V1,k )eisk ξk ·V1,k + A2 w(V2,k )eisk ξk ·V2,k                = E A1 w(U1 )eiL·U1 + A2 w(U2 )eiL·U2
                                                                               
                                                                                                                                 ,
k→∞
                                                                                                                             (23)
where Ui , i = 1, 2, are uniform random variables on the unit cube and L = limk→∞ sk ξk .
Similarly,
                                                        E[|w(V1,k )|p ]
                                                 lim                       = 1.                                              (24)
                                               k→∞            kwkpp
Lastly, recalling that sk → 0 as k → ∞ and using (18) from the proof of Theorem 7, we see
                       e−Λsk (t) Λsk (t) − Λsk (t)
                                                                                           −Λs (t)           
                                                                         Λsk (t)              e k −1
                  lim                                       = lim                     lim
                 k→∞                 s2d
                                      k
                                                               k→∞         sdk       k→∞             sd
                                                                             −Λs (t)           k
                                                                               e k −1
                                                            = λ(t) lim
                                                                    k→∞               sdk
                                                            = −λ(t)2 .                                                       (25)
    Now we integrate both sides of (21) over QT and divide by T d . Taking the limit as
k → ∞, on the left hand side we get:
                                                              Λsk (t) E[|w(V1,k )|p ]
                       Z                                                                  
                    1                Sγk ,p Y (t)
            lim                  2d       p                − 2d                    p          dt
           k→∞ T d Q           s    kwk   p E[|A   1 |p]        s             kwk  p
                          T      k                                k
                                                                         E[|w(V1,k )|p ] 1
                                                                                               Z                    
                                             SY (sk , ξk , p)                                           Λsk (t)
                            = lim                                    −                                            dt
                                k→∞       s2d          p
                                            k kwkp E[|A1 | ]
                                                                  p            kwkpp        T d QT s2d      k
                                                                                         Z                  
                                                    SY (sk , ξk , p)                 1         Λsk (t)
                            = lim                                               − d                      dt ,
                                k→∞       s2d                    p
                                            k E[|w(V1,k )| ]E[|A1 | ]
                                                                             p      T QT s2d       k
                                                               128


where we used the definition of the invariant scattering moments and (24). On the right
hand side of (21), we use (24), (25) and the dominated convergence theorem to see that the
first term is:
                  e−Λsk (t) Λsk (t) − Λsk (t) E[|w(V1,k )|p ]                                  e−Λsk (t) Λsk (t) − Λsk (t)
            Z                                                                            Z
        1                                                                             1
  lim d                                                              dt  =    lim                                            dt
 k→∞ T       QT                  s2d
                                  k                     kwkpp                k→∞ T d Q
                                                                                            T
                                                                                                           s2d
                                                                                                             k
                                                                                      Z
                                                                                  1
                                                                         =− d              λ(t)2 dt .
                                                                                 T QT
Using (18), (23), and the bounded convergence theorem, the second term of (21) is:
                                                                        isk ξk ·V1,k                             p
                                                                                     + A2 w(V2,k )eisk ξk ·V2,k
                                                      
                                                  2
                              −Λsk (t) (Λsk (t)) E A1 w(V1,k )e
                      Z
                 1
          lim               e                                                                                        dt
         k→∞ T d Q
                          T
                                            s2d
                                             k                            2kwkpp E[|A1 |p ]
                          E[|A1 w(U1 )eiL·U1 + A2 w(U2 )eiL·U2 |p ] 1
                                                                                Z                  
                                                                                                2
                       =                                                                    λ(t) dt .
                                           2kwkpp E[|A1 |p ]                      T d QT
Finally, the third term of (21) goes to zero using the bounded convergence theorem and (22).
Putting together the left and right hand sides of (21) with these calculations finishes the
proof.
Proof of Theorem 9
Proof. [Theorem 9] As in the proof of Theorem 6, let Ns (t) = N [t − s, t]d denote the
                                                                                                                
number of points in the cube [t − s, t]d . Then since the support of w is contained in [0, 1]d ,
                                                                                 Nsk (t)               
                                               t−u                                                t − tj
                             Z                                                       X
                                                          iξk ·(t−u)
       (gγk ∗ Y ) (t) =                   w             e            Y (du) =              Aj w              eiξk ·(t−tj ) ,
                               [t−sk ,t]d        sk                                  j=1
                                                                                                    sk
where t1 , t2 , . . . , tNsk (t) are the points of N in [t−sk , t]d . Therefore, in the event that Nsk (t) =
1,
                                          | (gγk ∗ Y ) (t)|p = (|gγk |p ∗ |Y |p ) (t) ,
                                                                129


and so, partitioning the space of possible outcomes based on Nsk (t), we obtain:
| (gγk ∗ Y ) (t)|p = | (gγk ∗ Y ) (t) · 1{Nsk (t)=1} + (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p
                    = | (gγk ∗ Y ) (t) · 1{Nsk (t)=1} |p + | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p
                    = (|gγk |p ∗ |Y |p ) (t) · 1{Nsk (t)=1} + | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p
                    = (|gγk |p ∗ |Y |p ) (t) + | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p − (|gγk |p ∗ |Y |p ) (t) · 1{Nsk (t)>1}
                    = (|gγk |p ∗ |Y |p ) (t) + ek (t) ,
where
                ek (t) := | (gγk ∗ Y ) (t) · 1{Nsk (t)>1} |p − (|gγk |p ∗ |Y |p ) (t) · 1{Nsk (t)>1}
Using the above, we can write the second order convolution term as:
                     gγk0 ∗ |gγk ∗ Y | (t) = gγk0 ∗ |gγk |p ∗ |Y |p (t) + gγk0 ∗ ek (t) .
                                                                                          
                                                                                        0
The following lemma implies that gγk0 ∗ ek (t) decays rapidly in Lp at a rate independent
                                                        
of t.
Lemma 15. There exists δ > 0, independent of t, such that if sk < δ,
                                              p                      kλk∞            d(p0 +2)
                             gγk0 ∗ ek (t) ≤ C(p, p0 , w, c, L)              kλk2∞ sk
                                      
                       E                                                                       .
                                                                       λmin
    Once we have proved Lemma 15, equation (4.13) will follow once we show,
                                                          p0
                          h                                 i
                                           p      p
                       E gγk ∗ |gγk | ∗ |Y | (t)
                                 0
                  lim                   d(p0 +1)               = K(p, p0 , w, c, L)λ(t)E[|A1 |q ] .              (26)
                 k→∞                 sk
    Let us prove (26) first and postpone the proof of Lemma 15. We will use the fact that
the support of gγk0 ∗ |gγk |p is contained in [0, sk + s0k ]d . Let s̃k := sk + s0k , Nk (t) := Ns̃k (t),
Λk (t) := Λs̃k (t), and let t1 , t2 , . . . , tNk (t) be the points of N in the cube [t − s̃k , t]d . We
                                                  n
have that P[Nk (t) = n] = e−Λk (t) (Λkn!(t)) , and conditioned on the event that Nk (t) = n, the
locations of the points t1 , . . . , tn are distributed as i.i.d. random variables Z1 (t), . . . , Zn (t)
                                                         130


taking values in [t− s̃k , t]d with density pZ(t) (z) =                      λ(z)
                                                                             Λk (t)
                                                                                    .  Therefore the i.i.d. random variables
Ve1 (t), . . . , Ven (t) defined by Vei (t) := t − Zi (t) take values in [0, s̃k ]d and have density
                                                               λ(t − v)
                                             pVe (t) (v) =                 ,    v ∈ [0, s̃k ]d .
                                                                 Λk (t)
Now, we condition on Nk (t) to see that
                                                                                                     p0
                                                                                                       
                                                         Nk (t)
                                          p0
    h                                        i
                                                       X
 E gγk0 ∗ |gγk |p ∗ |Y |p (t)                                   |Aj |p gγk0 ∗ |gγk |p (t − tj ) 
                                                                                        
                                               =E 
                                                                                                        
                                                          j=1
                                                    ∞
                                                   X               (Λk (t))n
                                               =         e−Λk (t)              ·                                                (27)
                                                   n=1
                                                                       n!
                                                                                                     p0
                                                                                                                       
                                                         Nk (t)
                                                       X
                                                                |Aj |p gγk0 ∗ |gγk |p (t − tj )
                                                                                         
                                                                                                        : Nk (t) = n
                                                                                                                        
                                                  E
                                                          j=1
                                                                                                                           p0
                                                                                                                             
                                                    ∞                        n          n
                                                   X               (Λk (t))           X
                                                         e−Λk (t)                          |Aj |p gγk0 ∗ |gγk |p (Vej (t)) 
                                                                                                                  
                                               =                               E
                                                   n=1
                                                                       n!              j=1
                                                                                                                  p0
                                                                                                                    
                                                    −Λk (t)                  q                     p
                                                                                                                                (28)
                                                                                                     
                                               =e           Λk (t)E[|A1 | ]E gγk0 ∗ |gγk | (V1 (t))      e
                                                                                                                               p0
                                                                                                                                 
                                                        ∞                        n         n
                                                       X              (Λk (t))           X
                                                             e−Λk (t)                         |Aj |p gγk0 ∗ |gγk |p (Vej (t))  .
                                                                                                                      
                                                   +                               E
                                                       n=2
                                                                           n!             j=1
                                                                                                                                (29)
The following lemma will be used to estimate the scaling of the term in (28).
Lemma 16. For all t ∈ Rd ,
                                    s̃dk                                         p0
                                                                                   
                                                                                                               0
                                                                  p
                                                                                       = kgc,L/c ∗ |g1,0 |p kpp0 .              (30)
                                                                    
                          lim         0     E       gγk0 ∗ |gγk |     (Ve1 (t))
                         k→∞ sd(p +1)
                                  k
Furthermore, there exists δ > 0, independent of t, such that if sk < δ then
                             s̃dk                                        p0
                                                                           
                                                                                      kλk∞
                                                           p
                                                                                             C(p, p0 , w, c, L) .               (31)
                                                             
                           d(p0 +1)
                                    E       gγk0 ∗ |gγk |      (Ve1 (t))       ≤2
                         sk                                                            λmin
                                                                     131


Proof. Making a change of variables in both u and v, and recalling the assumption that
s0k = csk , we observe that
              s̃dk                                               p0
                                                                   
                                              p
                                                 
            d(p0 +1)
                      E     gγk0 ∗ |gγk |          (Ve1 (t))
          sk
                                                                                                                      p       p0
                                s̃dk
                                                                                                                
                                                                             v−u
                                          Z                      Z
                                                                                            iξk0 ·(v−u)         u
                        =     d(p0 +1)
                                                  pVe (t) (v)          w                  e               w             du       dv
                           sk               Rd                      Rd          s0k                            sk
                                                                                                                         p0
                                                                        sk (v − u) isk ξk0 ·(v−u)
                                Z                           Z
                                                                                                                    p
                        =  s̃dk         pVe (t) (sk v)             w                        e               |w(u)| du dv
                                   Rd                          Rd             s0k
                                                                                                                        p0
                                   s̃dk λ(t − sk v)
                                                                                  
                                                                          u − v is0k ξk0 ·(u−v)/c
                           Z                                  Z
                        =                                           w                  e                          p
                                                                                                         |w(u)| du dv .             (32)
                              Rd          Λk (t)                Rd          c
The continuity of λ(t) implies that
                                              s̃dk λ(t − sk v)
                                       lim                            = 1,        ∀ v ∈ [0, 1 + c]d .
                                      k→∞            Λk (t)
Furthermore, the assumption 0 < λmin ≤ kλk∞ < ∞ implies
                                             s̃dk λ(t − sk v)            kλk∞
                                                                     ≤             ,     ∀k ≥ 1.                                    (33)
                                                    Λk (t)                λmin
Therefore, (30) follows from the dominated convergence theorem and by the observation that
the inner integral of (32) is zero unless v ∈ [0, 1 + c]d . Equation (31) follows from inserting
(33) into (32) and sending k to infinity.
    Since
                                                                    Λk (t)
                                                             lim            = λ(t) ,
                                                            k→∞ s̃d
                                                                       k
the independence of Ve1 (t) and A1 , the continuity of λ(t), and Lemma 16 imply that taking
k → ∞ in (28) yields:
                                                                 h                                    i
                            −Λk (t)                          q                       p             p 0
                          e            Λk (t)E[|A1 | ]E |gγk0 ∗ |gγk | (V1 (t))|        e
                   lim                                        d(p0 +1)
                                                                                                        
                  k→∞                                         sk
                                                                                                                            !
                                         −Λk (t) Λk (t)                 q     s̃dk        h
                                                                                                            p          p0
                                                                                                                          i
                        = lim           e                      E[|A1 | ] d(p0 +1) E |gγk0           ∗ |gγk | (Ve1 (t))|
                          k→∞                          s̃dk                sk
                        = K(p, p0 , c, w, L)λ(t)E[|A1 |q ] .
                                                                       132


                                                         0 
                                                           d(p +2)
    The following lemma shows that (29) is O sk                        (and converges at a rate independent
of t), and therefore completes the proof of (4.13) subject to proving Lemma 15.
Lemma 17. For all α ∈ R there exists δ > 0, independent of t, such that if sk < δ, then
                                                                                        p0
                                                                                          
               ∞                    n            n
             X             (Λk (t)) α         X
                  e−Λk (t)                          |Aj |p gγk0 ∗ |gγk |p (Vej (t)) 
                                                                             
                                      n E
              n=2
                               n!               j=1
                                                                    kλk∞                        d(p0 +2)
                                      ≤ C(p, p0 , w, c, α, L)               kλk2∞ E[|A1 |q ]sk           .
                                                                     λmin
Proof. For any sequence of i.i.d. random variables, Z1 , Z2 , . . . , it holds that
                            " k        p#              " k             #
                             X                           X
                        E          Zn      ≤ k p−1 E           |Zn |p = k p E [|Z1 |p ] .
                              n=1                        n=1
Therefore, by Lemma 14, Lemma 16, and the fact that the Vej (t) and Ai are i.i.d. and
independent of each other, we see that if sk < δ, where δ is as in (31),
                                                                                              p0
                                                                                                
                    ∞                    n            n
                   X             (Λk (t)) α  X
                       e−Λk (t)                           |Ai |p gγk0 ∗ |gγk |p (Vej (t)) 
                                                                                   
                                           n E
                   n=2
                                    n!               j=1
                    ∞                    n                                                   p0
                                                                                               
                         −Λk (t) (Λk (t))    α p0
                   X                                                             p e
                                                             q
                ≤      e                   n n E |A1 | gγk0 ∗ |gγk | (V1 (t))
                   n=2
                                    n!
                    ∞                    n                                                       p0
                                                                                                   
                         −Λk (t) (Λk (t))    p0 +α
                   X                                                               p e
                                                           q
                =      e                   n       E[|A1 | ]E gγk0 ∗ |gγk | (V1 (t))
                   n=2
                                    n!
                                                                 ∞
                                                              p0 X                (Λk (t))n p0 +α
                               
                                                p e
                           q
                =E[|A1 | ]E gγk0 ∗ |gγk | (V1 (t))                      e−Λk (t)            n
                                                                   n=2
                                                                                     n!
                                                          d(p0 +1)    ∞
                          0           kλk∞              s            X             (Λk (t))n p0 +α
                ≤C(p, p , w, c, L)           E[|A1 |q ] k d              e−Λk (t)             n
                                       λmin                 s̃k      n=2
                                                                                       n!
                                                              d(p0 +1)
                                         kλk∞               s
                ≤C(p, p0 , w, c, L, α)           E[|A1 |q ] k d (Λk (t))2
                                          λmin                  s̃k
                                         kλk∞                         d(p0 +2)
                ≤C(p, p0 , w, c, L, α)           kλk2∞ E[|A1 |q ]sk            ,
                                          λmin
where the last inequality uses the fact that Λk (t) ≤ s̃dk kλk∞ = (1 + c)d sdk kλk∞ .
    We will now complete the proof of the theorem by proving Lemma 15.
                                                      133


Proof. [Lemma 15] Since
                  ek (t) = | (gγk ∗ Y ) (t)1{Nsk (t)>1} |p − (|gγk |p ∗ |Y |p ) (t)1{Nsk (t)>1} ,
we see that
                              (gγk ∗ Y ) 1{Nsk (·)>1}         (t) + gγk0 ∗ (|gγk |p ∗ |Y |p ) 1{Nsk (·)>1} (t) .
                                                           p                                             
   gγk0 ∗ ek (t) ≤ gγk0 ∗
First turning our attention to the second term, we note that
                gγk0 ∗ (|gγk |p ∗ |Y |p ) 1{Nsk (·)>1} (t)
                                                         
                                                  
                                          t − u iξk0 ·(t−u)
                         Z
                     =               w         0
                                                      e         (|gγk |p ∗ |Y |p ) (u)1{Nsk (u)>1} du
                              0
                          [t−sk ,t]d        sk
                                                             
                                                        t−u
                                      Z
                     ≤ 1{Nk (t)>1}                 w      0
                                                                 (|gγk |p ∗ |Y |p ) (u) du
                                       [t−s0k ,t]d       sk
                     = 1{Nk (t)>1} gs0k ,0 ∗ |gγk |p ∗ |Y |p (t).                                             (34)
                                                               
since Nsk (u) ≤ Nsk +s0k (t) = Ns̃k (t) = Nk (t) for all u ∈ [t − s0k , t]d . Therefore, conditioning on
Nk (t), if sk < δ,
                                                                        p0
                     h                                                     i
                   E gγk0 ∗ (|gγk |p ∗ |Y |p ) 1{Nsk (·)>1} (t)
                                                                 
                                                                             p0
                           h                                                    i
                      ≤ E 1{Nk (t)>1} gs0k ,0 ∗ |gγk |p ∗ |Y |p (t)
                                                                     
                                                                                                p0
                                                                                                  
                          ∞                        n      n
                         X             (Λk (t))         X
                             e−Λk (t)                         |Aj |p gs0k ,0 ∗ |gγk |p (Vej (t)) 
                                                                                      
                      =                              E
                         n=2
                                           n!            j=1
                                               kλk∞                      d(p0 +2)
                      ≤ C(p, p0 , w, c, L)             kλk2∞ E[|A1 |q ]sk
                                                 λmin
by Lemma 17. Now, turning our attention to the first term, note that
                 |(gγk ∗ Y )(t)|p 1{Nsk (t)>1} ≤ Nsk (t)p−1 (|gγk |p ∗ |Y |p ) (t)1{Nsk (t)>1} .
                                                          134


Therefore, by the same logic as in (34)
                       (gγk ∗ Y ) 1{Nsk (·)>1} (t)
                                                  p
              gγk0 ∗
                                             
                                       t−u
                     Z
                   ≤               w      0
                                                 Nsk (u)p−1 (|gγk |p ∗ |Y |p ) (u)1{Nsk (u)>1} du
                       [t−s0k ,t]d      sk
                                                                        
                                                                t−u
                                               Z
                   ≤ 1{Nk (t)>1} Nk (t)   p−1
                                                            w        0
                                                                            (|gγk |p ∗ |Y |p ) (u) du
                                                    0
                                                [t−sk ,t]d         sk
                   ≤ 1{Nk (t)>1} Nk (t)p−1 gs0k ,0 ∗ (|gγk |p ∗ |Y |p ) (t) .
                                                                               
So again conditioning on Nk (t), and applying Lemma 17, we see that if sk < δ
                                                               p0
               h                                                  i
                           (gγk ∗ Y ) 1{Nsk (·)>1} (t)
                                                      p
             E gγk0 ∗
                                                                                                      p0
                                                                                                        
                    ∞                      k               n
                          −Λk (t) (Λk (t))
                   X                                     X                                          p
                                              p−1 
                                                              |Aj |p
                                                                                         
                ≤       e                    n E                         gs0k ,0 ∗ |gγk | (Vej (t))      
                   n=2
                                     n!                  j=1
                                       kλk∞                         d(p0 +2)
                ≤ C(p, p0 , w, c, L)           kλk2∞ E[|A1 |q ]sk            .
                                        λmin
    This completes the proof of (4.13). Line (4.14) follows from integrating with respect to
t, observing that the error bounds in Lemmas 15 and 16 are independent of t, and applying
the bounded convergence theorem.
Proofs of Results from Section 4.5
In order to prove Theorems 10 and 11, we will need the following lemma which shows that the
scaling relationship of a self-similar process X(t)induces a similar relationship on stochastic
integrals against dX(t).
Lemma 18. If X is a stochastic process that satisfies the scaling relation
                                                          d
                                               X(st) = sβ X(t)                                             (35)
                                                        135


for some β > 0, then for any measurable function f : R → R,
                                  Z   s                           Z    1
                                                           d    β
                                         f (u) dX(u) = s                 f (su) dX(u) .
                                    0                                0
Proof. Let X = (X(t))t∈R be a stochastic process satisfying (35), and let Pn = {0 = tn0 <
tn1 < . . . < tnKn = 1} be a sequence of partitions of [0, 1] such that
                                            lim max |tnk − tnk−1 | = 0.
                                                         
                                           n→∞      k
Then, by the scaling relation (35),
         Z    s                       KX n −1
                                              f (stnk ) X(stnk+1 ) − X(stnk )
                                                                                       
                f (u) dX(u) = lim
            0                  n→∞
                                        k=0
                                           K n −1                                                 Z  1
                             d
                                            X
                                β
                                                  f (stnk )   X(tnk+1 )       X(tnk )         β
                                                                                       
                            = s lim                                        −             =s             f (su) dX(u) .
                                  n→∞                                                              0
                                            k=0
     We will now use Lemma 18 to prove Theorems 10 and 11.
Proof. [Theorem 10] Let X = (X(t))t∈R be the α-stable process, p < α ≤ 2. Since X has
stationary increments, its scattering coefficients do not depend on t and it suffices to analyze
                                           Z    0                        p         Z    sk                       p
                                p
            E [|(gγk ∗ dX)(0)| ] = E                 gγk (u) dX(u)           =E               gγk (u) dX(u)            ,
                                                −sk                                      0
where the second equality uses the fact the distribution of X does not change if it is run in
reverse, i.e.
                                                              d
                                              (X(t))t∈R = (X(−t))t∈R
It is well known that X(t) satisfies (35) for β = 1/α. Therefore, by Lemma 18
                                Z    sk                                p             Z     1                         p
                        p                        u      iξk u                      p/α                     iξk sk u
   E [|(gγk ∗ dX)(0)| ] = E               w            e      dX(u)          = sk E               w(u)e             dX(u)    .
                                    0           sk                                            0
So,
                                                                                                  p
                          E [|(gγk ∗ dX)(0)|p ]                     1
                                                             Z
                                                                              iξk sk u
                                     p/α
                                                      =E               w(u)e           dX(u)           .
                                   sk                             0
                                                             136


The proof will be complete as soon as we show that
                 Z        1                            p 1/p       Z           1                    p 1/p
                                     iξk sk u                                                 iLu
            lim E             w(u)e            dX(u)               = E                  w(u)e     dX(u)           .
           k→∞            0                                                        0
By the triangle inequality,
                Z      1                            p 1/p       Z           1                    p 1/p
                                   iξk sk u                                                 iLu
                 E          w(u)e           dX(u)                − E                w(u)e       dX(u)
                       0                                                        0
                    Z       1                                           p 1/p
                                          iξk sk u     iLu
                                                            
               ≤ E              w(u) e             −e           dX(u)                 .
                            0
Since 1 ≤ p < α, we may choose p0 strictly greater than 1 such that p ≤ p0 < α, and note
that by Jensen’s inequality
                                                   p 1/p
                                                                      "Z                                             p0
                                                                                                                        #!1/p0
 Z      1                                                                    1
            w(u) eiξk sk u − e   iLu
                                                                                 w(u) eiξk sk u − e  iLu
                                                                                                        
  E                                     dX(u)                 ≤    E                                        dX(u)              ,
        0                                                                    0
and since X(t) is a p0 -integrable martingale, the boundedness of martingale transforms [72]
(see also [73]) implies
            "Z                                               p0
                                                                # !1/p0
                  1
                    w(u) eiξk sk u − eiLu dX(u)
                                               
         E
                0
                                                          h            0
                                                                          i                                 h      0
                                                                                                                     i
             ≤ Cp0 sup w(u) eiξk sk u − eiLu E |X1 |p ≤ Cp0 |sk ξk − L|kwk∞ E |X1 |p ,
                    0≤u≤1
which converges to zero by the continuity of w on [0, 1] and the assumption that sk ξk con-
verges to L.
Proof. [Theorem 11] Similarly to the proof of Theorem 10, it suffices to show that if a
(X(t))t∈R is fractional Brownian motion with Hurst parameter H, then
                                 Z        1                                            p 1/p
                                                        iξk sk u     iLu
                                                                          
                         lim E                w(u) e             −e          dX(u)               = 0.
                        k→∞               0
However, fractional Brownian motion is not a semi-martingale so we cannot apply Burkholder’s
theorem as we did in the proof of Theorem 10. Instead, we use the following result first es-
tablished in [74] (see also [75], p. 48) which states that if x(u) is any (deterministic) function
                                                              137


with bounded variation, and y(u) is any function which is α-Hölder continuous, 0 < α < 1,
then
                                            Z    1
                                                   x(u) dy(u)
                                               0
is well-defined as the limit of Riemann sums and
                     Z 1
                         x(u) dy(u) − x(0) (y(1) − y(0)) ≤ Cα kxkBV kykα ,
                       0
where k · kBV and k · kα are the bounded variation and α-Hölder seminorms respectively. For
all k, the function hk (u) := w(u) eiξk sk u − eiLu := w(u)fk (u) satisfies, hk (0) = 0 and
                                                           
                            khk kBV ≤ kwk∞ kfk kBV + kwkBV kfk k∞ .
One can check that the fact that sk ξk converges to L implies that fk converges to zero in
both L∞ and in the bounded variation seminorm, and that therefore that khk kBV converges
to zero.
    It is well-known that fractional Brownian motion with Hurst parameter H admits a
continuous modification which is α-Hölder continuous for any α < H. Therefore,
                  Z 1                                        p
                                 iξk sk u    iLu
                                                                 ≤ Cαp khk kpBV E [kXkpα ] .
                                                  
                E        w(u) e           −e         dX(u)
                      0
Lastly, one can use the Garsia-Rodemich-Rumsey inequality [76], to show that
                                           E[kXkpα ] < ∞ .
for all 1 < p < ∞. For details we refer the reader to the survey article [77]. Therefore,
                               Z 1                                        p
                                                  iξk sk u   iLu
                                                                 
                         lim E          w(u) e             −e      dX(u)      =0
                         k→0        0
as desired.
Remark 8. The assumption that w has bounded-variation was used to justify that the
stochastic integral against fractional Brownian motion was well defined as the limit of Rie-
mann sums because of its Hölder continuity and the above mentioned result of [74]. This
allowed us to avoid the technical complexities of defining such an integral using either the
Malliavin calculus or the Wick product.
                                                      138


Details of Numerical Experiments
Definition of Filters
For all the numerical experiments, we take the window function w to be the smooth bump
function                               
                                       exp − 1 2 , t ∈ (0, 1)
                                       
                                                        
                                                 4t−4t
                              w(t) =
                                                             otherwise.
                                       
                                       0,
                                       
Therefore for γ = (s, ξ), our filters are given by
                                            
                                                       2
                                            eiξt e−s /(4ts−4t2 ) , t ∈ (0, s)
                                            
                                            
                       gγ (t) = eiξt w(t) =                                    .
                                                                    otherwise
                                            
                                            0,
                                            
Frequencies
In all of our experiments, we hold the frequency, ξ, which we sample uniformly at random
from (0, 2π), constant while allowing the scale to decrease to zero.
Simulation of Poisson point process
We use the standard method to generate a realization of a Poisson point process. For
Poisson point process with intensity λ, the time interval between two neighbor jumps follows
exponential distribution:
                                     ∆j := tj − tj−1 ∼ Exp(λ) .
Therefore, taking the inverse cumulative distribution function, we sample the time interval
between two neighbor jumps through:
                                                    log Uj
                                          ∆j = −           ,
                                                       λ
where Uj are i.i.d. uniform random variables on [0, 1], and assign the charge Aj to the jump
at location tj .
                                                139


    For inhomogeneous Poisson process with intensity funciton λ(t), we simulate the time
interval based on the proposition from [78]. First define the cumulated intensity:
                                             Z  t
                                     Λ(t) =       λ(s)ds ,
                                              0
then generate the location of jumps tj by the following algorithm:
  Algorithm 2 Algorithm for simulating inhomogeneous Poisson point process
    initialize V = 0, t = 0
    while t < N do
        generate U ∼ U([0, 1])
        V ← V − log U
        t = inf{v : Λ(v) < V }
        deliver t
                                             140


BIBLIOGRAPHY
      141


                                     BIBLIOGRAPHY
[1] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
     scale image recognition. In International Conference on Learning Representations, 2015.
[2] Javier Portilla and Eero Simoncelli. A parametric texture model based on joint statistics
     of complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–71,
     10 2000.
[3] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convo-
     lutional neural networks. In Advances in neural information processing systems, pages
     262–270, 2015.
[4] Stephane Mallat. A wavelet tour of signal processing, third edition: The sparse way.
     Academic Press, 3rd edition, 2008.
[5] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of image
     data. IEEE Trans. Pattern Anal. Mach. Intell., 18(8):837–842, August 1996.
[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for
     image recognition. CoRR, abs/1512.03385, 2015.
[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
     deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.
     Weinberger, editors, Advances in Neural Information Processing Systems, volume 25.
     Curran Associates, Inc., 2012.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
     large-scale hierarchical image database. In 2009 IEEE conference on computer vision
     and pattern recognition, pages 248–255. Ieee, 2009.
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-
     jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances
     in neural information processing systems, pages 2672–2680, 2014.
[10] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied
     Mathematics, 65(10):1331–1398, 2012.
[11] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive Growing of GANs for Im-
     proved Quality, Stability, and Variation. ArXiv e-prints, October 2017.
[12] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative Adver-
     sarial Text to Image Synthesis. ArXiv e-prints, May 2016.
[13] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei Efros.
     Context encoders: Feature learning by inpainting. In Computer Vision and Pattern
     Recognition (CVPR), 2016.
                                             142


[14] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image
     translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV),
     2017 IEEE International Conference on, 2017.
[15] Leon A. Gatys, Alexander S. Ecker, and M. Bethge. A neural algorithm of artistic style.
     ArXiv, abs/1508.06576, 2015.
[16] Joakim Andén and Stéphane Mallat. Multiscale scattering for audio classification. In
     Proceedings of the ISMIR 2011 conference, pages 657–662, 2011.
[17] Joakim Andén and Stéphane Mallat. Deep scattering spectrum. IEEE Transactions on
     Signal Processing, 62(16):4114–4128, August 2014.
[18] G. Wolf, S. Mallat, and S.A. Shamma. Audio source separation with time-frequency
     velocities. In 2014 IEEE International Workshop on Machine Learning for Signal Pro-
     cessing (MLSP), Reims, France, 2014.
[19] Guy Wolf, Stephane Mallat, and Shihab A. Shamma. Rigid motion model for audio
     source separation. IEEE Transactions on Signal Processing, 64(7):1822–1831, 2015.
[20] Joakim Andén, Vincent Lostanlen, and Stéphane Mallat. Classification with joint time-
     frequency scattering. arXiv:1807.08869, 2018.
[21] Joan Bruna and Stéphane Mallat. Classification with scattering operators. In 2011
     IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–
     1566, 2011.
[22] Joan Bruna and Stéphane Mallat. Invariant scattering convolution networks. IEEE
     Transactions on Pattern Analysis and Machine Intelligence, 35(8):1872–1886, August
     2013.
[23] Laurent Sifre and Stéphane Mallat. Combined scattering for rotation invariant texture
     analysis. In Proceedings of the ESANN 2012 conference, 2012.
[24] Laurent Sifre and Stéphane Mallat. Rotation, scaling and deformation invariant scat-
     tering for texture discrimination. In The IEEE Conference on Computer Vision and
     Pattern Recognition (CVPR), June 2013.
[25] Laurent Sifre and Stéphane Mallat. Rigid-motion scattering for texture classification.
     arXiv:1403.1687, 2014.
[26] Edouard Oyallon and Stéphane Mallat. Deep roto-translation scattering for object
     classification. In Proceedings in IEEE CVPR 2015 conference, 2015. arXiv:1412.8659.
[27] Matthew Hirn, Stéphane Mallat, and Nicolas Poilvert. Wavelet scattering regression of
     quantum chemical energies. Multiscale Modeling and Simulation, 15(2):827–863, 2017.
     arXiv:1605.04654.
                                             143


[28] Michael Eickenberg, Georgios Exarchakis, Matthew Hirn, and Stéphane Mallat. Solid
     harmonic wavelet scattering: Predicting quantum molecular energy from invariant de-
     scriptors of 3D electronic densities. In Advances in Neural Information Processing Sys-
     tems 30 (NIPS 2017), pages 6540–6549, 2017.
[29] Michael Eickenberg, Georgios Exarchakis, Matthew Hirn, Stéphane Mallat, and Louis
     Thiry. Solid harmonic wavelet scattering for predictions of molecule properties. Journal
     of Chemical Physics, 148:241732, 2018.
[30] Xavier Brumwell, Paul Sinz, Kwang Jin Kim, Yue Qi, and Matthew Hirn. Steerable
     wavelet scattering for 3D atomic systems with application to Li-Si energy prediction.
     In NeurIPS Workshop on Machine Learning for Molecules and Materials, 2018.
[31] Stéphane Mallat and Iréne Waldspurger. Phase retrieval for the cauchy wavelet trans-
     form. Journal of Fourier Analysis and Applications, 21(6):1251–1309, 2015.
[32] Juri Ranieri, Amina Chebira, Yue M. Lu, and Martin Vetterli. Phase retrieval for sparse
     signals: Uniqueness conditions. CoRR, abs/1308.3058, 2013.
[33] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convo-
     lutional neural networks. In Advances in Neural Information Processing Systems 28,
     pages 262–270, 2015.
[34] Joseph Antognini, Matt Hoffman, and Ron J. Weiss. Synthesizing diverse, high-quality
     audio textures. arXiv:1806.08002, 2018.
[35] Mikolaj Binkowski, Gautier Marti, and Philippe Donnat. Autoregressive convolutional
     neural networks for asynchronous time series. In Jennifer Dy and Andreas Krause, edi-
     tors, Proceedings of the 35th International Conference on Machine Learning, volume 80
     of Proceedings of Machine Learning Research, pages 580–589, Stockholmsmässan, Stock-
     holm Sweden, 10–15 Jul 2018. PMLR.
[36] Antoine Brochard, Bartłomiej Błaszczyszyn, Stéphane Mallat, and Sixin Zhang. Sta-
     tistical learning of geometric characteristics of wireless networks. arXiv:1812.08265,
     2018.
[37] Martin Haenggi, Jeffrey G. Andrews, François Baccelli, Olivier Dousse, and Massimo
     Franceschetti. Stochastic geometry and random graphs for the analysis and design of
     wireless networks. IEEE Journal on Selected Areas in Communications, 27(7):1029–
     1046, 2009.
[38] Astrid Genet, Pavel Grabarnik, Olga Sekretenko, and David Pothier. Incorporating the
     mechanisms underlying inter-tree competition into a random point process model to
     improve spatial tree pattern analysis in forestry. Ecological Modelling, 288:143–154, 09
     2014.
[39] Frederic Paik Schoenberg. A note on the consistent estimation of spatial-temporal point
     process parameters. Statistica Sinica, 2016.
                                              144


[40] V. Fromion, E. Leoncini, and P. Robert. Stochastic gene expression in cells: A point
     process approach. SIAM Journal on Applied Mathematics, 73(1):195–211, 2013.
[41] Joan Bruna, Stéphane Mallat, Emmanuel Bacry, and Jean-Francois Muzy. Intermittent
     process analysis with scattering moments. Annals of Statistics, 43(1):323 – 351, 2015.
[42] D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes. Vol.
     I. Probability and its Applications (New York). Springer-Verlag, New York, second
     edition, 2003. Elementary theory and methods.
[43] Joan Bruna and Stéphane Mallat. Multiscale sparse microcanonical models. Mathemat-
     ical Statistics and Learning, 1(3/4):257–315, 01 2018.
[44] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher-
     jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances
     in Neural Information Processing Systems 27, pages 2672–2680, 2014.
[45] Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. Learning texture manifolds with
     the periodic spatial GAN. In Doina Precup and Yee Whye Teh, editors, Proceedings of
     the 34th International Conference on Machine Learning, volume 70 of Proceedings of
     Machine Learning Research, pages 469–477, International Convention Centre, Sydney,
     Australia, 06–11 Aug 2017. PMLR.
[46] Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. Texture synthesis with spatial
     generative adversarial networks. ArXiv, abs/1611.08207, 2016.
[47] Nikolay Jetchev, Urs Bergmann, and C. Seward. Ganosaic: Mosaic creation with gen-
     erative texture manifolds. ArXiv, abs/1712.00269, 2017.
[48] Wenqi Xian, Patsorn Sangkloy, Jingwan Lu, Chen Fang, F. Yu, and James Hays. Tex-
     turegan: Controlling deep image synthesis with texture patches. 2018 IEEE/CVF Con-
     ference on Computer Vision and Pattern Recognition, pages 8456–8465, 2018.
[49] Yang Zhou, Zhen Zhu, Xiang Bai, Dani Lischinski, Daniel Cohen-Or, and Hui Huang.
     Non-stationary texture synthesis by adversarial expansion. ACM Transactions on
     Graphics (Proc. SIGGRAPH), 37(4), 2018.
[50] Adib Akl, Charles Yaacoub, Marc Donias, Jean-Pierre Da Costa, and Christian Ger-
     main. A survey of exemplar-based texture synthesis methods. Computer Vision and
     Image Understanding, 172:12–24, 2018.
[51] Bruno Galerne, Yann Gousseau, and Jean-Michel Morel. Micro-Texture Synthesis by
     Phase Randomization. Image Processing On Line, 1:213–237, 2011.
[52] B. Galerne, Y. Gousseau, and J. Morel. Random phase textures: Theory and synthesis.
     IEEE Transactions on Image Processing, 20:257–267, 2011.
[53] Song Chun Zhu, Yingnian Wu, and David Mumford. Filters, random fields and max-
     imum entropy (frame): Towards a unified theory for texture modeling. International
     Journal of Computer Vision, 27(2):107–126, 1998.
                                             145


[54] Y. Lu, Song-Chun Zhu, and Y. Wu. Learning frame models using cnn filters for knowl-
     edge visualization. ArXiv, abs/1509.08379, 2015.
[55] Valentin De Bortoli, Agnès Desolneux, Bruno Galerne, and Arthur Leclaire. Macro-
     canonical models for texture synthesis. In Jan Lellmann, Martin Burger, and Jan Mod-
     ersitzki, editors, Scale Space and Variational Methods in Computer Vision, pages 13–24,
     Cham, 2019. Springer International Publishing.
[56] Valentin De Bortoli, A. Desolneux, Alain Durmus, B. Galerne, and A. Leclaire. Maxi-
     mum entropy methods for texture synthesis: theory and practice. SIAM J. Math. Data
     Sci., 3:52–82, 2021.
[57] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward
     synthesis of textures and stylized images. In ICML, 2016.
[58] J. Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer
     and super-resolution. In ECCV, 2016.
[59] Valentin De Bortoli, A. Desolneux, B. Galerne, and A. Leclaire. Macrocanonical models
     for texture synthesis. In SSVM, 2019.
[60] Ivan Ustyuzhaninov, Wieland Brendel, Leon Gatys, and Matthias Bethge. What does
     it take to generate natural textures? In Proceedings of the International Conference on
     Learning Representations, 2017.
[61] David J. Heeger and James R. Bergen. Pyramid-based texture analysis/synthesis. In
     Proceedings of the 22nd annual conference on Computer graphics and interactive tech-
     niques, pages 229–238, 1995.
[62] Thibaud Briand, Jonathan Vacher, Bruno Galerne, and Julien Rabin. The Heeger-
     Bergen pyramid-based texture synthesis algorithm. Image Processing On Line, 4:279–
     299, 2014.
[63] Nicolas Gonthier, Yann Gousseau, and Saïd Ladjal. High resolution neural texture
     synthesis with long range constraints. CoRR, abs/2008.01808, 2020.
[64] Xavier Snelgrove. High-resolution multi-scale neural texture synthesis. In SIGGRAPH
     ASIA 2017 Technical Briefs, SA ’17, New York, NY, USA, 2017. ACM.
[65] Benjamin Balas, Lisa Nakano, and Ruth Rosenholtz. A summary statistic representation
     in peripheral vision explains visual crowding. Journal of vision, 9:13.1–18, 11 2009.
[66] Jeremy Freeman, Corey M Ziemba, David J Heeger, Eero P Simoncelli, and J Anthony
     Movshon. A functional and perceptual signature of the second visual area in primates.
     Nature Neuroscience, pages 974–981, 07 2013.
[67] Gouki Okazawa, Satohiro Tajima, and Hidehiko Komatsu. Image statistics underly-
     ing natural texture selectivity of neurons in macaque v4. Proceedings of the National
     Academy of Sciences, 112(4):E351–E360, 2015.
                                              146


[68] R. Brüel Gabrielsson and G. Carlsson. Exposition and interpretation of the topology
     of neural networks. In 2019 18th IEEE International Conference On Machine Learning
     And Applications (ICMLA), pages 1069–1076, 2019.
[69] Sixin Zhang and S. Mallat. Maximum entropy models from phase harmonic covariances.
     ArXiv, abs/1911.10017, 2019.
[70] Lionel Moisan. Periodic plus Smooth Image Decomposition. working paper or preprint,
     May 2009.
[71] Matthew Hirn and Anna Little. Wavelet invariants for statistically robust multi-
     reference alignment. arXiv:1909.11062, 2019.
[72] Donald L. Burkholder. Sharp inequalities for martingales and stochastic integrals. In
     Colloque Paul Lévy sur les processus stochastiques, number 157-158 in Astérisque, pages
     75–94. Société mathématique de France, 1988.
[73] Rodrigo Bañuelos and Gang Wang. Sharp inequalities for martingales with applications
     to the beurling-ahlfors and riesz transforms. Duke Math. J., 80(3):575–600, 12 1995.
[74] L. C. Young. An inequality of the Hölder type, connected with Stieltjes integration.
     Acta Math., 67:251–282, 1936.
[75] Peter K Friz and Martin Hairer. A course on rough paths: with an introduction to
     regularity structures. Springer, 2014.
[76] A. M. Garsia, E. Rodemich, and H. Rumsey Jr. A real variable lemma and the con-
     tinuity of paths of some Gaussian processes. Indiana University Mathematics Journal,
     20(6):565–578, 1970.
[77] G. Shevchenko. Fractional Brownian motion in a nutshell. In International Journal
     of Modern Physics Conference Series, volume 36 of International Journal of Modern
     Physics Conference Series, page 1560002, January 2015.
[78] E. (Erhan) Cinlar. Introduction to stochastic processes. Prentice-Hall, Englewood Cliffs,
     N.J., 1975.
                                              147