STATISTICAL SIGNAL PROCESSING APPROACHES FOR MULTI-REFERENCE
          ALIGNMENT AND NEURAL TEXTURE SYNTHESIS
                                       By
                                   Liping Yin
                              A DISSERTATION
                                  Submitted to
                          Michigan State University
                  in partial fulfillment of the requirements
                               for the degree of
                Applied Mathematics—Doctor of Philosophy
      Computational Mathematics, Science, and Engineering—Dual Major
                                      2023


                                            ABSTRACT
Statistical signal processing plays a crucial role in numerous fields of modern technology and
science. Some of the important applications include extracting signals from noisy data, processing
images and videos for tasks like compression and enhancement, and analyzing time-varying data,
such as climate data and asset prices. In this dissertation, we address two problems related to
statistical signal/image processing.
    The first issue involves a generalized version of the multi-reference alignment problem in one
dimension, inspired by modern data applications such as cryo-electron microscopy. The objective
is to recover an unknown signal 𝑓 : R → R from multiple observations that have been translated,
dilated, and corrupted by additive noise. In the presence of large dilations and corruptions,
observational data do not resemble the underlying signal. Although current approaches in the field
have shown empirical success in the absence of dilations, no approach has successfully provided
convergence guarantees for signal inversion while dilating, translating, and corrupting observational
data all at once. Thus, we propose an unbiased estimator for the bispectrum of the unknown signal
which depends on the corrupted samples and knowledge of the dilation distribution. To validate
our proposed estimator, we use it for bispectrum recovery, and invert the recovered bispectrum to
achieve full signal inversion.
    The second problem concerns neural texture synthesis, which is important for understanding
how humans perceive texture. Current approaches require regularization terms or some type of
supervision to capture long range constraints, such as the alignment of bricks, in images. To remedy
this issue, we propose a new set of statistics for examplar-based neural texture synthesis based on
Sliced Wasserstein Loss, and augment our proposed algorithm with a multi-scale synthesis process.
Based on qualitative and quantitative experiments, our results are comparable or better than current
state of the art methods.


Copyright by
LIPING YIN
2023


                                      ACKNOWLEDGEMENTS
I complete the thesis thanks to the guidance of my committee members, and support from family
and friends.
First of all, I would like to express my appreciation to both of my committee chairs Dr. Di Liu and
Dr. Yuying Xie for supporting me during my PhD studies. It is my greatest luck to meet them
in my life. I am very grateful to Dr. Yuying Xie to the point that I wish I could meet him much
earlier. He offered incredible support in my hardest period of PhD life. He gave me the freedom
to create my research plan and at the same time offered help whenever I need. He also help with
every little thing, including checking over my thesis, applying for fellowships, and much more. I
still remember he modified my slides word by word and checked my thesis writing carefully. I am
impressed by his knowledge and intelligence. The most important thing is that, I am also learning
from his sincerity, sense of responsibility, and perseverance. Dr. Di Liu has been my committee
chair for three years. He has been very supportive and helpful during those years. He was always
there when I need help for anything. I appreciate that he could give me suggestions when I was
having trouble and encouraging me when I was under stress. He is also kind and wise. I would not
be able to continue without his support and help.
I also very appreciate my committee members Dr. Yingda Cheng and Dr. Ekaterina Rapinchuk.
There are always reachable and helpful. They went through my work and gave me helpful feedback.
Their suggestions were very useful to me and I made improvement to my research. They asked
helpful questions to ensure my research was sound.
I would also like to thank Dr. Mathew Hirn who has been my advisor during my third and fourth
PhD life. He introduced both of my projects to me and taught me the background and all the useful
math tools towards my projects. He also funded me for three semesters so that I could focus on
courses and research. I would not be able to grow up without his supervision. He is very creative
and able to help me when I got stuck. I also took two courses with him and I learned a lot from his
courses. I also appreciated that he introduced my another important committee member, Dr. Anna
Little, to me. I would like to send my best wishes to him.
                                                   iv


As I mentioned above. Dr. Anna Little played a critical rule in my dissertation. I would not be able
to finish the MRA project, which is the most important part of my dissertation, without her. She
has done pioneering work on dilation MRA and guided me through the project. I am grateful that
she was very patient and answered all my questions, no matter how simple they are. She read my
dissertation and commented on it so that I could make improvements to it, and I’m very thankful
for that already. Regardless of time zone difference she was always there to discuss with me. I
still remember she checked over my proofs, came up with ideas, and debugged my code. There are
too many things to thank her for. I appreciate her time and effort, and I learned a lot from her. I
not only learned math techniques, but I also learned how to think scientifically. She is the great
mathematician that I have learned from.
Lastly, I would like to thank my parents. Thanks for my mother always encouraging me and
supporting me. And thanks for my father who has left the world along long time ago. The good
memories and love will never fade.
                                                v


                              TABLE OF CONTENTS
CHAPTER 1    THESIS OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . .             1
CHAPTER 2    BISPECTRUM INVERSION FOR
             MULTI-REFERENCE ALIGNMENT . . . . . . . . . . . . . . . . . . .                   3
CHAPTER 3    NONLINEAR HEEGER-BERGEN TEXTURE SYNTHESIS . . . . . . 57
CHAPTER 4    LONG RANGE CONSTRAINTS FOR NEURAL TEXTURE
             SYNTHESIS USING SLICED WASSERSTEIN LOSS . . . . . . . . . . 81
CHAPTER 5    CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . 104
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
                                            vi


                                            CHAPTER 1
                                        THESIS OVERVIEW
We provide a brief outline of this thesis, which focuses on two problems in statistical signal
processing: multi-reference alignment and texture synthesis. The second chapter focuses on a
variation of the multi-reference alignment problem, which is motivated by Cryoelectron microscopy
that won the Nobel prize in 2015. The original multi-reference alignment problem focuses on
recovering a function 𝑓 : R3 → R from observations that have been translated, rotated, or
corrupted by additive noise. The challenge in this problem is the following: in the presence of large
corruptions, observational data do not resemble the underlying signal.
    However, the multi-reference alignment problem is not necessarily an accurate representation
of reality. One expects small physical variations in objects, such as macro-molecular structure in
the context of Cryo-EM or deformation of an object in the context of object registration. Thus, it
is of interest to consider a model where each observation has been deformed by a diffeomorphism
𝜏 ∈ 𝐶 2 (R3 ). However, the class of diffeomorphisms is challenging to study. We address a
simplified one-dimensional version of diffeomorphism MRA where we only consider a specific
class of diffeomorphisms, dilations of the form 𝜏(𝑥) = (1 − 𝑐)𝑥, which we will call dilation MRA.
Under mild assumptions, we propose an algorithm that is able to recover the ground truth signal
using Fourier invariants, namely the power spectrum and bispectrum. Previous results provide a
nonlinear estimator for bispectrum recovery, and we recover the bispectrum via a novel nonlinear
estimator. This is a first step towards a model that can address general classes of diffeomorphisms,
which will be left for future work.
    The third chapter considers the problem of texture synthesis, which focuses on generating a new
texture from a reference texture. The generated texture should have the same perceptual qualities
as the reference texture without being a repetition of the old texture. We consider a modification
the Heeger Bergen texture synthesis algoritm, which generates new textures via matching wavelet
coefficients between a reference texture and white noise. To increase the robustness of the algorithm,
we first create a nonlinear version of the wavelet transform by applying a set of nonlinearities to
                                                   1


each wavelet coefficient after performing a wavelet transform. We then extend our formulation
by creating a modification of Mallat’s invariant scattering transform, the Two Layer Scattering
Pyramid. We attempt to histogram match coefficients of the Two Layer Scattering Pyramid to get
good synthesis results. However, we find that our results are not competitive with state of the art
algorithms, which motivates the last main chapter of this dissertation.
    For chapter four, instead of using an understood and mathematically tested representation like
the scattering transform, we switch to using statistics of a deep convolutional neural network,
VGG19. We propose a new set of statistics for texture synthesis using Sliced Wasserstein Loss,
which is an approximate solution to an optimal transport problem. Our results are compared with
current state-of-the-art algorithms in single texture synthesis that have comparable run-time. We
find our results are competitive with the state-of-the-art, but do require hyperparameter tuning or
user-added spatial constraints. Additionally, we propose a multi-scale algorithm to augment our
results, which improves the results of our synthesis dramatically.
                                                  2


                                                    CHAPTER 2
                                        BISPECTRUM INVERSION FOR
                                     MULTI-REFERENCE ALIGNMENT
2.1    Introduction to MRA
    The multi-reference alignment (MRA) problem, is a problem where we want to recover an
unknown signal 𝑓 : R → R from many observations that have been randomly translated and
corrupted by an additive noise. Is there a way to recover 𝑓 from these observations? A formal
description of the assumptions is given in Model 1.
Model 1. Suppose we have 𝑀 independent observations of a function 𝑓 ∈ 𝐿 2 (R) defined by
                                     𝑓 𝑗 (𝑥) = 𝑓 (𝑥 − 𝑡 𝑗 ) + 𝜀 𝑗 (𝑥), 1 ≤ 𝑗 ≤ 𝑀,
where
     • supp( 𝑓 𝑗 ) ⊂ [− 12 , 12 ] for 1 ≤ 𝑗 ≤ 𝑀.
     • {𝑡 𝑗 } 𝑀
              𝑗=1 are independent samples of a random variable 𝑡 ∈ R.
     • {𝜀 𝑗 (𝑥)} 𝑀                                                         1 1
                  𝑗=1 are independent white noise processes on [− 2 , 2 ] with variance 𝜎 .
                                                                                            2
    The MRA problem is a simplified version of problems in cryo-electron microscopy (cryo-EM),
and is similar to problems in other fields such as structural biology [17, 35, 36, 41, 42, 50], image
registration [11, 18], and image processing [54].
    To solve the problem outlined in Model 1, current methods can be grouped into two categories.
The first of which is synchronization methods [2, 3, 4, 5, 9, 12, 13, 38, 46, 53], which try to recover
each translation factor {𝑡 𝑗 } 𝑀    𝑗=1 , align each of the signals using the recovered translation factor,
and average the aligned signals to get a smoother estimate for the ground truth signal. However,
synchronization methods can be problematic when the signal-to-noise ratio (SNR) is small. To
illustrate this point, consider the following picture:
    Looking at Figure 2.1, with small amounts of noise, one can align the signals and then average
the translated signals to get a cleaner version of the ground truth signal, up to a translation. However,
at high noise levels, these peaks aren’t recognizable, which makes this "synchronization" process
unreliable.
                                                             3


Figure 2.1 Left Column: three ground truth signals that have been translated without any additive
noise. Middle Column: Adding Gaussian noise with mean zero and 𝜎 2 = 0.2 to each of the
signals in the left column. Right Column: Adding Gaussian noise with mean zero and 𝜎 2 = 1.2
to each of the signals in the left column. Reference: [6].
    The second approach involves estimating the signal directly using ideas such as the method of
moments [22, 28, 44]; a subclass of the method of moments technique, is the method of invariants
technique [2, 7, 15, 26, 27]. Additionally, expectation maximization (EM) algorithms [1, 16] have
also shown success for signal recovery.
    One problem with Model 1 is that it is not an accurate representation of real world applications.
In particular, for three dimensional applications such as cryo-EM, molecules are randomly rotated
and one only has access to 2D projections of the molecule like below:
                                                  4


Figure 2.2 An example of the Cryo-EM problem. The 3D object at the top of the figure is the
ground truth object. One has 2D slices of the object that are noisy, rotated, and possibly translated.
Reference: [47].
    However, the model does not consider of physical variations in the objects, such as macro-
molecular structure which “flop" around. One can think of these deformations as a diffeomorphism
acting on the molecule before retrieving the slices. In other words, if 𝑔 is a function extracting
the slices and 𝑥 is a molecule, we retrieve 𝑔(𝜉 (𝑥)), where 𝜉 ∈ 𝐶 2 (R𝑛 ) has a bĳective, continuous
hessian. The problem of adding random diffeomorphisms to Model 1 is much harder though. This is
because the class of all diffeomorphisms encompasses a variety of different possible deformations.
To simplify the problem, we consider a simple subset of all possible diffeomorphisms, the set of
dilations. This leads to the formulation of Model 2 below:
Model 2. Suppose we have 𝑀 independent observations of a function 𝑓 ∈ 𝐿 2 (R) defined by
                 𝑦 𝑗 (𝑥) = 𝑓 ((1 − 𝜏 𝑗 ) −1 (𝑥 − 𝑡 𝑗 )) + 𝜀 𝑗 (𝑥) := 𝑓 𝑗 (𝑥) + 𝜀 𝑗 (𝑥), 1≤ 𝑗 ≤𝑀
Furthermore, assume that
    • supp( 𝑓 𝑗 ) ⊂ [− 12 , 12 ] for 1 ≤ 𝑗 ≤ 𝑀.
    • {𝑡 𝑗 } 𝑀
             𝑗=1 are independent samples of a random variable 𝑡 ∈ R.
    • {𝜏 𝑗 } 𝑀
             𝑗=1 are independent samples of a uniformly distributed random variable 𝜏 ∈ R satisfying
                                            1
       E[𝜏] = 0 and Var(𝜏) = 𝜂2 ≤           12 .
                                                            5


     • {𝜀 𝑗 (𝑥)} 𝑀                                                  1 1                  2
                  𝑗=1 are independent white noise processes on [− 2 , 2 ] with variance 𝜎 .
     Dilations are one of the simplest class of diffeomorphisms other than translations, but Model
2 is significantly more difficult to solve compared to Model 1, even at low noise levels. Consider
Figure 2.3. Intuitively, as we see in Figure 2.3, dilations add another degree of difficulty to the
Figure 2.3 Left Column: three ground truth signals that have been translated without any additive
noise. Middle Left Column: Dilating each of the signals. Middle Right Column: Adding
Gaussian noise with mean zero and 𝜎 2 = 0.5 to each of the signals in the left column. Right
Column: Adding Gaussian noise with mean zero and 𝜎 2 = 2 to each of the signals in the left
column. Reference: [26].
problem. Regarding alignment, the main challenge arises from the additive noise.
     Regarding the method of invariants, it utilizes translation invariant Fourier features, such as the
power spectrum and bispectrum. Before we begin, define 𝐿 𝑞 (R) as the set of functions 𝑓 such that
         ∫
∥ 𝑓 ∥ 𝑞 = R | 𝑓 | 𝑞 𝑑𝑥 < ∞. The Fourier Transform of 𝑓 ∈ 𝐿 1 (R) is
      𝑞
                                                 ∫
                                        𝑓ˆ(𝜔) =     𝑓 (𝑡)𝑒 −𝑖𝑤𝑡 𝑑𝑡,                                (2.1)
                                                  R
                                                    6


the power spectrum is
                                                    (𝑃 𝑓 )(𝜔) = | 𝑓ˆ(𝜔)| 2 ,                           (2.2)
and the bispectrum is
                                      𝐵 𝑓 (𝜔1 , 𝜔2 ) = 𝑓ˆ(𝜔1 ) 𝑓ˆ∗ (𝜔2 ) 𝑓ˆ(𝜔2 − 𝜔1 ).                 (2.3)
However, in the case of Model 2, using an approach with Fourier invariants results in significant
difficulties because the Fourier Transform modulus is unstable to small dilations. Consider the
operator 𝐿 𝑐 𝑓 (𝑡) = 𝑓 ((1 − 𝑐)𝑡). For small 𝑐, this is a minor rescaling of the function 𝑓 . Let
𝜉 (𝑡) = (1 − 𝑐)𝑡; it is clear that 𝜉 is a diffeomorphism, specifically a dilation. Then 𝐿 𝑐 𝑓 (𝑡) = 𝑓 (𝜉 (𝑡))
with ∥𝜉 ′ ∥ ∞ = 1 − 𝑐. Choose 𝑓 (𝑡) = 𝑒𝑖𝛼𝑡 𝜃 (𝑡), where 𝜃 is smooth with fast decay. We see that
𝐿 𝑐 𝑓 (𝑡) = 𝑒𝑖𝛼(1−𝑐)𝑡 𝜃 ((1 − 𝑐)𝑡), and it follows that
                                                     ∫
                                       𝐿d𝑐 𝑓 (𝜔) =        𝜃 ((1 − 𝑐)𝑡)𝑒𝑖𝛼(1−𝑐)𝑡−𝑖𝜔𝑡 𝑑𝑡
                                                        R
                                                              ∫
                                                         1                     𝜔
                                                  =               𝜃 (𝑡)𝑒𝑖𝛼𝑡−𝑖 1−𝑐 𝑡 𝑑𝑡
                                                      1−𝑐 R
                                                         1 ˆ
                                                  =          𝜃 ((1 − 𝑐) −1 𝜔 − 𝛼).
                                                      1−𝑐
Looking at the norm now, for 𝑐 < 1/2, we see that from an argument identical to [34],
                                              ∥| 𝐿d          ˆ 2                   2
                                                  𝑐 𝑓 | − | 𝑓 |∥ 2 ≈ |𝑐𝛼| · ∥ 𝑓 ∥ 2 .                  (2.4)
Since 𝛼 is arbitrary, we see why Fourier invariants are unstable to small dilations.
     While we will address Model 2 later in this thesis, we first address a noiseless model, Model 3,
given by
Model 3. Suppose we have 𝑀 independent observations of a function 𝑓 ∈ 𝐿 2 (R) defined by
                                   𝑓 𝑗 (𝑥) = 𝑓 ((1 − 𝜏 𝑗 ) −1 (𝑥 − 𝑡 𝑗 )),      1≤ 𝑗 ≤𝑀
Furthermore, assume that
     • supp( 𝑓 𝑗 ) ⊂ [− 21 , 12 ] for 1 ≤ 𝑗 ≤ 𝑀.
     • {𝑡 𝑗 } 𝑀
              𝑗=1 are independent samples of a random variable 𝑡 ∈ R.
                                                                 7


    • {𝜏 𝑗 } 𝑀
             𝑗=1 are independent samples of a uniformly distributed random variable 𝜏 ∈ R with
                                       1
       E[𝜏] = 0 and Var(𝜏) = 𝜂2 ≤      12 .
    Note that the lack of additive noise means one can estimate 𝐶 𝑓 = ∥ 𝑓 ∥ 2 and then dilate each
observation to have norm 𝐶 𝑓 , which makes this problem trivial. However we explore a solution to
Model 3 via Fourier invariants to provide intuition on how to approach Model 2.
2.2   Notation
    Let 𝑔 = 𝐵 𝑓 . For the second and third models, let
                                   𝑔𝜂 (𝜔1 , 𝜔2 ) = E𝜏 [(𝐵 𝑓 𝑗 )(𝜔1 , 𝜔2 )].
We will also define the following constants and operations used in the rest of the paper:
                                        √
                                  (1 − 3𝜂)               √                  1
                            𝐶0 =        √      , 𝐶1 = 2 3𝜂 , 𝐶2 =           √ .                  (2.5)
                                  (1 + 3𝜂)                             1 + 3𝜂
Additionally, define the dilation operator 𝐿 𝐶 𝑔(𝜔1 , 𝜔2 ) := 𝐶 4 𝑔(𝐶𝜔1 , 𝐶𝜔2 ). Note that in polar
coordinates (𝑟, 𝜃), we have 𝐿 𝐶 𝑔(𝑟, 𝜃) = 𝐶 4 𝑔(𝐶𝑟, 𝜃).
2.3   Power Spectrum Recovery for Model 3
    Specific work from [26] has produced results for recovery of the power spectrum of a hidden
signal under certain assumptions in Models 2 and 3. We summarize their results below for Model
3 and explain how to adjust the results for Model 2.
    Define
                                                                    
                                        𝑝 𝜂 (𝜔) := E𝜏 (𝑃 𝑓 𝑗 )(𝜔)                                (2.6)
We also let (𝑆𝐶 𝑔)(𝜔) = 𝐶 3 𝑔(𝐶𝜔) be a rescaled dilation operator. In the case of the infinite sample
limit, we have the following theorem:
Theorem 1. Assume 𝑃 𝑓 ∈ C0 (R) and 𝑝 𝜂 as defined in (2.6). Then for 𝜔 ≠ 0:
                          (𝑃 𝑓 )(𝜔) = (𝐼 − 𝑆𝐶0 ) −1𝐶1 𝐿 𝐶2 (3𝑝 𝜂 (𝜔) + 𝜔𝑝′𝜂 (𝜔)) ,
where 𝐶0 , 𝐶1 , 𝐶2 are as defined in (2.5).
                                                     8


     One can now use this to derive an approximation of the infinite sample estimator. In practice,
one can estimate 𝑝 𝜂 by
                                                           𝑀
                                                        1 ∑︁
                                           𝑝e𝜂 (𝜔) :=         (𝑃 𝑓 𝑗 )(𝜔) .                      (2.7)
                                                       𝑀 𝑗=1
Additionally, a natural approximation for the estimator in Theorem 1 is
                            ( 𝑃f𝑓 )(𝜔) := (𝐼 − 𝑆𝐶0 ) −1𝐶1 𝑆𝐶2 (3e  𝑝 𝜂 (𝜔) + 𝜔 𝑝e′𝜂 (𝜔)) ,       (2.8)
     For the next theorem, define the quantity:
                                    (𝑃 𝑓 ) (𝑘) (𝜔) :=     max     |(𝑃 𝑓 ) (𝑘) (𝜉)|               (2.9)
                                                       𝜉∈[𝜔/2,2𝜔]
Using (2.8), the following convergence theorem holds:
Theorem 2. Assume Model 3, the estimator ( 𝑃f𝑓 )(𝜔) defined in (2.8), 𝑃 𝑓 ∈ C3 (R), and that
𝜔 𝑘 (𝑃 𝑓 ) (𝑘) (𝜔) ∈ 𝐿 2 (R) for 𝑘 = 2, 3. Then:
                            h                i    𝜂2
                          E ∥𝑃 𝑓 − 𝑃f𝑓 ∥ 22 ≲          ∥(𝑃 𝑓 )(𝜔)∥22 + ∥𝜔(𝑃 𝑓 ) ′ (𝜔)∥ 22
                                                  𝑀
                                                       + ∥𝜔2 (𝑃 𝑓 ) ′′ (𝜔)∥ 22 + 𝑟 ,
                                                                               
where 𝑟 is a higher-order term satisfying
                                    𝜂4  2          ′′                     ′′′
                                                                                     
                               𝑟≤       ∥𝜔 (𝑃 𝑓 ) (𝜔)∥ 22 + ∥𝜔3 (𝑃 𝑓 ) (𝜔)∥ 22 .
                                    𝑀
                                                                                             2
                                                                                             𝜂
     That is, in expectation, the convergence rate of (2.8) to the true power spectrum is 𝑂  𝑀   .
2.4   Power Spectrum Recovery for Model 2
     Using these results from Model 3, we design estimators for Model 2. The main difficulty in
model 2 is because of the additive noise. First of all, the mean-squared error (MSE) is only finite
on a bounded interval due to the additive noise. Thus we consider a bounded frequency domain Ω,
                                                               h                       i
and consider the MSE of an estimator 𝑃f𝑓 over Ω: E ∥𝑃 𝑓 − 𝑃f𝑓 ∥ L2 (Ω) ∥ 2 .
     Another difficultly is that one cannot use 𝑝e𝜂 , the average of the observations without noise.
Instead, one has access to
                                                𝑀
                                            1 ∑︁
                                                   𝑃𝑦 𝑗 − 𝜎 2 = 𝑝e𝜂 + 𝑝e𝜎
                                            𝑀 𝑗=1
                                                          9


where
                                                 𝑀
                                            1 ∑︁ b ∗ b∗
                                   𝑝e𝜎 :=             𝑓𝑗b        𝜖 𝑗 + 𝑃𝜖 𝑗 − 𝜎 2 .
                                                         𝜖𝑗 + 𝑓𝑗 b
                                           𝑀 𝑗=1
The particular problem is the term 𝑝e𝜎 , which is not continuous due to the additive noise. To remedy
this issue, we smooth the noisy power spectra via a low pass filtering:
                                                  ( 𝑝e𝜂 + 𝑝e𝜎 ) ∗ 𝜙 𝐿                                       (2.10)
                                                                          𝜔2
                                                                     1  −
using a Gaussian filter with width 𝐿, 𝜙 𝐿 (𝜔) = (2𝜋𝐿 2 ) − 2 𝑒            2𝐿 2 . Accordingly, we define a modified
version of (2.7):
                                                                                                ′
      ( 𝑃f𝑓 ) (𝜔) := (𝐼 − 𝑆𝐶0 ) −1𝐶1 𝑆𝐶2 3( 𝑝e𝜂 + 𝑝e𝜎 ) ∗ 𝜙 𝐿 (𝜔) + 𝜔 ( 𝑝e𝜂 + 𝑝e𝜎 ) ∗ 𝜙 𝐿 (𝜔) .
                                                                                                    
                                                                                                            (2.11)
Similar to before, (2.11) is an unbiased estimator of 𝑃 𝑓 when |𝑀 | → ∞ and 𝐿 → 0. Additionally,
we have the following result:
Theorem 3. Assume Model 2, the estimator ( 𝑃f𝑓 )(𝜔) defined in (2.11), 𝑃 𝑓 ∈ C3 (R), and that
𝜔 𝑘 (𝑃 𝑓 ) (𝑘) (𝜔) ∈ 𝐿 2 (R) for 𝑘 = 2, 3. Then
                                                                 2
                                                                                 𝜎2 ∨ 𝜎4
                              h                    i                                        
                                            2                     𝜂        4
                           E ∥𝑃 𝑓 − 𝑃 𝑓 ∥ 𝐿 2 (Ω) ≲ 𝐶 𝑓 ,Ω
                                       f                               +𝐿 +                   .
                                                                  𝑀                𝐿2 𝑀
                                                   1/6
                                                𝜎4
     When 𝜎 ≥ 1, we may choose 𝐿 =              𝑀        . Then we have
                                                                "                      2/3 #
                                                                  𝜂2               𝜎4
                              h                     i                            
                           E ∥𝑃 𝑓 − 𝑃f𝑓 ∥ 2𝐿 2 (Ω) ≲ 𝐶 𝑓 ,Ω            + 𝐿4 +                 .
                                                                  𝑀                𝑀
We will generalize these results for bispectrum recovery in the next sections. In other words, for
proper choice of 𝐿 and 𝑀, the expected MSE converges to 0 as 𝑀 → ∞.
2.5    Bispectrum Recovery for Model 3
     Following [26], we propose a similar process for creating an estimator for the bispectrum. We
will first consider the case where we have an infinite number of samples and find an unbiased
estimator.
                                                           10


Theorem 4. Suppose we have an infinite number of samples and assume that 𝐵 𝑓 ∈ 𝐶 1 (R2 ) An
unbiased estimator for the bispectrum is given by
                                                                                               
                                                     −1                              𝜕𝑔𝜂
                        𝐵 𝑓 (𝑟, 𝜃) = (𝐼 − 𝐿 𝐶0 ) 𝐶1 𝐿 𝐶2           4𝑔𝜂 (𝑟, 𝜃) + 𝑟        (𝑟, 𝜃) .
                                                                                     𝜕𝑟
Proof. Recall that the Bispectrum is given by
                                  𝐵 𝑓 (𝜔1 , 𝜔2 ) = 𝑓ˆ(𝜔1 ) 𝑓ˆ∗ (𝜔2 ) 𝑓ˆ(𝜔2 − 𝜔1 ).
The Fourier Transform of each 𝑓 𝑗 is 𝑒 −𝑖𝜔𝑡 𝑗 (1 − 𝜏 𝑗 ) 𝑓ˆ((1 − 𝜏 𝑗 )𝜔). Making a substitution,
          𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 ) = (1 − 𝜏 𝑗 ) 3 𝑒 −𝑖𝜔1 𝑡 𝑗 𝑓ˆ((1 − 𝜏 𝑗 )𝜔1 )
                           · [𝑒 −𝑖𝜔2 𝑡 𝑗 𝑓ˆ((1 − 𝜏 𝑗 )𝜔1 )] ∗ · 𝑒 −𝑖(𝜔2 −𝜔1 )𝑡 𝑗 𝑓ˆ((1 − 𝜏 𝑗 )(𝜔2 − 𝜔1 ))
                           = (1 − 𝜏 𝑗 ) 3 𝑓ˆ((1 − 𝜏 𝑗 )𝜔1 ) 𝑓ˆ∗ ((1 − 𝜏 𝑗 )𝜔1 ) 𝑓ˆ((1 − 𝜏 𝑗 )(𝜔2 − 𝜔1 )).
So
                       (𝐵 𝑓 𝑗 )(𝜔1 , 𝜔2 ) = (1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )((1 − 𝜏 𝑗 )𝜔1 , (1 − 𝜏 𝑗 )𝜔2 ).
   Since 𝜏 has uniform distribution with variance 𝜂2 , the pdf of 𝜏 has form 𝑝 𝜏 =                      √1 𝜒 √ √ .
                                                                                                       2 3𝜂 [− 3𝜂, 3𝜂]
Thus:
                      𝑔𝜂 (𝜔1 , 𝜔2 ) = E𝜏 [𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 )]
                                      = E𝜏 [(1 − 𝜏) 3 𝑔((1 − 𝜏)𝑤 1 , (1 − 𝜏)𝑤 2 )]
                                          ∫
                                      =       (1 − 𝜏) 3 𝑔((1 − 𝜏)𝑤 1 , (1 − 𝜏)𝑤 2 ) 𝑝 𝜏 (𝜏) 𝑑𝜏
Now we convert to polar coordinates (𝑟, 𝜃) and let 𝜏˜ = (1 − 𝜏)𝑟:
                                                     ∫ √3𝜂
                                               1                       3
                             𝑔𝜂 (𝑟, 𝜃) = √              √ (1 − 𝜏) 𝑔((1 − 𝜏)𝑟, 𝜃)𝑑𝜏
                                             2 3𝜂 − 3𝜂
                                                        ∫ (1+√3𝜂)𝑟
                                                1
                                          = √                √        𝜏˜ 3 𝑔( 𝜏,
                                                                              ˜ 𝜃) 𝑑 𝜏.
                                                                                      ˜
                                                      4
                                             2 3𝜂𝑟 (1− 3𝜂)𝑟
   Let 𝐻 be the antiderivative in the variable 𝑤 for the function ℎ(𝑤, 𝜃) = 𝑤 3 𝑔(𝑤, 𝜃). In other
words,
                                        𝜕𝐻
                                             (𝑤, 𝜃) = ℎ(𝑤, 𝜃) = 𝑤 3 𝑔(𝑤, 𝜃).
                                        𝜕𝑤
                                                           11


By Fundamental Theorem of Calculus,
                       √                               √                          √
                      2 3𝜂𝑟 4 𝑔𝜂 (𝑟, 𝜃) = 𝐻 ((1 + 3 𝜂)𝑟, 𝜃) − 𝐻 ((1 − 3 𝜂)𝑟, 𝜃).
Now take derivative with respect to 𝑟 and divide both sides by 𝑟 3 to get
   √
                                   
                        𝜕𝑔𝜂                     √               √                       √                 √
  2 3𝜂 4𝑔𝜂 (𝑟, 𝜃) + 𝑟        (𝑟, 𝜃) = (1 + 3 𝜂) 4 𝑔((1 + 3 𝜂)𝑟, 𝜃) − (1 − 3 𝜂) 4 𝑔((1 − 3 𝜂)𝑟, 𝜃).
                         𝜕𝑟
We now apply the dilation operation 𝐿 𝐶2 to both sides, which yields
                                                                          √                √
                                                                    1−3 𝜂 4 1−3 𝜂
                                                                                                    
                                      𝜕𝑔𝜂
           𝐶1 𝐿 𝐶2    4𝑔𝜂 (𝑟, 𝜃) + 𝑟      (𝑟, 𝜃) = 𝑔(𝑟, 𝜃) −              √       𝑔          √ 𝑟, 𝜃 .
                                      𝜕𝑟                            1+3 𝜂              1+3 𝜂
We can also rewrite the right side in terms of 𝐼 and 𝐿 𝐶0 to get
                                                             
                                                   𝜕𝑔𝜂
                         𝐶1 𝐿 𝐶2   4𝑔𝜂 (𝑟, 𝜃) + 𝑟      (𝑟, 𝜃) = (𝐼 − 𝐿 𝐶0 )𝑔(𝑟, 𝜃).
                                                   𝜕𝑟
Thus, an unbiased estimator is
                                                                                         
                                               −1                           𝜕𝑔𝜂
                        𝑔(𝑟, 𝜃) = (𝐼 − 𝐿 𝐶0 ) 𝐶1 𝐿 𝐶2      4𝑔𝜂 (𝑟, 𝜃) + 𝑟         (𝑟, 𝜃) .
                                                                             𝜕𝑟
                                                                                                              □
    However, we are only given a finite number of samples and do not have access to the estimator
above in actual applications. For Model 3, we can approximate 𝑔𝜂 by taking an average of 𝑀
samples using
                                                        𝑀
                                                     1 ∑︁
                                  𝑔𝜂 (𝜔1 , 𝜔2 ) :=
                                  e                         (𝐵 𝑓 𝑗 )(𝜔1 , 𝜔2 ).
                                                     𝑀 𝑗=1
Based on Proposition 4, a good choice for the estimator is
                                                                                           
                                                  −1                           𝜕 𝑔˜ 𝜂
                     (𝐵f𝑓 )(𝑟, 𝜃) := (𝐼 − 𝐿 𝐶0 ) 𝐶1 𝐿 𝐶2      4𝑔˜ 𝜂 (𝑟, 𝜃) + 𝑟        (𝑟, 𝜃) .
                                                                                𝜕𝑟
To show the estimator 𝐵  f𝑓 has a small error, we will need the following lemma.
Lemma 1. Assume that 𝐵 𝑓 ∈ 𝐶 1 (R). Then
                                                                                                          2
                                                                          𝜕𝑔𝜂               𝜕 𝑔˜ 𝜂
         ∥𝐵 𝑓 (𝑟, 𝜃) − 𝐵f𝑓 (𝑟, 𝜃)∥ 2 ≲ ∥𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)∥ 2 + 𝑟         (𝑟, 𝜃) − 𝑟          (𝑟, 𝜃)   .
                                   2                               2
                                                                           𝜕𝑟                𝜕𝑟           2
                                                      12


Proof. We start with
                                                                                                                      
                                          −1                                             𝜕𝑔𝜂              𝜕 𝑔˜ 𝜂
 𝐵 𝑓 (𝑟, 𝜃) − 𝐵f𝑓 (𝑟, 𝜃) = (𝐼 − 𝐿 𝐶0 ) 𝐶1 𝐿 𝐶2         4(𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)) + 𝑟         (𝑟, 𝜃) −           (𝑟, 𝜃)    .
                                                                                         𝜕𝑟                𝜕𝑟
By using the triangle inequality,
         ∥𝐵 𝑓 (𝑟, 𝜃) − 𝐵f𝑓 (𝑟, 𝜃)∥ 2 ≤ 32𝐶 2 ∥(𝐼 − 𝐿 𝐶0 ) −1 ∥ 2 ∥𝐿 𝐶2 ∥ 2 ∥𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)∥ 2
                                    2           1                                                         2
                                                                                                                  2
                                                                                𝜕𝑔𝜂              𝜕 𝑔˜ 𝜂
                                        + 2𝐶12 ∥(𝐼 − 𝐿 𝐶0 ) −1 ∥ 2 ∥𝐿 𝐶2 ∥ 2 𝑟      (𝑟, 𝜃) − 𝑟          (𝑟, 𝜃)      .
                                                                                 𝜕𝑟                𝜕𝑟             2
To compute the spectral norm of 𝐿 𝐶 , we revert back to Cartesian coordinates:
                                                 ∫ ∫
                          ∥𝐿 𝐶𝑚 𝑔𝜂 ∥ 22 =𝐶   8𝑚
                                                         |𝑔𝜂 (𝐶 𝑚 𝜔1 , 𝐶 𝑚 𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2 .
                                                  R    R
Let 𝑢 = 𝐶 𝑚 𝜔1 and 𝑣 = 𝐶 𝑚 𝜔2 :
                                                ∫ ∫
                         ∥𝐿 𝐶𝑚 𝑔𝜂 ∥ 22  =𝐶   6𝑚
                                                         |𝑔𝜂 (𝑢, 𝑣)| 2 𝑑𝑢𝑑𝑣 = 𝐶 6𝑚 ∥𝑔𝜂 ∥ 22 .
                                                  R   R
Thus ∥𝐿 𝐶2 ∥ 2 = 𝐶26𝑚 . Our result also implies that
                                                     ∞                ∞
                                                    ∑︁              ∑︁
                                                                            3𝑗       1
                            ∥(𝐼 − 𝐿 𝐶0 ) −1 ∥ ≤
                                                             𝑗
                                                           𝐿 𝐶0 ≤        𝐶0 =             .
                                                    𝑗=0              𝑗=0
                                                                                  1 − 𝐶03
Putting everything together,
                                                     16 · 12𝜂2
           ∥𝐵 𝑓 (𝑟, 𝜃) − 𝐵f𝑓 (𝑟, 𝜃)∥ 2 ≤
                                        2                       √         ∥𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)∥ 22
                                              (1 − 𝐶06 ) 2 (1 + 3𝜂) 12
                                                                                                             2
                                                        12𝜂2                 𝜕𝑔𝜂              𝜕 𝑔˜ 𝜂
                                          +                    √           𝑟      (𝑟, 𝜃) − 𝑟         (𝑟, 𝜃)     .
                                             (1 − 𝐶06 ) 2 (1 + 3𝜂) 12          𝜕𝑟             𝜕𝑟             2
                                                            13


                                         12𝜂2
Now it suffices to prove that                  √       is bounded by constant:
                                 (1−𝐶06 ) 2 (1+ 3𝜂) 12
                                12𝜂2                           12𝜂2
                                        √           ≤ 
                     (1 − 𝐶06 ) 2 (1 + 3𝜂) 12                  √ 6 2
                                                                1− 3𝜂
                                                         1−        √
                                                                1+ 3𝜂
                                                                    12𝜂2
                                                    = 
                                                            √ 6         √ 6 2
                                                           1+ 3𝜂
                                                             √        − 1−√3𝜂
                                                           1+ 3𝜂           1+ 3𝜂
                                                                               √
                                                                 12𝜂2 (1 + 3𝜂) 12
                                                    =        √               √               √
                                                        (108 3𝜂5 + 120 3𝜂3 + 12 3𝜂) 2
                                                                               √
                                                                  12𝜂2 (1 + 3𝜂) 12
                                                    =           √               √              √
                                                       𝜂2 (108 3𝜂4 + 120 3𝜂2 + 12 3) 2
                                                        𝜂2       √
                                                    ≲ 2 (1 + 3𝜂) 12
                                                        𝜂
                                                    = 𝑂 (1).
                                                                                                              □
     This lemma implies that we only have to bound
                                      E ∥𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)∥ 22 ,
                                                                         
                                           "                                     #
                                                                               2
                                               𝜕𝑔𝜂               𝜕 𝑔˜ 𝜂
                                      E 𝑟            (𝑟, 𝜃) − 𝑟         (𝑟, 𝜃)
                                                𝜕𝑟                𝜕𝑟           2
with an 𝑂 (1/𝑀) bound on 𝑀 for 𝐵           f𝑓 (𝑟, 𝜃) to converge to 𝐵 𝑓 (𝑟, 𝜃) with an 𝑂 (1/𝑀) bound. We
have the following result now.
Theorem 5. Suppose that 𝐵 𝑓 ∈ 𝐶 3 (R2 ) and also assume that 𝑟 2 max𝛼∈[𝑟/2,2𝑟] |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| 2 ∈
𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃) and 𝑟 3 max𝛾∈[𝑟/2,2𝑟] |𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)| 2 ∈ 𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃). Then the following
bound holds:
         h
                     2
                       i     𝜂2                     2                           2         2        2
                                                                                                      
      E ∥𝐵 𝑓 − 𝐵 𝑓 ∥ 2 ≲
                 f                ∥(𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + 2∥𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + ∥𝑟 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 2
                             𝑀                                                                            !
                                                                         2                              2
                            𝜂4
                         +         𝑟 2 max 𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃) + 𝑟 3 max 𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)                     .
                            𝑀         𝛼∈[𝑟/2,2𝑟]                         2        𝛾∈[𝑟/2,2𝑟]            2
with
                                                                                                
                                                      −1                            𝜕 𝑔˜ 𝜂
                      ( 𝐵 𝑓 )(𝑟, 𝜃) = (𝐼 − 𝐿 𝐶0 ) 𝐶1 𝐿 𝐶2 4𝑔˜ 𝜂 (𝑟, 𝜃) + 𝑟
                        f                                                                  (𝑟, 𝜃) .
                                                                                    𝜕𝑟
                                                            14


Proof. First, assume that the 𝐵 𝑓 : R2 → R. The argument will be generalized to the complex case
after. Notice that
                                                                                                 2
                                                                  𝑀
                                                              1 ∑︁
                           | 𝑔˜ 𝜂 (𝑟, 𝜃) − 𝑔𝜂 (𝑟, 𝜃)| 2 =           (𝐵 𝑓 𝑗 )(𝑟, 𝜃) − 𝑔𝜂 (𝑟, 𝜃) .
                                                             𝑀 𝑗=1
Define
                          𝑋 𝑗 = 𝐵 𝑓 𝑗 (𝑟, 𝜃) − 𝑔𝜂 (𝑟, 𝜃) = 𝐵 𝑓 𝑗 (𝑟, 𝜃) − E[(𝐵 𝑓 𝑗 )(𝑟, 𝜃)].
This means each 𝑋 𝑗 is zero centered, so we have
                                                   2
                                       1 ∑︁ 𝑀                 1 𝑀
                                                                 ∑︁           
                                                                                var(𝑋 𝑗 )
                                                       = var            𝑋 𝑗  =
                                                     
                                   E           𝑋𝑗                                          .
                                      𝑀                         𝑀 𝑗=1               𝑀
                                           𝑗=1                
                                                                            
    Write
                        𝑋 𝑗 = 𝐵 𝑓 𝑗 (𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃) + (𝐵 𝑓 )(𝑟, 𝜃) − E[(𝐵 𝑓 𝑗 )(𝑟, 𝜃)].
Then
                   𝑋 2𝑗 ≲ (𝐵 𝑓 𝑗 (𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃)) 2 + ((𝐵 𝑓 )(𝑟, 𝜃) − E[(𝐵 𝑓 𝑗 )(𝑟, 𝜃)]) 2
and
           E[𝑋 2𝑗 ] ≲ E (𝐵 𝑓 𝑗 (𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃)) 2 + E ((𝐵 𝑓 )(𝑟, 𝜃) − E[(𝐵 𝑓 𝑗 )(𝑟, 𝜃)]) 2
                                                                                                             
                    ≲ E (𝐵 𝑓 𝑗 (𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃)) 2 .
                                                               
    Each 𝜏 𝑗 has bounded variance and is supported on [−1/2, 1/2]. Taylor expand the dilated
bispectrum in radial variable in interval [𝑟/2, 2𝑟]:
                                                                        1
 (𝐵 𝑓 ) ((1 − 𝜏 𝑗 )𝑟, 𝜃) = (𝐵 𝑓 )(𝑟, 𝜃) − 𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗 + 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)                 𝑟 2 𝜏 2𝑗 , 𝛼 ∈ [𝑟/2, 2𝑟].
                                                                        2                    𝑟=𝛼
Now multiply both sides by (1 − 𝜏 𝑗 ) 3 to get
                      (𝐵 𝑓 𝑗 )(𝑟, 𝜃) = (1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )((1 − 𝜏 𝑗 )𝑟, 𝜃)
                                         = (1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )(𝑟, 𝜃) − (1 − 𝜏 𝑗 ) 3 𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗
                                                       1
                                        + (1 − 𝜏 𝑗 ) 3 𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 2 𝜏 2𝑗 .
                                                       2
                                                               15


with 𝛼 ∈ [𝑟/2, 2𝑟]. It now follows that
                          (𝐵 𝑓 𝑗 )(𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃) = (3𝜏 2𝑗 − 3𝜏 𝑗 − 𝜏 3𝑗 )(𝐵 𝑓 )(𝑟, 𝜃)
                                                           − (1 − 𝜏 𝑗 ) 3 𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗
                                                              1
                                                           + (1 − 𝜏 𝑗 ) 3 𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 2 𝜏 2𝑗 .
                                                              2
with 𝛼 ∈ [𝑟/2, 2𝑟].
    Square both sides to get:
      ((𝐵 𝑓 𝑗 )(𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃)) 2 = (3𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 ) 2 (𝐵 𝑓 ) 2 (𝑟, 𝜃)
                                             − 2(3𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 )(1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )(𝑟, 𝜃)𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗
                                             + (3𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 )(1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )(𝑟, 𝜃)𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 2 𝜏 2𝑗
                                             + (1 − 𝜏 𝑗 ) 6 [𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)] 2𝑟 2 𝜏 2𝑗
                                             − (1 − 𝜏 𝑗 ) 6 𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 3 𝜏 3𝑗
                                                (1 − 𝜏 𝑗 ) 6
                                             +               [𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)] 2𝑟 4 𝜏 4𝑗 .
                                                     4
Using the inequality 2|𝑎𝑏| ≤ |𝑎| 2 + |𝑏| 2 , it follows that
      2|(𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 )(1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )(𝑟, 𝜃)𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗 | ≤ (𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 ) 2 |(𝐵 𝑓 )(𝑟, 𝜃)| 2
                                                                              + (1 − 𝜏 𝑗 ) 6 |𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗 | 2
and
                                                                                  1
   |(3𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 )(1 − 𝜏 𝑗 ) 3 (𝐵 𝑓 )(𝑟, 𝜃)𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 2 𝜏 2𝑗 | ≤      (3𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 ) 2 |(𝐵 𝑓 )(𝑟, 𝜃)| 2
                                                                                  2
                                                                                  1
                                                                              + |1 − 𝜏 𝑗 | 6 |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 2 𝜏 2𝑗 | 2 .
                                                                                  2
and
                                                                         1
            |(1 − 𝜏 𝑗 ) 6 𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 3 𝜏 3𝑗 | ≤      (1 − 𝜏 𝑗 ) 6 |𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝜏 𝑗 𝑟 | 2
                                                                         2
                                                                        1
                                                                     + |1 − 𝜏 𝑗 | 6 |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)𝑟 2 𝜏 2𝑗 | 2 .
                                                                        2
                                                              16


    Take the expectation of both sides now. Since the pdf of 𝜏 is supported on [−1/2, 1/2] and zero
centered,
                                      E[(3𝜏 2𝑗 − 𝜏 𝑗 − 𝜏 3𝑗 ) 2 ] ≲ E[𝜏 2𝑗 ] ≲ 𝜂2 .
The other terms with 𝜏 𝑗 are bounded in a similar way. Thus
              E[((𝐵 𝑓 𝑗 )(𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃)) 2 ] ≲ 𝜂2 (𝐵 𝑓 ) 2 (𝑟, 𝜃) + 𝑟 2 𝜂2 [𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)] 2
                                                                                                  2
                                                            4 4
                                                       +𝜂 𝑟           max |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| .
                                                                    𝛼∈[𝑟/2,2𝑟]
We also have
       var[𝑋 𝑗 ] = E[𝑋 2𝑗 ]
                 ≲ E[((𝐵 𝑓 𝑗 )(𝑟, 𝜃) − (𝐵 𝑓 )(𝑟, 𝜃)) 2 ]
                                                                                                              2
                      2         2          2 2                      2     4 4
                 ≲ 𝜂 (𝐵 𝑓 ) (𝑟, 𝜃) + 𝑟 𝜂 [𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)] + 𝜂 𝑟                   max      |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)|    ,
                                                                                 𝛼∈[𝑟/2,2𝑟]
and
                                                      𝜂2                          𝜂2
                 E (𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)) 2 ≲ (𝐵 𝑓 ) 2 (𝑟, 𝜃) + 𝑟 2 [𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)] 2
                                               
                                                      𝑀                           𝑀
                                                       4
                                                                                               2
                                                     𝜂 4
                                                   + 𝑟            max |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| .
                                                     𝑀          𝛼∈[𝑟/2,2𝑟]
Now we can take the integral and expectation to get
                                                     𝜂2                          𝜂2
                E ∥𝑔𝜂 (𝑟, 𝜃) − 𝑔˜ 𝜂 (𝑟, 𝜃)∥ 22 ≲ ∥(𝐵 𝑓 )(𝑟, 𝜃)∥ 22 + ∥𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 22
                                              
                                                      𝑀                          𝑀
                                                      4                                          2
                                                    𝜂       2
                                                  +      𝑟 max |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)|
                                                    𝑀          𝛼∈[𝑟/2,2𝑟]                        2
The first term is now handled appropriately. We can now repeat a nearly identical argument for the
second term. Let 𝑔 𝑗 = 𝐵 𝑓 𝑗 and
                                                 𝜕𝑔 𝑗                𝜕𝑔𝜂
                                          𝑍𝑗 = 𝑟      (𝑟, 𝜃) − 𝑟          (𝑟, 𝜃).
                                                  𝜕𝑟                  𝜕𝑟
We have
                                                                 𝑀
                        𝜕 𝑔˜ 𝜂            𝜕𝑔𝜂             1 ∑︁ 𝜕𝑔 𝑗                    𝜕𝑔𝜂
                      𝑟        (𝑟, 𝜃) − 𝑟     (𝑟, 𝜃) =              𝑟      (𝑟, 𝜃) − 𝑟        (𝑟, 𝜃).
                        𝜕𝑟                 𝜕𝑟            𝑀 𝑗=1 𝜕𝑟                       𝜕𝑟
                                                           17


By Leibniz Rule, we can take the derivative inside the expectation to get E[𝑍 𝑗 ] = 0, and a similar
argument from before yields
                                                                2                                   2
                                  𝜕𝑔 𝑗              𝜕𝑔                   𝜕𝑔                𝜕𝑔𝜂
                       𝑍 2𝑗  ≲ 𝑟         (𝑟, 𝜃) − 𝑟 (𝑟, 𝜃)          + 𝑟 (𝑟, 𝜃) − 𝑟              (𝑟, 𝜃)
                                   𝜕𝑟               𝜕𝑟                   𝜕𝑟                𝜕𝑟
and                                                 "                                 2#
                                                         𝜕𝑔 𝑗              𝜕𝑔
                                     E[𝑍 2𝑗 ] ≲ E      𝑟       (𝑟, 𝜃) − 𝑟 (𝑟, 𝜃)            .
                                                          𝜕𝑟               𝜕𝑟
Taylor expand 𝜕𝑟 (𝐵 𝑓 )((1 − 𝜏 𝑗 )𝑟, 𝜃) to get
                                                                             1
𝜕𝑟 (𝐵 𝑓 ) ((1 − 𝜏 𝑗 )𝑟, 𝜃) = 𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃) − 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟𝜏 𝑗 + 𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)𝑟 2 𝜏 2𝑗 ,          𝛾 ∈ [𝑟/2, 2𝑟].
                                                                             2
          𝜕𝑔 𝑗
Since 𝑟   𝜕𝑟   (𝑟, 𝜃) = 𝑟 (1 − 𝜏 𝑗 ) 4 𝜕𝑟 (𝐵 𝑓 )((1 − 𝜏 𝑗 )𝑟, 𝜃), multiply both sides by (1 − 𝜏 𝑗 ) 4 :
  (1 − 𝜏 𝑗 ) 4𝑟𝜕𝑟 (𝐵 𝑓 )((1 − 𝜏 𝑗 )𝑟, 𝜃) = (1 − 𝜏 𝑗 ) 4𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)
                                                                                                  1
                                             − (1 − 𝜏 𝑗 ) 4 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟 2 𝜏 𝑗 + (1 − 𝜏 𝑗 ) 4 𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)𝑟 3 𝜏 2𝑗
                                                                                                  2
with 𝛾 ∈ [𝑟/2, 2𝑟]. Then
         𝜕𝑔 𝑗               𝜕𝑔
       𝑟       (𝑟, 𝜃) − 𝑟 (𝑟, 𝜃) = (𝜏 4𝑗 − 4𝜏 3𝑗 + 6𝜏 2𝑗 − 4𝜏 𝑗 )𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)
          𝜕𝑟                𝜕𝑟
                                                                                              1
                                       − (1 − 𝜏 𝑗 ) 4 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)𝑟 2 𝜏 𝑗 + (1 − 𝜏 𝑗 ) 4 𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)𝑟 3 𝜏 2𝑗
                                                                                              2
with 𝛾 ∈ [𝑟/2, 2𝑟]. By a similar process from above,
    "                                  2#
         𝜕𝑔 𝑗               𝜕𝑔𝜂                  𝜂2
 E 𝑟           (𝑟, 𝜃) − 𝑟       (𝑟, 𝜃)       ≲ ∥𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 22
          𝜕𝑟                𝜕𝑟                   𝑀
                                                                                                                           2
                                                𝜂2 2                            𝜂4 3
                                             +     ∥𝑟 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 22 +           𝑟 max |𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)|             .
                                                𝑀                               𝑀         𝛾∈[𝑟/2,2𝑟]                       2
Thus
       h                  i      2                                                                             
    E ∥𝐵 𝑓 − 𝐵     f𝑓 ∥ 2 ≲ 𝜂 ∥(𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + 2∥𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + +∥𝑟 2 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 2
                        2                             2                            2                          2
                               𝑀                                                                                       !
                                                                           2                                        2
                              𝜂4       2                                            3
                            +        𝑟 max |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| + 𝑟 max |𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)|                                .
                               𝑀         𝛼∈[𝑟/2,2𝑟]                        2          𝛾∈[𝑟/2,2𝑟]                    2
                                                               18


    For the case where 𝐵 𝑓 : R2 → C, simply write 𝐵 𝑓 = Re(𝐵 𝑓 ) + 𝑖Im(𝐵 𝑓 ) and repeat the
argument above. Then it follows that
               h                              i
            E ∥Re(𝐵 𝑓 ) − Re( 𝐵       f𝑓 )∥ 2
                                            2
                 𝜂2                                                                                 
             ≲         ∥Re(𝐵 𝑓 )(𝑟, 𝜃)∥ 22 + 2∥𝑟𝜕𝑟 Re(𝐵 𝑓 )(𝑟, 𝜃)∥ 22 + +∥𝑟 2 𝜕𝑟𝑟 Re(𝐵 𝑓 )(𝑟, 𝜃)∥ 22
                 𝑀                                                                                      !
                                                       2                                              2
                𝜂4
             +         𝑟 2 max |𝜕𝛼 Re(𝐵 𝑓 )(𝛼, 𝜃)| + 𝑟 3 max |𝜕𝛼𝛼𝛼 Re(𝐵 𝑓 )(𝛾, 𝜃)|
                𝑀          𝛼∈[𝑟/2,2𝑟]                  2         𝛾∈[𝑟/2,2𝑟]                           2
and
              h                              i
          E ∥Im(𝐵 𝑓 ) − Im( 𝐵        f𝑓 )∥ 2
                                           2
                𝜂2  
                                          2                        2        2                    2
                                                                                                    
            ≲         ∥Im(𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + 2∥𝑟𝜕𝑟 Im(𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + +∥𝑟 𝜕𝑟𝑟 Im(𝐵 𝑓 )(𝑟, 𝜃)∥ 2
                𝑀                                                                                      !
                                                       2                                             2
               𝜂4
            +         𝑟 2 max |𝜕𝛼𝛼 Im(𝐵 𝑓 )(𝛼, 𝜃) + 𝑟 3 max |𝜕𝛾𝛾𝛾 Im(𝐵 𝑓 )(𝛾, 𝜃)|                         .
               𝑀          𝛼∈[𝑟/2,2𝑟]                   2         𝛾∈[𝑟/2,2𝑟]                          2
Adding the two inequalities together yields the desired result.                                             □
    Now that we have solved the noiseless case, the goal is to move onto Model 2 and try to adapt
our method to work in the presence of additive noise.
2.6    Bispectrum Recovery for Model 2
    Now we move on to Model 2. This is similar, but the additive noise creates two difficulties.
First, we must restrict ourselves from R2 to some finite domain Ω since the MSE is not well defined
on infinite intervals because of the noise. Second, we don’t necessarily have access to 𝑔˜ 𝜂 like
before. Instead, we only know
                                                      𝑀
                                                   1 ∑︁
                                                          𝐵𝑦 𝑗 .
                                                   𝑀 𝑗=1
                     ∫ 1/2
Writing 𝜀(𝜔)
          ˆ       =    −1/2
                            𝑒 −𝑖𝜔𝑥 𝑑𝐵𝑥 as an integral with respect to a Brownian motion, it is clear that
E[𝐵𝜀 𝑗 ] = 0. Also, notice that E[𝐵𝜀 𝑗 ] = E[ 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )], where each 𝜀 𝑗 is white
                                                      19


noise process on [−1/2, 1/2] with variance 𝜎 2 . We see that for each index
        𝐵𝑦 𝑗 (𝜔1 , 𝜔2 ) = 𝐵( 𝑓 𝑗 + 𝜀 𝑗 )(𝜔1 , 𝜔2 )
                        = ( 𝑓ˆ𝑗 (𝜔1 ) + 𝜀ˆ 𝑗 (𝜔1 ))( 𝑓ˆ𝑗∗ (𝜔2 ) + 𝜀ˆ∗𝑗 (𝜔2 ))( 𝑓ˆ𝑗 (𝜔2 − 𝜔1 ) + 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ))
                        = 𝑓ˆ𝑗 (𝜔1 ) 𝑓ˆ𝑗∗ (𝜔2 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 ) + 𝑓ˆ𝑗 (𝜔1 ) 𝑓ˆ𝑗∗ (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )
                        + 𝑓ˆ𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) + 𝑓ˆ𝑗∗ (𝜔2 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 ) 𝜀ˆ 𝑗 (𝜔1 )
                        + 𝑓ˆ𝑗∗ (𝜔2 ) 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) + 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 )
                        + 𝑓ˆ𝑗 (𝜔1 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) + 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )
                        := 𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 ) + 𝑅 𝑗 (𝜔1 , 𝜔2 ).
                                                                               1 Í𝑀
Thus, we need to perform a 𝜎-based centering to recover                       𝑀     𝑗=1 𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 ). We see that if we
take the expectation over 𝜀, we have
          E𝜀 [𝐵𝑦 𝑗 (𝜔1 , 𝜔2 )] = 𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 ) + E𝜀 [ 𝑓ˆ𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )
                                    + 𝑓ˆ𝑗∗ (𝜔2 ) 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) + 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 )].
Now we know from Theorem 4.5 of [29]
                                                                                            
                                                                            1
                                                                      sin   2 (𝜔2   − 𝜔1 )
                                         ˆ 1 ) 𝜀ˆ∗ (𝜔2 )] = 2𝜎 2
                                   E𝜀 [ 𝜀(𝜔                                                    .
                                                                            𝜔2 − 𝜔1
                        sin ( 21 𝜔)
Define ℎ(𝜔) = 2𝜎 2          𝜔       . Note that ℎ is an even function since it is the product of two odd
functions. Taking the expectation yields
   E𝜀 [𝐵𝑦 𝑗 (𝜔1 , 𝜔2 )] = 𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 ) + 𝑓ˆ𝑗 (𝜔1 )ℎ(𝜔1 ) + 𝑓ˆ𝑗∗ (𝜔2 )ℎ(𝜔2 ) + 𝑓ˆ𝑗 (𝜔2 − 𝜔1 )ℎ(𝜔2 − 𝜔1 ).
Now take the expectation over the translation and dilation parameters to get
                  E[𝐵𝑦 𝑗 (𝜔1 , 𝜔2 )] = E[𝐵 𝑓 𝑗 (𝜔1 , 𝜔2 )] + E[ 𝑓ˆ𝑗 (𝜔1 )]ℎ(𝜔1 )
                                          + E[ 𝑓ˆ𝑗∗ (𝜔2 )]ℎ(𝜔2 ) + E[ 𝑓ˆ𝑗 (𝜔2 − 𝜔1 )]ℎ(𝜔2 − 𝜔1 ).
                                                                  20


Denote the 𝜇(𝜔) = E[ 𝑓ˆ𝑗 (𝜔)] and let 𝜇(𝜔)              1  Í𝑀
                                             ˜      =  𝑀     𝑗=1   𝑦ˆ 𝑗 (𝜔). We will approximate the expectation
using
                           𝑀
                        1 ∑︁
           E[𝐵 𝑓 𝑗 ] ≈                  ˜ 1 )ℎ(𝜔1 ) − 𝜇˜ ∗ (𝜔2 )ℎ(𝜔2 ) − 𝜇(𝜔
                                𝐵𝑦 𝑗 − 𝜇(𝜔                                         ˜ 2 − 𝜔1 )ℎ(𝜔2 − 𝜔1 )
                        𝑀 𝑗=1
                           𝑀
                        1 ∑︁
                     =          𝐵𝑦 𝑗 − 𝑅𝜎 .
                        𝑀 𝑗=1
After empirical centering by 𝑅𝜎 , we can thus decompose the computable quantity into two pieces:
                                               𝑀
                                          1 ∑︁
                                                  𝐵𝑦 𝑗 − 𝑅𝜎 = 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ,
                                          𝑀 𝑗=1
where
                            𝑀                                               𝑀
                         1 ∑︁                                            1 ∑︁
                 𝑔˜ 𝜎 =         𝑅 𝑗 (𝜔1 , 𝜔2 ) − 𝑅𝜎 (𝜔1 , 𝜔2 ) =               𝑅 𝑗 (𝑟, 𝜃) − 𝑅𝜎 (𝑟, 𝜃).
                        𝑀 𝑗=1                                           𝑀 𝑗=1
The term 𝑔˜ 𝜂 + 𝑔˜ 𝜎 is not smooth due to the additive noise. Thus we need to add some procedure
                                                               2 /(2𝐿 2 )
to smooth the signal. Let 𝜙 𝐿 (𝑟) = (2𝜋𝐿 2 ) −1/2 𝑒 −𝑟                    be a low pass filter. We define a new
estimator for Model 2 as
    f𝑓 ) (𝑟, 𝜃) := (𝐼 − 𝐿 𝐶0 ) −1𝐶1 𝐿 𝐶2 4(( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 )(𝑟, 𝜃) + 𝑟𝜕𝑟 (( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 )(𝑟, 𝜃) . (2.12)
                                                                                                             
  (𝐵
We will use the following two lemmas, whose proofs are similar to [26].
Lemma 2. Let 𝑞 ∈ 𝐿 2 (R2 ) and assume | 𝑞(𝜔)|     ˆ        decays like |𝑞| −𝛼 for some integer 𝛼 ≥ 2 and 𝜔
such that |𝜔| ≥ 𝜔0 for some 𝜔0 > 0. Then for 𝐿 small enough,
                                    ∥𝑞 − 𝑞 ∗ 𝜙 𝐿 ∥ 22 ≲ ∥𝑞∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) .
                                     2  2
Proof. We have 𝜙ˆ𝐿 (|𝜔|) = 𝑒 −𝐿 |𝜔| /2 and
                                                          𝐿 2 |𝜔| 2
                                            ˆ
                                       1 − 𝜙 𝐿 (|𝜔|) =                 + 𝑂 (𝐿 3 ).
                                                              2
                                                          21


Letting |𝜔| = 𝑟, we write the integral in polar coordinates and get
                                          1
                  ∥𝑞 − 𝑞 ∗ 𝜙 𝐿 ∥ 22 =         2
                                                   ˆ − 𝜙ˆ𝐿 )∥ 22
                                                ∥ 𝑞(1
                                        (2𝜋)
                                                ∫ 2𝜋 ∫ ∞
                                          1
                                    =                             ˆ 𝜃)| 2 |1 − 𝜙ˆ𝐿 (𝑟)| 2𝑟𝑑𝑟𝑑𝜃
                                                                | 𝑞(𝑟,
                                        (2𝜋) 2 0           0
                                                ∫ 2𝜋 ∫ 𝜔0
                                          1
                                    =         2
                                                                   ˆ 𝜃)| 2 |1 − 𝜙ˆ𝐿 (𝑟)|
                                                                 | 𝑞(𝑟,                  ˆ 2𝑟𝑑𝑟𝑑𝜃
                                        (2𝜋) 0             0
                                                ∫ 2𝜋 ∫ ∞
                                          1
                                    +                             ˆ 𝜃)| 2 |1 − 𝜙ˆ𝐿 (𝑟)| 2𝑟𝑑𝑟𝑑𝜃
                                                               | 𝑞(𝑟,
                                       (2𝜋) 2 0           𝜔0
                                    := 𝐼1 + 𝐼2 .
For 𝐼1 , we have
                                      2𝜋                                               2
                                                                   𝐿 2 |𝑟 | 2
                                  ∫       ∫  𝜔0                 
                            1                                2                       3
                    𝐼1 =                          | 𝑞(𝑟,
                                                    ˆ 𝜃)|                      + 𝑂 (𝐿 ) 𝑟 𝑑𝑟 𝑑𝜃
                          (2𝜋) 2    0      0                          2
                                                       !∫
                           1      𝐿 4 𝜔40            5
                                                             2𝜋 ∫ 𝜔0
                       ≤                  + 𝑂 (𝐿 )                            ˆ 𝜃)| 2𝑟𝑑𝑟𝑑𝜃
                                                                           | 𝑞(𝑟,
                          2𝜋 2       4                     0        0
                       ≲ 𝜔40 ∥𝑞∥ 22 𝐿 4 + 𝑂 (𝐿 5 ).
For 𝐼2 ,
                                         ∫ 2𝜋 ∫ ∞
                                  𝐶2                                            2 2
                         𝐼2 ≤          2
                                                        𝑟 −2𝛼+1 (1 − 𝑒 −𝐿 𝑟 /2 ) 2 𝑑𝑟 𝑑𝜃
                               (2𝜋) 0               1
                                  2 ∫ ∞
                               𝐶                                  2 2
                            =              𝑟 −2𝛼+1 (1 − 𝑒 −𝐿 𝑟 /2 ) 2 𝑑𝑟.
                               2𝜋 1
Using the same argument as [26], it follows that 𝐼2 ≲ 𝐿 4∧(2𝛼−2) .                                          □
Lemma 3. Let 𝑟𝑞(𝑟, 𝜃) ∈ 𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃) and assume its Fourier transform (·)𝑞(·)(𝜔)        \   decays like
|𝑤| −𝛼 for some integer 𝛼 ≥ 2. Then for 𝐿 small enough,
                  ∥𝑟 (𝑞 − 𝑞 ∗ 𝜙 𝐿 )(𝑟, 𝜃)∥ 22 ≲ ∥𝑟𝑞(𝑟, 𝜃)∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) + 𝐿 3 ∥𝑞∥ 22 .
                                                          22


Proof. First, we switch back to rectangular coordinates:
         ∥𝑟 (𝑞 − 𝑞 ∗ 𝜙 𝐿 )(𝑟, 𝜃)∥ 22 = ∥(𝜔21 + 𝜔22 ) 1/2 (𝑞 − 𝑞 ∗ 𝜙 𝐿 )(𝜔1 , 𝜔2 )∥ 22
                                         ∫
                                      =         𝑤 21 |(𝑞 − 𝑞 ∗ 𝜙 𝐿 )(𝜔1 , 𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2
                                              2
                                         ∫R
                                     +          𝑤 22 |(𝑞 − 𝑞 ∗ 𝜙 𝐿 )(𝜔1 , 𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2
                                           R 2
                                                     ∫
                                             1
                                      =                   |𝜕𝑡 𝑞(𝑡
                                                                ˆ 1 , 𝑡2 ) − 𝜕𝑡1 ( 𝑞(𝑡      ˆ 1 , 𝑡2 ) 𝜙ˆ𝐿 (𝑡1 , 𝑡2 ))| 2 𝑑𝑡 1 𝑑𝑡2
                                          (2𝜋) 2 R2 1
                                                    ∫
                                             1
                                     +                   |𝜕𝑡 𝑞(𝑡
                                                               ˆ 1 , 𝑡2 ) − 𝜕𝑡2 ( 𝑞(𝑡       ˆ 1 , 𝑡2 ) 𝜙ˆ𝐿 (𝑡1 , 𝑡2 ))| 2 𝑑𝑡 1 𝑑𝑡2
                                         (2𝜋) 2 R2 2
                                      := 𝐼1 + 𝐼2 .
For the first term, take derivatives to get
                       ∫
                  1
         𝐼1 =       2
                           |𝜕𝑡1 𝑞(𝑡
                                 ˆ 1 , 𝑡2 ) − 𝜕𝑡1 𝑞(𝑡 ˆ 1 , 𝑡2 ) 𝜙ˆ𝐿 (𝑡1 , 𝑡2 ) − 𝑞(𝑡   ˆ 1 , 𝑡2 )𝜕𝑡1 𝜙ˆ𝐿 (𝑡1 , 𝑡2 )| 2 𝑑𝑡1 𝑑𝑡 2
               (2𝜋) R2
                       ∫
                  1
            ≲        2
                            |𝜕𝑡1 𝑞(𝑡
                                  ˆ 1 , 𝑡2 ) − 𝜕𝑡1 𝑞(𝑡 ˆ 1 , 𝑡2 ) 𝜙ˆ𝐿 (𝑡1 , 𝑡2 )| 2 𝑑𝑡1 𝑑𝑡 2
                (2𝜋) R2
                      ∫
                 1
            +       2
                            ˆ 1 , 𝑡2 )𝜕𝑡1 𝜙ˆ𝐿 (𝑡1 , 𝑡2 )| 2 𝑑𝑡1 𝑑𝑡2
                          | 𝑞(𝑡
               (2𝜋) R2
            ≲ ∥𝜔1 𝑞(𝜔1 , 𝜔2 ) − (𝜔1 𝑞) ∗ 𝜙 𝐿 (𝜔1 , 𝜔2 )∥ 22 + ∥ 𝑞(𝑡            ˆ 1 , 𝑡2 )𝜕𝑡1 𝜙ˆ𝐿 (𝑡1 , 𝑡2 )∥ 22 .
We use Lemma 2 to get
             4∥𝜔1 𝑞(𝜔1 , 𝜔2 ) − (𝜔1 𝑞) ∗ 𝜙 𝐿 (𝜔1 , 𝜔2 )∥ 22 ≲ ∥𝜔1 𝑞(𝜔1 , 𝜔2 )∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) .
                                                                        2 (𝑡 2 +𝑡 2 )/2
By how we defined 𝜙 𝐿 , we also have 𝜙ˆ𝐿 (𝑡1 , 𝑡2 ) = 𝑒 −𝐿                   1 2          . Take the derivative with respect to
𝑡1 :
                                                                               2 (𝑡 2 +𝑡 2 )/2
                                              ˆ 1 , 𝑡2 ) = −𝐿 2 𝑡1 𝑒 −𝐿
                                         𝜕𝑡1 𝜙(𝑡                                    1 2
and get
                                        ˆ 1 , 𝑡2 )𝜕𝑡1 𝜙ˆ𝐿 (𝑡1 , 𝑡2 )∥ 22 ≤ 𝐿 3 ∥𝑞∥ 22 .
                                     ∥ 𝑞(𝑡
     The result is
                              𝐼1 ≲ ∥𝜔1 𝑞(𝜔1 , 𝜔2 )∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) + 𝐿 3 ∥𝑞∥ 22 .
                                                               23


An identical argument is used to get
                           𝐼2 ≲ ∥𝜔2 𝑞(𝜔1 , 𝜔2 )∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) + 𝐿 3 ∥𝑞∥ 22 .
Now combine 𝐼1 and 𝐼2 . The previous work implies
                          ∥𝜔1 𝑞(𝜔1 , 𝜔2 )∥ 22 + ∥𝜔2 𝑞(𝜔1 , 𝜔2 )∥ 22 = ∥𝑟𝑞(𝑟, 𝜃)∥ 22
and
                   ∥𝑟 (𝑞 − 𝑞 ∗ 𝜙 𝐿 )(𝑟, 𝜃)∥ 22 ≲ ∥𝑟𝑞(𝑟, 𝜃)∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) + 𝐿 3 ∥𝑞∥ 22 .
                                                                                                    □
     Along with these two lemmas, we will also need the following lemma.
Lemma 4. Suppose 𝜀 is a mean zero Gaussian white noise supported on [−1/2, 1/2] with variance
𝜎 2 . For all 𝑝 > 0 and 𝜔 ∈ R,
                                            E [| 𝜀(𝜔)|
                                                 ˆ      𝑝
                                                          ] ≲𝑝 𝜎 𝑝 .
Proof. We rewrite 𝜀ˆ as an integral with respect to a Brownian motion:
                                                   ∫   1/2
                                          ˆ
                                          𝜀(𝜔)   =          𝑒 −𝑖𝜔𝑥 𝑑𝐵𝑥 .
                                                    −1/2
Let
                                                           ∫   1/2
                               𝑔1 (𝜔) = Re( 𝜀(𝜔))
                                               ˆ       =           cos(𝜔𝑥)𝑑𝐵𝑥 ,
                                                             −1/2
                                                           ∫ 1/2
                               𝑔2 (𝜔) = Im( 𝜀(𝜔))
                                               ˆ       =           sin(𝜔𝑥)𝑑𝐵𝑥 .
                                                             −1/2
                                   ©𝑔1 (𝜔) ª
For fixed 𝜔, the random vector ­­           ® ∼ 𝑁 (0, Σ(𝜔)) where the covariance matrix is given by
                                            ®
                                     𝑔2 (𝜔)
                                   «        ¬
                                           2­
                                             ©1 + sin 𝜔𝜔cos 𝜔           0         ª
                              Σ(𝜔) = 𝜎 ­                                          ®.
                                                                      sin 𝜔 cos 𝜔
                                                                                  ®
                                                    0            1−        𝜔
                                             «                                    ¬
                                                       24


    We will now prove a bound for 𝑔1 (𝜔). An identical bound applies for 𝑔2 (𝜔). Since 𝑔1 (𝜔) is
normal,
                                                                                     
                                                                                 𝑝+1
                                                Var 𝑝 (𝑔1 (𝜔))        2 𝑝/2 Γ      2
                     E[|𝑔1 (𝜔)| 𝑝 ] = 𝜎 𝑝                                   √
                                                         𝜎 2𝑝                 𝜋
                                                                            𝑝            
                                                 𝜎 2𝑝 1 − sin 𝜔𝜔cos 𝜔 2 𝑝/2 Γ 𝑝+1            2
                                          ≤ 𝜎𝑝                                          √
                                                              𝜎 2𝑝                        𝜋
                                          ≲ 𝜎𝑝.
    Now we have
                                                                𝑝/2 
                                        𝑝                      2
                             ˆ
                        E[| 𝜀(𝜔)|         ]=E       | 𝜀(𝜔)|
                                                       ˆ
                                                 h                               i
                                            = E (𝑔12 (𝜔) + 𝑔22 (𝜔)) 𝑝/2
                                                      h             𝑝
                                                                      i      h            𝑝
                                                                                            i
                                            ≲ 𝑝 (E |𝑔12 (𝜔)| 2 + E |𝑔22 (𝜔)| 2 )
                                            ≲𝑝 𝜎 𝑝 .
                                                                                                                      □
Lemma 5. Assume that the assumptions of Model 2 hold, 𝐵 𝑓 ∈ 𝐶 3 (R2 ), (·) 𝐵                   c𝑓 (·) decays like | · | 𝜅
for some 𝜅 ≥ 2,
                        𝑟2     max         |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| ∈ 𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃),
                           𝛼∈[𝑟/2,2𝑟]
and
                       𝑟3     max         |𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)| ∈ 𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃).
                           𝛾∈[𝑟/2,2𝑟]
We have the bound
                                                     𝜎2                𝜎4              𝜎6
                           ∥ 𝑔˜ 𝜎 ∥ 2𝐿 2 (Ω) ≲Ω,𝜏          ∥ 𝑓 ∥ 42 +      ∥ 𝑓 ∥ 22 +     .
                                                      𝑀                 𝑀               𝑀
                                                           25


Proof. We use triangle inequality to get
                                                                       2                                                          2
                        𝑀                                                                  𝑀
                     1 ∑︁ ˆ                                                          1 ∑︁ ˆ∗
∥ 𝑔˜ 𝜎 ∥ 2𝐿 2 (Ω) ≲        𝑓 𝑗 (𝜔1 ) 𝑓ˆ𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )                   +              𝑓 (𝜔2 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 ) 𝜀ˆ 𝑗 (𝜔1 )
                    𝑀 𝑗=1                                                           𝑀 𝑗=1 𝑗
                                                                       𝐿 2 (Ω)                                                    𝐿 2 (Ω)
                                                                      2                                          2
                       𝑀                                                                 𝑀
                    1 ∑︁ ˆ                                                          1 ∑︁
                  +       𝑓 𝑗 (𝜔1 ) 𝑓ˆ𝑗 (𝜔2 − 𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 )                  +              𝐵𝜀 𝑗 (𝜔1 , 𝜔2 )
                    𝑀 𝑗=1                                                           𝑀 𝑗=1
                                                                      𝐿 2 (Ω)                                    𝐿 2 (Ω)
                                                                                             2
                       𝑀
                    1 ∑︁ ˆ
                  +       𝑓 𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) − ℎ(𝜔1 ) 𝜇(𝜔        ˜ 1)
                    𝑀 𝑗=1
                                                                                             𝐿 2 (Ω)
                                                                                                2
                       𝑀
                    1 ∑︁ ˆ∗
                  +       𝑓 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) − ℎ(𝜔2 ) 𝜇˜ ∗ (𝜔2 )
                    𝑀 𝑗=1 𝑗
                                                                                                𝐿 2 (Ω)
                                                                                                            2
                       𝑀
                    1 ∑︁ ˆ
                  +       𝑓 𝑗 (𝜔2 − 𝜔1 ) 𝜀ˆ 𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) − ℎ(𝜔2 − 𝜔1 ) 𝜇(𝜔            ˜ 2 − 𝜔1 )               .
                    𝑀 𝑗=1
                                                                                                            𝐿 2 (Ω)
      The argument for bounding the expectation of the first three terms is similar, so we only provide
                                                                                             2
the proof for the first term 𝑀1 𝑀
                                         Í          ˆ         ˆ∗
                                            𝑗=1 𝑓 𝑗 (𝜔1 ) 𝑓 𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) 2                 . We have
                                                                                             𝐿 (Ω)
                                                                                     2        
                              1 ∑︁   𝑀                                                        
                                           𝑓ˆ𝑗 (𝜔1 ) 𝑓ˆ𝑗∗ (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )
                                                                                              
                          E                                                                   
                              𝑀                                                               
                                    𝑗=1                                                 2 (Ω) 
                                                                                      𝐿       
                                                                                               2
                              ∫        1 ∑︁      𝑀                                               
                          =
                                      
                                   E 2                 ˆ         ˆ
                                                       𝑓 𝑗 (𝜔1 ) 𝑓 𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )  𝑑𝜔1 𝑑𝜔2
                                                                                                  
                                Ω 𝑀  
                                                 𝑗=1
                                                                                                  
                                                                                                  
                                                                                                 
                                 2  ∫ ∑︁   𝑀
                               𝜎
                          = 2                   | 𝑓ˆ𝑗 (𝜔1 ) 𝑓ˆ𝑗 (𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2
                               𝑀 Ω 𝑗=1
                                     𝑀 ∫
                               𝜎 2 ∑︁
                          = 2                   | 𝑓ˆ𝑗 (𝜔1 ) 𝑓ˆ𝑗 (𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2
                               𝑀 𝑗=1 Ω
                                     𝑀 ∫
                               𝜎 2 ∑︁
                                                                       ∫
                                                             2
                          = 2                      ˆ
                                                | 𝑓 𝑗 (𝜔1 )| 𝑑𝜔1            | 𝑓ˆ𝑗 (𝜔2 )| 2 𝑑𝜔2
                               𝑀 𝑗=1 Ω                                    Ω
                               𝜎2 ˆ 4
                          =        ∥ 𝑓 ∥ 𝐿 2 (Ω)
                               𝑀
                               𝜎2
                          ≤        ∥ 𝑓 ∥| 4𝐿 2 (R) ,
                               𝑀
                                                                     26


where the last line follows from Phlancherel Theorem. A similar argument proves that each of the
                                                                            𝜎2
first three terms on the right are bounded on the order of                   𝑀∥   𝑓 ∥ 42 .
     For the fourth term, since E[𝐵𝜖 𝑗 ] = 0, we have
                                             2                ∫                               2               
                       𝑀                                                     𝑀
                 1 ∑︁                                                1   ∑︁                                   
                           𝐵𝜀 𝑗 (𝜔1 , 𝜔2 )              =E                      𝐵𝜀 𝑗 (𝜔1 , 𝜔2 ) 𝑑𝜔1 𝑑𝜔2 
                                                                                                              
               E
                 𝑀                                             Ω 𝑀                                            
                      𝑗=1                                                   𝑗=1
                
                                             𝐿 2 (Ω)         
                                                                
                                                                                                                 
                                                                                                                 
                                                                                                2
                                                            ∫        1 ∑︁   𝑀                      
                                                         =                        𝐵𝜀 𝑗 (𝜔1 , 𝜔2 )  𝑑𝜔1 𝑑𝜔2
                                                                                                   
                                                                  E
                                                               Ω  𝑀 𝑗=1                           
                                                                                                    
                                                                                                   
                                                            ∫           ∑︁
                                                                       1       𝑀                   
                                                                                                    
                                                         =        Var            𝐵𝜀 𝑗 (𝜔1 , 𝜔2 )  𝑑𝜔1 𝑑𝜔2
                                                               Ω        𝑀 𝑗=1                      
                                                                  ∫                                
                                                             1
                                                         =            Var(𝐵𝜀 𝑗 ) 𝑑𝜔1 𝑑𝜔2 .
                                                             𝑀 Ω
We now bound Var(𝐵𝜀 𝑗 ). First, since the expectation is zero, by Holder’s inequality and the Lemma
4 to get
                  Var(𝐵𝜀 𝑗 ) = E[|𝐵𝜀 𝑗 | 2 ]
                                       ˆ 1 )| 2 | 𝜀(𝜔ˆ 2 )| 2 | 𝜀(𝜔
                                                                  ˆ 2 − 𝜔1 )| 2
                                                                                   
                             = E | 𝜀(𝜔
                                                     1/3                  1/3                          1/3
                                        ˆ 1 )| 6                 ˆ 2 )| 6                ˆ 2 − 𝜔1 )| 6
                                    
                             ≤ E | 𝜀(𝜔                   E | 𝜀(𝜔                  E | 𝜀(𝜔
                             ≲ 𝜎6 .
Thus, we obtain
                                                                     2        
                                             𝑀
                                    1 ∑︁                                          𝜎6
                                                  𝐵𝜀 𝑗 (𝜔1 , 𝜔2 )              ≲          |Ω|.
                                                                              
                                E
                                    𝑀                                              𝑀
                                            𝑗=1
                                   
                                                                     𝐿 2 (Ω) 
     Now we bound the last three terms. We start with
                                                                                                   2         
                          1 ∑︁   𝑀                                                                           
                  𝐴=E                𝑓ˆ𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ) − ℎ(𝜔1 ) 𝜇(𝜔      ˜ 1)
                                                                                                             
                                                                                                              .
                          𝑀                                                                                  
                                 𝑗=1
                         
                                                                                                    𝐿 2 (Ω) 
Consider the random variable
                                       𝐴 𝑗 = 𝑓ˆ𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 ).
                                                              27


Then
                                                    2        
                1 ∑︁   𝑀                                     
        𝐴=E                 𝐴 𝑗 − ℎ(𝜔1 ) 𝜇(𝜔
                                           ˜ 1)
                                                             
                                                              
                𝑀                                            
                      𝑗=1                              2 (Ω) 
                                                    𝐿        
                                                                                                       2       
                1 ∑︁     𝑀                                                                                     
           =E ­              𝐴 𝑗 − ℎ(𝜔1 )𝜇(𝜔1 ) ® + ℎ(𝜔1 )𝜇(𝜔1 ) − ℎ(𝜔1 ) 𝜇(𝜔              ˜ 1)
                ©                                   ª                                                          
                                                                                                                
                𝑀                                                                                              
                «       𝑗=1                                                                              2 (Ω) 
                                                    ¬                                                  𝐿       
                                                     2        
                 1 ∑︁   𝑀                                             h                                                  i
                                                                                                                    2
           ≲E               𝐴 𝑗 − ℎ(𝜔1 )𝜇(𝜔1 )                 + E ∥ℎ(𝜔1 )𝜇(𝜔1 ) − ℎ(𝜔1 ) 𝜇(𝜔              ˜ 1 )∥ 𝐿 2 (Ω) .
                                                              
                 𝑀                                            
                        𝑗=1
                
                                                     𝐿 2 (Ω) 
For the first term at the end of the inequality, we see that ℎ(𝜔1 )𝜇(𝜔1 ) is the mean of 𝐴 𝑗 for fixed
(𝜔1 , 𝜔2 ). Thus,
                                             2         ∫                                                    2
            1 ∑︁  𝑀                                                  1 ∑︁   𝑀                                 
                      𝐴 𝑗 − ℎ(𝜔1 )𝜇(𝜔1 )               =                            𝐴 𝑗 − ℎ(𝜔1 )𝜇(𝜔1 )  𝑑𝜔1 𝑑𝜔2
                                                                                                              
        E                                                         E
            𝑀                                                 Ω  𝑀 𝑗=1
                                                                                                                
                  𝑗=1
           
                                             𝐿 2 (Ω)               
                                                                                                                 
                                                                                                                 
                                                             ∫
                                                                    1          
                                                           =           Var 𝐴 𝑗 𝑑𝜔1 𝑑𝜔2
                                                                Ω 𝑀
We have
                      Var( 𝐴 𝑗 ) = E[| 𝑓ˆ𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )| 2 ] − ℎ2 (𝜔1 )𝜇2 (𝜔1 )
                                  ≤ E[| 𝑓ˆ𝑗 (𝜔1 ) 𝜀ˆ∗𝑗 (𝜔2 ) 𝜀ˆ 𝑗 (𝜔2 − 𝜔1 )| 2 ]
                                  ≲ 𝜎 4 E𝑡,𝜏 [| 𝑓ˆ𝑗 (𝜔1 )| 2 ].
Now substitute this back into the integral to get
                                     𝜎4                                                   𝜎4
  ∫                                       ∫                                                   ∫
       1
                                              E𝑡,𝜏 [| 𝑓ˆ𝑗 (𝜔1 )| 2 ] 𝑑𝜔1 𝑑𝜔2 =                      E𝜏 | 𝑓ˆ𝑗 (𝜔1 )| 2 𝑑𝜔1 𝑑𝜔2
                                                                                                                   
         Var 𝐴 𝑗 𝑑𝜔1 𝑑𝜔2 ≲
   Ω  𝑀                              𝑀      Ω                                             𝑀      Ω
by translation invariance of the power spectrum. Now we have
             𝜎4                                                  4
                  ∫                                                    ∫                                 
                                      2                      𝜎                            2
                      E𝜏 | 𝑓ˆ𝑗 (𝜔1 )| 𝑑𝜔1 𝑑𝜔2 =                               | 𝑓ˆ𝑗 (𝜔1 )| 𝑑𝜔1 𝑑𝜔2
                                       
                                                                   E𝜏
              𝑀 Ω                                             𝑀           Ω
                                                             𝜎4
                                                                                    ∫ ∫                              
                                                                            1                  ˆ         2
                                                          =        E𝜏                       | 𝑓 ( 𝜔˜ 1 )| 𝑑 𝜔˜ 1 𝑑𝜔2
                                                              𝑀          1 − 𝜏𝑗
                                                                   𝜎4
                                                          ≲𝜏,Ω         ∥ 𝑓 ∥ 22 .
                                                                   𝑀
                                                                28


       For the second term,
          h                                            i    ∫
                                      ˜ 1 )∥ 2𝐿 2 (Ω)             ℎ2 (𝜔1 )E |𝜇(𝜔1 ) − 𝜇(𝜔        ˜ 1 )| 2 𝑑𝜔1 𝑑𝜔2
                                                                                                          
        E ∥ℎ(𝜔1 )𝜇(𝜔1 ) −   ℎ(𝜔1 ) 𝜇(𝜔                   =
                                                               Ω
                                                                  ∫
                                                                4
                                                                                             ˜ 1 )| 2 𝑑𝜔1 𝑑𝜔2
                                                                                                        
                                                         ≲𝜎            E |𝜇(𝜔1 ) − 𝜇(𝜔
                                                                     Ω
                                                                                                            2
                                                                  ∫       1 ∑︁   𝑀                           
                                                         = 𝜎4                          𝑦ˆ 𝑗 (𝜔1 ) − 𝜇(𝜔1 )  𝑑𝜔1 𝑑𝜔2
                                                                                                             
                                                                       E
                                                                    Ω  𝑀 𝑗=1                                
                                                                                                              
                                                                                                             
Define 𝑍 𝑗 = 𝑦ˆ 𝑗 (𝜔) − 𝜇(𝜔). Using similar steps to before, we get
                                                                                                       2
                                                                              ∫              𝑀
                                                                                   © 1 ∑︁ ª
                  h                                                i
                 E ∥ℎ(𝜔1 )𝜇(𝜔1 ) − ℎ(𝜔1 ) 𝜇(𝜔      ˜ 1 )∥ 2𝐿 2 (Ω)    ≲𝜎    4
                                                                                   ­             𝑍 𝑗 ® 𝑑𝜔1 𝑑𝜔2
                                                                                Ω 𝑀 𝑗=1
                                                                                   «                 ¬
                                                                        𝜎   4 ∫
                                                                      =            Var(𝑍 𝑗 ) 𝑑𝜔1 𝑑𝜔2
                                                                         𝑀 Ω
                                                                            𝜎4
                                                                      ≲Ω         (∥ 𝑓 ∥ 22 + 𝜎 2 ).
                                                                            𝑀
This means that we put everything together to conclude
                                                         𝜎2             𝜎4                 𝜎6
                                ∥ 𝑔˜ 𝜎 ∥ 2𝐿 2 (Ω) ≲Ω,𝜏       ∥ 𝑓 ∥ 42 +       ∥ 𝑓 ∥ 22 +       .
                                                         𝑀               𝑀                  𝑀
                                                                                                                              □
Theorem 6. Assume that the assumptions of Model 2 hold, 𝐵 𝑓 ∈ 𝐶 3 (R2 ), (·) 𝐵                               c𝑓 (·) decays like
| · | 𝜅 for some 𝜅 ≥ 2,
                            𝑟2      max         |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| ∈ 𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃),
                               𝛼∈[𝑟/2,2𝑟]
and
                            𝑟3     max         |𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)| ∈ 𝐿 2 (R2 , 𝑑𝑟 × 𝑑𝜃).
                               𝛾∈[𝑟/2,2𝑟]
For the estimator ( 𝐵 f𝑓 )(𝑟, 𝜃) defined for Model 2,
                                                                                       𝜎2 ∨ 𝜎6
                            h                         i               2                            
                                                 2                    𝜂         4
                          E ∥𝐵 𝑓 − 𝐵 𝑓 ∥ 𝐿 2 (Ω) ≤ 𝐶 𝑓 ,Ω
                                          f                               +𝐿 +                         ,
                                                                      𝑀                   𝐿2 𝑀
where 𝐶 𝑓 ,Ω only depends on 𝑓 and Ω.
                                                             29


Proof. First, we see that by an argument as in Lemma 1,
                f𝑓 ∥ 2 2 ≲ 4𝑔𝜂 + 𝑟𝜕𝑟 𝑔𝜂 − 4( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 − 𝑟𝜕𝑟 (( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 )                       2
      ∥𝐵 𝑓 − 𝐵       𝐿 (Ω)                                                                                            𝐿 2 (Ω)
                                                   2                                  2
                             ≲ 𝑔𝜂 − 𝑔˜ 𝜂           𝐿 2 (Ω)
                                                           + 𝑟𝜕𝑟 𝑔𝜂 − 𝑟𝜕𝑟 𝑔˜ 𝜂         𝐿 2 (Ω)
                                                                                                                     2
                            + 4𝑔˜ 𝜂 + 𝑟𝜕𝑟 𝑔˜ 𝜂 − 4( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 − 𝑟𝜕𝑟 (( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 )               𝐿 2 (Ω)
                                                   2                                  2
                             ≲ 𝑔𝜂 − 𝑔˜ 𝜂           𝐿 2 (Ω)
                                                           + 𝑟𝜕𝑟 𝑔𝜂 − 𝑟𝜕𝑟 𝑔˜ 𝜂         𝐿 2 (Ω)
                                                                                               + ∥ 𝑔˜ 𝜂 − 𝑔˜ 𝜂 ∗ 𝜙 𝐿 ∥ 2𝐿 2 (Ω)
                                                                      2
                            + 𝑟𝜕𝑟 𝑔˜ 𝜂 − 𝑟𝜕𝑟 ( 𝑔˜ 𝜂 ∗ 𝜙 𝐿 )           𝐿 2 (Ω)
                                                                              + ∥ 𝑔˜ 𝜎 ∗ 𝜙 𝐿 ∥ 2𝐿 2 (Ω) + ∥𝑟𝜕𝑟 ( 𝑔˜ 𝜎 ∗ 𝜙 𝐿 )∥ 2𝐿 2 (Ω)
By Theorem 1,
                  h                                                           i
                                  2                                    2
               E    𝑔𝜂 − 𝑔˜ 𝜂 𝐿 2 (Ω) + 𝑟𝜕𝑟 𝑔𝜂 − 𝑟𝜕𝑟 𝑔˜ 𝜂 𝐿 2 (Ω)
                    𝜂2                         2                               2          2                    2 2
                                                                                                                    
                ≲        ∥(𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + 2∥𝑟𝜕𝑟 (𝐵 𝑓 )(𝑟, 𝜃)∥ 2 + ∥𝑟 𝜕𝑟𝑟 (𝐵 𝑓 )(𝑟, 𝜃)] ∥ 2
                    𝑀                                                                                                         !
                                                                        2                                                   2
                   𝜂4
                +        𝑟 2 max |𝜕𝛼𝛼 (𝐵 𝑓 )(𝛼, 𝜃)| + 𝑟 3 max |𝜕𝛾𝛾𝛾 (𝐵 𝑓 )(𝛾, 𝜃)|                                               .
                   𝑀         𝛼∈[𝑟/2,2𝑟]                                 2          𝛾∈[𝑟/2,2𝑟]                               2
It suffices to bound the other four terms appropriately now. By Lemma 2,
                                       ∥ 𝑔˜ 𝜂 − 𝑔˜ 𝜂 ∗ 𝜙 𝐿 ∥ 2𝐿 2 (Ω) ≲ ∥ 𝑔˜ 𝜂 ∥ 22 𝐿 4 + 𝐿 4∧(2𝜅−2) .
For 𝑔˜ 𝜂 , notice that
                                                      ∫
                     ∥𝐵 𝑓 𝑗 ∥ 22  = (1 − 𝜏 𝑗 )     6
                                                           |(𝐵 𝑓 )((1 − 𝜏 𝑗 )𝜔1 , (1 − 𝜏 𝑗 )𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2
                                                        R2
                                                      ∫
                                                   4
                                  = (1 − 𝜏 𝑗 )             |(𝐵 𝑓 )(𝜔1 , 𝜔2 )| 2 𝑑𝜔1 𝑑𝜔2
                                                        R2
                                  = (1 − 𝜏 𝑗 ) 4 ∥𝐵 𝑓 ∥ 22
                                  ≤ 24 ∥𝐵 𝑓 ∥ 22 .
Triangle inequality implies
                                                        𝑀                      𝑀
                                                   1 ∑︁                    1 ∑︁
                                 ∥ 𝑔˜ 𝜂 ∥ 2 =               𝐵 𝑓𝑗      ≤              ∥𝐵 𝑓 𝑗 ∥ 2 ≲ ∥𝐵 𝑓 ∥ 2
                                                  𝑀 𝑗=1                    𝑀 𝑗=1
                                                                    2
and
                                      ∥ 𝑔˜ 𝜂 − 𝑔˜ 𝜂 ∗ 𝜙 𝐿 ∥ 𝐿 2 (Ω) ≲ ∥𝐵 𝑓 ∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) .
                                                                      30


Also, by Lemma 3,
                  ∥𝑟𝜕𝑟 ( 𝑔˜ 𝜂 − ( 𝑔˜ 𝜂 ∗ 𝜙 𝐿 ))∥ 2𝐿 2 (Ω) ≲ ∥𝑟𝜕𝑟 𝑔˜ 𝜂 ∥ 22 𝐿 4 + 𝐿 4∧(2𝛼−2) + 𝐿 3 ∥𝜕𝑟 𝑔˜ 𝜂 ∥ 22 .
     Now consider the error terms. First start with ∥ 𝑔˜ 𝜎 ∗ 𝜙 𝐿 ∥ 2𝐿 2 (Ω) . We have
                             ∥ 𝑔˜ 𝜎 ∗ 𝜙 𝐿 ∥ 2𝐿 2 (Ω) ≤ ∥𝜙 𝐿 ∥ 2𝐿 1 (Ω) ∥ 𝑔˜ 𝜎 ∥ 2𝐿 2 (Ω) ≲ ∥ 𝑔˜ 𝜎 ∥ 2𝐿 2 (Ω) .
     Now we consider ∥𝑟𝜕𝑟 ( 𝑔˜ 𝜎 ∗ 𝜙 𝐿 )∥ 𝐿 2 (Ω) . Let 𝑅Ω = maxΩ |𝑟 |. We have
                                   ∥𝑟𝜕𝑟 ( 𝑔˜ 𝜎 ∗ 𝜙 𝐿 )∥ 2𝐿 2 (Ω) ≤ 𝑅Ω     2
                                                                            ∥ 𝑔˜ 𝜎 ∗ 𝜕𝑟 𝜙 𝐿 ∥ 2𝐿 2 (Ω)
                                                                          2
                                                                    ≤ 𝑅Ω    ∥𝜕𝑟 𝜙 𝐿 ∥ 21 ∥ 𝑔˜ 𝜎 ∥ 2𝐿 2 (Ω) .
Since 𝜙 is a radial filter and we know the filter
                                                            h 2          2
                                                                            i                                2 2
                     𝜕𝑟 𝜙 𝐿 (𝑟) = (2𝜋𝐿 2 ) −1/2 𝜕𝑟 𝑒 −𝑟 /(2𝐿 ) = −(2𝜋𝐿 3 ) −1/2𝑟𝑒 −𝑟 /(2𝐿 ) .
It follows that
                                                               ∫    ∞
                                                          1                  2 /(2𝐿 2 )
                                       ∥𝜕𝑟 𝜙 𝐿 ∥ 1 = 3                 𝑟𝑒 −𝑟             𝑑𝑟 = 𝐿 −1
                                                         𝐿       0
and ∥𝜕𝑟 𝜙 𝐿 ∥ 21 = 𝐿 −2 . We can also use our previous bound to get
                                                                                          𝜎2 ∨ 𝜎6
                                                                                                       
                                                                                  −2
                                    ∥𝑟𝜕𝑟 ( 𝑔˜ 𝜎 ∗   𝜙 𝐿 )∥ 2𝐿 2 (Ω)  ≲𝜏,Ω, 𝑓 𝐿                            .
                                                                                              𝑀
This finishes the proof of the theorem since each term is dependent on 𝐿, 𝑀, and 𝜂2 now.                                     □
                                                                                             
                                                                                           𝜎6                             𝜎6
     Note that we can choose 𝐿 =                  𝜎
                                                𝑀 1/6
                                                       to get a bound of 𝑂                 𝑀    when 𝜎 ≥ 1. We let 𝐿 4 = 𝐿2 𝑀
                                                                                                                              ,
                                                        2
                                                                 𝜎4
and this gives a convergence rate of 𝑂 ( 𝜂𝑀 +                   𝑀 2/3
                                                                      ) on the squared error.
2.7    Numerical Implementation of Bispectrum Recovery
     Now that we have theoretical results to recover the bispectrum, we will use these results for
bispectrum recovery. After showing success for signal recovery, we will use our recovery results for
signal inversion with the phase synchronization algorithm from [7], which will require a recovered
power spectrum and recovered bispectrum, which will motivate our current approach of doing
power spectrum recovery first.
                                                                     31


    We can discuss computing (2.12) numerically now. We cannot compute (𝐼 − 𝐿 𝐶0 ) −1 , but we
can still solve for the bispectrum by means of optimization. Similar to how we constructed the
estimator in the finite sample case, we will consider the infinite sample case first and derive an
optimization procedure. Based on this procedure, we will design another optimization procedure
for the finite sample case that is a good estimator when 𝑀 is large.
    In the infinite sample limit, we have access to the term
                                    𝑑 (𝑟, 𝜃) = 4𝑔𝜂 (𝑟, 𝜃) + 𝑟𝜕𝑟 𝑔𝜂 (𝑟, 𝜃)                     (2.13)
and Proposition 4 implies that we can recover 𝑑 by solving the convex optimization problem
                                𝑔 = argmin𝑔˜ ∥(𝐼 − 𝐿 𝐶0 ) 𝑔˜ − 𝐶1 𝐿 𝐶2 𝑑 ∥ 22 ,               (2.14)
where the constants 𝐶0 , 𝐶1 , 𝐶2 depend on 𝜂 as given in (2.5). Note that for fixed 𝜂, this problem
is convex. Since the variance of the dilations is not necessarily known, 𝜂 is possibly an unknown
parameter and we must actually minimize
                                                                                  2
                             𝔏( 𝑔, ˜ = (𝐼 − 𝐿 𝐶0 (𝜂)˜ ) 𝑔˜ − 𝐶1 ( 𝜂)𝐿
                                ˜ 𝜂)                              ˜ 𝐶2 (𝜂)˜ 𝑑     2
                                                                                    .
    In [26], the authors find 𝜂 via a nonconvex optimization problem while recovering the power
spectrum. We could mimic a similar process for bispectrum recovery, but it is likely this process
would fail. The reason for this is that the optimization problem for bispectrum recovery has a
larger number of variables, so memory limited methods are necessary. In addition, we also need a
recovered power spectrum for our bispectrum inversion algorithm anyway, so we would still need
to perform power spectrum recovery. We will choose to learn 𝜂 and recover the power spectrum
using a modification of the algorithm from [26] and describe the steps below.
    For recovering the power spectrum, our data term is given by
                                     𝑝 data (𝜔) = 3𝑝 𝜂 (𝜔) + 𝜔𝑝′𝜂 (𝜔) ,                       (2.15)
and we can consider minimizer
                                              ∥(𝐼 − 𝑆𝐶0 ) 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22
                              𝑝 = argmin 𝑝˜                                       ,           (2.16)
                                                             𝜂2
                                                    32


Since 𝜂 may be unknown, we use the loss function:
                                                                        ˜ 𝐶2 (𝜂)˜ 𝑝 data ∥ 22
                                                           
                                          ∥ 𝐼 − 𝑆𝐶0 (𝜂)˜ 𝑝˜ − 𝐶1 ( 𝜂)𝑆
                            L ( 𝑝,
                                ˜ 𝜂)
                                   ˜ =                                                        .               (2.17)
                                                                𝜂2
Theorem 7. Define the operator 𝐴 = 𝐼 − 𝑆𝐶0 . For the loss function in (2.17), we have
                         2𝐴∗ ( 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data )
         ∇ 𝑝˜ L ( 𝑝,
                  ˜ 𝜂)
                     ˜ =
                                      𝜂2
                             ∫
                          1                                            𝜕
         ∇𝜂˜ L ( 𝑝,
                  ˜ 𝜂)
                     ˜ = 2       2( 𝐴 𝑝(𝜔)
                                       ˜      − 𝐶1 𝑆𝐶2 𝑝 data (𝜔)) ( 𝐴 𝑝(𝜔)   ˜       − 𝐶1 𝑆𝐶2 𝑝 data (𝜔)) 𝑑𝜔
                         𝜂                                           𝜕 𝜂˜
                         2
                       − 3 ∥ 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22 .
                         𝜂
Proof. The following computation is almost identical to [26], but we provide the details for com-
pleteness. We start with ∇ 𝑝˜ L ( 𝑝,˜ 𝜂)˜ and fix 𝜂.  ˜ We will ignore the 𝜂˜ dependence for this part of the
computation. The Frechet derivative is
                                             ∥ 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22
                                 L ( 𝑝)
                                      ˜ =                                = 𝑁 ( 𝐴 𝑝)˜ ,
                                                         𝜂2
where 𝑁 𝑓 = ∥ 𝑓 − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22 . Let ℎ be a test function. Then we have
                        (𝐷L)( 𝑝)ℎ˜ = (𝐷𝑁)( 𝐴 𝑝)       ˜ ◦ 𝐷 ( 𝐴 𝑝)ℎ
                                                                 ˜ = (𝐷𝑁)( 𝐴 𝑝)         ˜ ◦ 𝐴ℎ
since 𝐴 is a linear operator. It follows that
                                                2
                       |𝑁 ( 𝑓 + ℎ) − 𝑁 𝑓 −     𝜂2
                                                  ⟨𝑓   − 𝐶1 𝑆𝐶2 𝑝 data , ℎ⟩|      1 ∥ℎ∥ 22
                                                                             =                  →0
                                              ∥ℎ∥ 2                               𝜂2 ∥ℎ∥ 2
                                       2
as ∥ℎ∥ 2 → 0, and (𝐷𝑁)( 𝑓 )ℎ =        𝜂2
                                         ⟨𝑓  − 𝐶1 𝑆𝐶2 𝑝 data , ℎ⟩. Thus
                                                   2
                               (𝐷L)( 𝑝)ℎ ˜ =          ⟨𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data , 𝐴ℎ⟩
                                                  𝜂2
                                                    2
                                               = ⟨ 2 𝐴∗ ( 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ), ℎ⟩.
                                                    𝜂
It now follows that
                                                   2 ∗
                                    ∇L ( 𝑝) ˜ =       𝐴 ( 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ) .                            (2.18)
                                                   𝜂2
                                                          33


Additionally, using the work above, we have
                         𝜂2 ∇𝜂 (∥ 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22 ) − 2𝜂∥ 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22
         ∇𝜂˜ L ( 𝑝,
                 ˜ 𝜂)
                    ˜ =
                                                           𝜂4
                              ∫
                          1                                            𝜕
                      =          2( 𝐴  ˜
                                      𝑝(𝜔)    −  𝐶 1 𝑆 𝐶2 𝑝 data (𝜔))      ( 𝐴 𝑝(𝜔)
                                                                                ˜    − 𝐶1 𝑆𝐶2 𝑝 data (𝜔)) 𝑑𝜔
                         𝜂2                                           𝜕 𝜂˜
                          2
                      − 3 ∥ 𝐴 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22 .
                         𝜂
                                                                                                                 □
    For this implementation, we need to calculate 𝐴∗ , which we do analytically below. We have
                                             ∫
                                ⟨𝐴𝑔, ℎ⟩ =         𝐴𝑔(𝜔)ℎ(𝜔) 𝑑𝜔
                                             ∫
                                          =      (𝑔(𝜔) − 𝐶03 𝑔(𝐶0 𝜔))ℎ(𝜔) 𝑑𝜔
                                             ∫                              
                                                                       2      𝜔˜
                                          =          ˜ ℎ( 𝜔)
                                                 𝑔( 𝜔)         ˜ − 𝐶0 ℎ              𝑑 𝜔˜
                                                                             𝐶0
                                          = ⟨𝑔, 𝐴∗ ℎ⟩.
Thus,
                                                                             
                                          ∗                               𝜔
                                        𝐴 ℎ(𝜔) = ℎ(𝜔)          − 𝐶02 ℎ           .                           (2.19)
                                                                         𝐶0
    For implementations, one only has access to the estimate of 𝑝 𝜂 , so numerical implementations
will use the following estimate for the data term:
                      𝑝edata (𝜔) := 3( 𝑝e𝜂 + 𝑝e𝜎 ) ∗ 𝜙 𝐿 (𝜔) + 𝜔 ( 𝑝e𝜂 + 𝑝e𝜎 ) ∗ 𝜙′𝐿 (𝜔)
                                                                                         
                                                                                  2
                       Le( 𝑝,  ˜ := 𝐼 − 𝑆𝐶0 (𝜂)˚ 𝑝˜ − 𝐶1 ( 𝜂)𝑆
                            ˜ 𝜂)                                  ˜ 𝐶2 (𝜂)˜ 𝑝edata 2 ,
and recovered power spectrum is computed by minimizing L.                  e
    After recovering the power spectrum and an estimate for 𝜂, the problem for recovering the
bispectrum is
                              𝑔 = argmin𝑔˜ ∥(𝐼 − 𝐿 𝐶0 (𝜂) ) 𝑔˜ − 𝐶1 (𝜂)𝐿 𝐶2 (𝜂) 𝑑 ∥ 22 .                     (2.20)
for a fixed estimate 𝜂, ˜ which is a convex optimization problem. We let 𝑌 = 𝐼 − 𝐿 𝐶0 . By a proof
identical to above, we get
                                                ˜ = 2𝑌 ∗ (𝑌 𝑔˜ − 𝐶1 𝐿 𝐶2 𝑑)
                                         ∇𝑔˜ 𝔏( 𝑔)
                                                            34


Additionally, we have
                          𝑌 ∗ ℎ(𝜔1 , 𝜔2 ) = ℎ(𝜔1 , 𝜔2 ) − 𝐶02 ℎ(𝐶0−1 𝑤 1 , 𝐶0−1 𝑤 2 )).              (2.21)
    Like before, for numerical applications, we only have access to a finite number of samples, so
will have to have to modify our data term and loss function based on the estimator provided in
Model 2:
                     𝑑˜(𝑟, 𝜃) = 4( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜙 𝐿 (𝑟, 𝜃) + 𝑟 [( 𝑔˜ 𝜂 + 𝑔˜ 𝜎 ) ∗ 𝜕𝑟 𝜙 𝐿 ] (𝑟, 𝜃)
                        ˜ 𝑔)                                   2
                       𝔏(  ˜ = (𝐼 − 𝐿 𝐶0 ) 𝑔˜ − 𝐶1 𝐿 𝐶2 𝑑˜ 2 .
With this surrogate loss and data term, we can conduct numerical experiments in the next section.
2.8   Numerical Experiments for Bispectrum Recovery
    We will now test our bispectrum recovery algorithm on the following signals:
                                                     2
                                 𝑓1 (𝑥) = 𝐴1 𝑒 −5𝑥 cos(4𝑥)
                                                     2
                                 𝑓2 (𝑥) = 𝐴2 𝑒 −5𝑥 cos(8𝑥)
                                                     2
                                 𝑓3 (𝑥) = 𝐴3 𝑒 −5𝑥 cos(12𝑥)
                                 𝑓ˆ4 (𝑥) = 𝐴4 1 [−1/8,1/8] (𝑥)
                                 𝑓5 (𝑥) = 𝐴5 sinc(4𝑥)
                                                
                                                 2 − 2|𝑥|,             |𝑥| < 1
                                                
                                                
                                                
                                 𝑓6 (𝑥) = 𝐴6                                          .
                                                
                                                 0,
                                                                        otherwise
                                                
Like in [27], we define our hidden signals on [−𝑁/4, 𝑁/4], and the noisy signals are defined on
[−𝑁/2, 𝑁/2] with 𝑁 = 25 . In frequency, the signals were sampled on the interval [−2ℓ 𝜋, 2ℓ 𝜋] with
ℓ = 4 and sampling rate of 𝜋/𝑁. The constants 𝐴𝑖 with 𝑖 = 1, . . . , 6 were chosen so that the SNR
                                                     ∫                     
                                                        𝑁/2
for each 𝑓𝑖 was set to be 𝜎 −2 , where SNR = −𝑁/2 | 𝑓 (𝑥)| 2 𝑑𝑥 /𝜎 2 . All signals, other than 𝑓4 , were
generated in space. We chose to generate 𝑓4 in frequency. We chose to generate 𝑓4 in frequency
because of aliasing.
    Regarding our signals, we chose our signals so that they could test the robustness of our proposed
method. The signals 𝑓1 to 𝑓3 are smooth with fast decay, which do not fit the assumptions to employ
                                                          35


our proposed estimators. Nonetheless, 𝑓1 to 𝑓3 still perform well, which is most likely because of
their exponential decay rate. However, we note that as the peak of the power spectrum is farther
from the origin, our problem becomes much harder problem because the dilations cause larger
perturbations. This is shown in Figure 2.4.
    We start with the case of oracle 𝜂 (i.e. 𝜂 is known) and consider two cases: 𝜎 = 0.5 and 𝜎 = 1.0.
Error plots are shown in Figures 2.4 and 2.5. All oracle experiments were run with 𝜂 = 12−1/2 and
                                  𝜎 1/6                                                4
with a gaussian width of 𝐿 = 5( 𝑀   ) for bispectrum recovery and width 𝐿 = 10( 𝜎𝑀 ) 1/6 for power
spectrum recovery.
             (a) 𝑓1                                (b) 𝑓2                           (c) 𝑓3
             (d) 𝑓4                                (e) 𝑓5                           (f) 𝑓6
Figure 2.4 Relative error decay with standard error bars for Bispectrum Recovery using Model 2
under the assumption of oracle 𝜂 = 12−1/2 with 𝜎 = 0.5.
                                                    36


             (a) 𝑓1                             (b) 𝑓2                             (c) 𝑓3
             (d) 𝑓4                             (e) 𝑓5                             (f) 𝑓6
Figure 2.5 Relative error decay with standard error bars for Bispectrum Recovery using Model 2
under the assumption of oracle 𝜂 = 12−1/2 and 𝜎 = 1.0.
    For all the signals, in the case where one only does empirical noise centering on the average
bispectra, which is marked in blue, there is a diminishing return in performance for large enough
𝑀. This is most likely because the blue estimator has a dilation bias, and a large sample size will
not be able to overcome this. It will only be able to overcome the additive noise bias, so eventually
the blue line will plateau. In comparison, our inversion unbiasing procedure,which is marked in
red, demonstrates a continual drop in error for both choices of 𝜎, except for 𝑓5 . Additionally, note
that the red decay line does not plateau, but is linear on the log-log plot, supporting the claim of
Theorem 5 that 𝐵   f𝑓 is truly an unbiased estimator. The poor performance of 𝑓5 stems from the
fact that it does not obey the assumptions of Model 2 since sinc is not a compactly supported
function. The same is true for each of the Gabor functions, but their decay is exponential rather
than polynomial like the sinc.
    Bispectrum recovery examples for 𝜎 = 0.5 are given in Figures 2.6 and 2.7. Note that in all
                                                  37


the examples, once can "undilate" the additive noise unbiased bispectrum average. The results are
similar for 𝜎 = 1.0, but they are not provided in this thesis.
Figure 2.6 Example plots of recovery for 𝑓1 , 𝑓2 , and 𝑓3 under the assumption of oracle 𝜂 = 12−1/2
with 𝜎 = 0.5. Left: ground truth. Middle: Only Additive Noise Unbiasing. Our Unbiasing
Procedure.
                                                  38


Figure 2.7 Example plots of recovery for 𝑓4 , 𝑓5 , and 𝑓6 under the assumption of oracle 𝜂 = 12−1/2
with 𝜎 = 0.5. Left: ground truth. Middle: Only Additive Noise Unbiasing. Our Unbiasing
Procedure.
    We now consider the case where 𝜂 is unknown and estimated. Again, we have two cases:
𝜎 = 0.5, and 𝜎 = 1.0. For estimating 𝜎, since all the signals decay away from the origin, the
values of the power spectrum average should be near zero, sans noise. Thus, we find the variance
of the function values on the edge of the signal. All experiments will be run with 𝜂 = 12−1/2 . Since
we now need to estimate the power spectrum, we let the width for the smoothing parameter for
                                 4
power spectrum recovery be 5( 𝜎𝑀 ) 1/6 and use the smoothing parameter 5( 𝑀 𝜎 1/6
                                                                              ) for the bispectrum
                                                  39


recovery. Note that the smoothing parameter for the power spectrum is smaller than the oracle
                            4             4
case. We found that 10( 𝜎𝑀 ) 1/6 and 5( 𝜎𝑀 ) 1/6 yielded similar results, so we decided to not change
the parameter.
    In Figure 2.8, an error plot based on the number of samples is shown with 𝜎 = 0.5; in Figure
2.9, a similar error plot is shown with 𝜎 = 1.0. With regards to the estimation of 𝜂, the process
given in [26] was unreliable for all sample sizes. Our results show that our modified loss function
yields a more reliable estimate of 𝜂 for large 𝑀. See Figures 2.20 and 2.21.
             (a) 𝑓1                               (b) 𝑓2                             (c) 𝑓3
             (d) 𝑓4                               (e) 𝑓5                             (f) 𝑓6
Figure 2.8 Relative error decay with standard error bars for Bispectrum Recovery using Model 2
without prior knowledge of 𝜂 and with 𝜎 = 0.5.
                                                   40


             (a) 𝑓1                             (b) 𝑓2                           (c) 𝑓3
            (d) 𝑓4                              (e) 𝑓5                           (f) 𝑓6
Figure 2.9 Relative error decay with standard error bars for Bispectrum Recovery using Model 2
without prior knowledge of 𝜂 and with 𝜎 = 1.0.
    Our results indicate that we can recovery the bispectrum with a reasonable degree of accuracy.
The question is: do we have enough accuracy for full signal inversion?
2.9   Bispectrum Inversion and Hidden Signal Recovery
    For hidden signal recovery, the tools we will need are a power spectrum recovery algorithm,
a bispectrum recovery algorithm, and a bispectrum inversion algorithm. We use [26] for power
spectrum recovery and 𝜂 estimation, our proposed bispectrum recovery algorithm, and the iterative
phase synrchonization algorithm [7] for bispectrum inversion. The general outline of the recovery
and inversion process is in Algorithm 2.2. The idea behind algorithm 2.2 is that we have already
recovered the magnitudes of the target signal via power spectrum recovery. Thus, if we are able to
recover the corresponding phase measurements, will recover the original signal.
    The algorithm we use for phase recovery is iterative phase synchronization. Suppose that the
                                                 41


bispectrum of our hidden signal is given by 𝐵, and our estimate of the phases for 𝐵 is given by 𝐵.              ˜
Suppose we have an estimate of the phase, 𝑦˜ , say 𝑦˜ 𝑘−1 that is close to the ground truth. Then we
should have 𝐵˜ ◦𝑇 ( 𝑦˜ 𝑘−1 ) ≈ 𝑦˜ 𝑘−1 𝑦˜ ∗𝑘−1 , where 𝑇 (𝑦) is the circulant matrix for the vector 𝑦 and ◦ is the
elementwise product of two matrices. We then approximate the phases via solving the optimization
problem:
                                            argmax𝑧∈C𝑛 Re{𝑧∗ 𝐵 ◦ 𝑇 (𝑦)𝑧}
subject to |𝑧[ℓ]| = 1∀ℓ. However, this solution is incorrect by a global phase. We need additional
information, namely 𝑓ˆ(0), or some estimate of it.
     Algorithm 2.1 describes the iterative phase synchronization algorithm:
Algorithm 2.1 Iterative Phase Synchronization
  1: INPUT: normalized bispectrum 𝐵,            ˆ estimation of phase of 𝑓ˆ(0), given by 𝑦¯ (0).
  2: OUTPUT: Phase of Signal, 𝑦˜ .
  3: Let 𝑘 = 0.
  4: while Stopping Criterion does not occur do
  5:   Increase 𝑘 by 1.
                        𝑦ˆ 𝑘 ←− argmax𝑧∈C𝑛 Re{𝑧∗ 𝐵ˆ ◦ 𝑇 ( 𝑦ˆ 𝑘 )𝑧} subject to 𝑧[ℓ]| = 1 ∀ℓ.
  6:
                                                                     𝑦˜ (0)
                                                      𝑦ˆ 𝑘 ←− 𝑦ˆ 𝑘 ·        .
                                                                     𝑦ˆ (0)
  7:   If the signal is real, symmetrize it.
  8: end while
     Other methods, such as frequency marching [7], local non-convex optimization over the man-
ifold of phases [7], and semidefinite relaxation [4]. Frequency marching and iterative phase
synchronization have been found the have similar empirical performance. However, iterative phase
synchronization requires less assumptions, namely one does not need 𝑓ˆ(1) ≠ 0. We have found
that local non-convex optimization over the manifold of phases did not yield good results in our
experiments, and semidefinite programming was infeasible due to memory requirements.
                                                           42


Algorithm 2.2 Hidden Signal Recovery Algorithm
  1: INPUTS: noisy signals {𝑦 𝑗 }
  2: Calculate 𝑓˜ = 𝑗 𝑦ˆ 𝑗 , 𝑝˜𝜂 + 𝑝˜ 𝜎 , 𝑔˜ 𝜂 + 𝑔˜ 𝜎 .
                        Í
  3: Estimate the additive noise level, 𝜎,          ˜ using the power spectrum on the edge of the signal.
  4: Perform additive noise centering on the power spectrum and bispectrum.
  5: if Eta Known then
  6:    Recover the power spectrum via solving the convex optimization problem argmin 𝑝˜ ∥(𝐼 −
        𝑆𝐶0 ) 𝑝˜ − 𝐶1 𝑆𝐶2 𝑝 data ∥ 22 .
  7: else if Eta Unknown then
  8:    Estimate the    power      spectrum and 𝜂 via solving the nonconvex optimization problem
                      ∥ 𝐼−𝑆𝐶0 ( 𝜂)  ˜ 1 ( 𝜂)𝑆                2
                                ˜ 𝑝−𝐶             ˜ 𝑝 data ∥ 2
                                           ˜ 𝐶2 ( 𝜂)
        argmin 𝑝,˜ 𝜂˜                   𝜂2
                                                               .
  9: end if
10: Recover the bispectrum via solving the convex optimization problem argmin𝑔˜ ∥(𝐼 − 𝐿 𝐶0 ) 𝑔˜ −
     𝐶1 𝐿 𝐶2 𝑑 ∥ 22 .
11: Apply APS to recover the original signal using 𝑓˜(0), the recovered power spectrum, and the
     recovered bispectrum.
     To illustrate the difficulty of this process, we provide a ground truth example for 𝑓5 in Figure
2.10 and four observations of 𝑓5 in Figure 2.11. One can see that the observations do not resemble
the actual signal, even at a low noise levels.
                                        Figure 2.10 Ground Truth Signal for 𝑓5 .
                                                                 43


                       (a) 𝑓3
  Figure 2.11 Corrupted Samples for 𝑓5 . The top row is 𝜎 = 0.5 and the bottom row is 𝜎 = 1.0.
    We test our results on the signals from the previous section with 𝑀 = 220 samples with
𝜂 = 12−1/2 . We start with an oracle 𝜂. The results for 𝜎 = 0.5 are given in Figure 2.12 and example
inversion plots are given in Figure 2.13.
                                                 44


              (a) 𝑓1                              (b) 𝑓2                               (c) 𝑓3
              (d) 𝑓4                              (e) 𝑓5                               (f) 𝑓6
Figure 2.12 Relative error decay with standard error bars for bispectrum inversion using Model 2
under the assumption of oracle 𝜂 = 12−1/2 with 𝜎 = 0.5.
    The blue lines are for inversion using additive noise centered versions of the average bispectrum
and average power spectrum, and the red lines are using our inversion unbiasing procedure. One
can see that there is a huge gain in performance for the first five signals. However, for 𝑓6 , there is
no gain in performance. We hypothesize this occurs because the power spectrum and bispectrum
of 𝑓6 have a peak around the origin. Dilating the power spectrum and bispectrum will not have a
large effect on the shape of the signal, since their support is restricted to very low frequencies. This
behavior is observed in Figure 2.7 as well.
                                                   45


                (a) 𝑓1                            (b) 𝑓2                             (c) 𝑓3
                (d) 𝑓4                            (e) 𝑓5                             (f) 𝑓6
Figure 2.13 Bispectrum inversion results using Model 2 under the assumption of oracle 𝜂 = 12−1/2
with 𝜎 = 0.5. "Rec" is using our inversion unbiasing procedure and "NO UB" is using the the
centered averages. "RE" stands for the relative error.
      Note that while the relative error between the signals is similar in a few cases, such as 𝑓1 and
 𝑓5 , the quality of the recovery results cannot be fully judged by using relative error. For instance,
in 𝑓1 and 𝑓5 , one can see that the general shape of the signals is incorrect without the inversion
unbiasing, but the relative error is low because the recovered signal is relatively smooth. On the
other hand, with the inversion unbiasing, the recovered signal has the right general shape, but is
not very smooth. Also, with regards to 𝑓5 , the assumptions of the model are not met, so the drop
in performance is expected. For this specific run, for the high frequency signal ( 𝑓3 ), the recovery
error decreased from 0.45 without unbiasing to 0.09 with unbiasing. Additionally for 𝑓2 and 𝑓4 ,
the unbiasing procedure decreased the error error by over 50%. For 𝑓1 and 𝑓6 we do not see much
improvement, but that is as expected since they are supported in the low frequencies; for 𝑓5 there is a
moderate improvement, but the bispectrum estimation was less reliable because model assumptions
were not met.
                                                   46


    Now we consider the case where 𝜎 = 1.0. The error decay plots are given in Figure 2.14 and
corresponding inversion plots are given in Figure 2.15.
             (a) 𝑓1                              (b) 𝑓2                          (c) 𝑓3
             (d) 𝑓4                              (e) 𝑓5                          (f) 𝑓6
Figure 2.14 Relative error decay with standard error bars for bispectrum inversion using Model 2
under the assumption of oracle 𝜂 = 12−1/2 with 𝜎 = 1.0.
    Notice that the gap in relative error between not using the unbiasing procedure and using the
unbaising procedure has decreased. This suggests that the bispectrum inversion algorithm we have
used is sensitive to noise. This also explains the slight performance drop we have observed.
                                                  47


               (a) 𝑓1                             (b) 𝑓2                             (c) 𝑓3
               (d) 𝑓4                             (e) 𝑓5                             (f) 𝑓6
Figure 2.15 Bispectrum inversion results using Model 2 under the assumption of oracle 𝜂 = 12−1/2
with 𝜎 = 1.0.
    Surprisingly, our error has not increased too much except in 𝑓4 . For 𝑓4 , we notice that noise in
the signal is most notable around the discontinuities of the function. There is a similar phenomenon
in 𝑓6 . At the end of each triangle and the peak of the triangle, we observe more noise. This suggests
that the bispectrum inversion algorithm has a hard time learning discontinuities of the signal, which
is expected.
    Now, we consider the case where 𝜂 is unknown and estimated again. For 𝜎 = 0.5, the error
decay plots are in Figure 2.16 and the inversion results are in Figure 2.17. For 𝜎 = 1.0, the error
decay plots are in Figure 2.18 and the inversion results are in Figure 2.19. Surprisingly, the results
are similar to the corresponding oracle plots in many cases.
                                                   48


            (a) 𝑓1                            (b) 𝑓2                            (c) 𝑓3
            (d) 𝑓4                            (e) 𝑓5                            (f) 𝑓6
Figure 2.16 Relative error decay with standard error bars for bispectrum inversion using Model 2
with 𝜎 = 0.5 and no prior knowledge of 𝜂.
                                               49


            (a) 𝑓1                            (b) 𝑓2                           (c) 𝑓3
            (d) 𝑓4                            (e) 𝑓5                           (f) 𝑓6
Figure 2.17 Bispectrum inversion results using Model 2 with 𝜎 = 0.5 and no prior knowledge of 𝜂.
                                               50


            (a) 𝑓1                            (b) 𝑓2                            (c) 𝑓3
            (d) 𝑓4                            (e) 𝑓5                            (f) 𝑓6
Figure 2.18 Relative error decay with standard error bars for bispectrum inversion using Model 2
with 𝜎 = 1.0 and no prior knowledge of 𝜂.
                                               51


            (a) 𝑓1                            (b) 𝑓2                           (c) 𝑓3
            (d) 𝑓4                            (e) 𝑓5                           (f) 𝑓6
Figure 2.19 Bispectrum inversion results using Model 2 with 𝜎 = 1.0 and no prior knowledge of 𝜂.
                                               52


    Finally, here are error plots for the estimation of 𝜂 in the empirical case. Figure 2.20 is for
𝜎 = 0.5 and Figure 2.21 is for 𝜎 = 1.0. Note that the improvement tends to plateau once we have
more than 214 samples for most of our cases. Note that the 𝜂 estimation was subpar for 𝑓4 . We
believe this is because the signal is not continuous, but we have not done rigorous tests to confirm
this. Nonetheless, the bispectrum recovery results were still adequate in the oracle case.
               (a) 𝑓1                            (b) 𝑓2                           (c) 𝑓3
              (d) 𝑓4                             (e) 𝑓5                           (f) 𝑓6
                     Figure 2.20 Mean Relative Error in estimating 𝜂 with 𝜎 = 0.5.
                                                  53


(a) 𝑓1                           (b) 𝑓2                           (c) 𝑓3
(d) 𝑓4                           (e) 𝑓5                            (f) 𝑓6
      Figure 2.21 Mean Relative Error in estimating 𝜂 with 𝜎 = 1.0.
                                  54


2.10    Conclusions and Future Work
     In this chapter, we have provided a solution for the dilation MRA model by recovering the
bispectrum of the noisy observations. After recovering the bispectrum, we perform full signal
inversion and observe competitive results compared to other methods without dilations.
     One natural extension is to work in two dimensions. That is, we want to recover a signal
 𝑓 : R2 → R from many noisy observations that have been randomly translated, dilated, rotated,
and corrupted by additive noise. Define the rotation matrix
                                                        ©cos 𝜃 − sin 𝜃 ª
                                                 𝑅𝜃 = ­­                     ®.
                                                                             ®                        (2.22)
                                                           sin 𝜃 cos 𝜃
                                                        «                    ¬
     A formal description is now given by
Model 4. Suppose we have 𝑀 independent observations of a function 𝑓 ∈ 𝐿 2 (R2 ) defined by
                𝑦 𝑗 (𝑥) = 𝑓 (𝑅𝜃−1𝑗 (1 − 𝜏 𝑗 ) −1 (𝑥 − 𝑡 𝑗 )) + 𝜀 𝑗 (𝑥) := 𝑓 𝑗 (𝑥) + 𝜀 𝑗 (𝑥), 1≤ 𝑗 ≤𝑀
Furthermore, assume that
     • supp( 𝑓 𝑗 ) ⊂ [− 12 , 12 ] 2 for 1 ≤ 𝑗 ≤ 𝑀.
     • {𝑡 𝑗 } 𝑀
              𝑗=1 are independent samples of a random variable 𝑡 ∈ R.
     • {𝜃 𝑗 } 𝑀𝑗=1 are independent samples of a uniformly distributed random variable 𝜃 ∈ [−𝜋, 𝜋).
     • {𝜏 𝑗 } 𝑀
              𝑗=1 are independent samples of a uniformly distributed random variable 𝜏 ∈ R satisfying
                                               1
       E[𝜏] = 0 and Var(𝜏) = 𝜂2 ≤             12 .
     • {𝜀 𝑗 (𝑥)} 𝑀                                                               1 1 2
                   𝑗=1 are independent white noise processes on [− 2 , 2 ] with variance 𝜎 .
                                                                                                    2
Can we recover 𝑓 ?
     The main difficulty in this model comes from the addition of the rotations, like mentioned in
previous sections. Even the noiseless case itself is difficult because one cannot simply align the
signals anymore. Intuitively, one has to first rotate all the signals in place, but this is not very
feasible. The corresponding noiseless model is given by
Model 5. Suppose we have 𝑀 independent observations of a function 𝑓 ∈ 𝐿 2 (R2 ) defined by
                                  𝑓 𝑗 (𝑥) = 𝑓 (𝑅𝜃−1𝑗 (1 − 𝜏 𝑗 ) −1 (𝑥 − 𝑡 𝑗 ))  1≤ 𝑗 ≤𝑀
                                                               55


Furthermore, assume that
    • supp( 𝑓 𝑗 ) ⊂ [− 12 , 12 ] 2 for 1 ≤ 𝑗 ≤ 𝑀.
    • {𝑡 𝑗 } 𝑀
             𝑗=1 are independent samples of a random variable 𝑡 ∈ R.
    • {𝜃 𝑗 } 𝑀𝑗=1 are independent samples of a uniformly distributed random variable 𝜃 ∈ [−𝜋, 𝜋).
    • {𝜏 𝑗 } 𝑀
             𝑗=1 are independent samples of a uniformly distributed random variable 𝜏 ∈ R satisfying
                                            1
      E[𝜏] = 0 and Var(𝜏) = 𝜂2 ≤           12 .
Can we recover 𝑓 ?
   We will start with Model 4 and generalize our results to Model 5, like in the one dimensional
case.
                                                   56


                                             CHAPTER 3
                 NONLINEAR HEEGER-BERGEN TEXTURE SYNTHESIS
3.1   Background on Texture Synthesis
    Texture synthesis is the process of generating an image from a reference image by taking
advantage of its statistical properties. The new texture should have similar qualities as the reference
texture, but look different. In other words, for example, the same reference texture, a rotation of
the reference texture, or a rearrangement of the reference texture should not be the result after
synthesis. An example of texture synthesis is given in Figure 3.1.
Figure 3.1 "Example" is the reference texture and "Generated" is the synthesized texture. Note that
the images have the same perceptual qualities, but are not the same image.
    While this task can be phrased simply, one difficulty in this task is that traditional image metrics
do not lead to good synthesis results. For example, if one were to minimize the MSE loss between
a reference image 𝐼 𝑅 and synthesized image 𝐼𝑆 via solving the optimization problem
                                        𝐼𝑆 = argmin 𝐼ˆ𝑆 ∥ 𝐼ˆ𝑆 − 𝐼 𝑅 ∥ 22 ,
the result would be same texture. To illustrate why such a metric is difficult, we will provide some
examples.
    Consider Figure 3.2. The synthesis in the last two columns have similar quality compared to
the reference, and it’s clear that both textures are not simply repetitions of the reference texture.
                                                  57


  Figure 3.2 Left: Original Texture. Middle: Synthesis using [24]. Right: Synthesis using [20].
One could argue that dots are more solid in the middle column, which matches with the reference
texture, so the synthesis is better. However, this is a subjective argument without any quantitative
backing.
    Additionally, consider Figure 3.3. One can see that the middle column has worse alignment
compared to the reference texture on the left, so it’s clear that the synthesis on the right is higher
quality since there is no repetition.
  Figure 3.3 Left: Original Texture. Middle: Synthesis using [24]. Right: Synthesis using [20].
    The synthesis in the right column of Figure 3.3 captures what we will call "long range con-
straints" in an image. That is, it is able to capture macroscopic features that the middle column
of Figure 3.3 cannot capture. To approach this problem, there are two factors to consider. First,
can one find a representation that captures texture? If so, what is a good measure for measuring
similarity between textures? One needs to account for long range constraints without creating a
                                                  58


texture too similar to the original reference texture.
    One approach, which we will generalize in this chapter, is to update a white noise by matching
the histograms of wavelet coefficients between a reference image and the white noise. First, we will
need to introduce the filter bank we used and the wavelet transform.
3.2   Filter Construction
    In this section we describe the filter bank used in the Heeger-Bergen algorithm. This is the
analytic description of the filters that does not rely on down sampling. We are following the
presentation in [10], but some of the notation is changed so care should be taken when comparing
the two. All filters are defined in frequency in the frequency box [−𝜋, 𝜋] 2 := [−𝜋, 𝜋] × [−𝜋, 𝜋]
using polar coordinates.
    Write 𝜔 = (𝜔1 , 𝜔2 ) ∈ [−𝜋, 𝜋) 2 as
                            𝜔 = (𝑟, 𝜃)
                                 √︃
                            𝑟 := 𝜔21 + 𝜔22
                                 
                                                        𝜔1 ≤ 0 and 𝜔2 = 0
                                 
                                  𝜋
                                 
                                 
                            𝜃 :=                                         .
                                                𝜔2
                                 
                                 
                                   2 arctan   𝜔1 +𝑟    otherwise
                                 
We begin by defining some “building block” functions. The first is a low frequency radial function
𝐿 : [0, ∞) → R defined as
                                           
                                              1                   𝑟 ≤ 𝜋/2
                                           
                                           
                                           
                                           
                                           
                                           
                                                          
                      ∀ 𝑟 ≥ 0, 𝐿(𝑟) := cos 𝜋2 log2 2𝑟𝜋             𝜋/2 < 𝑟 ≤ 𝜋 .
                                           
                                           
                                           
                                                                  𝑟≥𝜋
                                           
                                            0
                                           
                                           
The second is a high frequency function 𝐻 : [0, ∞) → R defined as
                                            
                                               0                  𝑟 ≤ 𝜋/2
                                            
                                            
                                            
                                            
                                            
                                            
                                            
                       ∀ 𝑟 ≥ 0, 𝐻 (𝑟) := cos 𝜋2 log2 𝜋𝑟
                                                              
                                                                 𝜋/2 < 𝑟 ≤ 𝜋 .
                                            
                                            
                                             1                   𝑟≥𝜋
                                            
                                            
                                            
One can verify that 𝐿 and 𝐻 satisfy the following important property:
                                  |𝐿(𝑟)| 2 + |𝐻 (𝑟)| 2 = 1,   ∀ 𝑟 ≥ 0.                         (3.1)
                                                     59


                         Figure 3.4 Plot of 𝐿(𝑟), 𝐻 (𝑟), and |𝐿(𝑟)| 2 + |𝐻 (𝑟)| 2 .
In fact, the cosine part for 𝐿 (𝑟) can be replaced with a different decreasing function, and the cosine
part of 𝐻 (𝑟) can be replaced with a different increasing function, so long as (3.1) holds. Figure 3.4
plots 𝐿(𝑟) and 𝐻 (𝑟) and verifies, numerically, equation (3.1).
Remark 1. Looking at the plots in Figure 3.4, indeed we may want to use a different form for 𝐿 (𝑟)
and 𝐻 (𝑟), as they are not very smooth at the point where they equal zero.
    We also define dilations of 𝐿 and 𝐻. They are given by:
                                                 
                                                    1                   𝑟 ≤ 𝜋/2 𝑗+1
                                                 
                                                 
                                                 
                                                 
                                                 
                                                 
                                                              𝑗+1 
             ∀ 𝑗 ∈ Z,   𝐿 𝑗 (𝑟) := 𝐿(2 𝑗 𝑟) = cos 𝜋2 log2 2 𝜋 𝑟          𝜋/2 𝑗+1 < 𝑟 ≤ 𝜋/2 𝑗 ,
                                                 
                                                 
                                                 
                                                                        𝑟 ≥ 𝜋/2 𝑗
                                                 
                                                  0
                                                 
                                                 
and
                                                  
                                                      0                 𝑟 ≤ 𝜋/2 𝑗+1
                                                  
                                                  
                                                  
                                                  
                                                  
                                                  
                                                               𝑗 
             ∀ 𝑗 ∈ Z, 𝐻 𝑗 (𝑟) := 𝐻 (2 𝑗 𝑟) = cos 𝜋2 log2 2𝜋𝑟            𝜋/2 𝑗+1 < 𝑟 ≤ 𝜋/2 𝑗 .
                                                  
                                                  
                                                  
                                                                        𝑟 ≥ 𝜋/2 𝑗
                                                  
                                                   1
                                                  
                                                  
We remark that it follows from (3.1) that
                              |𝐿 𝑗 (𝑟)| 2 + |𝐻 𝑗 (𝑟)| 2 = 1, ∀ 𝑟 ≥ 0, ∀ 𝑗 ∈ Z.
    Now we define a family of functions for the angular variable 𝜃. Let 𝑄 be the number of such
functions, which will later correspond to the number of directional filters at each scale. We define
                                                         60


                                                           |𝐺 𝑞 (𝜃)| 2 for 0 ≤ 𝑞 < 𝑄 = 4.
                                                     Í𝑄−1
                 Figure 3.5 Plots of 𝐺 𝑞 (𝜃) and       𝑞=0
∀ 0 ≤ 𝑞 < 𝑄, ∀ 𝜃 ∈ (−𝜋, 𝜋],
                                𝑄−1                                          𝑄−1                  !
                            𝜋𝑞                                     𝜋(𝑞 − 𝑄)
    𝐺 𝑞 (𝜃) := 𝛼𝑄 cos 𝜃 −             1|𝜃−𝜋𝑞/𝑄|≤𝜋/2 + cos 𝜃 −                        1|𝜃−𝜋(𝑞−𝑄)/𝑄|≤𝜋/2 ,
                             𝑄                                           𝑄
where
                                                        (𝑄 − 1)!
                                      𝛼𝑄 := 2𝑄−1 √︁                    .
                                                      𝑄(2(𝑄 − 1))!
One can verify that
                                  𝑄−1
                                  ∑︁
                                       |𝐺 𝑞 (𝜃)| 2 = 1,   ∀ 𝜃 ∈ (−𝜋, 𝜋].                               (3.2)
                                  𝑞=0
Figure 3.5 plots the functions 𝐺 𝑞 (𝜃) and verifies (3.2) numerically. Similarly to the 𝐿 (𝑟) and 𝐻 (𝑟)
functions, the functions 𝐺 𝑞 (𝜃) can be changed so long as (3.2) is satisfied.
    Now we use the functions 𝐿(𝑟), 𝐻 (𝑟), and 𝐺 𝑞 (𝜃) to build our filters. We construct three types
of filters:
     • A high frequency filter ℎ : R2 → R
     • Directional wavelet filters 𝜓 𝑞 : R2 → R
     • A low frequency filter ℓ : R2 → R
The high pass filter ℎ(𝑢) is defined through its Fourier transform b      ℎ(𝑟, 𝜃) as
                             ∀ 𝑟 ≥ 0, ∀ 𝜃 ∈ (−𝜋, 𝜋],        ℎ(𝑟, 𝜃) := 𝐻 (𝑟),
                                                            b
and the low pass filter ℓ(𝑢) is defined analogously,
                             ∀ 𝑟 ≥ 0, ∀ 𝜃 ∈ (−𝜋, 𝜋],         b 𝜃) := 𝐿(𝑟).
                                                             ℓ(𝑟,
                                                     61


The directional filters 𝜓 𝑞 (𝑢) are defined as
                           ∀ 𝑟 ≥ 0, ∀ 𝜃 ∈ (−𝜋, 𝜋],         𝜓b𝑞 (𝑟, 𝜃) := 𝐿 0 (𝑟)𝐻1 (𝑟)𝐺 𝑞 (𝜃).
One may verify that all the filters are symmetric through the origin (i.e., ℎ(−𝑢) = ℎ(𝑢), ℓ(−𝑢) =
ℓ(𝑢), and 𝜓 𝑞 (−𝑢) = 𝜓 𝑞 (𝑢)) and real valued in space.
     We now define the wavelet transform of an image 𝑥 ∈ L2 (R2 ) with its frequency support
contained in [−𝜋, 𝜋] 2 , i.e., supp(b     𝑥 ) ⊆ [−𝜋, 𝜋] 2 . Let 𝐽 ≥ 1 be the number of scales and let 𝑄 ≥ 1
be the number of orientations. A dilation of a filter 𝑓 : R2 → C is defined as
                                  ∀ 𝑢 ∈ R2 , ∀ 𝑗 ∈ Z,         𝑓 𝑗 (𝑢) := 2−2 𝑗 𝑓 (2− 𝑗 𝑢).
We remark that the Fourier transform of a dilated filter satisfies:
                       𝑓 𝑗 (𝑟, 𝜃) = b
                       b            𝑓 𝑗 (𝜔) = b𝑓 (2 𝑗 𝜔) = b 𝑓 (2 𝑗 𝑟, 𝜃),   ∀ 𝜔 ∈ R2 , ∀ 𝑗 ∈ Z.
Using this notation define the filter bank F𝐽,𝑄 as
                                 F𝐽,𝑄 := {ℎ , ℓ𝐽 , 𝜓 𝑗,𝑞 : 0 ≤ 𝑗 < 𝐽, 0 ≤ 𝑞 < 𝑄},
where ℓ𝐽 (𝑢) is the dilation of the low pass filter ℓ(𝑢), which means:
                  ℓ𝐽 (𝑢) = 2−2𝐽 ℓ(2−𝐽 𝑢)        =⇒                     b 𝐽 𝑟, 𝜃) = 𝐿 (2𝐽 𝑟) = 𝐿 𝐽 (𝑟),
                                                        ℓb𝐽 (𝑟, 𝜃) = ℓ(2
and 𝜓 𝑗,𝑞 (𝑢) is the dilation of the directional filter 𝜓 𝑞 (𝑢), meaning that:
           𝜓 𝑗,𝑞 (𝑢) = 2−2 𝑗 𝜓 𝑞 (2− 𝑗 𝑢)     =⇒      b𝑗,𝑞 (𝑟, 𝜃) = 𝜓
                                                      𝜓                b𝑞 (2 𝑗 𝑟, 𝜃) = 𝐿 𝑗 (𝑟)𝐻 𝑗+1 (𝑟)𝐺 𝑞 (𝜃).
Figure 3.6 plots the high frequency residual filter ℎ(𝑢) and the low frequency filter ℓ𝐽 (𝑢) for
𝐽 = 4, along with their Fourier transforms. Figure 3.7 plots the Fourier transforms 𝜓                   b𝑗,𝑞 (𝜔) of the
directional band-pass wavelets for 0 ≤ 𝑗 < 𝐽 = 4 and 0 ≤ 𝑞 < 𝑄 = 4, while Figure 3.8 plots
𝜓 𝑗,𝑞 (𝑢) in space for 0 ≤ 𝑗 < 𝐽 = 4 and 0 ≤ 𝑞 < 𝑄 = 4.
                                                             62


                          (a) b
                              ℎ(𝜔)                                    (b) ℎ(𝑢)
                    (c) ℓb𝐽 (𝜔) for 𝐽 = 4                        (d) ℓ𝐽 (𝑢) for 𝐽 = 4
Figure 3.6 The high frequency residual filter ℎ(𝑢) and the low frequency filter ℓ𝐽 (𝑢) with 𝐽 = 4,
and their Fourier transforms b   ℎ(𝜔) and ℓb𝐽 (𝜔), respectively.
                                                     63


                                  b𝑗,𝑞 (𝜔) of the directional wavelets for 0 ≤ 𝑗 < 𝐽 = 4 and
Figure 3.7 The Fourier transforms 𝜓
0 ≤ 𝑞 < 𝑄 = 4.
                                              64


Figure 3.8 The directional wavelets 𝜓 𝑗,𝑞 (𝑢) for 0 ≤ 𝑗 < 𝐽 = 4 and 0 ≤ 𝑞 < 𝑄 = 4.
                                         65


    Define the wavelet transform of 𝑥(𝑢) as the map 𝑊𝐽,𝑄 : L2 (R2 ) → L2 (R2 ) 𝐽𝑄+2 , where
                    𝑊𝐽,𝑄 𝑥 := {𝑥 ∗ ℎ , 𝑥 ∗ ℓ𝐽 , 𝑥 ∗ 𝜓 𝑗,𝑞 : 0 ≤ 𝑗 < 𝐽, 0 ≤ 𝑞 < 𝑄}.
Notice that 𝑊𝐽,𝑄 takes an image 𝑥(𝑢) and returns 𝐽𝑄 + 2 images, which we refer to as the wavelet
coefficients of 𝑥. Let ∥𝑥∥ denote the L2 (R2 ) norm of 𝑥,
                                                      ∫
                                                 2
                                             ∥𝑥∥ =       |𝑥(𝑢)| 2 𝑑𝑢.
                                                       R
The norm of 𝑊𝐽,𝑄 𝑥 is defined as
                                                                   𝐽−1 𝑄−1
                                                                   ∑︁   ∑︁
                                  2              2
                       ∥𝑊𝐽,𝑄 𝑥∥ := ∥𝑥 ∗ ℎ∥ + ∥𝑥 ∗ ℓ𝐽 ∥ +       2
                                                                            ∥𝑥 ∗ 𝜓 𝑗,𝑞 ∥ 2 .
                                                                    𝑗=0 𝑞=0
This particular wavelet transform has many nice properties due to the filter construction. We collect
them in the next theorem.
Theorem 8. The filter bank F𝐽,𝑄 satisfies the following Littlewood-Paley condition:
                                              𝐽−1 𝑄−1
                                             ∑︁   ∑︁
                    ℎ(𝜔)| 2 + | ℓb𝐽 (𝜔)| 2 +
                   |b                                  b𝑗,𝑞 (𝜔)| 2 = 1,
                                                      |𝜓                    ∀ 𝜔 ∈ [−𝜋, 𝜋] 2 .          (3.3)
                                              𝑗=0 𝑞=0
Therefore, for any 𝑥 ∈ L2 (R2 ) with supp(b     𝑥 ) ⊆ [−𝜋, 𝜋] 2 , the wavelet transform 𝑊𝐽,𝑄 is an isometry,
                                                ∥𝑊𝐽,𝑄 𝑥∥ = ∥𝑥∥,
and furthermore has the following inverse:
                                                             𝐽−1 𝑄−1
                                                            ∑︁   ∑︁
                         𝑥 = 𝑥 ∗ ℎ ∗ ℎ + 𝑥 ∗ ℓ𝐽 ∗ ℓ𝐽 +                𝑥 ∗ 𝜓 𝑗,𝑞 ∗ 𝜓 𝑗,𝑞 .              (3.4)
                                                             𝑗=0 𝑞=0
Proof. First, convert everything to polar coordinates. Use condition (3.2) to simplify the third
term, condition (3.1) for the other terms, and also use the definition of 𝐿 𝑗 (𝑟) to observe that
                                                        66


𝐿 𝑗 (𝑟)𝐿 𝑗+1 (𝑟) = 𝐿 𝑗+1 (𝑟), which results in a telescoping sum:
                                               𝐽−1 𝑄−1
                                               ∑︁    ∑︁
                     |b      2
                      ℎ(𝜔)| + | ℓb𝐽 (𝜔)| +2              b𝑗,𝑞 (𝜔)| 2
                                                        |𝜓
                                                𝑗=0 𝑞=0
                                                   𝐽−1 𝑄−1
                                                   ∑︁  ∑︁
                                2             2
                      = |𝐻 (𝑟)| + |𝐿 𝐽 (𝑟)| +                |𝐿 𝑗 (𝑟)| 2 |𝐻 𝑗+1 (𝑟)| 2 |𝐺 𝑞 (𝜃)| 2
                                                   𝑗=0 𝑞=0
                                                   𝐽−1                             𝑄−1
                                                   ∑︁                             © ∑︁
                                2
                      = |𝐻 (𝑟)| + |𝐿 𝐽 (𝑟)| + 2                   2            2
                                                       |𝐿 𝑗 (𝑟)| |𝐻 𝑗+1 (𝑟)| | ­          𝐺 𝑞 (𝜃)| 2 ®
                                                                                                     ª
                                                   𝑗=0                            « 𝑞=0              ¬
                                                   𝐽−1
                                                   ∑︁
                      = |𝐻 (𝑟)| 2 + |𝐿 𝐽 (𝑟)| 2 +      |𝐿 𝑗 (𝑟)| 2 |𝐻 𝑗+1 (𝑟)| 2
                                                   𝑗=0
                                                       𝐽−1
                                                       ∑︁
                                    2              2
                      = 1 − |𝐿 (𝑟)| + |𝐿 𝐽 (𝑟)| +           |𝐿 𝑗 (𝑟)| 2 (1 − |𝐿 𝑗+1 (𝑟)| 2 )
                                                       𝑗=0
                                                       𝐽−1
                                                       ∑︁
                                    2              2
                      = 1 − |𝐿 (𝑟)| + |𝐿 𝐽 (𝑟)| +           |𝐿 𝑗 (𝑟)| 2 − |𝐿 𝑗 (𝑟)| 2 |𝐿 𝑗+1 (𝑟)| 2
                                                       𝑗=0
                                                       𝐽−1
                                                       ∑︁
                      = 1 − |𝐿 (𝑟)| 2 + |𝐿 𝐽 (𝑟)| 2 +       |𝐿 𝑗 (𝑟)| 2 − |𝐿 𝑗+1 (𝑟)| 2
                                                       𝑗=0
                      = 1 − |𝐿 (𝑟)| 2 + |𝐿 𝐽 (𝑟)| 2 + |𝐿 (𝑟)| 2 − |𝐿 𝐽 (𝑟)| 2
                      = 1.
To prove the isometry property, multiply both sides of (3.3) by | 𝑥(𝜔)|                 ˆ      2 , integrate, and apply
Plancherel formula.
     To prove the inversion formula, take the Fourier Transform of
                                                               𝐽−1 𝑄−1
                                                               ∑︁   ∑︁
                            𝑔 = 𝑥 ∗ ℎ ∗ ℎ + 𝑥 ∗ ℓ𝐽 ∗ ℓ𝐽 +                 𝑥 ∗ 𝜓 𝑗,𝑞 ∗ 𝜓 𝑗,𝑞 .
                                                                𝑗=0 𝑞=0
                                                         67


Then
                                                                        𝐽−1 𝑄−1
                                                                        ∑︁    ∑︁
               ˆ
               𝑔(𝜔)   = 𝑥(𝜔)
                         ˆ    ˆ
                              ℎ(𝜔)  ˆ
                                    ℎ(𝜔)   + 𝑥(𝜔)
                                               ˆ     ˆ ℓ(𝜔)
                                                     ℓ(𝜔)    ˆ     +               ˆ
                                                                                  𝑥(𝜔)     𝜓ˆ 𝑗,𝑞 (𝜔) 𝜓ˆ 𝑗,𝑞 (𝜔)
                                                                        𝑗=0 𝑞=0
                                                         𝐽−1 𝑄−1
                                                          ∑︁   ∑︁                    
                                        2            2                             2
                                                                                    
                      = 𝑥(𝜔)
                         ˆ        ˆ
                               | ℎ(𝜔)|       ˆ
                                          + | ℓ(𝜔)|    +             | 𝜓ˆ 𝑗,𝑞 (𝜔)| 
                              
                                                         𝑗=0 𝑞=0                    
                                                                                    
                      = 𝑥(𝜔).
                         ˆ
Taking inverse Fourier Transform gives result.                                                                      □
     It turns out that the inverse formula (3.4) is based on the adjoint of the wavelet transform. Let
us explain. For 𝑥, 𝑦 ∈ L2 (R2 ) define their inner product as:
                                                       ∫
                                          ⟨𝑥, 𝑦⟩ :=         𝑥(𝑢)𝑦(𝑢) 𝑑𝑢,                                         (3.5)
                                                        R2
where 𝑥(𝑢) is the complex conjugate of 𝑥(𝑢). Using this inner product we can define an inner
product on L2 (R2 ) 𝐽𝑄+2 . Write 𝐹 ∈ L2 (R2 ) 𝐽𝑄+2 as
                              𝐹 = { 𝑓 ℎ , 𝑓𝐽 , 𝑓 𝑗,𝑞 : 0 ≤ 𝑗 < 𝐽, 0 ≤ 𝑞 < 𝑄},
where 𝑓 ℎ ∈ L2 (R2 ), 𝑓𝐽 ∈ L2 (R2 ), and 𝑓 𝑗,𝑞 ∈ L2 (R2 ) for each 𝑗 and 𝑞. Then for 𝐹, 𝐺 ∈ L2 (R2 ) 𝐽𝑄+2 ,
define their inner product as:
                                                                     𝐽−1 𝑄−1
                                                                    ∑︁      ∑︁
                            ⟨𝐹, 𝐺⟩ := ⟨ 𝑓 ℎ , 𝑔 ℎ ⟩ + ⟨ 𝑓𝐽 , 𝑔𝐽 ⟩ +             ⟨ 𝑓 𝑗,𝑞 , 𝑔 𝑗,𝑞 ⟩,
                                                                      𝑗=0 𝑞=0
where the inner products on the right hand side are defined using (3.5). Notice that with this inner
product we have
                                        ∥𝑊𝐽,𝑄 𝑥∥ 2 = ⟨𝑊𝐽,𝑄 𝑥, 𝑊𝐽,𝑄 𝑥⟩.
Now let us define the adjoint of the wavelet transform. The adjoint of 𝑊𝐽,𝑄 : L2 (R2 ) → L2 (R2 ) 𝐽𝑄+2
is the map 𝑊𝐽,𝑄∗ : L2 (R2 ) 𝐽𝑄+2 → L2 (R2 ) such that the following relation holds:
                                                                                                   ∗
                    ∀ 𝑥 ∈ L2 (R2 ), ∀ 𝐹 ∈ L2 (R2 ) 𝐽𝑄+2 ,           ⟨𝑊𝐽,𝑄 𝑥, 𝐹⟩ = ⟨𝑥, 𝑊𝐽,𝑄            𝐹⟩.        (3.6)
Using (3.6) as the definition of 𝑊𝐽,𝑄 ∗ , one can prove the following theorem.
                                                          68


Theorem 9. The adjoint of 𝑊𝐽,𝑄 is
                                           𝐽−1 𝑄−1
                                           ∑︁  ∑︁
                ∗
              𝑊𝐽,𝑄 𝐹 = 𝑓 ℎ ∗ ℎ + 𝑓𝐽 ∗ ℓ𝐽 +         𝑓 𝑗,𝑞 ∗ 𝜓 𝑗,𝑞 , ∀ 𝐹 ∈ L2 (R2 ) 𝐽𝑄+2 .
                                           𝑗=0 𝑞=0
Proof. Let 𝑥 ∈ L2 (R2 ) and 𝐹 ∈ L2 (R2 ) 𝐽𝑄+2 . The filters are real symmetric. We can also use
                                                69


Fubini’s Theorem.
    ⟨𝑊𝐽,𝑄 𝑥, 𝐹⟩
                                               𝐽−1 𝑄−1
                                               ∑︁   ∑︁
     = ⟨(𝑥 ∗ ℎ), 𝑓 ℎ ⟩ + ⟨(𝑥 ∗ ℓ𝐽 ), 𝑓ℓ ⟩ +             ⟨(𝑥 ∗ 𝜓 𝑗,𝑞 ), 𝑓 𝑗,𝑞 ⟩
                                                𝑗=0 𝑞=0
        ∫                               ∫                                 𝐽−1 𝑄−1
                                                                          ∑︁    ∑︁ ∫
     =      (𝑥 ∗ ℎ)(𝑢) 𝑓 ℎ (𝑢) 𝑑𝑢 +           (𝑥 ∗ ℓ𝐽 )(𝑢) 𝑓ℓ (𝑢) 𝑑𝑢 +                   (𝑥 ∗ 𝜓 𝑗,𝑞 )(𝑢) 𝑓 𝑗,𝑞 (𝑢) 𝑑𝑢
         R2                               R2                               𝑗=0 𝑞=0    R2
        ∫    ∫                                        ∫     ∫                         
     =              𝑥(𝑡)ℎ(𝑢 − 𝑡) 𝑑𝑡 𝑓 ℎ (𝑢) 𝑑𝑢 +                    𝑥(𝑡)ℓ𝐽 (𝑢 − 𝑡) 𝑑𝑡 𝑓ℓ (𝑢) 𝑑𝑢
         R2      R2                                       R2     R2
            𝐽−1 𝑄−1
            ∑︁    ∑︁ ∫      ∫                            
          +                       𝑥(𝑡)𝜓 𝑗,𝑞 (𝑢 − 𝑡) 𝑑𝑡 𝑓 𝑗,𝑞 (𝑢) 𝑑𝑢
             𝑗=0 𝑞=0    R2     R2
        ∫           ∫                                 ∫           ∫                            
     =      𝑥(𝑡)         𝑓 ℎ (𝑢)ℎ(𝑢 − 𝑡) 𝑑𝑢 𝑑𝑡 +              𝑥(𝑡)         𝑓ℓ (𝑢)ℓ𝐽 (𝑢 − 𝑡) 𝑑𝑢 𝑑𝑡
         R2           R2                                   R2          R2
            𝐽−1 𝑄−1
            ∑︁    ∑︁ ∫            ∫                                
          +                 𝑥(𝑡)        𝑓 𝑗,𝑞 (𝑢)𝜓 𝑗,𝑞 (𝑢 − 𝑡) 𝑑𝑢 𝑑𝑡
             𝑗=0 𝑞=0    R2          R2
        ∫           ∫                                 ∫           ∫                            
     =      𝑥(𝑡)         𝑓 ℎ (𝑢)ℎ(𝑡 − 𝑢) 𝑑𝑢 𝑑𝑡 +              𝑥(𝑡)         𝑓ℓ (𝑢)ℓ𝐽 (𝑡 − 𝑢) 𝑑𝑢 𝑑𝑡
         R2           R2                                   R2          R2
            𝐽−1 𝑄−1
            ∑︁    ∑︁ ∫            ∫                                
          +                 𝑥(𝑡)        𝑓 𝑗,𝑞 (𝑢)𝜓 𝑗,𝑞 (𝑡 − 𝑢) 𝑑𝑢 𝑑𝑡
             𝑗=0 𝑞=0    R2          R2
        ∫            ∫                          !       ∫            ∫                            !
     =      𝑥(𝑡)         𝑓 ℎ (𝑢)ℎ(𝑡 − 𝑢) 𝑑𝑢 𝑑𝑡 +              𝑥(𝑡)          𝑓ℓ (𝑢)ℓ𝐽 (𝑡 − 𝑢) 𝑑𝑢 𝑑𝑡
         R2           R2                                   R2          R2
            𝐽−1 𝑄−1
                                                                    !
            ∑︁    ∑︁ ∫             ∫
          +                 𝑥(𝑡)        𝑓 𝑗,𝑞 (𝑢)𝜓 𝑗,𝑞 (𝑡 − 𝑢) 𝑑𝑢 𝑑𝑡
             𝑗=0 𝑞=0    R2          R2
        ∫                            ∫                                𝐽−1 𝑄−1
                                                                      ∑︁     ∑︁ ∫
     =      𝑥(𝑡) · 𝑓 ℎ ∗ ℎ(𝑡) 𝑑𝑡 +          𝑥(𝑡) · 𝑓ℓ ∗ ℓ𝐽 (𝑡) 𝑑𝑡 +                   𝑥(𝑡) · 𝑓 𝑗,𝑞 ∗ 𝜓 𝑗,𝑞 (𝑡) 𝑑𝑡
         R2                             R2                             𝑗=0 𝑞=0     R2
                                               𝐽−1 𝑄−1
                                               ∑︁   ∑︁
     = ⟨𝑥, ( 𝑓 ℎ ∗ ℎ)⟩ + ⟨𝑥, ( 𝑓ℓ ∗ ℓ𝐽 )⟩ +             ⟨𝑥, ( 𝑓 𝑗,𝑞 ∗ 𝜓 𝑗,𝑞 )⟩
                                                𝑗=0 𝑞=0
               ∗
     = ⟨𝑥, 𝑊𝐽,𝑄    𝐹⟩.
                                                                                                                      □
   Looking back at (3.4) we see that the left inverse of 𝑊𝐽,𝑄 is given by 𝑊𝐽,𝑄                 ∗ , i.e.,
                                                            ∗
                                                   𝑥 = 𝑊𝐽,𝑄    𝑊𝐽,𝑄 𝑥.
                                                            70


Having the adjoint of 𝑊𝐽,𝑄 and this relationship will be useful when we get to the Heeger-Bergen
texture synthesis algorithm.
3.3   Heeger Bergen Texture Synthesis
     Using the filter bank from before, we can now state the Heeger-Bergen Texture Synthesis
algorithm. We will need an auxillary histogram matching algorithm taken from [10]:
Algorithm 3.1 Histogram Matching
  1: Start with an input image 𝑢 and reference image 𝑣 both of size 𝑀 × 𝑁.
  2: Define 𝐿 = 𝑀 𝑁 and assume that 𝑢 and 𝑣 are unraveled.
  3: Determine the permutation 𝜏 such that 𝑣 𝜏(1) ≤ 𝑣 𝜏(2) ≤ . . . ≤ 𝑣 𝜏(𝐿) .
  4: Determine the permutation 𝜎 such that 𝑢 𝜎(1) ≤ 𝑢 𝜎(2) ≤ . . . ≤ 𝑢 𝜎(𝐿) .
  5: for 𝑖 = 1 to 𝐿 do
  5:    𝑢 𝜎(𝑖) ← 𝑣 𝜏(𝑘)
  6: end for
     The general idea behind the Heeger Bergen Texture Synthesis algorithm is to match distributions
one wavelet coefficients. One starts with a 𝑀 × 𝑁 white noise, say 𝐼𝑊 , with pixels sampled from
a standard normal distribution and a reference image 𝐼 𝑅 . One then calculates 𝑊𝐽,𝑄 𝐼 𝑅 and 𝑊𝐽,𝑄 𝐼𝑊 ,
histogram matches corresponding coefficients, and inverts the transform. The general idea is that
matching distributions of wavelet coefficients will turn the white noise into an image that is similar
to reference texture. Pseudocode for the algorithm is provided in Algorithm 3.2.
Algorithm 3.2 Heeger Bergen Texture Synthesis Algorithm
  1: Start with an white noise 𝐼𝑊 ∈ R 𝑀×𝑁 , reference image 𝐼 𝑅 ∈ R 𝑀×𝑁 , set of scales 𝐽, number of
     rotations 𝑄, and number of iterations 𝑇.
  2: Calculate 𝑊𝐽,𝑄 𝐼 𝑅 ∈ L2 (R2 ) 𝐽𝑄+2 .
  3: for 𝑡 = 1 to 𝑇 do
  4:    Calculate 𝑊𝐽,𝑄 𝐼𝑊 ∈ L2 (R2 ) 𝐽𝑄+2 .
  5:    Update 𝑊𝐽,𝑄 𝐼𝑊 by histogram matching each element of 𝑊𝐽,𝑄 𝐼𝑊 with the corresponding filter
        in 𝑊𝐽,𝑄 𝐼 𝑅 (using each filter in 𝑊𝐽,𝑄 𝐼 𝑅 as the reference histogram).
        Update 𝐼𝑊 via the formula 𝐼𝑊 = 𝑊𝐽,𝑄   ∗ 𝑊
  6:                                                𝐽,𝑄 𝐼𝑊 .
  7:    Histogram match 𝐼𝑊 with 𝐼 𝑅 using 𝐼 𝑅 as the reference histogram.
  8: end for
     Unfortunately, Algorithm 3.2 does not yield high quality texture synthesis. One possible reason
for this is that it does not use a nonlinear filter bank, so complex features in a texture cannot be
                                                      71


effectively captured.
3.4   One Layer Nonlinear Heeger Bergen Texture Synthesis
    To improve the synthesis quality of Algorithm 3.2, we propose a slight modification. Consider
the nonlinearity ReLU(𝑥) := max{0, 𝑥} and notice that we have the identities
                                     |𝑥| = ReLU(𝑥) + ReLU(−𝑥)                                 (3.7)
and
                                      𝑥 = ReLU(𝑥) − ReLU(−𝑥).                                 (3.8)
Now define the nonlinear filter bank
𝑊𝐽,𝑄,𝜀 𝑥 := {ReLU(𝑥 ∗ 𝜀ℎ) , ReLU(𝑥 ∗ 𝜀ℓ𝐽 ) , ReLU(𝑥 ∗ 𝜀𝜓 𝑗,𝑞 ) : 0 ≤ 𝑗 < 𝐽, 0 ≤ 𝑞 < 𝑄, 𝜀 = ±1}.
                                                                                              (3.9)
In particular, for any Ψ in 𝑊𝐽,𝑄 , we see that
                             | 𝑓 ∗ Ψ| = ReLU( 𝑓 ∗ Ψ) + ReLU( 𝑓 ∗ −Ψ)                         (3.10)
and
                              𝑓 ∗ Ψ = ReLU( 𝑓 ∗ Ψ) − ReLU( 𝑓 ∗ −Ψ).                          (3.11)
Thus, it follows that
                                     𝑊𝐽,𝑄 𝑥 = 𝑊𝐽,𝑄,1 𝑥 − 𝑊𝐽,𝑄,−1 𝑥,                          (3.12)
which means that
                                          ∗
                                    𝑥 = 𝑊𝐽,𝑄 (𝑊𝐽,𝑄,1 𝑥 − 𝑊𝐽,𝑄,−1 𝑥).                         (3.13)
    Using this inverse formula, we can create the following algorithm. Results using Algorithm 3.3
are given in Figure 3.9.
                                                 72


Algorithm 3.3 ReLU Heeger Bergen Texture Synthesis Algorithm
 1: Start with an white noise 𝐼𝑊 ∈ R 𝑀×𝑁 , reference image 𝐼 𝑅 ∈ R 𝑀×𝑁 , set of scales 𝐽, number of
    rotations 𝑄, and number of iterations 𝑇.
 2: Calculate 𝑊𝐽,𝑄,𝜀 𝐼 𝑅 ∈ L2 (R2 ) 2(𝐽𝑄+2) .
 3: for 𝑡 = 1 to 𝑇 do
 4:    Calculate 𝑊𝐽,𝑄,𝜀 𝐼𝑊 ∈ L2 (R2 ) 2(𝐽𝑄+2) .
 5:    Update 𝑊𝐽,𝑄,𝜀 𝐼𝑊 by histogram matching each element of 𝑊𝐽,𝑄,𝜀 𝐼𝑊 with the corresponding
       filter in 𝑊𝐽,𝑄,𝜀 𝐼 𝑅 (using each filter in 𝑊𝐽,𝑄,𝜀 𝐼 𝑅 as the reference histogram).
                                               ∗ (𝑊
 6:    Update 𝐼𝑊 via the formula 𝐼𝑊 = 𝑊𝐽,𝑄           𝐽,𝑄,1 𝐼𝑊 − 𝑊𝐽,𝑄,−1 𝐼𝑊 ).
 7:    Histogram match 𝐼𝑊 with 𝐼 𝑅 using 𝐼 𝑅 as the reference histogram.
 8: end for
Figure 3.9 Left: Reference Texture. Middle: No ReLU. Right: With ReLU. Ran with 𝐽 = 4 and
𝑄 = 6 for 50 iterations.
3.5  An Invertible Windowed Scattering Transform
    Before we describe the next modification to the Heeger Bergen Texture Synthesis algorithm,
we need to motivate the usage of wavelet scattering transforms. We will use the following notation
                                                     73


again:
                                           𝐿 𝑐 𝑓 (𝑥) = 𝑓 (𝑥 − 𝑐)                                (3.14)
                                          𝐿𝜏 𝑓 (𝑥) = 𝑓 (𝑥 − 𝜏(𝑥)).                              (3.15)
The first operator, 𝐿 𝑐 , is a translation operator, and the second operator 𝐿 𝜏 can be thought of as a
deformation operator. In particular, if
                                         ∥∇𝜏∥ ∞ = sup |∇𝜏(𝑥)| < 1                               (3.16)
                                                     𝑥∈R𝑛
and 𝜏 ∈ C2 (R𝑛 ). Suppose that H1 , H2 are a Hilbert spaces and Φ : H1 → H2 is an operator for a
vision-related task, such as classification. We would like Φ to have the following properties:
    • Local Translation Invariance: small translations of an object should not greatly affect the
       output of Φ.
    • Nonexpansiveness: for 𝑓 , 𝑔 ∈ H1 ,
                                           ∥Φ 𝑓 − Φ𝑔∥ H2 ≤ ∥ 𝑓 − 𝑔∥ H1 .                        (3.17)
       In other words, the distance between Φ 𝑓 and Φ𝑔 should not be larger than the original
       distance between 𝑓 and 𝑔 for stability reasons.
    • Stability to Diffeomorphisms: there exists 𝐶 > 0 such that
                           ∥Φ 𝑓 − Φ𝐿 𝜏 𝑓 ∥ H2 ≤ 𝐶 ∥ 𝑓 ∥ H1 (∥𝜏∥ ∞ + ∥∇𝜏∥ ∞ + ∥𝐻𝜏∥ ∞ ).          (3.18)
       for all 𝜏 ∈ C2 (R𝑛 ). That is, the operator "linearizes" small deformations.
    As shown in [34], the Fourier modulus is not stable to small dilations, which are a simple class
of diffeomorphisms. A natural next step is to use wavelets, which produce a representation that is
well localized in space and frequency.
    Let 𝐺 + be the set of "positive" rotations and define
                                                           +
                                               𝑗
                                       {𝜆 = 2 𝑟 : 𝑟 ∈ 𝐺 , 𝑗 < 𝐽} if 𝐽 < ∞
                                      
                                      
                                      
                                Λ𝐽 =                                                            (3.19)
                                       {𝜆 = 2 𝑗 𝑟 : 𝑟 ∈ 𝐺 + , 𝑗 ∈ Z} if 𝐽 = ∞.
                                      
                                      
                                      
                                                       74


     Let 𝜓 be a wavelet. For 𝑥 ∈ 𝐿 2 (R𝑛 ), let 𝜆 ∈ Λ𝐽 and define
                                              𝑈 [𝜆]𝑥 = |𝑥 ∗ 𝜓𝜆 |                                 (3.20)
A path 𝑝 is an ordered tuple 𝑝 = (𝜆 1 , . . . 𝜆ℓ ) with 𝜆𝑖 ∈ Λ∞ for 𝑖 = 1 . . . , ℓ, and 𝑃𝐽 where each
element of the path is from Λ𝐽 . For paths, define the operator
                                 𝑈 [ 𝑝]𝑥 = 𝑈 [𝜆ℓ ] . . . 𝑈 [𝜆 2 ]𝑈 [𝜆1 ]𝑥
                                           = |||𝑥 ∗ 𝜓𝜆1 | ∗ 𝜓𝜆2 | . . . | ∗ 𝜓𝜆ℓ |.               (3.21)
Let 𝜙 𝐽 (𝑢) be a low pass filter. For 𝑝 ∈ 𝑃𝐽 , define the windowed scattering operator
                                     𝑆 𝐽 [ 𝑝]𝑥(𝑢) = (𝑈 [ 𝑝]𝑥 ∗ 𝜙 𝐽 )(𝑢)                          (3.22)
For any set of paths, say Ω, define the path set 𝑆 𝐽 [Ω] = {𝑆 𝐽 [ 𝑝]} 𝑝∈Ω . We now define the windowed
scattering transform as the set 𝑆 𝐽 [𝑃𝐽 ]𝑥 and use the following Hilbert Space norm
                                                        ∑︁
                                      ∥𝑆 𝐽 [Ω]𝑥∥ 2 =          ∥𝑆 𝐽 [ 𝑝]𝑥∥ 22 .                   (3.23)
                                                        𝑝∈Ω
Notably, the windowed scattering transform has many properties we would like in a feature extractor.
Theorem 10 (The Windowed Scattering Norm is Well-defined). Suppose that 𝜓 is a wavelet such
that there 𝜂 ∈ R𝑛 , 𝜌 ≥ 0, | 𝜌(𝜔)|
                             ˆ           ˆ
                                    ≤ | 𝜙(2𝜔)|,     ˆ
                                                    𝜌(0)  = 1, and
                                                       ∞
                                                      ∑︁                             
                         Ψ̂(𝜔) = | 𝜌(𝜔
                                     ˆ − 𝜂)| −    2
                                                                      ˆ −𝑘 𝜔 − 𝜂)| 2
                                                            𝑘 1 − | 𝜌(2                          (3.24)
                                                      𝑘=1
satisfies
                                      ∑︁∞ ∑︁
                         𝛼 − inf                 Ψ̂(2− 𝑗 𝑟 −1 𝜔)| 𝜓(2
                                                                   ˆ − 𝑗 𝑟 −1 𝜔)| 2 > 0.         (3.25)
                               1≤𝜔≤2
                                     𝑗=−∞ 𝑟∈𝐺
We will call this the admissibility condition. Under the admissibility condition, for all 𝑥 ∈ L2 (R𝑛 ),
we have the isometry
                                             ∥𝑆 𝐽 [𝑃𝐽 ]𝑥∥ = ∥𝑥∥.                                 (3.26)
Theorem 11 (Nonexpansive Property of the Windowed Scattering Transform). For all 𝑥, 𝑦 ∈
𝐿 2 (R𝑛 ),
                                   ∥𝑆 𝐽 [𝑃𝐽 ]𝑥 − 𝑆 𝐻 [𝑃𝐽 ]𝑦∥ ≤ ∥𝑥 − 𝑦∥ 2 .                       (3.27)
                                                       75


Theorem 12 (Local Translation Invariance of the Windowed Scattering Transform). For all 𝑐 ∈ R𝑛 ,
𝑥 ∈ 𝐿 2 (R𝑛 ), and an admissible wavelet,
                                     lim ∥𝑆 𝐽 [𝑃𝐽 ]𝑥 − 𝑆 𝐽 [𝑃𝐽 ] 𝐿 𝑐 𝑥∥ = 0.                    (3.28)
                                    𝐽→∞
                                                                                1
Theorem 13 (Diffeomorphism Stability). Let 𝜏 ∈ 𝐶 2 (R𝑛 ) with ∥𝐷𝜏∥ ∞ <         2𝑛 . For an admissible
wavelet and any 𝑥 ∈ 𝐿 2 (R𝑛 ), there exists a constant 𝐶 such that
                                  ∥𝑆 𝐽 [𝑃𝐽 ] 𝐿 𝜏 𝑥 − 𝑆 𝐽 [𝑃𝐽 ]𝑥∥ ≤ 𝐶𝐾 (𝜏)∥𝑥∥,                   (3.29)
where 𝐾 (𝜏) → 0 as
                                       ∥𝜏∥ ∞ + ∥∇𝜏∥ ∞ + ∥𝐻𝜏∥ ∞ → 0.
    That is to say, a windowed scattering operator is a good feature extractor. In practice, it is not
able to compute an infinite cascade of wavelet transforms, but empirical studies have shown that
two layers is enough [33].
    We will now construct a two layer modification of the scattering transform, which we will
denote as the "Two Layer Scattering Pyramid," using the operators 𝑊𝐽,𝑄 and 𝑊𝐽,𝑄,𝜀 :
  𝑆2 𝑥 := {𝑥 ∗ 𝜙 𝐽 , 𝑥 ∗ ℎ, 𝑊𝐽,𝑄 ReLU(𝑥 ∗ 𝜀𝜓 𝑗,𝑞 ) : 0 ≤ 𝑗 ≤ 𝐽 − 1 , 0 ≤ 𝑞 ≤ 𝑄 − 1 , 𝜀 = ±1}. (3.30)
The algorithm is provided in Algorithm 3.4. Note that this can be generalized to multiple layers, but
the computational cost is infeasible; additionally, most of energy should be in the first two layers.
                                                          76


Algorithm 3.4 Two Layer Scattering Pyramid
 1: INPUTS: An image 𝑥 and operators 𝑊𝐽,𝑄 , 𝑊𝐽,𝑄,−1 , 𝑊𝐽,𝑄,1 .
 2: OUTPUT: 𝑆 2 𝑥, a modification of the Two Layer Scattering Pyramid.
 3: Initialize a set of functions 𝑆 2 𝑥.
 4: Calculate 𝑊𝐽,𝑄,−1 𝑥 and 𝑊𝐽,𝑄,1 𝑥 and add
                                 𝑥 ∗ 𝜙 𝐽 = ReLU(𝑥 ∗ 𝜙 𝐽 ) − ReLU(𝑥 ∗ −𝜙 𝐽 )
    and
                                   𝑥 ∗ ℎ = ReLU(𝑥 ∗ ℎ) − ReLU(𝑥 ∗ −ℎ).
 5: for 𝑗 = 0 to 𝐽 − 1 do
 6:    for 𝑞 = 0 to 𝑄 − 1 do
 7:       for 𝜀 = ±1 do
 8:          Calculate 𝑊𝐽,𝑄,𝜖 (𝑥 ∗ 𝜓 𝑗,𝑞 ) and add to 𝑆2 𝑥.
 9:       end for
10:    end for
11: end for
12: Return 𝑆2 𝑥.
Algorithm 3.5 Two Layer Scattering Pyramid Inverse
 1: INPUTS: The two layer pyramid 𝑆 2 𝑥.
 2: OUTPUT: The original image 𝑥.
 3: for 𝑗 = 0 to 𝐽 − 1 do
 4:    for 𝑞 = 0 to 𝑄 − 1 do
                        ∗ (𝑊
 5:       Calculate 𝑊𝐽,𝑄      𝐽,𝑄,1 (𝑥 ∗ 𝜓 𝑗,𝑞 ) − 𝑊𝐽,𝑄,−1 (𝑥 ∗ 𝜓 𝑗,𝑞 )), where corresponding filters are
          subtracted in the subtraction operation.
 6:    end for
 7: end for
 8: Note that this recovers all the bandpass filters in 𝑊𝐽,𝑄 𝑥, and the high and low pass residuals are
    already in 𝑆2 𝑥.
 9: 𝑥 = 𝑊𝐽,𝑄∗ 𝑊
                 𝐽,𝑄 𝑥.
10: Return 𝑥.
    For the next section, we will use the algorithms above to formulate a modified texture synthesis
algorithm.
3.6   Two Layer Nonlinear Heeger Bergen Texture Synthesis
    Using Algorithm 3.4, we can formulate a Heeger Bergen Texture Synthesis algorithm motivated
by the scattering transform.
                                                    77


Algorithm 3.6 Heeger Bergen Scattering Texture Synthesis
 1: Start with an white noise 𝐼𝑊 ∈ R 𝑀×𝑁 , reference image 𝐼 𝑅 ∈ R 𝑀×𝑁 , set of scales 𝐽, number of
    rotations 𝑄, and number of iterations 𝑇.
 2: Calculate 𝑆 2 𝐼 𝑅 and save it.
 3: for 𝑡 = 1 to 𝑇 do
 4:    Calculate 𝑊𝐽,𝑄,1 𝐼𝑊 and 𝑊𝐽,𝑄,−1 𝐼𝑊 .
 5:
 6:    Update 𝑊𝐽,𝑄,1 𝐼𝑊 by histogram matching each element of 𝑊𝐽,𝑄,1 𝐼𝑊 with the corresponding
       filter in 𝑊𝐽,𝑄,1 𝐼 𝑅 (using each filter in 𝑊𝐽,𝑄,1 𝐼 𝑅 as the reference histogram) and update
      𝑊𝐽,𝑄,−1 𝐼𝑊 by histogram matching each element of 𝑊𝐽,𝑄,−1 𝐼𝑊 with the corresponding filter
       in 𝑊𝐽,𝑄,−1 𝐼 𝑅 (using each filter in 𝑊𝐽,𝑄,−1 𝐼 𝑅 as the reference histogram).
 7:    Keep
                                𝐼𝑊 ∗ 𝜙 𝐽 = ReLU(𝐼𝑊 ∗ 𝜙 𝐽 ) − ReLU(𝐼𝑊 ∗ −𝜙 𝐽 )
       and
                                   𝐼𝑊 ∗ ℎ = ReLU(𝐼𝑊 ∗ ℎ) − ReLU(𝐼𝑊 ∗ −ℎ)
       for inversion.
 8:    for 𝑗 = 0 to 𝐽 − 1 do
 9:        for 𝑞 = 0 to 𝑄 − 1 do
10:          for 𝜀 = ±1 do
11:             Calculate 𝑊𝐽,𝑄,𝜖 (𝐼𝑊 ∗ 𝜀𝜓 𝑗,𝑞 ) and histogram match with corresponding filter in 𝑆2 𝐼 𝑅 .
12:          end for
13:        end for
14:    end for
15:    Invert the scattering pyramid 𝑆2 𝐼𝑊 via 3.5.
16: end for
17: Histogram match 𝐼𝑊 with 𝐼 𝑅 using 𝐼 𝑅 as the reference histogram.
                                                    78


Figure 3.10 Left: Reference Texture. Middle: One Layer ReLU Synthesis for 50 iterations. Right:
Two Layer Synthesis for 10 iterations. Ran with 𝐽 = 4 and 𝑄 = 6.
                                               79


3.7    Conclusions and Future Work
    Based on our current experiments, we believe there are multiple reasons why our approach
fails. First, matching wavelet coefficients is an approximation of optimal transport. However, this
approximation seems to get stuck in local minimums quickly. There are two possible reasons
     • Histogram matching wavelet coefficients is bad approximation of optimal transport.
     • Matching each coefficient works, but they "interfere" each other. That is, matches in one
       coefficiens might lead to better synthesis, but a match in a different coefficient leads to worse
       synthesis.
The second point leads to an interesting question. Two representations can have close to the same
histogram for each component, but their structure does not necessarily look the same; what else do
we need to match to obtain structure such as edges and lines?
    One approach, which we will consider for deep neural networks in the next chapter, is to match
𝑛-dimensional histograms. We can regard the set of wavelet coefficients as a 𝐻 × 𝑊 × 𝐶 tensor with
each 𝐻 × 𝑊 slice of the last dimension being a wavelet coefficient. If we unroll the vector into a
𝐻𝑊 × 𝐶 matrix and match 𝐻𝑊 dimension histograms for each pixel, it is a good approximation of
an optimal transport problem. Will this problem yield good results for synthesis? Our preliminary
experiments suggest yes. If this is the case, can we study what type of frames yield good texture
synthesis? Is invertibility/pseudoinvertibility really a necessary requirement?
                                                   80


                                             CHAPTER 4
                   LONG RANGE CONSTRAINTS FOR NEURAL TEXTURE
                      SYNTHESIS USING SLICED WASSERSTEIN LOSS
This work is based on [51], which has been submitted to IEEE ICIP 2023. We introduce an
algorithm for texture synthesis using a modified form of Sliced Wasserstein Loss. The main idea of
both methods is similar: we match 1D histograms between feature maps. However, our approach
for this chapter will use deep convolutional neural networks (CNNs) instead of an invertible wavelet
frame.
4.1    Background on Convolutional Neural Networks
     The contents of this section is based on [21]. A CNN is neural network using series of
convolutions to learn features from data. The general steps of the network for an image classification
task are the following:
     • Initialize random weights for a set of filters and put in an input image 𝑥.
     • Apply a cascade of convolutional layers and nonlinearities using the filters.
     • Subsample the result using a subsampling operation such as max pooling, min pooling, or
       average pooling.
     • Repeat this process multiple times.
     • Use a classifier (usually feedforward neural network) to classify the image.
     • Apply backpropogation to update the filters via gradient descent.
     • Repeat the steps till one reaches desired performance on the desired task.
An example of a convolutional architecture, LeNet, is given in Figure 4.1.
     First, consider the discrete convolution of functions 𝑥(𝑡) and 𝑤(𝑡), which are only defined for
integer 𝑡:
                                                       ∞
                                                      ∑︁
                                𝑠(𝑡) = (𝑥 ∗ 𝑤)(𝑡) =       𝑥(𝑎)𝑤(𝑡 − 𝑎).                          (4.1)
                                                     𝑎=−∞
     Convolutions are done over more than one dimension at a time. In particular, for two dimensions,
                                                   81


    Figure 4.1 LeNet Architecture demonstrating the procedure for the forward pass of a CNN.
for an image two-dimensional image 𝐼 and two dimensional kernel 𝐾, the convolution is
                                                 ∑︁ ∑︁
                       𝑆(𝑖, 𝑗) = (𝐾 ∗ 𝐼)(𝑖, 𝑗) =          𝐼 (𝑖 − 𝑚, 𝑗 − 𝑛)𝐾 (𝑚, 𝑛),          (4.2)
                                                  𝑚    𝑛
which is a commutative operation.
    However, neural network libraries usually use the cross-correlation function
                                                 ∑︁ ∑︁
                       𝑆(𝑖, 𝑗) = (𝐾 ∗ 𝐼)(𝑖, 𝑗) =          𝐼 (𝑖 + 𝑚, 𝑗 + 𝑛)𝐾 (𝑚, 𝑛).          (4.3)
                                                  𝑚    𝑛
due to ease of implementation. While the operation is not commutative, this does not affect the
accuracy of a network because the kernel 𝐾 is learned from the data anyway. In practice, when
ones uses a convolutional neural network, we regard an image or feature map as an 3-tensor. For
example, an 𝑀 × 𝑁 RGB image is an 𝑀 × 𝑁 × 3 tensor, where the last dimension is known as the
channel dimension. Each channel of a the three tensor (i.e. each 𝑀 × 𝑁 matrix) is known as a
feature map.
    We can now provide a mathematical formulation for a convolutional layer of a network. Assume
we have an 𝑀 × 𝑁 × 𝐶1 input. In other words, there are 𝐶1 feature maps in the channel dimension.
If we would like to output 𝐶2 feature maps from our convolutional layer, define the set of filters
{𝐹𝑖, 𝑗 } with 1 ≤ 𝑖 ≤ 𝐶1 and 1 ≤ 𝑗 ≤ 𝐶2 . A the "convolution" step of convolutional layer is
                                                   𝐶1
                                                  ∑︁
                                          𝐶 (𝑥) =      𝐹𝑖, 𝑗 ∗ 𝑥,                            (4.4)
                                                   𝑖=1
which has channel dimension 𝐶2 . After the application of the convolution step above, one applies
                                                   82


a pointwise nonlinearity, 𝜎 to get
                                                     𝐶1
                                                                  !
                                                    ∑︁
                                    𝜎(𝐶 (𝑥)) = 𝜎         𝐹𝑖, 𝑗 ∗ 𝑥 .                           (4.5)
                                                     𝑖=1
Common choices are the sigmoid function, tanh, ReLU, and so on.
    After a convolutional layer, CNNs usually have a subsampling operation to decrease the size
of the feature maps. This makes training networks less computationally intensive and helps with
extracting features such as edges. Usually, a pooling operation, such as max pooling, which involves
sliding a kernel and taking the maximum along a window, is used.
    The reason CNNs work better than, such as feedfoward neural networks, for vision tasks is
that they encode information in a more relevant manner. Specifically, the convolution step of a
convolutional layer is translation equivariant (e.g. applying a translation to an image before and
after the convolution step yields the same result); convolutions are a local operation, and local
information is more relevant for images; a feature map shares parameters with other feature maps,
which captures interaction between different feature maps and lowers the total number of learned
parameters. Parameter sharing and more capacity for larger networks are some of the reasons one
would suspect a deep CNN to work better than a scattering transform for classification tasks.
    Regarding deep CNNs, we are interested in a specific architecture which is commonly used for
neural texture synthesis tasks, VGG19 [45]. See Figures 4.2 and 4.3.
Figure 4.2 Schematic of VGG19 architecture. The blocks "Conv" are a convolution block and
ReLU, "MaxPooling" is a max pooling operation, and "FC+relu" is a linear layer and ReLU.
4.2   Image Quality Metrics
    To evaluate the effectiveness of our texture synthesis algorithms, discuss a few image quality
metrics that we will use to compare the quality of images we generated. Traditional image metrics,
                                                  83


Figure 4.3 VGG19 schematic showing how feature map size decreases as the number of filters
increase.
like PSNR and SSIM, are not perfect image metrics. Consider the following example:
 Figure 4.4 Two examples where human perception does not match with common image metrics.
    In the example, one can clearly see that Patch 1 looks similar to the reference, and is slightly
deformed. However, common image metrics like MSE would not be able to handle the distortions
properly. Additionally, Patch 0, which is a low pass filtering of the reference, would be recognized
as similar by traditional metrics. Note that these metrics we will discuss are not optimal either. In
all cases, if we calculate the similarity between two of the same image, the image metric would
                                                 84


yield the best score. Like we mentioned previously, this is not ideal for texture synthesis.
    The first metric we consider is LPIPS [52], which measures the perceptual similarity between
two images. The idea is to use MSE between feature maps of a deep convolutional neural network.
                      Figure 4.5 LPIPS calculation between an image 𝑥 and 𝑥 0 .
    Suppose we have ℓ = 1, . . . , 𝐿 layers of a neural network. We extract feature stacks from layer
ℓ and unit-normalize in the channel dimension, which we designate 𝑦 ℓ , 𝑦ˆ ℓ ∈ R𝐻ℓ ×𝑊ℓ ×𝐶ℓ . We do a
channel-wise scaling of the activiations by vector 𝑤 ℓ ∈ R𝐶ℓ Finally, we average spatially and over
the channels. This can be represented as
                                           𝐿          𝐻ℓ ∑︁
                                                          𝑊ℓ
                                          ∑︁    1 ∑︁
                            𝑑 (𝑥, 𝑥 0 ) =                     ∥𝑤 ℓ ◦ (𝑦 ℓ − 𝑦ˆ ℓ )∥ 22               (4.6)
                                          ℓ=1
                                              𝐻ℓ 𝑊ℓ 𝑖=1 𝑗=1
Usually, a deep network like VGG or AlexNet are used for feature map extraction.
    The second metric we use is Frechet Inception distance (FID) [25]. The 2-Wasserstein Distance,
or Frechet Distance, is given by
                                                     ∫
                             𝑑 𝐹2 (𝜇, 𝜈)  =   inf            ∥𝑥 − 𝑦∥ 2 𝑑𝛾(𝑥, 𝑦)                      (4.7)
                                            𝛾∈Γ(𝜇,𝜈)  R𝑛 ×R𝑛
where Γ(𝜇, 𝜈) is the set of joint probability measures such that
                     ∫                                                                
                                                              𝑛
                      𝛾:       𝛾(𝑥, 𝑦) = 𝑑𝑦 = 𝜇(𝑥) and R }𝛾(𝑥, 𝑦) = 𝑑𝑥 = 𝜈(𝑦) .
                           R𝑛
For FID, one assumes that the model fits to a Gaussian distribution. Then for two Gaussian
distributions 𝑝 1 with distribution N (𝜇1 , Σ1 ) and 𝑝 2 with distribution N (𝜇2 , Σ2 ), one can write the
                                                      85


FID as
                                                                                              1/2 
                   𝑑 𝐹2 ( 𝑝 1 , 𝑝 2 ) = ∥𝜇1 − 𝜇2 ∥ 22 + tr Σ1 +    Σ1′   −2     Σ11/2 Σ2 Σ11/2         .  (4.8)
Usually, the function to estimate the distribution is a specific layer of a deep convolutional neural
network, similar to LPIPS.
    KID score [8] is similar to FID score, but relaxes the assumption that both distributions are
Gaussian and is defined as
                                         KID( 𝑝 1 , 𝑝 2 ) = MMD( 𝑝 1 , 𝑝 2 ) 2 .                          (4.9)
More detail is given in [8], but it will be omitted from this thesis.
4.3   Texture Synthesis Using VGG19 and Gram Matrices
    The contents of this section are adapted from [19], which has made led to remarkable improve-
ments in the field of texture synthesis since its publication. For notation, assume we have an 𝐿 + 1
layer convolutional neural network; for ℓ = 0, . . . , 𝐿, layer ℓ has 𝑁ℓ distinct feature maps. Each of
these feature maps is a 𝑀ℓ vector after flattening. We reshape feature maps in layer ℓ into a matrix
𝐹ℓ ∈ R𝑁ℓ ×𝑀ℓ , where each row of the matrix is a feature map.
    Assuming that each row of 𝐹ℓ (i.e. each feature map in layer ℓ) is mean centered, we define the
Gram matrix 𝐺 ℓ ∈ R𝑛 × R𝑛 as
                                                          ∑︁
                                                𝐺 𝑖ℓ𝑗 =       𝐹𝑖𝑘ℓ 𝐹 𝑗ℓ𝑘 .                               (4.10)
                                                           𝑘
In matrix form, we have 𝐺 ℓ = 𝐹 ℓ (𝐹 ℓ )𝑇 .
    To generate a new texture from a reference texture, we update a white noise image using
gradient descent by optimizing over the mean-squared error (MSE) between the Gram matrices of
the reference image and the white noise. More formally, let 𝑥® be a reference image and 𝑥®ˆ be the
generated image. Also let the gram matrices for 𝑥® be 𝐺 ℓ and the gram matrices for 𝑥®ˆ be 𝑔ˆ ℓ . Define
                                                    1      ∑︁
                                         𝐸ℓ =                  (𝐺 𝑖ℓ𝑗 − 𝐺ˆ 𝑖ℓ𝑗 ) 2 ,                     (4.11)
                                               4𝑁ℓ2 𝑀ℓ2 𝑖, 𝑗
which is the MSE between the gram matrices of 𝑥® and 𝑥®ˆ in layer ℓ. The full loss function is given
                                                           86


by
                                                           𝐿
                                                          ∑︁
                                         L (® 𝑥 , 𝑥®ˆ ) =      𝑤 ℓ 𝐸ℓ ,                       (4.12)
                                                          ℓ=0
where 𝑤 ℓ are user-defined weights.
    For updating the image, one uses gradient descent. In particular, the following identity holds:
                                                                 
                                    1     ˆ
                                                    
                                                             ˆ          𝐹ˆ𝑖ℓ𝑗 > 0
                                           ℓ  𝑇        ℓ      ℓ
                               𝑁 2 𝑀 2 ( 𝐹 ) 𝐺 − 𝐺 𝑗𝑖 ,
                              
                              
                              
                      𝜕𝐸 ℓ 
                            =      ℓ ℓ                                                        (4.13)
                     𝜕 𝐹ˆ𝑖ℓ𝑗 
                               0,
                                                                       otherwise.
                              
In practice, one does not implement the gradient manually. Instead, one can use open-source deep
learning packages, like PyTorch or Tensorflow, with a built in optimizer (L-BFGS) to update the
white noise image.
    Before we provide a unified workflow, we discuss the specific CNN architecture, which is a
modified version of VGG-19, used for texture synthesis. The following changes were made to the
original version of VGG-19:
    • The max pooling layers are switched to average pooling layers. No retraining of the network
      is done. The authors noticed that this change improved image quality.
    • The weights of the network are transformed so that the mean of each feature map is 1.
Additionally, during the synthesis process the feature maps from the layers ’conv1_1’, ’pool1’,
’pool2’, ’pool3’, and ’pool4’ are used to make Gram matrices. The other convolutional layers and
the fully connected part of the network are not used.A schematic of the general workflow is given
in Figure 4.6 and some synthesis results are given in Figure 4.7.
                                                        87


Figure 4.6 The general synthesis method is given below. The images 𝑥® and 𝑥®ˆ are both passed through
the network. Gram matrices are created from specific feature maps, and the loss is calculated. Using
a backpropogation engine, the 𝑥®ˆ is updated. Picture reference: [19].
                                                 88


Figure 4.7 Results for textures synthesis after each gram matrix is added for synthesis. "Original"
denotes the reference image and "Portilla and Simoncelli" are results from [40]. Picture reference:
[19].
   One notable issue with the results above is that long range constraints are not captured by this
algorithm. In other words, the alignment of objects isn’t capture by this algorithm. This is expected
because CNNs aggregate local information.
                                                  89


4.4   Improvements Via Regularization
    To ameliorate the lack of long range information captured by [19], [31] propose adding regu-
larization (4.12). For an image 𝐼, let
                                          E 𝐼 = { 𝐼˜ : |F (𝐼)| = |F ( 𝐼)|}.
                                                                         ˜                          (4.14)
That is, (4.14) is the set of images with the same power spectrum as 𝐼. We can find the projection
of an image 𝐼ˆ onto E 𝐼 via
                                                        ˆ · F ( 𝐼)
                                                                ˆ∗
                                                                               
                                                   F  ( 𝐼)
                                      𝐼˜ := F                         · F (𝐼) .                     (4.15)
                                                        ˆ · F ( 𝐼)
                                                  |F ( 𝐼)       ˆ ∗|
        ˆ E 𝐼 ) be the distance of 𝐼ˆ to E 𝐼 . If we denote LCNN to be the loss from (4.12), then our new
Let 𝑑 ( 𝐼,
loss function between images is
                                                             𝛽 ˆ
                                           L = LCNN +          𝑑 ( 𝐼, E 𝐼 ) 2 ,                     (4.16)
                                                             2
where 𝛽 is a hyperparameter.
    In Figure 4.8, some examples of synthesis using the spectrum constraint are given. As shown
in Figure 4.8, there is notable improvement in alignment for these textures. However, it is difficult
to find a value for 𝛽 that works on a variety of images. This would require time-consuming
hyperparameter tuning. Additionally, if one required high quality synthesis for multiple textures,
one may have to tune 𝛽 depending on each texture. It would be ideal if one could find a method
that did not require hyperparameter tuning for high quality synthesis results.
                                                           90


Figure 4.8 Results for textures synthesis using a spectrum constraint. The hyperparameter 𝛽 is
105 . Left: Reference Texture. Middle: Without Spectrum Constraint. Right: With Spectrum
Constraint.
4.5   Texture Synthesis Using Sliced Wasserstein Loss
    Before we discuss the method from [51], we will provide some background information about
texture synthesis using Sliced Wasserstein Loss from [24]. Suppose that layer ℓ of an 𝐿 layer
convolutional neural network has 𝑁ℓ channels and 𝑀ℓ pixels in each channel. We denote the
feature vector located at pixel 𝑚 as 𝐹𝑚ℓ ∈ R𝑁ℓ , which is a change in notation from previous sections.
    The authors of [24] propose using a different set of statistics instead of the mean squared error
between gram matrices. In a manner similar to [23, 48], the authors match the distributions between
feature maps, but these feature maps used in [24] are feature maps of VGG19 [45].
    With respect a network architecture, let 𝑝 ℓ and 𝑝ˆℓ be the probability density functions associated
                                                   91


with the set of feature vectors {𝐹𝑚ℓ } and { 𝐹ˆ𝑚ℓ }. Since our network is discrete, we assume that the
probability density functions are always an average of Dirac delta distributions of the form
                                                             𝑀ℓ
                                             ℓ          1 ∑︁
                                           𝑝 (𝑥) =              𝛿 ℓ (𝑥).                         (4.17)
                                                       𝑀ℓ 𝑚=1 𝐹𝑚
Let 𝑉 ∈ S𝑁ℓ be a random direction on the unit sphere of dimension 𝑁ℓ . For the purpose of this
paper, the Sliced Wasserstein Distance between two distributions of features is of the form
                                LSW,ℓ ( 𝑝 ℓ , 𝑝ˆℓ ) = E𝑉 [LSW1D ( 𝑝𝑉ℓ , 𝑝ˆ𝑉ℓ )],                 (4.18)
where
                                                𝑝𝑉ℓ := {⟨𝐹𝑚ℓ , 𝑉⟩}                               (4.19)
is a set consisting of batched projections of the feature maps 𝐹𝑚ℓ onto the directions 𝑉; if we make a
vector 𝑃𝑉ℓ consisting of the elements of 𝑝𝑉ℓ , the 1𝐷 Sliced Wasserstein Loss is the 2-norm between
sets of sorted projections:
                                                       1                                   2
                          LSW1D ( 𝑝𝑉ℓ , 𝑝ˆ𝑉ℓ ) =               sort(𝑃𝑉ℓ ) − sort( 𝑃ˆ𝑉ℓ )   2
                                                                                                 (4.20)
                                                  len(𝑃𝑉 )ℓ
and the full Sliced Wasserstein loss over all the layers is
                                                     ∑︁𝐿
                                                                     ℓ          ℓ
                                LSW (𝐼1 , 𝐼2 ) =          LSW,ℓ ( 𝑝𝑉,𝐼   1
                                                                           , 𝑝𝑉,𝐼   2
                                                                                      ),         (4.21)
                                                     ℓ=1
for images 𝐼1 and 𝐼2 , respectively. For practical applications, one uses a loss of the form
                                                   ∑︁𝐿
                                                                       ℓ          ℓ
                               LSW (𝐼1 , 𝐼2 ) =          𝑤 𝑖 LSW,ℓ ( 𝑝𝑉,𝐼   1
                                                                              , 𝑝𝑉,𝐼  2
                                                                                        ),       (4.22)
                                                    ℓ=1
where 𝑤 𝑖 are weight terms that set to zero for layers that are not used. We will use this formulation
for the rest of the paper.
     An implementation of 4.20 is provided below in pseudocode:
                                                         92


Algorithm 4.1 Implementation of 4.20
  1: Set 𝐼WN as variable to be updated by optimizer.
  2: for 𝑘 = 1, . . . , 𝑀 do
  3:    Calculate Extract(𝐼WN ).
  4:    Calculate Extract(𝐼ref ).
  5:    Calculate LSlicing (𝐼WN , 𝐼ref ).
  6:    Backpropogate and update 𝐼WN .
  7: end for
  8: Return updated 𝐼WN as synthesized texture.
     Pitie et al. [39] showed that Sliced Wasserstein Distance satisfies:
                                         LSW ( 𝑝, 𝑝)
                                                  ˆ = 0 =⇒ 𝑝 = 𝑝.
                                                                ˆ
The same does not hold for other losses used for texture synthesis, such as the gram matrix loss.
Thus, using a Sliced Wasserstein-based loss should capture more stationary statistics compared to
the traditional Gram Loss.
       Figure 4.9 Synthesis results using the gram matrix loss, denoted LGram and using LSW .
     In Figure 4.9, some examples of synthesis using sliced wasserstein loss are given above. While
the alignment is better for the synthesis, the algorithm still has trouble capturing long range
                                                     93


constraints, like alignment. The authors propose adding a spatial tag as fourth channel in an RGB
image to guide synthesis, which is shown in Figure 4.10.
                     Figure 4.10 Synthesis results with and without a spatial tag.
While adding a spatial tag creates visually appealing textures, one needs a spatial tag for each
constraint they wish to impose. However, it’s unlikely to have a prepared spatial tag for each texture
one would like to generate, especially if the texture is highly irregular. Thus, it would be ideal if
there was an algorithm that could capture long range constraints without user guidance and without
tedious hyperparameter tuning.
4.6   New Texture Synthesis Algorithm
    The method proposed in [24] cannot effectively capture long range constraints unless a user-
added spatial tag is added to guide synthesis. This is most likely Sliced Wasserstein Loss does
not fully capture nonstationary statistics in an image. Thus we propose a new set of statistics for
texture synthesis based on Sliced Wasserstein Loss to capture long range constraints without any
                                                  94


supervision or hyperparameter tuning. Our experiments show that our proposed set of statistics
provides competitive results with current approaches. Additionally, we augment our synthesis
results via a coarse-to-fine multi-scale procedure, which yield state-of-the-art results.
    Instead of just matching distributions via slicing over the channel dimension of the feature
maps, we purpose matching more statistics in a very simple way. Consider a set of feature maps
𝐹 ℓ ∈ R𝐻ℓ ×𝑊ℓ ×𝑁ℓ . In the original algorithm, we unravel each 𝐻ℓ × 𝑊ℓ feature map, consider each
pixel 𝑚 to get feature vectors of length 𝑁ℓ , and project onto direction 𝑉 in Eq. (4.21). Another
way to reshape the feature maps is to unravel them into 𝐻ℓ different 𝑊ℓ × 𝑁ℓ feature maps, 𝐹𝐻ℓ
        ℓ being a vector of all 𝑛th pixels of each feature vector), and project them onto 𝑉
(with 𝐹𝐻,𝑛                                                                                          𝐻ℓ ∈ S .
                                                                                                          𝐻ℓ
Analogous to Eq. (4.19), for the distribution 𝑝 ℓ𝐻 associated to feature vectors {𝐹𝐻,𝑛      ℓ } we can define
another set of batched projections given by
                                             𝑝𝑉ℓ 𝐻 = {⟨𝐹𝐻,𝑛 ℓ
                                                              , 𝑉𝐻ℓ ⟩}.                                (4.23)
                                                  ℓ
    The corresponding additional loss term is
                                                    𝐿
                                                   ∑︁                                  
                                                                                ℓ
                             LSW,𝐻 (𝐼1 , 𝐼2 ) =        𝑤 𝑖 LSW,ℓ 𝑝𝑉ℓ 𝐻   ,𝐼1 , 𝑝𝑉𝐻  ,𝐼2   .            (4.24)
                                                                       ℓ          ℓ
                                                   ℓ=1
    Intuitively, this loss term accounts for alignment in an image by slicing over the dimension for
the height of the feature maps rather than the dimension for the channel of the feature maps. The
new loss function we consider is
                             LSlicing (𝐼1 , 𝐼2 ) = LSW (𝐼1 , 𝐼2 ) + LSW,𝐻 (𝐼1 , 𝐼2 ),                  (4.25)
which is the sum of Eq. (4.21) and Eq. (4.24).
    Denote the feature map extraction from VGG19 as Extract(𝐼). For our algorithm, we start with
a reference image 𝐼ref and a white noise 𝐼WN and run for 𝑀 epochs. Our implementation for slicing
synthesis is the same as in [24] for Eq. (4.21) where we perform a batch projection on 𝑁ℓ directions
and sort. For our additional loss term in equation Eq. (4.24), the number of batched projections we
make is 𝐻ℓ . In our next algorithm, assume that the goal is to synthesize an image the same size as
the reference image without any loss of generality.
                                                         95


Algorithm 4.2 Synthesis Algorithm
  1: Set 𝐼WN as variable to be updated by optimizer.
  2: for 𝑘 = 1, . . . , 𝑀 do
  3:    Calculate Extract(𝐼WN ).
  4:    Calculate Extract(𝐼ref ).
  5:    Calculate LSlicing (𝐼WN , 𝐼ref ).
  6:    Backpropogate and update 𝐼WN .
  7: end for
  8: Return updated 𝐼WN as synthesized texture.
     The settings for the slicing loss are to use the first 12 layers of VGG19 for calculating LSW and
the first two convolutions (after the ReLU) in each convolution block for for calculating LSW,𝐻 .
The L-BFGS optimizer [30] is used for optimization with a learning rate of 𝜂 = 1.
     We start by comparing results with [24]. We use the author’s TensorFlow implementation,
which is a previous commit in https://github.com/tchambon/A-Sliced-Wasserstein-Loss-for-Neural-
Texture-Synthesis. without their spatial tag on some relatively periodic textures. For all the
experiments in this paper, our texture sources were the following:
     • https://github.com/omrysendik/DCor/tree/master/Data
     • https://www.robots.ox.ac.uk/ vgg/data/dtd
for generating 256 × 256 textures. The results are given in Fig. 4.11.
                                                    96


Figure 4.11 Original SW loss vs. Eq. (4.25). Left: Reference. Mid: SW Loss. Right: Eq. (4.25)
Loss (Ours).
   However, there are still some textures that are not generated perfectly. See Fig. 4.12 for some
examples.
Figure 4.12 Less successful cases. Left: Reference. Mid: SW Loss. Right: Eq. (4.25) Loss
(Ours).
   Now the results of using Eq. (4.25) is compared to [31]. Note that we use the implementation
from [20] for this comparison. as an additional point of comparison. There is no comparison
with [43] because experiments from [43, 20] have shown the algorithm does not yield much
                                                97


improvement for nonperiodic textures. Take note of the top row of Figure 4.13 in particular. The
Figure 4.13 Gram Matrix + Spectrum vs. Eq. (4.25) Loss. Left: Reference. Mid: Spectrum
Constraint. Right: Eq. (4.25) Loss (Ours).
spectrum constraint produces better results, which suggests that there could be improvement in our
proposed synthesis method.
    Quantitative comparisons between the original SW loss, the spectrum constraint, and our
proposed method using a set of 34 images are provided in this section. The LPIPS [52], FID [25],
crop-based FID. This is done by taking sixty-four 128 × 128 crops of the reference texture and
synthesized texture for each exemplar (the FID/KID score is calculated between these two sets of
images. For the ground truth case, a different set of crops of the reference is used) like [32], KID
[8], and crop-based KID (c-KID) score are provided in Table 4.1. For FID and KID based scores,
the implementation from [37] is used. In Table 4.1, SW stands for the method using the original
SW Loss, Spec. stands for using a spectrum constraint, and GT stands for the Ground Truth.
                                                98


                                    Table 4.1 Quantitative Comparison
                          Method   LPIPS         FID       c-FID      KID      c-KID
                           Ours     0.437     107.220 71.938 −0.014            0.073
                            SW      0.454     101.768 78.683 −0.016            0.083
                           Spec.    0.447      99.615 78.250 −0.016            0.083
                            GT        0            0       18.069 −0.025         0
     From the table, our results are competitive and our proposed set of statistics did not require
searching for a proper hyperparameter to get competitive results. Note that our results for FID and
c-FID vary compared to [32] because FID is a biased estimate [14] and our sample count is very
low.
4.7     Improvements via a Multi-scale Approach
     Since there is room for improvement in our synthesis, we augment our algorithm with a multi-
scale procedure in a manner identical to [20, 49]. For the multi-scale algorithm at 𝐾 scales, let
𝐼ref,𝑖 be the reference image downsampled by a scale factor of 2𝑖 with 𝑖 = 0, . . . , 𝐾, and define the
upsampling operator as Upsample(𝐼). Lastly, define the output of Algorithm 4.2 using the notation
𝐼Synthesis = SWSynthesis(𝐼input , 𝐼ref ), where 𝐼input is the input to be optimized via backpropogation,
𝐼ref is the reference texture, and 𝐼Synthesis is the output after synthesis using Algorithm 1. The results
are given in Figure 4.14.
Algorithm 4.3 Multi-scale Synthesis Algorithm
  1: Initialize 𝐼Synthesis as a white noise that is the same size as the reference texture downsampled
      by 2𝐾 .
  2: for 𝑖 = 0, . . . , 𝐾 do
  3:     𝐼Synthesis ← SWSynthesis(𝐼Synthesis , 𝐼ref,𝐾−𝑖 ).
  4:     𝐼Synthesis ← Upsample(𝐼Synthesis ).
  5: end for
  6: Return 𝐼Synthesis as the synthesized texture.
     In Figure 4.14, note the small improvements in edge generation and general structure when
using 𝐾 = 1 compared to 𝐾 = 0.
                                                       99


Figure 4.14 Multi-scale procedure at different scales. Left: Reference. Mid Left: 𝐾 = 0. Mid
Right: 𝐾 = 1. Right: 𝐾 = 2.
    The intuition for why it works is that it generates the details of the texture in a coarse-to-fine
way; the initial scale generates the general color and macro-scale features and additional scales add
on fine-grain details in an image.
Figure 4.15 Progression of synthesis that lead to repetitions. Left: Reference Texture. Middle
Left: 𝐾 = 0. Middle Right: 𝐾 = 1. Right: 𝐾 = 2.
    However, it is possible to create replica textures for larger values of 𝐾. Of the 34 images
                                                  100


generated for the experiments, there were four repetitions when 𝐾 = 2. See Figure 4.15 for
examples.
Additionally, a quantitative comparison using the same image quality metrics from before is
provided. Unlike in previous comparisons, it would be better to have a higher LPIPS score at a
comparable generative metric; this would mean that our textures are less likely to be replicas, but
still have similar qualities. The results are given in Table 4.2.
                              Table 4.2 Quantitative Score for 𝐾 = 0, 1, 2
                        Scale   LPIPS        FID      c-FID      KID     c-KID
                       𝐾=0       0.437    107.220 71.938 −0.014          0.073
                       𝐾=1       0.381     67.118 53.908 −0.018          0.044
                       𝐾=2       0.250     38.304 40.220 −0.022          0.027
Based on the scores and Fig. 5, 𝐾 = 1 provides a nice mix between diversity and image quality at
our fixed image size.
     In the Figure 4.16, the multi-scale approach is applied without the additional loss term in
Eq. (4.25).
Figure 4.16 Results with SW Loss. Left: Reference. Mid Left: 𝐾 = 0. Mid Right: 𝐾 = 1.
Right: 𝐾 = 2.
                                                  101


From our results, the multi-scale approach by itself is not enough to fully capture nonstationary
statistics or enforce long range constraints. That is to say, the loss term added in Eq. (4.25) is
absolutely necessary to capture long-range constraints in textures.
Figure 4.17 Comparison of results. Left: Reference. Mid Left: SW Loss. Mid Right: Gonthier.
Right: 𝐾 = 1 (Ours).
    Lastly, the results using 𝐾 = 1 are compared with [20] using the default settings from their
experiments (𝐾 = 2) to show the effectiveness our mutli-scale results relative to another multi-scale
algorithm. The results from [24] again for an additional point of reference. Some results are shown
                                                102


in Figure 4.17 and a quantitative study is given in Table 4.3.
                        Table 4.3 Comparison of 𝐾 = 1, SW Loss, Gonthier
                      Method     LPIPS      FID       c-FID      KID     c-KID
                      𝐾=1        0.381     67.118 53.908 −0.018           0.044
                        SW       0.454    101.768 78.683 −0.016           0.083
                     Gonthier    0.415     77.569 67.728 −0.018           0.067
                        GT          0         0      18.069 −0.025          0
4.8   Conclusions
    We present a modification of texture synthesis via Sliced Wasserstein Loss that has the ability to
add long range constraints without user-added spatial tags (supervision). Our additional loss term
can be thought of as a regularization term, but unlike traditional regularization terms, one does not
need to hyperparameter tune to enforce long range constraints. That is to say, the proposed method
requires less user supervision for competitive results. One thing we have not tested is whether the
number of scales is dependent on image size. We believe this is true, and it is probable that one
would need to choose the number of scales based on the size of the image.
                                                 103


                                             CHAPTER 5
                                     CONCLUDING REMARKS
We have addressed two important problems in statistical signal processing: mtuli-reference align-
ment and texture synthesis. For multi-reference alignment, like we mentioned in chapter 2, many
open problems remain. In one dimension, one question to consider is how we can approach general
diffeomorphisms. While considering a specific set of group actions seems to be the most viable
approach, one wonders whether it would be possible to consider limiting the size of ∥𝜏′ ∥ ∞ and
∥𝜏′′ ∥ ∞ , which would make 𝜏 "close" to a translation. Additionally, is it possible to generalize all
our results to two and three dimensions? Both these cases would be more relevant to practitioners
in cryo-EM.
    Regarding texture synthesis, one wonders whether a deep representation is actually needed.
That is, could we use a wavelet transform or scattering transform with sliced wasserstein loss to
generate textures? Our preliminary experiments, which are not available in this thesis, suggest this
is possible. However, the synthesis quality is not as strong. This suggests that we could use a
different filter bank for better synthesis. VGG models have omnidirectional filters and all the filters
are not necessarily positive, which we believe is responsible for strong synthesis. Future work will
focus on testing our hypotheses.
                                                 104


                                      BIBLIOGRAPHY
[1] Emmanuel Abbe, Tamir Bendory, William Leeb, João M Pereira, Nir Sharon, and Amit
     Singer. Multireference alignment is easier with an aperiodic translation distribution. IEEE
     Transactions on Information Theory, 65(6):3565–3584, 2018.
[2] Afonso Bandeira, Yutong Chen, Roy R Lederman, and Amit Singer. Non-unique games over
     compact groups and orientation estimation in cryo-em. Inverse Problems, 2020.
[3] Afonso S Bandeira, Ben Blum-Smith, Joe Kileel, Amelia Perry, Jonathan Weed, and Alexan-
     der S Wein. Estimation under group actions: recovering orbits from invariants. arXiv preprint
     arXiv:1712.10163, 2017.
[4] Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for
     semidefinite programs arising in synchronization and community detection. In Conference
     on learning theory, pages 361–382, 2016.
[5] Afonso S Bandeira, Moses Charikar, Amit Singer, and Andy Zhu. Multireference alignment
     using semidefinite programming. In Proceedings of the 5th conference on Innovations in
     theoretical computer science, pages 459–470. ACM, 2014.
[6] Tamir Bendory, Alberto Bartesaghi, and Amit Singer. Single-particle cryo-electron mi-
     croscopy: Mathematical theory, computational challenges, and opportunities. IEEE Signal
     Processing Magazine, pages 58–76, March 2020.
[7] Tamir Bendory, Nicolas Boumal, Chao Ma, Zhizhen Zhao, and Amit Singer. Bispectrum
     inversion with application to multireference alignment. IEEE Transactions on Signal Pro-
     cessing, 66(4):1037–1050, 2017.
[8] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying
     mmd gans. In International Conference on Learning Representations.
[9] Nicolas Boumal. Nonconvex phase synchronization.           SIAM Journal on Optimization,
     26(4):2355–2377, 2016.
[10] Thibaud Briand, Jonathan Vacher, Bruno Galerne, and Julien Rabin. The heeger & bergen
     pyramid based texture synthesis algorithm. Image processing on line, 4:276–299, 2014.
[11] Lisa Gottesfeld Brown. A survey of image registration techniques. ACM computing surveys
     (CSUR), 24(4):325–376, 1992.
[12] Yuxin Chen and Emmanuel J Candès. The projected power method: An efficient algo-
     rithm for joint alignment from pairwise differences. Communications on Pure and Applied
     Mathematics, 71(8):1648–1714, 2018.
                                              105


[13] Yuxin Chen, Leonidas J Guibas, and Qi-Xing Huang. Near-optimal joint object matching via
     convex relaxation. In Proceedings of the 31st International Conference on Machine Learning,
     volume 32 of Proceedings of Machine Learning Research, pages 100–108, 2014.
[14] Min Jin Chong and David Forsyth. Effectively unbiased fid and inception score and where
     to find them. In Proceedings of the IEEE/CVF conference on computer vision and pattern
     recognition, pages 6070–6079, 2020.
[15] WB Collis, PR White, and JK Hammond. Higher-order spectra: the bispectrum and trispec-
     trum. Mechanical systems and signal processing, 12(3):375–394, 1998.
[16] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete
     data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological),
     39(1):1–22, 1977.
[17] Robert Diamond. On the multiple simultaneous superposition of molecular structures by rigid
     body transformations. Protein Science, 1(10):1279–1287, 1992.
[18] Hassan Foroosh, Josiane B Zerubia, and Marc Berthod. Extension of phase correlation to
     subpixel registration. IEEE transactions on image processing, 11(3):188–200, 2002.
[19] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Texture synthesis using convolutional
     neural networks. Advances in neural information processing systems, 28, 2015.
[20] Nicolas Gonthier, Yann Gousseau, and Saïd Ladjal. High-resolution neural texture synthesis
     with long-range constraints. Journal of Mathematical Imaging and Vision, 64(5):478–492,
     2022.
[21] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
[22] Lars Peter Hansen. Large sample properties of generalized method of moments estimators.
     Econometrica: Journal of the Econometric Society, pages 1029–1054, 1982.
[23] David J Heeger and James R Bergen. Pyramid-based texture analysis/synthesis. In Proceedings
     of the 22nd annual conference on Computer graphics and interactive techniques, pages 229–
     238, 1995.
[24] Eric Heitz, Kenneth Vanhoey, Thomas Chambon, and Laurent Belcour. A sliced wasserstein
     loss for neural texture synthesis. In Proceedings of the IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, pages 9412–9420, 2021.
[25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre-
     iter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.
     Advances in neural information processing systems, 30, 2017.
                                                106


[26] Matthew Hirn and Anna Little. Unbiasing procedures for scale-invariant multi-reference
     alignment. arXiv preprint arXiv:2107.01274, 2021.
[27] Matthew Hirn and Anna Little. Wavelet invariants for statistically robust multi-reference
     alignment. Information and Inference: A Journal of the IMA, 10(4):1287–1351, 2021.
[28] Zvi Kam. The reconstruction of structure from electron micrographs of randomly oriented
     particles. In Electron Microscopy at Molecular Dimensions, pages 270–277. Springer, 1980.
[29] Fima C Klebaner. Introduction to stochastic calculus with applications. World Scientific
     Publishing Company, 2012.
[30] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimiza-
     tion. Mathematical programming, 45(1-3):503–528, 1989.
[31] Gang Liu, Yann Gousseau, and Gui-Song Xia. Texture synthesis through convolutional
     neural networks and spectrum constraints. In 2016 23rd International Conference on Pattern
     Recognition (ICPR), pages 3234–3239. IEEE, 2016.
[32] Guilin Liu, Rohan Taori, Ting-Chun Wang, Zhiding Yu, Shiqiu Liu, Fitsum A Reda, Karan
     Sapra, Andrew Tao, and Bryan Catanzaro. Transposer: Universal texture synthesis using
     feature maps as transposed convolution filter. arXiv preprint arXiv:2007.07243, 2020.
[33] Stéphane Mallat. Recursive interferometric representations. In 18th European Signal Pro-
     cessing Conference (EUSIPCO-2010), Aalborg, Denmark, 2010.
[34] Stéphane Mallat. Group invariant scattering. Communications on Pure and Applied Mathe-
     matics, 65(10):1331–1398, October 2012.
[35] Wooram Park and Gregory S Chirikjian. An assembly automation approach to alignment of
     noncircular projections in electron microscopy. IEEE Transactions on Automation Science
     and Engineering, 11(3):668–679, 2014.
[36] Wooram Park, Charles R Midgett, Dean R Madden, and Gregory S Chirikjian. A stochastic
     kinematic model of class averaging in single-particle electron microscopy. The International
     journal of robotics research, 30(6):730–754, 2011.
[37] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties
     in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
     Pattern Recognition, pages 11410–11420, 2022.
[38] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Message-passing
     algorithms for synchronization problems over compact groups. Communications on Pure and
     Applied Mathematics, 71(11):2275–2322, 2018.
                                               107


[39] F. Pitie, A.C. Kokaram, and R. Dahyot. N-dimensional probability density function transfer
     and its application to color transfer. In Tenth IEEE International Conference on Computer
     Vision (ICCV’05) Volume 1, volume 2, pages 1434–1439 Vol. 2, 2005.
[40] Javier Portilla and Eero P Simoncelli. A parametric texture model based on joint statistics of
     complex wavelet coefficients. International journal of computer vision, 40:49–70, 2000.
[41] Brian M Sadler and Georgios B Giannakis. Shift-and rotation-invariant object reconstruction
     using the bispectrum. JOSA A, 9(1):57–69, 1992.
[42] Sjors HW Scheres, Mikel Valle, Rafael Nuñez, Carlos OS Sorzano, Roberto Marabini, Ga-
     bor T Herman, and Jose-Maria Carazo. Maximum-likelihood multi-reference refinement for
     electron microscopy images. Journal of molecular biology, 348(1):139–149, 2005.
[43] Omry Sendik and Daniel Cohen-Or. Deep correlations for texture synthesis. ACM Transactions
     on Graphics (ToG), 36(5):1–15, 2017.
[44] Nir Sharon, Joe Kileel, Yuehaw Khoo, Boris Landa, and Amit Singer. Method of moments
     for 3-D single particle ab initio modeling with non-uniform distribution of viewing angles.
     Inverse Problems, 36(4):044003, 2020.
[45] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
     image recognition. arXiv preprint arXiv:1409.1556, 2014.
[46] Amit Singer. Angular synchronization by eigenvectors and semidefinite programming. Ap-
     plied and computational harmonic analysis, 30(1):20–36, 2011.
[47] Amit Singer. Mathematics for cryo-electron microscopy. In Proceedings of the International
     Congress of Mathematicians: Rio de Janeiro 2018, pages 3995–4014. World Scientific, 2018.
[48] Guillaume Tartavel, Yann Gousseau, and Gabriel Peyré. Variational texture synthesis with
     sparsity and spectrum constraints. Journal of Mathematical Imaging and Vision, 52:124–144,
     2015.
[49] Guillaume Tartavel, Gabriel Peyré, and Yann Gousseau. Wasserstein loss for image synthesis
     and restoration. SIAM Journal on Imaging Sciences, 9(4):1726–1755, 2016.
[50] Douglas L Theobald and Phillip A Steindel. Optimal simultaneous superpositioning of
     multiple structures with missing data. Bioinformatics, 28(15):1972–1979, 2012.
[51] Liping Yin and Albert Chua. Long range constraints for neural texture synthesis using sliced
     wasserstein loss. arXiv preprint arXiv:2211.11137, 2022.
[52] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea-
     sonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE
                                                108


     conference on computer vision and pattern recognition, pages 586–595, 2018.
[53] Yiqiao Zhong and Nicolas Boumal. Near-optimal bounds for phase synchronization. SIAM
     Journal on Optimization, 28(2):989–1016, 2018.
[54] J Portegies Zwart, René van der Heiden, Sjoerd Gelsema, and Frans Groen. Fast translation
     invariant classification of hrr range profiles in a zero phase representation. IEE Proceedings-
     Radar, Sonar and Navigation, 150(6):411–418, 2003.
                                                  109