ADVANCES IN NEAR-FIELD TO BLIND FAR-FIELD PTYCHOGRAPHY, AND
 COMPRESSED CLASSIFICATION FROM PHASELESS MEASUREMENTS
                                   By
                           Mark Philip Roach
                           A DISSERTATION
                               Submitted to
                       Michigan State University
               in partial fulfillment of the requirements
                            for the degree of
                 Mathematics – Doctor of Philosophy
                                  2023


                                             ABSTRACT
Chapter 1 initially concerns the introduction of Fourier Phase Retrieval. In many imaging systems,
one can only measure the magnitude-square of the Fourier transform of the underlying signal,
known as the power spectral density. At a large enough distance from the imaging plane, the
measurements are given by the Fourier transform of the image. Thus for capturing images of
large distances, optical devices essentially measure the Fourier transform magnitude of the object
being imaged. The problem of reconstructing a signal from its Fourier magnitude is known as
Fourier phase retrieval. This problem arises in many areas of engineering and applied physics,
including optics, X-ray crystallography (determining the atomic and molecular structure of a
crystal), astronomical imaging, speech processing, computational biology, etc. Also, in Chapter
1, we introduce the concept of Dimensionality Reduction ([99], [25], [59], [62]), which is a tool
used in several disciplines, including statistics, data mining, pattern recognition, machine learning,
artificial intelligence, and optimization. Dimensionality reduction refers to the act of transforming
data from a high-dimensional space to a low-dimensional space.
    Chapter 2 discusses Fourier Ptychography, which is an imaging technique that involves
sample being illuminated at different angles of incidence (effectively shifting the sample’s Fourier
transform) after which a lens acts as a low-pass filter, thereby effectively providing localized Fourier
information about the sample around frequencies dictated by each angle of illumination. Near-Field
(Fourier) Ptychography (NFP) (see, e.g., [107, 108, 125]) occurs when the sample is placed at a
short defocus distance having a large Fresnel number. We prove that certain NFP measurements
are robustly invertible (up to an unavoidable global phase ambiguity) for specific Point Spread
Functions (PSFs) and physical masks which lead to well-conditioned lifted linear systems. We then
apply a block phase retrieval algorithm using weighted angular synchronization and prove that the
proposed approach accurately recovers the measured sample for these specific PSF and mask pairs.
Finally, we also propose using a Wirtinger Flow for NFP problems and numerically evaluate that
alternate approach both against our main proposed approach, as well as with NFP measurements
for which our main approach does not apply.


    We now move onto Chapter 3 concerning Blind Ptychography. Far-field Ptychography
occurs when there is a large enough defocus distance (when the Fresnel number is ≪ 1) to
obtain magnitude-square Fourier transform measurements. In an attempt to remove ambiguities,
masks are utilized to ensure unique outputs to any recovery algorithm are unique up to a global
phase. In Chapter 3, we assume that both the sample and the mask are unknown, and we apply
blind deconvolutional techniques to solve for both. Numerical experiments demonstrate that the
technique works well in practice, and is robust under noise.
    Finally, we have Chapter 4. Let M be a compact 𝑑-dimensional submanifold of R𝑁 with reach
𝜏 and volume 𝑉M . Fix 𝜖 ∈ (0, 1). In this chapter, we prove that a nonlinear function 𝑓 : R𝑁 → R𝑚
                              √𝑑
                                      
                        2
                                 𝑉M
exists with 𝑚 ≤ 𝐶 𝑑/𝜖 log         𝜏     such that
                       (1 − 𝜖)∥x − y∥ 2 ≤ ∥ 𝑓 (x) − 𝑓 (y)∥ 2 ≤ (1 + 𝜖)∥x − y∥ 2
holds for all x ∈ M and y ∈ R𝑁 . In effect, 𝑓 not only serves as a bi-Lipschitz function from M into
R𝑚 with bi-Lipschitz constants close to one, but also approximately preserves all distances from
points not in M to all points in M in its image. Furthermore, the proof is constructive and yields
an algorithm which works well in practice. In particular, it is empirically demonstrated herein that
such nonlinear functions allow for more accurate compressive nearest neighbor classification than
standard linear Johnson-Lindenstrauss embeddings do in practice. Furthermore, it is demonstrated
that this approach works when the labelled data consists of NFP measurements.


Dedicated to Holly (2004-2018),
     Gypsy (2007-2021),
   and Sparkle (2006-2022).
              iv


                                  ACKNOWLEDGEMENTS
I would like to thank Mark Iwen for his guidance and support.I would like to thank Guoan Zheng
for answering questions about (and providing code for) his work in [125]. I would like to thank
Michael Perlmutter for his work and collaboration on the Near-field Ptychography chapter.
    My work was supported in part by NSF DMS 1912706. Work in Chapter 2 was first published in
Sampling Theory, Signal Processing, and Data Analysis [Volume 21, Article 6, 2023] by Springer
Nature ([50]).
                                                v


                             TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1   INTRODUCTION TO PHASE RETRIEVAL
            AND DIMENSIONALITY REDUCTION .                . . . . . . . . . . . . . . . . . 1
   1.1 INTRODUCTION . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . 1
   1.2 PHASE RETRIEVAL PROBLEM . . . . . . . . .          . . . . . . . . . . . . . . . . . 2
   1.3 DIMENSIONALITY REDUCTION . . . . . . . .           . . . . . . . . . . . . . . . . . 12
CHAPTER 2   TOWARD FAST AND PROVABLY ACCURATE
            NEAR-FIELD PTYCHOGRAPHIC PHASE RETRIEVAL                    . . . . . . . . . . 16
   2.1 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
   2.2 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . 16
   2.3 PRELIMINARIES: PRIOR RESULTS FOR FAR-FIELD
       PTYCHOGRAPHY USING LOCAL MEASUREMENTS . . .                      . . . . . . . . . . 21
   2.4 NEAR FROM FAR: GUARANTEED NEAR-FIELD
       PTYCHOGRAPHIC RECOVERY VIA FAR-FIELD RESULTS                     . . . . . . . . . . 25
   2.5 ERROR ANALYSIS FOR ALGORITHM 2.1 . . . . . . . . . .             . . . . . . . . . . 30
   2.6 AN ALTERNATE APPROACH: NEAR-FIELD
       PTYCHOGRAPHY VIA WIRTINGER FLOW . . . . . . . . .                . . . . . . . . . .  34
   2.7 NUMERICAL SIMULATIONS . . . . . . . . . . . . . . . . .          . . . . . . . . . .  35
   2.8 APPLICATION OF ALGORITHM 2.1 . . . . . . . . . . . . . .         . . . . . . . . . .  40
   2.9 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . .              . . . . . . . . . .  41
CHAPTER 3 BLIND PTYCHOGRAPHY VIA BLIND DECONVOLUTION                          . . . . . . .  43
   3.1 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  43
   3.2 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . .  43
   3.3 FAR-FIELD FOURIER PTYCHOGRAPHY . . . . . . . . . . . . . .             . . . . . . .  45
   3.4 BLIND DECONVOLUTION . . . . . . . . . . . . . . . . . . . . . .        . . . . . . .  51
   3.5 BLIND PTYCHOGRAPHY . . . . . . . . . . . . . . . . . . . . . .         . . . . . . .  64
   3.6 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . .              . . . . . . .  72
CHAPTER 4   ON OUTER BI-LIPSCHITZ EXTENSIONS OF LINEAR
            JOHNSON-LINDENSTRAUSS EMBEDDINGS OF
            LOW-DIMENSIONAL SUBMANIFOLDS OF R𝑁 . . . . . . . . . .                    . . .  73
   4.1 ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  73
   4.2 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . .  73
   4.3 NOTATION AND PRELIMINARIES . . . . . . . . . . . . . . . . . . . . .           . . .  78
   4.4 THE MAIN BI-LIPSCHITZ EXTENSION RESULTS AND THEIR PROOFS                       . . .  81
   4.5 THE PROOF OF THEOREM 4.2.1 . . . . . . . . . . . . . . . . . . . . . . .       . . .  92
   4.6 A NUMERICAL EVALUATION OF TERMINAL EMBEDDINGS . . . . .                        . . .  92
   4.7 COMPRESSED CLASSIFICATION FROM
       PHASELESS MEASUREMENTS . . . . . . . . . . . . . . . . . . . . . . .           . . . 106
CHAPTER 5   CONTRIBUTIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . 109
                                           vi


BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
APPENDIX A  NEAR-FIELD PTYCHOGRAPHY . . . . . . . . . . . . . . . . . . . . . 121
APPENDIX B  FAR-FIELD PTYCHOGRAPHY . . . . . . . . . . . . . . . . . . . . . . 127
APPENDIX C  BLIND DECONVOLUTION . . . . . . . . . . . . . . . . . . . . . . . . 133
                                           vii


                                   LIST OF ABBREVIATIONS
• Let x, m ∈ C𝑑 denote the specimen and mask
• Far-field Ptychographic Measurements: |(F𝑑 (x ◦ 𝑆 𝑘 m))ℓ | 2
• Point Spread Function (PSF): p ∈ C𝑑
• Near-field Ptychographic Measurements: |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 ,
• Indexing: [𝑑] = {1, 2, . . . , 𝑑}, [𝑑]0 = {0, 1, . . . , 𝑑 − 1}
• Support: 𝑠𝑢 𝑝 𝑝(x) := {𝑛 ∈ [𝑑]0 | 𝑥 𝑛 ̸= 0}
• Fourier Transform: F𝑑 - 𝑑 × 𝑑 DFT matrix
• Reversal: 𝑥˜𝑛 := 𝑥 −𝑛 𝑚𝑜𝑑 𝑑 ,       ∀𝑛 ∈ [𝑑]0
• Shift Operator: (𝑆ℓ x)𝑛 = 𝑥 (ℓ+𝑛) 𝑚𝑜𝑑 𝑑 , ∀𝑛 ∈ [𝑑]0
• Circular Convolution : (x ∗𝑑 y)ℓ := 𝑛=0
                                                 P𝑑−1
                                                        𝑥 𝑛 𝑦 (ℓ−𝑛) 𝑚𝑜𝑑 𝑑
• Hadamard Product: (x ◦ y)ℓ := 𝑥ℓ 𝑦 ℓ
                                                                                           
• Decoupling Lemma: (x ◦ 𝑆−ℓ y) ∗𝑑 (x̃¯ ◦ 𝑆ℓ ỹ¯ ) = (x ◦ 𝑆−𝑘 x̄) ∗𝑑 ( 𝑦˜ ◦ 𝑆 𝑘 ỹ¯ )
                                                                 𝑘                              ℓ
• Ā - complex Gaussian matrix, 𝐴¯ 𝑖 𝑗 ∼ N (0, 1/2) + 𝑖N (0, 1/2)
• aℓ - ℓ-th column of A∗
• B - first 𝐾 columns of F𝑑 , bℓ - ℓ-th column of B∗
                                                    𝜎 2 𝐿 02                𝜎 2 𝐿 02
• e - complex Gaussian vector, e ∼ N (0,                     𝐼 𝑑 ) + 𝑖N (0,          𝐼𝑑 )
                                                      2𝑑                     2𝑑
• A - linear operator: A(𝑍) := {bℓ∗ 𝑍aℓ }ℓ=1      𝑑 ∈ C𝑑×1
• A ∗ - adjoint linear operator: A ∗ (𝑧) := ℓ=1            𝑧ℓ bℓ aℓ∗ ∈ C𝐾×𝑁
                                                   P𝑑
• (h0 , x0 ) - underlying truth, 𝑦 = A(h0 x∗0 ) + e, 𝐿 0 = ∥h0 ∥ 2 = ∥x0 ∥ 2
• (u0 , v0 ) - initial estimate, (u𝑡 , v𝑡 ) - estimate during gradient descent
• (h, x) - obtained estimate after Algorithm,                𝐿 = ∥h∥·∥x∥
                            ∥hx∗ − h0 x∗0 ∥ 𝐹                                      𝐿∥Bh0 ∥ 2∞
• 𝛿 = 𝛿(z) = 𝛿(h, x) :=                         , 𝜇 ℎ - incoherence, 𝜇2ℎ =
                                   𝐿0                                                 ∥h0 ∥ 2
                              √                 √
• 𝑁 𝐿 0 := {(h, x) | ∥h∥≤ 2 𝐿 0 , ∥x∥≤ 2 𝐿 0 }
                 √               √
• 𝑁 𝜇 := {h | 𝑑 ∥Bh∥ ∞ ≤ 4 𝐿 0 𝜇}, 𝜇 ℎ ≤ 𝜇
                                                      viii


                                                               1
• 𝑁𝜖 := {(h, x) | ∥hx∗ − h0 x∗0 ∥ 𝐹 ≤ 𝜖 𝐿 0 },    0<𝜖 ≤
                                                             15
                   ˜ x) ≤ 1 𝜖 2 𝐿 2 + ∥e∥ 2 }
• 𝑁 𝐹˜ = {(h, x) | 𝐹(h,
                              3      0
• 𝐹(h, x) := ∥A(hx∗ − h0 x∗0 ) + e∥ 2 ,     𝐹0 (h, x) := ∥A(hx∗ − h0 x∗0 )∥ 2
• 𝐹(h, x) = ∥e∥ 2 +𝐹0 (h, x) − 2𝑅𝑒(⟨A ∗ (e), hx∗ − h0 x∗0 ⟩)
• 𝐺 0 (z) := max{𝑧 − 1, 0}2 = [𝑧 − 1]2+ ,      𝜌 ≥ 𝑑 2 + 2∥e∥ 2
                h    ∥h∥ 2         ∥x∥ 2                𝑑|b∗ h| 2 i
• 𝐺(h, x) := 𝜌 𝐺 0                                                ℓ
                                                P𝑑
                              + 𝐺0            +    ℓ=1 𝐺 0
                       2𝐿              2𝐿                      8𝐿𝜇2
• 𝐹(h,
   ˜ x) := 𝐹(h, x) + 𝐺(h, x)
• Tube: 𝑡𝑢𝑏𝑒(𝛿, M) := {x | ∃y ∈ M 𝑤𝑖𝑡ℎ ∥x − y∥ 2 ≤ 𝛿}
• Euclidean Ball: 𝐵ℓ𝑁2 (x, 𝛾) := {y ∈ R𝑁 | ∥x − y∥ 2 < 𝛾}
• −𝑆 := {−x | x ∈ 𝑆}, 𝑆 ± 𝑆 := {x ± y | x, y ∈ 𝑆}, 𝑈(x) := x/∥x∥ 2
                                                   n                               o
                                                      x−y
• Unit Secants: 𝑆𝑇 := 𝑈 ((𝑇 − 𝑇) \ {0}) = ∥x−y∥           2
                                                             | x,  y ∈ 𝑇, x ̸ =  y   .
• 𝜖-JL map: 𝐴 ∈ C𝑚×𝑁 , 𝑇 ⊂ R𝑁 into C𝑚 , (1 − 𝜖)∥x∥ 22 ≤ ∥ 𝐴x∥ 22 ≤ (1 + 𝜖)∥x∥ 22
• 𝜖-JL embedding: 𝐴 ∈ C𝑚×𝑛 𝜖-JL map of 𝑇 − 𝑇 := {x − y | x, y ∈ 𝑇 }
• Radius: 𝑟𝑎𝑑(𝑇) := 𝑠𝑢 𝑝 ∥x∥ 2
                        𝑥∈𝑇
• Diameter: 𝑑𝑖𝑎𝑚(𝑇) := 𝑟𝑎𝑑(𝑇 − 𝑇) = 𝑠𝑢 𝑝 ∥x − y∥ 2 ,
                                                𝑥,𝑦∈𝑇
• 𝛿-cover: 𝛿 ∈ R+ , 𝑆 ⊂ 𝑇, ∀x ∈ 𝑇, ∃y ∈ 𝑆 such that ∥x − y∥ 2 ≤ 𝛿.
• 𝛿-Covering Number: N (𝑇, 𝛿) ∈ N, smallest achievable cardinality of a 𝛿-cover of 𝑇
• Gaussian Width: 𝑤(𝑇) := E 𝑠𝑢 𝑝 ⟨g, x⟩, g ∈ R𝑁 , i.i.d. 𝜇 = 0, 𝜎 2 = 1, Gaussian entries
                                    𝑥∈𝑇
• Reach: 𝑆 ⊂ R𝑁 , 𝜏𝑆 := sup {𝑡 ≥ 0 | ∀x ∈ R𝑛 , 𝑑(x, 𝑆) < 𝑡, x has unique closest point in 𝑆}
                                           S nP 𝑗                                                         o
• Convex Hull: 𝑆 ⊂ C𝑁 , conv(𝑆) := ∞
                                                                                               P𝑗
                                              𝑗=1         𝛼
                                                      ℓ=1 ℓ ℓ x   | xℓ ∈ 𝑆,   𝛼 ℓ ∈    [0, 1],     𝛼
                                                                                                ℓ=1 ℓ = 1
• 𝜖-Convex Hull Distortion: 𝑆 ⊂ R𝑁 , |∥Φx∥ 2 −∥x∥ 2 | ≤ 𝜖, ∀x ∈ conv(𝑆)
                                                   ix


                                            CHAPTER 1
   INTRODUCTION TO PHASE RETRIEVAL AND DIMENSIONALITY REDUCTION
               Figure 1.1 [100] An illustration of a conventional ptychography setup.
1.1    INTRODUCTION
In many imaging systems, one can only measure the magnitude-square of the Fourier transform of
the underlying signal, known as the power spectral density. For example, in an optical setting,
detection devices like CCD (charge-coupled device) cameras and photosensitive films cannot
measure the phase of a light wave only, instead measuring the photon flux (number of photons per
second per unit area).
    At a large enough distance from the imaging plane, the measurements are given by the Fourier
transform of the image. Thus for capturing images of large distances, optical devices essentially
measure the Fourier transform magnitude of the object being imaged. However, structural content
about the image is contained in the phase, so this important information is lost. The problem of
reconstructing a signal from its Fourier magnitude is known as Fourier phase retrieval. This prob-
lem arises in many areas of engineering and applied physics, including optics ([57], [36]), X-ray
crystallography ([66],[121],[103],[24]), astronomical imaging ([116],[94],[106]), speech process-
ing ([91], [42]) and computational biology, just to name a few.
                                                  1


Figure 1.2 [93] In x-ray crystallography, the goal is to gain an image of the positions of atoms
within a molecule by illuminating a crystallized sample with x-rays. The molecular structure is
then deduced from the pattern of the radiation diffracted by the sample.
1.2    PHASE RETRIEVAL PROBLEM
To mathematically solve the phase retrieval problem, we first focus on the discretized one-
dimensional setting, which can the be generalized.
Definition 1.2.1. (Classical Phase Retrieval Problem) Let x ∈ C𝑑 be the underlying signal we
wish to recover. In Fourier phase retrieval, the measurements are given by
                                  𝑑−1               2
                                      𝑥 𝑛 𝑒 −2𝜋𝑖𝑘𝑛/𝑁 ,
                                  ∑︁
                           𝑦𝑘 =                         𝑘 ∈ [𝑁]0 , 𝑁 = 2𝑑 − 1                    (1.1)
                                  𝑛=0
Here we are over-sampling by a factor of two by twice the length of the signal. Our goal is to
recover x.
    There are many challenges involved in solving the phase retrieval problem, as discussed in [7].
First among these, is the fact that the true signal x ∈ C𝑑 cannot be recovered uniquely. For instance,
the rotation, translation, or conjugate reflection do not modify the Fourier magnitudes. Without
additional constraints, the unknown signal will only be determined up to what are called classical
ambiguities or unavoidable trivial ambiguities, which may not be of concern depending on the
application.
                                                      2


      There are also non-trivial ambiguities for the classical phase retrieval problem. For example
                                                                          √                   √
                         𝑥 1 = (1, 0, −2, 0, −2)𝑇 ,        𝑥2 = ((1 − 3), 0, 1, 0, (1 + 3))𝑇                      (1.2)
yield the same Fourier magnitudes 𝑦 𝑘 . We wish to categorise the number of non-trivial solutions, by
exploring the relationship between the Fourier magnitudes and the autocorrelation measurements.
Definition 1.2.2. Let x ∈ C𝑑 be the underlying signal with 𝑠𝑢 𝑝 𝑝(𝑥) ⊆ [𝑑]0 . We define the
autocorrelation measurements by
                                           𝑑
                                          ∑︁
                                   𝑎𝑚 =         𝑥¯𝑛 𝑥 𝑛+𝑚 ,    −𝑁 + 1 ≤ 𝑚 ≤ 𝑁 − 1                                 (1.3)
                                          𝑛=0
                                                                            P𝑑−1
      We consider the product of the polynomial 𝑋(𝑧) =                         𝑛=0 𝑥 𝑛 𝑧 𝑛 and the reversed polynomial
 ˜
𝑋(𝑧)             ¯ −1 ), where 𝑋¯ denote the polynomial with conjugate coefficients. Assuming that
         = 𝑧 𝑑−1 𝑋(𝑧
𝑥[0], 𝑥[𝑑 − 1] ̸= 0, we have that
                                       𝑑−1                  𝑑−1             2𝑑−2
                                                                 𝑥¯ℓ 𝑧 −ℓ =
                                      ∑︁                    ∑︁              ∑︁
                               ˜
                         𝑋(𝑧) 𝑋(𝑧)  =      𝑥 𝑛 𝑧 𝑛 · 𝑧 𝑑−1                       𝑎 𝑛−𝑑+1 𝑧 𝑛 =: 𝐴(𝑧)              (1.4)
                                       𝑛=0                   ℓ=0             𝑛=0
where 𝐴(𝑧) is the autocorrelation polynomial of degree 2𝑑 − 2. We can then rewrite the Fourier
magnitude measurements
                     𝑦 𝑘 = 𝑒 2𝜋𝑖𝑘(𝑑−1)/𝑁 𝑋(𝑒 −2𝜋𝑖𝑘/𝑁 ) 𝑋(𝑒  ˜ −2𝜋𝑖𝑘/𝑁 ) = 𝑒 2𝜋𝑖(𝑑−1)/𝑁 𝐴(𝑒 −2𝜋𝑖𝑘/𝑁 )              (1.5)
so that the autocorrelation polynomial is completely determined by the 2d - 1 samples 𝑦 𝑘 . The
phase retrieval problem is thus equivalent to the recovery of 𝑋(𝑧) from 𝐴(𝑧) = 𝑋(𝑧) 𝑋(𝑧).                 ˜
      Comparing the roots of 𝑋(𝑧) and 𝑋(𝑧),       ˜      we note that the roots of 𝐴(𝑧) occur in reflected pairs
(𝛾 𝑗 , 𝛾¯ −1
          𝑗 ) with respect to the unit circle. The main problem in the recovery of 𝑋(𝑧) is deciding
whether 𝛾 𝑗 or 𝛾¯ −1 𝑗 is a root of 𝑋(𝑧). In [6], this approach is used to show that the number of
non-trivial solutions is therefore bounded by 2𝑑−2 .
                                                               3


1.2.1    PHASE RETRIEVAL USING MASKS
                   Figure 1.3 [12] An illustration of masked phase retrieval setup.
     We can adapt the phase retrieval problem by using masks, to eliminate some of the trivial and
non-trivial ambiguities. There are several methods to applying this, some of which are listed below:
(i) Masking: The phase front after the sample is modified by the use of a mask or a phase plate;
(ii) Diffraction grating: The illuminating beam is modulated by the use of optical gratings;
(iii) Oblique illuminations: The illuminating beam is modulated to hit the sample at specific angles.
Figure 1.4 On the left ([87]) is an example of oblique illuminations, and on the right ([74]) is a
diffraction grating which displaces the angle of the beam.
     The main area of research involving masked phase retrieval is split in two sectors, looking at
                                                   4


random masks or deterministic masks (i.e. masks that have been specifically chosen).
Definition 1.2.3. (Phase Retrieval Using Masks) Let x ∈ C𝑑 denote the signal, {mℓ | mℓ ∈ C𝑑 , ℓ ∈
[𝐿]0 }, 𝐿 ≥ 2 denote the collection of masks. The masked phase retrieval measurements will be of
the form
                                  𝑑−1                        2
                                       𝑥 𝑛 [mℓ ]𝑛 𝑒 −2𝜋𝑖𝑛𝑘/𝑑 ,
                                 ∑︁
                         𝑦 ℓ,𝑘 =                                     𝑘 ∈ [𝑑]0 , ℓ ∈ [𝐿]0 .          (1.6)
                                  𝑛=0
    The goal is then to recover x. A further study of research is blind phase retrieval ([98], [1], [16],
[17]), in which both the signal and mask are unknown, although some information may be known.
    In Chapters 2 and 3, we look at types of masked phase retrieval, with Chapter 3 in particular,
looking at the blind variant. In both situations, the set of masks are generated by a shift operator,
which we will discuss more in the next section.
    There is a constraint to the type of signal that we can recover.
Definition 1.2.4. A signal x is said to be non-vanishing if 𝑥 𝑛 ̸= 0 for each 𝑛 ∈ [𝑑].
In [55], deterministic masks are considered instead of random masks and they show that two masks
are sufficient for the convex relaxation of the problem to uniquely recover non-vanishing signals up
to a global phase when over-sampled by a factor of two.
1.2.2    PHASELIFT
    In [12], the classical phase retrieval problem is reformulated as a matrix completion problem.
First, we need to define the Fourier transform of a vector.
Definition 1.2.5. The Fourier transform of x ∈ C𝑑 , denoted x̂ ∈ C𝑑 , is defined component-wise via
                                                          𝑑−1
                                                               𝑥 𝑛 𝑒 −2𝜋𝑖𝑛𝑘/𝑑
                                                          ∑︁
                                      𝑥ˆ 𝑘 := (F𝑑 x) 𝑘 =                                            (1.7)
                                                          𝑛=0
where F𝑑 ∈ C𝑑×𝑑 denotes the 𝑑 × 𝑑 discrete Fourier transform (DFT) matrix with entries
                               (F𝑑 )ℓ,𝑘 = 𝑒 −2𝜋𝑖ℓ𝑘/𝑑 ,     ∀(ℓ, 𝑘) ∈ [𝑑]0 × [𝑑]0                    (1.8)
                                                         5


Remark 1.2.1. With this definition, we can rewrite the masked phase retrieval problem as
                              𝑦 ℓ,𝑘 = |(F𝑑 (𝑥 ◦ mℓ )) 𝑘 | 2 ,      𝑘 ∈ [𝑑]0 , ℓ ∈ [𝐿]0 .        (1.9)
     Let X = xx∗ . For ℓ ∈ [𝐿]0 , let Dℓ be the diagonal matrix with the mask mℓ on the diagonal and
let f 𝑘∗ be the rows of the DFT matrix. We then have that the measurements in (1.6) can be written as
                                    𝑦 ℓ,𝑘 = |f 𝑘∗ Dℓ∗ x| 2 ,   𝑘 ∈ [𝑑]0 , ℓ ∈ [𝐿]0 .           (1.10)
Let A : S 𝑛×𝑛 −→ R𝑑𝐿 denote the linear operator with entries given by
                             {A(X)}ℓ,𝑘 = 𝑡𝑟(x∗ f 𝑘∗ Dℓ∗ Dℓ f 𝑘 x) = 𝑡𝑟(Dℓ f 𝑘 f 𝑘∗ Dℓ∗ X),     (1.11)
where S 𝑛×𝑛 is the space of self-adjoint matrices. Then the phase retrieval problem can be formulated
as
                        Find X,           subject to         A(X) = y, X ⪰ 0, rank(X) = 1.     (1.12)
When the measurements in (1.6) are injective, then this is equivalent to
                          minimize      rank(X),             subject to    A(X) = y, X ⪰ 0     (1.13)
Due to the complexities of solving this problem, PhaseLift (Section 2.3, [12]) solves the convex
surrogate, giving the semi-definite program
                         minimize      Trace(X),              subject to    A(X) = y, X ⪰ 0    (1.14)
The result will follow from looking at random masks, where the diagonal matrices Dℓ for ℓ ∈ [𝐿]0
are i.i.d copies of a matrix D, whose entries are i.i.d copies of a random variable 𝑝. These are
                                                              6


known as coded diffraction patterns. It is shown in [13], that the solution to the convex relaxation
is exact, with high probability, provided that we have sufficiently many coded diffraction patterns.
It is further shown in the theorem below that the feasible set of solutions is given by
                                    {X : X ⪰ 0, A(X) = y} = {xx∗ }                                 (1.15)
Before we get to the result, we need a restriction on the random variable 𝑝 which will allow us to
recover x.
Definition 1.2.6. We say that 𝑝 is admissible if
(i) 𝑝 is symmetric;
(ii) | 𝑝|≤ 𝑁;
(iii) E𝑝 = E𝑝 2 = 0;
(iv) E| 𝑝| 4 = 2E| 𝑝| 2 .
     We can now state the theorem for recovering x.
Theorem 1.2.1. (Theorem 1.1, [13]) Suppose that the modulation is admissible (i.e. the random
masks are generated from an admissible random variable) and that the number 𝐿 of coded diffraction
patterns obeys 𝐿 ≥ 𝑐𝛾𝑙𝑜𝑔 4 𝑑, for some fixed numerical constant 𝑐. Then with probability at least
1 − 𝑛−𝛾 , the set of solutions to the convex relaxation reduces to xx∗ , and we thus recover x up to a
global phase.
     In practice, physically generating these random masks with known entries is a hard task. In
many settings, deterministic masks are more practical to use and used more in real world situations.
The next section deals with these masks, in particular taking one mask and shifting it to generate a
set of masks.
1.2.3     STFT PHASE RETRIEVAL
     STFT phase retrieval is similar to masked phase retrieval in that it looks at partially blocking the
signal measurements so that you can attempt to recover the signal at a more local level. The key idea
is to introduce redundancy in the magnitude-only measurements by maintaining a substantial overlap
                                                   7


between adjacent short-time sections of shifted masked measurements. In effect, we are taking
a window and shifting it over time. This could involve physically shifting the specimen/window
itself, or one could shift the beam to focus on a different local area of the specimen. First, we define
the shift of a vector.
Definition 1.2.7. Given ℓ ∈ [𝑑]0 , denote the circulant shift operator 𝑆ℓ : C𝑑 −→ C𝑑 component-
wise via
                                   (𝑆ℓ x)𝑛 = 𝑥 (ℓ+𝑛) 𝑚𝑜𝑑 𝑑 , ∀𝑛 ∈ [𝑑]0                              (1.16)
    Now we can introduce the STFT phase retrieval problem.
Definition 1.2.8. (STFT Phase Retrieval Problem) Let x ∈ C𝑑 , w ∈ C𝑑 denote the signal and
window respectively. Let mℓ = 𝑆ℓ w denote the ℓ-shift of the window for ℓ ∈ [𝐿]0 . The STFT
magnitude measurements will be of the form of the masked measurements from 1.6, where each of
the masks is a shift of the original mask or window. Our goal is to recover x.
    In a similar manner as to before, we say that a window w is non-vanishing if 𝑤 𝑛 ̸= 0 for each
𝑛 ∈ [𝑑]. In [56], it is shown that up to a set of measure zero, non-vanishing signals are uniquely
identifiable from their STFT magnitude measurements, up to a global phase, if the support of the
signal is contained inside the support of the window, and as long as adjacent short-time sections
overlap by any amount. In other research ([82], [28]), it is shown that all non-vanishing signals are
uniquely identifiable from their STFT magnitude measurements, up to a global phase, for specific
choices of w and 𝐿.
    In Chapters 2 and 3, we will explore a couple of ptychographic phase retrieval problems, which
can be modeled as an STFT phase retrieval problems.
1.2.4    NOISE
    In phase retrieval, noise refers to how a signal can be modified in a way that alters the final result.
This occurs at all stages of the system, from capture and storage, to processing and transmission.
Any real world system is affected by some level of noise. In analog photography or video capture,
                                                      8


noise can come in the way of film grain, which is caused by the developing process of silver halide
crystals dispersed in photographic emulsion ([39],[86]). In digital photography, noise can come in
the way of compression artifacts that occur when the file is compressed to reduce file size ([23]).
Background noise is a common occurrence in audio capture. In astronomy, it may result from
cosmic background radiation, which is a faint glow of light occurring as a result of remnants from
the Big Bang ([120], [8], [102]).
Figure 1.5 [97] Example of an image (left) in which a replication of film grain has been digitally
added (right).
    Additive White Gaussian Noise (AWGN) or simply additive noise is the basic noise model
which will be applied in Chapters 2 and 3. This model assumes that for all our phase retrieval models,
there will be added unknown Gaussian noise which forms part of the collected measurements, i.e.
                                                Y=X+N                                           (1.17)
where X are the "true" measurements, and N is the additive Gaussian noise. Both X and N are
then assumed to be unknown with Y being the known collected measurements. For example, in the
masked phase retrieval model, we would have that
                         𝑦 ℓ,𝑘 = |(F𝑑 (𝑥 ◦ mℓ )) 𝑘 | 2 +𝑁 𝑘,ℓ , 𝑘 ∈ [𝑑]0 , ℓ ∈ [𝐿]0 .           (1.18)
                                                        9


where 𝑁 ∈ C𝑑×𝐿 has complex Gaussian entries.
    Although the noise is unknown, what can be modelled is the recovery of the signal against
varying levels of the noise, with the noise level being measured as relative to the signal. This can
be measured via the signal-to-noise ratio (SNR), which is the ratio of the power of the signal, 𝑃𝑠 ,
                                                 𝑃𝑠
to the power of the noise, 𝑃𝑛 . That is, 𝑆𝑁 𝑅 =      . Thus we have that the higher the signal-to-noise
                                                 𝑃𝑛
ratio, the less presence of noise relative to the signal. Typically, it is measured in decibels using a
base 10 logarithm, that is
                                                                     𝑃 
                                                                         𝑠
                             𝑆𝑁 𝑅𝑑𝑏 := 10 log10 (𝑆𝑁 𝑅) = 10 log10          .                     (1.19)
                                                                       𝑃𝑛
        Figure 1.6 [97] Example of an image with varying levels of Gaussian noise applied.
1.2.5    CONDITION NUMBER OF A MATRIX
    In Chapter 2, we demonstrate a method for solving a phase retrieval problem involving rewriting
the measurements as a matrix multiplication, for which we then invert the generated matrix.
However, the presence of noise can affect this approach. To measure how much noise can effect
the outcome, we require the following definition.
                                                   10


Definition 1.2.9. The condition number of a matrix A is defined by
                                        𝜅 = 𝜅(A) := ∥A∥·∥A−1 ∥                                     (1.20)
where ∥·∥ is the operator norm. In particular, if ∥·∥ is the ℓ 2 norm, then
                                                    𝜎𝑚𝑎𝑥 (𝐴)
                                           𝜅(A) :=                                                 (1.21)
                                                     𝜎𝑚𝑖𝑛 (𝐴)
where 𝜎𝑚𝑎𝑥 (𝐴), 𝜎𝑚𝑖𝑛 (𝐴) are the maximal and minimal singular values respectively.
    A matrix with a low condition number is said to be well-conditioned, while a matrix with
a high condition number is said to be ill-conditioned. We apply these same definitions to the
problem or system involving matrices i.e. a system is well-conditioned if the resultant matrix is
well-conditioned.
    Informally, the condition number measures how close a matrix is to being non-invertible. In
practice, it measures the effect of perturbations. In our context, this perturbation will be the additive
noise. To see where the condition number comes into play, suppose we have the system y = Ax
which has been effected by noise such that
                                    y + 𝛿y = Ax + 𝛿Ax = A(x + 𝛿x)                                  (1.22)
where 𝛿y is the noise which is relatively small compared to y (that is, it has a relatively large
          ∥y∥
𝑆𝑁 𝑅 =        ). Then we have that
         ∥𝛿y∥
                                                             1      ∥A∥
                                ∥y∥= ∥Ax∥≤ ∥A∥·∥x∥⇒             ≤        ,                         (1.23)
                                                            ∥x∥     ∥y∥
and
                                    ∥𝛿x∥= ∥A−1 𝛿y∥≤ ∥A−1 ∥·∥𝛿y∥.                                   (1.24)
                                                   11


Thus combining these inequalities, we get that the relative error between the noise distributed part
of the signal, 𝛿x, and the true signal, x, is given by
                                   ∥𝛿x∥                 ∥𝛿y∥      𝜅
                                         ≤ ∥A∥·∥A−1 ∥·       =       .                           (1.25)
                                    ∥x∥                  ∥y∥    𝑆𝑁 𝑅
Thus our hope for a successful recovery, at least to a given margin of error, relies on 𝑆𝑁 𝑅 being
relatively large, and 𝜅 being relatively small.
    In Chapter 2, Lemma 2.4.2, we demonstrate a choice of matrix which allows for a well-
conditioned system, and thus successful recovery of the signal. This is ensured by utilizing the
bound of the condition number generated from [54], in which the maximal singular value is upper
bounded whilst the minimal singular value is lower bounded.
1.3    DIMENSIONALITY REDUCTION
Dimensionality reduction ([99], [25], [59], [62]) is a general data analysis tool used in several
disciplines, including statistics, data mining, pattern recognition, machine learning, artificial intel-
ligence, and optimization. Dimensionality reduction refers to the act of transforming data from a
high-dimensional space to a low-dimensional space. The goal is to be able to remove irrelevant
or redundant data, and to be able to reduce the computational cost while still retaining meaningful
properties of the original data. Dimension in this context can refer to many things, such as at-
tributes, variables, features, pixels, etc.
    Each application utilizes different dimension reduction techniques. In pattern recognition for
example, the problem of dimensionality reduction is to extract a subset of features that recovers
most of the variability of the data. In text mining, the problem is defined as selecting a subset of
words or terms.
                                                   12


Figure 1.7 [97] Example of dimensionality reduction effect over a 3-dimensional spherical shell
manifold. The resulting 2-dimensional embedded data is the attempt to unfold the original data.
1.3.1    K-NEAREST NEIGHBORS CLASSIFICATION
    Suppose we vectorize training data in 𝑑-dimensions in such a way that the concept of distance
between two points in R𝑑 makes sense within the problem, and that data is classified into multiple
different classes. The goal of the 𝑘-nearest neighbours algorithm (k-NN) is to identify a new image
by applying the same vectorization and classifying based on the preset classification of its k-nearest
neighbours in R𝑑 . Generally, 𝑘 is chosen by testing and evaluating the results for different values.
Figure 1.8 [48] Example of the k-nearest neighbors algorithm being applied on embedded data.
The left figure demonstrates the data that is being attempted to be categorized. The central and
right figure demonstrates the kNN algorithm being applied with 𝑘 = 3.
1.3.2    JOHNSON-LINDENSTRAUSS LEMMA
    Algorithms, for example, performing k-nearest neighbour classification, can be computational
expensive in high dimensions. Thus it is often advantageous to reduce the dimension of the data
before carrying out such tasks. To ensure that applying dimensionality reduction does not highly
distort the data, we will utilize the Johnson-Lindenstrauss lemma.
                                                  13


Lemma 1.3.1 (Johnson-Lindenstrauss Lemma). Let 𝜖 ∈ (0, 1) and 𝑋 ⊂ R𝑑 be arbitrary with
|𝑋 |= 𝑛 > 1. There exists 𝑓 : 𝑋 −→ R𝑚 with 𝑚 = O(𝜖 −2 log 𝑛) such that ∀𝑥 ∈ 𝑋, ∀𝑦 ∈ 𝑋
                         (1 − 𝜖)∥𝑥 − 𝑦∥ 2 ≤ ∥ 𝑓 (𝑥) − 𝑓 (𝑦)∥ 2 ≤ (1 + 𝜖)∥𝑥 − 𝑦∥ 2 .                (1.26)
    Let 𝑋 denote the set of training data in R𝑑 . Then Lemma 1.3.1 states that the distance between
the training data will be preserved up to a small distortion. However, in order to successfully apply,
for example, k-NN methods, we would prefer a stronger guarantee that the distances from points in
the training data be approximately preserved to any point in R𝑑 .
    Work by Elkin et al.in [29] demonstrated that 𝑦 can be taken as an arbitrary point in R𝑑
                                       √
with 𝑚 = O(log 𝑛) and distortion ≈ 10. This embedding is called a terminal embedding with
multiplicative factor on the right hand side referred to as the terminal distortion. Further work
demonstrated that if 𝑚 is sufficiently large, one may prove a result of the following type.
Theorem 1.3.1 (Lemma 1.1, [80]). Let 𝜖 ∈ (0, 1) and 𝑋 ⊂ R𝑑 be arbitrary with |𝑋 |= 𝑛 > 1. There
exists 𝑓 : 𝑋 −→ R𝑚 with 𝑚 = O(𝜖 −2 log 𝑛) such that ∀𝑥 ∈ 𝑋, ∀𝑦 ∈ R𝑑
                       (1 − 𝜖)∥𝑥 − 𝑦∥ 2 ≤ ∥ 𝑓 (𝑥) − 𝑓 (𝑦)∥ 2 ≤ (1 + 𝜖)∥𝑥 − 𝑦∥ 2 .                  (1.27)
    Thus if the points in 𝑋 are mapped to R𝑚 well, which occurs with high probability, then our
final terminal embedding is guaranteed to have low terminal distortion as a map from all of R𝑑 to
R𝑚 . This terminal embedding is required to be non-linear. To see this, let 𝑋 ⊂ R𝑑 be arbitrary.
Suppose for contradiction that 𝑓 : 𝑋 −→ R𝑚 , 𝑑 > 𝑚 is a linear embedding with constant terminal
distortion. By the Rank-Nullity theorem, 𝑑𝑖𝑚(𝑘𝑒𝑟( 𝑓 )) ≥ 𝑑 −𝑚 ≥ 1. This means ∃𝑦 ∈ 𝑘𝑒𝑟( 𝑓 )\{0}.
Let 𝑥 ∈ 𝑋 be arbitrary. Since 𝑓 is a linear embedding and 𝑥 − 𝑦 ∈ R𝑑 ,
   0 < ∥𝑦∥ 2 = ∥𝑥 − (𝑥 − 𝑦)∥ 2 ≤ ∥ 𝑓 (𝑥) − 𝑓 (𝑥 − 𝑦)∥ 2 = ∥ 𝑓 (𝑥) − 𝑓 (𝑥) + 𝑓 (𝑦)∥ 2 = ∥ 𝑓 (𝑦)∥ 2 = 0.
Thus we have arrived at a contradiction.
                                                    14


   In Chapter 4, we will explore dimensionality reduction of manifolds. We also demonstrate
numerically, a compressed classification algorithm for labelled data.
                                                15


                                             CHAPTER 2
    TOWARD FAST AND PROVABLY ACCURATE NEAR-FIELD PTYCHOGRAPHIC
                                       PHASE RETRIEVAL
2.1     ABSTRACT
Ptychography is an imaging technique that involves a sample being illuminated by a coherent,
localized probe of illumination. When the probe interacts with the sample, the light is diffracted and
a diffraction pattern is detected. Then the sample (or probe) is shifted laterally in space to illuminate
a new area of the sample whilst ensuring sufficient overlap. Similarly, in Fourier ptychography
a sample is illuminated at different angles of incidence (effectively shifting the sample’s Fourier
transform) after which a lens acts as a low-pass filter, thereby effectively providing localized
Fourier information about the sample around frequencies dictated by each angle of illumination.
Mathematically, one therefore obtains a similar set of overlapping measurements of the sample
in both Fourier ptychography and ptychography, except in the different domains (Fourier for the
former, and physical for the latter). In either case, one is then able to reconstruct an image of the
sample from the measurements using similar methods.
     Near-Field (Fourier) Ptychography (NFP) (see, e.g., [107, 108, 125]) occurs when the sample
is placed at a short defocus distance having a large Fresnel number. In this chapter, we prove that
certain NFP measurements are robustly invertible (up to an unavoidable global phase ambiguity)
for specific Point Spread Functions (PSFs) and physical masks which lead to well-conditioned
lifted linear systems. We then apply a block phase retrieval algorithm using weighted angular
synchronization and prove that the proposed approach accurately recovers the measured sample
for these specific PSF and mask pairs. Finally, we also propose using a Wirtinger Flow for
NFP problems and numerically evaluate that alternate approach both against our main proposed
approach, as well as with NFP measurements for which our main approach does not apply.
2.2     INTRODUCTION
The task of recovering a complex signal x ∈ C𝑑 from phaseless magnitude measurements is called
the phase retrieval problem. These types of problems appear in many applications such as optics
                                                  16


[3, 119] and x-ray crystallography [9, 73]. Here, we are interested in phase retrieval problems
arising from (Fourier) ptychography [96, 126]. Ptychography is an imaging technique involving
a sample illuminated by a coherent and often localized probe of illumination. When the probe
interacts with the sample, light is diffracted and a diffraction pattern is detected. The probe, or
the sample, is then shifted laterally in space to illuminate a new area of the sample while ensuring
there is sufficient overlap between each neighboring shift. The intensity of the diffraction pattern
detected at position ℓ resulting from the 𝑘 𝑡ℎ shift of the probe along the sample takes the general
form of
                                             𝑌˜𝑘,ℓ = |(𝐷(𝑆 𝑘 m ◦ x))ℓ | 2 ,                                   (2.1)
where x ∈ C𝑑 is the sample being imaged, m ∈ C𝑑 is a mask which represents the probe’s incident
illumination on (a portion of) the sample, ◦ denotes the Hadamard (pointwise) product, 𝑆 𝑘 is a
shift operator, and 𝐷 : C𝑑 → C𝑑 is a function that describes the diffraction of the probe radiation
from the sample to the plane of the detector after possibly passing through, e.g, a lens. Similarly,
Fourier ptychography ultimately results in the same type of measurements as in (2.1) except with
m and x replaced by m       b and b x, respectively (see, e.g., [128]).
     Prior work in the computational mathematics community related to (Fourier) ptychographic
imaging has primarily focused on Far-Field1 Ptychography (FFP) in which 𝐷 is the action of a
discrete (inverse) Fourier transform matrix (see, e.g., [95, 54, 49, 31, 92, 89]) in (2.1). Here, in
contrast, we consider the less well studied setting of near-field ptychography (NFP) which describes
situations where, e.g., the masked sample is too close to the source/detector to be well described
by the FFP model. See, e.g., [107, 108, 125] for such imaging applications as well as for more
detailed related discussions. In all of these NFP applications the acquired measurements can again
be written in the form of (2.1) where 𝐷 is now a convolution operator with a given Point Spread
Function (PSF) p ∈ C𝑑 .
     Let x ∈ C𝑑 denote an unknown sample, m ∈ C𝑑 be a known mask, and p ∈ C𝑑 be a known
     1 Far-field versus near-field measurements are defined based on the Fresnel number of the imaging system. See,
e.g., [57] for details.
                                                          17


PSF, respectively. For the remainder of this chapter we will suppose we have noisy discretized NFP
measurements of the form
                  𝑌𝑘,ℓ = 𝑌𝑘,ℓ (x) B |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 + 𝑁 𝑘,ℓ ,      (𝑘, ℓ) ∈ S ⊆ [𝑑]0 × [𝑑]0 ,           (2.2)
where 𝑆 𝑘 is a circular shift operator (𝑆 𝑘 x)𝑛 = x𝑛+𝑘        mod 𝑑 ,   N = (𝑁 𝑘,ℓ ) is an additive noise matrix, and
[𝑑]0 := {0, . . . , 𝑑 − 1}. Throughout this chapter we will always index vectors and matrices modulo
𝑑 unless otherwise stated.
2.2.1      RESULTS, CONTRIBUTIONS, AND CONTENTS
     Our main theorem guarantees the existence of a PSF p ∈ C𝑑 and a locally supported mask
m ∈ C𝑑 with 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 B {0, . . . , 𝛿 − 1}, 𝛿 ≪ 𝑑, for which the measurements (2.2) can be
inverted up to a global phase factor by a computationally efficient and noise robust algorithm. In
particular, we prove the following result which we believe to be the first theoretical error guarantee
for a recovery algorithm in the setting of NFP.
Theorem 2.2.1 (Inversion of NFP Measurements). Choose 𝛿 ∈ [𝑑]0 such that 2𝛿 − 1 divides 𝑑.
Then, there exists a PSF p ∈ C𝑑 and a mask m ∈ C𝑑 with supp(m) ⊆ [𝛿]0 such that Algorithm 2.1
below, when provided with input measurements (2.2), will return an estimate xest ∈ C𝑑 of x satisfying
                                                       √ √︁
                 𝑚𝑖𝑛 ∥xest − 𝑒 i𝜙 x∥ 2 ≤ 𝐶 ∥x∥ ∞
                                                    𝑑   𝛿 ∥xest ∥ 2∞ +∥xest ∥ 3∞              √︁        
                                                                                   · ∥N∥   𝐹 +    𝑑𝛿∥N∥ 𝐹 .
               𝜙∈[0,2𝜋)                                       |xest | 2min
Here 𝐶 ∈ R+ is an absolute constant2, and |xest | min denotes the smallest magnitude of any entry in
xest .
     Looking at Theorem 2.2.1 we can see, e.g., that in the noiseless setting where ∥N∥ 𝐹 = 0 the
output xest of Algorithm 2.1 is guaranteed to match the measured signal x up to a global phase factor
whenever xest has no zeros.3 Moreover, the method is also robust to small amounts of additive
noise. The proof of Theorem 2.2.1 consists of two parts: First, in Section 2.4, we show that a
     2 In this chapter we will use 𝐶 to denote absolute constants which may change from line to line.
     3 Note  that prior work on far-field ptychography assumed that x itself was non-vanishing (see e.g. [54, 49]).
However, requiring xest to not vanish is more easily verifiable in practice.
                                                           18


specific PSF and mask choice results in NFP measurements (2.2) which are essentially equivalent
to far-field ptychographic measurements (2.4) that are known to be robustly invertible by prior
work [54, 49, 92]. This guarantees the existence of PSFs and masks which allow for the robust
inversion of (2.2) up to a global phase. However, these prior works all prove error bounds on
  𝑚𝑖𝑛 ∥xest − 𝑒 i𝜙 x∥ 2 which scale quadratically in 𝑑 (see, e.g., Corollary 3 in [49] and Theorem 1 in
𝜙∈[0,2𝜋)
[92]). This motivates the second part of the proof in Section 2.5, where we improve these results so
that they only depend linearly on 𝑑. This is achieved by utilizing weighted angular synchronization
error bounds from [33] which require, among other things, updated lower bounds for the second
smallest eigenvalue of the unnormalized graph Laplacian of a given weighted graph obtained from
x (derived in Section 2.5 with the help of auxiliary results proven in Appendix A.2). We also note
that the improved dependence on 𝑑 proven in Section 2.5 for the FFP methods previously analyzed
in [54, 49, 92] may be of potential independent interest.
    Theorem 2.2.1 is proven for a specific (2𝛿 − 1)-periodic PSF p and locally supported mask
m whose induced lifted linear measurement operator (see (2.5) – (2.7) below together with
Lemma 2.4.1) is provably well conditioned. See Lemma 2.4.2 for the definition of this partic-
ular p, m pair as well as for their measurements’ related condition number bound. However, we
note that Algorithm 2.1 is guaranteed to work well much more generally for any PSF and mask pair
which leads to well-conditioned measurements (up to, at worst, potentially having to change the shift
and frequency pairs one samples if, e.g., p is not periodic – see Remark 2.4.1). Indeed, inspecting
the proof of Theorem 2.2.1 we see that Lemma 2.5.1 decomposes the total error of Algorithm 2.1,
min𝜙∈[0,2𝜋) x − 𝑒 i𝜙 xest 2
                                                                                  (𝜃)
                            , into terms involving the phase error, min𝜙∈[0,2𝜋)∥xest  − 𝑒 i𝜙 x(𝜃) ∥ 2 , and the
                                 (mag)
magnitute error, ∥x(mag) − xest        ∥ 2 . The phase error is controlled by Theorem 2.5.2 and Lemma
2.5.3. The proof of these results only depends on the choice of m and p through 𝜎 𝑚𝑖𝑛 (M               q (𝑝,𝑚) ),
the minimal singular value of their induced measurement operator. Similarly, the magnitude error
is controlled by Lemma 2.5.2 and Theorem 2.5.1 which also only depend on m and p through
       q (𝑝,𝑚) ). Therefore, variants of these results can be derived for any invertible measurement
𝜎 𝑚𝑖𝑛 (M
system. Moreover, numerical experiments demonstrate that the proposed method works well for a
                                                      19


wide variety of non-vanishing PSF and locally supported mask pairs.
    In order to be able to handle even more general PSFs p which do however, e.g., vanish, in
Section 2.6 we also propose a Wirtinger Flow based algorithm, Algorithm 2.2, for inverting NFP
measurements (2.2). Though slower than Algorithm 2.1 and less well supported by theory for the
PSF and mask pairs for which both methods work empirically, Algorithm 2.2 generally appears
more flexible and, e.g., also requires fewer shifts than Algorithm 2.1 to work well in practice when
a given mask is not locally supported. Similar to Algorithm 2.1, Algorithm 2.2 relies on the
observation that the NFP measurements (2.2) are essentially equivalent to FFP measurements as
shown in Section 2.4. In Section 2.7, we evaluate Algorithm 2.1 and Algorithm 2.2, numerically,
both individually and in comparison to one another in the case of locally supported masks. Finally,
in Section 2.9, we conclude with a brief discussion of future work.
                                                  20


2.3     PRELIMINARIES: PRIOR RESULTS FOR FAR-FIELD
        PTYCHOGRAPHY USING LOCAL MEASUREMENTS
                                      Table 2.1 Notational Reference Table
Notation    Definition                                                 Notes
[𝑛]0        [𝑛]0 = {0, 1, 2, . . . , 𝑛 − 1}                            Zero indexing
(x)𝑛        x∈    C𝑑 , (x) 𝑛 = 𝑥 𝑛 mod 𝑑                               Vector circular indexing
(A)𝑖, 𝑗     A ∈ C𝑚×𝑛 , (A)𝑖, 𝑗 = 𝐴𝑖 mod 𝑚, 𝑗 mod 𝑛                     Matrix circular indexing
⟨x, y⟩      ⟨x, y⟩ = 𝑛=0       𝑥 𝑛 𝑦 𝑛 = y∗ x
                       P𝑑−1
                                                                       Complex inner product
supp(x)     supp(x) = {𝑛 ∈ [𝑑]0 | 𝑥 𝑛 ̸= 0}                            Support
F𝑑          (F𝑑 ) 𝑗,𝑘 = 𝑒 −2𝜋i 𝑗 𝑘/𝑑 , ∀( 𝑗, 𝑘) ∈ [𝑑]0 × [𝑑]0          Discrete Fourier transform matrix
x           𝑥 𝑛 = (F𝑑 x)𝑛 = 𝑑−1
                                P            −2𝜋 i𝑛𝑘/𝑑                 Discrete Fourier transform
                                   𝑘=0 𝑥 𝑘 𝑒
b           b
F−1         (F−1          1 P𝑑−1          2𝜋 i 𝑘𝑛/𝑑
  𝑑 x          𝑑 x)𝑛 = 𝑑        𝑘=0 𝑥 𝑘 𝑒                              Discrete inverse Fourier transform
𝑆 𝑘 (x)     (𝑆 𝑘 x)𝑛 = 𝑥(𝑛+𝑘)     𝑚𝑜𝑑 𝑑 ,     ∀𝑛 ∈ [𝑑]0                Circular shift
x
e           𝑥 𝑛 = 𝑥−𝑛
            e           𝑚𝑜𝑑 𝑑 ,      ∀𝑛 ∈   [𝑑]0                       Reversal
x∗y         (x ∗
                         P𝑑−1
                  y)𝑛 = 𝑘=0 𝑥 𝑘 𝑦 𝑛−𝑘                                  Circular convolution
x◦y         (x ◦ y)𝑛 = 𝑥 𝑛 𝑦 𝑛                                         Hadamard (pointwise) product
     Our method, described in Algorithm 2.1, is based on relating the near-field ptychographic
measurements (2.2) to far-field ptychographic measurements of the form
                                                                                  2
                                                                           i
                                                       𝑑−1
                                                            𝑚′𝑛 𝑥 𝑛+𝑘 𝑒 −2𝜋 ℓ𝑛/𝑑
                                                       ∑︁
                              𝑌e𝑘,ℓ = 𝑌e𝑘,ℓ (x) B                                   + 𝑁 𝑘,ℓ ,                     (2.3)
                                                       𝑛=0
where m′ is a compactly supported mask. If we let (m               q ℓ )𝑛 B 𝑚′𝑛 𝑒 −2𝜋iℓ𝑛/𝑑 , then these measurements
can be written as
                                             𝑌e𝑘,ℓ = |⟨mq ℓ , 𝑆 𝑘 x⟩| 2 + 𝑁 𝑘,ℓ ,                                 (2.4)
where as above 𝑆 𝑘 denotes a circular shift of length 𝑘, i.e., (𝑆 𝑘 x)𝑛 = 𝑥 (𝑛+𝑘)               𝑚𝑜𝑑 𝑑 . In [54], phase
retrieval measurements of this form are studied when m′ is supported in an interval of length 𝛿
for some 𝛿 ≪ 𝑑. The fast phase retrieval (fpr) method used there relies on using a lifted linear
                                                             21


system involving a block-circulant matrix to recover a portion of the autocorrelation matrix xx∗ .
Specifically, letting 𝐷 B 𝑑(2𝛿 − 1), the authors define a block-circulant matrix M             q ∈ C𝐷×𝐷 by
                                 ©M q0 M   q1 ... M          q 𝛿−1    0    0 ...      0 ª
                                 ­                                                         ®
                                 ­ 0 M     q0 M               ... M        0    ... 0 ®
                                 ­                    q1            q 𝛿−1                  ®
                          M B ­­ .
                           q
                                            ..         ..                  ..    ..    .. ®® .
                                                                                           ®               (2.5)
                                 ­ ..                         ...    ...
                                 ­           .          .                   .     .     . ®
                                 ­                                                         ®
                                 «M1 . . . M𝛿−1
                                    q               q          0      0    0    ... M q0
                                                                                           ¬
where the matrices M   q 𝑘 ∈ C(2𝛿−1)×(2𝛿−1) are defined entry-wise by
                               
                               
                                 (m
                                  q ℓ ) 𝑘 (m
                                           q ℓ ) 𝑗+𝑘 ,      0 ≤ 𝑗 ≤ 𝛿 − 𝑘,
                               
                               
                               
                               
                               
                               
                               
                               
                (Mq 𝑘 )ℓ 𝑗  B (m  q ℓ ) 𝑘 (m
                                           q ℓ ) 𝑗+𝑘−2𝛿 ,   2𝛿 − 1 + 𝑘 ≤ 𝑗 ≤ 2𝛿 − 2 and 𝑘 < 𝛿,             (2.6)
                               
                               
                               
                               
                               
                                0,                         otherwise.
                               
                               
                               
Letting z ∈ C𝑑 be a vector obtained by subsampling appropriate entries of vec(xx∗ ), the authors
show that, in the noiseless setting,
                                          vec(Y)e = Mz,   q     e ∈ C𝑑×(2𝛿−1) .
                                                                Y                                          (2.7)
(See Equation (9) of [54] for explicit details on the arrangement of the entries.) For properly
chosen m, the matrix M     q is invertible, and therefore one may solve for z by multiplying by M     q −1 , i.e.,
z=M  q −1 vec(Y). Then, one may reshape z to recover a 𝑑 × 𝑑 matrix X               b whose non-zero entries are
estimates of the autocorrelation matrix xx∗ . One may then obtain a vector xest which approximates x
by angular synchronization procedure such as the eigenvector-based method which we will discuss
in Section 2.3.1.
                                                             22


    In [54], it is shown that exponential masks m                 q ℓ(fpr) defined by
                                                       2 𝜋 i𝑛ℓ
                                   
                                         −(𝑛+1)/𝑎
                                   
                                     √4𝑒          · 𝑒 2 𝛿−1 ,     𝑛 ∈ [𝛿]0                n 𝛿 − 1o
                    q ℓ(fpr) )𝑛 =
                                   
                                   
                  (m                      2𝛿 − 1                                ,   𝑎 B max 4,       ,               (2.8)
                                                                                                2
                                    0,                            otherwise
                                   
                                   
                                   
lead to a lifted linear system which is well-conditioned and thus to provable recovery guarantees
for the method described above. In particular, we may obtain the following upper bound for the
condition number of block-circulant matrix M                   q (fpr) obtained when one sets m     q ℓ(fpr) .
                                                                                               qℓ = m
Theorem 2.3.1 (Theorem 4 and Equation (33) in [54]). The condition number of M                       q (fpr) , the matrix
                                    (fpr)
obtained by setting m    qℓ = m   qℓ        in (2.6), may be bounded by
                                                                      2        2o
                               q (fpr) < max 144𝑒 2 , 9𝑒 (𝛿 − 1) ≤ 𝐶𝛿2 ,
                                                 n
                          𝜅 M                                                             𝐶 ∈ R+ .
                                                                         4
                                                                                                                            
Furthermore, M   q (fpr)  can be inverted in O(𝛿·𝑑 log 𝑑)-time and its smallest singular value 𝜎 𝑚𝑖𝑛 M               q (fpr)
is bounded from below by 𝐶/𝛿.
2.3.1   ANGULAR SYNCHRONIZATION
    Inverting M  q as described in the previous subsection allows one to obtain a portion of the
autocorrelation matrix xx∗ . This motivates us to consider angular synchronization, the process
of recovering a vector x from (a portion of) its autocorrelation matrix xx∗ (or an estimate X).                         b
One popular approach, which we discuss below, is based on upon first entry-wise normalizing this
matrix and then taking the lead eigenvector. Specifically, we define a truncated autocorrelation
matrix X corresponding to the true signal x by
                                                     
                                                     
                                                                    | 𝑗 − 𝑘 | mod 𝑑 < 𝛿
                                                     
                                                     𝑥 𝑗 𝑥 𝑘 ,
                                                     
                                                     
                                            𝑋 𝑗,𝑘 =                                                                  (2.9)
                                                     
                                                      0,          otherwise.
                                                     
                                                     
                                                     
                                                                   23


We also define a truncated autocorrelation matrix X                        b corresponding to our estimate, xest , given by
                                                
                                                
                                                 (𝑥 est ) 𝑗 (𝑥 est ) 𝑘 ,   | 𝑗 − 𝑘 | mod 𝑑 < 𝛿
                                                
                                                
                                                
                                    b𝑗,𝑘 =
                                    𝑋                                                                                             (2.10)
                                                
                                                 0,                        otherwise.
                                                
                                                
                                                
The method from [54] is based upon first solving for X                           b and then solving for xest . If X        b is a good
approximation of X, then the results proved in [118] show that xest will be a good approximation of
x.
    Moving forward, prior works [54, 118] effectively decomposed X = X(𝜃) ◦ X(mag) into its phase
                                                    (mag)
and magnitude matrices by setting 𝑋 𝑗,𝑘                     = |𝑋 𝑗,𝑘 | and 𝑋 (𝜃)   𝑗,𝑘 = 𝑋 𝑗,𝑘 /|𝑋 𝑗,𝑘 | if |𝑋 𝑗,𝑘 | ≠ 0 with 𝑋 (𝜃)
                                                                                                                                 𝑗,𝑘 = 0
otherwise. One may then write X             b=X     b(𝜃) ◦ X   b(mag) . Note that by construction, if x is nonvanishing,
                                       (mag)
then we have |𝑋 (𝜃) 𝑗,𝑘 | = 1 and 𝑋 𝑗,𝑘         > 0 whenever | 𝑗 − 𝑘 | mod 𝑑 < 𝛿. Letting u ∈ C𝑑 be the leading
eigenvector of X  b and letting diag(X)        b ∈ C𝑑 be the main diagonal of X,                        b the output of the resulting
algorithm is then x𝑒𝑠𝑡 B diag(X)        b ◦ u.
Example 2.3.1. Let 𝑑 = 4, 𝛿 = 2. Then X                   b defined as in (2.10) is given by
                                            2
                           © |(𝑥 est )0 |           (𝑥 est )0 (𝑥 est )1            0             (𝑥 est )0 (𝑥est )3 ª
                           ­                                                                                        ®
                           ­(𝑥 ) (𝑥 )                  |(𝑥 est )1 | 2      (𝑥 est )1 (𝑥 est )2             0
                           ­                                                                                        ®
                     b = ­ est 1 est 1
                                                                                                                    ®
                     X                                                                                              ®.
                                                                              |(𝑥 est )2 | 2
                           ­                                                                                        ®
                           ­
                           ­        0               (𝑥 est )2 (𝑥 est )1                          (𝑥 est )2 (𝑥est )3 ®
                                                                                                                    ®
                           ­                                                                                        ®
                           «(𝑥est )3 (𝑥 est )0              0              (𝑥 est )3 (𝑥 est )2       |(𝑥 est )3 | 2 ¬
If we write (𝑥 est )𝑛 = |(𝑥 est )𝑛 |𝑒 i𝜃 𝑛 , then we may compute
                                          © 1                𝑒𝑖(𝜃 0 −𝜃 1 )        0          𝑒𝑖(𝜃 0 −𝜃 3 )  ª
                                          ­                                                                 ®
                                          ­ 𝑖(𝜃 1 −𝜃 0 )
                                                                  1         𝑒𝑖(𝜃 1 −𝜃 2 )        0
                                                                                                            ®
                                          ­𝑒
                               b(𝜃)
                                                                                                            ®
                               X      =­  ­                                                                 ®.
                                                             𝑒𝑖(𝜃 2 −𝜃 1 )                   𝑒𝑖(𝜃 2 −𝜃 3 )
                                                                                                            ®
                                          ­ 0                                     1                         ®
                                          ­                                                                 ®
                                          ­                                                                 ®
                                              𝑖(𝜃 3 −𝜃 0 )        0         𝑒𝑖(𝜃 3 −𝜃 2 )        1
                                          «𝑒                                                                ¬
                                                                      24


One may verify that the lead eigenvector is u = (𝑒 i𝜃 0 𝑒𝑖𝜃 1 𝑒𝑖𝜃 2 𝑒 i𝜃 3 )𝑇 and therefore
                             b ◦ u = (|(𝑥 est )0 |𝑒 i𝜃 0 |(𝑥 est )1 |𝑒 i𝜃 1 |(𝑥 est )2 |𝑒 i𝜃 2 |(𝑥est )3 |𝑒 i𝜃 3 )𝑇 .
                     √︃
             x 𝑒𝑠𝑡 =    𝑑𝑖𝑎𝑔(X)
    In Section 2.5, we will discuss another slightly more sophisticated way for estimating the phases
based on Algorithm 3 of [93] which involves taking the smallest eigenvector of an appropriately
weighted graph Laplacian. Indeed, this new angular synchronization approach is what ultimately
allows for the NFP error bound in Theorem 2.2.1 to have improved dependence on signal dimension
𝑑 over prior FFP error bounds in [54, 49, 92]. The end result will be a more accurate method for
computing X  b in (2.10) from given NFP measurements (2.2).
2.4     NEAR FROM FAR: GUARANTEED NEAR-FIELD
        PTYCHOGRAPHIC RECOVERY VIA FAR-FIELD RESULTS
In this section, we show how to relate the near-field ptychographic measurements (2.2) to the
far-field ptychographic measurements (2.4). This will allow us to recover x by using methods
similar to those introduced in [54]. In order get nontrivial bounds, we will also need to prove the
existence of an admissible PSF and mask pair, p ∈ C𝑑 and m ∈ C𝑑 , which lead to a well conditioned
linear system in (2.7). In particular, we will present a PSF and mask pair such that the resulting
block-circulant matrix, denoted M   q (𝑝,𝑚) , will have the same condition number as the matrix M                      q (fpr)
                                (fpr)
constructed from the masks m  qℓ      defined in (2.8). Therefore, Theorem 2.3.1 will allow us to obtain
convergence guarantees for Algorithm 2.1.
    Here, we will set the measurement index set S considered in (2.2) to be S = K × L where
K = [𝑑]0 and L = [2𝛿 − 1]0 . The following lemma proves that we can rewrite NFP measurements
from (2.2) as local FFP measurements of the form (2.4) as long as the mask m has local support
and the PSF is periodic. It will be based upon defining masks
                                      mq ℓ(𝑝,𝑚) B 𝑆ℓep ◦ m ∈ C𝑑 , [𝛿]0 ,                                              (2.11)
where e p is the reversal of p about its first entry modulo 𝑑, i.e., 𝑝e𝑛 = 𝑝 −𝑛 mod 𝑑 Since the masks
                                                        25


q ℓ(𝑝,𝑚) have compact support, this will then yield a lifted set of linear measurements of the type
m
considered in [54, 49, 92].
Lemma 2.4.1. Let S = K × L = [𝑑]0 × [2𝛿 − 1]0 , and recall the measurements
                                    𝑌𝑘,ℓ = |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 , (𝑘, ℓ) ∈ S,
defined in (2.2). Suppose that 2𝛿 − 1 divides 𝑑, that p ∈ C𝑑 is 2𝛿 − 1 periodic, and that m ∈ C𝑑
satisfies 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 . Then, we may rearrange the measurements (2.2) into a matrix of
FFP-type measurements
                                                        (𝑝,𝑚)
          𝑌e𝑘,ℓ B 𝑌−𝑘   𝑚𝑜𝑑 𝑑, 𝑘−ℓ 𝑚𝑜𝑑 2𝛿−1     = |⟨mqℓ       , 𝑆 𝑘 x⟩| 2 , (𝑘, ℓ) ∈ [𝑑]0 × [2𝛿 − 1]0 ,           (2.12)
           (𝑝,𝑚)
where m  qℓ      is defined as in (2.11). As a consequence, recovering x is equivalent to inverting a
block-circulant matrix as described in (2.5) – (2.7).
Proof. By Lemma A.1.3 part 1 , Lemma A.1.2, Lemma A.1.3 part 2, and Lemma A.1.1 from
Appendix A.1, we have that
                𝑌𝑘,ℓ = |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 = |⟨𝑆−ℓe p, 𝑆 𝑘 m ◦ x⟩| 2
                                               = |⟨𝑆−ℓe p ◦ 𝑆 𝑘 m, x⟩| 2
                                               = |⟨𝑆 𝑘 (𝑆−ℓ−𝑘 e p ◦ m), x⟩| 2
                                               = |⟨𝑆 𝑘 (𝑆−ℓ−𝑘 e p ◦ m), x⟩| 2
                                               = |⟨𝑆 𝑘 (𝑆−ℓ−𝑘     𝑚𝑜𝑑 2𝛿−1e  p◦   m), x⟩| 2 ,
where the last equality uses the fact that p is 2𝛿 − 1 periodic. We may now apply Lemma A.1.3 part
3 to see that 𝑌𝑘,ℓ = |⟨𝑆 𝑘 (𝑆−ℓ−𝑘     𝑚𝑜𝑑 2𝛿−1e p◦   m), x⟩| 2 = |⟨(𝑆−ℓ−𝑘         𝑚𝑜𝑑 2𝛿−1e  p◦ m), 𝑆−𝑘 x⟩| 2 .. Finally,
                                                           26


since m q ℓ(𝑝,𝑚) = 𝑆ℓep ◦ m, we see that for all 𝑘 ∈ [𝑑]0 and all ℓ ∈ [2𝛿 − 1]0 , we have
                             𝑌e𝑘,ℓ = 𝑌−𝑘   𝑚𝑜𝑑 𝑑, 𝑘−ℓ 𝑚𝑜𝑑 2𝛿−1
                                   = |⟨(𝑆−(𝑘−ℓ)−(−𝑘)      𝑚𝑜𝑑 2𝛿−1e  p◦   m), 𝑆−(−𝑘) x⟩| 2
                                   = |⟨(𝑆ℓ   𝑚𝑜𝑑 2𝛿−1e  p◦     m), 𝑆 𝑘 x⟩| 2
                                          (𝑝,𝑚)
                                   = |⟨m
                                       qℓ       , 𝑆 𝑘 x⟩| 2 .
Remark 2.4.1. If we instead change the pairs S in Lemma 2.4.1 for which we collect NFP
measurements (2.2) to be S ′ := ∪ 𝑘 ∈[𝑑]0 {𝑑 − 𝑘 } × {𝑘 − 2𝛿 + 2, . . . , 𝑘 − 1, 𝑘 } mod 𝑑, then we may
remove the assumption that p is 2𝛿 − 1 periodic. In particular, for (𝑘, ℓ) ∈ S ′ one may substitute
𝑘 = 𝑑 − 𝑘 ′ and ℓ′ = 𝑘 ′ − 𝑖 for some 0 ≤ 𝑖 ≤ 2𝛿 − 2 to see that then (−𝑘 ′ − ℓ′) mod 𝑑 = 𝑖. Thus, since
0 ≤ 𝑖 ≤ 2𝛿 − 2, (−ℓ′ − 𝑘 ′) mod 2𝛿 − 1 = (−ℓ′ − 𝑘 ′) mod 𝑑, and so we may use the same calculation
as above without assuming that p is 2𝛿 − 1 periodic. Note, however, that S has a simple Cartesian
product structure whereas S ′ does not. As a result, the entries of p ∗ (𝑆 𝑘 m ◦ x) that one must sample
varies based on the mask shift 𝑘 in the case of S ′, potentially complicating the collection of the
associated NFP measurements (2.2) in some situations.
     Next, in Lemma 2.4.2 below, we will show how to choose a mask m and PSF p such that
q ℓ(𝑝,𝑚) defined as in (2.11) and m
m                                   q (fpr)
                                      2ℓ mod 2𝛿−1
                                                      defined as in (2.8) will only differ by a global phase for
each ℓ ∈ [2𝛿 − 1]0 . As a consequence, we obtain the desired result that the block-circulant matrix
arising from the NFP measurements (2.2) is essentially equivalent (up to a row permutation and
global phase shift) to the well-conditioned lifted linear measurement operator M               q (fpr) considered in
Theorem 2.3.1.
Lemma 2.4.2. Let p, m ∈ C𝑑 have entries given by
                                                                              2𝜋 i𝑛2
                             2𝜋 i𝑛2
                                                             
                                                             
                                                              𝑒 −𝑛+1 /𝑎
                                                                          · 𝑒 2𝛿 − 1 ,
                                                             
                           −
                                                             √
                                                             
                                                               4
                                                                                        𝑛 ∈ [𝛿]0
                    𝑝 𝑛 B 𝑒 2𝛿 − 1 ,     and      𝑚𝑛 B            2𝛿 − 1                           ,
                                                             
                                                             
                                                             
                                                              0,
                                                                                       otherwise
                                                             
                                                            27


                     n   𝛿 − 1o
where 𝑎 B max 4,                 . Then for all ℓ ∈ [2𝛿 − 1]0 , m       q ℓ(𝑝,𝑚) = 𝑆ℓe   p ◦ m satisfies
                           2
                                                       2𝜋 iℓ 2
                                        q ℓ(𝑝,𝑚)
                                        m        = 𝑒 2𝛿 − 1 · m     q (fpr)
                                                                       2ℓ mod 2𝛿−1
                                                                                     ,                                  (2.13)
         (fpr)                                                                    q (fpr) and M q (𝑝,𝑚) be the lifted linear
where mqℓ       is defined as in (2.8). As a consequence, if we let M
                                                                                                              (fpr)      (𝑝,𝑚)
measurement matrices as per (2.5) obtained by setting each m                   q ℓ in (2.6) equal to m     qℓ       and m
                                                                                                                        qℓ     ,
respectively, then we will have
                                                 Mq (𝑝,𝑚) = PM     q (fpr) ,                                            (2.14)
where P is a 𝐷 × 𝐷 block diagonal permutation matrix. Thus M                         q (𝑝,𝑚) and M   q ( 𝑓 𝑝𝑟) have the same
singular values and
                                                                     
                                         𝜅 M q (𝑝,𝑚) = 𝜅 M       q (fpr) ≤ 𝐶𝛿2 ,
where 𝜅(·) denotes the condition number of a matrix.
Proof. Using the definition of the Hadamard product ◦, the circulant shift operator 𝑆ℓ and the
reversal operator x ↦→ e   x, we see that
                            (𝑝,𝑚)
                         (m
                          qℓ      )𝑛 = (𝑆ℓep ◦ m)𝑛 = (𝑆ℓe      p) 𝑛 m 𝑛 = e  p𝑛+ℓ m𝑛 = p−𝑛−ℓ m𝑛 .
Therefore, inserting the definitions of p and m above shows that for 𝑛 ∈ [𝛿]0
                            2𝜋 i(𝑛 + ℓ)2 −(𝑛+1)/𝑎                2𝜋 i𝑛2
                                             𝑒               −
          (mq ℓ(𝑝,𝑚) )𝑛 = 𝑒 2𝛿 − 1 · √4                  · 𝑒 2𝛿 − 1
                                               2𝛿 − 1
                            2𝜋 i𝑛2      4𝜋 i𝑛ℓ       2𝜋 iℓ 2         −(𝑛+1)/𝑎
                                                                                       2𝜋 i𝑛2
                                                                   𝑒               −
                        = 𝑒 2𝛿 − 1 · 𝑒 2𝛿 − 1 · 𝑒 2𝛿 − 1 · √4                   · 𝑒 2𝛿 − 1
                                                                      2𝛿 − 1
                            2𝜋 iℓ  −(𝑛+1)/𝑎
                                   2                                     2𝜋 iℓ 2 
                                                    2 𝜋 i𝑛(2ℓ)
                                      𝑒                                                            
                        = 𝑒 2𝛿 − 1 √4            · 𝑒 2 𝛿−1        = 𝑒 2𝛿 − 1 m      q (fpr)
                                                                                        2ℓ mod 2𝛿−1 𝑛
                                                                                                       .
                                        2𝛿 − 1
                                                             28


                                              2𝜋 iℓ 2                       
                                                                 (fpr)
For 𝑛 ∈/ [𝛿]0 , we have m   q ℓ(𝑝,𝑚)      =𝑒   2𝛿     −  1    m2ℓ mod 2𝛿−1 = 0. Thus (2.13) follows.
                                                               q
                                       𝑛                                         𝑛
                            q (𝑝,𝑚) and M       (𝑝,𝑚)                                                                 𝑝,𝑚
    To prove (2.14), let M                   q
                                                𝑘         be the matrices obtained by using the mask m             qℓ       in (2.5)
                    q (fpr) and M      (fpr)                                                  (fpr)
and (2.6) and let M                q
                                       𝑘     be the matrices obtained using mℓ instead. Then combining
                                                                  
(2.13) and (2.6) implies that M      q (𝑝,𝑚)           = M   q (fpr)                  . For example, when 𝑗 ∈ [𝛿 − 𝑘 + 1]0
                                         𝑘       𝑖, 𝑗           𝑘      2𝑖 mod 2𝛿−1, 𝑗
one may check
                                         2𝜋 i𝑖 2                             2𝜋 i𝑖 2                 
                    q (𝑝,𝑚)
                    M 𝑘            = 𝑒 2𝛿 − 1 m           q (fpr)
                                                             2𝑖 mod 2𝛿−1 𝑘
                                                                              𝑒 2𝛿 −   1   m
                                                                                           q (fpr)
                                                                                             2𝑖 mod 2𝛿−1 𝑗+𝑘
                                                                                                                             (2.15)
                              𝑖, 𝑗
                                                    
                                   = M     q (fpr)                       ,
                                              𝑘        2𝑖 mod 2𝛿−1, 𝑗
and one may perform similar computations in the other cases. Since each M                               q (𝑝,𝑚) and M q (fpr) have
                                                                                                          𝑘               𝑘
2𝛿 − 1 rows and the mapping 𝑖 → 2𝑖 is a bĳection on Z/(2𝛿 − 1)Z we see that each M                                   q (𝑝,𝑚) may
                                                                                                                        𝑘
be obtained by permuting the rows of M                  q (fpr) (and that the permutation does not depend on 𝑘).
                                                           𝑘
Therefore, there exists a block diagonal permutation matrix P such that M                           q (𝑝,𝑚) = PM q (fpr) . Finally,
the condition number bound for M         q (𝑝,𝑚) now follows from Theorem 2.3.1 and the fact that permuting
the rows of a matrix does not change its condition number or any of its singular values.
    Lemma 2.4.1 above demonstrates how to recast NFP problems involving locally supported
masks and periodic PSFs as particular types of FFP problems. Then, Lemma 2.4.2 provides
a particular PSF and mask combination for which the resulting FFP problem can be solved by
inverting a well-conditioned linear system. Together they imply that, for properly chosen m and
p, one may robustly invert the measurements given in (2.2) by first recasting the NFP data as
modified FFP data and then using the BlockPR approach from [54, 49, 92]. This is the main
idea behind Algorithm 2.1. However, this approach will lead to theoretical error bounds which
scale quadratically in 𝑑. To remedy this, the final step of Algorithm 2.1 uses an alternative
angular synchronization method (which originally appeared in [93]) based on a weighted graph
Laplacian as opposed to previous works which used methods based on, e.g., the methods outlined
in Section 2.3.1. As we shall see in the next section, this will allow us to obtain bounds in
                                                                   29


Theorem 2.2.1 which depend linearly on 𝑑 rather than quadratically.
Algorithm 2.1 NFP-BlockPR
Input:
   1) Variables 𝑑, 𝛿, 𝐷 = 𝑑(2𝛿 − 1).
   2) A 2𝛿 − 1 periodic PSF p ∈ C𝑑 , and a mask m ∈ C𝑑 with 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 .
   3) A near-field ptychographic measurement matrix Y ∈ C𝑑×2𝛿−1 .
Output: xest with xest ≈ 𝑒𝑖𝜃 x for some 𝜃 ∈ [0, 2𝜋].
   1) Form masks m     q ℓ(𝑝,𝑚) = 𝑆ℓe   p ◦ m and matrix M    q (𝑝,𝑚) as per (2.5) and (2.6).
                                   −1
   2) Compute z = M       q (𝑝,𝑚)        vec(Y) ∈ C𝐷 .
   3) Reshape z to get X     b as per Section 2.3.1 containing estimated entries of xx∗ .
   4) Use weighted angular synchronization (Algorithm 3, [93]) to obtain xest .
2.5     ERROR ANALYSIS FOR ALGORITHM 2.1
In this section, we will prove our main result, Theorem 2.2.1, which provides accuracy and
robustness guarantees for Algorithm 2.1. For x ∈ C𝑑 , we write its 𝑛th entry as 𝑥 𝑛 C |𝑥 𝑛 |𝑒 i𝜃 𝑛 and
let x(mag) B (|𝑥 0 |, . . . , |𝑥 𝑑−1 |)𝑇 and x(𝜃) B (𝑒 i𝜃 0 , 𝑒 i𝜃 1 , . . . , 𝑒 i𝜃 𝑑−1 )𝑇 so that we may decompose x as
                                                   x = x(mag) ◦ x(𝜃) .                                                  (2.16)
The following lemma upper bounds the total estimation error in terms of its phase and magnitude
errors. For a proof, please see Appendix A.1.
Lemma 2.5.1. Let x be decomposed as in (2.16), and similarly let xest be decomposed xest =
  (mag)
xest    ◦ x(𝜃)
           est . Then, we have that
                  min x − 𝑒 i𝜙 xest                                          i𝜙 (𝜃)                     (mag)
                                          2
                                            ≤ ∥x∥ ∞ min x(𝜃)     est − 𝑒 x               + x(mag) − xest        .       (2.17)
                 𝜙∈[0,2𝜋)                          𝜙∈[0,2𝜋)                            2                      2
     In light of Lemma 2.5.1, to bound the total error of our algorithm, it suffices to consider the
                                                                                                (mag)
phase and magnitude errors separately. In order to bound ∥x(mag) − xest                               ∥ 2 , we may utilize the
following lemma which is a restatement of Lemma 3 of [54].
                                                                 
Lemma 2.5.2 (Lemma 3 of [54]). Let 𝜎min M              q (𝑝,𝑚) denote the smallest singular value of the lifted
                                                             30


measurement matrix M     q (𝑝,𝑚) from line 1 of Algorithm 2.1. Then,
                                                               v
                                                               u
                                     (mag)     (mag)
                                                               t        ∥N∥ 𝐹
                                   x       −  xest        ≤𝐶                       .
                                                      ∞                       (𝑝,𝑚)
                                                                 𝜎 𝑚𝑖𝑛 M    q
    Having obtained Lemma 2.5.2, we are now able to prove the following theorem bounding the
total estimation error.
Theorem 2.5.1. Let p and m be the admissible PSF, mask pair defined in Lemma 2.4.2. Then, we
have that
                           x − 𝑒 i𝜙 xest                        x(𝜃)      i𝜙 (𝜃)         √︁
                  min                    2
                                           ≤ ∥x∥ ∞ min           est − 𝑒 x           + 𝐶 𝑑𝛿∥N∥ 𝐹 .
                 𝜙∈[0,2𝜋)                             𝜙∈[0,2𝜋)                    2
                                                                                              √
Proof. Combining Lemmas 2.5.1 and 2.5.2 along with the inequality ∥u∥ 2 ≤                       𝑑 ∥u∥ ∞ , implies that
                                                                                   v
                min ∥x − 𝑒 i𝜙 xest ∥ 2 ≤ ∥x∥ ∞ min              − 𝑒 i𝜙 x(𝜃)
                                                                                   u
                                                            (𝜃)
                                                                                   t       𝑑 ∥N∥ 𝐹
                                                          xest                 +𝐶                     .        (2.18)
              𝜙∈[0,2𝜋)                           𝜙∈[0,2𝜋)                    2                 q (𝑝,𝑚)
                                                                                       𝜎 𝑚𝑖𝑛 M
As noted in Lemma 2.4.2, the singular values of M          q (𝑝,𝑚) are the same as those of M      q (fpr) . Therefore,
applying Theorem 2.3.1 then finishes the proof.
Remark 2.5.1. Note that the inequality (2.18) in the proof of Theorem 2.5.1 holds any time m ∈ C𝑑
satisfies 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 and either (𝑖) p ∈ C𝑑 is 2𝛿 − 1 periodic for 2𝛿 − 1 dividing 𝑑, or
else (𝑖𝑖) one instead collects NFP measurements (2.2) at all (𝑘, ℓ) ∈ S ′ as in Remark 2.4.1.
Therefore, results analogous to Theorem 2.5.1 may be produced for any such p and m pairs such
                   
that 𝜎 𝑚𝑖𝑛 M q (𝑝,𝑚) > 0. Furthermore, the value of this minimal singular value is straightforward
to check numerically in practice.
    In order to bound x(𝜃)          i𝜙 (𝜃) , we will need a few additional definitions. As in (2.9), let X
                            est − 𝑒 x
                                            2
denote the partial autocorrelation matrix corresponding to the true signal x, as in (2.10), and let X                b
denote the partial autocorrelation matrix corresponding to xest , i.e., the matrix obtained in step 3
of Algorithm 2.1. Let 𝐺 = (𝑉, 𝐸, W) be a weighted graph whose vertices are given by 𝑉 = [𝑑]0 ,
                                                          31


whose edge set 𝐸 is taken to be the set of (𝑖, 𝑗) such that 𝑖 ̸= 𝑗 and |𝑖 − 𝑗 | mod 𝑑 < 𝛿, and whose
weight matrix W is defined entrywise by
                                         
                                             b𝑖, 𝑗 | 2 ,
                                         
                                         |𝑋              0 < |𝑖 − 𝑗 | mod 𝑑 < 𝛿
                                         
                                         
                                         
                                 𝑊𝑖, 𝑗 =                                         .                   (2.19)
                                         
                                          0,             otherwise
                                         
                                         
                                         
Letting A𝐺 denote the unweighted adjacency matrix of 𝐺, we observe that by construction, we have
X = (I + A𝐺 ) ◦ xx∗ and X   b = (I + A𝐺 ) ◦ xest x∗est . Letting D denote the weighted degree matrix, we
define the unnormalized graph Laplacian by L𝐺 B D − W and the normalized graph Laplacian
by L𝑁 B D−1/2 L𝐺 D−1/2 . It is well known that both L𝐺 and L𝑁 are positive semi-definite with
a minimal eigenvalue of zero (see, e.g., Section 3.1, [104]). We will let 𝜏𝐺 denote the spectral
gap (second smallest eigenvalue) of L𝐺 . It is well known that if 𝐺 is connected then 𝜏𝐺 is strictly
positive (see, e.g., Lemma 3.1.1, [104]).
    In [33], the authors used a weighted graph approach to prove the following result which bounds
  𝑚𝑖𝑛 ∥x(𝜃)      i𝜙 (𝜃)
          est − 𝑒 x ∥ 2 .
𝜙∈[0,2𝜋)
Theorem 2.5.2 (Corollary 3, [33]). Consider the weighted graph 𝐺 = (𝑉, 𝐸, W) described in the
previous paragraph with weight matrix given as in (2.19). Let 𝜏𝐺 denote the spectral gap of the
associated unnormalized Laplacian L𝐺 . Then we have that
                            x(𝜃)      i𝜙 (𝜃)             √︁              ∥X − X∥
                                                                               b 𝐹
                                                                                   , 𝐶 ∈ R+ .
                    𝑚𝑖𝑛       est − 𝑒 x           ≤ 𝐶 1 + ∥x 𝑒𝑠𝑡 ∥ ∞ · √
                   𝜙∈[0,2𝜋)                   2                               𝜏𝐺
                        √︁
Remark 2.5.2. The         1 + ∥xest ∥ ∞ term is referred to in Theorem 4 of [33] as a tightness penalty
which is applied when taking the non-convex constraint and performing an eigenvector relaxation,
allowing us to use the method of angular synchronization involving the weighted Laplacian given
in Algorithm 3 of [93].
    In order to utilize Theorem 2.5.2 we require both an upper bound of ∥X − X∥             b 𝐹 and a lower
bound for the spectral gap 𝜏𝐺 . These are provided by the next two lemmas.
Lemma 2.5.3. Let p and m be defined as in Lemma 2.4.2. Then, ∥X − X∥               b 𝐹 ≤ 𝐶𝛿∥N∥ 𝐹 .
                                                            32


Proof. Let 𝑣𝑒𝑐 : C𝑑×𝑑 → C𝐷 be the vectorization operator considered in (2.7). It follows from
(2.4), (2.7), and Step 2 of Algorithm 2.1, that
                          vec(Y) = Mvec(
                                      q          b and
                                                 X)              vec(Y − N) = Mvec(X).q
Therefore,
                                               −1                     ∥vec(N)∥ 2
                   X−X   b    ≤      q (𝑝,𝑚)
                                     M              vec(N)        ≤                      ≤ 𝐶𝛿∥N∥ 𝐹 ,
                            𝐹                                   2    𝜎 𝑚𝑖𝑛 M   q   (𝑝,𝑚)
where final inequality again utilizes Lemma 2.4.2 and Theorem 2.3.1.
Lemma 2.5.4. For the graph 𝐺 considered in Theorem 2.5.2, we have that
                                                     |xest | 4min 4(𝛿 − 1)
                                              𝜏𝐺 ≥                          .
                                                     ∥xest ∥ 2∞      𝑑2
Proof. Letting 𝑊min and 𝑊max be the minimum and maximum value of any of the (nonzero) entries
of W, we have that 𝑊min ≥ |xest | 2min , 𝑊max ≤ ∥xest ∥ 2∞ , and diam(𝐺 unw ) ≥ 𝑑/(2𝛿 − 1) (where
diam(𝐺 unw ) is the diameter of the unweighted version of 𝐺). Therefore, by Theorem A.2.1 in
Appendix A.2, we have that
                                 |xest | 4min          2                 |xest | 4min 4(𝛿 − 1)
                          𝜏𝐺 ≥                                       ≥                         .
                                  ∥xest ∥ 2∞ (𝑑 − 1)diam(𝐺)              ∥xest ∥ 2∞ 𝑑 2
    We shall now finally prove our main result.
The Proof of Theorem 2.2.1. By Theorem 2.5.1, we have
                          x − 𝑒 i𝜙 xest                            x(𝜃)     i𝜙 (𝜃)           √︁
                  min                     2
                                            ≤ ∥x∥ ∞ min             est − 𝑒 x             + 𝐶 𝑑𝛿∥N∥ 𝐹 .
                𝜙∈[0,2𝜋)                              𝜙∈[0,2𝜋)                         2
                                                           33


Combining Theorem 2.5.2 with Lemmas 2.5.3 and 2.5.4 yields
                              x(𝜃)    i𝜃 (𝜃)            √︁                 ∥X − X∥
                                                                                b 𝐹
                    𝑚𝑖𝑛        est − 𝑒 x           ≤ 𝐶 1 + ∥x 𝑒𝑠𝑡 ∥ ∞ · √
                   𝜙∈[0,2𝜋)                    2                               𝜏𝐺
                                                                            √
                                                        √︁                 𝑑 𝛿∥xest ∥ ∞ ∥N∥ 𝐹
                                                   ≤ 𝐶 1 + ∥x 𝑒𝑠𝑡 ∥ ∞ ·                          .
                                                                               |xest | 2min
The result follows.
2.6     AN ALTERNATE APPROACH: NEAR-FIELD
        PTYCHOGRAPHY VIA WIRTINGER FLOW
In the previous sections we have demonstrated a particular point spread function and mask for which
NFP measurements are guaranteed to allow image reconstruction via Algorithm 2.1. However, in
many real-world scenarios the particular mask and PSF combination considered above are not of
the type actually used in practice. For example, in the setting considered in [125] the PSF p ideally
behaves like a low-pass filter (so that, e.g., b    p is supported in {𝑘 ∈ Z|−𝐾 < 𝑘 mod 𝑑 < 𝐾 } for some
𝐾 ≪ 𝑑), and the mask 𝑚 is globally supported in [𝑑]0 . In contrast, the PSF considered above
has its nonzero Discrete Fourier coefficients at frequencies in {𝑘 𝑑/(2𝛿 − 1)} 𝑘∈[2𝛿−1]0 (and thus its
Fourier support includes large frequencies), and the mask 𝑚 has small physical support in [𝛿]0 .
This motivates us to explore a variant of the well known Wirtinger Flow algorithm [14] in this
section. This method, Algorithm 2.2, can be applied to more general set of PSF and mask pairs
than Algorithm 2.1 considered in the previous section.
    Suppose we have noiseless NFP measurements of the form
                                                       [
          𝑌𝑘,ℓ = |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 , (𝑘, ℓ) ∈            {𝑑 − 𝑘 } × {𝐾 − 𝐿 + 1, . . . , 𝑘 } mod 𝑑,
                                                   0≤𝑘 ≤𝐾−1
where 𝐾, 𝐿 ∈ [𝑑 + 1]0 \ {0}. Then by the same argument used in Lemma 2.4.1 (see also in Remark
2.4.1), we can manipulate the measurements above so that we have
                             𝑌e𝑘,ℓ = |⟨mq ℓ(𝑝,𝑚) , 𝑆 𝑘 x⟩| 2 ,  (𝑘, ℓ) ∈ [𝐾]0 × [𝐿]0 ,
                                                            34


                       (𝑝,𝑚)
where the masks m    qℓ      are defined as in (2.11). We may then reshape these measurements into a
vector y ∈ C𝐾 𝐿 with entries given by
                                               (𝑝,𝑚)
                                  𝑦 𝑛 := |⟨mq𝑛          , 𝑆 𝑛 x⟩| 2 ,
                                                  mod 𝐿 ⌊ 𝐿 ⌋
                                                                         ∀𝑛 ∈ [𝐾 𝐿]0 .                          (2.20)
After this reformulation, we may then apply a standard Wirtinger Flow Algorithm with spectral
initialization. Full details are given below in Algorithm 2.2.
Algorithm 2.2 NFP Wirtinger Flow
Input:
   1) Size 𝑑 ∈ N, number of iterations 𝑇, stepsizes 𝜇𝜏+1 for 𝜏 ∈ [𝑇]0 .4
   2) PSF p ∈ C𝑑 , mask m ∈ C𝑑 , m       q ℓ(𝑝,𝑚) = 𝑆ℓe  p ◦ m.
   3) Noisy measurements 𝑌𝑘,ℓ = |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 + 𝑁 𝑘,ℓ .
Output: xest ∈ C𝑑 with xest ≈ 𝑒𝑖𝜃 x for some 𝜃 ∈ [0, 2𝜋]
   1) Rearrange measurement matrix to form measurement vector y in (2.20).
   2) Compute z0 using spectral method (Algorithm 1 in [14]).
                                          𝜇𝜏+1
   3) For 𝜏 ∈ [𝑇]0 , let z𝜏+1 = z𝜏 −              ∇ 𝑓 (z𝜏 ) where
                                          ∥z0 ∥ 2
                                                                           ∗ 2           2
                                           1 P𝐾 𝐿                   (𝑝,𝑚)
                               𝑓 (z) B                             q 𝑛 𝑚𝑜𝑑 𝐿 z − 𝑦 𝑛 .
                                                           𝑆−⌊ 𝐿 ⌋ m
                                                               𝑛
                                         𝐾 𝐿 𝑛=1
   4) Return xest = z𝑇 .
2.7     NUMERICAL SIMULATIONS
In this section, we evaluate Algorithms 2.1 and 2.2 with respect to both noise robustness and
runtime. Every data point in the plots below reports an average reconstruction error or runtime over
100 tests. For each test, a new sample x ∈ C𝑑 is randomly generated by choosing each entry to have
independent and identically distributed (i.i.d.) mean 0 and variance 1 Gaussian real and imaginary
parts. We then attempt to recover this sample from the noisy measurements 𝑌𝑘,ℓ (x) defined as in
(2.2) where the additive noise matrices N also have i.i.d. mean 0 Gaussian entries.
    In our noise robustness experiments, we plot the reconstruction error as a function of the
    4 For our numerical simulations in Section 2.7, we set 𝜇 𝜏 = min(1 − 𝑒 − 𝜏/330 , 0.4) as suggested in [14].
                                                            35


Signal-to-Noise Ratio (SNR), where we define the reconstruction error by
                                                          min𝜙 ∥x − 𝑒 i𝜙 xest ∥ 2 
                                                                                 2
                             𝐸𝑟𝑟𝑜𝑟(x, xest ) := 10 log10                             ,
                                                                  ∥x∥ 22
and the SNR by
                                                             ∥Y − N∥ 
                                                                         𝐹
                                  𝑆𝑁 𝑅(Y, N) := 10 log10                   .
                                                                ∥N∥ 𝐹
In these experiments, we re-scale the noise matrix N in order to achieve each desired SNR level.
All simulations were performed using MATLAB R2021b on an Intel desktop with a 2.60GHz
i7-10750H CPU and 16GB DDR4 2933MHz memory. All code used to generate the figures below
is publicly available at https://github.com/MarkPhilipRoach/NearFieldPtychography.
2.7.1     ALGORITHMS 2.1 AND 2.2 FOR LOCALLY SUPPORTED MASKS
          AND PERIODIC POINT SPREAD FUNCTIONS
    In these experiments, we choose the measurement index set for (2.2) to be S = K × L where
K = [𝑑]0 and L = [2𝛿 − 1]0 . As a consequence we see that we consider all shifts 𝑘 ∈ [𝑑]0
of the mask while observing only a portion of each resulting noisy near-field diffraction pattern
|p ∗ (𝑆 𝑘 m ◦ x)| 2 for each 𝑘. This corresponds to a physical imaging system where, e.g., the sample
and (a smaller) detector are fixed while a localized probe with support size 𝛿 scans across the
sample. Figure 2.1 evaluates the robustness and runtime of Algorithm 2.1 as a function of the
SNR and mask support 𝛿 in this setting. Looking at Figure 2.1 one can see that noise robustness
increases with the support size of the mask, 𝛿, in exchange for mild increases in runtime.
                                                     36


Figure 2.1 An evaluation of Algorithm 2.1 for the proposed PSF and mask with 𝑑 = 945. Left:
Reconstruction error vs SNR for various 𝛿 = | 𝑠𝑢 𝑝 𝑝(m)|. Right: Runtime as a function of 𝛿.
   Figure 2.2 compares the performance of Algorithm 2.1 and Algorithm 2.2 for the measurements
proposed in Lemma 2.4.2. Looking at Figure 2.2 we can see that Algorithm 2.2 takes longer to
achieve comparable errors to Algorithm 2.1 for these particular p and m as SNR increases. More
specifically, we see, e.g., that BlockPR achieves a similar reconstruction error to 500 iterations of
Wirtinger flow at an SNR of about 50 in a small fraction of the time. This supports the value of the
BlockPR method as a fast initializer for more traditional optimization-based solution approaches.
                                                 37


Figure 2.2 A comparison of Algorithms 2.1 and 2.2 for the proposed PSF and mask with 𝛿 = 26
and 𝑑 = 102. Left: Reconstruction error vs SNR for various numbers of Algorithm 2.2 iterations.
Right: The corresponding average runtimes.
2.7.2    ALGORITHM 2.2 FOR GLOBALLY SUPPORTED MASKS
     As we saw in the previous section, Algorithm 2.1 is able to invert NFP measurements more
efficiently than Algorithm 2.2 in situations where it is applicable. However, Algorithm 2.1 only
applies to locally supported masks. In this section, we will show that Algorithm 2.2 remains
effective even when the masks are globally supported, such as the masks considered in [125].
     In Figure 2.3, we evaluate Algorithm 2.2 using noisy measurements of the form
                       𝑌𝑘,ℓ = |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 + 𝑁 𝑘,𝑙 , (𝑘, ℓ) ∈ [𝐾]0 × [𝑑]0 .                 (2.21)
                                                                          𝑑
Here p ∈ C𝑑 is a low-pass filter with b  p = 𝑆−(𝛾−1)/2 1𝛾 where 𝛾 =         + 1 and 1𝛾 ∈ {0, 1} 𝑑 is a vector
                                                                          3
whose first 𝛾 entries are 1 and whose last 𝑑 − 𝛾 entries are 0. Here, we choose the mask m to have
i.i.d. mean 0 variance 1 Gaussian entries. Thus, the measurements considered in (2.21) differ from
those used in Section 2.7.1 in two crucial respects: i) the mask m here has global support. ii) we
utilize a small number of mask shifts and observe the entire diffraction pattern resulting from each
one (as opposed to observing just a portion of each diffraction pattern from all possible shifts, as
above). Examining Figure 2.3, one can see Algorithm 2.2 remains effective in this setting. We also
                                                      38


observe, as expected, that using more shifts, i.e., collecting more measurements, results in lower
reconstruction errors.
Figure 2.3 The reconstruction error of Algorithm 2.2 with 𝑑 = 102, L = [𝑑]0 , and number of
iterations 𝑇 = 2000. Left: Reconstruction error vs the number of total shifts 𝐾 for fixed
 𝑆𝑁 𝑅 = 80. Right: Reconstruction error vs SNR for various numbers of shifts 𝐾.
2.7.3    ALGORITHM 2.1 FOR NON-PERIODIC PSFS VIA REMARK 2.4.1
    In these experiments, we choose the measurement index set for (2.2) to be S ′ from Remark 2.4.1
and consider two different non-periodic PSFs together with locally supported masks m ∈ C𝑑 having
 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 for 𝛿 = 26 and 𝑑 = 102. Motivated again by [125], we first consider a PSF given
by a low-pass filter (defined as in Section 2.7.2) plus small noise modeling imperfections (here
the additive vector has i.i.d. N (0, 10−4 ) normal entries) in combination with a random symmetric
mask. Here the mask’s nonzero entries are created by reflecting 𝛿/2 = 13 random entries (chosen
via i.i.d. mean 0 variance 1 Gaussians) across the middle of its support. The reconstruction error of
Algorithm 2.1, as well as of Algorithm 2.2 initialized with the output of Algorithm 2.1, is plotted
on the left in Figure 2.4 as a function of the NFP measurements’ SNR for this PSF/mask pair.
    For our second non-periodic PSF and locally supported mask pair, we let the PSF be a vector
with unit magnitude entries having i.i.d. uniformly random phases, and let our locally supported
masks have 𝛿 nonzero i.i.d. mean 0 variance 1 Gaussian entries. The reconstruction errors of
both Algorithm 2.1 and Algorithm 2.2 are plotted on the right in Figure 2.4 as a function of the
                                                  39


NFP measurements’ SNR in this case. In both experiments plotted in Figure 2.4, we note that both
random initialization as well as the spectral initialization method from [14] appear insufficiently
accurate to allow Algorithm 2.2 to converge. However, when the output of Algorithm 2.1 is used to
compute 𝑧0 in step 2 of Algorithm 2.2, Algorithm 2.2 then converges nicely to an accurate estimate
of the true signal. This further reinforces the potential value of Algorithm 2.1 as fast and accurate
initializer for more traditional optimization-based solution approaches.
Figure 2.4 A simulation applying Algorithm 2.1 via Remark 2.4.1 and then using its generated
estimate as the initial estimate 𝑧0 in Algorithm 2.2. In both plots Algorithm 2.1 is plotted in solid
blue, and Algorithm 2.2 is plotted in dashed red. Left: Reconstruction error vs SNR where the
PSF is an low-pass filter with small additive noise, and the locally supported mask is symmetric
and random. Here 𝑇 = 5000 iterations are used in Algorithm 2.2. Right: Reconstruction error vs
SNR where the PSF is an vector with randomized unit magnitude entries, and the locally
supported mask is random. Here 𝑇 = 1000 iterations are used in Algorithm 2.2.
2.8     APPLICATION OF ALGORITHM 2.1
We aim to apply a real world application of Algorithm 2.1, where we aim to recover a nxn-pixel
color image. The image is first broken down into its three colour channels on the RGB scale,
and converting to three individual integer valued matrices with entries from 0 to 255 based on
the intensity of the color at each pixel. Each of these matrices is then separately reshaped into a
                       2
column vector in R𝑛 , which is then used for the object 𝑥 in Algorithm 2.1. Once we obtain our
three estimates for the three column vectors, they are then reshaped back into 𝑛 × 𝑛 matrices and
                                                  40


then combined together to form the color estimate of the original image.
Figure 2.5 Here we have an example of this process in action. The original image is a 128 × 128
pixel color image. This would mean that 𝑑 = 1282 = 16, 384 in Algorithm 2.1, however to ensure
that 𝑑 is divisible by 2𝛿 − 1 for any given delta, we let 𝑑 𝑒𝑥𝑡 be the smallest integer such that
𝑑 𝑒𝑥𝑡 ≥ 𝑑 and 𝑑 𝑒𝑥𝑡 is divisible by 2𝛿 − 1. We then extend the reshapen pixel data such that 𝑥 ∈ R𝑑𝑒𝑥𝑡
by adding ones at the extended part and disregard this extension once we recover the estimate. We
test this recovery for two levels of delta and for each delta, we apply varying levels of noise.
2.9     CONCLUSIONS AND FUTURE WORK
We have introduced two new algorithms for recovering a specimen of interest from near-field
ptychographic measurements. Both of these algorithms rely on first reformulating and reshaping
our measurements so that they resemble widely-studied far-field ptychographic measurements. We
then recover our method using either Wirtinger Flow or using methods based on [54]. Algorithm
2.1 is computational efficient and, to the best of our knowledge, is the first algorithm with provable
recovery guarantees for measurements of this form. Algorithm 2.2, on the other hand, has the
advantage of being applied to more general masks with global support. Developing more efficient
and provably accurate algorithms for this latter class of measurements remains an interesting avenue
                                                   41


for future work.
                 42


                                           CHAPTER 3
                BLIND PTYCHOGRAPHY VIA BLIND DECONVOLUTION
3.1    ABSTRACT
Ptychography involves a sample being illuminated by a coherent, localised probe of illumination.
When the probe interacts with the sample, the light is diffracted and a diffraction pattern is de-
tected. Then the probe or sample is shifted laterally in space to illuminate a new area of the sample
while ensuring there is sufficient overlap. Far-field Ptychography occurs when there is a large
enough distance (when the Fresnel number is ≪ 1) to obtain magnitude-square Fourier transform
measurements. In an attempt to remove ambiguities, masks are utilized to ensure unique outputs
to any recovery algorithm are unique up to a global phase. In this paper, we assume that both the
sample and the mask are unknown, and we apply blind deconvolutional techniques to solve for
both. Numerical experiments demonstrate that the technique works well in practice, and is robust
under noise.
    This chapter is comprised of three sections. Section 3.3 introduces far-field Fourier ptychog-
raphy, and an algorithm for solving given noisy ptychographic measurements. In particular of
use to us will be Theorem 3.3.2 which reformulates the measurements into a convolution. Sec-
tion 3.4 explores a method for solving a blind deconvolution problem, given certain appropriate
real-world assumptions. Section 3.5 combines the previous two sections, taking the reformulated
convolutional measurements, assuming that both components are unknown, and then applying the
blind deconvolution recovery algorithm. The full algorithm is stated and numerical simulations are
summarized, outlining good recovery which is robust under noise.
3.2    INTRODUCTION
Ptychography involves a sample being illuminated by a coherent, localised probe of illumination.
When the probe interacts with the sample, the light is diffracted and a diffraction pattern is
detected. Then the probe or sample is shifted laterally in space to illuminate a new area of the
sample while ensuring there is sufficient overlap. Far-field Ptychography occurs when there is
                                                 43


a large enough distance (when the Fresnel number is ≪ 1) to obtain magnitude-square Fourier
transform measurements.
    Ptychography was initially studied in the late 1960s ([45]), with the problem solidified in
1970 ([44]). The name "Ptychography" was coined in 1972 ([43]), after the Greek word to
fold because the process involves an interference pattern such that the scattered waves fold into
one anotherthe (coherent) Fourier diffraction pattern of the object. Initially developed to study
crystalline objects under a scanning transmission electron microscope, since then the field has
widen to setups such as using visible light ([19], [47], [85]), x-rays ([26], [115], [90]), or electrons
([123],[37][58]). It is benefited from being unaffected by lens-induced aberrations or diffraction
effects unlike conventional lens imaging. Various types of ptychography are studied based on the
optical configuration of the experiments. For instance, Bragg Ptychography ([38] [109], [46], [70])
measures strain in crystalline specimens by shifting the surface of the specimen.
    Fourier ptychography ([127],[114],[88],[128]) consists of taking multiple images at a wide
field-of-view then computationally synthesizing into a high-resolution image reconstruction in the
Fourier domain. This results in an increased resolution compared to a conventional microscope.
                   Figure 3.1 [47] Experimental setup for fly-scan ptychography.
                                                 44


3.3     FAR-FIELD FOURIER PTYCHOGRAPHY
Let x, m ∈ C𝑑 denote the unknown sample and known mask, respectively. We suppose that we
have 𝑑 2 noisy ptychographic measurements of the form
                       (Y)ℓ,𝑘 = |(F(x ◦ 𝑆 𝑘 m))ℓ | 2 + (N)ℓ,𝑘 ,  (ℓ, 𝑘) ∈ [𝑑]0 × [𝑑]0 ,          (3.1)
where 𝑆 𝑘 , ◦, F := F𝑑 denote 𝑘 th circular shift, Hadamard product, and 𝑑-dimensional discrete
Fourier transform, and N is the matrix of additive noise.
    In this section, we will define a discrete Wigner distribution deconvolution method for recovering
a discrete signal. A modified Wigner distribution deconvolution approach is used to solve for an
estimate of x̂x̂∗ ∈ C𝑑×𝑑 and then angular synchronization is performed to compute estimate of x̂
and thus x.
    In Section 3.3.1, we introduce definitions and techincal lemmas which will be of use. In
particular, the decoupling lemma (Lemma 3.3.2) allows use to effectively ’separate’ the mask and
object from a convolution. In Section 3.3.2, these technical lemmas are applied to the ptychographic
measurements to write the problem as a decoupled deconvolution problem, the blind variant of
which will be studied later on. In Section 3.3.3, an additional Fourier transform is applied and the
measurements have been rewritten to a form in which a pointwise division approach can be applied.
Sub-sampled version of this theorem are also given. We then state the full algorithm for recovering
the sample.
3.3.1    PROPERTIES OF THE DISCRETE FOURIER TRANSFORM
    We firstly define the modulation operator.
Definition 3.3.1. Given 𝑘 ∈ [𝑑]0 , define the modulation operator 𝑊 𝑘 : C𝑑 −→ C𝑑 component-wise
via
                                   (𝑊 𝑘 x)𝑛 = 𝑥 𝑛 𝑒 2𝜋𝑖𝑘𝑛/𝑑 ,   ∀𝑛 ∈ [𝑑]0 .                      (3.2)
    From this definition, we can develop some useful equalities which we will use in the main
                                                       45


proofs of this section.
Lemma 3.3.1. (Technical Equalities) (Lemma 1.3.1., [78]) The following equalities hold for all
x ∈ C𝑑 , [ℓ] ∈ [𝑑]0 :
(i) F𝑑 x̂ = 𝑑 · x̃;
(ii) F𝑑 (𝑊ℓ x) = 𝑆−ℓ x̂;
(iii) F𝑑 (𝑆ℓ x) = 𝑊ℓb  x;
                        ¯
(iv) 𝑊−ℓ F𝑑 (𝑆ℓ x̃¯ ) = x̂;
(v) 𝑆g            ¯;
      ℓ x = 𝑆 −ℓ x̃
(vi) F𝑑 x̄ = F𝑑 x̃;
(vii) x̂˜ = F𝑑 x̃.
     We wish to be able to convert between the convolution and the Hadamard product, so we will
need this useful theorem.
Theorem 3.3.1. (Discretized Convolution Theorem) (Lemma 1.3.2., [78]) Let x, y ∈ C𝑑 . We have
that
(i) 𝐹𝑑−1 (x̂ ◦ ŷ) = x ∗𝑑 y;
(ii) (F𝑑 x) ∗𝑑 (F𝑑 y) = 𝑑 · F𝑑 (x ◦ y).
     Currently, the measurements we are dealing with will be having the specimen and the mask
intertwined. We introduce the decoupling lemma to essentially detangle the two.
Lemma 3.3.2. (Decoupling Lemma) (Lemma 1.3.3., [78])
     Let x, y ∈ C𝑑 , ℓ, 𝑘 ∈ [𝑑]0 . Then
                                                                                                
                              (x ◦ 𝑆−ℓ y) ∗𝑑 (x̃¯ ◦ 𝑆ℓ ỹ¯ )     = (x ◦ 𝑆−𝑘 x̄) ∗𝑑 ( 𝑦˜ ◦ 𝑆 𝑘 ỹ¯ ) .  (3.3)
                                                               𝑘                                     ℓ
Proof. Let x, y ∈ C𝑑 , ℓ, 𝑘 ∈ [𝑑]0 . By the definitions of the circular convolution, Hadamard product
                                                                 46


and shift operator, we have that
                                                     𝑑−1
                                                      ∑︁
                 (x ◦ 𝑆−ℓ y) ∗𝑑 (x̃¯ ◦ 𝑆ℓ ỹ¯ )     =       (x ◦ 𝑆−ℓ y)𝑛 ((x̃¯ ◦ 𝑆ℓ ỹ¯ ) 𝑘−𝑛
                                                  𝑘    𝑛=0
                                                       𝑑−1
                                                      ∑︁
                                                    =        𝑥 𝑛 𝑦 𝑛−ℓ 𝑥¯˜ 𝑘−𝑛 𝑦¯˜ ℓ+𝑘−𝑛
                                                       𝑛=0
                                                       𝑑−1
                                                      ∑︁
                                                    =        𝑥 𝑛 𝑥¯𝑛−𝑘 𝑦˜ ℓ−𝑛 𝑦¯˜ 𝑘+ℓ−𝑛       (3.4)
                                                       𝑛=0
                                                       𝑑−1
                                                      ∑︁
                                                    =       (x ◦ 𝑆−𝑘 x)𝑛 ((ỹ ◦ 𝑆 𝑘 ỹ¯ )ℓ−𝑛
                                                       𝑛=0
                                                                                        
                                                    = (x ◦ 𝑆−𝑘 x̄) ∗𝑑 ( 𝑦˜ ◦ 𝑆 𝑘 ỹ¯ ) .
                                                                                           ℓ
    Lastly before entering the main part of this subsection, we need a lemma involving looking at
how the Fourier squared magnitude measurements will relate to a convolution.
Lemma 3.3.3. Let x ∈ C𝑑 . We have that
                                                   |F𝑑 x| 2 = F𝑑 (x ∗𝑑 x̃¯ ).                 (3.5)
Proof. Let x ∈ C𝑑 . Then we have that
                         |F𝑑 x| 2 = (F𝑑 x) ◦ (F𝑑 x) = (F𝑑 x) ◦ (F𝑑 x̃¯ ) = F𝑑 (x ∗𝑑 x̃¯ ).    (3.6)
3.3.2  DISCRETIZED WIGNER DISTRIBUTION DECONVOLUTION
    We now prove the Discretized Wigner Distribution Deconvolution theorem which will allow us
to convert the measurements into a form in which we can algorithmically solve.
Theorem 3.3.2. (Lemma 1.3.5., [78]) Let x, m ∈ C𝑑 denote the unknown specimen and known
                                                               47


mask, respectively. Suppose we have 𝑑 2 noisy ptychographic measurements of the form
                                𝑑−1                        2
                                     𝑥 𝑛 𝑚 𝑛−ℓ 𝑒 −2𝜋𝑖𝑛𝑘/𝑑 + (N)ℓ,𝑘 ,
                               ∑︁
                    (yℓ ) 𝑘 =                                              (ℓ, 𝑘) ∈ [𝑑]0 × [𝑑]0 .       (3.7)
                                𝑛=0
Let Y ∈ R𝑑×𝑑 , N ∈ C𝑑×𝑑 be the matrices whose ℓ 𝑡ℎ column is yℓ , Nℓ respectively. Then for any
𝑘 ∈ [𝑑]0 ,
                                                                                      
                             Y𝑇 F𝑇𝑑      = 𝑑 · (x ◦ 𝑆 𝑘 x̄) ∗𝑑 (m̃ ◦ 𝑆−𝑘 m̃¯ ) + N𝑇 F𝑇 .                (3.8)
                                                                                       𝑑
                                      𝑘                                                    𝑘
Proof. Let ℓ ∈ [𝑑]0 . We have that
                                                                                          
                   yℓ = |𝐹𝑑 (x ◦ 𝑆−ℓ m)| 2 +Nℓ = 𝐹𝑑 (x ◦ 𝑆−ℓ m) ∗𝑑 (x̃¯ ◦ 𝑆ℓ m̃        ¯ ) + Nℓ .       (3.9)
Taking Fourier transform of both sides at 𝑘 ∈ [𝑑]0 and using that F𝑑 x̂ = 𝑑 · x̃ yields
                                                                  
                (F𝑑 yℓ ) 𝑘 = 𝑑 · (x ◦ 𝑆−ℓ m) ∗𝑑 (x̃¯ ◦ 𝑆ℓ m̃    ¯)      + (F𝑑 Nℓ ) 𝑘                   (3.10)
                                                                     −𝑘
                                                                  
                           = 𝑑 · (x ◦ 𝑆 𝑘 x) ∗𝑑 (m̃ ◦ 𝑆−𝑘 m̃    ¯ ) + (F𝑑 Nℓ ) 𝑘                     ,
                                                                     ℓ
by previous lemma. For fixed ℓ ∈ [𝑑]0 , the vector F𝑑 yℓ ∈ C𝑑 is the ℓ th column of the matrix F𝑑 Y,
thus its transpose y𝑇ℓ F𝑇𝑑 ∈ C𝑑 is the ℓ th row of the matrix (F𝑑 Y)𝑇 . Similarly, ((N)ℓ )𝑇 F𝑇𝑑 ∈ C𝑑 is the
ℓ th row of (F𝑑 N)𝑇 . Thus we have that
                                                                                      
                         Y𝑇 F𝑇𝑑         = 𝑑 · (x ◦ 𝑆 𝑘 x̄) ∗𝑑 (m̃ ◦ 𝑆−𝑘 m̃  ¯ ) + N𝑇 F𝑇            .   (3.11)
                                                                                           𝑑
                                 𝑘 ℓ                                             ℓ             𝑘,ℓ
Thus we have that
                                                                                      
                             Y𝑇 F𝑇𝑑      = 𝑑 · (x ◦ 𝑆 𝑘 x̄) ∗𝑑 (m̃ ◦ 𝑆−𝑘 m̃¯ ) + N𝑇 F𝑇 .               (3.12)
                                                                                       𝑑
                                      𝑘                                                    𝑘
                                                            48


    We note that x ◦ 𝑆 𝑘 x̄ is a diagonal of xx∗ .
3.3.3   WIGNER DISTRIBUTION DECONVOLUTION ALGORITHM
    We suppose that the mask is known and the specimen is unknown. By taking an additional
Fourier transform and using the discretized convolution theorem, we have these variances of the
previous lemmas.
Theorem 3.3.3. (Discretized Wigner Distribution Deconvolution) Let x, m ∈ C𝑑 denote the
unknown specimen and known mask, respectively. Suppose we have 𝑑 2 noisy spectrogram mea-
surements of the form
                               𝑑−1                      2
                                    𝑥 𝑛 𝑚 𝑛−ℓ 𝑒 −2𝜋𝑖𝑛𝑘/𝑑 + (N)ℓ,𝑘 ,
                              ∑︁
                  (yℓ ) 𝑘 =                                            (ℓ, 𝑘) ∈ [𝑑]0 × [𝑑]0 .            (3.13)
                               𝑛=0
Let Y ∈ R𝑑×𝑑 be the matrix whose ℓ 𝑡ℎ column is yℓ . Then for any 𝑘 ∈ [𝑑]0
                                                                                        
                  F𝑑 Y𝑇 F𝑇𝑑 = 𝑑 · F𝑑 (x ◦ 𝑆 𝑘 x̄) ◦ F𝑑 (m̃ ◦ 𝑆−𝑘 m̃       ¯ ) + F𝑑 N𝑇 F𝑇 .               (3.14)
                                                                                         𝑑
                                  𝑘                                                          𝑘
    We also have a similar result based on the work in Appendix B.
Lemma 3.3.4. (Sub-Sampling In Frequency) Suppose that the spectrogram measurements are
collected on a subset K ⊆ [𝑑]0 of 𝐾 equally spaced Fourier modes. Then for any 𝜔 ∈ [𝐾]0
                                     𝑑
                                   𝐾 −1
                                     ∑︁                                                             
         F𝑑 (Y𝐾,𝑑 )𝑇 F𝑇𝐾      =𝐾                                                ¯ ) + F𝑑 (N𝐾,𝑑 )𝑇 F𝑇
                                           F𝑑 (x ◦ 𝑆ℓ𝐿−𝛼 x̄) ◦ F𝑑 (m̃ ◦ 𝑆 𝛼−ℓ𝐿 m̃                  𝐾
                            𝜔        𝑟=0                                                               𝜔
where Y𝐾,𝑑 ∈ C𝐾×𝑑 is the matrix of sub-sampled noiseless 𝐾 · 𝑑 measurements.
Lemma 3.3.5. (Sub-Sampling In Frequency And Space) Suppose we have spectrogram measure-
ments collected on a subset K ⊆ [𝑑]0 of 𝐾 equally spaced frequencies and a subset L ⊆ [𝑑]0 of 𝐿
                                                         49


equally spaced physical shifts. Then for any 𝜔 ∈ [𝐾]0 , 𝛼 ∈ [𝐿]0
                           
       F 𝐿 (Y𝐾,𝐿 )𝑇 (F𝑇𝐾 )𝜔
                               𝛼
              𝑑     𝑑
              𝐾 −1 𝐿 −1
         𝐾 𝐿 ∑︁ ∑︁                         ¯
                                                      
                                                                             ¯
                                                                                         
                                                                                                        𝑇 𝑇
                                                                                                               
      =                   F 𝑑  (x̂ ◦ 𝑆 ℓ𝐿−𝛼 x̂)          𝐹𝑑 (m̂ ◦  𝑆 𝛼−ℓ𝐿   m̂)         +   F 𝐿 (N 𝐾,𝐿 ) (F  )
                                                                                                            𝐾 𝜔 ,
          𝑑 3 𝑟=0 ℓ=0                             𝜔−𝑟𝐾                             𝜔−𝑟𝐾                          𝛼
where Y𝐾,𝐿 ∈ C𝐾×𝐿 is the matrix of sub-sampled noiseless 𝐾 · 𝐿 measurements.
    Assume that m is band-limited with 𝑠𝑢 𝑝 𝑝(m̂) = [𝛿]0 for some 𝛿 ≪ 𝑑. Then the algorithm
below allows for the recovery of an estimate of x̂ from spectrogram measurements via Wigner
distribution deconvolution and angular synchronization.
Algorithm 3.1 (Algorithm 1, [78]) Wigner Distribution Deconvolution Algorithm
Input: 1) 𝑌𝑑,𝐿 ∈ C𝑑×𝐿 , matrix of noisy measurements.
   2) Mask m ∈ C𝑑 with 𝑠𝑢 𝑝 𝑝(m̂) = [𝛿].
   3) Integer 𝜅 ≤ 𝛿, so that 2𝜅 − 1 diagonals of x̂x̂∗ are estimated, and 𝐿 = 𝛿 + 𝜅 − 1.
Output: An estimate x𝑒𝑠𝑡 of x up to a global phase.
   1) Perform pointwise division to compute
                                                                       
                                                                  𝑇   𝑇
                                                      1 F𝑑 Y F𝑑           𝑘
                                                                               .                                   (3.15)
                                                      𝑑 F−1                ¯
                                                          𝑑 (m̃ ◦ 𝑆 −𝑘 m̃)
   2) Invert the (2𝜅 − 1) Fourier transforms above.
   3) Organize values from step 2 to form the diagonals of a banded matrix 𝑌2𝜅−1 .
   4) Perform angular synchronization on 𝑌2𝜅−1 to obtain x̂𝑒𝑠𝑡 .
   5) Let x𝑒𝑠𝑡 = F−1  𝑑 x̂𝑒𝑠𝑡 .
    When the mask is known with 𝑠𝑢 𝑝 𝑝(m̂) = [𝛿]0 , 𝛿 ≪ 𝑑, maximum error guarantees (Theorem
2.1.1., [78]) are given depending on x, 𝑑, 𝜅, 𝐿, ∥𝑁 𝑑,𝐿 ∥ 𝐹 (the matrix formed by the noise) and the
mask dependent constant 𝜇 > 0,
                                         𝜇 :=         𝑚𝑖𝑛       (𝐹𝑑 (m̂ ◦ 𝑆 𝑝 m̂)) ¯ 𝑞.                            (3.16)
                                                | 𝑝|≤𝜅−1,𝑞∈[𝑑]0
In the next section, we look at the situation in which both the specimen and mask are unknown.
Since we have already shown that we can rewrite the Fourier squared magnitude measurements
                                                              50


as convolutions between the shifted autocorrelations, then the next obvious step is when both the
specimen and mask are unknown. This is the topic of blind deconvolution, which seeks to recover
vectors from their deconvolution. In particular, we will look at a couple of approaches which
involve making assumptions based on real world applications.
3.4      BLIND DECONVOLUTION
3.4.1     INTRODUCTION
     Blind Deconvolution is a problem that has been mathematically considered for decades, from
more past work ([4], [61], [112], [34][79], [65]) to more recent work ([15], [72], [2], [35] [71]),
and summarized in [69]. The goal is to recover a sharp image from an initial blurry image. The
first application to compressive sensing was considered in [2].
     We consider one-dimensional, discrete, noisy measurements of the form y = f ∗ g + n, where the
f is considered to be an object, signal, or image of consideration. g is considered to be a blurring,
masking, or point-spread function. n is considered to be the noise vector. ∗ refers to circulant
convolution1. We consider situations when both f and g are unknown. The process of recovering
the object and blurring function can be generalized to two-dimensional measurements.
     The problem of estimating the unknown blurring function and unknown object simultaneously
is known as blind image restoration ([124], [122], [64], [105]). Although strictly speaking, blind
deconvolution refers to the noiseless model of recovering f and g from y = f ∗ g, the noisy model
is commonly refered to as blind deconvolution itself, and this notation will be continued in this
chapter.
     As we will show later, the problem is ill-posed and ambiguities lead to no unique solution to
the pair being viable from any approach.
     In Section 3.4.2, we consider the underlying measuerements and assumptions that we will
consider. We then show how through manipulation, that we can re-write the original problem
as a minimization of a non-convex linear function. In Section 3.4.3, we demonstrate an iterative
     1∗ should refer to ordinary convolution, but for g that will be considered in this chapter, circulant convolution will
be sufficient.
                                                            51


approach to the minimization problem, in particular, by applying Wirtinger Gradient Descent. In
Section 3.4.5, we outline the initial estimate used for this gradient descent and fully layout the
algorithm that will apply in our numerical simulations. In Section 3.4.6, we will look at the
recovery guarantees that currently exist for this approach. Finally, in Section 3.4.7, we consider the
key conditions used to generate the main recovery theorem, and where further work could be done
to generalize these conditions and ultimately, allow more guarantees of recovery.
3.4.2   BLIND DECONVOLUTION MODEL
    We now want to approach the blind ptychography problem in which both the mask and specimen
are unknown. Using the lemmas in the previous section, we can see that this would reduce to solving
a blind deconvolution problem.
Definition 3.4.1. We consider the blind deconvolution model
                                     y′ = f ∗ g + n,    y, f, g, n ∈ C𝑑 ,
where y are blind deconvolutional measurements, f is the unknown blurring function (which serves
a similar role as to our phase retrieval masks), n is the noise, and g is the signal (which serves a
similar role as to our phase retrieval object). Here ∗ denotes circular convolution.
    We will base our work on the algorithm suggested in [71], considering the assumptions used.
In [71], the authors impose general conditions on f and g that are not restricted to any particular
application but allows for flexibility. They also assume that f and g belong to known linear
subspaces.
    For the blurring function, it is assumed that f is either compactly supported, or that f decays
sufficiently fast so that it can be well approximated by a compactly supported function. Therefore,
we make the assumption that f ∈ C𝑑 satisfies
                                                           
                                                      h 
                                               f := 
                                                           
                                                            ,
                                                     0 
                                                      𝑑−𝐾 
                                                           
                                                     52


for some 𝑘 ≪ 𝑑, h ∈ C𝐾 . This again reinforces the notion that the blurring function is analogous
to our masking function since both are compactly supported.
          Figure 3.2 [35] An example of an image deblurring by solving the deconvolution.
    For the signal, it is assumed that g belongs to a linear subspace spanned by the columns of a
known matrix C, i.e., g = Cx̄ for some matrix C ∈ C𝑑×𝑁 , 𝑁 ≪ 𝑑. This will lead to an additional
restriction we have to place on our blind ptychography but one for which there are real world
applications for which this assumption makes reasonable sense.
    In [71], the authors use that C is a Gaussian random matrix for theoretical guarantees although
they demonstrated in numerical simulations that this assumption is not necessary to gain results. In
particular, they found good results for when C represents a wavelet subspace (suitable for images)
or when C is a Hadamard-type matrix (suitable for communications).
                                                              𝜎 2 𝐿 02              𝜎 2 𝐿 02
    We assume the noise is complex Gaussian, i.e. n ∼ N (0,            𝐼 𝑑 )+𝑖N (0,          𝐼 𝑑 ) is a complex
                                                                 2                    2
Gaussian noise vector, where 𝐿 0 = ∥h0 ∥·∥x0 ∥, and h0 , x0 are the true blurring function and signal.
𝜎 −2 represents the SNR.
    The goal is convert the problem into one which can be algorithmically solvable via gradient
descent.
Proposition 3.4.1. [71] Let F𝑑 ∈ 𝐶 𝑑×𝑑 be DFT matrix. Let B ∈ C𝑑×𝐾 denote the first 𝐾 columns
                                                  53


of F𝑑 . Then we have that
                                               y = Bh ◦ Ax + e,                                      (3.17)
             1                                     1
where y = √ yb′, Ā = FC ∈ C𝑑×𝑁 , and e = √ Fd n represents noise.
              𝑑                                     𝑑
Proof. By the unitary property of F𝑑 , we have that B∗ B = I𝐾 .By applying the scaled DFT matrix
√
  𝐿F𝑑 to both sides of the convolution, we have that
                             √                √             √            √
                                𝐿Fd y = ( 𝐿Fd f) ◦ ( 𝐿Fd g) + 𝐿Fd n.
Additionally, we have that
                                                                  
                                                h      i h 
                                       Fd f = B M                  = Bh,
                                                                  
                                                          0 
                                                           𝑑−𝐾 
                                                                  
and we let Ā = FC ∈ C 𝐿×𝑁 . Since C is Gaussian, then Ā = FC is also Gaussian. In particular,
                                                       1               1
                                        Ā𝑖 𝑗 ∼ N (0, ) + 𝑖N (0, ).
                                                       2               2
Thus by dividing by 𝑑, the problem converts to
                                              1
                                             √ yb′ = Bh ◦ Ax + e,
                                               𝑑
             1               𝜎 2 𝐿 02                𝜎 2 𝐿 02
where e = √ FL n ∼ N (0,              I𝑑 ) + 𝑖N (0,           I𝑑 ) serves as complex Gaussian noise. Hence
              𝑑               2𝐿                      2𝐿
                1
by letting y = √ yb′, we arrive at
                 𝑑
                                               y = Bh ◦ Ax + e.
                                                      54


    We have thus transformed the original blind deconvolution model as a hadamard product. This
form of the problem is used in the rest of the section, where y ∈ C𝑑 , B ∈ C𝑑×𝐾 , A ∈ C𝑑×𝑁 are
given. Our goal is to recover h0 and x0 .
    There are inherent ambiguities to the problem however. If (h0 , x0 ) is a solution to the blind
deconvolution problem, then so is (𝛼h0 , 𝛼−1 x0 ) for any non-zero constant 𝛼. For most real world
                                                                                          √
applications, this is not an issue. Thus for uniformity, it is assumed that ∥h0 ∥= ∥x0 ∥= 𝐿 0 .
Definition 3.4.2. We define the matrix-valued linear operator A : C𝐾×𝑁 −→ C𝑑 by
                                          A(𝑍) := {bℓ∗ 𝑍aℓ }ℓ=1  𝑑
                                                                      ,
where b 𝑘 denotes the 𝑘-th column of B∗ , and a 𝑘 is the 𝑘-th column of A∗ . We also define the
corresponding adjoint operator A ∗ : C𝑑 −→ C𝐾×𝑁 , given by
                                                      𝑑
                                          A ∗ (z) :=      z 𝑘 b 𝑘 a∗𝑘 .
                                                     ∑︁
                                                     𝑘=1
    We see that this translates to a lifting problem, where
                    𝑑
                                                      𝐾
                       bℓ bℓ∗ = B∗ B = I𝐾 ,                   E(aℓ aℓ∗ ) = I𝑁 ,
                   ∑︁
                                              ∥bℓ ∥=    ,                       ∀𝑘 ∈ [𝑑].
                   𝑘=1                                𝑑
Lemma 3.4.1. Let 𝑦 be defined as in Proposition 3.4.1. Then
                                            y = A(h0 x∗0 ) + e.                                 (3.18)
    This equivalent model to Proposition 3.4.1 will the model worked with for the rest of the chapter.
    We aim to recover (h0 , x0 ) by solving the minimization problem
                 min 𝐹(h, x),     𝐹(h, x) := ∥A(hx∗ ) − y∥ 2 = ∥A(hx∗ − h0 x∗0 ) − e∥ 2 .
                 (h,x)
                                                     55


We also define
                                                                        ∥hx∗ − h0 x∗0 ∥ 𝐹
                    𝐹0 (h, x) := ∥A(hx∗ − h0 x∗0 )∥ 2 ,   𝛿 = 𝛿(h, x) =                   .
                                                                              𝑑0
𝐹(h, x) is highly non-convex and thus attempts at minimization such as alternating minimization
and gradient descent, can get easily trapped in some local minima.
3.4.2.1    MAIN THEOREMS
Theorem 3.4.1. (Existence Of Unique Solution) ([2], Theorem 1) Fix 𝛼 ≥ 1. Then there exists a
constant 𝐶𝛼 = 𝑂(𝛼), such that if
                                             2                      𝑑
                                  max(𝐾 · 𝜇max    , 𝑁 · 𝜇2ℎ ) ≤           ,
                                                                𝐶𝛼 log3 𝑑
then X0 = h0 x∗0 is the unique solution to our minimization problem with probability 1 − 𝑂(𝑑 −𝛼+1 ),
thus we can separate y = f ∗ g up to a scalar multiple. When coherence is low, this is tight within a
logarithmic factor, as we always have max(𝐾, 𝑁) ≤ 𝑑.
Theorem 3.4.2. (Stability From Noise) ([2], Theorem 2) Let X0 = h0 m∗0 and suppose the condition
of previous theorem holds. We observe that
                                             y = A(X0 ) + e,
where e ∈ R𝑑 is an unknown noise vector with ∥𝑧∥ 2 ≤ 𝛿, and estimate X0 by solving
                             min   ∥X∥ ∗ ,     subject to      ∥b
                                                                𝑦 − A(X)∥ 2 ≤ 𝛿.
Let 𝜆min , 𝜆max be the smallest/largest non-zero eigenvalue of AA ∗ . Then with probability 1− 𝑑 −𝛼+1 ,
the solution X will obey
                                                     𝜆 max √︁
                                   ∥X − X0 ∥ 𝐹 ≤ 𝐶            min(𝐾, 𝑁)𝛿,
                                                      𝜆 min
                                                      56


for a fixed constant 𝐶.
3.4.3    WIRTINGER GRADIENT DESCENT
     In [71], the approach is to solve the minimization problem (Equation 3.19) using Wirtinger
gradient descent. In this subsection, the algorithm is introduced as well as the main theorems
which establish convergence of the proposed algorithm to the true solution.
     The algorithm consists of two parts: first an initial guess, and secondly, a variation of gradient
descent, starting at the initial guess to converge to the true solution. Theoretical results are
established for avoiding getting stuck in local minima.
     This is ensured by determining that the iterates are inside some properly chosen basin of
attraction of the true solution.
3.4.4    BASIN OF ATTRACTION
Proposition 3.4.2. Basin of Attraction: (Section 3.1, [71]) Three neighbourhoods are introduced
whose intersection will form the basis of attraction of the solution:
(i) Non-uniqueness: Due to the scale ambiguity, for numerical stability we introduce the following
neighbourhood
                                               √︁             √︁
                    𝑁 𝐿 0 := {(h, x) | ∥h∥≤ 2 𝐿 0 , ∥x∥≤ 2 𝐿 0 },   𝐿 0 = ∥h0 ∥·∥x0 ∥.
(ii) Incoherence: The number of measurements required for solving the blind deconvolution
problem depends on how much h0 is correlated with the rows of the matrix B, with the hopes of
minimizing the correlation. We define the incoherence between the rows of B and h0 , via
                                                    𝑑 ∥Bh0 ∥ 2∞
                                              𝜇h2 =              .
                                                      ∥h0 ∥ 2
To ensure that the incoherence of the solution is under control, we introduce the neighborhood
                                           √              √︁
                               𝑁 𝜇 := {h |   𝑑 ∥Bh∥ ∞ ≤ 4 𝐿 0 𝜇},  𝜇h ≤ 𝜇.                       (3.19)
                                                     57


(iii) Initial guess: A carefully chosen initial guess is required due to the non-convexity of the function
we wish to minimize. The distance to the true solution is defined via the following neighborhood
                                                                                1
                         𝑁𝜖 := {(h, x) | ∥hx∗ − h0 x∗0 ∥ 𝐹 ≤ 𝜖 𝐿 0 }, 0<𝜖 ≤       .                 (3.20)
                                                                               15
Thus the basin of attraction is chosen as 𝑁 𝑑0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 , where the true solution lies.
                            Figure 3.3 Basin of Attraction: 𝑁 𝐿 0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 .
     Our approach consists of two parts: We first construct an initial guess that is inside the basin of
attraction 𝑁 𝐿 0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 . We then apply a regularized Wirtinger gradient descent algorithm that
will ensure that all the iterates remain inside 𝑁 𝐿 0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 . To achieve that, we add a regularizing
function 𝐺(h, x) to the objective function 𝐹(h, x) to enforce that the iterates remain inside 𝑁 𝐿 0 ∩ 𝑁 𝜇 .
Hence we aim the minimize the following regularized objective function, in order to solve the blind
deconvolution problem:
                                      ˜ x) := 𝐹(h, x) + 𝐺(h, x),
                                      𝐹(h,
where 𝐹(h, x) := ∥A(hx∗ − h0 x∗0 ) − e∥ 2 is defined as before and 𝐺(h, x) is the penalty function, of
                                                    58


the form
                                          h  ∥h∥ 2          ∥x∥ 2  ∑︁𝑑         𝑑|b∗ h| 2 i
                                                                                       ℓ
                           𝐺(h, x) := 𝜌 𝐺 0             + 𝐺0          +       𝐺0                 ,
                                                   2𝐿           2𝐿      ℓ=1          8𝐿𝜇2
where
                                   𝐺 0 (𝑧) := max{𝑧 − 1, 0}2 ,     𝜌 ≥ 𝐿 2 + 2∥e∥ 2 .
                    9               11
It is assumed          𝐿0 ≤ 𝐿 ≤         𝐿 0 and 𝜇 ≥ 𝜇 ℎ .
                   10               10
Remark 3.4.1. The matrix A ∗ (e) = 𝑑𝑘=1 e 𝑘 b 𝑘 a∗𝑘 as a sum of d rank-1 random matrices, has nice
                                                 P
concentration of measure properties. Asymptotically, ∥A ∗ (𝑒)∥ converges to 0 with rate 𝑂(𝑑 1/2 ).
Note that
                        𝐹(h, x) = ∥e∥ 2 +∥A(hx∗ − h0 x∗0 )∥ 2𝐹 −2𝑅𝑒(⟨A ∗ (e), hx∗ − h0 x∗0 ⟩).
                                           𝜎 2 𝑑02
If one lets 𝑑 −→ ∞, then ∥e∥ 2 ∼                   𝜒2 will converge almost surely to 𝜎 2 𝑑 2 2 and the cross term
                                            2𝑑 2𝑑                                              0
𝑅𝑒(⟨hx∗ − h0 x∗0 , A ∗ (e)⟩) will converge to 0. In other words, asymptotically
                                             lim 𝐹(h, x) = 𝐹0 (h, x) + 𝜎 2 𝐿 02 ,
                                            𝑑→∞
for all fixed (h, x). This implies that if the number of measurements is large, then 𝐹(h, x) behaves
"almost like" 𝐹0 (h, x) = ∥A(hx∗ − h0 x∗0 )∥ 2 , the noiseless version of 𝐹(h, x). So for large 𝑑, we
effectively ignore the noise.
Theorem 3.4.3. For any given 𝑍 ∈ C𝐾×𝑁 , we have that
                                                    E(A ∗ (A(Z))) = Z.
Proof. By linearity, sufficient to prove for 𝑍 ∈ C𝐾×𝑁 where 𝑍𝑖, 𝑗 = 1, 0 otherwise. Then we have
     2 This is due to the law of large numbers
                                                            59


that
                             𝐿                            𝑑                               𝑑
       E((A ∗ A(Z))) = E(       𝑏 ∗𝑘𝑖 𝑎 𝑘 𝑗 b 𝑘 a∗𝑘 )) =     (𝑏 ∗𝑘𝑖 b 𝑘 E(𝑎 𝑘 𝑗 a∗𝑘 )) =     (𝑏 ∗𝑘𝑖 b 𝑘 )𝑧∗𝑗 = e𝑖 e∗𝑗 = 𝑍,
                            ∑︁                           ∑︁                              ∑︁
                            𝑘=1                          𝑘=1                             𝑘=1
Thus we have that
               E(A ∗ (y)) = E(A ∗ (A(h0 x∗0 ) + e) = E(A ∗ (A(h0 x∗0 ))) + E(A ∗ (e)) = h0 x∗0 ,
since E(A ∗ (e)) by the definition of e.
    Hence it makes logical sense that the leading singular value and vectors of A ∗ (y) would be a
good approximation of 𝐿 0 and (h0 , x0 ) respectively.
3.4.5    ALGORITHMS
    We can now state the algorithm for generating an initial estimate.
Algorithm 3.2 Blind Deconvolution Initial Estimate
Input: Blind Deconvolutional measurements y, 𝐾 = supp( 𝑓 ).
Output: Estimate underlying signal and blurring function.
   1) Compute A ∗ (y) and find the leading singular value, left and right singular vectors of A ∗ (y),
   denoted by 𝑑, h̃0 , and x̃0 respectively.
   2) Solve the following optimization problem
                                                  √                             √                  √
                       u0 := 𝑎𝑟𝑔𝑚𝑖𝑛 ∥z − 𝑑 h̃0 ∥ 2 , subject to 𝑑 ∥𝐵z∥ ∞ ≤ 2 𝐿𝜇,
                                  𝑧
             √
   and x0 =    𝐿 x̃0 .
    Since we are dealing with complex variables, for the gradient descent, Wirtinger derivatives
are utilized. Since 𝐹˜ is a real-valued function, we only need to consider the derivative of 𝐹,                          ˜ with
respect to h̄ and x̄, and the corresponding updates of h and x since
                                              𝜕 𝐹˜ 𝜕 𝐹˜          𝜕 𝐹˜ 𝜕 𝐹˜
                                                     =     ,            =       .
                                              𝜕 h̄ 𝜕h             𝜕 x̄    𝜕x
                                                             60


In particular, we denote
                                                   𝜕 𝐹˜             𝜕 𝐹˜
                                         ∇ 𝐹˜h :=       ,      ˜ :=
                                                             ∇ 𝐹x         .                     (3.21)
                                                   𝜕 h̄              𝜕 x̄
We can now state the full algorithm.
Algorithm 3.3 Wirtinger Gradient Descent Blind Deconvolution Algorithm
Input: Blind Deconvolutional measurements y, 𝐾 = supp( 𝑓 ).
Output: Estimate underlying signal and blurring function.
   1) Compute A ∗ (y) and find the leading singular value, left and right singular vectors of A ∗ (y),
   denoted by 𝑑, h̃0 , and x̃0 respectively.
   2) Solve the following optimization problem
                                              √                        √           √
                        u0 := 𝑎𝑟𝑔𝑚𝑖𝑛 ∥z − 𝐿 h̃0 ∥ 2 , subject to 𝑑 ∥𝐵z∥ ∞ ≤ 2 𝐿𝜇,
                                   𝑧
              √
   and x0 = 𝐿 x̃0 .
   3) Compute Wirtinger Gradient Descent
   while halting criterion false do
       ut = ut−1 − η∇ 𝐹˜h (ut−1 , ut−1 )
       vt = vt−1 − η∇ 𝐹˜x (vt−1 , vt−1 )
   end while
   4) Set (h, x) = (ut , vt )
    In [71], the authors show that with a carefully chosen initial guess (u0 , v0 ), running Wirtinger
gradient descent to minimize F̃(ℎ, 𝑥) will guarantee linear convergence of the sequence (u𝑡 , v𝑡 ) to
the global minimum (h0 , x0 ) in the noiseless case, and also provide robust recovery in the presence
of noise. The results are summarized in the following two theorems.
3.4.6   MAIN THEOREMS
Theorem 3.4.4. (Main Theorem 1) ([71], Theorem 3.1) The initialization obtained via Algorithm
3.2 satisfies
                                     1           1                   9           11
                       (u0 , v0 ) ∈ √ 𝑁 𝐿 0 ∩ √ 𝑁 𝜇 ∩ 𝑁 2 ,             𝐿0 ≤ 𝐿 ≤    𝐿0,
                                      3           3             𝜖   10           10
                                                              5
                                                          61


with probability at least 1 − 𝑑 −𝛾 if the number of measurements is sufficient large, that is
                                  𝑑 ≥ 𝐶𝛾 (𝜇2ℎ + 𝜎 2 ) max{𝐾, 𝑁 } log2 𝑑/𝜖 2 ,
                                                    i
                                                   1
where 𝜖 is a predetermined constant on 0, 15           , and 𝐶𝛾 is a constant only linearly depending on 𝛾
with 𝛾 ≥ 1.
    The following theorem establishes that as long as the initial guess lies inside the basin of
attraction of the true solution, regularized gradient descent will converge to this solution (or to a
nearby solution in case of noisy data).
Theorem 3.4.5. (Main Theorem 2) ([71], Theorem 3.2) Assume that the initialization (u0 , v0 ) ∈
√1 𝑁 𝐿 0 ∩ √1 𝑁 𝜇 ∩ 𝑁 2 𝜖 , and that 𝑑 ≥ 𝐶𝛾 (𝜇2 + 𝜎 2 ) max{𝐾, 𝑁 } log2 (𝐿)/𝜖 2 . Then Algorithm 3.3 will
  3          3        5
create a sequence (u𝑡 , v𝑡 ) ∈ 𝑁 𝑑0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 which converges geometrically to (h0 , x0 ) in the sense
                                                 1
that with probability at least 1 − 4𝑑 −𝛾 − 𝑒 −(𝐾+𝑁) , we have that
                                                𝛾
                                                          1 2                               
               max{𝑠𝑖𝑛̸ (u𝑡 , h0 ), 𝑠𝑖𝑛̸ (v𝑡 , x0 )} ≤         (1 − 𝜂𝜔)𝑡/2 𝜂𝐿 0 + 50∥A ∗ (e)∥ ,
                                                         𝐿𝑡 3
and
                                              2
                                |𝐿 𝑡 − 𝐿 0 |≤   (1 − 𝜂𝜔)𝑡/2 𝜖 𝐿 0 + 50∥A ∗ (e)∥,
                                              3
where 𝐿 𝑡 := ∥u𝑡 ∥·∥v𝑡 ∥, 𝜔 > 0, 𝜂 is the fixed stepsize. Here
                                          √︂                                      √
                                        n    (𝛾 + 1) max{𝐾, 𝑁 } log 𝑑 (𝛾 + 𝑑) 𝐾 𝑁 log2 𝑑 o
             ∥A ∗ (e)∥≤ 𝐶0 𝜎𝑑0 max                                       ,                      ,
                                                           𝑑                        𝑑
holds with probability 1 − 𝑑 −𝛾 .
    It has been shown with high probability that as long as the initial guess lies inside the basin of
attraction of the true solution, Wirtinger gradient descent will converge towards the solution.
                                                        62


3.4.7    KEY CONDITIONS
Theorem 3.4.6. Four Key Conditions:
(i) (Local RIP Condition) ([71], Condition 5.1) The following local Restricted Isometry Property
(RIP) for A holds uniformly for all (h, x) in the basin of attraction (𝑁 𝐿 0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 )
                        3                                            5
                          ∥hx∗ − h0 x∗0 ∥ 2𝐹 ≤ ∥A(hx∗ − h0 x∗0 )∥ 2 ≤ ∥hx∗ − h0 x∗0 ∥ 2𝐹 .
                        4                                            4
(ii) (Robustness Condition) ([71], Condition 5.2) For the complex Gaussian noise e, with high
probability
                                                            𝜖 𝐿0
                                                ∥A ∗ (e)∥≤    √ ,
                                                           10 2
                                                2
for 𝑑 sufficiently large, that is, 𝑑 ≥ 𝐶𝛾 ( 𝜎𝜖 2 + 𝜎𝜖 ) max{𝐾, 𝑁 }𝑙𝑜𝑔 𝑑;
(iii) (Local Regularity Condition) ([71], Condition 5.3) There exists a regularity constant 𝜔 =
  𝑑0
        > 0 such that
5000
                           ˜ x)∥ 2 ≥ 𝜔[𝐹(h,
                       ∥∇ 𝐹(h,               ˜ x) − 𝑐]+ ,    𝑐 = ∥e∥ 2 +1700∥A ∗ (e)∥ 2 ,
for all (h, x) ∈ 𝑁 𝐿 0 ∩ 𝑁 𝜇 ∩ 𝑁𝜖 ;
(iv) (Local Smoothness Condition) ([71], Condition 5.4) Denote z := (h, x). There exists a constant
𝐶𝑑 such that
                             ∥∇ 𝑓 (z + 𝑡Δz) − ∇ 𝑓 (z)∥≤ 𝐶𝑑 𝑡 ∥Δz∥,      0 ≤ 𝑡 ≤ 1,
for all {(z, Δz) | z + 𝑡Δz ∈ 𝑁𝜖 ∩ 𝑁 𝐹˜ , ∀0 ≤ 𝑡 ≤ 1}, i.e., the whole segment connecting (h, x) and
∇(h, x) belongs to the non-convex set 𝑁𝜖 ∩ 𝑁 𝐹˜ .
                                                        63


3.5     BLIND PTYCHOGRAPHY
3.5.1    INTRODUCTION
     A more recent area of study is blind ptychography, in which both the object and the mask are
considered unknown, up to reasonable assumptions. The first successful recovery was given in
([111],[110]) , further study into the sufficient overlap ([10], [77], [76]), and summarized in [30].
     Let x, m ∈ C𝑑 denote the unknown sample and mask, respectively. We suppose that we have
𝑑 2 noisy ptychographic measurements of the form
                        (Y)ℓ,𝑘 = |(F(x ◦ 𝑆 𝑘 m))ℓ | 2 +(N)ℓ,𝑘 ,    (ℓ, 𝑘) ∈ [𝑑]0 × [𝑑]0 ,       (3.22)
where 𝑆 𝑘 , ◦, F denote 𝑘 th circular shift, Hadamard product, and 𝑑-dimensional discrete Fourier
transform, and N is the matrix of additive noise. By Theorem 3.3.2, we have shown we can rewrite
the measurements as
                                                                               
                             Y𝑇 F𝑇                                     ¯ ) + N𝑇 F𝑇 ,
                                       = 𝑑 · (x ◦ 𝑆 𝑘 x̄) ∗ (m̃ ◦ 𝑆−𝑘 m̃                        (3.23)
                                     𝑘                                              𝑘
where ∗ denotes the 𝑑-dimensional discrete convolution, and m̃ denotes the reversal of m about its
first entry. This is now a scaled blind deconvolution problem which has been studied in [2],[71].
3.5.2    MAIN RESULTS
3.5.2.1    RECOVERING THE SAMPLE
     To recover the sample, we will need to assume that x belongs to a known subspace. Initially we
solve algorithmically for the zero shift case (𝑘 = 0) and then generalize the method to solve for the
estimate which utilizes all the obtained shifts.
     Our assumptions are as follows: x ∈ C𝑑 unknown, x = 𝐶x′, 𝐶 ∈ C𝑑×𝑁 , 𝑁 ≪ 𝐿 known, x′ ∈ C𝑁
or R𝑁 unknown m ∈ C𝑑 unknown, 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 , 𝐾 known, ∥m∥ 2 known. Known noisy
measurements Y.
     Our first goal is to compute an estimate x𝑒𝑠𝑡 of x, true up to a global phase. We will use this
                                                         64


estimate to then produce an estimate m𝑒𝑠𝑡 of m, again true up to a global phase.
    Firstly, we let y be the first column of    √1  · F((FY)
                                                        𝑇 ), f = m̃ ◦ m̃ (so ∥f ∥ 2 known). We
                                                  𝑑
next set g = x ◦ x̄ but to fully utilize the blind deconvolution algorithm, we will need a lemma
concerning hadamard products of products of matrices. Firstly, we need define some products
between matrices.
Definition 3.5.1. Let A = (𝐴𝑖, 𝑗 ) ∈ C𝑚×𝑛 and B = (𝐵𝑖, 𝑗 ) ∈ C 𝑝×𝑞 . Then the Kronecker product
A ⊗ B ∈ C𝑚 𝑝×𝑛𝑞 is defined by
                                 (A ⊗ B)𝑛(𝑖−1)+𝑘,𝑞( 𝑗−1)+ℓ = 𝐴𝑖, 𝑗 𝐵 𝑘,ℓ .
Definition 3.5.2. Let A ∈ C𝑚×𝑛 and B ∈ C 𝑝×𝑛 with columns a𝑖 , b𝑖 for 𝑖 ∈ [𝑛]0 . Then the Khatri-Rao
product A • B ∈ C𝑚 𝑝×𝑛 is defined by
                             A • B = [a0 ⊗ b0 a1 ⊗ b1 . . . a𝑛−1 ⊗ b𝑛−1 ].                     (3.24)
Definition 3.5.3. Let A ∈ C𝑚×𝑛 and B ∈ C𝑚×𝑝 be matrices with rows a𝑖 , b𝑖 for 𝑖 ∈ [𝑚]0 . Then the
transposed Khatri-Rao product (or face-splitting product), denoted ⊙, is the matrix whose rows
are Kronecker products of the columns of A and B i.e. the rows of A ⊙ B ∈ C𝑚×𝑛𝑝 are given by
                                   (A ⊙ B)𝑖 = a𝑖 ⊗ b𝑖 ,    𝑖 ∈ [𝑚]0 .
    We then utilize the following lemma concerning the transposed Khatri-Rao product.
Lemma 3.5.1 (Theorem 1, [101]). Let A ∈ C𝑚×𝑛 , B ∈ C𝑛×𝑝 , C ∈ C𝑚×𝑞 , D ∈ C𝑞×𝑝 . Then we have
that
                                  (AB) ◦ (CD) = (A • C)(B ⊙ D),
where ◦ is the Hadamard product, • is the standard Khatri-Rao product, and ⊙ is the transposed
Khatri-Rao product.
                                                   65


    Thus by Lemma 3.5.1 we have that for g = x ◦ x̄ = Cx′ ◦ C̄x̄′. Then 𝑔 = C′x′′ where
            2          2
C′ ∈ C 𝐿×𝑁 , x′′ ∈ C𝑁 are given by
                                       C′ = C • C̄,    x′′ = x′ ⊙ x̄′ .
We now compute RRR Blind Deconvolution (Algorithm 3.3) with y, f, g, C, 𝐾 = 𝛿 as above (B last
K columns of DFT matrix) to obtain estimate for x′ ⊙ x̄′. Use angular synchronisation to solve for
x′, and thus solve for x.
Algorithm 3.4 Blind Ptychography (Zero Shift)
Input:
   1) x ∈ C𝑑 unknown, x = 𝐶x′, 𝐶 ∈ C𝑑×𝑁 , 𝑁 ≪ 𝐿 known, x′ ∈ C𝑁 or R𝑁 unknown.
   2) m ∈ C𝑑 unknown, 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 , 𝐾 known, ∥m∥ 2 known Known noisy measurements Y.
   3) Known noisy measurements Y.
Output: Estimate x𝑒𝑠𝑡 of x true up to a global phase.
   1) Let y be the first column of √1 · F((FY)
                                            𝑇 ), f = m̃ ◦ m̃ (so ∥f ∥ 2 known)
                                      𝑑
                                                                         2     2
   2) Let g = x ◦ x̄ = Cx′ ◦ C̄x̄′. Then 𝑔 = C′x′′ where C′ ∈ C𝑑×𝑁 , x′′ ∈ C𝑁 are given by
                                         C′ = C • C̄,   x′′ = x′ ⊙ x̄′ .
   3) Compute RRR Blind Deconvolution (Algorithm 1 & 2, [71]) with y, f, g, C, 𝐾 = 𝛿 as above
   (B last K columns of DFT matrix) to obtain estimate for x′ ⊙ x̄′.
   4) Use angular synchronisation to solve for x′, and thus compute x𝑒𝑠𝑡 .
3.5.2.2    RECOVERING THE MASK
    Once the estimate of x has been found, denoted x𝑒𝑠𝑡 , we use this estimate to find m𝑒𝑠𝑡 . We first
compute g𝑒𝑠𝑡 = x𝑒𝑠𝑡 ◦ x𝑒𝑠𝑡 , and then we use point-wise division to find
                                                ¯)=   F−1 ((FY)
                                                           𝑇 )
                                        F(m̃ ◦ m̃                     .                        (3.25)
                                                      F(x𝑒𝑠𝑡 ◦ x𝑒𝑠𝑡 )
Then use an inverse Fourier transform, a reversal and then angular synchronization, similar to
obtaining x𝑒𝑠𝑡 .
                                                    66


Algorithm 3.5 Recovering The Mask
Input: 1) x𝑒𝑠𝑡 generated by Algorithm 3.4.
   2) Known noisy measurements Y.
   3) 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 , 𝐾 known, ∥m∥ 2 known.
Output: Estimate m𝑒𝑠𝑡 of m true up to a global phase.
   1) Compute g𝑒𝑠𝑡 = x𝑒𝑠𝑡 ◦ x𝑒𝑠𝑡 and 2𝛿 − 1 perform point-wise divisions to obtain
                                                        ¯ 𝑒𝑠𝑡 ) =  F−1 ((FY)
                                                                         𝑇 ) 𝑘
                                         F(m̃𝑒𝑠𝑡 ◦ 𝑆−𝑘 m̃                             .                       (3.26)
                                                                  F(x𝑒𝑠𝑡 ◦ 𝑆 𝑘 x𝑒𝑠𝑡 )
   2) Compute inverse Fourier transform to obtain m̃𝑒𝑠𝑡 ◦𝑆−𝑘 m̃           ¯ 𝑒𝑠𝑡 and use these to form the diagonals
   of a banded matrix.
   3) Use angular synchronisation to solve for m̃𝑒𝑠𝑡 , and thus perform a reversal to compute m𝑒𝑠𝑡 .
                  ∥m𝑒𝑠𝑡 ∥ 2
   4) Let 𝛼 =               . Finally, let x𝑒𝑠𝑡 = 𝛼x𝑒𝑠𝑡 , m𝑒𝑠𝑡 = 𝛼−1 m𝑒𝑠𝑡
                   ∥m∥ 2
3.5.2.3     MULTIPLE SHIFTS
      To generalize the setup, we let y(𝑘) denote the 𝑘 th column of             √1        𝑇 ), f(𝑘) = m̃ ◦ 𝑆−𝑘 m̃.
                                                                                       · F((FY)
                                                                                   𝑑
Let g(𝑘) = x ◦ 𝑆 𝑘 x̄ = Cx′ ◦ 𝑆 𝑘 C̄x̄′. Then again by another application of Lemma 3.5.1, g(𝑘) = C′(𝑘) x′′
                     2               2
where C′ ∈ C𝑑×𝑁 , x′′ ∈ C𝑁 are given by
           C′(𝑘) = C • 𝑆 𝑘 C̄,        0 ≤ 𝑘 ≤ 𝐾, 𝑑 − 𝐾 + 1 ≤ 𝑘 ≤ 𝑑,           x′′ = x′ ⊙ x̄′ = 𝑣𝑒𝑐(x′(x′)∗ ).
We then perform 2𝛿 − 1 blind deconvolutions to obtain 2𝛿 − 1 estimates of x and m, labelled
         𝑗
x𝑖𝑒𝑠𝑡 , m𝑒𝑠𝑡 respectively for 𝑖, 𝑗 ∈ [2𝛿 − 1]0 .
      Ideally, we would want to select the estimates which generates the minimum error for each x
and m but that implies prior knowledge of x and m. Instead, compute (2𝛿 − 1)2 estimates of the
Fourier measurements by
                             (Y𝑒𝑠𝑡 )ℓ,𝑘 = |(F(x𝑖𝑒𝑠𝑡 ◦ 𝑆 𝑘 m𝑒𝑠𝑡 ))ℓ | 2 ,
                                𝑖, 𝑗                          𝑗
                                                                         𝑖, 𝑗 ∈ [2𝛿 − 1]0 .                   (3.27)
                                                           67


We then compute the associated error
                                                        ∥Y𝑒𝑠𝑡 − Y∥ 2𝐹
                                                           𝑖, 𝑗
                               (𝑖′, 𝑗 ′) = argmin                          , 𝑖, 𝑗 ∈ [2𝛿 − 1]0 .                     (3.28)
                                               (𝑖, 𝑗)       ∥Y∥ 2𝐹
                   ′                   𝑗′
Then let x𝑒𝑠𝑡 = x𝑖𝑒𝑠𝑡 , m𝑒𝑠𝑡 = m𝑒𝑠𝑡 .
Algorithm 3.6 Blind Ptychography (Multiple Shifts)
Input:
   1) x ∈ C𝑑 unknown, x = 𝐶x′, 𝐶 ∈ C𝑑×𝑁 , 𝑁 ≪ 𝑑 known, x′ ∈ C𝑁 or R𝑁 unknown.
   2) m ∈ C𝑑 unknown, 𝑠𝑢 𝑝 𝑝(m) ⊆ [𝛿]0 , 𝐾 known, ∥m∥ 2 .
   3) Known noisy measurements Y.
Output: Estimate x𝑒𝑠𝑡 of x, true up to a global phase
   1) Let y(𝑘) denote the 𝑘 th column of √1 · F((FY)            𝑇 ), f(𝑘) = m̃ ◦ 𝑆−𝑘 m̃ (so ∥f(𝑘) ∥ 2 known).
                                                       𝑑
                                                                                                  2             2
   2) Let g(𝑘) = x ◦ 𝑆 𝑘 x̄ =   Cx′ ◦ 𝑆    𝑘 C̄x̄′.   Then 𝑔(𝑘) = C′x′′ where C′ ∈ C𝑑×𝑁 , x′′ ∈ C𝑁 are given by
                  C′(𝑘) = 𝐶 • 𝑆 𝑘 𝐶,      ¯      0 ≤ 𝑘 ≤ 𝐾, 𝑑 − 𝐾 + 1 ≤ 𝐾 ≤ 𝑑,                 x′′ = x′ ⊙ x̄′ .
   3) Perform 2𝛿 − 1 RRR Blind Deconvolutions (Algorithm 1 & 2, [71]) with y(𝑘) , f(𝑘) , g(𝑘) , C, as
   above to obtain 2𝛿 − 1 estimates for x′ ⊙ x̄′.
   4) Use angular synchronisation to solve for 2𝛿 − 1 estimates x′𝑒𝑠𝑡 , and thus solve for 2𝛿 − 1
   estimates x𝑖𝑒𝑠𝑡 = Cx′𝑒𝑠𝑡 , 𝑖 ∈ [2𝛿 − 1]0 .
                                                                                   𝑗
   5) Use these estimates x𝑖𝑒𝑠𝑡 to compute 2𝛿 − 1 estimates m𝑒𝑠𝑡 , 𝑗 ∈ [2𝛿 − 1]0 .
                ∥m𝑖𝑒𝑠𝑡 ∥ 2                                                                   1
   6) Let 𝛼𝑖 =             , and for 𝑖 ∈ [2𝛿 − 1]0 , let x𝑖𝑒𝑠𝑡 = 𝛼𝑖 x𝑖𝑒𝑠𝑡 , m𝑖𝑒𝑠𝑡 = 𝑖 m𝑖𝑒𝑠𝑡
                  ∥m∥ 2                                                                     𝛼
                            2
   7) Compute (2𝛿 − 1) estimates of the Fourier measurements by
                              (Y𝑒𝑠𝑡 )ℓ,𝑘 = |(F(x𝑖𝑒𝑠𝑡 ◦ 𝑆 𝑘 m𝑒𝑠𝑡 ))ℓ | 2 ,
                                  𝑖, 𝑗                              𝑗
                                                                                  𝑖, 𝑗 ∈ [2𝛿 − 1]0 .                (3.29)
                                                                             ∥Y𝑒𝑠𝑡 − Y∥ 2𝐹
                                                                                 𝑖, 𝑗
   We then compute the associated error                (𝑖′, 𝑗 ′) = argmin                   ,    𝑖, 𝑗 ∈ [2𝛿 − 1]0 .
                                                                      (𝑖, 𝑗)      ∥Y∥ 2𝐹
                    ′                  𝑗′
   8) Let x𝑒𝑠𝑡 = x𝑖𝑒𝑠𝑡 , m𝑒𝑠𝑡 = m𝑒𝑠𝑡 .
3.5.3   NUMERICAL SIMULATIONS
    All simulations were performed using MATLAB R2021b on an Intel desktop with a 2.60GHz
i7-10750H CPU and 16GB DDR4 2933MHz memory. All code used to generate the figures below
is publicly available at https://github.com/MarkPhilipRoach/BlindPtychography.
    To be more precise, we have defined the immeasurable (in practice since x and m are both
                                                                 68


unknown) estimates
     Max Shift(𝑥) = argmax ∥x − x𝑖𝑒𝑠𝑡 ∥ 22 , Max Shift(𝑚) = argmax ∥m − m𝑒𝑠𝑡 ∥ 22 ,
                                                                                   𝑗
                                                                                            𝑖, 𝑗 ∈ [2𝛿 − 1]0 .
                         𝑖
                       𝑥 𝑒𝑠𝑡                                       𝑗
                                                                 𝑚 𝑒𝑠𝑡
       Min Shift(𝑥) = argmin ∥x − x𝑖𝑒𝑠𝑡 ∥ 22 , Min Shift(𝑚) = argmin ∥m − m𝑒𝑠𝑡 ∥ 22 ,
                                                                                   𝑗
                                                                                            𝑖, 𝑗 ∈ [2𝛿 − 1]0 .
                           𝑖
                         𝑥 𝑒𝑠𝑡                                      𝑗
                                                                 𝑚 𝑒𝑠𝑡
and the measurable estimates. First, No Shift(𝑥) and No Shift(𝑚) refer to the zero shift estimates
outlined in Algorithm 3.4. Secondly, the estimates achieved in Algorithm 3.6
                                                                              ∥Y𝑒𝑠𝑡 − Y∥ 2𝐹
                                                                                𝑖, 𝑗
              (𝑥)               (𝑚)           𝑖′  𝑗′      ′   ′
(Argmin Shift , Argmin Shift        ) = (x , m ),       (𝑖 , 𝑗 ) = argmin                   ,    𝑖, 𝑗 ∈ [2𝛿 − 1]0 .
                                                                       (𝑖, 𝑗)    ∥Y∥ 2𝐹
Figure 3.4 𝑑 = 26 , 𝐾 = 𝛿 = log2 𝑑, 𝑁 = 4, C complex Gaussian. Max shift refers to the maximum
error achieved from a blind deconvolution of a particular shift. Min shift refers to the maximum
error achieved from a blind deconvolution of a particular shift. Argmin Shift refers to the choice
of object and mask chosen in Step 6 of Algorithm 3.6. Averaged over 100 simulations. 1000
iterations.
    Figure 3.4 demonstrates robust recovery under noise. It also demonstrates the impact of
performing the 2𝛿 − 1 blind deconvolutions and taking the Argmin Shift, versus simply taking
the non-shifted object and mask. It also demonstrates how closely the reconstructions error from
Argmin Shift and Min Shift are, in particular for the mask. Figure 3.5 demonstrates the impact
even more, showing that with a higher value for the known subspace, the more accurate the Argmin
Shift and Min Shift are, as well as demonstrating the large difference between the Max Shift and
Min Shift.
                                                      69


Figure 3.5 𝑑 = 26 , 𝐾 = 𝛿 = log2 𝑑, 𝑁 = 6, C complex Gaussian. Max shift refers to the maximum
error achieved from a blind deconvolution of a particular shift. Min shift refers to the maximum
error achieved from a blind deconvolution of a particular shift. Argmin Shift refers to the choice
of object and mask chosen in Step 6 of Algorithm 3.6. Averaged over 100 simulations. 1000
iterations.
    The following figures demonstrate recovery against additional noise, with varying 𝛿 and 𝑁.
Figure 3.6 𝑑 = 26 , 𝑁 = 4, C complex Gaussian. Application of Algorithm 3.6 with varying 𝐾 = 𝛿.
Figure 3.7 𝑑 = 26 , 𝐾 = 𝛿 = 6, C complex Gaussian. Application of Algorithm 3.6 with varying 𝑁.
                                                70


    Next, we consider the frequency of the chosen index from performing the argmin function (step
6 of Algorithm 3.6) compared to the true minimizing indices for the object and mask separately.
    Firstly, we have the frequency of the argmin indices.
Figure 3.8 𝑑 = 26 , 𝛿 = 6, 𝑁 = 4, C complex Gaussian. 1000 simulations. Frequency of index
being chosen to compute Argmin Shift(𝑥) and Argmin Shift(𝑚) .
    Secondly, we have the frequency of the min shift for both of the object and mask. Both of
Figure 3.8 and Figure 3.9 were computed on the same 1000 tests.
Figure 3.9 𝑑 = 26 , 𝛿 = 6, 𝑁 = 4, C complex Gaussian. 1000 simulations. Frequency of index
being chosen to compute Min Shift(𝑥) and Min Shift(𝑚) .
    Finally, we plot these choice of indices for both the Argmin Shift and Min Shift onto a two
dimensional plot.
                                                71


Figure 3.10 𝑑 = 26 , 𝛿 = 6, 𝑁 = 4, C complex Gaussian. 1000 simulations. Frequency of indices
being chosen to compute (Argmin Shift(𝑥) , Argmin Shift(𝑚) ) and (Min Shift(𝑥) , Min Shift(𝑚) ).
3.6    CONCLUSIONS AND FUTURE WORK
We have introduced an algorithm for recovering a specimen of interest from blind far-field pty-
chographic measurements. This algorithm relies on reformulating the measurements so that they
resemble widely-studied blind deconvolutional measurements. This leads to transposed Khatri-Rao
product estimates of our specimen which are then able to be recovered by angular synchroniza-
tion. We then use these estimates in applying inverse Fourier transforms, point-wise division, and
angular synchronization to recover estimates for the mask. Finally, we use a best error estimate
sorting algorithm to find the final estimate of both the specimen and mask. As shown in numerical
results, Algorithm 3.6 recovers both the sample and mask within a good margin of error. It also
provides stability under noise. A further goal for this research would be to adapt the existing
recovery guarantee theorems for the selected blind deconvolutional recovery algorithm, in which
the assumed Gaussian matrix C is replaced with Khatri-Rao matrix C′(𝑘) = C • 𝑆 𝑘 C̄. In particular,
this would mean providing alternate inequalities for the four key conditions laid out in Theorem
3.4.6.
                                                  72


                                             CHAPTER 4
 ON OUTER BI-LIPSCHITZ EXTENSIONS OF LINEAR JOHNSON-LINDENSTRAUSS
            EMBEDDINGS OF LOW-DIMENSIONAL SUBMANIFOLDS OF R𝑁
4.1     ABSTRACT
Let M be a compact 𝑑-dimensional submanifold of R𝑁 with reach 𝜏 and volume 𝑉M . Fix
𝜖 ∈ (0, 1). In this chapter, it is proven that a nonlinear function 𝑓 : R𝑁 → R𝑚 exists with
                    √𝑑
                            
              2
                       𝑉M
𝑚 ≤ 𝐶 𝑑/𝜖 log           𝜏     such that
                         (1 − 𝜖)∥x − y∥ 2 ≤ ∥ 𝑓 (x) − 𝑓 (y)∥ 2 ≤ (1 + 𝜖)∥x − y∥ 2                   (4.1)
holds for all x ∈ M and y ∈ R𝑁 . In effect, 𝑓 not only serves as a bi-Lipschitz function from M into
R𝑚 with bi-Lipschitz constants close to one, but also approximately preserves all distances from
points not in M to all points in M in its image. Furthermore, the proof is constructive and yields
an algorithm which works well in practice. In particular, it is empirically demonstrated herein that
such nonlinear functions allow for more accurate compressive nearest neighbor classification than
standard linear Johnson-Lindenstrauss embeddings do in practice.
4.2     INTRODUCTION
The classical Kirszbraun theorem [63] ensures that a Lipschitz continuous function 𝑓 : 𝑆 → R𝑚
from a subset 𝑆 ⊂ R𝑁 into R𝑚 can always be extended to a function 𝑓˜ : R𝑁 → R𝑚 with the same
Lipschitz constant as 𝑓 . More recently, similar results have been proven for bi-Lipschitz functions,
 𝑓 : 𝑆 → R𝑚 , from 𝑆 ⊂ R𝑁 into R𝑚 in the theoretical computer science literature. In particular, it
was shown in [75] that outer extensions of such bi-Lipschitz functions 𝑓 , 𝑓˜ : R𝑁 → R𝑚+1 , exist
which both (𝑖) approximately preserve 𝑓 ’s Lipschitz constants, and which (𝑖𝑖) satisfy 𝑓˜(x) = ( 𝑓 (x), 0)
for all x ∈ 𝑆. Narayanan and Nelson [81] then applied similar outer extension methods to a special
class of the linear bi-Lipschitz maps guaranteed to exist for any given finite set 𝑆 ⊂ R𝑁 by Johnson-
Lindenstrauss (JL) lemma [60] in order prove the following remarkable result: For each finite set
𝑆 ⊂ R𝑁 and 𝜖 ∈ (0, 1) there exists a terminal embedding of 𝑆, 𝑓 : R𝑁 → RO ( log|𝑆|/𝜖 ) , with the
                                                                                            2
                                                    73


property that
                          (1 − 𝜖)∥x − y∥ 2 ≤ ∥ 𝑓 (x) − 𝑓 (y)∥ 2 ≤ (1 + 𝜖)∥x − y∥ 2                   (4.2)
holds ∀x ∈ 𝑆 and ∀y ∈ R𝑁 .
    In this chapter, we generalize Narayanan and Nelson’s theorem for finite sets to also hold
for infinite subsets 𝑆 ⊂ R𝑁 , and then give a specialized variant for the case where the infinite
subset 𝑆 ⊂ R𝑁 in question is a compact and smooth submanifold of R𝑁 . As we shall see below,
generalizing this result requires us to both alter the bi-Lipschitz extension methods of [75] as well
as to replace the use of embedding techniques utilizing cardinality in [81] with different JL-type
embedding methods involving alternate measures of set complexity which remain meaningful for
infinite sets (i.e., the Gaussian width of the unit secants of the set 𝑆 in question). In the special case
where 𝑆 is a submanifold of R𝑁 , recent results bounding the Gaussian widths of the unit secants of
such sets in terms of other fundamental geometric quantities (e.g., their reach, dimension, volume,
etc.) [53] can then be brought to bear in order to produce terminal manifold embeddings of 𝑆 into
R𝑚 satisfying (4.2) with 𝑚 near-optimally small. Note that a non-trivial terminal embedding, 𝑓 , of 𝑆
satisfying (4.2) for all x ∈ 𝑆 and y ∈ R𝑁 must be nonlinear. In contrast, prior work on bi-Lipschitz
maps of submanifolds of R𝑁 into lower dimensional Euclidean space in the mathematical data
science literature have all utilized linear maps (see, e.g., [5, 27, 53]). As a result, it is impossible
for such previously considered linear maps to serve as terminal embeddings of submanifolds of
R𝑁 into lower-dimensional Euclidean space without substantial modification. Another way of
viewing the work carried out herein is that it constructs outer bi-Lipschitz extensions of such prior
linear JL embeddings of manifolds in a way that effectively preserves their near-optimal embedding
dimension in the final resulting extension. Motivating applications of terminal embeddings of
submanifolds of R𝑁 related to compressive classification via manifold models [21] are discussed
next.
                                                     74


4.2.1     UNIVERSALLY ACCURATE COMPRESSIVE CLASSIFICATION VIA NOISY
          MANIFOLD DATA
     It is one of the sad facts of life that most everyone eventually comes to accept: everything living
must eventually die, you can’t always win, you aren’t always right, and – worst of all to the most
dedicated of data scientists – there is always noise contaminating your datasets. Nevertheless, there
are mitigating circumstances and achievable victories implicit in every statement above – most
pertinently here, there are mountains of empirical evidence that noisy training data still permits
accurate learning. In particular, when the noise level is not too large, the mere existence of a
low-dimensional data model which only approximately fits your noisy training data can still allow
for successful, e.g., nearest-neighbor classification using only a highly compressed version of your
original training dataset (even when you know very little about the model specifics) [21]. Better
quantifying these empirical observations in the context of low-dimensional manifold models is the
primary motivation for our main result below.
     For example, let M ⊂ R𝑁 be a 𝑑-dimensional submanifold of R𝑁 (our data model), fix 𝛿 ∈ R+
(our effective noise level), and choose 𝑇 ⊆ 𝑡𝑢𝑏𝑒(𝛿, M) := {x | ∃y ∈ M 𝑤𝑖𝑡ℎ ∥x − y∥ 2 ≤ 𝛿} (our
“noisy” and potentially high-dimensional training data). Fix 𝜖 ∈ (0, 1). For a terminal embedding
 𝑓 : R𝑁 → R𝑚 of M as per (4.2), one can see that
              (1 − 𝜖) ∥z − t∥ 2 − 2(1 − 𝜖)𝛿 ≤ ∥ 𝑓 (z) − 𝑓 (t)∥ 2 ≤ (1 + 𝜖) ∥z − t∥ 2 + 2(1 + 𝜖)𝛿                    (4.3)
will hold simultaneously for all z ∈ R𝑁 and t ∈ 𝑇, where 𝑓 has an embedding dimension that
only depends on the geometric properties of M (and not necessarily on |𝑇 |).1 Thus, if 𝑇 includes
a sufficiently dense external cover of M, then 𝑓 will allow us to approximate the distance of all
z ∈ R𝑁 to M in the compressed embedding space via the estimator
                        ˜ 𝑓 (z), 𝑓 (𝑇)) := inf ∥ 𝑓 (z) − 𝑓 (t)∥ 2 ≈ 𝑑(z, M) := inf ∥z − y∥ 2
                       𝑑(                                                                                           (4.4)
                                           t∈𝑇                                    y∈M
     1 One can prove (4.3) by comparing both z and t to a point x𝑡 ∈ M satisfying ∥t − x𝑡 ∥ 2 ≤ 𝛿 via several applications
of the (reverse) triangle inequality.
                                                           75


up to O(𝛿)-error. As a result, if one has noisy data from two disjoint manifolds M1 , M2 ⊂ R𝑁 , one
can use this compressed 𝑑˜ estimator to correctly classify all data z ∈ 𝑡𝑢𝑏𝑒(𝛿, M 1 ) 𝑡𝑢𝑏𝑒(𝛿, M 2 )
                                                                                                S
as being in either 𝑇 1 := 𝑡𝑢𝑏𝑒(𝛿, M 1 ) (class 1) or 𝑇 2 := 𝑡𝑢𝑏𝑒(𝛿, M 2 ) (class 2) as long as
    inf      ∥x − y∥ 2 is sufficiently large. In short, terminal manifold embeddings demonstrate that
x∈𝑇 1 ,y∈𝑇 2
accurate compressive nearest-neighbor classification based on noisy manifold training data is al-
ways possible as long as the manifolds in question are sufficiently far apart (though not necessarily
separable from one another by, e.g., a hyperplane, etc.). Note that in the discussion above we
may in fact take 𝑇 = 𝑡𝑢𝑏𝑒(𝛿, M). In that case (4.3) will hold simultaneously for all z ∈ R𝑁 and
(t, 𝛿) ∈ R𝑁 × R+ with t ∈ 𝑡𝑢𝑏𝑒(𝛿, M) so that 𝑓 : R𝑁 → R𝑚 will approximately preserve the dis-
tances of all points z ∈ R𝑁 to 𝑡𝑢𝑏𝑒(𝛿, M) up to errors on the order of O(𝜖)𝑑(z, 𝑡𝑢𝑏𝑒(𝛿, M))+ O(𝛿)
for all 𝛿 ∈ R+ . This is in fact rather remarkable when one recalls that the best achievable embedding
dimension, 𝑚, here only depends on the geometric properties of the low-dimensional manifold M
(see Theorem 4.2.1 for a detailed accounting of these dependences).
     We further note that alternate applications of Theorem 4.4.2 (on which Theorem 4.2.1 depends)
involving other data models are also possible. As a more explicit second example, suppose that M
is a union of 𝑛 𝑑-dimensional affine subspaces so that its unit secants, 𝑆 M defined as per (4.7), are
                                              
contained in the union of at most 𝑛2 + 𝑛 unit spheres ⊂ S𝑁−1 , each of dimension at most 2𝑑 + 1.
                                                                                               √︁
The Gaussian width (see Definition 4.3.1) of 𝑆M can then be upper-bounded by 𝐶 𝑑 + log 𝑛 using
standard techniques, where 𝐶 ∈ R+ is an absolute constant. An application of Theorem 4.4.2
                                                                                            
                                                                                     𝑑+log 𝑛
                                                                      R𝑁  → R
                                                                                 O
now guarantees the existence of a terminal embedding 𝑓 :                               𝜖2      which will allow
approximate nearest subspace queries to be answered for any input point z ∈ R𝑁 using only 𝑓 (z)
                                     
                              𝑑+log 𝑛
in the compressed O             𝜖2
                                        -dimensional space. Even more specifically, if we choose, e.g., M to
consist of all at most 𝑠-sparse vectors in R𝑁 (i.e., so that M is the union of 𝑛 = 𝑁𝑠 subspaces of
                                                                                                  
R𝑁 ), we can now see that Theorem 4.4.2 guarantees the existence of a deterministic compressed
estimator (4.4) which allows for the accurate approximation of the best 𝑠-term approximation error
             inf           ∥z − y∥ 2 for all z ∈ R𝑁 using only 𝑓 (z) ∈ RO(𝑠 log(𝑁/𝑠)) as input. Note that this
y∈ R𝑁    𝑎𝑡 𝑚𝑜𝑠𝑡 𝑠 𝑠𝑝𝑎𝑟 𝑠𝑒
is only possible due to the non-linearity of 𝑓 herein. In, e.g., the setting of classical compressive
                                                         76


sensing theory where 𝑓 must be linear it is known that such good performance is impossible [20,
Section 5].
4.2.2    THE MAIN RESULT AND A BRIEF OUTLINE OF ITS PROOF
     The following theorem is proven in Section 4.5. Given a low-dimensional submanifold M of
R𝑁 it establishes the existence of a function 𝑓 : R𝑁 → R𝑚 with 𝑚 ≪ 𝑁 that approximately
preserves the Euclidean distances from all points in R𝑁 to all points in M. As a result, it
guarantees the existence of a low-dimensional embedding which will, e.g., always allow for the
correct compressed nearest-neighbor classification of images living near different well separated
submanifolds of Euclidean space.
Theorem 4.2.1 (The Main Result). Let M ↩→ R𝑁 be a compact 𝑑-dimensional submanifold of
R𝑁 with boundary 𝜕M, finite reach 𝜏M (see Definition 4.3.2), and volume 𝑉M . Enumerate the
connected components of 𝜕M and let 𝜏𝑖 be the reach of the 𝑖 𝑡ℎ connected component of 𝜕M as a
submanifold of R𝑁 . Set 𝜏 := min𝑖 {𝜏M , 𝜏𝑖 }, let 𝑉𝜕M be the volume of 𝜕M, and denote the volume
of the 𝑑-dimensional Euclidean ball of radius 1 by 𝜔 𝑑 . Next,
                                20𝑉 M
    1. if 𝑑 = 1, define 𝛼M :=      𝜏   + 𝑉 𝜕M , else
                                      𝑑              𝑑−1
                                𝑉M     41     𝑉 𝜕M     81
    2. if 𝑑 ≥ 2, define 𝛼M :=    𝜔𝑑     𝜏   + 𝜔 𝑑−1     𝜏     .
Finally, fix 𝜖 ∈ (0, 1) and define
                                                                
                                          𝛽M :=      𝛼2M     𝑑
                                                          + 3 𝛼M .                          (4.5)
Then, there exists a map 𝑓 : R𝑁 → C𝑚 with 𝑚 ≤ 𝑐 (ln (𝛽M ) + 4𝑑) /𝜖 2 that satisfies
                               ∥ 𝑓 (x) − 𝑓 (y)∥ 22 − ∥x − y∥ 22 ≤ 𝜖 ∥x − y∥ 22              (4.6)
     for all x ∈ M and y ∈ R𝑁 . Here 𝑐 ∈ R+ is an absolute constant independent of all other
quantities.
                                                       77


Proof. See Section 4.5.
    The remainder of the chapter is organized as follows. In Section 4.3 we review notation and
state a result from [53] that bounds the Gaussian width of the unit secants of a given submanifold of
R𝑁 in terms of geometric quantities of the original submanifold. Next, in Section 4.4 we prove an
optimal terminal embedding result for arbitrary subsets of R𝑁 in terms of the Gaussian widths of
their unit secants by generalizing results from the computer science literature concerning finite sets
[75, 81]. See Theorem 4.4.2 therein. We then combine results from Sections 4.3 and 4.4 in order
to prove our main theorem in Section 4.5. Finally, in Section 4.6 we conclude by demonstrating
that terminal embeddings allow for more accurate compressive nearest neighbor classification than
standard linear embeddings in practice.
4.3     NOTATION AND PRELIMINARIES
Below 𝐵ℓ𝑁2 (x, 𝛾) will denote the open Euclidean ball around x of radius 𝛾 in R𝑁 . Given an arbitrary
subset 𝑆 ⊂ R𝑁 , we will further define −𝑆 := {−x | x ∈ 𝑆} and 𝑆 ± 𝑆 := {x ± y | x, y ∈ 𝑆}. Finally,
for a given 𝑇 ⊂ R𝑁 we will also let 𝑇 denote its closure, and further define the normalization
operator 𝑈 : R𝑁 \ {0} → S𝑁−1 to be such that 𝑈(x) := x/∥x∥ 2 . With this notation in hand we can
then define the unit secants of 𝑇 ⊂ R𝑁 to be
                                                                                 
                                                      x−y
                       𝑆𝑇 := 𝑈 ((𝑇 − 𝑇) \ {0}) =                | x, y ∈ 𝑇, x ̸= y .             (4.7)
                                                     ∥x − y∥ 2
Note that 𝑆𝑇 is always a compact subset of the unit sphere S𝑁−1 ⊂ R𝑁 , and that 𝑆𝑇 = −𝑆𝑇 .
    Herein we will call a matrix 𝐴 ∈ C𝑚×𝑁 an 𝜖-JL map of a set 𝑇 ⊂ R𝑁 into C𝑚 if
                                 (1 − 𝜖)∥x∥ 22 ≤ ∥ 𝐴x∥ 22 ≤ (1 + 𝜖)∥x∥ 22
holds for all x ∈ 𝑇. Note that this is equivalent to 𝐴 ∈ C𝑚×𝑁 having the property that
                           sup    𝐴(x/∥x∥ 22 ) 2
                                               2 − 1 = sup ∥ 𝐴x∥ 22 − 1 ≤ 𝜖,                     (4.8)
                         x∈𝑇\{0}                         x∈𝑈(𝑇)
                                                   78


where 𝑈(𝑇) ⊂ R𝑁 is the normalized version of 𝑇 \ {0} ⊂ RN defined as above. Furthermore, we
will say that a matrix 𝐴 ∈ C𝑚×𝑛 is an 𝜖-JL embedding of a set 𝑇 ⊂ R𝑛 into C𝑚 if 𝐴 is an 𝜖-JL map
of
                                     𝑇 − 𝑇 := {x − y | x, y ∈ 𝑇 }
into C𝑚 . Here we will be working with random matrices which will embed any fixed set 𝑇 of
bounded size (measured with respect to, e.g., Gaussian Width [117]) with high probability. Such
matrix distributions are often called oblivious and discussed as randomized embeddings in the
absence of any specific set 𝑇 since their embedding quality can be determined independently
of any properties of a given set 𝑇 beyond its size. In particular, the class of oblivious sub-
Gaussian random matrices having independent, isotropic, and sub-Gaussian rows will receive
special attention below.
4.3.1   SOME COMMON MEASURES OF SET SIZE AND COMPLEXITY WITH
        ASSOCIATED BOUNDS
    We will denote the cardinality of a finite set 𝑇 by |𝑇 |. For a (potentially infinite) set 𝑇 ⊂ R𝑁
we define its radius and diameter to be
                                         𝑟𝑎𝑑(𝑇) := sup ∥x∥ 2
                                                     x∈𝑇
and
                               𝑑𝑖𝑎𝑚(𝑇) := 𝑟𝑎𝑑(𝑇 − 𝑇) = sup ∥x − y∥ 2 ,
                                                          x,y∈𝑇
respectively. Given a value 𝛿 ∈ R+ , a 𝛿-cover of 𝑇 (also sometimes called a 𝛿-net of 𝑇) will be a
subset 𝑆 ⊂ 𝑇 such that the following holds
                              ∀x ∈ 𝑇, ∃y ∈ 𝑆 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ∥x − y∥ 2 ≤ 𝛿.
The 𝛿-covering number of 𝑇, denoted by N (𝑇, 𝛿) ∈ N, is then the smallest achievable cardinality
of a 𝛿-cover of 𝑇. Finally, the Gaussian width of a set 𝑇 is defined as follows.
                                                 79


Definition 4.3.1. (Gaussian Width [117, Definition 7.5.1]). The Gaussian width of a set 𝑇 ⊂ R𝑁
is
                                          𝑤(𝑇) := E sup ⟨g, x⟩
                                                       x∈𝑇
where g is a random vector with 𝑁 independent and identically distributed (i.i.d.) mean 0 and
variance 1 Gaussian entries. For a list of useful properties of the Gaussian width we refer the
reader to [117, Proposition 7.5.2].
     Finally, reach is an extrinsic parameter of a subset 𝑆 of Euclidean space defined based on how
far away points can be from 𝑆 while still having a unique closest point in 𝑆 [32, 113]. The following
formal definition of reach utilizes the Euclidean distance 𝑑 between a given point x ∈ R𝑁 and
subset 𝑆 ⊂ R𝑁 .
Definition 4.3.2. (Reach [32, Definition 4.1]). For a subset 𝑆 ⊂ R𝑁 of Euclidean space, the reach
𝜏𝑆 is
        𝜏𝑆 := sup {𝑡 ≥ 0 | ∀x ∈ R𝑛 such that 𝑑(x, 𝑆) < 𝑡, x has a unique closest point in 𝑆} .
     he following theorem is a restatement of Theorem 20 in [53]. It bounds the Gaussian width of
a smooth submanifold of R𝑁 in terms of its dimension, reach, and volume.
Theorem 4.3.1 (Gaussian Width of the Unit Secants of a Submanifold of R𝑁 , Potentially with
Boundary). Let M ↩→ R𝑁 be a compact 𝑑-dimensional submanifold of R𝑁 with boundary 𝜕M,
finite reach 𝜏M , and volume 𝑉M . Enumerate the connected components of 𝜕M and let 𝜏𝑖 be the
reach of the 𝑖 𝑡ℎ connected component of 𝜕M as a submanifold of R𝑁 . Set 𝜏 := min𝑖 {𝜏M , 𝜏𝑖 }, let
𝑉𝜕M be the volume of 𝜕M, and denote the volume of the 𝑑-dimensional Euclidean ball of radius 1
by 𝜔 𝑑 . Next,
                                20𝑉M
    1. if 𝑑 = 1, define 𝛼M :=     𝜏   + 𝑉𝜕M , else
                                      𝑑            𝑑−1
                                𝑉M    41     𝑉 𝜕M   81
    2. if 𝑑 ≥ 2, define 𝛼M :=    𝜔𝑑    𝜏  +  𝜔 𝑑−1   𝜏     .
                                                    80


Finally, define
                                                              
                                                    2
                                         𝛽M := 𝛼M      + 3 𝑑 𝛼M .                                  (4.9)
Then, the Gaussian width of 𝑈 ((M − M) \ {0}) satisfies
                                                               √ √︁
                     𝑤 (𝑆M ) = 𝑤 𝑈 ((M − M) \ {0}) ≤ 8 2 ln (𝛽M ) + 4𝑑.
     With this Gaussian width bound in hand we can now begin the proof of our main result. The
approach will be to combine Theorem 4.3.1 above with general theorems concerning the existence of
outer bi-Lipschitz extensions of 𝜖-JL embeddings of arbitrary subsets of R𝑁 into lower-dimensional
Euclidean space. These general existence theorems are proven in the next section.
4.4    THE MAIN BI-LIPSCHITZ EXTENSION RESULTS AND THEIR
       PROOFS
Our first main technical result guarantees that any given JL map Φ of a special subset of S𝑁−1 related
to M will not only be a bi-Lipschitz map from M ⊂ R𝑁 into a lower dimensional Euclidean space
R𝑚 , but will also have an outer bi-Lipschitz extension into R𝑚+1 . It is useful as a means of
extending particular (structured) JL maps Φ of special interest in the context of, e.g., saving on
memory costs [52].
                                                                                             
Theorem 4.4.1. Let M ⊂        R𝑁 ,   𝜖 ∈ (0, 1), and suppose that Φ ∈       C𝑚×𝑁 is an    𝜖2
                                                                                         2304   -JL map
of 𝑆M + 𝑆M into C𝑚 . Then, there exists an outer bi-Lipschitz extension of Φ : M → C𝑚 ,
 𝑓 : R𝑁 → C𝑚+1 , with the property that
                             ∥ 𝑓 (x) − 𝑓 (y)∥ 2 2 − ∥x − y∥ 22 ≤ 𝜖 ∥x − y∥ 22
holds for all x ∈ M and y ∈ R𝑁 .
Proof. See Section 4.4.4.
                                                    81


                                                             
                                                          𝜖2
    Looking at Theorem 4.4.1 we can see that an          2304   -JL map of 𝑆M + 𝑆M is required in order
to achieve the outer extension 𝑓 of interest. This result is sub-optimal in two respects. First,
the constant factor 1/2304 is certainly not tight and can likely be improved substantially. More
importantly though is the fact that 𝜖 is squared in the required map distortion which means that
the terminal embedding dimension, 𝑚 + 1, will have to scale sub-optimally in 𝜖 (see Remark 4.4.1
below for details). Unfortunately, this is impossible to rectify when extending arbitrary maps Φ
(see, e.g., [75]). For sub-gaussian Φ an improvement is in fact possible, however, which is the
subject of our second main technical result just below. Using specialized theory for sub-gaussian
matrices it demonstrates the existence of terminal JL embeddings for arbitrary subsets of R𝑁 which
achieve an optimal terminal embedding dimension up to constants.
Theorem 4.4.2. Let M ⊂ R𝑁 and 𝜖 ∈ (0, 1). There exists a map 𝑓 : R𝑁 → C𝑚 with 𝑚 ≤
           2
    𝑤(𝑆 M )
𝑐     𝜖        that satisfies
                              ∥ 𝑓 (x) − 𝑓 (y)∥ 22 − ∥x − y∥ 22 ≤ 𝜖 ∥x − y∥ 22                    (4.10)
for all x ∈ M and y ∈ R𝑁 . Here 𝑐 ∈ R+ is an absolute constant independent of all other quantities.
Proof. See Section 4.4.5.
    To see the optimality of the terminal embedding dimension 𝑚 provided by Theorem 4.4.2 we
note that functions 𝑓 which satisfy (4.10) for all x, y ∈ M must in fact generally scale quadratically
in both 𝑤(𝑆M ) and 1/𝜖 (see [51, Theorem 7] and [67]). We will now begin proving supporting
results for both of the main technical theorems above. The first supporting results pertain to the
so-called convex hull distortion of a given linear 𝜖-JL map.
                                                         √
4.4.1    ALL LINEAR 𝜖-JL MAPS PROVIDE O( 𝜖)–CONVEX HULL DISTORTION
    A crucial component involved in proving our main results involves the approximate norm
preservation of all points in the convex hull of a given set bounded set 𝑆 ⊂ R𝑁 . Recall that the
                                                     82


convex hull of 𝑆 ⊂ C𝑁 is
                          (                                                                              )
                      [∞    ∑︁𝑗                                                               ∑︁𝑗
           conv(𝑆) :=            𝛼ℓ xℓ | x1 , . . . , x 𝑗 ∈ 𝑆, 𝛼1 , . . . , 𝛼 𝑗 ∈ [0, 1] s.t.     𝛼ℓ = 1 .
                      𝑗=1   ℓ=1                                                               ℓ=1
The next theorem states that each point in the convex hull of 𝑆 ⊂ R𝑁 can be expressed as a convex
combination of at most 𝑁 + 1 points from 𝑆. Hence, the convex hulls of subsets of R𝑁 are actually
a bit simpler than they first appear.
Theorem 4.4.3 (Carathéadory, see, e.g., [11]). Given 𝑆 ∈ R𝑁 , ∀x ∈ conv(𝑆), ∃y1 , . . . , y𝑁˜ , 𝑁˜ =
                                 P𝑁˜                                                    P𝑁˜
min(|𝑆|, 𝑁 + 1), such that x = ℓ=1      𝛼ℓ yℓ for some 𝛼1 , . . . , 𝛼𝑁˜ ∈ [0, 1], ℓ=1         𝛼ℓ = 1.
    Finally, we say that a matrix Φ ∈ C𝑚×𝑁 provides 𝜖-convex hull distortion for 𝑆 ⊂ R𝑁 if
                                              |∥Φx∥ 2 −∥x∥ 2 | ≤ 𝜖
holds for all x ∈ conv(𝑆). The main result of this subsection states that all linear 𝜖-JL maps can
provide 𝜖-convex hull distortion for the unit secants of any given set. In particular, we have the
following theorem which generalizes arguments in [75] for finite sets to arbitrary and potentially
infinite sets.
                                                                                                      2
Theorem 4.4.4. Let M ⊂          R𝑁 , 𝜖 ∈ (0, 1), and suppose that Φ ∈                 C𝑚×𝑁     is an  𝜖
                                                                                                       4  -JL map of
𝑆M + 𝑆M into C𝑚 . Then, Φ will also provide 𝜖-convex hull distortion for 𝑆M .
    The proof of Theorem 4.4.4 depends on two intermediate lemmas. The first lemma is a slight
modification of Lemma 3 in [52].
Lemma 4.4.1. Let 𝑆 ⊂ R𝑁 and 𝜖 ∈ (0, 1). Then, an 𝜖-JL map Φ ∈ C𝑚×𝑁 of the set
                                                                                      
                               ′       x           y        x         y
                             𝑆 =            +           ,      −            | x, y ∈ 𝑆
                                     ∥x∥ 2 ∥y∥ 2 ∥x∥ 2 ∥y∥ 2
will satisfy
                                 |ℜ (⟨Φx, Φy⟩) − ⟨x, y⟩|≤ 2𝜖 ∥x∥ 2 ∥y∥ 2
∀x, y ∈ 𝑆.
                                                          83


Proof. If x = 0 or y = 0 the inequality holds trivially. Thus, suppose x, y ̸= 0. Consider the
                        x             y
normalizations u =     ∥x∥ 2 , v =  ∥y∥ 2 .    The polarization identities for complex/real inner products
imply that
                                                                           !
                                                   3
                                       1          ∑︁                     2
                                                                                                        
       |ℜ (⟨Φu, Φv)⟩ − ⟨u, v⟩| = ℜ                    𝑖 ℓ Φu + 𝑖 ℓ Φv 2 − ∥u + v∥ 22 − ∥u − v∥ 22
                                       4          ℓ=0
                                       1                                                                    
                                    =         ∥Φu + Φv∥ 22 − ∥Φu − Φv∥ 22 − ∥u + v∥ 22 − ∥u − v∥ 22
                                       4
                                        1                  2               2                2               2
                                                                                                               
                                    ≤          ∥Φu   + Φv∥ 2 − ∥u + v∥ 2 +         ∥Φu − Φv∥ 2 − ∥u − v∥ 2
                                        4
                                        𝜖                                   𝜖
                                    ≤         ∥u + v∥ 22 + ∥u − v∥ 22 ≤ (∥u∥ 2 + ∥v∥ 2 ) 2 ≤ 2𝜖 .
                                        4                                     2
The result now follows by multiplying the inequality through by ∥x∥ 2 ∥y∥ 2 .
    Next, we see that and linear 𝜖-JL maps are capable of preserving the angles between the elements
of the convex hull of any bounded subset 𝑆 ⊂ R𝑁 .
                                                                                                 
Lemma 4.4.2. Suppose 𝑆 ⊂ 𝐵ℓ𝑁2 (0, 𝛾) and 𝜖 ∈ (0, 1). Let Φ ∈ C𝑚×𝑁 be an                       𝜖
                                                                                             2𝛾 2
                                                                                                    -JL map of the set
𝑆′ defined as in Lemma 4.4.1 into C𝑚 . Then
                                          |ℜ (⟨Φx, Φy⟩) − ⟨x, y⟩| ≤ 𝜖
holds ∀x, y ∈ conv(𝑆).
                                                                    𝑁˜            𝑁˜                   ˜
                                                                                                       𝑁          𝑁˜
Proof. Let x, y ∈ conv(𝑆). By Theorem 4.4.3, ∃ {y𝑖 }𝑖=1                   , {x𝑖 }𝑖=1 ⊂ 𝑆 𝑎𝑛𝑑 {𝛼ℓ }ℓ=1      , {𝛽ℓ }ℓ=1 ⊂
           P𝑁˜         P𝑁˜
[0, 1] with ℓ=1  𝛼ℓ = ℓ=1       𝛽ℓ = 1 such that
                                               𝑁˜
                                              ∑︁                     ∑︁𝑁˜
                                      x=           𝛼ℓ xℓ , 𝑎𝑛𝑑 y =          𝛽ℓ yℓ .
                                              ℓ=1                    ℓ=1
                                                           84


Hence, by Lemma 4.4.1 we have that
                                            ∑︁𝑁˜ ∑︁  𝑁˜                                          
                |ℜ (⟨Φx, Φy⟩) − ⟨x, y⟩| =                  𝛼ℓ 𝛽 𝑗 ℜ ⟨Φxℓ , Φy 𝑗 ⟩ − ⟨xℓ , y 𝑗 ⟩
                                             ℓ=1 𝑗=1
                                                 𝑁˜ ∑︁  𝑁˜                 
                                               ∑︁                       𝜖
                                          ≤2                𝛼ℓ 𝛽 𝑗            ∥xℓ ∥ 2 ∥y 𝑗 ∥ 2
                                               ℓ=1 𝑗=1                 2𝛾 2
                                                            !             !
                                                 ∑︁𝑁˜           ∑︁𝑁˜
                                          ≤𝜖            𝛼ℓ             𝛽𝑗     = 𝜖.
                                                  ℓ=1            𝑗=1
                                                      
                                                   𝜖
Here we have also used the mapping error         2𝛾 2
                                                           and the fact that all norms of vectors in this case
will be less than 𝛾.
    We are now prepared to prove Theorem 4.4.4.
4.4.1.1   PROOF OF THEOREM 4.4.4
    Applying Lemma 4.4.2 with 𝑆 = 𝑆M = 𝑆M ∪ −𝑆M , we note that 𝑆′ = 𝑆M + 𝑆M = (𝑆M ∪
−𝑆M ) + (𝑆M ∪ −𝑆M ) since 𝑆 ⊂ S𝑁−1 . Furthermore, 𝛾 = 1 in this case. Hence, Φ ∈ R𝑚×𝑁 being
    2
an 𝜖4 -JL map of 𝑆M + 𝑆M into R𝑚 implies that
                                                                            𝜖2
                                    |ℜ (⟨Φx, Φy⟩) − ⟨x, y⟩| ≤                                           (4.11)
                                                                            2
holds ∀x, y ∈ conv(𝑆M ) ⊂ 𝐵ℓ𝑁2 (0, 1). In particular, (4.11) with x = y implies that
                      |∥Φx∥ 2 −∥x∥ 2 | |∥Φx∥ 2 +∥x∥ 2 | = ∥Φx∥ 22 −∥x∥ 22 ≤ 𝜖 2 /2.
Noting that |∥Φx∥ 2 +∥x∥ 2 | ≥ ∥x∥ 2 we can see that the desired result holds automatically if
∥x∥ 2 ≥ 𝜖/2. Thus, it suffices to assume that that ∥x∥ 2 < 𝜖/2, but then we are also finished since
                                          √︃                         √
|∥Φx∥ 2 −∥x∥ 2 | ≤ max{∥x∥ 2 , ∥Φx∥ 2 } ≤ ∥x∥ 22 +𝜖 2 /2 < 23 𝜖 will hold in that case.
Remark 4.4.1. Though Theorem 4.4.4 holds for arbitrary linear maps, we note that it has sub-
                                                                                                2
optimal dependence on the distortion parameter 𝜖. In particular, a linear 𝜖4 -JL map of an
arbitrary set will generally embed that set into C𝑚 with 𝑚 = Ω(1/𝜖 4 ) [67]. However, it has been
                                                       85


shown in [81] that sub-Gaussian matrices will behave better with high probability, allowing for
outer bi-Lipschitz extensions of JL-embeddings of finite sets into R𝑚 with 𝑚 = O(1/𝜖 2 ). In the
next subsection we generalize those better scaling results for sub-Gaussian random matrices to
(potentially) infinite sets.
4.4.2   SUB-GAUSSIAN MATRICES AND 𝜖-CONVEX HULL DISTORTION FOR
        INFINITE SETS
    Motivated by results in [81] for finite sets which achieve optimal dependence on the distortion
parameter 𝜖 for sub-Gaussian matrices, in this section we will do the same for infinite sets using
results from [117]. Our main tool will be the following result (see also [53, Theorem 4]).
Theorem 4.4.5 (See Theorem 9.1.1 and Exercise 9.1.8 in [117]). Let Φ be 𝑚 × 𝑁 matrix whose
rows are independent, isotropic, and sub-Gaussian random vectors in R𝑁 . Let 𝑝 ∈ (0, 1) and
𝑆 ⊂ R𝑁 . Then there exists a constant 𝑐 depending only on the distribution of the rows of Φ such
that
                                    √               h      √︁                 i
                        sup ∥Φx∥ 2 − 𝑚∥x∥ 2 ≤ 𝑐 𝑤(𝑆) + ln(2/𝑝) · 𝑟𝑎𝑑(𝑆)
                        x∈𝑆
holds with probability at least 1 − 𝑝.
    The main result of this section is a simple consequence of Theorem 4.4.5 together with standard
results concerning Gaussian widths [117, Proposition 7.5.2].
Corollary 4.4.1. Let M ⊂ R𝑁 , 𝜖, 𝑝 ∈ (0, 1), and Φ ∈ R𝑚×𝑁 be an 𝑚 × 𝑁 matrix whose rows are
independent, isotropic, and sub-Gaussian random vectors in R𝑁 . Furthermore, suppose that
                                          𝑐′           √︁       2
                                   𝑚 ≥ 2 𝑤 (𝑆M ) + ln(2/𝑝) ,
                                         𝜖
where 𝑐′ is a constant depending only on the distribution of the rows of Φ. Then, with probability
at least 1 − 𝑝 the random matrix    √1 Φ   will simultaneously be both an 𝜖-JL embedding of M into
                                      𝑚
R𝑚 and also provide 𝜖-convex hull distortion for 𝑆M .
Proof. We apply Theorem 4.4.5 to 𝑆 = conv (𝑆M ). In doing so we note that 𝑤 (conv (𝑆M )) =
                                                   86


𝑤 (𝑆M ) [117, Proposition 7.5.2], and that 𝑟𝑎𝑑 (conv (𝑆M )) = 1 since conv (𝑆M ) ⊆ 𝐵ℓ𝑁2 (0, 1). The
result will be that √1 Φ  provides 𝜖-convex hull distortion for 𝑆M as long as 𝑐′ ≥ 𝑐2 . Next, we note
                      𝑚
that providing 𝜖-convex hull distortion for 𝑆M implies that    √1 Φ   will also approximately preserve
                                                                 𝑚
the ℓ2 -norms of all the unit vectors in 𝑆M ⊂ conv (𝑆M ). In particular,   √1 Φ  will be a 3𝜖-JL map of
                                                                             𝑚
𝑆M into R𝑚 , which in turn implies that    √1 Φ
                                            𝑚
                                                will also be a 3𝜖-JL embedding of M − M into R𝑚
by linearity/rescaling. Adjusting the constant 𝑐′ to account for the additional factor of 3 now yields
the stated result.
    We are now prepared to prove our general theorems regarding outer bi-Lipschitz extensions of
JL-embeddings of potentially infinite sets.
4.4.3    OUTER BI-LIPSCHITZ EXTENSION RESULTS FOR JL EMBEDDINGS OF
         GENERAL SETS
    Before we can prove our final results for general sets we will need two supporting lemmas.
They are adapted from the proofs of analogous results in [75, 81] for finite sets.
Lemma 4.4.3. Let M ⊂ R𝑁 , 𝜖 ∈ (0, 1), and suppose that Φ ∈ C𝑚×𝑁 provides 𝜖-convex hull
distortion for 𝑆M . Then, there exists a function 𝑔 : R𝑁 → C𝑚 such that
                                |ℜ (⟨𝑔(y), Φx⟩) − ⟨y, x⟩| ≤ 2𝜖 ∥y∥ 2 ∥x∥ 2                       (4.12)
holds for all x ∈ M − M and y ∈ R𝑁 .
Proof. First, we note that (4.12) holds trivially for y = 0 as long as 𝑔(0) = 0. Thus, it suffices
to consider nonzero y. Second, we claim that it suffices to prove the existence of a function
𝑔 : R𝑁 → C𝑚 that satisfies both of the following properties
   1. ∥𝑔(y)∥ 2 ≤ ∥y∥ 2 , and
   2. |ℜ (⟨𝑔(y), Φx′⟩) − ⟨y, x′⟩| ≤ 𝜖 ∥y∥ 2 for all x′ in a finite (𝜖/2 max{1, ∥Φ∥ 2→2 })-cover C of
        𝑆M ,
                                                  87


    for all y ∈ R𝑁 . To see why, fix y ̸= 0, x ∈ 𝑆M , and let x′ ∈ C ⊂ 𝑆M satisfy ∥x − x′ ∥ 2 ≤
𝜖/2 max{1, ∥Φ∥ 2→2 }. We can see that any function 𝑔 satisfying both of the properties above will
have
   |ℜ (⟨𝑔(y), Φx⟩) − ⟨y, x⟩| = |ℜ (⟨𝑔(y), Φx′⟩) + ℜ (⟨𝑔(y), Φ (x − x′)⟩) − ⟨y, (x − x′)⟩ − ⟨y, x′⟩|
                                    ≤ |ℜ (⟨𝑔(y), Φx′⟩) − ⟨y, x′⟩| + |⟨𝑔(y), Φ (x − x′)⟩| + |⟨y, (x − x′)⟩|
                                    ≤ 𝜖 ∥y∥ 2 +∥𝑔(y)∥ 2 ∥Φ∥ 2→2 ∥x − x′ ∥ 2 +∥y∥ 2 ∥∥x − x′ ∥ 2
where the second property was used in the last inequality above.
    Appealing to the first property above we can now also see that |ℜ (⟨𝑔(y), Φx⟩) − ⟨y, x⟩| ≤
2𝜖 ∥y∥ 2 will hold. Finally, as a consequence of the definition of 𝑆M , we therefore have that (4.12)
will hold for all x ∈ M − M and y ∈ R𝑁 whenever Properties 1 and 2 hold above. Showing
that (4.12) holds all x ∈ M − M more generally can be proven by contradiction using a limiting
argument combined with the fact that both the right and left hand sides of (4.12) are continuous
in x for fixed y. Hence, we have reduced the proof to constructing a function 𝑔 that satisfies both
Properties 1 and 2 above.
    Let
                             𝑔(y) := arg minv∈𝐵2𝑚 (0,∥y∥ ) max     | C|       ℎy (v, 𝝀), where                    (4.13)
                                                  ℓ2
                                                         2     𝝀∈𝐵      (0,1)
                                                                   ℓ1
                                      ∑︁
                         ℎy (v, 𝝀) :=     (𝜆 u (⟨y, u⟩ − ℜ (⟨v, Φu⟩)) − 𝜖 |𝜆 u | · ∥y∥ 2 )                        (4.14)
                                      u∈C
where we identify C𝑚 with R2𝑚 above. Note that Property 1 above is guaranteed by definition
(4.13). Furthermore, we note that if
   max         ℎy (𝑔(y), 𝝀) = max (|⟨y, u⟩ − ℜ (⟨𝑔(y), Φu⟩)| − 𝜖 ∥y∥ 2 ) ≤ max               | C|       ℎy (𝑔(y), 𝝀) ≤ 0
          | C|
𝝀∈{±e 𝑗 } 𝑗=1                   u∈C                                                      𝝀∈𝐵      (0,1)
                                                                                             ℓ1
                                                          88


then Property 2 above will hold as well. Thus, it suffices to show that
                                       min           max      | C|        ℎy (v, 𝝀) ≤ 0
                                                          𝝀∈𝐵      (0,1)
                                 v∈𝐵2𝑚  2
                                          (0,∥y∥ 2 )          ℓ1
                                      ℓ
always holds in order to finish the proof.
    Noting that ℎy : R2𝑚+|C| ↦→ R defined in (4.14) is continuous, convex (affine) in v, concave in
𝝀, and further noting that both 𝐵ℓ|C|                      2𝑚
                                       1 (0, 1) and 𝐵ℓ 2 (0, ∥y∥ 2 ) are compact and convex, we may apply
Von Neumann’s minimax theorem [84] to see that
           minv∈𝐵2𝑚 (0,∥y∥ ) max       | C|       ℎy (v, 𝝀) = max           | C|       minv∈𝐵2𝑚 (0,∥y∥ ) ℎy (v, 𝝀)
                   ℓ2
                            2     𝝀∈𝐵       (0,1)                      𝝀∈𝐵       (0,1)        ℓ2
                                                                                                       2
                                       ℓ1                                   ℓ1
holds. Thus, we will in fact be finished if we can show that minv∈𝐵2𝑚 (0,∥y∥                             ℎy (v, 𝝀) ≤ 0 holds
                                                                                                    2)
                                                                                           ℓ2
for each 𝝀 ∈ 𝐵ℓ|C|
                 1 (0, 1). By rescaling this in turn is implied by showing that ∀u ∈ conv(C ∪ −C)
∃v ∈ 𝐵ℓ2𝑚
        2 (0, ∥y∥ 2 ) such that
                                    (⟨y, u⟩ − ℜ (⟨v, Φu⟩) − 𝜖 ∥y∥ 2 ) ≤ 0                                             (4.15)
holds.
    To prove (4.15) for a fixed u ∈ conv(C ∪ −C) ⊆ conv(𝑆M ∪ −𝑆M ) = conv(𝑆M ) and thereby
                                                                 Φu
establish the stated theorem, one may set v = ∥y∥ 2 ∥Φu∥              2
                                                                        . Doing so we see that the left side of (4.15)
simplifies to ⟨y, u⟩ − ∥y∥ 2 ∥Φu∥ 2 −𝜖 ∥y∥ 2 . To finish, we note that indeed
                      ⟨y, u⟩ − ∥y∥ 2 ∥Φu∥ 2 −𝜖 ∥y∥ 2 ≤ ∥y∥ 2 ∥u∥ 2 −∥y∥ 2 ∥Φu∥ 2 −𝜖 ∥y∥ 2
                                                          ≤ ∥y∥ 2 (∥u∥ 2 −∥Φu∥ 2 −𝜖) ≤ 0
will then hold since Φ provides 𝜖-convex hull distortion for 𝑆M .
Lemma 4.4.4. Let M ⊂ R𝑁 be non-empty, 𝜖 ∈ (0, 1), and suppose that Φ ∈ C𝑚×𝑁 provides
𝜖-convex hull distortion for 𝑆M . Then, there exists an outer bi-Lipschitz extension of Φ, 𝑓 : R𝑁 →
                                                            89


C𝑚+1 , with the property that
                                     ∥ 𝑓 (x) − 𝑓 (y)∥ 22 − ∥x − y∥ 22 ≤ 24𝜖 ∥x − y∥ 22                            (4.16)
holds for all x ∈ M and y ∈ R𝑁 .
Proof. Given y ∈ R𝑁 let yM ∈ M satisfy ∥y − yM ∥ 2 = inf x∈M ∥y − x∥ 2 .2 We define
                           
                           
                            (Φy, 0)                                                          if y ∈ M
                           
                           
                           
                𝑓 (y) :=                               √︃                                 
                                                                    2                   2
                            ΦyM + 𝑔(y − yM ), ∥y − yM ∥ 2 − ∥𝑔 (y − yM )∥ 2                  if y ∈/ M
                           
                           
                           
where 𝑔 is defined as in Lemma 4.4.3. Fix x ∈ M. If y ∈ M then ∥ 𝑓 (x) − 𝑓 (y)∥ 22 = ∥Φ(x − y)∥ 22 ,
and so we can see that ∥ 𝑓 (x) − 𝑓 (y)∥ 22 −∥x − y∥ 22 ≤ 3𝜖 ∥x − y∥ 22 will hold since Φ will be 3𝜖-JL
embedding of M − M (recall the proof of Corollary 4.4.1 and note the linearity of Φ). Thus, it
suffices to consider a fixed y ∈/ M. In that case we have
        ∥ 𝑓 (x) − 𝑓 (y)∥ 22 = ∥Φ(x − yM ) − 𝑔 (y − yM ) ∥ 22 + ∥y − yM ∥ 22 − ∥𝑔 (y − yM )∥ 22
                             = ∥y − yM ∥ 22 + ∥Φ(x − yM )∥ 22 −2ℜ (⟨𝑔 (y − yM ) , Φ(x − yM )⟩)                    (4.17)
by the polarization identity and parallelogram law.
    Similarly we have that
      ∥x − y∥ 22 = ∥(x − yM ) − (y − yM )∥ 22 = ∥y − yM ∥ 22 +∥x − yM ∥ 22 −2⟨y − yM , x − yM ⟩.                  (4.18)
    2 One  can see that it suffices to approximately compute y M in order to achieve (4.16) up to a fixed precision.
                                                             90


Subtracting (4.18) from (4.17) we can now see that
       ∥ 𝑓 (x) − 𝑓 (y)∥ 22 −∥x − y∥ 22 ≤     ∥Φ(x − yM )∥ 22 −∥x − yM ∥ 22 +
                                           2 |ℜ (⟨𝑔 (y − yM ) , Φ(x − yM )⟩) − ⟨y − yM , x − yM ⟩|
                                       ≤ 3𝜖 ∥x − yM ∥ 22 + 4𝜖 ∥y − yM ∥ 2 ∥x − yM ∥ 2
                                                                                       
                                                        2                    2        2
                                       ≤ 3𝜖 ∥x − yM ∥ 2 + 2𝜖 ∥y − yM ∥ 2 +∥x − yM ∥ 2              (4.19)
where the second inequality again appeals to Φ being a 3𝜖-JL embedding of M − M, and to
Lemma 4.4.3. Considering (4.19) we can see that
    • ∥y − yM ∥ 2 ≤ ∥y − x∥ 2 by the definition of yM , and so
    • ∥x − yM ∥ 2 ≤ ∥x − y∥ 2 +∥y − yM ∥ 2 ≤ 2∥x − y∥ 2 , and thus
    • ∥y − yM ∥ 22 +∥x − yM ∥ 22 ≤ (∥y − yM ∥ 2 +∥x − yM ∥ 2 ) 2 ≤ 9∥x − y∥ 22 .
Using the last two inequalities above in (4.19) now yields the stated result.
    We are now prepared to prove the two main results of this section.
4.4.4   PROOF OF THEOREM 4.4.1
    Apply Theorem 4.4.4 with 𝜖 ← 𝜖/24 in order obtain 𝜖/24-convex hull distortion for 𝑆M via Φ.
Then, apply Lemma 4.4.4.
4.4.5   PROOF OF THEOREM 4.4.2
    To begin we apply Corollary 4.4.1 with, e.g., 𝑝 = 1/2 to demonstrate that an
                                         𝑐′′                   2
                                                      √︁
                                               𝑤(𝑆M ) + ln(4) × 𝑁
                                         𝜖2
matrix with i.i.d. standard normal random entries can provide (𝜖/24)-convex hull distortion for 𝑆M ,
where 𝑐′′ is an absolute constant. Hence, such a matrix Φ exists. An application of Lemma 4.4.4
now finishes the proof.
                                                      91


4.5     THE PROOF OF THEOREM 4.2.1
We apply Theorem 4.4.2 together with Theorem 4.3.1 to bound the Gaussian width of 𝑆M .
4.6     A NUMERICAL EVALUATION OF TERMINAL EMBEDDINGS
In this section we consider several variants of the optimization approach mentioned in Section 3.3
of [81] for implementing a terminal embedding 𝑓 : R𝑁 → R𝑚+1 of a finite set 𝑋 ⊂ R𝑁 . In effect,
this requires us to implement a function satisfying two sets of constraints from [81, Section 3.3]
that are analogous to the two properties of 𝑔 : R𝑁 → C𝑚 listed at the beginning of the proof of
Lemma 4.4.3. See Lines 1 and 2 of Algorithm 4.1 for a concrete example of one type of constrained
minimization problem solved herein to accomplish this task.
Algorithm 4.1 Terminal Embedding of a Finite Set
Input: 𝜖 ∈ (0, 1), 𝑋 ⊂ R𝑁 , |𝑋 | =: 𝑛, 𝑆 ⊂ R𝑁 , |𝑆| =: 𝑛′, 𝑚 ∈ N with 𝑚 < 𝑁, a random matrix
   with i.i.d. standard Gaussian entries, Φ ∈ R𝑚×𝑁 , rescaled to perform as a JL embedding matrix
   Π := √1𝑚 Φ
Output: A terminal embedding of 𝑋, 𝑓 ∈ R𝑁 → R𝑚+1 , evaluated on 𝑆
   for u ∈ 𝑆 do
       1) Compute x𝑁 𝑁 := argminx∈𝑋 ∥u − x∥ 2
       2) Solve the following constrained minimization problem to compute a minimizer u′ ∈ R𝑚
     Minimize ℎu,x 𝑁 𝑁 (z) := ∥z∥ 22 + 2⟨Π(u − x𝑁 𝑁 ), z⟩
     subject to ∥z∥ 2 ≤ ∥u − x𝑁 𝑁 ∥ 2
                   |⟨z, Π(x − x𝑁 𝑁 )⟩ − ⟨u − x𝑁 𝑁 , x − x𝑁 𝑁 ⟩| ≤ 𝜖 ∥u − x𝑁 𝑁 ∥ 2 ∥x − x𝑁 𝑁 ∥ 2 , ∀x ∈ 𝑋
       3) Compute 𝑓 : R𝑁 → R𝑚+1 at u via
                                   (
                                     (Πu, 0),                                  u∈𝑋
                          𝑓 (u) :=             ′
                                                 √︃
                                     (Πx𝑁 𝑁 + u , ∥u − x𝑁 𝑁 ∥ 22 − ∥u′ ∥ 22 ), u ∈/ 𝑋
   end for
    Crucially, we note that any choice u′ ∈ R𝑚 of a z satisfying the two sets of constraints in Line
2 of Algorithm 4.1 for a given u ∈ R𝑁 is guaranteed to correspond to an evaluation of a valid
terminal embedding of 𝑋 at u in Line 3. This leaves the choice of the objective function, ℎu,x 𝑁 𝑁 ,
minimized in Line 2 of Algorithm 4.1 open to change without effecting its theoretical performance
guarantees. Given this setup, several heretofore unexplored practical questions about terminal
                                                     92


embeddings immediately present themselves. These include:
   1. Repeatedly solving the optimization problem in Line 2 of Algorithm 4.1 to evaluate a terminal
      embedding of 𝑋 on 𝑆 is certainly more computationally expensive than simply evaluating a
      standard linear Johnson-Lindenstrauss (JL) embedding of 𝑋 on 𝑆 instead. How do terminal
      embeddings empirically compare to standard linear JL embedding matrices on real-world
      data in the context of, e.g., compressive classification? When, if ever, is their additional
      computational expense actually justified in practice?
   2. Though any choice of objective function ℎu,x 𝑁 𝑁 in Line 2 of Algorithm 4.1 must result in
      a terminal embedding 𝑓 of 𝑋 based on the available theory, some choices probably lead to
      better empirical performance than others. What’s a good default choice?
   3. How much dimensionality reduction are terminal embeddings capable of in the context of,
      e.g., accurate compressive classification using real-world data?
In keeping with the motivating application discussed in Section 4.2.1 above, we will explore some
preliminary answers to these three questions in the context of compressive classification based on
real-world data below.
4.6.1   A COMPARISON CRITERIA: COMPRESSIVE NEAREST NEIGHBOR
        CLASSIFICATION
    Given a labelled data set D ⊂ R𝑁 with label set L, we let 𝐿𝑎𝑏𝑒𝑙 : D → L denote the function
which assigns the correct label to each element of the data set. To address the three questions
above we will use compressive nearest neighbor classification accuracy as a primary measure of
an embedding strategy’s quality. See Algorithm 4.2 for a detailed description of how this accuracy
can be computed for a given data set D.
                                                 93


Algorithm 4.2 Measuring Compressive Nearest Neighbor Classification Accuracy
Input: 𝜖 ∈ (0, 1), A labeled data set D ⊂ R𝑁 split into two disjoint subsets: A training set 𝑋 ⊂ D
   with |𝑋 | =: 𝑛, and a test set 𝑆 ⊂ D with |𝑆| =: 𝑛′, such that 𝑆 ∩ 𝑋 = ∅. A compressive dimension
   𝑚 < 𝑁.
Output: Successful Nearest Neighbor Classification Percentage for Data Embedded in R𝑚+1
   Fix 𝑓 : R𝑁 → R𝑚+1 , an embedding of the training data 𝑋 ⊂ R𝑁 into R𝑚+1 satisfying
                           (1 − 𝜖)∥x − y∥ 2 ≤ ∥ 𝑓 (x) − 𝑓 (y)∥ 2 ≤ (1 + 𝜖)∥x − y∥ 2
   for all x, y ∈ 𝑋. [Note: this can either be a JL-embedding of 𝑋, or a stronger terminal embedding
   of 𝑋.]
   % Embed the training data into R𝑚+1 .
   for x ∈ 𝑋 do
       Compute 𝑓 (x) using, e.g., Algorithm 4.1.
   end for
   % Classify the test data using its embedded distance in R𝑚+1 .
   𝑝=0
   for u ∈ 𝑆 do
       Compute 𝑓 (u) using, e.g., Algorithm 4.1
       Compute x = argminy∈𝑋 ∥ 𝑓 (u) − 𝑓 (y)∥ 2
       if 𝐿𝑎𝑏𝑒𝑙(u) = 𝐿𝑎𝑏𝑒𝑙(x) then
            𝑝 = 𝑝+1
       end if
   end for
                                                           𝑝
   Output the Successful Classification Percentage =          × 100%
                                                           𝑛′
    Note that Algorithm 4.2 can be used to help us compare the quality of different embedding
strategies. For example, one can use Algorithm 4.2 to compare different choices of objective
functions ℎu,x 𝑁 𝑁 in Line 2 of Algorithm 4.1 against one another by running Algorithm 4.2 multiple
times on the same training and test data sets while only varying the implementation of Algorithm 4.1
each time. This is exactly the type of approach we will use below. Of course, before we can begin
we must first decide on some labelled data sets D to use in our classification experiments.
4.6.2    OUR CHOICE OF TRAINING AND TESTING DATA SETS
    Herein we consider two standard benchmark image data sets which allow for accurate uncom-
pressed Nearest Neighbor (NN) classification. The images in each data set can then be vectorized
                                                    94


and embedded using, e.g., Algorithm 4.1 in order to test the accuracies of compressed NN classifica-
tion variants against both one another, as well as against standard uncompressed NN classification.
These benchmark data sets are as follows.
  Figure 4.1 Example images from the MNIST data set (left), and the COIL-100 data set (right).
    The MNIST data set [68, 22] consists of 60,000 training images of 28 × 28-pixel grayscale
hand-written images of the digits 0 through 9. Thus, MNIST has 10 labels to correctly classify
between, and 𝑁 = 282 = 784. For all experiments involving the MNIST dataset, 𝑛/10 digits of
each type are selected uniformly at random to form the training set 𝑋, for a total of 𝑛 vectorized
training images in R784 . Then, 100 digits of each type are randomly selected from those not used
for training in order to form the test set 𝑆, leading to a total of 𝑛′ = 1000 vectorized test images in
R784 . See the left side of Figure 4.1 for example MNIST images.
    The COIL-100 data set [83] is a collection of 128×128-pixel color images of 100 objects, each
photographed 72 times where the object has been rotated by 5 degrees each time to get a complete
rotation. However, only the green color channel of each image is used herein for simplicity. Thus,
herein COIL-100 consists of 7, 200 total vectorized images in R𝑁 with 𝑁 = 1282 = 16, 384, where
each image has one of 100 different labels (72 images per label). For all experiments involving
this COIL-100 data set, 𝑛/100 training images are down sampled from each of the 100 objects’
rotational image sequences. Thus, the training sets each contain 𝑛/100 vectorized images of each
object, each photographed at rotations of ≈ 36000/𝑛 degrees (rounded to multiples of 5). The
resulting training data sets therefore all consist of 𝑛 vectorized images in R16,384 . After forming
each training set, 10 images of each type are then randomly selected from those not used for training
                                                   95


in order to form the test set 𝑆, leading to a total of 𝑛′ = 1000 vectorized test images in R16,384 per
experiment. See the right side of Figure 4.1 for example COIL-100 images.
4.6.3     A COMPARISON OF FOUR EMBEDDING STRATEGIES VIA NN
          CLASSIFICATION
     In this section we seek to better understand (𝑖) when terminal embeddings outperform standard
JL-embedding matrices in practice with respect to accurate compressive NN classification, (𝑖𝑖)
what type of objective functions ℎu,x 𝑁 𝑁 in Line 2 of Algorithm 4.1 perform best in practice when
computing a terminal embedding, and (𝑖𝑖𝑖) how much dimensionality reduction one can achieve with
a terminal embedding without appreciably degrading standard NN classification results in practice.
To gain insight on these three questions we will compare the following four embedding strategies
in the context of NN classification. These strategies begin with the most trivial linear embeddings
(i.e., the identity map) and slowly progress toward extremely non-linear terminal embeddings.
   (a) Identity: We use the data in its original uncompressed form (i.e., we use the trivial embedding
         𝑓 : R𝑁 → R𝑁 defined by 𝑓 (u) = u in Algorithm 4.2). Here the embedding dimension 𝑚 + 1
        is always fixed to be 𝑁.
   (b) Linear: We compressively embed our training data 𝑋 using a JL embedding. More specifi-
        cally, we generate an 𝑚 × 𝑁 random matrix Φ with i.i.d. standard Gaussian entries and then
                                                      
        set 𝑓 : R → R
                    𝑁      𝑚+1                 √1
                               to be 𝑓 (u) := 𝑚 Φu, 0 in Algorithm 4.2 for various choices of 𝑚. It
        is then hoped that 𝑓 will embed the test data 𝑆 well in addition to the training data 𝑋. Note
        that this embedding choice for 𝑓 is consistent with Algorithm 4.1 where one lets 𝑋 = 𝑋 ∪ 𝑆
        when evaluating Line 3, thereby rendering the minimization problem in Line 2 irrelevant.
   (c) A Valid Terminal Embedding That’s as Linear as Possible: To minimize the point-wise
        difference between the terminal embedding 𝑓 computed by Algorithm 4.1 and the linear map
        defined above in (b), we may choose the objective function in Line 2 of Algorithm 4.1 to be
        ℎu,x 𝑁 𝑁 (z) := ⟨Π(x𝑁 𝑁 − u), z⟩. To see why solving this minimizes the pointwise difference
        between 𝑓 and the linear map in (b), let u′ be such that ⟨Π(x𝑁 𝑁 − u), z⟩ is minimal subject to
        the constraints in Line 2 of Algorithm 4.1 when z = u′. Since u and x𝑁 𝑁 are fixed here, we
                                                   96


      note that z = u′ will then also minimize
                        ∥Π(x𝑁 𝑁 − u)∥ 22 + 2⟨Π(x𝑁 𝑁 − u), z⟩ + ∥u − x𝑁 𝑁 ∥ 22
                     = ∥Π(x𝑁 𝑁 − u)∥ 22 + ∥z∥ 22 +2⟨Π(x𝑁 𝑁 − u), z⟩ + ∥u − x𝑁 𝑁 ∥ 22 −∥z∥ 22
                     = ∥Π(x𝑁 𝑁 − u) + z∥ 22 + ∥u − x𝑁 𝑁 ∥ 22 −∥z∥ 22
                                    √︃                                  2
                     = Πx𝑁 𝑁 + z, ∥u − x𝑁 𝑁 ∥ 22 −∥z∥ 22 − (Πu, 0)
                                                                          2
      subject to the desired constraints. Hence, we can see that choosing z = u′ as above is
      equivalent to minimizing ∥ 𝑓 (u) − (Πu, 0)∥ 22 over all valid choices of terminal embeddings 𝑓
      that satisfy the existing theory.
  (d) A Terminal Embedding Computed by Algorithm 4.1 as Presented: This terminal embed-
      ding is computed using Algorithm 4.1 exactly as it is formulated above (i.e., with the objective
      function in Line 2 chosen to be ℎu,x 𝑁 𝑁 (z) := ∥z∥ 22 +2⟨Π(u − x𝑁 𝑁 ), z⟩). Note that this choice of
      objective function was made to encourage non-linearity in the resulting terminal embedding
       𝑓 computed by Algorithm 4.1. To understand our intuition for making this choice of objec-
      tive function in order to encourage non-linearity in 𝑓 , suppose that ∥z∥ 22 +2⟨Π(u − x𝑁 𝑁 ), z⟩ is
      minimal subject to the constraints in Line 2 of Algorithm 4.1 when z = u′. Since u and x𝑁 𝑁
      are fixed independently of z this means that z = u′ then also minimize
                     ∥z∥ 22 + 2⟨Π(u − x𝑁 𝑁 ), z⟩ + ∥Π(u − x𝑁 𝑁 )∥ 22 = ∥z + Π(u − x𝑁 𝑁 )∥ 22 .
      Hence, this objection function is encouraging u′ to be as close to −Π(u − x𝑁 𝑁 ) = Π(x𝑁 𝑁 − u)
      as possible subject to satisfying the constraints in Line 2 of Algorithm 4.1. Recalling (c) just
      above, we can now see that this is exactly encouraging u′ to be a value for which the objective
      function we seek to minimize in (c) is relatively large.
We are now prepared to empirically compare the four types of embeddings (a) – (d) on the data sets
discussed above in Section 4.6.2. To do so, we run Algorithm 4.2 four times for several different
                                                    97


choices of embedding dimension 𝑚 on each data set below, varying the choice of embedding 𝑓
between (a), (b), (c), and (d) for each value of 𝑚. The successful classification percentage is then
plotted as a function of 𝑚 for each different data set and choice of embedding. See Figures 4.2(a)
and 4.2(c) for the results. In addition, to quantify the extent to which the embedding strategies (b)
– (d) above are increasingly nonlinear, we also measure the relative distance between where each
training-set embedding 𝑓 maps points in the test sets versus where its associated linear training-set
embedding would map them. More specifically, for each embedding 𝑓 and test point u ∈ 𝑆 we let
                                                ∥ 𝑓 (u) − (Πu, 0)∥2
                           Nonlinearity 𝑓 (u) =                     × 100%
                                                     ∥(Πu, 0)∥2
See Figures 4.2(b) and 4.2(d) for plots of
                                      Meanu∈𝑆 Nonlinearity 𝑓 (u)
for each of the embedding strategies (b) – (d) on the data sets discussed in Section 4.6.2.
    To compute solutions to the minimization problem in Line 2 of Algorithm 4.1 below we used
the MATLAB package CVX [41, 40] with the initialization z0 = Π(u − x𝑁 𝑁 ) and 𝜖 = 0.1 in the
constraints. All simulations were performed using MATLAB R2021b on an Intel desktop with
a 2.60GHz i7-10750H CPU and 16GB DDR4 2933MHz memory. All code used to generate the
figures below is publicly available at https://github.com/MarkPhilipRoach/TerminalEmbedding.
                                                   98


Figure 4.2 Figures 4.2(a) and 4.2(b) concern the MNIST data set with training set size 𝑛 = 4000
and test set size 𝑛′ = 1000 in all experiments. Similarly, Figures 4.2(c) and 4.2(d) concern the
COIL-100 data set with training set size 𝑛 = 3600 and test set size 𝑛′ = 1000 in all experiments. In
both Figures 4.2(a) and 4.2(c) the dashed black “NearestNeighbor" line plots the classification
accuracy when the Identity map (a) is used in Algorithm 4.2. Note that the “NearestNeighbor"
line is independent of 𝑚 because the identity map involves no compression. Similarly, in all of the
Figures 4.2(a) – 4.2(d) the red “TerminalEmbed" curves correspond to the use of Algorithm 4.1 as
it’s presented to compute highly non-linear terminal embeddings (embedding strategy (d) above),
the green “InnerProd" curves correspond to the use of nearly linear terminal embeddings
(embedding strategy (c) above), and the blue “Linear" curves correspond to the use of Linear JL
embedding matrices (embedding strategy (b) above).
     Looking at Figure 4.2 one can see that the most non-linear embedding strategy (d) – i.e.,
Algorithm 4.1 – allows for the best compressed NN classification performance, outperforming
standard linear JL embeddings for all choices of 𝑚. Perhaps most interestingly, it also quickly
converges to the uncompressed NN classification performance, matching it to within 1 percent
                                                 99


at the values of 𝑚 = 24 for MNIST and 𝑚 = 15 for COIL-100. This corresponds to relative
dimensionality reductions of
                                       100(1 − 24/784)% ≈ 96.9%
and
                                     100(1 − 15/16384)% ≈ 99.9%,
respectively, with negligible loss of NN classification accuracy. As a result, it does indeed appear
as if nonlinear terminal embeddings have the potential to allow for improvements in dimensionality
reduction in the context of classification beyond what standard linear JL embeddings can achieve.
     Of course, challenges remain in the practical application of such nonlinear terminal embeddings.
Principally, their computation by, e.g., Algorithm 4.1 is orders of magnitude slower than simply
applying a JL embedding matrix to the data one wishes to compressively classify. Nonetheless, if
dimension reduction at all costs is one’s goal, terminal embeddings appear capable of providing
better results than their linear brethren. And, recent theoretical work [18] aimed at lessening their
computational deficiencies looks promising.
4.6.4    ADDITIONAL EXPERIMENTS ON EFFECTIVE DISTORTIONS
         AND RUN TIMES
     In this section we further investigate the best performing terminal embedding strategy from
the previous section (i.e., Algorithm 4.1) on the MNIST and COIL-100 data sets. In particular,
we provide illustrative experiments concerning the improvement of (𝑖) compressive classification
accuracy with training set size, and (𝑖𝑖) the effective distortion of the terminal embedding with
embedding dimension 𝑚 + 1. Furthermore, we also investigate (𝑖𝑖𝑖) the run time scaling of
Algorithm 4.1.
     To compute the effective distortions of a given (terminal) embedding of training data 𝑋,
 𝑓 : R𝑁 → R𝑚+1 , over all available test and train data 𝑋 ∪ 𝑆 we use
                                   ∥ 𝑓 (u) − 𝑓 (x)∥ 2                               ∥ 𝑓 (u) − 𝑓 (x)∥ 2
      MaxDist 𝑓 = 𝑚𝑎𝑥      𝑚𝑎𝑥                        ,   MinDist 𝑓 = 𝑚𝑖𝑛    𝑚𝑖𝑛                       .
                    𝑥∈𝑋 𝑢∈𝑆∪𝑋\{𝑥}       ∥u − x∥ 2                     𝑥∈𝑋 𝑢∈𝑆∪𝑋\{𝑥}      ∥u − x∥ 2
                                                      100


Note that these correspond to estimates of the upper and lower multiplicative distortions, respec-
tively, of a given terminal embedding in (4.2). In order to better understand the effect of the
minimizer u′ of the minimization problem in Line 2 of Algorithm 4.1 on the final embedding 𝑓 ,
we will also separately consider the effective distortions of its component linear JL embedding
u ↦→ (Πu, 0) below. See Figures 4.3 and 4.4 for such plots using the MNIST and COIL-100 data
sets, respectively.
Figure 4.3 This figure compares (a) compressive NN classification accuracies, and (b) the
classification run times of Algorithm 4.2 averaged over all u ∈ 𝑆, on the MNIST data set. Three
different training data set sizes 𝑛 = |𝑋 | ∈ {1000, 2000, 4000} were fixed as the embedding
dimension 𝑚 + 1 varied for each of the first two subfigures. Recall that the test set size is always
fixed to 𝑛′ = 1000. In addition, Figure (c) compares MaxDist_ 𝑓 and MinDist_ 𝑓 for the nonlinear
 𝑓 computed by Algorithm 4.1 versus its component linear embedding u ↦→ (Πu, 0) as 𝑚 varies for
a fixed embedded training set size of 𝑛 = 4000.
Figure 4.4 Figures (a) and (b) here are run with identical parameters as for their corresponding
subfigures in Figure 4.3, except using the COIL-100 data set. Similarly, Figure (c) compares
MaxDist_ 𝑓 and MinDist_ 𝑓 for the nonlinear 𝑓 computed by Algorithm 4.1 versus its component
linear embedding u ↦→ (Πu, 0) as 𝑚 varies for a fixed embedded training set size of 𝑛 = 3600.
    Looking at Figures 4.3 and 4.4 one notes several consistent trends. First, compressive classifi-
cation accuracy increases with both training set size 𝑛 and embedding dimension 𝑚, as generally
                                                  101


expected. Second, compressive classification run times also increase with training set size 𝑛 (as
well as more mildly with embedding dimension 𝑚). This is mainly due to the increase in the number
of constraints in Line 2 of Algorithm 4.1 with the training set size 𝑛. Finally, the distortion plots
indicate that the nonlinear terminal embeddings 𝑓 computed by Algorithm 4.1 tend to preserve the
lower distortions of their component linear JL embeddings while simultaneously increasing their
upper distortions. As a result, the nonlinear terminal embeddings considered here appear to spread
the initially JL embedded data out, perhaps pushing different classes away from one another in
the process. If so, it would help explained the increased compressive NN classification accuracy
observed for Algorithm 4.1 in Figure 4.2.
4.6.5   ADDITIONAL SIMULATION
                   Figure 4.5 Example images from the Fashion-MNIST data set.
    The Fashion-MNIST data set [68, 22] consists of 60,0000 training images of 28 × 28-pixel
grayscale images of clothing items, with 10 labels to correctly classify between, and 𝑁 = 282 = 784.
For all experiments involving the Fashion-MNIST dataset 𝑛/10 images of each clothing item are
selected uniformly at random to form the training set 𝑋, for a total of 𝑛 vectorized training images
in R784 . Then, 100 images of each type are randomly selected from those not used for training in
order to form the test set 𝑆, leading to a total of 𝑛′ = 1000 vectorized test images in R784 . See
Figure 4.5 for example Fashion-MNIST images.
                                                 102


Figure 4.6 Figure 4.6(a) compares compressive NN classification accuracies. Three different
training data set sizes 𝑛 = |𝑋 | ∈ {1000, 2000, 4000} were fixed as the embedding dimension
𝑚 + 1 varied. The test set size is again fixed to 𝑛′ = 1000. Figure 4.6(b) concerns the algorithmic
comparison, as discussed in Figure 4.2, with training set size 𝑛 = 4000.
4.6.6    DEMONSTRATION OF NON-LINEARITY BY DENSELY
         APPROXIMATED TERMINAL EMBEDDING OF MANIFOLDS
    In this section, we perform numerical simulations to approximate the various ways of embedding
from R3 to R2 . In particular, we will look at the TerminalEmbed, InnerProd, and Linear approaches
outlined in Section 4.6.3. For the set 𝑋, we take a randomly generated dense sample of a plane in
𝑃 ⊂ R3 and for 𝑆, we take dense sample of smooth curve. In all cases, the 𝑆 is formed from evenly
spaced samples. Figure 4.7 demonstrates when 𝑆 is densely approximating a line through this
densely sampled plane. Figure 4.8 demonstrates when 𝑆 is densely approximating a circle through
this densely sampled plane. Both examples demonstrate the non-linearity of our TerminalEmbed
output in comparison to the InnerProd, and especially the Linear output.
                                                   103


Figure 4.7 Figure 4.7 demonstrates when 𝑆 is densely approximating a line. Three different
training data set sizes 𝑛 = |𝑋 | ∈ {102 , 103 , 104 } were used. 𝑛′ = |𝑆|= 100 in all three cases.
                                                     104


Figure 4.8 Figure 4.8 demonstrates when 𝑆 is densely approximating a circle. Three different
training data set sizes 𝑛 = |𝑋 | ∈ {102 , 103 , 104 } were used. 𝑛′ = |𝑆|= 100 in all three cases.
                                                     105


4.7     COMPRESSED CLASSIFICATION FROM
        PHASELESS MEASUREMENTS
We now consider applying compressed classification to the measurements generated from chapters
2 and 3. For near-field ptychographic measurements, we use measurements of the form given in
2.2, using p and m as given in Lemma 2.4.2. We could also perform a similar simulation using
far-field ptychographic measurements, where measurements are of the form given in 3.1, using
masks given in 2.8. For both of these measurements, we will then vectorize so that we can apply
our classification algorithm. For the addition of noise, we apply a Gaussian noise vector to each
u ∈ 𝑆 as given in Algorithm 4.2, that is, we replace u with u + n, n ∈ R𝑑 , before applying the
minimization problem from Algorithm 4.1.
    For our x, we will use the grayscaled vectorizations of the MNIST and COIL-100 images as
performed in the previous section. This however, causes issues with the original space will be for
which we embed. For instance, consider a grayscale image that is 𝑃 × 𝑃-pixels. Once vectorized,
this is a 𝑃2 -vector and thus the matrix of measurements will be 𝑃2 × 𝑃2 . Once this has been
vectorized, we will finally result in a 𝑃4 -vector for our training/testing data. For the MNIST data,
this would result in vectors of size 284 ≈ 600 thousand, whereas for the COIL-100 data, this
would result in vectors of size 1284 ≈ 268 million. As a consequence, this results in long running
times for the algorithm. To counteract this, rather than taking the full measurement matrix, we
instead sub-sample based on the frequencies. In particular, we will simply take the first column
for our training/testing data. We will demonstrate that this approach not only allows for successful
classification, but it does not impact the result in a significant manner.
    Firstly we demonstrate our results on classifying NFP measurements of the MNIST dataset.
Since the vectorized images are of length 𝑑 = 282 = 784, we choose 𝛿 = 25 such that 𝑑 is divisible
by 2𝛿 − 1.
                                                   106


Figure 4.9 Figure 4.9(a) compares compressive NN classification accuracies for the MNIST-NFP
measurements (𝛿 = 25). Three different training data set sizes 𝑛 = |𝑋 | ∈ {4000, 8000} were fixed
as the embedding dimension 𝑚 + 1. NearestNeighbor refers to the nearest neighbor classification
in the original space. Linear refers to the linear embedding described in Section 4.6.3. Figure
4.9(b) concerns the classification in which varying levels of noise are applied, with the training
data set size fixed to 𝑛 = 8000. Noiseless NearestNeighbor refers to the noiseless nearest neighbor
classification in the original space. For both figures, the test set size is fixed to 𝑛′ = 1000.
    Here, we encounter an issue in choosing a suitable 𝛿 since the vectorized COIL-100 images are
of length 𝑑 = 1282 = 16, 384 = 214 , as such no 𝛿 exists wherein 𝑑 is divisible by 2𝛿 − 1. Instead,
we choose a 𝛿 that approximates the ratio of 𝑑/𝛿 that we used the MNIST-NFP simulations above.
We then artificially extend 𝑑 using the same process discussed in Section 2.8 to ensure divisibility.
Figure 4.10 Figure 4.10(a) compares compressive NN classification accuracies for the COIL-NFP
measurements (𝛿 = 525). Two different training data set sizes 𝑛 = |𝑋 | ∈ {1800, 3600} were fixed
as the embedding dimension 𝑚 + 1. NearestNeighbor refers to the nearest neighbor classification
in the original space. Linear refers to the linear embedding described in Section 4.6.3. Figure
4.10(b) concerns the classification in which varying levels of noise are applied, with the training
data set size fixed to 𝑛 = 3600. Noiseless NearestNeighbor refers to the noiseless nearest neighbor
classification in the original space. For both figures, the test set size is fixed to 𝑛′ = 1000.
    Finally, we look at the effect of sub-sampling the frequency index, using simply the nearest
                                                  107


neighbor classification in the original space.
Figure 4.11 Figure representing the non-compressed NN classification of the MNIST-NFP
measurements (𝛿 = 25), with varying levels of sub-sampling, that is,
𝑌𝑘,ℓ = |(p ∗ (𝑆 𝑘 m ◦ x))ℓ | 2 , (𝑘, ℓ) ∈ [𝑑]0 × [𝐾]0 , with varying levels for 𝐾, up to 2𝛿 − 1.
                                                       108


                                             CHAPTER 5
                            CONTRIBUTIONS AND FUTURE WORK
Herein we outline a summary of contributions and possible areas for future work.
    In Chapter 2, we proved that for a certain given point spread function and mask, we could provide
a recovery guarantee theorem when using a lifted linear approach. We developed a weighted spectral
gap lower bound which we then applied to our specific problem. We introduced two new algorithms
for recovering a specimen of interest from near-field ptychographic measurements. Both of these
algorithms rely on first reformulating and reshaping our measurements so that they resemble widely-
studied far-field ptychographic measurements. We then recover our method using either Wirtinger
Flow or using methods based on [54]. Algorithm 2.1 is computational efficient and, to the best of
our knowledge, is the first algorithm with provable recovery guarantees for measurements of this
form. Algorithm 2.2, on the other hand, has the advantage of being applied to more general masks
with global support. Developing more efficient and provably accurate algorithms for this latter
class of measurements remains an interesting avenue for future work.
    In Chapter 3, we developed a novel approach for solving blind far-field ptychography. We
introduced an algorithm for recovering a specimen of interest from blind far-field ptychographic
measurements. This algorithm relies on reformulating the measurements so that they resemble
widely-studied blind deconvolutional measurements. This leads to transposed Khatri-Rao product
estimates of our specimen which are then able to be recovered by angular synchronization. We
then use these estimates in applying inverse Fourier transforms, point-wise division, and angular
synchronization to recover estimates for the mask. Finally, we use a best error estimate sorting
algorithm to find the final estimate of both the specimen and mask. As shown in numerical
results, Algorithm 3.6 recovers both the sample and mask within a good margin of error. It also
provides stability under noise. A further goal for this research would be to adapt the existing
recovery guarantee theorems for the selected blind deconvolutional recovery algorithm, in which
the assumed Gaussian matrix C is replaced with Khatri-Rao matrix C′(𝑘) = C • 𝑆 𝑘 C̄. In particular,
                                                  109


this would mean providing alternate inequalities for the four key conditions laid out in Theorem
3.4.6.
    In Chapter 4, we generalized the Johnson-Lindenstrauss lemma for manifolds. In particular, we
let M be a compact 𝑑-dimensional submanifold of R𝑁 with reach 𝜏 and volume 𝑉M , and we proved
                                                                                           √𝑑
                                                                                                  
that for all 𝜖 ∈ (0, 1), that a nonlinear function 𝑓 : R → R exists with 𝑚 ≤ 𝐶 𝑑/𝜖 log
                                                         𝑁        𝑚                 2
                                                                                              𝑉M
                                                                                               𝜏
such that
                           (1 − 𝜖)∥x − y∥ 2 ≤ ∥ 𝑓 (x) − 𝑓 (y)∥ 2 ≤ (1 + 𝜖)∥x − y∥ 2
holds for all x ∈ M and y ∈ R𝑁 . The proof is constructive and yielded an algorithm which
works well in practice. In particular, we empirically demonstrated herein that such nonlinear
functions allow for more accurate compressive nearest neighbor classification than standard linear
Johnson-Lindenstrauss embeddings do in practice. Furthermore, it was demonstrated that this
approach works when the labelled data consists of NFP measurements. Future work in this area
would be to develop more computationally efficient algorithms for computing terminal embeddings.
Additionally, exploring the upper Lipschitz constants of terminal embeddings of sets with reach > 0
in small tubes around the set. Achieving these might allow for new results to be proven in machine
learning application contexts.
                                                     110


                                      BIBLIOGRAPHY
[1]  Ali Ahmed, Alireza Aghasi, and Paul Hand. Blind deconvolutional phase retrieval via convex
     programming. Advances in Neural Information Processing Systems, 31, 2018.
[2]  Ali Ahmed, Benjamin Recht, and Justin Romberg. Blind deconvolution using convex pro-
     gramming. IEEE Transactions on Information Theory, 60(3):1711–1732, 2013.
[3]  Jacopo Antonello and Michel Verhaegen. Modal-based phase retrieval for adaptive optics.
     JOSA A, 32(6):1160–1170, 2015.
[4]  GR Ayers and J Christopher Dainty. Iterative blind deconvolution method and its applications.
     Optics letters, 13(7):547–549, 1988.
[5]  Richard G Baraniuk and Michael B Wakin. Random projections of smooth manifolds.
     Foundations of computational mathematics, 9(1):51–77, 2009.
[6]  Robert Beinert and Gerlind Plonka. Ambiguities in one-dimensional discrete phase retrieval
     from fourier magnitudes. Journal of Fourier Analysis and Applications, 21(6):1169–1198,
     2015.
[7]  Tamir Bendory, Robert Beinert, and Yonina C Eldar. Fourier phase retrieval: Uniqueness
     and algorithms. In Compressed Sensing and its Applications, pages 55–91. Springer, 2017.
[8]  JR Bond and G Efstathiou. The statistics of cosmic background radiation fluctuations.
     Monthly Notices of the Royal Astronomical Society, 226(3):655–687, 1987.
[9]  Joe Buhler and Zinovy Reichstein. Symmetric functions and the phase problem in crystal-
     lography. Transactions of the American Mathematical Society, 357(6):2353–2377, 2005.
[10] Oliver Bunk, Martin Dierolf, Søren Kynde, Ian Johnson, Othmar Marti, and Franz Pfeiffer.
     Influence of the overlap parameter on the convergence of the ptychographical iterative engine.
     Ultramicroscopy, 108(5):481–487, 2008.
[11] Imre Bárány. A generalization of carathéodory’s theorem. Discrete Mathematics, 40(2):141–
     152, 1982.
[12] Emmanuel J Candès, Yonina C Eldar, Thomas Strohmer, and Vladislav Voroninski. Phase
     retrieval via matrix completion. SIAM review, 57(2):225–251, 2015.
[13] Emmanuel J. Candès, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval from coded
     diffraction patterns. Applied and Computational Harmonic Analysis, 39(2):277 – 299, 2015.
[14] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger
     flow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007,
                                               111


     2015.
[15] Alfred S Carasso. Direct blind deconvolution. SIAM Journal on Applied Mathematics,
     61(6):1980–2007, 2001.
[16] Huibin Chang, Pablo Enfedaque, and Stefano Marchesini. Blind ptychographic phase re-
     trieval via convergent alternating direction method of multipliers. SIAM Journal on Imaging
     Sciences, 12(1):153–185, 2019.
[17] Huibin Chang, Li Yang, and Stefano Marchesini. Fast iterative algorithms for blind phase
     retrieval: A survey. arXiv preprint arXiv:2211.06619, 2022.
[18] Yeshwanth Cherapanamjeri and Jelani Nelson. Terminal embeddings in sublinear time. In
     2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages
     1209–1216. IEEE, 2022.
[19] Jesse N Clark, Xiaojing Huang, Ross J Harder, and Ian K Robinson. Continuous scanning
     mode for ptychography. Optics letters, 39(20):6066–6069, 2014.
[20] Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best 𝑘-term
     approximation. Journal of the American mathematical society, 22(1):211–231, 2009.
[21] Mark A Davenport, Marco F Duarte, Michael B Wakin, Jason N Laska, Dharmpal Takhar,
     Kevin F Kelly, and Richard G Baraniuk. The smashed filter for compressive classification and
     target recognition. In Computational Imaging V, volume 6498, page 64980H. International
     Society for Optics and Photonics, 2007.
[22] Li Deng. The mnist database of handwritten digit images for machine learning research.
     IEEE Signal Processing Magazine, 29(6):141–142, 2012.
[23] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts
     reduction by a deep convolutional network. In Proceedings of the IEEE international
     conference on computer vision, pages 576–584, 2015.
[24] Jan Drenth. Principles of protein X-ray crystallography. Springer Science & Business
     Media, 2007.
[25] George H Dunteman. Principal components analysis. Number 69. Sage, 1989.
[26] TB Edo, DJ Batey, AM Maiden, C Rau, U Wagner, ZD Pevsić, TA Waigh, and JM Rodenburg.
     Sampling in x-ray ptychography. Physical Review A, 87(5):053850, 2013.
[27] Armin Eftekhari and Michael B Wakin. New analysis of manifold embeddings and signal
     recovery from compressive measurements. Applied and Computational Harmonic Analysis,
     39(1):67–109, 2015.
                                                112


[28] Yonina C Eldar, Pavel Sidorenko, Dustin G Mixon, Shaby Barel, and Oren Cohen. Sparse
     phase retrieval from short-time fourier measurements. IEEE Signal Processing Letters,
     22(5):638–642, 2014.
[29] Michael Elkin, Arnold Filtser, and Ofer Neiman. Terminal embeddings. arXiv preprint
     arXiv:1603.02321, 2016.
[30] Albert Fannjiang and Pengwen Chen. Blind ptychography: uniqueness and ambiguities.
     Inverse Problems, 36(4):045005, 2020.
[31] Albert Fannjiang and Zheqing Zhang. Fixed point analysis of douglas–rachford splitting
     for ptychography and phase retrieval. SIAM Journal on Imaging Sciences, 13(2):609–650,
     2020.
[32] Herbert Federer. Curvature measures. Transactions of the American Mathematical Society,
     93(3):418–418, March 1959.
[33] Frank Filbir, Felix Krahmer, and Oleh Melnyk. On recovery guarantees for angular synchro-
     nization. Journal of Fourier Analysis and Applications, 27(2):1–26, 2021.
[34] DA Fish, AM Brinicombe, ER Pike, and JG Walker. Blind deconvolution by means of the
     richardson–lucy algorithm. JOSA A, 12(1):58–65, 1995.
[35] Horacio E Fortunato and Manuel M Oliveira. Fast high-quality non-blind deconvolution
     using sparse adaptive priors. The Visual Computer, 30(6):661–671, 2014.
[36] Grant R Fowles. Introduction to modern optics. Courier Corporation, 1989.
[37] Si Gao, Peng Wang, Fucai Zhang, Gerardo T Martinez, Peter D Nellist, Xiaoqing Pan,
     and Angus I Kirkland. Electron ptychographic microscopy for three-dimensional imaging.
     Nature communications, 8(1):1–8, 2017.
[38] Pierre Godard, Marc Allain, and Virginie Chamard. Imaging of highly inhomogeneous strain
     field in nanocrystals using x-ray bragg ptychography: A numerical study. Physical Review
     B, 84(14):144109, 2011.
[39] JW Goodman. Film-grain noise in wavefront-reconstruction imaging. JOSA, 57(4):493–502,
     1967.
[40] Michael Grant and Stephen Boyd. Graph implementations for nonsmooth convex programs.
     In V. Blondel, S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control,
     Lecture Notes in Control and Information Sciences, pages 95–110. Springer-Verlag Limited,
     2008. http://stanford.edu/~boyd/graph_dcp.html.
[41] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex program-
                                              113


     ming, version 2.1. http://cvxr.com/cvx, March 2014.
[42] D Griffin, D Deadrick, and Jae Lim. Speech synthesis from short-time fourier transform
     magnitude and its application to speech processing. In ICASSP’84. IEEE International
     Conference on Acoustics, Speech, and Signal Processing, volume 9, pages 61–64. IEEE,
     1984.
[43] R Hegerl and W Hoppe. Phase evaluation in generalized diffraction (ptychography). Proc.
     Fifth Eur. Cong. Electron Microscopy, pages 628–629, 1972.
[44] Reiner Hegerl and W Hoppe. Dynamische theorie der kristallstrukturanalyse durch elektro-
     nenbeugung im inhomogenen primärstrahlwellenfeld. Berichte der Bunsengesellschaft für
     physikalische Chemie, 74(11):1148–1154, 1970.
[45] W Hoppe. Diffraction in inhomogeneous primary wave fields. Acta Crystallogr. A 25, pages
     495–501,508–515, 1969.
[46] SO Hruszkewycz, Marc Allain, MV Holt, CE Murray, JR Holt, PH Fuoss, and Virginie
     Chamard. High-resolution three-dimensional structural microscopy by single-angle bragg
     ptychography. Nature materials, 16(2):244–251, 2017.
[47] Xiaojing Huang, Kenneth Lauer, Jesse N Clark, Weihe Xu, Evgeny Nazaretski, Ross Harder,
     Ian K Robinson, and Yong S Chu. Fly-scan ptychography. Scientific reports, 5(1):1–5, 2015.
[48] IBM. https://www.ibm.com/topics/knn.
[49] M. A. Iwen, B. Preskitt, R. Saab, and A. Viswanathan. Phase retrieval from local measure-
     ments: improved robustness via eigenvector-based angular synchronization. Applied and
     Computational Harmonic Analysis, 48:415 – 444, 2020.
[50] Mark Iwen, Michael Perlmutter, and Mark Philip Roach. Toward fast and provably accurate
     near-field ptychographic phase retrieval. Sampling Theory, Signal Processing, and Data
     Analysis, 21(1):6, 2023.
[51] Mark Iwen, Arman Tavakoli, and Benjamin Schmidt. Lower bounds on the low-distortion
     embedding dimension of submanifolds of R𝑛 . arXiv preprint arXiv:2105.13512, 2021.
[52] Mark A Iwen, Deanna Needell, Elizaveta Rebrova, and Ali Zare. Lower memory oblivious
     (tensor) subspace embeddings with fewer random bits: modewise methods for least squares.
     SIAM Journal on Matrix Analysis and Applications, 42(1):376–416, 2021.
[53] Mark A. Iwen, Benjamin Schmidt, and Arman Tavakoli. On fast johnson-lindenstrauss
     embeddings of compact submanifolds of R𝑁 with boundary. Arxiv, 2110.04193, 2021.
[54] Mark A. Iwen, Aditya Viswanathan, and Yang Wang. Fast Phase Retrieval from Local
                                             114


     Correlation Measurements. SIAM Journal on Imaging Sciences, 9(4):1655–1688, 2016.
[55] Kishore Jaganathan, Yonina Eldar, and Babak Hassibi. Phase retrieval with masks using
     convex optimization. In 2015 IEEE International Symposium on Information Theory (ISIT),
     pages 1655–1659. IEEE, 2015.
[56] Kishore Jaganathan, Yonina C Eldar, and Babak Hassibi. Stft phase retrieval: Uniqueness
     guarantees and recovery algorithms. IEEE Journal of selected topics in signal processing,
     10(4):770–781, 2016.
[57] Francis Arthur Jenkins and Harvey Elliott White. Fundamentals of optics. Indian Journal
     of Physics, 25:265–266, 1957.
[58] Yi Jiang, Zhen Chen, Yimo Han, Pratiti Deb, Hui Gao, Saien Xie, Prafull Purohit, Mark W
     Tate, Jiwoong Park, Sol M Gruner, et al. Electron ptychography of 2d materials to deep
     sub-ångström resolution. Nature, 559(7714):343–349, 2018.
[59] GH John, R Kohavi, and K Pfleger. Machine learning: proceedings of the eleventh interna-
     tional conference. Irrelevant features and the subset selection problem, 1994.
[60] William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert
     space. Contemporary mathematics, 26:189–206, 1984.
[61] Aggelos K Katsaggelos and Kuen-Tsair Lay. Maximum likelihood blur identification and im-
     age restoration using the em algorithm. IEEE Transactions on Signal Processing, 39(3):729–
     733, 1991.
[62] Samina Khalid, Tehmina Khalil, and Shamila Nasreen. A survey of feature selection and fea-
     ture extraction techniques in machine learning. In 2014 science and information conference,
     pages 372–378. IEEE, 2014.
[63] Mojzesz Kirszbraun. Über die zusammenziehende und lipschitzsche transformationen. Fun-
     damenta Mathematicae, 22(1):77–108, 1934.
[64] Deepa Kundur and Dimitrios Hatzinakos. Blind image restoration via recursive filtering using
     deterministic constraints. In 1996 IEEE International Conference on Acoustics, Speech, and
     Signal Processing Conference Proceedings, volume 4, pages 2283–2286. IEEE, 1996.
[65] Deepa Kundur and Dimitrios Hatzinakos. A novel blind deconvolution scheme for image
     restoration using recursive filtering. IEEE transactions on signal processing, 46(2):375–390,
     1998.
[66] Marcus Frederick Charles Ladd, Rex Alfred Palmer, and Rex Alfred Palmer. Structure
     determination by X-ray crystallography, volume 233. Springer, 1977.
                                               115


[67] Kasper Green Larsen and Jelani Nelson. Optimality of the johnson-lindenstrauss lemma.
     In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages
     633–638. IEEE, 2017.
[68] Y. LeCun, C. Cortes, and C.J.C. Burges. The mnist database of handwritten digits. 1998.
[69] Anat Levin, Yair Weiss, Fredo Durand, and William T Freeman. Understanding blind
     deconvolution algorithms. IEEE transactions on pattern analysis and machine intelligence,
     33(12):2354–2367, 2011.
[70] Peng Li, Nicholas W Phillips, Steven Leake, Marc Allain, Felix Hofmann, and Virginie
     Chamard. Revealing nano-scale lattice distortions in implanted material with 3d bragg
     ptychography. Nature communications, 12(1):1–13, 2021.
[71] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind
     deconvolution via nonconvex optimization. Applied and computational harmonic analysis,
     47(3):893–934, 2019.
[72] Aristidis C Likas and Nikolas P Galatsanos. A variational approach for bayesian blind image
     deconvolution. IEEE transactions on signal processing, 52(8):2222–2233, 2004.
[73] Z-C Liu, Rui Xu, and Y-H Dong. Phase retrieval in protein crystallography. Acta Crystallo-
     graphica Section A: Foundations of Crystallography, 68(2):256–265, 2012.
[74] Deepali Lodhia, Daniel Brown, Frank Brueckner, Ludovico Carbone, Paul Fulda, Keiko
     Kokeyama, and Andreas Freise. Phase effects due to beam misalignment on diffraction
     gratings. arXiv preprint arXiv:1303.7016, 2013.
[75] Sepideh Mahabadi, Konstantin Makarychev, Yury Makarychev, and Ilya Razenshteyn. Non-
     linear dimension reduction via outer bi-lipschitz extensions. In Proceedings of the 50th
     Annual ACM SIGACT Symposium on Theory of Computing, pages 1088–1101, 2018.
[76] Andrew Maiden, Daniel Johnson, and Peng Li. Further improvements to the ptychographical
     iterative engine. Optica, 4(7):736–745, 2017.
[77] Andrew M Maiden and John M Rodenburg. An improved ptychographical phase retrieval
     algorithm for diffractive imaging. Ultramicroscopy, 109(10):1256–1262, 2009.
[78] Sami Eid Merhi. Phase Retrieval from Continuous and Discrete Ptychographic Measure-
     ments. Michigan State University, 2019.
[79] Rafael Molina, Aggelos K Katsaggelos, Javier Abad, and Javier Mateos. A bayesian ap-
     proach to blind deconvolution based on dirichlet distributions. In 1997 IEEE international
     conference on acoustics, speech, and signal processing, volume 4, pages 2809–2812. IEEE,
     1997.
                                             116


[80] Shyam Narayanan and Jelani Nelson. Optimal terminal dimensionality reduction in euclidean
     space. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing,
     pages 1064–1069, 2019.
[81] Shyam Narayanan and Jelani Nelson. Optimal terminal dimensionality reduction in euclidean
     space. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing,
     pages 1064–1069, 2019.
[82] S Nawab, T Quatieri, and Jae Lim. Signal reconstruction from short-time fourier transform
     magnitude. IEEE Transactions on Acoustics, Speech, and Signal Processing, 31(4):986–998,
     1983.
[83] Sameer A Nene, Shree K Nayar, and Hiroshi Murase. Columbia object image library
     (coil-100). 1996.
[84] J v Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320,
     1928.
[85] Michal Odstrvcil, Mirko Holler, and Manuel Guizar-Sicairos. Arbitrary-path fly-scan pty-
     chography. Optics express, 26(10):12585–12593, 2018.
[86] Byung Tae Oh, Shaw-min Lei, and C-C Jay Kuo. Advanced film grain noise extraction and
     synthesis for high-definition video coding. IEEE transactions on circuits and systems for
     video technology, 19(12):1717–1729, 2009.
[87] Olympus.             https://www.olympus-lifescience.com/en/microscope-resource/primer/
     techniques/oblique/obliqueintro/.
[88] Xiaoze Ou, Roarke Horstmeyer, Guoan Zheng, and Changhuei Yang. High numerical aper-
     ture fourier ptychography: principle, implementation and characterization. Optics express,
     23(3):3472–3491, 2015.
[89] Michael Perlmutter, Sami Merhi, Aditya Viswanathan, and Mark Iwen. Inverting spectro-
     gram measurements via aliased wigner distribution deconvolution and angular synchroniza-
     tion. Information and Inference: A Journal of the IMA, 2020.
[90] Franz Pfeiffer. X-ray ptychography. Nature Photonics, 12(1):9–17, 2018.
[91] M Portnoff. Short-time fourier analysis of sampled speech. IEEE Transactions on Acoustics,
     Speech, and Signal Processing, 29(3):364–373, 1981.
[92] Brian Preskitt and Rayan Saab. Admissible measurements and robust algorithms for pty-
     chography. Journal of Fourier Analysis and Applications, 27(2):1–39, 2021.
[93] Brian P Preskitt. Phase retrieval from locally supported measurements. University of
                                              117


      California, San Diego, 2018.
[94] Klaus G Puschmann and Franz Kneer. On super-resolution in astronomical imaging. As-
      tronomy & Astrophysics, 436(1):373–378, 2005.
[95] Jianliang Qian, Chao Yang, A Schirotzek, F Maia, and S Marchesini. Efficient algorithms
      for ptychographic phase retrieval. Inverse Problems and Applications, Contemp. Math,
      615:261–280, 2014.
[96] JM Rodenburg. Ptychography and related diffractive imaging methods. Advances in Imaging
      and Electron Physics, 150:87–184, 2008.
[97] P Rosero-Montalvo, P Diaz, Jose Alejandro Salazar-Castro, DF Pena-Unigarro, Andres J
      Anaya-Isaza, Juan C Alvarado-Pérez, Roberto Therón, and Diego Hernán Peluffo-Ordóñez.
      Interactive data visualization using dimensionality reduction and similarity-based repre-
      sentations. In IberoAmerican Congress on Pattern Recognition, pages 334–342. Springer,
      2017.
[98] P Yu Rotha and David M Paganin. Blind phase retrieval for aberrated linear shift-invariant
      imaging systems. New Journal of Physics, 12(7):073040, 2010.
[99] John W Sammon. A nonlinear mapping for data structure analysis. IEEE Transactions on
      computers, 100(5):401–409, 1969.
[100] Pavel Sidorenko and Oren Cohen. Single-shot ptychography. Optica, 3(1):9–14, 2016.
[101] VI Slyusar. A family of face products of matrices and its properties. Cybernetics and systems
      analysis, 35(3):379–384, 1999.
[102] George F Smoot and Douglas Scott. Cosmic background radiation. The European Physical
      Journal C-Particles and Fields, 15:145–149, 2000.
[103] MS Smyth and JHJ Martin. x ray crystallography. Molecular Pathology, 53(1):8, 2000.
[104] D.A. Spielman. Spectral and algebraic graph theory. Incomplete Draft, Yale University,
      2019.
[105] Filip Sroubek and Jan Flusser. Multichannel blind iterative image restoration. IEEE Trans-
      actions on Image Processing, 12(9):1094–1106, 2003.
[106] J-L Starck and Fionn Murtagh. Astronomical image and data analysis. 2007.
[107] Marco Stockmar, Peter Cloetens, Irene Zanette, Bjoern Enders, Martin Dierolf, Franz Pfeif-
      fer, and Pierre Thibault. Near-field ptychography: phase retrieval for inline holography using
      a structured illumination. Scientific reports, 3(1):1–6, 2013.
                                                118


[108] Marco Stockmar, Irene Zanette, Martin Dierolf, Bjoern Enders, Richard Clare, Franz Pfeiffer,
      Peter Cloetens, Anne Bonnin, and Pierre Thibault. X-ray near-field ptychography for optically
      thick specimens. Physical Review Applied, 3(1):014005, 2015.
[109] Yukio Takahashi, Akihiro Suzuki, Shin Furutaku, Kazuto Yamauchi, Yoshiki Kohmura, and
      Tetsuya Ishikawa. Bragg x-ray ptychography of a silicon crystal: Visualization of the dislo-
      cation strain field and the production of a vortex beam. Physical Review B, 87(12):121201,
      2013.
[110] Pierre Thibault, Martin Dierolf, Oliver Bunk, Andreas Menzel, and Franz Pfeiffer. Probe
      retrieval in ptychographic coherent diffractive imaging. Ultramicroscopy, 109(4):338–343,
      2009.
[111] Pierre Thibault, Martin Dierolf, Andreas Menzel, Oliver Bunk, Christian David, and Franz
      Pfeiffer. High-resolution scanning x-ray diffraction microscopy. Science, 321(5887):379–
      382, 2008.
[112] Eric Thiébaut and J-M Conan. Strict a priori constraints for maximum-likelihood blind
      deconvolution. JOSA A, 12(3):485–492, 1995.
[113] Christoph Thäle. 50 years sets with positive reach–a survey. Surveys in Mathematics and its
      Applications, 3:123–165, 2008. Publisher: University Constantin Brancusi.
[114] Lei Tian, Xiao Li, Kannan Ramchandran, and Laura Waller. Multiplexed coded illumination
      for fourier ptychography with an led array microscope. Biomedical optics express, 5(7):2376–
      2389, 2014.
[115] Esther HR Tsai, Ivan Usov, Ana Diaz, Andreas Menzel, and Manuel Guizar-Sicairos. X-ray
      ptychography with extended depth of field. Optics express, 24(25):29089–29108, 2016.
[116] Robert N Tubbs. Lucky exposures: Diffraction limited astronomical imaging through the
      atmosphere. arXiv preprint astro-ph/0311481, 2003.
[117] Roman Vershynin. High-dimensional probability: an introduction with applications in
      data science. Number 47 in Cambridge series in statistical and probabilistic mathematics.
      Cambridge University Press, New York, NY, 2018.
[118] Aditya Viswanathan and Mark Iwen. Fast angular synchronization for phase retrieval via
      incomplete information. In Wavelets and Sparsity XVI, volume 9597, page 959718. Interna-
      tional Society for Optics and Photonics, 2015.
[119] Adriaan Walther. The question of phase retrieval in optics. Optica Acta: International
      Journal of Optics, 10(1):41–49, 1963.
[120] DP Woody and PL Richards. Spectrum of the cosmic background radiation. Physical Review
                                                119


      Letters, 42(14):925, 1979.
[121] Michael M Woolfson and Michael Mark Woolfson. An introduction to X-ray crystallography.
      Cambridge University Press, 1997.
[122] Shixiang Wu, Chao Dong, and Yu Qiao. Blind image restoration based on cycle-consistent
      network. IEEE Transactions on Multimedia, 2022.
[123] H Yang, RN Rutte, L Jones, M Simson, R Sagawa, H Ryll, M Huth, TJ Pennycook, MLH
      Green, H Soltau, et al. Simultaneous atomic-resolution electron ptychography and z-contrast
      imaging of light and heavy elements in complex nanostructures. Nature Communications,
      7(1):1–8, 2016.
[124] Yu-Li You and Mostafa Kaveh. Blind image restoration by anisotropic regularization. IEEE
      Transactions on Image Processing, 8(3):396–407, 1999.
[125] He Zhang, Shaowei Jiang, Jun Liao, Junjing Deng, Jian Liu, Yongbing Zhang, and Guoan
      Zheng. Near-field fourier ptychography: super-resolution phase retrieval via speckle illumi-
      nation. Optics express, 27(5):7498–7512, 2019.
[126] Guoan Zheng. Fourier ptychographic imaging: a MATLAB tutorial. Morgan & Claypool
      Publishers, 2016.
[127] Guoan Zheng, Roarke Horstmeyer, and Changhuei Yang. Wide-field, high-resolution fourier
      ptychographic microscopy. Nature photonics, 7(9):739–745, 2013.
[128] Guoan Zheng, Cheng Shen, Shaowei Jiang, Pengming Song, and Changhuei Yang. Con-
      cept, implementations and applications of fourier ptychography. Nature Reviews Physics,
      3(3):207–223, 2021.
                                              120


                                                 APPENDIX A
                                   NEAR-FIELD PTYCHOGRAPHY
A.1    TECHNICAL LEMMAS
We state the following lemmas for the sake of completeness. Note we index all vectors modulo 𝑑.
Lemma A.1.1. Let x, y ∈ C𝑑 . We have that |⟨x, y⟩| 2 = |⟨x, y⟩| 2 .
Lemma A.1.2. Let x, y ∈ C𝑑 . We have that ⟨x, y ◦ z⟩ = ⟨x ◦ y, z⟩.
Proof. By the definition of the inner product and the Hadamard product
                                𝑑−1
                                ∑︁                  𝑑−1
                                                    ∑︁              𝑑−1
                                                                    ∑︁
                 ⟨x, y ◦ z⟩ =       𝑥 𝑛 (y ◦ z)𝑛 =       𝑥𝑛 𝑦𝑛 𝑧𝑛 =     (x ◦ y)𝑛 𝑧 𝑛 = ⟨x ◦ y, z⟩.
                                𝑛=0                 𝑛=0             𝑛=0
Lemma A.1.3. Let x, y ∈ C𝑑 , 𝑘 ∈ Z. We have that
   1. (x ∗ y) 𝑘 = ⟨𝑆−𝑘e x, y⟩;
   2. x ◦ 𝑆 𝑘 y = 𝑆 𝑘 (𝑆−𝑘 x ◦ y);
   3. ⟨𝑆 𝑘 x, y⟩ = ⟨x, 𝑆−𝑘 y⟩.
Proof. Proof of 1: Let x, y ∈ C𝑑 . By the definition of the circular convolution
                         𝑑−1
                         ∑︁               𝑑−1
                                          ∑︁              𝑑−1
                                                          ∑︁
             (x ∗ y) 𝑘 =     𝑥 𝑘−𝑛 𝑦 𝑛 =      𝑥 𝑘−𝑛 𝑦 𝑛 =      (𝑆−𝑘ex)𝑛 𝑦 𝑛 = ⟨𝑆−𝑘e x, y⟩.
                         𝑛=0              𝑛=0             𝑛=0
Proof of 2: Let x, y ∈ C𝑑 , 𝑘 ∈ Z. Let 𝑛 ∈ [𝑑]0 be arbitrary. Then we have that
                       (x ◦ 𝑆 𝑘 y)𝑛 = x𝑛 (𝑆 𝑘 y)𝑛 = (𝑆−𝑘 x)𝑛+𝑘 y𝑛+𝑘 = (𝑆 𝑘 (𝑆−𝑘 x ◦ y))𝑛 .
                                                        121


Proof of 3: Noting that we index modulo 𝑑, we have that
                        𝑑−1
                        ∑︁                  𝑑−1
                                            ∑︁                𝑑+𝑘−1
                                                                ∑︁                  𝑑−1
                                                                                    ∑︁
           ⟨𝑆 𝑘 x, y⟩ =      (𝑆 𝑘 x)𝑛 𝑦 𝑛 =     𝑥 𝑛+𝑘 𝑦 𝑛 =          𝑥 𝑛 𝑦 𝑛−𝑘 =         x𝑛 𝑦 𝑛−𝑘 = ⟨x, 𝑆−𝑘 y⟩.
                         𝑛=0                𝑛=0                 𝑛=𝑘                 𝑛=0
    We now give our proof of Lemma 2.5.1.
Proof of Lemma 2.5.1. Fix 𝜙 ∈ [0, 2𝜋). By the triangle inequality we have
               ∥x − 𝑒 i𝜙 xest ∥ 2 =∥x(mag) ◦ x(𝜃) − xest
                                                         (mag)
                                                                ◦ 𝑒𝑖𝜙 x(𝜃)
                                                                        est ∥ 2
                                 ≤ ∥x(mag) ◦ x(𝜃) − x(mag) ◦ 𝑒 i𝜙 xest   (𝜃)
                                                                             ∥2
                                          + ∥x(mag) ◦ 𝑒 i𝜙 xest(𝜃)     (mag)
                                                                   − xest       ◦ 𝑒𝑖𝜙 x(𝜃)
                                                                                       est ∥ 2 .                (A.1)
For the first term, we may use the inequality ∥u ◦ v∥ 2 ≤ ∥u∥ ∞ ∥v∥ 2 to see that
             ∥x(mag) ◦ x(𝜃) − x(mag) ◦ 𝑒 i𝜙 xest(𝜃)
                                                    ∥ 2 ≤ ∥x(mag) ∥ ∞ ∥x(𝜃)   est − 𝑒
                                                                                      −i𝜙 (𝜃)
                                                                                          x ∥2
                                                                                −i𝜙 (𝜃)
                                                        = ∥x∥ ∞ ∥x(𝜃) est − 𝑒       x ∥2.                       (A.2)
For the second term, we see that
           ∥x(mag) ◦ 𝑒 i𝜙 x(𝜃)                                   i𝜙 (𝜃)
                                    (mag)                                                     (mag)
                            est − xest     ◦ 𝑒𝑖𝜙 x(𝜃)
                                                  est ∥ 2 ≤ ∥𝑒 xest ∥ ∞ ·∥x
                                                                                  (mag)
                                                                                        − xest      ∥2
                                                                             (mag)
                                                          = ∥x(mag) − xest         ∥2.                          (A.3)
Combining (A.2) and (A.3) with (A.1) and minimizing over 𝜙 completes the proof.
A.2     AUXILLIARY RESULTS FROM SPECTRAL GRAPH THEORY
In this section, we will prove several lemmas related to the graph Laplacian and its eigenvalues.
The following definition defines a partial ordering on the set of weighted graphs induced by the
spectrum of their graph Laplacians.
                                                           122


Definition A.2.1. We say that a symmetric matrix A is positive semi-definite and write A ⪰ 0 if
x𝑇 Ax ≥ 0, ∀x ∈ R𝑛 (or equivalently if all the eigenvalues of A are non-negative). We define the
Loewner order1 ⪰ by the rule that A ⪰ B if A − B is positive semi-definite (or equivalently if
x𝑇 Ax ≥ x𝑇 Bx, ∀x ∈ R𝑛 ). For two graphs 𝐺 and 𝐻 with the same number of vertices, we will define
𝐺 ⪰ 𝐻 if L𝐺 ⪰ L𝐻 . We will also write 𝐺 ⪰ 𝑖=0                    𝐻𝑖 if L𝐺 ⪰ 𝑖=0
                                                       P𝑛−1                    P𝑛−1
                                                                                    L𝐻𝑖 , and for a scalar 𝑐 we will
write 𝐺 ⪰ 𝑐𝐻 if L𝐺 ⪰ 𝑐L𝐻 .
Remark A.2.1. If 𝐺 ⪰ 𝐻 and 𝜏𝐺 and 𝜏𝐻 are the smallest non-zero eigenvalues L𝐺 and L𝐻 , then
                                           x𝑇 L𝐺 x
one can use the fact that 𝜏𝐺 = 𝑚𝑖𝑛𝑛 𝑇                (see [104]) to verify that 𝜏𝐺 ≥ 𝜏𝐻 .
                                     𝑥∈𝑅     x x
                                      𝑥⊥1
     We now define some basic terminology for weighted graphs. (We note that these definitions
may also be applied to unweighted graphs by interpreting each edge as having weight one.)
Definition A.2.2. (Weighted Distance Definitions) Let 𝐺 = (𝑉, 𝐸, W) be a weighted graph.
(i) For any subgraph 𝐻 = (𝑉 ′, 𝐸 ′) of 𝐺, we define the weight of 𝐻, denoted 𝑤(𝐻), as
                                                               ∑︁
                                                 𝑤(𝐻) B                𝑊𝑖, 𝑗 ,
                                                            (𝑖, 𝑗)∈𝐸 ′
(ii) If 𝑃 is a path inside 𝐺, we will let len(𝑃) B 𝑤(𝑃) denote the weighted length of 𝑃.
(iii) We define the weighted distance between two vertices 𝑢 and 𝑣, dist𝐺 (𝑢, 𝑣), to be the minimal
weighted length of any path from 𝑢 to 𝑣
(iv) The weighted diameter of 𝐺, denoted by diam(𝐺), is the maximum distance between any two
vertices in 𝐺, that is,
                               diam(𝐺) B max{dist𝐺 (𝑢, 𝑣) | (𝑢, 𝑣) ∈ 𝑉 × 𝑉 }.
In some contexts, it will be useful to consider the pointwise inverses of the weights 𝑊𝑖, 𝑗 .
Definition A.2.3. (Inverse Weighted Distance Definitions) Let 𝐺 = (𝑉, 𝐸, W) be a weighted graph.
     1 The Loewner order is actually a partial ordering since there exist A and B such that A ̸ ⪰ B and B ̸ ⪰ A.
                                                           123


(i) For any subgraph 𝐻 = (𝑉 ′, 𝐸 ′) of 𝐺, the inverse weight of 𝐻, is defined by
                                                                    1
                                         𝑤 −1 (𝐻) B
                                                          ∑︁
                                                                        ,
                                                       (𝑖, 𝑗)∈𝐸 ′ 𝑊𝑖, 𝑗
(ii) For a path 𝑃 inside 𝐺, we refer to len−1 (𝑃) B 𝑤 −1 (𝑃) as the inverted weighted length of 𝑃.
(iii) For two vertices 𝑢 and 𝑣 we will refer to the minimal value of 𝑤 −1 (𝑃) over all paths from 𝑢 to
                                                             −1
𝑣 as the inverted weighted distance, denoted by dist𝐺           (𝑢, 𝑣).
(iv) The inverse weighted diameter of 𝐺, denoted by diam−1 (𝐺), is the maximum distance between
any two vertices in 𝐺, that is,
                          diam−1 (𝐺) B max{dist𝐺      −1
                                                         (𝑢, 𝑣) | (𝑢, 𝑣) ∈ 𝑉 × 𝑉 }.
     The proof of Lemma 2.5.4 (and thus Theorem 2.2.1), relies on the following lemma to provide
a lower bound for the spectral gap 𝜏𝐺 .
Lemma A.2.1. (Weighted Spectral Bound) Let 𝐺 = (𝑉, 𝐸, W) be a weighted, connected graph
with |𝑉 |= 𝑛, and let 𝑊min and 𝑊max denote the minimum and maximum value of any of the (nonzero)
weights of 𝐺. Then
                                                     2 · 𝑊min
                                    𝜏𝐺 ≥                                  .
                                            𝑊max (𝑛 − 1) · diam−1 (𝐺)
To prove Lemma A.2.1, we recall the following lemma from [104].
Lemma A.2.2. (Weighted Path Inequality) (Lemma 5.6.1 [104]) Let 𝑃𝑛 = (𝑣 0 , 𝑣 1 , . . . , 𝑣 𝑛−1 ) be a
path of length 𝑛 and assume that, for all 0 ≤ 𝑖 < 𝑛 − 2, 𝑤 𝑖 , the weight of (𝑣 𝑖 , 𝑣𝑖+1 ) is strictly positive.
For 0 ≤ 𝑖 < 𝑛 − 2, let 𝐺 𝑖,𝑖+1 = (𝑉, (𝑣 𝑖 , 𝑣𝑖+1 )) be the graph whose vertex set 𝑉 is that same as the
vertex set of 𝐺 but only has a single edge (𝑣 𝑖 , 𝑣𝑖+1 ). Similarly, let 𝐺 0,𝑛−1 = (𝑉, (𝑣 0 , 𝑣 𝑛−1 )) be the
graph with only a single edge (𝑣 0 , 𝑣 𝑛−1 ). Then
                                       𝑛−2       𝑛−2
                                      ∑︁    1  ∑︁
                          𝐺 0,𝑛−1 ≼                  𝑤 𝑖 𝐺 𝑖,𝑖+1 2 = len−1 (𝑃𝑛 ) · 𝑃𝑛 ,
                                       𝑖=0  𝑤 𝑖 𝑖=0
                                                     124


where the final equality is interpreted is the sense of 𝐴 ≼ 𝐵 and 𝐵 ≼ 𝐴.
The Proof of Lemma A.2.1. For 𝑢, 𝑣 ∈ 𝑉, let 𝐺 𝑢,𝑣 = (𝑉, (𝑢, 𝑣)) denote the graph with only a single
edge from 𝑢 to 𝑣 and let 𝑃𝑢,𝑣 denote a path from 𝑢 to 𝑣 with minimal weighted inverse length. Then,
by Lemma A.2.2 we have
                𝐺 𝑢,𝑣 ≼ len−1 (𝑃𝑢,𝑣 (𝐺)) · 𝑃𝑢,𝑣 (𝐺) ≼ diam−1 (𝐺) · 𝑃𝑢,𝑣 (𝐺) ≼ diam−1 (𝐺) · 𝐺,
where the last inequality holds since for all subgraphs 𝐻 of a graph 𝐺, 𝐻 ≼ 𝐺 (Section 5.2 [104])
     Let 𝐾e𝑛 be the extended weighted, complete graph on 𝑛 vertices with weighted matrix f                    W, where
         
         
         𝑊𝑖, 𝑗 , (𝑖, 𝑗) ∈ 𝐸
         
         
         
e𝑖, 𝑗 =
𝑊                                . Then by summing over all vertices, we have that
         
         𝑊min , (𝑖, 𝑗) ̸∈ 𝐸
         
         
         
                                                                               diam−1 (𝐺) · L𝐺 ,
                                    ∑︁                                ∑︁
                        L𝐾e𝑛 =               e𝑖, 𝑗 L𝐺 𝑖, 𝑗 ≼ 𝑊max
                                             𝑊
                                0≤𝑖< 𝑗 ≤𝑛−1                        0≤𝑖< 𝑗 ≤𝑛−1
                          = 𝑛(𝑛 − 1), we then have that
        P
Since      0≤𝑖< 𝑗 ≤𝑛−1 1
                                        e𝑛 ≼ 𝑊max 𝑛(𝑛 − 1) diam−1 (𝐺) · 𝐺,
                                        𝐾
                                                         2
which, by Remark (A.2.1) implies
                                                 𝑊max 𝑛(𝑛 − 1)
                                        𝜏𝐾e𝑛 ≤                    diam−1 (𝐺)𝜏𝐺 ,
                                                           2
and therefore
                                                               2𝜏𝐾e𝑛
                                         𝜏𝐺 ≥                                    .
                                                 𝑊max 𝑛(𝑛 − 1) · diam−1 (𝐺)
     2 Under this construction, we see that if we have a weight which is much larger than all of the others, it effectively
gets nullified by taking the inverse.
                                                             125


Letting 𝐾𝑛 be the unweighted graph on 𝑛 vertices we see that
                         ∑︁                                   ∑︁
           x𝑇 L𝐾e𝑛 x =        𝑊e𝑎,𝑏 (𝑥(𝑎) − 𝑥(𝑏))2 ≥ 𝑊min          (𝑥(𝑎) − 𝑥(𝑏))2 = 𝑊min x𝑇 L𝐾𝑛 x.
                      (𝑎,𝑏)∈[𝑛]0                           (𝑎,𝑏)∈[𝑛]0
                         x𝑇 L𝐾e𝑛 x               x𝑇 L𝐾𝑛 x
Therefore, 𝜏𝐾e𝑛 = 𝑚𝑖𝑛𝑛 𝑇           ≥ 𝑊min 𝑚𝑖𝑛𝑛 𝑇           ≥ 𝑊min · 𝜏𝐾𝑛 . Thus, since
                   𝑥∈𝑅     x x              𝑥∈𝑅    x x
                    𝑥⊥1                      𝑥⊥1
                                                            2𝑊min
𝜏𝐾 𝑁 = 𝑛 (5.4.1, [104]), we have that 𝜏𝐺 ≥                                    .
                                                 𝑊max (𝑛 − 1) · diam−1 (𝐺)
    Our next result uses Lemma A.2.1 to produce a bound for 𝜏𝐺 in terms of the diameter of the
underlying unweighted graph.
Theorem A.2.1. Let 𝐺 = (𝑉, 𝐸, W) be a weighted graph and let 𝑊min and 𝑊max be the minimum
and maximum value of any its (nonzero) weights. Then
                                                     2 · (𝑊min )2
                                       𝜏𝐺 ≥                               ,
                                              𝑊max (𝑛 − 1)diam(𝐺 unw )
where 𝐺 unw = (𝑉, 𝐸) is the unweighted counterpart of 𝐺.
Proof. Let 𝐺 ′ = (𝑉, 𝐸, W′), where 𝑊𝑖,′ 𝑗 = 1/𝑊𝑖, 𝑗 if 𝑊𝑖, 𝑗 ̸= 0 and 𝑊𝑖,′ 𝑗 = 0 otherwise. Let 𝑊max ′   be
                                                                                           1
the maximum element of W′. Observe that by construction, we have 𝑊max              ′  =       . Moreover, it
                                                                                         𝑊min
follows immediately from Definition A.2.2 that we have diam−1 (𝐺) = diam(𝐺 ′). Therefore,
                                                                            1
                 diam−1 (𝐺) = diam(𝐺 ′) ≤ 𝑊max     ′
                                                       · diam(𝐺 unw ) =         diam(𝐺 unw ).
                                                                           𝑊min
    So by Lemma A.2.1, we have that
                                       2 · 𝑊min                       2 · (𝑊min )2
                      𝜏𝐺 ≥                                 ≥                            .
                             𝑊max (𝑛 − 1) · diam−1 (𝐺)        𝑊max (𝑛 − 1)diam(𝐺 unw )
                                                      126


                                                     APPENDIX B
                                     FAR-FIELD PTYCHOGRAPHY
B.1     SUB-SAMPLING
In this section, we discuss sub-sampling lemmas that can be used in conjunction with Algorithm
3.1. In many cases, an illumination of the sample can cause damage to the sample, and applying
the illumination beam (which can be highly irradiative) repeatedly at a single point can destroy it.
Considering the risks to the sample and the costs of operating the measurement equipment, there
are strong incentives to reduce the number of illuminations applied to any object.
                                                                                                          𝑑
Definition B.1.1. Let 𝑠 ∈ N such that 𝑠 | 𝑑. We define the sub-sampling operator 𝑍 𝑠 : C𝑑 −→ C 𝑠
defined component-wise via
                                          (𝑍 𝑠 x)𝑛 := 𝑥 𝑛·𝑠 ,       ∀𝑛 ∈ [𝑑/𝑠]0 .                     (B.1)
    We now have an aliasing lemma which allows us to see the impact of performing the Fourier
transform on a sub-sampled specimen.
Lemma B.1.1. (Aliasing) ([78], Lemma 2.0.1.) Let 𝑠 ∈ N such that 𝑠 | 𝑑, x ∈ C𝑑 , 𝜔 ∈ [ 𝑑𝑠 ]0 . Then
we have that
                                                                       𝑠−1
                                                                 1 ∑︁
                                             𝐹 𝑑 (𝑍 𝑠 x)       =           x̂ 𝑑 .                     (B.2)
                                                𝑠           𝜔      𝑠 𝑟=0 𝜔−𝑟 𝑠
                                                                                                    𝑑
Proof. Let 𝑑 ∈ N and suppose 𝑠 ∈ N divides 𝑑. Let x ∈ C𝑑 and 𝜔 ∈ [ ]0 be arbitrary. By the
                                                                                                    𝑠
definition of the discrete Fourier transform and sub-sampling operator, we have that
                                                  𝑑                           𝑑
                                                𝑠 −1                        𝑠 −1
                                                                  − 2 𝜋𝑖𝑛𝜔                  2 𝜋𝑖𝑛𝜔𝑠
                                                                                   𝑥 𝑛𝑠 𝑒 −
                                                 ∑︁                           ∑︁
                            𝐹 𝑑 (𝑍 𝑠 x)      =         (𝑍 𝑠 x)𝑛 𝑒     𝑑/𝑠   =                   𝑑   . (B.3)
                              𝑠           𝜔       𝑛=0                         𝑛=0
                                                             127


By the inverse DFT and by collecting terms, we have that
                             𝑑                             𝑑
                             𝑠 −1                          𝑠 −1  ∑︁
                                                                  𝑑−1
                                         − 2 𝜋𝑖𝑛𝜔𝑠      1 ∑︁                 2 𝜋𝑖𝑟 𝑛𝑠
                                                                                       2 𝜋𝑖𝑛(𝑟 − 𝜔)𝑠
                                                                        𝑥ˆ𝑟 𝑒 𝑑 𝑒 − 𝑑
                             ∑︁
                                  𝑥 𝑛𝑠 𝑒       𝑑    =                                                  .                  (B.4)
                             𝑛=0                        𝑑 𝑛=0 𝑟=0
By treating this as a sum of DFTs, we then have that
                         𝑑
                         𝑠 −1  ∑︁
                                𝑑−1                                        𝑠−1                  𝑠−1
                     1 ∑︁                 2 𝜋𝑖𝑟 𝑛𝑠
                                                    2 𝜋𝑖𝑛(𝑟 − 𝜔)𝑠 1 ∑︁                     1 ∑︁
                                     𝑥ˆ𝑟 𝑒 𝑑 𝑒 − 𝑑                  =           𝑥ˆ𝜔+𝑟 𝑑 =             𝑥ˆ 𝑑 .              (B.5)
                     𝑑 𝑛=0 𝑟=0                                          𝑠 𝑟=0          𝑠    𝑠 𝑟=0 𝜔−𝑟 𝑠
    Before we start looking at aliased WDD, we need to introduce a lemma which will show the
effect of taking a Fourier transform of an autocorrelation.
Lemma B.1.2. (Fourier Transform Of Autocorrelation) ([78], Lemma 2.0.2.) Let x ∈ C𝑑 and
𝛼, 𝜔 ∈ [𝑑]0 . Then
                                                          1 2𝜋𝑖𝜔𝛼/𝑑                     ¯ .
                                                                                              
                                  F𝑑 (x ◦ 𝑆𝜔 x̄)        =     𝑒             F𝑑 (x̂ ◦ 𝑆−𝛼 x̂)                              (B.6)
                                                      𝛼     𝑑                                   𝜔
Proof. Let x ∈ C𝑑 and let 𝛼, 𝜔 ∈ [𝑑]0 be arbitrary. By the convolution theorem, we have that
                                                                1
                                         F𝑑 (x ◦ 𝑆𝜔 x̄)       =     (x̂ ∗𝑑 F𝑑 (𝑆𝜔 x̄))𝛼 .                                 (B.7)
                                                            𝛼    𝑑
By technical equality (iii), we can revert the Fourier transform of the shift operator to the modulation
operator of the Fourier transform
                                                              𝑑−1                           𝑑−1
      1                         1                          1 ∑︁                          1 ∑︁                 2 𝜋𝑖 𝜔(𝛼−𝑛)
        (x̂ ∗𝑑 F𝑑 (𝑆𝜔 x̄))𝛼 = (x̂ ∗𝑑 (𝑊𝜔 x̄ˆ )𝛼 =                  𝑥ˆ𝑛 (𝑊𝜔 x̄ˆ )𝛼−𝑛 =             𝑥ˆ𝑛 𝑥ˆ¯𝛼−𝑛 𝑒 𝑑 ,        (B.8)
      𝑑                         𝑑                          𝑑 𝑛=0                         𝑑 𝑛=0
with the latter equalities being the definition of the convolution and modulation. By applying
                                                               128


reversals and using that x̃˜ = x, we have that
                                   𝑑−1                                             𝑑−1
                                1 ∑︁                   2 𝜋𝑖 𝜔(𝛼−𝑛)    1 2 𝜋𝑖 𝜔 𝛼 ∑︁                      2 𝜋𝑖 𝜔 𝛼
                                          𝑥ˆ𝑛 𝑥ˆ¯𝛼−𝑛 𝑒 𝑑           = 𝑒 𝑑                  𝑥ˆ𝑛 𝑥˜ˆ¯𝑛−𝛼 𝑒 − 𝑑 .                 (B.9)
                                𝑑 𝑛=0                                 𝑑            𝑛=0
Finally by applying technical equality (vi) and using the definition of the shift operator and
Hadamard product, we have that
                  𝑑−1                                            𝑑−1
      1 2 𝜋𝑖 𝜔 𝛼 ∑︁                    2 𝜋𝑖 𝜔 𝛼      1 2 𝜋𝑖 𝜔 𝛼 ∑︁                  2 𝜋𝑖 𝜔 𝛼        1                   
        𝑒 𝑑           𝑥ˆ𝑛 𝑥˜ˆ¯𝑛−𝛼 𝑒 − 𝑑 = 𝑒 𝑑                        𝑥ˆ𝑛 𝑥¯ˆ𝑛−𝛼 𝑒 − 𝑑 = 𝑒 2𝜋𝑖𝜔𝛼/𝑑 F𝑑 (x̂ ◦ 𝑆−𝛼 x̂)      ¯ .
      𝑑           𝑛=0                                𝑑           𝑛=0                                𝑑                      𝜔
B.1.1   SUB-SAMPLING IN FREQUENCY
    We will first look at sub-sampling in frequency.
Definition B.1.2. Let 𝐾 be a positive factor of 𝑑, and assume that the data is measured at 𝐾 equally
                                                                                                             𝑑
spaced Fourier modes. We denote the set of Fourier modes of step-size                                        𝑘  by
                                                    𝑑             n 𝑑 2𝑑                           𝑑o
                                           K=          [𝐾]0 = 0, , , . . . , 𝑑 −                       .                     (B.10)
                                                    𝐾                 𝐾 𝐾                          𝐾
Definition B.1.3. Let A ∈ C𝑑×𝑑 with columns a 𝑗 , 𝐾 | 𝑑. We denote by A𝐾,𝑑 ∈ C𝐾×𝑑 the sub-matrix
of A whose ℓ th column is equal to 𝑍 𝑑 (aℓ ).
                                                     𝐾
    With these definitions, we will now convert the sub-sampled measurements into a more solvable
form.
Lemma B.1.3. ([78], Lemma 2.1.1.) Suppose that the noisy spectrogram measurements are
collected on a subset K ⊆ [𝑑]0 of 𝐾 equally space Fourier modes. Then for any 𝜔 ∈ [𝐾]0
                                                  𝑑
                                                𝐾 −1
                                                  ∑︁                                                               
                  (F𝐾 Y𝐾,𝑑 )     𝑇
                                         =𝐾           (x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 (m̃ ◦ 𝑆𝑟𝐾−𝜔 m̃             ¯ ) + (F𝐾 N𝐾,𝑑 )𝑇 ,
                                     𝜔            𝑟=0                                                                 𝜔
where Y𝐾,𝑑 ∈ C𝐾×𝑑 − N𝐾,𝑑 ∈ C𝐾×𝑑 is the matrix of sub-sampled noiseless 𝐾 · 𝑑 measurements.
                                                                    129


Proof. For ℓ ∈ [𝑑]0 , the ℓ th column of the matrix Y is
                                                                            
                                     yℓ = F𝑑 (x ◦ 𝑆−ℓ m) ∗𝑑 (x̃¯ ◦ 𝑆ℓ m̃  ¯ ) + 𝜂ℓ ,                            (B.11)
and thus for any 𝛼 ∈ [𝐾]0
                                                                                          
                       𝑍 𝑑 (yℓ )     = F𝑑 (x ◦ 𝑆−ℓ m) ∗𝑑 (x̃¯ ◦ 𝑆ℓ m̃   ¯)          + 𝑍 𝑑 (ȷℓ ) .               (B.12)
                         𝐾         𝛼                                           𝛼 𝐾𝑑          𝐾     𝛼
                                           𝑑
and by aliasing lemma (with 𝑠 =            𝐾)
                                   𝑑
                               𝐾 −1
                              𝐾 ∑︁
       F𝐾 𝑍 𝑑 (yℓ )        =            (ŷℓ )𝜔−𝑟𝐾
             𝐾       𝛼 𝜔      𝑑 𝑟=0
                                       𝑑
                                       𝐾 −1                                                           
                                   𝐾 ∑︁                                         ¯
                           =𝑑·                 (x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 (m̃ ◦ 𝑆𝑟𝐾−𝜔 m̃) + F𝐾 𝑍 𝑑 (ȷℓ )               .
                                    𝑑 𝑟=0                                              ℓ            𝐾     𝛼 𝜔
The ℓ th column of Y𝐾,𝑑 ∈ C𝐾×𝑑 is equal to 𝑍 𝑑 (yℓ ). Then for any 𝜔 ∈ [𝐾]0 , the 𝜔th column of
                                                          𝐾
(F𝐾 𝑌𝐾,𝑑 )𝑇 ∈  C𝑑×𝐾   may be computed as
                                           𝑑
                                        𝐾 −1
                                          ∑︁                                                         
                 (F𝐾 Y𝐾,𝑑 ) 𝑇
                                    =𝐾          (x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 (m̃ ◦ 𝑆𝑟𝐾−𝜔 m̃  ¯ ) + (F𝐾 N𝐾,𝑑 )𝑇 .           (B.13)
                                𝜔          𝑟=0                                                          𝜔
B.1.2    SUB-SAMPLING IN FREQUENCY AND SPACE
    We will now look at sub-sampling in both frequency and space.
Definition B.1.4. Let 𝐿 be a positive factor of 𝑑. Suppose measurements are collected at 𝐿 equally
spaced physical shifts of step-size 𝐿𝑑 . We denote the set of shifts by L, that is
                                              𝑑         n 𝑑 2𝑑                  𝑑o
                                      L = [𝐿]0 = 0, , , . . . , 𝑑 −                   .                         (B.14)
                                              𝐿              𝐿 𝐿                𝐿
Definition B.1.5. Let A ∈ C𝑑×𝑑 , 𝐿 | 𝑑. We denote by A𝑑,𝐿 ∈ C𝑑×𝐿 the sub-matrix of A whose rows
                                                          130


are those of A, sub-sampled in step-size 𝐿𝑑 .
     We will now prove a similar lemma as before, but now we will sub-sample in both frequency
and space.
Lemma B.1.4. ([78], Lemma 2.1.2.) Suppose we have noisy spectrogram measurements collected
on a subset K ⊆ [𝑑]0 of 𝐾 equally spaced frequencies and a subset L ⊆ [𝑑]0 of 𝐿 equally spaced
physical shifts. Then for any 𝜔 ∈ [𝐾]0 , 𝛼 ∈ [𝐿]0
                          
          F 𝐿 Y𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔
                               𝛼
                 𝑑    𝑑
                𝐾 −1 𝐿 −1
            𝐾 𝐿 ∑︁ ∑︁                                            ¯ ))
                                                                                               
         =                  F𝑑 ((x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 (m̃ ◦ 𝑆𝑟 𝑘−𝜔 m̃              + F 𝐿 N𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔 .
             𝑑 𝑟=0 ℓ=0                                                   𝛼−ℓ𝐿                        𝛼
where Y𝐾,𝐿 − N𝐾,𝐿 ∈ C𝐾×𝐿 is the matrix of sub-sampled noiseless 𝐾 · 𝐿 measurements.
Proof. For fixed ℓ ∈ [𝑑]0 , 𝜔 ∈ [𝐾]0 , we have computed
                                                 𝑑
                                               𝐾 −1 
                                                 ∑︁                                          
                          F𝐾 (𝑍 𝑑 (yℓ ))      =𝐾                                           ¯) .
                                                        (x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 F𝑑 (m̃ ◦ 𝑆𝑟 𝑘 𝜔 m̃
                                𝐾          𝜔                                                   ℓ
                                                 𝑟=0
Fix 𝜔 ∈ [𝐾]0 , and define the vector p𝜔 ∈ C 𝐿 by
                                                                          
                         (p𝜔 )ℓ := F𝐾 (𝑍 𝑑 (yℓ 𝑑 )) + F𝐾 (𝑍 𝑑 (ȷℓ 𝑑 )) ,           ∀ℓ ∈ [𝐿]0 .
                                              𝐾  𝐿     𝜔           𝐾      𝐿    𝜔
Note that the rows of Y𝐾,𝐿 , N𝐾,𝐿 ∈ C𝐾×𝐿 are those of Y𝐾,𝐿 , N𝐾,𝐿 ∈ C𝐾×𝑑 , sub-sampled in step-size
of 𝐿𝑑 . Thus
                                                                                 
                                 (p𝜔 )ℓ = (Y𝐾,𝐿 )𝑇 (F𝑇𝐾 )𝜔 + (Y𝐾,𝐿 )𝑇 (F𝑇𝐾 )𝜔 ,
                                                                ℓ                     ℓ
where (F𝑇𝐾 )𝜔 ∈ C𝐾 is the 𝜔th column of F𝑇𝐾 . Therefore
                              p𝜔 = Y𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔 + N𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔 ∈ C 𝐿 ,       ∀𝜔 ∈ [𝐾]0 .
                                                          131


For any ℓ ∈ [𝐿]0 , we have
                     𝑑
                     𝐾 −1 
                     ∑︁                                           
          (p𝜔 )ℓ = 𝐾                                           ¯)
                            (x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 F𝑑 (m̃ ◦ 𝑆𝑟 𝑘 𝜔 m̃          + N𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔
                     𝑟=0                                            ℓ 𝐿𝑑
                               𝑑
                              𝐾 −1
                              ∑︁                                           
                 = 𝐾 · 𝑍𝑑           (x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 F𝑑 (m̃ ◦ 𝑆𝑟 𝑘 𝜔 m̃ ¯)      + N𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔 ,
                          𝐿                                                    ℓ
                               𝑟=0
and for any 𝛼 ∈ [𝐿]0 , by aliasing, we have that
                       𝑑    𝑑
                       𝐾 −1 ∑︁
                            𝐿 −1                                                                     
                  𝐾 𝐿 ∑︁                                                  ¯ ))
     (F 𝐿 p𝜔 )𝛼 =                  F𝑑 ((x ◦ 𝑆𝜔−𝑟𝐾 x̄) ∗𝑑 (m̃ ◦ 𝑆𝑟 𝑘−𝜔 m̃              + F 𝐿 N𝑇𝐾,𝐿 (F𝑇𝐾 )𝜔 .
                   𝑑 𝑟=0 ℓ=0                                                     𝛼−ℓ𝐿                       𝛼
                                                      132


                                              APPENDIX C
                                      BLIND DECONVOLUTION
C.1      ALTERNATIVE APPROACH
In this section we discuss the convex relaxation approach studied in [2].
C.1.1     CONVEX RELAXATION
     In [2], the approach is to solve a convex version of the problem. Given y ∈ C 𝐿 , their goal is to
find h ∈ R 𝑘 , x ∈ R𝑁 that are consistent with the observations. Making no additional assumptions
other than the dimensions, the way to choose between multiple feasible points is by solving using
least-squares. That is,
               minimizeu,v ∥u∥ 22 +∥v∥ 22     subject to y(ℓ) = ⟨cℓ , u⟩⟨v, bℓ ⟩, 1≤ℓ≤𝐿
This is a non-convex quadratic optimization problem. The cost function is convex, but the quadratic
equality constraints mean that the feasible set is non-convex. The dual of this minimization problem
is the SDP and taking the dual again will give us a convex program
                                                                         
                           1          1                        W1 X 
                   min       𝑡𝑟(W1 ) + 𝑡𝑟(W2 )      subject to            ⪰ 0, y = A(X)
                                                                         
                 W1 ,W2 ,X 2          2                         X∗ W 
                                                                        2
                                                                         
which is equivalent to
                                     min∥X∥ ∗     subject to y = A(X)
                     √
where ∥X∥ ∗ = 𝑡𝑟( X∗ X) denotes the nuclear norm.
     In [2], they achieved guarantees for relatively large 𝐾 and 𝑁, when 𝐵 is incoherent in the Fourier
domain, and when 𝐶 is generic. We can now outline the algorithm from [2].
                                                    133


Algorithm C.1 Convex Relaxed Blind Deconvolution Algorithm
Input: Normalized Fourier measurement y,
Output: Estimate underlying signal and blurring function
  1) Compute A ∗ (y)
  2) Find the leading singular value, left and right singular vectors of A ∗ (y), denoted by 𝑑, h˜0 , and
  x˜0 respectively
  3) Let X0 = h˜0 x˜0 ∗ denote the initial estimate and solve the following optimization problem
                                        min          ∥X∥ ∗
                                        subject to   ∥y − A(X)∥≤ 𝛿                                 (C.1)
  where ∥·∥ ∗ denotes the nuclear norm and ∥e∥ 2 ≤ 𝛿
  Return (h, x) for X = hx∗
                                                   134